Untitled

what is the difference between map and map partition. explain with example and where to use when?

2. scenario - suppose if an rdd is deleted in spark job how do we can recover it and what is the backend mechanism of recovering rdd's.

3.how do you connect to your cluster using data nodes or edge nodes and what is reason of choosing between the both?

4. have you ever received an error "spaceout" in your datanode?

5. how do you allocate buffer memory to your datanode?

6.how much buffer space have you allocated to your map task and reduce task in your datanode?

7 how do you achieve broadcast join automatically without out doing it manually? and how do you setup your driver program to detect where broadcast join can be good to use and how do you automate the process?

8. how do you acheive in memory caache?

scenario : imagine you are working on cluster and already have cache your rdd and got the output stored in cache now i want to clear the memory space and use that space for caching another rdd? how to achieve this?

9.what are the packages you have worked in scala name the package you have imported in your current project?

10. what modules you have worked in scala and name the module which you have worked till date?

11.Kafka - scenario : suppose your producer is producing more then your consumer can  consume , how will you deal such situation and what are your preventive measures to stop data loss?

12. how do you achieve " re-balancing in  Kafka and in what way it is use useful?

13. Kafka scenario : suppose producer is writing the data in CSV format and in structure data then how will the consumer will come to know what schema the data is coming in and how to specify and where to specify the schema?

14. how do you manage you offsets?

15 . Kafka : scenario : suppose consumer a has read 10 offsets from the topics and it got failed then how consumer b will pick up offsets and how does it stores the data and what is the mechanism we need configure to achieve this.

16. Hive :  Scenario: Imagine we have 2 tables A and B.

B is the master table and A is the table which receives the updates of certain information

so i want to update table B using the latest updated columns based up on the id how do we achieve that and what is the exact query we use?

17. What is use of Row-index and in which scenarios have you used it in hive?

18. what do you know about Ntile.

19. Spark - Scenario : Suppose i m running 10 sql jobs which generally take 10 mins to complete, but one it took 1 hour to complete if this is the case how to you report this error and how will you debug your code and provide a solution for this.