Guest User

Untitled

a guest
May 21st, 2018
93
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 2.25 KB | None | 0 0
  1.  
  2. Task FOUR)
  3. **************************************************************************************************************************
  4. This task requires the selection of the proper labels for an input text work. As I already know, from the serial implementation
  5. and the nature of the problem, I have to follow specific steps, these are: a)Training file features collection, b)Bayes
  6. probabilities calculation, c)Processing file features collection d)Word labeling. The tasks of (a),(b) are central goals of the
  7. MapReduce application defined on the task 2 and I will use its output as my trained model for the label categorization of the
  8. processing file. Requirement (c) is covered by the mapper of the task 3 which collects the features of an input file. Based on
  9. the existing results and/or solutions) of the previous tasks I considered and positevely chose to use mapper of task 3 and the
  10. trained model of task 2 internally in a new reducer. The mapper operates on the input file chunks in the way described on task
  11. three. Τhe reducers operate on the partioned data emitted by the mappers and select the proper label for each word. For the support
  12. of the selection procedure of the reduction, the output data of task 2 are considered as functional part of the new reducer. Fact
  13. which means that each node which executes a reduction operation needs a local copy of the trained model. The output of the pair of
  14. the word with the associated label.
  15.  
  16. Considering the scalability issues, On one hand, we have the typical MapReduce scheme where
  17. the increase in mappers and reducers, in a logical way, increases the performance and on the other hand the copying of the trained data to
  18. each reduction executor leads to a one to all dependence where a single point shares the data to all the nodes. The latter is not really
  19. a desirable effect. As the number of machines increase the sharing trained data could be a bottleneck. Two possible scenarios for avoiding
  20. this situation
  21. 1.using a module or library with access to the distributed filesystem of Hadoop taking advantage of its features
  22. 2. the use of distributed algorithms implemented in multiple MapReduce execution.
  23. The latter is meanigful only for huge datasets and should
  24. be further investigated before moving it in production.
Add Comment
Please, Sign In to add comment