Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Task FOUR)
- **************************************************************************************************************************
- This task requires the selection of the proper labels for an input text work. As I already know, from the serial implementation
- and the nature of the problem, I have to follow specific steps, these are: a)Training file features collection, b)Bayes
- probabilities calculation, c)Processing file features collection d)Word labeling. The tasks of (a),(b) are central goals of the
- MapReduce application defined on the task 2 and I will use its output as my trained model for the label categorization of the
- processing file. Requirement (c) is covered by the mapper of the task 3 which collects the features of an input file. Based on
- the existing results and/or solutions) of the previous tasks I considered and positevely chose to use mapper of task 3 and the
- trained model of task 2 internally in a new reducer. The mapper operates on the input file chunks in the way described on task
- three. Τhe reducers operate on the partioned data emitted by the mappers and select the proper label for each word. For the support
- of the selection procedure of the reduction, the output data of task 2 are considered as functional part of the new reducer. Fact
- which means that each node which executes a reduction operation needs a local copy of the trained model. The output of the pair of
- the word with the associated label.
- Considering the scalability issues, On one hand, we have the typical MapReduce scheme where
- the increase in mappers and reducers, in a logical way, increases the performance and on the other hand the copying of the trained data to
- each reduction executor leads to a one to all dependence where a single point shares the data to all the nodes. The latter is not really
- a desirable effect. As the number of machines increase the sharing trained data could be a bottleneck. Two possible scenarios for avoiding
- this situation
- 1.using a module or library with access to the distributed filesystem of Hadoop taking advantage of its features
- 2. the use of distributed algorithms implemented in multiple MapReduce execution.
- The latter is meanigful only for huge datasets and should
- be further investigated before moving it in production.
Add Comment
Please, Sign In to add comment