Untitled


Task FOUR)
**************************************************************************************************************************
This task requires the selection of the proper labels for an input text work. As I already know, from the serial implementation
and the nature of the problem, I have to follow specific steps, these are: a)Training file features collection, b)Bayes
probabilities calculation, c)Processing file features collection d)Word labeling. The tasks of (a),(b) are central goals of the
MapReduce  application defined on the task 2 and I will use its output as my trained model for the label categorization of the
processing file. Requirement (c) is covered by the mapper of the task 3 which collects the features of an input file.  Based on
the existing results and/or solutions) of the previous tasks I considered and positevely chose to use mapper of task 3 and the
trained model of task 2 internally in a new reducer. The mapper operates on the input file chunks in the way described on task
three. Τhe reducers operate on the partioned data emitted by the mappers and select the proper label for each word. For the support
of the selection procedure of the reduction, the output data of task 2 are considered as  functional part of the new reducer. Fact
which means that  each node which executes a reduction operation needs a local copy of the trained model. The output of the  pair of
the word with the associated label.

Considering the scalability issues, On one hand, we have the typical MapReduce scheme where
the increase in mappers and reducers, in  a logical way, increases the performance and on the other hand the copying of the trained data to
each reduction executor leads to a one to all dependence where a single point shares the data to all the nodes. The latter is not really
a desirable effect. As the number of machines increase the sharing trained data could be a bottleneck. Two possible scenarios for avoiding
this situation
1.using a module or library with access to the distributed filesystem of Hadoop taking advantage of its features
2. the use of distributed algorithms implemented in multiple MapReduce execution.
The latter is meanigful only for huge datasets and should
be further investigated before moving it in production.