Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- File Formats:
- Input Formats
- TextInputFormat- File of values only where hadoop will generate keys which we are not interested on
- KeyValueTextInputFormat- File of keys & values where the default seperator is "\t" or TAB
- - we can chage the separtor by adding below conf:
- conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",")
- SequenceFileInputFormat- Compressed format which can be used to inforation with less Disk required
- - It is useful when output of one job is input to another job since it requires less
- disk writing and reading which speeds up the job
- NLineInputForamt- We can specify the size of split using NLineInputFormat, we actually say each split it this no of lines
- - We determine the size of split using below conf:
- job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 1000);
- Output Formats
- TextOutputFormat- Default output format of hadoop which produces a key, value pair on each line separated by TAB
- SequenceFileOutputFormat - Equivalent to SequenceFileOutputFormat
- By default size of each split is 128 MB
- To override parameter at runtime:
- hadoop jar FileName.jar ClassName -D
- mapreduce.input.keyvaluelinerecordreader.key.value.separator=%
- input output
- // add below line as well in production
- job.submit();
- return 0;
- // in standalone
- return job.waitForCompletion(true) ? 0:1
Add Comment
Please, Sign In to add comment