Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- /*
- Used for multiple line comments
- */
- -- For Single line comments.
- -- Statement used to load data. PigStorage() is what does the loading. We pass it a comma as the data delimiter.
- batting = load 'Batting.csv' using PigStorage(',');
- /*
- FOREACH statement is used to iterate through the 'batting' data object.
- GENERATE statement pulls out selected fields and assigns them names.
- New data is stored in a new data object called 'runs'.
- '$<number> designates the column in the .CSV where the field resides (i.e. $0 is the first field in batting.csv)
- */
- runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
- -- GROUP statement groups the elements inthe 'runs' data object by the 'year' field.
- grp_data = GROUP runs by (year);
- /*
- We iterate through the 'grp_data' object year by year.
- The 'FOREACH' statement we will find the maximum runs for each year.
- We use the GENERATE statement to store the data into a new data object called 'max_runs'.
- */
- max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
- /*
- JOIN statement joins 'max_runs' with the 'runs' data object in order to get player_id.
- Result is a dataset with Year, PlayerID, and Max Runs.
- Finally data is dumped to 'output'.
- */
- join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
- join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
- dump join_data;
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement