Advertisement
Guest User

Untitled

a guest
Oct 9th, 2015
96
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 1.37 KB | None | 0 0
  1. /*
  2. Used for multiple line comments
  3. */
  4.  
  5. -- For Single line comments.
  6.  
  7.  
  8. -- Statement used to load data. PigStorage() is what does the loading. We pass it a comma as the data delimiter.
  9. batting = load 'Batting.csv' using PigStorage(',');
  10.  
  11. /*
  12. FOREACH statement is used to iterate through the 'batting' data object.
  13. GENERATE statement pulls out selected fields and assigns them names.
  14. New data is stored in a new data object called 'runs'.
  15. '$<number> designates the column in the .CSV where the field resides (i.e. $0 is the first field in batting.csv)
  16. */
  17. runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
  18.  
  19. -- GROUP statement groups the elements inthe 'runs' data object by the 'year' field.
  20. grp_data = GROUP runs by (year);
  21.  
  22. /*
  23. We iterate through the 'grp_data' object year by year.
  24. The 'FOREACH' statement we will find the maximum runs for each year.
  25. We use the GENERATE statement to store the data into a new data object called 'max_runs'.
  26. */
  27. max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
  28.  
  29. /*
  30. JOIN statement joins 'max_runs' with the 'runs' data object in order to get player_id.
  31. Result is a dataset with Year, PlayerID, and Max Runs.
  32. Finally data is dumped to 'output'.
  33. */
  34. join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
  35. join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
  36. dump join_data;
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement