Advertisement
Guest User

Untitled

a guest
May 28th, 2015
282
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 2.29 KB | None | 0 0
  1. The [pyspark documentation]() doesn't include an example for the aggregateByKey RDD method. I didn't find any nice examples online, so I wrote my own.
  2.  
  3. Here's what the documetation does say:
  4.  
  5. `aggregateByKey(self, zeroValue, seqFunc, combFunc, numPartitions=None)`
  6.  
  7. > Aggregate the values of each key, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U's, The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.
  8.  
  9. `reduceByKey` and `aggregateByKey` are much more efficient than `groupByKey` and should be used for aggregations as much as possible.
  10.  
  11. In the example below, I create an RDD that is a short list of characters. My functions will aggregate the functions together with concatenation. I added brackets to the two types of concatenation to help give you an idea of what `aggregateByKey` is doing.
  12.  
  13. ```
  14. Welcome to
  15. ____ __
  16. / __/__ ___ _____/ /__
  17. _\ \/ _ \/ _ `/ __/ '_/
  18. /__ / .__/\_,_/_/ /_/\_\ version 1.1.0
  19. /_/
  20.  
  21. Using Python version 2.7.5 (default, Mar 9 2014 22:15:05)
  22. SparkContext available as sc.
  23.  
  24. In [1]: # Create rdd that is a list of characters
  25.  
  26. In [2]: sc.parallelize(list("aaaaabbbbcccdd")) \
  27. ...: .map(lambda letter: (letter, {"value": letter})) \
  28. ...: .aggregateByKey(
  29. ...: # Value to start aggregation (passed as s to `lambda s, d`)
  30. ...: "start",
  31. ...: # Function to join final data type (string) and rdd data type
  32. ...: lambda s, d: "[ %s %s ]" % (s, d["value"]),
  33. ...: # Function to join two final data types.
  34. ...: lambda s1, s2: "{ %s %s }" % (s1, s2),
  35. ...: ) \
  36. ...: .collect()
  37.  
  38. Out[2]:
  39. [('a', '{ { [ start a ] [ [ start a ] a ] } [ [ start a ] a ] }'),
  40. ('c', '{ [ start c ] [ [ start c ] c ] }'),
  41. ('b', '{ { [ [ start b ] b ] [ start b ] } [ start b ] }'),
  42. ('d', '[ [ start d ] d ]')]
  43.  
  44. ```
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement