Advertisement
Tashietash

Untitled

Apr 6th, 2020
304
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 2.77 KB | None | 0 0
  1. i am trying to get some help finishing up my assignment using PYSpark. below is the expected outcome and my results so far.
  2.  
  3. Expected output
  4. After reading and processing the input file, your code should create an output file, report.csv, with as many lines as unique pairs of product and year (of Date received) in the input file.
  5.  
  6. Each line in the output file should list the following fields in the following order:
  7.  
  8. product (name should be written in all lowercase)
  9. year
  10. total number of complaints received for that product and year
  11. total number of companies receiving at least one complaint for that product and year
  12. highest percentage (rounded to the nearest whole number) of total complaints filed against one company for that product and year. Use standard rounding conventions (i.e., Any percentage between 0.5% and 1%, inclusive, should round to 1% and anything less than 0.5% should round to 0%)
  13. The lines in the output file should be sorted by product (alphabetically) and year (ascending)
  14.  
  15. Given the above complaints.csv input file, we'd expect an output file, report.csv, in the following format
  16.  
  17. "credit reporting, credit repair services, or other personal consumer reports",2019,3,2,67
  18. "credit reporting, credit repair services, or other personal consumer reports",2020,1,1,100
  19. debt collection,2019,1,1,100
  20. Notice that because debt collection was only listed for 2019 and not 2020, the output file only has a single entry for debt collection. Also, notice that when a product has a comma (,) in the name, the name should be enclosed by double quotation marks ("). Finally, notice that percentages are listed as numbers and do not have % in them.
  21.  
  22. results so far
  23. from pyspark.sql.functions import lower, col
  24. from pyspark.sql.functions import col, unix_timestamp, to_date, year, count, avg
  25. report = spark.read.load(cons_report, format = 'csv', header = True, inferschema=True)
  26.  
  27. report = report.withColumn('Date received',
  28. to_date(unix_timestamp(col('Date received'), 'MM/dd/yyyy').cast("timestamp")))
  29.  
  30. report = report.select(lower(col('Product')).alias('Product'),
  31. year('Date received').alias('Year'))
  32.  
  33. report = report.na.drop()
  34. report = report.groupBy('Product', 'Year').count()
  35. +--------------------+----+-----+
  36. | Product|Year|count|
  37. +--------------------+----+-----+
  38. |credit reporting,...|2019| 3114|
  39. | credit card|2016| 4|
  40. |checking or savin...|2020| 3|
  41. |money transfer, v...|2019| 87|
  42. | mortgage|2018| 39|
  43. |credit card or pr...|2019| 437|
  44. | consumer loan|2015| 1|
  45. | debt collection|2017| 13|
  46. | mortgage|2019| 415|
  47. | student loan|2019| 157|
  48. |payday loan, titl...|2020| 1|
  49. | debt collection|2015| 4|
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement