Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- i am trying to get some help finishing up my assignment using PYSpark. below is the expected outcome and my results so far.
- Expected output
- After reading and processing the input file, your code should create an output file, report.csv, with as many lines as unique pairs of product and year (of Date received) in the input file.
- Each line in the output file should list the following fields in the following order:
- product (name should be written in all lowercase)
- year
- total number of complaints received for that product and year
- total number of companies receiving at least one complaint for that product and year
- highest percentage (rounded to the nearest whole number) of total complaints filed against one company for that product and year. Use standard rounding conventions (i.e., Any percentage between 0.5% and 1%, inclusive, should round to 1% and anything less than 0.5% should round to 0%)
- The lines in the output file should be sorted by product (alphabetically) and year (ascending)
- Given the above complaints.csv input file, we'd expect an output file, report.csv, in the following format
- "credit reporting, credit repair services, or other personal consumer reports",2019,3,2,67
- "credit reporting, credit repair services, or other personal consumer reports",2020,1,1,100
- debt collection,2019,1,1,100
- Notice that because debt collection was only listed for 2019 and not 2020, the output file only has a single entry for debt collection. Also, notice that when a product has a comma (,) in the name, the name should be enclosed by double quotation marks ("). Finally, notice that percentages are listed as numbers and do not have % in them.
- results so far
- from pyspark.sql.functions import lower, col
- from pyspark.sql.functions import col, unix_timestamp, to_date, year, count, avg
- report = spark.read.load(cons_report, format = 'csv', header = True, inferschema=True)
- report = report.withColumn('Date received',
- to_date(unix_timestamp(col('Date received'), 'MM/dd/yyyy').cast("timestamp")))
- report = report.select(lower(col('Product')).alias('Product'),
- year('Date received').alias('Year'))
- report = report.na.drop()
- report = report.groupBy('Product', 'Year').count()
- +--------------------+----+-----+
- | Product|Year|count|
- +--------------------+----+-----+
- |credit reporting,...|2019| 3114|
- | credit card|2016| 4|
- |checking or savin...|2020| 3|
- |money transfer, v...|2019| 87|
- | mortgage|2018| 39|
- |credit card or pr...|2019| 437|
- | consumer loan|2015| 1|
- | debt collection|2017| 13|
- | mortgage|2019| 415|
- | student loan|2019| 157|
- |payday loan, titl...|2020| 1|
- | debt collection|2015| 4|
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement