Untitled

i am trying to get some help finishing up my assignment using PYSpark. below is the expected outcome and my results so far.

Expected output
After reading and processing the input file, your code should create an output file, report.csv, with as many lines as unique pairs of product and year (of Date received) in the input file.

Each line in the output file should list the following fields in the following order:

product (name should be written in all lowercase)
year
total number of complaints received for that product and year
total number of companies receiving at least one complaint for that product and year
highest percentage (rounded to the nearest whole number) of total complaints filed against one company for that product and year. Use standard rounding conventions (i.e., Any percentage between 0.5% and 1%, inclusive, should round to 1% and anything less than 0.5% should round to 0%)
The lines in the output file should be sorted by product (alphabetically) and year (ascending)

Given the above complaints.csv input file, we'd expect an output file, report.csv, in the following format

"credit reporting, credit repair services, or other personal consumer reports",2019,3,2,67
"credit reporting, credit repair services, or other personal consumer reports",2020,1,1,100
debt collection,2019,1,1,100
Notice that because debt collection was only listed for 2019 and not 2020, the output file only has a single entry for debt collection. Also, notice that when a product has a comma (,) in the name, the name should be enclosed by double quotation marks ("). Finally, notice that percentages are listed as numbers and do not have % in them.

results so far
from pyspark.sql.functions import lower, col
from pyspark.sql.functions import col, unix_timestamp, to_date, year, count, avg
report = spark.read.load(cons_report, format = 'csv', header = True, inferschema=True)

report = report.withColumn('Date received',
                   to_date(unix_timestamp(col('Date received'), 'MM/dd/yyyy').cast("timestamp")))

report = report.select(lower(col('Product')).alias('Product'),
                       year('Date received').alias('Year'))

report = report.na.drop()
report = report.groupBy('Product', 'Year').count()
+--------------------+----+-----+
|             Product|Year|count|
+--------------------+----+-----+
|credit reporting,...|2019| 3114|
|         credit card|2016|    4|
|checking or savin...|2020|    3|
|money transfer, v...|2019|   87|
|            mortgage|2018|   39|
|credit card or pr...|2019|  437|
|       consumer loan|2015|    1|
|     debt collection|2017|   13|
|            mortgage|2019|  415|
|        student loan|2019|  157|
|payday loan, titl...|2020|    1|
|     debt collection|2015|    4|