Tashietash

SQL using PYSPARK

Apr 13th, 2020
1,084
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. Input dataset
  2. For this challenge, when we grade your submission, an input file, complaints.csv, will be moved to the top-most input directory of your repository. Your code must read that input file, process it and write the results to an output file, report.csv that your code must place in the top-most output directory of your repository.
  3.  
  4. Below are the contents of an example complaints.csv file:
  5. Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
  6. 2019-09-24,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,"transworld systems inc. is trying to collect a debt that is not mine, not owed and is inaccurate.",,TRANSWORLD SYSTEMS INC,FL,335XX,,Consent provided,Web,2019-09-24,Closed with explanation,Yes,N/A,3384392
  7. 2019-09-19,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Incorrect information on your report,Information belongs to someone else,,Company has responded to the consumer and the CFPB and chooses not to provide a public response,Experian Information Solutions Inc.,PA,15206,,Consent not provided,Web,2019-09-20,Closed with non-monetary relief,Yes,N/A,3379500
  8. 2020-01-06,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Incorrect information on your report,Information belongs to someone else,,,Experian Information Solutions Inc.,CA,92532,,N/A,Email,2020-01-06,In progress,Yes,N/A,3486776
  9. 2019-10-24,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Incorrect information on your report,Information belongs to someone else,,Company has responded to the consumer and the CFPB and chooses not to provide a public response,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",CA,925XX,,Other,Web,2019-10-24,Closed with explanation,Yes,N/A,3416481
  10. 2019-11-20,"Credit reporting, credit repair services, or other personal consumer reports",Credit reporting,Incorrect information on your report,Account information incorrect,I would like the credit bureau to correct my XXXX XXXX XXXX XXXX balance. My correct balance is XXXX,Company has responded to the consumer and the CFPB and chooses not to provide a public response,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",TX,77004,,Consent provided,Web,2019-11-20,Closed with explanation,Yes,N/A,3444592
  11. Each line of the input file, except for the first-line header, represents one complaint. Consult the Consumer Finance Protection Bureau's technical documentation for a description of each field.
  12. Notice that complaints were not listed in chronological order
  13. In 2019, there was a complaint against TRANSWORLD SYSTEMS INC for Debt collection
  14. Also in 2019, Experian Information Solutions Inc. received one complaint for Credit reporting, credit repair services, or other personal consumer reports while TRANSUNION INTERMEDIATE HOLDINGS, INC. received two
  15. In 2020, Experian Information Solutions Inc. received a complaint for Credit reporting, credit repair services, or other personal consumer reports
  16. In summary that means
  17. In 2019, there was one complaint for Debt collection, and 100% of it went to one company
  18. Also in 2019, three complaints against two companies were received for Credit reporting, credit repair services, or other personal consumer reports and 2/3rd of them (or 67% if we rounded the percentage to the nearest whole number) were against one company (TRANSUNION INTERMEDIATE HOLDINGS, INC.)
  19. In 2020, only one complaint was received for Credit reporting, credit repair services, or other personal consumer reports, and so the highest percentage received by one company would be 100%
  20. For this challenge, we want for each product and year that complaints were received, the total number of complaints, number of companies receiving a complaint and the highest percentage of complaints directed at a single company.
  21. For the purposes of this challenge, all names, including company and product, should be treated as case insensitive. For example, "Acme", "ACME", and "acme" would represent the same company.
  22.  
  23. Expected output
  24. After reading and processing the input file, your code should create an output file, report.csv, with as many lines as unique pairs of product and year (of Date received) in the input file.
  25. Each line in the output file should list the following fields in the following order:
  26. product (name should be written in all lowercase)
  27. year
  28. total number of complaints received for that product and year
  29. total number of companies receiving at least one complaint for that product and year
  30. highest percentage (rounded to the nearest whole number) of total complaints filed against one company for that product and year. Use standard rounding conventions (i.e., Any percentage between 0.5% and 1%, inclusive, should round to 1% and anything less than 0.5% should round to 0%)
  31. The lines in the output file should be sorted by product (alphabetically) and year (ascending)
  32. Given the above complaints.csv input file, we'd expect an output file, report.csv, in the following format
  33. "credit reporting, credit repair services, or other personal consumer reports",2019,3,2,67
  34. "credit reporting, credit repair services, or other personal consumer reports",2020,1,1,100
  35. debt collection,2019,1,1,100
  36. Notice that because debt collection was only listed for 2019 and not 2020, the output file only has a single entry for debt collection. Also, notice that when a product has a comma (,) in the name, the name should be enclosed by double quotation marks ("). Finally, notice that percentages are listed as numbers and do not have % in the
  37.  
  38. codes in python
  39.  
  40. from pyspark.sql.functions import lower, col
  41. from pyspark.sql.functions import col, unix_timestamp, to_date, year, count, avg
  42. report = spark.read.load(cons_report, format = 'csv', header = True, inferschema=True)
  43.  
  44. report = report.withColumn('Date received',
  45.                   to_date(unix_timestamp(col('Date received'), 'MM/dd/yyyy').cast("timestamp")))
  46.            
  47. report = report.select(lower(col('Product')).alias('Product'),
  48.                       year('Date received').alias('Year'))
  49.                      
  50. report = report.na.drop()
  51. report = report.groupBy('Product', 'Year').count()
  52.  
  53. --------------------+----+-----+
  54. |             Product|Year|count|
  55. +--------------------+----+-----+
  56. |credit reporting,...|2019| 3114|
  57. |         credit card|2016|    4|
  58. |checking or savin...|2020|    3|
  59. |money transfer, v...|2019|   87|
  60. |            mortgage|2018|   39|
  61. |credit card or pr...|2019|  437|
  62. |       consumer loan|2015|    1|
  63. |     debt collection|2017|   13|
  64. |            mortgage|2019|  415|
  65. |        student loan|2019|  157|
  66. |payday loan, titl...|2020|    1|
  67. |     debt collection|2015|    4|
  68. |checking or savin...|2019|  461|
  69. |checking or savin...|2017|
RAW Paste Data

Adblocker detected! Please consider disabling it...

We've detected AdBlock Plus or some other adblocking software preventing Pastebin.com from fully loading.

We don't have any obnoxious sound, or popup ads, we actively block these annoying types of ads!

Please add Pastebin.com to your ad blocker whitelist or disable your adblocking software.

×