Advertisement
Guest User

Untitled

a guest
Jun 17th, 2019
97
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 1.46 KB | None | 0 0
  1. ID Day Name Description
  2. 1 2016-09-01 Sam Retail
  3. 2 2016-01-28 Chris Retail
  4. 3 2016-02-06 ChrisTY Retail
  5. 4 2016-02-26 Christa Retail
  6. 3 2016-12-06 ChrisTu Retail
  7. 4 2016-12-31 Christi Retail
  8.  
  9. ID SkEY
  10. 1 1.1
  11. 2 1.2
  12. 3 1.3
  13.  
  14.  
  15.  
  16.  
  17. The following query is working but taking a long time as the number of
  18. columns are around 60(just used sample 3). Also didn't join Table C as I
  19. wasn't sure how to join to avoid cartisan join. performance isn't good,
  20. am not sure how to optimise the query.
  21.  
  22. from pyspark.sql import sparksession
  23. from pyspark.sql import functions as F
  24. from pyspark import HiveContext
  25. hiveContext= HiveContext(sc)
  26.  
  27. def UDF_df(i):
  28. print(i[0])
  29. ABC2=spark.sql("select * From A where day where day
  30. ='{0}'.format(i[0]))
  31. Join=ABC2.join(Tab2.join(ABC2.ID == Tab2.ID))
  32. .select(Tab2.skey,ABC2.Day,ABC2.Name,ABC2.Description)
  33. Join
  34. .select("Tab2.skey","ABC2.Day","ABC2.Name","ABC2.Description")
  35. .write
  36. .mode("append")
  37. .format("parquet')
  38. .insertinto("Table")
  39. ABC=spark.sql("select distinct day from A where day<= ' 2016-01-01' and
  40. day<='2016-12-31'")
  41. Tab2=spark.sql("select * from B where day is not null)
  42. for in in ABC.collect():
  43. UDF_df(i)
  44.  
  45. Actual Output should have the code in pyspark where A join B where every
  46. month needs to be processed. the value of I should automatically be
  47. incremented from 1 to 12 along with month dates.
  48. A Join B with ID and output ID along with other columns of A as well as
  49. performance should be good
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement