Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ID Day Name Description
- 1 2016-09-01 Sam Retail
- 2 2016-01-28 Chris Retail
- 3 2016-02-06 ChrisTY Retail
- 4 2016-02-26 Christa Retail
- 3 2016-12-06 ChrisTu Retail
- 4 2016-12-31 Christi Retail
- ID SkEY
- 1 1.1
- 2 1.2
- 3 1.3
- The following query is working but taking a long time as the number of
- columns are around 60(just used sample 3). Also didn't join Table C as I
- wasn't sure how to join to avoid cartisan join. performance isn't good,
- am not sure how to optimise the query.
- from pyspark.sql import sparksession
- from pyspark.sql import functions as F
- from pyspark import HiveContext
- hiveContext= HiveContext(sc)
- def UDF_df(i):
- print(i[0])
- ABC2=spark.sql("select * From A where day where day
- ='{0}'.format(i[0]))
- Join=ABC2.join(Tab2.join(ABC2.ID == Tab2.ID))
- .select(Tab2.skey,ABC2.Day,ABC2.Name,ABC2.Description)
- Join
- .select("Tab2.skey","ABC2.Day","ABC2.Name","ABC2.Description")
- .write
- .mode("append")
- .format("parquet')
- .insertinto("Table")
- ABC=spark.sql("select distinct day from A where day<= ' 2016-01-01' and
- day<='2016-12-31'")
- Tab2=spark.sql("select * from B where day is not null)
- for in in ABC.collect():
- UDF_df(i)
- Actual Output should have the code in pyspark where A join B where every
- month needs to be processed. the value of I should automatically be
- incremented from 1 to 12 along with month dates.
- A Join B with ID and output ID along with other columns of A as well as
- performance should be good
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement