Untitled

ID Day Name Description
1   2016-09-01  Sam   Retail
2   2016-01-28  Chris Retail
3   2016-02-06  ChrisTY Retail
4   2016-02-26  Christa Retail
3   2016-12-06  ChrisTu Retail
4   2016-12-31  Christi Retail

ID SkEY
1  1.1
2  1.2
3  1.3


The following query is working but taking a long time as the number of
columns are around 60(just used sample 3). Also didn't join Table C as I
wasn't sure how to join to avoid cartisan join. performance isn't good,
am not sure how to optimise the query.

from pyspark.sql import sparksession
from pyspark.sql import functions as F
from pyspark import HiveContext
hiveContext= HiveContext(sc)

 def UDF_df(i):
print(i[0])
ABC2=spark.sql("select * From A where day where day
='{0}'.format(i[0]))
Join=ABC2.join(Tab2.join(ABC2.ID == Tab2.ID))
.select(Tab2.skey,ABC2.Day,ABC2.Name,ABC2.Description)
Join
 .select("Tab2.skey","ABC2.Day","ABC2.Name","ABC2.Description")
 .write
 .mode("append")
 .format("parquet')
.insertinto("Table")
ABC=spark.sql("select distinct day from A where day<= ' 2016-01-01' and
day<='2016-12-31'")
Tab2=spark.sql("select * from B where day is not null)
for in in ABC.collect():
UDF_df(i)

 Actual Output should have the code in pyspark where A join B where every
 month needs to be processed. the value of I should automatically be
 incremented from 1 to 12 along with month dates.
 A Join B with ID and output ID along with other columns of A as well as
 performance should be good