Untitled

ID Day Name Description
 1   2016-09-01  Sam   Retail
 2   2016-01-28  Chris Retail
 3   2016-02-06  ChrisTY Retail
 4   2016-02-26  Christa Retail
 3   2016-12-06  ChrisTu Retail
 4   2016-12-31  Christi Retail

Table B

ID SkEY
1  1.1
2  1.2
3  1.3

The following query is working but taking a long time as the number of
columns are around 60(just used sample 3).performance isn't good at all as
the result is taking 1 hour for 20 days partitions to process,
Can you please figure out and optimise the query.

from pyspark.sql import sparksession
from pyspark.sql import functions as F
from pyspark import HiveContext
hiveContext= HiveContext(sc)

 def UDF_df(i):
print(i[0])
ABC2=spark.sql("select * From A where day where day
='{0}'.format(i[0]))
Join=ABC2.join(Tab2.join(ABC2.ID == Tab2.ID))
.select(Tab2.skey,ABC2.Day,ABC2.Name,ABC2.Description)
Join
 .select("Tab2.skey","ABC2.Day","ABC2.Name","ABC2.Description")
 .write
 .mode("append")
 .format("parquet')
.insertinto("Table")
ABC=spark.sql("select distinct day from A where day<= ' 2016-01-01' and
day<='2016-12-31'")
Tab2=spark.sql("select * from B where day is not null)
for in in ABC.collect():
UDF_df(i)

 Above is the pyspark code for a month that I've considered just to test the
 total time. A Join B with ID and output ID along with other columns of A.
 It's taking 1 hour to complete. Is there any better way of optimising the
query by taking either 1 month or 1 year of data. And also output table is
partitioned on 2 columns where the data is getting inserted which is why
hive contexts are used.