Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ID Day Name Description
- 1 2016-09-01 Sam Retail
- 2 2016-01-28 Chris Retail
- 3 2016-02-06 ChrisTY Retail
- 4 2016-02-26 Christa Retail
- 3 2016-12-06 ChrisTu Retail
- 4 2016-12-31 Christi Retail
- Table B
- ID SkEY
- 1 1.1
- 2 1.2
- 3 1.3
- The following query is working but taking a long time as the number of
- columns are around 60(just used sample 3).performance isn't good at all as
- the result is taking 1 hour for 20 days partitions to process,
- Can you please figure out and optimise the query.
- from pyspark.sql import sparksession
- from pyspark.sql import functions as F
- from pyspark import HiveContext
- hiveContext= HiveContext(sc)
- def UDF_df(i):
- print(i[0])
- ABC2=spark.sql("select * From A where day where day
- ='{0}'.format(i[0]))
- Join=ABC2.join(Tab2.join(ABC2.ID == Tab2.ID))
- .select(Tab2.skey,ABC2.Day,ABC2.Name,ABC2.Description)
- Join
- .select("Tab2.skey","ABC2.Day","ABC2.Name","ABC2.Description")
- .write
- .mode("append")
- .format("parquet')
- .insertinto("Table")
- ABC=spark.sql("select distinct day from A where day<= ' 2016-01-01' and
- day<='2016-12-31'")
- Tab2=spark.sql("select * from B where day is not null)
- for in in ABC.collect():
- UDF_df(i)
- Above is the pyspark code for a month that I've considered just to test the
- total time. A Join B with ID and output ID along with other columns of A.
- It's taking 1 hour to complete. Is there any better way of optimising the
- query by taking either 1 month or 1 year of data. And also output table is
- partitioned on 2 columns where the data is getting inserted which is why
- hive contexts are used.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement