Guest User

Untitled

a guest
Feb 15th, 2019
120
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 0.39 KB | None | 0 0
  1. val partitions = 5; // this value depends on data and volumes. Will be different in every case.
  2. val df = sqlContext.read.json(“URI://path/to/parquet/files/")
  3. df.createOrReplaceTempView("df")
  4. val df_output = spark
  5. .sql("SELECT DISTINCT * FROM df") // this removes duplicates. If it's not needed, simply remove this line
  6. .coalesce(partitions)
  7. df_output.write.parquet("URI://path/to/destination")
Add Comment
Please, Sign In to add comment