Advertisement
AliaksandrLet

Спринт 7/12 → Тема 4/8: PySpark для инженера данных → Урок 9/13 → Задание 3

Sep 6th, 2023 (edited)
480
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
SPARK 0.68 KB | None | 0 0
  1. import pyspark
  2. from pyspark.sql import SparkSession
  3. from pyspark.sql.window import Window
  4. import pyspark.sql.functions as F
  5.  
  6. spark = SparkSession.builder \
  7.                     .master("yarn") \
  8.                     .appName("Learning DataFrames") \
  9.                     .getOrCreate()
  10.  
  11. events = spark.read.json("/user/master/data/events/date=2022-05-01")
  12.  
  13. window = Window().partitionBy('event.message_from').orderBy('event.datetime')
  14.  
  15. dfWithLag = events.withColumn("lag_7", F.lag("event.message_to", 7).over(window))
  16.  
  17. dfWithLag.select("event.message_from", "lag_7") \
  18.     .filter(dfWithLag.lag_7.isNotNull()) \
  19.     .orderBy(F.col("event.message_from").desc()) \
  20.     .show(10, False)
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement