Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Install Spark and run master and slaves (workers) in standalone mode.
- ```
- brew install apache-spark
- /usr/local/Cellar/apache-spark/2.3.1/bin/spark-class org.apache.spark.deploy.master.Master
- /usr/local/Cellar/apache-spark/2.3.1/bin/spark-class org.apache.spark.deploy.worker.Worker spark://<MASTER_IP>:7077 -c 1 -m 512M
- ```
- In PostgreSQL
- ```
- CREATE TABLE items(
- id VARCHAR (100),
- description VARCHAR (100),
- );
- ```
- Load CSV from file
- COPY items FROM '/Users/<pathTo>/items.csv' DELIMITER ',' CSV HEADER;
- Launch Spark Shell
- ```
- pyspark --conf spark.executor.extraClassPath=/Users/<pathTo>/postgresql-42.2.4.jar --driver-class-path /Users/<pathTo>/postgresql-42.2.4.jar --master spark://192.168.1.199:7077 --executor-memory 512m
- ```
- Connect to existing PostgreSQL
- ```
- df = spark.read \
- .format("jdbc") \
- .option("driver", "org.postgresql.Driver") \
- .option("url", "jdbc:postgresql:retailme") \
- .option("dbtable", "items") \
- .option("user", "<postgres_user>") \
- .option("password", "") \
- .load()
- df.count() #This fires the query and displayed the count once can check progress in the Spark UI
- # Join, shows number of records where ids exist sin df1 but not df2
- left_join = df1.join(df2, df1.id == df2.id,how='left') # Could also use 'left_outer'
- left_join.filter(col('id').isNull()).count()
- # Write data to tables
- mode = "overwrite"
- url = "jdbc:postgresql:retailme"
- properties = {"user": "<postgreUser>","password": "","driver": "org.postgresql.Driver"}
- df.write.jdbc(url=url, table="items", mode=mode, properties=properties)
- ```
Add Comment
Please, Sign In to add comment