Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Type "help", "copyright", "credits" or "license" for more information.
- 19/09/26 13:35:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
- Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
- Setting default log level to "WARN".
- To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
- Welcome to
- ____ __
- / __/__ ___ _____/ /__
- _\ \/ _ \/ _ `/ __/ '_/
- /__ / .__/\_,_/_/ /_/\_\ version 2.4.4
- /_/
- Using Python version 3.7.4 (default, Sep 7 2019 18:27:02)
- SparkSession available as 'spark'.
- >>> from pyspark.sql.functions import col, lit, sum as _sum, when
- >>> valid = ['Messi', 'Ronaldo', 'Virgil']
- >>> from pyspark.sql import Row
- >>> my_cols = Row("Column1", "Column2", "Column3", "Column4")
- >>> my_cols
- <Row(Column1, Column2, Column3, Column4)>
- >>> row_1 = my_cols('Ronaldo', 'Salah', 'Messi', None)
- >>> row_1
- Row(Column1='Ronaldo', Column2='Salah', Column3='Messi', Column4=None)
- >>> row_2 = my_cols('Ronaldo', 'Messi', 'Virgil', 'Messi')
- >>> row_3 = my_cols('Ronaldo', 'Ronaldo', 'Messi', 'Ronaldo')
- >>> row_seq = [row_1, row_2, row_3]
- >>> df = spark.createDataFrame(row_seq)
- >>> df
- DataFrame[Column1: string, Column2: string, Column3: string, Column4: string]
- >>> display(df)
- Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- NameError: name 'display' is not defined
- >>> from pyspark.sql import display
- Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- ImportError: cannot import name 'display' from 'pyspark.sql' (/usr/local/lib/python3.7/site-packages/pyspark/sql/__init__.py)
- >>> valid = ['Messi', 'Ronaldo', 'Virgil']
- >>> invalid_counts = df.select( *[_sum(when(col(c).isin(valid), lit(0)).otherwise(lit(1))).alias(c) for c in df.columns] ).collect()
- >>> print(invalid_counts)
- [Row(Column1=0, Column2=1, Column3=0, Column4=1)]
- >>> valid_columns = [k for k,v in invalid_counts[0].asDict().items() if v == 0]
- >>> print(valid_columns)
- ['Column1', 'Column3']
- >>> valid_columns = sorted(valid_columns, key=df.columns.index)
- >>> valid_columns
- ['Column1', 'Column3']
- >>> df.select(valid_columns).show()
- +-------+-------+
- |Column1|Column3|
- +-------+-------+
- |Ronaldo| Messi|
- |Ronaldo| Virgil|
- |Ronaldo| Messi|
- +-------+-------+
- >>>
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement