Advertisement
avisrivastava254084

Untitled

Sep 26th, 2019
1,076
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. Type "help", "copyright", "credits" or "license" for more information.
  2. 19/09/26 13:35:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  3. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
  4. Setting default log level to "WARN".
  5. To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
  6. Welcome to
  7.      ____              __
  8.     / __/__  ___ _____/ /__
  9.    _\ \/ _ \/ _ `/ __/  '_/
  10.    /__ / .__/\_,_/_/ /_/\_\   version 2.4.4
  11.       /_/
  12.  
  13. Using Python version 3.7.4 (default, Sep  7 2019 18:27:02)
  14. SparkSession available as 'spark'.
  15. >>> from pyspark.sql.functions import col, lit, sum as _sum, when
  16. >>> valid = ['Messi', 'Ronaldo', 'Virgil']
  17. >>> from pyspark.sql import Row
  18. >>> my_cols = Row("Column1", "Column2", "Column3", "Column4")
  19. >>> my_cols
  20. <Row(Column1, Column2, Column3, Column4)>
  21. >>> row_1 = my_cols('Ronaldo', 'Salah', 'Messi', None)
  22. >>> row_1
  23. Row(Column1='Ronaldo', Column2='Salah', Column3='Messi', Column4=None)
  24. >>> row_2 = my_cols('Ronaldo', 'Messi', 'Virgil', 'Messi')
  25. >>> row_3 = my_cols('Ronaldo', 'Ronaldo', 'Messi', 'Ronaldo')
  26. >>> row_seq = [row_1, row_2, row_3]
  27. >>> df = spark.createDataFrame(row_seq)
  28. >>> df
  29. DataFrame[Column1: string, Column2: string, Column3: string, Column4: string]
  30. >>> display(df)
  31. Traceback (most recent call last):
  32.   File "<stdin>", line 1, in <module>
  33. NameError: name 'display' is not defined
  34. >>> from pyspark.sql import display
  35. Traceback (most recent call last):
  36.   File "<stdin>", line 1, in <module>
  37. ImportError: cannot import name 'display' from 'pyspark.sql' (/usr/local/lib/python3.7/site-packages/pyspark/sql/__init__.py)
  38. >>> valid = ['Messi', 'Ronaldo', 'Virgil']
  39. >>> invalid_counts = df.select( *[_sum(when(col(c).isin(valid), lit(0)).otherwise(lit(1))).alias(c) for c in df.columns] ).collect()
  40. >>> print(invalid_counts)
  41. [Row(Column1=0, Column2=1, Column3=0, Column4=1)]
  42. >>> valid_columns = [k for k,v in invalid_counts[0].asDict().items() if v == 0]
  43. >>> print(valid_columns)
  44. ['Column1', 'Column3']
  45. >>> valid_columns = sorted(valid_columns, key=df.columns.index)
  46. >>> valid_columns
  47. ['Column1', 'Column3']
  48. >>> df.select(valid_columns).show()
  49. +-------+-------+
  50. |Column1|Column3|
  51. +-------+-------+
  52. |Ronaldo|  Messi|
  53. |Ronaldo| Virgil|
  54. |Ronaldo|  Messi|
  55. +-------+-------+
  56.  
  57. >>>
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement