Advertisement
Guest User

Untitled

a guest
Sep 27th, 2017
515
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 6.32 KB | None | 0 0
  1. # Practice Problem:
  2. > Using sqoop, import orders table into hdfs to folders /user/cloudera/problem1/orders. Use any file type and compression codec of your choice.
  3.  
  4. > Using sqoop, import order_items table into hdfs to folders /user/cloudera/problem1/order-items. Use any file type and compression codec of your choice.
  5.  
  6. > Using Spark Scala load data at /user/cloudera/problem1/orders and /user/cloudera/problem1/orders-items items as dataframes.
  7.  
  8. > Find total orders and total amount per status per day. The result should be sorted by order date in descending, order status in ascending and total amount in ascending and total orders in ascending.
  9.  
  10. > Use sparkSQL and spark dataframe API to solve this
  11.  
  12. > Save the result as csv and load the csv back into mysql db in a table named result-a
  13.  
  14.  
  15. # SOLUTION:
  16.  
  17. ```
  18. sqoop eval \
  19. --connect 'jdbc:mysql://localhost:3306/retail_db' \
  20. --username root \
  21. --password cloudera \
  22. --query 'select * from orders'
  23.  
  24. sqoop import \
  25. --connect 'jdbc:mysql://localhost:3306/retail_db' \
  26. --username root \
  27. --password cloudera \
  28. --table orders \
  29. --target-dir /user/cloudera/problem1/orders/ \
  30. --as-avrodatafile \
  31. --compress \
  32. --compression-codec org.apache.hadoop.io.compress.SnappyCodec
  33.  
  34.  
  35. sqoop import \
  36. --connect 'jdbc:mysql://localhost:3306/retail_db' \
  37. --username root \
  38. --password cloudera \
  39. --table order_items \
  40. --target-dir /user/cloudera/problem1/order-items \
  41. --as-avrodatafile \
  42. --compress \
  43. --compression-codec org.apache.hadoop.io.compress.SnappyCodec
  44.  
  45.  
  46. sqlContext.setConf("spark.sql.shuffle.partitions","4")
  47. val orders = sqlContext.read.format("com.databricks.spark.avro").load("/user/cloudera/problem1/orders")
  48. val order_items = sqlContext.read.format("com.databricks.spark.avro").load("/user/cloudera/problem1/order-items")
  49.  
  50. orders.registerTempTable("orders")
  51. order_items.registerTempTable("order_items")
  52.  
  53. // Order_Date d, Order_status a, total_orders a, total_amount d
  54.  
  55. val res = sqlContext.sql("""
  56. select order_date, order_status, count(distinct(o.order_id)) total_orders, sum(oi.order_item_subtotal) total_amount
  57. from orders o
  58. join order_items oi on oi.order_item_order_id = o.order_id
  59. group by o.order_date, o.order_status
  60. order by o.order_date desc, o.order_status, total_orders, total_amount desc
  61. """)
  62.  
  63. ```
  64. ```
  65. +-------------+---------------+------------+------------------+
  66. | order_date| order_status|total_orders| total_amount|
  67. +-------------+---------------+------------+------------------+
  68. |1406185200000| CANCELED| 2|1254.9200382232666|
  69. |1406185200000| CLOSED| 26|16333.160339355469|
  70. |1406185200000| COMPLETE| 55| 34552.03063583374|
  71. |1406185200000| ON_HOLD| 4|1709.7400207519531|
  72. |1406185200000| PAYMENT_REVIEW| 1|499.95001220703125|
  73. |1406185200000| PENDING| 22|12729.490217208862|
  74. |1406185200000|PENDING_PAYMENT| 34|17680.700359344482|
  75. |1406185200000| PROCESSING| 17| 9964.740190505981|
  76. |1406185200000|SUSPECTED_FRAUD| 4|2351.6100215911865|
  77. |1406098800000| CANCELED| 10| 5777.330112457275|
  78. |1406098800000| CLOSED| 18| 13312.7202835083|
  79. |1406098800000| COMPLETE| 40|25482.510496139526|
  80. |1406098800000| ON_HOLD| 6| 4514.460060119629|
  81. |1406098800000| PAYMENT_REVIEW| 2|1699.8200302124023|
  82. |1406098800000| PENDING| 11| 6161.3701171875|
  83. |1406098800000|PENDING_PAYMENT| 30|19279.810424804688|
  84. |1406098800000| PROCESSING| 15| 7962.790130615234|
  85. |1406098800000|SUSPECTED_FRAUD| 6|3799.5700721740723|
  86. |1406012400000| CANCELED| 4| 3209.730094909668|
  87. |1406012400000| CLOSED| 20| 12688.79024887085|
  88. +-------------+---------------+------------+------------------+
  89. ```
  90.  
  91. ### Export data from hdfs to mysql
  92. ```
  93. scala> res.map(x=> x(0)+","+x(1)+","+x(2)+","+x(3)).saveAsTextFile("/user/cloudera/problem1/resulta-csv")
  94.  
  95. mysql> create table result (col1 varchar(30), col2 varchar(30), col3 varchar(30), col4 varchar(30));
  96.  
  97. $
  98. sqoop export \
  99. --connect 'jdbc:mysql://localhost:3306/retail_db' \
  100. --username root \
  101. --password cloudera \
  102. --table result-a \
  103. --export-dir /user/cloudera/problem1/resulta-csv \
  104. --input-fields-terminated-by ',' \
  105. --input-lines-terminated-by '\n'
  106. ```
  107.  
  108. ### Using dataframe API
  109. ```
  110. // Order_Date d, Order_status a, total_orders a, total_amount d
  111. //using dataframe
  112. sqlContext.setConf("spark.sql.shuffle.partitions","4")
  113. val orders = sqlContext.read.parquet("/user/hive/warehouse/retail_db.db/orders/")
  114. val order_items = sqlContext.read.parquet("/user/hive/warehouse/retail_db.db/order_items/")
  115.  
  116. orders.join(order_items).where($"order_id" === $"order_item_order_id").groupBy($"order_date",$"order_status").agg(countDistinct($"order_id").alias("total_orders"), sum($"order_item_subtotal").alias("total_amount")).sort(desc("order_date"),$"order_status",$"total_orders",$"total_amount")
  117. ```
  118. ```
  119. +-------------+---------------+------------+------------------+
  120. | order_date| order_status|total_orders| total_amount|
  121. +-------------+---------------+------------+------------------+
  122. |1406185200000| CANCELED| 2|1254.9200382232666|
  123. |1406185200000| CLOSED| 26|16333.160339355469|
  124. |1406185200000| COMPLETE| 55| 34552.03063583374|
  125. |1406185200000| ON_HOLD| 4|1709.7400207519531|
  126. |1406185200000| PAYMENT_REVIEW| 1|499.95001220703125|
  127. |1406185200000| PENDING| 22|12729.490217208862|
  128. |1406185200000|PENDING_PAYMENT| 34|17680.700359344482|
  129. |1406185200000| PROCESSING| 17| 9964.740190505981|
  130. |1406185200000|SUSPECTED_FRAUD| 4|2351.6100215911865|
  131. |1406098800000| CANCELED| 10| 5777.330112457275|
  132. |1406098800000| CLOSED| 18| 13312.7202835083|
  133. |1406098800000| COMPLETE| 40|25482.510496139526|
  134. |1406098800000| ON_HOLD| 6| 4514.460060119629|
  135. |1406098800000| PAYMENT_REVIEW| 2|1699.8200302124023|
  136. |1406098800000| PENDING| 11| 6161.3701171875|
  137. |1406098800000|PENDING_PAYMENT| 30|19279.810424804688|
  138. |1406098800000| PROCESSING| 15| 7962.790130615234|
  139. |1406098800000|SUSPECTED_FRAUD| 6|3799.5700721740723|
  140. |1406012400000| CANCELED| 4| 3209.730094909668|
  141. |1406012400000| CLOSED| 20| 12688.79024887085|
  142. +-------------+---------------+------------+------------------+
  143. `
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement