Guest User

Untitled

a guest
Jun 24th, 2018
76
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 6.14 KB | None | 0 0
  1. # Things that I need to be able to do
  2.  
  3. ## Import from csv into memory
  4. To create a dataframe from a csv, you can use the following:
  5. ```py
  6. df = pd.read_csv('pandas_dataframe_importing_csv/example.csv')
  7. ```
  8. If you have a different delimiter than a comma, you can use the `sep` parameter:
  9. ```py
  10. df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', sep="\|", engine="python")
  11. ```
  12. Notice the escape character before the delimiter character.
  13.  
  14. With no headers, you need to add the following:
  15. ```py
  16. df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', sep="\|", engine="python", header=None)
  17. ```
  18. That will give you something like the following:
  19. ```
  20. 0 1 2 3
  21. 0 2018-06-22T23:59:47.965Z 123-456-789 200 12.203
  22. 1 2018-06-22T23:60:47.965Z 132-456-789 200 14.203
  23. 2 2018-06-22T23:61:47.965Z 312-456-789 400 15.203
  24. 3 2018-06-22T23:62:47.965Z 231-456-789 200 11.203
  25. ```
  26.  
  27. ## Calculate the min for a column
  28. Assuming that you know the column you want to take the min of, you'll use the column number as an index to the dataframe. For example, just using the index gives you all the column values:
  29. ```py
  30. In [13]: df[3]
  31. Out[13]:
  32. 0 12.203
  33. 1 14.203
  34. 2 15.203
  35. 3 11.203
  36. Name: 3, dtype: float64
  37. ```
  38. You can then take the `df[i]` format and tag it with the `.min()` method to give you the min value:
  39. ```py
  40. In [11]: df[3].min()
  41. Out[11]: 11.203
  42. ```
  43.  
  44. ## Calculate the max for a column
  45. Same as with the `.min()`, only we're switching out for the `.max()` method:
  46. ```py
  47. In [12]: df[3].max()
  48. Out[12]: 15.203
  49. ```
  50.  
  51. ## Calculate the average for a column
  52. Second grade fun fact: mean = average. Evidently I lost that knowladge over the years.
  53.  
  54. Same as min and max:
  55. ```py
  56. In [19]: df[3].mean()
  57. Out[19]: 13.203
  58. ```
  59.  
  60. ## Fun side note... use `describe()`
  61. You can use the `describe()` method a dataframe to get a whole bunch of info on all of your numeric columns super quickly. See the following:
  62. ```py
  63. In [18]: df.describe()
  64. Out[18]:
  65. 2 3
  66. count 4.0 4.000000
  67. mean 250.0 13.203000
  68. std 100.0 1.825742
  69. min 200.0 11.203000
  70. 25% 200.0 11.953000
  71. 50% 200.0 13.203000
  72. 75% 250.0 14.453000
  73. max 400.0 15.203000
  74. ```
  75.  
  76. ## Calculate the 95th percentile of a column
  77. You can calculate percentiles by using the `quantile()` method. This takes in some fractional value and returns that percentile. For example, to get the 95th, you'd use `.95`:
  78. ```py
  79. In [28]: df[3].quantile(.95)
  80. Out[28]: 15.052999999999999
  81. ```
  82.  
  83. ## Calculate the 99th percentile of a column
  84. Same as the previous, only using `.99`:
  85. ```py
  86. In [26]: df[3].quantile(.99)
  87. Out[26]: 15.173
  88. ```
  89.  
  90. ## Filter by column value
  91. You can filter down columns by doing conditional evaluations, and even combine them to do more complex queries:
  92. ```py
  93. In [40]: filtered_data = df[df[0] > '2018-06-22T23:60:47.965Z']
  94.  
  95. In [41]: filtered_data
  96. Out[41]:
  97. 0 1 2 3
  98. 2 2018-06-22T23:61:47.965Z 312-456-789 400 15.203
  99. 3 2018-06-22T23:62:47.965Z 231-456-789 200 11.203
  100.  
  101. In [42]: filtered_data = df[(df[0] > '2018-06-22T23:60:47.965Z') & (df[2] != 200)]
  102.  
  103. In [43]: filtered_data
  104. Out[43]:
  105. 0 1 2 3
  106. 2 2018-06-22T23:61:47.965Z 312-456-789 400 15.203
  107. ```
  108.  
  109. ## Turn a column into a list
  110. Surprise, there's another built in method to handle this too. Using the `tolist()` method, you can turn a specific column into a list:
  111. ```py
  112. In [30]: df[3].tolist()
  113. Out[30]: [12.203, 14.203, 15.203, 11.203]
  114. ```
  115.  
  116. ## Compare csv lengths (whatever that term is in pandas)
  117. Another method, `count()`, is your friend:
  118. ```py
  119. In [32]: df[3].count()
  120. Out[32]: 4
  121. ```
  122. However, this is potentially dangerous as it will only count rows where non NaN values are present. In that case, you can use `shape[0]`:
  123. ```py
  124. n [36]: df.shape[0]
  125. Out[36]: 4
  126. ```
  127.  
  128. ## Handling dates
  129. When importing in the csv, you need to use the `parse_dates` property to set equal to the columns that have dates in them:
  130. ```py
  131. In [37]: df = pd.read_csv('t1.csv', sep="\|", engine="python", header=None, parse_dates=[0])
  132. ```
  133.  
  134. ## Merge two csv by value in column
  135. Can be done using the merge method. The two tables will then create a new dataframe based on the key that you provided:
  136. ```py
  137. In [46]: df = pd.merge(df1, df2, on=[1], how='left', indicator='Exist')
  138.  
  139. In [47]: df
  140. Out[47]:
  141. 0_x 1 2_x 3_x 0_y 2_y 3_y Exist
  142. 0 2018-06-22T23:59:47.965Z 123-456-789 200 12.203 2018-06-23T23:59:47.965Z 200 12.303 both
  143. 1 2018-06-22T23:60:47.965Z 132-456-789 200 14.203 2018-06-23T23:60:47.965Z 200 14.303 both
  144. 2 2018-06-22T23:61:47.965Z 312-456-789 400 15.203 2018-06-23T23:61:47.965Z 400 15.303 both
  145. 3 2018-06-22T23:62:47.965Z 231-456-789 200 11.203 2018-06-23T23:62:47.965Z 200 11.303 both
  146. ```
  147. If the key can't be found, then a NaN value will appear in the merged version:
  148. ```py
  149. In [49]: df = pd.merge(df1, df2, on=[1], how='left', indicator='Exist')
  150. In [56]: df['Exist'] = np.where(df.Exist == 'both', True, False)
  151.  
  152. In [50]: df
  153. Out[50]:
  154. 0_x 1 2_x 3_x 0_y 2_y 3_y Exist
  155. 0 2018-06-22T23:59:47.965Z 123-456-789 200 12.203 2018-06-23T23:59:47.965Z 200.0 12.303 True
  156. 1 2018-06-22T23:60:47.965Z 132-456-789 200 14.203 2018-06-23T23:60:47.965Z 200.0 14.303 True
  157. 2 2018-06-22T23:61:47.965Z 312-456-789 400 15.203 2018-06-23T23:61:47.965Z 400.0 15.303 True
  158. 3 2018-06-22T23:62:47.965Z 231-456-789 200 11.203 NaN NaN NaN False
  159. ```
  160.  
  161. ## Check to see if all values are present
  162. Can either search on the number of `NaN`s that are present in the table, or off of the created `Exists` column that we've overwritten to be `True` or `False` if the record exists:
  163. ```py
  164. In [53]: df.isnull().sum().sum()
  165. Out[53]: 3
  166.  
  167. In [54]: df.isnull().sum()
  168. Out[54]:
  169. 0_x 0
  170. 1 0
  171. 2_x 0
  172. 3_x 0
  173. 0_y 1
  174. 2_y 1
  175. 3_y 1
  176. Exist 0
  177. dtype: int64
  178.  
  179. In [55]: df['0_y'].isnull().sum()
  180. Out[55]: 1
  181.  
  182. In [58]: df[df['Exist'] == False]
  183. Out[58]:
  184. 0_x 1 2_x 3_x 0_y 2_y 3_y Exist
  185. 3 2018-06-22T23:62:47.965Z 231-456-789 200 11.203 NaN NaN NaN False
  186. ```
  187.  
  188. 11. Generate graphs
Add Comment
Please, Sign In to add comment