Guest User

Untitled

a guest
Jul 17th, 2018
88
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 3.47 KB | None | 0 0
  1. Note: this is a fork of @Conormacd converted to markdown from raw text. All content is Conor's.
  2.  
  3. R to python useful data wrangling snippets
  4.  
  5. The dplyr package in R makes data wrangling significantly easier.
  6. The beauty of dplyr is that, by design, the options available are limited.
  7. Specifically, a set of key verbs form the core of the package.
  8. Using these verbs you can solve a wide range of data problems effectively in a shorter timeframe.
  9. Whilse transitioning to Python I have greatly missed the ease with which I can think through and solve problems using dplyr in R.
  10. The purpose of this document is to demonstrate how to execute the key dplyr verbs when manipulating data using Python (with the pandas package).
  11.  
  12. dplyr is organised around six key verbs
  13.  
  14. `filter`: subset a dataframe according to condition(s) in a variable(s)
  15. `select`: choose a specific variable or set of variables
  16. `arrange`: order dataframe by index or variable
  17. `group_by`: create a grouped dataframe
  18. `summarise`: reduce variable to summary variable (e.g. `mean`)
  19. `mutate`: transform dataframe by adding new variables
  20.  
  21. The excellent pandas package in Python easily allows you to implement all of these actions (and much, much more!). Below are some snippets to highlight some of the more basic conversions.
  22.  
  23. I'll update this on a regular basis with more complex snippets.
  24. Thanks!
  25. Conor @Conormacd
  26.  
  27. # Function Equivalents
  28.  
  29. ## Filter
  30. ### R
  31. ```
  32. filter(df, var > 20000 & var < 30000)
  33. filter(df, var == 'string') # df %>% filter(var != 'string')
  34. df %>% filter(var != 'string')
  35. df %>% group_by(group) %>% filter(sum(var) > 2000000)
  36. ```
  37.  
  38. ### Python
  39. ```
  40. df[(df['var'] > 20000) & (df['var'] < 30000)]
  41. df[df['var'] == 'string']
  42. df[df['var'] != 'string']
  43. df.groupby('group').filter(lambda x: sum(x['var']) > 2000000)
  44. ```
  45. ## Select
  46. ### R
  47. ```
  48. select(df, var1, var2)
  49. select(df, -var3)
  50. ```
  51.  
  52. ### Python
  53. ```
  54. df[['var1', 'var2']]
  55. df.drop('var3', 1)
  56. ```
  57.  
  58. ## Arrange
  59. ### R
  60. ```
  61. arrange(df, var1)
  62. arrange(df, desc(var1))
  63. ```
  64.  
  65. ### Python
  66. ```
  67. df.sort_values('var1')
  68. df.sort_values('var1', ascending=False)
  69. ```
  70.  
  71. ## Grouping
  72. ### R
  73. ```
  74. df %>% group_by(group)
  75. df %>% group_by(group1, group2)
  76. df %>% ungroup()
  77. ```
  78.  
  79. ### Python
  80. ```
  81. df.groupby('group1')
  82. df.groupby(['group1', 'group2'])
  83. df.reset_index() / or when grouping: df.groupby('group1', as_index=False)
  84. ```
  85.  
  86. ## Summarise / Aggregate df by group
  87. ### R
  88. ```
  89. df %>% group_by(group) %>% summarise(mean_var1 = mean(var1))
  90. df %>% group_by(group1, group2) %>% summarise(mean_var1 = mean(var1),
  91. sum_var1 = sum(var1),
  92. count_var1 = n())
  93. ```
  94.  
  95. ### Python
  96. ```
  97. df.groupby('group1')['var1'].agg({'mean_col' : np.mean()}) # pass dict to specifiy column name
  98.  
  99. df.groupby(['group1', 'group2'])['var1]'].agg(['mean', 'sum', 'count']) # for count also consider 'size'. size will return n for NaN values also, whereas 'count' will not.
  100. ```
  101.  
  102. ## Mutate / transform df by group
  103. ### R
  104. ```
  105. df %>% group_by(group) %>% mutate(mean_var1 = mean(var1))
  106. ```
  107.  
  108. ### Python
  109. ```
  110. df.groupby('group')['var1'].transform(np.mean)
  111. ```
  112.  
  113. ## Distinct
  114. remove duplicate obs from data frame
  115. ### R
  116. ```
  117. df %>% distinct()
  118. df %>% distinct(col1) # returns dataframe with unique values of col1
  119. ```
  120.  
  121. ### Python
  122. ```
  123. df.drop_duplicates()
  124. df.drop_duplicates(subset='col1') # returns dataframe with unique values of col1
  125. ```
  126.  
  127. ## Sample
  128. generate random samples of the data by n or by %
  129. ### R
  130. ```
  131. sample_n(df, 100)
  132. sample_frac(df, 0.5)
  133. ```
  134.  
  135. ### Python
  136. ```
  137. df.sample(100)
  138. df.sample(frac=0.5)
  139. ````
Add Comment
Please, Sign In to add comment