Advertisement
Guest User

Untitled

a guest
Aug 1st, 2015
233
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 8.06 KB | None | 0 0
  1. {
  2. "metadata": {
  3. "name": "",
  4. "signature": "sha256:4c4c6b06002301ad7afbc7789449b54b94a4135218ccea4c9b9400e20f4f3062"
  5. },
  6. "nbformat": 3,
  7. "nbformat_minor": 0,
  8. "worksheets": [
  9. {
  10. "cells": [
  11. {
  12. "cell_type": "heading",
  13. "level": 1,
  14. "metadata": {},
  15. "source": [
  16. "All Code For Data Cleaning"
  17. ]
  18. },
  19. {
  20. "cell_type": "heading",
  21. "level": 2,
  22. "metadata": {},
  23. "source": [
  24. "1. Summary"
  25. ]
  26. },
  27. {
  28. "cell_type": "markdown",
  29. "metadata": {},
  30. "source": [
  31. "The purpose of this is to show the main data cleaning that was used for the other IPython notebooks. All the other notebooks still contain general subsetting and aggregation. This code includes the library, create_list, which contains the two functions make_empty_dataframe and clean_date. The code for creating the .csv files for each individual team is also included."
  32. ]
  33. },
  34. {
  35. "cell_type": "code",
  36. "collapsed": false,
  37. "input": [
  38. "# This is where all the datasets for the project are contained. They will also be added to the github file.\n",
  39. "cd C:\\tmp"
  40. ],
  41. "language": "python",
  42. "metadata": {},
  43. "outputs": []
  44. },
  45. {
  46. "cell_type": "code",
  47. "collapsed": false,
  48. "input": [
  49. "# Import all libraries and have plots show in their respective blocks.\n",
  50. "import pandas as pd\n",
  51. "import datetime\n",
  52. "import time\n",
  53. "import numpy as np\n",
  54. "import matplotlib as mpl\n",
  55. "import matplotlib.pyplot as plt\n",
  56. "import seaborn as sns\n",
  57. "%matplotlib inline"
  58. ],
  59. "language": "python",
  60. "metadata": {},
  61. "outputs": []
  62. },
  63. {
  64. "cell_type": "code",
  65. "collapsed": false,
  66. "input": [
  67. "# Reads the .csv data\n",
  68. "df = pd.read_csv(\"june_final.csv\", header=None, names=['id', 'name', 'followers', 'current_viewers', 'date_time'])"
  69. ],
  70. "language": "python",
  71. "metadata": {},
  72. "outputs": []
  73. },
  74. {
  75. "cell_type": "code",
  76. "collapsed": false,
  77. "input": [
  78. "# Extracts individual pieces to be used later for the new datetime\n",
  79. "df['years'] = pd.DatetimeIndex(df['date_time']).year\n",
  80. "df['months'] = pd.DatetimeIndex(df['date_time']).month\n",
  81. "df['days'] = pd.DatetimeIndex(df['date_time']).day\n",
  82. "df['hours'] = pd.DatetimeIndex(df['date_time']).hour\n",
  83. "df['minutes'] = pd.DatetimeIndex(df['date_time']).minute"
  84. ],
  85. "language": "python",
  86. "metadata": {},
  87. "outputs": []
  88. },
  89. {
  90. "cell_type": "heading",
  91. "level": 2,
  92. "metadata": {},
  93. "source": [
  94. "2. Library: create_list"
  95. ]
  96. },
  97. {
  98. "cell_type": "markdown",
  99. "metadata": {},
  100. "source": [
  101. "This function creates a dataframe with an index that ranges from 5/30/2015 - 6/30/2015 where each value is a zero.\n",
  102. "This dataframe is created to later combine with other dataframes where downtime is needed - giving a \n",
  103. "complete time series. It is included in the create_list library."
  104. ]
  105. },
  106. {
  107. "cell_type": "code",
  108. "collapsed": false,
  109. "input": [
  110. "def make_empty_dataframe(df):\n",
  111. " date_tmp = pd.date_range(start=datetime.datetime(2015, 5, 30), end=datetime.datetime(2015, 6, 30)).tolist()\n",
  112. " date_df = pd.DataFrame(date_tmp).reset_index()\n",
  113. " date_df.columns = ('index', 'date')\n",
  114. " date_list = pd.to_datetime(date_df['date']).tolist()\n",
  115. "\n",
  116. " datetime_list = []\n",
  117. " hour_list = np.arange(0,24)\n",
  118. " min_list = (0, 10, 20, 30, 40, 50)\n",
  119. "\n",
  120. " for x in date_list:\n",
  121. " day1 = x.day\n",
  122. " month1 = x.month\n",
  123. " year1 = x.year\n",
  124. "\n",
  125. " for y in hour_list:\n",
  126. " for z in min_list:\n",
  127. " datetime_list.append(datetime.datetime(year1, month1, day1, y, z))\n",
  128. " \n",
  129. " datetime_df = pd.DataFrame(index=datetime_list, columns=['A'])\n",
  130. " datetime_df.fillna(0, inplace=True)\n",
  131. "\n",
  132. " return datetime_d"
  133. ],
  134. "language": "python",
  135. "metadata": {},
  136. "outputs": []
  137. },
  138. {
  139. "cell_type": "markdown",
  140. "metadata": {},
  141. "source": [
  142. "This function is included in the create_list library.\n",
  143. "The purpose of this function is to clean the minutes so that it is either 00, 10, 20, 30, 40, 50\n",
  144. "and then placed into a dataframe column as a datetime object."
  145. ]
  146. },
  147. {
  148. "cell_type": "code",
  149. "collapsed": false,
  150. "input": [
  151. "def clean_date(df):\n",
  152. " zeroes = np.arange(0, 10)\n",
  153. " tens = np.arange(10, 20)\n",
  154. " twenties = np.arange(20, 30)\n",
  155. " thirties = np.arange(30, 40)\n",
  156. " fourties = np.arange(40, 50)\n",
  157. " fifties = np.arange(50, 60)\n",
  158. " df['minutes'].replace(fourties, 40, inplace=True)\n",
  159. " df['minutes'].replace(zeroes, 0, inplace=True)\n",
  160. " df['minutes'].replace(tens, 10, inplace=True)\n",
  161. " df['minutes'].replace(twenties, 20, inplace=True)\n",
  162. " df['minutes'].replace(thirties, 30, inplace=True)\n",
  163. " df['minutes'].replace(fifties, 50, inplace=True)\n",
  164. "\n",
  165. " for index, row in df.iterrows():\n",
  166. " df.loc[index, 'test'] = datetime.datetime(row['years'], row['months'], row['days'], row['hours'], row['minutes'])\n",
  167. "\n",
  168. " return df"
  169. ],
  170. "language": "python",
  171. "metadata": {},
  172. "outputs": []
  173. },
  174. {
  175. "cell_type": "heading",
  176. "level": 2,
  177. "metadata": {},
  178. "source": [
  179. "3. Create Team Datasets"
  180. ]
  181. },
  182. {
  183. "cell_type": "code",
  184. "collapsed": false,
  185. "input": [
  186. "# List of each players' twitch.tv name\n",
  187. "archon = ['amazhs', 'deernadia', 'hero_firebat', 'xixo', 'zalaehs', 'orange_hs', 'backspacehs', 'purpledrank_hs']\n",
  188. "c9 = ['gnimsh', 'itshafu', 'kolento', 'ek0p', 'strifecro', 'tidesoftime', 'massansc']\n",
  189. "complexity = ['superjj102', 'hsdogdog', 'thejordude', 'ryzentv']\n",
  190. "liquid = ['savjz', 'neirea', 'sjow']\n",
  191. "nihilum = ['thijshs', 'lifecoach1981', 'lotharhs', 'radu_hs']\n",
  192. "tempostorm = ['reynad27', 'gaarabestshaman', 'ratsmah', 'reckful', 'hyp3d', 'eloise_ailv', 'justsaiyanhs']\n",
  193. "tsm = ['nl_kripp', 'trumpsc']"
  194. ],
  195. "language": "python",
  196. "metadata": {},
  197. "outputs": []
  198. },
  199. {
  200. "cell_type": "code",
  201. "collapsed": false,
  202. "input": [
  203. "# Make individual team datasets\n",
  204. "archon_df = df[df['name'].isin(archon)]\n",
  205. "c9_df = df[df['name'].isin(c9)]\n",
  206. "complexity_df = df[df['name'].isin(complexity)]\n",
  207. "liquid_df = df[df['name'].isin(liquid)]\n",
  208. "nihilum_df = df[df['name'].isin(nihilum)]\n",
  209. "tempostorm_df = df[df['name'].isin(tempostorm)]\n",
  210. "tsm_df = df[df['name'].isin(tsm)]"
  211. ],
  212. "language": "python",
  213. "metadata": {},
  214. "outputs": []
  215. },
  216. {
  217. "cell_type": "code",
  218. "collapsed": false,
  219. "input": [
  220. "# Make separate files for each team\n",
  221. "archon_df.to_csv('archon.csv', columns=('name', 'followers', 'current_viewers', 'days', 'hours', 'minutes', 'test'))\n",
  222. "c9_df.to_csv('c9.csv', columns=('name', 'followers', 'current_viewers', 'days', 'hours', 'minutes', 'test'))\n",
  223. "complexity_df.to_csv('complexity.csv', columns=('name', 'followers', 'current_viewers', 'days', 'hours', 'minutes', 'test'))\n",
  224. "liquid_df.to_csv('liquid.csv', columns=('name', 'followers', 'current_viewers', 'days', 'hours', 'minutes', 'test'))\n",
  225. "nihilum_df.to_csv('nihilum.csv', columns=('name', 'followers', 'current_viewers', 'days', 'hours', 'minutes', 'test'))\n",
  226. "tempostorm_df.to_csv('tempostorm.csv', columns=('name', 'followers', 'current_viewers', 'days', 'hours', 'minutes', 'test'))\n",
  227. "tsm_df.to_csv('tsm.csv', columns=('name', 'followers', 'current_viewers', 'days', 'hours', 'minutes', 'test'))"
  228. ],
  229. "language": "python",
  230. "metadata": {},
  231. "outputs": []
  232. }
  233. ],
  234. "metadata": {}
  235. }
  236. ]
  237. }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement