Advertisement
Guest User

Untitled

a guest
Oct 16th, 2019
94
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 3.87 KB | None | 0 0
  1. # our trade/craft
  2. gaining insights from data for actionable decisions
  3. data science -> applied interdisciplinary science around using data to make decision
  4.  
  5. ## practicing data science is a function
  6. 1. input
  7. 2. process
  8. 3. output
  9.  
  10. ## data science workflow
  11. - plan
  12. - acquire
  13. - prepare
  14. - explore
  15. - model
  16. - present
  17. - maintain your data products
  18.  
  19. ## inputs:
  20. - csv
  21. - sql
  22. - raw text
  23. - images
  24. - audio or video
  25. - various types of data in various types of formats
  26.  
  27. - we have 2 main kinds of data
  28. - labeled data:
  29. - we have our targets
  30. - we have our labels (some human interaction)
  31. - example: hitting spam button is adding a lable to the data inside that email
  32. - sender, subject, recipient, attachments, language, wording
  33. - unlabeled data:
  34. - raw text
  35. - images where a human has not added any input
  36. - sometimes our data is structured and sometimes not.
  37. - either way, it requires work to prepare or integrate
  38.  
  39. ## acquire
  40. - sometimes, it's as easy as pd.read_csv()
  41. - sometimes, we've got to go collect the data ourselves
  42. - how do we sample a population to have a representative sample
  43. - this is where you need to be good w/ sql
  44. - may have to talk to different data sources
  45. - overlap w/ other software dev skills
  46. - you may get data from a data engineer
  47. - most of the time, this is on you.
  48.  
  49. ## prepare
  50. - 80% of the time, we're preparing the data
  51. - pandas, pandas, pandas
  52. - it's pandamonium.
  53. - derived values like gross revenue if you have units and price
  54. - feature engineering
  55.  
  56. ## explore
  57. - statistics
  58. - probability
  59. - visualization
  60. - we explore the data to figure out what model is a good candidate
  61. - set aside 30% of our data for data for testing our model
  62.  
  63.  
  64. ## model
  65. - run a model or two(maybe three)
  66. - measure the effectiveness
  67. - true positives, true negatives, false positive, false negatives
  68. - accuracy
  69. - precision
  70. - recall
  71. - hyperparameter tuning (we're modifying the components of the ML equation that aren't only the inputs)
  72. - weights on weighted averages
  73. - number of groups in k-means
  74. - think of these like tweaking a performance automobile
  75. - experiment with your fuel-air mixture
  76. - we need to test our model
  77. - ideally, we've got new data coming in
  78.  
  79. ## present
  80. - building data products to share
  81. - write a paper or whitepaper
  82. - produce a talk or a handout for talk
  83. - bokeh
  84. - tableau
  85. - handouts
  86.  
  87. ## Maintain
  88. - you may retrain your model
  89. - get new data
  90. - engineer your model to run on streaming inputs vs. a one time dump of data
  91. - maintain your tool
  92.  
  93.  
  94.  
  95.  
  96.  
  97. ## Technologies in their place of the Data Science Workflow
  98. Reminder: Workflow is plan, acquire, prepare, explore, model, present
  99.  
  100. Plan -> people think and people discuss and collaborate, project management like trello
  101.  
  102. ### Acquire
  103. - Technologies: python, pandas, numpy, SQL, python libraries (web scraping, etc...)
  104. - Skills: programming, troubleshooting, keeping the big picture in mind, keeping the next step in mind
  105. - Tasks: getting raw data
  106.  
  107. ### Prepare
  108. - Technologies: serious python. lots and lots of pandas.
  109. - Skills: attention to detail, debug other people's data, debug our own code
  110. - Tasks: cleaning the data, this is the "data wrangling" or "data munging" part, feature engineering
  111.  
  112. ### Explore
  113. - Tasks: get a sense of the relationship of data's variables, visualization, statistical analysis
  114. - Skills: statistics, visualization
  115. - Technologies: matplotlib, seaborn, pandas, python, numpy, scipy, statistical packages
  116.  
  117. ## Model
  118. - Technologies: scipy, sklearn (ML algorithms), keras or tensorflow
  119. - Tasks: Evaluating the effacacy of a model, hyperparameter tuning
  120. - Skills: determining overfitting vs. underfitting
  121.  
  122. ## Present
  123. - Tasks: get your point across quickly, be prepared to support your thesis, but start w/ the bottom line (your point)
  124. - Skills: public speaking, visual design and a sense for how people process information
  125. - Technologies: Tableau, Microsoft Word to make a handout
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement