Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- # our trade/craft
- gaining insights from data for actionable decisions
- data science -> applied interdisciplinary science around using data to make decision
- ## practicing data science is a function
- 1. input
- 2. process
- 3. output
- ## data science workflow
- - plan
- - acquire
- - prepare
- - explore
- - model
- - present
- - maintain your data products
- ## inputs:
- - csv
- - sql
- - raw text
- - images
- - audio or video
- - various types of data in various types of formats
- - we have 2 main kinds of data
- - labeled data:
- - we have our targets
- - we have our labels (some human interaction)
- - example: hitting spam button is adding a lable to the data inside that email
- - sender, subject, recipient, attachments, language, wording
- - unlabeled data:
- - raw text
- - images where a human has not added any input
- - sometimes our data is structured and sometimes not.
- - either way, it requires work to prepare or integrate
- ## acquire
- - sometimes, it's as easy as pd.read_csv()
- - sometimes, we've got to go collect the data ourselves
- - how do we sample a population to have a representative sample
- - this is where you need to be good w/ sql
- - may have to talk to different data sources
- - overlap w/ other software dev skills
- - you may get data from a data engineer
- - most of the time, this is on you.
- ## prepare
- - 80% of the time, we're preparing the data
- - pandas, pandas, pandas
- - it's pandamonium.
- - derived values like gross revenue if you have units and price
- - feature engineering
- ## explore
- - statistics
- - probability
- - visualization
- - we explore the data to figure out what model is a good candidate
- - set aside 30% of our data for data for testing our model
- ## model
- - run a model or two(maybe three)
- - measure the effectiveness
- - true positives, true negatives, false positive, false negatives
- - accuracy
- - precision
- - recall
- - hyperparameter tuning (we're modifying the components of the ML equation that aren't only the inputs)
- - weights on weighted averages
- - number of groups in k-means
- - think of these like tweaking a performance automobile
- - experiment with your fuel-air mixture
- - we need to test our model
- - ideally, we've got new data coming in
- ## present
- - building data products to share
- - write a paper or whitepaper
- - produce a talk or a handout for talk
- - bokeh
- - tableau
- - handouts
- ## Maintain
- - you may retrain your model
- - get new data
- - engineer your model to run on streaming inputs vs. a one time dump of data
- - maintain your tool
- ## Technologies in their place of the Data Science Workflow
- Reminder: Workflow is plan, acquire, prepare, explore, model, present
- Plan -> people think and people discuss and collaborate, project management like trello
- ### Acquire
- - Technologies: python, pandas, numpy, SQL, python libraries (web scraping, etc...)
- - Skills: programming, troubleshooting, keeping the big picture in mind, keeping the next step in mind
- - Tasks: getting raw data
- ### Prepare
- - Technologies: serious python. lots and lots of pandas.
- - Skills: attention to detail, debug other people's data, debug our own code
- - Tasks: cleaning the data, this is the "data wrangling" or "data munging" part, feature engineering
- ### Explore
- - Tasks: get a sense of the relationship of data's variables, visualization, statistical analysis
- - Skills: statistics, visualization
- - Technologies: matplotlib, seaborn, pandas, python, numpy, scipy, statistical packages
- ## Model
- - Technologies: scipy, sklearn (ML algorithms), keras or tensorflow
- - Tasks: Evaluating the effacacy of a model, hyperparameter tuning
- - Skills: determining overfitting vs. underfitting
- ## Present
- - Tasks: get your point across quickly, be prepared to support your thesis, but start w/ the bottom line (your point)
- - Skills: public speaking, visual design and a sense for how people process information
- - Technologies: Tableau, Microsoft Word to make a handout
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement