Untitled

# our trade/craft
gaining insights from data for actionable decisions
data science -> applied interdisciplinary science around using data to make decision

## practicing data science is a function
1. input
2. process
3. output

## data science workflow
- plan
- acquire
- prepare
- explore
- model
- present
- maintain your data products

## inputs:
- csv
- sql
- raw text
- images
- audio or video
- various types of data in various types of formats

- we have 2 main kinds of data
    - labeled data:
        - we have our targets
        - we have our labels (some human interaction)
            - example: hitting spam button is adding a lable to the data inside that email
                - sender, subject, recipient, attachments, language, wording
    - unlabeled data:
        - raw text
        - images where a human has not added any input
- sometimes our data is structured and sometimes not.
    - either way, it requires work to prepare or integrate

## acquire
- sometimes, it's as easy as pd.read_csv()
- sometimes, we've got to go collect the data ourselves
- how do we sample a population to have a representative sample
- this is where you need to be good w/ sql
- may have to talk to different data sources
- overlap w/ other software dev skills
- you may get data from a data engineer
- most of the time, this is on you.

## prepare
- 80% of the time, we're preparing the data
- pandas, pandas, pandas
- it's pandamonium.
- derived values like gross revenue if you have units and price
- feature engineering

## explore
- statistics
- probability
- visualization
- we explore the data to figure out what model is a good candidate
- set aside 30% of our data for data for testing our model


## model
- run a model or two(maybe three)
- measure the effectiveness
    - true positives, true negatives, false positive, false negatives
    - accuracy
    - precision
    - recall
- hyperparameter tuning (we're modifying the components of the ML equation that aren't only the inputs)
    - weights on weighted averages
    - number of groups in k-means
    - think of these like tweaking a performance automobile
    - experiment with your fuel-air mixture
- we need to test our model
- ideally, we've got new data coming in

## present
- building data products to share
- write a paper or whitepaper
- produce a talk or a handout for talk
- bokeh
- tableau
- handouts

## Maintain
- you may retrain your model
- get new data
- engineer your model to run on streaming inputs vs. a one time dump of data
- maintain your tool


## Technologies in their place of the Data Science Workflow
Reminder: Workflow is plan, acquire, prepare, explore, model, present

Plan -> people think and people discuss and collaborate, project management like trello

### Acquire
- Technologies: python, pandas, numpy, SQL, python libraries (web scraping, etc...)
- Skills: programming, troubleshooting, keeping the big picture in mind, keeping the next step in mind
- Tasks: getting raw data

### Prepare
- Technologies: serious python. lots and lots of pandas.
- Skills: attention to detail, debug other people's data, debug our own code
- Tasks: cleaning the data, this is the "data wrangling" or "data munging" part, feature engineering

### Explore
- Tasks: get a sense of the relationship of data's variables, visualization, statistical analysis
- Skills: statistics, visualization
- Technologies: matplotlib, seaborn, pandas, python, numpy, scipy, statistical packages

## Model
- Technologies: scipy, sklearn (ML algorithms), keras or tensorflow
- Tasks: Evaluating the effacacy of a model, hyperparameter tuning
- Skills: determining overfitting vs. underfitting

## Present
- Tasks: get your point across quickly, be prepared to support your thesis, but start w/ the bottom line (your point)
- Skills: public speaking, visual design and a sense for how people process information
- Technologies: Tableau, Microsoft Word to make a handout