Tutorial for getting started with R.

This is a step-by-step tutorial for getting started with R, a powerful programming language for data analysis and
visualization. It is aimed at near complete beginners. You'll basically want to be comfortable with spreadsheets and with
using your computer's command line.

I slapped this together quickly, so expect some weirdness. Feel free to email me with comments or questions at
jpvelez | at | gmail.com

I learned the following stuff using the UCLA Statistic's Department great R tutorials, so check those out:
https://www.ats.ucla.edu/stat/r/learning_modules.htm


## GETTING DATA

First, we need to get some data to analyze. We'll be using a dataset of NYC school sat scores
(nyc_schools_sat_scores_clean.csv), which is attach to this gist. Download the data your computer and pop it into Excel to
examine it.

Every row in this dataset represents an NYC school that had MORE THAN FIVE students take the SAT exam in 2010. The columns
include a school's "DBN" number (a unique id for every school), its name, the number of students who took the SAT that year,
and the mean reading, math, and writing scores of those students. Think of each row as a school, and each column as that
schools attributes.

NOTE: the attached csv file is a cleaned version of the raw data available on the NYC data portal:
https://nycopendata.socrata.com/

A number of schools in the original data had 's' values instead of numbers in the SAT reading, math, and writing score
columns. According to the data portal, these schools had fewer than 5 students take the SAT, so those scores have been
suppressed (hence the 's') to protect their anonymity. To simplify things, I've gone ahead and removed those rows from the
dataset. If you don't use the clean data, the following tutorial won't work.


## FIRING UP R

Ok, use the command line to navigate to the directory where you saved the data, and type "r" to fire up R. You can also
use the official R console, but then you'll need to explicitly set your working directory to the directory where the sat
data lives. I won't cover that. Google it. Don't be lazy.


## READING CSV FILES INTO R

Now, we need to read our data into R so we can do stuff with it. For what, we use the read.csv() function:

```R
sat_scores = read.csv('nyc_school_sat_scores_clean.csv')
```

Here the `read.csv` function reads in the csv, which needs to
be located in the working directory, and returns your data in a dataframe object that is then saved to a variable
named `sat_scores`


## R DATA TYPES: DATAFRAMES AND VECTORS

What is a dataframe object, you ask? In technicalese, it's a data structure that makes it easy to store and access
tabular data with named columns. Think of it as a spreadsheet or table you can do stuff with.

To see the contents of your dataframe, just type it in:

```R
sat_scores
```

```R
names(sat_scores)
```

Throw your dataframe into this names function to see what columns of data are in there. this is the
same thing as column names in the first row of a spreadsheet.

The other big type of object in R is a vector. A vector is basically just a list. It could be a list of text, or of
numbers, but it's usually numbers.

From here on out, type in the code first, try to understand what it does, and then read the description.

```R
vector = c(1, 2, 3, 4)
```

this is how you make a new vector and save it to a variable named `vector`. if you don't save dataframes or vectors
to variables, you can't use them later.

```R
vector
```

this is how you inspect the contents of your new vector.

## SUBSETTING DATA

Now we're going to slice and dice the data in our dataframe. This is called 'subsetting' data.

```R
sat_scores$reading
```

this is the easiest way of accessing all the data in on of your columns. the style is `dataframe$column`. you punch this
is in, and the computer will return a vector of all the values in that column, in this case, all of the mean sat
reading scores for nyc schools (that had more than 5 students take the sat in 2010.)

```R
reading_scores = sat_scores$reading
```

you can save the data in the reading column, i.e. the vector of mean reading scores, by saving it to a new variable just
like above

```R
sat_scores[, 2]
```

you can also select columns using brackets like this. actually, these brackets let you select both columns and rows.
the first 'slot' in the brackets, before the comma, lets you specify what rows you want. we want all of them, so
leave that blank. the second slow lets you specify which columns you want. so this code will get you a vector of
`school_names`, because school_names are in column 2. if you don't remember a column's number (or name), use the `names()`
function.

```R
sat_scores[, c(2, 3, 4)]
```

you can select multiple columns. the way you do this is by putting a vector of the column
numbers you want in that second column slot.

```R
sat_scores[, 'school_names']
```

you can also specify what columns you want by using their names. the names must have quotes around them, because
technically these are 'strings', or text objects. if you don't put quotes around it, R thinks you're talking about a variable.
if you haven't used that variable anywhere, it'll get pissy and throw an error at
you.

```R
sat_scores[, c('school_names', 'reading')]
```

you can also specify multiple columns using their names by putting them in a vector, just like we did for the column
numbers. this code will return a new, two-column dataframe of school_names and reading scores. every school will be
in this new dataframe, because you left the first 'slot' in the brackets blank.

Let's try to filter out rows now.

```R
sat_scores[sat_scores$math > 350, ]
```

This code will return a new dataframe that will contain only those rows
 (i.e. those schools) which had math scores ABOVE 350. this new dataframe will have all the columns - school name,
number of testers, reading scores, etc - because you left the second slot intact, but it will only include schools
that scored above 350 in math. in other words, this command says 'get me every school that had math scores greater
than 350." For some silly reason, you can't just write `[math > 350,]` because R doesn't know which dataframe that R
column belongs to. maybe you have several dataframes with columns named 'math'. so you need to specify which column
you're talking about by writing `[sat_scores$math > 350,]` the same syntax you used to access the reading scores
above.

Now let's subset on both rows and columns.

```R
sat_scores[sat_scores$math > 350, c(2, 4, 5)]
```

This codes says "get me every row where math score > 350, but only show me the data in columns 2, 4, and 5.' in other
words, take our sat_scores table and spit out a new table that only shows the school name, math, and reading scores of
schools that had mean math scores above 350. Got it?

```R
sat = sat_scores[sat_scores$math != 's' ,]
```

This shows you another way you can subset rows. This code says "return every row that DOES NOT have an 's' value in it's
math column." != stands DOES NOT EQUAL, while == stands for EQUAL. If you want to start with the raw, not-cleaned data
from the NYC data portal, you could use this code to remove schools that have suppressed scores i.e. 's' strings in many
 of their columns.

Alright, so now you can turn filter tables and access the data in their columns. That's nice, but a big vector (list)
of numbers isn't very helpful. It doesn't give us insight. We need a way to summarize some of the data.

## SUMMARIZING DATA

```R
summary(sat_scores)
```

The summary function does just. For vectors that contain numbers, it prints summary statistics like
what is the smallest, largest, and mean number in the list. For vectors that contain text, like school_names, it counts
how many times each unique text string occurs in the vector. you can also use this function on individual vectors -
`summary(sat_scores$reading)` or `summary(reading_scores)` - not just entire dataframes.

It's time for a little data viz. R makes it stupid simple to generate charts. Let's start with a histogram, which is great
way to visualize the distribution of the data in a single column.

```R
hist(sat$math)
```

The hist function takes in a vector and returns a histogram chart. it won't work if you feed it an entire dataframe -
`hist(sat_scores)` - you need to specify a column. remember, whenever you type `data_frame$column`, the computer returns a
vector of all the values in the column, which then get fed into the `hist` function.

```R
hist(sat$reading)
```

So this will show you the distribution of mean reading scores across nyc schools. notice that a lot of them cluster
between 350 - 450. This is a low score, and consistent with the median values we got from the summary function.
Takeaway: NYC schools aren't doing very well.

```R
hist(sat$writing)
```

You can do this for writing and math as well.

## VISUALIZING DATA

Now let's make a scatterplot. These let us see two variables at once, and examine wether there's a relationship
between the two.

```R
plot(math ~ reading, sat)
```

This `plot` function looks similar to hist, but it's a little peculiar. first, it has two arguments or 'slots'.
instead of specifying columns with the `dataframe$column` syntax as you've been doing, the first argument tells R
which columns to plot, and the second argument tells R which dataframe these columns belong to. Also there's a weird
~ in the first argument. Basically the (math ~ writing, .. ) code says I want a scatterplot with math scores on the
y axis and writing scores on the x axis: (y ~ x, ..) I think of it as "math mashed up with writing."

OK, so we have a scatterplot! Two observations: most schools cluster between 350-450 on BOTH their math and reading scores,
which is consistent with our histograms and summaries. 2. schools that have higher math scores tend to have higher reading
scores. they tend to move together, that's why you see the dots moving up and to the right. that means there's an
association between math and reading scores. cool! if we didn't see that pattern, if dots where all over the place, then
there would be no association.

```R
library(lattice)
```

Lattice is a library that has functions that make fancier graphs than the ones that come built-in to R.
use this function to load it into R so you can use some of them.

```R
xyplot(math ~ reading, sat)
```

`xyplot` is lattice's equivalent to the `plot()` function. it works the same way, but gives
you pretty colors.

So we've got some charts. That's great. Before the end, I will tease you with a tiny bit of stats.

## RUNNIN' STATS ON DATA

```R
fit = lm(math ~ reading, sat)
```

This `lm()` function runs a linear regression on the data we visualized with a scatterplot. Very crudely, it tries
to measure to what extent there's a linear relationship between math and reading scores. A linear relationship means
"as math goes up, so does reading." The scatterplot suggested that schools with higher math scores tend to have higher
reading scores, this is a rigorous way of capturing that relationship.

```R
abline(fit)
```

This function will take the linear regression object generate it above, and add a 'line of best fit' to
our scatterplot.