Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- This is a step-by-step tutorial for getting started with R, a powerful programming language for data analysis and
- visualization. It is aimed at near complete beginners. You'll basically want to be comfortable with spreadsheets and with
- using your computer's command line.
- I slapped this together quickly, so expect some weirdness. Feel free to email me with comments or questions at
- jpvelez | at | gmail.com
- I learned the following stuff using the UCLA Statistic's Department great R tutorials, so check those out:
- https://www.ats.ucla.edu/stat/r/learning_modules.htm
- ## GETTING DATA
- First, we need to get some data to analyze. We'll be using a dataset of NYC school sat scores
- (nyc_schools_sat_scores_clean.csv), which is attach to this gist. Download the data your computer and pop it into Excel to
- examine it.
- Every row in this dataset represents an NYC school that had MORE THAN FIVE students take the SAT exam in 2010. The columns
- include a school's "DBN" number (a unique id for every school), its name, the number of students who took the SAT that year,
- and the mean reading, math, and writing scores of those students. Think of each row as a school, and each column as that
- schools attributes.
- NOTE: the attached csv file is a cleaned version of the raw data available on the NYC data portal:
- https://nycopendata.socrata.com/
- A number of schools in the original data had 's' values instead of numbers in the SAT reading, math, and writing score
- columns. According to the data portal, these schools had fewer than 5 students take the SAT, so those scores have been
- suppressed (hence the 's') to protect their anonymity. To simplify things, I've gone ahead and removed those rows from the
- dataset. If you don't use the clean data, the following tutorial won't work.
- ## FIRING UP R
- Ok, use the command line to navigate to the directory where you saved the data, and type "r" to fire up R. You can also
- use the official R console, but then you'll need to explicitly set your working directory to the directory where the sat
- data lives. I won't cover that. Google it. Don't be lazy.
- ## READING CSV FILES INTO R
- Now, we need to read our data into R so we can do stuff with it. For what, we use the read.csv() function:
- ```R
- sat_scores = read.csv('nyc_school_sat_scores_clean.csv')
- ```
- Here the `read.csv` function reads in the csv, which needs to
- be located in the working directory, and returns your data in a dataframe object that is then saved to a variable
- named `sat_scores`
- ## R DATA TYPES: DATAFRAMES AND VECTORS
- What is a dataframe object, you ask? In technicalese, it's a data structure that makes it easy to store and access
- tabular data with named columns. Think of it as a spreadsheet or table you can do stuff with.
- To see the contents of your dataframe, just type it in:
- ```R
- sat_scores
- ```
- ```R
- names(sat_scores)
- ```
- Throw your dataframe into this names function to see what columns of data are in there. this is the
- same thing as column names in the first row of a spreadsheet.
- The other big type of object in R is a vector. A vector is basically just a list. It could be a list of text, or of
- numbers, but it's usually numbers.
- From here on out, type in the code first, try to understand what it does, and then read the description.
- ```R
- vector = c(1, 2, 3, 4)
- ```
- this is how you make a new vector and save it to a variable named `vector`. if you don't save dataframes or vectors
- to variables, you can't use them later.
- ```R
- vector
- ```
- this is how you inspect the contents of your new vector.
- ## SUBSETTING DATA
- Now we're going to slice and dice the data in our dataframe. This is called 'subsetting' data.
- ```R
- sat_scores$reading
- ```
- this is the easiest way of accessing all the data in on of your columns. the style is `dataframe$column`. you punch this
- is in, and the computer will return a vector of all the values in that column, in this case, all of the mean sat
- reading scores for nyc schools (that had more than 5 students take the sat in 2010.)
- ```R
- reading_scores = sat_scores$reading
- ```
- you can save the data in the reading column, i.e. the vector of mean reading scores, by saving it to a new variable just
- like above
- ```R
- sat_scores[, 2]
- ```
- you can also select columns using brackets like this. actually, these brackets let you select both columns and rows.
- the first 'slot' in the brackets, before the comma, lets you specify what rows you want. we want all of them, so
- leave that blank. the second slow lets you specify which columns you want. so this code will get you a vector of
- `school_names`, because school_names are in column 2. if you don't remember a column's number (or name), use the `names()`
- function.
- ```R
- sat_scores[, c(2, 3, 4)]
- ```
- you can select multiple columns. the way you do this is by putting a vector of the column
- numbers you want in that second column slot.
- ```R
- sat_scores[, 'school_names']
- ```
- you can also specify what columns you want by using their names. the names must have quotes around them, because
- technically these are 'strings', or text objects. if you don't put quotes around it, R thinks you're talking about a variable.
- if you haven't used that variable anywhere, it'll get pissy and throw an error at
- you.
- ```R
- sat_scores[, c('school_names', 'reading')]
- ```
- you can also specify multiple columns using their names by putting them in a vector, just like we did for the column
- numbers. this code will return a new, two-column dataframe of school_names and reading scores. every school will be
- in this new dataframe, because you left the first 'slot' in the brackets blank.
- Let's try to filter out rows now.
- ```R
- sat_scores[sat_scores$math > 350, ]
- ```
- This code will return a new dataframe that will contain only those rows
- (i.e. those schools) which had math scores ABOVE 350. this new dataframe will have all the columns - school name,
- number of testers, reading scores, etc - because you left the second slot intact, but it will only include schools
- that scored above 350 in math. in other words, this command says 'get me every school that had math scores greater
- than 350." For some silly reason, you can't just write `[math > 350,]` because R doesn't know which dataframe that R
- column belongs to. maybe you have several dataframes with columns named 'math'. so you need to specify which column
- you're talking about by writing `[sat_scores$math > 350,]` the same syntax you used to access the reading scores
- above.
- Now let's subset on both rows and columns.
- ```R
- sat_scores[sat_scores$math > 350, c(2, 4, 5)]
- ```
- This codes says "get me every row where math score > 350, but only show me the data in columns 2, 4, and 5.' in other
- words, take our sat_scores table and spit out a new table that only shows the school name, math, and reading scores of
- schools that had mean math scores above 350. Got it?
- ```R
- sat = sat_scores[sat_scores$math != 's' ,]
- ```
- This shows you another way you can subset rows. This code says "return every row that DOES NOT have an 's' value in it's
- math column." != stands DOES NOT EQUAL, while == stands for EQUAL. If you want to start with the raw, not-cleaned data
- from the NYC data portal, you could use this code to remove schools that have suppressed scores i.e. 's' strings in many
- of their columns.
- Alright, so now you can turn filter tables and access the data in their columns. That's nice, but a big vector (list)
- of numbers isn't very helpful. It doesn't give us insight. We need a way to summarize some of the data.
- ## SUMMARIZING DATA
- ```R
- summary(sat_scores)
- ```
- The summary function does just. For vectors that contain numbers, it prints summary statistics like
- what is the smallest, largest, and mean number in the list. For vectors that contain text, like school_names, it counts
- how many times each unique text string occurs in the vector. you can also use this function on individual vectors -
- `summary(sat_scores$reading)` or `summary(reading_scores)` - not just entire dataframes.
- It's time for a little data viz. R makes it stupid simple to generate charts. Let's start with a histogram, which is great
- way to visualize the distribution of the data in a single column.
- ```R
- hist(sat$math)
- ```
- The hist function takes in a vector and returns a histogram chart. it won't work if you feed it an entire dataframe -
- `hist(sat_scores)` - you need to specify a column. remember, whenever you type `data_frame$column`, the computer returns a
- vector of all the values in the column, which then get fed into the `hist` function.
- ```R
- hist(sat$reading)
- ```
- So this will show you the distribution of mean reading scores across nyc schools. notice that a lot of them cluster
- between 350 - 450. This is a low score, and consistent with the median values we got from the summary function.
- Takeaway: NYC schools aren't doing very well.
- ```R
- hist(sat$writing)
- ```
- You can do this for writing and math as well.
- ## VISUALIZING DATA
- Now let's make a scatterplot. These let us see two variables at once, and examine wether there's a relationship
- between the two.
- ```R
- plot(math ~ reading, sat)
- ```
- This `plot` function looks similar to hist, but it's a little peculiar. first, it has two arguments or 'slots'.
- instead of specifying columns with the `dataframe$column` syntax as you've been doing, the first argument tells R
- which columns to plot, and the second argument tells R which dataframe these columns belong to. Also there's a weird
- ~ in the first argument. Basically the (math ~ writing, .. ) code says I want a scatterplot with math scores on the
- y axis and writing scores on the x axis: (y ~ x, ..) I think of it as "math mashed up with writing."
- OK, so we have a scatterplot! Two observations: most schools cluster between 350-450 on BOTH their math and reading scores,
- which is consistent with our histograms and summaries. 2. schools that have higher math scores tend to have higher reading
- scores. they tend to move together, that's why you see the dots moving up and to the right. that means there's an
- association between math and reading scores. cool! if we didn't see that pattern, if dots where all over the place, then
- there would be no association.
- ```R
- library(lattice)
- ```
- Lattice is a library that has functions that make fancier graphs than the ones that come built-in to R.
- use this function to load it into R so you can use some of them.
- ```R
- xyplot(math ~ reading, sat)
- ```
- `xyplot` is lattice's equivalent to the `plot()` function. it works the same way, but gives
- you pretty colors.
- So we've got some charts. That's great. Before the end, I will tease you with a tiny bit of stats.
- ## RUNNIN' STATS ON DATA
- ```R
- fit = lm(math ~ reading, sat)
- ```
- This `lm()` function runs a linear regression on the data we visualized with a scatterplot. Very crudely, it tries
- to measure to what extent there's a linear relationship between math and reading scores. A linear relationship means
- "as math goes up, so does reading." The scatterplot suggested that schools with higher math scores tend to have higher
- reading scores, this is a rigorous way of capturing that relationship.
- ```R
- abline(fit)
- ```
- This function will take the linear regression object generate it above, and add a 'line of best fit' to
- our scatterplot.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement