Advertisement
HackerRIZLA

Tutorial for getting started with R.

Sep 22nd, 2012
157
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 11.31 KB | None | 0 0
  1. This is a step-by-step tutorial for getting started with R, a powerful programming language for data analysis and
  2. visualization. It is aimed at near complete beginners. You'll basically want to be comfortable with spreadsheets and with
  3. using your computer's command line.
  4.  
  5. I slapped this together quickly, so expect some weirdness. Feel free to email me with comments or questions at
  6. jpvelez | at | gmail.com
  7.  
  8. I learned the following stuff using the UCLA Statistic's Department great R tutorials, so check those out:
  9. https://www.ats.ucla.edu/stat/r/learning_modules.htm
  10.  
  11.  
  12. ## GETTING DATA
  13.  
  14. First, we need to get some data to analyze. We'll be using a dataset of NYC school sat scores
  15. (nyc_schools_sat_scores_clean.csv), which is attach to this gist. Download the data your computer and pop it into Excel to
  16. examine it.
  17.  
  18. Every row in this dataset represents an NYC school that had MORE THAN FIVE students take the SAT exam in 2010. The columns
  19. include a school's "DBN" number (a unique id for every school), its name, the number of students who took the SAT that year,
  20. and the mean reading, math, and writing scores of those students. Think of each row as a school, and each column as that
  21. schools attributes.
  22.  
  23. NOTE: the attached csv file is a cleaned version of the raw data available on the NYC data portal:
  24. https://nycopendata.socrata.com/
  25.  
  26. A number of schools in the original data had 's' values instead of numbers in the SAT reading, math, and writing score
  27. columns. According to the data portal, these schools had fewer than 5 students take the SAT, so those scores have been
  28. suppressed (hence the 's') to protect their anonymity. To simplify things, I've gone ahead and removed those rows from the
  29. dataset. If you don't use the clean data, the following tutorial won't work.
  30.  
  31.  
  32.  
  33. ## FIRING UP R
  34.  
  35. Ok, use the command line to navigate to the directory where you saved the data, and type "r" to fire up R. You can also
  36. use the official R console, but then you'll need to explicitly set your working directory to the directory where the sat
  37. data lives. I won't cover that. Google it. Don't be lazy.
  38.  
  39.  
  40.  
  41. ## READING CSV FILES INTO R
  42.  
  43. Now, we need to read our data into R so we can do stuff with it. For what, we use the read.csv() function:
  44.  
  45. ```R
  46. sat_scores = read.csv('nyc_school_sat_scores_clean.csv')
  47. ```
  48.  
  49. Here the `read.csv` function reads in the csv, which needs to
  50. be located in the working directory, and returns your data in a dataframe object that is then saved to a variable
  51. named `sat_scores`
  52.  
  53.  
  54.  
  55. ## R DATA TYPES: DATAFRAMES AND VECTORS
  56.  
  57. What is a dataframe object, you ask? In technicalese, it's a data structure that makes it easy to store and access
  58. tabular data with named columns. Think of it as a spreadsheet or table you can do stuff with.
  59.  
  60. To see the contents of your dataframe, just type it in:
  61.  
  62. ```R
  63. sat_scores
  64. ```
  65.  
  66. ```R
  67. names(sat_scores)
  68. ```
  69.  
  70. Throw your dataframe into this names function to see what columns of data are in there. this is the
  71. same thing as column names in the first row of a spreadsheet.
  72.  
  73. The other big type of object in R is a vector. A vector is basically just a list. It could be a list of text, or of
  74. numbers, but it's usually numbers.
  75.  
  76. From here on out, type in the code first, try to understand what it does, and then read the description.
  77.  
  78. ```R
  79. vector = c(1, 2, 3, 4)
  80. ```
  81.  
  82. this is how you make a new vector and save it to a variable named `vector`. if you don't save dataframes or vectors
  83. to variables, you can't use them later.
  84.  
  85. ```R
  86. vector
  87. ```
  88.  
  89. this is how you inspect the contents of your new vector.
  90.  
  91. ## SUBSETTING DATA
  92.  
  93. Now we're going to slice and dice the data in our dataframe. This is called 'subsetting' data.
  94.  
  95. ```R
  96. sat_scores$reading
  97. ```
  98.  
  99. this is the easiest way of accessing all the data in on of your columns. the style is `dataframe$column`. you punch this
  100. is in, and the computer will return a vector of all the values in that column, in this case, all of the mean sat
  101. reading scores for nyc schools (that had more than 5 students take the sat in 2010.)
  102.  
  103. ```R
  104. reading_scores = sat_scores$reading
  105. ```
  106.  
  107. you can save the data in the reading column, i.e. the vector of mean reading scores, by saving it to a new variable just
  108. like above
  109.  
  110. ```R
  111. sat_scores[, 2]
  112. ```
  113.  
  114. you can also select columns using brackets like this. actually, these brackets let you select both columns and rows.
  115. the first 'slot' in the brackets, before the comma, lets you specify what rows you want. we want all of them, so
  116. leave that blank. the second slow lets you specify which columns you want. so this code will get you a vector of
  117. `school_names`, because school_names are in column 2. if you don't remember a column's number (or name), use the `names()`
  118. function.
  119.  
  120. ```R
  121. sat_scores[, c(2, 3, 4)]
  122. ```
  123.  
  124. you can select multiple columns. the way you do this is by putting a vector of the column
  125. numbers you want in that second column slot.
  126.  
  127. ```R
  128. sat_scores[, 'school_names']
  129. ```
  130.  
  131. you can also specify what columns you want by using their names. the names must have quotes around them, because
  132. technically these are 'strings', or text objects. if you don't put quotes around it, R thinks you're talking about a variable.
  133. if you haven't used that variable anywhere, it'll get pissy and throw an error at
  134. you.
  135.  
  136. ```R
  137. sat_scores[, c('school_names', 'reading')]
  138. ```
  139.  
  140. you can also specify multiple columns using their names by putting them in a vector, just like we did for the column
  141. numbers. this code will return a new, two-column dataframe of school_names and reading scores. every school will be
  142. in this new dataframe, because you left the first 'slot' in the brackets blank.
  143.  
  144. Let's try to filter out rows now.
  145.  
  146. ```R
  147. sat_scores[sat_scores$math > 350, ]
  148. ```
  149.  
  150. This code will return a new dataframe that will contain only those rows
  151. (i.e. those schools) which had math scores ABOVE 350. this new dataframe will have all the columns - school name,
  152. number of testers, reading scores, etc - because you left the second slot intact, but it will only include schools
  153. that scored above 350 in math. in other words, this command says 'get me every school that had math scores greater
  154. than 350." For some silly reason, you can't just write `[math > 350,]` because R doesn't know which dataframe that R
  155. column belongs to. maybe you have several dataframes with columns named 'math'. so you need to specify which column
  156. you're talking about by writing `[sat_scores$math > 350,]` the same syntax you used to access the reading scores
  157. above.
  158.  
  159. Now let's subset on both rows and columns.
  160.  
  161. ```R
  162. sat_scores[sat_scores$math > 350, c(2, 4, 5)]
  163. ```
  164.  
  165. This codes says "get me every row where math score > 350, but only show me the data in columns 2, 4, and 5.' in other
  166. words, take our sat_scores table and spit out a new table that only shows the school name, math, and reading scores of
  167. schools that had mean math scores above 350. Got it?
  168.  
  169. ```R
  170. sat = sat_scores[sat_scores$math != 's' ,]
  171. ```
  172.  
  173. This shows you another way you can subset rows. This code says "return every row that DOES NOT have an 's' value in it's
  174. math column." != stands DOES NOT EQUAL, while == stands for EQUAL. If you want to start with the raw, not-cleaned data
  175. from the NYC data portal, you could use this code to remove schools that have suppressed scores i.e. 's' strings in many
  176. of their columns.
  177.  
  178. Alright, so now you can turn filter tables and access the data in their columns. That's nice, but a big vector (list)
  179. of numbers isn't very helpful. It doesn't give us insight. We need a way to summarize some of the data.
  180.  
  181. ## SUMMARIZING DATA
  182.  
  183. ```R
  184. summary(sat_scores)
  185. ```
  186.  
  187. The summary function does just. For vectors that contain numbers, it prints summary statistics like
  188. what is the smallest, largest, and mean number in the list. For vectors that contain text, like school_names, it counts
  189. how many times each unique text string occurs in the vector. you can also use this function on individual vectors -
  190. `summary(sat_scores$reading)` or `summary(reading_scores)` - not just entire dataframes.
  191.  
  192. It's time for a little data viz. R makes it stupid simple to generate charts. Let's start with a histogram, which is great
  193. way to visualize the distribution of the data in a single column.
  194.  
  195. ```R
  196. hist(sat$math)
  197. ```
  198.  
  199. The hist function takes in a vector and returns a histogram chart. it won't work if you feed it an entire dataframe -
  200. `hist(sat_scores)` - you need to specify a column. remember, whenever you type `data_frame$column`, the computer returns a
  201. vector of all the values in the column, which then get fed into the `hist` function.
  202.  
  203. ```R
  204. hist(sat$reading)
  205. ```
  206.  
  207. So this will show you the distribution of mean reading scores across nyc schools. notice that a lot of them cluster
  208. between 350 - 450. This is a low score, and consistent with the median values we got from the summary function.
  209. Takeaway: NYC schools aren't doing very well.
  210.  
  211. ```R
  212. hist(sat$writing)
  213. ```
  214.  
  215. You can do this for writing and math as well.
  216.  
  217. ## VISUALIZING DATA
  218.  
  219. Now let's make a scatterplot. These let us see two variables at once, and examine wether there's a relationship
  220. between the two.
  221.  
  222. ```R
  223. plot(math ~ reading, sat)
  224. ```
  225.  
  226. This `plot` function looks similar to hist, but it's a little peculiar. first, it has two arguments or 'slots'.
  227. instead of specifying columns with the `dataframe$column` syntax as you've been doing, the first argument tells R
  228. which columns to plot, and the second argument tells R which dataframe these columns belong to. Also there's a weird
  229. ~ in the first argument. Basically the (math ~ writing, .. ) code says I want a scatterplot with math scores on the
  230. y axis and writing scores on the x axis: (y ~ x, ..) I think of it as "math mashed up with writing."
  231.  
  232. OK, so we have a scatterplot! Two observations: most schools cluster between 350-450 on BOTH their math and reading scores,
  233. which is consistent with our histograms and summaries. 2. schools that have higher math scores tend to have higher reading
  234. scores. they tend to move together, that's why you see the dots moving up and to the right. that means there's an
  235. association between math and reading scores. cool! if we didn't see that pattern, if dots where all over the place, then
  236. there would be no association.
  237.  
  238. ```R
  239. library(lattice)
  240. ```
  241.  
  242. Lattice is a library that has functions that make fancier graphs than the ones that come built-in to R.
  243. use this function to load it into R so you can use some of them.
  244.  
  245. ```R
  246. xyplot(math ~ reading, sat)
  247. ```
  248.  
  249. `xyplot` is lattice's equivalent to the `plot()` function. it works the same way, but gives
  250. you pretty colors.
  251.  
  252. So we've got some charts. That's great. Before the end, I will tease you with a tiny bit of stats.
  253.  
  254. ## RUNNIN' STATS ON DATA
  255.  
  256. ```R
  257. fit = lm(math ~ reading, sat)
  258. ```
  259.  
  260. This `lm()` function runs a linear regression on the data we visualized with a scatterplot. Very crudely, it tries
  261. to measure to what extent there's a linear relationship between math and reading scores. A linear relationship means
  262. "as math goes up, so does reading." The scatterplot suggested that schools with higher math scores tend to have higher
  263. reading scores, this is a rigorous way of capturing that relationship.
  264.  
  265. ```R
  266. abline(fit)
  267. ```
  268.  
  269. This function will take the linear regression object generate it above, and add a 'line of best fit' to
  270. our scatterplot.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement