Advertisement
Guest User

Untitled

a guest
Feb 20th, 2019
173
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
R 29.71 KB | None | 0 0
  1. ---
  2. title: Regression
  3. author: Blanca Calvo
  4. date: "Generation date: `r format(Sys.time(), '%b %d, %Y - %H:%M:%S')`"
  5. output:
  6.   html_document:
  7.     toc: true
  8.     code_folding: show
  9.     toc_float:
  10.         collapsed: false
  11.         smooth_scroll: true
  12.     number_sections: true
  13. ---
  14.  
  15. ```{r setup, message=F, echo=F}
  16. # some custom options to generate a nice html file
  17. options(digits = 3)
  18. options(width=100)
  19. packages <- c("knitr","pander")
  20. if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
  21.   install.packages(setdiff(packages, rownames(installed.packages())), repos='http://cran.us.r-project.org')
  22. }
  23.  
  24.  
  25. library(knitr)
  26. opts_chunk$set(cache=F, comment='#', tidy = T, tidy.opts=list(blank=T, width.cutoff=100), fig.align='center', message=F, warning=T, fig.width=6, fig.height=6, dev='svg',eval=T)
  27.  
  28. library(pander)
  29. panderOptions('table.alignment.default',function(df) ifelse(sapply(df, is.numeric), 'right', 'left'))
  30.  
  31. library(dplyr)
  32. library(ggplot2)
  33. library(cowplot)
  34.  
  35. ```
  36.  
  37. # Regression
  38.  
  39. ## Introduction
  40.  
  41. The data we will use in the first part is a hypothetical study on child language acquisition (created by a [colleague of mine](http://coltekin.net/cagri/R/r-exercisesse8.html) and slightly adapted by me). We want to investigate the effects of amount of time spend in front of TV to two-year-old children's language development. The response variable in this data set, `cdi`, is a standard measure of children's language abilities based on parental reports. The predictor of interest is `tv.hours`, which is the weekly hours of TV time for each child.
  42.  
  43. You will have to fill in most commands yourself, but this should be feasible given the slides of the lecture, which can be viewed here: http://www.let.rug.nl/wieling/Statistics/Regression-large. While you can just enter the commands in RStudio, it is also possible to modify the source of this so-called R-markdown file directly in RStudio and press the "Knit HTML" button to generate an html file which contains both the commands you've used and their output. You can download the file to your current working directory in R by pasting the following command: `download.file('http://www.let.rug.nl/wieling/Statistics/Regression-large/lab/lab.Rmd', 'lab.Rmd')`. You can then open this file in RStudio. In this file, all R commands which are located within chunks (beginning and ending with three backticks) will be evaluated. Creating an R markdown file is very useful as your analysis becomes reproducible and easy to check for others. Note that chunks have options, with which you can customize the output. See for more information: http://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf
  44.  
  45. Note that the significance threshold $\alpha$ is set to 0.05 as usual.
  46.  
  47. ## Importing the data
  48. We will first download the data (which was saved as a dataframe in R: an `.rda` file), and load it into R.
  49.  
  50. ```{r import}
  51. download.file('http://www.let.rug.nl/wieling/Statistics/Regression-large/lab/tv.rda', 'tv.rda')
  52. load('tv.rda') # an rda file can be loaded with the command load
  53. ```
  54.  
  55. ## Structure of the data
  56.  
  57. Investigate the data with descriptive statistics and plots. Also calculate the correlations between the numerical variables.
  58.  
  59. ### Summary of the data
  60. ```{r}
  61. summary(tv)
  62. ```
  63. From the summary we see it's a table of 80 observations with 40 female and 40 male. The cdi mean is 99.9 and the median 100 and the value go from 97 to 103. The mean of the hours spent in front of tv is 11.7 and the median is 11.0, being 2 hours the minmun value and 23 hours the maximum.
  64.  
  65. ### Point plots for relation of the different columns and cdi score
  66.  
  67. Here we want to explore the possible effects of the collected data on the cdi score based on gender, by plotting in a points graph.
  68. ```{r}
  69. plot1 <- tv%>%
  70.   ggplot(aes(x = tv.hours, y = cdi, col = gender))+
  71.   geom_point()+
  72.   #geom_line(aes (y = mean(tv$cdi)), col = "grey") +
  73.   labs(y = "cdi score", x = "hours in front of tv") +
  74.   ggtitle('hours in tv')+
  75.   theme(text = element_text(size=10))
  76.  
  77. plot2 <- tv%>%
  78.   ggplot(aes(x = daycare.hours, y = cdi, col = gender))+
  79.   geom_point()+
  80.   #geom_line(aes (y = mean(tv$cdi)), col = "grey") +
  81.   labs(y = "cdi score", x = "daycare hours") +
  82.   ggtitle('daycare hours')+
  83.   theme(text = element_text(size=10))
  84.  
  85. plot3 <- tv%>%
  86.   ggplot(aes(x = book.reading, y = cdi, col = gender))+
  87.   geom_point()+
  88.   #geom_line(aes (y = mean(tv$cdi)), col = "grey") +
  89.   labs(y = "cdi score", x = "book reading hours") +
  90.   ggtitle('reading hours')+
  91.   theme(text = element_text(size=10))
  92.  
  93. plot4 <- tv%>%
  94.   ggplot(aes(x = mot.education, y = cdi, col = gender))+
  95.   geom_point()+
  96.   #geom_line(aes (y = mean(tv$cdi)), col = "grey") +
  97.   labs(y = "cdi score", x = "mot.education") +
  98.   ggtitle('mot.education')+
  99.   theme(text = element_text(size=10))
  100.  
  101. plot_grid(plot1, plot2, plot3, plot4)
  102. ```
  103. The first thing that seems clear by these plots is that male cdi scores are lower than female cdi scores. The increase of the hours in front of the tv seem to have a negative effect on the cdi score. The daycare hours don't show a clear effect on cdi score in this plot. Both the reading hours and the mot.education seem to have a positive effect on the cdi score.
  104.  
  105.  
  106. ### Differences between gender
  107.  
  108. Seeing the previous results, we want to confirm the existence of a difference between cdi scores based on gender. We calculate the mean and the median grouped by gender and boxplot the grouped data.
  109. ```{r}
  110. # mean of cdi per gender
  111.  
  112. tv%>%
  113.  group_by(gender)%>%
  114.  summarise(mean= mean(cdi),
  115.            median = median(cdi))
  116.  
  117. #quartiles
  118.  
  119. tapply(tv$cdi, tv$gender, quantile)
  120.  
  121. #boxplot
  122.  
  123. boxplot(tv$cdi ~ tv$gender, xlab="Gender", ylab="cdi")
  124.  
  125. ```
  126.  
  127. We see that both the mean and the median are higher for female kids than for male kids. We can also see that quantile 1 of cdi score for females is the same value of quantile 3 for male kids.
  128.  
  129. ### Correlation of the different values
  130.  
  131. ```{r}
  132.  
  133. cor(tv[, c("cdi","tv.hours", "mot.education", "daycare.hours", "book.reading")])
  134.  
  135. ```
  136.  
  137. With the new data we can confirm that the correllation between cdi and hours in tv is negative. We also see that mot.education and book.reading have a positive correlation with cdi and daycare.hours has a weak correlation with cdi. For the rest of the correlations we can higlight the positive relation between mot.education and book.reading (the meaning of the variable mot.education is not known).
  138.  
  139. ## Regression modeling
  140.  
  141. In this section, we will fit the best model for the data, predicting the `cdi` language development score on the basis of various predictors.  
  142.  
  143. ### The best model without interactions
  144.  
  145. Fit the best model without taking into account interactions.
  146. ```{r lm}
  147. m1 <- lm (cdi~tv.hours, tv)
  148. m2 <- lm (cdi~tv.hours + gender, tv)
  149. AIC(m1) - AIC(m2)
  150. m3 <- lm (cdi~tv.hours + mot.education + gender, tv)
  151. AIC(m2) - AIC(m3)
  152. m4 <- lm (cdi~tv.hours + mot.education + book.reading + gender, tv)
  153. AIC(m3) - AIC(m4)
  154. m5 <- lm (cdi~tv.hours + mot.education + daycare.hours + book.reading + gender, tv)
  155. AIC(m4) - AIC(m5)
  156.  
  157. summary(m4)
  158. ```
  159. The AIC comparison is positive and higher than 2 for all the more complex models except from the last, in which daycare.hours is included. These results are in line with our expectations from the exploration part. Fort this reason, we exclude daycare.hours from our model and keep all the other variables.
  160.  
  161.  
  162. ### Visualize the results
  163.  
  164. ```{r vis}
  165. library(visreg)
  166. visreg(m4)
  167.  
  168. ```
  169.  
  170. As expected in the exploration of the data, tv.hours has a negative and significant effect in cdi score, while mot.education and book.reading have a positive effect in cdi score. Being a female has a positive effect in the cdi score.
  171.  
  172.  
  173. ### Assess the regression assumptions discussed in the lecture
  174.  
  175. #### Linearity
  176. ```{r linearity}
  177. # linearity -> que la relació tingui una forma molt diferent a la lineal. ex: parabòlica
  178. par(mfrow = c(2, 2))
  179. library(car)
  180. crPlot(m4, var = "tv.hours")
  181. crPlot(m4, var = "mot.education")
  182. crPlot(m4, var = "book.reading")
  183. crPlot(m4, var = "gender")
  184.  
  185. ```
  186.  
  187. The variables tv.hours and book.reading seem to have a linear effect on cdi score. Nontheless, the variable mot.education suggests a stagnation after 17. How to assess linearity in categorical variables??
  188.  
  189. #### Autocorrelation
  190. ```{r autocorrelation}
  191. # autocorrelation in residuals
  192. durbinWatsonTest(m4)
  193.  
  194. acf(resid(m4))
  195. ```
  196.  
  197. Autocorrelation in mainly important for time series data.
  198.  
  199. #### Multicollinearity
  200. ```{r multicollinearity}
  201. # multicollinearity -> que unes variables independents expliquin les altres, llavors no són independents
  202. vif(m4) #interpretar i arreglar si cal
  203.  
  204. summary(lm(tv.hours ~ mot.education, data = tv))$r.squared
  205. summary(lm(tv.hours ~ book.reading, data = tv))$r.squared
  206. summary(lm(book.reading ~ mot.education, data = tv))$r.squared
  207. summary(lm(tv.hours ~ gender, data = tv))$r.squared
  208. summary(lm(mot.education ~ gender, data = tv))$r.squared
  209. summary(lm(book.reading ~ gender, data = tv))$r.squared
  210.  
  211. ```
  212. None of the variables used in the model explains more than 50% of another variable. For this reason, there is no collinearity.
  213.  
  214.  
  215. #### Homoscedacity
  216. ```{r homoscedacity}
  217.  
  218. # homoscedasticity
  219.  
  220. plot(fitted(m4), resid(m4)) # interpretar i arreglar si cal
  221. ncvTest(m4)
  222. ```
  223. There seems to be no heteroscedacity looking at the graph. The p-value is too big to asses a significant heteroscedacity.
  224.  
  225. #### Distribution of residuals
  226. ```{r distribution}
  227. # distribution of residuals
  228.  
  229. shapiro.test(resid(m4))$p.value
  230. qqnorm(resid(m4))
  231. qqline(resid(m4))
  232.  
  233. ```
  234. Residuals are normally distributed.
  235.  
  236. ### Are interactions necessary?
  237.  
  238. Fit the best model while taking into account interactions. For simplicity and speed, we'll only investigate potential interactions with gender. Also visualize the results if an interaction is necessary.
  239.  
  240. ```{r inter}
  241.  
  242. m6 <- lm (cdi~tv.hours + mot.education + book.reading + gender + tv.hours:gender, tv)
  243. AIC(m4)-AIC(m6)
  244. m7 <- lm (cdi~tv.hours + mot.education + book.reading + gender + mot.education:gender, tv)
  245. AIC(m4)-AIC(m7)
  246. m8 <- lm (cdi~tv.hours + mot.education + book.reading + gender + book.reading:gender, tv)
  247. AIC(m4)-AIC(m8)
  248.  
  249. ```
  250. None of the AIC is lower than the one obtained by our best model. This indicates that interactions are not needed.
  251.  
  252. ### Effect size
  253.  
  254. Obtain the effect size, both of the full model and the individual predictors
  255.  
  256. ```{r eff}
  257. library(lmSupport)
  258. modelEffectSizes(m4)
  259. summary(m4)
  260. ```
  261. The model overall effect size of the model is of 0.636 (adjusted), which is a large effect. Individually, gender is the only predictor with a large effect (0.229).
  262.  
  263.  
  264. ### Model criticism
  265.  
  266. Apply model criticism by excluding observations with residuals outside 2.5 SD.  
  267.  
  268. ```{r mc}
  269. tv2 <- tv[abs(scale(resid(m4))) < 2.5, ]
  270. sum(abs(scale(resid(m4))) >= 2.5)
  271.  
  272. m42 <- lm(cdi~tv.hours + mot.education + book.reading + gender, tv2)
  273. summary(m42)
  274.  
  275. ```
  276. Excluding observations with absolute residuals larger than 2.5 standard deviations above the mean deletes 2 observations from out data. The p-value doesn't change after excluding the outliers, and the patterns look the same.
  277.  
  278.  
  279. ### Bootstrap sampling
  280.  
  281. Conduct bootstrap sampling with 1000 repetitions.
  282.  
  283. ```{r bs}
  284. summary(m4.boot <- Boot(m42, R = 1000))
  285.  
  286.  
  287. ```
  288. The results of the bootstrap sampling are very similar to the original correlation results of the model, which proofs that the model is robust.
  289.  
  290. # Logistic regression
  291.  
  292. ## Introduction
  293. In this assignment we will use logistic regression to assess the differences in pronunciation at the end of syllables in New York during the 1950s and 1960s depending on the socioeconomic class (phoneme [r] vs [&#601;] (schwa)).
  294.  
  295. We use the following data:
  296.  
  297.  
  298. <table style="width:28%;"><colgroup><col width="13%"></col><col width="13%"></col></colgroup><thead><tr class="header"><th align="left">outcome</th><th align="left">status</th></tr></thead><tbody><tr class="odd"><td align="left">r</td><td align="left">upper</td></tr><tr class="even"><td align="left">r</td><td align="left">upper</td></tr><tr class="odd"><td align="left">r</td><td align="left">upper</td></tr><tr class="even"><td align="left">r</td><td align="left">upper</td></tr><tr class="odd"><td align="left">r</td><td align="left">upper</td></tr><tr class="even"><td align="left">r</td><td align="left">upper</td></tr><tr class="odd"><td align="left">r</td><td align="left">upper</td></tr><tr class="even"><td align="left">r</td><td align="left">upper</td></tr><tr class="odd"><td align="left">r</td><td align="left">upper</td></tr><tr class="even"><td align="left">r</td><td align="left">upper</td></tr><tr class="odd"><td align="left">r</td><td align="left">upper</td></tr><tr class="even"><td align="left">r</td><td align="left">upper</td></tr><tr class="odd"><td align="left">r</td><td align="left">upper</td></tr><tr class="even"><td align="left">r</td><td align="left">upper</td></tr><tr class="odd"><td align="left">r</td><td align="left">upper</td></tr><tr class="even"><td align="left">r</td><td align="left">upper</td></tr><tr class="odd"><td align="left">r</td><td align="left">upper</td></tr><tr class="even"><td align="left">r</td><td align="left">upper</td></tr><tr class="odd"><td align="left">r</td><td align="left">upper</td></tr><tr class="even"><td align="left">r</td><td align="left">upper</td></tr><tr class="odd"><td align="left">r</td><td align="left">upper</td></tr><tr class="even"><td align="left">r</td><td align="left">upper</td></tr><tr class="odd"><td align="left">r</td><td align="left">upper</td></tr><tr class="even"><td align="left">r</td><td align="left">upper</td></tr><tr class="odd"><td align="left">r</td><td align="left">upper</td></tr><tr class="even"><td align="left">r</td><td align="left">upper</td></tr><tr class="odd"><td align="left">r</td><td align="left">upper</td></tr><tr class="even"><td align="left">r</td><td align="left">upper</td></tr><tr class="odd"><td align="left">r</td><td align="left">upper</td></tr><tr class="even"><td align="left">r</td><td align="left">upper</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">upper</td></tr><tr class="even"><td align="left">schwa</td><td align="left">upper</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">upper</td></tr><tr class="even"><td align="left">schwa</td><td align="left">upper</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">upper</td></tr><tr class="even"><td align="left">schwa</td><td align="left">upper</td></tr><tr class="odd"><td align="left">r</td><td align="left">middle</td></tr><tr class="even"><td align="left">r</td><td align="left">middle</td></tr><tr class="odd"><td align="left">r</td><td align="left">middle</td></tr><tr class="even"><td align="left">r</td><td align="left">middle</td></tr><tr class="odd"><td align="left">r</td><td align="left">middle</td></tr><tr class="even"><td align="left">r</td><td align="left">middle</td></tr><tr class="odd"><td align="left">r</td><td align="left">middle</td></tr><tr class="even"><td align="left">r</td><td align="left">middle</td></tr><tr class="odd"><td align="left">r</td><td align="left">middle</td></tr><tr class="even"><td align="left">r</td><td align="left">middle</td></tr><tr class="odd"><td align="left">r</td><td align="left">middle</td></tr><tr class="even"><td align="left">r</td><td align="left">middle</td></tr><tr class="odd"><td align="left">r</td><td align="left">middle</td></tr><tr class="even"><td align="left">r</td><td align="left">middle</td></tr><tr class="odd"><td align="left">r</td><td align="left">middle</td></tr><tr class="even"><td align="left">r</td><td align="left">middle</td></tr><tr class="odd"><td align="left">r</td><td align="left">middle</td></tr><tr class="even"><td align="left">r</td><td align="left">middle</td></tr><tr class="odd"><td align="left">r</td><td align="left">middle</td></tr><tr class="even"><td align="left">r</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="even"><td align="left">schwa</td><td align="left">middle</td></tr><tr class="odd"><td align="left">r</td><td align="left">lower</td></tr><tr class="even"><td align="left">r</td><td align="left">lower</td></tr><tr class="odd"><td align="left">r</td><td align="left">lower</td></tr><tr class="even"><td align="left">r</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="odd"><td align="left">schwa</td><td align="left">lower</td></tr><tr class="even"><td align="left">schwa</td><td align="left">lower</td></tr></tbody></table>
  299.  
  300. ```{r, results='asis', echo=F, eval=F}
  301. status <- c(rep("upper", 36), rep("middle",94), rep("lower",54))
  302. pronunciation <- c(rep("r", 30), rep("schwa", 6), rep("r", 20), rep("schwa", 74), rep("r", 4), rep("schwa", 50))
  303.  
  304. # create the data frame
  305. dat <- data.frame(pronunciation, status)
  306. pandoc.table(dat)
  307. ```
  308.  
  309. ## Create data frame
  310.  
  311. Use R code to generate this data without having to type all the observations (there are 184 observations...). Hint: you can use the function `c()` to concatenate information and function `rep()` to repeat one element multiple times, e.g. `rep("upper",36)` will create a vector in which `"upper"` is repeated 36 times. In the end you should have a data frame with 1 dependent variable (`pronunciation`, with 2 levels: `schwa` and `r`) and 1 predictor (`status`, with 3 levels: `lower`, `middle` and `upper`).
  312.  
  313. ```{r createtable}
  314. status <- c(rep("upper", 36), rep("middle",94), rep("lower",54))
  315. pronunciation <- c(rep("r", 30), rep("schwa", 6), rep("r", 20), rep("schwa", 74), rep("r", 4), rep("schwa", 50))
  316.  
  317. # create the data frame
  318. dat <- data.frame(pronunciation, status)
  319. pandoc.table(dat)
  320.  
  321. ```
  322.  
  323. Furthermore, make sure that both variables are coded as factors and show that this indeed is the case. Set `lower` as your reference level for the predictor `status`. Now add an additional predictor `IsR` which is equal to 1 if `pronunciation` equal [r] and 0 otherwise. This will be our dependent variable.  
  324.  
  325. ```{r factors}
  326. levels(dat$status)
  327. levels(dat$pronunciation)
  328.  
  329. library(stringr)
  330.  
  331. for (i in dat){
  332.  if (str_detect(pronunciation[i], "r")){
  333.    dat$Isr <- 1
  334.  } else{
  335.    dat$Isr <- 0
  336.  }
  337. }
  338.  
  339. summary(dat)
  340.  
  341. ```
  342.  
  343. ## Logistic regression model
  344.  
  345. Now fit a binomial logistic regression model to the data.
  346.  
  347. ```{r logmod}
  348. #m <- glm(SpeciesVirginica ~ Petal.Area, data = iris, family = "binomial")
  349.  
  350. ```
  351.  
  352. ## Visualize the results
  353.  
  354. Now visualize the results.
  355.  
  356. ```{r vislog}
  357. # your code goes here
  358. ```
  359.  
  360.  
  361. ## Assumptions
  362.  
  363. Check the logistic regression assumptions.
  364.  
  365. ```{r checkassumptions}
  366. # your code goes here
  367. ```
  368.  
  369. ## Goodness of fit
  370.  
  371. Evaluate how well the model performs, by calculating the index of concordance.
  372.  
  373. ```{r goodnessoffit}
  374. # your code goes here
  375. ```
  376.  
  377. ## Interpretation
  378.  
  379. Given your model, what are the odds of `r` to `schwa` when the predictor `status` is at its reference level (`lower`)?
  380. Hint: the estimates in a logistic regression model are in logits, which is the natural logarithm of the odds. To back transform to odds, you'll need to invert the logartithm (how?). Add the R code which calculates these odds. Also calculate the *probability* of the outcome being `r` when status is at its reference level (`lower`).
  381.  
  382. ```{r interpretation1}
  383. # your code goes here
  384. ```
  385.  
  386. Which `status` is associated with a greater probability of pronouncing the [r], instead of [&#601;]? What are the odds of hearing [r] with this status as opposed to the `lower` status?
  387.  
  388. ```{r interpretation2}
  389. # your code goes here
  390. ```
  391.  
  392. # Answers
  393. The answers to the questions in this file can be viewed here: http://www.let.rug.nl/wieling/Statistics/Regression-large/lab/answers. The associated R markdown file can be downloaded here: http://www.let.rug.nl/wieling/Statistics/Regression-large/lab/answers/answers.Rmd
  394.  
  395. # Replication
  396.  
  397. From within RStudio, you can simply download this file using the commands:
  398. ```{r knitrs, eval=F}
  399. # download original file if not already exists (to prevent overwriting)
  400. #if (!file.exists('lab.Rmd')) {
  401. #  download.file('http://www.let.rug.nl/wieling/Statistics/Regression-large/lab/lab.Rmd', 'lab.Rmd')
  402. #}
  403. ```
  404. Subsequently, open it in the editor and use the Knit HMTL button to generate the html file.
  405.  
  406. If you use plain R, you first have to install [Pandoc](http://johnmacfarlane.net/pandoc/). Then copy the following lines to the most recent version of R.
  407.  
  408. ```{r knitr, eval=F}
  409. # install rmarkdown package if not installed
  410. #if(!"rmarkdown" %in% rownames(installed.packages())) {
  411. #   install.packages("rmarkdown")
  412. #}
  413. #library(rmarkdown) # load rmarkdown package
  414.  
  415. # download original file if not already exists (to prevent overwriting)
  416. #if (!file.exists('lab.Rmd')) {
  417. #  download.file('http://www.let.rug.nl/wieling/Statistics/Regression-large/lab/lab.Rmd', 'lab.Rmd')
  418. #}
  419.  
  420. # generate output
  421. #render('lab.Rmd') # generates html file with results
  422.  
  423. # view output in browser
  424. #browseURL(paste('file://', file.path(getwd(),'lab.html'), sep='')) # shows result
  425. ```
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement