Advertisement
Guest User

Untitled

a guest
Mar 27th, 2017
38
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 4.32 KB | None | 0 0
  1. Recall from the swirl lesson called "Working with Variables" that if you have missing values R will give you a
  2. value of NA when you ask for specific statistics, such as the mean and sd. Therefore, the first two lines of code
  3. in the chunk below add na.rm=TRUE which tells R that those are missing values that should be ignored to compute the mean and sd.
  4. ```{r}
  5. mean(GSS$tvhours, na.rm=TRUE)
  6. sd(GSS$tvhours, na.rm=TRUE)
  7.  
  8. #note the added code section for geom_vline in the histogram below. Add title and labels.
  9. ggplot_tvhours <-ggplot(GSS, aes(tvhours))
  10. ggplot_tvhours + geom_histogram(binwidth =1, aes(y=(..count../sum(..count..))*100)) +
  11. ggtitle("") +
  12. labs(y="Percent", x="") +
  13. geom_vline(xintercept=mean(GSS$tvhours, na.rm=TRUE), color="blue", linetype="dashed", size=1)
  14. ```
  15. Describe what the code segment that begins with geom_vline adds to your histogram. Why is the line where it is?
  16. Are you surprised by the placement of the vertical line? Be specific.
  17.  
  18. Now let's convert (AKA transform) the unit of measurement from hours to minutes and
  19. create a new variable that measures the number of minutes spent watching tv per day.
  20. ```{r}
  21. GSS$tvmins <-GSS$tvhours*60
  22. mean(GSS$tvmins, na.rm=TRUE)
  23. sd(GSS$tvmins, na.rm=TRUE)
  24. ```
  25.  
  26. Now let's see what happens if we convert the units to standard deviations.
  27. Note that multiplying by 1/sd is the same as dividing by sd
  28. ```{r}
  29. GSS$tvsd <- GSS$tvhours*1/sd(GSS$tvhours, na.rm=TRUE)
  30. mean(GSS$tvsd, na.rm=TRUE)
  31. sd(GSS$tvsd, na.rm=TRUE)
  32. ```
  33. Describe how the mean for tvmins and tvsd are related to the mean for tvhours.
  34.  
  35. What is unique about the standard deviation of tvsd?
  36.  
  37.  
  38. Now let's make histograms of the new variables. Notice that we are changing the binwidth.
  39. Think about why we are changing the binwidth to 60 for minutes and 1/sd(tvhrs)
  40. for standard deviations.
  41. If you want, after your run them this way, you can change the binwidth back to 1 and see what happens.
  42. ```{r}
  43. ggplot_tvmins <-ggplot(GSS, aes(tvmins))
  44. ggplot_tvmins + geom_histogram(binwidth =60, aes(y=(..count../sum(..count..))*100)) +
  45. ggtitle("")+
  46. labs(y="Percent", x="") +
  47. geom_vline(xintercept=mean(GSS$tvmins, na.rm=TRUE), color="green", linetype="dashed", size=1)
  48. ```
  49.  
  50. ```{r}
  51. ggplot_tvsd <-ggplot(GSS, aes(tvsd))
  52. ggplot_tvsd +
  53. geom_histogram(binwidth =1/sd(GSS$tvhours, na.rm=TRUE), aes(y=(..count../sum(..count..))*100)) +
  54. ggtitle("") +
  55. labs(y="Percent", x="") +
  56. geom_vline(xintercept=mean(GSS$tvsd, na.rm=TRUE), color="purple", linetype="dashed", size=1)
  57. ```
  58.  
  59. Now let's create some new variables where instead of the actual values we have the difference from the value to the mean.
  60. This is sometimes called "recentering." Then we will get the standard deviation and histogram for each new variable.
  61. ```{r}
  62. GSS$tvhours0<-GSS$tvhours - mean(GSS$tvhours, na.rm=TRUE)
  63. sd(GSS$tvhours0, na.rm=TRUE)
  64. ggplot_tvhours0 <-ggplot(GSS, aes(tvhours0))
  65. ggplot_tvhours0 + geom_histogram(binwidth =1, aes(y=(..count../sum(..count..))*100)) +
  66. ggtitle("") +
  67. labs(y="Percent", x="") +
  68. geom_vline(xintercept=mean(GSS$tvhours0, na.rm=TRUE), color="blue", linetype="dashed", size=.5)
  69. ```
  70.  
  71. ```{r}
  72. GSS$tvmins0<-GSS$tvmins - mean(GSS$tvmins, na.rm=TRUE)
  73. sd(GSS$tvmins0, na.rm=TRUE)
  74. ggplot_tvmins0 <-ggplot(GSS, aes(tvmins0))
  75. ggplot_tvmins0 +
  76. geom_histogram(binwidth =60, aes(y=(..count../sum(..count..))*100)) +
  77. ggtitle("") + labs(y="Percent", x="") +
  78. geom_vline(xintercept=mean(GSS$tvmins0, na.rm=TRUE), color="green", linetype="dashed", size=.5)
  79. ```
  80.  
  81. ```{r}
  82. GSS$tvsd0<-GSS$tvsd - mean(GSS$tvsd, na.rm=TRUE)
  83. sd(GSS$tvsd0, na.rm=TRUE)
  84. ggplot_tvsd0 <-ggplot(GSS, aes(tvsd0))
  85. ggplot_tvsd0 +
  86. geom_histogram(binwidth =1/sd(GSS$tvhours0, na.rm=TRUE), aes(y=(..count../sum(..count..))*100)) +
  87. ggtitle("") +
  88. labs(y="Percent", x="") +
  89. geom_vline(xintercept=mean(GSS$tvsd0, na.rm=TRUE), color="purple", linetype="dashed", size=.5)
  90. ```
  91. Why is subtracting the mean from each value sometimes called "recentering"?
  92.  
  93. What does a negative value on these variables mean?
  94.  
  95. Describe how the standard deviations for these recentered variables compare to the standard deviations
  96. for the previous (comparable) variables.
  97.  
  98. When we convert the value of an observation into units of "standard deviations above the mean"
  99. or "standard deviations below the mean" those new scores are called Z-SCORES.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement