Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Recall from the swirl lesson called "Working with Variables" that if you have missing values R will give you a
- value of NA when you ask for specific statistics, such as the mean and sd. Therefore, the first two lines of code
- in the chunk below add na.rm=TRUE which tells R that those are missing values that should be ignored to compute the mean and sd.
- ```{r}
- mean(GSS$tvhours, na.rm=TRUE)
- sd(GSS$tvhours, na.rm=TRUE)
- #note the added code section for geom_vline in the histogram below. Add title and labels.
- ggplot_tvhours <-ggplot(GSS, aes(tvhours))
- ggplot_tvhours + geom_histogram(binwidth =1, aes(y=(..count../sum(..count..))*100)) +
- ggtitle("") +
- labs(y="Percent", x="") +
- geom_vline(xintercept=mean(GSS$tvhours, na.rm=TRUE), color="blue", linetype="dashed", size=1)
- ```
- Describe what the code segment that begins with geom_vline adds to your histogram. Why is the line where it is?
- Are you surprised by the placement of the vertical line? Be specific.
- Now let's convert (AKA transform) the unit of measurement from hours to minutes and
- create a new variable that measures the number of minutes spent watching tv per day.
- ```{r}
- GSS$tvmins <-GSS$tvhours*60
- mean(GSS$tvmins, na.rm=TRUE)
- sd(GSS$tvmins, na.rm=TRUE)
- ```
- Now let's see what happens if we convert the units to standard deviations.
- Note that multiplying by 1/sd is the same as dividing by sd
- ```{r}
- GSS$tvsd <- GSS$tvhours*1/sd(GSS$tvhours, na.rm=TRUE)
- mean(GSS$tvsd, na.rm=TRUE)
- sd(GSS$tvsd, na.rm=TRUE)
- ```
- Describe how the mean for tvmins and tvsd are related to the mean for tvhours.
- What is unique about the standard deviation of tvsd?
- Now let's make histograms of the new variables. Notice that we are changing the binwidth.
- Think about why we are changing the binwidth to 60 for minutes and 1/sd(tvhrs)
- for standard deviations.
- If you want, after your run them this way, you can change the binwidth back to 1 and see what happens.
- ```{r}
- ggplot_tvmins <-ggplot(GSS, aes(tvmins))
- ggplot_tvmins + geom_histogram(binwidth =60, aes(y=(..count../sum(..count..))*100)) +
- ggtitle("")+
- labs(y="Percent", x="") +
- geom_vline(xintercept=mean(GSS$tvmins, na.rm=TRUE), color="green", linetype="dashed", size=1)
- ```
- ```{r}
- ggplot_tvsd <-ggplot(GSS, aes(tvsd))
- ggplot_tvsd +
- geom_histogram(binwidth =1/sd(GSS$tvhours, na.rm=TRUE), aes(y=(..count../sum(..count..))*100)) +
- ggtitle("") +
- labs(y="Percent", x="") +
- geom_vline(xintercept=mean(GSS$tvsd, na.rm=TRUE), color="purple", linetype="dashed", size=1)
- ```
- Now let's create some new variables where instead of the actual values we have the difference from the value to the mean.
- This is sometimes called "recentering." Then we will get the standard deviation and histogram for each new variable.
- ```{r}
- GSS$tvhours0<-GSS$tvhours - mean(GSS$tvhours, na.rm=TRUE)
- sd(GSS$tvhours0, na.rm=TRUE)
- ggplot_tvhours0 <-ggplot(GSS, aes(tvhours0))
- ggplot_tvhours0 + geom_histogram(binwidth =1, aes(y=(..count../sum(..count..))*100)) +
- ggtitle("") +
- labs(y="Percent", x="") +
- geom_vline(xintercept=mean(GSS$tvhours0, na.rm=TRUE), color="blue", linetype="dashed", size=.5)
- ```
- ```{r}
- GSS$tvmins0<-GSS$tvmins - mean(GSS$tvmins, na.rm=TRUE)
- sd(GSS$tvmins0, na.rm=TRUE)
- ggplot_tvmins0 <-ggplot(GSS, aes(tvmins0))
- ggplot_tvmins0 +
- geom_histogram(binwidth =60, aes(y=(..count../sum(..count..))*100)) +
- ggtitle("") + labs(y="Percent", x="") +
- geom_vline(xintercept=mean(GSS$tvmins0, na.rm=TRUE), color="green", linetype="dashed", size=.5)
- ```
- ```{r}
- GSS$tvsd0<-GSS$tvsd - mean(GSS$tvsd, na.rm=TRUE)
- sd(GSS$tvsd0, na.rm=TRUE)
- ggplot_tvsd0 <-ggplot(GSS, aes(tvsd0))
- ggplot_tvsd0 +
- geom_histogram(binwidth =1/sd(GSS$tvhours0, na.rm=TRUE), aes(y=(..count../sum(..count..))*100)) +
- ggtitle("") +
- labs(y="Percent", x="") +
- geom_vline(xintercept=mean(GSS$tvsd0, na.rm=TRUE), color="purple", linetype="dashed", size=.5)
- ```
- Why is subtracting the mean from each value sometimes called "recentering"?
- What does a negative value on these variables mean?
- Describe how the standard deviations for these recentered variables compare to the standard deviations
- for the previous (comparable) variables.
- When we convert the value of an observation into units of "standard deviations above the mean"
- or "standard deviations below the mean" those new scores are called Z-SCORES.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement