Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ### Exercise 1:
- ```{r}
- ## Compute popoulation p directly
- ames %>% select(Central.Air) %>% table() %>% prop.table()
- ```
- ### Exercise 2:
- ```{r}
- ## new variable / extract air
- ames <- ames %>% mutate(air = as.numeric(Central.Air=='Y'))
- air <- ames$air
- ## Compute population p in a new way; save it
- pop_p <- sum(air) / nrow(ames)
- pop_p
- ```
- I created the value of the amount with air and divided it by the total number of homes in ames, then created this seperate value.
- ```{r}
- ### Exercise 3:
- ## Compute pop sd in two different ways
- sd(air, na.rm = TRUE)
- sqrt(pop_p*(1-pop_p))
- ```
- **written answer**
- no they are not the same but they are the same out to four significant figures
- ### Exercise 4:
- ```{r}
- ## Draw sample
- samp <- sample(air, size=50)
- ## Compute p_hat
- p_hat <- sum(samp) / 50
- p_hat
- ```
- when I run the code over and over the value changes. generally anywhere from .86-.98
- ### Exercise 5:
- After running the line of code over 20 times, most of the answers were larger than pop_p (93.3). The answers I saw did range from 86-98 but most of the time it was 94 and above. however if I were to run the line of code several more times and calculated the mean, I wouldn't be surprised if it was very close to pop_p.
- ### Exercise 6:
- ```{r}
- ### Try out samples of all different sizes
- samp1 <- sample(air, size=20)
- p_hat1 <- sum(samp1) / 20
- samp2 <- sample(air, size=50)
- p_hat2 <- sum(samp2) / 50
- samp3 <- sample(air, size=200)
- p_hat3 <- sum(samp3) / 200
- p_hat1
- p_hat2
- p_hat3
- ```
- The sample size that tends to be closest to the truth is samp3. Samp1 (sample size 20) is the sample size with the most variability.
- ### Exercise 7:
- ```{r}
- ## Make a plot
- set.seed(111)
- phats_20 <- replicate(100000, mean(sample(air, size=20)))
- ggplot(data = NULL, aes(x = phats_20)) + geom_histogram()
- ```
- The plot has a left skew shape with the center being at 0.95. The values range from 0.7-1.0 with most the data being from 0.9-1.0
- ### Exercise 8
- ```{r}
- mean(phats_20)
- ```
- This value is much closer to the true population proportion (0.9331058) and is the same out to four decimals. It is a slight over-estimate but is very close.
- ### Exercise 9
- ```{r}
- ## Compute SD
- sd(phats_20)
- ```
- The standard errors are so far apart it is ridiculous. When I calculated the standard error a different way based off the mean it was very close to the true standard error.
- ### Exercise 10
- ```{r}
- set.seed(111)
- phats_20 <- replicate(100000, mean(sample(air, size=20)))
- ### Fill in for size
- set.seed(111)
- phats_50 <- replicate(100000, mean(sample(air, size=50)))
- ### Fill in for size 200.
- set.seed(111)
- phats_200 <- replicate(100000, mean(sample(air, size=200)))
- ```
- ### Exercise 11
- ```{r}
- ## Two histograms
- ggplot(data = NULL, aes(x = phats_50)) + geom_histogram()
- ggplot(data = NULL, aes(x = phats_200)) + geom_histogram()
- ```
- As sample size increases, the shape of the distribution becomes more normal, and the center starts to approach the true proportion value. There is also a decrease in the spread of the data as sample size increases.
- ### Exercise 12
- The distribution of phats_20 is not normal due to the left skew of the distribution. phats_50 is approaching normal but still have a left skew in the data, but the left skew isn't as strong as in phats_20. phats_200 is the closest to a normal distribution out of the three distributions with only a slight left skew in the data.
- ### Exercise 13
- the first and third conditions are met for all three of the sample sizes however the second one isn't met. With such small sample sizes they weren't able to meet the success failure conditions?????
- ### Exercise 14
- ```{r}
- ## Include the histogram here
- ggplot(data = NULL, aes(x = phats_200)) +
- geom_blank() +
- geom_histogram(bins=30,aes(y = ..density..)) +
- stat_function(fun = dnorm, args = c(mean = pop_p, sd = sqrt((pop_p)*(1-pop_p)/200)), col = "tomato")
- ```
- The empirical distribution and theoretical distributions aer very close to the same but don't quite match. The peaks of the distributions are off slightly.
- ### Exercise 15
- ```{r}
- ## And calculations
- lower <- pop_p - 1.96*pop_sd/sqrt(200)
- upper <- pop_p + 1.96*pop_sd/sqrt(200)
- sum(phats_200 > lower & phats_200 < upper) / 100000
- ```
- The proportion of my samples that fell between the bounds was 0.96.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement