Untitled

### Exercise 1:

```{r}
## Compute popoulation p directly
ames %>% select(Central.Air) %>% table() %>% prop.table()
```

### Exercise 2:

```{r}
## new variable / extract air
ames <- ames %>% mutate(air = as.numeric(Central.Air=='Y'))
air <- ames$air
## Compute population p in a new way; save it
pop_p <- sum(air) / nrow(ames)
pop_p

```


I created the value of the amount with air and divided it by the total number of homes in ames, then created this seperate value.


```{r}
### Exercise 3:
## Compute pop sd in two different ways
sd(air, na.rm = TRUE)
sqrt(pop_p*(1-pop_p))
```

**written answer**
no they are not the same but they are the same out to four significant figures


### Exercise 4:

```{r}
## Draw sample
samp <- sample(air, size=50)
## Compute p_hat
p_hat <- sum(samp) / 50
p_hat
```

when I run the code over and over the value changes.  generally anywhere from .86-.98

### Exercise 5:

After running the line of code over 20 times, most of the answers were larger than pop_p (93.3).  The answers I saw did range from 86-98 but most of the time it was 94 and above.  however if I were to run the line of code several more times and calculated the mean, I wouldn't be surprised if it was very close to pop_p.

### Exercise 6:

```{r}
### Try out samples of all different sizes
samp1 <- sample(air, size=20)
p_hat1 <- sum(samp1) / 20
samp2 <- sample(air, size=50)
p_hat2 <- sum(samp2) / 50
samp3 <- sample(air, size=200)
p_hat3 <- sum(samp3) / 200
p_hat1
p_hat2
p_hat3
```

The sample size that tends to be closest to the truth is samp3.  Samp1 (sample size 20) is the sample size with the most variability.

### Exercise 7:

```{r}
## Make a plot
set.seed(111)
phats_20 <- replicate(100000, mean(sample(air, size=20)))
ggplot(data = NULL, aes(x = phats_20)) + geom_histogram()
```

The plot has a left skew shape with the center being at 0.95.  The values range from 0.7-1.0 with most the data being from 0.9-1.0

### Exercise 8

```{r}
mean(phats_20)
```

This value is much closer to the true population proportion (0.9331058) and is the same out to four decimals.  It is a slight over-estimate but is very close.

### Exercise 9

```{r}
## Compute SD
sd(phats_20)
```

The standard errors are so far apart it is ridiculous.  When I calculated the standard error a different way based off the mean it was very close to the true standard error.

### Exercise 10

```{r}
set.seed(111)
phats_20 <- replicate(100000, mean(sample(air, size=20)))
### Fill in for size
set.seed(111)
phats_50 <- replicate(100000, mean(sample(air, size=50)))
### Fill in for size 200.
set.seed(111)
phats_200 <- replicate(100000, mean(sample(air, size=200)))

```

### Exercise 11
```{r}
## Two histograms
ggplot(data = NULL, aes(x = phats_50)) + geom_histogram()
ggplot(data = NULL, aes(x = phats_200)) + geom_histogram()
```

As sample size increases, the shape of the distribution becomes more normal, and the center starts to approach the true proportion value.  There is also a decrease in the spread of the data as sample size increases.

### Exercise 12

The distribution of phats_20 is not normal due to the left skew of the distribution.  phats_50 is approaching normal but still have a left skew in the data, but the left skew isn't as strong as in phats_20.  phats_200 is the closest to a normal distribution out of the three distributions with only a slight left skew in the data.

### Exercise 13

the first and third conditions are met for all three of the sample sizes however the second one isn't met.  With such small sample sizes they weren't able to meet the success failure conditions?????

### Exercise 14

```{r}
## Include the histogram here
ggplot(data = NULL, aes(x = phats_200)) +
        geom_blank() +
        geom_histogram(bins=30,aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = pop_p, sd = sqrt((pop_p)*(1-pop_p)/200)), col = "tomato")
```

The empirical distribution and theoretical distributions aer very close to the same but don't quite match.  The peaks of the distributions are off slightly.

### Exercise 15

```{r}
## And calculations
lower <- pop_p - 1.96*pop_sd/sqrt(200)
upper <- pop_p + 1.96*pop_sd/sqrt(200)
sum(phats_200 > lower & phats_200 < upper) / 100000
```

The proportion of my samples that fell between the bounds was 0.96.