Untitled

---
title: "Final Project Rough Draft"
author: "Kendall Clancy, Jesse Keyton, Blake Wilhelm"
output: pdf_document
---

```{r setup, include=FALSE}
      knitr::opts_chunk$set(echo = TRUE, fig.width = 5, fig.height = 3, fig.align = "center")
      library(here)
      library(openintro)
      library(ggplot2)
      library(knitr)
      library(dplyr)
      library(mgcv)
      library(tidyverse)
      library(gridExtra)
      library(data.table)
      library(formattable)
      library(tidyr)
      library(tm)


```

**Recent College Graduates**

```{r, echo=FALSE}
grads_df <- read.csv(here("tidytuesday", "data", "2018", "2018-10-16", "recent-grads.csv"))
engineering <- grads_df[-c(7,8,20:22,25,27,28,30,33,35:38,40:50,52:58,60:65,68:173),]
grads_df <- grads_df[-22,]
grads_df <- grads_df[-109,]
```

This first data set is about college majors, and it has been collected from the American Community Survey 2010-2012. There are 173 majors considered, from Petroleum Engineering to Library Science. For each major, the researchers accumulated a sample of recent graduates and then counted how many of them fit into different employment categories. Among the categories are the number recent graduates that have full time employment, part time employment, unemployment, and they also record a median salary. Some other categories we consider in our tests are how many of the jobs actually required a degree, how many did not, and the number of women in each college major sample.

The manner in which the data was collected (namely, counting how many fit into an employment category) forces our tests to be mainly about the proportions of different categories within different college majors. After the observation based on the plot below that engineering majors have a higher median income right out of college than the other majors, we first test whether or not the data on median incomes for engineering majors is normally distributed. Next, we compare business majors and social science majors to investigate what greater opportunities out of college really means. In particular, for these two major types, we study the proportion of jobs not requiring a degree and the unemployment rates.

```{r, echo=FALSE, fig.height=4.5}
ggplot(grads_df, las=2, aes(x=Major_category, y=Median))+ geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```

```{r, echo=FALSE, fig.height=3}
g1 <- grads_df %>% subset(Major_category == "Engineering") %>% ggplot(aes(x=Median))+geom_histogram()
g2 <- grads_df %>% subset(Major_category == "Engineering") %>% ggplot(aes(sample=Median))+geom_qq()+geom_qq_line(color="red")
grid.arrange(g1, g2, ncol = 2)
```

Based on these two figures, we conclude that the median salaries of recent engineering graduates are not normally distributed. Petroleum Engineering is an outlier in this group, skewing the median salary of engineering majors up. This knowledge should change how one responds to reports that the median salaries of engineering majors are significantly higher than other majors. With that being said, one problem with this test is that we are testing for normality on summary statistics. In particular, there were a different number of people surveyed about their salaries from the different engineering majors.

Other studies conclude that women make less than men in the workplace. Within engineering majors, then, it would be interesting to know if there are less women in the engineering majors with high median salaries. It should be noted again that the data from each engineering major is the result of different sample sizes. But since we want to compare all engineering majors at once, we rely on the summary statistics for the median income and proportion of women. We will run a non-linear test with the rank of the engineering major as a predictor of the share of women in the major. The engineering major is listed with its rank and median salary in the appendix.

```{r, echo=FALSE,fig.height= 8, fig.width = 8}
engineering$new_rank <- c(1:29)
fit <- gam(ShareWomen~ s(new_rank), data = engineering)
summary(fit)
engineering$yhat <- predict(fit, type = "response")

ggplot(data = engineering, aes(x = new_rank, y = yhat)) + labs(title = "Rank + Yhat graph") +
geom_line(lwd = 2) +
geom_point(data = engineering, aes(x = new_rank, y = ShareWomen), alpha = 0.5)

engineering_table <- engineering[,-c(1,2,4:15,17:21,23)]
#formattable(engineering_table, align =c("l","c","c"))


#ggplot(grads_df, aes(x=Rank, y=grads_df$Unemployment_rate)) +
#  stat_smooth(method = "lm")+
#  geom_point() +
#    xlab("Rank") +
#    ylab("Unemployment Rate")

#fit <- gam(Unemployed ~ s(ShareWomen), data = grads_df)
#summary(fit)
#grads_df$yhat <- predict(fit, type = "response")

#ggplot(data = grads_df, aes(x = ShareWomen, y = yhat, group = Major_category, color = Major_category)) +
#geom_line(lwd = 2) +
#geom_point(data = grads_df, aes(x = ShareWomen, y = Unemployed), alpha = 0.5)


#fit <- gam(Unemployment_rate ~ s(Rank) + s(ShareWomen) + s(Non_college_jobs), data = grads_df)
#grads_df$xhat <- predict(fit, type = "response")

#ggplot(data = grads_df, aes(x = Rank, y = xhat, group = Major_category, color = Major_category)) +
#geom_line(lwd = 2) +
#geom_point(data = grads_df, aes(x = Rank, y = Unemployment_rate), alpha = 0.5)
```

Next we examine subtleties behind the data set by comparing business majors and social science majors. First, we test the claim that there are significantly more recent business majors who have jobs that do not require degrees than recent social science majors. Let $p_b$ and $p_s$ be the true proportion of recent graduates who have jobs that do not require a degree from business majors and social science majors, respectively.

\begin{align*}
H_0: & p_b = p_s \\
H_a: & p_b > p_s
\end{align*}

Our significance level is $\alpha=0.01$.

```{r, echo=FALSE}
business_grads <- subset(grads_df, Major_category == "Business")
social <- subset(grads_df, Major_category == "Social Science")

n_b <- sum(business_grads$College_jobs)+sum(business_grads$Non_college_jobs)
n_s <- sum(social$College_jobs)+sum(social$Non_college_jobs)

x_b <- sum(business_grads$Non_college_jobs)
x_s <- sum(social$Non_college_jobs)

p_b_hat <- x_b/n_b
p_s_hat <- x_s/n_s

p_hat <- (x_s + x_b) / (n_s + n_b)
## calculate test statstic
#Z_test <- (p_b_hat - p_s_hat - 0) / sqrt(p_hat * (1 - p_hat) / n_b + p_hat * (1 - p_hat) / n_s)
#Z_test
#1-pnorm(Z_test)

prop.test(x = c(x_b, x_s), n = c(n_b, n_s), alternative = "greater", conf.level = 0.99, correct = FALSE)
```


With a $p$-value of $0$, we reject the null hypothesis and conclude that the proportion of recent business graduates with jobs not requiring a degree is significantly greater than the proportion of recent social science graduates with jobs not requiring a degree. The next test reveals why for some majors, like business majors, this may not be troublesome. Indeed, by showing that business majors have a lower unemployment rate, compared to a major with a lower rate of jobs not requiring a degree, we reveal a characteristic of business majors; namely, that their career goals take longer to achieve.

We now test the proportions of unemployed recent graduates in the social science and business major categories.

\begin{align*}
H_0: & p_b = p_s \\
H_a: & p_b < p_s
\end{align*}

Our significance level is $\alpha=0.01$.

```{r, echo=FALSE, fig.height= 8, fig.width = 8}
n_b <- sum(business_grads$Unemployed)+sum(business_grads$Employed)
n_s <- sum(social$Unemployed)+sum(social$Employed)

x_b <- sum(business_grads$Unemployed)
x_s <- sum(social$Unemployed)

p_b_hat <- x_b/n_b
p_s_hat <- x_s/n_s

p_hat <- (x_b + x_s) / (n_b + n_s)
## calculate test statstic
#Z_test <- (p_b_hat - p_s_hat - 0) / sqrt(p_hat * (1 - p_hat) / n_b + p_hat * (1 - p_hat) / n_s)
#Z_test
#pnorm(Z_test)

prop.test(x = c(x_b, x_s), n = c(n_b, n_s), alternative = "less", conf.level = 0.95, correct = FALSE)
```

With a $p$-value of $0$, we reject the null hypothesis and conclude that the proportion of unemployed recent business majors is significantly less than the proportion of recent social science majors.

Comparing business and social science majors shows us the following: employment categories for recent graduates should not necessarily determine the most rewarding majors. There are certain majors which have immediate payoffs, but this data says nothing about the long term status of those with different degrees, which is perhaps more important than initial status. Business majors may not land jobs requiring degrees at first, but compared to other majors with higher proportions (like social science majors), we see that the employment rate for business majors can actually be significantly higher.

\newpage

**UFO Sightings**

The second data set is a compilation of reported UFO "sightings" across the globe from the National UFO Reporting Center that was cleaned up and uploaded by a Github user named Sigmond Axel. The data includes 11 variables: date_time (the date and time of the reported sighting), city_area (city or area of the sighting), state, country, ufo_shape (reported shape of the UFO), encounter_length (how long the sighting was in seconds), described_encounter_length (the description of how long the sighting was), description (the description of the sighting), date_documented, latitude, and longitude. Because this is all from self reports, there absolutely could be some issues with the overall authenticity of the data. For example, what one person may describe as one shape they saw, another may have a different definition for that shape. The encounter length could also be off, as shocking events can make time feel like it is moving differently and without an actual way to track it the reports could be the people's best estimate. Despite this, we felt the data was still interesting enough and still had enough substance to run tests on, with the disclosure that the results from any tests are entirely based on the data provided and there could be inaccuracies depending on how closely the self-reports mirrored the actual encounter. We did decide to focus our tests on three of the variables: state, shape of the UFO, and the length of the encounter.

Before running any tests, we chose to filter and clean the data up a bit. We decided to focus exclusively on reports coming from the United States, so we filtered and dropped the data collected from any other countries. We then filtered and removed the states of Hawaii, Alaska, Washington DC, and Puerto Rico, and the reported UFO shapes of changed, crescent, delta, flare, hexagon, pyramid, and round, because of the low report numbers that were messing with our tests. We used this inital step of cleaned up data to run a Shapiro test to test if the encounter lengths were normal.

```{r, echo=FALSE}
ufo_df <- read.csv(here("tidytuesday", "data", "2019", "2019-06-25", "ufo_sightings.csv"))
```

```{r, echo = FALSE}
#Filtering out
ufo_us <- ufo_df %>% filter(country == "us" & state != "ak" & state != "hi" & state != "dc" & state != "pr") %>% filter(ufo_shape != "changed" & ufo_shape != "crescent" & ufo_shape != "delta" & ufo_shape != "flare" & ufo_shape != "hexagon" & ufo_shape != "pyramid" & ufo_shape != "round") %>% droplevels()
ufo_regions <- ufo_us %>% mutate(state =
                                   recode(state,
                                          'wa' = "Pacific",
                                          'or' = "Pacific",
                                          'ca' = "Pacific",
                                          'nm' = "South West",
                                          'az' = "South West",
                                          'tx' = "South West",
                                          'ok' = "South West",
                                          'ar' = "South East",
                                          'la' = "South East",
                                          'ms' = "South East",
                                          'al' = "South East",
                                          'tn' = "South East",
                                          'ga' = "South East",
                                          'ky' = "South East",
                                          'wv' = "South East",
                                          'sc' = "South East",
                                          'nc' = "South East",
                                          'va' = "South East",
                                          'fl' = "South East",
                                          'pa' = "North East",
                                          'de' = "South East",
                                          'nj' = "North East",
                                          'md' = "South East",
                                          'ny' = "North East",
                                          'ct' = "North East",
                                          'nh' = "North East",
                                          'vt' = "North East",
                                          'me' = "North East",
                                          'ma' = "North East",
                                          'ri' = "North East",
                                          'nd' = "Midwest",
                                          'sd' = "Midwest",
                                          'ne' = "Midwest",
                                          'ks' = "Midwest",
                                          'mn' = "Midwest",
                                          'mo' = "Midwest",
                                          'ia' = "Midwest",
                                          'mi' = "Midwest",
                                          'il' = "Midwest",
                                          'wi' = "Midwest",
                                          'in' = "Midwest",
                                          'oh' = "Midwest",
                                          'mt' = "Rocky Mountains",
                                          'co' = "Rocky Mountains",
                                          'id' = "Rocky Mountains",
                                          'nv' = "Rocky Mountains",
                                          'ut' = "Rocky Mountains",
                                          'wy' = "Rocky Mountains"))
ufo_clean <- subset(ufo_regions, encounter_length > 0 & encounter_length <= 1800)
ufo_clean$encounter_length2[ufo_clean$encounter_length > 0 & ufo_clean$encounter_length < 360] <- "0-360"
ufo_clean$encounter_length2[ufo_clean$encounter_length > 361 & ufo_clean$encounter_length < 720] <- "361-720"
ufo_clean$encounter_length2[ufo_clean$encounter_length > 721 & ufo_clean$encounter_length < 1080] <- "721-1080"
ufo_clean$encounter_length2[ufo_clean$encounter_length > 1081 & ufo_clean$encounter_length < 1440] <- "1081-1440"
ufo_clean$encounter_length2[ufo_clean$encounter_length > 1441 & ufo_clean$encounter_length < 1800] <- "1441-1800"
```


The initial data set was too large to run a Shapiro test on so we then took a sample of 4,000 random rows of data.
```{r, echo=FALSE}
set.seed(1234)
ufo_dfus<- dplyr::sample_n(ufo_us, 4000) #take a sample of those in the US
```

Something that interested us was the encounter length of each UFO observation. What do the lengths of encounters look like? We wondered if there was a normal distribution of encounter lengths centering around a certain timeframe, or if there was non-normality and a skew that could tell us a story.

This is the graph of the encounter_length per observation.
```{r, echo=FALSE, fig.height= 9, fig.width = 9}
ggplot(ufo_dfus, aes(x = encounter_length)) + geom_dotplot() + labs(title = "Encounter length by observations")
```

For this test, we used a Shapiro Test to test for the normality of the encounter lengths.

Our significance level is = 0.05

$H_0$ = The data is approximately normal with a p-value of above 0.05

$H_A$ = The data is not approximately normal with a p-value of below 0.05

```{r, echo=FALSE, fig.height= 15, fig.width = 15}

shapiro.test(ufo_dfus$encounter_length) #Normality test

#ad.test(ufo_dfus$encounter_length)

```


The p-value is below 0.05, therefore we reject $H_0$ and conclude that the data is not normal.


The data does not match the red line on a QQ plot, and has biases toward the upper quantiles. Data points of very large encounter lengths skew the data to the right.

```{r, echo=FALSE, fig.height= 8, fig.width = 8}
qqnorm(ufo_dfus$encounter_length, pch = 1)
qqline(ufo_dfus$encounter_length, col = "red")
```

It seems that most encounters with UFOs are relatively short. Skews to the right would logically be associated with alien abduction, as opposed to seeing a UFO in the sky briefly.

Another thing we are interested in is whether or not the shape of the UFO sighted is dependent on the location. Due to the small amount of reports given by some individual states, we decided to first group all the states into regions: Pacific, South West, South East, North East, Rocky Mountains, and Midwest.


```{r, fig.height= 10, fig.width = 10, echo=FALSE}
ggplot(ufo_regions, aes(state, fill = ufo_shape)) + labs(title = "Shapes by Region")+
  geom_histogram(stat = "count")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
We determined that Pearson's Chi-squared test of independence was the best test for what we were looking for. Our null hypothesis, $H_0$, is that the region of encounter and the shape reported are independent of each other, meaning there is no evidence that the region has any effect on what shapes are reported. Our alternative hypothesis, $H_A$, is that the region of the encounter and the shape reported are dependent. In other words, there is evidence that the region where the encounter was reported had an impact on the shape reported. We used a significance level of $\alpha$ = 0.05.

```{r, echo=FALSE}

tbl <- table(ufo_regions$state, ufo_regions$ufo_shape)
chisq.test(tbl)
```
With a $p$-value of $0$, we reject the null hypothesis and can reasonably conclude there is a relationship between the region of the encounter and the shape observed.

While we are happy to have found evidence of a relationship, we still are interested in what regions may be more associated with certain shapes. To dive deeper into this, we used the Pearson standardized residuals.
```{r, echo = FALSE}
tab2 <- chisq.test(tbl)
```
The Pearson standardized residuals is set up where positive numbers indicate positive association, and vice versa for the negative numbers. Anything over the absolute value of 2 or 3 is said to be a significant enough association to possibly lead to the rejection of the null. Based on this, we found that one of the largest associations is the Midwest and a triangle shaped UFO at 6.208. Likewise, the South West is associated with very few reports of the fireball shape.


```{r ref.label=knitr::all_labels(), echo = T, eval = F}
#pastes all code at the end
```