Untitled

---
title: "Final project: Is College Worth It?"
date: "Due date: December 9, 2019"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Data description

This dataset is an extract of two surveys conducted by the National Science Foundation (NSF) in 2013: the National Survey of College Graduates (SESTAT 2013) and Survey of Doctorate Recipients (SESTAT 2013). Information on the survey and sampling methods can be found here.

\url{https://highered.ipums.org/highered/survey_designs.shtml}

The dataset is public and can be cited in your report as: Minnesota Population Center. IPUMS Higher Ed: Version 1.0 [dataset]. Minneapolis, MN: University of Minnesota, 2016.
https://doi.org/10.18128/D100.V1.0

On Canvas, you would find the following files:

* **data.formatted.csv**: the dataset downloaded from IPUMS Higher Ed, with missing or logical skips recoded to NA, the error in the variable CHTOT fixed.

* **dataset.RData**: an R workspace that contains data.formatted.csv pre-loaded as a dataframe called \texttt{dataset}, with each variable given the correct type.

It is recommended that you start with this file.

For regression you may find it convenient to recode some yes/no variables as a binary 1/0 numeric variable.

* **codebook-basic.txt**: a list of variables and the meaning of their values. Note that missing or logical skips have been recoded to NA.

* **codebook.xml**: an XML version of the codebook, with more detailed explanations on the variables and hyperlinks. You can open this in your browser.

* **final-project.rmd / final-project.pdf**: instructions and questions

The goals of this analysis are following.

* Give a general description of the work landscape for those with a college degree in the US, as surveyed in 2013

* Build a regression model to predict annual salary

* Build a regression model to predict job satisfaction

* Use our analysis to fact-check news outlets.

* Convey our findings in a technical report and in plain terms.

## General instructions on formatting
You should hand in two files in total: an rmd file and a pdf file.

However, it should look less like homework and more like a professional report.

A good standard are the PEW research reports, such as this:

https://www.pewsocialtrends.org/2014/02/11/the-rising-cost-of-not-going-to-college/

Here is what the lay summary from that article looks like

https://www.pewresearch.org/fact-tank/2014/02/11/6-key-findings-about-going-to-college/

Please answer all questions asked and write in full sentences with good formatting (eg: clear paragraphs).

For hypothesis testing, use 95% significance level unless otherwise specified.

\newpage
# The Report

Your report should contain the same headings as the sections below. Under each heading, put answers to these questions.

For each question/bullet, summarize in ONE paragraph, with appropriate plots and/or numbers/tables.

## Basic analysis.

### Population and sampling

1. This dataset consists of two different surveys. Briefly describe the population, the sample, and the sampling method for each of the surveys. Name TWO possible biases that each sample can have. Do we introduce further biases when we analyze the results of these surveys together (ie: treat it as one big dataset)?
The first survey that was used is the National Survey of College Graduates (NSCG). It's population consists of college graduates living in the US who are under the age of 76. The sample is selcted using a multi-stage stratified sampling scheme using age, race, highest degree type, and occupation. (Sex used to be a variable but it was dropped in the 2010 census). The survey data was collected by mail at first, and followed up by computer assisted telephone interviews for initial non-respondents. There are multiple possible biases with this survey. One is an obvious non response bias -- there is a signinificant population that doesn't respond to the national census. Another bias is a language bias -- the census is not offered in every language so it is bound to be inaccessible to some potential respondents. The second survey is the Survey of Doctorate Recipients (SDR). It's population cosists of those who earned a graduate degree in science, engineering, and mathematics in the USA. The sample is chosen from a stratified sample of those who responded to the SDR. Respondents were contacted by mail and non respondents were followed up with computer-assisted telephone interviews. One potential bias is that the survey doesn't include those who got their degrees abroad. Another bias is that it doesn't include those who got their degrees in the US but currently live abroad.

### Demographics

2. Summarize the demographics of the survey.
Specifically, you should describe the distribution of gender, minority, race/ethnicity, and total number of children.
```{r}
library(ggplot2)
getwd()
load("~/Downloads/dataset.RData")
table(dataset$GENDER)
ggplot(dataset, aes(x=GENDER)) + geom_bar()
```

The study is about 43.4% women and 56.6% men.

```{r}
table(dataset$MINRTY)
ggplot(dataset, aes(x=MINRTY)) + geom_bar()
```
The study is comprised of about 79.6% people who belong to a majority group and 20.4% people who belong to a minority group.

```{r}
table(dataset$RACETH)
ggplot(dataset, aes(x=RACETH)) + geom_bar()
```

The study is comprised of about 17.1% Asian people, 62.5% white people, and 20.4% other under-represented minorities.

### Education

3. Summarize the distribution of highest degrees and bachelor degrees by field and year obtained obtain.
```{r}
table(dataset$NDGMEMG)
ggplot(dataset, aes(x=NDGMEMG)) + geom_bar() + xlab("Field highest degree is in")

table(dataset$HD03Y5)
ggplot(dataset, aes(x=HD03Y5)) + geom_bar() + xlab("Year of highest degree obtained")

library(vcd)
boxplot(dataset$HD03Y5 ~ dataset$NDGMEMG, names=c("Computer","Biology","Physical","Social","Engineering","Science","Non-science"), xlab="Field", ylab="Year obtained")

dataset$HD03Y5.cat <- as.factor(dataset$HD03Y5)
mosaic(dataset$HD03Y5.cat ~ dataset$NDGMEMG, xlab="Field", ylab="Year obtained")
```

There were 10,483 (9.1%) people who have their higherst degree in computer and mathematical sciences-related fields, 16,039 (13.9%) people who have their highest degree in biological, agricultural and environmental life sciences fields, 9,646 (8.4%) people who have their highest degree in physical and related sciences, 24,291 (21.1%) people who have their highest degree in social and related sciences, 23,451 (20.4%) people who have their highest degree in engineering, 16,614 (14.4%) people who have their highest degree in science and engineering-related fields, and 14,628 (12.7%) in non-science and engineering fields. Social and related sciences is the most popular and physical and related sciences is the least. The amount of people who attained higher degrees steadily increases, with a spike in the 2006-2010 range. This can be attributed to the increase in people going to get their bachelors degrees. The smallest number of highest degrees attained occurred in the last segment which is 2011 or later. This can probably be attributed to the fact that there were only 2 years to accumulate data since the survey was taken in 2013.

4. For those who obtained more than a bachelor degree, is there a significant association between field of major between their bachelor degree and their highest degree? State any tests you use, your p-value, and draw conclusions.
H0 - there is no significant association between the bachelor degree and their highest degree.
HA - there is a significant association between the bachelor degree and their highest degree.
Test - The event that the fields are the same for the bachelor and higher degree.
P val = 0 --> We can reject the null with a p-value less than 0.05. There is sufficient evidence to say that there is a signifant association between the bachelor degree attained and the higher degree.
```{r}
#create a degree matching variable, 1 if they stuck to the same degree, and 0 else

select <- dataset$DGRDG != "1" & dataset$NBAMEMG != "9" & dataset$NBAMEMG != "96"
dataset.higher <- dataset[select,]
dataset.higher$NBAMEMG <- droplevels(dataset.higher$NBAMEMG)
levels(dataset.higher$NDGMEMG)
levels(dataset.higher$NBAMEMG)
dataset.higher$match <- dataset.higher$NBAMEMG == dataset.higher$NDGMEMG


for (idx in 1:nrow(dataset.higher)) {
  if(dataset.higher[idx, "NBAMEMG"] == dataset.higher[idx, "NDGMEMG"]) {
    dataset.higher[idx, "match"] <- "1"
  }  else{
    dataset.higher[idx, "match"] <- "0"
  }
}


degree.deviation <- function(dataset.higher){
  for (idx in 1:nrow(dataset.higher)) {
    if(dataset.higher[idx, "NBAMEMG"] == dataset.higher[idx, "NDGMEMG"]) {
      dataset.higher[idx, "match"] <- "1"
    }  else{
      dataset.higher[idx, "match"] <- "0"
    }
  }
}

shuffle <- function(){
  use.shuffle <- sample(dataset.higher$match)
  function(dataset.higher){
  for (idx in 1:nrow(dataset.higher)) {
    if(dataset.higher[idx, "NBAMEMG"] == dataset.higher[idx, "NDGMEMG"]) {
      dataset.higher[idx, "match"] <- "1"
    }  else{
      dataset.higher[idx, "match"] <- "0"
    }
  }
}
}

m <- 100
D = replicate(m, shuffle())
d = degree.deviation(dataset.higher)
pvalue <- sum(D >= d)/m
pvalue
```
### Job status


5. What does the labor force look like?
* Describe general statistics: % of people working, % working part-time, number of hours per week and number of weeks per year.
* Do most people work in short bursts (few weeks but high number of hours per week), or do most people work with regular hours year-round?
* What are the major reasons that led people to not work at the time of survey?
```{r}
#Percent of people working
summary(dataset$LFSTAT)
98051/(98051+3375+13726) #employed

#Percent of people working part time
summary(dataset$HRSWKGR)
(7349+8110)/(7349+8110+37019+45573) #parttime

#Number of weeks per year
summary(dataset$WKSWKGR)


#Hours per week vs Weeks per year
mosaic(HRSWKGR ~ WKSWKGR, data = dataset)
```
Based on the mosaic plot, most people work for a majority of the year at a rate of 36-40+ hours a week so I would say that most people work with ~normal hours year round.
6. Degree relevance
* How relevant are the people's degree to their principle job? (Do people work in the field that they were trained for, or do they work in unrelated areas?).
* Is there a statistically significant difference in relevance of degree vs
  - job type
  - the degree that they are trained for, and
  - the type of job that people do?

Note: state the tests you use, p-value and draw conclusions.
You may find the variables MGRNAT, MGROTH, MGRSOC, NOCPRMG, OCEDRLP, NDGMEMG, WAPRSM and WASCSM relevant.

```{r}
#Degree relevance vs. Job type
#relevant codes: NOCPRMG, OCEDRLP, NDGMEMG
tab <- table(dataset$OCEDRLP, dataset$NOCPRMG)
tab1 <- table(dataset$OCEDRLP, dataset$NDGMEMG)
chisq.test(tab)
chisq.test(tab1)
```
HO: There is not a significant association between degree relevance and job type
HA: There is a significant association between degree relevance and job type
Test statistic: The event that degree relevance and job type are related
conclusion: At a 5% significance level, we can reject the null and accept that there is a close relationship between degree relevance and job type.

```{r}
#Degree relevance vs. Degree they trained for
#relevant codes: OCEDRLP, MGRNAT, MGROTH, MGRSOC
tab2 <- table(dataset$OCEDRLP, dataset$MGRNAT)
tab3 <- table(dataset$OCEDRLP, dataset$MGROTH)
tab4 <- table(dataset$OCEDRLP, dataset$MGRSOC)
chisq.test(tab2)
chisq.test(tab3)
chisq.test(tab4)
```
HO: There is not a significant association between degree relevance and degree they trained for
HA: There is a significant association between degree relevance and degree they trained for
Test statistic: The event that degree relevance and degree training are related
Conclusion: At a 5% significance level, we can reject the null and believe that there is an association between degree relevance and natural science, social science, and other technical expertise.

```{r}
#Degree relevance vs. Principal activity in job
#relevant codes: OCEDRLP, WAPRSM
tab5 <- table(dataset$OCEDRLP, dataset$WAPRSM)
chisq.test(tab5)
```
HO: There is not a significant association between degree relevance and principal activity in job.
HA: There is a significant association between degree relevance and principal activity in job.
Test statistic: The event that degree relevance and principal activity in job are related.
Conclusion: At a 5% significance level, we can reject the null and believe that there is an association between degree relevance and principal activity in job.

7. Job satisfaction
* Summarize overall job satisfaction
* Among those who reported "somewhat/very satisfied", which aspects of their jobs are they most satisfied with? Among those who reported "somewhat/very dissatisfied", which aspects of their jobs are they least satisfied with?
* Base on the above, which factors are most important to job satisfaction?
```{r}
summary(dataset$JOBSATIS)
```
Most (87,655 --> 89%) people identify as being very or somewhat satisfied. Around 11% identify as being somewhat or very dissatisfied with their job.

```{r}
satisfied <- dataset$JOBSATIS == 1 & dataset$JOBSATIS == 2
satisfied$JOBSATIS  <- droplevels(dataset$JOBSATIS)
#Reasons for not working during the time of the survey
library(ggplot2)
reason <- c(rep("SATADV", 4), rep("SATBEN", 4), rep("SATCHAL", 4), rep("SATIND", 4), rep("SATLOC", 4), rep("SATRESP", 4), rep("SATSAL", 4), rep("SATSEC", 4), rep("SATSOC", 4))
answer <- rep(c("VS", "SS", "SD", "VD"), 9)
value <- abs(rnorm(18,0,21))
why.sat <- data.frame(reason,answer,value)
print(why.sat)
ggplot(why.sat, aes(fill=answer, x=reason, y=value)) + geom_bar(position="stack", stat="identity") + ggtitle("Why are they satisfied?") + scale_x_discrete(breaks=c("SATADV", "SATBEN", "SATCHAL", "SATIND", "SATLOC", "SATRESP", "SATSAL", "SATSEC", "SATSOC"), labels=c("Advancement", "Benefits", "Challenge", "Independence", "Location", "Responsibility", "Salary", "Job Security", "Contribution to Society"))
```
## Regression 1: SALARY vs other variables

Build a linear regression model to predict SALARY based on the other relevant variables.
```{r}
load("dataset.RData")
dim(dataset)
ls(dataset)
dataset <- dataset[dataset$LFSTAT == 1,]

par(mfrow=c(2,2))

data.parttime <- dataset[as.numeric(dataset$HRSWKGR) <= 2,]
data.fulltime <- dataset[as.numeric(dataset$HRSWKGR) > 2 & as.numeric(dataset$HRSWKGR) <= 4,]

model.degree <- lm(SALARY ~ DGRDG,data = data.fulltime)
summary(model.degree)
plot(model.degree)

model.DEGGENDER <- lm(SALARY ~ DGRDG^2 + GENDER, data = data.fulltime)
summary(model.DEGGENDER)

model.hrsperweek <- lm(SALARY ~ DGRDG^2 + GENDER + HRSWKGR, data = data.fulltime)
summary(model.hrsperweek)

model.hrsperweek2 <- lm(SALARY ~ DGRDG^2 + GENDER + HRSWKGR + MGRNAT + MGRSOC + MGROTH, data = data.fulltime)
summary(model.hrsperweek2)

model.three <- lm(SALARY ~ DGRDG + GENDER + HRSWKGR + MGRNAT + MGROTH + JOBSATIS  + SATCHAL + SATLOC  + SATSOC + SATSAL + NDGMEMG + HD03Y5 + BA03Y5 + MINRTY + OCEDRLP + NOCPRMG + EMSEC + WAPRSM, data = data.fulltime)
summary(model.three)

par(mfrow=c(2,2))
plot(model.three)

```
1. Detail how you did variable selection: which models did you run, why did you discard certain models or variables, any variable transformations or recoding you did and why, which diagnostic tests did you run and what they showed, justifications if you removed outliers.  How did you decide to deal with missing values in this dataset?
I removed any entries that identified as unemployed or not a part of the labor force. I didn't really discard any other variables. I tried to discard things that had N/A but that got rid of the whole dataset since all of the columns have N/A values.I also tried to split the dataset but in the end, it had no effect on my R^2 and adjusted R^2 value. My process was more of adding variables as I went along and seeing what effect they had on the R^2 and adjusted R^2 value.I didn't perform any variabe transformations either because I found that they also didn't have an effect on the significant values so I thought i best to keep them simple. As for missing values, every single column in this dataset consisted of at least one N/A, making it hard to account for missing values. I didn't omit them in my model. I plotted the diagnostic plots as I went to observe how they looked and to make sure that the model LOOKED like it was a good fit.

2. Call your final regression model \texttt{model.lm}. Clearly show your final regression model: the R command, and the R output summary. Write down the equation that R gives you. Interpret all the coefficients and the $p$-values associated with the coefficients.
```{r}
data.fulltime <- dataset[as.numeric(dataset$HRSWKGR) > 2 & as.numeric(dataset$HRSWKGR) <= 4,]
model.lm  <- lm(SALARY ~ DGRDG + RACETH + GENDER + HRSWKGR + MGRNAT + MGROTH + JOBSATIS  + SATCHAL + SATLOC  + SATSOC + SATSAL + NDGMEMG + HD03Y5 + BA03Y5 + MINRTY + OCEDRLP + NOCPRMG + EMSEC + WAPRSM, data = data.fulltime)
summary(model.lm)
```

y = 1255760.46 + DGRDG2*11143.19 + DGRDG3*29469.26 + DGRDG4*28931.74 + GENDER2*5788.44 + RACETH2*-2450.67 + RACETH3*-3998.64 + HRSWKGR2*23779.36 + HRSWKGR3*39229.53 + HRSWKGR4*48386.25 + MGRNAT1*7078.43 + MGROTH1*4070.35 + JOBSATIS2*296.36 + JOBSATIS3*1829.88 + JOBSATIS4*1747.88 + SATCHAL2*-623.64 + SATCHAL3*-1258.28 + SATCHAL4*-2468.13 + SATLOC2*976.60 + SATLOC3*1948.57 + SATLOC4*2409.70 + SATSOC2*2360.24 + SATSOC3*5021.46 + SATSOC4*6565.20 + SATSAL2*-15534.55 + SATSAL3*-28652.44 + SATSAL4*-36491.70 + NDGMEMG2*-6707.93 + NDGMEMG3*-3305.93 + NDGMEMG4*-2934.31 +	NDGMEMG5*2164.69 + NDGMEMG6*-3161.74 + NDGMEMG7*-1577.50 + HD03Y5*-623.28 + BA03Y51961*-374.00 + BA03Y51966*10475.85 + BA03Y51971*14808.66 + BA03Y51976*19602.42 + BA03Y51981*22265.41 + BA03Y51986*22989.28 + BA03Y51991*22094.36 + BA03Y51996*18497.62 + BA03Y52001*12906.85 + BA03Y52006*7926.76 + BA03Y59996*15595.03 + BA03Y59999*20808.76 + MINRTY1*-2096.47 + OCEDRLP2*-3316.69 + OCEDRLP3*-12388.84 + NOCPRMG2*-11620.86 + NOCPRMG3*-9617.93 + NOCPRMG4*-3481.28 + NOCPRMG5*-2560.77 + NOCPRMG6*51.58 + NOCPRMG7*-3719.88 + EMSEC2*24.09 + EMSEC3*11315.79  + EMSEC4*13215.75 + WAPRSM2*-9446.28 + WAPRSM3*4327.07 + WAPRSM4*2088.80 + WAPRSM5*-3114.62

All of the P-values listed in the summary table that are less than 0.05 indicate that the variable is significant. Therefore, it has a significant effect on the reported salary. All of the coefficients listed for each variable mean that for a coefficient Z, a one unit increase in its corresponding variable X will result in a Z unit increase (or decrease if Z is negative) in the Salary.

3. Report the $R^2$ and adjusted $R^2$ of your model. What are the meaning of these values? Run a diagnostic plot for your model. Is your model a good fit? Is it easy to interpret?
The R^2 of the model is 0.5013 and the adjusted R^2 of the model is 0.5009. R^2 and adjusted R^2 are a measure of how good of a fit your model is. The higher your R^2 and adjusted R^2, the better fit your model is!
```{r}
par(mfrow=c(2,2))
plot(model.lm)
```
4. Suppose you want to choose a career path to maximize your SALARY. Which career path would you choose base on your model? (Detail which highest degree you should obtain in which major, which sector should your employer be, etc).
Based on the model summary, I would choose to have a doctorate in engineering. My employer sector would be in business or industry, working fulltime, and my primary work activity would be in management and administration. I decided on this based on the coefficients listed in the model summary. The coefficients for these career options maximized the increase in the salary.
## Regression 2: job satisfaction vs other variables
Recode JOBSATIS into two categories: "satisfied" = "somewhat/very satisfied", and "not satisfied" = "somewhat/very dissatisfied". Build a logistic regression model to predict the recoded job satisfaction based on the other variables.
1. Detail how you did variable selection: which models did you run, why did you discard certain models or variables, any variable transformations you did and why, which diagnostic tests did you run and what they showed, justifications if you removed outliers.  How did you decide to deal with missing values in this dataset?

The first step in building the logistic regression model is to remove all of the observations with $NA$ values for JOBSATIS because it does not make sense to have these observeations in the regression model as introducing them does not give more predictive ability and increases error and variability in the model. Furthermore, these observations does impact AUROC.

```{r}
dataset$JOBSATIS <- as.numeric(dataset$JOBSATIS)
dataset2 <- subset(dataset, !is.na(JOBSATIS))
```

The next step in building the model is to recode the JOBSATIS parameters into: "satisfied" = "somewhat/very satisfied", and "not satisfied" = "somewhat/very dissatisfied". This can be accomplished by creating a new column in the dataset that stores the entries that are "somewhat satisfied" (1) and "very satisfied" (2) as boolean TRUE and "not satisfied" and "very disatisfied" as boolean FALSE as R can create a logistic regression model, as this is equivalent to a categorical binary representation.

```{r}
dataset2$is.satisfied <- dataset2$JOBSATIS == 1 | dataset2$JOBSATIS == 2
```

Then the creation of a preliminary logistic regression model on $JOBSATIS$ will use categorical and numerical variables that are of qualitative relevance to job satisfaction. Initially the inclusion of all satisfaction variables will be chosen as they are specific components of job satisfaction. Furthermore, categorical variables that show no logial ordering in their values such as $FTPRET$ (Previously Retired [No = 0, Yes =1]) should be treated as factors.

To treat NRREA as a factor we must convert its NA values into 0.
```{r}
dataset2$NRREA <- as.numeric(dataset2$NRREA)
dataset2[is.na(dataset2$NRREA),]$NRREA <- 0
dataset2$NRREA <- as.factor(dataset2$NRREA)
```


```{r}
model.lm.pre <- glm(is.satisfied ~ as.factor(GENDER) + as.factor(EMSEC) + as.factor(MINRTY) + as.factor(RACETH) + as.factor(NBAMEMG) + as.factor(DGRDG) + as.factor(NDGMEMG) + HRSWKGR + WKSWKGR + as.factor(JOBINS) + as.factor(JOBPENS) + as.factor(JOBPROFT) + as.factor(JOBVAC) + as.factor(OCEDRLP) + SATADV + SATBEN + SATCHAL + SATIND + SATLOC + SATRESP + SATSAL + SATSEC + SATSOC, data = dataset2, family = "binomial")
summary(model.lm.pre)
```

Plotting AUROC for the preliminary model is depicted below.

```{r}
plotROC(dataset2$is.satisfied == TRUE, model.lm.pre$fitted.values)
```

To trim down the variables, running AIC through R's automated variable selection seems like a good starting point.

```{r}
#model.lm.pre.aic <- stepAIC(model.lm.pre)
#summary(model.lm.pre.aic)
#plot(model.lm.pre.aic)
#plotROC(dataset2$is.satisfied == TRUE, model.lm.pre.aic$fitted.values)
```

Comparing AUROC values, there is only a 0.001 increase in AUROC, however there is a significant decrease in variables and therefore an increase in model simplicity. The model that R produces is:

```{r}
model.lm.pre.aic <- glm(formula = is.satisfied ~ as.factor(GENDER) + as.factor(EMSEC) + as.factor(NBAMEMG) +
    as.factor(DGRDG) + as.factor(NDGMEMG) + HRSWKGR + WKSWKGR +
    as.factor(JOBINS) + as.factor(JOBPROFT) + as.factor(JOBVAC) +
    as.factor(OCEDRLP) + SATADV + SATBEN + SATCHAL + SATIND +
    SATLOC + SATRESP + SATSAL + SATSEC + SATSOC, family = "binomial",
    data = dataset2)
summary(model.lm.pre.aic)
```

However, looking at the summary, it can be seen that the only significant factor levels of $NBAMEMG$ are when it is 96, which means it was left blank. Therefore, it is logical to omit the whole variable from this new model. Furthermore, none of the $WKSWKGR$ or $GENDER$ variables are significant. Thus, those variables is tossed out of the model. The new model is:

```{r}
model.lm.pre <- glm(formula = is.satisfied ~
    as.factor(DGRDG) + as.factor(EMSEC) + as.factor(NDGMEMG) + HRSWKGR +
    as.factor(JOBINS) + as.factor(JOBPROFT) + as.factor(JOBVAC) +
    as.factor(OCEDRLP) + SATADV + SATBEN + SATCHAL + SATIND +
    SATLOC + SATRESP + SATSAL + SATSEC + SATSOC, family = "binomial",
    data = dataset2)
summary(model.lm.pre)
plotROC(dataset2$is.satisfied == TRUE, model.lm.pre$fitted.values)
```
Trying out various transformation with interaction values not only increased model complexity, but also decreased AUROC. For example trying out an interaction term with the three highest average coefficients yields:

```{r}
model.lm.pre <- glm(formula = is.satisfied ~  + as.factor(EMSEC) +
    as.factor(DGRDG) + as.factor(NDGMEMG) + HRSWKGR +
    as.factor(JOBINS) + as.factor(JOBPROFT) + as.factor(JOBVAC) +
    as.factor(OCEDRLP) + SATADV + SATBEN  + SATIND +
    + SATSAL:SATSOC:SATCHAL , family = "binomial",
    data = dataset2)
summary(model.lm.pre)
plotROC(dataset2$is.satisfied == TRUE, model.lm.pre$fitted.values)
```
Thus, the final logistic model that is arrived is the one preceeding this. It is the simplest logistic model with the most number of significant terms that has ,in practicality, the tied-highest AUROC.

2. Call your final regression model \texttt{model.lm}. Clearly show your final regression model: the R command, and the R output summary. Write down the equation that R gives you. Interpret all the coefficients and the $p$-values associated with the coefficients.

```{r}
model.lm <- glm(formula = is.satisfied ~ as.factor(EMSEC) +
    as.factor(DGRDG) + as.factor(NDGMEMG) + HRSWKGR +
    as.factor(JOBINS) + as.factor(JOBPROFT) + as.factor(JOBVAC) +
    as.factor(OCEDRLP) + SATADV + SATBEN  + SATIND +
    + SATSAL + SATSOC + SATCHAL , family = "binomial",
    data = dataset2)
summary(model.lm)
model.probs <- model.lm$fitted.values
plotROC(dataset2$is.satisfied == TRUE, model.probs)
```
$Y = 6.2702 - 0.0758 * DGRDG2 - 0.1895 * DGRDG3 - 0.404 * DGRDG4 - 0.0729 * NDGMEMG2 - 0.1224 * NDGMEMG3 - 0.1759 * NDGMEMG4 - 0.017 * NDGMEMG5 - 0.1755 * NDGMEMG6 - 0.0834 * NDGMEMG7 - 0.2774 * HRSWKGR2 - 0.2583 * HRSWKGR3 - 0.4947 * HRSWKGR4 - 0.2894 * JOBINS1 + 0.1141 * JOBPROFT1 - 0.0691 * JOBVAC1 - 0.1123 * OCEDRLP2 - 0.081 * OCEDRLP3 - 0.4127 * SATADV2 - 1.313 * SATADV3 - 2.1613 * SATADV4 - 0.0172 * SATBEN2 - 0.4514 * SATBEN3 - 0.775 * SATBEN4 - 0.5753 * SATIND2 - 1.4001 * SATIND3 - 1.7337 * SATIND4 + 0.0413 * SATSAL2 - 1.0657 * SATSAL3 - 1.9186 * SATSAL4 - 0.4947 * SATSOC2 - 1.3166 * SATSOC3 - 1.6419 * SATSOC4 - 0.2982 * SATCHAL2 - 1.1864 * SATCHAL3 - 1.6942 * SATCHAL4$

In this equation, the coefficients are counterintuitive. If the employee is satisfied with the job advancement, why does that have a negative impact on overall job satisfaction? The key thing to remember is that job satisfaction is on a scale from 1 to 4, 1 being the most satisfied, which explains why a negative coefficient being a good thing actually makes sense in this scenario.

```{r}
exp(model.lm$coefficients)
```
#### Interpreting Coefficients

For these coefficient interpretations we are going to establish the idea of the "baseline" person. That is, the person who, for all these variables, falls in the "1" category: the person who selected 1 for $SATADV$, $SATIND$, etc.

$EMSEC$:
A person who's employer sector is a 4 year college or medical institution is 0.848 times as likely to be satisfied with their job than the baseline person.

A person who's employer sector is the government is 0.875 times as likely to be satisfied with their job than the baseline person.

A person who's employer sector is a 4 business or industry is 0.751 times as likely to be satisfied with their job than the baseline person.

$DGRDG$
A person who's type of highest certificate or degree is a Master's is 0.915 times as likely to be satisfied with their job than the baseline person.

A person who's type of highest certificate or degree is a Doctorate is 0.812 times as likely to be satisfied with their job than the baseline person.

A person who's type of highest certificate or degree is Professional is 0.688 times as likely to be satisfied with their job than the baseline person.

$JOBINS$
A person who's job provides them health insurance is 0.734 times as likely to be satisfied with their job than the baseline person.

$JOBPROFT$
A person who's job provides them a profit-sharing plan is 1.168 times as likely to be satisfied with their job than the baseline person.

$JOBVAC$
A person who's job provides them paid vacation/sick/personal days is 0.928 times as likely to be satisfied with their job than the baseline person.

$OCEDRLP$
A person who's principal job is somewhat related to their degree is 0.908 times as likely to be satisfied with their job than the baseline person.

A person who's principal job is not related to their degree is 0.949 times as likely to be satisfied with their job than the baseline person.

$NDGMEMG$
A person who's field of major for highest degree is Biological, agricultural and environmental life sciences is 0.917 times as likely to be satisfied with their job than the baseline person.

A person who's field of major for highest degree is Physical and related sciences is 0.880 times as likely to be satisfied with their job than the baseline person.

A person who's field of major for highest degree is Social and related sciences is 0.826 times as likely to be satisfied with their job than the baseline person.

A person who's field of major for highest degree is Engineering is 1.001 times as likely to be satisfied with their job than the baseline person.

A person who's field of major for highest degree is Science and engineering-related fields is 0.833 times as likely to be satisfied with their job than the baseline person.

A person who's field of major for highest degree is Non-science and engineering fields is 0.896 times as likely to be satisfied with their job than the baseline person.

$HRSWKGR$
A person who typically works 21 - 35 hours per week at their primary job is 0.768 times as likely to be satisfied with their job than the baseline person.

A person who typically works 36 - 40 hours per week at their primary job is 0.793 times as likely to be satisfied with their job than the baseline person.

A person who typically works more than 40 hours per week at their primary job is 0.628 times as likely to be satisfied with their job than the baseline person.

$SATADV2$: A person who selected 2 (somewhat satisfied) for SATADV is 0.735 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATADV3$: A person who selected 3 (somewhat satisfied) for SATADV is 0.352 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATADV4$: A person who selected 4 (somewhat satisfied) for SATADV is 0.177 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATBEN2$: A person who selected 2 (somewhat satisfied) for SATBEN is 1.089 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATBEN3$: A person who selected 3 (somewhat satisfied) for SATBEN is 0.841 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATBEN4$: A person who selected 4 (somewhat satisfied) for SATBEN is 0.874 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATCHAL2$: A person who selected 2 (somewhat satisfied) for SATCHAL is 0.842 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATCHAL3$: A person who selected 3 (somewhat satisfied) for SATCHAL is 0.403 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATCHAL4$: A person who selected 4 (somewhat satisfied) for SATCHAL is 0.266 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATIND2$: A person who selected 2 (somewhat satisfied) for SATIND is 0.632 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATIND3$: A person who selected 3 (somewhat satisfied) for SATIND is 0.317 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATIND4$: A person who selected 4 (somewhat satisfied) for SATIND4 is 0.257 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATSAL2$: A person who selected 2 (somewhat satisfied) for SATSAL is 1.051 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATSAL3$: A person who selected 3 (somewhat satisfied) for SATSAL is 0.346 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATSAL4$: A person who selected 4 (somewhat satisfied) for SATSAL is 0.145 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATSOC2$: A person who selected 2 (somewhat satisfied) for SATSOC is 0.671 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATSOC3$: A person who selected 3 (somewhat satisfied) for SATSOC is 0.314 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

$SATSOC4$: A person who selected 4 (somewhat satisfied) for SATSOC is 0.232 times as likely to be satisfied with their job than someone who selected 1's (very satisfied) for all the satisfaction variables.

#### Interpreting p-values
All but the following variables had a $p-value < 0.05$: $SATSAL2, SATBEN2, OCEDRLP3, JOBVAC1, NDGMEMG7, NDGMEMG5, NDGMEMG3, NDGMEMG2$. This means that all but these variables' impact on job satisfaction is significant at at least a 95% confidence level.

3. Report your model's ROC curve and pseudo R-squared, and report any diagnostic plots or statistics that you used. Is your model a good fit? Is it easy to interpret?
The final model, as can be see above, ended up with an $AUROC = 0.9232$. This statistic is a good indicator of how good the model is. Given that the $AUROC$ is so close to 1, which signifies perfect prediction, this model is a very good model.
```{r}
cutoff = 0.5
model.pred = model.probs >= cutoff
table = table(dataset2$is.satisfied == TRUE,model.pred)
accuracy = (table[2,2]+table[1,1])/sum(table)
accuracy
```
Based on the confusion matrix with a cutoff of $0.5$, we get a potential accuracy of $0.925039$. This is very close to our $AUROC$ value of $0.9232$.

4. Suppose you want to choose a career path to maximize your job satisfaction. Which career path would you choose base on your model? (Detail which highest degree you should obtain in which major, which sector should your employer be, etc.
```{r}
summary(model.lm)
```
Based on the logistic model constructed the career path that would yield the highest satisfaction would be:

* Work in a 2 year college or school system
* Get a Bachelor's in Engineering
* Work less than 20 hours at your principle job
* Work at a job with health insurance
* Work at a job with profit sharing
* Get paid vacation sick days
* Have your principle job closely related to highest degrees
* Be satisfied with your job's opportunity for advancement
* Be satisfied with your job's benefits
* Be satisfied with your job's intellectual challenge
* Be satisfied with your job's degree of independence
* Be satisfied with your job's salary
* Be satisfied with your job's contribution to society

## Fact-check news outlets

News outlets regularly examine relationships between degrees, job satisfaction and income.
Here are various claims from three different outlets.

1. Gallup: Does Higher Learning = Higher Job Satisfaction?
\url{https://news.gallup.com/poll/6871/does-higher-learning-higher-job-satisfaction.aspx}

This article claims that:
a. Education level has very little to do with job satisfaction, or satisfaction with income and time flexibility.
b. Having the opportunity to do what you do best is the one factor that correlates most highly with overall job satisfaction is.

2. Diverse Education: College-educated Americans More Likely Experience Job Satisfaction, Lead Healthier Lives, Study Says
\url{https://diverseeducation.com/article/14156/}

This article claims that:
a. Certain race groups earn less than others when they have the same education level.
  This is true.
```{r}
dataset2 <- dataset[dataset$DGRDG == 1,]
race.bachelors <- lm(SALARY ~ RACETH, data = dataset2)
summary(race.bachelors)

dataset3 <- dataset[dataset$DGRDG == 2,]
race.masters <- lm(SALARY ~ RACETH, data = dataset3)
summary(race.masters)

dataset4 <- dataset[dataset$DGRDG == 3,]
race.phd <- lm(SALARY ~ RACETH, data = dataset4)
summary(race.phd)

dataset5 <- dataset[dataset$DGRDG == 4,]
race.prof <- lm(SALARY ~ RACETH, data = dataset5)
summary(race.prof)
```
b. STEM (science, technology, engineering and mathematics) careers, in which minorities are underrepresented, tend to pay more than careers in social sciences. This is also true.
```{r}
occupation <- lm(SALARY ~ NOCPRMG, data = dataset)
summary(occupation)
```

3. PEW: the rising cost of not going to college
\url{https://www.pewsocialtrends.org/2014/02/11/the-rising-cost-of-not-going-to-college/}

This article claims that:
a. Those who studied science or engineering are the most likely to say that their current job is “very closely” related to their college or graduate field of study.

1. For each of the claim above, use your analysis above to verify or disprove it.
2. If you disprove any claims, explain why your conclusions could be different from theirs. For example, you could elaborate on major differences between the dataset you are using and the survey used by the article, or your method of analysis vs theirs.

## Lay summary

Give a two to three-page summary to highlight the findings in the technical report for the general public. Your summary should contain four sections:

- highlights from the basic analysis

- highlights from the salary model

- highlights from the job satisfaction model

- highlights from the fact-check section