White Lives Dont Matter

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

### Directions: Complete two (2) of the 3 Tasks listed in the project.

> 1. Complete either Task 1 or Task 2

> 2. Complete Task 3

**Note:** If you `Rmd` file submission knits you will receive total of **(5 points)**

```{r packages, echo=TRUE, message=FALSE}
# load the packages needed
library(PASWR2)
library(ggplot2)
library(dplyr)
library(lattice)
```

#### **1. (5 points)** How many packages were loaded?

Answer: Four (4)


## Task 1: **This is problem 8 on page 196 in the text** w/t added questions

Note: **Problem 8/p. 196 is modified**

Some claim that the final hours aboard the Titanic were marked by class warfare other  claim  it  was  characterized  by male chivalry. The  data  frame `TITANIC3` from the `PASWR2` package contains information pertaining to class status `pclass`,survival of passengers `survived`,  and gender `sex`,  among others.  Based on the information in the data frame:

### Load and Access the Data from the package

A description of the variables can be found by running the code:

```{r data description}
help("TITANIC3")
data("TITANIC3")
```

#### **2. (5 points)** How many observations and variables are in the `TITANIC3` data?
Hint: Use the function `dim()`, `glimpse()` or `str()`.

1,309 Observations & 14 variables

```{r dimentions}
str(TITANIC3)
```

**Answer:** There are 1,309  rows and 14 columns in `TITANIC3`.

#### **3. (5 points)** Write code to show the first (or last) 6 observation in the `TITANIC3` data?

```{r}
tail(TITANIC3, 6)

#### **4. (5 points)** Using the `survived` variable in the `TITANIC3` data, which is of type integer `(0/1)` mutate it to a factor variable by running the code below and create **new** data frame `TITANIC`.

#What are the new levels of `survived` and its type?

#```{r}
#TITANIC <- TITANIC3 %>% mutate(survived = factor(survived, levels = 0:1, labels = c("No", "Yes")))

#**Answer:**
TITANIC <- mutate(TITANIC3, survived = factor(survived, levels = 0:1, labels = c("No", "Yes")))
sapply(TITANIC, levels)


#### **5. (5 points)** The code below produces summary for the `TITANIC` Data. Write code using the pipe %>% operator the produces the same result.


```{r}
summary(TITANIC)

```

YOUR CODE HERE:
```{r}
#TITANIC %>% ...
TITANIC %>% summary()
```


#### **a) (5 points)** Determine the fraction of survivors (`survived`) according to class (`pclass`).

**Hint:** Uncomment one of the first 3 lines in the code chunk below and then use the `prop.table` function.

```{r part-a}
T1 <- xtabs(~survived + pclass, data = TITANIC)

#T1 <- table(TITANIC$survived,TITANIC$pclass)

T1 <- TITANIC %>% select(survived, pclass) %>% table()

T1

prop.table(T1, margin = 2) # to produce the proportion per column (2), per row would be margin = 1
```


**Answer:** In 1st class percent survived is ..., 2nd class ..., 3rd class ...
        pclass
survived 1st 2nd 3rd
     No  123 158 528
     Yes 200 119 181


#### **b) (10 points)** Compute the fraction of survivors according to class and gender.  Did men in the first class or women in the third class have a higher survival rate?

Hint: Use the code below that creates 3-way table and then use `prop.table()` similarly to part a).

```{r part-b}
T2 <- TITANIC %>% select(pclass, sex, survived) %>% table()

T2

prop.table(T2)

```

**Answer:** 8% of women in third class survived while 9% of men in first class survived.


#### **c) (10 points)** How  would  you  characterize  the  distribution  of age(e.g.,  is  it symmetric,positively/negatively skewed, unimodal, multimodal)?

Hint: Run the code below that produces some summary statistics and the density distribution.
The commented code is old style of R programming, it is shown as it may resemble the textbook examples.

```{r part-c}
# Finding summary statistics

#median(TITANIC$age, na.rm = TRUE) # old style
#mean(TITANIC$age, na.rm = TRUE) # old style

# dplyr style
TITANIC %>% summarise(mean = mean(age, na.rm = TRUE), meadian = median(age, na.rm = TRUE))

# IQR(TITANIC$age, na.rm = TRUE)

TITANIC %>% pull(age) %>% IQR(na.rm = TRUE) # pull() does extract the column from the data frame as vector object

# look at the density function to see if it is uni or bi-modal distribution
ggplot(data = TITANIC, aes(x = age)) +
geom_density(fill = "lightgreen") +
theme_bw()

```

**Answer:** Positive Skew


#### **d) (5 points)** Were the median and mean ages for females who survived higher or lower than for females who did not survive?  Report the median and mean ages as well as an appropriate measure of spread for each statistic.

**Hint:** Using the `dplyr` package functions and the pipes operator ` %>% ` elegant code can produce summaries for each statistics - mean, median, sd, IRQ.


##### Without considering the `pclass` variable, namely regardless of passenger class:

##### Mean Summaries
```{r}
# mean summaries
TITANIC %>% group_by(sex, survived) %>% summarise(avg = mean(age, na.rm = TRUE))

```

##### Standard deviation Summarries
```{r}
# sd summaries
TITANIC %>% group_by(sex, survived) %>% summarise(stdev = sd(age, na.rm = TRUE))

```

#### Median Summarries

```{r}
# median summaries
TITANIC %>% group_by(sex, survived) %>% summarise(med = median(age, na.rm = TRUE))

```

#### IQR Summarries

```{r}
# IQR summaries
TITANIC %>% group_by(sex, survived) %>% summarise(IQR = IQR(age, na.rm = TRUE))

```

Based on the summaries, answer the question below:

**d-1)**

For those who survived, the mean age for females is GREATER than the mean age for males?

**d-2)**
For those who survived, the median age for females is GREATER than the median age for males?


**Answer:** _ _ _


#### **6. (10 points)** Now Consider the `survived` variable in the `TITANIC` data too, create similar summary statistics and answer the question below.

For those who survived, which class the mean age for females is *less** than the mean age for males?

For those who survived, which class the median age for females is **greater** than the median age for males?


Write your code in the chunk below:

```{r}
# mean summaries
TITANIC %>% group_by(pclass, sex, survived) %>% summarise(avg = mean(age, na.rm = TRUE))

# median summaries
# write your code here
TITANIC %>% group_by(pclass, sex, survived) %>% summarise(avg = median(age, na.rm = TRUE))


#### **e) (5 points)**  Were the median and mean ages for males who survived higher or lower than for males who did not survive?  Report the median and mean ages as well as an appropriate measure of spread for each statistic.

**Hint:** Read the output of the code in part d)

**Answer:** The Median & Mean ages(31.5 & 27) for Males who did not survive were both greater than the Mean & Median for Men who did survive (27.0 & 27)


#### **f) (5 points)**  What was the age of the youngest female in the first class who survived?

**Hint:** Complete the code below by specify which variable you want to be arranged.

```{r}
TITANIC %>% filter (sex =="female" & survived =="Yes" & pclass == "1st") %>% arrange(survived, pclass, sex)
```

Arranging in descending order is achieved by specifying in the `arrange()` function `desc(var_name)`.

#### **7. (5 points)** What was the age of the oldest female (male) in the first class who survived?

YOUR CODE HERE:
```{r}
TITANIC %>% filter (sex =="female" & survived =="Yes" & pclass == "1st") %>% arrange(desc(age))

TITANIC %>% filter (sex =="male" & survived =="Yes" & pclass == "1st") %>% arrange(desc(age))
```

**Answer:** The oldest male was Barkworth, Mr. Algernon Henry W & the oldest female was Cavendish, Mrs. Tyrell William

Oldest female in 1st class survived was 76 years of age.

Oldest male in 1st class survived was 80 years of age.

#### **g) (10 points)** Do the data suggest that the final hours aboard the Titanic were characterized by class warfare, male chivalry, some combination of both, or neither? Justify your answer based on computations above, or based on other explorations of the data.

**Hint:** Review and explain the exploratory graphs created by the code chunk. How they support you justification?

```{r part extra}

TITANIC %>%  ggplot(aes(x = survived)) +
  geom_bar(aes(fill = sex), stat = "count", position = "stack" ) +
  theme_bw()

TITANIC %>%  ggplot(aes(x = survived)) +
  geom_bar(aes(fill = pclass), stat = "count", position = "stack" ) +
  theme_bw()

```

Of those who did survive the ice berg sinking the Titanic, most of them were from the 3rd & 1st classes, so I don't think class played as big as a role in the survival of the people. And of the survived, they were overwhelmingly female which supports the male chivalry theory.
#TITANIC %>%  ggplot(aes(x = survived)) +
    geom_bar(aes(fill = sex), stat = "count", position = "stack" ) +
    theme_bw()
## Task 1 (Extra Credit, 10 pts): Produce CLEAN data from the TITANIC data by removing all observation with `NA`
Comment: In most of the code you used/wrote in **Task 1**, functions were called with argument `na.rm = TRUE`, instructing the `NA` values to be dropped for the computations.
**part 1) (5 points)** Use the function `na.omit()`(or the `filter()`) function from `dplyr` package to create a **clean** data set that removes subjects if any observations on the subject are **unknown** Store the modified data frame in a data frame named `CLEAN`.  Run the function `dim()` on the data frame `CLEAN` to find the number of observations(rows) in the `CLEAN` data.
COMPLETE THE CODE HERE, uncomment necessary lines before running:
```{r part 1 extra_credit}
CLEAN <- na.omit(TITANIC)
#or
#CLEAN <- TITANIC %>% filter(complete.cases(_ _ _))
#print the dimensions
dim(CLEAN)
```
**part 2) (5 points)**  How many missing values in the data frame `TITANIC` are there?  How many rows of `TITANIC` have no missing values, one missing value, two missing values, and three missing values, respectively?  Note: the number of rows in `CLEAN` should agree with your answer for the number of rows in `TITANIC` that have no missing values.
What are the cons of cleaning the data in the suggested way?
Use the code, explain what it does.
```{r part 2 extra_credit}
#get the number of missing values in columns
colNAs<- colSums(is.na(TITANIC))
(colNAs <- as.vector(colSums(is.na(TITANIC)))) # coerce to a vector
rowNAs <- table(rowSums(is.na(TITANIC)))
(rowNAs <- as.vector(table(rowSums(is.na(TITANIC))))) # coerce to a vector
```
**Comment:** The missing values are for variables _ _ _ _ _ _.
**Comment:** There are **`r rowNAs[1]`** rows with no missing values, **`r rowNAs[2]`** rows with 1 missing value, and **`r rowNAs[3]`** rows with 2 missing values.
Comment how this align with the dimensions of your `CLEAN` data.
> **Your comment:**
**Good practice:** Save your customized data frame `CLEAN` in your working directory as a `*.csv` file using the function `write.csv()` using the argument `row.names = FALSE`.
```{r save}
write.csv(CLEAN, file="TITANIC_CLEAN.csv", row.names=FALSE)
```
## Task 2: **This is problem 9 on page 197 in the text** as is.
**Note:** This is **not guided** task, you have to write your own code from scratch!
Use the CARS2004 data frame from the `PASWR2` package, which contains the numbers of cars per `1000` inhabitants (`cars`), the total number of known mortal accidents (`deaths`), and the country population/1000 (`population`) for the 25 member countries of the European Union for the year 2004.
#### **a) (10 points)**
YOUR CODE:
```{r}
library()
```
**Answer:**
total.cars <- CARS2004[,'cars']
total.cars
proto.death.rate <- sum(CARS2004[,'deaths'])
death.rate <- proto.death.rate / total.cars
death.rate

#### **b) (10 points)**
YOUR CODE:
```{r}
death.rate <- CARS2004[,'deaths']
barplot(death.rate, horiz=TRUE, names.arg=(CARS2004[,'country']))
```
**Answer:**
#### **c) (10 points)**
YOUR CODE:
```{r}
```
**Answer:** Hungarym Italy
#### **d) (10 points)**
YOUR CODE:
```{r}
total.cars <- CARS2004[,'cars']
ggplot(data = CARS2004, mapping = aes(x = total.cars, y = population)) +
    geom_point()
```
**Answer:** It seems countries with lower populations have a far greater number of cars
#### **e) (10 points)**
YOUR CODE:
```{r}
```
**Answer:**
#### **f) (10 points)**
YOUR CODE:
```{r}
death.rate <- CARS2004[,'deaths']
ggplot(data = CARS2004, mapping = aes(x = total.cars, y = death.rate)) +
    geom_point()
```
**Answer:** On average, countries with more than 300 cars but less than 500 cars had a higher amount of deaths, above 100
#### **g) (10 points)**
YOUR CODE:
```{r}
cor(x = total.cars, y = death.rate, method = "spearman")
```
**Answer:** -0.4693878
#### **h) (10 points)**
YOUR CODE:
```{r}
ggplot(data = CARS2004, mapping = aes(x = total.cars, y = death.rate)) +
    geom_point() + scale_x_continuous(trans = 'log2') +
    scale_y_continuous(trans = 'log2')
```
**Answer:** The bulk of the data sees a higher death rate in direct correlation to a higher numbe rof cars
## Task 3 **(10 points)** Create a map with leaflet package, by completing the code below, that displays 5 UNC system schools using their geographic locations. Draw circles with radius proportionate to the school size using the `addCircles()` function.
Try: E.g. `addCircles(weight = 1, radius = sqrt(UNC_schools$size)*100)`
```{r}
set.seed(2020-02-01)
library(leaflet)
# The code below will create list of 5 UNC university data points with lat & lng, name and school size
# Create data frame with column variables name (of UNC school), students (size), lat, lng)
UNC_schools <- data.frame(name = c("NC State", "UNC Chapel Hill", "FSU", "ECU", "UNC Charlotte"),
                        size = c(30130, 28136, 6000, 25990, 25990),
                        lat = c(36.0373638, 35.9050353, 35.0726, 35.6073769, 35.2036325),
                        lng = c(-79.0355663, -79.0477533, -78.8924739, -77.3671566, -80.8401145))
# Use the data frame to draw map and circles proportional to the school sizes of the cities
UNC_schools %>%
  leaflet() %>%
  addTiles() %>%
  addCircles(weight = 1, radius = sqrt(UNC_schools$size)*100) # try adjusting the radius by multiplying with 50 instead of 100. What do you notice?
```
## Task 3 (Extra Credit, 5 pts): Add at least two more UNC schools, using their location data and enrollment numbers. Modify the code above and update the map for all schools included.
UNC_schools <- data.frame(name = c("NC State", "UNC Chapel Hill", "FSU", "ECU", "UNC Charlotte", "UNC Greensboro", "UNC Wilmington"),
                          size = c(30130, 28136, 6000, 25990, 25990, 15995, 661),
                          lat = c(36.0373638, 35.9050353, 35.0726, 35.6073769, 35.2036325, 36.0683663932, 34.2226257762),
                          lng = c(-79.0355663, -79.0477533, -78.8924739, -77.3671566, -80.8401145, -79.8068367726, -77.873491506))
UNC_schools %>%
    leaflet() %>%
    addTiles() %>%
    addCircles(weight = 1, radius = sqrt(UNC_schools$size)*25)
```{r, echo=FALSE}
## DO NOT CHANGE ANYTHING IN THIS CODE CHUNK!
date()
sessionInfo()
R.Version()
```