Advertisement
SixPathsOfMen

White Lives Dont Matter

Feb 12th, 2023
1,119
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
R 15.06 KB | None | 0 0
  1. ```{r setup, include=FALSE}
  2. knitr::opts_chunk$set(echo = TRUE)
  3. ```
  4.  
  5. ### Directions: Complete two (2) of the 3 Tasks listed in the project.
  6.  
  7. > 1. Complete either Task 1 or Task 2
  8.  
  9. > 2. Complete Task 3
  10.  
  11. **Note:** If you `Rmd` file submission knits you will receive total of **(5 points)**
  12.  
  13. ```{r packages, echo=TRUE, message=FALSE}
  14. # load the packages needed
  15. library(PASWR2)
  16. library(ggplot2)
  17. library(dplyr)
  18. library(lattice)  
  19. ```
  20.  
  21. #### **1. (5 points)** How many packages were loaded?
  22.  
  23. Answer: Four (4)
  24.  
  25.  
  26. ## Task 1: **This is problem 8 on page 196 in the text** w/t added questions
  27.  
  28. Note: **Problem 8/p. 196 is modified**
  29.  
  30. Some claim that the final hours aboard the Titanic were marked by class warfare other  claim  it  was  characterized  by male chivalry. The  data  frame `TITANIC3` from the `PASWR2` package contains information pertaining to class status `pclass`,survival of passengers `survived`,  and gender `sex`,  among others.  Based on the information in the data frame:
  31.  
  32. ### Load and Access the Data from the package  
  33.  
  34. A description of the variables can be found by running the code:
  35.  
  36. ```{r data description}
  37. help("TITANIC3")
  38. data("TITANIC3")
  39. ```
  40.  
  41. #### **2. (5 points)** How many observations and variables are in the `TITANIC3` data?
  42. Hint: Use the function `dim()`, `glimpse()` or `str()`.
  43.  
  44. 1,309 Observations & 14 variables
  45.  
  46. ```{r dimentions}
  47. str(TITANIC3)
  48. ```
  49.  
  50. **Answer:** There are 1,309  rows and 14 columns in `TITANIC3`.
  51.  
  52. #### **3. (5 points)** Write code to show the first (or last) 6 observation in the `TITANIC3` data?
  53.  
  54. ```{r}
  55. tail(TITANIC3, 6)
  56.  
  57. #### **4. (5 points)** Using the `survived` variable in the `TITANIC3` data, which is of type integer `(0/1)` mutate it to a factor variable by running the code below and create **new** data frame `TITANIC`.
  58.  
  59. #What are the new levels of `survived` and its type?
  60.  
  61. #```{r}
  62. #TITANIC <- TITANIC3 %>% mutate(survived = factor(survived, levels = 0:1, labels = c("No", "Yes")))
  63.  
  64. #**Answer:**
  65. TITANIC <- mutate(TITANIC3, survived = factor(survived, levels = 0:1, labels = c("No", "Yes")))
  66. sapply(TITANIC, levels)
  67.  
  68.  
  69. #### **5. (5 points)** The code below produces summary for the `TITANIC` Data. Write code using the pipe %>% operator the produces the same result.
  70.  
  71.  
  72.  
  73. ```{r}
  74. summary(TITANIC)
  75.  
  76. ```
  77.  
  78. YOUR CODE HERE:
  79. ```{r}
  80. #TITANIC %>% ...
  81. TITANIC %>% summary()
  82. ```
  83.  
  84.  
  85. #### **a) (5 points)** Determine the fraction of survivors (`survived`) according to class (`pclass`).
  86.  
  87. **Hint:** Uncomment one of the first 3 lines in the code chunk below and then use the `prop.table` function.
  88.  
  89. ```{r part-a}
  90. T1 <- xtabs(~survived + pclass, data = TITANIC)
  91.  
  92. #T1 <- table(TITANIC$survived,TITANIC$pclass)
  93.  
  94. T1 <- TITANIC %>% select(survived, pclass) %>% table()
  95.  
  96. T1
  97.  
  98. prop.table(T1, margin = 2) # to produce the proportion per column (2), per row would be margin = 1
  99. ```
  100.  
  101.  
  102. **Answer:** In 1st class percent survived is ..., 2nd class ..., 3rd class ...
  103.         pclass
  104. survived 1st 2nd 3rd
  105.      No  123 158 528
  106.      Yes 200 119 181
  107.  
  108.  
  109. #### **b) (10 points)** Compute the fraction of survivors according to class and gender.  Did men in the first class or women in the third class have a higher survival rate?
  110.  
  111. Hint: Use the code below that creates 3-way table and then use `prop.table()` similarly to part a).
  112.  
  113. ```{r part-b}
  114. T2 <- TITANIC %>% select(pclass, sex, survived) %>% table()
  115.  
  116. T2
  117.  
  118. prop.table(T2)
  119.  
  120. ```
  121.  
  122. **Answer:** 8% of women in third class survived while 9% of men in first class survived.
  123.  
  124.  
  125. #### **c) (10 points)** How  would  you  characterize  the  distribution  of age(e.g.,  is  it symmetric,positively/negatively skewed, unimodal, multimodal)?
  126.  
  127. Hint: Run the code below that produces some summary statistics and the density distribution.
  128. The commented code is old style of R programming, it is shown as it may resemble the textbook examples.
  129.  
  130. ```{r part-c}
  131. # Finding summary statistics
  132.  
  133. #median(TITANIC$age, na.rm = TRUE) # old style
  134. #mean(TITANIC$age, na.rm = TRUE) # old style
  135.  
  136. # dplyr style
  137. TITANIC %>% summarise(mean = mean(age, na.rm = TRUE), meadian = median(age, na.rm = TRUE))
  138.  
  139. # IQR(TITANIC$age, na.rm = TRUE)
  140.  
  141. TITANIC %>% pull(age) %>% IQR(na.rm = TRUE) # pull() does extract the column from the data frame as vector object
  142.  
  143. # look at the density function to see if it is uni or bi-modal distribution
  144. ggplot(data = TITANIC, aes(x = age)) +
  145. geom_density(fill = "lightgreen") +
  146. theme_bw()
  147.  
  148. ```
  149.  
  150. **Answer:** Positive Skew
  151.  
  152.  
  153. #### **d) (5 points)** Were the median and mean ages for females who survived higher or lower than for females who did not survive?  Report the median and mean ages as well as an appropriate measure of spread for each statistic.
  154.  
  155. **Hint:** Using the `dplyr` package functions and the pipes operator ` %>% ` elegant code can produce summaries for each statistics - mean, median, sd, IRQ.
  156.  
  157.  
  158. ##### Without considering the `pclass` variable, namely regardless of passenger class:
  159.  
  160. ##### Mean Summaries
  161. ```{r}
  162. # mean summaries
  163. TITANIC %>% group_by(sex, survived) %>% summarise(avg = mean(age, na.rm = TRUE))
  164.  
  165. ```
  166.  
  167. ##### Standard deviation Summarries
  168. ```{r}
  169. # sd summaries
  170. TITANIC %>% group_by(sex, survived) %>% summarise(stdev = sd(age, na.rm = TRUE))
  171.  
  172. ```
  173.  
  174. #### Median Summarries
  175.  
  176. ```{r}
  177. # median summaries
  178. TITANIC %>% group_by(sex, survived) %>% summarise(med = median(age, na.rm = TRUE))
  179.  
  180. ```
  181.  
  182. #### IQR Summarries
  183.  
  184. ```{r}
  185. # IQR summaries
  186. TITANIC %>% group_by(sex, survived) %>% summarise(IQR = IQR(age, na.rm = TRUE))
  187.  
  188. ```
  189.  
  190. Based on the summaries, answer the question below:
  191.  
  192. **d-1)**
  193.  
  194. For those who survived, the mean age for females is GREATER than the mean age for males?
  195.  
  196. **d-2)**
  197. For those who survived, the median age for females is GREATER than the median age for males?
  198.  
  199.  
  200.  
  201. **Answer:** _ _ _
  202.  
  203.  
  204. #### **6. (10 points)** Now Consider the `survived` variable in the `TITANIC` data too, create similar summary statistics and answer the question below.
  205.  
  206. For those who survived, which class the mean age for females is *less** than the mean age for males?
  207.  
  208. For those who survived, which class the median age for females is **greater** than the median age for males?
  209.  
  210.  
  211. Write your code in the chunk below:
  212.  
  213. ```{r}
  214. # mean summaries
  215. TITANIC %>% group_by(pclass, sex, survived) %>% summarise(avg = mean(age, na.rm = TRUE))
  216.  
  217. # median summaries
  218. # write your code here
  219. TITANIC %>% group_by(pclass, sex, survived) %>% summarise(avg = median(age, na.rm = TRUE))
  220.  
  221.  
  222. #### **e) (5 points)**  Were the median and mean ages for males who survived higher or lower than for males who did not survive?  Report the median and mean ages as well as an appropriate measure of spread for each statistic.
  223.  
  224. **Hint:** Read the output of the code in part d)
  225.  
  226. **Answer:** The Median & Mean ages(31.5 & 27) for Males who did not survive were both greater than the Mean & Median for Men who did survive (27.0 & 27)
  227.  
  228.  
  229. #### **f) (5 points)**  What was the age of the youngest female in the first class who survived?
  230.  
  231. **Hint:** Complete the code below by specify which variable you want to be arranged.
  232.  
  233. ```{r}
  234. TITANIC %>% filter (sex =="female" & survived =="Yes" & pclass == "1st") %>% arrange(survived, pclass, sex)
  235. ```
  236.  
  237. Arranging in descending order is achieved by specifying in the `arrange()` function `desc(var_name)`.
  238.  
  239. #### **7. (5 points)** What was the age of the oldest female (male) in the first class who survived?
  240.  
  241. YOUR CODE HERE:
  242. ```{r}
  243. TITANIC %>% filter (sex =="female" & survived =="Yes" & pclass == "1st") %>% arrange(desc(age))
  244.  
  245. TITANIC %>% filter (sex =="male" & survived =="Yes" & pclass == "1st") %>% arrange(desc(age))
  246. ```
  247.  
  248. **Answer:** The oldest male was Barkworth, Mr. Algernon Henry W & the oldest female was Cavendish, Mrs. Tyrell William
  249.  
  250. Oldest female in 1st class survived was 76 years of age.
  251.  
  252. Oldest male in 1st class survived was 80 years of age.
  253.  
  254. #### **g) (10 points)** Do the data suggest that the final hours aboard the Titanic were characterized by class warfare, male chivalry, some combination of both, or neither? Justify your answer based on computations above, or based on other explorations of the data.
  255.  
  256. **Hint:** Review and explain the exploratory graphs created by the code chunk. How they support you justification?
  257.  
  258. ```{r part extra}
  259.  
  260. TITANIC %>%  ggplot(aes(x = survived)) +
  261.   geom_bar(aes(fill = sex), stat = "count", position = "stack" ) +
  262.   theme_bw()
  263.  
  264. TITANIC %>%  ggplot(aes(x = survived)) +
  265.   geom_bar(aes(fill = pclass), stat = "count", position = "stack" ) +
  266.   theme_bw()
  267.  
  268. ```
  269.  
  270. Of those who did survive the ice berg sinking the Titanic, most of them were from the 3rd & 1st classes, so I don't think class played as big as a role in the survival of the people. And of the survived, they were overwhelmingly female which supports the male chivalry theory.
  271. #TITANIC %>%  ggplot(aes(x = survived)) +
  272.    geom_bar(aes(fill = sex), stat = "count", position = "stack" ) +
  273.    theme_bw()
  274. ## Task 1 (Extra Credit, 10 pts): Produce CLEAN data from the TITANIC data by removing all observation with `NA`
  275. Comment: In most of the code you used/wrote in **Task 1**, functions were called with argument `na.rm = TRUE`, instructing the `NA` values to be dropped for the computations.
  276. **part 1) (5 points)** Use the function `na.omit()`(or the `filter()`) function from `dplyr` package to create a **clean** data set that removes subjects if any observations on the subject are **unknown** Store the modified data frame in a data frame named `CLEAN`.  Run the function `dim()` on the data frame `CLEAN` to find the number of observations(rows) in the `CLEAN` data.  
  277. COMPLETE THE CODE HERE, uncomment necessary lines before running:
  278. ```{r part 1 extra_credit}
  279. CLEAN <- na.omit(TITANIC)
  280. #or
  281. #CLEAN <- TITANIC %>% filter(complete.cases(_ _ _))
  282. #print the dimensions
  283. dim(CLEAN)
  284. ```
  285. **part 2) (5 points)**  How many missing values in the data frame `TITANIC` are there?  How many rows of `TITANIC` have no missing values, one missing value, two missing values, and three missing values, respectively?  Note: the number of rows in `CLEAN` should agree with your answer for the number of rows in `TITANIC` that have no missing values.
  286. What are the cons of cleaning the data in the suggested way?
  287. Use the code, explain what it does.
  288. ```{r part 2 extra_credit}
  289. #get the number of missing values in columns
  290. colNAs<- colSums(is.na(TITANIC))
  291. (colNAs <- as.vector(colSums(is.na(TITANIC)))) # coerce to a vector
  292. rowNAs <- table(rowSums(is.na(TITANIC)))
  293. (rowNAs <- as.vector(table(rowSums(is.na(TITANIC))))) # coerce to a vector
  294. ```
  295. **Comment:** The missing values are for variables _ _ _ _ _ _.
  296. **Comment:** There are **`r rowNAs[1]`** rows with no missing values, **`r rowNAs[2]`** rows with 1 missing value, and **`r rowNAs[3]`** rows with 2 missing values.
  297. Comment how this align with the dimensions of your `CLEAN` data.
  298. > **Your comment:**
  299. **Good practice:** Save your customized data frame `CLEAN` in your working directory as a `*.csv` file using the function `write.csv()` using the argument `row.names = FALSE`.
  300. ```{r save}
  301. write.csv(CLEAN, file="TITANIC_CLEAN.csv", row.names=FALSE)
  302. ```
  303. ## Task 2: **This is problem 9 on page 197 in the text** as is.
  304. **Note:** This is **not guided** task, you have to write your own code from scratch!
  305. Use the CARS2004 data frame from the `PASWR2` package, which contains the numbers of cars per `1000` inhabitants (`cars`), the total number of known mortal accidents (`deaths`), and the country population/1000 (`population`) for the 25 member countries of the European Union for the year 2004.
  306. #### **a) (10 points)**
  307. YOUR CODE:
  308. ```{r}
  309. library()
  310. ```
  311. **Answer:**
  312. total.cars <- CARS2004[,'cars']
  313. total.cars
  314. proto.death.rate <- sum(CARS2004[,'deaths'])
  315. death.rate <- proto.death.rate / total.cars
  316. death.rate
  317.  
  318. #### **b) (10 points)**
  319. YOUR CODE:
  320. ```{r}
  321. death.rate <- CARS2004[,'deaths']
  322. barplot(death.rate, horiz=TRUE, names.arg=(CARS2004[,'country']))
  323. ```
  324. **Answer:**
  325. #### **c) (10 points)**
  326. YOUR CODE:
  327. ```{r}
  328. ```
  329. **Answer:** Hungarym Italy
  330. #### **d) (10 points)**
  331. YOUR CODE:
  332. ```{r}
  333. total.cars <- CARS2004[,'cars']
  334. ggplot(data = CARS2004, mapping = aes(x = total.cars, y = population)) +
  335.    geom_point()
  336. ```
  337. **Answer:** It seems countries with lower populations have a far greater number of cars
  338. #### **e) (10 points)**
  339. YOUR CODE:
  340. ```{r}
  341. ```
  342. **Answer:**
  343. #### **f) (10 points)**
  344. YOUR CODE:
  345. ```{r}
  346. death.rate <- CARS2004[,'deaths']
  347. ggplot(data = CARS2004, mapping = aes(x = total.cars, y = death.rate)) +
  348.    geom_point()
  349. ```
  350. **Answer:** On average, countries with more than 300 cars but less than 500 cars had a higher amount of deaths, above 100
  351. #### **g) (10 points)**
  352. YOUR CODE:
  353. ```{r}
  354. cor(x = total.cars, y = death.rate, method = "spearman")
  355. ```
  356. **Answer:** -0.4693878
  357. #### **h) (10 points)**
  358. YOUR CODE:
  359. ```{r}
  360. ggplot(data = CARS2004, mapping = aes(x = total.cars, y = death.rate)) +
  361.    geom_point() + scale_x_continuous(trans = 'log2') +
  362.    scale_y_continuous(trans = 'log2')
  363. ```
  364. **Answer:** The bulk of the data sees a higher death rate in direct correlation to a higher numbe rof cars
  365. ## Task 3 **(10 points)** Create a map with leaflet package, by completing the code below, that displays 5 UNC system schools using their geographic locations. Draw circles with radius proportionate to the school size using the `addCircles()` function.
  366. Try: E.g. `addCircles(weight = 1, radius = sqrt(UNC_schools$size)*100)`
  367. ```{r}
  368. set.seed(2020-02-01)
  369. library(leaflet)
  370. # The code below will create list of 5 UNC university data points with lat & lng, name and school size
  371. # Create data frame with column variables name (of UNC school), students (size), lat, lng)
  372. UNC_schools <- data.frame(name = c("NC State", "UNC Chapel Hill", "FSU", "ECU", "UNC Charlotte"),
  373.                        size = c(30130, 28136, 6000, 25990, 25990),
  374.                        lat = c(36.0373638, 35.9050353, 35.0726, 35.6073769, 35.2036325),
  375.                        lng = c(-79.0355663, -79.0477533, -78.8924739, -77.3671566, -80.8401145))
  376. # Use the data frame to draw map and circles proportional to the school sizes of the cities
  377. UNC_schools %>%
  378.  leaflet() %>%
  379.  addTiles() %>%
  380.  addCircles(weight = 1, radius = sqrt(UNC_schools$size)*100) # try adjusting the radius by multiplying with 50 instead of 100. What do you notice?
  381. ```
  382. ## Task 3 (Extra Credit, 5 pts): Add at least two more UNC schools, using their location data and enrollment numbers. Modify the code above and update the map for all schools included.
  383. UNC_schools <- data.frame(name = c("NC State", "UNC Chapel Hill", "FSU", "ECU", "UNC Charlotte", "UNC Greensboro", "UNC Wilmington"),
  384.                          size = c(30130, 28136, 6000, 25990, 25990, 15995, 661),
  385.                          lat = c(36.0373638, 35.9050353, 35.0726, 35.6073769, 35.2036325, 36.0683663932, 34.2226257762),
  386.                          lng = c(-79.0355663, -79.0477533, -78.8924739, -77.3671566, -80.8401145, -79.8068367726, -77.873491506))
  387. UNC_schools %>%
  388.    leaflet() %>%
  389.    addTiles() %>%
  390.    addCircles(weight = 1, radius = sqrt(UNC_schools$size)*25)
  391. ```{r, echo=FALSE}
  392. ## DO NOT CHANGE ANYTHING IN THIS CODE CHUNK!
  393. date()
  394. sessionInfo()
  395. R.Version()
  396. ```
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement