Advertisement
Guest User

Untitled

a guest
Apr 23rd, 2019
87
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 11.01 KB | None | 0 0
  1. {
  2. "cells": [
  3. {
  4. "cell_type": "markdown",
  5. "metadata": {},
  6. "source": [
  7. "# Sean Dunn IST 707 Week 2 BLT 2.6 "
  8. ]
  9. },
  10. {
  11. "cell_type": "markdown",
  12. "metadata": {},
  13. "source": [
  14. "## Question 1: How many values are missing for each variable?\n"
  15. ]
  16. },
  17. {
  18. "cell_type": "code",
  19. "execution_count": 2,
  20. "metadata": {},
  21. "outputs": [],
  22. "source": [
  23. "# Read the titanic data \n",
  24. "titanic <- read.csv(\"train.csv\", na.string = c(\"\"))"
  25. ]
  26. },
  27. {
  28. "cell_type": "code",
  29. "execution_count": 12,
  30. "metadata": {},
  31. "outputs": [
  32. {
  33. "name": "stdout",
  34. "output_type": "stream",
  35. "text": [
  36. " [1] \"PassengerId\" \"Survived\" \"Pclass\" \"Name\" \"Sex\" \n",
  37. " [6] \"Age\" \"SibSp\" \"Parch\" \"Ticket\" \"Fare\" \n",
  38. "[11] \"Cabin\" \"Embarked\" \n"
  39. ]
  40. }
  41. ],
  42. "source": [
  43. "print(colnames(titanic))# One way to determine variable names "
  44. ]
  45. },
  46. {
  47. "cell_type": "code",
  48. "execution_count": 13,
  49. "metadata": {},
  50. "outputs": [
  51. {
  52. "name": "stdout",
  53. "output_type": "stream",
  54. "text": [
  55. "[1] \"PassengerId\"\n",
  56. "[1] \"Survived\"\n",
  57. "[1] \"Pclass\"\n",
  58. "[1] \"Name\"\n",
  59. "[1] \"Sex\"\n",
  60. "[1] \"Age\"\n",
  61. "[1] \"SibSp\"\n",
  62. "[1] \"Parch\"\n",
  63. "[1] \"Ticket\"\n",
  64. "[1] \"Fare\"\n",
  65. "[1] \"Cabin\"\n",
  66. "[1] \"Embarked\"\n"
  67. ]
  68. }
  69. ],
  70. "source": [
  71. "## Another way, loops - print the variable names \n",
  72. "for(i in 1:ncol(titanic)){\n",
  73. " print(colnames(titanic[i]))\n",
  74. "}\n"
  75. ]
  76. },
  77. {
  78. "cell_type": "code",
  79. "execution_count": 14,
  80. "metadata": {},
  81. "outputs": [
  82. {
  83. "data": {
  84. "text/html": [
  85. "0"
  86. ],
  87. "text/latex": [
  88. "0"
  89. ],
  90. "text/markdown": [
  91. "0"
  92. ],
  93. "text/plain": [
  94. "[1] 0"
  95. ]
  96. },
  97. "metadata": {},
  98. "output_type": "display_data"
  99. },
  100. {
  101. "data": {
  102. "text/html": [
  103. "0"
  104. ],
  105. "text/latex": [
  106. "0"
  107. ],
  108. "text/markdown": [
  109. "0"
  110. ],
  111. "text/plain": [
  112. "[1] 0"
  113. ]
  114. },
  115. "metadata": {},
  116. "output_type": "display_data"
  117. },
  118. {
  119. "data": {
  120. "text/html": [
  121. "0"
  122. ],
  123. "text/latex": [
  124. "0"
  125. ],
  126. "text/markdown": [
  127. "0"
  128. ],
  129. "text/plain": [
  130. "[1] 0"
  131. ]
  132. },
  133. "metadata": {},
  134. "output_type": "display_data"
  135. },
  136. {
  137. "data": {
  138. "text/html": [
  139. "0"
  140. ],
  141. "text/latex": [
  142. "0"
  143. ],
  144. "text/markdown": [
  145. "0"
  146. ],
  147. "text/plain": [
  148. "[1] 0"
  149. ]
  150. },
  151. "metadata": {},
  152. "output_type": "display_data"
  153. },
  154. {
  155. "data": {
  156. "text/html": [
  157. "0"
  158. ],
  159. "text/latex": [
  160. "0"
  161. ],
  162. "text/markdown": [
  163. "0"
  164. ],
  165. "text/plain": [
  166. "[1] 0"
  167. ]
  168. },
  169. "metadata": {},
  170. "output_type": "display_data"
  171. },
  172. {
  173. "data": {
  174. "text/html": [
  175. "0"
  176. ],
  177. "text/latex": [
  178. "0"
  179. ],
  180. "text/markdown": [
  181. "0"
  182. ],
  183. "text/plain": [
  184. "[1] 0"
  185. ]
  186. },
  187. "metadata": {},
  188. "output_type": "display_data"
  189. },
  190. {
  191. "data": {
  192. "text/html": [
  193. "0"
  194. ],
  195. "text/latex": [
  196. "0"
  197. ],
  198. "text/markdown": [
  199. "0"
  200. ],
  201. "text/plain": [
  202. "[1] 0"
  203. ]
  204. },
  205. "metadata": {},
  206. "output_type": "display_data"
  207. },
  208. {
  209. "data": {
  210. "text/html": [
  211. "0"
  212. ],
  213. "text/latex": [
  214. "0"
  215. ],
  216. "text/markdown": [
  217. "0"
  218. ],
  219. "text/plain": [
  220. "[1] 0"
  221. ]
  222. },
  223. "metadata": {},
  224. "output_type": "display_data"
  225. },
  226. {
  227. "data": {
  228. "text/html": [
  229. "0"
  230. ],
  231. "text/latex": [
  232. "0"
  233. ],
  234. "text/markdown": [
  235. "0"
  236. ],
  237. "text/plain": [
  238. "[1] 0"
  239. ]
  240. },
  241. "metadata": {},
  242. "output_type": "display_data"
  243. },
  244. {
  245. "data": {
  246. "text/html": [
  247. "0"
  248. ],
  249. "text/latex": [
  250. "0"
  251. ],
  252. "text/markdown": [
  253. "0"
  254. ],
  255. "text/plain": [
  256. "[1] 0"
  257. ]
  258. },
  259. "metadata": {},
  260. "output_type": "display_data"
  261. },
  262. {
  263. "data": {
  264. "text/html": [
  265. "687"
  266. ],
  267. "text/latex": [
  268. "687"
  269. ],
  270. "text/markdown": [
  271. "687"
  272. ],
  273. "text/plain": [
  274. "[1] 687"
  275. ]
  276. },
  277. "metadata": {},
  278. "output_type": "display_data"
  279. },
  280. {
  281. "data": {
  282. "text/html": [
  283. "2"
  284. ],
  285. "text/latex": [
  286. "2"
  287. ],
  288. "text/markdown": [
  289. "2"
  290. ],
  291. "text/plain": [
  292. "[1] 2"
  293. ]
  294. },
  295. "metadata": {},
  296. "output_type": "display_data"
  297. }
  298. ],
  299. "source": [
  300. "# Determine the length of each variable \n",
  301. "\n",
  302. "length(which(is.na(titanic$PassengerId)))\n",
  303. "length(which(is.na(titanic$Survived)))\n",
  304. "length(which(is.na(titanic$Pclass)))\n",
  305. "length(which(is.na(titanic$Name)))\n",
  306. "length(which(is.na(titanic$Sex)))\n",
  307. "length(which(is.na(titanic$Age)))\n",
  308. "length(which(is.na(titanic$SibSp)))\n",
  309. "length(which(is.na(titanic$Parch)))\n",
  310. "length(which(is.na(titanic$Ticket)))\n",
  311. "length(which(is.na(titanic$Fare)))\n",
  312. "length(which(is.na(titanic$Cabin)))\n",
  313. "length(which(is.na(titanic$Embarked)))"
  314. ]
  315. },
  316. {
  317. "cell_type": "code",
  318. "execution_count": 15,
  319. "metadata": {},
  320. "outputs": [
  321. {
  322. "name": "stdout",
  323. "output_type": "stream",
  324. "text": [
  325. "[1] 0\n",
  326. "[1] 0\n",
  327. "[1] 0\n",
  328. "[1] 0\n",
  329. "[1] 0\n",
  330. "[1] 0\n",
  331. "[1] 0\n",
  332. "[1] 0\n",
  333. "[1] 0\n",
  334. "[1] 0\n",
  335. "[1] 687\n",
  336. "[1] 2\n"
  337. ]
  338. }
  339. ],
  340. "source": [
  341. "## Alternate method, make a loop to determine missing values \n",
  342. "for(i in 1:ncol(titanic)){\n",
  343. " print(length(which(is.na(titanic[i]))))\n",
  344. "}\n"
  345. ]
  346. },
  347. {
  348. "cell_type": "markdown",
  349. "metadata": {},
  350. "source": [
  351. "For all variables, except Age, Cabin, and Embarked, 0 values are missing.\n",
  352. "\n",
  353. " - **Age** : 177 missing values\n",
  354. " - **Cabin** : 687\n",
  355. " - **Embarked**: two missing values"
  356. ]
  357. },
  358. {
  359. "cell_type": "markdown",
  360. "metadata": {},
  361. "source": [
  362. "## Question 2: How do you handle the missing values in each variable?\n",
  363. " "
  364. ]
  365. },
  366. {
  367. "cell_type": "code",
  368. "execution_count": 8,
  369. "metadata": {},
  370. "outputs": [
  371. {
  372. "data": {
  373. "text/html": [
  374. "0"
  375. ],
  376. "text/latex": [
  377. "0"
  378. ],
  379. "text/markdown": [
  380. "0"
  381. ],
  382. "text/plain": [
  383. "[1] 0"
  384. ]
  385. },
  386. "metadata": {},
  387. "output_type": "display_data"
  388. }
  389. ],
  390. "source": [
  391. "# For Age, start by removing the values and replacing the values with the Age mean \n",
  392. "\n",
  393. "titanic$Age[is.na(titanic$Age)]<- mean(titanic$Age, na.rm = TRUE)\n",
  394. "length(which(is.na(titanic$Age))) # Confirm no missing values in variable Age "
  395. ]
  396. },
  397. {
  398. "cell_type": "markdown",
  399. "metadata": {},
  400. "source": [
  401. "As we can see, for Age the result is **zero**"
  402. ]
  403. },
  404. {
  405. "cell_type": "code",
  406. "execution_count": 21,
  407. "metadata": {},
  408. "outputs": [
  409. {
  410. "data": {
  411. "text/html": [
  412. "<table>\n",
  413. "<thead><tr><th scope=col>Pclass</th><th scope=col>Sex</th><th scope=col>Age</th><th scope=col>SibSp</th><th scope=col>Parch</th><th scope=col>Ticket</th><th scope=col>Fare</th><th scope=col>Embarked</th></tr></thead>\n",
  414. "<tbody>\n",
  415. "\t<tr><td>3 </td><td>male </td><td>22 </td><td>1 </td><td>0 </td><td>A/5 21171 </td><td> 7.2500 </td><td>S </td></tr>\n",
  416. "\t<tr><td>1 </td><td>female </td><td>38 </td><td>1 </td><td>0 </td><td>PC 17599 </td><td>71.2833 </td><td>C </td></tr>\n",
  417. "\t<tr><td>3 </td><td>female </td><td>26 </td><td>0 </td><td>0 </td><td>STON/O2. 3101282</td><td> 7.9250 </td><td>S </td></tr>\n",
  418. "</tbody>\n",
  419. "</table>\n"
  420. ],
  421. "text/latex": [
  422. "\\begin{tabular}{r|llllllll}\n",
  423. " Pclass & Sex & Age & SibSp & Parch & Ticket & Fare & Embarked\\\\\n",
  424. "\\hline\n",
  425. "\t 3 & male & 22 & 1 & 0 & A/5 21171 & 7.2500 & S \\\\\n",
  426. "\t 1 & female & 38 & 1 & 0 & PC 17599 & 71.2833 & C \\\\\n",
  427. "\t 3 & female & 26 & 0 & 0 & STON/O2. 3101282 & 7.9250 & S \\\\\n",
  428. "\\end{tabular}\n"
  429. ],
  430. "text/markdown": [
  431. "\n",
  432. "Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | \n",
  433. "|---|---|---|\n",
  434. "| 3 | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | \n",
  435. "| 1 | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C | \n",
  436. "| 3 | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | \n",
  437. "\n",
  438. "\n"
  439. ],
  440. "text/plain": [
  441. " Pclass Sex Age SibSp Parch Ticket Fare Embarked\n",
  442. "1 3 male 22 1 0 A/5 21171 7.2500 S \n",
  443. "2 1 female 38 1 0 PC 17599 71.2833 C \n",
  444. "3 3 female 26 0 0 STON/O2. 3101282 7.9250 S "
  445. ]
  446. },
  447. "metadata": {},
  448. "output_type": "display_data"
  449. }
  450. ],
  451. "source": [
  452. "# Given the large number of variables missing from Cabin, we can simply not include Cabin \n",
  453. "# In our analysis. We can remove other variables as well which we may not need to examine\n",
  454. "newVars=c(\"Pclass\", \"Sex\", \"Age\", \"SibSp\", \"Parch\", \"Ticket\", \"Fare\", \"Embarked\")\n",
  455. "titanic_new <- titanic[newVars]\n",
  456. "head(titanic_new,3)"
  457. ]
  458. },
  459. {
  460. "cell_type": "markdown",
  461. "metadata": {},
  462. "source": [
  463. "For Embarked, which is the Port of Embarkation\tC = Cherbourg, Q = Queenstown, S = Southampton, with just two missing values we can decide to ignore those missing values. "
  464. ]
  465. }
  466. ],
  467. "metadata": {
  468. "kernelspec": {
  469. "display_name": "R",
  470. "language": "R",
  471. "name": "ir"
  472. },
  473. "language_info": {
  474. "codemirror_mode": "r",
  475. "file_extension": ".r",
  476. "mimetype": "text/x-r-source",
  477. "name": "R",
  478. "pygments_lexer": "r",
  479. "version": "3.5.1"
  480. }
  481. },
  482. "nbformat": 4,
  483. "nbformat_minor": 2
  484. }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement