Advertisement
Guest User

Untitled

a guest
Jun 27th, 2017
54
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 26.11 KB | None | 0 0
  1. {
  2. "cells": [
  3. {
  4. "cell_type": "code",
  5. "execution_count": null,
  6. "metadata": {
  7. "collapsed": true
  8. },
  9. "outputs": [],
  10. "source": [
  11. "import pandas as pd\n",
  12. "from pandas import Series, DataFrame"
  13. ]
  14. },
  15. {
  16. "cell_type": "code",
  17. "execution_count": null,
  18. "metadata": {
  19. "collapsed": true
  20. },
  21. "outputs": [],
  22. "source": [
  23. "titanic_df = pd.read_csv('/root/hackerday/01_titanic/train.csv')"
  24. ]
  25. },
  26. {
  27. "cell_type": "markdown",
  28. "metadata": {},
  29. "source": [
  30. "Read more about this function by running `pd.read_csv?`"
  31. ]
  32. },
  33. {
  34. "cell_type": "raw",
  35. "metadata": {},
  36. "source": [
  37. "To see the list of functions available inside `pandas`\n",
  38. "\n",
  39. "pd.[tab]\n",
  40. "\n",
  41. "----\n",
  42. "\n",
  43. "## Building a Classification Model using the Titanic Dataset\n",
  44. "\n",
  45. "#### Data Field details:\n",
  46. "\n",
  47. "PassengerId -- A numerical id assigned to each passenger.\n",
  48. "Survived -- Whether the passenger survived (1), or didn't (0). We'll be making predictions for this column.\n",
  49. "Pclass -- The class the passenger was in -- first class (1), second class (2), or third class (3).\n",
  50. "Name -- the name of the passenger.\n",
  51. "Sex -- The gender of the passenger -- male or female.\n",
  52. "Age -- The age of the passenger. Fractional.\n",
  53. "SibSp -- The number of siblings and spouses the passenger had on board.\n",
  54. "Parch -- The number of parents and children the passenger had on board.\n",
  55. "Ticket -- The ticket number of the passenger.\n",
  56. "Fare -- How much the passenger paid for the ticker.\n",
  57. "Cabin -- Which cabin the passenger was in.\n",
  58. "Embarked -- Where the passenger boarded the Titanic."
  59. ]
  60. },
  61. {
  62. "cell_type": "code",
  63. "execution_count": null,
  64. "metadata": {
  65. "collapsed": true
  66. },
  67. "outputs": [],
  68. "source": [
  69. "# Accessing Columns\n",
  70. "print titanic_df['Name'].head() # one column\n",
  71. "print titanic_df[['Name', 'Pclass']].head() # Two columns"
  72. ]
  73. },
  74. {
  75. "cell_type": "code",
  76. "execution_count": null,
  77. "metadata": {
  78. "collapsed": true
  79. },
  80. "outputs": [],
  81. "source": [
  82. "titanic_df.shape"
  83. ]
  84. },
  85. {
  86. "cell_type": "markdown",
  87. "metadata": {},
  88. "source": [
  89. "---\n",
  90. "### Looking at the data\n",
  91. "\n",
  92. "- We know that women and children were more likely to survive. Thus, Age and Sex are probably good predictors. \n",
  93. "\n",
  94. "- It's also logical to think that passenger class might affect the outcome, as first class cabins were closer to the deck of the ship. \n",
  95. "\n",
  96. "- Fare is tied to passenger class, and will probably be highly correlated with it, but might add some additional information. \n",
  97. "\n",
  98. "- Number of siblings and parents/children will probably be correlated with survival one way or the other, as either there are more people to help you, or more people to think about and try to save."
  99. ]
  100. },
  101. {
  102. "cell_type": "markdown",
  103. "metadata": {},
  104. "source": [
  105. "### Some more information from inspecting the data\n",
  106. "\n",
  107. "- Age, Cabin have missing values\n",
  108. "\n",
  109. "---"
  110. ]
  111. },
  112. {
  113. "cell_type": "markdown",
  114. "metadata": {},
  115. "source": [
  116. "### Steps in the Data Science Process\n",
  117. "\n",
  118. "1. Data Inspection\n",
  119. "2. Data Understanding\n",
  120. "3. Data Preparation (cleaning - treat missing values, treat outliers and other bad data, Normalization)\n",
  121. "4. Split the Data\n",
  122. "5. Visualize\n",
  123. "6. Train the Model\n",
  124. "7. Test the Model (deploy on unseen data)\n",
  125. "8. Evaluate the performance of the Model\n",
  126. "\n",
  127. "---"
  128. ]
  129. },
  130. {
  131. "cell_type": "markdown",
  132. "metadata": {},
  133. "source": [
  134. "### Statistical Summaries using `describe()`"
  135. ]
  136. },
  137. {
  138. "cell_type": "code",
  139. "execution_count": null,
  140. "metadata": {
  141. "collapsed": true,
  142. "scrolled": true
  143. },
  144. "outputs": [],
  145. "source": [
  146. "titanic_df.describe()"
  147. ]
  148. },
  149. {
  150. "cell_type": "code",
  151. "execution_count": null,
  152. "metadata": {
  153. "collapsed": true,
  154. "scrolled": true
  155. },
  156. "outputs": [],
  157. "source": [
  158. "# Numeric Series - Binary\n",
  159. "titanic_df['Survived'].describe()"
  160. ]
  161. },
  162. {
  163. "cell_type": "code",
  164. "execution_count": null,
  165. "metadata": {
  166. "collapsed": true
  167. },
  168. "outputs": [],
  169. "source": [
  170. "# Numeric Series - Float\n",
  171. "titanic_df['Fare'].describe()"
  172. ]
  173. },
  174. {
  175. "cell_type": "code",
  176. "execution_count": null,
  177. "metadata": {
  178. "collapsed": true
  179. },
  180. "outputs": [],
  181. "source": [
  182. "# Categorical Series\n",
  183. "titanic_df['Sex'].describe()"
  184. ]
  185. },
  186. {
  187. "cell_type": "code",
  188. "execution_count": null,
  189. "metadata": {
  190. "collapsed": true,
  191. "scrolled": true
  192. },
  193. "outputs": [],
  194. "source": [
  195. "titanic_df['Pclass'].value_counts()"
  196. ]
  197. },
  198. {
  199. "cell_type": "markdown",
  200. "metadata": {},
  201. "source": [
  202. "---"
  203. ]
  204. },
  205. {
  206. "cell_type": "markdown",
  207. "metadata": {},
  208. "source": [
  209. "### The `plot()` method to inspect data visually"
  210. ]
  211. },
  212. {
  213. "cell_type": "code",
  214. "execution_count": null,
  215. "metadata": {
  216. "collapsed": true
  217. },
  218. "outputs": [],
  219. "source": [
  220. "import matplotlib.pyplot as plt\n",
  221. "%pylab inline\n",
  222. "\n",
  223. "# This will enable plotting functionality\n",
  224. "# and make sure the plots are displayed inside the notebook"
  225. ]
  226. },
  227. {
  228. "cell_type": "code",
  229. "execution_count": null,
  230. "metadata": {
  231. "collapsed": true
  232. },
  233. "outputs": [],
  234. "source": [
  235. "titanic_df['Fare'].plot(kind=\"hist\")"
  236. ]
  237. },
  238. {
  239. "cell_type": "code",
  240. "execution_count": null,
  241. "metadata": {
  242. "collapsed": true
  243. },
  244. "outputs": [],
  245. "source": [
  246. "# Task 1\n",
  247. "#\n",
  248. "# MAKE A HISTOGRAM FOR AGE - Make a note of what you see\n",
  249. "#\n",
  250. "#"
  251. ]
  252. },
  253. {
  254. "cell_type": "code",
  255. "execution_count": null,
  256. "metadata": {
  257. "collapsed": true
  258. },
  259. "outputs": [],
  260. "source": [
  261. "titanic_df['Fare'].plot(kind='hist')\n",
  262. "# Fare is an example of a SKEWED variable"
  263. ]
  264. },
  265. {
  266. "cell_type": "code",
  267. "execution_count": null,
  268. "metadata": {
  269. "collapsed": true
  270. },
  271. "outputs": [],
  272. "source": [
  273. "print titanic_df.Fare.mean()\n",
  274. "print titanic_df.Fare.median()"
  275. ]
  276. },
  277. {
  278. "cell_type": "code",
  279. "execution_count": null,
  280. "metadata": {
  281. "collapsed": true
  282. },
  283. "outputs": [],
  284. "source": [
  285. "titanic_df['Age'].plot(kind=\"hist\")\n",
  286. "# Age is an example of a (approximately) Normally Distributed Variable"
  287. ]
  288. },
  289. {
  290. "cell_type": "code",
  291. "execution_count": null,
  292. "metadata": {
  293. "collapsed": true
  294. },
  295. "outputs": [],
  296. "source": [
  297. "# Task 2\n",
  298. "#\n",
  299. "# MAKE A BAR CHART FOR Parch - Make a note of what you see\n",
  300. "#\n",
  301. "#"
  302. ]
  303. },
  304. {
  305. "cell_type": "code",
  306. "execution_count": null,
  307. "metadata": {
  308. "collapsed": true
  309. },
  310. "outputs": [],
  311. "source": [
  312. "titanic_df['Parch'].value_counts().plot(kind='bar')"
  313. ]
  314. },
  315. {
  316. "cell_type": "code",
  317. "execution_count": null,
  318. "metadata": {
  319. "collapsed": true
  320. },
  321. "outputs": [],
  322. "source": [
  323. "titanic_df['SibSp'].value_counts().plot(kind='bar')"
  324. ]
  325. },
  326. {
  327. "cell_type": "markdown",
  328. "metadata": {},
  329. "source": [
  330. "---\n",
  331. "\n",
  332. "### The `groupby()` method to inspect relationships in the data\n",
  333. "\n",
  334. "- Pretty much the same as the SQL group-by.\n",
  335. "- Can also interpret it as having the same functionality as Pivot Tables in Excel."
  336. ]
  337. },
  338. {
  339. "cell_type": "code",
  340. "execution_count": null,
  341. "metadata": {
  342. "collapsed": true
  343. },
  344. "outputs": [],
  345. "source": [
  346. "titanic_df[['Survived', 'Age', 'Fare']].groupby('Survived').mean()"
  347. ]
  348. },
  349. {
  350. "cell_type": "code",
  351. "execution_count": null,
  352. "metadata": {
  353. "collapsed": true
  354. },
  355. "outputs": [],
  356. "source": [
  357. "titanic_df[['Pclass', 'Survived', 'Age', 'Fare']].groupby('Pclass').median()"
  358. ]
  359. },
  360. {
  361. "cell_type": "code",
  362. "execution_count": null,
  363. "metadata": {
  364. "collapsed": true
  365. },
  366. "outputs": [],
  367. "source": [
  368. "# Task 3\n",
  369. "#\n",
  370. "# Group by number of siblings and see how much they paid on average for a ticket\n",
  371. "#\n",
  372. "#"
  373. ]
  374. },
  375. {
  376. "cell_type": "code",
  377. "execution_count": null,
  378. "metadata": {
  379. "collapsed": true
  380. },
  381. "outputs": [],
  382. "source": [
  383. "titanic_df[['SibSp', 'Fare']].groupby('SibSp').mean()"
  384. ]
  385. },
  386. {
  387. "cell_type": "code",
  388. "execution_count": null,
  389. "metadata": {
  390. "collapsed": true
  391. },
  392. "outputs": [],
  393. "source": [
  394. "# Task 4\n",
  395. "#\n",
  396. "# What is the % of survivors by Pclass\n",
  397. "#\n",
  398. "#"
  399. ]
  400. },
  401. {
  402. "cell_type": "code",
  403. "execution_count": null,
  404. "metadata": {
  405. "collapsed": true
  406. },
  407. "outputs": [],
  408. "source": [
  409. "titanic_df.Survived.mean() == titanic_df['Survived'].mean()"
  410. ]
  411. },
  412. {
  413. "cell_type": "code",
  414. "execution_count": null,
  415. "metadata": {
  416. "collapsed": true,
  417. "scrolled": true
  418. },
  419. "outputs": [],
  420. "source": [
  421. "titanic_df[['Pclass', 'Survived']].groupby('Pclass').mean()"
  422. ]
  423. },
  424. {
  425. "cell_type": "markdown",
  426. "metadata": {},
  427. "source": [
  428. "---\n",
  429. "\n",
  430. "### Handling Missing Data with `fillna()`\n",
  431. "\n",
  432. "As we can see that PassengerId counr is 891 whereas Age count is 714. It means there are 77 rows in which Age is missing.\n",
  433. "\n",
  434. "This means that the data isn't perfectly clean, and we're going to have to clean it ourselves. \n",
  435. "\n",
  436. "Note:\n",
  437. "\n",
  438. "- We don't want to have to remove the rows with missing values, because more data helps us train a better algorithm. \n",
  439. "- We also don't want to get rid of the whole column, as age is probably fairly important to our analysis.\n",
  440. "\n",
  441. "There are many strategies for cleaning up missing data, but a simple one is to just fill in all the missing values with the median of all the values in the column. "
  442. ]
  443. },
  444. {
  445. "cell_type": "markdown",
  446. "metadata": {},
  447. "source": [
  448. "---\n",
  449. "#### Fill Missing Values for the Age Variable (Numeric)"
  450. ]
  451. },
  452. {
  453. "cell_type": "code",
  454. "execution_count": null,
  455. "metadata": {
  456. "collapsed": true
  457. },
  458. "outputs": [],
  459. "source": [
  460. "titanic_df['Age'].isnull().tail()"
  461. ]
  462. },
  463. {
  464. "cell_type": "code",
  465. "execution_count": null,
  466. "metadata": {
  467. "collapsed": true
  468. },
  469. "outputs": [],
  470. "source": [
  471. "# we can get the median of the column by applying median function on it.\n",
  472. "print titanic_df[\"Age\"].median()\n",
  473. "print titanic_df[\"Age\"].mean()"
  474. ]
  475. },
  476. {
  477. "cell_type": "code",
  478. "execution_count": null,
  479. "metadata": {
  480. "collapsed": true
  481. },
  482. "outputs": [],
  483. "source": [
  484. "Age_median = titanic_df[\"Age\"].median()"
  485. ]
  486. },
  487. {
  488. "cell_type": "code",
  489. "execution_count": null,
  490. "metadata": {
  491. "collapsed": true
  492. },
  493. "outputs": [],
  494. "source": [
  495. "# Using the fillna method to impute datab\n",
  496. "titanic_df[\"Age\"].fillna(Age_median, inplace=True)\n",
  497. "# Alternate method\n",
  498. "titanic_df[\"Age\"] = titanic_df[\"Age\"].fillna(Age_median)"
  499. ]
  500. },
  501. {
  502. "cell_type": "code",
  503. "execution_count": null,
  504. "metadata": {
  505. "collapsed": true,
  506. "scrolled": true
  507. },
  508. "outputs": [],
  509. "source": [
  510. "titanic_df['Age'].tail()"
  511. ]
  512. },
  513. {
  514. "cell_type": "code",
  515. "execution_count": null,
  516. "metadata": {
  517. "collapsed": true
  518. },
  519. "outputs": [],
  520. "source": [
  521. "titanic_df.Age.describe()"
  522. ]
  523. },
  524. {
  525. "cell_type": "markdown",
  526. "metadata": {},
  527. "source": [
  528. "---\n",
  529. "\n",
  530. "#### Fill Missing Values for the Embarked Variable (Categorical)"
  531. ]
  532. },
  533. {
  534. "cell_type": "code",
  535. "execution_count": null,
  536. "metadata": {
  537. "collapsed": true
  538. },
  539. "outputs": [],
  540. "source": [
  541. "# Task 5\n",
  542. "# Find some rows that have missing values for variable Embarked\n",
  543. "titanic_df[titanic_df.Embarked.isnull()].head()"
  544. ]
  545. },
  546. {
  547. "cell_type": "code",
  548. "execution_count": null,
  549. "metadata": {
  550. "collapsed": true
  551. },
  552. "outputs": [],
  553. "source": [
  554. "# To find the mode\n",
  555. "titanic_df.Embarked.describe()"
  556. ]
  557. },
  558. {
  559. "cell_type": "code",
  560. "execution_count": null,
  561. "metadata": {
  562. "collapsed": true
  563. },
  564. "outputs": [],
  565. "source": [
  566. "# Task 6\n",
  567. "# Find the mode of Embarked\n",
  568. "titanic_df.Embarked.value_counts()"
  569. ]
  570. },
  571. {
  572. "cell_type": "code",
  573. "execution_count": null,
  574. "metadata": {
  575. "collapsed": true
  576. },
  577. "outputs": [],
  578. "source": [
  579. "# Task 7\n",
  580. "# Use the .fillna() method to impute missing values\n",
  581. "titanic_df.Embarked.fillna('S', inplace=True)"
  582. ]
  583. },
  584. {
  585. "cell_type": "markdown",
  586. "metadata": {},
  587. "source": [
  588. "---\n",
  589. "\n",
  590. "### Convert Categoricals into Binary Variables"
  591. ]
  592. },
  593. {
  594. "cell_type": "code",
  595. "execution_count": null,
  596. "metadata": {
  597. "collapsed": true
  598. },
  599. "outputs": [],
  600. "source": [
  601. "titanic_df.loc[titanic_df['Sex'] == 'male', 'Sex'] = 0\n",
  602. "titanic_df.loc[titanic_df['Sex'] == 'female', 'Sex'] = 1"
  603. ]
  604. },
  605. {
  606. "cell_type": "code",
  607. "execution_count": null,
  608. "metadata": {
  609. "collapsed": true
  610. },
  611. "outputs": [],
  612. "source": [
  613. "titanic_df['Sex'].unique()"
  614. ]
  615. },
  616. {
  617. "cell_type": "code",
  618. "execution_count": null,
  619. "metadata": {
  620. "collapsed": true
  621. },
  622. "outputs": [],
  623. "source": [
  624. "# Task 8\n",
  625. "#\n",
  626. "# Convert Embarked into a binary variable"
  627. ]
  628. },
  629. {
  630. "cell_type": "code",
  631. "execution_count": null,
  632. "metadata": {
  633. "collapsed": true
  634. },
  635. "outputs": [],
  636. "source": [
  637. "titanic_df.Embarked.value_counts()"
  638. ]
  639. },
  640. {
  641. "cell_type": "code",
  642. "execution_count": null,
  643. "metadata": {
  644. "collapsed": true
  645. },
  646. "outputs": [],
  647. "source": [
  648. "# convert \"S\" to 0, \"C\" to 1 and \"Q\" to 2 in Embarked column\n",
  649. "titanic_df.loc[titanic_df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
  650. "titanic_df.loc[titanic_df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
  651. "titanic_df.loc[titanic_df[\"Embarked\"] == \"Q\", \"Embarked\"] = 1\n",
  652. "\n",
  653. "#print the unique values of Embarked column\n",
  654. "print(titanic_df[\"Embarked\"].unique())"
  655. ]
  656. },
  657. {
  658. "cell_type": "markdown",
  659. "metadata": {},
  660. "source": [
  661. "---\n",
  662. "### Essentials of Modeling (Overfitting and Cross Validation)\n",
  663. "\n",
  664. "> The aim of all machine learning is generalization.\n",
  665. "\n",
  666. "We want to train the algorithm on different data than we make predictions on. This is critical if we want to avoid overfitting. Overfitting is what happens when a model fits itself to \"noise\", not signal. Every dataset has its own quirks that don't exist in the full population. For example, if I asked you to predict the top speed of a car from its horsepower and other characteristics, and gave you a dataset that randomly had cars with very high top speeds, you would create a model that overstated speed. The way to figure out if your model is doing this is to evaluate its performance on data it hasn't been trained using.\n",
  667. "\n",
  668. "Every machine learning algorithm can overfit, although some (like linear regression) are much less prone to it. If you evaluate your algorithm on the same dataset that you train it on, it's impossible to know if it's performing well because it overfit itself to the noise, or if it actually is a good algorithm.\n",
  669. "\n",
  670. "Luckily, cross validation is a simple way to avoid overfitting. To cross validate, you split your data into some number of parts (or \"folds\"). Lets use 3 as an example. You then do this:\n",
  671. "* Combine the first two parts, train a model, make predictions on the third.\n",
  672. "\n",
  673. "* Combine the first and third parts, train a model, make predictions on the second.\n",
  674. "\n",
  675. "* Combine the second and third parts, train a model, make predictions on the first.\n",
  676. "\n",
  677. "This way, we generate predictions for the whole dataset without ever evaluating accuracy on the same data we train our model using."
  678. ]
  679. },
  680. {
  681. "cell_type": "code",
  682. "execution_count": null,
  683. "metadata": {
  684. "collapsed": true
  685. },
  686. "outputs": [],
  687. "source": [
  688. "titanic_df.columns.values"
  689. ]
  690. },
  691. {
  692. "cell_type": "code",
  693. "execution_count": null,
  694. "metadata": {
  695. "collapsed": true
  696. },
  697. "outputs": [],
  698. "source": [
  699. "# The columns we'll use to predict the target\n",
  700. "predictors_dim = [\"Pclass\", \"Sex\", \"Age\", \"SibSp\", \"Parch\", \"Fare\", \"Embarked\"]"
  701. ]
  702. },
  703. {
  704. "cell_type": "markdown",
  705. "metadata": {},
  706. "source": [
  707. "A short note on using Scikit Learn\n",
  708. "\n",
  709. "- There are model families from which we import Estimators\n",
  710. "- Declaring an estimator object exposes its methods to us\n",
  711. "- These include fit, transform and predict\n",
  712. "- Paramters are defined when declaring the Estimator object"
  713. ]
  714. },
  715. {
  716. "cell_type": "markdown",
  717. "metadata": {},
  718. "source": [
  719. "## Logistic Regression\n",
  720. "\n",
  721. "One good way to think of logistic regression is that it takes the output of a linear regression, and maps it so it is between 0 and 1. We will do this with the logit function. Passing any value through the logit function will map it to a value between 0 and 1 by \"squeezing\" the extreme values. This is perfect for us, because we only care about two outcomes.\n",
  722. "\n",
  723. "Sklearn has a class for logistic regression that we can use. We'll also make things easier by using an sklearn helper function to do all of our cross validation and evaluation for us."
  724. ]
  725. },
  726. {
  727. "cell_type": "code",
  728. "execution_count": null,
  729. "metadata": {
  730. "collapsed": true
  731. },
  732. "outputs": [],
  733. "source": [
  734. "algo2"
  735. ]
  736. },
  737. {
  738. "cell_type": "code",
  739. "execution_count": null,
  740. "metadata": {
  741. "collapsed": true
  742. },
  743. "outputs": [],
  744. "source": [
  745. "from sklearn import cross_validation as cv\n",
  746. "from sklearn.linear_model import LogisticRegression\n",
  747. "\n",
  748. "# Initialize our algorithm\n",
  749. "algo2 = LogisticRegression(random_state=1)\n",
  750. "\n",
  751. "# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)\n",
  752. "\n",
  753. "scores = cv.cross_val_score(algo2, titanic_df[predictors_dim], titanic_df[\"Survived\"], cv=5)\n",
  754. "# Take the mean of the scores (because we have one for each fold)\n",
  755. "\n",
  756. "print(scores.mean())"
  757. ]
  758. },
  759. {
  760. "cell_type": "code",
  761. "execution_count": null,
  762. "metadata": {
  763. "collapsed": true
  764. },
  765. "outputs": [],
  766. "source": [
  767. "algo2."
  768. ]
  769. },
  770. {
  771. "cell_type": "code",
  772. "execution_count": null,
  773. "metadata": {
  774. "collapsed": true
  775. },
  776. "outputs": [],
  777. "source": [
  778. "# Support Vector Machines\n",
  779. "from sklearn.svm import SVC\n",
  780. "svc_obj = SVC(kernel='linear')\n",
  781. "svc_scores = cv.cross_val_score(svc_obj, titanic_df[predictors_dim], titanic_df[\"Survived\"], cv=5)\n",
  782. "print svc_scores.mean()"
  783. ]
  784. },
  785. {
  786. "cell_type": "markdown",
  787. "metadata": {},
  788. "source": [
  789. "---\n",
  790. "### Lets process the test case\n",
  791. "\n",
  792. "Process titanic_test the same way we processed titanic.\n",
  793. "\n",
  794. "This involved:\n",
  795. "\n",
  796. "Replace the missing values in the \"Age\" column with the median age from the train set. The age has to be the exact same value we replaced the missing ages in the training set with (it can't be the median of the test set, because this is different). You should use titanic[\"Age\"].median() to find the median.\n",
  797. "\n",
  798. "Replace any male values in the Sex column with 0, and any female values with 1.\n",
  799. "\n",
  800. "Fill any missing values in the Embarked column with S.\n",
  801. "\n",
  802. "In the Embarked column, replace S with 0, C with 1, and Q with 2.\n",
  803. "\n",
  804. "We'll also need to replace a missing value in the Fare column. Use .fillna with the median of the column in the test set to replace this. There are no missing values in the Fare column of the training set, but test sets can sometimes be different."
  805. ]
  806. },
  807. {
  808. "cell_type": "code",
  809. "execution_count": null,
  810. "metadata": {
  811. "collapsed": true
  812. },
  813. "outputs": [],
  814. "source": [
  815. "titanic_test = pd.read_csv('/root/hackerday/01_titanic/test.csv')\n",
  816. "titanic_test[\"Age\"] = titanic_test[\"Age\"].fillna(titanic_df[\"Age\"].median())\n",
  817. "titanic_test[\"Fare\"] = titanic_test[\"Fare\"].fillna(titanic_test[\"Fare\"].median())\n",
  818. "titanic_test.loc[titanic_test[\"Sex\"] == \"male\", \"Sex\"] = 0 \n",
  819. "titanic_test.loc[titanic_test[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
  820. "titanic_test[\"Embarked\"] = titanic_test[\"Embarked\"].fillna(\"S\")\n",
  821. "\n",
  822. "titanic_test.loc[titanic_test[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
  823. "titanic_test.loc[titanic_test[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
  824. "titanic_test.loc[titanic_test[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n"
  825. ]
  826. },
  827. {
  828. "cell_type": "code",
  829. "execution_count": null,
  830. "metadata": {
  831. "collapsed": true
  832. },
  833. "outputs": [],
  834. "source": [
  835. "# Initialize the algorithm class\n",
  836. "alg = LogisticRegression(random_state=1)\n",
  837. "\n",
  838. "# Train the algorithm using all the training data\n",
  839. "alg.fit(titanic_df[predictors_dim], titanic_df[\"Survived\"])\n",
  840. "\n",
  841. "# Make predictions using the test set.\n",
  842. "predictions = alg.predict(titanic_test[predictors_dim])\n",
  843. "\n",
  844. "# Create a new dataframe \n",
  845. "submission = pd.DataFrame({\n",
  846. " \"PassengerId\": titanic_test[\"PassengerId\"],\n",
  847. " \"Survived\": predictions\n",
  848. " })\n",
  849. "\n",
  850. "print submission.head()"
  851. ]
  852. },
  853. {
  854. "cell_type": "code",
  855. "execution_count": 165,
  856. "metadata": {},
  857. "outputs": [
  858. {
  859. "data": {
  860. "text/html": [
  861. "<div>\n",
  862. "<table border=\"1\" class=\"dataframe\">\n",
  863. " <thead>\n",
  864. " <tr style=\"text-align: right;\">\n",
  865. " <th></th>\n",
  866. " <th>PassengerId</th>\n",
  867. " <th>Survived</th>\n",
  868. " </tr>\n",
  869. " </thead>\n",
  870. " <tbody>\n",
  871. " <tr>\n",
  872. " <th>0</th>\n",
  873. " <td>892</td>\n",
  874. " <td>0</td>\n",
  875. " </tr>\n",
  876. " <tr>\n",
  877. " <th>1</th>\n",
  878. " <td>893</td>\n",
  879. " <td>0</td>\n",
  880. " </tr>\n",
  881. " <tr>\n",
  882. " <th>2</th>\n",
  883. " <td>894</td>\n",
  884. " <td>0</td>\n",
  885. " </tr>\n",
  886. " <tr>\n",
  887. " <th>3</th>\n",
  888. " <td>895</td>\n",
  889. " <td>0</td>\n",
  890. " </tr>\n",
  891. " <tr>\n",
  892. " <th>4</th>\n",
  893. " <td>896</td>\n",
  894. " <td>1</td>\n",
  895. " </tr>\n",
  896. " </tbody>\n",
  897. "</table>\n",
  898. "</div>"
  899. ],
  900. "text/plain": [
  901. " PassengerId Survived\n",
  902. "0 892 0\n",
  903. "1 893 0\n",
  904. "2 894 0\n",
  905. "3 895 0\n",
  906. "4 896 1"
  907. ]
  908. },
  909. "execution_count": 165,
  910. "metadata": {},
  911. "output_type": "execute_result"
  912. }
  913. ],
  914. "source": [
  915. "submission.head()"
  916. ]
  917. },
  918. {
  919. "cell_type": "code",
  920. "execution_count": 166,
  921. "metadata": {},
  922. "outputs": [
  923. {
  924. "data": {
  925. "text/html": [
  926. "<div>\n",
  927. "<table border=\"1\" class=\"dataframe\">\n",
  928. " <thead>\n",
  929. " <tr style=\"text-align: right;\">\n",
  930. " <th></th>\n",
  931. " <th>PassengerId</th>\n",
  932. " <th>Pclass</th>\n",
  933. " <th>Name</th>\n",
  934. " <th>Sex</th>\n",
  935. " <th>Age</th>\n",
  936. " <th>SibSp</th>\n",
  937. " <th>Parch</th>\n",
  938. " <th>Ticket</th>\n",
  939. " <th>Fare</th>\n",
  940. " <th>Cabin</th>\n",
  941. " <th>Embarked</th>\n",
  942. " </tr>\n",
  943. " </thead>\n",
  944. " <tbody>\n",
  945. " <tr>\n",
  946. " <th>0</th>\n",
  947. " <td>892</td>\n",
  948. " <td>3</td>\n",
  949. " <td>Kelly, Mr. James</td>\n",
  950. " <td>0</td>\n",
  951. " <td>34.5</td>\n",
  952. " <td>0</td>\n",
  953. " <td>0</td>\n",
  954. " <td>330911</td>\n",
  955. " <td>7.8292</td>\n",
  956. " <td>NaN</td>\n",
  957. " <td>2</td>\n",
  958. " </tr>\n",
  959. " </tbody>\n",
  960. "</table>\n",
  961. "</div>"
  962. ],
  963. "text/plain": [
  964. " PassengerId Pclass Name Sex Age SibSp Parch Ticket \\\n",
  965. "0 892 3 Kelly, Mr. James 0 34.5 0 0 330911 \n",
  966. "\n",
  967. " Fare Cabin Embarked \n",
  968. "0 7.8292 NaN 2 "
  969. ]
  970. },
  971. "execution_count": 166,
  972. "metadata": {},
  973. "output_type": "execute_result"
  974. }
  975. ],
  976. "source": [
  977. "titanic_test.head(1)"
  978. ]
  979. },
  980. {
  981. "cell_type": "code",
  982. "execution_count": 163,
  983. "metadata": {},
  984. "outputs": [
  985. {
  986. "data": {
  987. "text/plain": [
  988. "array([[-0.99196504, 2.60300021, -0.0341005 , -0.30937218, -0.07846287,\n",
  989. " 0.00329483, 0.23985491]])"
  990. ]
  991. },
  992. "execution_count": 163,
  993. "metadata": {},
  994. "output_type": "execute_result"
  995. }
  996. ],
  997. "source": [
  998. "alg.coef_"
  999. ]
  1000. },
  1001. {
  1002. "cell_type": "code",
  1003. "execution_count": 164,
  1004. "metadata": {},
  1005. "outputs": [
  1006. {
  1007. "data": {
  1008. "text/plain": [
  1009. "['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']"
  1010. ]
  1011. },
  1012. "execution_count": 164,
  1013. "metadata": {},
  1014. "output_type": "execute_result"
  1015. }
  1016. ],
  1017. "source": [
  1018. "predictors_dim"
  1019. ]
  1020. },
  1021. {
  1022. "cell_type": "markdown",
  1023. "metadata": {},
  1024. "source": [
  1025. "y = -0.99 * Pclass + 2.60 * Sex - 0.03 * Age - ...."
  1026. ]
  1027. },
  1028. {
  1029. "cell_type": "code",
  1030. "execution_count": null,
  1031. "metadata": {
  1032. "collapsed": true
  1033. },
  1034. "outputs": [],
  1035. "source": []
  1036. }
  1037. ],
  1038. "metadata": {
  1039. "kernelspec": {
  1040. "display_name": "Python 3",
  1041. "language": "python",
  1042. "name": "python3"
  1043. },
  1044. "language_info": {
  1045. "codemirror_mode": {
  1046. "name": "ipython",
  1047. "version": 3
  1048. },
  1049. "file_extension": ".py",
  1050. "mimetype": "text/x-python",
  1051. "name": "python",
  1052. "nbconvert_exporter": "python",
  1053. "pygments_lexer": "ipython3",
  1054. "version": "3.6.1"
  1055. }
  1056. },
  1057. "nbformat": 4,
  1058. "nbformat_minor": 1
  1059. }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement