Untitled

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pandas import Series, DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df = pd.read_csv('/root/hackerday/01_titanic/train.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read more about this function by running `pd.read_csv?`"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "To see the list of functions available inside `pandas`\n",
    "\n",
    "pd.[tab]\n",
    "\n",
    "----\n",
    "\n",
    "## Building a Classification Model using the Titanic Dataset\n",
    "\n",
    "#### Data Field details:\n",
    "\n",
    "PassengerId -- A numerical id assigned to each passenger.\n",
    "Survived    -- Whether the passenger survived (1), or didn't (0). We'll be making predictions for this column.\n",
    "Pclass      -- The class the passenger was in -- first class (1), second class (2), or third class (3).\n",
    "Name        -- the name of the passenger.\n",
    "Sex         -- The gender of the passenger -- male or female.\n",
    "Age         -- The age of the passenger. Fractional.\n",
    "SibSp       -- The number of siblings and spouses the passenger had on board.\n",
    "Parch       -- The number of parents and children the passenger had on board.\n",
    "Ticket      -- The ticket number of the passenger.\n",
    "Fare        -- How much the passenger paid for the ticker.\n",
    "Cabin       -- Which cabin the passenger was in.\n",
    "Embarked    -- Where the passenger boarded the Titanic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Accessing Columns\n",
    "print titanic_df['Name'].head() # one column\n",
    "print titanic_df[['Name', 'Pclass']].head() # Two columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Looking at the data\n",
    "\n",
    "- We know that women and children were more likely to survive. Thus, Age and Sex are probably good predictors. \n",
    "\n",
    "- It's also logical to think that passenger class might affect the outcome, as first class cabins were closer to the deck of the ship. \n",
    "\n",
    "- Fare is tied to passenger class, and will probably be highly correlated with it, but might add some additional information. \n",
    "\n",
    "- Number of siblings and parents/children will probably be correlated with survival one way or the other, as either there are more people to help you, or more people to think about and try to save."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Some more information from inspecting the data\n",
    "\n",
    "- Age, Cabin have missing values\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Steps in the Data Science Process\n",
    "\n",
    "1. Data Inspection\n",
    "2. Data Understanding\n",
    "3. Data Preparation (cleaning - treat missing values, treat outliers and other bad data, Normalization)\n",
    "4. Split the Data\n",
    "5. Visualize\n",
    "6. Train the Model\n",
    "7. Test the Model (deploy on unseen data)\n",
    "8. Evaluate the performance of the Model\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Statistical Summaries using `describe()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "titanic_df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Numeric Series - Binary\n",
    "titanic_df['Survived'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Numeric Series - Float\n",
    "titanic_df['Fare'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Categorical Series\n",
    "titanic_df['Sex'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "titanic_df['Pclass'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The `plot()` method to inspect data visually"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "%pylab inline\n",
    "\n",
    "# This will enable plotting functionality\n",
    "# and make sure the plots are displayed inside the notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df['Fare'].plot(kind=\"hist\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Task 1\n",
    "#\n",
    "# MAKE A HISTOGRAM FOR AGE - Make a note of what you see\n",
    "#\n",
    "#"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df['Fare'].plot(kind='hist')\n",
    "# Fare is an example of a SKEWED variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print titanic_df.Fare.mean()\n",
    "print titanic_df.Fare.median()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df['Age'].plot(kind=\"hist\")\n",
    "# Age is an example of a (approximately) Normally Distributed Variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Task 2\n",
    "#\n",
    "# MAKE A BAR CHART FOR Parch - Make a note of what you see\n",
    "#\n",
    "#"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df['Parch'].value_counts().plot(kind='bar')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df['SibSp'].value_counts().plot(kind='bar')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### The `groupby()` method to inspect relationships in the data\n",
    "\n",
    "- Pretty much the same as the SQL group-by.\n",
    "- Can also interpret it as having the same functionality as Pivot Tables in Excel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df[['Survived', 'Age', 'Fare']].groupby('Survived').mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df[['Pclass', 'Survived', 'Age', 'Fare']].groupby('Pclass').median()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Task 3\n",
    "#\n",
    "# Group by number of siblings and see how much they paid on average for a ticket\n",
    "#\n",
    "#"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df[['SibSp', 'Fare']].groupby('SibSp').mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Task 4\n",
    "#\n",
    "# What is the % of survivors by Pclass\n",
    "#\n",
    "#"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df.Survived.mean() == titanic_df['Survived'].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "titanic_df[['Pclass', 'Survived']].groupby('Pclass').mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### Handling Missing Data with `fillna()`\n",
    "\n",
    "As we can see that PassengerId counr is 891 whereas Age count is 714. It means there are 77 rows in which Age is missing.\n",
    "\n",
    "This means that the data isn't perfectly clean, and we're going to have to clean it ourselves. \n",
    "\n",
    "Note:\n",
    "\n",
    "- We don't want to have to remove the rows with missing values, because more data helps us train a better algorithm. \n",
    "- We also don't want to get rid of the whole column, as age is probably fairly important to our analysis.\n",
    "\n",
    "There are many strategies for cleaning up missing data, but a simple one is to just fill in all the missing values with the median of all the values in the column. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "#### Fill Missing Values for the Age Variable (Numeric)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df['Age'].isnull().tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# we can get the median of the column by applying median function on it.\n",
    "print titanic_df[\"Age\"].median()\n",
    "print titanic_df[\"Age\"].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Age_median = titanic_df[\"Age\"].median()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Using the fillna method to impute datab\n",
    "titanic_df[\"Age\"].fillna(Age_median, inplace=True)\n",
    "# Alternate method\n",
    "titanic_df[\"Age\"] = titanic_df[\"Age\"].fillna(Age_median)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "titanic_df['Age'].tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df.Age.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "#### Fill Missing Values for the Embarked Variable (Categorical)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Task 5\n",
    "# Find some rows that have missing values for variable Embarked\n",
    "titanic_df[titanic_df.Embarked.isnull()].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# To find the mode\n",
    "titanic_df.Embarked.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Task 6\n",
    "# Find the mode of Embarked\n",
    "titanic_df.Embarked.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Task 7\n",
    "# Use the .fillna() method to impute missing values\n",
    "titanic_df.Embarked.fillna('S', inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### Convert Categoricals into Binary Variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df.loc[titanic_df['Sex'] == 'male', 'Sex'] = 0\n",
    "titanic_df.loc[titanic_df['Sex'] == 'female', 'Sex'] = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df['Sex'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Task 8\n",
    "#\n",
    "# Convert Embarked into a binary variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df.Embarked.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# convert \"S\" to 0, \"C\" to 1 and \"Q\" to 2 in Embarked column\n",
    "titanic_df.loc[titanic_df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
    "titanic_df.loc[titanic_df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
    "titanic_df.loc[titanic_df[\"Embarked\"] == \"Q\", \"Embarked\"] = 1\n",
    "\n",
    "#print the unique values of Embarked column\n",
    "print(titanic_df[\"Embarked\"].unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Essentials of Modeling (Overfitting and Cross Validation)\n",
    "\n",
    "> The aim of all machine learning is generalization.\n",
    "\n",
    "We want to train the algorithm on different data than we make predictions on. This is critical if we want to avoid overfitting. Overfitting is what happens when a model fits itself to \"noise\", not signal. Every dataset has its own quirks that don't exist in the full population. For example, if I asked you to predict the top speed of a car from its horsepower and other characteristics, and gave you a dataset that randomly had cars with very high top speeds, you would create a model that overstated speed. The way to figure out if your model is doing this is to evaluate its performance on data it hasn't been trained using.\n",
    "\n",
    "Every machine learning algorithm can overfit, although some (like linear regression) are much less prone to it. If you evaluate your algorithm on the same dataset that you train it on, it's impossible to know if it's performing well because it overfit itself to the noise, or if it actually is a good algorithm.\n",
    "\n",
    "Luckily, cross validation is a simple way to avoid overfitting. To cross validate, you split your data into some number of parts (or \"folds\"). Lets use 3 as an example. You then do this:\n",
    "* Combine the first two parts, train a model, make predictions on the third.\n",
    "\n",
    "* Combine the first and third parts, train a model, make predictions on the second.\n",
    "\n",
    "* Combine the second and third parts, train a model, make predictions on the first.\n",
    "\n",
    "This way, we generate predictions for the whole dataset without ever evaluating accuracy on the same data we train our model using."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_df.columns.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# The columns we'll use to predict the target\n",
    "predictors_dim = [\"Pclass\", \"Sex\", \"Age\", \"SibSp\", \"Parch\", \"Fare\", \"Embarked\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A short note on using Scikit Learn\n",
    "\n",
    "- There are model families from which we import Estimators\n",
    "- Declaring an estimator object exposes its methods to us\n",
    "- These include fit, transform and predict\n",
    "- Paramters are defined when declaring the Estimator object"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Logistic Regression\n",
    "\n",
    "One good way to think of logistic regression is that it takes the output of a linear regression, and maps it so it is between 0 and 1. We will do this with the logit function. Passing any value through the logit function will map it to a value between 0 and 1 by \"squeezing\" the extreme values. This is perfect for us, because we only care about two outcomes.\n",
    "\n",
    "Sklearn has a class for logistic regression that we can use. We'll also make things easier by using an sklearn helper function to do all of our cross validation and evaluation for us."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "algo2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import cross_validation as cv\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "# Initialize our algorithm\n",
    "algo2 = LogisticRegression(random_state=1)\n",
    "\n",
    "# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)\n",
    "\n",
    "scores = cv.cross_val_score(algo2, titanic_df[predictors_dim], titanic_df[\"Survived\"], cv=5)\n",
    "# Take the mean of the scores (because we have one for each fold)\n",
    "\n",
    "print(scores.mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "algo2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Support Vector Machines\n",
    "from sklearn.svm import SVC\n",
    "svc_obj = SVC(kernel='linear')\n",
    "svc_scores = cv.cross_val_score(svc_obj, titanic_df[predictors_dim], titanic_df[\"Survived\"], cv=5)\n",
    "print svc_scores.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Lets process the test case\n",
    "\n",
    "Process titanic_test the same way we processed titanic.\n",
    "\n",
    "This involved:\n",
    "\n",
    "Replace the missing values in the \"Age\" column with the median age from the train set. The age has to be the exact same value we replaced the missing ages in the training set with (it can't be the median of the test set, because this is different). You should use titanic[\"Age\"].median() to find the median.\n",
    "\n",
    "Replace any male values in the Sex column with 0, and any female values with 1.\n",
    "\n",
    "Fill any missing values in the Embarked column with S.\n",
    "\n",
    "In the Embarked column, replace S with 0, C with 1, and Q with 2.\n",
    "\n",
    "We'll also need to replace a missing value in the Fare column. Use .fillna with the median of the column in the test set to replace this. There are no missing values in the Fare column of the training set, but test sets can sometimes be different."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_test = pd.read_csv('/root/hackerday/01_titanic/test.csv')\n",
    "titanic_test[\"Age\"] = titanic_test[\"Age\"].fillna(titanic_df[\"Age\"].median())\n",
    "titanic_test[\"Fare\"] = titanic_test[\"Fare\"].fillna(titanic_test[\"Fare\"].median())\n",
    "titanic_test.loc[titanic_test[\"Sex\"] == \"male\", \"Sex\"] = 0 \n",
    "titanic_test.loc[titanic_test[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
    "titanic_test[\"Embarked\"] = titanic_test[\"Embarked\"].fillna(\"S\")\n",
    "\n",
    "titanic_test.loc[titanic_test[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
    "titanic_test.loc[titanic_test[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
    "titanic_test.loc[titanic_test[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Initialize the algorithm class\n",
    "alg = LogisticRegression(random_state=1)\n",
    "\n",
    "# Train the algorithm using all the training data\n",
    "alg.fit(titanic_df[predictors_dim], titanic_df[\"Survived\"])\n",
    "\n",
    "# Make predictions using the test set.\n",
    "predictions = alg.predict(titanic_test[predictors_dim])\n",
    "\n",
    "# Create a new dataframe \n",
    "submission = pd.DataFrame({\n",
    "        \"PassengerId\": titanic_test[\"PassengerId\"],\n",
    "        \"Survived\": predictions\n",
    "    })\n",
    "\n",
    "print submission.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 165,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>892</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>893</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>894</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>895</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>896</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived\n",
       "0          892         0\n",
       "1          893         0\n",
       "2          894         0\n",
       "3          895         0\n",
       "4          896         1"
      ]
     },
     "execution_count": 165,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "submission.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 166,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>892</td>\n",
       "      <td>3</td>\n",
       "      <td>Kelly, Mr. James</td>\n",
       "      <td>0</td>\n",
       "      <td>34.5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>330911</td>\n",
       "      <td>7.8292</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Pclass              Name Sex   Age  SibSp  Parch  Ticket  \\\n",
       "0          892       3  Kelly, Mr. James   0  34.5      0      0  330911   \n",
       "\n",
       "     Fare Cabin Embarked  \n",
       "0  7.8292   NaN        2  "
      ]
     },
     "execution_count": 166,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic_test.head(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 163,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-0.99196504,  2.60300021, -0.0341005 , -0.30937218, -0.07846287,\n",
       "         0.00329483,  0.23985491]])"
      ]
     },
     "execution_count": 163,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alg.coef_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 164,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']"
      ]
     },
     "execution_count": 164,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictors_dim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "y = -0.99 * Pclass + 2.60 * Sex - 0.03 * Age - ...."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}