Untitled

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 153,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "### Load data\n",
    "\n",
    "We know that we will want to include sex. \n",
    "\n",
    "If we assume a linear model, then we can regress our target variable on sex, and use the residuals to pick among the remaining variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 201,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "dat = pd.read_csv('train_data.txt', sep=' ', header=None)\n",
    "\n",
    "y = dat.values[:, 0]\n",
    "X = dat.values[:, 1:]\n",
    "\n",
    "sex = X[:, 1].reshape(-1,1)\n",
    "markers = X[:, 1:]\n",
    "\n",
    "lin = LinearRegression()\n",
    "lin.fit(sex, y)\n",
    "err = y - lin.predict(sex)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "### Lasso\n",
    "\n",
    "A first attempt might be to run Lasso on our markers and the residuals. This gives us a couple of markers, which is a good start!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 211,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LassoCV\n",
    "\n",
    "def lasso_selection(X, y):\n",
    "    model = LassoCV(cv=3)\n",
    "    model.fit(X, y)\n",
    "    selected = np.argwhere(model.coef_ != 0).reshape(-1)\n",
    "    best_score = np.min(model.mse_path_.mean(1))\n",
    "    return selected, best_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 218,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([4860, 4999]), 9.636899707219222, 8.2643213236732)"
      ]
     },
     "execution_count": 218,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "simple_lasso_markers, best_score = lasso_selection(markers, err)\n",
    "simple_lasso_markers, best_score, score_ols(simple_lasso_markers, sex, markers, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "# Prescreen and run OLS\n",
    "\n",
    "Another option, might be to do a simple prescreening based on partial correlations, and then do an exhaustive search over the space of linear models created by the top _ markers by partial correlation.\n",
    "\n",
    "This gives us 4 markers, and a much better cross-validated MSE. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 205,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "from itertools import combinations\n",
    "\n",
    "def prescreen(X, y, top):\n",
    "    cors = [pd.Series(y).corr(m) for _,m in pd.DataFrame(X).T.iterrows()]\n",
    "    sorted_corrs = list(pd.Series(cors).abs().sort_values(ascending=False).index)\n",
    "    x = [[np.array(l) for l in list(combinations(sorted_corrs[:top], i))] for i in range(1,5)]\n",
    "    return [i for l in x for i in l], sorted_corrs\n",
    "\n",
    "def score_ols(i, sex, markers, y):\n",
    "    final_X = np.concatenate([sex, markers[:, i]], 1)\n",
    "    l = LinearRegression()\n",
    "    return -np.mean(cross_val_score(l,final_X, y, cv=3, scoring = 'neg_mean_squared_error'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 206,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([4999, 4998, 4860, 3211]), 2.7520014965420962)"
      ]
     },
     "execution_count": 206,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "screened, sorted_corrs = prescreen(markers, err, 6)\n",
    "scores = [score_ols(i, sex, markers, y) for i in screened]    \n",
    "combinitorial_markers = screened[np.argsort(scores)[0]]\n",
    "\n",
    "combinitorial_markers, np.min(scores)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "### Use Lasso on the prescreened variables \n",
    "\n",
    "Another option could be to use Lasso in combination with the prescreened variables, thus running Lasso on the top (50) variables by partial correlation. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 217,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(27, 3.548344879465436, 4.006063575106914)"
      ]
     },
     "execution_count": 217,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "idx = sorted_corrs[:50]\n",
    "model = LassoCV()\n",
    "model.fit(markers[:, idx], err)\n",
    "lasso_markers = np.array(sorted_corrs)[np.argwhere(model.coef_ != 0).reshape(-1)]\n",
    "best_score = np.min(np.mean(model.mse_path_, axis=1))\n",
    "\n",
    "len(lasso_markers), best_score, score_ols(lasso_markers, sex, markers, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "### Adding an L2 regularization\n",
    "\n",
    "Our Lasso didn't do quite as well as our simple prescreening. But it also picked many more markers (27 vs 4). Now that we have our variables selected, we find that adding a regularization parameter improves the cross-validation scores. This is intuitive, as our dataset is very small, and even with a small number of selected markers, the results of the simple OLS can be sensitive to high leverage points. Regularization reduces that sensitivity, and an L2 regularization allows us to keep all our chosen markers. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 219,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import RidgeCV, Ridge\n",
    "\n",
    "def ridger(ids, sex, markers, y):\n",
    "    final_X = np.concatenate([sex, markers[:, ids]], 1)\n",
    "    model = RidgeCV(cv=3)\n",
    "    model.fit(final_X, y)\n",
    "\n",
    "    model = Ridge(alpha = model.alpha_)\n",
    "    score = -np.mean(cross_val_score(model, final_X, y, cv=3, scoring = 'neg_mean_squared_error'))\n",
    "    return model, score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 220,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.hycell": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.6131415123627557"
      ]
     },
     "execution_count": 220,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model,score = ridger(lasso_markers, sex, markers, y)\n",
    "score"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "name": "Untitled.ipynb"
 },
 "nbformat": 4,
 "nbformat_minor": 2
}