Untitled

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# scitkit-learn feature order bug (?)\n",
    "\n",
    "I ran into a possible bug in scikit-learn today where the order of the features that I pass to a tree-based classifier affects the classifier performance. I've created a minimal working example below. As far as I can tell, it only affects decision tree and random forest classifiers. I've tested a couple different types of classifiers below to verify this statement.\n",
    "\n",
    "First simulate a data set with two features and a class and divide them into training/testing sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>class</th>\n",
       "      <th>group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.374540</td>\n",
       "      <td>0.185133</td>\n",
       "      <td>1</td>\n",
       "      <td>training</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.950714</td>\n",
       "      <td>0.541901</td>\n",
       "      <td>0</td>\n",
       "      <td>testing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.731994</td>\n",
       "      <td>0.872946</td>\n",
       "      <td>0</td>\n",
       "      <td>training</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.598658</td>\n",
       "      <td>0.732225</td>\n",
       "      <td>0</td>\n",
       "      <td>training</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.156019</td>\n",
       "      <td>0.806561</td>\n",
       "      <td>1</td>\n",
       "      <td>training</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          A         B  class     group\n",
       "0  0.374540  0.185133      1  training\n",
       "1  0.950714  0.541901      0   testing\n",
       "2  0.731994  0.872946      0  training\n",
       "3  0.598658  0.732225      0  training\n",
       "4  0.156019  0.806561      1  training"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.cross_validation import StratifiedShuffleSplit\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "np.random.seed(42)\n",
    "\n",
    "test_df = pd.DataFrame({'A': np.random.random(1000),\n",
    "                        'B': np.random.random(1000),\n",
    "                        'class': np.random.randint(0, 2, 1000)})\n",
    "\n",
    "training_indeces, testing_indeces = next(iter(StratifiedShuffleSplit(test_df['class'].values,\n",
    "                                                                     n_iter=1,\n",
    "                                                                     train_size=0.75,\n",
    "                                                                     test_size=0.25)))\n",
    "\n",
    "test_df.loc[training_indeces, 'group'] = 'training'\n",
    "test_df.loc[testing_indeces, 'group'] = 'testing'\n",
    "\n",
    "test_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Random forests\n",
    "\n",
    "Now fit the data with a random forest classifier with the same random state. Note that here I pass the features ordered column 'A' then column 'B'. The printed values are the testing performance. Repeat this procedure 10 times to make sure the results are reproducible with the same feature order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.536\n",
      "0.536\n",
      "0.536\n",
      "0.536\n",
      "0.536\n",
      "0.536\n",
      "0.536\n",
      "0.536\n",
      "0.536\n",
      "0.536\n"
     ]
    }
   ],
   "source": [
    "for repeat in range(10):\n",
    "    rfc = RandomForestClassifier(random_state=42)\n",
    "\n",
    "    training_features = test_df.loc[test_df['group'] == 'training', ['A', 'B']].values\n",
    "    training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
    "\n",
    "    testing_features = test_df.loc[test_df['group'] == 'testing', ['A', 'B']].values\n",
    "    testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
    "\n",
    "    rfc.fit(training_features, training_classes)\n",
    "\n",
    "    print(rfc.score(testing_features, testing_classes))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now fit the data with a random forest classifier with the same random state. Note that here I pass the features ordered column 'B' then column 'A'. Repeat this procedure 10 times to make sure the results are reproducible with the same feature order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.532\n",
      "0.532\n",
      "0.532\n",
      "0.532\n",
      "0.532\n",
      "0.532\n",
      "0.532\n",
      "0.532\n",
      "0.532\n",
      "0.532\n"
     ]
    }
   ],
   "source": [
    "for repeat in range(10):\n",
    "    rfc = RandomForestClassifier(random_state=42)\n",
    "\n",
    "    training_features = test_df.loc[test_df['group'] == 'training', ['B', 'A']].values\n",
    "    training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
    "\n",
    "    testing_features = test_df.loc[test_df['group'] == 'testing', ['B', 'A']].values\n",
    "    testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
    "\n",
    "    rfc.fit(training_features, training_classes)\n",
    "\n",
    "    print(rfc.score(testing_features, testing_classes))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See how the classifier performance is different when all I changed was the order of the features? Why does the order of features affect classification performance?\n",
    "\n",
    "# Decision tree classifiers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.476\n",
      "0.476\n",
      "0.476\n",
      "0.476\n",
      "0.476\n",
      "0.476\n",
      "0.476\n",
      "0.476\n",
      "0.476\n",
      "0.476\n"
     ]
    }
   ],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "for repeat in range(10):\n",
    "    dtc = DecisionTreeClassifier(random_state=42)\n",
    "\n",
    "    training_features = test_df.loc[test_df['group'] == 'training', ['A', 'B']].values\n",
    "    training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
    "\n",
    "    testing_features = test_df.loc[test_df['group'] == 'testing', ['A', 'B']].values\n",
    "    testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
    "\n",
    "    dtc.fit(training_features, training_classes)\n",
    "\n",
    "    print(dtc.score(testing_features, testing_classes))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n"
     ]
    }
   ],
   "source": [
    "for repeat in range(10):\n",
    "    dtc = DecisionTreeClassifier(random_state=42)\n",
    "\n",
    "    training_features = test_df.loc[test_df['group'] == 'training', ['B', 'A']].values\n",
    "    training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
    "\n",
    "    testing_features = test_df.loc[test_df['group'] == 'testing', ['B', 'A']].values\n",
    "    testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
    "\n",
    "    dtc.fit(training_features, training_classes)\n",
    "\n",
    "    print(dtc.score(testing_features, testing_classes))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SVM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n"
     ]
    }
   ],
   "source": [
    "from sklearn.svm import SVC\n",
    "\n",
    "for repeat in range(10):\n",
    "    svc = SVC(random_state=42)\n",
    "\n",
    "    training_features = test_df.loc[test_df['group'] == 'training', ['A', 'B']].values\n",
    "    training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
    "\n",
    "    testing_features = test_df.loc[test_df['group'] == 'testing', ['A', 'B']].values\n",
    "    testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
    "\n",
    "    svc.fit(training_features, training_classes)\n",
    "\n",
    "    print(svc.score(testing_features, testing_classes))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n"
     ]
    }
   ],
   "source": [
    "for repeat in range(10):\n",
    "    svc = SVC(random_state=42)\n",
    "\n",
    "    training_features = test_df.loc[test_df['group'] == 'training', ['B', 'A']].values\n",
    "    training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
    "\n",
    "    testing_features = test_df.loc[test_df['group'] == 'testing', ['B', 'A']].values\n",
    "    testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
    "\n",
    "    svc.fit(training_features, training_classes)\n",
    "\n",
    "    print(svc.score(testing_features, testing_classes))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Logistic regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "for repeat in range(10):\n",
    "    lrc = LogisticRegression(random_state=42)\n",
    "\n",
    "    training_features = test_df.loc[test_df['group'] == 'training', ['A', 'B']].values\n",
    "    training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
    "\n",
    "    testing_features = test_df.loc[test_df['group'] == 'testing', ['A', 'B']].values\n",
    "    testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
    "\n",
    "    lrc.fit(training_features, training_classes)\n",
    "\n",
    "    print(lrc.score(testing_features, testing_classes))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n",
      "0.512\n"
     ]
    }
   ],
   "source": [
    "for repeat in range(10):\n",
    "    lrc = LogisticRegression(random_state=42)\n",
    "\n",
    "    training_features = test_df.loc[test_df['group'] == 'training', ['B', 'A']].values\n",
    "    training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
    "\n",
    "    testing_features = test_df.loc[test_df['group'] == 'testing', ['B', 'A']].values\n",
    "    testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
    "\n",
    "    lrc.fit(training_features, training_classes)\n",
    "\n",
    "    print(lrc.score(testing_features, testing_classes))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}