Advertisement
Guest User

Untitled

a guest
Oct 13th, 2015
95
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 13.55 KB | None | 0 0
  1. {
  2. "cells": [
  3. {
  4. "cell_type": "markdown",
  5. "metadata": {},
  6. "source": [
  7. "# scitkit-learn feature order bug (?)\n",
  8. "\n",
  9. "I ran into a possible bug in scikit-learn today where the order of the features that I pass to a tree-based classifier affects the classifier performance. I've created a minimal working example below. As far as I can tell, it only affects decision tree and random forest classifiers. I've tested a couple different types of classifiers below to verify this statement.\n",
  10. "\n",
  11. "First simulate a data set with two features and a class and divide them into training/testing sets."
  12. ]
  13. },
  14. {
  15. "cell_type": "code",
  16. "execution_count": 1,
  17. "metadata": {
  18. "collapsed": false
  19. },
  20. "outputs": [
  21. {
  22. "data": {
  23. "text/html": [
  24. "<div>\n",
  25. "<table border=\"1\" class=\"dataframe\">\n",
  26. " <thead>\n",
  27. " <tr style=\"text-align: right;\">\n",
  28. " <th></th>\n",
  29. " <th>A</th>\n",
  30. " <th>B</th>\n",
  31. " <th>class</th>\n",
  32. " <th>group</th>\n",
  33. " </tr>\n",
  34. " </thead>\n",
  35. " <tbody>\n",
  36. " <tr>\n",
  37. " <th>0</th>\n",
  38. " <td>0.374540</td>\n",
  39. " <td>0.185133</td>\n",
  40. " <td>1</td>\n",
  41. " <td>training</td>\n",
  42. " </tr>\n",
  43. " <tr>\n",
  44. " <th>1</th>\n",
  45. " <td>0.950714</td>\n",
  46. " <td>0.541901</td>\n",
  47. " <td>0</td>\n",
  48. " <td>testing</td>\n",
  49. " </tr>\n",
  50. " <tr>\n",
  51. " <th>2</th>\n",
  52. " <td>0.731994</td>\n",
  53. " <td>0.872946</td>\n",
  54. " <td>0</td>\n",
  55. " <td>training</td>\n",
  56. " </tr>\n",
  57. " <tr>\n",
  58. " <th>3</th>\n",
  59. " <td>0.598658</td>\n",
  60. " <td>0.732225</td>\n",
  61. " <td>0</td>\n",
  62. " <td>training</td>\n",
  63. " </tr>\n",
  64. " <tr>\n",
  65. " <th>4</th>\n",
  66. " <td>0.156019</td>\n",
  67. " <td>0.806561</td>\n",
  68. " <td>1</td>\n",
  69. " <td>training</td>\n",
  70. " </tr>\n",
  71. " </tbody>\n",
  72. "</table>\n",
  73. "</div>"
  74. ],
  75. "text/plain": [
  76. " A B class group\n",
  77. "0 0.374540 0.185133 1 training\n",
  78. "1 0.950714 0.541901 0 testing\n",
  79. "2 0.731994 0.872946 0 training\n",
  80. "3 0.598658 0.732225 0 training\n",
  81. "4 0.156019 0.806561 1 training"
  82. ]
  83. },
  84. "execution_count": 1,
  85. "metadata": {},
  86. "output_type": "execute_result"
  87. }
  88. ],
  89. "source": [
  90. "from sklearn.ensemble import RandomForestClassifier\n",
  91. "from sklearn.cross_validation import StratifiedShuffleSplit\n",
  92. "import numpy as np\n",
  93. "import pandas as pd\n",
  94. "\n",
  95. "np.random.seed(42)\n",
  96. "\n",
  97. "test_df = pd.DataFrame({'A': np.random.random(1000),\n",
  98. " 'B': np.random.random(1000),\n",
  99. " 'class': np.random.randint(0, 2, 1000)})\n",
  100. "\n",
  101. "training_indeces, testing_indeces = next(iter(StratifiedShuffleSplit(test_df['class'].values,\n",
  102. " n_iter=1,\n",
  103. " train_size=0.75,\n",
  104. " test_size=0.25)))\n",
  105. "\n",
  106. "test_df.loc[training_indeces, 'group'] = 'training'\n",
  107. "test_df.loc[testing_indeces, 'group'] = 'testing'\n",
  108. "\n",
  109. "test_df.head()"
  110. ]
  111. },
  112. {
  113. "cell_type": "markdown",
  114. "metadata": {},
  115. "source": [
  116. "# Random forests\n",
  117. "\n",
  118. "Now fit the data with a random forest classifier with the same random state. Note that here I pass the features ordered column 'A' then column 'B'. The printed values are the testing performance. Repeat this procedure 10 times to make sure the results are reproducible with the same feature order."
  119. ]
  120. },
  121. {
  122. "cell_type": "code",
  123. "execution_count": 2,
  124. "metadata": {
  125. "collapsed": false
  126. },
  127. "outputs": [
  128. {
  129. "name": "stdout",
  130. "output_type": "stream",
  131. "text": [
  132. "0.536\n",
  133. "0.536\n",
  134. "0.536\n",
  135. "0.536\n",
  136. "0.536\n",
  137. "0.536\n",
  138. "0.536\n",
  139. "0.536\n",
  140. "0.536\n",
  141. "0.536\n"
  142. ]
  143. }
  144. ],
  145. "source": [
  146. "for repeat in range(10):\n",
  147. " rfc = RandomForestClassifier(random_state=42)\n",
  148. "\n",
  149. " training_features = test_df.loc[test_df['group'] == 'training', ['A', 'B']].values\n",
  150. " training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
  151. "\n",
  152. " testing_features = test_df.loc[test_df['group'] == 'testing', ['A', 'B']].values\n",
  153. " testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
  154. "\n",
  155. " rfc.fit(training_features, training_classes)\n",
  156. "\n",
  157. " print(rfc.score(testing_features, testing_classes))"
  158. ]
  159. },
  160. {
  161. "cell_type": "markdown",
  162. "metadata": {},
  163. "source": [
  164. "Now fit the data with a random forest classifier with the same random state. Note that here I pass the features ordered column 'B' then column 'A'. Repeat this procedure 10 times to make sure the results are reproducible with the same feature order."
  165. ]
  166. },
  167. {
  168. "cell_type": "code",
  169. "execution_count": 3,
  170. "metadata": {
  171. "collapsed": false
  172. },
  173. "outputs": [
  174. {
  175. "name": "stdout",
  176. "output_type": "stream",
  177. "text": [
  178. "0.532\n",
  179. "0.532\n",
  180. "0.532\n",
  181. "0.532\n",
  182. "0.532\n",
  183. "0.532\n",
  184. "0.532\n",
  185. "0.532\n",
  186. "0.532\n",
  187. "0.532\n"
  188. ]
  189. }
  190. ],
  191. "source": [
  192. "for repeat in range(10):\n",
  193. " rfc = RandomForestClassifier(random_state=42)\n",
  194. "\n",
  195. " training_features = test_df.loc[test_df['group'] == 'training', ['B', 'A']].values\n",
  196. " training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
  197. "\n",
  198. " testing_features = test_df.loc[test_df['group'] == 'testing', ['B', 'A']].values\n",
  199. " testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
  200. "\n",
  201. " rfc.fit(training_features, training_classes)\n",
  202. "\n",
  203. " print(rfc.score(testing_features, testing_classes))"
  204. ]
  205. },
  206. {
  207. "cell_type": "markdown",
  208. "metadata": {},
  209. "source": [
  210. "See how the classifier performance is different when all I changed was the order of the features? Why does the order of features affect classification performance?\n",
  211. "\n",
  212. "# Decision tree classifiers"
  213. ]
  214. },
  215. {
  216. "cell_type": "code",
  217. "execution_count": 4,
  218. "metadata": {
  219. "collapsed": false
  220. },
  221. "outputs": [
  222. {
  223. "name": "stdout",
  224. "output_type": "stream",
  225. "text": [
  226. "0.476\n",
  227. "0.476\n",
  228. "0.476\n",
  229. "0.476\n",
  230. "0.476\n",
  231. "0.476\n",
  232. "0.476\n",
  233. "0.476\n",
  234. "0.476\n",
  235. "0.476\n"
  236. ]
  237. }
  238. ],
  239. "source": [
  240. "from sklearn.tree import DecisionTreeClassifier\n",
  241. "\n",
  242. "for repeat in range(10):\n",
  243. " dtc = DecisionTreeClassifier(random_state=42)\n",
  244. "\n",
  245. " training_features = test_df.loc[test_df['group'] == 'training', ['A', 'B']].values\n",
  246. " training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
  247. "\n",
  248. " testing_features = test_df.loc[test_df['group'] == 'testing', ['A', 'B']].values\n",
  249. " testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
  250. "\n",
  251. " dtc.fit(training_features, training_classes)\n",
  252. "\n",
  253. " print(dtc.score(testing_features, testing_classes))"
  254. ]
  255. },
  256. {
  257. "cell_type": "code",
  258. "execution_count": 5,
  259. "metadata": {
  260. "collapsed": false
  261. },
  262. "outputs": [
  263. {
  264. "name": "stdout",
  265. "output_type": "stream",
  266. "text": [
  267. "0.512\n",
  268. "0.512\n",
  269. "0.512\n",
  270. "0.512\n",
  271. "0.512\n",
  272. "0.512\n",
  273. "0.512\n",
  274. "0.512\n",
  275. "0.512\n",
  276. "0.512\n"
  277. ]
  278. }
  279. ],
  280. "source": [
  281. "for repeat in range(10):\n",
  282. " dtc = DecisionTreeClassifier(random_state=42)\n",
  283. "\n",
  284. " training_features = test_df.loc[test_df['group'] == 'training', ['B', 'A']].values\n",
  285. " training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
  286. "\n",
  287. " testing_features = test_df.loc[test_df['group'] == 'testing', ['B', 'A']].values\n",
  288. " testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
  289. "\n",
  290. " dtc.fit(training_features, training_classes)\n",
  291. "\n",
  292. " print(dtc.score(testing_features, testing_classes))"
  293. ]
  294. },
  295. {
  296. "cell_type": "markdown",
  297. "metadata": {},
  298. "source": [
  299. "# SVM"
  300. ]
  301. },
  302. {
  303. "cell_type": "code",
  304. "execution_count": 6,
  305. "metadata": {
  306. "collapsed": false
  307. },
  308. "outputs": [
  309. {
  310. "name": "stdout",
  311. "output_type": "stream",
  312. "text": [
  313. "0.512\n",
  314. "0.512\n",
  315. "0.512\n",
  316. "0.512\n",
  317. "0.512\n",
  318. "0.512\n",
  319. "0.512\n",
  320. "0.512\n",
  321. "0.512\n",
  322. "0.512\n"
  323. ]
  324. }
  325. ],
  326. "source": [
  327. "from sklearn.svm import SVC\n",
  328. "\n",
  329. "for repeat in range(10):\n",
  330. " svc = SVC(random_state=42)\n",
  331. "\n",
  332. " training_features = test_df.loc[test_df['group'] == 'training', ['A', 'B']].values\n",
  333. " training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
  334. "\n",
  335. " testing_features = test_df.loc[test_df['group'] == 'testing', ['A', 'B']].values\n",
  336. " testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
  337. "\n",
  338. " svc.fit(training_features, training_classes)\n",
  339. "\n",
  340. " print(svc.score(testing_features, testing_classes))"
  341. ]
  342. },
  343. {
  344. "cell_type": "code",
  345. "execution_count": 7,
  346. "metadata": {
  347. "collapsed": false
  348. },
  349. "outputs": [
  350. {
  351. "name": "stdout",
  352. "output_type": "stream",
  353. "text": [
  354. "0.512\n",
  355. "0.512\n",
  356. "0.512\n",
  357. "0.512\n",
  358. "0.512\n",
  359. "0.512\n",
  360. "0.512\n",
  361. "0.512\n",
  362. "0.512\n",
  363. "0.512\n"
  364. ]
  365. }
  366. ],
  367. "source": [
  368. "for repeat in range(10):\n",
  369. " svc = SVC(random_state=42)\n",
  370. "\n",
  371. " training_features = test_df.loc[test_df['group'] == 'training', ['B', 'A']].values\n",
  372. " training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
  373. "\n",
  374. " testing_features = test_df.loc[test_df['group'] == 'testing', ['B', 'A']].values\n",
  375. " testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
  376. "\n",
  377. " svc.fit(training_features, training_classes)\n",
  378. "\n",
  379. " print(svc.score(testing_features, testing_classes))"
  380. ]
  381. },
  382. {
  383. "cell_type": "markdown",
  384. "metadata": {},
  385. "source": [
  386. "# Logistic regression"
  387. ]
  388. },
  389. {
  390. "cell_type": "code",
  391. "execution_count": 8,
  392. "metadata": {
  393. "collapsed": false
  394. },
  395. "outputs": [
  396. {
  397. "name": "stdout",
  398. "output_type": "stream",
  399. "text": [
  400. "0.512\n",
  401. "0.512\n",
  402. "0.512\n",
  403. "0.512\n",
  404. "0.512\n",
  405. "0.512\n",
  406. "0.512\n",
  407. "0.512\n",
  408. "0.512\n",
  409. "0.512\n"
  410. ]
  411. }
  412. ],
  413. "source": [
  414. "from sklearn.linear_model import LogisticRegression\n",
  415. "\n",
  416. "for repeat in range(10):\n",
  417. " lrc = LogisticRegression(random_state=42)\n",
  418. "\n",
  419. " training_features = test_df.loc[test_df['group'] == 'training', ['A', 'B']].values\n",
  420. " training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
  421. "\n",
  422. " testing_features = test_df.loc[test_df['group'] == 'testing', ['A', 'B']].values\n",
  423. " testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
  424. "\n",
  425. " lrc.fit(training_features, training_classes)\n",
  426. "\n",
  427. " print(lrc.score(testing_features, testing_classes))"
  428. ]
  429. },
  430. {
  431. "cell_type": "code",
  432. "execution_count": 9,
  433. "metadata": {
  434. "collapsed": false
  435. },
  436. "outputs": [
  437. {
  438. "name": "stdout",
  439. "output_type": "stream",
  440. "text": [
  441. "0.512\n",
  442. "0.512\n",
  443. "0.512\n",
  444. "0.512\n",
  445. "0.512\n",
  446. "0.512\n",
  447. "0.512\n",
  448. "0.512\n",
  449. "0.512\n",
  450. "0.512\n"
  451. ]
  452. }
  453. ],
  454. "source": [
  455. "for repeat in range(10):\n",
  456. " lrc = LogisticRegression(random_state=42)\n",
  457. "\n",
  458. " training_features = test_df.loc[test_df['group'] == 'training', ['B', 'A']].values\n",
  459. " training_classes = test_df.loc[test_df['group'] == 'training', 'class'].values\n",
  460. "\n",
  461. " testing_features = test_df.loc[test_df['group'] == 'testing', ['B', 'A']].values\n",
  462. " testing_classes = test_df.loc[test_df['group'] == 'testing', 'class'].values\n",
  463. "\n",
  464. " lrc.fit(training_features, training_classes)\n",
  465. "\n",
  466. " print(lrc.score(testing_features, testing_classes))"
  467. ]
  468. },
  469. {
  470. "cell_type": "code",
  471. "execution_count": null,
  472. "metadata": {
  473. "collapsed": true
  474. },
  475. "outputs": [],
  476. "source": []
  477. }
  478. ],
  479. "metadata": {
  480. "kernelspec": {
  481. "display_name": "Python 3",
  482. "language": "python",
  483. "name": "python3"
  484. },
  485. "language_info": {
  486. "codemirror_mode": {
  487. "name": "ipython",
  488. "version": 3
  489. },
  490. "file_extension": ".py",
  491. "mimetype": "text/x-python",
  492. "name": "python",
  493. "nbconvert_exporter": "python",
  494. "pygments_lexer": "ipython3",
  495. "version": "3.4.3"
  496. }
  497. },
  498. "nbformat": 4,
  499. "nbformat_minor": 0
  500. }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement