Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- {
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Sklearn, XGBoost"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## sklearn.ensemble.RandomForestClassifier"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "from sklearn import ensemble , cross_validation, learning_curve, metrics \n",
- "\n",
- "import numpy as np\n",
- "import pandas as pd\n",
- "import xgboost as xgb"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "%pylab inline"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Данные"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Задача на kaggle: https://www.kaggle.com/c/bioresponse\n",
- "\n",
- "Данные: https://www.kaggle.com/c/bioresponse/data\n",
- "\n",
- "По данным характеристикам молекулы требуется определить, будет ли дан биологический ответ (biological response).\n",
- "\n",
- "Признаки нормализаваны.\n",
- "\n",
- "Для демонстрации используется обучающая выборка из исходных данных train.csv, файл с данными прилагается."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "bioresponce = pd.read_csv('bioresponse.csv', header=0, sep=',')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "bioresponce.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "bioresponce_target = bioresponce.Activity.values"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "bioresponce_data = bioresponce.iloc[:, 1:]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Модель RandomForestClassifier"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Зависимость качества от количесвта деревьев"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "n_trees = [1] + range(10, 55, 5) "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "%%time\n",
- "scoring = []\n",
- "for n_tree in n_trees:\n",
- " estimator = ensemble.RandomForestClassifier(n_estimators = n_tree, min_samples_split=5, random_state=1)\n",
- " score = cross_validation.cross_val_score(estimator, bioresponce_data, bioresponce_target, \n",
- " scoring = 'accuracy', cv = 3) \n",
- " scoring.append(score)\n",
- "scoring = np.asmatrix(scoring)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "scoring"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "pylab.plot(n_trees, scoring.mean(axis = 1), marker='.', label='RandomForest')\n",
- "pylab.grid(True)\n",
- "pylab.xlabel('n_trees')\n",
- "pylab.ylabel('score')\n",
- "pylab.title('Accuracy score')\n",
- "pylab.legend(loc='lower right')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Кривые обучения для деревьев большей глубины"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "%%time\n",
- "xgb_scoring = []\n",
- "for n_tree in n_trees:\n",
- " estimator = xgb.XGBClassifier(learning_rate=0.1, max_depth=5, n_estimators=n_tree, min_child_weight=3)\n",
- " score = cross_validation.cross_val_score(estimator, bioresponce_data, bioresponce_target, \n",
- " scoring = 'accuracy', cv = 3) \n",
- " xgb_scoring.append(score)\n",
- "xgb_scoring = np.asmatrix(xgb_scoring)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "xgb_scoring"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "pylab.plot(n_trees, scoring.mean(axis = 1), marker='.', label='RandomForest')\n",
- "pylab.plot(n_trees, xgb_scoring.mean(axis = 1), marker='.', label='XGBoost')\n",
- "pylab.grid(True)\n",
- "pylab.xlabel('n_trees')\n",
- "pylab.ylabel('score')\n",
- "pylab.title('Accuracy score')\n",
- "pylab.legend(loc='lower right')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### **Если Вас заинтересовал xgboost:**\n",
- "python api: http://xgboost.readthedocs.org/en/latest/python/python_api.html\n",
- "\n",
- "установка: http://xgboost.readthedocs.io/en/latest/build.html"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 2",
- "language": "python",
- "name": "python2"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 2
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython2",
- "version": "2.7.12"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 0
- }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement