Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- {
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Задачи к Лекции 5\n",
- "\n",
- "__Исходные данные__ \n",
- "\n",
- "Дан файл **\"mlbootcamp5_train.csv\"**. В нем содержатся данные об опросе 70000 пациентов с целью определения наличия заболеваний сердечно-сосудистой системы (ССЗ). Данные в файле промаркированы и если у человека имееются ССЗ, то значение **cardio** будет равно 1, в противном случае - 0. Описание и значения полей представлены во второй лекции.\n",
- "\n",
- "__Загрузка файла__"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>age</th>\n",
- " <th>gender</th>\n",
- " <th>height</th>\n",
- " <th>weight</th>\n",
- " <th>ap_hi</th>\n",
- " <th>ap_lo</th>\n",
- " <th>cholesterol</th>\n",
- " <th>gluc</th>\n",
- " <th>smoke</th>\n",
- " <th>alco</th>\n",
- " <th>active</th>\n",
- " <th>cardio</th>\n",
- " <th>chol_1</th>\n",
- " <th>chol_2</th>\n",
- " <th>chol_3</th>\n",
- " <th>gluc_1</th>\n",
- " <th>gluc_2</th>\n",
- " <th>gluc_3</th>\n",
- " <th>gender_bin</th>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>id</th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " <th></th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>18393</td>\n",
- " <td>2</td>\n",
- " <td>168</td>\n",
- " <td>62.0</td>\n",
- " <td>110</td>\n",
- " <td>80</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>20228</td>\n",
- " <td>1</td>\n",
- " <td>156</td>\n",
- " <td>85.0</td>\n",
- " <td>140</td>\n",
- " <td>90</td>\n",
- " <td>3</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>18857</td>\n",
- " <td>1</td>\n",
- " <td>165</td>\n",
- " <td>64.0</td>\n",
- " <td>130</td>\n",
- " <td>70</td>\n",
- " <td>3</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>17623</td>\n",
- " <td>2</td>\n",
- " <td>169</td>\n",
- " <td>82.0</td>\n",
- " <td>150</td>\n",
- " <td>100</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>17474</td>\n",
- " <td>1</td>\n",
- " <td>156</td>\n",
- " <td>56.0</td>\n",
- " <td>100</td>\n",
- " <td>60</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " age gender height weight ap_hi ap_lo cholesterol gluc smoke \\\n",
- "id \n",
- "0 18393 2 168 62.0 110 80 1 1 0 \n",
- "1 20228 1 156 85.0 140 90 3 1 0 \n",
- "2 18857 1 165 64.0 130 70 3 1 0 \n",
- "3 17623 2 169 82.0 150 100 1 1 0 \n",
- "4 17474 1 156 56.0 100 60 1 1 0 \n",
- "\n",
- " alco active cardio chol_1 chol_2 chol_3 gluc_1 gluc_2 gluc_3 \\\n",
- "id \n",
- "0 0 1 0 1 0 0 1 0 0 \n",
- "1 0 1 1 0 0 1 1 0 0 \n",
- "2 0 0 1 0 0 1 1 0 0 \n",
- "3 0 1 1 1 0 0 1 0 0 \n",
- "4 0 0 0 1 0 0 1 0 0 \n",
- "\n",
- " gender_bin \n",
- "id \n",
- "0 1 \n",
- "1 0 \n",
- "2 0 \n",
- "3 1 \n",
- "4 0 "
- ]
- },
- "execution_count": 3,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "%matplotlib inline\n",
- "import numpy as np\n",
- "import pandas as pd\n",
- "import seaborn as sns\n",
- "import sklearn\n",
- "from matplotlib import pyplot as plt\n",
- "import warnings\n",
- "warnings.filterwarnings('ignore')\n",
- "import matplotlib.pyplot as plt\n",
- "plt.rcParams['figure.figsize'] = [10, 5]\n",
- "\n",
- "\n",
- "df = pd.read_csv(\"../data/mlbootcamp5_train.csv\", \n",
- " sep=\";\", \n",
- " index_col=\"id\")\n",
- "# Делаем one-hot кодирование\n",
- "chol = pd.get_dummies(df[\"cholesterol\"], prefix=\"chol\")\n",
- "gluc = pd.get_dummies(df[\"gluc\"], prefix=\"gluc\")\n",
- "df = pd.concat([df, chol, gluc], axis=1)\n",
- "\n",
- "# Делаем пол бинарным признаком\n",
- "df[\"gender_bin\"] = df[\"gender\"].map({1: 0, 2: 1})\n",
- "df.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Задачи"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "__1. Хоть в sklearn и присутствует реализация метода k-ближайших соседей, я же предлагаю попробовать вам написать его самостоятельно.__\n",
- "\n",
- "* __создать классификатор используя только pandas, numpy и scipy. Гиперпараметром данного классификатора должно быть число ближайших соседей. (Необязательно) можно добавить метрику расстояния и выбор весов.__\n",
- "* __С помощью кросс-валидации найти оптимальное количество ближайших соседей и (необязательно) набор признаков.__\n",
- "\n",
- "Алгоритм работы классификатора:\n",
- " 1. Для заданного прецедент $\\vec{x}$ мы считаем расстояние до всех прецедентов в обучающей выборке.\n",
- " 2. Сортируем прецеденты по расстоянию до $\\vec{x}$.\n",
- " 3. Отбираем $k$ минимальных значений\n",
- " 4. Устраиваем голосование между отобранными прецедент."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [],
- "source": [
- "# A lot of code here"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Комментарии:** Ваши комментарии здесь."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**2. Определить какой из трех классификаторов (kNN, наивный Байес, решающее дерево) лучший в каждой метрике по отдельности: accuracy, F1-мера, ROC AUC, функция потерь. Использовать набор признаков: 'age', 'weight', 'height', 'ap_lo', 'ap_hi'.**\n",
- "\n",
- "**(Необязательно) Найти оптимальный набор признаков.**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Your code here"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Комментарии:** Ваши комментарии здесь."
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.5.2"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
- }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement