Advertisement
Guest User

Untitled

a guest
Mar 22nd, 2019
97
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 11.84 KB | None | 0 0
  1. {
  2. "cells": [
  3. {
  4. "cell_type": "markdown",
  5. "metadata": {},
  6. "source": [
  7. "### Задачи к Лекции 5\n",
  8. "\n",
  9. "__Исходные данные__ \n",
  10. "\n",
  11. "Дан файл **\"mlbootcamp5_train.csv\"**. В нем содержатся данные об опросе 70000 пациентов с целью определения наличия заболеваний сердечно-сосудистой системы (ССЗ). Данные в файле промаркированы и если у человека имееются ССЗ, то значение **cardio** будет равно 1, в противном случае - 0. Описание и значения полей представлены во второй лекции.\n",
  12. "\n",
  13. "__Загрузка файла__"
  14. ]
  15. },
  16. {
  17. "cell_type": "code",
  18. "execution_count": 3,
  19. "metadata": {},
  20. "outputs": [
  21. {
  22. "data": {
  23. "text/html": [
  24. "<div>\n",
  25. "<style scoped>\n",
  26. " .dataframe tbody tr th:only-of-type {\n",
  27. " vertical-align: middle;\n",
  28. " }\n",
  29. "\n",
  30. " .dataframe tbody tr th {\n",
  31. " vertical-align: top;\n",
  32. " }\n",
  33. "\n",
  34. " .dataframe thead th {\n",
  35. " text-align: right;\n",
  36. " }\n",
  37. "</style>\n",
  38. "<table border=\"1\" class=\"dataframe\">\n",
  39. " <thead>\n",
  40. " <tr style=\"text-align: right;\">\n",
  41. " <th></th>\n",
  42. " <th>age</th>\n",
  43. " <th>gender</th>\n",
  44. " <th>height</th>\n",
  45. " <th>weight</th>\n",
  46. " <th>ap_hi</th>\n",
  47. " <th>ap_lo</th>\n",
  48. " <th>cholesterol</th>\n",
  49. " <th>gluc</th>\n",
  50. " <th>smoke</th>\n",
  51. " <th>alco</th>\n",
  52. " <th>active</th>\n",
  53. " <th>cardio</th>\n",
  54. " <th>chol_1</th>\n",
  55. " <th>chol_2</th>\n",
  56. " <th>chol_3</th>\n",
  57. " <th>gluc_1</th>\n",
  58. " <th>gluc_2</th>\n",
  59. " <th>gluc_3</th>\n",
  60. " <th>gender_bin</th>\n",
  61. " </tr>\n",
  62. " <tr>\n",
  63. " <th>id</th>\n",
  64. " <th></th>\n",
  65. " <th></th>\n",
  66. " <th></th>\n",
  67. " <th></th>\n",
  68. " <th></th>\n",
  69. " <th></th>\n",
  70. " <th></th>\n",
  71. " <th></th>\n",
  72. " <th></th>\n",
  73. " <th></th>\n",
  74. " <th></th>\n",
  75. " <th></th>\n",
  76. " <th></th>\n",
  77. " <th></th>\n",
  78. " <th></th>\n",
  79. " <th></th>\n",
  80. " <th></th>\n",
  81. " <th></th>\n",
  82. " <th></th>\n",
  83. " </tr>\n",
  84. " </thead>\n",
  85. " <tbody>\n",
  86. " <tr>\n",
  87. " <th>0</th>\n",
  88. " <td>18393</td>\n",
  89. " <td>2</td>\n",
  90. " <td>168</td>\n",
  91. " <td>62.0</td>\n",
  92. " <td>110</td>\n",
  93. " <td>80</td>\n",
  94. " <td>1</td>\n",
  95. " <td>1</td>\n",
  96. " <td>0</td>\n",
  97. " <td>0</td>\n",
  98. " <td>1</td>\n",
  99. " <td>0</td>\n",
  100. " <td>1</td>\n",
  101. " <td>0</td>\n",
  102. " <td>0</td>\n",
  103. " <td>1</td>\n",
  104. " <td>0</td>\n",
  105. " <td>0</td>\n",
  106. " <td>1</td>\n",
  107. " </tr>\n",
  108. " <tr>\n",
  109. " <th>1</th>\n",
  110. " <td>20228</td>\n",
  111. " <td>1</td>\n",
  112. " <td>156</td>\n",
  113. " <td>85.0</td>\n",
  114. " <td>140</td>\n",
  115. " <td>90</td>\n",
  116. " <td>3</td>\n",
  117. " <td>1</td>\n",
  118. " <td>0</td>\n",
  119. " <td>0</td>\n",
  120. " <td>1</td>\n",
  121. " <td>1</td>\n",
  122. " <td>0</td>\n",
  123. " <td>0</td>\n",
  124. " <td>1</td>\n",
  125. " <td>1</td>\n",
  126. " <td>0</td>\n",
  127. " <td>0</td>\n",
  128. " <td>0</td>\n",
  129. " </tr>\n",
  130. " <tr>\n",
  131. " <th>2</th>\n",
  132. " <td>18857</td>\n",
  133. " <td>1</td>\n",
  134. " <td>165</td>\n",
  135. " <td>64.0</td>\n",
  136. " <td>130</td>\n",
  137. " <td>70</td>\n",
  138. " <td>3</td>\n",
  139. " <td>1</td>\n",
  140. " <td>0</td>\n",
  141. " <td>0</td>\n",
  142. " <td>0</td>\n",
  143. " <td>1</td>\n",
  144. " <td>0</td>\n",
  145. " <td>0</td>\n",
  146. " <td>1</td>\n",
  147. " <td>1</td>\n",
  148. " <td>0</td>\n",
  149. " <td>0</td>\n",
  150. " <td>0</td>\n",
  151. " </tr>\n",
  152. " <tr>\n",
  153. " <th>3</th>\n",
  154. " <td>17623</td>\n",
  155. " <td>2</td>\n",
  156. " <td>169</td>\n",
  157. " <td>82.0</td>\n",
  158. " <td>150</td>\n",
  159. " <td>100</td>\n",
  160. " <td>1</td>\n",
  161. " <td>1</td>\n",
  162. " <td>0</td>\n",
  163. " <td>0</td>\n",
  164. " <td>1</td>\n",
  165. " <td>1</td>\n",
  166. " <td>1</td>\n",
  167. " <td>0</td>\n",
  168. " <td>0</td>\n",
  169. " <td>1</td>\n",
  170. " <td>0</td>\n",
  171. " <td>0</td>\n",
  172. " <td>1</td>\n",
  173. " </tr>\n",
  174. " <tr>\n",
  175. " <th>4</th>\n",
  176. " <td>17474</td>\n",
  177. " <td>1</td>\n",
  178. " <td>156</td>\n",
  179. " <td>56.0</td>\n",
  180. " <td>100</td>\n",
  181. " <td>60</td>\n",
  182. " <td>1</td>\n",
  183. " <td>1</td>\n",
  184. " <td>0</td>\n",
  185. " <td>0</td>\n",
  186. " <td>0</td>\n",
  187. " <td>0</td>\n",
  188. " <td>1</td>\n",
  189. " <td>0</td>\n",
  190. " <td>0</td>\n",
  191. " <td>1</td>\n",
  192. " <td>0</td>\n",
  193. " <td>0</td>\n",
  194. " <td>0</td>\n",
  195. " </tr>\n",
  196. " </tbody>\n",
  197. "</table>\n",
  198. "</div>"
  199. ],
  200. "text/plain": [
  201. " age gender height weight ap_hi ap_lo cholesterol gluc smoke \\\n",
  202. "id \n",
  203. "0 18393 2 168 62.0 110 80 1 1 0 \n",
  204. "1 20228 1 156 85.0 140 90 3 1 0 \n",
  205. "2 18857 1 165 64.0 130 70 3 1 0 \n",
  206. "3 17623 2 169 82.0 150 100 1 1 0 \n",
  207. "4 17474 1 156 56.0 100 60 1 1 0 \n",
  208. "\n",
  209. " alco active cardio chol_1 chol_2 chol_3 gluc_1 gluc_2 gluc_3 \\\n",
  210. "id \n",
  211. "0 0 1 0 1 0 0 1 0 0 \n",
  212. "1 0 1 1 0 0 1 1 0 0 \n",
  213. "2 0 0 1 0 0 1 1 0 0 \n",
  214. "3 0 1 1 1 0 0 1 0 0 \n",
  215. "4 0 0 0 1 0 0 1 0 0 \n",
  216. "\n",
  217. " gender_bin \n",
  218. "id \n",
  219. "0 1 \n",
  220. "1 0 \n",
  221. "2 0 \n",
  222. "3 1 \n",
  223. "4 0 "
  224. ]
  225. },
  226. "execution_count": 3,
  227. "metadata": {},
  228. "output_type": "execute_result"
  229. }
  230. ],
  231. "source": [
  232. "%matplotlib inline\n",
  233. "import numpy as np\n",
  234. "import pandas as pd\n",
  235. "import seaborn as sns\n",
  236. "import sklearn\n",
  237. "from matplotlib import pyplot as plt\n",
  238. "import warnings\n",
  239. "warnings.filterwarnings('ignore')\n",
  240. "import matplotlib.pyplot as plt\n",
  241. "plt.rcParams['figure.figsize'] = [10, 5]\n",
  242. "\n",
  243. "\n",
  244. "df = pd.read_csv(\"../data/mlbootcamp5_train.csv\", \n",
  245. " sep=\";\", \n",
  246. " index_col=\"id\")\n",
  247. "# Делаем one-hot кодирование\n",
  248. "chol = pd.get_dummies(df[\"cholesterol\"], prefix=\"chol\")\n",
  249. "gluc = pd.get_dummies(df[\"gluc\"], prefix=\"gluc\")\n",
  250. "df = pd.concat([df, chol, gluc], axis=1)\n",
  251. "\n",
  252. "# Делаем пол бинарным признаком\n",
  253. "df[\"gender_bin\"] = df[\"gender\"].map({1: 0, 2: 1})\n",
  254. "df.head()"
  255. ]
  256. },
  257. {
  258. "cell_type": "markdown",
  259. "metadata": {},
  260. "source": [
  261. "## Задачи"
  262. ]
  263. },
  264. {
  265. "cell_type": "markdown",
  266. "metadata": {},
  267. "source": [
  268. "__1. Хоть в sklearn и присутствует реализация метода k-ближайших соседей, я же предлагаю попробовать вам написать его самостоятельно.__\n",
  269. "\n",
  270. "* __создать классификатор используя только pandas, numpy и scipy. Гиперпараметром данного классификатора должно быть число ближайших соседей. (Необязательно) можно добавить метрику расстояния и выбор весов.__\n",
  271. "* __С помощью кросс-валидации найти оптимальное количество ближайших соседей и (необязательно) набор признаков.__\n",
  272. "\n",
  273. "Алгоритм работы классификатора:\n",
  274. " 1. Для заданного прецедент $\\vec{x}$ мы считаем расстояние до всех прецедентов в обучающей выборке.\n",
  275. " 2. Сортируем прецеденты по расстоянию до $\\vec{x}$.\n",
  276. " 3. Отбираем $k$ минимальных значений\n",
  277. " 4. Устраиваем голосование между отобранными прецедент."
  278. ]
  279. },
  280. {
  281. "cell_type": "code",
  282. "execution_count": 9,
  283. "metadata": {},
  284. "outputs": [],
  285. "source": [
  286. "# A lot of code here"
  287. ]
  288. },
  289. {
  290. "cell_type": "markdown",
  291. "metadata": {},
  292. "source": [
  293. "**Комментарии:** Ваши комментарии здесь."
  294. ]
  295. },
  296. {
  297. "cell_type": "markdown",
  298. "metadata": {},
  299. "source": [
  300. "**2. Определить какой из трех классификаторов (kNN, наивный Байес, решающее дерево) лучший в каждой метрике по отдельности: accuracy, F1-мера, ROC AUC, функция потерь. Использовать набор признаков: 'age', 'weight', 'height', 'ap_lo', 'ap_hi'.**\n",
  301. "\n",
  302. "**(Необязательно) Найти оптимальный набор признаков.**"
  303. ]
  304. },
  305. {
  306. "cell_type": "code",
  307. "execution_count": 6,
  308. "metadata": {},
  309. "outputs": [],
  310. "source": [
  311. "# Your code here"
  312. ]
  313. },
  314. {
  315. "cell_type": "markdown",
  316. "metadata": {},
  317. "source": [
  318. "**Комментарии:** Ваши комментарии здесь."
  319. ]
  320. }
  321. ],
  322. "metadata": {
  323. "kernelspec": {
  324. "display_name": "Python 3",
  325. "language": "python",
  326. "name": "python3"
  327. },
  328. "language_info": {
  329. "codemirror_mode": {
  330. "name": "ipython",
  331. "version": 3
  332. },
  333. "file_extension": ".py",
  334. "mimetype": "text/x-python",
  335. "name": "python",
  336. "nbconvert_exporter": "python",
  337. "pygments_lexer": "ipython3",
  338. "version": "3.5.2"
  339. }
  340. },
  341. "nbformat": 4,
  342. "nbformat_minor": 2
  343. }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement