Advertisement
Guest User

Untitled

a guest
Apr 22nd, 2019
58
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 17.74 KB | None | 0 0
  1. {
  2. "cells": [
  3. {
  4. "cell_type": "code",
  5. "execution_count": 1,
  6. "metadata": {},
  7. "outputs": [],
  8. "source": [
  9. "# Python libraries\n",
  10. "import pandas as pd \n",
  11. "import seaborn as sns\n",
  12. "import matplotlib.pyplot as plt\n",
  13. "import plotly\n",
  14. "import plotly.graph_objs as go\n",
  15. "import imblearn.under_sampling as imb # to downsample the datasaet \n",
  16. "from sklearn.model_selection import train_test_split # to create train and test datasets"
  17. ]
  18. },
  19. {
  20. "cell_type": "code",
  21. "execution_count": 2,
  22. "metadata": {},
  23. "outputs": [],
  24. "source": [
  25. "data = pd.read_csv('2016_cens_locals_plantabaixa.csv', header = 0, index_col = None, sep = ',', low_memory=False)"
  26. ]
  27. },
  28. {
  29. "cell_type": "markdown",
  30. "metadata": {},
  31. "source": [
  32. "# Feature Engineering & Data Visualization"
  33. ]
  34. },
  35. {
  36. "cell_type": "code",
  37. "execution_count": 3,
  38. "metadata": {},
  39. "outputs": [
  40. {
  41. "name": "stdout",
  42. "output_type": "stream",
  43. "text": [
  44. "<class 'pandas.core.frame.DataFrame'>\n",
  45. "RangeIndex: 78033 entries, 0 to 78032\n",
  46. "Data columns (total 34 columns):\n",
  47. "ID_BCN 78033 non-null int64\n",
  48. "ID_PRINCIP 78033 non-null int64\n",
  49. "N_PRINCIP 78033 non-null object\n",
  50. "ID_SECTOR 78033 non-null int64\n",
  51. "N_SECTOR 78033 non-null object\n",
  52. "ID_GRUPACT 78033 non-null int64\n",
  53. "N_GRUPACT 78033 non-null object\n",
  54. "ID_ACT 78033 non-null int64\n",
  55. "N_ACT 78033 non-null object\n",
  56. "N_LOCAL 78033 non-null object\n",
  57. "SN_CARRER 78033 non-null int64\n",
  58. "SN_MERCAT 78033 non-null int64\n",
  59. "ID_MERCAT 78033 non-null int64\n",
  60. "N_MERCAT 78033 non-null object\n",
  61. "SN_GALERIA 78033 non-null int64\n",
  62. "N_GALERIA 78033 non-null object\n",
  63. "SN_CCOMERC 78033 non-null int64\n",
  64. "ID_CCOMERC 78033 non-null object\n",
  65. "N_CCOMERC 78033 non-null object\n",
  66. "N_CARRER 78033 non-null object\n",
  67. "NUM_POLICI 78017 non-null float64\n",
  68. "REF_CAD 78033 non-null object\n",
  69. "DATA 78033 non-null object\n",
  70. "Codi_Barri 78033 non-null int64\n",
  71. "Nom_Barri 78033 non-null object\n",
  72. "Codi_Districte 78033 non-null int64\n",
  73. "N_DISTRI 78033 non-null object\n",
  74. "N_EIX 26218 non-null object\n",
  75. "SN_EIX 78033 non-null int64\n",
  76. "SEC_CENS 78033 non-null int64\n",
  77. "Y_UTM_ETRS 78033 non-null float64\n",
  78. "X_UTM_ETRS 78033 non-null float64\n",
  79. "LATITUD 78033 non-null float64\n",
  80. "LONGITUD 78033 non-null float64\n",
  81. "dtypes: float64(5), int64(14), object(15)\n",
  82. "memory usage: 20.2+ MB\n"
  83. ]
  84. }
  85. ],
  86. "source": [
  87. "data.info()"
  88. ]
  89. },
  90. {
  91. "cell_type": "markdown",
  92. "metadata": {},
  93. "source": [
  94. "We want to predict 'ID_PRINCIP' (1: is active, 2: no information available, 0: is not active) based on the actual information available. "
  95. ]
  96. },
  97. {
  98. "cell_type": "code",
  99. "execution_count": 4,
  100. "metadata": {},
  101. "outputs": [
  102. {
  103. "data": {
  104. "text/plain": [
  105. "Actiu 60265\n",
  106. "Sense informació 13023\n",
  107. "Sense activitat 4745\n",
  108. "Name: N_PRINCIP, dtype: int64"
  109. ]
  110. },
  111. "execution_count": 4,
  112. "metadata": {},
  113. "output_type": "execute_result"
  114. }
  115. ],
  116. "source": [
  117. "data['N_PRINCIP'].value_counts()"
  118. ]
  119. },
  120. {
  121. "cell_type": "markdown",
  122. "metadata": {},
  123. "source": [
  124. "We removed the observationes labeled as 'Sense informació' (without information)"
  125. ]
  126. },
  127. {
  128. "cell_type": "code",
  129. "execution_count": 5,
  130. "metadata": {},
  131. "outputs": [],
  132. "source": [
  133. "data = data[data['N_PRINCIP'] != 'Sense informació']"
  134. ]
  135. },
  136. {
  137. "cell_type": "code",
  138. "execution_count": 6,
  139. "metadata": {},
  140. "outputs": [],
  141. "source": [
  142. "plot_data = data.copy()"
  143. ]
  144. },
  145. {
  146. "cell_type": "markdown",
  147. "metadata": {},
  148. "source": [
  149. "We decided to drop the 'SN_EIX', 'SN_CCOMERC', 'SN_GALERIA', 'SN_CARRER' and 'SN_MERCAT' features, because these column are a binary feature that indicate when N_EIX, ID_CCOMERC, N_GALERIA, N_CARRER and N_MERCAT respectively are NaN or not. \n",
  150. "\n",
  151. "We drop 'ID_BCN' because is unique for every observation, it doesn't give us relevant information.\n",
  152. "\n",
  153. "Y_UTM_ETRS and X_UTM_ETRS has the same information as LATITUD and LONGITUD, we keep only the last two.\n",
  154. "\n",
  155. "For the following features we decide to keep the ID version (numerical map of the categorical feature):\n",
  156. "- N_PRINCIP\n",
  157. "- N_SECTOR\n",
  158. "- N_GRUPACT\n",
  159. "- N_ACT\n",
  160. "- N_DISTRI\n",
  161. "- Nom_Barri\n",
  162. "- N_MERCAT\n",
  163. "- ID_CCOMERC\n",
  164. "\n",
  165. "'N_GALERIA' is empty, we dropped this feature too."
  166. ]
  167. },
  168. {
  169. "cell_type": "code",
  170. "execution_count": 7,
  171. "metadata": {},
  172. "outputs": [],
  173. "source": [
  174. "data = data.drop(['ID_BCN', 'SN_EIX', 'SN_GALERIA', 'SN_CCOMERC', 'SN_CARRER', 'SN_MERCAT', 'N_PRINCIP', 'N_SECTOR', 'N_GRUPACT', 'N_ACT', 'N_DISTRI', 'Nom_Barri', 'N_MERCAT', \n",
  175. " 'ID_CCOMERC', 'Y_UTM_ETRS', 'X_UTM_ETRS', 'N_GALERIA'], axis = 1)"
  176. ]
  177. },
  178. {
  179. "cell_type": "markdown",
  180. "metadata": {},
  181. "source": [
  182. "The initial dataset has 78033 rows and 34 columns (features). "
  183. ]
  184. },
  185. {
  186. "cell_type": "markdown",
  187. "metadata": {},
  188. "source": [
  189. "We transform the 'DATA' feature from string to datetime. Therefore, we keep the year and month values."
  190. ]
  191. },
  192. {
  193. "cell_type": "code",
  194. "execution_count": 8,
  195. "metadata": {},
  196. "outputs": [],
  197. "source": [
  198. "data['DATA'] = pd.to_datetime(data['DATA'])"
  199. ]
  200. },
  201. {
  202. "cell_type": "code",
  203. "execution_count": 9,
  204. "metadata": {},
  205. "outputs": [],
  206. "source": [
  207. "data['YEAR'] = data['DATA'].dt.year \n",
  208. "data = data.drop('DATA', axis = 1)"
  209. ]
  210. },
  211. {
  212. "cell_type": "code",
  213. "execution_count": 10,
  214. "metadata": {},
  215. "outputs": [
  216. {
  217. "data": {
  218. "text/plain": [
  219. "2016 64740\n",
  220. "2017 257\n",
  221. "1899 12\n",
  222. "2001 1\n",
  223. "Name: YEAR, dtype: int64"
  224. ]
  225. },
  226. "execution_count": 10,
  227. "metadata": {},
  228. "output_type": "execute_result"
  229. }
  230. ],
  231. "source": [
  232. "data['YEAR'].value_counts()"
  233. ]
  234. },
  235. {
  236. "cell_type": "markdown",
  237. "metadata": {},
  238. "source": [
  239. "The date isn't relevant, because is not the date of when the local starts is just when was checked the activity and it has some erros (Hotel W Barcelona, Desigual and World Trade Center appears checked on 1899, when the 3 of them were inagurated many decades after). We drop the DATA feature from the dataset."
  240. ]
  241. },
  242. {
  243. "cell_type": "code",
  244. "execution_count": 11,
  245. "metadata": {},
  246. "outputs": [],
  247. "source": [
  248. "data = data.drop('YEAR', axis = 1)"
  249. ]
  250. },
  251. {
  252. "cell_type": "markdown",
  253. "metadata": {},
  254. "source": [
  255. "After we remove the features with repeated or useless information we end up with 16 features, including the response."
  256. ]
  257. },
  258. {
  259. "cell_type": "code",
  260. "execution_count": 12,
  261. "metadata": {},
  262. "outputs": [
  263. {
  264. "data": {
  265. "text/html": [
  266. "<div>\n",
  267. "<style scoped>\n",
  268. " .dataframe tbody tr th:only-of-type {\n",
  269. " vertical-align: middle;\n",
  270. " }\n",
  271. "\n",
  272. " .dataframe tbody tr th {\n",
  273. " vertical-align: top;\n",
  274. " }\n",
  275. "\n",
  276. " .dataframe thead th {\n",
  277. " text-align: right;\n",
  278. " }\n",
  279. "</style>\n",
  280. "<table border=\"1\" class=\"dataframe\">\n",
  281. " <thead>\n",
  282. " <tr style=\"text-align: right;\">\n",
  283. " <th></th>\n",
  284. " <th>ID_PRINCIP</th>\n",
  285. " <th>ID_SECTOR</th>\n",
  286. " <th>ID_GRUPACT</th>\n",
  287. " <th>ID_ACT</th>\n",
  288. " <th>N_LOCAL</th>\n",
  289. " <th>ID_MERCAT</th>\n",
  290. " <th>N_CCOMERC</th>\n",
  291. " <th>N_CARRER</th>\n",
  292. " <th>NUM_POLICI</th>\n",
  293. " <th>REF_CAD</th>\n",
  294. " <th>Codi_Barri</th>\n",
  295. " <th>Codi_Districte</th>\n",
  296. " <th>N_EIX</th>\n",
  297. " <th>SEC_CENS</th>\n",
  298. " <th>LATITUD</th>\n",
  299. " <th>LONGITUD</th>\n",
  300. " </tr>\n",
  301. " </thead>\n",
  302. " <tbody>\n",
  303. " <tr>\n",
  304. " <th>0</th>\n",
  305. " <td>1</td>\n",
  306. " <td>1</td>\n",
  307. " <td>4</td>\n",
  308. " <td>4003</td>\n",
  309. " <td>INSIDE</td>\n",
  310. " <td>0</td>\n",
  311. " <td>Maremàgnum</td>\n",
  312. " <td>MOLL ESPANYA</td>\n",
  313. " <td>5.0</td>\n",
  314. " <td>1813205DF3811D</td>\n",
  315. " <td>2</td>\n",
  316. " <td>1</td>\n",
  317. " <td>NaN</td>\n",
  318. " <td>1030</td>\n",
  319. " <td>41.375377</td>\n",
  320. " <td>2.182944</td>\n",
  321. " </tr>\n",
  322. " <tr>\n",
  323. " <th>1</th>\n",
  324. " <td>1</td>\n",
  325. " <td>2</td>\n",
  326. " <td>14</td>\n",
  327. " <td>14000</td>\n",
  328. " <td>KURTZ & GUT</td>\n",
  329. " <td>0</td>\n",
  330. " <td>Maremàgnum</td>\n",
  331. " <td>MOLL ESPANYA</td>\n",
  332. " <td>5.0</td>\n",
  333. " <td>1813205DF3811D</td>\n",
  334. " <td>2</td>\n",
  335. " <td>1</td>\n",
  336. " <td>NaN</td>\n",
  337. " <td>1030</td>\n",
  338. " <td>41.375390</td>\n",
  339. " <td>2.183092</td>\n",
  340. " </tr>\n",
  341. " <tr>\n",
  342. " <th>2</th>\n",
  343. " <td>1</td>\n",
  344. " <td>1</td>\n",
  345. " <td>4</td>\n",
  346. " <td>4003</td>\n",
  347. " <td>SPRINGFIELD</td>\n",
  348. " <td>0</td>\n",
  349. " <td>Maremàgnum</td>\n",
  350. " <td>MOLL ESPANYA</td>\n",
  351. " <td>5.0</td>\n",
  352. " <td>1813205DF3811D</td>\n",
  353. " <td>2</td>\n",
  354. " <td>1</td>\n",
  355. " <td>NaN</td>\n",
  356. " <td>1030</td>\n",
  357. " <td>41.375404</td>\n",
  358. " <td>2.183217</td>\n",
  359. " </tr>\n",
  360. " <tr>\n",
  361. " <th>3</th>\n",
  362. " <td>1</td>\n",
  363. " <td>1</td>\n",
  364. " <td>4</td>\n",
  365. " <td>4000</td>\n",
  366. " <td>CASAS KIDS</td>\n",
  367. " <td>0</td>\n",
  368. " <td>Maremàgnum</td>\n",
  369. " <td>MOLL ESPANYA</td>\n",
  370. " <td>5.0</td>\n",
  371. " <td>1813205DF3811D</td>\n",
  372. " <td>2</td>\n",
  373. " <td>1</td>\n",
  374. " <td>NaN</td>\n",
  375. " <td>1030</td>\n",
  376. " <td>41.375420</td>\n",
  377. " <td>2.183354</td>\n",
  378. " </tr>\n",
  379. " <tr>\n",
  380. " <th>4</th>\n",
  381. " <td>1</td>\n",
  382. " <td>2</td>\n",
  383. " <td>14</td>\n",
  384. " <td>14000</td>\n",
  385. " <td>BARITIMO LOUGE CLUB</td>\n",
  386. " <td>0</td>\n",
  387. " <td>Maremàgnum</td>\n",
  388. " <td>MOLL ESPANYA</td>\n",
  389. " <td>5.0</td>\n",
  390. " <td>1813205DF3811D</td>\n",
  391. " <td>2</td>\n",
  392. " <td>1</td>\n",
  393. " <td>NaN</td>\n",
  394. " <td>1030</td>\n",
  395. " <td>41.375425</td>\n",
  396. " <td>2.183517</td>\n",
  397. " </tr>\n",
  398. " </tbody>\n",
  399. "</table>\n",
  400. "</div>"
  401. ],
  402. "text/plain": [
  403. " ID_PRINCIP ID_SECTOR ID_GRUPACT ID_ACT N_LOCAL ID_MERCAT \\\n",
  404. "0 1 1 4 4003 INSIDE 0 \n",
  405. "1 1 2 14 14000 KURTZ & GUT 0 \n",
  406. "2 1 1 4 4003 SPRINGFIELD 0 \n",
  407. "3 1 1 4 4000 CASAS KIDS 0 \n",
  408. "4 1 2 14 14000 BARITIMO LOUGE CLUB 0 \n",
  409. "\n",
  410. " N_CCOMERC N_CARRER NUM_POLICI REF_CAD Codi_Barri \\\n",
  411. "0 Maremàgnum MOLL ESPANYA 5.0 1813205DF3811D 2 \n",
  412. "1 Maremàgnum MOLL ESPANYA 5.0 1813205DF3811D 2 \n",
  413. "2 Maremàgnum MOLL ESPANYA 5.0 1813205DF3811D 2 \n",
  414. "3 Maremàgnum MOLL ESPANYA 5.0 1813205DF3811D 2 \n",
  415. "4 Maremàgnum MOLL ESPANYA 5.0 1813205DF3811D 2 \n",
  416. "\n",
  417. " Codi_Districte N_EIX SEC_CENS LATITUD LONGITUD \n",
  418. "0 1 NaN 1030 41.375377 2.182944 \n",
  419. "1 1 NaN 1030 41.375390 2.183092 \n",
  420. "2 1 NaN 1030 41.375404 2.183217 \n",
  421. "3 1 NaN 1030 41.375420 2.183354 \n",
  422. "4 1 NaN 1030 41.375425 2.183517 "
  423. ]
  424. },
  425. "execution_count": 12,
  426. "metadata": {},
  427. "output_type": "execute_result"
  428. }
  429. ],
  430. "source": [
  431. "data.head()"
  432. ]
  433. },
  434. {
  435. "cell_type": "code",
  436. "execution_count": 13,
  437. "metadata": {},
  438. "outputs": [
  439. {
  440. "name": "stdout",
  441. "output_type": "stream",
  442. "text": [
  443. "<class 'pandas.core.frame.DataFrame'>\n",
  444. "Int64Index: 65010 entries, 0 to 78031\n",
  445. "Data columns (total 16 columns):\n",
  446. "ID_PRINCIP 65010 non-null int64\n",
  447. "ID_SECTOR 65010 non-null int64\n",
  448. "ID_GRUPACT 65010 non-null int64\n",
  449. "ID_ACT 65010 non-null int64\n",
  450. "N_LOCAL 65010 non-null object\n",
  451. "ID_MERCAT 65010 non-null int64\n",
  452. "N_CCOMERC 65010 non-null object\n",
  453. "N_CARRER 65010 non-null object\n",
  454. "NUM_POLICI 64996 non-null float64\n",
  455. "REF_CAD 65010 non-null object\n",
  456. "Codi_Barri 65010 non-null int64\n",
  457. "Codi_Districte 65010 non-null int64\n",
  458. "N_EIX 22844 non-null object\n",
  459. "SEC_CENS 65010 non-null int64\n",
  460. "LATITUD 65010 non-null float64\n",
  461. "LONGITUD 65010 non-null float64\n",
  462. "dtypes: float64(3), int64(8), object(5)\n",
  463. "memory usage: 8.4+ MB\n"
  464. ]
  465. }
  466. ],
  467. "source": [
  468. "data.info()"
  469. ]
  470. },
  471. {
  472. "cell_type": "code",
  473. "execution_count": 14,
  474. "metadata": {},
  475. "outputs": [
  476. {
  477. "data": {
  478. "text/plain": [
  479. "1 60265\n",
  480. "0 4745\n",
  481. "Name: ID_PRINCIP, dtype: int64"
  482. ]
  483. },
  484. "execution_count": 14,
  485. "metadata": {},
  486. "output_type": "execute_result"
  487. }
  488. ],
  489. "source": [
  490. "data['ID_PRINCIP'].value_counts()"
  491. ]
  492. },
  493. {
  494. "cell_type": "code",
  495. "execution_count": 15,
  496. "metadata": {},
  497. "outputs": [],
  498. "source": [
  499. "data[['ID_PRINCIP','N_LOCAL', 'N_CCOMERC', 'N_CARRER', 'REF_CAD','N_EIX']] = data[['ID_PRINCIP','N_LOCAL', 'N_CCOMERC', 'N_CARRER', 'REF_CAD','N_EIX']].astype('category')"
  500. ]
  501. },
  502. {
  503. "cell_type": "code",
  504. "execution_count": 16,
  505. "metadata": {},
  506. "outputs": [],
  507. "source": [
  508. "for col in ['ID_PRINCIP','N_LOCAL', 'N_CCOMERC', 'N_CARRER', 'REF_CAD','N_EIX']:\n",
  509. " data[col] = data[col].cat.codes"
  510. ]
  511. },
  512. {
  513. "cell_type": "code",
  514. "execution_count": 17,
  515. "metadata": {},
  516. "outputs": [
  517. {
  518. "name": "stdout",
  519. "output_type": "stream",
  520. "text": [
  521. "<class 'pandas.core.frame.DataFrame'>\n",
  522. "Int64Index: 65010 entries, 0 to 78031\n",
  523. "Data columns (total 16 columns):\n",
  524. "ID_PRINCIP 65010 non-null int8\n",
  525. "ID_SECTOR 65010 non-null int64\n",
  526. "ID_GRUPACT 65010 non-null int64\n",
  527. "ID_ACT 65010 non-null int64\n",
  528. "N_LOCAL 65010 non-null int32\n",
  529. "ID_MERCAT 65010 non-null int64\n",
  530. "N_CCOMERC 65010 non-null int8\n",
  531. "N_CARRER 65010 non-null int16\n",
  532. "NUM_POLICI 64996 non-null float64\n",
  533. "REF_CAD 65010 non-null int32\n",
  534. "Codi_Barri 65010 non-null int64\n",
  535. "Codi_Districte 65010 non-null int64\n",
  536. "N_EIX 65010 non-null int8\n",
  537. "SEC_CENS 65010 non-null int64\n",
  538. "LATITUD 65010 non-null float64\n",
  539. "LONGITUD 65010 non-null float64\n",
  540. "dtypes: float64(3), int16(1), int32(2), int64(7), int8(3)\n",
  541. "memory usage: 6.3 MB\n"
  542. ]
  543. }
  544. ],
  545. "source": [
  546. "data.info()"
  547. ]
  548. }
  549. ],
  550. "metadata": {
  551. "kernelspec": {
  552. "display_name": "Python 3",
  553. "language": "python",
  554. "name": "python3"
  555. },
  556. "language_info": {
  557. "codemirror_mode": {
  558. "name": "ipython",
  559. "version": 3
  560. },
  561. "file_extension": ".py",
  562. "mimetype": "text/x-python",
  563. "name": "python",
  564. "nbconvert_exporter": "python",
  565. "pygments_lexer": "ipython3",
  566. "version": "3.6.8"
  567. },
  568. "varInspector": {
  569. "cols": {
  570. "lenName": 16,
  571. "lenType": 16,
  572. "lenVar": 40
  573. },
  574. "kernels_config": {
  575. "python": {
  576. "delete_cmd_postfix": "",
  577. "delete_cmd_prefix": "del ",
  578. "library": "var_list.py",
  579. "varRefreshCmd": "print(var_dic_list())"
  580. },
  581. "r": {
  582. "delete_cmd_postfix": ") ",
  583. "delete_cmd_prefix": "rm(",
  584. "library": "var_list.r",
  585. "varRefreshCmd": "cat(var_dic_list()) "
  586. }
  587. },
  588. "position": {
  589. "height": "647.7310180664062px",
  590. "left": "1298.6956787109375px",
  591. "right": "20px",
  592. "top": "120px",
  593. "width": "350px"
  594. },
  595. "types_to_exclude": [
  596. "module",
  597. "function",
  598. "builtin_function_or_method",
  599. "instance",
  600. "_Feature"
  601. ],
  602. "window_display": false
  603. }
  604. },
  605. "nbformat": 4,
  606. "nbformat_minor": 2
  607. }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement