Advertisement
Guest User

Untitled

a guest
Oct 10th, 2017
542
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 47.84 KB | None | 0 0
  1. {
  2. "cells": [
  3. {
  4. "cell_type": "markdown",
  5. "metadata": {},
  6. "source": [
  7. "Este notebook es un repaso por las funciones más importantes de Pandas de cara a utilizarlo en aplicaciones de machine learning. Está basado en el curso \"The Numpy Stack in Python\" de Udemy, es gratuito y está muy bien para iniciarse en las librerias más comunes en machine learning (Numpy, Pandas, Matplotlib y scipy)."
  8. ]
  9. },
  10. {
  11. "cell_type": "code",
  12. "execution_count": 95,
  13. "metadata": {},
  14. "outputs": [],
  15. "source": [
  16. "import matplotlib.pyplot as plt\n",
  17. "import numpy as np\n",
  18. "import os\n",
  19. "import pandas as pd\n",
  20. "from datetime import datetime"
  21. ]
  22. },
  23. {
  24. "cell_type": "markdown",
  25. "metadata": {},
  26. "source": [
  27. "Nos vamos a la carpeta con los datos que queremos cargar"
  28. ]
  29. },
  30. {
  31. "cell_type": "code",
  32. "execution_count": 96,
  33. "metadata": {
  34. "scrolled": true
  35. },
  36. "outputs": [
  37. {
  38. "name": "stdout",
  39. "output_type": "stream",
  40. "text": [
  41. "/home/david/ProyectosTF/machine_learning_examples/linear_regression_class\n"
  42. ]
  43. }
  44. ],
  45. "source": [
  46. "os.chdir('/home/david/ProyectosTF/machine_learning_examples/linear_regression_class/')\n",
  47. "ruta_actual = os.getcwd()\n",
  48. "print(ruta_actual)"
  49. ]
  50. },
  51. {
  52. "cell_type": "markdown",
  53. "metadata": {},
  54. "source": [
  55. "Vamos a cargar nuestro dataset con Pandas en un DataFrame, indicamos que nuestro conjunto de datos no tiene cabecera, ya que Pandas puede leer archivos con una fila de cabecera, que normalmente contiene los nombres de cada columna."
  56. ]
  57. },
  58. {
  59. "cell_type": "code",
  60. "execution_count": 99,
  61. "metadata": {},
  62. "outputs": [],
  63. "source": [
  64. "X = pd.read_csv('data_2d.csv', header=None)"
  65. ]
  66. },
  67. {
  68. "cell_type": "markdown",
  69. "metadata": {},
  70. "source": [
  71. "Los DataFrames de Pandas tienen una función info(), que nos da el numero de filas y los nombres y tipos de datos de cada columna."
  72. ]
  73. },
  74. {
  75. "cell_type": "code",
  76. "execution_count": 100,
  77. "metadata": {},
  78. "outputs": [
  79. {
  80. "name": "stdout",
  81. "output_type": "stream",
  82. "text": [
  83. "<class 'pandas.core.frame.DataFrame'>\n",
  84. "RangeIndex: 100 entries, 0 to 99\n",
  85. "Data columns (total 3 columns):\n",
  86. "0 100 non-null float64\n",
  87. "1 100 non-null float64\n",
  88. "2 100 non-null float64\n",
  89. "dtypes: float64(3)\n",
  90. "memory usage: 2.4 KB\n"
  91. ]
  92. }
  93. ],
  94. "source": [
  95. "X.info()"
  96. ]
  97. },
  98. {
  99. "cell_type": "markdown",
  100. "metadata": {},
  101. "source": [
  102. "La función head() nos permite ver las primeras filas de nuestro dataset, si queremos ver un numero específico de filas se las podemos pasar a la función para que nos las muestre."
  103. ]
  104. },
  105. {
  106. "cell_type": "code",
  107. "execution_count": 101,
  108. "metadata": {},
  109. "outputs": [
  110. {
  111. "data": {
  112. "text/html": [
  113. "<div>\n",
  114. "<style>\n",
  115. " .dataframe thead tr:only-child th {\n",
  116. " text-align: right;\n",
  117. " }\n",
  118. "\n",
  119. " .dataframe thead th {\n",
  120. " text-align: left;\n",
  121. " }\n",
  122. "\n",
  123. " .dataframe tbody tr th {\n",
  124. " vertical-align: top;\n",
  125. " }\n",
  126. "</style>\n",
  127. "<table border=\"1\" class=\"dataframe\">\n",
  128. " <thead>\n",
  129. " <tr style=\"text-align: right;\">\n",
  130. " <th></th>\n",
  131. " <th>0</th>\n",
  132. " <th>1</th>\n",
  133. " <th>2</th>\n",
  134. " </tr>\n",
  135. " </thead>\n",
  136. " <tbody>\n",
  137. " <tr>\n",
  138. " <th>0</th>\n",
  139. " <td>17.930201</td>\n",
  140. " <td>94.520592</td>\n",
  141. " <td>320.259530</td>\n",
  142. " </tr>\n",
  143. " <tr>\n",
  144. " <th>1</th>\n",
  145. " <td>97.144697</td>\n",
  146. " <td>69.593282</td>\n",
  147. " <td>404.634472</td>\n",
  148. " </tr>\n",
  149. " <tr>\n",
  150. " <th>2</th>\n",
  151. " <td>81.775901</td>\n",
  152. " <td>5.737648</td>\n",
  153. " <td>181.485108</td>\n",
  154. " </tr>\n",
  155. " <tr>\n",
  156. " <th>3</th>\n",
  157. " <td>55.854342</td>\n",
  158. " <td>70.325902</td>\n",
  159. " <td>321.773638</td>\n",
  160. " </tr>\n",
  161. " <tr>\n",
  162. " <th>4</th>\n",
  163. " <td>49.366550</td>\n",
  164. " <td>75.114040</td>\n",
  165. " <td>322.465486</td>\n",
  166. " </tr>\n",
  167. " <tr>\n",
  168. " <th>5</th>\n",
  169. " <td>3.192702</td>\n",
  170. " <td>29.256299</td>\n",
  171. " <td>94.618811</td>\n",
  172. " </tr>\n",
  173. " <tr>\n",
  174. " <th>6</th>\n",
  175. " <td>49.200784</td>\n",
  176. " <td>86.144439</td>\n",
  177. " <td>356.348093</td>\n",
  178. " </tr>\n",
  179. " <tr>\n",
  180. " <th>7</th>\n",
  181. " <td>21.882804</td>\n",
  182. " <td>46.841505</td>\n",
  183. " <td>181.653769</td>\n",
  184. " </tr>\n",
  185. " <tr>\n",
  186. " <th>8</th>\n",
  187. " <td>79.509863</td>\n",
  188. " <td>87.397356</td>\n",
  189. " <td>423.557743</td>\n",
  190. " </tr>\n",
  191. " <tr>\n",
  192. " <th>9</th>\n",
  193. " <td>88.153887</td>\n",
  194. " <td>65.205642</td>\n",
  195. " <td>369.229245</td>\n",
  196. " </tr>\n",
  197. " </tbody>\n",
  198. "</table>\n",
  199. "</div>"
  200. ],
  201. "text/plain": [
  202. " 0 1 2\n",
  203. "0 17.930201 94.520592 320.259530\n",
  204. "1 97.144697 69.593282 404.634472\n",
  205. "2 81.775901 5.737648 181.485108\n",
  206. "3 55.854342 70.325902 321.773638\n",
  207. "4 49.366550 75.114040 322.465486\n",
  208. "5 3.192702 29.256299 94.618811\n",
  209. "6 49.200784 86.144439 356.348093\n",
  210. "7 21.882804 46.841505 181.653769\n",
  211. "8 79.509863 87.397356 423.557743\n",
  212. "9 88.153887 65.205642 369.229245"
  213. ]
  214. },
  215. "execution_count": 101,
  216. "metadata": {},
  217. "output_type": "execute_result"
  218. }
  219. ],
  220. "source": [
  221. "X.head(10)"
  222. ]
  223. },
  224. {
  225. "cell_type": "markdown",
  226. "metadata": {},
  227. "source": [
  228. "Los DataFrames no funcionan como arrays bidimensionales, aunque podemos convertirlos de la siguiente forma:"
  229. ]
  230. },
  231. {
  232. "cell_type": "code",
  233. "execution_count": 102,
  234. "metadata": {},
  235. "outputs": [
  236. {
  237. "data": {
  238. "text/plain": [
  239. "numpy.ndarray"
  240. ]
  241. },
  242. "execution_count": 102,
  243. "metadata": {},
  244. "output_type": "execute_result"
  245. }
  246. ],
  247. "source": [
  248. "M = X.as_matrix()\n",
  249. "type(M)"
  250. ]
  251. },
  252. {
  253. "cell_type": "markdown",
  254. "metadata": {},
  255. "source": [
  256. "Para acceder a una columna del Dataframe le pasamos el nombre de esta. En este caso probamos a mostrar la primera columna del DataFrame, es decir la 0. Esto puede resultar un poco contraintuitivo ya que cuando utilizamos numpy el primer índice se refiere a la fila, mientras que en pandas indica el nombre de la columna y solo se permite un índice.\n",
  257. "\n",
  258. "numpy: X[0] #Fila numero 0 \n",
  259. "pandas: X[0] #Columna que tiene el nombre '0'"
  260. ]
  261. },
  262. {
  263. "cell_type": "code",
  264. "execution_count": 103,
  265. "metadata": {},
  266. "outputs": [
  267. {
  268. "data": {
  269. "text/plain": [
  270. "0 17.930201\n",
  271. "1 97.144697\n",
  272. "2 81.775901\n",
  273. "3 55.854342\n",
  274. "4 49.366550\n",
  275. "5 3.192702\n",
  276. "6 49.200784\n",
  277. "7 21.882804\n",
  278. "8 79.509863\n",
  279. "9 88.153887\n",
  280. "10 60.743854\n",
  281. "11 67.415582\n",
  282. "12 48.318116\n",
  283. "13 28.829972\n",
  284. "14 43.853743\n",
  285. "15 25.313694\n",
  286. "16 10.807727\n",
  287. "17 98.365746\n",
  288. "18 29.146910\n",
  289. "19 65.100302\n",
  290. "20 24.644113\n",
  291. "21 37.559805\n",
  292. "22 88.164506\n",
  293. "23 13.834621\n",
  294. "24 64.410844\n",
  295. "25 68.925992\n",
  296. "26 39.488442\n",
  297. "27 52.463178\n",
  298. "28 48.484787\n",
  299. "29 8.062088\n",
  300. " ... \n",
  301. "70 30.187692\n",
  302. "71 11.788418\n",
  303. "72 18.292424\n",
  304. "73 96.712668\n",
  305. "74 31.012739\n",
  306. "75 11.397261\n",
  307. "76 17.392556\n",
  308. "77 72.182694\n",
  309. "78 73.980021\n",
  310. "79 94.493058\n",
  311. "80 84.562821\n",
  312. "81 51.742474\n",
  313. "82 53.748590\n",
  314. "83 85.050835\n",
  315. "84 46.777250\n",
  316. "85 49.758434\n",
  317. "86 24.119257\n",
  318. "87 27.201576\n",
  319. "88 7.009596\n",
  320. "89 97.646950\n",
  321. "90 1.382983\n",
  322. "91 22.323530\n",
  323. "92 45.045406\n",
  324. "93 40.163991\n",
  325. "94 53.182740\n",
  326. "95 46.456779\n",
  327. "96 77.130301\n",
  328. "97 68.600608\n",
  329. "98 41.693887\n",
  330. "99 4.142669\n",
  331. "Name: 0, Length: 100, dtype: float64"
  332. ]
  333. },
  334. "execution_count": 103,
  335. "metadata": {},
  336. "output_type": "execute_result"
  337. }
  338. ],
  339. "source": [
  340. "X[0]"
  341. ]
  342. },
  343. {
  344. "cell_type": "markdown",
  345. "metadata": {},
  346. "source": [
  347. "Los DataFrames estan compuestos de objetos unidimensionales llamados Series"
  348. ]
  349. },
  350. {
  351. "cell_type": "code",
  352. "execution_count": 104,
  353. "metadata": {},
  354. "outputs": [
  355. {
  356. "data": {
  357. "text/plain": [
  358. "pandas.core.series.Series"
  359. ]
  360. },
  361. "execution_count": 104,
  362. "metadata": {},
  363. "output_type": "execute_result"
  364. }
  365. ],
  366. "source": [
  367. "type(X[0])"
  368. ]
  369. },
  370. {
  371. "cell_type": "markdown",
  372. "metadata": {},
  373. "source": [
  374. "Ya hemos visto como obtener una columna de un DataFrame, para obtener una fila utilizaremos la función iloc(). La cual también nos devuelve un objeto de tipo Series."
  375. ]
  376. },
  377. {
  378. "cell_type": "code",
  379. "execution_count": 105,
  380. "metadata": {},
  381. "outputs": [
  382. {
  383. "data": {
  384. "text/plain": [
  385. "0 17.930201\n",
  386. "1 94.520592\n",
  387. "2 320.259530\n",
  388. "Name: 0, dtype: float64"
  389. ]
  390. },
  391. "execution_count": 105,
  392. "metadata": {},
  393. "output_type": "execute_result"
  394. }
  395. ],
  396. "source": [
  397. "X.iloc[0]"
  398. ]
  399. },
  400. {
  401. "cell_type": "markdown",
  402. "metadata": {},
  403. "source": [
  404. "Para seleccionar más de una columna, indicamos los nombres de cada columna entre corchetes"
  405. ]
  406. },
  407. {
  408. "cell_type": "code",
  409. "execution_count": 106,
  410. "metadata": {},
  411. "outputs": [
  412. {
  413. "data": {
  414. "text/html": [
  415. "<div>\n",
  416. "<style>\n",
  417. " .dataframe thead tr:only-child th {\n",
  418. " text-align: right;\n",
  419. " }\n",
  420. "\n",
  421. " .dataframe thead th {\n",
  422. " text-align: left;\n",
  423. " }\n",
  424. "\n",
  425. " .dataframe tbody tr th {\n",
  426. " vertical-align: top;\n",
  427. " }\n",
  428. "</style>\n",
  429. "<table border=\"1\" class=\"dataframe\">\n",
  430. " <thead>\n",
  431. " <tr style=\"text-align: right;\">\n",
  432. " <th></th>\n",
  433. " <th>0</th>\n",
  434. " <th>2</th>\n",
  435. " </tr>\n",
  436. " </thead>\n",
  437. " <tbody>\n",
  438. " <tr>\n",
  439. " <th>0</th>\n",
  440. " <td>17.930201</td>\n",
  441. " <td>320.259530</td>\n",
  442. " </tr>\n",
  443. " <tr>\n",
  444. " <th>1</th>\n",
  445. " <td>97.144697</td>\n",
  446. " <td>404.634472</td>\n",
  447. " </tr>\n",
  448. " <tr>\n",
  449. " <th>2</th>\n",
  450. " <td>81.775901</td>\n",
  451. " <td>181.485108</td>\n",
  452. " </tr>\n",
  453. " <tr>\n",
  454. " <th>3</th>\n",
  455. " <td>55.854342</td>\n",
  456. " <td>321.773638</td>\n",
  457. " </tr>\n",
  458. " <tr>\n",
  459. " <th>4</th>\n",
  460. " <td>49.366550</td>\n",
  461. " <td>322.465486</td>\n",
  462. " </tr>\n",
  463. " <tr>\n",
  464. " <th>5</th>\n",
  465. " <td>3.192702</td>\n",
  466. " <td>94.618811</td>\n",
  467. " </tr>\n",
  468. " <tr>\n",
  469. " <th>6</th>\n",
  470. " <td>49.200784</td>\n",
  471. " <td>356.348093</td>\n",
  472. " </tr>\n",
  473. " <tr>\n",
  474. " <th>7</th>\n",
  475. " <td>21.882804</td>\n",
  476. " <td>181.653769</td>\n",
  477. " </tr>\n",
  478. " <tr>\n",
  479. " <th>8</th>\n",
  480. " <td>79.509863</td>\n",
  481. " <td>423.557743</td>\n",
  482. " </tr>\n",
  483. " <tr>\n",
  484. " <th>9</th>\n",
  485. " <td>88.153887</td>\n",
  486. " <td>369.229245</td>\n",
  487. " </tr>\n",
  488. " <tr>\n",
  489. " <th>10</th>\n",
  490. " <td>60.743854</td>\n",
  491. " <td>427.605804</td>\n",
  492. " </tr>\n",
  493. " <tr>\n",
  494. " <th>11</th>\n",
  495. " <td>67.415582</td>\n",
  496. " <td>292.471822</td>\n",
  497. " </tr>\n",
  498. " <tr>\n",
  499. " <th>12</th>\n",
  500. " <td>48.318116</td>\n",
  501. " <td>395.529811</td>\n",
  502. " </tr>\n",
  503. " <tr>\n",
  504. " <th>13</th>\n",
  505. " <td>28.829972</td>\n",
  506. " <td>319.031348</td>\n",
  507. " </tr>\n",
  508. " <tr>\n",
  509. " <th>14</th>\n",
  510. " <td>43.853743</td>\n",
  511. " <td>287.428144</td>\n",
  512. " </tr>\n",
  513. " <tr>\n",
  514. " <th>15</th>\n",
  515. " <td>25.313694</td>\n",
  516. " <td>292.768909</td>\n",
  517. " </tr>\n",
  518. " <tr>\n",
  519. " <th>16</th>\n",
  520. " <td>10.807727</td>\n",
  521. " <td>159.663308</td>\n",
  522. " </tr>\n",
  523. " <tr>\n",
  524. " <th>17</th>\n",
  525. " <td>98.365746</td>\n",
  526. " <td>438.798964</td>\n",
  527. " </tr>\n",
  528. " <tr>\n",
  529. " <th>18</th>\n",
  530. " <td>29.146910</td>\n",
  531. " <td>250.986309</td>\n",
  532. " </tr>\n",
  533. " <tr>\n",
  534. " <th>19</th>\n",
  535. " <td>65.100302</td>\n",
  536. " <td>231.711508</td>\n",
  537. " </tr>\n",
  538. " <tr>\n",
  539. " <th>20</th>\n",
  540. " <td>24.644113</td>\n",
  541. " <td>163.398161</td>\n",
  542. " </tr>\n",
  543. " <tr>\n",
  544. " <th>21</th>\n",
  545. " <td>37.559805</td>\n",
  546. " <td>83.480155</td>\n",
  547. " </tr>\n",
  548. " <tr>\n",
  549. " <th>22</th>\n",
  550. " <td>88.164506</td>\n",
  551. " <td>466.265806</td>\n",
  552. " </tr>\n",
  553. " <tr>\n",
  554. " <th>23</th>\n",
  555. " <td>13.834621</td>\n",
  556. " <td>100.886430</td>\n",
  557. " </tr>\n",
  558. " <tr>\n",
  559. " <th>24</th>\n",
  560. " <td>64.410844</td>\n",
  561. " <td>365.641048</td>\n",
  562. " </tr>\n",
  563. " <tr>\n",
  564. " <th>25</th>\n",
  565. " <td>68.925992</td>\n",
  566. " <td>426.140015</td>\n",
  567. " </tr>\n",
  568. " <tr>\n",
  569. " <th>26</th>\n",
  570. " <td>39.488442</td>\n",
  571. " <td>235.532389</td>\n",
  572. " </tr>\n",
  573. " <tr>\n",
  574. " <th>27</th>\n",
  575. " <td>52.463178</td>\n",
  576. " <td>283.291640</td>\n",
  577. " </tr>\n",
  578. " <tr>\n",
  579. " <th>28</th>\n",
  580. " <td>48.484787</td>\n",
  581. " <td>298.581440</td>\n",
  582. " </tr>\n",
  583. " <tr>\n",
  584. " <th>29</th>\n",
  585. " <td>8.062088</td>\n",
  586. " <td>309.234109</td>\n",
  587. " </tr>\n",
  588. " <tr>\n",
  589. " <th>...</th>\n",
  590. " <td>...</td>\n",
  591. " <td>...</td>\n",
  592. " </tr>\n",
  593. " <tr>\n",
  594. " <th>70</th>\n",
  595. " <td>30.187692</td>\n",
  596. " <td>89.539008</td>\n",
  597. " </tr>\n",
  598. " <tr>\n",
  599. " <th>71</th>\n",
  600. " <td>11.788418</td>\n",
  601. " <td>181.550683</td>\n",
  602. " </tr>\n",
  603. " <tr>\n",
  604. " <th>72</th>\n",
  605. " <td>18.292424</td>\n",
  606. " <td>224.773383</td>\n",
  607. " </tr>\n",
  608. " <tr>\n",
  609. " <th>73</th>\n",
  610. " <td>96.712668</td>\n",
  611. " <td>219.567094</td>\n",
  612. " </tr>\n",
  613. " <tr>\n",
  614. " <th>74</th>\n",
  615. " <td>31.012739</td>\n",
  616. " <td>298.490216</td>\n",
  617. " </tr>\n",
  618. " <tr>\n",
  619. " <th>75</th>\n",
  620. " <td>11.397261</td>\n",
  621. " <td>199.944045</td>\n",
  622. " </tr>\n",
  623. " <tr>\n",
  624. " <th>76</th>\n",
  625. " <td>17.392556</td>\n",
  626. " <td>43.915692</td>\n",
  627. " </tr>\n",
  628. " <tr>\n",
  629. " <th>77</th>\n",
  630. " <td>72.182694</td>\n",
  631. " <td>256.068378</td>\n",
  632. " </tr>\n",
  633. " <tr>\n",
  634. " <th>78</th>\n",
  635. " <td>73.980021</td>\n",
  636. " <td>159.372581</td>\n",
  637. " </tr>\n",
  638. " <tr>\n",
  639. " <th>79</th>\n",
  640. " <td>94.493058</td>\n",
  641. " <td>447.132704</td>\n",
  642. " </tr>\n",
  643. " <tr>\n",
  644. " <th>80</th>\n",
  645. " <td>84.562821</td>\n",
  646. " <td>233.078830</td>\n",
  647. " </tr>\n",
  648. " <tr>\n",
  649. " <th>81</th>\n",
  650. " <td>51.742474</td>\n",
  651. " <td>131.070180</td>\n",
  652. " </tr>\n",
  653. " <tr>\n",
  654. " <th>82</th>\n",
  655. " <td>53.748590</td>\n",
  656. " <td>298.814333</td>\n",
  657. " </tr>\n",
  658. " <tr>\n",
  659. " <th>83</th>\n",
  660. " <td>85.050835</td>\n",
  661. " <td>451.803523</td>\n",
  662. " </tr>\n",
  663. " <tr>\n",
  664. " <th>84</th>\n",
  665. " <td>46.777250</td>\n",
  666. " <td>368.366436</td>\n",
  667. " </tr>\n",
  668. " <tr>\n",
  669. " <th>85</th>\n",
  670. " <td>49.758434</td>\n",
  671. " <td>254.706774</td>\n",
  672. " </tr>\n",
  673. " <tr>\n",
  674. " <th>86</th>\n",
  675. " <td>24.119257</td>\n",
  676. " <td>168.308433</td>\n",
  677. " </tr>\n",
  678. " <tr>\n",
  679. " <th>87</th>\n",
  680. " <td>27.201576</td>\n",
  681. " <td>146.342260</td>\n",
  682. " </tr>\n",
  683. " <tr>\n",
  684. " <th>88</th>\n",
  685. " <td>7.009596</td>\n",
  686. " <td>176.810149</td>\n",
  687. " </tr>\n",
  688. " <tr>\n",
  689. " <th>89</th>\n",
  690. " <td>97.646950</td>\n",
  691. " <td>219.160280</td>\n",
  692. " </tr>\n",
  693. " <tr>\n",
  694. " <th>90</th>\n",
  695. " <td>1.382983</td>\n",
  696. " <td>252.905653</td>\n",
  697. " </tr>\n",
  698. " <tr>\n",
  699. " <th>91</th>\n",
  700. " <td>22.323530</td>\n",
  701. " <td>127.570479</td>\n",
  702. " </tr>\n",
  703. " <tr>\n",
  704. " <th>92</th>\n",
  705. " <td>45.045406</td>\n",
  706. " <td>375.822340</td>\n",
  707. " </tr>\n",
  708. " <tr>\n",
  709. " <th>93</th>\n",
  710. " <td>40.163991</td>\n",
  711. " <td>80.389019</td>\n",
  712. " </tr>\n",
  713. " <tr>\n",
  714. " <th>94</th>\n",
  715. " <td>53.182740</td>\n",
  716. " <td>142.718183</td>\n",
  717. " </tr>\n",
  718. " <tr>\n",
  719. " <th>95</th>\n",
  720. " <td>46.456779</td>\n",
  721. " <td>336.876154</td>\n",
  722. " </tr>\n",
  723. " <tr>\n",
  724. " <th>96</th>\n",
  725. " <td>77.130301</td>\n",
  726. " <td>438.460586</td>\n",
  727. " </tr>\n",
  728. " <tr>\n",
  729. " <th>97</th>\n",
  730. " <td>68.600608</td>\n",
  731. " <td>355.900287</td>\n",
  732. " </tr>\n",
  733. " <tr>\n",
  734. " <th>98</th>\n",
  735. " <td>41.693887</td>\n",
  736. " <td>284.834637</td>\n",
  737. " </tr>\n",
  738. " <tr>\n",
  739. " <th>99</th>\n",
  740. " <td>4.142669</td>\n",
  741. " <td>168.034401</td>\n",
  742. " </tr>\n",
  743. " </tbody>\n",
  744. "</table>\n",
  745. "<p>100 rows × 2 columns</p>\n",
  746. "</div>"
  747. ],
  748. "text/plain": [
  749. " 0 2\n",
  750. "0 17.930201 320.259530\n",
  751. "1 97.144697 404.634472\n",
  752. "2 81.775901 181.485108\n",
  753. "3 55.854342 321.773638\n",
  754. "4 49.366550 322.465486\n",
  755. "5 3.192702 94.618811\n",
  756. "6 49.200784 356.348093\n",
  757. "7 21.882804 181.653769\n",
  758. "8 79.509863 423.557743\n",
  759. "9 88.153887 369.229245\n",
  760. "10 60.743854 427.605804\n",
  761. "11 67.415582 292.471822\n",
  762. "12 48.318116 395.529811\n",
  763. "13 28.829972 319.031348\n",
  764. "14 43.853743 287.428144\n",
  765. "15 25.313694 292.768909\n",
  766. "16 10.807727 159.663308\n",
  767. "17 98.365746 438.798964\n",
  768. "18 29.146910 250.986309\n",
  769. "19 65.100302 231.711508\n",
  770. "20 24.644113 163.398161\n",
  771. "21 37.559805 83.480155\n",
  772. "22 88.164506 466.265806\n",
  773. "23 13.834621 100.886430\n",
  774. "24 64.410844 365.641048\n",
  775. "25 68.925992 426.140015\n",
  776. "26 39.488442 235.532389\n",
  777. "27 52.463178 283.291640\n",
  778. "28 48.484787 298.581440\n",
  779. "29 8.062088 309.234109\n",
  780. ".. ... ...\n",
  781. "70 30.187692 89.539008\n",
  782. "71 11.788418 181.550683\n",
  783. "72 18.292424 224.773383\n",
  784. "73 96.712668 219.567094\n",
  785. "74 31.012739 298.490216\n",
  786. "75 11.397261 199.944045\n",
  787. "76 17.392556 43.915692\n",
  788. "77 72.182694 256.068378\n",
  789. "78 73.980021 159.372581\n",
  790. "79 94.493058 447.132704\n",
  791. "80 84.562821 233.078830\n",
  792. "81 51.742474 131.070180\n",
  793. "82 53.748590 298.814333\n",
  794. "83 85.050835 451.803523\n",
  795. "84 46.777250 368.366436\n",
  796. "85 49.758434 254.706774\n",
  797. "86 24.119257 168.308433\n",
  798. "87 27.201576 146.342260\n",
  799. "88 7.009596 176.810149\n",
  800. "89 97.646950 219.160280\n",
  801. "90 1.382983 252.905653\n",
  802. "91 22.323530 127.570479\n",
  803. "92 45.045406 375.822340\n",
  804. "93 40.163991 80.389019\n",
  805. "94 53.182740 142.718183\n",
  806. "95 46.456779 336.876154\n",
  807. "96 77.130301 438.460586\n",
  808. "97 68.600608 355.900287\n",
  809. "98 41.693887 284.834637\n",
  810. "99 4.142669 168.034401\n",
  811. "\n",
  812. "[100 rows x 2 columns]"
  813. ]
  814. },
  815. "execution_count": 106,
  816. "metadata": {},
  817. "output_type": "execute_result"
  818. }
  819. ],
  820. "source": [
  821. "X[[0,2]]"
  822. ]
  823. },
  824. {
  825. "cell_type": "markdown",
  826. "metadata": {},
  827. "source": [
  828. "También puedes seleccionar filas específicas basadas en algún tipo de criterio que te apetezca. Por ejemplo, supongamos que queremos todas las filas donde el valor de la columna 0 sea inferior a 5.\n",
  829. "\n",
  830. "Lo que puede parecer raro de esto es que la notación con corchetes '[]' nos permitia seleccionar columnas, pero si le pasamos una serie lo que seleccionamos son filas. Y sin embargo los dos utilizan la misma notación con corchetes."
  831. ]
  832. },
  833. {
  834. "cell_type": "code",
  835. "execution_count": 107,
  836. "metadata": {},
  837. "outputs": [
  838. {
  839. "data": {
  840. "text/html": [
  841. "<div>\n",
  842. "<style>\n",
  843. " .dataframe thead tr:only-child th {\n",
  844. " text-align: right;\n",
  845. " }\n",
  846. "\n",
  847. " .dataframe thead th {\n",
  848. " text-align: left;\n",
  849. " }\n",
  850. "\n",
  851. " .dataframe tbody tr th {\n",
  852. " vertical-align: top;\n",
  853. " }\n",
  854. "</style>\n",
  855. "<table border=\"1\" class=\"dataframe\">\n",
  856. " <thead>\n",
  857. " <tr style=\"text-align: right;\">\n",
  858. " <th></th>\n",
  859. " <th>0</th>\n",
  860. " <th>1</th>\n",
  861. " <th>2</th>\n",
  862. " </tr>\n",
  863. " </thead>\n",
  864. " <tbody>\n",
  865. " <tr>\n",
  866. " <th>5</th>\n",
  867. " <td>3.192702</td>\n",
  868. " <td>29.256299</td>\n",
  869. " <td>94.618811</td>\n",
  870. " </tr>\n",
  871. " <tr>\n",
  872. " <th>44</th>\n",
  873. " <td>3.593966</td>\n",
  874. " <td>96.252217</td>\n",
  875. " <td>293.237183</td>\n",
  876. " </tr>\n",
  877. " <tr>\n",
  878. " <th>54</th>\n",
  879. " <td>4.593463</td>\n",
  880. " <td>46.335932</td>\n",
  881. " <td>145.818745</td>\n",
  882. " </tr>\n",
  883. " <tr>\n",
  884. " <th>90</th>\n",
  885. " <td>1.382983</td>\n",
  886. " <td>84.944087</td>\n",
  887. " <td>252.905653</td>\n",
  888. " </tr>\n",
  889. " <tr>\n",
  890. " <th>99</th>\n",
  891. " <td>4.142669</td>\n",
  892. " <td>52.254726</td>\n",
  893. " <td>168.034401</td>\n",
  894. " </tr>\n",
  895. " </tbody>\n",
  896. "</table>\n",
  897. "</div>"
  898. ],
  899. "text/plain": [
  900. " 0 1 2\n",
  901. "5 3.192702 29.256299 94.618811\n",
  902. "44 3.593966 96.252217 293.237183\n",
  903. "54 4.593463 46.335932 145.818745\n",
  904. "90 1.382983 84.944087 252.905653\n",
  905. "99 4.142669 52.254726 168.034401"
  906. ]
  907. },
  908. "execution_count": 107,
  909. "metadata": {},
  910. "output_type": "execute_result"
  911. }
  912. ],
  913. "source": [
  914. "X[ X[0] < 5]"
  915. ]
  916. },
  917. {
  918. "cell_type": "markdown",
  919. "metadata": {},
  920. "source": [
  921. "Ahora vamos a usar otro conjunto de datos distinto, al que le vamos a cambiar los nombres de las columnas y vamos a deshacernos de las últimas lineas que no nos sirven para nada."
  922. ]
  923. },
  924. {
  925. "cell_type": "code",
  926. "execution_count": 108,
  927. "metadata": {},
  928. "outputs": [
  929. {
  930. "name": "stdout",
  931. "output_type": "stream",
  932. "text": [
  933. "/home/david/ProyectosTF/machine_learning_examples/airline\n"
  934. ]
  935. }
  936. ],
  937. "source": [
  938. "os.chdir('/home/david/ProyectosTF/machine_learning_examples/airline/')\n",
  939. "print(os.getcwd())"
  940. ]
  941. },
  942. {
  943. "cell_type": "markdown",
  944. "metadata": {},
  945. "source": [
  946. "Cargamos el conjunto de datos como antes con la función read_csv() de Pandas, pero no le pasamos el parámetro de las cabeceras (header) porque por defecto Pandas lee las cabeceras y en esta ocasión las columnas sí que tienen nombres. También le indicamos que se salte las 3 últimas lineas con el parametro skipfooter=3, como skipfooter no funciona con el engine por defecto (C), le indicamos que use python"
  947. ]
  948. },
  949. {
  950. "cell_type": "code",
  951. "execution_count": 109,
  952. "metadata": {},
  953. "outputs": [],
  954. "source": [
  955. "df = pd.read_csv('international-airline-passengers.csv', engine = \"python\" , skipfooter=3)"
  956. ]
  957. },
  958. {
  959. "cell_type": "markdown",
  960. "metadata": {},
  961. "source": [
  962. "Como vemos a continuación los nombres de las columnas son bastante largos y poco descriptivos"
  963. ]
  964. },
  965. {
  966. "cell_type": "code",
  967. "execution_count": 110,
  968. "metadata": {},
  969. "outputs": [
  970. {
  971. "data": {
  972. "text/html": [
  973. "<div>\n",
  974. "<style>\n",
  975. " .dataframe thead tr:only-child th {\n",
  976. " text-align: right;\n",
  977. " }\n",
  978. "\n",
  979. " .dataframe thead th {\n",
  980. " text-align: left;\n",
  981. " }\n",
  982. "\n",
  983. " .dataframe tbody tr th {\n",
  984. " vertical-align: top;\n",
  985. " }\n",
  986. "</style>\n",
  987. "<table border=\"1\" class=\"dataframe\">\n",
  988. " <thead>\n",
  989. " <tr style=\"text-align: right;\">\n",
  990. " <th></th>\n",
  991. " <th>Month</th>\n",
  992. " <th>International airline passengers: monthly totals in thousands. Jan 49 ? Dec 60</th>\n",
  993. " </tr>\n",
  994. " </thead>\n",
  995. " <tbody>\n",
  996. " </tbody>\n",
  997. "</table>\n",
  998. "</div>"
  999. ],
  1000. "text/plain": [
  1001. "Empty DataFrame\n",
  1002. "Columns: [Month, International airline passengers: monthly totals in thousands. Jan 49 ? Dec 60]\n",
  1003. "Index: []"
  1004. ]
  1005. },
  1006. "execution_count": 110,
  1007. "metadata": {},
  1008. "output_type": "execute_result"
  1009. }
  1010. ],
  1011. "source": [
  1012. "df.head(0)"
  1013. ]
  1014. },
  1015. {
  1016. "cell_type": "markdown",
  1017. "metadata": {},
  1018. "source": [
  1019. "Así que vamos a cambiarlos por otros un poco mejores, pasandole una lista."
  1020. ]
  1021. },
  1022. {
  1023. "cell_type": "code",
  1024. "execution_count": 111,
  1025. "metadata": {},
  1026. "outputs": [
  1027. {
  1028. "data": {
  1029. "text/plain": [
  1030. "Index(['month', 'passengers'], dtype='object')"
  1031. ]
  1032. },
  1033. "execution_count": 111,
  1034. "metadata": {},
  1035. "output_type": "execute_result"
  1036. }
  1037. ],
  1038. "source": [
  1039. "df.columns = ['month', 'passengers']\n",
  1040. "df.columns"
  1041. ]
  1042. },
  1043. {
  1044. "cell_type": "markdown",
  1045. "metadata": {},
  1046. "source": [
  1047. "Ya hemos visto que podemos acceder a una columna pasandole el nombre de esta entre corchetes, así: df['passengers'] pero cuando los nombres de las columnas son strings también podemos acceder directamente de la siguiente forma:"
  1048. ]
  1049. },
  1050. {
  1051. "cell_type": "code",
  1052. "execution_count": 112,
  1053. "metadata": {},
  1054. "outputs": [
  1055. {
  1056. "data": {
  1057. "text/plain": [
  1058. "0 112\n",
  1059. "1 118\n",
  1060. "2 132\n",
  1061. "3 129\n",
  1062. "4 121\n",
  1063. "5 135\n",
  1064. "6 148\n",
  1065. "7 148\n",
  1066. "8 136\n",
  1067. "9 119\n",
  1068. "10 104\n",
  1069. "11 118\n",
  1070. "12 115\n",
  1071. "13 126\n",
  1072. "14 141\n",
  1073. "15 135\n",
  1074. "16 125\n",
  1075. "17 149\n",
  1076. "18 170\n",
  1077. "19 170\n",
  1078. "20 158\n",
  1079. "21 133\n",
  1080. "22 114\n",
  1081. "23 140\n",
  1082. "24 145\n",
  1083. "25 150\n",
  1084. "26 178\n",
  1085. "27 163\n",
  1086. "28 172\n",
  1087. "29 178\n",
  1088. " ... \n",
  1089. "114 491\n",
  1090. "115 505\n",
  1091. "116 404\n",
  1092. "117 359\n",
  1093. "118 310\n",
  1094. "119 337\n",
  1095. "120 360\n",
  1096. "121 342\n",
  1097. "122 406\n",
  1098. "123 396\n",
  1099. "124 420\n",
  1100. "125 472\n",
  1101. "126 548\n",
  1102. "127 559\n",
  1103. "128 463\n",
  1104. "129 407\n",
  1105. "130 362\n",
  1106. "131 405\n",
  1107. "132 417\n",
  1108. "133 391\n",
  1109. "134 419\n",
  1110. "135 461\n",
  1111. "136 472\n",
  1112. "137 535\n",
  1113. "138 622\n",
  1114. "139 606\n",
  1115. "140 508\n",
  1116. "141 461\n",
  1117. "142 390\n",
  1118. "143 432\n",
  1119. "Name: passengers, Length: 144, dtype: int64"
  1120. ]
  1121. },
  1122. "execution_count": 112,
  1123. "metadata": {},
  1124. "output_type": "execute_result"
  1125. }
  1126. ],
  1127. "source": [
  1128. "df.passengers"
  1129. ]
  1130. },
  1131. {
  1132. "cell_type": "code",
  1133. "execution_count": 113,
  1134. "metadata": {},
  1135. "outputs": [
  1136. {
  1137. "ename": "SyntaxError",
  1138. "evalue": "invalid syntax (<ipython-input-113-cfa464292516>, line 1)",
  1139. "output_type": "error",
  1140. "traceback": [
  1141. "\u001b[0;36m File \u001b[0;32m\"<ipython-input-113-cfa464292516>\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m Para añadir una nueva columna de unos a nuestro Dataframe hacemos lo siguiente\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
  1142. ]
  1143. }
  1144. ],
  1145. "source": [
  1146. "Para añadir una nueva columna de unos a nuestro Dataframe hacemos lo siguiente"
  1147. ]
  1148. },
  1149. {
  1150. "cell_type": "code",
  1151. "execution_count": 114,
  1152. "metadata": {},
  1153. "outputs": [
  1154. {
  1155. "data": {
  1156. "text/html": [
  1157. "<div>\n",
  1158. "<style>\n",
  1159. " .dataframe thead tr:only-child th {\n",
  1160. " text-align: right;\n",
  1161. " }\n",
  1162. "\n",
  1163. " .dataframe thead th {\n",
  1164. " text-align: left;\n",
  1165. " }\n",
  1166. "\n",
  1167. " .dataframe tbody tr th {\n",
  1168. " vertical-align: top;\n",
  1169. " }\n",
  1170. "</style>\n",
  1171. "<table border=\"1\" class=\"dataframe\">\n",
  1172. " <thead>\n",
  1173. " <tr style=\"text-align: right;\">\n",
  1174. " <th></th>\n",
  1175. " <th>month</th>\n",
  1176. " <th>passengers</th>\n",
  1177. " <th>ones</th>\n",
  1178. " </tr>\n",
  1179. " </thead>\n",
  1180. " <tbody>\n",
  1181. " <tr>\n",
  1182. " <th>0</th>\n",
  1183. " <td>1949-01</td>\n",
  1184. " <td>112</td>\n",
  1185. " <td>1</td>\n",
  1186. " </tr>\n",
  1187. " <tr>\n",
  1188. " <th>1</th>\n",
  1189. " <td>1949-02</td>\n",
  1190. " <td>118</td>\n",
  1191. " <td>1</td>\n",
  1192. " </tr>\n",
  1193. " <tr>\n",
  1194. " <th>2</th>\n",
  1195. " <td>1949-03</td>\n",
  1196. " <td>132</td>\n",
  1197. " <td>1</td>\n",
  1198. " </tr>\n",
  1199. " <tr>\n",
  1200. " <th>3</th>\n",
  1201. " <td>1949-04</td>\n",
  1202. " <td>129</td>\n",
  1203. " <td>1</td>\n",
  1204. " </tr>\n",
  1205. " <tr>\n",
  1206. " <th>4</th>\n",
  1207. " <td>1949-05</td>\n",
  1208. " <td>121</td>\n",
  1209. " <td>1</td>\n",
  1210. " </tr>\n",
  1211. " </tbody>\n",
  1212. "</table>\n",
  1213. "</div>"
  1214. ],
  1215. "text/plain": [
  1216. " month passengers ones\n",
  1217. "0 1949-01 112 1\n",
  1218. "1 1949-02 118 1\n",
  1219. "2 1949-03 132 1\n",
  1220. "3 1949-04 129 1\n",
  1221. "4 1949-05 121 1"
  1222. ]
  1223. },
  1224. "execution_count": 114,
  1225. "metadata": {},
  1226. "output_type": "execute_result"
  1227. }
  1228. ],
  1229. "source": [
  1230. "df['ones'] = 1\n",
  1231. "df.head(5)"
  1232. ]
  1233. },
  1234. {
  1235. "cell_type": "markdown",
  1236. "metadata": {},
  1237. "source": [
  1238. "Si queremos añadir una nueva columna que no sea toda del mismo valor, sino que queremos que esté relacionada con las otras columnas, utilizaremos la función apply(), que es parecida a la función map en python. Por ejemplo, si queremos que la nueva columna sea el resultado de multiplicar los valores de las dos columnas anteriores, tendriamos que hacer algo así:\n",
  1239. "\n",
  1240. "df[\"x1x2\"] = df.apply(lambda row: row['x1']* row['x2'], axis=1)\n",
  1241. "\n",
  1242. "Tenemos que indicar el parametro axis=1 para que la función se aplique por filas en vez de por columnas.\n",
  1243. "\n",
  1244. "Ahora vamos a probar con nuestro dataset a añadir una nueva columna con las fechas convertidas a datetime"
  1245. ]
  1246. },
  1247. {
  1248. "cell_type": "code",
  1249. "execution_count": 115,
  1250. "metadata": {},
  1251. "outputs": [
  1252. {
  1253. "data": {
  1254. "text/html": [
  1255. "<div>\n",
  1256. "<style>\n",
  1257. " .dataframe thead tr:only-child th {\n",
  1258. " text-align: right;\n",
  1259. " }\n",
  1260. "\n",
  1261. " .dataframe thead th {\n",
  1262. " text-align: left;\n",
  1263. " }\n",
  1264. "\n",
  1265. " .dataframe tbody tr th {\n",
  1266. " vertical-align: top;\n",
  1267. " }\n",
  1268. "</style>\n",
  1269. "<table border=\"1\" class=\"dataframe\">\n",
  1270. " <thead>\n",
  1271. " <tr style=\"text-align: right;\">\n",
  1272. " <th></th>\n",
  1273. " <th>month</th>\n",
  1274. " <th>passengers</th>\n",
  1275. " <th>ones</th>\n",
  1276. " <th>datetime</th>\n",
  1277. " </tr>\n",
  1278. " </thead>\n",
  1279. " <tbody>\n",
  1280. " <tr>\n",
  1281. " <th>0</th>\n",
  1282. " <td>1949-01</td>\n",
  1283. " <td>112</td>\n",
  1284. " <td>1</td>\n",
  1285. " <td>1949-01-01</td>\n",
  1286. " </tr>\n",
  1287. " <tr>\n",
  1288. " <th>1</th>\n",
  1289. " <td>1949-02</td>\n",
  1290. " <td>118</td>\n",
  1291. " <td>1</td>\n",
  1292. " <td>1949-02-01</td>\n",
  1293. " </tr>\n",
  1294. " <tr>\n",
  1295. " <th>2</th>\n",
  1296. " <td>1949-03</td>\n",
  1297. " <td>132</td>\n",
  1298. " <td>1</td>\n",
  1299. " <td>1949-03-01</td>\n",
  1300. " </tr>\n",
  1301. " <tr>\n",
  1302. " <th>3</th>\n",
  1303. " <td>1949-04</td>\n",
  1304. " <td>129</td>\n",
  1305. " <td>1</td>\n",
  1306. " <td>1949-04-01</td>\n",
  1307. " </tr>\n",
  1308. " <tr>\n",
  1309. " <th>4</th>\n",
  1310. " <td>1949-05</td>\n",
  1311. " <td>121</td>\n",
  1312. " <td>1</td>\n",
  1313. " <td>1949-05-01</td>\n",
  1314. " </tr>\n",
  1315. " </tbody>\n",
  1316. "</table>\n",
  1317. "</div>"
  1318. ],
  1319. "text/plain": [
  1320. " month passengers ones datetime\n",
  1321. "0 1949-01 112 1 1949-01-01\n",
  1322. "1 1949-02 118 1 1949-02-01\n",
  1323. "2 1949-03 132 1 1949-03-01\n",
  1324. "3 1949-04 129 1 1949-04-01\n",
  1325. "4 1949-05 121 1 1949-05-01"
  1326. ]
  1327. },
  1328. "execution_count": 115,
  1329. "metadata": {},
  1330. "output_type": "execute_result"
  1331. }
  1332. ],
  1333. "source": [
  1334. "datetime.strptime(\"1949-01\", \"%Y-%m\") \n",
  1335. "df[\"datetime\"] = df.apply(lambda row: datetime.strptime(row[\"month\"], \"%Y-%m\"), axis=1)\n",
  1336. "df.head(5)"
  1337. ]
  1338. },
  1339. {
  1340. "cell_type": "markdown",
  1341. "metadata": {},
  1342. "source": [
  1343. "Ahora vamos a ver como unir dos dataframes de forma similar al join en SQL"
  1344. ]
  1345. },
  1346. {
  1347. "cell_type": "code",
  1348. "execution_count": 116,
  1349. "metadata": {},
  1350. "outputs": [
  1351. {
  1352. "name": "stdout",
  1353. "output_type": "stream",
  1354. "text": [
  1355. "/home/david/ProyectosTF/machine_learning_examples/numpy_class\n"
  1356. ]
  1357. }
  1358. ],
  1359. "source": [
  1360. "os.chdir('/home/david/ProyectosTF/machine_learning_examples/numpy_class/')\n",
  1361. "print(os.getcwd())"
  1362. ]
  1363. },
  1364. {
  1365. "cell_type": "code",
  1366. "execution_count": 117,
  1367. "metadata": {},
  1368. "outputs": [
  1369. {
  1370. "data": {
  1371. "text/html": [
  1372. "<div>\n",
  1373. "<style>\n",
  1374. " .dataframe thead tr:only-child th {\n",
  1375. " text-align: right;\n",
  1376. " }\n",
  1377. "\n",
  1378. " .dataframe thead th {\n",
  1379. " text-align: left;\n",
  1380. " }\n",
  1381. "\n",
  1382. " .dataframe tbody tr th {\n",
  1383. " vertical-align: top;\n",
  1384. " }\n",
  1385. "</style>\n",
  1386. "<table border=\"1\" class=\"dataframe\">\n",
  1387. " <thead>\n",
  1388. " <tr style=\"text-align: right;\">\n",
  1389. " <th></th>\n",
  1390. " <th>user_id</th>\n",
  1391. " <th>email</th>\n",
  1392. " <th>age</th>\n",
  1393. " </tr>\n",
  1394. " </thead>\n",
  1395. " <tbody>\n",
  1396. " <tr>\n",
  1397. " <th>0</th>\n",
  1398. " <td>1</td>\n",
  1399. " <td>alice@gmail.com</td>\n",
  1400. " <td>20</td>\n",
  1401. " </tr>\n",
  1402. " <tr>\n",
  1403. " <th>1</th>\n",
  1404. " <td>2</td>\n",
  1405. " <td>bob@gmail.com</td>\n",
  1406. " <td>25</td>\n",
  1407. " </tr>\n",
  1408. " <tr>\n",
  1409. " <th>2</th>\n",
  1410. " <td>3</td>\n",
  1411. " <td>carol@gmail.com</td>\n",
  1412. " <td>30</td>\n",
  1413. " </tr>\n",
  1414. " </tbody>\n",
  1415. "</table>\n",
  1416. "</div>"
  1417. ],
  1418. "text/plain": [
  1419. " user_id email age\n",
  1420. "0 1 alice@gmail.com 20\n",
  1421. "1 2 bob@gmail.com 25\n",
  1422. "2 3 carol@gmail.com 30"
  1423. ]
  1424. },
  1425. "execution_count": 117,
  1426. "metadata": {},
  1427. "output_type": "execute_result"
  1428. }
  1429. ],
  1430. "source": [
  1431. "t1 = pd.read_csv(\"table1.csv\")\n",
  1432. "t2 = pd.read_csv(\"table2.csv\")\n",
  1433. "t1\n"
  1434. ]
  1435. },
  1436. {
  1437. "cell_type": "code",
  1438. "execution_count": 118,
  1439. "metadata": {},
  1440. "outputs": [
  1441. {
  1442. "data": {
  1443. "text/html": [
  1444. "<div>\n",
  1445. "<style>\n",
  1446. " .dataframe thead tr:only-child th {\n",
  1447. " text-align: right;\n",
  1448. " }\n",
  1449. "\n",
  1450. " .dataframe thead th {\n",
  1451. " text-align: left;\n",
  1452. " }\n",
  1453. "\n",
  1454. " .dataframe tbody tr th {\n",
  1455. " vertical-align: top;\n",
  1456. " }\n",
  1457. "</style>\n",
  1458. "<table border=\"1\" class=\"dataframe\">\n",
  1459. " <thead>\n",
  1460. " <tr style=\"text-align: right;\">\n",
  1461. " <th></th>\n",
  1462. " <th>user_id</th>\n",
  1463. " <th>ad_id</th>\n",
  1464. " <th>click</th>\n",
  1465. " </tr>\n",
  1466. " </thead>\n",
  1467. " <tbody>\n",
  1468. " <tr>\n",
  1469. " <th>0</th>\n",
  1470. " <td>1</td>\n",
  1471. " <td>1</td>\n",
  1472. " <td>1</td>\n",
  1473. " </tr>\n",
  1474. " <tr>\n",
  1475. " <th>1</th>\n",
  1476. " <td>1</td>\n",
  1477. " <td>2</td>\n",
  1478. " <td>0</td>\n",
  1479. " </tr>\n",
  1480. " <tr>\n",
  1481. " <th>2</th>\n",
  1482. " <td>1</td>\n",
  1483. " <td>5</td>\n",
  1484. " <td>0</td>\n",
  1485. " </tr>\n",
  1486. " <tr>\n",
  1487. " <th>3</th>\n",
  1488. " <td>2</td>\n",
  1489. " <td>3</td>\n",
  1490. " <td>0</td>\n",
  1491. " </tr>\n",
  1492. " <tr>\n",
  1493. " <th>4</th>\n",
  1494. " <td>2</td>\n",
  1495. " <td>4</td>\n",
  1496. " <td>1</td>\n",
  1497. " </tr>\n",
  1498. " <tr>\n",
  1499. " <th>5</th>\n",
  1500. " <td>2</td>\n",
  1501. " <td>1</td>\n",
  1502. " <td>0</td>\n",
  1503. " </tr>\n",
  1504. " <tr>\n",
  1505. " <th>6</th>\n",
  1506. " <td>3</td>\n",
  1507. " <td>2</td>\n",
  1508. " <td>0</td>\n",
  1509. " </tr>\n",
  1510. " <tr>\n",
  1511. " <th>7</th>\n",
  1512. " <td>3</td>\n",
  1513. " <td>1</td>\n",
  1514. " <td>0</td>\n",
  1515. " </tr>\n",
  1516. " <tr>\n",
  1517. " <th>8</th>\n",
  1518. " <td>3</td>\n",
  1519. " <td>3</td>\n",
  1520. " <td>0</td>\n",
  1521. " </tr>\n",
  1522. " <tr>\n",
  1523. " <th>9</th>\n",
  1524. " <td>3</td>\n",
  1525. " <td>4</td>\n",
  1526. " <td>0</td>\n",
  1527. " </tr>\n",
  1528. " <tr>\n",
  1529. " <th>10</th>\n",
  1530. " <td>3</td>\n",
  1531. " <td>5</td>\n",
  1532. " <td>1</td>\n",
  1533. " </tr>\n",
  1534. " </tbody>\n",
  1535. "</table>\n",
  1536. "</div>"
  1537. ],
  1538. "text/plain": [
  1539. " user_id ad_id click\n",
  1540. "0 1 1 1\n",
  1541. "1 1 2 0\n",
  1542. "2 1 5 0\n",
  1543. "3 2 3 0\n",
  1544. "4 2 4 1\n",
  1545. "5 2 1 0\n",
  1546. "6 3 2 0\n",
  1547. "7 3 1 0\n",
  1548. "8 3 3 0\n",
  1549. "9 3 4 0\n",
  1550. "10 3 5 1"
  1551. ]
  1552. },
  1553. "execution_count": 118,
  1554. "metadata": {},
  1555. "output_type": "execute_result"
  1556. }
  1557. ],
  1558. "source": [
  1559. "t2"
  1560. ]
  1561. },
  1562. {
  1563. "cell_type": "markdown",
  1564. "metadata": {},
  1565. "source": [
  1566. "Como es natural, vamos a unir las tablas por la columna user_id, que es la que comparten las dos. Para esto hay una función en Pandas que se llama merge() que nos va a ser muy útil. Tan solo le tenemos que pasar las dos tablas que queremos unir y la columna de unión, si no le pasamos ninguna columna utilizará el índice de la fila."
  1567. ]
  1568. },
  1569. {
  1570. "cell_type": "code",
  1571. "execution_count": 119,
  1572. "metadata": {},
  1573. "outputs": [
  1574. {
  1575. "data": {
  1576. "text/html": [
  1577. "<div>\n",
  1578. "<style>\n",
  1579. " .dataframe thead tr:only-child th {\n",
  1580. " text-align: right;\n",
  1581. " }\n",
  1582. "\n",
  1583. " .dataframe thead th {\n",
  1584. " text-align: left;\n",
  1585. " }\n",
  1586. "\n",
  1587. " .dataframe tbody tr th {\n",
  1588. " vertical-align: top;\n",
  1589. " }\n",
  1590. "</style>\n",
  1591. "<table border=\"1\" class=\"dataframe\">\n",
  1592. " <thead>\n",
  1593. " <tr style=\"text-align: right;\">\n",
  1594. " <th></th>\n",
  1595. " <th>user_id</th>\n",
  1596. " <th>email</th>\n",
  1597. " <th>age</th>\n",
  1598. " <th>ad_id</th>\n",
  1599. " <th>click</th>\n",
  1600. " </tr>\n",
  1601. " </thead>\n",
  1602. " <tbody>\n",
  1603. " <tr>\n",
  1604. " <th>0</th>\n",
  1605. " <td>1</td>\n",
  1606. " <td>alice@gmail.com</td>\n",
  1607. " <td>20</td>\n",
  1608. " <td>1</td>\n",
  1609. " <td>1</td>\n",
  1610. " </tr>\n",
  1611. " <tr>\n",
  1612. " <th>1</th>\n",
  1613. " <td>1</td>\n",
  1614. " <td>alice@gmail.com</td>\n",
  1615. " <td>20</td>\n",
  1616. " <td>2</td>\n",
  1617. " <td>0</td>\n",
  1618. " </tr>\n",
  1619. " <tr>\n",
  1620. " <th>2</th>\n",
  1621. " <td>1</td>\n",
  1622. " <td>alice@gmail.com</td>\n",
  1623. " <td>20</td>\n",
  1624. " <td>5</td>\n",
  1625. " <td>0</td>\n",
  1626. " </tr>\n",
  1627. " <tr>\n",
  1628. " <th>3</th>\n",
  1629. " <td>2</td>\n",
  1630. " <td>bob@gmail.com</td>\n",
  1631. " <td>25</td>\n",
  1632. " <td>3</td>\n",
  1633. " <td>0</td>\n",
  1634. " </tr>\n",
  1635. " <tr>\n",
  1636. " <th>4</th>\n",
  1637. " <td>2</td>\n",
  1638. " <td>bob@gmail.com</td>\n",
  1639. " <td>25</td>\n",
  1640. " <td>4</td>\n",
  1641. " <td>1</td>\n",
  1642. " </tr>\n",
  1643. " <tr>\n",
  1644. " <th>5</th>\n",
  1645. " <td>2</td>\n",
  1646. " <td>bob@gmail.com</td>\n",
  1647. " <td>25</td>\n",
  1648. " <td>1</td>\n",
  1649. " <td>0</td>\n",
  1650. " </tr>\n",
  1651. " <tr>\n",
  1652. " <th>6</th>\n",
  1653. " <td>3</td>\n",
  1654. " <td>carol@gmail.com</td>\n",
  1655. " <td>30</td>\n",
  1656. " <td>2</td>\n",
  1657. " <td>0</td>\n",
  1658. " </tr>\n",
  1659. " <tr>\n",
  1660. " <th>7</th>\n",
  1661. " <td>3</td>\n",
  1662. " <td>carol@gmail.com</td>\n",
  1663. " <td>30</td>\n",
  1664. " <td>1</td>\n",
  1665. " <td>0</td>\n",
  1666. " </tr>\n",
  1667. " <tr>\n",
  1668. " <th>8</th>\n",
  1669. " <td>3</td>\n",
  1670. " <td>carol@gmail.com</td>\n",
  1671. " <td>30</td>\n",
  1672. " <td>3</td>\n",
  1673. " <td>0</td>\n",
  1674. " </tr>\n",
  1675. " <tr>\n",
  1676. " <th>9</th>\n",
  1677. " <td>3</td>\n",
  1678. " <td>carol@gmail.com</td>\n",
  1679. " <td>30</td>\n",
  1680. " <td>4</td>\n",
  1681. " <td>0</td>\n",
  1682. " </tr>\n",
  1683. " <tr>\n",
  1684. " <th>10</th>\n",
  1685. " <td>3</td>\n",
  1686. " <td>carol@gmail.com</td>\n",
  1687. " <td>30</td>\n",
  1688. " <td>5</td>\n",
  1689. " <td>1</td>\n",
  1690. " </tr>\n",
  1691. " </tbody>\n",
  1692. "</table>\n",
  1693. "</div>"
  1694. ],
  1695. "text/plain": [
  1696. " user_id email age ad_id click\n",
  1697. "0 1 alice@gmail.com 20 1 1\n",
  1698. "1 1 alice@gmail.com 20 2 0\n",
  1699. "2 1 alice@gmail.com 20 5 0\n",
  1700. "3 2 bob@gmail.com 25 3 0\n",
  1701. "4 2 bob@gmail.com 25 4 1\n",
  1702. "5 2 bob@gmail.com 25 1 0\n",
  1703. "6 3 carol@gmail.com 30 2 0\n",
  1704. "7 3 carol@gmail.com 30 1 0\n",
  1705. "8 3 carol@gmail.com 30 3 0\n",
  1706. "9 3 carol@gmail.com 30 4 0\n",
  1707. "10 3 carol@gmail.com 30 5 1"
  1708. ]
  1709. },
  1710. "execution_count": 119,
  1711. "metadata": {},
  1712. "output_type": "execute_result"
  1713. }
  1714. ],
  1715. "source": [
  1716. "m = pd.merge(t1, t2, on=\"user_id\") #tambien se puede así: t1.merge(t2, on=\"user_id\")\n",
  1717. "m"
  1718. ]
  1719. },
  1720. {
  1721. "cell_type": "markdown",
  1722. "metadata": {},
  1723. "source": [
  1724. "Aquí termina este repaso por las funciones más habituales de Pandas. Hay un montón de funcionalidades que no hemos visto, pero con esto se puede ir tirando, para todo lo demás StackOverflow."
  1725. ]
  1726. }
  1727. ],
  1728. "metadata": {
  1729. "kernelspec": {
  1730. "display_name": "Python 3",
  1731. "language": "python",
  1732. "name": "python3"
  1733. },
  1734. "language_info": {
  1735. "codemirror_mode": {
  1736. "name": "ipython",
  1737. "version": 3
  1738. },
  1739. "file_extension": ".py",
  1740. "mimetype": "text/x-python",
  1741. "name": "python",
  1742. "nbconvert_exporter": "python",
  1743. "pygments_lexer": "ipython3",
  1744. "version": "3.5.2"
  1745. }
  1746. },
  1747. "nbformat": 4,
  1748. "nbformat_minor": 2
  1749. }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement