Advertisement
Guest User

Untitled

a guest
Mar 27th, 2019
116
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 114.77 KB | None | 0 0
  1. {
  2. "cells": [
  3. {
  4. "cell_type": "markdown",
  5. "metadata": {},
  6. "source": [
  7. "# Data Cleaning and Preparation"
  8. ]
  9. },
  10. {
  11. "cell_type": "code",
  12. "execution_count": 1,
  13. "metadata": {},
  14. "outputs": [],
  15. "source": [
  16. "import numpy as np\n",
  17. "import pandas as pd"
  18. ]
  19. },
  20. {
  21. "cell_type": "code",
  22. "execution_count": 2,
  23. "metadata": {},
  24. "outputs": [
  25. {
  26. "name": "stdout",
  27. "output_type": "stream",
  28. "text": [
  29. "Numpy version: 1.16.1\n",
  30. "Pandas version: 0.24.1\n"
  31. ]
  32. }
  33. ],
  34. "source": [
  35. "print(f'Numpy version: {np.__version__}')\n",
  36. "print(f'Pandas version: {pd.__version__}')"
  37. ]
  38. },
  39. {
  40. "cell_type": "markdown",
  41. "metadata": {},
  42. "source": [
  43. "## Handling Missing Data\n",
  44. "\n",
  45. "* NA(missing data) handling methods\n",
  46. "\n",
  47. "Methods | Description\n",
  48. ":--- | :---\n",
  49. "`dropna` | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.\n",
  50. "`fillna` | Fill in missing data with some value or using an interpolatiion method such as `ffill` or `bfill`.\n",
  51. "`isnull` | Return boolean values indicating which values are missing.\n",
  52. "`notnull` | Negation of `isnull`.\n",
  53. "\n",
  54. "### Filtering Out Missing Data"
  55. ]
  56. },
  57. {
  58. "cell_type": "code",
  59. "execution_count": 3,
  60. "metadata": {},
  61. "outputs": [
  62. {
  63. "data": {
  64. "text/plain": [
  65. "0 1.0\n",
  66. "1 NaN\n",
  67. "2 3.5\n",
  68. "3 NaN\n",
  69. "4 7.0\n",
  70. "dtype: float64"
  71. ]
  72. },
  73. "execution_count": 3,
  74. "metadata": {},
  75. "output_type": "execute_result"
  76. }
  77. ],
  78. "source": [
  79. "data = pd.Series([1, np.nan, 3.5, np.nan, 7])\n",
  80. "data"
  81. ]
  82. },
  83. {
  84. "cell_type": "markdown",
  85. "metadata": {},
  86. "source": [
  87. "```py\n",
  88. "Series.dropna(\n",
  89. " axis=0,\n",
  90. " inplace=False,\n",
  91. " **kwargs\n",
  92. ")\n",
  93. "```"
  94. ]
  95. },
  96. {
  97. "cell_type": "code",
  98. "execution_count": 4,
  99. "metadata": {},
  100. "outputs": [
  101. {
  102. "data": {
  103. "text/plain": [
  104. "0 1.0\n",
  105. "2 3.5\n",
  106. "4 7.0\n",
  107. "dtype: float64"
  108. ]
  109. },
  110. "execution_count": 4,
  111. "metadata": {},
  112. "output_type": "execute_result"
  113. }
  114. ],
  115. "source": [
  116. "data.dropna()"
  117. ]
  118. },
  119. {
  120. "cell_type": "code",
  121. "execution_count": 5,
  122. "metadata": {},
  123. "outputs": [
  124. {
  125. "data": {
  126. "text/html": [
  127. "<div>\n",
  128. "<style scoped>\n",
  129. " .dataframe tbody tr th:only-of-type {\n",
  130. " vertical-align: middle;\n",
  131. " }\n",
  132. "\n",
  133. " .dataframe tbody tr th {\n",
  134. " vertical-align: top;\n",
  135. " }\n",
  136. "\n",
  137. " .dataframe thead th {\n",
  138. " text-align: right;\n",
  139. " }\n",
  140. "</style>\n",
  141. "<table border=\"1\" class=\"dataframe\">\n",
  142. " <thead>\n",
  143. " <tr style=\"text-align: right;\">\n",
  144. " <th></th>\n",
  145. " <th>0</th>\n",
  146. " <th>1</th>\n",
  147. " <th>2</th>\n",
  148. " </tr>\n",
  149. " </thead>\n",
  150. " <tbody>\n",
  151. " <tr>\n",
  152. " <th>0</th>\n",
  153. " <td>1.0</td>\n",
  154. " <td>6.5</td>\n",
  155. " <td>3.0</td>\n",
  156. " </tr>\n",
  157. " <tr>\n",
  158. " <th>1</th>\n",
  159. " <td>1.0</td>\n",
  160. " <td>NaN</td>\n",
  161. " <td>NaN</td>\n",
  162. " </tr>\n",
  163. " <tr>\n",
  164. " <th>2</th>\n",
  165. " <td>NaN</td>\n",
  166. " <td>NaN</td>\n",
  167. " <td>NaN</td>\n",
  168. " </tr>\n",
  169. " <tr>\n",
  170. " <th>3</th>\n",
  171. " <td>NaN</td>\n",
  172. " <td>6.5</td>\n",
  173. " <td>3.0</td>\n",
  174. " </tr>\n",
  175. " </tbody>\n",
  176. "</table>\n",
  177. "</div>"
  178. ],
  179. "text/plain": [
  180. " 0 1 2\n",
  181. "0 1.0 6.5 3.0\n",
  182. "1 1.0 NaN NaN\n",
  183. "2 NaN NaN NaN\n",
  184. "3 NaN 6.5 3.0"
  185. ]
  186. },
  187. "execution_count": 5,
  188. "metadata": {},
  189. "output_type": "execute_result"
  190. }
  191. ],
  192. "source": [
  193. "data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],\n",
  194. " [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])\n",
  195. "data"
  196. ]
  197. },
  198. {
  199. "cell_type": "markdown",
  200. "metadata": {},
  201. "source": [
  202. "```py\n",
  203. "DataFrame.dropna(\n",
  204. " axis=0,\n",
  205. " how='any',\n",
  206. " thresh=None,\n",
  207. " subset=None,\n",
  208. " inplace=False\n",
  209. ")\n",
  210. "```"
  211. ]
  212. },
  213. {
  214. "cell_type": "code",
  215. "execution_count": 6,
  216. "metadata": {},
  217. "outputs": [
  218. {
  219. "data": {
  220. "text/html": [
  221. "<div>\n",
  222. "<style scoped>\n",
  223. " .dataframe tbody tr th:only-of-type {\n",
  224. " vertical-align: middle;\n",
  225. " }\n",
  226. "\n",
  227. " .dataframe tbody tr th {\n",
  228. " vertical-align: top;\n",
  229. " }\n",
  230. "\n",
  231. " .dataframe thead th {\n",
  232. " text-align: right;\n",
  233. " }\n",
  234. "</style>\n",
  235. "<table border=\"1\" class=\"dataframe\">\n",
  236. " <thead>\n",
  237. " <tr style=\"text-align: right;\">\n",
  238. " <th></th>\n",
  239. " <th>0</th>\n",
  240. " <th>1</th>\n",
  241. " <th>2</th>\n",
  242. " </tr>\n",
  243. " </thead>\n",
  244. " <tbody>\n",
  245. " <tr>\n",
  246. " <th>0</th>\n",
  247. " <td>1.0</td>\n",
  248. " <td>6.5</td>\n",
  249. " <td>3.0</td>\n",
  250. " </tr>\n",
  251. " </tbody>\n",
  252. "</table>\n",
  253. "</div>"
  254. ],
  255. "text/plain": [
  256. " 0 1 2\n",
  257. "0 1.0 6.5 3.0"
  258. ]
  259. },
  260. "execution_count": 6,
  261. "metadata": {},
  262. "output_type": "execute_result"
  263. }
  264. ],
  265. "source": [
  266. "cleaned = data.dropna()\n",
  267. "cleaned"
  268. ]
  269. },
  270. {
  271. "cell_type": "code",
  272. "execution_count": 7,
  273. "metadata": {},
  274. "outputs": [
  275. {
  276. "data": {
  277. "text/html": [
  278. "<div>\n",
  279. "<style scoped>\n",
  280. " .dataframe tbody tr th:only-of-type {\n",
  281. " vertical-align: middle;\n",
  282. " }\n",
  283. "\n",
  284. " .dataframe tbody tr th {\n",
  285. " vertical-align: top;\n",
  286. " }\n",
  287. "\n",
  288. " .dataframe thead th {\n",
  289. " text-align: right;\n",
  290. " }\n",
  291. "</style>\n",
  292. "<table border=\"1\" class=\"dataframe\">\n",
  293. " <thead>\n",
  294. " <tr style=\"text-align: right;\">\n",
  295. " <th></th>\n",
  296. " <th>0</th>\n",
  297. " <th>1</th>\n",
  298. " <th>2</th>\n",
  299. " </tr>\n",
  300. " </thead>\n",
  301. " <tbody>\n",
  302. " <tr>\n",
  303. " <th>0</th>\n",
  304. " <td>1.0</td>\n",
  305. " <td>6.5</td>\n",
  306. " <td>3.0</td>\n",
  307. " </tr>\n",
  308. " <tr>\n",
  309. " <th>1</th>\n",
  310. " <td>1.0</td>\n",
  311. " <td>NaN</td>\n",
  312. " <td>NaN</td>\n",
  313. " </tr>\n",
  314. " <tr>\n",
  315. " <th>3</th>\n",
  316. " <td>NaN</td>\n",
  317. " <td>6.5</td>\n",
  318. " <td>3.0</td>\n",
  319. " </tr>\n",
  320. " </tbody>\n",
  321. "</table>\n",
  322. "</div>"
  323. ],
  324. "text/plain": [
  325. " 0 1 2\n",
  326. "0 1.0 6.5 3.0\n",
  327. "1 1.0 NaN NaN\n",
  328. "3 NaN 6.5 3.0"
  329. ]
  330. },
  331. "execution_count": 7,
  332. "metadata": {},
  333. "output_type": "execute_result"
  334. }
  335. ],
  336. "source": [
  337. "data.dropna(how='all') # drop if only all are NA"
  338. ]
  339. },
  340. {
  341. "cell_type": "code",
  342. "execution_count": 8,
  343. "metadata": {},
  344. "outputs": [
  345. {
  346. "data": {
  347. "text/html": [
  348. "<div>\n",
  349. "<style scoped>\n",
  350. " .dataframe tbody tr th:only-of-type {\n",
  351. " vertical-align: middle;\n",
  352. " }\n",
  353. "\n",
  354. " .dataframe tbody tr th {\n",
  355. " vertical-align: top;\n",
  356. " }\n",
  357. "\n",
  358. " .dataframe thead th {\n",
  359. " text-align: right;\n",
  360. " }\n",
  361. "</style>\n",
  362. "<table border=\"1\" class=\"dataframe\">\n",
  363. " <thead>\n",
  364. " <tr style=\"text-align: right;\">\n",
  365. " <th></th>\n",
  366. " <th>0</th>\n",
  367. " <th>1</th>\n",
  368. " <th>2</th>\n",
  369. " <th>3</th>\n",
  370. " </tr>\n",
  371. " </thead>\n",
  372. " <tbody>\n",
  373. " <tr>\n",
  374. " <th>0</th>\n",
  375. " <td>1.0</td>\n",
  376. " <td>6.5</td>\n",
  377. " <td>3.0</td>\n",
  378. " <td>NaN</td>\n",
  379. " </tr>\n",
  380. " <tr>\n",
  381. " <th>1</th>\n",
  382. " <td>1.0</td>\n",
  383. " <td>NaN</td>\n",
  384. " <td>NaN</td>\n",
  385. " <td>NaN</td>\n",
  386. " </tr>\n",
  387. " <tr>\n",
  388. " <th>2</th>\n",
  389. " <td>NaN</td>\n",
  390. " <td>NaN</td>\n",
  391. " <td>NaN</td>\n",
  392. " <td>NaN</td>\n",
  393. " </tr>\n",
  394. " <tr>\n",
  395. " <th>3</th>\n",
  396. " <td>NaN</td>\n",
  397. " <td>6.5</td>\n",
  398. " <td>3.0</td>\n",
  399. " <td>NaN</td>\n",
  400. " </tr>\n",
  401. " </tbody>\n",
  402. "</table>\n",
  403. "</div>"
  404. ],
  405. "text/plain": [
  406. " 0 1 2 3\n",
  407. "0 1.0 6.5 3.0 NaN\n",
  408. "1 1.0 NaN NaN NaN\n",
  409. "2 NaN NaN NaN NaN\n",
  410. "3 NaN 6.5 3.0 NaN"
  411. ]
  412. },
  413. "execution_count": 8,
  414. "metadata": {},
  415. "output_type": "execute_result"
  416. }
  417. ],
  418. "source": [
  419. "data[3] = np.nan\n",
  420. "data"
  421. ]
  422. },
  423. {
  424. "cell_type": "code",
  425. "execution_count": 9,
  426. "metadata": {},
  427. "outputs": [
  428. {
  429. "data": {
  430. "text/html": [
  431. "<div>\n",
  432. "<style scoped>\n",
  433. " .dataframe tbody tr th:only-of-type {\n",
  434. " vertical-align: middle;\n",
  435. " }\n",
  436. "\n",
  437. " .dataframe tbody tr th {\n",
  438. " vertical-align: top;\n",
  439. " }\n",
  440. "\n",
  441. " .dataframe thead th {\n",
  442. " text-align: right;\n",
  443. " }\n",
  444. "</style>\n",
  445. "<table border=\"1\" class=\"dataframe\">\n",
  446. " <thead>\n",
  447. " <tr style=\"text-align: right;\">\n",
  448. " <th></th>\n",
  449. " <th>0</th>\n",
  450. " <th>1</th>\n",
  451. " <th>2</th>\n",
  452. " </tr>\n",
  453. " </thead>\n",
  454. " <tbody>\n",
  455. " <tr>\n",
  456. " <th>0</th>\n",
  457. " <td>1.0</td>\n",
  458. " <td>6.5</td>\n",
  459. " <td>3.0</td>\n",
  460. " </tr>\n",
  461. " <tr>\n",
  462. " <th>1</th>\n",
  463. " <td>1.0</td>\n",
  464. " <td>NaN</td>\n",
  465. " <td>NaN</td>\n",
  466. " </tr>\n",
  467. " <tr>\n",
  468. " <th>2</th>\n",
  469. " <td>NaN</td>\n",
  470. " <td>NaN</td>\n",
  471. " <td>NaN</td>\n",
  472. " </tr>\n",
  473. " <tr>\n",
  474. " <th>3</th>\n",
  475. " <td>NaN</td>\n",
  476. " <td>6.5</td>\n",
  477. " <td>3.0</td>\n",
  478. " </tr>\n",
  479. " </tbody>\n",
  480. "</table>\n",
  481. "</div>"
  482. ],
  483. "text/plain": [
  484. " 0 1 2\n",
  485. "0 1.0 6.5 3.0\n",
  486. "1 1.0 NaN NaN\n",
  487. "2 NaN NaN NaN\n",
  488. "3 NaN 6.5 3.0"
  489. ]
  490. },
  491. "execution_count": 9,
  492. "metadata": {},
  493. "output_type": "execute_result"
  494. }
  495. ],
  496. "source": [
  497. "data.dropna(axis=1, how='all') # drop the column with all NA values"
  498. ]
  499. },
  500. {
  501. "cell_type": "code",
  502. "execution_count": 10,
  503. "metadata": {},
  504. "outputs": [
  505. {
  506. "data": {
  507. "text/html": [
  508. "<div>\n",
  509. "<style scoped>\n",
  510. " .dataframe tbody tr th:only-of-type {\n",
  511. " vertical-align: middle;\n",
  512. " }\n",
  513. "\n",
  514. " .dataframe tbody tr th {\n",
  515. " vertical-align: top;\n",
  516. " }\n",
  517. "\n",
  518. " .dataframe thead th {\n",
  519. " text-align: right;\n",
  520. " }\n",
  521. "</style>\n",
  522. "<table border=\"1\" class=\"dataframe\">\n",
  523. " <thead>\n",
  524. " <tr style=\"text-align: right;\">\n",
  525. " <th></th>\n",
  526. " <th>0</th>\n",
  527. " <th>1</th>\n",
  528. " <th>2</th>\n",
  529. " <th>3</th>\n",
  530. " </tr>\n",
  531. " </thead>\n",
  532. " <tbody>\n",
  533. " <tr>\n",
  534. " <th>0</th>\n",
  535. " <td>1.0</td>\n",
  536. " <td>6.5</td>\n",
  537. " <td>3.0</td>\n",
  538. " <td>NaN</td>\n",
  539. " </tr>\n",
  540. " <tr>\n",
  541. " <th>3</th>\n",
  542. " <td>NaN</td>\n",
  543. " <td>6.5</td>\n",
  544. " <td>3.0</td>\n",
  545. " <td>NaN</td>\n",
  546. " </tr>\n",
  547. " </tbody>\n",
  548. "</table>\n",
  549. "</div>"
  550. ],
  551. "text/plain": [
  552. " 0 1 2 3\n",
  553. "0 1.0 6.5 3.0 NaN\n",
  554. "3 NaN 6.5 3.0 NaN"
  555. ]
  556. },
  557. "execution_count": 10,
  558. "metadata": {},
  559. "output_type": "execute_result"
  560. }
  561. ],
  562. "source": [
  563. "data.dropna(thresh=2) # drop those rows with < 2 non-NA values"
  564. ]
  565. },
  566. {
  567. "cell_type": "code",
  568. "execution_count": 11,
  569. "metadata": {},
  570. "outputs": [
  571. {
  572. "data": {
  573. "text/html": [
  574. "<div>\n",
  575. "<style scoped>\n",
  576. " .dataframe tbody tr th:only-of-type {\n",
  577. " vertical-align: middle;\n",
  578. " }\n",
  579. "\n",
  580. " .dataframe tbody tr th {\n",
  581. " vertical-align: top;\n",
  582. " }\n",
  583. "\n",
  584. " .dataframe thead th {\n",
  585. " text-align: right;\n",
  586. " }\n",
  587. "</style>\n",
  588. "<table border=\"1\" class=\"dataframe\">\n",
  589. " <thead>\n",
  590. " <tr style=\"text-align: right;\">\n",
  591. " <th></th>\n",
  592. " <th>0</th>\n",
  593. " <th>1</th>\n",
  594. " <th>2</th>\n",
  595. " </tr>\n",
  596. " </thead>\n",
  597. " <tbody>\n",
  598. " <tr>\n",
  599. " <th>0</th>\n",
  600. " <td>1.0</td>\n",
  601. " <td>6.5</td>\n",
  602. " <td>3.0</td>\n",
  603. " </tr>\n",
  604. " <tr>\n",
  605. " <th>1</th>\n",
  606. " <td>1.0</td>\n",
  607. " <td>NaN</td>\n",
  608. " <td>NaN</td>\n",
  609. " </tr>\n",
  610. " <tr>\n",
  611. " <th>2</th>\n",
  612. " <td>NaN</td>\n",
  613. " <td>NaN</td>\n",
  614. " <td>NaN</td>\n",
  615. " </tr>\n",
  616. " <tr>\n",
  617. " <th>3</th>\n",
  618. " <td>NaN</td>\n",
  619. " <td>6.5</td>\n",
  620. " <td>3.0</td>\n",
  621. " </tr>\n",
  622. " </tbody>\n",
  623. "</table>\n",
  624. "</div>"
  625. ],
  626. "text/plain": [
  627. " 0 1 2\n",
  628. "0 1.0 6.5 3.0\n",
  629. "1 1.0 NaN NaN\n",
  630. "2 NaN NaN NaN\n",
  631. "3 NaN 6.5 3.0"
  632. ]
  633. },
  634. "execution_count": 11,
  635. "metadata": {},
  636. "output_type": "execute_result"
  637. }
  638. ],
  639. "source": [
  640. "data.dropna(axis='columns', thresh=2)"
  641. ]
  642. },
  643. {
  644. "cell_type": "markdown",
  645. "metadata": {},
  646. "source": [
  647. "### Filling In Missing Data\n",
  648. "\n",
  649. "```py\n",
  650. "fillna(\n",
  651. " value=None,\n",
  652. " method=None,\n",
  653. " axis=None,\n",
  654. " inplace=False,\n",
  655. " limit=None,\n",
  656. " downcast=None,\n",
  657. " **kwargs\n",
  658. ")\n",
  659. "```\n",
  660. "\n",
  661. "* `fillna` func args\n",
  662. "\n",
  663. "Arg | Description\n",
  664. ":--- | :---\n",
  665. "`value` | Scalar value or dict-like obj to use to fill missing values.\n",
  666. "`method` | Interpolation; by default `ffill` if function called with no other args.\n",
  667. "`axis` | Axis to fill on; default `axis=0`.\n",
  668. "`inplace` | Modify the calling obj without producing a copy.\n",
  669. "`limit` | For forward and backward filling, maximum number of consecutive periods to fill."
  670. ]
  671. },
  672. {
  673. "cell_type": "code",
  674. "execution_count": 12,
  675. "metadata": {},
  676. "outputs": [
  677. {
  678. "data": {
  679. "text/html": [
  680. "<div>\n",
  681. "<style scoped>\n",
  682. " .dataframe tbody tr th:only-of-type {\n",
  683. " vertical-align: middle;\n",
  684. " }\n",
  685. "\n",
  686. " .dataframe tbody tr th {\n",
  687. " vertical-align: top;\n",
  688. " }\n",
  689. "\n",
  690. " .dataframe thead th {\n",
  691. " text-align: right;\n",
  692. " }\n",
  693. "</style>\n",
  694. "<table border=\"1\" class=\"dataframe\">\n",
  695. " <thead>\n",
  696. " <tr style=\"text-align: right;\">\n",
  697. " <th></th>\n",
  698. " <th>0</th>\n",
  699. " <th>1</th>\n",
  700. " <th>2</th>\n",
  701. " <th>3</th>\n",
  702. " </tr>\n",
  703. " </thead>\n",
  704. " <tbody>\n",
  705. " <tr>\n",
  706. " <th>0</th>\n",
  707. " <td>1.0</td>\n",
  708. " <td>6.5</td>\n",
  709. " <td>3.0</td>\n",
  710. " <td>NaN</td>\n",
  711. " </tr>\n",
  712. " <tr>\n",
  713. " <th>1</th>\n",
  714. " <td>1.0</td>\n",
  715. " <td>NaN</td>\n",
  716. " <td>NaN</td>\n",
  717. " <td>NaN</td>\n",
  718. " </tr>\n",
  719. " <tr>\n",
  720. " <th>2</th>\n",
  721. " <td>NaN</td>\n",
  722. " <td>NaN</td>\n",
  723. " <td>NaN</td>\n",
  724. " <td>NaN</td>\n",
  725. " </tr>\n",
  726. " <tr>\n",
  727. " <th>3</th>\n",
  728. " <td>NaN</td>\n",
  729. " <td>6.5</td>\n",
  730. " <td>3.0</td>\n",
  731. " <td>NaN</td>\n",
  732. " </tr>\n",
  733. " </tbody>\n",
  734. "</table>\n",
  735. "</div>"
  736. ],
  737. "text/plain": [
  738. " 0 1 2 3\n",
  739. "0 1.0 6.5 3.0 NaN\n",
  740. "1 1.0 NaN NaN NaN\n",
  741. "2 NaN NaN NaN NaN\n",
  742. "3 NaN 6.5 3.0 NaN"
  743. ]
  744. },
  745. "execution_count": 12,
  746. "metadata": {},
  747. "output_type": "execute_result"
  748. }
  749. ],
  750. "source": [
  751. "data"
  752. ]
  753. },
  754. {
  755. "cell_type": "code",
  756. "execution_count": 13,
  757. "metadata": {},
  758. "outputs": [
  759. {
  760. "data": {
  761. "text/html": [
  762. "<div>\n",
  763. "<style scoped>\n",
  764. " .dataframe tbody tr th:only-of-type {\n",
  765. " vertical-align: middle;\n",
  766. " }\n",
  767. "\n",
  768. " .dataframe tbody tr th {\n",
  769. " vertical-align: top;\n",
  770. " }\n",
  771. "\n",
  772. " .dataframe thead th {\n",
  773. " text-align: right;\n",
  774. " }\n",
  775. "</style>\n",
  776. "<table border=\"1\" class=\"dataframe\">\n",
  777. " <thead>\n",
  778. " <tr style=\"text-align: right;\">\n",
  779. " <th></th>\n",
  780. " <th>0</th>\n",
  781. " <th>1</th>\n",
  782. " <th>2</th>\n",
  783. " <th>3</th>\n",
  784. " </tr>\n",
  785. " </thead>\n",
  786. " <tbody>\n",
  787. " <tr>\n",
  788. " <th>0</th>\n",
  789. " <td>1.0</td>\n",
  790. " <td>6.5</td>\n",
  791. " <td>3.0</td>\n",
  792. " <td>0.0</td>\n",
  793. " </tr>\n",
  794. " <tr>\n",
  795. " <th>1</th>\n",
  796. " <td>1.0</td>\n",
  797. " <td>0.0</td>\n",
  798. " <td>0.0</td>\n",
  799. " <td>0.0</td>\n",
  800. " </tr>\n",
  801. " <tr>\n",
  802. " <th>2</th>\n",
  803. " <td>0.0</td>\n",
  804. " <td>0.0</td>\n",
  805. " <td>0.0</td>\n",
  806. " <td>0.0</td>\n",
  807. " </tr>\n",
  808. " <tr>\n",
  809. " <th>3</th>\n",
  810. " <td>0.0</td>\n",
  811. " <td>6.5</td>\n",
  812. " <td>3.0</td>\n",
  813. " <td>0.0</td>\n",
  814. " </tr>\n",
  815. " </tbody>\n",
  816. "</table>\n",
  817. "</div>"
  818. ],
  819. "text/plain": [
  820. " 0 1 2 3\n",
  821. "0 1.0 6.5 3.0 0.0\n",
  822. "1 1.0 0.0 0.0 0.0\n",
  823. "2 0.0 0.0 0.0 0.0\n",
  824. "3 0.0 6.5 3.0 0.0"
  825. ]
  826. },
  827. "execution_count": 13,
  828. "metadata": {},
  829. "output_type": "execute_result"
  830. }
  831. ],
  832. "source": [
  833. "data.fillna(0) # fill NA with value 0"
  834. ]
  835. },
  836. {
  837. "cell_type": "code",
  838. "execution_count": 14,
  839. "metadata": {},
  840. "outputs": [
  841. {
  842. "data": {
  843. "text/html": [
  844. "<div>\n",
  845. "<style scoped>\n",
  846. " .dataframe tbody tr th:only-of-type {\n",
  847. " vertical-align: middle;\n",
  848. " }\n",
  849. "\n",
  850. " .dataframe tbody tr th {\n",
  851. " vertical-align: top;\n",
  852. " }\n",
  853. "\n",
  854. " .dataframe thead th {\n",
  855. " text-align: right;\n",
  856. " }\n",
  857. "</style>\n",
  858. "<table border=\"1\" class=\"dataframe\">\n",
  859. " <thead>\n",
  860. " <tr style=\"text-align: right;\">\n",
  861. " <th></th>\n",
  862. " <th>0</th>\n",
  863. " <th>1</th>\n",
  864. " <th>2</th>\n",
  865. " <th>3</th>\n",
  866. " </tr>\n",
  867. " </thead>\n",
  868. " <tbody>\n",
  869. " <tr>\n",
  870. " <th>0</th>\n",
  871. " <td>1.0</td>\n",
  872. " <td>6.5</td>\n",
  873. " <td>3.0</td>\n",
  874. " <td>99.0</td>\n",
  875. " </tr>\n",
  876. " <tr>\n",
  877. " <th>1</th>\n",
  878. " <td>1.0</td>\n",
  879. " <td>11.0</td>\n",
  880. " <td>NaN</td>\n",
  881. " <td>99.0</td>\n",
  882. " </tr>\n",
  883. " <tr>\n",
  884. " <th>2</th>\n",
  885. " <td>NaN</td>\n",
  886. " <td>11.0</td>\n",
  887. " <td>NaN</td>\n",
  888. " <td>99.0</td>\n",
  889. " </tr>\n",
  890. " <tr>\n",
  891. " <th>3</th>\n",
  892. " <td>NaN</td>\n",
  893. " <td>6.5</td>\n",
  894. " <td>3.0</td>\n",
  895. " <td>99.0</td>\n",
  896. " </tr>\n",
  897. " </tbody>\n",
  898. "</table>\n",
  899. "</div>"
  900. ],
  901. "text/plain": [
  902. " 0 1 2 3\n",
  903. "0 1.0 6.5 3.0 99.0\n",
  904. "1 1.0 11.0 NaN 99.0\n",
  905. "2 NaN 11.0 NaN 99.0\n",
  906. "3 NaN 6.5 3.0 99.0"
  907. ]
  908. },
  909. "execution_count": 14,
  910. "metadata": {},
  911. "output_type": "execute_result"
  912. }
  913. ],
  914. "source": [
  915. "data.fillna({1: 11, 3: 99}) # use a different value for each column"
  916. ]
  917. },
  918. {
  919. "cell_type": "code",
  920. "execution_count": 15,
  921. "metadata": {},
  922. "outputs": [
  923. {
  924. "data": {
  925. "text/html": [
  926. "<div>\n",
  927. "<style scoped>\n",
  928. " .dataframe tbody tr th:only-of-type {\n",
  929. " vertical-align: middle;\n",
  930. " }\n",
  931. "\n",
  932. " .dataframe tbody tr th {\n",
  933. " vertical-align: top;\n",
  934. " }\n",
  935. "\n",
  936. " .dataframe thead th {\n",
  937. " text-align: right;\n",
  938. " }\n",
  939. "</style>\n",
  940. "<table border=\"1\" class=\"dataframe\">\n",
  941. " <thead>\n",
  942. " <tr style=\"text-align: right;\">\n",
  943. " <th></th>\n",
  944. " <th>0</th>\n",
  945. " <th>1</th>\n",
  946. " <th>2</th>\n",
  947. " <th>3</th>\n",
  948. " </tr>\n",
  949. " </thead>\n",
  950. " <tbody>\n",
  951. " <tr>\n",
  952. " <th>0</th>\n",
  953. " <td>1.0</td>\n",
  954. " <td>6.5</td>\n",
  955. " <td>3.0</td>\n",
  956. " <td>NaN</td>\n",
  957. " </tr>\n",
  958. " <tr>\n",
  959. " <th>1</th>\n",
  960. " <td>1.0</td>\n",
  961. " <td>6.5</td>\n",
  962. " <td>3.0</td>\n",
  963. " <td>NaN</td>\n",
  964. " </tr>\n",
  965. " <tr>\n",
  966. " <th>2</th>\n",
  967. " <td>1.0</td>\n",
  968. " <td>6.5</td>\n",
  969. " <td>3.0</td>\n",
  970. " <td>NaN</td>\n",
  971. " </tr>\n",
  972. " <tr>\n",
  973. " <th>3</th>\n",
  974. " <td>1.0</td>\n",
  975. " <td>6.5</td>\n",
  976. " <td>3.0</td>\n",
  977. " <td>NaN</td>\n",
  978. " </tr>\n",
  979. " </tbody>\n",
  980. "</table>\n",
  981. "</div>"
  982. ],
  983. "text/plain": [
  984. " 0 1 2 3\n",
  985. "0 1.0 6.5 3.0 NaN\n",
  986. "1 1.0 6.5 3.0 NaN\n",
  987. "2 1.0 6.5 3.0 NaN\n",
  988. "3 1.0 6.5 3.0 NaN"
  989. ]
  990. },
  991. "execution_count": 15,
  992. "metadata": {},
  993. "output_type": "execute_result"
  994. }
  995. ],
  996. "source": [
  997. "data.fillna(method='ffill')"
  998. ]
  999. },
  1000. {
  1001. "cell_type": "code",
  1002. "execution_count": 16,
  1003. "metadata": {},
  1004. "outputs": [
  1005. {
  1006. "data": {
  1007. "text/html": [
  1008. "<div>\n",
  1009. "<style scoped>\n",
  1010. " .dataframe tbody tr th:only-of-type {\n",
  1011. " vertical-align: middle;\n",
  1012. " }\n",
  1013. "\n",
  1014. " .dataframe tbody tr th {\n",
  1015. " vertical-align: top;\n",
  1016. " }\n",
  1017. "\n",
  1018. " .dataframe thead th {\n",
  1019. " text-align: right;\n",
  1020. " }\n",
  1021. "</style>\n",
  1022. "<table border=\"1\" class=\"dataframe\">\n",
  1023. " <thead>\n",
  1024. " <tr style=\"text-align: right;\">\n",
  1025. " <th></th>\n",
  1026. " <th>0</th>\n",
  1027. " <th>1</th>\n",
  1028. " <th>2</th>\n",
  1029. " <th>3</th>\n",
  1030. " </tr>\n",
  1031. " </thead>\n",
  1032. " <tbody>\n",
  1033. " <tr>\n",
  1034. " <th>0</th>\n",
  1035. " <td>1.0</td>\n",
  1036. " <td>6.5</td>\n",
  1037. " <td>3.0</td>\n",
  1038. " <td>3.0</td>\n",
  1039. " </tr>\n",
  1040. " <tr>\n",
  1041. " <th>1</th>\n",
  1042. " <td>1.0</td>\n",
  1043. " <td>1.0</td>\n",
  1044. " <td>1.0</td>\n",
  1045. " <td>NaN</td>\n",
  1046. " </tr>\n",
  1047. " <tr>\n",
  1048. " <th>2</th>\n",
  1049. " <td>NaN</td>\n",
  1050. " <td>NaN</td>\n",
  1051. " <td>NaN</td>\n",
  1052. " <td>NaN</td>\n",
  1053. " </tr>\n",
  1054. " <tr>\n",
  1055. " <th>3</th>\n",
  1056. " <td>NaN</td>\n",
  1057. " <td>6.5</td>\n",
  1058. " <td>3.0</td>\n",
  1059. " <td>3.0</td>\n",
  1060. " </tr>\n",
  1061. " </tbody>\n",
  1062. "</table>\n",
  1063. "</div>"
  1064. ],
  1065. "text/plain": [
  1066. " 0 1 2 3\n",
  1067. "0 1.0 6.5 3.0 3.0\n",
  1068. "1 1.0 1.0 1.0 NaN\n",
  1069. "2 NaN NaN NaN NaN\n",
  1070. "3 NaN 6.5 3.0 3.0"
  1071. ]
  1072. },
  1073. "execution_count": 16,
  1074. "metadata": {},
  1075. "output_type": "execute_result"
  1076. }
  1077. ],
  1078. "source": [
  1079. "data.fillna(axis=1, method='ffill', limit=2) # fill at most 2 consecutive NAs"
  1080. ]
  1081. },
  1082. {
  1083. "cell_type": "markdown",
  1084. "metadata": {},
  1085. "source": [
  1086. "## Data Transformation\n",
  1087. "\n",
  1088. "### Removing Duplicates"
  1089. ]
  1090. },
  1091. {
  1092. "cell_type": "code",
  1093. "execution_count": 17,
  1094. "metadata": {},
  1095. "outputs": [
  1096. {
  1097. "data": {
  1098. "text/html": [
  1099. "<div>\n",
  1100. "<style scoped>\n",
  1101. " .dataframe tbody tr th:only-of-type {\n",
  1102. " vertical-align: middle;\n",
  1103. " }\n",
  1104. "\n",
  1105. " .dataframe tbody tr th {\n",
  1106. " vertical-align: top;\n",
  1107. " }\n",
  1108. "\n",
  1109. " .dataframe thead th {\n",
  1110. " text-align: right;\n",
  1111. " }\n",
  1112. "</style>\n",
  1113. "<table border=\"1\" class=\"dataframe\">\n",
  1114. " <thead>\n",
  1115. " <tr style=\"text-align: right;\">\n",
  1116. " <th></th>\n",
  1117. " <th>k1</th>\n",
  1118. " <th>k2</th>\n",
  1119. " </tr>\n",
  1120. " </thead>\n",
  1121. " <tbody>\n",
  1122. " <tr>\n",
  1123. " <th>0</th>\n",
  1124. " <td>one</td>\n",
  1125. " <td>1</td>\n",
  1126. " </tr>\n",
  1127. " <tr>\n",
  1128. " <th>1</th>\n",
  1129. " <td>two</td>\n",
  1130. " <td>1</td>\n",
  1131. " </tr>\n",
  1132. " <tr>\n",
  1133. " <th>2</th>\n",
  1134. " <td>one</td>\n",
  1135. " <td>2</td>\n",
  1136. " </tr>\n",
  1137. " <tr>\n",
  1138. " <th>3</th>\n",
  1139. " <td>two</td>\n",
  1140. " <td>3</td>\n",
  1141. " </tr>\n",
  1142. " <tr>\n",
  1143. " <th>4</th>\n",
  1144. " <td>one</td>\n",
  1145. " <td>3</td>\n",
  1146. " </tr>\n",
  1147. " <tr>\n",
  1148. " <th>5</th>\n",
  1149. " <td>two</td>\n",
  1150. " <td>4</td>\n",
  1151. " </tr>\n",
  1152. " <tr>\n",
  1153. " <th>6</th>\n",
  1154. " <td>two</td>\n",
  1155. " <td>4</td>\n",
  1156. " </tr>\n",
  1157. " </tbody>\n",
  1158. "</table>\n",
  1159. "</div>"
  1160. ],
  1161. "text/plain": [
  1162. " k1 k2\n",
  1163. "0 one 1\n",
  1164. "1 two 1\n",
  1165. "2 one 2\n",
  1166. "3 two 3\n",
  1167. "4 one 3\n",
  1168. "5 two 4\n",
  1169. "6 two 4"
  1170. ]
  1171. },
  1172. "execution_count": 17,
  1173. "metadata": {},
  1174. "output_type": "execute_result"
  1175. }
  1176. ],
  1177. "source": [
  1178. "data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],\n",
  1179. " 'k2': [1, 1, 2, 3, 3, 4, 4]})\n",
  1180. "data"
  1181. ]
  1182. },
  1183. {
  1184. "cell_type": "code",
  1185. "execution_count": 18,
  1186. "metadata": {},
  1187. "outputs": [
  1188. {
  1189. "data": {
  1190. "text/plain": [
  1191. "0 False\n",
  1192. "1 False\n",
  1193. "2 False\n",
  1194. "3 False\n",
  1195. "4 False\n",
  1196. "5 False\n",
  1197. "6 True\n",
  1198. "dtype: bool"
  1199. ]
  1200. },
  1201. "execution_count": 18,
  1202. "metadata": {},
  1203. "output_type": "execute_result"
  1204. }
  1205. ],
  1206. "source": [
  1207. "data.duplicated()"
  1208. ]
  1209. },
  1210. {
  1211. "cell_type": "code",
  1212. "execution_count": 19,
  1213. "metadata": {},
  1214. "outputs": [
  1215. {
  1216. "data": {
  1217. "text/html": [
  1218. "<div>\n",
  1219. "<style scoped>\n",
  1220. " .dataframe tbody tr th:only-of-type {\n",
  1221. " vertical-align: middle;\n",
  1222. " }\n",
  1223. "\n",
  1224. " .dataframe tbody tr th {\n",
  1225. " vertical-align: top;\n",
  1226. " }\n",
  1227. "\n",
  1228. " .dataframe thead th {\n",
  1229. " text-align: right;\n",
  1230. " }\n",
  1231. "</style>\n",
  1232. "<table border=\"1\" class=\"dataframe\">\n",
  1233. " <thead>\n",
  1234. " <tr style=\"text-align: right;\">\n",
  1235. " <th></th>\n",
  1236. " <th>k1</th>\n",
  1237. " <th>k2</th>\n",
  1238. " </tr>\n",
  1239. " </thead>\n",
  1240. " <tbody>\n",
  1241. " <tr>\n",
  1242. " <th>0</th>\n",
  1243. " <td>one</td>\n",
  1244. " <td>1</td>\n",
  1245. " </tr>\n",
  1246. " <tr>\n",
  1247. " <th>1</th>\n",
  1248. " <td>two</td>\n",
  1249. " <td>1</td>\n",
  1250. " </tr>\n",
  1251. " <tr>\n",
  1252. " <th>2</th>\n",
  1253. " <td>one</td>\n",
  1254. " <td>2</td>\n",
  1255. " </tr>\n",
  1256. " <tr>\n",
  1257. " <th>3</th>\n",
  1258. " <td>two</td>\n",
  1259. " <td>3</td>\n",
  1260. " </tr>\n",
  1261. " <tr>\n",
  1262. " <th>4</th>\n",
  1263. " <td>one</td>\n",
  1264. " <td>3</td>\n",
  1265. " </tr>\n",
  1266. " <tr>\n",
  1267. " <th>5</th>\n",
  1268. " <td>two</td>\n",
  1269. " <td>4</td>\n",
  1270. " </tr>\n",
  1271. " </tbody>\n",
  1272. "</table>\n",
  1273. "</div>"
  1274. ],
  1275. "text/plain": [
  1276. " k1 k2\n",
  1277. "0 one 1\n",
  1278. "1 two 1\n",
  1279. "2 one 2\n",
  1280. "3 two 3\n",
  1281. "4 one 3\n",
  1282. "5 two 4"
  1283. ]
  1284. },
  1285. "execution_count": 19,
  1286. "metadata": {},
  1287. "output_type": "execute_result"
  1288. }
  1289. ],
  1290. "source": [
  1291. "data.drop_duplicates()"
  1292. ]
  1293. },
  1294. {
  1295. "cell_type": "code",
  1296. "execution_count": 20,
  1297. "metadata": {},
  1298. "outputs": [
  1299. {
  1300. "data": {
  1301. "text/html": [
  1302. "<div>\n",
  1303. "<style scoped>\n",
  1304. " .dataframe tbody tr th:only-of-type {\n",
  1305. " vertical-align: middle;\n",
  1306. " }\n",
  1307. "\n",
  1308. " .dataframe tbody tr th {\n",
  1309. " vertical-align: top;\n",
  1310. " }\n",
  1311. "\n",
  1312. " .dataframe thead th {\n",
  1313. " text-align: right;\n",
  1314. " }\n",
  1315. "</style>\n",
  1316. "<table border=\"1\" class=\"dataframe\">\n",
  1317. " <thead>\n",
  1318. " <tr style=\"text-align: right;\">\n",
  1319. " <th></th>\n",
  1320. " <th>k1</th>\n",
  1321. " <th>k2</th>\n",
  1322. " <th>v1</th>\n",
  1323. " </tr>\n",
  1324. " </thead>\n",
  1325. " <tbody>\n",
  1326. " <tr>\n",
  1327. " <th>0</th>\n",
  1328. " <td>one</td>\n",
  1329. " <td>1</td>\n",
  1330. " <td>0</td>\n",
  1331. " </tr>\n",
  1332. " <tr>\n",
  1333. " <th>1</th>\n",
  1334. " <td>two</td>\n",
  1335. " <td>1</td>\n",
  1336. " <td>1</td>\n",
  1337. " </tr>\n",
  1338. " </tbody>\n",
  1339. "</table>\n",
  1340. "</div>"
  1341. ],
  1342. "text/plain": [
  1343. " k1 k2 v1\n",
  1344. "0 one 1 0\n",
  1345. "1 two 1 1"
  1346. ]
  1347. },
  1348. "execution_count": 20,
  1349. "metadata": {},
  1350. "output_type": "execute_result"
  1351. }
  1352. ],
  1353. "source": [
  1354. "data['v1'] = range(7)\n",
  1355. "data.drop_duplicates(['k1'])"
  1356. ]
  1357. },
  1358. {
  1359. "cell_type": "code",
  1360. "execution_count": 21,
  1361. "metadata": {},
  1362. "outputs": [
  1363. {
  1364. "data": {
  1365. "text/html": [
  1366. "<div>\n",
  1367. "<style scoped>\n",
  1368. " .dataframe tbody tr th:only-of-type {\n",
  1369. " vertical-align: middle;\n",
  1370. " }\n",
  1371. "\n",
  1372. " .dataframe tbody tr th {\n",
  1373. " vertical-align: top;\n",
  1374. " }\n",
  1375. "\n",
  1376. " .dataframe thead th {\n",
  1377. " text-align: right;\n",
  1378. " }\n",
  1379. "</style>\n",
  1380. "<table border=\"1\" class=\"dataframe\">\n",
  1381. " <thead>\n",
  1382. " <tr style=\"text-align: right;\">\n",
  1383. " <th></th>\n",
  1384. " <th>k1</th>\n",
  1385. " <th>k2</th>\n",
  1386. " <th>v1</th>\n",
  1387. " </tr>\n",
  1388. " </thead>\n",
  1389. " <tbody>\n",
  1390. " <tr>\n",
  1391. " <th>0</th>\n",
  1392. " <td>one</td>\n",
  1393. " <td>1</td>\n",
  1394. " <td>0</td>\n",
  1395. " </tr>\n",
  1396. " <tr>\n",
  1397. " <th>1</th>\n",
  1398. " <td>two</td>\n",
  1399. " <td>1</td>\n",
  1400. " <td>1</td>\n",
  1401. " </tr>\n",
  1402. " <tr>\n",
  1403. " <th>2</th>\n",
  1404. " <td>one</td>\n",
  1405. " <td>2</td>\n",
  1406. " <td>2</td>\n",
  1407. " </tr>\n",
  1408. " <tr>\n",
  1409. " <th>3</th>\n",
  1410. " <td>two</td>\n",
  1411. " <td>3</td>\n",
  1412. " <td>3</td>\n",
  1413. " </tr>\n",
  1414. " <tr>\n",
  1415. " <th>4</th>\n",
  1416. " <td>one</td>\n",
  1417. " <td>3</td>\n",
  1418. " <td>4</td>\n",
  1419. " </tr>\n",
  1420. " <tr>\n",
  1421. " <th>6</th>\n",
  1422. " <td>two</td>\n",
  1423. " <td>4</td>\n",
  1424. " <td>6</td>\n",
  1425. " </tr>\n",
  1426. " </tbody>\n",
  1427. "</table>\n",
  1428. "</div>"
  1429. ],
  1430. "text/plain": [
  1431. " k1 k2 v1\n",
  1432. "0 one 1 0\n",
  1433. "1 two 1 1\n",
  1434. "2 one 2 2\n",
  1435. "3 two 3 3\n",
  1436. "4 one 3 4\n",
  1437. "6 two 4 6"
  1438. ]
  1439. },
  1440. "execution_count": 21,
  1441. "metadata": {},
  1442. "output_type": "execute_result"
  1443. }
  1444. ],
  1445. "source": [
  1446. "data.drop_duplicates(['k1', 'k2'], keep='last')"
  1447. ]
  1448. },
  1449. {
  1450. "cell_type": "markdown",
  1451. "metadata": {},
  1452. "source": [
  1453. "### Transforming Data Using a Func or Mapping"
  1454. ]
  1455. },
  1456. {
  1457. "cell_type": "code",
  1458. "execution_count": 22,
  1459. "metadata": {},
  1460. "outputs": [
  1461. {
  1462. "data": {
  1463. "text/html": [
  1464. "<div>\n",
  1465. "<style scoped>\n",
  1466. " .dataframe tbody tr th:only-of-type {\n",
  1467. " vertical-align: middle;\n",
  1468. " }\n",
  1469. "\n",
  1470. " .dataframe tbody tr th {\n",
  1471. " vertical-align: top;\n",
  1472. " }\n",
  1473. "\n",
  1474. " .dataframe thead th {\n",
  1475. " text-align: right;\n",
  1476. " }\n",
  1477. "</style>\n",
  1478. "<table border=\"1\" class=\"dataframe\">\n",
  1479. " <thead>\n",
  1480. " <tr style=\"text-align: right;\">\n",
  1481. " <th></th>\n",
  1482. " <th>food</th>\n",
  1483. " <th>ounces</th>\n",
  1484. " </tr>\n",
  1485. " </thead>\n",
  1486. " <tbody>\n",
  1487. " <tr>\n",
  1488. " <th>0</th>\n",
  1489. " <td>bacon</td>\n",
  1490. " <td>4.0</td>\n",
  1491. " </tr>\n",
  1492. " <tr>\n",
  1493. " <th>1</th>\n",
  1494. " <td>pulled pork</td>\n",
  1495. " <td>3.0</td>\n",
  1496. " </tr>\n",
  1497. " <tr>\n",
  1498. " <th>2</th>\n",
  1499. " <td>bacon</td>\n",
  1500. " <td>12.0</td>\n",
  1501. " </tr>\n",
  1502. " <tr>\n",
  1503. " <th>3</th>\n",
  1504. " <td>Pastrami</td>\n",
  1505. " <td>6.0</td>\n",
  1506. " </tr>\n",
  1507. " <tr>\n",
  1508. " <th>4</th>\n",
  1509. " <td>corned beef</td>\n",
  1510. " <td>7.5</td>\n",
  1511. " </tr>\n",
  1512. " <tr>\n",
  1513. " <th>5</th>\n",
  1514. " <td>Bacon</td>\n",
  1515. " <td>8.0</td>\n",
  1516. " </tr>\n",
  1517. " <tr>\n",
  1518. " <th>6</th>\n",
  1519. " <td>pastrami</td>\n",
  1520. " <td>3.0</td>\n",
  1521. " </tr>\n",
  1522. " <tr>\n",
  1523. " <th>7</th>\n",
  1524. " <td>honey ham</td>\n",
  1525. " <td>5.0</td>\n",
  1526. " </tr>\n",
  1527. " <tr>\n",
  1528. " <th>8</th>\n",
  1529. " <td>nova lox</td>\n",
  1530. " <td>6.0</td>\n",
  1531. " </tr>\n",
  1532. " </tbody>\n",
  1533. "</table>\n",
  1534. "</div>"
  1535. ],
  1536. "text/plain": [
  1537. " food ounces\n",
  1538. "0 bacon 4.0\n",
  1539. "1 pulled pork 3.0\n",
  1540. "2 bacon 12.0\n",
  1541. "3 Pastrami 6.0\n",
  1542. "4 corned beef 7.5\n",
  1543. "5 Bacon 8.0\n",
  1544. "6 pastrami 3.0\n",
  1545. "7 honey ham 5.0\n",
  1546. "8 nova lox 6.0"
  1547. ]
  1548. },
  1549. "execution_count": 22,
  1550. "metadata": {},
  1551. "output_type": "execute_result"
  1552. }
  1553. ],
  1554. "source": [
  1555. "data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',\n",
  1556. " 'Pastrami', 'corned beef', 'Bacon',\n",
  1557. " 'pastrami', 'honey ham', 'nova lox'],\n",
  1558. " 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n",
  1559. "data"
  1560. ]
  1561. },
  1562. {
  1563. "cell_type": "code",
  1564. "execution_count": 23,
  1565. "metadata": {},
  1566. "outputs": [],
  1567. "source": [
  1568. "meat_to_animal = {'bacon': 'pig',\n",
  1569. " 'pulled pork': 'pig',\n",
  1570. " 'pastrami': 'cow',\n",
  1571. " 'corned beef': 'cow',\n",
  1572. " 'honey ham': 'pig',\n",
  1573. " 'nova lox': 'salmon'}"
  1574. ]
  1575. },
  1576. {
  1577. "cell_type": "code",
  1578. "execution_count": 24,
  1579. "metadata": {},
  1580. "outputs": [
  1581. {
  1582. "data": {
  1583. "text/plain": [
  1584. "0 bacon\n",
  1585. "1 pulled pork\n",
  1586. "2 bacon\n",
  1587. "3 pastrami\n",
  1588. "4 corned beef\n",
  1589. "5 bacon\n",
  1590. "6 pastrami\n",
  1591. "7 honey ham\n",
  1592. "8 nova lox\n",
  1593. "Name: food, dtype: object"
  1594. ]
  1595. },
  1596. "execution_count": 24,
  1597. "metadata": {},
  1598. "output_type": "execute_result"
  1599. }
  1600. ],
  1601. "source": [
  1602. "lowercased = data['food'].str.lower() # data['food'].map(str.lower)\n",
  1603. "lowercased"
  1604. ]
  1605. },
  1606. {
  1607. "cell_type": "code",
  1608. "execution_count": 25,
  1609. "metadata": {},
  1610. "outputs": [
  1611. {
  1612. "data": {
  1613. "text/html": [
  1614. "<div>\n",
  1615. "<style scoped>\n",
  1616. " .dataframe tbody tr th:only-of-type {\n",
  1617. " vertical-align: middle;\n",
  1618. " }\n",
  1619. "\n",
  1620. " .dataframe tbody tr th {\n",
  1621. " vertical-align: top;\n",
  1622. " }\n",
  1623. "\n",
  1624. " .dataframe thead th {\n",
  1625. " text-align: right;\n",
  1626. " }\n",
  1627. "</style>\n",
  1628. "<table border=\"1\" class=\"dataframe\">\n",
  1629. " <thead>\n",
  1630. " <tr style=\"text-align: right;\">\n",
  1631. " <th></th>\n",
  1632. " <th>food</th>\n",
  1633. " <th>ounces</th>\n",
  1634. " <th>animal</th>\n",
  1635. " </tr>\n",
  1636. " </thead>\n",
  1637. " <tbody>\n",
  1638. " <tr>\n",
  1639. " <th>0</th>\n",
  1640. " <td>bacon</td>\n",
  1641. " <td>4.0</td>\n",
  1642. " <td>pig</td>\n",
  1643. " </tr>\n",
  1644. " <tr>\n",
  1645. " <th>1</th>\n",
  1646. " <td>pulled pork</td>\n",
  1647. " <td>3.0</td>\n",
  1648. " <td>pig</td>\n",
  1649. " </tr>\n",
  1650. " <tr>\n",
  1651. " <th>2</th>\n",
  1652. " <td>bacon</td>\n",
  1653. " <td>12.0</td>\n",
  1654. " <td>pig</td>\n",
  1655. " </tr>\n",
  1656. " <tr>\n",
  1657. " <th>3</th>\n",
  1658. " <td>Pastrami</td>\n",
  1659. " <td>6.0</td>\n",
  1660. " <td>cow</td>\n",
  1661. " </tr>\n",
  1662. " <tr>\n",
  1663. " <th>4</th>\n",
  1664. " <td>corned beef</td>\n",
  1665. " <td>7.5</td>\n",
  1666. " <td>cow</td>\n",
  1667. " </tr>\n",
  1668. " <tr>\n",
  1669. " <th>5</th>\n",
  1670. " <td>Bacon</td>\n",
  1671. " <td>8.0</td>\n",
  1672. " <td>pig</td>\n",
  1673. " </tr>\n",
  1674. " <tr>\n",
  1675. " <th>6</th>\n",
  1676. " <td>pastrami</td>\n",
  1677. " <td>3.0</td>\n",
  1678. " <td>cow</td>\n",
  1679. " </tr>\n",
  1680. " <tr>\n",
  1681. " <th>7</th>\n",
  1682. " <td>honey ham</td>\n",
  1683. " <td>5.0</td>\n",
  1684. " <td>pig</td>\n",
  1685. " </tr>\n",
  1686. " <tr>\n",
  1687. " <th>8</th>\n",
  1688. " <td>nova lox</td>\n",
  1689. " <td>6.0</td>\n",
  1690. " <td>salmon</td>\n",
  1691. " </tr>\n",
  1692. " </tbody>\n",
  1693. "</table>\n",
  1694. "</div>"
  1695. ],
  1696. "text/plain": [
  1697. " food ounces animal\n",
  1698. "0 bacon 4.0 pig\n",
  1699. "1 pulled pork 3.0 pig\n",
  1700. "2 bacon 12.0 pig\n",
  1701. "3 Pastrami 6.0 cow\n",
  1702. "4 corned beef 7.5 cow\n",
  1703. "5 Bacon 8.0 pig\n",
  1704. "6 pastrami 3.0 cow\n",
  1705. "7 honey ham 5.0 pig\n",
  1706. "8 nova lox 6.0 salmon"
  1707. ]
  1708. },
  1709. "execution_count": 25,
  1710. "metadata": {},
  1711. "output_type": "execute_result"
  1712. }
  1713. ],
  1714. "source": [
  1715. "data['animal'] = lowercased.map(meat_to_animal)\n",
  1716. "data"
  1717. ]
  1718. },
  1719. {
  1720. "cell_type": "code",
  1721. "execution_count": 26,
  1722. "metadata": {},
  1723. "outputs": [
  1724. {
  1725. "data": {
  1726. "text/plain": [
  1727. "0 pig\n",
  1728. "1 pig\n",
  1729. "2 pig\n",
  1730. "3 cow\n",
  1731. "4 cow\n",
  1732. "5 pig\n",
  1733. "6 cow\n",
  1734. "7 pig\n",
  1735. "8 salmon\n",
  1736. "Name: food, dtype: object"
  1737. ]
  1738. },
  1739. "execution_count": 26,
  1740. "metadata": {},
  1741. "output_type": "execute_result"
  1742. }
  1743. ],
  1744. "source": [
  1745. "data['food'].map(lambda x: meat_to_animal[x.lower()])"
  1746. ]
  1747. },
  1748. {
  1749. "cell_type": "markdown",
  1750. "metadata": {},
  1751. "source": [
  1752. "### Replacing Values\n",
  1753. "\n",
  1754. "```py\n",
  1755. "replace(\n",
  1756. " to_replace=None,\n",
  1757. " value=None,\n",
  1758. " inplace=False,\n",
  1759. " limit=None,\n",
  1760. " regex=False,\n",
  1761. " method='pad'\n",
  1762. ")\n",
  1763. "```"
  1764. ]
  1765. },
  1766. {
  1767. "cell_type": "code",
  1768. "execution_count": 27,
  1769. "metadata": {},
  1770. "outputs": [
  1771. {
  1772. "data": {
  1773. "text/plain": [
  1774. "0 1.0\n",
  1775. "1 -999.0\n",
  1776. "2 2.0\n",
  1777. "3 -999.0\n",
  1778. "4 -1000.0\n",
  1779. "5 3.0\n",
  1780. "dtype: float64"
  1781. ]
  1782. },
  1783. "execution_count": 27,
  1784. "metadata": {},
  1785. "output_type": "execute_result"
  1786. }
  1787. ],
  1788. "source": [
  1789. "data = pd.Series([1., -999, 2, -999, -1000, 3.])\n",
  1790. "data"
  1791. ]
  1792. },
  1793. {
  1794. "cell_type": "code",
  1795. "execution_count": 28,
  1796. "metadata": {},
  1797. "outputs": [
  1798. {
  1799. "data": {
  1800. "text/plain": [
  1801. "0 1.0\n",
  1802. "1 NaN\n",
  1803. "2 2.0\n",
  1804. "3 NaN\n",
  1805. "4 -1000.0\n",
  1806. "5 3.0\n",
  1807. "dtype: float64"
  1808. ]
  1809. },
  1810. "execution_count": 28,
  1811. "metadata": {},
  1812. "output_type": "execute_result"
  1813. }
  1814. ],
  1815. "source": [
  1816. "data.replace(-999, np.nan)"
  1817. ]
  1818. },
  1819. {
  1820. "cell_type": "code",
  1821. "execution_count": 29,
  1822. "metadata": {},
  1823. "outputs": [
  1824. {
  1825. "data": {
  1826. "text/plain": [
  1827. "0 1.0\n",
  1828. "1 NaN\n",
  1829. "2 2.0\n",
  1830. "3 NaN\n",
  1831. "4 NaN\n",
  1832. "5 3.0\n",
  1833. "dtype: float64"
  1834. ]
  1835. },
  1836. "execution_count": 29,
  1837. "metadata": {},
  1838. "output_type": "execute_result"
  1839. }
  1840. ],
  1841. "source": [
  1842. "data.replace([-999, -1000], np.nan)"
  1843. ]
  1844. },
  1845. {
  1846. "cell_type": "code",
  1847. "execution_count": 30,
  1848. "metadata": {},
  1849. "outputs": [
  1850. {
  1851. "data": {
  1852. "text/plain": [
  1853. "0 1.0\n",
  1854. "1 NaN\n",
  1855. "2 2.0\n",
  1856. "3 NaN\n",
  1857. "4 0.0\n",
  1858. "5 3.0\n",
  1859. "dtype: float64"
  1860. ]
  1861. },
  1862. "execution_count": 30,
  1863. "metadata": {},
  1864. "output_type": "execute_result"
  1865. }
  1866. ],
  1867. "source": [
  1868. "data.replace({-999: np.nan, -1000: 0}) # or data.replace([-999, -1000], [np.nan, 0])"
  1869. ]
  1870. },
  1871. {
  1872. "cell_type": "markdown",
  1873. "metadata": {},
  1874. "source": [
  1875. "### Renaming Axis Indexes\n",
  1876. "\n",
  1877. "```py\n",
  1878. "DataFrame.rename(\n",
  1879. " mapper=None,\n",
  1880. " index=None,\n",
  1881. " columns=None,\n",
  1882. " axis=None,\n",
  1883. " copy=True,\n",
  1884. " inplace=False,\n",
  1885. " level=None\n",
  1886. ")\n",
  1887. "```"
  1888. ]
  1889. },
  1890. {
  1891. "cell_type": "code",
  1892. "execution_count": 31,
  1893. "metadata": {},
  1894. "outputs": [
  1895. {
  1896. "data": {
  1897. "text/html": [
  1898. "<div>\n",
  1899. "<style scoped>\n",
  1900. " .dataframe tbody tr th:only-of-type {\n",
  1901. " vertical-align: middle;\n",
  1902. " }\n",
  1903. "\n",
  1904. " .dataframe tbody tr th {\n",
  1905. " vertical-align: top;\n",
  1906. " }\n",
  1907. "\n",
  1908. " .dataframe thead th {\n",
  1909. " text-align: right;\n",
  1910. " }\n",
  1911. "</style>\n",
  1912. "<table border=\"1\" class=\"dataframe\">\n",
  1913. " <thead>\n",
  1914. " <tr style=\"text-align: right;\">\n",
  1915. " <th></th>\n",
  1916. " <th>one</th>\n",
  1917. " <th>two</th>\n",
  1918. " <th>three</th>\n",
  1919. " <th>four</th>\n",
  1920. " </tr>\n",
  1921. " </thead>\n",
  1922. " <tbody>\n",
  1923. " <tr>\n",
  1924. " <th>Ohio</th>\n",
  1925. " <td>0</td>\n",
  1926. " <td>1</td>\n",
  1927. " <td>2</td>\n",
  1928. " <td>3</td>\n",
  1929. " </tr>\n",
  1930. " <tr>\n",
  1931. " <th>Colorado</th>\n",
  1932. " <td>4</td>\n",
  1933. " <td>5</td>\n",
  1934. " <td>6</td>\n",
  1935. " <td>7</td>\n",
  1936. " </tr>\n",
  1937. " <tr>\n",
  1938. " <th>New York</th>\n",
  1939. " <td>8</td>\n",
  1940. " <td>9</td>\n",
  1941. " <td>10</td>\n",
  1942. " <td>11</td>\n",
  1943. " </tr>\n",
  1944. " </tbody>\n",
  1945. "</table>\n",
  1946. "</div>"
  1947. ],
  1948. "text/plain": [
  1949. " one two three four\n",
  1950. "Ohio 0 1 2 3\n",
  1951. "Colorado 4 5 6 7\n",
  1952. "New York 8 9 10 11"
  1953. ]
  1954. },
  1955. "execution_count": 31,
  1956. "metadata": {},
  1957. "output_type": "execute_result"
  1958. }
  1959. ],
  1960. "source": [
  1961. "data = pd.DataFrame(np.arange(12).reshape((3, 4)),\n",
  1962. " index=['Ohio', 'Colorado', 'New York'],\n",
  1963. " columns=['one', 'two', 'three', 'four'])\n",
  1964. "data"
  1965. ]
  1966. },
  1967. {
  1968. "cell_type": "code",
  1969. "execution_count": 32,
  1970. "metadata": {},
  1971. "outputs": [
  1972. {
  1973. "data": {
  1974. "text/plain": [
  1975. "Index(['OHIO', 'COLO', 'NEW '], dtype='object')"
  1976. ]
  1977. },
  1978. "execution_count": 32,
  1979. "metadata": {},
  1980. "output_type": "execute_result"
  1981. }
  1982. ],
  1983. "source": [
  1984. "transform = lambda x: x[:4].upper()\n",
  1985. "data.index.map(transform)"
  1986. ]
  1987. },
  1988. {
  1989. "cell_type": "code",
  1990. "execution_count": 33,
  1991. "metadata": {},
  1992. "outputs": [
  1993. {
  1994. "data": {
  1995. "text/html": [
  1996. "<div>\n",
  1997. "<style scoped>\n",
  1998. " .dataframe tbody tr th:only-of-type {\n",
  1999. " vertical-align: middle;\n",
  2000. " }\n",
  2001. "\n",
  2002. " .dataframe tbody tr th {\n",
  2003. " vertical-align: top;\n",
  2004. " }\n",
  2005. "\n",
  2006. " .dataframe thead th {\n",
  2007. " text-align: right;\n",
  2008. " }\n",
  2009. "</style>\n",
  2010. "<table border=\"1\" class=\"dataframe\">\n",
  2011. " <thead>\n",
  2012. " <tr style=\"text-align: right;\">\n",
  2013. " <th></th>\n",
  2014. " <th>one</th>\n",
  2015. " <th>two</th>\n",
  2016. " <th>three</th>\n",
  2017. " <th>four</th>\n",
  2018. " </tr>\n",
  2019. " </thead>\n",
  2020. " <tbody>\n",
  2021. " <tr>\n",
  2022. " <th>OHIO</th>\n",
  2023. " <td>0</td>\n",
  2024. " <td>1</td>\n",
  2025. " <td>2</td>\n",
  2026. " <td>3</td>\n",
  2027. " </tr>\n",
  2028. " <tr>\n",
  2029. " <th>COLO</th>\n",
  2030. " <td>4</td>\n",
  2031. " <td>5</td>\n",
  2032. " <td>6</td>\n",
  2033. " <td>7</td>\n",
  2034. " </tr>\n",
  2035. " <tr>\n",
  2036. " <th>NEW</th>\n",
  2037. " <td>8</td>\n",
  2038. " <td>9</td>\n",
  2039. " <td>10</td>\n",
  2040. " <td>11</td>\n",
  2041. " </tr>\n",
  2042. " </tbody>\n",
  2043. "</table>\n",
  2044. "</div>"
  2045. ],
  2046. "text/plain": [
  2047. " one two three four\n",
  2048. "OHIO 0 1 2 3\n",
  2049. "COLO 4 5 6 7\n",
  2050. "NEW 8 9 10 11"
  2051. ]
  2052. },
  2053. "execution_count": 33,
  2054. "metadata": {},
  2055. "output_type": "execute_result"
  2056. }
  2057. ],
  2058. "source": [
  2059. "data.index = data.index.map(transform)\n",
  2060. "data"
  2061. ]
  2062. },
  2063. {
  2064. "cell_type": "code",
  2065. "execution_count": 34,
  2066. "metadata": {},
  2067. "outputs": [
  2068. {
  2069. "data": {
  2070. "text/html": [
  2071. "<div>\n",
  2072. "<style scoped>\n",
  2073. " .dataframe tbody tr th:only-of-type {\n",
  2074. " vertical-align: middle;\n",
  2075. " }\n",
  2076. "\n",
  2077. " .dataframe tbody tr th {\n",
  2078. " vertical-align: top;\n",
  2079. " }\n",
  2080. "\n",
  2081. " .dataframe thead th {\n",
  2082. " text-align: right;\n",
  2083. " }\n",
  2084. "</style>\n",
  2085. "<table border=\"1\" class=\"dataframe\">\n",
  2086. " <thead>\n",
  2087. " <tr style=\"text-align: right;\">\n",
  2088. " <th></th>\n",
  2089. " <th>ONE</th>\n",
  2090. " <th>TWO</th>\n",
  2091. " <th>THREE</th>\n",
  2092. " <th>FOUR</th>\n",
  2093. " </tr>\n",
  2094. " </thead>\n",
  2095. " <tbody>\n",
  2096. " <tr>\n",
  2097. " <th>Ohio</th>\n",
  2098. " <td>0</td>\n",
  2099. " <td>1</td>\n",
  2100. " <td>2</td>\n",
  2101. " <td>3</td>\n",
  2102. " </tr>\n",
  2103. " <tr>\n",
  2104. " <th>Colo</th>\n",
  2105. " <td>4</td>\n",
  2106. " <td>5</td>\n",
  2107. " <td>6</td>\n",
  2108. " <td>7</td>\n",
  2109. " </tr>\n",
  2110. " <tr>\n",
  2111. " <th>New</th>\n",
  2112. " <td>8</td>\n",
  2113. " <td>9</td>\n",
  2114. " <td>10</td>\n",
  2115. " <td>11</td>\n",
  2116. " </tr>\n",
  2117. " </tbody>\n",
  2118. "</table>\n",
  2119. "</div>"
  2120. ],
  2121. "text/plain": [
  2122. " ONE TWO THREE FOUR\n",
  2123. "Ohio 0 1 2 3\n",
  2124. "Colo 4 5 6 7\n",
  2125. "New 8 9 10 11"
  2126. ]
  2127. },
  2128. "execution_count": 34,
  2129. "metadata": {},
  2130. "output_type": "execute_result"
  2131. }
  2132. ],
  2133. "source": [
  2134. "# Use function to transform both indexes.\n",
  2135. "data.rename(index=str.title, columns=str.upper)"
  2136. ]
  2137. },
  2138. {
  2139. "cell_type": "code",
  2140. "execution_count": 35,
  2141. "metadata": {},
  2142. "outputs": [
  2143. {
  2144. "data": {
  2145. "text/html": [
  2146. "<div>\n",
  2147. "<style scoped>\n",
  2148. " .dataframe tbody tr th:only-of-type {\n",
  2149. " vertical-align: middle;\n",
  2150. " }\n",
  2151. "\n",
  2152. " .dataframe tbody tr th {\n",
  2153. " vertical-align: top;\n",
  2154. " }\n",
  2155. "\n",
  2156. " .dataframe thead th {\n",
  2157. " text-align: right;\n",
  2158. " }\n",
  2159. "</style>\n",
  2160. "<table border=\"1\" class=\"dataframe\">\n",
  2161. " <thead>\n",
  2162. " <tr style=\"text-align: right;\">\n",
  2163. " <th></th>\n",
  2164. " <th>one</th>\n",
  2165. " <th>two</th>\n",
  2166. " <th>peekaboo</th>\n",
  2167. " <th>four</th>\n",
  2168. " </tr>\n",
  2169. " </thead>\n",
  2170. " <tbody>\n",
  2171. " <tr>\n",
  2172. " <th>INDIANA</th>\n",
  2173. " <td>0</td>\n",
  2174. " <td>1</td>\n",
  2175. " <td>2</td>\n",
  2176. " <td>3</td>\n",
  2177. " </tr>\n",
  2178. " <tr>\n",
  2179. " <th>COLO</th>\n",
  2180. " <td>4</td>\n",
  2181. " <td>5</td>\n",
  2182. " <td>6</td>\n",
  2183. " <td>7</td>\n",
  2184. " </tr>\n",
  2185. " <tr>\n",
  2186. " <th>NEW</th>\n",
  2187. " <td>8</td>\n",
  2188. " <td>9</td>\n",
  2189. " <td>10</td>\n",
  2190. " <td>11</td>\n",
  2191. " </tr>\n",
  2192. " </tbody>\n",
  2193. "</table>\n",
  2194. "</div>"
  2195. ],
  2196. "text/plain": [
  2197. " one two peekaboo four\n",
  2198. "INDIANA 0 1 2 3\n",
  2199. "COLO 4 5 6 7\n",
  2200. "NEW 8 9 10 11"
  2201. ]
  2202. },
  2203. "execution_count": 35,
  2204. "metadata": {},
  2205. "output_type": "execute_result"
  2206. }
  2207. ],
  2208. "source": [
  2209. "# Use dictionary.\n",
  2210. "data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'})"
  2211. ]
  2212. },
  2213. {
  2214. "cell_type": "markdown",
  2215. "metadata": {},
  2216. "source": [
  2217. "### Discretization and Binning\n",
  2218. "\n",
  2219. "```py\n",
  2220. "pd.cut(\n",
  2221. " x,\n",
  2222. " bins,\n",
  2223. " right=True,\n",
  2224. " labels=None,\n",
  2225. " retbins=False,\n",
  2226. " precision=3,\n",
  2227. " include_lowest=False,\n",
  2228. " duplicates='raise'\n",
  2229. ")\n",
  2230. "```"
  2231. ]
  2232. },
  2233. {
  2234. "cell_type": "code",
  2235. "execution_count": 36,
  2236. "metadata": {},
  2237. "outputs": [
  2238. {
  2239. "data": {
  2240. "text/plain": [
  2241. "[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]\n",
  2242. "Length: 12\n",
  2243. "Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]"
  2244. ]
  2245. },
  2246. "execution_count": 36,
  2247. "metadata": {},
  2248. "output_type": "execute_result"
  2249. }
  2250. ],
  2251. "source": [
  2252. "ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]\n",
  2253. "bins=[18, 25, 35, 60, 100]\n",
  2254. "cats = pd.cut(ages, bins)\n",
  2255. "cats"
  2256. ]
  2257. },
  2258. {
  2259. "cell_type": "code",
  2260. "execution_count": 37,
  2261. "metadata": {},
  2262. "outputs": [
  2263. {
  2264. "data": {
  2265. "text/plain": [
  2266. "array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)"
  2267. ]
  2268. },
  2269. "execution_count": 37,
  2270. "metadata": {},
  2271. "output_type": "execute_result"
  2272. }
  2273. ],
  2274. "source": [
  2275. "cats.codes"
  2276. ]
  2277. },
  2278. {
  2279. "cell_type": "code",
  2280. "execution_count": 38,
  2281. "metadata": {},
  2282. "outputs": [
  2283. {
  2284. "data": {
  2285. "text/plain": [
  2286. "IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],\n",
  2287. " closed='right',\n",
  2288. " dtype='interval[int64]')"
  2289. ]
  2290. },
  2291. "execution_count": 38,
  2292. "metadata": {},
  2293. "output_type": "execute_result"
  2294. }
  2295. ],
  2296. "source": [
  2297. "cats.categories"
  2298. ]
  2299. },
  2300. {
  2301. "cell_type": "code",
  2302. "execution_count": 39,
  2303. "metadata": {},
  2304. "outputs": [
  2305. {
  2306. "data": {
  2307. "text/plain": [
  2308. "(18, 25] 5\n",
  2309. "(35, 60] 3\n",
  2310. "(25, 35] 3\n",
  2311. "(60, 100] 1\n",
  2312. "dtype: int64"
  2313. ]
  2314. },
  2315. "execution_count": 39,
  2316. "metadata": {},
  2317. "output_type": "execute_result"
  2318. }
  2319. ],
  2320. "source": [
  2321. "pd.value_counts(cats)"
  2322. ]
  2323. },
  2324. {
  2325. "cell_type": "code",
  2326. "execution_count": 40,
  2327. "metadata": {},
  2328. "outputs": [
  2329. {
  2330. "data": {
  2331. "text/plain": [
  2332. "[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]\n",
  2333. "Length: 12\n",
  2334. "Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]"
  2335. ]
  2336. },
  2337. "execution_count": 40,
  2338. "metadata": {},
  2339. "output_type": "execute_result"
  2340. }
  2341. ],
  2342. "source": [
  2343. "group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']\n",
  2344. "pd.cut(ages, bins, labels=group_names)"
  2345. ]
  2346. },
  2347. {
  2348. "cell_type": "code",
  2349. "execution_count": 41,
  2350. "metadata": {},
  2351. "outputs": [
  2352. {
  2353. "data": {
  2354. "text/plain": [
  2355. "[(0.73, 0.96], (0.73, 0.96], (0.73, 0.96], (0.049, 0.28], (0.73, 0.96], ..., (0.73, 0.96], (0.73, 0.96], (0.73, 0.96], (0.28, 0.51], (0.049, 0.28]]\n",
  2356. "Length: 20\n",
  2357. "Categories (4, interval[float64]): [(0.049, 0.28] < (0.28, 0.51] < (0.51, 0.73] < (0.73, 0.96]]"
  2358. ]
  2359. },
  2360. "execution_count": 41,
  2361. "metadata": {},
  2362. "output_type": "execute_result"
  2363. }
  2364. ],
  2365. "source": [
  2366. "data = np.random.rand(20)\n",
  2367. "pd.cut(data, 4, precision=2)"
  2368. ]
  2369. },
  2370. {
  2371. "cell_type": "markdown",
  2372. "metadata": {},
  2373. "source": [
  2374. "Quantile cut:\n",
  2375. "\n",
  2376. "```py\n",
  2377. "pd.qcut(\n",
  2378. " x,\n",
  2379. " q,\n",
  2380. " labels=None,\n",
  2381. " retbins=False,\n",
  2382. " precision=3,\n",
  2383. " duplicates='raise'\n",
  2384. ")\n",
  2385. "```"
  2386. ]
  2387. },
  2388. {
  2389. "cell_type": "code",
  2390. "execution_count": 42,
  2391. "metadata": {},
  2392. "outputs": [
  2393. {
  2394. "data": {
  2395. "text/plain": [
  2396. "[(0.654, 2.764], (-3.239, -0.694], (-3.239, -0.694], (0.00261, 0.654], (-0.694, 0.00261], ..., (-3.239, -0.694], (-0.694, 0.00261], (0.00261, 0.654], (-0.694, 0.00261], (0.654, 2.764]]\n",
  2397. "Length: 1000\n",
  2398. "Categories (4, interval[float64]): [(-3.239, -0.694] < (-0.694, 0.00261] < (0.00261, 0.654] < (0.654, 2.764]]"
  2399. ]
  2400. },
  2401. "execution_count": 42,
  2402. "metadata": {},
  2403. "output_type": "execute_result"
  2404. }
  2405. ],
  2406. "source": [
  2407. "data = np.random.randn(1000)\n",
  2408. "cats = pd.qcut(data, 4) # Cut into quartiles\n",
  2409. "cats"
  2410. ]
  2411. },
  2412. {
  2413. "cell_type": "code",
  2414. "execution_count": 43,
  2415. "metadata": {},
  2416. "outputs": [
  2417. {
  2418. "data": {
  2419. "text/plain": [
  2420. "(0.654, 2.764] 250\n",
  2421. "(0.00261, 0.654] 250\n",
  2422. "(-0.694, 0.00261] 250\n",
  2423. "(-3.239, -0.694] 250\n",
  2424. "dtype: int64"
  2425. ]
  2426. },
  2427. "execution_count": 43,
  2428. "metadata": {},
  2429. "output_type": "execute_result"
  2430. }
  2431. ],
  2432. "source": [
  2433. "pd.value_counts(cats)"
  2434. ]
  2435. },
  2436. {
  2437. "cell_type": "code",
  2438. "execution_count": 44,
  2439. "metadata": {},
  2440. "outputs": [
  2441. {
  2442. "data": {
  2443. "text/plain": [
  2444. "[(0.00261, 1.298], (-3.239, -1.241], (-1.241, 0.00261], (0.00261, 1.298], (-1.241, 0.00261], ..., (-1.241, 0.00261], (-1.241, 0.00261], (0.00261, 1.298], (-1.241, 0.00261], (0.00261, 1.298]]\n",
  2445. "Length: 1000\n",
  2446. "Categories (4, interval[float64]): [(-3.239, -1.241] < (-1.241, 0.00261] < (0.00261, 1.298] < (1.298, 2.764]]"
  2447. ]
  2448. },
  2449. "execution_count": 44,
  2450. "metadata": {},
  2451. "output_type": "execute_result"
  2452. }
  2453. ],
  2454. "source": [
  2455. "pd.qcut(data, [0, 0.1, 0.5, 0.9, 1])"
  2456. ]
  2457. },
  2458. {
  2459. "cell_type": "code",
  2460. "execution_count": 45,
  2461. "metadata": {},
  2462. "outputs": [
  2463. {
  2464. "data": {
  2465. "text/plain": [
  2466. "(0.00261, 1.298] 400\n",
  2467. "(-1.241, 0.00261] 400\n",
  2468. "(1.298, 2.764] 100\n",
  2469. "(-3.239, -1.241] 100\n",
  2470. "dtype: int64"
  2471. ]
  2472. },
  2473. "execution_count": 45,
  2474. "metadata": {},
  2475. "output_type": "execute_result"
  2476. }
  2477. ],
  2478. "source": [
  2479. "pd.value_counts(pd.qcut(data, [0, 0.1, 0.5, 0.9, 1]))"
  2480. ]
  2481. },
  2482. {
  2483. "cell_type": "markdown",
  2484. "metadata": {},
  2485. "source": [
  2486. "### Detecting and Filtering Outliers"
  2487. ]
  2488. },
  2489. {
  2490. "cell_type": "code",
  2491. "execution_count": 46,
  2492. "metadata": {},
  2493. "outputs": [
  2494. {
  2495. "data": {
  2496. "text/html": [
  2497. "<div>\n",
  2498. "<style scoped>\n",
  2499. " .dataframe tbody tr th:only-of-type {\n",
  2500. " vertical-align: middle;\n",
  2501. " }\n",
  2502. "\n",
  2503. " .dataframe tbody tr th {\n",
  2504. " vertical-align: top;\n",
  2505. " }\n",
  2506. "\n",
  2507. " .dataframe thead th {\n",
  2508. " text-align: right;\n",
  2509. " }\n",
  2510. "</style>\n",
  2511. "<table border=\"1\" class=\"dataframe\">\n",
  2512. " <thead>\n",
  2513. " <tr style=\"text-align: right;\">\n",
  2514. " <th></th>\n",
  2515. " <th>0</th>\n",
  2516. " <th>1</th>\n",
  2517. " <th>2</th>\n",
  2518. " <th>3</th>\n",
  2519. " </tr>\n",
  2520. " </thead>\n",
  2521. " <tbody>\n",
  2522. " <tr>\n",
  2523. " <th>count</th>\n",
  2524. " <td>1000.000000</td>\n",
  2525. " <td>1000.000000</td>\n",
  2526. " <td>1000.000000</td>\n",
  2527. " <td>1000.000000</td>\n",
  2528. " </tr>\n",
  2529. " <tr>\n",
  2530. " <th>mean</th>\n",
  2531. " <td>-0.026046</td>\n",
  2532. " <td>0.008732</td>\n",
  2533. " <td>-0.018339</td>\n",
  2534. " <td>-0.009127</td>\n",
  2535. " </tr>\n",
  2536. " <tr>\n",
  2537. " <th>std</th>\n",
  2538. " <td>1.028878</td>\n",
  2539. " <td>0.992479</td>\n",
  2540. " <td>0.977812</td>\n",
  2541. " <td>0.974988</td>\n",
  2542. " </tr>\n",
  2543. " <tr>\n",
  2544. " <th>min</th>\n",
  2545. " <td>-3.288240</td>\n",
  2546. " <td>-3.827553</td>\n",
  2547. " <td>-3.431789</td>\n",
  2548. " <td>-2.624219</td>\n",
  2549. " </tr>\n",
  2550. " <tr>\n",
  2551. " <th>25%</th>\n",
  2552. " <td>-0.743113</td>\n",
  2553. " <td>-0.669691</td>\n",
  2554. " <td>-0.697851</td>\n",
  2555. " <td>-0.672852</td>\n",
  2556. " </tr>\n",
  2557. " <tr>\n",
  2558. " <th>50%</th>\n",
  2559. " <td>-0.040368</td>\n",
  2560. " <td>0.028459</td>\n",
  2561. " <td>-0.022622</td>\n",
  2562. " <td>0.020490</td>\n",
  2563. " </tr>\n",
  2564. " <tr>\n",
  2565. " <th>75%</th>\n",
  2566. " <td>0.694826</td>\n",
  2567. " <td>0.694573</td>\n",
  2568. " <td>0.669563</td>\n",
  2569. " <td>0.643025</td>\n",
  2570. " </tr>\n",
  2571. " <tr>\n",
  2572. " <th>max</th>\n",
  2573. " <td>3.497946</td>\n",
  2574. " <td>3.062780</td>\n",
  2575. " <td>2.808596</td>\n",
  2576. " <td>3.446911</td>\n",
  2577. " </tr>\n",
  2578. " </tbody>\n",
  2579. "</table>\n",
  2580. "</div>"
  2581. ],
  2582. "text/plain": [
  2583. " 0 1 2 3\n",
  2584. "count 1000.000000 1000.000000 1000.000000 1000.000000\n",
  2585. "mean -0.026046 0.008732 -0.018339 -0.009127\n",
  2586. "std 1.028878 0.992479 0.977812 0.974988\n",
  2587. "min -3.288240 -3.827553 -3.431789 -2.624219\n",
  2588. "25% -0.743113 -0.669691 -0.697851 -0.672852\n",
  2589. "50% -0.040368 0.028459 -0.022622 0.020490\n",
  2590. "75% 0.694826 0.694573 0.669563 0.643025\n",
  2591. "max 3.497946 3.062780 2.808596 3.446911"
  2592. ]
  2593. },
  2594. "execution_count": 46,
  2595. "metadata": {},
  2596. "output_type": "execute_result"
  2597. }
  2598. ],
  2599. "source": [
  2600. "data = pd.DataFrame(np.random.randn(1000, 4))\n",
  2601. "data.describe()"
  2602. ]
  2603. },
  2604. {
  2605. "cell_type": "code",
  2606. "execution_count": 47,
  2607. "metadata": {},
  2608. "outputs": [
  2609. {
  2610. "data": {
  2611. "text/plain": [
  2612. "424 -3.431789\n",
  2613. "Name: 2, dtype: float64"
  2614. ]
  2615. },
  2616. "execution_count": 47,
  2617. "metadata": {},
  2618. "output_type": "execute_result"
  2619. }
  2620. ],
  2621. "source": [
  2622. "col = data[2]\n",
  2623. "col[np.abs(col) > 3]"
  2624. ]
  2625. },
  2626. {
  2627. "cell_type": "code",
  2628. "execution_count": 48,
  2629. "metadata": {},
  2630. "outputs": [
  2631. {
  2632. "data": {
  2633. "text/html": [
  2634. "<div>\n",
  2635. "<style scoped>\n",
  2636. " .dataframe tbody tr th:only-of-type {\n",
  2637. " vertical-align: middle;\n",
  2638. " }\n",
  2639. "\n",
  2640. " .dataframe tbody tr th {\n",
  2641. " vertical-align: top;\n",
  2642. " }\n",
  2643. "\n",
  2644. " .dataframe thead th {\n",
  2645. " text-align: right;\n",
  2646. " }\n",
  2647. "</style>\n",
  2648. "<table border=\"1\" class=\"dataframe\">\n",
  2649. " <thead>\n",
  2650. " <tr style=\"text-align: right;\">\n",
  2651. " <th></th>\n",
  2652. " <th>0</th>\n",
  2653. " <th>1</th>\n",
  2654. " <th>2</th>\n",
  2655. " <th>3</th>\n",
  2656. " </tr>\n",
  2657. " </thead>\n",
  2658. " <tbody>\n",
  2659. " <tr>\n",
  2660. " <th>197</th>\n",
  2661. " <td>0.849447</td>\n",
  2662. " <td>-3.085049</td>\n",
  2663. " <td>-0.550219</td>\n",
  2664. " <td>-0.120688</td>\n",
  2665. " </tr>\n",
  2666. " <tr>\n",
  2667. " <th>235</th>\n",
  2668. " <td>-1.140439</td>\n",
  2669. " <td>3.062780</td>\n",
  2670. " <td>-0.292776</td>\n",
  2671. " <td>-1.541634</td>\n",
  2672. " </tr>\n",
  2673. " <tr>\n",
  2674. " <th>287</th>\n",
  2675. " <td>-3.288240</td>\n",
  2676. " <td>0.637092</td>\n",
  2677. " <td>0.750347</td>\n",
  2678. " <td>0.852326</td>\n",
  2679. " </tr>\n",
  2680. " <tr>\n",
  2681. " <th>310</th>\n",
  2682. " <td>1.198193</td>\n",
  2683. " <td>0.001196</td>\n",
  2684. " <td>1.068577</td>\n",
  2685. " <td>3.446911</td>\n",
  2686. " </tr>\n",
  2687. " <tr>\n",
  2688. " <th>376</th>\n",
  2689. " <td>2.660516</td>\n",
  2690. " <td>-0.788964</td>\n",
  2691. " <td>0.578800</td>\n",
  2692. " <td>3.279157</td>\n",
  2693. " </tr>\n",
  2694. " <tr>\n",
  2695. " <th>395</th>\n",
  2696. " <td>-1.226703</td>\n",
  2697. " <td>1.154297</td>\n",
  2698. " <td>-0.712612</td>\n",
  2699. " <td>3.047792</td>\n",
  2700. " </tr>\n",
  2701. " <tr>\n",
  2702. " <th>401</th>\n",
  2703. " <td>3.497946</td>\n",
  2704. " <td>-0.929906</td>\n",
  2705. " <td>0.213705</td>\n",
  2706. " <td>-0.062713</td>\n",
  2707. " </tr>\n",
  2708. " <tr>\n",
  2709. " <th>424</th>\n",
  2710. " <td>1.320339</td>\n",
  2711. " <td>-0.201304</td>\n",
  2712. " <td>-3.431789</td>\n",
  2713. " <td>-0.039907</td>\n",
  2714. " </tr>\n",
  2715. " <tr>\n",
  2716. " <th>570</th>\n",
  2717. " <td>3.278339</td>\n",
  2718. " <td>0.979884</td>\n",
  2719. " <td>-0.542488</td>\n",
  2720. " <td>0.147562</td>\n",
  2721. " </tr>\n",
  2722. " <tr>\n",
  2723. " <th>697</th>\n",
  2724. " <td>0.512636</td>\n",
  2725. " <td>-3.107288</td>\n",
  2726. " <td>0.475335</td>\n",
  2727. " <td>1.160560</td>\n",
  2728. " </tr>\n",
  2729. " <tr>\n",
  2730. " <th>770</th>\n",
  2731. " <td>0.701224</td>\n",
  2732. " <td>-3.137252</td>\n",
  2733. " <td>0.442069</td>\n",
  2734. " <td>0.241852</td>\n",
  2735. " </tr>\n",
  2736. " <tr>\n",
  2737. " <th>776</th>\n",
  2738. " <td>-1.825486</td>\n",
  2739. " <td>-3.827553</td>\n",
  2740. " <td>1.281648</td>\n",
  2741. " <td>-0.328060</td>\n",
  2742. " </tr>\n",
  2743. " <tr>\n",
  2744. " <th>813</th>\n",
  2745. " <td>0.408299</td>\n",
  2746. " <td>-3.120840</td>\n",
  2747. " <td>-0.708262</td>\n",
  2748. " <td>-0.382290</td>\n",
  2749. " </tr>\n",
  2750. " </tbody>\n",
  2751. "</table>\n",
  2752. "</div>"
  2753. ],
  2754. "text/plain": [
  2755. " 0 1 2 3\n",
  2756. "197 0.849447 -3.085049 -0.550219 -0.120688\n",
  2757. "235 -1.140439 3.062780 -0.292776 -1.541634\n",
  2758. "287 -3.288240 0.637092 0.750347 0.852326\n",
  2759. "310 1.198193 0.001196 1.068577 3.446911\n",
  2760. "376 2.660516 -0.788964 0.578800 3.279157\n",
  2761. "395 -1.226703 1.154297 -0.712612 3.047792\n",
  2762. "401 3.497946 -0.929906 0.213705 -0.062713\n",
  2763. "424 1.320339 -0.201304 -3.431789 -0.039907\n",
  2764. "570 3.278339 0.979884 -0.542488 0.147562\n",
  2765. "697 0.512636 -3.107288 0.475335 1.160560\n",
  2766. "770 0.701224 -3.137252 0.442069 0.241852\n",
  2767. "776 -1.825486 -3.827553 1.281648 -0.328060\n",
  2768. "813 0.408299 -3.120840 -0.708262 -0.382290"
  2769. ]
  2770. },
  2771. "execution_count": 48,
  2772. "metadata": {},
  2773. "output_type": "execute_result"
  2774. }
  2775. ],
  2776. "source": [
  2777. "data[(np.abs(data) > 3).any(axis=1)] # rows with a value whose abs > 3"
  2778. ]
  2779. },
  2780. {
  2781. "cell_type": "code",
  2782. "execution_count": 49,
  2783. "metadata": {},
  2784. "outputs": [
  2785. {
  2786. "data": {
  2787. "text/html": [
  2788. "<div>\n",
  2789. "<style scoped>\n",
  2790. " .dataframe tbody tr th:only-of-type {\n",
  2791. " vertical-align: middle;\n",
  2792. " }\n",
  2793. "\n",
  2794. " .dataframe tbody tr th {\n",
  2795. " vertical-align: top;\n",
  2796. " }\n",
  2797. "\n",
  2798. " .dataframe thead th {\n",
  2799. " text-align: right;\n",
  2800. " }\n",
  2801. "</style>\n",
  2802. "<table border=\"1\" class=\"dataframe\">\n",
  2803. " <thead>\n",
  2804. " <tr style=\"text-align: right;\">\n",
  2805. " <th></th>\n",
  2806. " <th>0</th>\n",
  2807. " <th>1</th>\n",
  2808. " <th>2</th>\n",
  2809. " <th>3</th>\n",
  2810. " </tr>\n",
  2811. " </thead>\n",
  2812. " <tbody>\n",
  2813. " <tr>\n",
  2814. " <th>count</th>\n",
  2815. " <td>1000.000000</td>\n",
  2816. " <td>1000.000000</td>\n",
  2817. " <td>1000.000000</td>\n",
  2818. " <td>1000.000000</td>\n",
  2819. " </tr>\n",
  2820. " <tr>\n",
  2821. " <th>mean</th>\n",
  2822. " <td>-0.026534</td>\n",
  2823. " <td>0.009948</td>\n",
  2824. " <td>-0.017908</td>\n",
  2825. " <td>-0.009901</td>\n",
  2826. " </tr>\n",
  2827. " <tr>\n",
  2828. " <th>std</th>\n",
  2829. " <td>1.025554</td>\n",
  2830. " <td>0.988028</td>\n",
  2831. " <td>0.976398</td>\n",
  2832. " <td>0.972450</td>\n",
  2833. " </tr>\n",
  2834. " <tr>\n",
  2835. " <th>min</th>\n",
  2836. " <td>-3.000000</td>\n",
  2837. " <td>-3.000000</td>\n",
  2838. " <td>-3.000000</td>\n",
  2839. " <td>-2.624219</td>\n",
  2840. " </tr>\n",
  2841. " <tr>\n",
  2842. " <th>25%</th>\n",
  2843. " <td>-0.743113</td>\n",
  2844. " <td>-0.669691</td>\n",
  2845. " <td>-0.697851</td>\n",
  2846. " <td>-0.672852</td>\n",
  2847. " </tr>\n",
  2848. " <tr>\n",
  2849. " <th>50%</th>\n",
  2850. " <td>-0.040368</td>\n",
  2851. " <td>0.028459</td>\n",
  2852. " <td>-0.022622</td>\n",
  2853. " <td>0.020490</td>\n",
  2854. " </tr>\n",
  2855. " <tr>\n",
  2856. " <th>75%</th>\n",
  2857. " <td>0.694826</td>\n",
  2858. " <td>0.694573</td>\n",
  2859. " <td>0.669563</td>\n",
  2860. " <td>0.643025</td>\n",
  2861. " </tr>\n",
  2862. " <tr>\n",
  2863. " <th>max</th>\n",
  2864. " <td>3.000000</td>\n",
  2865. " <td>3.000000</td>\n",
  2866. " <td>2.808596</td>\n",
  2867. " <td>3.000000</td>\n",
  2868. " </tr>\n",
  2869. " </tbody>\n",
  2870. "</table>\n",
  2871. "</div>"
  2872. ],
  2873. "text/plain": [
  2874. " 0 1 2 3\n",
  2875. "count 1000.000000 1000.000000 1000.000000 1000.000000\n",
  2876. "mean -0.026534 0.009948 -0.017908 -0.009901\n",
  2877. "std 1.025554 0.988028 0.976398 0.972450\n",
  2878. "min -3.000000 -3.000000 -3.000000 -2.624219\n",
  2879. "25% -0.743113 -0.669691 -0.697851 -0.672852\n",
  2880. "50% -0.040368 0.028459 -0.022622 0.020490\n",
  2881. "75% 0.694826 0.694573 0.669563 0.643025\n",
  2882. "max 3.000000 3.000000 2.808596 3.000000"
  2883. ]
  2884. },
  2885. "execution_count": 49,
  2886. "metadata": {},
  2887. "output_type": "execute_result"
  2888. }
  2889. ],
  2890. "source": [
  2891. "data[np.abs(data) > 3] = np.sign(data) * 3\n",
  2892. "data.describe()"
  2893. ]
  2894. },
  2895. {
  2896. "cell_type": "markdown",
  2897. "metadata": {},
  2898. "source": [
  2899. "### Permutation and Random Sampling"
  2900. ]
  2901. },
  2902. {
  2903. "cell_type": "code",
  2904. "execution_count": 50,
  2905. "metadata": {},
  2906. "outputs": [
  2907. {
  2908. "data": {
  2909. "text/plain": [
  2910. "array([0, 1, 3, 4, 2])"
  2911. ]
  2912. },
  2913. "execution_count": 50,
  2914. "metadata": {},
  2915. "output_type": "execute_result"
  2916. }
  2917. ],
  2918. "source": [
  2919. "df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))\n",
  2920. "sampler = np.random.permutation(5)\n",
  2921. "sampler"
  2922. ]
  2923. },
  2924. {
  2925. "cell_type": "code",
  2926. "execution_count": 51,
  2927. "metadata": {},
  2928. "outputs": [
  2929. {
  2930. "data": {
  2931. "text/html": [
  2932. "<div>\n",
  2933. "<style scoped>\n",
  2934. " .dataframe tbody tr th:only-of-type {\n",
  2935. " vertical-align: middle;\n",
  2936. " }\n",
  2937. "\n",
  2938. " .dataframe tbody tr th {\n",
  2939. " vertical-align: top;\n",
  2940. " }\n",
  2941. "\n",
  2942. " .dataframe thead th {\n",
  2943. " text-align: right;\n",
  2944. " }\n",
  2945. "</style>\n",
  2946. "<table border=\"1\" class=\"dataframe\">\n",
  2947. " <thead>\n",
  2948. " <tr style=\"text-align: right;\">\n",
  2949. " <th></th>\n",
  2950. " <th>0</th>\n",
  2951. " <th>1</th>\n",
  2952. " <th>2</th>\n",
  2953. " <th>3</th>\n",
  2954. " </tr>\n",
  2955. " </thead>\n",
  2956. " <tbody>\n",
  2957. " <tr>\n",
  2958. " <th>0</th>\n",
  2959. " <td>0</td>\n",
  2960. " <td>1</td>\n",
  2961. " <td>2</td>\n",
  2962. " <td>3</td>\n",
  2963. " </tr>\n",
  2964. " <tr>\n",
  2965. " <th>1</th>\n",
  2966. " <td>4</td>\n",
  2967. " <td>5</td>\n",
  2968. " <td>6</td>\n",
  2969. " <td>7</td>\n",
  2970. " </tr>\n",
  2971. " <tr>\n",
  2972. " <th>2</th>\n",
  2973. " <td>8</td>\n",
  2974. " <td>9</td>\n",
  2975. " <td>10</td>\n",
  2976. " <td>11</td>\n",
  2977. " </tr>\n",
  2978. " <tr>\n",
  2979. " <th>3</th>\n",
  2980. " <td>12</td>\n",
  2981. " <td>13</td>\n",
  2982. " <td>14</td>\n",
  2983. " <td>15</td>\n",
  2984. " </tr>\n",
  2985. " <tr>\n",
  2986. " <th>4</th>\n",
  2987. " <td>16</td>\n",
  2988. " <td>17</td>\n",
  2989. " <td>18</td>\n",
  2990. " <td>19</td>\n",
  2991. " </tr>\n",
  2992. " </tbody>\n",
  2993. "</table>\n",
  2994. "</div>"
  2995. ],
  2996. "text/plain": [
  2997. " 0 1 2 3\n",
  2998. "0 0 1 2 3\n",
  2999. "1 4 5 6 7\n",
  3000. "2 8 9 10 11\n",
  3001. "3 12 13 14 15\n",
  3002. "4 16 17 18 19"
  3003. ]
  3004. },
  3005. "execution_count": 51,
  3006. "metadata": {},
  3007. "output_type": "execute_result"
  3008. }
  3009. ],
  3010. "source": [
  3011. "df"
  3012. ]
  3013. },
  3014. {
  3015. "cell_type": "code",
  3016. "execution_count": 52,
  3017. "metadata": {},
  3018. "outputs": [
  3019. {
  3020. "data": {
  3021. "text/html": [
  3022. "<div>\n",
  3023. "<style scoped>\n",
  3024. " .dataframe tbody tr th:only-of-type {\n",
  3025. " vertical-align: middle;\n",
  3026. " }\n",
  3027. "\n",
  3028. " .dataframe tbody tr th {\n",
  3029. " vertical-align: top;\n",
  3030. " }\n",
  3031. "\n",
  3032. " .dataframe thead th {\n",
  3033. " text-align: right;\n",
  3034. " }\n",
  3035. "</style>\n",
  3036. "<table border=\"1\" class=\"dataframe\">\n",
  3037. " <thead>\n",
  3038. " <tr style=\"text-align: right;\">\n",
  3039. " <th></th>\n",
  3040. " <th>0</th>\n",
  3041. " <th>1</th>\n",
  3042. " <th>2</th>\n",
  3043. " <th>3</th>\n",
  3044. " </tr>\n",
  3045. " </thead>\n",
  3046. " <tbody>\n",
  3047. " <tr>\n",
  3048. " <th>0</th>\n",
  3049. " <td>0</td>\n",
  3050. " <td>1</td>\n",
  3051. " <td>2</td>\n",
  3052. " <td>3</td>\n",
  3053. " </tr>\n",
  3054. " <tr>\n",
  3055. " <th>1</th>\n",
  3056. " <td>4</td>\n",
  3057. " <td>5</td>\n",
  3058. " <td>6</td>\n",
  3059. " <td>7</td>\n",
  3060. " </tr>\n",
  3061. " <tr>\n",
  3062. " <th>3</th>\n",
  3063. " <td>12</td>\n",
  3064. " <td>13</td>\n",
  3065. " <td>14</td>\n",
  3066. " <td>15</td>\n",
  3067. " </tr>\n",
  3068. " <tr>\n",
  3069. " <th>4</th>\n",
  3070. " <td>16</td>\n",
  3071. " <td>17</td>\n",
  3072. " <td>18</td>\n",
  3073. " <td>19</td>\n",
  3074. " </tr>\n",
  3075. " <tr>\n",
  3076. " <th>2</th>\n",
  3077. " <td>8</td>\n",
  3078. " <td>9</td>\n",
  3079. " <td>10</td>\n",
  3080. " <td>11</td>\n",
  3081. " </tr>\n",
  3082. " </tbody>\n",
  3083. "</table>\n",
  3084. "</div>"
  3085. ],
  3086. "text/plain": [
  3087. " 0 1 2 3\n",
  3088. "0 0 1 2 3\n",
  3089. "1 4 5 6 7\n",
  3090. "3 12 13 14 15\n",
  3091. "4 16 17 18 19\n",
  3092. "2 8 9 10 11"
  3093. ]
  3094. },
  3095. "execution_count": 52,
  3096. "metadata": {},
  3097. "output_type": "execute_result"
  3098. }
  3099. ],
  3100. "source": [
  3101. "df.take(sampler)"
  3102. ]
  3103. },
  3104. {
  3105. "cell_type": "code",
  3106. "execution_count": 53,
  3107. "metadata": {},
  3108. "outputs": [
  3109. {
  3110. "data": {
  3111. "text/html": [
  3112. "<div>\n",
  3113. "<style scoped>\n",
  3114. " .dataframe tbody tr th:only-of-type {\n",
  3115. " vertical-align: middle;\n",
  3116. " }\n",
  3117. "\n",
  3118. " .dataframe tbody tr th {\n",
  3119. " vertical-align: top;\n",
  3120. " }\n",
  3121. "\n",
  3122. " .dataframe thead th {\n",
  3123. " text-align: right;\n",
  3124. " }\n",
  3125. "</style>\n",
  3126. "<table border=\"1\" class=\"dataframe\">\n",
  3127. " <thead>\n",
  3128. " <tr style=\"text-align: right;\">\n",
  3129. " <th></th>\n",
  3130. " <th>0</th>\n",
  3131. " <th>1</th>\n",
  3132. " <th>2</th>\n",
  3133. " <th>3</th>\n",
  3134. " </tr>\n",
  3135. " </thead>\n",
  3136. " <tbody>\n",
  3137. " <tr>\n",
  3138. " <th>0</th>\n",
  3139. " <td>0</td>\n",
  3140. " <td>1</td>\n",
  3141. " <td>2</td>\n",
  3142. " <td>3</td>\n",
  3143. " </tr>\n",
  3144. " <tr>\n",
  3145. " <th>1</th>\n",
  3146. " <td>4</td>\n",
  3147. " <td>5</td>\n",
  3148. " <td>6</td>\n",
  3149. " <td>7</td>\n",
  3150. " </tr>\n",
  3151. " <tr>\n",
  3152. " <th>3</th>\n",
  3153. " <td>12</td>\n",
  3154. " <td>13</td>\n",
  3155. " <td>14</td>\n",
  3156. " <td>15</td>\n",
  3157. " </tr>\n",
  3158. " <tr>\n",
  3159. " <th>4</th>\n",
  3160. " <td>16</td>\n",
  3161. " <td>17</td>\n",
  3162. " <td>18</td>\n",
  3163. " <td>19</td>\n",
  3164. " </tr>\n",
  3165. " <tr>\n",
  3166. " <th>2</th>\n",
  3167. " <td>8</td>\n",
  3168. " <td>9</td>\n",
  3169. " <td>10</td>\n",
  3170. " <td>11</td>\n",
  3171. " </tr>\n",
  3172. " </tbody>\n",
  3173. "</table>\n",
  3174. "</div>"
  3175. ],
  3176. "text/plain": [
  3177. " 0 1 2 3\n",
  3178. "0 0 1 2 3\n",
  3179. "1 4 5 6 7\n",
  3180. "3 12 13 14 15\n",
  3181. "4 16 17 18 19\n",
  3182. "2 8 9 10 11"
  3183. ]
  3184. },
  3185. "execution_count": 53,
  3186. "metadata": {},
  3187. "output_type": "execute_result"
  3188. }
  3189. ],
  3190. "source": [
  3191. "df.iloc[sampler]"
  3192. ]
  3193. },
  3194. {
  3195. "cell_type": "markdown",
  3196. "metadata": {},
  3197. "source": [
  3198. "```py\n",
  3199. "sample(\n",
  3200. " n=None,\n",
  3201. " frac=None,\n",
  3202. " replace=False,\n",
  3203. " weights=None,\n",
  3204. " random_state=None,\n",
  3205. " axis=None\n",
  3206. ")\n",
  3207. "```"
  3208. ]
  3209. },
  3210. {
  3211. "cell_type": "code",
  3212. "execution_count": 54,
  3213. "metadata": {},
  3214. "outputs": [
  3215. {
  3216. "data": {
  3217. "text/html": [
  3218. "<div>\n",
  3219. "<style scoped>\n",
  3220. " .dataframe tbody tr th:only-of-type {\n",
  3221. " vertical-align: middle;\n",
  3222. " }\n",
  3223. "\n",
  3224. " .dataframe tbody tr th {\n",
  3225. " vertical-align: top;\n",
  3226. " }\n",
  3227. "\n",
  3228. " .dataframe thead th {\n",
  3229. " text-align: right;\n",
  3230. " }\n",
  3231. "</style>\n",
  3232. "<table border=\"1\" class=\"dataframe\">\n",
  3233. " <thead>\n",
  3234. " <tr style=\"text-align: right;\">\n",
  3235. " <th></th>\n",
  3236. " <th>0</th>\n",
  3237. " <th>1</th>\n",
  3238. " <th>2</th>\n",
  3239. " <th>3</th>\n",
  3240. " </tr>\n",
  3241. " </thead>\n",
  3242. " <tbody>\n",
  3243. " <tr>\n",
  3244. " <th>2</th>\n",
  3245. " <td>8</td>\n",
  3246. " <td>9</td>\n",
  3247. " <td>10</td>\n",
  3248. " <td>11</td>\n",
  3249. " </tr>\n",
  3250. " <tr>\n",
  3251. " <th>1</th>\n",
  3252. " <td>4</td>\n",
  3253. " <td>5</td>\n",
  3254. " <td>6</td>\n",
  3255. " <td>7</td>\n",
  3256. " </tr>\n",
  3257. " <tr>\n",
  3258. " <th>4</th>\n",
  3259. " <td>16</td>\n",
  3260. " <td>17</td>\n",
  3261. " <td>18</td>\n",
  3262. " <td>19</td>\n",
  3263. " </tr>\n",
  3264. " </tbody>\n",
  3265. "</table>\n",
  3266. "</div>"
  3267. ],
  3268. "text/plain": [
  3269. " 0 1 2 3\n",
  3270. "2 8 9 10 11\n",
  3271. "1 4 5 6 7\n",
  3272. "4 16 17 18 19"
  3273. ]
  3274. },
  3275. "execution_count": 54,
  3276. "metadata": {},
  3277. "output_type": "execute_result"
  3278. }
  3279. ],
  3280. "source": [
  3281. "df.sample(n=3) # select a random subset without replacement"
  3282. ]
  3283. },
  3284. {
  3285. "cell_type": "code",
  3286. "execution_count": 55,
  3287. "metadata": {},
  3288. "outputs": [
  3289. {
  3290. "data": {
  3291. "text/html": [
  3292. "<div>\n",
  3293. "<style scoped>\n",
  3294. " .dataframe tbody tr th:only-of-type {\n",
  3295. " vertical-align: middle;\n",
  3296. " }\n",
  3297. "\n",
  3298. " .dataframe tbody tr th {\n",
  3299. " vertical-align: top;\n",
  3300. " }\n",
  3301. "\n",
  3302. " .dataframe thead th {\n",
  3303. " text-align: right;\n",
  3304. " }\n",
  3305. "</style>\n",
  3306. "<table border=\"1\" class=\"dataframe\">\n",
  3307. " <thead>\n",
  3308. " <tr style=\"text-align: right;\">\n",
  3309. " <th></th>\n",
  3310. " <th>3</th>\n",
  3311. " <th>0</th>\n",
  3312. " <th>2</th>\n",
  3313. " </tr>\n",
  3314. " </thead>\n",
  3315. " <tbody>\n",
  3316. " <tr>\n",
  3317. " <th>0</th>\n",
  3318. " <td>3</td>\n",
  3319. " <td>0</td>\n",
  3320. " <td>2</td>\n",
  3321. " </tr>\n",
  3322. " <tr>\n",
  3323. " <th>1</th>\n",
  3324. " <td>7</td>\n",
  3325. " <td>4</td>\n",
  3326. " <td>6</td>\n",
  3327. " </tr>\n",
  3328. " <tr>\n",
  3329. " <th>2</th>\n",
  3330. " <td>11</td>\n",
  3331. " <td>8</td>\n",
  3332. " <td>10</td>\n",
  3333. " </tr>\n",
  3334. " <tr>\n",
  3335. " <th>3</th>\n",
  3336. " <td>15</td>\n",
  3337. " <td>12</td>\n",
  3338. " <td>14</td>\n",
  3339. " </tr>\n",
  3340. " <tr>\n",
  3341. " <th>4</th>\n",
  3342. " <td>19</td>\n",
  3343. " <td>16</td>\n",
  3344. " <td>18</td>\n",
  3345. " </tr>\n",
  3346. " </tbody>\n",
  3347. "</table>\n",
  3348. "</div>"
  3349. ],
  3350. "text/plain": [
  3351. " 3 0 2\n",
  3352. "0 3 0 2\n",
  3353. "1 7 4 6\n",
  3354. "2 11 8 10\n",
  3355. "3 15 12 14\n",
  3356. "4 19 16 18"
  3357. ]
  3358. },
  3359. "execution_count": 55,
  3360. "metadata": {},
  3361. "output_type": "execute_result"
  3362. }
  3363. ],
  3364. "source": [
  3365. "df.sample(axis=1, n=3)"
  3366. ]
  3367. },
  3368. {
  3369. "cell_type": "code",
  3370. "execution_count": 56,
  3371. "metadata": {},
  3372. "outputs": [
  3373. {
  3374. "data": {
  3375. "text/plain": [
  3376. "4 4\n",
  3377. "2 -1\n",
  3378. "4 4\n",
  3379. "1 7\n",
  3380. "4 4\n",
  3381. "2 -1\n",
  3382. "3 6\n",
  3383. "1 7\n",
  3384. "1 7\n",
  3385. "2 -1\n",
  3386. "dtype: int64"
  3387. ]
  3388. },
  3389. "execution_count": 56,
  3390. "metadata": {},
  3391. "output_type": "execute_result"
  3392. }
  3393. ],
  3394. "source": [
  3395. "choices = pd.Series([5, 7, -1, 6, 4])\n",
  3396. "draws = choices.sample(n=10, replace=True) # allow repeat choices\n",
  3397. "draws"
  3398. ]
  3399. },
  3400. {
  3401. "cell_type": "markdown",
  3402. "metadata": {},
  3403. "source": [
  3404. "### Computing Indicator/Dummy Variables\n",
  3405. "\n",
  3406. "Use `get_dummies()` to get the one-hot representation of **categorical** variable:\n",
  3407. "\n",
  3408. "```py\n",
  3409. "pd.get_dummies(\n",
  3410. " data,\n",
  3411. " prefix=None,\n",
  3412. " prefix_sep='_',\n",
  3413. " dummy_na=False,\n",
  3414. " columns=None,\n",
  3415. " sparse=False,\n",
  3416. " drop_first=False,\n",
  3417. " dtype=None\n",
  3418. ")\n",
  3419. "```"
  3420. ]
  3421. },
  3422. {
  3423. "cell_type": "code",
  3424. "execution_count": 57,
  3425. "metadata": {},
  3426. "outputs": [
  3427. {
  3428. "data": {
  3429. "text/html": [
  3430. "<div>\n",
  3431. "<style scoped>\n",
  3432. " .dataframe tbody tr th:only-of-type {\n",
  3433. " vertical-align: middle;\n",
  3434. " }\n",
  3435. "\n",
  3436. " .dataframe tbody tr th {\n",
  3437. " vertical-align: top;\n",
  3438. " }\n",
  3439. "\n",
  3440. " .dataframe thead th {\n",
  3441. " text-align: right;\n",
  3442. " }\n",
  3443. "</style>\n",
  3444. "<table border=\"1\" class=\"dataframe\">\n",
  3445. " <thead>\n",
  3446. " <tr style=\"text-align: right;\">\n",
  3447. " <th></th>\n",
  3448. " <th>key</th>\n",
  3449. " <th>data1</th>\n",
  3450. " </tr>\n",
  3451. " </thead>\n",
  3452. " <tbody>\n",
  3453. " <tr>\n",
  3454. " <th>0</th>\n",
  3455. " <td>b</td>\n",
  3456. " <td>0</td>\n",
  3457. " </tr>\n",
  3458. " <tr>\n",
  3459. " <th>1</th>\n",
  3460. " <td>b</td>\n",
  3461. " <td>1</td>\n",
  3462. " </tr>\n",
  3463. " <tr>\n",
  3464. " <th>2</th>\n",
  3465. " <td>a</td>\n",
  3466. " <td>2</td>\n",
  3467. " </tr>\n",
  3468. " <tr>\n",
  3469. " <th>3</th>\n",
  3470. " <td>c</td>\n",
  3471. " <td>3</td>\n",
  3472. " </tr>\n",
  3473. " <tr>\n",
  3474. " <th>4</th>\n",
  3475. " <td>a</td>\n",
  3476. " <td>4</td>\n",
  3477. " </tr>\n",
  3478. " <tr>\n",
  3479. " <th>5</th>\n",
  3480. " <td>b</td>\n",
  3481. " <td>5</td>\n",
  3482. " </tr>\n",
  3483. " </tbody>\n",
  3484. "</table>\n",
  3485. "</div>"
  3486. ],
  3487. "text/plain": [
  3488. " key data1\n",
  3489. "0 b 0\n",
  3490. "1 b 1\n",
  3491. "2 a 2\n",
  3492. "3 c 3\n",
  3493. "4 a 4\n",
  3494. "5 b 5"
  3495. ]
  3496. },
  3497. "execution_count": 57,
  3498. "metadata": {},
  3499. "output_type": "execute_result"
  3500. }
  3501. ],
  3502. "source": [
  3503. "df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],\n",
  3504. " 'data1': range(6)})\n",
  3505. "df"
  3506. ]
  3507. },
  3508. {
  3509. "cell_type": "code",
  3510. "execution_count": 58,
  3511. "metadata": {},
  3512. "outputs": [
  3513. {
  3514. "data": {
  3515. "text/html": [
  3516. "<div>\n",
  3517. "<style scoped>\n",
  3518. " .dataframe tbody tr th:only-of-type {\n",
  3519. " vertical-align: middle;\n",
  3520. " }\n",
  3521. "\n",
  3522. " .dataframe tbody tr th {\n",
  3523. " vertical-align: top;\n",
  3524. " }\n",
  3525. "\n",
  3526. " .dataframe thead th {\n",
  3527. " text-align: right;\n",
  3528. " }\n",
  3529. "</style>\n",
  3530. "<table border=\"1\" class=\"dataframe\">\n",
  3531. " <thead>\n",
  3532. " <tr style=\"text-align: right;\">\n",
  3533. " <th></th>\n",
  3534. " <th>a</th>\n",
  3535. " <th>b</th>\n",
  3536. " <th>c</th>\n",
  3537. " </tr>\n",
  3538. " </thead>\n",
  3539. " <tbody>\n",
  3540. " <tr>\n",
  3541. " <th>0</th>\n",
  3542. " <td>0</td>\n",
  3543. " <td>1</td>\n",
  3544. " <td>0</td>\n",
  3545. " </tr>\n",
  3546. " <tr>\n",
  3547. " <th>1</th>\n",
  3548. " <td>0</td>\n",
  3549. " <td>1</td>\n",
  3550. " <td>0</td>\n",
  3551. " </tr>\n",
  3552. " <tr>\n",
  3553. " <th>2</th>\n",
  3554. " <td>1</td>\n",
  3555. " <td>0</td>\n",
  3556. " <td>0</td>\n",
  3557. " </tr>\n",
  3558. " <tr>\n",
  3559. " <th>3</th>\n",
  3560. " <td>0</td>\n",
  3561. " <td>0</td>\n",
  3562. " <td>1</td>\n",
  3563. " </tr>\n",
  3564. " <tr>\n",
  3565. " <th>4</th>\n",
  3566. " <td>1</td>\n",
  3567. " <td>0</td>\n",
  3568. " <td>0</td>\n",
  3569. " </tr>\n",
  3570. " <tr>\n",
  3571. " <th>5</th>\n",
  3572. " <td>0</td>\n",
  3573. " <td>1</td>\n",
  3574. " <td>0</td>\n",
  3575. " </tr>\n",
  3576. " </tbody>\n",
  3577. "</table>\n",
  3578. "</div>"
  3579. ],
  3580. "text/plain": [
  3581. " a b c\n",
  3582. "0 0 1 0\n",
  3583. "1 0 1 0\n",
  3584. "2 1 0 0\n",
  3585. "3 0 0 1\n",
  3586. "4 1 0 0\n",
  3587. "5 0 1 0"
  3588. ]
  3589. },
  3590. "execution_count": 58,
  3591. "metadata": {},
  3592. "output_type": "execute_result"
  3593. }
  3594. ],
  3595. "source": [
  3596. "pd.get_dummies(df['key'])"
  3597. ]
  3598. },
  3599. {
  3600. "cell_type": "code",
  3601. "execution_count": 59,
  3602. "metadata": {},
  3603. "outputs": [
  3604. {
  3605. "data": {
  3606. "text/html": [
  3607. "<div>\n",
  3608. "<style scoped>\n",
  3609. " .dataframe tbody tr th:only-of-type {\n",
  3610. " vertical-align: middle;\n",
  3611. " }\n",
  3612. "\n",
  3613. " .dataframe tbody tr th {\n",
  3614. " vertical-align: top;\n",
  3615. " }\n",
  3616. "\n",
  3617. " .dataframe thead th {\n",
  3618. " text-align: right;\n",
  3619. " }\n",
  3620. "</style>\n",
  3621. "<table border=\"1\" class=\"dataframe\">\n",
  3622. " <thead>\n",
  3623. " <tr style=\"text-align: right;\">\n",
  3624. " <th></th>\n",
  3625. " <th>data1</th>\n",
  3626. " <th>key_a</th>\n",
  3627. " <th>key_b</th>\n",
  3628. " <th>key_c</th>\n",
  3629. " </tr>\n",
  3630. " </thead>\n",
  3631. " <tbody>\n",
  3632. " <tr>\n",
  3633. " <th>0</th>\n",
  3634. " <td>0</td>\n",
  3635. " <td>0</td>\n",
  3636. " <td>1</td>\n",
  3637. " <td>0</td>\n",
  3638. " </tr>\n",
  3639. " <tr>\n",
  3640. " <th>1</th>\n",
  3641. " <td>1</td>\n",
  3642. " <td>0</td>\n",
  3643. " <td>1</td>\n",
  3644. " <td>0</td>\n",
  3645. " </tr>\n",
  3646. " <tr>\n",
  3647. " <th>2</th>\n",
  3648. " <td>2</td>\n",
  3649. " <td>1</td>\n",
  3650. " <td>0</td>\n",
  3651. " <td>0</td>\n",
  3652. " </tr>\n",
  3653. " <tr>\n",
  3654. " <th>3</th>\n",
  3655. " <td>3</td>\n",
  3656. " <td>0</td>\n",
  3657. " <td>0</td>\n",
  3658. " <td>1</td>\n",
  3659. " </tr>\n",
  3660. " <tr>\n",
  3661. " <th>4</th>\n",
  3662. " <td>4</td>\n",
  3663. " <td>1</td>\n",
  3664. " <td>0</td>\n",
  3665. " <td>0</td>\n",
  3666. " </tr>\n",
  3667. " <tr>\n",
  3668. " <th>5</th>\n",
  3669. " <td>5</td>\n",
  3670. " <td>0</td>\n",
  3671. " <td>1</td>\n",
  3672. " <td>0</td>\n",
  3673. " </tr>\n",
  3674. " </tbody>\n",
  3675. "</table>\n",
  3676. "</div>"
  3677. ],
  3678. "text/plain": [
  3679. " data1 key_a key_b key_c\n",
  3680. "0 0 0 1 0\n",
  3681. "1 1 0 1 0\n",
  3682. "2 2 1 0 0\n",
  3683. "3 3 0 0 1\n",
  3684. "4 4 1 0 0\n",
  3685. "5 5 0 1 0"
  3686. ]
  3687. },
  3688. "execution_count": 59,
  3689. "metadata": {},
  3690. "output_type": "execute_result"
  3691. }
  3692. ],
  3693. "source": [
  3694. "pd.get_dummies(df) # data1 is not encoded"
  3695. ]
  3696. },
  3697. {
  3698. "cell_type": "code",
  3699. "execution_count": 60,
  3700. "metadata": {},
  3701. "outputs": [
  3702. {
  3703. "data": {
  3704. "text/html": [
  3705. "<div>\n",
  3706. "<style scoped>\n",
  3707. " .dataframe tbody tr th:only-of-type {\n",
  3708. " vertical-align: middle;\n",
  3709. " }\n",
  3710. "\n",
  3711. " .dataframe tbody tr th {\n",
  3712. " vertical-align: top;\n",
  3713. " }\n",
  3714. "\n",
  3715. " .dataframe thead th {\n",
  3716. " text-align: right;\n",
  3717. " }\n",
  3718. "</style>\n",
  3719. "<table border=\"1\" class=\"dataframe\">\n",
  3720. " <thead>\n",
  3721. " <tr style=\"text-align: right;\">\n",
  3722. " <th></th>\n",
  3723. " <th>data1_0</th>\n",
  3724. " <th>data1_1</th>\n",
  3725. " <th>data1_2</th>\n",
  3726. " <th>data1_3</th>\n",
  3727. " <th>data1_4</th>\n",
  3728. " <th>data1_5</th>\n",
  3729. " <th>key_a</th>\n",
  3730. " <th>key_b</th>\n",
  3731. " <th>key_c</th>\n",
  3732. " </tr>\n",
  3733. " </thead>\n",
  3734. " <tbody>\n",
  3735. " <tr>\n",
  3736. " <th>0</th>\n",
  3737. " <td>1</td>\n",
  3738. " <td>0</td>\n",
  3739. " <td>0</td>\n",
  3740. " <td>0</td>\n",
  3741. " <td>0</td>\n",
  3742. " <td>0</td>\n",
  3743. " <td>0</td>\n",
  3744. " <td>1</td>\n",
  3745. " <td>0</td>\n",
  3746. " </tr>\n",
  3747. " <tr>\n",
  3748. " <th>1</th>\n",
  3749. " <td>0</td>\n",
  3750. " <td>1</td>\n",
  3751. " <td>0</td>\n",
  3752. " <td>0</td>\n",
  3753. " <td>0</td>\n",
  3754. " <td>0</td>\n",
  3755. " <td>0</td>\n",
  3756. " <td>1</td>\n",
  3757. " <td>0</td>\n",
  3758. " </tr>\n",
  3759. " <tr>\n",
  3760. " <th>2</th>\n",
  3761. " <td>0</td>\n",
  3762. " <td>0</td>\n",
  3763. " <td>1</td>\n",
  3764. " <td>0</td>\n",
  3765. " <td>0</td>\n",
  3766. " <td>0</td>\n",
  3767. " <td>1</td>\n",
  3768. " <td>0</td>\n",
  3769. " <td>0</td>\n",
  3770. " </tr>\n",
  3771. " <tr>\n",
  3772. " <th>3</th>\n",
  3773. " <td>0</td>\n",
  3774. " <td>0</td>\n",
  3775. " <td>0</td>\n",
  3776. " <td>1</td>\n",
  3777. " <td>0</td>\n",
  3778. " <td>0</td>\n",
  3779. " <td>0</td>\n",
  3780. " <td>0</td>\n",
  3781. " <td>1</td>\n",
  3782. " </tr>\n",
  3783. " <tr>\n",
  3784. " <th>4</th>\n",
  3785. " <td>0</td>\n",
  3786. " <td>0</td>\n",
  3787. " <td>0</td>\n",
  3788. " <td>0</td>\n",
  3789. " <td>1</td>\n",
  3790. " <td>0</td>\n",
  3791. " <td>1</td>\n",
  3792. " <td>0</td>\n",
  3793. " <td>0</td>\n",
  3794. " </tr>\n",
  3795. " <tr>\n",
  3796. " <th>5</th>\n",
  3797. " <td>0</td>\n",
  3798. " <td>0</td>\n",
  3799. " <td>0</td>\n",
  3800. " <td>0</td>\n",
  3801. " <td>0</td>\n",
  3802. " <td>1</td>\n",
  3803. " <td>0</td>\n",
  3804. " <td>1</td>\n",
  3805. " <td>0</td>\n",
  3806. " </tr>\n",
  3807. " </tbody>\n",
  3808. "</table>\n",
  3809. "</div>"
  3810. ],
  3811. "text/plain": [
  3812. " data1_0 data1_1 data1_2 data1_3 data1_4 data1_5 key_a key_b key_c\n",
  3813. "0 1 0 0 0 0 0 0 1 0\n",
  3814. "1 0 1 0 0 0 0 0 1 0\n",
  3815. "2 0 0 1 0 0 0 1 0 0\n",
  3816. "3 0 0 0 1 0 0 0 0 1\n",
  3817. "4 0 0 0 0 1 0 1 0 0\n",
  3818. "5 0 0 0 0 0 1 0 1 0"
  3819. ]
  3820. },
  3821. "execution_count": 60,
  3822. "metadata": {},
  3823. "output_type": "execute_result"
  3824. }
  3825. ],
  3826. "source": [
  3827. "# Specify the columns to be encoded mannually.\n",
  3828. "pd.get_dummies(df, columns=['data1', 'key'])"
  3829. ]
  3830. },
  3831. {
  3832. "cell_type": "code",
  3833. "execution_count": 61,
  3834. "metadata": {},
  3835. "outputs": [
  3836. {
  3837. "data": {
  3838. "text/html": [
  3839. "<div>\n",
  3840. "<style scoped>\n",
  3841. " .dataframe tbody tr th:only-of-type {\n",
  3842. " vertical-align: middle;\n",
  3843. " }\n",
  3844. "\n",
  3845. " .dataframe tbody tr th {\n",
  3846. " vertical-align: top;\n",
  3847. " }\n",
  3848. "\n",
  3849. " .dataframe thead th {\n",
  3850. " text-align: right;\n",
  3851. " }\n",
  3852. "</style>\n",
  3853. "<table border=\"1\" class=\"dataframe\">\n",
  3854. " <thead>\n",
  3855. " <tr style=\"text-align: right;\">\n",
  3856. " <th></th>\n",
  3857. " <th>prefix_a</th>\n",
  3858. " <th>prefix_b</th>\n",
  3859. " <th>prefix_c</th>\n",
  3860. " </tr>\n",
  3861. " </thead>\n",
  3862. " <tbody>\n",
  3863. " <tr>\n",
  3864. " <th>0</th>\n",
  3865. " <td>0</td>\n",
  3866. " <td>1</td>\n",
  3867. " <td>0</td>\n",
  3868. " </tr>\n",
  3869. " <tr>\n",
  3870. " <th>1</th>\n",
  3871. " <td>0</td>\n",
  3872. " <td>1</td>\n",
  3873. " <td>0</td>\n",
  3874. " </tr>\n",
  3875. " <tr>\n",
  3876. " <th>2</th>\n",
  3877. " <td>1</td>\n",
  3878. " <td>0</td>\n",
  3879. " <td>0</td>\n",
  3880. " </tr>\n",
  3881. " <tr>\n",
  3882. " <th>3</th>\n",
  3883. " <td>0</td>\n",
  3884. " <td>0</td>\n",
  3885. " <td>1</td>\n",
  3886. " </tr>\n",
  3887. " <tr>\n",
  3888. " <th>4</th>\n",
  3889. " <td>1</td>\n",
  3890. " <td>0</td>\n",
  3891. " <td>0</td>\n",
  3892. " </tr>\n",
  3893. " <tr>\n",
  3894. " <th>5</th>\n",
  3895. " <td>0</td>\n",
  3896. " <td>1</td>\n",
  3897. " <td>0</td>\n",
  3898. " </tr>\n",
  3899. " </tbody>\n",
  3900. "</table>\n",
  3901. "</div>"
  3902. ],
  3903. "text/plain": [
  3904. " prefix_a prefix_b prefix_c\n",
  3905. "0 0 1 0\n",
  3906. "1 0 1 0\n",
  3907. "2 1 0 0\n",
  3908. "3 0 0 1\n",
  3909. "4 1 0 0\n",
  3910. "5 0 1 0"
  3911. ]
  3912. },
  3913. "execution_count": 61,
  3914. "metadata": {},
  3915. "output_type": "execute_result"
  3916. }
  3917. ],
  3918. "source": [
  3919. "dummies = pd.get_dummies(df['key'], prefix='prefix')\n",
  3920. "dummies"
  3921. ]
  3922. },
  3923. {
  3924. "cell_type": "code",
  3925. "execution_count": 62,
  3926. "metadata": {},
  3927. "outputs": [
  3928. {
  3929. "data": {
  3930. "text/html": [
  3931. "<div>\n",
  3932. "<style scoped>\n",
  3933. " .dataframe tbody tr th:only-of-type {\n",
  3934. " vertical-align: middle;\n",
  3935. " }\n",
  3936. "\n",
  3937. " .dataframe tbody tr th {\n",
  3938. " vertical-align: top;\n",
  3939. " }\n",
  3940. "\n",
  3941. " .dataframe thead th {\n",
  3942. " text-align: right;\n",
  3943. " }\n",
  3944. "</style>\n",
  3945. "<table border=\"1\" class=\"dataframe\">\n",
  3946. " <thead>\n",
  3947. " <tr style=\"text-align: right;\">\n",
  3948. " <th></th>\n",
  3949. " <th>data1</th>\n",
  3950. " <th>prefix_a</th>\n",
  3951. " <th>prefix_b</th>\n",
  3952. " <th>prefix_c</th>\n",
  3953. " </tr>\n",
  3954. " </thead>\n",
  3955. " <tbody>\n",
  3956. " <tr>\n",
  3957. " <th>0</th>\n",
  3958. " <td>0</td>\n",
  3959. " <td>0</td>\n",
  3960. " <td>1</td>\n",
  3961. " <td>0</td>\n",
  3962. " </tr>\n",
  3963. " <tr>\n",
  3964. " <th>1</th>\n",
  3965. " <td>1</td>\n",
  3966. " <td>0</td>\n",
  3967. " <td>1</td>\n",
  3968. " <td>0</td>\n",
  3969. " </tr>\n",
  3970. " <tr>\n",
  3971. " <th>2</th>\n",
  3972. " <td>2</td>\n",
  3973. " <td>1</td>\n",
  3974. " <td>0</td>\n",
  3975. " <td>0</td>\n",
  3976. " </tr>\n",
  3977. " <tr>\n",
  3978. " <th>3</th>\n",
  3979. " <td>3</td>\n",
  3980. " <td>0</td>\n",
  3981. " <td>0</td>\n",
  3982. " <td>1</td>\n",
  3983. " </tr>\n",
  3984. " <tr>\n",
  3985. " <th>4</th>\n",
  3986. " <td>4</td>\n",
  3987. " <td>1</td>\n",
  3988. " <td>0</td>\n",
  3989. " <td>0</td>\n",
  3990. " </tr>\n",
  3991. " <tr>\n",
  3992. " <th>5</th>\n",
  3993. " <td>5</td>\n",
  3994. " <td>0</td>\n",
  3995. " <td>1</td>\n",
  3996. " <td>0</td>\n",
  3997. " </tr>\n",
  3998. " </tbody>\n",
  3999. "</table>\n",
  4000. "</div>"
  4001. ],
  4002. "text/plain": [
  4003. " data1 prefix_a prefix_b prefix_c\n",
  4004. "0 0 0 1 0\n",
  4005. "1 1 0 1 0\n",
  4006. "2 2 1 0 0\n",
  4007. "3 3 0 0 1\n",
  4008. "4 4 1 0 0\n",
  4009. "5 5 0 1 0"
  4010. ]
  4011. },
  4012. "execution_count": 62,
  4013. "metadata": {},
  4014. "output_type": "execute_result"
  4015. }
  4016. ],
  4017. "source": [
  4018. "df_with_dummy = df[['data1']].join(dummies)\n",
  4019. "df_with_dummy"
  4020. ]
  4021. },
  4022. {
  4023. "cell_type": "code",
  4024. "execution_count": 63,
  4025. "metadata": {},
  4026. "outputs": [
  4027. {
  4028. "data": {
  4029. "text/html": [
  4030. "<div>\n",
  4031. "<style scoped>\n",
  4032. " .dataframe tbody tr th:only-of-type {\n",
  4033. " vertical-align: middle;\n",
  4034. " }\n",
  4035. "\n",
  4036. " .dataframe tbody tr th {\n",
  4037. " vertical-align: top;\n",
  4038. " }\n",
  4039. "\n",
  4040. " .dataframe thead th {\n",
  4041. " text-align: right;\n",
  4042. " }\n",
  4043. "</style>\n",
  4044. "<table border=\"1\" class=\"dataframe\">\n",
  4045. " <thead>\n",
  4046. " <tr style=\"text-align: right;\">\n",
  4047. " <th></th>\n",
  4048. " <th>movie_id</th>\n",
  4049. " <th>title</th>\n",
  4050. " <th>genres</th>\n",
  4051. " </tr>\n",
  4052. " </thead>\n",
  4053. " <tbody>\n",
  4054. " <tr>\n",
  4055. " <th>0</th>\n",
  4056. " <td>1</td>\n",
  4057. " <td>Toy Story (1995)</td>\n",
  4058. " <td>Animation|Children's|Comedy</td>\n",
  4059. " </tr>\n",
  4060. " <tr>\n",
  4061. " <th>1</th>\n",
  4062. " <td>2</td>\n",
  4063. " <td>Jumanji (1995)</td>\n",
  4064. " <td>Adventure|Children's|Fantasy</td>\n",
  4065. " </tr>\n",
  4066. " <tr>\n",
  4067. " <th>2</th>\n",
  4068. " <td>3</td>\n",
  4069. " <td>Grumpier Old Men (1995)</td>\n",
  4070. " <td>Comedy|Romance</td>\n",
  4071. " </tr>\n",
  4072. " <tr>\n",
  4073. " <th>3</th>\n",
  4074. " <td>4</td>\n",
  4075. " <td>Waiting to Exhale (1995)</td>\n",
  4076. " <td>Comedy|Drama</td>\n",
  4077. " </tr>\n",
  4078. " <tr>\n",
  4079. " <th>4</th>\n",
  4080. " <td>5</td>\n",
  4081. " <td>Father of the Bride Part II (1995)</td>\n",
  4082. " <td>Comedy</td>\n",
  4083. " </tr>\n",
  4084. " <tr>\n",
  4085. " <th>5</th>\n",
  4086. " <td>6</td>\n",
  4087. " <td>Heat (1995)</td>\n",
  4088. " <td>Action|Crime|Thriller</td>\n",
  4089. " </tr>\n",
  4090. " <tr>\n",
  4091. " <th>6</th>\n",
  4092. " <td>7</td>\n",
  4093. " <td>Sabrina (1995)</td>\n",
  4094. " <td>Comedy|Romance</td>\n",
  4095. " </tr>\n",
  4096. " <tr>\n",
  4097. " <th>7</th>\n",
  4098. " <td>8</td>\n",
  4099. " <td>Tom and Huck (1995)</td>\n",
  4100. " <td>Adventure|Children's</td>\n",
  4101. " </tr>\n",
  4102. " <tr>\n",
  4103. " <th>8</th>\n",
  4104. " <td>9</td>\n",
  4105. " <td>Sudden Death (1995)</td>\n",
  4106. " <td>Action</td>\n",
  4107. " </tr>\n",
  4108. " <tr>\n",
  4109. " <th>9</th>\n",
  4110. " <td>10</td>\n",
  4111. " <td>GoldenEye (1995)</td>\n",
  4112. " <td>Action|Adventure|Thriller</td>\n",
  4113. " </tr>\n",
  4114. " </tbody>\n",
  4115. "</table>\n",
  4116. "</div>"
  4117. ],
  4118. "text/plain": [
  4119. " movie_id title genres\n",
  4120. "0 1 Toy Story (1995) Animation|Children's|Comedy\n",
  4121. "1 2 Jumanji (1995) Adventure|Children's|Fantasy\n",
  4122. "2 3 Grumpier Old Men (1995) Comedy|Romance\n",
  4123. "3 4 Waiting to Exhale (1995) Comedy|Drama\n",
  4124. "4 5 Father of the Bride Part II (1995) Comedy\n",
  4125. "5 6 Heat (1995) Action|Crime|Thriller\n",
  4126. "6 7 Sabrina (1995) Comedy|Romance\n",
  4127. "7 8 Tom and Huck (1995) Adventure|Children's\n",
  4128. "8 9 Sudden Death (1995) Action\n",
  4129. "9 10 GoldenEye (1995) Action|Adventure|Thriller"
  4130. ]
  4131. },
  4132. "execution_count": 63,
  4133. "metadata": {},
  4134. "output_type": "execute_result"
  4135. }
  4136. ],
  4137. "source": [
  4138. "mnames = ['movie_id', 'title', 'genres']\n",
  4139. "movies = pd.read_csv('../datasets/movielens/movies.dat', sep='::',\n",
  4140. " header=None, names=mnames, engine='python')\n",
  4141. "movies[:10]"
  4142. ]
  4143. },
  4144. {
  4145. "cell_type": "code",
  4146. "execution_count": 64,
  4147. "metadata": {},
  4148. "outputs": [
  4149. {
  4150. "data": {
  4151. "text/plain": [
  4152. "array(['Animation', \"Children's\", 'Comedy', 'Adventure', 'Fantasy',\n",
  4153. " 'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',\n",
  4154. " 'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',\n",
  4155. " 'Western'], dtype=object)"
  4156. ]
  4157. },
  4158. "execution_count": 64,
  4159. "metadata": {},
  4160. "output_type": "execute_result"
  4161. }
  4162. ],
  4163. "source": [
  4164. "# extract the list of (unique) genres\n",
  4165. "all_genres = []\n",
  4166. "for x in movies['genres']:\n",
  4167. " all_genres.extend(x.split('|'))\n",
  4168. "genres = pd.unique(all_genres)\n",
  4169. "genres"
  4170. ]
  4171. },
  4172. {
  4173. "cell_type": "code",
  4174. "execution_count": 65,
  4175. "metadata": {},
  4176. "outputs": [
  4177. {
  4178. "data": {
  4179. "text/plain": [
  4180. "movie_id 1\n",
  4181. "title Toy Story (1995)\n",
  4182. "genres Animation|Children's|Comedy\n",
  4183. "Genre_Animation 1\n",
  4184. "Genre_Children's 1\n",
  4185. "Genre_Comedy 1\n",
  4186. "Genre_Adventure 0\n",
  4187. "Genre_Fantasy 0\n",
  4188. "Genre_Romance 0\n",
  4189. "Genre_Drama 0\n",
  4190. "Genre_Action 0\n",
  4191. "Genre_Crime 0\n",
  4192. "Genre_Thriller 0\n",
  4193. "Genre_Horror 0\n",
  4194. "Genre_Sci-Fi 0\n",
  4195. "Genre_Documentary 0\n",
  4196. "Genre_War 0\n",
  4197. "Genre_Musical 0\n",
  4198. "Genre_Mystery 0\n",
  4199. "Genre_Film-Noir 0\n",
  4200. "Genre_Western 0\n",
  4201. "Name: 0, dtype: object"
  4202. ]
  4203. },
  4204. "execution_count": 65,
  4205. "metadata": {},
  4206. "output_type": "execute_result"
  4207. }
  4208. ],
  4209. "source": [
  4210. "# construct a zero dummies\n",
  4211. "zero_matrix = np.zeros((len(movies), len(genres)))\n",
  4212. "dummies = pd.DataFrame(zero_matrix, columns=genres)\n",
  4213. "\n",
  4214. "# set dummies for each movie\n",
  4215. "for i, gen in enumerate(movies['genres']):\n",
  4216. " indices = dummies.columns.get_indexer(gen.split('|'))\n",
  4217. " dummies.iloc[i, indices] = 1\n",
  4218. " \n",
  4219. "# combine the dummies with movies\n",
  4220. "movies_windic = movies.join(dummies.add_prefix('Genre_'))\n",
  4221. "movies_windic.iloc[0]"
  4222. ]
  4223. },
  4224. {
  4225. "cell_type": "code",
  4226. "execution_count": 66,
  4227. "metadata": {},
  4228. "outputs": [
  4229. {
  4230. "data": {
  4231. "text/plain": [
  4232. "array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,\n",
  4233. " 0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])"
  4234. ]
  4235. },
  4236. "execution_count": 66,
  4237. "metadata": {},
  4238. "output_type": "execute_result"
  4239. }
  4240. ],
  4241. "source": [
  4242. "np.random.seed(12345)\n",
  4243. "values = np.random.rand(10)\n",
  4244. "values"
  4245. ]
  4246. },
  4247. {
  4248. "cell_type": "code",
  4249. "execution_count": 67,
  4250. "metadata": {},
  4251. "outputs": [
  4252. {
  4253. "data": {
  4254. "text/plain": [
  4255. " (0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]\n",
  4256. "0 0 0 0 0 1\n",
  4257. "1 0 1 0 0 0\n",
  4258. "2 1 0 0 0 0\n",
  4259. "3 0 1 0 0 0\n",
  4260. "4 0 0 1 0 0\n",
  4261. "5 0 0 1 0 0\n",
  4262. "6 0 0 0 0 1\n",
  4263. "7 0 0 0 1 0\n",
  4264. "8 0 0 0 1 0\n",
  4265. "9 0 0 0 1 0"
  4266. ]
  4267. },
  4268. "execution_count": 67,
  4269. "metadata": {},
  4270. "output_type": "execute_result"
  4271. }
  4272. ],
  4273. "source": [
  4274. "bins = [0, 0.2, 0.4, 0.6, 0.8, 1]\n",
  4275. "pd.get_dummies(pd.cut(values, bins))"
  4276. ]
  4277. },
  4278. {
  4279. "cell_type": "markdown",
  4280. "metadata": {},
  4281. "source": [
  4282. "## String Manipulation\n",
  4283. "\n",
  4284. "* Python built-in string methods\n",
  4285. "\n",
  4286. "Method | Description\n",
  4287. ":--- | :---\n",
  4288. "`count` | Return the num of *non-overlapping* occurrences of substring in the string.\n",
  4289. "`endswith` | Return `True` if string ends with suffix.\n",
  4290. "`startswith` | Return `True` if string starts with prefix.\n",
  4291. "`join` | Use string as delimiter for concatenating a sequence of other strings.\n",
  4292. "`index` | Return position of first character in substring if found in the string; raise `ValueError` if not found.\n",
  4293. "`find` | Return position of first character of *first* occurrence of substring in the string; like `index`, but returns -1 if not found.\n",
  4294. "`rfind` | Return position of first character of *last* occurrence of substring in the string; return -1 if not found.\n",
  4295. "`replace` | Replace occurrences of string with another string.\n",
  4296. "`strip`, `rstrip`, `lstrip` | Trim whitespace, including newlines.\n",
  4297. "`split` | Break string into list of substrings using passed delimiter.\n",
  4298. "`lower` | Convert alphabet characters to lowercase.\n",
  4299. "`upper` | Convert alphabet characters to upppercase.\n",
  4300. "`casefold` | Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form.\n",
  4301. "`ljust`, `rjust` | Left justify or right justify; pad opposite side of string with spaces(or some other fill character) to return a string with a minimum width.\n",
  4302. "\n",
  4303. "### Regular Expressions\n",
  4304. "\n",
  4305. "* Regular expression methods\n",
  4306. "\n",
  4307. "Method | Description\n",
  4308. ":--- | :---\n",
  4309. "`findall` | Return all non-overlapping matching patterns in a string as a list.\n",
  4310. "`finditer` | Like `findall`, but return an iterator.\n",
  4311. "`match` | Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, returns a match obj; otherwise return `None`.\n",
  4312. "`search` | Scan string for match to pattern; return a match obj if so; unlike `match`, the match can be anywhere in the string as opposed to only at the beginning.\n",
  4313. "`split` | Break string into pieces at each occurence of pattern.\n",
  4314. "`sub`, `subn` | Replace all (sub) or first n (subn) occurrences of pattern in string with replacement expression; use symbols \\1, \\2, ... to refer to match group elements in the replacement string.\n",
  4315. "\n",
  4316. "### Vectorized String Functions in pandas\n",
  4317. "\n",
  4318. "**Series** (and Index) has array-oriented methods for string operations that skip NA values. These are accessed through `str` attribute."
  4319. ]
  4320. },
  4321. {
  4322. "cell_type": "code",
  4323. "execution_count": 68,
  4324. "metadata": {},
  4325. "outputs": [
  4326. {
  4327. "data": {
  4328. "text/plain": [
  4329. "Dave dave@google.com\n",
  4330. "Steve steve@gmail.com\n",
  4331. "Rob rog@gmail.com\n",
  4332. "Wes NaN\n",
  4333. "dtype: object"
  4334. ]
  4335. },
  4336. "execution_count": 68,
  4337. "metadata": {},
  4338. "output_type": "execute_result"
  4339. }
  4340. ],
  4341. "source": [
  4342. "data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',\n",
  4343. " 'Rob': 'rog@gmail.com', 'Wes': np.nan}\n",
  4344. "data = pd.Series(data)\n",
  4345. "data"
  4346. ]
  4347. },
  4348. {
  4349. "cell_type": "code",
  4350. "execution_count": 69,
  4351. "metadata": {},
  4352. "outputs": [
  4353. {
  4354. "data": {
  4355. "text/plain": [
  4356. "Dave False\n",
  4357. "Steve True\n",
  4358. "Rob True\n",
  4359. "Wes NaN\n",
  4360. "dtype: object"
  4361. ]
  4362. },
  4363. "execution_count": 69,
  4364. "metadata": {},
  4365. "output_type": "execute_result"
  4366. }
  4367. ],
  4368. "source": [
  4369. "data.str.contains('gmail')"
  4370. ]
  4371. },
  4372. {
  4373. "cell_type": "code",
  4374. "execution_count": 70,
  4375. "metadata": {},
  4376. "outputs": [
  4377. {
  4378. "data": {
  4379. "text/plain": [
  4380. "Dave dave@\n",
  4381. "Steve steve\n",
  4382. "Rob rog@g\n",
  4383. "Wes NaN\n",
  4384. "dtype: object"
  4385. ]
  4386. },
  4387. "execution_count": 70,
  4388. "metadata": {},
  4389. "output_type": "execute_result"
  4390. }
  4391. ],
  4392. "source": [
  4393. "data.str[:5]"
  4394. ]
  4395. },
  4396. {
  4397. "cell_type": "code",
  4398. "execution_count": 71,
  4399. "metadata": {},
  4400. "outputs": [
  4401. {
  4402. "data": {
  4403. "text/plain": [
  4404. "Dave a\n",
  4405. "Steve t\n",
  4406. "Rob o\n",
  4407. "Wes NaN\n",
  4408. "dtype: object"
  4409. ]
  4410. },
  4411. "execution_count": 71,
  4412. "metadata": {},
  4413. "output_type": "execute_result"
  4414. }
  4415. ],
  4416. "source": [
  4417. "data.str.get(1) # or data.str[1]"
  4418. ]
  4419. },
  4420. {
  4421. "cell_type": "markdown",
  4422. "metadata": {},
  4423. "source": [
  4424. "* Some vectorized string methods\n",
  4425. "\n",
  4426. "Method | Description\n",
  4427. ":--- | :---\n",
  4428. "`cat` | Concatenate strings element-wise with optional delimiter.\n",
  4429. "`contains` | Return boolean array if each string contains pattern/regex.\n",
  4430. "`count` | Count occurrences of pattern.\n",
  4431. "`extract` | Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group.\n",
  4432. "`endswith` | Equivalent to `x.endswith(pattern)` for each element.\n",
  4433. "`startswith` | ... `x.startswith(pattern)`...\n",
  4434. "`findall` | Compute list of all occurrences of pattern/regex for each string.\n",
  4435. "`get` | Index into each element (retrieve i-th element).\n",
  4436. "`isalnum` | Equivalent to built-in `str.isalnum`.\n",
  4437. "`isalpha` | ... `str.isalpha`.\n",
  4438. "`isdecimal` | ... `str.isdecimal`.\n",
  4439. "`isdigit` | ...\n",
  4440. "`islower` | ...\n",
  4441. "`isnumeric` | ...\n",
  4442. "`isupper` | ...\n",
  4443. "`join` | Join strings in each element of the Series with passed separator.\n",
  4444. "`len` | Compute length of each string.\n",
  4445. "`lower` , `upper` | Convert cases.\n",
  4446. "`match` | `re.match` on each element.\n",
  4447. "`pad` | Add whitespace to left, right, or both sides of strings.\n",
  4448. "`center` | Equivalent to `pad(side='both')`.\n",
  4449. "`repeat` | Duplicate values.\n",
  4450. "`replace` | Replace occurrences of pattern/regex with some other string.\n",
  4451. "`slice` | Slice each string in the Series.\n",
  4452. "`split` | Split strings on delimiter or regex.\n",
  4453. "`strip`, `lstrip`, `rstrip` | Trim whitespace on both, left, right side."
  4454. ]
  4455. }
  4456. ],
  4457. "metadata": {
  4458. "kernelspec": {
  4459. "display_name": "Python 3",
  4460. "language": "python",
  4461. "name": "python3"
  4462. },
  4463. "language_info": {
  4464. "codemirror_mode": {
  4465. "name": "ipython",
  4466. "version": 3
  4467. },
  4468. "file_extension": ".py",
  4469. "mimetype": "text/x-python",
  4470. "name": "python",
  4471. "nbconvert_exporter": "python",
  4472. "pygments_lexer": "ipython3",
  4473. "version": "3.6.7"
  4474. },
  4475. "toc": {
  4476. "base_numbering": 1,
  4477. "nav_menu": {},
  4478. "number_sections": true,
  4479. "sideBar": false,
  4480. "skip_h1_title": false,
  4481. "title_cell": "Table of Contents",
  4482. "title_sidebar": "Contents",
  4483. "toc_cell": false,
  4484. "toc_position": {
  4485. "height": "267px",
  4486. "left": "1065px",
  4487. "right": "0px",
  4488. "top": "33px",
  4489. "width": "215px"
  4490. },
  4491. "toc_section_display": false,
  4492. "toc_window_display": false
  4493. }
  4494. },
  4495. "nbformat": 4,
  4496. "nbformat_minor": 2
  4497. }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement