Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- {
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Data Cleaning and Preparation"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import pandas as pd"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Numpy version: 1.16.1\n",
- "Pandas version: 0.24.1\n"
- ]
- }
- ],
- "source": [
- "print(f'Numpy version: {np.__version__}')\n",
- "print(f'Pandas version: {pd.__version__}')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Handling Missing Data\n",
- "\n",
- "* NA(missing data) handling methods\n",
- "\n",
- "Methods | Description\n",
- ":--- | :---\n",
- "`dropna` | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.\n",
- "`fillna` | Fill in missing data with some value or using an interpolatiion method such as `ffill` or `bfill`.\n",
- "`isnull` | Return boolean values indicating which values are missing.\n",
- "`notnull` | Negation of `isnull`.\n",
- "\n",
- "### Filtering Out Missing Data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0 1.0\n",
- "1 NaN\n",
- "2 3.5\n",
- "3 NaN\n",
- "4 7.0\n",
- "dtype: float64"
- ]
- },
- "execution_count": 3,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data = pd.Series([1, np.nan, 3.5, np.nan, 7])\n",
- "data"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "```py\n",
- "Series.dropna(\n",
- " axis=0,\n",
- " inplace=False,\n",
- " **kwargs\n",
- ")\n",
- "```"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0 1.0\n",
- "2 3.5\n",
- "4 7.0\n",
- "dtype: float64"
- ]
- },
- "execution_count": 4,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.dropna()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>1.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1.0</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>NaN</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2\n",
- "0 1.0 6.5 3.0\n",
- "1 1.0 NaN NaN\n",
- "2 NaN NaN NaN\n",
- "3 NaN 6.5 3.0"
- ]
- },
- "execution_count": 5,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],\n",
- " [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])\n",
- "data"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "```py\n",
- "DataFrame.dropna(\n",
- " axis=0,\n",
- " how='any',\n",
- " thresh=None,\n",
- " subset=None,\n",
- " inplace=False\n",
- ")\n",
- "```"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>1.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2\n",
- "0 1.0 6.5 3.0"
- ]
- },
- "execution_count": 6,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "cleaned = data.dropna()\n",
- "cleaned"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>1.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1.0</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>NaN</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2\n",
- "0 1.0 6.5 3.0\n",
- "1 1.0 NaN NaN\n",
- "3 NaN 6.5 3.0"
- ]
- },
- "execution_count": 7,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.dropna(how='all') # drop if only all are NA"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " <th>3</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>1.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1.0</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>NaN</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2 3\n",
- "0 1.0 6.5 3.0 NaN\n",
- "1 1.0 NaN NaN NaN\n",
- "2 NaN NaN NaN NaN\n",
- "3 NaN 6.5 3.0 NaN"
- ]
- },
- "execution_count": 8,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data[3] = np.nan\n",
- "data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>1.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1.0</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>NaN</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2\n",
- "0 1.0 6.5 3.0\n",
- "1 1.0 NaN NaN\n",
- "2 NaN NaN NaN\n",
- "3 NaN 6.5 3.0"
- ]
- },
- "execution_count": 9,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.dropna(axis=1, how='all') # drop the column with all NA values"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " <th>3</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>1.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>NaN</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2 3\n",
- "0 1.0 6.5 3.0 NaN\n",
- "3 NaN 6.5 3.0 NaN"
- ]
- },
- "execution_count": 10,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.dropna(thresh=2) # drop those rows with < 2 non-NA values"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>1.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1.0</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>NaN</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2\n",
- "0 1.0 6.5 3.0\n",
- "1 1.0 NaN NaN\n",
- "2 NaN NaN NaN\n",
- "3 NaN 6.5 3.0"
- ]
- },
- "execution_count": 11,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.dropna(axis='columns', thresh=2)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Filling In Missing Data\n",
- "\n",
- "```py\n",
- "fillna(\n",
- " value=None,\n",
- " method=None,\n",
- " axis=None,\n",
- " inplace=False,\n",
- " limit=None,\n",
- " downcast=None,\n",
- " **kwargs\n",
- ")\n",
- "```\n",
- "\n",
- "* `fillna` func args\n",
- "\n",
- "Arg | Description\n",
- ":--- | :---\n",
- "`value` | Scalar value or dict-like obj to use to fill missing values.\n",
- "`method` | Interpolation; by default `ffill` if function called with no other args.\n",
- "`axis` | Axis to fill on; default `axis=0`.\n",
- "`inplace` | Modify the calling obj without producing a copy.\n",
- "`limit` | For forward and backward filling, maximum number of consecutive periods to fill."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " <th>3</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>1.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1.0</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>NaN</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2 3\n",
- "0 1.0 6.5 3.0 NaN\n",
- "1 1.0 NaN NaN NaN\n",
- "2 NaN NaN NaN NaN\n",
- "3 NaN 6.5 3.0 NaN"
- ]
- },
- "execution_count": 12,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " <th>3</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>1.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>0.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>0.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>0.0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2 3\n",
- "0 1.0 6.5 3.0 0.0\n",
- "1 1.0 0.0 0.0 0.0\n",
- "2 0.0 0.0 0.0 0.0\n",
- "3 0.0 6.5 3.0 0.0"
- ]
- },
- "execution_count": 13,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.fillna(0) # fill NA with value 0"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " <th>3</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>1.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>99.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1.0</td>\n",
- " <td>11.0</td>\n",
- " <td>NaN</td>\n",
- " <td>99.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>NaN</td>\n",
- " <td>11.0</td>\n",
- " <td>NaN</td>\n",
- " <td>99.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>NaN</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>99.0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2 3\n",
- "0 1.0 6.5 3.0 99.0\n",
- "1 1.0 11.0 NaN 99.0\n",
- "2 NaN 11.0 NaN 99.0\n",
- "3 NaN 6.5 3.0 99.0"
- ]
- },
- "execution_count": 14,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.fillna({1: 11, 3: 99}) # use a different value for each column"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " <th>3</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>1.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>1.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>1.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2 3\n",
- "0 1.0 6.5 3.0 NaN\n",
- "1 1.0 6.5 3.0 NaN\n",
- "2 1.0 6.5 3.0 NaN\n",
- "3 1.0 6.5 3.0 NaN"
- ]
- },
- "execution_count": 15,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.fillna(method='ffill')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " <th>3</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>1.0</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>3.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1.0</td>\n",
- " <td>1.0</td>\n",
- " <td>1.0</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " <td>NaN</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>NaN</td>\n",
- " <td>6.5</td>\n",
- " <td>3.0</td>\n",
- " <td>3.0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2 3\n",
- "0 1.0 6.5 3.0 3.0\n",
- "1 1.0 1.0 1.0 NaN\n",
- "2 NaN NaN NaN NaN\n",
- "3 NaN 6.5 3.0 3.0"
- ]
- },
- "execution_count": 16,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.fillna(axis=1, method='ffill', limit=2) # fill at most 2 consecutive NAs"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Data Transformation\n",
- "\n",
- "### Removing Duplicates"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>k1</th>\n",
- " <th>k2</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>one</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>two</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>one</td>\n",
- " <td>2</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>two</td>\n",
- " <td>3</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>one</td>\n",
- " <td>3</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>5</th>\n",
- " <td>two</td>\n",
- " <td>4</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>6</th>\n",
- " <td>two</td>\n",
- " <td>4</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " k1 k2\n",
- "0 one 1\n",
- "1 two 1\n",
- "2 one 2\n",
- "3 two 3\n",
- "4 one 3\n",
- "5 two 4\n",
- "6 two 4"
- ]
- },
- "execution_count": 17,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],\n",
- " 'k2': [1, 1, 2, 3, 3, 4, 4]})\n",
- "data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0 False\n",
- "1 False\n",
- "2 False\n",
- "3 False\n",
- "4 False\n",
- "5 False\n",
- "6 True\n",
- "dtype: bool"
- ]
- },
- "execution_count": 18,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.duplicated()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>k1</th>\n",
- " <th>k2</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>one</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>two</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>one</td>\n",
- " <td>2</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>two</td>\n",
- " <td>3</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>one</td>\n",
- " <td>3</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>5</th>\n",
- " <td>two</td>\n",
- " <td>4</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " k1 k2\n",
- "0 one 1\n",
- "1 two 1\n",
- "2 one 2\n",
- "3 two 3\n",
- "4 one 3\n",
- "5 two 4"
- ]
- },
- "execution_count": 19,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.drop_duplicates()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>k1</th>\n",
- " <th>k2</th>\n",
- " <th>v1</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>one</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>two</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " k1 k2 v1\n",
- "0 one 1 0\n",
- "1 two 1 1"
- ]
- },
- "execution_count": 20,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data['v1'] = range(7)\n",
- "data.drop_duplicates(['k1'])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>k1</th>\n",
- " <th>k2</th>\n",
- " <th>v1</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>one</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>two</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>one</td>\n",
- " <td>2</td>\n",
- " <td>2</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>two</td>\n",
- " <td>3</td>\n",
- " <td>3</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>one</td>\n",
- " <td>3</td>\n",
- " <td>4</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>6</th>\n",
- " <td>two</td>\n",
- " <td>4</td>\n",
- " <td>6</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " k1 k2 v1\n",
- "0 one 1 0\n",
- "1 two 1 1\n",
- "2 one 2 2\n",
- "3 two 3 3\n",
- "4 one 3 4\n",
- "6 two 4 6"
- ]
- },
- "execution_count": 21,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.drop_duplicates(['k1', 'k2'], keep='last')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Transforming Data Using a Func or Mapping"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>food</th>\n",
- " <th>ounces</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>bacon</td>\n",
- " <td>4.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>pulled pork</td>\n",
- " <td>3.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>bacon</td>\n",
- " <td>12.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>Pastrami</td>\n",
- " <td>6.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>corned beef</td>\n",
- " <td>7.5</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>5</th>\n",
- " <td>Bacon</td>\n",
- " <td>8.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>6</th>\n",
- " <td>pastrami</td>\n",
- " <td>3.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>7</th>\n",
- " <td>honey ham</td>\n",
- " <td>5.0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>8</th>\n",
- " <td>nova lox</td>\n",
- " <td>6.0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " food ounces\n",
- "0 bacon 4.0\n",
- "1 pulled pork 3.0\n",
- "2 bacon 12.0\n",
- "3 Pastrami 6.0\n",
- "4 corned beef 7.5\n",
- "5 Bacon 8.0\n",
- "6 pastrami 3.0\n",
- "7 honey ham 5.0\n",
- "8 nova lox 6.0"
- ]
- },
- "execution_count": 22,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',\n",
- " 'Pastrami', 'corned beef', 'Bacon',\n",
- " 'pastrami', 'honey ham', 'nova lox'],\n",
- " 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})\n",
- "data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {},
- "outputs": [],
- "source": [
- "meat_to_animal = {'bacon': 'pig',\n",
- " 'pulled pork': 'pig',\n",
- " 'pastrami': 'cow',\n",
- " 'corned beef': 'cow',\n",
- " 'honey ham': 'pig',\n",
- " 'nova lox': 'salmon'}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0 bacon\n",
- "1 pulled pork\n",
- "2 bacon\n",
- "3 pastrami\n",
- "4 corned beef\n",
- "5 bacon\n",
- "6 pastrami\n",
- "7 honey ham\n",
- "8 nova lox\n",
- "Name: food, dtype: object"
- ]
- },
- "execution_count": 24,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "lowercased = data['food'].str.lower() # data['food'].map(str.lower)\n",
- "lowercased"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>food</th>\n",
- " <th>ounces</th>\n",
- " <th>animal</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>bacon</td>\n",
- " <td>4.0</td>\n",
- " <td>pig</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>pulled pork</td>\n",
- " <td>3.0</td>\n",
- " <td>pig</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>bacon</td>\n",
- " <td>12.0</td>\n",
- " <td>pig</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>Pastrami</td>\n",
- " <td>6.0</td>\n",
- " <td>cow</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>corned beef</td>\n",
- " <td>7.5</td>\n",
- " <td>cow</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>5</th>\n",
- " <td>Bacon</td>\n",
- " <td>8.0</td>\n",
- " <td>pig</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>6</th>\n",
- " <td>pastrami</td>\n",
- " <td>3.0</td>\n",
- " <td>cow</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>7</th>\n",
- " <td>honey ham</td>\n",
- " <td>5.0</td>\n",
- " <td>pig</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>8</th>\n",
- " <td>nova lox</td>\n",
- " <td>6.0</td>\n",
- " <td>salmon</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " food ounces animal\n",
- "0 bacon 4.0 pig\n",
- "1 pulled pork 3.0 pig\n",
- "2 bacon 12.0 pig\n",
- "3 Pastrami 6.0 cow\n",
- "4 corned beef 7.5 cow\n",
- "5 Bacon 8.0 pig\n",
- "6 pastrami 3.0 cow\n",
- "7 honey ham 5.0 pig\n",
- "8 nova lox 6.0 salmon"
- ]
- },
- "execution_count": 25,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data['animal'] = lowercased.map(meat_to_animal)\n",
- "data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0 pig\n",
- "1 pig\n",
- "2 pig\n",
- "3 cow\n",
- "4 cow\n",
- "5 pig\n",
- "6 cow\n",
- "7 pig\n",
- "8 salmon\n",
- "Name: food, dtype: object"
- ]
- },
- "execution_count": 26,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data['food'].map(lambda x: meat_to_animal[x.lower()])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Replacing Values\n",
- "\n",
- "```py\n",
- "replace(\n",
- " to_replace=None,\n",
- " value=None,\n",
- " inplace=False,\n",
- " limit=None,\n",
- " regex=False,\n",
- " method='pad'\n",
- ")\n",
- "```"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0 1.0\n",
- "1 -999.0\n",
- "2 2.0\n",
- "3 -999.0\n",
- "4 -1000.0\n",
- "5 3.0\n",
- "dtype: float64"
- ]
- },
- "execution_count": 27,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data = pd.Series([1., -999, 2, -999, -1000, 3.])\n",
- "data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0 1.0\n",
- "1 NaN\n",
- "2 2.0\n",
- "3 NaN\n",
- "4 -1000.0\n",
- "5 3.0\n",
- "dtype: float64"
- ]
- },
- "execution_count": 28,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.replace(-999, np.nan)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 29,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0 1.0\n",
- "1 NaN\n",
- "2 2.0\n",
- "3 NaN\n",
- "4 NaN\n",
- "5 3.0\n",
- "dtype: float64"
- ]
- },
- "execution_count": 29,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.replace([-999, -1000], np.nan)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 30,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0 1.0\n",
- "1 NaN\n",
- "2 2.0\n",
- "3 NaN\n",
- "4 0.0\n",
- "5 3.0\n",
- "dtype: float64"
- ]
- },
- "execution_count": 30,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.replace({-999: np.nan, -1000: 0}) # or data.replace([-999, -1000], [np.nan, 0])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Renaming Axis Indexes\n",
- "\n",
- "```py\n",
- "DataFrame.rename(\n",
- " mapper=None,\n",
- " index=None,\n",
- " columns=None,\n",
- " axis=None,\n",
- " copy=True,\n",
- " inplace=False,\n",
- " level=None\n",
- ")\n",
- "```"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 31,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>one</th>\n",
- " <th>two</th>\n",
- " <th>three</th>\n",
- " <th>four</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>Ohio</th>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>2</td>\n",
- " <td>3</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>Colorado</th>\n",
- " <td>4</td>\n",
- " <td>5</td>\n",
- " <td>6</td>\n",
- " <td>7</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>New York</th>\n",
- " <td>8</td>\n",
- " <td>9</td>\n",
- " <td>10</td>\n",
- " <td>11</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " one two three four\n",
- "Ohio 0 1 2 3\n",
- "Colorado 4 5 6 7\n",
- "New York 8 9 10 11"
- ]
- },
- "execution_count": 31,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data = pd.DataFrame(np.arange(12).reshape((3, 4)),\n",
- " index=['Ohio', 'Colorado', 'New York'],\n",
- " columns=['one', 'two', 'three', 'four'])\n",
- "data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 32,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "Index(['OHIO', 'COLO', 'NEW '], dtype='object')"
- ]
- },
- "execution_count": 32,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "transform = lambda x: x[:4].upper()\n",
- "data.index.map(transform)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 33,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>one</th>\n",
- " <th>two</th>\n",
- " <th>three</th>\n",
- " <th>four</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>OHIO</th>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>2</td>\n",
- " <td>3</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>COLO</th>\n",
- " <td>4</td>\n",
- " <td>5</td>\n",
- " <td>6</td>\n",
- " <td>7</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>NEW</th>\n",
- " <td>8</td>\n",
- " <td>9</td>\n",
- " <td>10</td>\n",
- " <td>11</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " one two three four\n",
- "OHIO 0 1 2 3\n",
- "COLO 4 5 6 7\n",
- "NEW 8 9 10 11"
- ]
- },
- "execution_count": 33,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.index = data.index.map(transform)\n",
- "data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 34,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>ONE</th>\n",
- " <th>TWO</th>\n",
- " <th>THREE</th>\n",
- " <th>FOUR</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>Ohio</th>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>2</td>\n",
- " <td>3</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>Colo</th>\n",
- " <td>4</td>\n",
- " <td>5</td>\n",
- " <td>6</td>\n",
- " <td>7</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>New</th>\n",
- " <td>8</td>\n",
- " <td>9</td>\n",
- " <td>10</td>\n",
- " <td>11</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " ONE TWO THREE FOUR\n",
- "Ohio 0 1 2 3\n",
- "Colo 4 5 6 7\n",
- "New 8 9 10 11"
- ]
- },
- "execution_count": 34,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Use function to transform both indexes.\n",
- "data.rename(index=str.title, columns=str.upper)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 35,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>one</th>\n",
- " <th>two</th>\n",
- " <th>peekaboo</th>\n",
- " <th>four</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>INDIANA</th>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>2</td>\n",
- " <td>3</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>COLO</th>\n",
- " <td>4</td>\n",
- " <td>5</td>\n",
- " <td>6</td>\n",
- " <td>7</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>NEW</th>\n",
- " <td>8</td>\n",
- " <td>9</td>\n",
- " <td>10</td>\n",
- " <td>11</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " one two peekaboo four\n",
- "INDIANA 0 1 2 3\n",
- "COLO 4 5 6 7\n",
- "NEW 8 9 10 11"
- ]
- },
- "execution_count": 35,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Use dictionary.\n",
- "data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'})"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Discretization and Binning\n",
- "\n",
- "```py\n",
- "pd.cut(\n",
- " x,\n",
- " bins,\n",
- " right=True,\n",
- " labels=None,\n",
- " retbins=False,\n",
- " precision=3,\n",
- " include_lowest=False,\n",
- " duplicates='raise'\n",
- ")\n",
- "```"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 36,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]\n",
- "Length: 12\n",
- "Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]"
- ]
- },
- "execution_count": 36,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]\n",
- "bins=[18, 25, 35, 60, 100]\n",
- "cats = pd.cut(ages, bins)\n",
- "cats"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 37,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)"
- ]
- },
- "execution_count": 37,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "cats.codes"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 38,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],\n",
- " closed='right',\n",
- " dtype='interval[int64]')"
- ]
- },
- "execution_count": 38,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "cats.categories"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 39,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "(18, 25] 5\n",
- "(35, 60] 3\n",
- "(25, 35] 3\n",
- "(60, 100] 1\n",
- "dtype: int64"
- ]
- },
- "execution_count": 39,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "pd.value_counts(cats)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 40,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]\n",
- "Length: 12\n",
- "Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]"
- ]
- },
- "execution_count": 40,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']\n",
- "pd.cut(ages, bins, labels=group_names)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 41,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[(0.73, 0.96], (0.73, 0.96], (0.73, 0.96], (0.049, 0.28], (0.73, 0.96], ..., (0.73, 0.96], (0.73, 0.96], (0.73, 0.96], (0.28, 0.51], (0.049, 0.28]]\n",
- "Length: 20\n",
- "Categories (4, interval[float64]): [(0.049, 0.28] < (0.28, 0.51] < (0.51, 0.73] < (0.73, 0.96]]"
- ]
- },
- "execution_count": 41,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data = np.random.rand(20)\n",
- "pd.cut(data, 4, precision=2)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Quantile cut:\n",
- "\n",
- "```py\n",
- "pd.qcut(\n",
- " x,\n",
- " q,\n",
- " labels=None,\n",
- " retbins=False,\n",
- " precision=3,\n",
- " duplicates='raise'\n",
- ")\n",
- "```"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 42,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[(0.654, 2.764], (-3.239, -0.694], (-3.239, -0.694], (0.00261, 0.654], (-0.694, 0.00261], ..., (-3.239, -0.694], (-0.694, 0.00261], (0.00261, 0.654], (-0.694, 0.00261], (0.654, 2.764]]\n",
- "Length: 1000\n",
- "Categories (4, interval[float64]): [(-3.239, -0.694] < (-0.694, 0.00261] < (0.00261, 0.654] < (0.654, 2.764]]"
- ]
- },
- "execution_count": 42,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data = np.random.randn(1000)\n",
- "cats = pd.qcut(data, 4) # Cut into quartiles\n",
- "cats"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 43,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "(0.654, 2.764] 250\n",
- "(0.00261, 0.654] 250\n",
- "(-0.694, 0.00261] 250\n",
- "(-3.239, -0.694] 250\n",
- "dtype: int64"
- ]
- },
- "execution_count": 43,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "pd.value_counts(cats)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 44,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[(0.00261, 1.298], (-3.239, -1.241], (-1.241, 0.00261], (0.00261, 1.298], (-1.241, 0.00261], ..., (-1.241, 0.00261], (-1.241, 0.00261], (0.00261, 1.298], (-1.241, 0.00261], (0.00261, 1.298]]\n",
- "Length: 1000\n",
- "Categories (4, interval[float64]): [(-3.239, -1.241] < (-1.241, 0.00261] < (0.00261, 1.298] < (1.298, 2.764]]"
- ]
- },
- "execution_count": 44,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "pd.qcut(data, [0, 0.1, 0.5, 0.9, 1])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 45,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "(0.00261, 1.298] 400\n",
- "(-1.241, 0.00261] 400\n",
- "(1.298, 2.764] 100\n",
- "(-3.239, -1.241] 100\n",
- "dtype: int64"
- ]
- },
- "execution_count": 45,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "pd.value_counts(pd.qcut(data, [0, 0.1, 0.5, 0.9, 1]))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Detecting and Filtering Outliers"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 46,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " <th>3</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>count</th>\n",
- " <td>1000.000000</td>\n",
- " <td>1000.000000</td>\n",
- " <td>1000.000000</td>\n",
- " <td>1000.000000</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>mean</th>\n",
- " <td>-0.026046</td>\n",
- " <td>0.008732</td>\n",
- " <td>-0.018339</td>\n",
- " <td>-0.009127</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>std</th>\n",
- " <td>1.028878</td>\n",
- " <td>0.992479</td>\n",
- " <td>0.977812</td>\n",
- " <td>0.974988</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>min</th>\n",
- " <td>-3.288240</td>\n",
- " <td>-3.827553</td>\n",
- " <td>-3.431789</td>\n",
- " <td>-2.624219</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>25%</th>\n",
- " <td>-0.743113</td>\n",
- " <td>-0.669691</td>\n",
- " <td>-0.697851</td>\n",
- " <td>-0.672852</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>50%</th>\n",
- " <td>-0.040368</td>\n",
- " <td>0.028459</td>\n",
- " <td>-0.022622</td>\n",
- " <td>0.020490</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>75%</th>\n",
- " <td>0.694826</td>\n",
- " <td>0.694573</td>\n",
- " <td>0.669563</td>\n",
- " <td>0.643025</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>max</th>\n",
- " <td>3.497946</td>\n",
- " <td>3.062780</td>\n",
- " <td>2.808596</td>\n",
- " <td>3.446911</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2 3\n",
- "count 1000.000000 1000.000000 1000.000000 1000.000000\n",
- "mean -0.026046 0.008732 -0.018339 -0.009127\n",
- "std 1.028878 0.992479 0.977812 0.974988\n",
- "min -3.288240 -3.827553 -3.431789 -2.624219\n",
- "25% -0.743113 -0.669691 -0.697851 -0.672852\n",
- "50% -0.040368 0.028459 -0.022622 0.020490\n",
- "75% 0.694826 0.694573 0.669563 0.643025\n",
- "max 3.497946 3.062780 2.808596 3.446911"
- ]
- },
- "execution_count": 46,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data = pd.DataFrame(np.random.randn(1000, 4))\n",
- "data.describe()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 47,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "424 -3.431789\n",
- "Name: 2, dtype: float64"
- ]
- },
- "execution_count": 47,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "col = data[2]\n",
- "col[np.abs(col) > 3]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 48,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " <th>3</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>197</th>\n",
- " <td>0.849447</td>\n",
- " <td>-3.085049</td>\n",
- " <td>-0.550219</td>\n",
- " <td>-0.120688</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>235</th>\n",
- " <td>-1.140439</td>\n",
- " <td>3.062780</td>\n",
- " <td>-0.292776</td>\n",
- " <td>-1.541634</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>287</th>\n",
- " <td>-3.288240</td>\n",
- " <td>0.637092</td>\n",
- " <td>0.750347</td>\n",
- " <td>0.852326</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>310</th>\n",
- " <td>1.198193</td>\n",
- " <td>0.001196</td>\n",
- " <td>1.068577</td>\n",
- " <td>3.446911</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>376</th>\n",
- " <td>2.660516</td>\n",
- " <td>-0.788964</td>\n",
- " <td>0.578800</td>\n",
- " <td>3.279157</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>395</th>\n",
- " <td>-1.226703</td>\n",
- " <td>1.154297</td>\n",
- " <td>-0.712612</td>\n",
- " <td>3.047792</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>401</th>\n",
- " <td>3.497946</td>\n",
- " <td>-0.929906</td>\n",
- " <td>0.213705</td>\n",
- " <td>-0.062713</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>424</th>\n",
- " <td>1.320339</td>\n",
- " <td>-0.201304</td>\n",
- " <td>-3.431789</td>\n",
- " <td>-0.039907</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>570</th>\n",
- " <td>3.278339</td>\n",
- " <td>0.979884</td>\n",
- " <td>-0.542488</td>\n",
- " <td>0.147562</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>697</th>\n",
- " <td>0.512636</td>\n",
- " <td>-3.107288</td>\n",
- " <td>0.475335</td>\n",
- " <td>1.160560</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>770</th>\n",
- " <td>0.701224</td>\n",
- " <td>-3.137252</td>\n",
- " <td>0.442069</td>\n",
- " <td>0.241852</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>776</th>\n",
- " <td>-1.825486</td>\n",
- " <td>-3.827553</td>\n",
- " <td>1.281648</td>\n",
- " <td>-0.328060</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>813</th>\n",
- " <td>0.408299</td>\n",
- " <td>-3.120840</td>\n",
- " <td>-0.708262</td>\n",
- " <td>-0.382290</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2 3\n",
- "197 0.849447 -3.085049 -0.550219 -0.120688\n",
- "235 -1.140439 3.062780 -0.292776 -1.541634\n",
- "287 -3.288240 0.637092 0.750347 0.852326\n",
- "310 1.198193 0.001196 1.068577 3.446911\n",
- "376 2.660516 -0.788964 0.578800 3.279157\n",
- "395 -1.226703 1.154297 -0.712612 3.047792\n",
- "401 3.497946 -0.929906 0.213705 -0.062713\n",
- "424 1.320339 -0.201304 -3.431789 -0.039907\n",
- "570 3.278339 0.979884 -0.542488 0.147562\n",
- "697 0.512636 -3.107288 0.475335 1.160560\n",
- "770 0.701224 -3.137252 0.442069 0.241852\n",
- "776 -1.825486 -3.827553 1.281648 -0.328060\n",
- "813 0.408299 -3.120840 -0.708262 -0.382290"
- ]
- },
- "execution_count": 48,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data[(np.abs(data) > 3).any(axis=1)] # rows with a value whose abs > 3"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 49,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " <th>3</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>count</th>\n",
- " <td>1000.000000</td>\n",
- " <td>1000.000000</td>\n",
- " <td>1000.000000</td>\n",
- " <td>1000.000000</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>mean</th>\n",
- " <td>-0.026534</td>\n",
- " <td>0.009948</td>\n",
- " <td>-0.017908</td>\n",
- " <td>-0.009901</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>std</th>\n",
- " <td>1.025554</td>\n",
- " <td>0.988028</td>\n",
- " <td>0.976398</td>\n",
- " <td>0.972450</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>min</th>\n",
- " <td>-3.000000</td>\n",
- " <td>-3.000000</td>\n",
- " <td>-3.000000</td>\n",
- " <td>-2.624219</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>25%</th>\n",
- " <td>-0.743113</td>\n",
- " <td>-0.669691</td>\n",
- " <td>-0.697851</td>\n",
- " <td>-0.672852</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>50%</th>\n",
- " <td>-0.040368</td>\n",
- " <td>0.028459</td>\n",
- " <td>-0.022622</td>\n",
- " <td>0.020490</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>75%</th>\n",
- " <td>0.694826</td>\n",
- " <td>0.694573</td>\n",
- " <td>0.669563</td>\n",
- " <td>0.643025</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>max</th>\n",
- " <td>3.000000</td>\n",
- " <td>3.000000</td>\n",
- " <td>2.808596</td>\n",
- " <td>3.000000</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2 3\n",
- "count 1000.000000 1000.000000 1000.000000 1000.000000\n",
- "mean -0.026534 0.009948 -0.017908 -0.009901\n",
- "std 1.025554 0.988028 0.976398 0.972450\n",
- "min -3.000000 -3.000000 -3.000000 -2.624219\n",
- "25% -0.743113 -0.669691 -0.697851 -0.672852\n",
- "50% -0.040368 0.028459 -0.022622 0.020490\n",
- "75% 0.694826 0.694573 0.669563 0.643025\n",
- "max 3.000000 3.000000 2.808596 3.000000"
- ]
- },
- "execution_count": 49,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data[np.abs(data) > 3] = np.sign(data) * 3\n",
- "data.describe()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Permutation and Random Sampling"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 50,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "array([0, 1, 3, 4, 2])"
- ]
- },
- "execution_count": 50,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))\n",
- "sampler = np.random.permutation(5)\n",
- "sampler"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 51,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " <th>3</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>2</td>\n",
- " <td>3</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>4</td>\n",
- " <td>5</td>\n",
- " <td>6</td>\n",
- " <td>7</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>8</td>\n",
- " <td>9</td>\n",
- " <td>10</td>\n",
- " <td>11</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>12</td>\n",
- " <td>13</td>\n",
- " <td>14</td>\n",
- " <td>15</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>16</td>\n",
- " <td>17</td>\n",
- " <td>18</td>\n",
- " <td>19</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2 3\n",
- "0 0 1 2 3\n",
- "1 4 5 6 7\n",
- "2 8 9 10 11\n",
- "3 12 13 14 15\n",
- "4 16 17 18 19"
- ]
- },
- "execution_count": 51,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 52,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " <th>3</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>2</td>\n",
- " <td>3</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>4</td>\n",
- " <td>5</td>\n",
- " <td>6</td>\n",
- " <td>7</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>12</td>\n",
- " <td>13</td>\n",
- " <td>14</td>\n",
- " <td>15</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>16</td>\n",
- " <td>17</td>\n",
- " <td>18</td>\n",
- " <td>19</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>8</td>\n",
- " <td>9</td>\n",
- " <td>10</td>\n",
- " <td>11</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2 3\n",
- "0 0 1 2 3\n",
- "1 4 5 6 7\n",
- "3 12 13 14 15\n",
- "4 16 17 18 19\n",
- "2 8 9 10 11"
- ]
- },
- "execution_count": 52,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df.take(sampler)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 53,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " <th>3</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>2</td>\n",
- " <td>3</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>4</td>\n",
- " <td>5</td>\n",
- " <td>6</td>\n",
- " <td>7</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>12</td>\n",
- " <td>13</td>\n",
- " <td>14</td>\n",
- " <td>15</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>16</td>\n",
- " <td>17</td>\n",
- " <td>18</td>\n",
- " <td>19</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>8</td>\n",
- " <td>9</td>\n",
- " <td>10</td>\n",
- " <td>11</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2 3\n",
- "0 0 1 2 3\n",
- "1 4 5 6 7\n",
- "3 12 13 14 15\n",
- "4 16 17 18 19\n",
- "2 8 9 10 11"
- ]
- },
- "execution_count": 53,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df.iloc[sampler]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "```py\n",
- "sample(\n",
- " n=None,\n",
- " frac=None,\n",
- " replace=False,\n",
- " weights=None,\n",
- " random_state=None,\n",
- " axis=None\n",
- ")\n",
- "```"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 54,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>0</th>\n",
- " <th>1</th>\n",
- " <th>2</th>\n",
- " <th>3</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>8</td>\n",
- " <td>9</td>\n",
- " <td>10</td>\n",
- " <td>11</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>4</td>\n",
- " <td>5</td>\n",
- " <td>6</td>\n",
- " <td>7</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>16</td>\n",
- " <td>17</td>\n",
- " <td>18</td>\n",
- " <td>19</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 0 1 2 3\n",
- "2 8 9 10 11\n",
- "1 4 5 6 7\n",
- "4 16 17 18 19"
- ]
- },
- "execution_count": 54,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df.sample(n=3) # select a random subset without replacement"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 55,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>3</th>\n",
- " <th>0</th>\n",
- " <th>2</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>3</td>\n",
- " <td>0</td>\n",
- " <td>2</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>7</td>\n",
- " <td>4</td>\n",
- " <td>6</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>11</td>\n",
- " <td>8</td>\n",
- " <td>10</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>15</td>\n",
- " <td>12</td>\n",
- " <td>14</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>19</td>\n",
- " <td>16</td>\n",
- " <td>18</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " 3 0 2\n",
- "0 3 0 2\n",
- "1 7 4 6\n",
- "2 11 8 10\n",
- "3 15 12 14\n",
- "4 19 16 18"
- ]
- },
- "execution_count": 55,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df.sample(axis=1, n=3)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 56,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "4 4\n",
- "2 -1\n",
- "4 4\n",
- "1 7\n",
- "4 4\n",
- "2 -1\n",
- "3 6\n",
- "1 7\n",
- "1 7\n",
- "2 -1\n",
- "dtype: int64"
- ]
- },
- "execution_count": 56,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "choices = pd.Series([5, 7, -1, 6, 4])\n",
- "draws = choices.sample(n=10, replace=True) # allow repeat choices\n",
- "draws"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Computing Indicator/Dummy Variables\n",
- "\n",
- "Use `get_dummies()` to get the one-hot representation of **categorical** variable:\n",
- "\n",
- "```py\n",
- "pd.get_dummies(\n",
- " data,\n",
- " prefix=None,\n",
- " prefix_sep='_',\n",
- " dummy_na=False,\n",
- " columns=None,\n",
- " sparse=False,\n",
- " drop_first=False,\n",
- " dtype=None\n",
- ")\n",
- "```"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 57,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>key</th>\n",
- " <th>data1</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>b</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>b</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>a</td>\n",
- " <td>2</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>c</td>\n",
- " <td>3</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>a</td>\n",
- " <td>4</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>5</th>\n",
- " <td>b</td>\n",
- " <td>5</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " key data1\n",
- "0 b 0\n",
- "1 b 1\n",
- "2 a 2\n",
- "3 c 3\n",
- "4 a 4\n",
- "5 b 5"
- ]
- },
- "execution_count": 57,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],\n",
- " 'data1': range(6)})\n",
- "df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 58,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>a</th>\n",
- " <th>b</th>\n",
- " <th>c</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>5</th>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " a b c\n",
- "0 0 1 0\n",
- "1 0 1 0\n",
- "2 1 0 0\n",
- "3 0 0 1\n",
- "4 1 0 0\n",
- "5 0 1 0"
- ]
- },
- "execution_count": 58,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "pd.get_dummies(df['key'])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 59,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>data1</th>\n",
- " <th>key_a</th>\n",
- " <th>key_b</th>\n",
- " <th>key_c</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>2</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>3</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>4</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>5</th>\n",
- " <td>5</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " data1 key_a key_b key_c\n",
- "0 0 0 1 0\n",
- "1 1 0 1 0\n",
- "2 2 1 0 0\n",
- "3 3 0 0 1\n",
- "4 4 1 0 0\n",
- "5 5 0 1 0"
- ]
- },
- "execution_count": 59,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "pd.get_dummies(df) # data1 is not encoded"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 60,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>data1_0</th>\n",
- " <th>data1_1</th>\n",
- " <th>data1_2</th>\n",
- " <th>data1_3</th>\n",
- " <th>data1_4</th>\n",
- " <th>data1_5</th>\n",
- " <th>key_a</th>\n",
- " <th>key_b</th>\n",
- " <th>key_c</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>5</th>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " data1_0 data1_1 data1_2 data1_3 data1_4 data1_5 key_a key_b key_c\n",
- "0 1 0 0 0 0 0 0 1 0\n",
- "1 0 1 0 0 0 0 0 1 0\n",
- "2 0 0 1 0 0 0 1 0 0\n",
- "3 0 0 0 1 0 0 0 0 1\n",
- "4 0 0 0 0 1 0 1 0 0\n",
- "5 0 0 0 0 0 1 0 1 0"
- ]
- },
- "execution_count": 60,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Specify the columns to be encoded mannually.\n",
- "pd.get_dummies(df, columns=['data1', 'key'])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 61,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>prefix_a</th>\n",
- " <th>prefix_b</th>\n",
- " <th>prefix_c</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>5</th>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " prefix_a prefix_b prefix_c\n",
- "0 0 1 0\n",
- "1 0 1 0\n",
- "2 1 0 0\n",
- "3 0 0 1\n",
- "4 1 0 0\n",
- "5 0 1 0"
- ]
- },
- "execution_count": 61,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "dummies = pd.get_dummies(df['key'], prefix='prefix')\n",
- "dummies"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 62,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>data1</th>\n",
- " <th>prefix_a</th>\n",
- " <th>prefix_b</th>\n",
- " <th>prefix_c</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>2</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>3</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>4</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>5</th>\n",
- " <td>5</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " data1 prefix_a prefix_b prefix_c\n",
- "0 0 0 1 0\n",
- "1 1 0 1 0\n",
- "2 2 1 0 0\n",
- "3 3 0 0 1\n",
- "4 4 1 0 0\n",
- "5 5 0 1 0"
- ]
- },
- "execution_count": 62,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df_with_dummy = df[['data1']].join(dummies)\n",
- "df_with_dummy"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 63,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>movie_id</th>\n",
- " <th>title</th>\n",
- " <th>genres</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>1</td>\n",
- " <td>Toy Story (1995)</td>\n",
- " <td>Animation|Children's|Comedy</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>2</td>\n",
- " <td>Jumanji (1995)</td>\n",
- " <td>Adventure|Children's|Fantasy</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>3</td>\n",
- " <td>Grumpier Old Men (1995)</td>\n",
- " <td>Comedy|Romance</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>4</td>\n",
- " <td>Waiting to Exhale (1995)</td>\n",
- " <td>Comedy|Drama</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>5</td>\n",
- " <td>Father of the Bride Part II (1995)</td>\n",
- " <td>Comedy</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>5</th>\n",
- " <td>6</td>\n",
- " <td>Heat (1995)</td>\n",
- " <td>Action|Crime|Thriller</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>6</th>\n",
- " <td>7</td>\n",
- " <td>Sabrina (1995)</td>\n",
- " <td>Comedy|Romance</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>7</th>\n",
- " <td>8</td>\n",
- " <td>Tom and Huck (1995)</td>\n",
- " <td>Adventure|Children's</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>8</th>\n",
- " <td>9</td>\n",
- " <td>Sudden Death (1995)</td>\n",
- " <td>Action</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>9</th>\n",
- " <td>10</td>\n",
- " <td>GoldenEye (1995)</td>\n",
- " <td>Action|Adventure|Thriller</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " movie_id title genres\n",
- "0 1 Toy Story (1995) Animation|Children's|Comedy\n",
- "1 2 Jumanji (1995) Adventure|Children's|Fantasy\n",
- "2 3 Grumpier Old Men (1995) Comedy|Romance\n",
- "3 4 Waiting to Exhale (1995) Comedy|Drama\n",
- "4 5 Father of the Bride Part II (1995) Comedy\n",
- "5 6 Heat (1995) Action|Crime|Thriller\n",
- "6 7 Sabrina (1995) Comedy|Romance\n",
- "7 8 Tom and Huck (1995) Adventure|Children's\n",
- "8 9 Sudden Death (1995) Action\n",
- "9 10 GoldenEye (1995) Action|Adventure|Thriller"
- ]
- },
- "execution_count": 63,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "mnames = ['movie_id', 'title', 'genres']\n",
- "movies = pd.read_csv('../datasets/movielens/movies.dat', sep='::',\n",
- " header=None, names=mnames, engine='python')\n",
- "movies[:10]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 64,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "array(['Animation', \"Children's\", 'Comedy', 'Adventure', 'Fantasy',\n",
- " 'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',\n",
- " 'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',\n",
- " 'Western'], dtype=object)"
- ]
- },
- "execution_count": 64,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# extract the list of (unique) genres\n",
- "all_genres = []\n",
- "for x in movies['genres']:\n",
- " all_genres.extend(x.split('|'))\n",
- "genres = pd.unique(all_genres)\n",
- "genres"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 65,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "movie_id 1\n",
- "title Toy Story (1995)\n",
- "genres Animation|Children's|Comedy\n",
- "Genre_Animation 1\n",
- "Genre_Children's 1\n",
- "Genre_Comedy 1\n",
- "Genre_Adventure 0\n",
- "Genre_Fantasy 0\n",
- "Genre_Romance 0\n",
- "Genre_Drama 0\n",
- "Genre_Action 0\n",
- "Genre_Crime 0\n",
- "Genre_Thriller 0\n",
- "Genre_Horror 0\n",
- "Genre_Sci-Fi 0\n",
- "Genre_Documentary 0\n",
- "Genre_War 0\n",
- "Genre_Musical 0\n",
- "Genre_Mystery 0\n",
- "Genre_Film-Noir 0\n",
- "Genre_Western 0\n",
- "Name: 0, dtype: object"
- ]
- },
- "execution_count": 65,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# construct a zero dummies\n",
- "zero_matrix = np.zeros((len(movies), len(genres)))\n",
- "dummies = pd.DataFrame(zero_matrix, columns=genres)\n",
- "\n",
- "# set dummies for each movie\n",
- "for i, gen in enumerate(movies['genres']):\n",
- " indices = dummies.columns.get_indexer(gen.split('|'))\n",
- " dummies.iloc[i, indices] = 1\n",
- " \n",
- "# combine the dummies with movies\n",
- "movies_windic = movies.join(dummies.add_prefix('Genre_'))\n",
- "movies_windic.iloc[0]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 66,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,\n",
- " 0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])"
- ]
- },
- "execution_count": 66,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "np.random.seed(12345)\n",
- "values = np.random.rand(10)\n",
- "values"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 67,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- " (0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]\n",
- "0 0 0 0 0 1\n",
- "1 0 1 0 0 0\n",
- "2 1 0 0 0 0\n",
- "3 0 1 0 0 0\n",
- "4 0 0 1 0 0\n",
- "5 0 0 1 0 0\n",
- "6 0 0 0 0 1\n",
- "7 0 0 0 1 0\n",
- "8 0 0 0 1 0\n",
- "9 0 0 0 1 0"
- ]
- },
- "execution_count": 67,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "bins = [0, 0.2, 0.4, 0.6, 0.8, 1]\n",
- "pd.get_dummies(pd.cut(values, bins))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## String Manipulation\n",
- "\n",
- "* Python built-in string methods\n",
- "\n",
- "Method | Description\n",
- ":--- | :---\n",
- "`count` | Return the num of *non-overlapping* occurrences of substring in the string.\n",
- "`endswith` | Return `True` if string ends with suffix.\n",
- "`startswith` | Return `True` if string starts with prefix.\n",
- "`join` | Use string as delimiter for concatenating a sequence of other strings.\n",
- "`index` | Return position of first character in substring if found in the string; raise `ValueError` if not found.\n",
- "`find` | Return position of first character of *first* occurrence of substring in the string; like `index`, but returns -1 if not found.\n",
- "`rfind` | Return position of first character of *last* occurrence of substring in the string; return -1 if not found.\n",
- "`replace` | Replace occurrences of string with another string.\n",
- "`strip`, `rstrip`, `lstrip` | Trim whitespace, including newlines.\n",
- "`split` | Break string into list of substrings using passed delimiter.\n",
- "`lower` | Convert alphabet characters to lowercase.\n",
- "`upper` | Convert alphabet characters to upppercase.\n",
- "`casefold` | Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form.\n",
- "`ljust`, `rjust` | Left justify or right justify; pad opposite side of string with spaces(or some other fill character) to return a string with a minimum width.\n",
- "\n",
- "### Regular Expressions\n",
- "\n",
- "* Regular expression methods\n",
- "\n",
- "Method | Description\n",
- ":--- | :---\n",
- "`findall` | Return all non-overlapping matching patterns in a string as a list.\n",
- "`finditer` | Like `findall`, but return an iterator.\n",
- "`match` | Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, returns a match obj; otherwise return `None`.\n",
- "`search` | Scan string for match to pattern; return a match obj if so; unlike `match`, the match can be anywhere in the string as opposed to only at the beginning.\n",
- "`split` | Break string into pieces at each occurence of pattern.\n",
- "`sub`, `subn` | Replace all (sub) or first n (subn) occurrences of pattern in string with replacement expression; use symbols \\1, \\2, ... to refer to match group elements in the replacement string.\n",
- "\n",
- "### Vectorized String Functions in pandas\n",
- "\n",
- "**Series** (and Index) has array-oriented methods for string operations that skip NA values. These are accessed through `str` attribute."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 68,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "Dave dave@google.com\n",
- "Steve steve@gmail.com\n",
- "Rob rog@gmail.com\n",
- "Wes NaN\n",
- "dtype: object"
- ]
- },
- "execution_count": 68,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',\n",
- " 'Rob': 'rog@gmail.com', 'Wes': np.nan}\n",
- "data = pd.Series(data)\n",
- "data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 69,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "Dave False\n",
- "Steve True\n",
- "Rob True\n",
- "Wes NaN\n",
- "dtype: object"
- ]
- },
- "execution_count": 69,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.str.contains('gmail')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 70,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "Dave dave@\n",
- "Steve steve\n",
- "Rob rog@g\n",
- "Wes NaN\n",
- "dtype: object"
- ]
- },
- "execution_count": 70,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.str[:5]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 71,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "Dave a\n",
- "Steve t\n",
- "Rob o\n",
- "Wes NaN\n",
- "dtype: object"
- ]
- },
- "execution_count": 71,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data.str.get(1) # or data.str[1]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "* Some vectorized string methods\n",
- "\n",
- "Method | Description\n",
- ":--- | :---\n",
- "`cat` | Concatenate strings element-wise with optional delimiter.\n",
- "`contains` | Return boolean array if each string contains pattern/regex.\n",
- "`count` | Count occurrences of pattern.\n",
- "`extract` | Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group.\n",
- "`endswith` | Equivalent to `x.endswith(pattern)` for each element.\n",
- "`startswith` | ... `x.startswith(pattern)`...\n",
- "`findall` | Compute list of all occurrences of pattern/regex for each string.\n",
- "`get` | Index into each element (retrieve i-th element).\n",
- "`isalnum` | Equivalent to built-in `str.isalnum`.\n",
- "`isalpha` | ... `str.isalpha`.\n",
- "`isdecimal` | ... `str.isdecimal`.\n",
- "`isdigit` | ...\n",
- "`islower` | ...\n",
- "`isnumeric` | ...\n",
- "`isupper` | ...\n",
- "`join` | Join strings in each element of the Series with passed separator.\n",
- "`len` | Compute length of each string.\n",
- "`lower` , `upper` | Convert cases.\n",
- "`match` | `re.match` on each element.\n",
- "`pad` | Add whitespace to left, right, or both sides of strings.\n",
- "`center` | Equivalent to `pad(side='both')`.\n",
- "`repeat` | Duplicate values.\n",
- "`replace` | Replace occurrences of pattern/regex with some other string.\n",
- "`slice` | Slice each string in the Series.\n",
- "`split` | Split strings on delimiter or regex.\n",
- "`strip`, `lstrip`, `rstrip` | Trim whitespace on both, left, right side."
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.7"
- },
- "toc": {
- "base_numbering": 1,
- "nav_menu": {},
- "number_sections": true,
- "sideBar": false,
- "skip_h1_title": false,
- "title_cell": "Table of Contents",
- "title_sidebar": "Contents",
- "toc_cell": false,
- "toc_position": {
- "height": "267px",
- "left": "1065px",
- "right": "0px",
- "top": "33px",
- "width": "215px"
- },
- "toc_section_display": false,
- "toc_window_display": false
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
- }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement