Guest User

Untitled

a guest
Nov 18th, 2017
156
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 20.69 KB | None | 0 0
  1. {
  2. "cells": [
  3. {
  4. "cell_type": "markdown",
  5. "metadata": {},
  6. "source": [
  7. "# Project 1\n",
  8. "\n",
  9. "In this first project you will create a framework to scope out data science projects. This framework will provide you with a guide to develop a well-articulated problem statement and analysis plan that will be robust and reproducible."
  10. ]
  11. },
  12. {
  13. "cell_type": "markdown",
  14. "metadata": {},
  15. "source": [
  16. "## Exercise 1: Read and evaluate the following problem statement: \n",
  17. "Determine which free-tier customers will covert to paying customers, using demographic data collected at signup (age, gender, location, and profession) and customer useage data (days since last log in, and activity score 1 = active user, 0= inactive user) based on Hooli data from Jan-Apr 2015. \n"
  18. ]
  19. },
  20. {
  21. "cell_type": "markdown",
  22. "metadata": {},
  23. "source": [
  24. "#### 1. What is the outcome?"
  25. ]
  26. },
  27. {
  28. "cell_type": "markdown",
  29. "metadata": {},
  30. "source": [
  31. "Answer: Will the customer convert to a paying customer? Use a convert indicator (1=Y/0=N)"
  32. ]
  33. },
  34. {
  35. "cell_type": "markdown",
  36. "metadata": {},
  37. "source": [
  38. "#### 2. What are the predictors/covariates? "
  39. ]
  40. },
  41. {
  42. "cell_type": "markdown",
  43. "metadata": {},
  44. "source": [
  45. "Answer: age, gender, lcoation, profession, days since last log-in, activity score (1=active user, 0=inactive user)"
  46. ]
  47. },
  48. {
  49. "cell_type": "markdown",
  50. "metadata": {},
  51. "source": [
  52. "#### 3. What timeframe is this data relevent for?"
  53. ]
  54. },
  55. {
  56. "cell_type": "markdown",
  57. "metadata": {},
  58. "source": [
  59. "Answer: January 2015 to April 2015"
  60. ]
  61. },
  62. {
  63. "cell_type": "markdown",
  64. "metadata": {},
  65. "source": [
  66. "#### 4. What is the hypothesis?"
  67. ]
  68. },
  69. {
  70. "cell_type": "markdown",
  71. "metadata": {},
  72. "source": [
  73. "Answer: Hooli Customer demographic data and customer usage data will allow us to predict if a free-tier customer will convert to a paying customer."
  74. ]
  75. },
  76. {
  77. "cell_type": "markdown",
  78. "metadata": {},
  79. "source": [
  80. "## Exercise 2: Let's get started with our dataset (use admissions.csv)"
  81. ]
  82. },
  83. {
  84. "cell_type": "code",
  85. "execution_count": 1,
  86. "metadata": {
  87. "collapsed": true
  88. },
  89. "outputs": [],
  90. "source": [
  91. "import pandas as pd\n",
  92. "import os"
  93. ]
  94. },
  95. {
  96. "cell_type": "code",
  97. "execution_count": 3,
  98. "metadata": {
  99. "collapsed": true
  100. },
  101. "outputs": [],
  102. "source": [
  103. "data = pd.read_csv('/Users/chadkenney/Documents/GitHub/DAT-DEN-03/projects/unit-projects/project-1/assets/admissions.csv')"
  104. ]
  105. },
  106. {
  107. "cell_type": "code",
  108. "execution_count": 4,
  109. "metadata": {},
  110. "outputs": [
  111. {
  112. "data": {
  113. "text/html": [
  114. "<div>\n",
  115. "<style>\n",
  116. " .dataframe thead tr:only-child th {\n",
  117. " text-align: right;\n",
  118. " }\n",
  119. "\n",
  120. " .dataframe thead th {\n",
  121. " text-align: left;\n",
  122. " }\n",
  123. "\n",
  124. " .dataframe tbody tr th {\n",
  125. " vertical-align: top;\n",
  126. " }\n",
  127. "</style>\n",
  128. "<table border=\"1\" class=\"dataframe\">\n",
  129. " <thead>\n",
  130. " <tr style=\"text-align: right;\">\n",
  131. " <th></th>\n",
  132. " <th>admit</th>\n",
  133. " <th>gre</th>\n",
  134. " <th>gpa</th>\n",
  135. " <th>prestige</th>\n",
  136. " </tr>\n",
  137. " </thead>\n",
  138. " <tbody>\n",
  139. " <tr>\n",
  140. " <th>0</th>\n",
  141. " <td>0</td>\n",
  142. " <td>380.0</td>\n",
  143. " <td>3.61</td>\n",
  144. " <td>3.0</td>\n",
  145. " </tr>\n",
  146. " <tr>\n",
  147. " <th>1</th>\n",
  148. " <td>1</td>\n",
  149. " <td>660.0</td>\n",
  150. " <td>3.67</td>\n",
  151. " <td>3.0</td>\n",
  152. " </tr>\n",
  153. " <tr>\n",
  154. " <th>2</th>\n",
  155. " <td>1</td>\n",
  156. " <td>800.0</td>\n",
  157. " <td>4.00</td>\n",
  158. " <td>1.0</td>\n",
  159. " </tr>\n",
  160. " <tr>\n",
  161. " <th>3</th>\n",
  162. " <td>1</td>\n",
  163. " <td>640.0</td>\n",
  164. " <td>3.19</td>\n",
  165. " <td>4.0</td>\n",
  166. " </tr>\n",
  167. " <tr>\n",
  168. " <th>4</th>\n",
  169. " <td>0</td>\n",
  170. " <td>520.0</td>\n",
  171. " <td>2.93</td>\n",
  172. " <td>4.0</td>\n",
  173. " </tr>\n",
  174. " </tbody>\n",
  175. "</table>\n",
  176. "</div>"
  177. ],
  178. "text/plain": [
  179. " admit gre gpa prestige\n",
  180. "0 0 380.0 3.61 3.0\n",
  181. "1 1 660.0 3.67 3.0\n",
  182. "2 1 800.0 4.00 1.0\n",
  183. "3 1 640.0 3.19 4.0\n",
  184. "4 0 520.0 2.93 4.0"
  185. ]
  186. },
  187. "execution_count": 4,
  188. "metadata": {},
  189. "output_type": "execute_result"
  190. }
  191. ],
  192. "source": [
  193. "data.head()"
  194. ]
  195. },
  196. {
  197. "cell_type": "code",
  198. "execution_count": 11,
  199. "metadata": {},
  200. "outputs": [
  201. {
  202. "data": {
  203. "text/html": [
  204. "<div>\n",
  205. "<style>\n",
  206. " .dataframe thead tr:only-child th {\n",
  207. " text-align: right;\n",
  208. " }\n",
  209. "\n",
  210. " .dataframe thead th {\n",
  211. " text-align: left;\n",
  212. " }\n",
  213. "\n",
  214. " .dataframe tbody tr th {\n",
  215. " vertical-align: top;\n",
  216. " }\n",
  217. "</style>\n",
  218. "<table border=\"1\" class=\"dataframe\">\n",
  219. " <thead>\n",
  220. " <tr style=\"text-align: right;\">\n",
  221. " <th></th>\n",
  222. " <th>admit</th>\n",
  223. " <th>gre</th>\n",
  224. " <th>gpa</th>\n",
  225. " <th>prestige</th>\n",
  226. " </tr>\n",
  227. " </thead>\n",
  228. " <tbody>\n",
  229. " <tr>\n",
  230. " <th>count</th>\n",
  231. " <td>400.0</td>\n",
  232. " <td>398.0</td>\n",
  233. " <td>398.0</td>\n",
  234. " <td>399.0</td>\n",
  235. " </tr>\n",
  236. " <tr>\n",
  237. " <th>mean</th>\n",
  238. " <td>0.0</td>\n",
  239. " <td>588.0</td>\n",
  240. " <td>3.0</td>\n",
  241. " <td>2.0</td>\n",
  242. " </tr>\n",
  243. " <tr>\n",
  244. " <th>std</th>\n",
  245. " <td>0.0</td>\n",
  246. " <td>116.0</td>\n",
  247. " <td>0.0</td>\n",
  248. " <td>1.0</td>\n",
  249. " </tr>\n",
  250. " <tr>\n",
  251. " <th>min</th>\n",
  252. " <td>0.0</td>\n",
  253. " <td>220.0</td>\n",
  254. " <td>2.0</td>\n",
  255. " <td>1.0</td>\n",
  256. " </tr>\n",
  257. " <tr>\n",
  258. " <th>25%</th>\n",
  259. " <td>0.0</td>\n",
  260. " <td>520.0</td>\n",
  261. " <td>3.0</td>\n",
  262. " <td>2.0</td>\n",
  263. " </tr>\n",
  264. " <tr>\n",
  265. " <th>50%</th>\n",
  266. " <td>0.0</td>\n",
  267. " <td>580.0</td>\n",
  268. " <td>3.0</td>\n",
  269. " <td>2.0</td>\n",
  270. " </tr>\n",
  271. " <tr>\n",
  272. " <th>75%</th>\n",
  273. " <td>1.0</td>\n",
  274. " <td>660.0</td>\n",
  275. " <td>4.0</td>\n",
  276. " <td>3.0</td>\n",
  277. " </tr>\n",
  278. " <tr>\n",
  279. " <th>max</th>\n",
  280. " <td>1.0</td>\n",
  281. " <td>800.0</td>\n",
  282. " <td>4.0</td>\n",
  283. " <td>4.0</td>\n",
  284. " </tr>\n",
  285. " </tbody>\n",
  286. "</table>\n",
  287. "</div>"
  288. ],
  289. "text/plain": [
  290. " admit gre gpa prestige\n",
  291. "count 400.0 398.0 398.0 399.0\n",
  292. "mean 0.0 588.0 3.0 2.0\n",
  293. "std 0.0 116.0 0.0 1.0\n",
  294. "min 0.0 220.0 2.0 1.0\n",
  295. "25% 0.0 520.0 3.0 2.0\n",
  296. "50% 0.0 580.0 3.0 2.0\n",
  297. "75% 1.0 660.0 4.0 3.0\n",
  298. "max 1.0 800.0 4.0 4.0"
  299. ]
  300. },
  301. "execution_count": 11,
  302. "metadata": {},
  303. "output_type": "execute_result"
  304. }
  305. ],
  306. "source": [
  307. "data.describe().round()"
  308. ]
  309. },
  310. {
  311. "cell_type": "code",
  312. "execution_count": 14,
  313. "metadata": {},
  314. "outputs": [
  315. {
  316. "data": {
  317. "text/plain": [
  318. "2.0 150\n",
  319. "3.0 121\n",
  320. "4.0 67\n",
  321. "1.0 61\n",
  322. "Name: prestige, dtype: int64"
  323. ]
  324. },
  325. "execution_count": 14,
  326. "metadata": {},
  327. "output_type": "execute_result"
  328. }
  329. ],
  330. "source": [
  331. "data['prestige'].value_counts()"
  332. ]
  333. },
  334. {
  335. "cell_type": "code",
  336. "execution_count": 15,
  337. "metadata": {},
  338. "outputs": [
  339. {
  340. "data": {
  341. "text/html": [
  342. "<div>\n",
  343. "<style>\n",
  344. " .dataframe thead tr:only-child th {\n",
  345. " text-align: right;\n",
  346. " }\n",
  347. "\n",
  348. " .dataframe thead th {\n",
  349. " text-align: left;\n",
  350. " }\n",
  351. "\n",
  352. " .dataframe tbody tr th {\n",
  353. " vertical-align: top;\n",
  354. " }\n",
  355. "</style>\n",
  356. "<table border=\"1\" class=\"dataframe\">\n",
  357. " <thead>\n",
  358. " <tr style=\"text-align: right;\">\n",
  359. " <th></th>\n",
  360. " <th>admit</th>\n",
  361. " <th>gre</th>\n",
  362. " <th>gpa</th>\n",
  363. " <th>prestige</th>\n",
  364. " </tr>\n",
  365. " </thead>\n",
  366. " <tbody>\n",
  367. " <tr>\n",
  368. " <th>admit</th>\n",
  369. " <td>1.000000</td>\n",
  370. " <td>0.182919</td>\n",
  371. " <td>0.175952</td>\n",
  372. " <td>-0.241355</td>\n",
  373. " </tr>\n",
  374. " <tr>\n",
  375. " <th>gre</th>\n",
  376. " <td>0.182919</td>\n",
  377. " <td>1.000000</td>\n",
  378. " <td>0.382408</td>\n",
  379. " <td>-0.124533</td>\n",
  380. " </tr>\n",
  381. " <tr>\n",
  382. " <th>gpa</th>\n",
  383. " <td>0.175952</td>\n",
  384. " <td>0.382408</td>\n",
  385. " <td>1.000000</td>\n",
  386. " <td>-0.059031</td>\n",
  387. " </tr>\n",
  388. " <tr>\n",
  389. " <th>prestige</th>\n",
  390. " <td>-0.241355</td>\n",
  391. " <td>-0.124533</td>\n",
  392. " <td>-0.059031</td>\n",
  393. " <td>1.000000</td>\n",
  394. " </tr>\n",
  395. " </tbody>\n",
  396. "</table>\n",
  397. "</div>"
  398. ],
  399. "text/plain": [
  400. " admit gre gpa prestige\n",
  401. "admit 1.000000 0.182919 0.175952 -0.241355\n",
  402. "gre 0.182919 1.000000 0.382408 -0.124533\n",
  403. "gpa 0.175952 0.382408 1.000000 -0.059031\n",
  404. "prestige -0.241355 -0.124533 -0.059031 1.000000"
  405. ]
  406. },
  407. "execution_count": 15,
  408. "metadata": {},
  409. "output_type": "execute_result"
  410. }
  411. ],
  412. "source": [
  413. "data.corr()"
  414. ]
  415. },
  416. {
  417. "cell_type": "markdown",
  418. "metadata": {},
  419. "source": [
  420. "\n",
  421. "#### 1. Create a data dictionary "
  422. ]
  423. },
  424. {
  425. "cell_type": "markdown",
  426. "metadata": {},
  427. "source": [
  428. "Answer: \n",
  429. "\n",
  430. "Variable | Description | Type of Variable\n",
  431. "---| ---| ---\n",
  432. "admit | admittance indicator (1=admitted) | binary\n",
  433. "gre | GRE Score, Integer (200 to 800) | continuous \n",
  434. "gpa |Grade Point Average, Float (0-4) | continuous\n",
  435. "prestige | Integer (0-4) | continuous\n"
  436. ]
  437. },
  438. {
  439. "cell_type": "markdown",
  440. "metadata": {},
  441. "source": [
  442. "We would like to explore the association between admittance and GRE Score, Grade Point Average, and Prestige of the school. "
  443. ]
  444. },
  445. {
  446. "cell_type": "markdown",
  447. "metadata": {},
  448. "source": [
  449. "#### 2. What is the outcome?"
  450. ]
  451. },
  452. {
  453. "cell_type": "markdown",
  454. "metadata": {},
  455. "source": [
  456. "Answer: Will the student be admitted into the university program? Use the admit indicator (1=Y/0=N)"
  457. ]
  458. },
  459. {
  460. "cell_type": "markdown",
  461. "metadata": {},
  462. "source": [
  463. "#### 3. What are the predictors/covariates? "
  464. ]
  465. },
  466. {
  467. "cell_type": "markdown",
  468. "metadata": {},
  469. "source": [
  470. "Answer: GRE Score, GPA, Prestige of school"
  471. ]
  472. },
  473. {
  474. "cell_type": "markdown",
  475. "metadata": {},
  476. "source": [
  477. "#### 4. What timeframe is this data relevant for?"
  478. ]
  479. },
  480. {
  481. "cell_type": "markdown",
  482. "metadata": {},
  483. "source": [
  484. "Answer: There is no timeframe associated with this dataset as far as I can tell"
  485. ]
  486. },
  487. {
  488. "cell_type": "markdown",
  489. "metadata": {},
  490. "source": [
  491. "#### 5. What is the hypothesis?"
  492. ]
  493. },
  494. {
  495. "cell_type": "markdown",
  496. "metadata": {},
  497. "source": [
  498. "Answer: That student GRE score, student GPA, and unversity Prestige can predict a student's admittance into a university program."
  499. ]
  500. },
  501. {
  502. "cell_type": "markdown",
  503. "metadata": {},
  504. "source": [
  505. "#### 6. Using the above information, write a well-formed problem statement. \n"
  506. ]
  507. },
  508. {
  509. "cell_type": "markdown",
  510. "metadata": {},
  511. "source": [
  512. "Answer: Using the admissions dataset, determine how likely it is a student will be admitted into a university program based on student's scholastic performance (GRE score & GPA) and the school's prestige level (scored from 1 - 4)."
  513. ]
  514. },
  515. {
  516. "cell_type": "markdown",
  517. "metadata": {},
  518. "source": [
  519. "## Exercise 3: Exploratory Analysis Plan (Materials will be covered on Tuesday's class)"
  520. ]
  521. },
  522. {
  523. "cell_type": "markdown",
  524. "metadata": {},
  525. "source": [
  526. "Using the lab from a class as a guide, create an exploratory analysis plan. "
  527. ]
  528. },
  529. {
  530. "cell_type": "markdown",
  531. "metadata": {},
  532. "source": [
  533. "#### 1. What are the goals of the exploratory analysis? "
  534. ]
  535. },
  536. {
  537. "cell_type": "markdown",
  538. "metadata": {
  539. "collapsed": true
  540. },
  541. "source": [
  542. "Answer: A goal of the exploratory analysis is to better understand the data you are working with. This includes understanding how many fields (columns) there are & what each data field represents, and how big is the data set (# of rows). You can use data.shape() and len(data) to get these answers. Exploratory analysis would also involve an analysis of which columns have sufficient data to be useful (i.e. a field in a 100 row dataset with only one non-empty value would probably not help in the analysis). For columns with sufficient data, each column can then be analyzed for distribution (mean, median, mode, standard deviation, variance). Categorical variables can be analyzed for distribution across the categories. The outcome variable can be plotted against several of the key predictors to get an initial sense of correlation. The \"best\" single variable predictor could be used as a baseline to benchmark performance lift gained from more sophisticated machine learnings models developed. "
  543. ]
  544. },
  545. {
  546. "cell_type": "markdown",
  547. "metadata": {},
  548. "source": [
  549. "#### 2a. What are the assumptions of the distribution of data? "
  550. ]
  551. },
  552. {
  553. "cell_type": "markdown",
  554. "metadata": {},
  555. "source": [
  556. "Answer: I am not sure on this one. I assume that baseline assumption is that data is distributed normally and then through exploratory analysis you may determine that certain variables are significantly skewed positive or negative. Another assumption may be that the data collected is a representative sample of the population. An additional assumption is that the data collection instrument was valid/reliable."
  557. ]
  558. },
  559. {
  560. "cell_type": "markdown",
  561. "metadata": {},
  562. "source": [
  563. "#### 2b. How will determine the distribution of your data? "
  564. ]
  565. },
  566. {
  567. "cell_type": "markdown",
  568. "metadata": {
  569. "collapsed": true
  570. },
  571. "source": [
  572. "Answer: There are several options using the pandas package. One is to do a data.describe() which will provide several summary and distribution statistics for each of the numerical variables. For the categorical variables, you can use a data[\"Column Name\"].value_counts() to get the distribution across the unique categories. You can also plot histograms df.plot.hist() for each of the categories to get a visual description of the distribution. If you want to see distribution/correlation df.plot.scatter(x=predictor y=outcome)"
  573. ]
  574. },
  575. {
  576. "cell_type": "markdown",
  577. "metadata": {},
  578. "source": [
  579. "#### 3a. How might outliers impact your analysis? "
  580. ]
  581. },
  582. {
  583. "cell_type": "markdown",
  584. "metadata": {
  585. "collapsed": true
  586. },
  587. "source": [
  588. "Answer: For the exploratory analysis, outliers have more of an impact on the mean than the median. In general, this is what makes the median a more reliable single value descriptor than the mean. If outliers are then included in the modeling, it may skew (overfit?) the model's predictions and hurt model performance. "
  589. ]
  590. },
  591. {
  592. "cell_type": "markdown",
  593. "metadata": {},
  594. "source": [
  595. "#### 3b. How will you test for outliers? "
  596. ]
  597. },
  598. {
  599. "cell_type": "markdown",
  600. "metadata": {},
  601. "source": [
  602. "Answer: I think the easiest way to test for outliers is to use histograms and scatterplots because outliers jump out very easily when the data is displayed visually. You can also use a more quantitative approach by calculated data.std() for the standard deviation, data.mean() for the mean, and then comparing suspected outlier data points to those two data points. If the outlier is 2+ standard deviations away from the mean in either direction, then you can consider it an outlier. "
  603. ]
  604. },
  605. {
  606. "cell_type": "markdown",
  607. "metadata": {},
  608. "source": [
  609. "#### 4a. What is colinearity? "
  610. ]
  611. },
  612. {
  613. "cell_type": "markdown",
  614. "metadata": {
  615. "collapsed": true
  616. },
  617. "source": [
  618. "Answer: Colinearity is when the values of one of the predictor variables can be easily/reliable predicted from one or more of the other predictor variables. Another way to say this is that the variable is very highly correlated with one of the other variables (or a combination of the other variables). I believe that in machine learning, if you have colinear variables in a model, this can hurt model performance in the long run because it makes it look like a combination of variables is a particularly strong predictor when in fact it is like including the same variable twice in the model. "
  619. ]
  620. },
  621. {
  622. "cell_type": "markdown",
  623. "metadata": {},
  624. "source": [
  625. "#### 4b. How will you test for colinearity? "
  626. ]
  627. },
  628. {
  629. "cell_type": "markdown",
  630. "metadata": {},
  631. "source": [
  632. "Answer: For testing if two different variables are co-linear, you could use data.corr() which will give you an N by N matrix (n= # of columns in the dataframe) with the correlation coefficient for each pair of columns. In the admissions example above, all of the pair-wise correlations seem pretty weak. I do not know how you test co-linearity of 1 variable to a combination of variables, but I will do some googling :) "
  633. ]
  634. },
  635. {
  636. "cell_type": "markdown",
  637. "metadata": {},
  638. "source": [
  639. "#### 5. What is your exploratory analysis plan?\n",
  640. "Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis 1 year from now. "
  641. ]
  642. },
  643. {
  644. "cell_type": "markdown",
  645. "metadata": {},
  646. "source": [
  647. "Answer: \n",
  648. "- Import the pandas package using python and use .read_csv command to read your data into python. \n",
  649. "- Do a high level description of the dataset. print a list of each columns (for i in data.columns(): print i) len(data), data.describe(), data.head(), \n",
  650. "- Check for columns with a significant number of null values relative to the total number of rows i.e. len(data). See bolow\n",
  651. "- for i in data.columns.values:\n",
  652. " print i + \" Missing values = \" , \n",
  653. " print data[i].isnull().sum() ,\n",
  654. " print \"Out of \" + str(len(data))\n",
  655. "- For each column that is not missing a significant number of values, check visual distribution using df.plot.hist(). - Calculate data['Column Name'].value_counts for categorical variables. \n",
  656. "- use data.corr() to get a pairwise matrix to check if any predictor variables are colinear\n",
  657. "- Use data.plot.scatter(x='predictor column here', y='outcome columne here') on the numerical variables to see which individual variables are the most highly correlated with the outcome.\n"
  658. ]
  659. },
  660. {
  661. "cell_type": "markdown",
  662. "metadata": {},
  663. "source": [
  664. "## Bonus Questions:\n",
  665. "1. Outline your analysis method for predicting your outcome\n",
  666. "2. Write an alternative problem statement for your dataset\n",
  667. "3. Articulate the assumptions and risks of the alternative model"
  668. ]
  669. }
  670. ],
  671. "metadata": {
  672. "kernelspec": {
  673. "display_name": "Python 2",
  674. "language": "python",
  675. "name": "python2"
  676. },
  677. "language_info": {
  678. "codemirror_mode": {
  679. "name": "ipython",
  680. "version": 2
  681. },
  682. "file_extension": ".py",
  683. "mimetype": "text/x-python",
  684. "name": "python",
  685. "nbconvert_exporter": "python",
  686. "pygments_lexer": "ipython2",
  687. "version": "2.7.14"
  688. }
  689. },
  690. "nbformat": 4,
  691. "nbformat_minor": 1
  692. }
Add Comment
Please, Sign In to add comment