Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Jupyter Notebook Logout Control Panelhw03 Last Checkpoint: 38 minutes ago (autosaved)
- Python 3
- File
- Edit
- View
- Insert
- Cell
- Kernel
- Widgets
- Help
- CellToolbarMem: 735 / 2048 (MB)Verify Drive
- Homework 3: Tables and Charts
- Reading: Textbook chapters 5 and 6.
- Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests. Each time you start your server, you will need to execute this cell again to load the tests.
- In [93]:
- # Don't change this cell; just run it.
-
- import numpy as np
- from datascience import *
-
- %matplotlib inline
- import matplotlib.pyplot as plots
- plots.style.use('fivethirtyeight')
-
- from client.api.notebook import Notebook
- ok = Notebook('hw03.ok')
- _ = ok.auth(inline=True)
- =====================================================================
- Assignment: Homework 3: Tables and Charts
- OK, version v1.9.5
- =====================================================================
- Successfully logged in as dkathuria@berkeley.edu
- Important: The ok tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach).
- Once you're finished, select "Save and Checkpoint" in the File menu and then execute the submit cell below. The result will contain a link that you can use to check that your assignment has been submitted successfully. If you submit more than once before the deadline, we will only grade your final submission.
- In [94]:
- _ = ok.submit()
- Saving notebook... Saved 'hw03.ipynb'.
- Submit... 100% complete
- Submission successful for user: dkathuria@berkeley.edu
- URL: https://okpy.org/cal/data8/sp17/hw03/submissions/rkZvEK
- 1. Unemployment
- The Federal Reserve Bank of St. Louis publishes data about jobs in the US. Below we've loaded data on unemployment in the United States. There are many ways of defining unemployment, and our dataset includes two notions of the unemployment rate:
- Among people who are able to work and are looking for a full-time job, the percentage who can't find a job. This is called the Non-Employment Index, or NEI.
- Among people who are able to work and are looking for a full-time job, the percentage who can't find any job or are only working at a part-time job. The latter group is called "Part-Time for Economic Reasons", so the acronym for this index is NEI-PTER. (Economists are great at marketing.)
- The source of the data is here.
- Question 1. The data are in a CSV file called unemployment.csv. Load that file into a table called unemployment.
- In [95]:
- unemployment = Table.read_table("unemployment.csv")
- unemployment
- Out[95]:
- Date NEI NEI-PTER
- 1994-01-01 10.0974 11.172
- 1994-04-01 9.6239 10.7883
- 1994-07-01 9.3276 10.4831
- 1994-10-01 9.1071 10.2361
- 1995-01-01 8.9693 10.1832
- 1995-04-01 9.0314 10.1071
- 1995-07-01 8.9802 10.1084
- 1995-10-01 8.9932 10.1046
- 1996-01-01 9.0002 10.0531
- 1996-04-01 8.9038 9.9782
- ... (80 rows omitted)
- In [5]:
- _ = ok.grade('q1_1')
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Running tests
- ---------------------------------------------------------------------
- Test summary
- Passed: 1
- Failed: 0
- [ooooooooook] 100.0% passed
- Question 2. Sort the data in decreasing order by NEI, naming the sorted table by_nei. Create another table called by_nei_pter that's sorted in decreasing order by NEI-PTER instead.
- In [84]:
- by_nei = unemployment.sort("NEI", descending = True)
- by_nei_pter = unemployment.sort("NEI-PTER", descending = True)
- by_nei = unemployment.sort("NEI", descending = True)
- by_nei_pter = unemployment.sort("NEI-PTER", descending = True)
- In [85]:
- _ = ok.grade('q1_2')
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Running tests
- ---------------------------------------------------------------------
- Test summary
- Passed: 1
- Failed: 0
- [ooooooooook] 100.0% passed
- Question 3. Use take to make a table containing the data for the 10 quarters when NEI was greatest. Call that table greatest_nei.
- In [18]:
- greatest_nei = by_nei.take(np.arange(10))
- greatest_nei
- Out[18]:
- Date NEI NEI-PTER
- 2009-10-01 10.9698 12.8557
- 2010-01-01 10.9054 12.7311
- 2009-07-01 10.8089 12.7404
- 2009-04-01 10.7082 12.5497
- 2010-04-01 10.6597 12.5664
- 2010-10-01 10.5856 12.4329
- 2010-07-01 10.5521 12.3897
- 2011-01-01 10.5024 12.3017
- 2011-07-01 10.4856 12.2507
- 2011-04-01 10.4409 12.247
- In [19]:
- _ = ok.grade('q1_3')
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Running tests
- ---------------------------------------------------------------------
- Test summary
- Passed: 1
- Failed: 0
- [ooooooooook] 100.0% passed
- Question 4. It's believed that many people became PTER (recall: "Part-Time for Economic Reasons") in the "Great Recession" of 2008-2009. NEI-PTER is the percentage of people who are unemployed (and counted in the NEI) plus the percentage of people who are PTER. Compute an array containing the percentage of people who were PTER in each quarter. (The first element of the array should correspond to the first row of unemployment, and so on.)
- Note: Use the original unemployment table for this.
- In [31]:
- pter = unemployment.column("NEI-PTER") - unemployment.column("NEI")
- pter
- Out[31]:
- array([ 1.0746, 1.1644, 1.1555, 1.129 , 1.2139, 1.0757, 1.1282,
- 1.1114, 1.0529, 1.0744, 1.1004, 1.0747, 1.0705, 1.0455,
- 1.008 , 0.9734, 0.9753, 0.8931, 0.9451, 0.8367, 0.8208,
- 0.8105, 0.8248, 0.7578, 0.7251, 0.7445, 0.7543, 0.7423,
- 0.7399, 0.7687, 0.8418, 0.9923, 0.9181, 0.9629, 0.9703,
- 0.9575, 1.0333, 1.0781, 1.0675, 1.0354, 1.0601, 1.01 ,
- 1.0042, 1.0368, 0.9704, 0.923 , 0.9759, 0.93 , 0.889 ,
- 0.821 , 0.9409, 0.955 , 0.898 , 0.8948, 0.9523, 0.9579,
- 1.0149, 1.0762, 1.2873, 1.4335, 1.7446, 1.8415, 1.9315,
- 1.8859, 1.8257, 1.9067, 1.8376, 1.8473, 1.7993, 1.8061,
- 1.7651, 1.7927, 1.7286, 1.6387, 1.6808, 1.6805, 1.6629,
- 1.6253, 1.6477, 1.6298, 1.4796, 1.5131, 1.4866, 1.4345,
- 1.3675, 1.3097, 1.2319, 1.1735, 1.1844, 1.1746])
- In [32]:
- _ = ok.grade('q1_4')
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Running tests
- ---------------------------------------------------------------------
- Test summary
- Passed: 1
- Failed: 0
- [ooooooooook] 100.0% passed
- Question 5. Add pter as a column to unemployment (named "PTER") and sort the resulting table by that column in decreasing order. Call the table by_pter.
- Try to do this with a single line of code, if you can.
- In [33]:
- by_pter = unemployment.with_column("PTER", pter).sort("PTER", descending = True)
- by_pter
- Out[33]:
- Date NEI NEI-PTER PTER
- 2009-07-01 10.8089 12.7404 1.9315
- 2010-04-01 10.6597 12.5664 1.9067
- 2009-10-01 10.9698 12.8557 1.8859
- 2010-10-01 10.5856 12.4329 1.8473
- 2009-04-01 10.7082 12.5497 1.8415
- 2010-07-01 10.5521 12.3897 1.8376
- 2010-01-01 10.9054 12.7311 1.8257
- 2011-04-01 10.4409 12.247 1.8061
- 2011-01-01 10.5024 12.3017 1.7993
- 2011-10-01 10.3287 12.1214 1.7927
- ... (80 rows omitted)
- In [34]:
- _ = ok.grade('q1_5')
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Running tests
- ---------------------------------------------------------------------
- Test summary
- Passed: 1
- Failed: 0
- [ooooooooook] 100.0% passed
- Question 6. Does it seem true that the PTER rate was very high during the Great Recession, compared to other periods in the dataset? Justify your answer by referring to specific values in the table or by generating a chart.
- In [102]:
-
- Out[102]:
- numpy.str_
- *Write your answer here, replacing this text.*
- 2. Birth Rates
- The following table gives census-based population estimates for each state on July 1, 2015 and July 1, 2016. The last four columns describe the components of the estimated change in population during this time interval. For all questions below, assume that the word "states" refers to all 52 rows including Puerto Rico & the District of Columbia.
- In [96]:
- # Don't change this cell; just run it.
- # From http://www2.census.gov/programs-surveys/popest/datasets/2010-2016/national/totals/nst-est2016-alldata.csv
- # See http://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/totals/nst-est2015-alldata.pdf
- # for column descriptions. (As of Feb 2017, no descriptions were posted for 2010-2016.)
- pop = Table.read_table('nst-est2016-alldata.csv').where('SUMLEV', 40).select([1, 4, 12, 13, 27, 34, 62, 69])
- pop = pop.relabeled(2, '2015').relabeled(3, '2016')
- pop = pop.relabeled(4, 'BIRTHS').relabeled(5, 'DEATHS')
- pop = pop.relabeled(6, 'MIGRATION').relabeled(7, 'OTHER')
- pop.set_format([2, 3, 4, 5, 6, 7], NumberFormatter(decimals=0)).show(5)
- REGION NAME 2015 2016 BIRTHS DEATHS MIGRATION OTHER
- 3 Alabama 4,853,875 4,863,300 58,556 52,405 3,874 -600
- 4 Alaska 737,709 741,894 11,255 4,511 -2,557 -2
- 4 Arizona 6,817,565 6,931,071 87,204 56,564 76,405 6,461
- 3 Arkansas 2,977,853 2,988,248 37,936 30,581 3,530 -490
- 4 California 38,993,940 39,250,017 502,848 273,850 33,530 -6,451
- ... (47 rows omitted)
- Question 1. Assign us_birth_rate to the total US annual birth rate during this time interval. The annual birth rate for a year-long period is the number of births in that period as a proportion of the population at the start of the period.
- In [41]:
- us_birth_rate = sum(pop.column("BIRTHS") / pop.column("2015"))
- us_birth_rate
- Out[41]:
- 0.64106727268407149
- In [42]:
- _ = ok.grade('q2_1')
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Running tests
- ---------------------------------------------------------------------
- Test summary
- Passed: 1
- Failed: 0
- [ooooooooook] 100.0% passed
- Question 2. Assign fastest_growth to an array of the names of the five states with the fastest population growth rates in descending order of growth rate.
- In [51]:
- fastest_growth = pop.with_column("test", (pop.column("2016") - pop.column("2015"))/pop.column("2015")).sort("test", descending = True).column("NAME") [0:5]
- fastest_growth
- Out[51]:
- array(['Utah', 'Nevada', 'Idaho', 'Florida', 'Washington'],
- dtype='<U20')
- In [52]:
- _ = ok.grade('q2_2')
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Running tests
- ---------------------------------------------------------------------
- Test summary
- Passed: 1
- Failed: 0
- [ooooooooook] 100.0% passed
- Question 3. Assign movers to the number of states for which the absolute annual rate of migration was higher than 1%. The annual rate of migration for a year-long period is the net number of migrations (in and out) as a proportion of the population at the start of the period. The MIGRATION column contains estimated annual net migration counts by state.
- In [66]:
- movers = pop.with_column("test", pop.column("MIGRATION")/ pop.column("2015")).where("test", lambda x: abs(x)>.01).num_rows
- movers
- Out[66]:
- 9
- In [67]:
- _ = ok.grade('q2_3')
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Running tests
- ---------------------------------------------------------------------
- Test summary
- Passed: 1
- Failed: 0
- [ooooooooook] 100.0% passed
- Question 4. Assign west_births to the total number of births that occurred in region 4 (the Western US).
- In [78]:
- west_births = sum(pop.where("REGION", lambda x: x == "4").column("BIRTHS"))
- west_births
- Out[78]:
- 979657
- In [79]:
- _ = ok.grade('q2_4')
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Running tests
- ---------------------------------------------------------------------
- Test summary
- Passed: 1
- Failed: 0
- [ooooooooook] 100.0% passed
- Question 5. Assign less_than_west_births to the number of states that had a total population in 2016 that was smaller than the number of babies born in region 4 (the Western US) during this time interval.
- In [81]:
- less_than_west_births = pop.where("2016", lambda x: x < west_births).num_rows
- less_than_west_births
- Out[81]:
- 7
- In [82]:
- _ = ok.grade('q2_5')
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Running tests
- ---------------------------------------------------------------------
- Test summary
- Passed: 1
- Failed: 0
- [ooooooooook] 100.0% passed
- Question 6. Was there an association between birth rate and death rate during this time interval? Use the code cell below to support your conclusion with a chart. If an association exists, what might explain it?
- There is some association between the birth rate and death rate in a slight inverted correlation. As the scatter plot shows, there is a decrease in the death rate as the birth rate increases. This could possibly be attributed to a younger population overall in the state; younger people imply more babies being born with less overall deaths.
- In [99]:
- Death
- # Generate a chart here to support your conclusion
- pop.with_columns("Birth Rate", pop.column("BIRTHS") / pop.column("2015"), "Death Rate", pop.column("DEATHS") / pop.column("2015")).scatter("Birth Rate", "Death Rate")
- 3. Consumer Financial Protection Bureau Complaints
- The Consumer Financial Protection Bureau has collected and published consumer complaints against financial companies since 2011. The data are available here (or at this direct link. For this exercise, to make your code run faster, we've selected only the data from May 2016.
- Run the next cell to load the data. Each row represents one consumer's complaint.
- In [83]:
- # Just run this cell.
- complaints = Table.read_table("complaints.csv")
- complaints
- Out[83]:
- company company_public_response company_response complaint_id complaint_what_happened consumer_consent_provided consumer_disputed date_received date_sent_to_company issue product state sub_issue sub_product submitted_via tags timely zip_code
- TransUnion Intermediate Holdings, Inc. Company has responded to the consumer and the CFPB and c ... Closed with explanation 1920073 (None) (None) Yes 2016-05-11T15:39:07.000 2016-05-11T15:39:07.000 Credit reporting company's investigation Credit reporting VT Inadequate help over the phone (None) Phone (None) Yes 05035
- TransUnion Intermediate Holdings, Inc. Company has responded to the consumer and the CFPB and c ... Closed with explanation 1914777 (None) Consent not provided No 2016-05-08T00:53:47.000 2016-05-12T18:40:34.000 Incorrect information on credit report Credit reporting MO Information is not mine (None) Web (None) Yes 63020
- Bank of America Company has responded to the consumer and the CFPB and c ... Closed with explanation 1907306 I became aware of several charges on a Bank of America c ... Consent provided No 2016-05-03T16:49:33.000 2016-05-03T16:49:34.000 Other Credit card VA (None) (None) Web (None) Yes 239XX
- Finance of America Reverse LLC Company believes it acted appropriately as authorized by ... Closed with explanation 1919055 I applied for a reverse mortgage and everthing was going ... Consent provided No 2016-05-10T20:13:22.000 2016-05-10T20:13:23.000 Application, originator, mortgage broker Mortgage TX (None) Reverse mortgage Web Older American Yes 774XX
- Acceptance Solutions Group, INC Company believes it acted appropriately as authorized by ... Closed with explanation 1908628 Keeps calling numbers that are not mine. And talking to ... Consent provided No 2016-05-03T21:05:42.000 2016-05-06T13:42:45.000 Improper contact or sharing of info Debt collection OH Talked to a third party about my debt Payday loan Web (None) Yes 430XX
- Equifax (None) Closed with explanation 1909176 (None) (None) No 2016-05-04T20:08:06.000 2016-05-09T15:11:00.000 Incorrect information on credit report Credit reporting NC Information is not mine (None) Postal mail (None) Yes 28052
- TransUnion Intermediate Holdings, Inc. Company has responded to the consumer and the CFPB and c ... Closed with explanation 1914477 When I enter my personal information to receive my credi ... Consent provided No 2016-05-06T23:09:50.000 2016-05-08T22:40:19.000 Unable to get credit report/credit score Credit reporting OH Problem getting my free annual report (None) Web (None) Yes 450XX
- Encore Capital Group (None) Closed with non-monetary relief 1919937 (None) Consent not provided (None) 2016-05-11T18:58:25.000 2016-05-11T21:53:54.000 Cont'd attempts collect debt not owed Debt collection CT Debt is not mine Credit card Web Older American Yes 06801
- Nationstar Mortgage (None) Closed with explanation 1920517 I am livid with Nation Star for refusing to work with me ... Consent provided (None) 2016-05-11T20:38:09.000 2016-05-11T20:38:09.000 Application, originator, mortgage broker Mortgage IL (None) Conventional adjustable mortgage (ARM) Web (None) Yes 606XX
- Convergent Resources, Inc. (None) Closed with explanation 1920464 (None) Consent not provided No 2016-05-11T12:16:31.000 2016-05-11T12:16:32.000 Cont'd attempts collect debt not owed Debt collection TX Debt is not mine Other (i.e. phone, health club, etc.) Web (None) Yes 78109
- ... (15021 rows omitted)
- Question 1. Financial companies offer a variety of products. How many complaints were made against each kind of product? Make a table called complaints_per_product with one row per product category and 2 columns: "product" (the name of the product) and "number of complaints" (the number of complaints made against that kind of product).
- In [88]:
- complaints_per_product = complaints.group("product").relabel("count", "number of complaints")
- complaints_per_product
- Out[88]:
- product number of complaints
- Bank account or service 1687
- Consumer Loan 775
- Credit card 1566
- Credit reporting 3820
- Debt collection 3022
- Money transfers 142
- Mortgage 3468
- Other financial service 16
- Payday loan 119
- Prepaid card 110
- ... (1 rows omitted)
- In [89]:
- _ = ok.grade('q3_1')
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Running tests
- ---------------------------------------------------------------------
- Test summary
- Passed: 1
- Failed: 0
- [ooooooooook] 100.0% passed
- Question 2. Make a bar chart showing how many complaints were made about each product category. Sort the bars from shortest to longest.
- In [ ]:
- ...
- Question 3. Make a table of the number of complaints made against each company. Call it complaints_per_company. It should have one row per company and 2 columns: "company" (the name of the company) and "number of complaints" (the number of complaints made against that company).
- In [91]:
- number of complaints
- complaints_per_company = complaints.group("company").relabel("count", "number of complaints")
- complaints_per_company
- Out[91]:
- company number of complaints
- 1st Preference Mortgage 2
- 21st Mortgage Corporation 7
- 2288984 Ontario Inc. 3
- 360 Mortgage 1
- 3rd Generation, Inc. 1
- 4M Collections, LLC 1
- A.R.M. Solutions, Inc. 2
- AC Autopay, LLC 1
- ACE Cash Express Inc. 21
- ACS Education Services 8
- ... (1131 rows omitted)
- In [92]:
- _ = ok.grade('q3_3')
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Running tests
- ---------------------------------------------------------------------
- Test summary
- Passed: 1
- Failed: 0
- [ooooooooook] 100.0% passed
- Question 4. It wouldn't be a good idea to make a bar chart of that data. (Don't try it!) Why not?
- Write your answer here, replacing this text.
- Question 5. Make a bar chart of just the 10 companies with the most complaints.
- In [ ]:
- ...
- Question 6. Make a bar chart like the one above, with one difference: The size of each company's bar should be the proportion (among all complaints made against any company in complaints) that were made against that company.
- Note: Graphs aren't very useful without accurate labels. Make sure that the text on the horizontal axis of the graph makes sense.
- 4. Marginal Histograms
- Consider the following scatter plot:
- The axes of the plot represent values of two variables: xx and yy.
- Suppose we have a table called t that has two columns in it:
- x: a column containing the x-values of the points in the scatter plot
- y: a column containing the y-values of the points in the scatter plot
- Question 1: Match each of the following lines of code to the histograms they produce. Explain your reasoning.
- Line 1: t.hist('x')
- Line 2: t.hist('y')
- Histogram A: Histogram B:
- Histogram for Line 1:
- Explanation:
- Histogram for Line 2:
- Explanation:
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement