Advertisement
Guest User

Untitled

a guest
Jul 24th, 2022
74
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 15.37 KB | None | 0 0
  1. # -*- coding: utf-8 -*-
  2. """life_expectancy_gdp.ipynb
  3.  
  4. # Introduction
  5.  
  6. For this project, you will act as a data researcher for the World Health Organization. You will investigate if there is a strong correlation between the economic output of a country and the life expectancy of its citizens.
  7.  
  8. During this project, you will analyze, prepare, and plot data, and seek to answer questions in a meaningful way.
  9.  
  10. After you perform analysis, you'll be creating an article with your visualizations to be featured in the fictional "Time Magazine".
  11.  
  12. **Focusing Questions**:
  13. + Has life expectancy increased over time in the six nations?
  14. + Has GDP increased over time in the six nations?
  15. + Is there a correlation between GDP and life expectancy of a country?
  16. + What is the average life expactancy in these nations?
  17. + What is the distribution of that life expectancy?
  18.  
  19. GDP Source:[World Bank](https://data.worldbank.org/indicator/NY.GDP.MKTP.CD)national accounts data, and OECD National Accounts data files.
  20.  
  21. Life expectancy Data Source: [World Health Organization](http://apps.who.int/gho/data/node.main.688)
  22.  
  23. ## Step 1. Import Python Modules
  24.  
  25. Import the modules that you'll be using in this project:
  26. - `from matplotlib import pyplot as plt`
  27. - `import pandas as pd`
  28. - `import seaborn as sns`
  29. """
  30.  
  31. from matplotlib import pyplot as plt
  32. import pandas as pd
  33. import seaborn as sns
  34.  
  35. """## Step 2 Prep The Data
  36.  
  37. To look for connections between GDP and life expectancy you will need to load the datasets into DataFrames so that they can be visualized.
  38.  
  39. Load **all_data.csv** into a DataFrame called `df`. Then, quickly inspect the DataFrame using `.head()`.
  40.  
  41. Hint: Use `pd.read_csv()`
  42. """
  43.  
  44. df = pd.read_csv("all_data.csv")
  45. print(df.head())
  46.  
  47. """## Step 3 Examine The Data
  48.  
  49. The datasets are large and it may be easier to view the entire dataset locally on your computer. You can open the CSV files directly from the folder you downloaded for this project.
  50.  
  51. Let's learn more about our data:
  52. - GDP stands for **G**ross **D**omestic **P**roduct. GDP is a monetary measure of the market value of all final goods and services produced in a time period.
  53. - The GDP values are in current US dollars.
  54.  
  55. What six countries are represented in the data?
  56. """
  57.  
  58. # The six countries represented in the data are Chile, China, Germany, Mexico, United States of America and Zimbabwe.
  59.  
  60. """What years are represented in the data?"""
  61.  
  62. # The data represents the years 2000 to 2015 (inclusive)
  63.  
  64. """## Step 4 Tweak The DataFrame
  65.  
  66. Look at the column names of the DataFrame `df` using `.head()`.
  67. """
  68.  
  69. df.head()
  70.  
  71. """What do you notice? The first two column names are one word each, and the third is five words long! `Life expectancy at birth (years)` is descriptive, which will be good for labeling the axis, but a little difficult to wrangle for coding the plot itself.
  72.  
  73. **Revise The DataFrame Part A:**
  74.  
  75. Use Pandas to change the name of the last column to `LEABY`.
  76.  
  77. Hint: Use `.rename()`. [You can read the documentation here.](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html)). </font>
  78. """
  79.  
  80. df.rename(columns={"Life expectancy at birth (years)": "LEABY"}, inplace=True)
  81.  
  82. """Run `df.head()` again to check your new column name worked."""
  83.  
  84. df.head()
  85.  
  86. """---
  87.  
  88. ## Step 5 Bar Charts To Compare Average
  89.  
  90. To take a first high level look at both datasets, create a bar chart for each DataFrame:
  91.  
  92. A) Create a bar chart from the data in `df` using `Country` on the x-axis and `GDP` on the y-axis.
  93. Remember to `plt.show()` your chart!
  94. """
  95.  
  96. plt.figure(figsize=(12, 7))
  97. sns.barplot(data=df, x="Country", y="GDP")
  98. plt.show()
  99.  
  100. """B) Create a bar chart using the data in `df` with `Country` on the x-axis and `LEABY` on the y-axis.
  101. Remember to `plt.show()` your chart!
  102. """
  103.  
  104. plt.figure(figsize=(12, 7))
  105. sns.barplot(data=df, x="Country", y="LEABY")
  106. plt.show()
  107.  
  108. """What do you notice about the two bar charts? Do they look similar?"""
  109.  
  110. # The bar plots show a similar Life Expectancy, independent of the GDP. With the exception of Zimbabwe which is considerably lower, in line with their GDP also being the lowest of the 6 countries.
  111.  
  112. """## Step 6. Violin Plots To Compare Life Expectancy Distributions
  113.  
  114. Another way to compare two datasets is to visualize the distributions of each and to look for patterns in the shapes.
  115.  
  116. We have added the code to instantiate a figure with the correct dimmensions to observe detail.
  117. 1. Create an `sns.violinplot()` for the dataframe `df` and map `Country` and `LEABY` as its respective `x` and `y` axes.
  118. 2. Be sure to show your plot
  119. """
  120.  
  121. fig = plt.subplots(figsize=(15, 10))
  122. sns.violinplot(data=df, x="Country", y="LEABY")
  123. plt.show()
  124. plt.savefig("violin.png")
  125.  
  126. """What do you notice about this distribution? Which country's life expactancy has changed the most?
  127.  
  128. **The deviation/spread of the LEABY's of each country are relatively compact, with the interquartile range spanning only a couple of years. Zimbabwe is an exception to this however, with even their interquartile range spanning over a decade and the whole dataset being over 20 years apart, suggesting a disparity of socio-economic factors.**
  129.  
  130. ## Step 7. Bar Plots Of GDP and Life Expectancy over time
  131.  
  132. We want to compare the GDPs of the countries over time, in order to get a sense of the relationship between GDP and life expectancy.
  133.  
  134. First, can plot the progession of GDP's over the years by country in a barplot using Seaborn.
  135. We have set up a figure with the correct dimensions for your plot. Under that declaration:
  136. 1. Save `sns.barplot()` to a variable named `ax`
  137. 2. Chart `Country` on the x axis, and `GDP` on the `Y` axis on the barplot. Hint: `ax = sns.barplot(x="Country", y="GDP")`
  138. 3. Use the `Year` as a `hue` to differentiate the 15 years in our data. Hint: `ax = sns.barplot(x="Country", y="GDP", hue="Year", data=df)`
  139. 4. Since the names of the countries are long, let's rotate their label by 90 degrees so that they are legible. Use `plt.xticks("rotation=90")`
  140. 5. Since our GDP is in trillions of US dollars, make sure your Y label reflects that by changing it to `"GDP in Trillions of U.S. Dollars"`. Hint: `plt.ylabel("GDP in Trillions of U.S. Dollars")`
  141. 6. Be sure to show your plot.
  142. """
  143.  
  144. f, ax = plt.subplots(figsize=(10, 15))
  145. sns.barplot(data=df, x="Country", y="GDP", hue="Year")
  146. plt.xticks(rotation = 90)
  147. ax.set_ylabel("GDP in Trillions of U.S. Dollars")
  148. plt.show()
  149.  
  150. """Now that we have plotted a barplot that clusters GDP over time by Country, let's do the same for Life Expectancy.
  151.  
  152. The code will essentially be the same as above! The beauty of Seaborn relies in its flexibility and extensibility. Paste the code from above in the cell bellow, and:
  153. 1. Change your `y` value to `LEABY` in order to plot life expectancy instead of GDP. Hint: `ax = sns.barplot(x="Country", y="LEABY", hue="Year", data=df)`
  154. 2. Tweak the name of your `ylabel` to reflect this change, by making the label `"Life expectancy at birth in years"` Hint: `ax.set(ylabel="Life expectancy at birth in years")`
  155.  
  156. """
  157.  
  158. f, ax = plt.subplots(figsize=(10, 15))
  159. sns.barplot(data=df, x="Country", y="LEABY", hue="Year")
  160. plt.xticks(rotation = 90)
  161. ax.set_ylabel("Life expectancy at birth in years")
  162. plt.show()
  163.  
  164. """What are your first impressions looking at the visualized data?
  165.  
  166. - Which countries' bars changes the most?
  167. - What years are there the biggest changes in the data?
  168. - Which country has had the least change in GDP over time?
  169. - How do countries compare to one another?
  170. - Now that you can see the both bar charts, what do you think about the relationship between GDP and life expectancy?
  171. - Can you think of any reasons that the data looks like this for particular countries?
  172. """
  173.  
  174. # First looking at the trends of GDP, China has seen exponential growth of it's GDP year on year. The USA follows a similar trend, but with a slight flattening around 2008. Germany saw good growth in the early 2000s but have since started to plateau, with Chile, Mexico and Zimbabwe seeing the smallest amounts of growth Year on Year.
  175.  
  176. # Thankfully, life expectancy has followed an upward trend for all 6 countries, generally seeing around a 3-5 year increase in the 15 years shown. However, Zimbabwe's Life Expectancy has seen the sharpest rise.
  177.  
  178. # China's GDP grew by the most, around $1tn.
  179.  
  180. # There is enough of a relationship to say that the increase in GDP correlates with an increase in Life Expectancy, but the rate at which a country's GDP grows is much larger than the increase in Life Expectancy. For example, USA saw an increase of around +75% to their GDP in the 15 years, but their life expectancy only grew around 5%.
  181.  
  182. # For the first 5 countries, this is most likely due to a high standard of living being present at even the earliest point in the data (2000). However Zimbabwe is only just seeing their standard of living increase to that of modern society, so they are seeing greater leaps in their Life Expectancy.
  183.  
  184. """Note: You've mapped two bar plots showcasing a variable over time by country, however, bar charts are not traditionally used for this purpose. In fact, a great way to visualize a variable over time is by using a line plot. While the bar charts tell us some information, the data would be better illustrated on a line plot. We will complete this in steps 9 and 10, for now let's switch gears and create another type of chart.
  185.  
  186. ## Step 8. Scatter Plots of GDP and Life Expectancy Data
  187.  
  188. To create a visualization that will make it easier to see the possible correlation between GDP and life expectancy, you can plot each set of data on its own subplot, on a shared figure.
  189.  
  190. To create multiple plots for comparison, Seaborn has a special (function)[https://seaborn.pydata.org/generated/seaborn.FacetGrid.html] called `FacetGrid`. A FacetGrid takes in a function and creates an individual graph for which you specify the arguments!
  191.  
  192. Since this may be the first time you've learned about FacetGrid, we have prepped a fill in the blank code snippet below.
  193. Here are the instructors to fill in the blanks from the commented word bank:
  194.  
  195. 1. In this graph, we want GDP on the X axis and Life Expectancy on the Y axis.
  196. 2. We want the columns to be split up for every Year in the data
  197. 3. We want the data points to be differentiated (hue) by Country.
  198. 4. We want to use a Matplotlib scatter plot to visualize the different graphs
  199.  
  200.  
  201. Be sure to show your plot!
  202. """
  203.  
  204. # WORDBANK:
  205. # "Year"
  206. # "Country"
  207. # "GDP"
  208. # "LEABY"
  209. # plt.scatter
  210.  
  211.  
  212. # Uncomment the code below and fill in the blanks
  213. g = sns.FacetGrid(df, col="Year", hue="Country", col_wrap=4, height=2)
  214. g = (g.map(plt.scatter, "GDP", "LEABY", edgecolor="w").add_legend())
  215. plt.show()
  216.  
  217. """+ Which country moves the most along the X axis over the years?
  218. + Which country moves the most along the Y axis over the years?
  219. + Is this surprising?
  220. + Do you think these scatter plots are easy to read? Maybe there's a way to plot that!
  221. """
  222.  
  223. # China move the furthest along the X axis over the years, closely followed by USA. This is due to their GDP increasing by the largest amounts.
  224.  
  225. # Zimbabwe move the furthest along the Y axis over the years. This is due to their life expectancy increasing the most, made more apparent by it starting much lower.
  226.  
  227. """## Step 9. Line Plots for Life Expectancy
  228.  
  229. In the scatter plot grid above, it was hard to isolate the change for GDP and Life expectancy over time.
  230. It would be better illustrated with a line graph for each GDP and Life Expectancy by country.
  231.  
  232. FacetGrid also allows you to do that! Instead of passing in `plt.scatter` as your Matplotlib function, you would have to pass in `plt.plot` to see a line graph. A few other things have to change as well. So we have created a different codesnippets with fill in the blanks. that makes use of a line chart, and we will make two seperate FacetGrids for both GDP and Life Expectancy separately.
  233.  
  234. Here are the instructors to fill in the blanks from the commented word bank:
  235.  
  236. 1. In this graph, we want Years on the X axis and Life Expectancy on the Y axis.
  237. 2. We want the columns to be split up by Country
  238. 3. We want to use a Matplotlib line plot to visualize the different graphs
  239.  
  240.  
  241. Be sure to show your plot!
  242. """
  243.  
  244. # WORDBANK:
  245. # plt.plot
  246. # "LEABY"
  247. # "Year"
  248. # "Country"
  249.  
  250.  
  251. # Uncomment the code below and fill in the blanks
  252. g3 = sns.FacetGrid(df, col="Country", col_wrap=3, height=4)
  253. g3 = (g3.map(plt.plot, "Year", "LEABY").add_legend())
  254.  
  255. """What are your first impressions looking at the visualized data?
  256.  
  257. - Which countries' line changes the most?
  258. - What years are there the biggest changes in the data?
  259. - Which country has had the least change in life expectancy over time?
  260. - Can you think of any reasons that the data looks like this for particular countries?
  261.  
  262. **Zimbabwe's line changes the most, seeing a decline in the early 2000's followed by a sharp increase. The USA has the least increase in life expectancy, presenting a very flat line. Chile and to a lesser extent Mexico have a less stable line. Zimbabwe we likely affected by the famine crisis in the early 2000s which would explain the decrease around this time..**
  263.  
  264. ## Step 10. Line Plots for GDP
  265.  
  266. Let's recreate the same FacetGrid for GDP now. Instead of Life Expectancy on the Y axis, we now we want GDP.
  267.  
  268. Once you complete and successfully run the code above, copy and paste it into the cell below. Change the variable for the X axis. Change the color on your own! Be sure to show your plot.
  269. """
  270.  
  271. # WORDBANK:
  272. # plt.plot
  273. # "LEABY"
  274. # "Year"
  275. # "Country"
  276.  
  277.  
  278. # Uncomment the code below and fill in the blanks
  279. g3 = sns.FacetGrid(df, col="Country", col_wrap=3, height=4)
  280. g3 = (g3.map(plt.plot, "Year", "GDP").add_legend())
  281. plt.show()
  282.  
  283. """Which countries have the highest and lowest GDP?"""
  284.  
  285. # The USA have the highest GDP and Zimbabwe have the lowest.
  286.  
  287. # The other countries are more scattered. Germany begins the millenium with the 2nd highest GDP but is quickly overtaken by China around 2010 onwards.
  288.  
  289. """Which countries have the highest and lowest life expectancy?"""
  290.  
  291. # Germany have the highest life expectancy, followed by USA. Zimbabwe have the lowest life expectancy at all points throughout the entire 15 years.
  292.  
  293. """## Step 11 Researching Data Context
  294.  
  295. Based on the visualization, choose one part the data to research a little further so you can add some real world context to the visualization. You can choose anything you like, or use the example question below.
  296.  
  297. What happened in China between in the past 10 years that increased the GDP so drastically?
  298. """
  299.  
  300. # Studys report the main reasons for China's increase in GDP as being large scale capital investment (financed by both domestic savings and foreign investment) and being world leaders in productivity growth.
  301.  
  302. # Source: https://www.everycrsreport.com/reports/RL33534.html
  303.  
  304. """## Step 12 Create Blog Post
  305.  
  306. Use the content you have created in this Jupyter notebook to create a blog post reflecting on this data.
  307. Include the following visuals in your blogpost:
  308.  
  309. 1. The violin plot of the life expectancy distribution by country
  310. 2. The facet grid of scatter graphs mapping GDP as a function Life Expectancy by country
  311. 3. The facet grid of line graphs mapping GDP by country
  312. 4. The facet grid of line graphs mapping Life Expectancy by country
  313.  
  314.  
  315. We encourage you to spend some time customizing the color and style of your plots! Remember to use `plt.savefig("filename.png")` to save your figures as a `.png` file.
  316.  
  317. When authoring your blog post, here are a few guiding questions to guide your research and writing:
  318. + How do you think the histories and the cultural values of each country relate to its GDP and life expectancy?
  319. + What would have helped make the project data more reliable? What were the limitations of the dataset?
  320. + Which graphs better illustrate different relationships??
  321. """
  322.  
  323.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement