Advertisement
Guest User

udacity query

a guest
Sep 19th, 2017
323
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 26.94 KB | None | 0 0
  1. WHITE WINE PROJECT
  2. ========================================================
  3.  
  4. ```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
  5. # libraries potentially needed
  6. library(ggplot2)
  7. library(GGally)
  8. library(scales)
  9. library(memisc)
  10. library(dplyr)
  11. library(tidyr)
  12. library(gridExtra)
  13. library(RColorBrewer)
  14. library (corrplot)
  15.  
  16. library(ggthemes)
  17. theme_set(theme_economist())
  18.  
  19. ```
  20.  
  21. ```{r echo=FALSE, Load_the_Data}
  22. whitewine <- read.csv('wineQualityWhites.csv')
  23.  
  24. ```
  25.  
  26.  
  27.  
  28.  
  29. ## INTRODUCTION TO DATASET
  30. This report explores a datset containing an expert quality assessment and chemical composition data (11 variables) for 4,898 white wines (all Portuguese 'Vinho Verde').
  31.  
  32. # Univariate Plots Section
  33.  
  34. ```{r echo=FALSE, Univariate_Plots}
  35.  
  36. print('whitewine dataframe')
  37. str(whitewine)
  38. writeLines('\n\n')
  39. summary(whitewine)
  40.  
  41. ```
  42.  
  43.  
  44. Preliminary assessment of the data suggests that the quality score may often be usefully treated as more usefully treated as as factor variable, as it only takes integer scores (and only scores from 3-9 are seen - so it is effectively a 6 point scale fo this dataset). An additional variable (quality.factor) is introduced for when a factor scale is more useful.
  45.  
  46. ```{r echo=FALSE, Univariate_Plots}
  47. whitewine$quality.factor <- factor(whitewine$quality)
  48.  
  49. ggplot(whitewine, aes(x=quality.factor)) +
  50. geom_histogram(stat='count')
  51.  
  52. ```
  53.  
  54. A quick view shows that quality scores range from 3-8, with the majority between 5-7. Scores show a relatively normal distribution. It is also worth noting that there are very few wines with a quality score of 3 (x wines) or 9 (x wines)
  55.  
  56.  
  57.  
  58.  
  59. ```{r echo=FALSE, Univariate_Plots}
  60. p1 = ggplot(whitewine, aes(x=residual.sugar)) +
  61. geom_histogram(bins = 75)
  62.  
  63. p2 = p1 + scale_x_log10() +
  64. geom_vline(aes(xintercept = 1.3), colour="red", linetype = 20) +
  65. geom_vline(aes(xintercept = 10), colour="red", linetype = 20) +
  66. geom_vline(aes(xintercept = 3), colour="orange", linetype = 20)
  67.  
  68. #arrange plots in grid
  69. grid.arrange(p1,p2, ncol=1)
  70.  
  71. ```
  72.  
  73. Residual sugar, when expanded to a log_10 scale, shows a bimodal distribution, with peaks at c. 1.3 and 10 (red lines). White wines are commonly regarded as either 'dry' or 'sweet' - so perhaps it is worth splitting the whitewines into these two categories for the analysis, split at around residual.sugar = 3 (the orange line)?
  74.  
  75. ```{r echo=TRUE, Univariate_Plots}
  76. whitewine$sweetness <- NA
  77. whitewine$sweetness <- factor(ifelse(whitewine$residual.sugar >=3, 'sweet','dry'))
  78.  
  79. whitewine.sweet <- whitewine[whitewine$sweetness == 'sweet',]
  80. whitewine.dry <- whitewine[whitewine$sweetnes == 'dry',]
  81. ```
  82.  
  83.  
  84.  
  85. ```{r echo=FALSE, Univariate_Plots}
  86.  
  87. p1 = ggplot(whitewine, aes(x=residual.sugar, fill = sweetness)) +
  88. geom_histogram(bins = 75)
  89.  
  90. p2 = p1 + scale_x_log10() +
  91. geom_vline(aes(xintercept = 1.3), colour="red", linetype = 20) +
  92. geom_vline(aes(xintercept = 10), colour="red", linetype = 20) +
  93. geom_vline(aes(xintercept = 3), colour="orange", linetype = 20)
  94.  
  95. #arrange plots in grid
  96. grid.arrange(p1,p2, ncol=1)
  97.  
  98. ```
  99.  
  100. I have created a new variable 'sweetness', with values 'dry' and 'sweet'.
  101.  
  102. ```{r echo=FALSE, Univariate_Plots}
  103. print('whitewine.sweet dataframe')
  104. summary(whitewine.sweet)
  105. writeLines('\n\n')
  106. print('whitewine.dry dataframe')
  107. summary(whitewine.dry)
  108.  
  109. ```
  110.  
  111. Let's look at quality again.
  112.  
  113. ```{r echo=FALSE, Univariate_Plots}
  114. ggplot(whitewine, aes(x=quality.factor, fill = sweetness)) +
  115. geom_histogram(stat='count')
  116.  
  117. ```
  118.  
  119. Quality distribution of sweet and dry whitewines appears similar.
  120.  
  121. ```{r echo=FALSE, Univariate_Plots}
  122. ggplot(whitewine, aes(x=fixed.acidity, fill = sweetness)) +
  123. geom_histogram(binwidth= 0.2) +
  124. scale_x_continuous(limits = c(2, 12.5), breaks = seq (0,15,2)) +
  125. geom_vline(aes(xintercept = median(fixed.acidity)), colour="black", linetype = 20)
  126.  
  127. ```
  128.  
  129. Fixed acidity shows a relatively thin normal distribution about a median of 6.8 (black line), with no noticeable sweet / dry difference.
  130.  
  131. ```{r echo=FALSE, Univariate_Plots}
  132. ggplot(whitewine, aes(x=volatile.acidity,fill = sweetness)) +
  133. geom_histogram(binwidth= 0.02) +
  134. scale_x_continuous(limits = c(0, 1), breaks = seq (0,1,.2)) +
  135. geom_vline(aes(xintercept = median(volatile.acidity)), colour="black", linetype = 20)
  136. ```
  137.  
  138. Volatile acidity shows a slightly right skewed normal distribution about a median of 0.26 (black line), with a few outliers at >0.8.
  139.  
  140. ```{r echo=FALSE, Univariate_Plots}
  141. ggplot(whitewine, aes(x=citric.acid, fill = sweetness)) +
  142. geom_histogram(binwidth= 0.02) +
  143. scale_x_continuous(limits = c(0, 1), breaks = seq (0,1,.2)) +
  144. geom_vline(aes(xintercept = 0.49), colour="red", linetype = 20) +
  145. geom_vline(aes(xintercept = 0.74), colour="red", linetype = 20) +
  146. geom_vline(aes(xintercept = median(citric.acid)), colour="black", linetype = 20)
  147.  
  148. ```
  149.  
  150. Citric acid shows a generally normal distribution about a median of 0.32 (black line) , with an odd peak at c. 0.49, and a smaller one at c. 0.74 (red lines).
  151.  
  152. ```{r echo=FALSE, message=FALSE, warning=FALSE}
  153. ggplot(whitewine, aes(x=citric.acid, fill = sweetness)) +
  154. geom_histogram(binwidth= 0.005) +
  155. scale_x_continuous(limits = c(0.4, 0.6), breaks = seq (0,1,.1)) +
  156. geom_vline(aes(xintercept = 0.490), colour="red", linetype = 20)
  157.  
  158. ```
  159.  
  160. A 'zoom' into the histogram in this region shows that there certainly appears to be a local 'spike' in the data at citric.acid = 0.49. I wonder if there is some form of target / guideline to aim for 'below 0.5' for citric.acid during the wine making process? (Could this be Goodharts Law in action?). Or alternatively, could this be some measurement artefact?
  161.  
  162. ```{r echo=FALSE, message=FALSE, warning=FALSE}
  163. ggplot(whitewine, aes(x=citric.acid, fill = sweetness)) +
  164. geom_histogram(binwidth= 0.005) +
  165. scale_x_continuous(limits = c(0.7, 0.8), breaks = seq (0,1,.1)) +
  166. geom_vline(aes(xintercept = 0.740), colour="red", linetype = 20)
  167.  
  168. ```
  169.  
  170. There is a similar (though smaller) local 'spike' at 0.74 - again, is there some effect clustering values below a 'round' value of 0.75?
  171.  
  172. There appears to be some effect causing a local spike in citric.acid at 0.49 and 0.74, just below the 'round' values of 0.5 and 0.75. My suspicion at this point is that 0.5 and 0.75 could be some form of 'target levels' which winemakers aim to be below. I wonder how I might investigate this further?
  173.  
  174.  
  175. ```{r echo=FALSE, Univariate_Plots}
  176. ggplot(whitewine, aes(x=chlorides, fill = sweetness)) +
  177. geom_histogram(bins=100) +
  178. scale_x_continuous(limits = c(0, 0.2), breaks = seq (0,0.2,.05)) +
  179. geom_vline(aes(xintercept = median(chlorides)), colour="black", linetype = 20)
  180.  
  181. ```
  182.  
  183. Chlorides show a tight normal distribution about a median of 0.043, with a few outliers above 0.1. Here the dry wines show slightly lower levels of chlorides than sweet wines.
  184.  
  185. ```{r echo=FALSE, Univariate_Plots}
  186. ggplot(whitewine, aes(x=free.sulfur.dioxide, fill = sweetness)) +
  187. geom_histogram(binwidth = 2) +
  188. scale_x_continuous(limits = c(0, 125), breaks = seq (0,300,50))+
  189. geom_vline(aes(xintercept = median(free.sulfur.dioxide)), colour="black", linetype = 20)
  190.  
  191.  
  192. ```
  193.  
  194. Free sulfur dioxide shows a (very) slightly right skewed distribution about a median of 34.0. There is a noticeable difference here between sweet and dry, with dry wines showing lower free.sulfur.dioxide.
  195.  
  196.  
  197. ```{r echo=FALSE, Univariate_Plots}
  198. ggplot(whitewine, aes(x=total.sulfur.dioxide, fill = sweetness)) +
  199. geom_histogram(binwidth = 4)+
  200. scale_x_continuous(limits = c(0, 325), breaks = seq (0,500,50))+
  201. geom_vline(aes(xintercept = median(total.sulfur.dioxide)), colour="black", linetype = 20)
  202.  
  203.  
  204. ```
  205.  
  206. Total sulfur dioxide shows a more symmetrical normal distribution about a median of 134.0. Again, there is a noticeable difference here between sweet and dry, with dry wines showing lower total.sulfur.dioxide.
  207.  
  208. ```{r echo=FALSE, Univariate_Plots}
  209. ggplot(whitewine, aes(x=density, fill = sweetness)) +
  210. geom_histogram(binwidth = 0.0005)+
  211. scale_x_continuous(limits = c(0.98, 1.01), breaks = seq (0,2,.01))+
  212. geom_vline(aes(xintercept = median(density)), colour="black", linetype = 20)
  213.  
  214.  
  215. ```
  216.  
  217. Density shows a normal distribution in a very tight range (mostly 0.99 < density < 1.00). A marked difference between sweet and dry is clear: dry wines are lower density.
  218.  
  219. ```{r echo=FALSE, Univariate_Plots}
  220. ggplot(whitewine, aes(x=pH, fill = sweetness)) +
  221. geom_histogram(binwidth = 0.03) +
  222. scale_x_continuous(limits = c(2.7, 3.7), breaks = seq (0,14,.1)) +
  223. geom_vline(aes(xintercept = median(pH)), colour="black", linetype = 20)
  224.  
  225. ```
  226.  
  227. pH shows a 'very' neat normal distribution around a median of 3.18 (black line).
  228.  
  229.  
  230. ```{r echo=FALSE, Univariate_Plots}
  231. ggplot(whitewine, aes(x=sulphates, fill = sweetness)) +
  232. geom_histogram(binwidth = 0.02) +
  233. scale_x_continuous(limits = c(0.2, 1.0), breaks = seq (0,2,.2))+
  234. geom_vline(aes(xintercept = median(sulphates)), colour="black", linetype = 20)
  235.  
  236.  
  237. ```
  238.  
  239. Sulphates show a slightly right skewed distribution around a median of 0.47.
  240.  
  241.  
  242. ```{r echo=FALSE, Univariate_Plots}
  243. ggplot(whitewine, aes(x=alcohol, fill = sweetness)) +
  244. geom_histogram(binwidth = 0.1)+
  245. scale_x_continuous(limits = c(8, 15), breaks = seq (0,20,1)) +
  246. geom_vline(aes(xintercept = median(alcohol)), colour="black", linetype = 20)#+
  247. #geom_vline(aes(xintercept = mean(alcohol)), colour="orange", linetype = 20)
  248.  
  249. ```
  250.  
  251. Alcohol shows a more 'spread' distribution, with median of 10.4. And a clear difference between sweet and dry wines. Let's look at the separately.
  252.  
  253. ```{r}
  254. ggplot(whitewine, aes(x=alcohol, fill = sweetness)) +
  255. geom_histogram(binwidth = 0.1) +
  256. facet_wrap(~sweetness, ncol=1) +
  257. theme(legend.position = 'right')
  258.  
  259. ```
  260.  
  261. Dry wines now a more symmetrical normal distribution , while sweet wines show a noticeable 'right skew'.
  262.  
  263.  
  264.  
  265. # Univariate Analysis
  266.  
  267. ### What is the structure of your dataset?
  268. The data contains data on 11 chemical parameters and a expert quality assessment (on a scale of 1-10) for 4,898 white wines. All of the chemical parameters are measurements on a continuous scale. The quality score is integers (only) from 1-10, and only scores from 3-9 are observed, so for some aspects of this analysis it makes sense to consider it as an ordered factor varaible.
  269.  
  270. Most wines have a quality score of 5-7, with few scoring either very high or very low quality scores.
  271.  
  272. Most of the chemical parameters show normal or slightly skewed distributions, with a few exceptions worth noting:
  273. - citric.acid has two 'local' maxima / spikes in distribution at 0.49 and 0.74
  274. - residual.sugar (viewed on a log scale) shows a bimodal distribution
  275. - alcohol shows a broad distribution
  276.  
  277.  
  278. ### What is/are the main feature(s) of interest in your dataset?
  279. At this point, most of the chemical parameters are potentially interesting. Both chlorides and density show relatively narrow distributions, and so may prove to be less interesting than other parameters.
  280.  
  281. The sulfur.dioxide variables (free. and total.) both show differences between sweet and dry wines, as do density and alcohol.
  282.  
  283. ### What other features in the dataset do you think will help support your \
  284. investigation into your feature(s) of interest?
  285. I certainly think it will be useful to consider the difference between the sweet and dry wines for future analysis.
  286.  
  287.  
  288. ### Did you create any new variables from existing variables in the dataset?
  289. It appears worthwhile to add a 'quality.factor' variable - for views where quality is more usefully considered as a factor ub subsequent plots.
  290. I also added a 'sweetness' variable, to split the data up into 'sweet' and 'dry' wines.
  291.  
  292.  
  293. ### Of the features you investigated, were there any unusual distributions? \
  294. Did you perform any operations on the data to tidy, adjust, or change the form \
  295. of the data? If so, why did you do this?
  296.  
  297. I used a log_10 scale for the residual sugar to highlight the bimodality of the distribution, and enable the splitting between sweet and dry.
  298.  
  299.  
  300. # Bivariate Plots Section
  301.  
  302. ```{r echo=FALSE, Bivariate_Plots}
  303. m <- cor(whitewine[c(13, 2:12)])
  304. corrplot(m, method = 'square')
  305. ```
  306.  
  307. ```{r echo=FALSE, Bivariate_Plots}
  308. corrplot(m, method = 'number', number.cex = 0.65)
  309. ```
  310.  
  311. An initial look at a correlation matrix indicates a few areas of interest:
  312. - There is a strong (-0.78) negative correlation between alcohol and density, and quite some correlations between alcohol and residual.sugar (-0.45).
  313. - Unsurprisingly, pH shows a negative association (-0.43) with fixed.acidity (low pH means acidic).
  314. - Density shows a strong positive correlation (0.84) with residual.sugar, and some correlation (0.53) with total.sulfur.dioxide.
  315. - The strongest correlations shown quality are with alcohol (0.44) and density (-0.31).
  316. - total.sulfur.dioxide and free.sulfur.dioxide correlate (0.62) with each other.
  317.  
  318. I thought it might also be interesting to see where the correlation matrices show the *greatest differences* when evaluated separately for the 'sweet' and 'dry' white wines:
  319.  
  320.  
  321. ```{r echo=TRUE, Bivariate_Plots}
  322. m.sweet <- cor(whitewine.sweet[c(13, 2:12)])
  323. m.dry <- cor(whitewine.dry[c(13, 2:12)])
  324.  
  325. m.diff <- m.sweet - m.dry
  326.  
  327. m.diff
  328.  
  329. ```
  330.  
  331. If we look for the largest differences:
  332. - >0.75 difference: density / residual.sugar
  333. - >0.5 difference: residual.sugar / alcohol
  334. - >0.25 difference residual.sugar / quality; chlorides / citric.acid; alcohol / free.sulfur.dioxide;
  335. density / free.sulfur.dioxide
  336.  
  337. ```{r echo=FALSE, Bivariate_Plots}
  338.  
  339.  
  340. ggpairs(whitewine[,c(14,2:6)],
  341. lower = list(continuous = wrap('points', alpha=0.5, size = 0.1), combo=wrap('facetdensity')),
  342. upper = list(continuous = wrap('cor', size=4), combo = wrap('box', size=0.3,
  343. outlier.size = 0.1)),
  344. axisLabels = 'show') +
  345. theme(axis.text.x = element_text(angle=90, vjust=1, hjust=1,size = 2),
  346. axis.text.y = element_text(size=6),
  347. strip.text = element_text(size=8))
  348.  
  349.  
  350. ```
  351.  
  352. ```{r echo=FALSE, Bivariate_Plots}
  353. ggpairs(whitewine[,c(14,7:12)],
  354. lower = list(continuous = wrap('points', alpha=0.5, size = 0.1), combo=wrap('facetdensity')),
  355. upper = list(continuous = wrap('cor', size=4), combo = wrap('box', size=0.3,
  356. outlier.size = 0.1)),
  357. axisLabels = 'show') +
  358. theme(axis.text.x = element_text(angle=90, vjust=1, hjust=1,size = 2),
  359. axis.text.y = element_text(size=6),
  360. strip.text = element_text(size=8))
  361. ```
  362.  
  363. A review of the 'pair plot' matrix (split into 2, and using quality as a factor in each), highlights a few features worth investigating:
  364. - TBD
  365.  
  366.  
  367. ```{r echo=FALSE, Bivariate_Plots}
  368.  
  369. COLUMNS <- colnames(whitewine)[2:12]
  370.  
  371. showBoxplt <- function(df, y, x='quality.factor', lower=0.01, upper=0.98) {
  372. ggplot(df, aes_string(x=x, y=y)) +
  373. geom_boxplot(size = 0.4, outlier.size = 0.1)+
  374. ylim(quantile(df[, y], prob=lower),
  375. quantile(df[, y], prob=upper)) +
  376. theme(axis.text.x = element_text(size = 6),
  377. axis.text.y = element_text(size=6),
  378. axis.title.x = element_text(size=8),
  379. axis.title.y = element_text(size=8))
  380.  
  381. }
  382. boxp1 <- lapply(COLUMNS, FUN=function(var) showBoxplt(whitewine, var))
  383. do.call(grid.arrange, args=c(boxp1, list(ncol=4)))
  384.  
  385.  
  386.  
  387. ```
  388.  
  389. It is also worth looking (at a slightly greater scale) at boxplots of the various chemical parameters across quality scores. An initial look suggests that density and alcohol and chlorides show the clearest and most consistrent trends with quality. But it is worth examining some of these parameters more closely, and see how the quality relationship might vary between sweet and dry whitewines.
  390.  
  391.  
  392. ```{r echo=FALSE, Bivariate_Plots}
  393.  
  394. y.data = whitewine$density
  395. y.label = 'density'
  396.  
  397. ggplot(whitewine, aes( x = quality.factor, y = y.data) ) +
  398. geom_jitter(alpha = 0.4, aes(colour = sweetness)) +
  399. geom_boxplot(alpha = 0.5, aes(fill = sweetness), outlier.size = 0.6) +
  400. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.99))) +
  401. ylab(y.label)
  402.  
  403.  
  404.  
  405. ```
  406.  
  407. Density shows a clear nagative trend with quality for both dry and sweet, but the variation across quality is more pronounced for the sweet wines. And there is a clear variation between optimal densities for sweet vs. dry white wines.
  408.  
  409. ```{r echo=FALSE, Bivariate_Plots}
  410.  
  411. y.data = whitewine$alcohol
  412. y.label = 'alcohol'
  413.  
  414. ggplot(whitewine, aes( x = quality.factor, y = y.data) ) +
  415. geom_jitter(alpha = 0.4, aes(colour = sweetness)) +
  416. geom_boxplot(alpha = 0.5, aes(fill = sweetness), outlier.size = 0.6) +
  417. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.99))) +
  418. ylab(y.label)
  419.  
  420.  
  421. ```
  422.  
  423. Alcohol shows a broadly positive correlation with quality for both sweet and dry wines, but the relationship appears more pronounced for sweet wines.
  424.  
  425.  
  426. ```{r echo=FALSE, Bivariate_Plots}
  427. y.data = whitewine$chlorides
  428. y.label = 'chlorides'
  429.  
  430. ggplot(whitewine, aes( x = quality.factor, y = y.data) ) +
  431. geom_jitter(alpha = 0.4, aes(colour = sweetness)) +
  432. geom_boxplot(alpha = 0.5, aes(fill = sweetness), outlier.size = 0.6) +
  433. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.95))) +
  434. ylab(y.label)
  435.  
  436.  
  437. ```
  438.  
  439. Chlorides do indeed show a (negative) correlation with quality, at slightly lower levels for dry wines than for sweet.
  440.  
  441.  
  442. ```{r echo=FALSE, Bivariate_Plots}
  443. y.data = whitewine$residual.sugar
  444. y.label = 'residual.sugar'
  445.  
  446. ggplot(whitewine, aes( x = quality.factor, y = y.data) ) +
  447. geom_jitter(alpha = 0.4, aes(colour = sweetness)) +
  448. geom_boxplot(alpha = 0.5, aes(fill = sweetness), outlier.size = 0.6) +
  449. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.99))) +
  450. ylab(y.label)
  451.  
  452.  
  453. ```
  454.  
  455. As the basis for the separation between dry & sweet, clearly residual sugar shows very different quality variation across the two groups. Let's examine these groups separately for residual.sugar.
  456.  
  457. ```{r echo=FALSE, Bivariate_Plots}
  458. y.data = whitewine.sweet$residual.sugar
  459. y.label = 'residual.sugar'
  460.  
  461. ggplot(whitewine.sweet, aes( x = quality.factor, y = y.data) ) +
  462. geom_jitter(alpha = 0.4, color = '#07bfc4') +
  463. geom_boxplot(alpha = 0.5, fill = '#07bfc4', outlier.size = 0.6) +
  464. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.99))) +
  465. ylab(y.label)
  466.  
  467.  
  468. ```
  469.  
  470. The trend for sweet whitewines is slightly negative: higher quality wines have less residual.sugar
  471.  
  472.  
  473. ```{r echo=FALSE, Bivariate_Plots}
  474. y.data = whitewine.dry$residual.sugar
  475. y.label = 'residual.sugar'
  476.  
  477. ggplot(whitewine.dry, aes( x = quality.factor, y = y.data) ) +
  478. geom_jitter(alpha = 0.4, color = '#f8766d') +
  479. geom_boxplot(alpha = 0.5, fill = '#f8766d', outlier.size = 0.6) +
  480. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.99))) +
  481. ylab(y.label)
  482.  
  483.  
  484. ```
  485.  
  486. However for dry whitewines, the trend is reversed - with higher quality wines showing slightly higher levels of residual.sugar.
  487.  
  488. ```{r echo=FALSE, Bivariate_Plots}
  489. y.data = whitewine$total.sulfur.dioxide
  490. y.label = 'total.sulfur.dioxide'
  491.  
  492. ggplot(whitewine, aes( x = quality.factor, y = y.data) ) +
  493. geom_jitter(alpha = 0.4, aes(colour = sweetness)) +
  494. geom_boxplot(alpha = 0.5, aes(fill = sweetness), outlier.size = 0.6) +
  495. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.99))) +
  496. ylab(y.label)
  497.  
  498.  
  499. ```
  500.  
  501. Total.sulfur.dioxide shows a clear negative trend agains increasing quality for sweet wines, but a much less pronounced trend for dry wines.
  502.  
  503. ```{r echo=FALSE, Bivariate_Plots}
  504. y.data = whitewine$sulphates
  505. y.label = 'sulphates'
  506.  
  507. ggplot(whitewine, aes( x = quality.factor, y = y.data) ) +
  508. geom_jitter(alpha = 0.4, aes(colour = sweetness)) +
  509. geom_boxplot(alpha = 0.5, aes(fill = sweetness), outlier.size = 0.6) +
  510. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.95))) +
  511. ylab(y.label)
  512.  
  513.  
  514. ```
  515.  
  516. The trend of sulphates with quality is slight, but does appear to be positive for dry wines and negative for sweet wines.
  517.  
  518. ```{r echo=FALSE, Bivariate_Plots}
  519. y.data = whitewine$pH
  520. y.label = 'pH'
  521.  
  522. ggplot(whitewine, aes( x = quality.factor, y = y.data) ) +
  523. geom_jitter(alpha = 0.4, aes(colour = sweetness)) +
  524. geom_boxplot(alpha = 0.5, aes(fill = sweetness), outlier.size = 0.6) +
  525. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.95))) +
  526. ylab(y.label)
  527.  
  528.  
  529. ```
  530.  
  531. There is little consistent variation in pH with quality for sweet wines, but a very clear positive correlation for dry wines (higher pH means better wine)
  532.  
  533.  
  534.  
  535. ```{r echo=FALSE, Bivariate_Plots}
  536. y.data = whitewine$density
  537. y.label = 'density'
  538. x.data = whitewine$alcohol
  539. x.label = 'alcohol'
  540.  
  541.  
  542. ggplot(whitewine, aes(x = x.data, y = y.data, colour = sweetness)) +
  543. geom_point(alpha = 0.2) +
  544. scale_x_continuous(limits = c(quantile(x.data, prob = 0.01), quantile(x.data, prob = 0.99))) +
  545. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.99))) +
  546. geom_smooth(method = 'loess') +
  547. xlab(x.label) + ylab(y.label)
  548.  
  549. ```
  550.  
  551. Alcohol and density show a clear nagative correlation, and also a clear separation between sweet and dry wines (dry wines generally having lower density for a given alcohol level).
  552.  
  553.  
  554.  
  555. ```{r echo=FALSE, Bivariate_Plots}
  556. y.data = whitewine$density
  557. y.label = 'density'
  558. x.data = whitewine$residual.sugar
  559. x.label = 'residual.sugar'
  560.  
  561.  
  562. ggplot(whitewine, aes(x = x.data, y = y.data, colour = sweetness)) +
  563. geom_point(alpha = 0.2) +
  564. scale_x_continuous(limits = c(quantile(x.data, prob = 0.01), quantile(x.data, prob = 0.99))) +
  565. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.99))) +
  566. geom_smooth(method = 'lm') +
  567. xlab(x.label) + ylab(y.label)
  568.  
  569. ```
  570.  
  571. There is a clear positive correlation between residual.sugar and density for sweet wines, but no indication of any significant correlation for dry wines (albeit with a restricted range of residual.sugar)
  572.  
  573.  
  574. ```{r echo=FALSE, Bivariate_Plots}
  575. y.data = whitewine$density
  576. y.label = 'density'
  577. x.data = whitewine$fixed.acidity
  578. x.label = 'fixed.acidity'
  579.  
  580.  
  581. ggplot(whitewine, aes(x = x.data, y = y.data, colour = sweetness)) +
  582. geom_point(alpha = 0.2) +
  583. scale_x_continuous(limits = c(quantile(x.data, prob = 0.01), quantile(x.data, prob = 0.99))) +
  584. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.99))) +
  585. geom_smooth(method = 'lm') +
  586. xlab(x.label) + ylab(y.label)
  587.  
  588. ```
  589.  
  590. Fixed.acidity shows a slight positive correlation with density, and also a reasonable separation between sweet and dry wines (dry wines are generally lower density for a given fixed.acidity)
  591.  
  592.  
  593. ```{r echo=FALSE, Bivariate_Plots}
  594. y.data = whitewine$alcohol
  595. y.label = 'alcohol'
  596. x.data = whitewine$residual.sugar
  597. x.label = 'residual.sugar'
  598.  
  599.  
  600. ggplot(whitewine, aes(x = x.data, y = y.data, colour = sweetness)) +
  601. geom_point(alpha = 0.2) +
  602. scale_x_continuous(limits = c(quantile(x.data, prob = 0.01), quantile(x.data, prob = 0.99))) +
  603. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.99))) +
  604. geom_smooth(method = 'lm') +
  605. xlab(x.label) + ylab(y.label)
  606.  
  607. ```
  608.  
  609. The trends for residual.sugar vs. alcohol are less clear, but regression lines show different slope directions in sweet vs. dry wines.
  610.  
  611.  
  612.  
  613. ```{r echo=FALSE, Bivariate_Plots}
  614. y.data = whitewine$free.sulfur.dioxide
  615. y.label = 'free.sulfur.dioxide'
  616. x.data = whitewine$total.sulfur.dioxide
  617. x.label = 'total.sulfur.dioxide'
  618.  
  619.  
  620. ggplot(whitewine, aes(x = x.data, y = y.data, colour = sweetness)) +
  621. geom_point(alpha = 0.2) +
  622. scale_x_continuous(limits = c(quantile(x.data, prob = 0.01), quantile(x.data, prob = 0.99))) +
  623. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.99))) +
  624. geom_smooth(method = 'lm') +
  625. xlab(x.label) + ylab(y.label)
  626.  
  627. ```
  628.  
  629. There is some positive correlation between total.sulfur.dioxide and free.sulfur.dioxide, but no clear separation between dry and sweet wines.
  630.  
  631.  
  632.  
  633.  
  634. # Bivariate Analysis
  635.  
  636.  
  637. ### Talk about some of the relationships you observed in this part of the \
  638. investigation. How did the feature(s) of interest vary with other features in \
  639. the dataset?
  640.  
  641. ### Did you observe any interesting relationships between the other features \
  642. (not the main feature(s) of interest)?
  643.  
  644. ### What was the strongest relationship you found?
  645.  
  646.  
  647. # Multivariate Plots Section
  648.  
  649. There are only a few wines with quality scores <4 or >8, and they make insights harder to visualise on some plots, so for this section I will create a new set of 'clipped' data, with quality scores of 3 and 9 removed.
  650. (NOTE there are only 5 wines of quality 9 and 20 of quality 3 out of 4,898 wines in the data)
  651.  
  652. ```{r}
  653.  
  654. whitewine.sweet.clip <- whitewine.sweet[whitewine.sweet$quality.factor != 3
  655. & whitewine.sweet$quality.factor != 9 ,]
  656. whitewine.dry.clip <- whitewine.dry[whitewine.dry$quality.factor != 3
  657. & whitewine.dry$quality.factor != 9 ,]
  658. whitewine.clip <- whitewine[whitewine$quality.factor != 3
  659. & whitewine$quality.factor != 9 ,]
  660.  
  661. ```
  662.  
  663.  
  664.  
  665.  
  666.  
  667. ```{r echo=FALSE, Multivariate_Plots}
  668.  
  669. data = whitewine.dry.clip
  670. x.data = data$alcohol
  671. x.label = 'alcohol'
  672. y.data = data$density
  673. y.label = 'density'
  674. pl.title = 'Dry wines (residual.sugar <3)'
  675.  
  676.  
  677. ggplot(data, aes(x = x.data, y = y.data, colour = quality.factor)) +
  678. geom_point(alpha = 0.9) +
  679. scale_x_continuous(limits = c(quantile(x.data, prob = 0.01), quantile(x.data, prob = 0.99))) +
  680. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.99))) +
  681. xlab(x.label) + ylab(y.label) +
  682. scale_color_brewer(type = 'div') +
  683. theme(legend.position = 'right') +
  684. ggtitle(pl.title)
  685.  
  686. ```
  687.  
  688. TEXT
  689.  
  690.  
  691.  
  692.  
  693. ```{r echo=FALSE, Multivariate_Plots}
  694.  
  695. data = whitewine.sweet.clip
  696. x.data = data$alcohol
  697. x.label = 'alcohol'
  698. y.data = data$density
  699. y.label = 'density'
  700. pl.title = 'Sweet wines (residual.sugar >3)'
  701.  
  702.  
  703. ggplot(data, aes(x = x.data, y = y.data, colour = quality.factor)) +
  704. geom_point(alpha = 0.9) +
  705. scale_x_continuous(limits = c(quantile(x.data, prob = 0.01), quantile(x.data, prob = 0.99))) +
  706. scale_y_continuous(limits = c(quantile(y.data, prob = 0.01), quantile(y.data, prob = 0.99))) +
  707. xlab(x.label) + ylab(y.label) +
  708. scale_color_brewer(type = 'div') +
  709. theme(legend.position = 'right') +
  710. ggtitle(pl.title)
  711.  
  712. ```
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement