Advertisement
Guest User

Untitled

a guest
Feb 20th, 2020
171
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 10.49 KB | None | 0 0
  1. ---
  2. title: "Assignment 1"
  3. author: "Koen Kahlman (2076861), Jacob Sonnenberg (2634644), Britt van Leeuwen (2575802), group 042"
  4. date: "5 February 2020"
  5. output: pdf_document
  6. fontsize: 11pt
  7. highlight: tango
  8. ---
  9.  
  10. ```{r settings, echo=FALSE}
  11. options(digits=3)
  12. options(scipen=999)
  13. ```
  14.  
  15. ##Exercise 1
  16. **a-c)**
  17. We set `n = 30`, `m = 30`, `mu = 180`, `sd = 5`, and let `nu` range from $175$ to $185$: `nu = seq(175,185,by=0.25)`.
  18.  
  19. ```{r Parameters, echo=FALSE}
  20. n = 30
  21. m = 30
  22. mu = 180
  23. sd = 5
  24. nu = seq(175,185,by=0.25)
  25. ```
  26. We calculate the power of the t-test with the following function:
  27.  
  28. ```{r q1 power function}
  29. ttest.power=function(n,m,mu,nu,sd,B=1000){
  30. p=numeric(B)
  31. for (b in 1:B) {x=rnorm(n,mu,sd); y=rnorm(m,nu,sd)
  32. p[b]=t.test(x,y,var.equal=TRUE)[[3]] }
  33. mean(p<0.05) }
  34. ```
  35.  
  36. We plot the rejection power of the t-test as a function of `nu`. When `nu = mu = 180`, $H_0$ holds, so power should be around $0.05$: the used significance level of the test. ($5\%$ chance to incorrectly reject $H_0$). But when `nu` $\neq$ `mu` the power should grow, approaching $1$ as the means become more different.
  37.  
  38. TODO include
  39. ```{r q1abc plots, eval=FALSE, fig.height=3, fig.width=6, include=FALSE}
  40. par(mfrow=c(1,3)) # three plots next to each other
  41. C = length(nu)
  42.  
  43. power = numeric(C)
  44. for(c in 1:C){
  45. power[c] = ttest.power(n,m,mu,nu[c],sd)
  46. }
  47. plot(nu,power, main = "1a: n=m=30, sd=5")
  48.  
  49. n = 100; m = 100
  50. power = numeric(C)
  51. for(c in 1:C){
  52. power[c] = ttest.power(n,m,mu,nu[c],sd)
  53. }
  54. plot(nu,power, main = "1b: n=m=100, sd=5")
  55.  
  56. n = 30; m = 100; sd = 15
  57. power = numeric(C)
  58. for(c in 1:C){
  59. power[c] = ttest.power(n,m,mu,nu[c],sd)
  60. }
  61. plot(nu,power, main = "1c: n=m=30, sd=15", ylim=c(0, 1)) #changed y axis to make it clearer next to the other two plots
  62. ```
  63.  
  64. **d)**
  65. 1b compared to 1a: bigger sample size, so rejection power goes up faster when $H_0$ does not hold (sharper curve).
  66.  
  67. 1c compared to 1a: bigger variance in the samples, so the sample size is just too low. The curve becomes very wide as the difference between the means must be very large for the t-test to be able to reliably tell the difference.
  68.  
  69. ##Exercise 2
  70. We read the data from the `light1879.txt`, `light1882.txt`, and `light.txt` files.
  71. ```{r q2 input, echo = FALSE}
  72. light1879 = unlist(read.table("light1879.txt"))
  73.  
  74. light1882 = na.omit(unlist(read.table("light1882.txt", fill = TRUE))) #uneven rows
  75.  
  76. light = unlist(read.table("light.txt"))
  77. ```
  78.  
  79. **a)**
  80. Investigating the normality of the three datasets:
  81.  
  82. ```{r q2a plots, fig.height=2.8, fig.width=6, echo = FALSE}
  83. par(mfrow=c(1,2)) #two plots next to each other
  84. hist(light1879); qqnorm(light1879)
  85. hist(light1882); qqnorm(light1882)
  86. hist(light); qqnorm(light)
  87. ```
  88.  
  89. Based on the above plots, we can assume that `light1879` comes from a normal distribution. It seems that the `light1882` and `light` datasets do not come from a normal distribution. This is strange, since the speed of light is a constant quantity so the expected result would be that speed plus some normal noise.
  90.  
  91. There do appear to be some outliers, which are possibly measurement errors. In the `light1882` dataset, there are two unusually high and three unusually low values, and in `light` there are two unusually low values.
  92.  
  93. We assume that those data points were measurement errors; we then remove them from the data and plot the remaining data again.
  94.  
  95. ```{r q2a removing outliers}
  96. lightfixed = light[light > 0]
  97. light1882fixed = light1882[(light1882 > 650 & light1882 < 1000)]
  98. ```
  99.  
  100. ```{r q2a no outliers, fig.height=3, fig.width=6, echo = FALSE}
  101. par(mfrow=c(1,2)) #two plots next to each other
  102.  
  103. hist(light1882fixed); qqnorm(light1882fixed)
  104. hist(lightfixed); qqnorm(lightfixed)
  105. ```
  106.  
  107. Based on these plots, we conclude that light1882 and light also come from a normal distribution. To be sure, we also run the <TODO name of test> Shapiro normality test.
  108.  
  109. ```{r shapiro tests}
  110. shapiro.test(light1879)[[2]]
  111. shapiro.test(light1882fixed)[[2]]
  112. shapiro.test(lightfixed)[[2]]
  113. ```
  114.  
  115. **b)**
  116. We have normal data, so we use the normal approximation for the mean.
  117.  
  118. The confidence interval for the mean with 95% confidence is the interval $\bar{X} \pm 1.96 * s / \sqrt{N}$, which we obtain from running `t.test`.
  119.  
  120. ```{r q2b}
  121. t.test(light1879)[[4]] + 299000
  122.  
  123. t.test(light1882fixed)[[4]] + 299000
  124.  
  125. #converting the measurements to km/s
  126. speedlight = 7.442 / ((lightfixed / 1000000000) + 0.0000248)
  127.  
  128. t.test(speedlight)[[4]]
  129. ```
  130.  
  131. **c)**
  132. Currently, the speed of light is equal to $299792.458\;km/s$, which is a defined quantity so it is exact. This is not inside the first confidence interval (too high) but it is inside the second confidence interval, and it is not inside the third interval (too low).
  133.  
  134. ##Exercise 3
  135. We read the data from `telephone.txt`.
  136.  
  137. **a)**
  138. We make a histogram and boxplot of the `Bills` data.
  139.  
  140. ```{r q3 input+plots, echo = FALSE, fig.height=3, fig.width=6}
  141. telephone = read.table("telephone.txt", header = TRUE)
  142. par(mfrow=c(1,2)) #two plots next to each other
  143. hist(telephone$Bills);boxplot(telephone$Bills)
  144. ```
  145.  
  146. ### Marketing advice
  147.  
  148. Appeal to the segment of the market paying over 70 on their bill,
  149. there is a clear segment above this amount that represents the
  150. majority of the market value.
  151.  
  152. ```{r eval=FALSE}
  153. above70 = telephone$Bills[telephone$Bills > 70]
  154. below70 = telephone$Bills[telephone$Bills <= 70]
  155. length(above70) # => 66, or 33%
  156. length(below70) # => 134, or 67%
  157. sum(above70) / sum(below70) # => ~2.5x
  158. ```
  159.  
  160. ### Anomaly in the data
  161. A significant proportion, 26%, of bills are under 10. This could
  162. reflect either pay-per-usage or promotional plans. This segment
  163. represents less than 3% of the market value.
  164.  
  165. ```{r eval=FALSE}
  166. below10 = telephone$Bills[telephone$Bills < 10]
  167. length(below10)/length(telephone$Bills) # => 0.26
  168. sum(below10)/sum(telephone$Bills) # => 0.02734952
  169. ```
  170.  
  171. **b)**
  172. We run the bootstrap test for lambda in the range $[0.01, 0.1]$.
  173.  
  174. TODO include
  175. ```{r q3b, eval=FALSE, include=FALSE, fig.height=3, fig.width=6}
  176. lambda = seq(0.01,0.1,by=0.0005)
  177. L = length(lambda)
  178. B = 1000
  179. t = median(telephone$Bills)
  180.  
  181. pvalues = numeric(L)
  182. for(l in 1:L){
  183. currentLambda = lambda[l]
  184. tstar = numeric(B)
  185. for(b in 1:B){
  186. xstar = rexp(length(telephone$Bills), currentLambda)
  187. tstar[b] = median(xstar)
  188. }
  189.  
  190. pl = sum(tstar<t)/B
  191. pr = sum(tstar>t)/B
  192. pvalues[l] = 2 * min(pl,pr)
  193. }
  194.  
  195. plot(lambda, pvalues, main = "bla", type="l")
  196.  
  197. #TODO find position of max...
  198. ```
  199.  
  200. **c)**
  201. We generate a $95\%$ confidence interval for the population median of `telephone`.
  202.  
  203. ```{r q3c}
  204. t = median(telephone$Bills)
  205. B=1000
  206.  
  207. Tstar=numeric(B)
  208. for(b in 1:B) {
  209. Xstar=sample(telephone$Bills,replace=TRUE)
  210. Tstar[b]=mean(Xstar)
  211. }
  212. Tstar25=quantile(Tstar,0.025)
  213. Tstar975=quantile(Tstar,0.975)
  214. c(2*t-Tstar975,2*t-Tstar25)
  215. ```
  216.  
  217. **d)**
  218. TODO text goes here
  219. We assume that the data is exponentially distributed with parameter lambda. We estimate lambda and construct a 95% confidence interval for the population median.
  220. ```{r q3d}
  221. interval_for_mean = c(mean(telephone$Bills) - 1.96 * sd(telephone$Bills / sqrt(length(telephone$Bills))) , mean(telephone$Bills) + 1.96 * sd(telephone$Bills / sqrt(length(telephone$Bills))) )
  222.  
  223. interval_for_median = interval_for_mean * log(2)
  224. print(interval_for_median)
  225.  
  226. estimate_lambda = 1 / mean(telephone$Bills)
  227. print(estimate_lambda)
  228. ```
  229. TODO comment on findings
  230.  
  231. **e)**
  232. We use the binomial test to test the null hypothesis that the median bill is bigger or equal to 40 euro.
  233. We also use the test to check whether the fraction of the bills less than 10 euro is at most 25%.
  234.  
  235. ```{r q3e}
  236. larger_than_40 = sum(telephone$Bills>40)
  237.  
  238. binom.test(larger_than_40, length(telephone$Bills), p = 0.5, alternative = "less")
  239.  
  240. larger_than_10 = sum(telephone$Bills>10)
  241.  
  242. binom.test(larger_than_10, length(telephone$Bills), p = 0.75, alternative = "less")
  243. ```
  244.  
  245. TODO comment on findings
  246.  
  247. ## Exercise 4
  248. We read the data from `run.txt`.
  249.  
  250. ```{r q4 input, echo = FALSE, fig.height=3, fig.width=6}
  251. run = read.table("run.txt", header = TRUE)
  252. ```
  253.  
  254. **a)**
  255. We want to perform a test (preferably Pearson) to check correlation between datasets. We first check normality of the data, since the Pearson test requires normality assumption. As shown below, the data is likely to be normally distributed, so we perfom the Pearson test.
  256.  
  257. ```{r q4a, fig.height=3, fig.width=6}
  258. # check normality assumption
  259. par(mfrow=c(1,2))
  260. hist(run$before);qqnorm(run$before)
  261. hist(run$after);qqnorm(run$after)
  262.  
  263. #default test is pearson which requires normality assumption
  264. cor.test(run$before,run$after)
  265. ```
  266. Based on the test result we reject the null hypothesis, and conclude that the run times before and after are correlated. The p-value is 0.0008, which is lower than 0.05.
  267.  
  268.  
  269. **b)**
  270. We test the null hypothesis that the difference in means between before and after drinking is equal to 0. Therefore, we use the t-test, which assumes that the differences in the data are a random sample from a normal population.
  271.  
  272. ```{r q4b}
  273. #split the data
  274. lemo = subset(run, drink == "lemo", select = c(before,after))
  275. energy = subset(run, drink == "energy", select = c(before,after))
  276.  
  277. #data is paired because its two results from each experimental unit (kid)
  278. t.test(lemo$before, lemo$after, paired = TRUE)
  279. t.test(energy$before, energy$after, paired = TRUE)
  280. ```
  281.  
  282. In both cases we cannot reject the null hypothesis since the p-values are 0.4 and 0.1, which are both higher than 0.05.
  283.  
  284. **c)**
  285. We test the null hypothesis that the time difference between the two running tasks is not affected by the type of drink, so that the mean of this population is 0. Therefore, we use the t-test again.
  286.  
  287. ```{r q4c}
  288. lemoDiff = lemo$after - lemo$before
  289. energyDiff = energy$after - energy$before
  290. #differences are still normally distributed
  291. #the Diff samples are independent
  292.  
  293. t.test(lemoDiff, energyDiff)
  294. ```
  295.  
  296. We cannot reject the hypothesis that there is no difference in the means of the running speed differences per group. The p-value is 0.2, which is higher than 0.05.
  297.  
  298. **d)**
  299. - If the diffence with energy drink before and after drinking was checked and the speed turned out to be increased, it still does not prove if the energy drink was the cause. If the softdrink would also result in an increased speed, the reason might be something different. We only checked if we could reject the possibility of the drinks having no effect on the speed, which we could not.
  300.  
  301. -
  302.  
  303.  
  304. ## Exercise 5
  305. We use the data from `chickwts`.
  306.  
  307. ```{r q5 input, echo = FALSE}
  308. summary(chickwts)
  309. ```
  310.  
  311. **a)**
  312.  
  313. **b)**
  314.  
  315. **c)**
  316.  
  317. **d)**
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement