Untitled

---
title: "Assignment 1"
author: "Koen Kahlman (2076861), Jacob Sonnenberg (2634644), Britt van Leeuwen (2575802), group 042"
date: "5 February 2020"
output: pdf_document
fontsize: 11pt
highlight: tango
---

```{r settings, echo=FALSE}
options(digits=3)
options(scipen=999)
```

##Exercise 1
**a-c)**
We set `n = 30`, `m = 30`, `mu = 180`, `sd = 5`, and let `nu` range from $175$ to $185$: `nu = seq(175,185,by=0.25)`.

```{r Parameters, echo=FALSE}
n = 30
m = 30
mu = 180
sd = 5
nu = seq(175,185,by=0.25)
```
We calculate the power of the t-test with the following function:

```{r q1 power function}
ttest.power=function(n,m,mu,nu,sd,B=1000){
  p=numeric(B)
  for (b in 1:B) {x=rnorm(n,mu,sd); y=rnorm(m,nu,sd)
    p[b]=t.test(x,y,var.equal=TRUE)[[3]] }
  mean(p<0.05) }
```

We plot the rejection power of the t-test as a function of `nu`. When `nu = mu = 180`, $H_0$ holds, so power should be around $0.05$: the used significance level of the test. ($5\%$ chance to incorrectly reject $H_0$). But when `nu` $\neq$ `mu` the power should grow, approaching $1$ as the means become more different.

TODO include
```{r q1abc plots, eval=FALSE, fig.height=3, fig.width=6, include=FALSE}
par(mfrow=c(1,3)) # three plots next to each other
C = length(nu)

power = numeric(C)
for(c in 1:C){
  power[c] = ttest.power(n,m,mu,nu[c],sd)
}
plot(nu,power, main = "1a: n=m=30, sd=5")

n = 100; m = 100
power = numeric(C)
for(c in 1:C){
  power[c] = ttest.power(n,m,mu,nu[c],sd)
}
plot(nu,power, main = "1b: n=m=100, sd=5")

n = 30; m = 100; sd = 15
power = numeric(C)
for(c in 1:C){
  power[c] = ttest.power(n,m,mu,nu[c],sd)
}
plot(nu,power, main = "1c: n=m=30, sd=15", ylim=c(0, 1)) #changed y axis to make it clearer next to the other two plots
```

**d)**
1b compared to 1a: bigger sample size, so rejection power goes up faster when $H_0$ does not hold (sharper curve).

1c compared to 1a: bigger variance in the samples, so the sample size is just too low. The curve becomes very wide as the difference between the means must be very large for the t-test to be able to reliably tell the difference.

##Exercise 2
We read the data from the `light1879.txt`, `light1882.txt`, and `light.txt` files.
```{r q2 input, echo = FALSE}
light1879 = unlist(read.table("light1879.txt"))

light1882 = na.omit(unlist(read.table("light1882.txt", fill = TRUE))) #uneven rows

light = unlist(read.table("light.txt"))
```

**a)**
Investigating the normality of the three datasets:

```{r q2a plots, fig.height=2.8, fig.width=6, echo = FALSE}
par(mfrow=c(1,2)) #two plots next to each other
hist(light1879); qqnorm(light1879)
hist(light1882); qqnorm(light1882)
hist(light); qqnorm(light)
```

Based on the above plots, we can assume that `light1879` comes from a normal distribution. It seems that the `light1882` and `light` datasets do not come from a normal distribution. This is strange, since the speed of light is a constant quantity so the expected result would be that speed plus some normal noise.

There do appear to be some outliers, which are possibly measurement errors. In the `light1882` dataset, there are two unusually high and three unusually low values, and in `light` there are two unusually low values.

We assume that those data points were measurement errors; we then remove them from the data and plot the remaining data again.

```{r q2a removing outliers}
lightfixed = light[light > 0]
light1882fixed = light1882[(light1882 > 650 & light1882 < 1000)]
```

```{r q2a no outliers, fig.height=3, fig.width=6, echo = FALSE}
par(mfrow=c(1,2)) #two plots next to each other

hist(light1882fixed); qqnorm(light1882fixed)
hist(lightfixed); qqnorm(lightfixed)
```

Based on these plots, we conclude that light1882 and light also come from a normal distribution. To be sure, we also run the <TODO name of test> Shapiro normality test.

```{r shapiro tests}
shapiro.test(light1879)[[2]]
shapiro.test(light1882fixed)[[2]]
shapiro.test(lightfixed)[[2]]
```

**b)**
We have normal data, so we use the normal approximation for the mean.

The confidence interval for the mean with 95% confidence is the interval $\bar{X} \pm 1.96 * s / \sqrt{N}$, which we obtain from running `t.test`.

```{r q2b}
t.test(light1879)[[4]] + 299000

t.test(light1882fixed)[[4]] + 299000

#converting the measurements to km/s
speedlight = 7.442 / ((lightfixed / 1000000000) + 0.0000248)

t.test(speedlight)[[4]]
```

**c)**
Currently, the speed of light is equal to $299792.458\;km/s$, which is a defined quantity so it is exact. This is not inside the first confidence interval (too high) but it is inside the second confidence interval, and it is not inside the third interval (too low).

##Exercise 3
We read the data from `telephone.txt`.

**a)**
We make a histogram and boxplot of the `Bills` data.

```{r q3 input+plots, echo = FALSE, fig.height=3, fig.width=6}
telephone = read.table("telephone.txt", header = TRUE)
par(mfrow=c(1,2)) #two plots next to each other
hist(telephone$Bills);boxplot(telephone$Bills)
```

### Marketing advice

Appeal to the segment of the market paying over 70 on their bill,
there is a clear segment above this amount that represents the
majority of the market value.

```{r eval=FALSE}
above70 = telephone$Bills[telephone$Bills > 70]
below70 = telephone$Bills[telephone$Bills <= 70]
length(above70) # => 66, or 33%
length(below70) # => 134, or 67%
sum(above70) / sum(below70) # => ~2.5x
```

### Anomaly in the data
A significant proportion, 26%, of bills are under 10. This could
reflect either pay-per-usage or promotional plans. This segment
represents less than 3% of the market value.

```{r eval=FALSE}
below10 = telephone$Bills[telephone$Bills < 10]
length(below10)/length(telephone$Bills) # => 0.26
sum(below10)/sum(telephone$Bills) # => 0.02734952
```

**b)**
We run the bootstrap test for lambda in the range $[0.01, 0.1]$.

TODO include
```{r q3b, eval=FALSE, include=FALSE, fig.height=3, fig.width=6}
lambda = seq(0.01,0.1,by=0.0005)
L = length(lambda)
B = 1000
t = median(telephone$Bills)

pvalues = numeric(L)
for(l in 1:L){
  currentLambda = lambda[l]
  tstar = numeric(B)
  for(b in 1:B){
    xstar = rexp(length(telephone$Bills), currentLambda)
    tstar[b] = median(xstar)
  }

  pl = sum(tstar<t)/B
  pr = sum(tstar>t)/B
  pvalues[l] = 2 * min(pl,pr)
}

plot(lambda, pvalues, main = "bla", type="l")

#TODO find position of max...
```

**c)**
We generate a $95\%$ confidence interval for the population median of `telephone`.

```{r q3c}
t = median(telephone$Bills)
B=1000

Tstar=numeric(B)
for(b in 1:B) {
  Xstar=sample(telephone$Bills,replace=TRUE)
  Tstar[b]=mean(Xstar)
}
Tstar25=quantile(Tstar,0.025)
Tstar975=quantile(Tstar,0.975)
c(2*t-Tstar975,2*t-Tstar25)
```

**d)**
TODO text goes here
We assume that the data is exponentially distributed with parameter lambda. We estimate lambda and construct a 95% confidence interval for the population median.
```{r q3d}
interval_for_mean = c(mean(telephone$Bills) - 1.96 * sd(telephone$Bills / sqrt(length(telephone$Bills))) , mean(telephone$Bills) + 1.96 * sd(telephone$Bills / sqrt(length(telephone$Bills))) )

interval_for_median = interval_for_mean * log(2)
print(interval_for_median)

estimate_lambda = 1 / mean(telephone$Bills)
print(estimate_lambda)
```
TODO comment on findings

**e)**
We use the binomial test to test the null hypothesis that the median bill is bigger or equal to 40 euro.
We also use the test to check whether the fraction of the bills less than 10 euro is at most 25%.

```{r q3e}
larger_than_40 = sum(telephone$Bills>40)

binom.test(larger_than_40, length(telephone$Bills), p = 0.5, alternative = "less")

larger_than_10 = sum(telephone$Bills>10)

binom.test(larger_than_10, length(telephone$Bills), p = 0.75, alternative = "less")
```

TODO comment on findings

## Exercise 4
We read the data from `run.txt`.

```{r q4 input, echo = FALSE, fig.height=3, fig.width=6}
run = read.table("run.txt", header = TRUE)
```

**a)**
We want to perform a test (preferably Pearson) to check correlation between datasets. We first check normality of the data, since the Pearson test requires normality assumption. As shown below, the data is likely to be normally distributed, so we perfom the Pearson test.

```{r q4a, fig.height=3, fig.width=6}
# check normality assumption
par(mfrow=c(1,2))
hist(run$before);qqnorm(run$before)
hist(run$after);qqnorm(run$after)

#default test is pearson which requires normality assumption
cor.test(run$before,run$after)
```
Based on the test result we reject the null hypothesis, and conclude that the run times before and after are correlated. The p-value is 0.0008, which is lower than 0.05.


**b)**
We test the null hypothesis that the difference in means between before and after drinking is equal to 0. Therefore, we use the t-test, which assumes that the differences in the data are a random sample from a normal population.

```{r q4b}
#split the data
lemo = subset(run, drink == "lemo", select = c(before,after))
energy = subset(run, drink == "energy", select = c(before,after))

#data is paired because its two results from each experimental unit (kid)
t.test(lemo$before, lemo$after, paired = TRUE)
t.test(energy$before, energy$after, paired = TRUE)
```

In both cases we cannot reject the null hypothesis since the p-values are 0.4 and 0.1, which are both higher than 0.05.

**c)**
We test the null hypothesis that the time difference between the two running tasks is not affected by the type of drink, so that the mean of this population is 0. Therefore, we use the t-test again.

```{r q4c}
lemoDiff = lemo$after - lemo$before
energyDiff = energy$after - energy$before
#differences are still normally distributed
#the Diff samples are independent

t.test(lemoDiff, energyDiff)
```

We cannot reject the hypothesis that there is no difference in the means of the running speed differences per group. The p-value is 0.2, which is higher than 0.05.

**d)**
- If the diffence with energy drink before and after drinking was checked and the speed turned out to be increased, it still does not prove if the energy drink was the cause. If the softdrink would also result in an increased speed, the reason might be something different. We only checked if we could reject the possibility of the drinks having no effect on the speed, which we could not.

-


## Exercise 5
We use the data from `chickwts`.

```{r q5 input, echo = FALSE}
summary(chickwts)
```

**a)**

**b)**

**c)**

**d)**