Untitled

---
title: "Assignment 1"
author: "Vincent Viola, Twan Kerkhof, Jaymon Veldkamp"
date: "19 May 2019"
output: html_document
---

```{r setup, include=FALSE}
library(Hmisc)
library(AER)
library(mfx)
library(ggplot2)
data("HMDA")
df <- HMDA
```

## Assignment 1.1(A)

```{r}
summary(df)
```

The summary function in R provides you with a concise overview of the data. What it returns depends on the type of values a certain variable entails. For binary variables, which is yes/no or 1/0. It returns the count for each value present for a certain variable. So for example, for the variable 'deny' you can see that there are 2095 'no' values in the dataset and 285 'yes' values. For nominal values, (for example the 'chist' variable, which ranges from 1 to 6) it also provides you with the count of all values present in the space of that variable. So for Chist you can see that there are 1353 values with the value 1, 441 for the value 2, 126 for the value 3 and so forth. For numeric variables it provides you with different statistics. When a variable is numeric it provides you with the Minimum value present in that variable, and also with the maximum value. It provides you with the mean, which is the average of all values of a certain variable. The median, which is the value that separates the lower half from the upper half. So it is the middle value in a row of numbers which are ordered. It also provides you with the value of the 1st Quantile, which seperates the first quarter of the dataset and the other 3 quarters. The 3rd quantile value is the value that seperates the lower 75% of the dataset and the upper 25%.


## Assignment 1.1(B)
```{r}
rej_appl <- nrow(df[df$deny == "no", ])/nrow(df)
print(rej_appl)
```
The probability of not getting a mortgage loan by this calculation is 0.88. Ofcourse one can doubt if this is close to the real probability of not getting a loan, since there are a lot more variables which come into play when calculating such a probability. So to get a more accurate probability it would be wise to take more variables into account.


## Assignment 1.1(C)
```{r}
ggplot(df, aes(x=deny, y=pirat, fill=deny)) +
    geom_boxplot() + ggtitle("Boxplot of variable pirat grouped by variable deny")
```
It seems when Deny = yes the mean is a little bit higher than when the variable of deny has the value of no. But from this boxplot it is hard to see since there is an outlier in the dataset which scales the boxplot. But overall, the'yes' group seems to have a slightly higher pirat score than the 'no' group.

## Assignment 1.1(D)
```{r}
df1 <- df[(df$pirat < mean(df$pirat) + 12*var(df$pirat)^(1/2)) & (df$pirat > mean(df$pirat) - 12*var(df$pirat)^(1/2)),]

ggplot(df1, aes(x=deny, y=pirat, fill=deny)) +
    geom_boxplot() + ggtitle("Boxplot of variable pirat grouped by variable deny without outlier")
```
Now that the outlier has been removed a much clearer boxplot has been created. You can still see, but clearer now that overall the values which have the value 'yes' for the 'deny' variable have a higher pirat score than when the value is 'no'. The cleaning process hasn't changed the boxplot itself because the boxplot is quite robust against outliers. Since the boxplot does not show the mean for example, but only the median and the quantiles.

## Assignment 1.2(A)

The model should become:
$deny = \beta_0 + \beta_1 \times pirat + u$
The first component of the model, namely $deny$, is the probability of a mortgage being denied.
The second component. $\beta_0$ is the intercept. This would be the value if $pirat$ would be 0.
The third component, the $\beta_1$ is the coefficient, this beta determines the effect of $pirat$ on the probability of $deny$.
The last component $u$, is the error term.
```{r}
#First the deny column has to be converted to numeric values 1 and 0.
#We also added a -1 because otherwise the values would be changed to 1 and 2 instead of 0 and 1.
HMDA$deny2 <- as.numeric(HMDA$deny) - 1

```

```{r}


lindeny <- lm(formula = deny ~ pirat, data=df)
summary(lindeny)

```