Untitled

\documentclass[12pt,a4paper]{report}

\linespread{1.3}

\usepackage{graphicx}
\usepackage{wrapfig}
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{booktabs}
\usepackage{float}
\usepackage{apalike}
\usepackage[margin=1.2in]{geometry}
\usepackage{listings}
\usepackage{fancyhdr}

\pagestyle{fancy}
\fancyhf{}
\rhead{Dilan Hayes}
\lhead{Classification Using Logistic Regression And R}
\rfoot{Page \thepage}

\bibliographystyle{apalike}

\author{Dilan Hayes}
\title{Classification Using Logistic Regression And R}

\begin{document}
\maketitle

\tableofcontents
\newpage

\section*{Abstract}
This project involves the study of logistic regression models for classification and applying this in practice by fitting a model to a dataset consisting of anonymized credit card data using the statistical language R. The fitted model was used to classify credit card holders by whether they would default or not. The data was investigated and processed in order for a model to be fit. The data was cleaned of erroneous observations and various techniques were explored to produce the best model from the given data. This involved creating new derived variables from existing ones and dealing with highly correlated predictor variables. Various graphs were generated in order to help understand the processes of logistic regression and to understand the given dataset. A model was fit and the resulting model was interpreted to gain insight into the data and the effect of different predictors on the response. The predictive power of the model was assessed. Finally other techniques for classification were investigated and compared to logistic regression.


\newpage
\chapter{Introduction}
Banks issue thousands of credit cards each year to customers from a wide variety of demographics such as age, sex, marital status etc. Many of these however are eventually defaulted on. If the bank could predict with high accuracy who is likely to default it would be of great value and potentially save the bank large sums of money each year by avoiding issuing credit to people likely to default.

A dataset has been provided consisting of anonymized information from a Taiwanese bank about customers who were issued credit cards. The data includes personal information such as the age of the customer as well as financial information such as historic billing and repayment amounts. The dataset also specifies weather each customer ended up defaulting on their credit card or not. A logistic regression model will be fitted to this data. The model will then be interpreted and the predictive power of the model will be assessed.

Before creating the model the dataset must first be investigated and cleaned of missing and erroneous data. The categorical data must be regrouped where necessary (such as a variable having too many categories or categories with too few observations) and continuous data discretized. After splitting the cleaned data into training and validation sets. The model will then be assessed for its accuracy. The data was provided with a key describing the variables and their values.

The statistical package used for this project was R, using the RStudio IDE. Some of the 3rd party libraries used include ggplot2, ROCR and binr.
\newpage
\section{Overview of data}
The given dataset consists of 10091 observations of 24 variables. Of which 14 are continuous, 9 categorical and a binary response variable. The variables are summarized in table 1.
\begin{table}[H]
\centering
\begin{tabular}{|l l l|}
    \hline
    \textbf{Variable} & \textbf{Description} & \textbf{Type} \\ \hline
    Age & The customers age & Continuous \\ \hline
    Sex & The customers sex & Categorical \\ \hline
    Marriage & The customers marital status & Categorical \\ \hline
    Education & The customers level of education & Categorical \\ \hline
    Credit Limit & The customers credit limit & Continuous \\ \hline
    Payed 1-6 & How late each bill was payed & Categorical \\ \hline
    Bill Amount 1-6 & The total bill each month & Continuous \\ \hline
    Pay Amount 1-6 & The amount payed off on previous months bill & Continuous \\ \hline
    Defaulted & Indicates if customer defaulted & Binary \\ \hline
\end{tabular}
\caption{Overview of variables in dataset}
\label{table:1}
\end{table}

An outline of the dataset was also provided along with the dataset itself. According to this outline, the time frame for the data is over six month beginning in April and ending in September. The variables are numbered in reverse, the variables labeled as month 1 pertain to September and the variables labeled as month 6 pertain to April. The outline also states that for the variables indicating how late a given months bill was payed, -1 means it was payed duly and positive numbers indicate how many months late the payment was made.
For the sex variable it states that 1=Male and 2=Female.
For the marriage variable it states that 1=single, 2=married and 3=divorced
For the education variable it states that higher numbers correspond to a higher level of education.

The data was checked for missing values and was found to contain no missing values. One of the variables, named pay\_0 was found to be named inconstantly. This variable was renamed to pay\_1 to bring it in line with the naming scheme for the other variables.
\newpage
\section{Introduction to credit cards}
As this project involves performing regression analysis on credit card data it is useful to understand how credit cards work and the concepts of credit limit, balance, and defaulting. A credit card effectively allows the card holder to borrow money on a day to day basis which they will eventually have to pay back. Each credit card has a specific credit limit. This is the maximum amount that the customer can spend using the card. How the credit limit is determined will differ between different creditors and is not public information, however it generally depends on the applicants existing credit score and factors such as income \cite{little_2007}.

The balance of a credit card is the total amount owed. Each time the customer makes a purchase using the card the balance will increase. Fees and interest will also add to the balance. The customer will receive a monthly bill consisting of the total amount owed; The minimum amount due; and the date by which the bill must be paid.

The card holder has the option of either paying off the total amount or just paying the minimum amount. Should they pay off the full amount they will not have to pay any interest on the amount borrowed. If they pay just the minimum amount they will have to pay off the remaining balance at a later date with interest. It is also possible to pay more than the total balance. There are various reasons for doing this for example if a card holder wishes to use there card to pay for something that costs more than their credit limit. Missing a payment by failing to pay the minimum amount can result in an increase in the interest rate and be detrimental to the customers credit score.

How the minimum payment amount is calculated varies between institutions however it is usually calculated as some percentage of the balance plus fees. The utilization rate of a card is the ratio of the balance of the card to the credit limit. Utilization rate may effect a card holders credit score. If a card holder does not pay their bill for an extended period of time (the specific period differs from creditor to creditor) the creditor will write of the amount owed as bad debt and will consider the customer to have defaulted.

\chapter{Logistic regression}
Logistic regression is used when fitting a model where the response variable is a binary true or false value and the predictors consist of one or more continuous or categorical variables \cite{crawley_2014}. A simple example of this might be predicting weather or not a person contracts the flu given their age. There is a binary response indicating weather they do or do not contract the flu and a single continuous predictor, their age. A logistic regression model can be used to estimate the probability of an event occurring based on one or more independent predictor variables. This probability can then be used for classification. Logistic regression can also be used to model how changes in the predictors will effect the response.

In logistic regression the response variable is distributed according to a Bernoulli distribution with an unknown $p$ for any given linear combination of the predictor variables. The process of fitting a logistic model involves finding coefficients of the predictor variables similarly to linear regression and then using the logit function to map the linear combination of predictor variables to a value between zero and one which can represent a probability. Unlike linear regression the residuals are not normally distributed and therefore the coefficients can not be calculated using least squares. Instead the maximum likelihood estimator is used. The aim of logistic regression is to estimate an unknown $p$ for any given linear combination of the independent predictor variables.
\newpage
\section{Generalized linear models}
Linear regression relies on various assumptions which are violated when dealing with a binary response variable. Generalized linear models are a class of models that work with a larger class of response types. They can be used when the variance is not constant and the errors are not normally distributed. The generalized linear model can be thought of as a framework consisting of three parts: the error structure, the linear predictor, and the link function \cite{dobson_2008}.

The errors can be distributed according to any member of the exponential family, such as Poisson, binomial or exponential. In the case of logistic regression on a binary response, the error is distributed according to a binomial distribution.

The linear predictor $\eta=X\beta$ is a linear combination of the predictor variables. Where $X$ is a matrix of the predictor variables. And $\beta$ is a vector of the unknown parameters. The predicted value is obtained by transforming the value given by the linear predictor using the link function. In the case of linear regression the link function was the identity function. In logistic regression the link function used is the logit function.
\newpage
\section{The binomial distribution}
With a binary response variable the response can take on only two values, as in the case of the credit card data
\[
y_i=
\begin{cases}
1 & \text{If the customer defaults}\\
0 & \text{Otherwise}
\end{cases}
\]
Each observed $y_i$ is considered to be the outcome of a random variable $Y_i$ where
\[
P(Y_i=1) = \pi_i
\]
\[
P(Y_i=0) = 1-\pi_i
\]
$Y_i$ is therefore distributed according to a Bernoulli distribution parameterized by $\pi_i$.
\[
P(Y_i = y_i) = \pi_i^{y_i}(1-\pi_i)^{1-y_i}
\]
\[
E[Y_i] = \pi_i
\]
\[
Var[Y_i] = \pi_i(1-\pi_i)
\]
This is a special case of the binomial distribution where the number of trials is taken to be one.
\newpage
\section{The logit link function}
In a logistic regression model the predicted value for the response should be a probability. As the linear predictor can take any value we need a way of transforming the output of the linear predictor into the range $[0, 1]$. This function is the logit link function which is defined as follows.

If $p$ is the probability of an event occurring, the odds of the event occurring are defined as

\[
odds = \frac{p}{1-p}
\]

The log odds or logit function is defined by

\[
logit(p) = ln(\frac{p}{1-p})
\]
\[
p\mapsto(-\infty, +\infty)
\]

This function maps a probability $p\in(0, 1)$ onto ${\rm I\!R}$. The inverse of the logit function called the logistic function

\[
logit^{-1}(\alpha ) = logistic(\alpha ) = \frac{1}{1+exp(-\alpha )} = \frac{exp(\alpha )}{exp(\alpha ) + 1}
\]

Maps a value $\alpha\in{\rm I\!R}$ onto $(0, 1)$. So, equating the logit of the probability of the outcome with the linear predictor gives


\begin{equation}
\begin{split}
logit(p_{i}) & = \eta_i \\
& = \beta_{0}+\beta_{1}x_{1,i}...+\beta_{n}x_{n,i}
\end{split}
\end{equation}

and
\[
p_{i} = logit^{-1}(\beta_{0}+\beta_{1}x_{1,i}...+\beta_{n}x_{n,i} )=\frac{1}{1+exp(-(\beta_{0}+\beta_{1}x_{1,i}...+\beta_{n}x_{n,i}) )}
\]
Where $\beta_{n}$ is the $n$th coefficient and $x_{n,i}$ is a specific value for $n$th predictor variable. The logit and inverse logit functions are graphed in figures 1 and 2. The inverse logit function forms a sigmoid curve.
\begin{figure}[H]
    \centering
    \begin{minipage}{0.45\textwidth}
        \centering
        \includegraphics[width=0.9\textwidth]{logit_plot} % first figure itself
        \caption{Plot of the logit function}
    \end{minipage}\hfill
    \begin{minipage}{0.45\textwidth}
        \centering
        \includegraphics[width=0.9\textwidth]{logistic_plot} % second figure itself
        \caption{Plot of the inverse logit function}
    \end{minipage}
\end{figure}
\section{Estimating the model parameters}
The process of fitting the model involves finding point and interval estimates for the model parameters by maximizing the likelihood function.
\subsection{The likelihood function}
Given an assumed model $F$ with parameters $\theta$, The probability of an outcome $O$ is $P(O) = F(O|\theta)$. However in statistics $\theta$ will generally be unknown and only specific sample data will be known. The likelihood function is a function of the model parameters where the observations are fixed. $\mathcal{L}(\theta|O) = P(O|\theta)$. The likelihood function is not a probability density function as its integral over the parameters does not necessarily equal $1$ or the integral may not exist at all.
\subsection{Maximum likelihood estimation}
When fitting a statistical model, the goal is to find the parameters $\theta$ which maximize the likelihood of producing the observed data. The model also assumes that the observations are independent and identically distributed. Given $n$ observations, the probability of making these observations is the joint density function
\[
F(y_1, y_2, ..., y_n|\theta) = \prod_{i=1}^{n} F(y_i|\theta)
\]
The likelihood function for the observed data is therefore also defined by
\[
\mathcal{L}(\theta|y_1, y_2, ..., y_n) = \prod_{i=1}^{n} F(y_i|\theta)
\]
With the distinction that $y_1, y_2, ..., y_n$ are constants and $\mathcal{L}$ is a function of $\theta$. As the logarithm is a monotonically increasing function, the values that maximize a function will also maximize the log of that function. This means that it is more convenient to take the log of the likelihood function and maximize that instead, as multiplicative terms become additive after taking the log of a function. The log likelihood is given by
\[
l(\theta|y_1, y_2, ..., y_n) = \sum_{i=1}^{n}log(F(y_i|\theta))
\]
Substituting in the Bernoulli distribution and link function then gives the log likelihood function for a logistic regression model as
\begin{equation}
\begin{split}
l(\beta) & = \sum_{i=1}^{n}\Big[y_i \eta_i - log[1+exp(\eta_i)]\Big]\\
& = \sum_{i=1}^{n}\Big[y_i (\beta_{0}+\beta_{1}x_{1,i}...+\beta_{n}x_{n,i}) - log[1+exp(\beta_{0}+\beta_{1}x_{1,i}...+\beta_{n}x_{n,i})]\Big]
\end{split}
\end{equation}
To visualize maximum likelihood estimation the log likelihood function for a model containing credit limit as a single predictor was plotted. The x and y axes are $\beta_0$ and $\beta_1$, the coefficients for the intercept and credit limit predictor respectively. The z axis is the log likelihood. The coefficients where calculated using the glm \cite{rbase} function in R and plotted onto the surface. As can be seen the calculated parameter coefficients correspond maxima of the plotted log likelihood function.
\begin{figure}[H]
\centering
\includegraphics[width=0.9\textwidth]{loglikelihoodplot}
\caption{Plot of the log likelihood for a model with the lateness score as a single predictor}
\end{figure}


\subsection{The score function}
To obtain the maximum likelihood estimator for each parameter $\beta_j$ the derivative of the log likelihood is taken with respect to $\beta_j$. This is the scoring function for $\beta_j$ denoted $U_j$ given by
\[
\frac{\partial l}{\partial \beta_j} = U_j = \sum_{i=1}^{n}\Big[y_i x_{j, i} - \frac{exp(\beta_{0}+\beta_{1}x_{1,i}...+\beta_{n}x_{n,i})}{1+exp(\beta_{0}+\beta_{1}x_{1,i}...+\beta_{n}x_{n,i})}\Big]
\]
$U = (U_0, U_1, ..., U_n)$ is the scoring vector for the parameters. The maximum likelihood estimator $\hat{\beta}$ is found by solving $U(\beta)=0$ \cite{dobson_2008}

Unlike with least squares, the log likelihood function does not necessarily have a closed form solution meaning it cannot be solved analytically. Because of this the maximum likelihood estimator is computed numerically using an  iterative method, improving with each iteration until it converges on a satisfactory solution. There are multiple algorithms which can be used to perform this process. Commonly Newton-Raphson, Fisher Scoring, or Iteratively Reweighted Least Squares (the statistical language R by default uses Fisher Scoring). If the algorithm fails to converge it could indicate issues with the data, such as not enough data or that the variables are highly correlated. Highly correlated variables violates the model assumption that the variables are independent.


\subsection{The Newton-Raphson method}
The Newton-Raphson method is an iterative algorithm for approximating the roots of a real valued function. It works by guessing an initial value $x_0$ and approximating the function at this point with a first order Taylor series i.e by the tangent line at that point. The x-intercept will generally be a closer approximation of the root and becomes the next guess $x_{1}$. This process can be iterated until the desired accuracy is acquired \cite{anton_bivens_davis_2009}.

When used for optimization - finding the maxima or minima of a function - the function is approximated by a second order Taylor series around $x_n$. The maxima of this approximation is then obtained and taken to be a closer approximation than the original guess. When $f$ is a function of a single variable the approximation will involve the first and second derivatives. In the case of many variables this generalizes and the approximation will involve the gradient (the vector of all partial derivatives of $f$) and the Hessian (the square matrix of all second-order partial derivatives of $f$). This gives the following relation which is used to maximize the log likely hood \cite{dobson_2008}
\[
\beta_{n+1} = \beta_n - [Hl(\beta_n)]^{-1}\nabla l(\beta_n)
\]
Where $\beta_n$ is a vector of the parameter guesses, $Hl(\beta)$ is the Hessian of the log likelihood function, and $\nabla l(\beta)$ is the gradient of the log likelihood function which is the scoring vector. The negative of the Hessian is also called the observed information index denoted $\mathcal{J}$ \cite{hardin_hilbe_2007}. This gives the the more compact notation
\[
\beta_{n+1} = \beta_n + \mathcal{J}_n^{-1}U_n
\]
To illustrate MLE using Newton-Raphson the algorithm was implemented in R. The code used for this can be found in the appendix. The same model used earlier to visualize the log likelihood function of the late score as a single predictor and default as the response was used. Table 2 shows the iterations of the algorithm as it converges. Figure 2.4 shows the algorithm converging graphically.
\begin{table}[h!]
\centering
\begin{tabular}{|l|l|l|l|l|l|l|}
    \hline
    Iteration      & Initial guess & 1           & 2           & 3           & 4           & 5          \\ \hline
    $\beta_0$             & 0             & -0.208 & -0.142 & -0.108 & -0.105 & -0.105 \\ \hline
    $\beta_1$             & 0             &  0.026  &  0.037  &  0.041  &  0.041  &  0.041  \\ \hline
    log-likelihood & -20.79442     & -10.799   & -10.158   & -10.130   & -10.129   & -10.129  \\ \hline
    \end{tabular}
\caption{Iterations of the Newton-Raphson method maximizing the parameters of a model with lateness score as a single predictor}
\label{table:1}
\end{table}

\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{nrp}
\caption{Plot of the iterations of the Newton-Raphson method converging on the parameter estimates for maximum likelyhood.}
\end{figure}

\newpage
\section{Visualizing the model}
It is useful to visualize the logistic model graphically in order to better understand the process. Using the given dataset, a pot of credit limit was plotted against the response. A model was then fitted with credit limit as a single predictor and the curve of the fitted logistic function was plotted on the same graph (Fig 3). It can be visualized from this plot how the fitted logistic curve maps the values of the predictor variable to a probability. This was then repeated using two predictors credit limit and age. It can be seen how the logistic function of two variables forms a surface in three dimensions to map values of the two predictors onto a probability (Fig 4). The code used to generate these plots can be found in the appendix.
\begin{figure}[H]
\centering
\includegraphics[width=0.7\textwidth]{logistic_graph_2d_2}
\caption{fitted logit function overlaid on scatter plot of single predictor variable}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=0.7\textwidth]{logistic_graph_3d}
\caption{fitted logit function for two predictor variables}
\end{figure}
\newpage
\section{Assessing the model fit}
When performing regression analysis we are in a sense attempting to approximate a hypothetical "true" model say $f$ by a model $g$. If $f$ was known, the loss in information or "relative entropy" could be measured using the Kullback-Leibler information \cite{burnham_anderson_2010}
\begin{equation*}
\mathbf{I}(f, g) = \int_{}{} f(x)log\left( \frac{f(x}{g(x|\theta)} \right)dx
\end{equation*}
This is equivalent to
\begin{equation*}
\mathbf{I}(f, g) = \int_{}{} f(x)log(f(x)) dx - \int_{}{}f(x)log(g(x|\theta)) dx
\end{equation*}
The first integral involves only the true model and does not depend on $g$, it is constant for any model $g_i$ being compared. This leads to the result that the relative entropy between two models can be estimated without knowing anything about the true model they are attempting to approximate.

Akaike showed that for large samples, the K-L distance can be estimated by
\[
log(\mathcal{L}(g(\hat{\theta}|x))) - k
\]
Where $\hat{\theta}$ is the estimated model parameters, $x$ is the observed data and $k$ is the number of parameters in the model. Thus giving the Akaike information criterion
\[
AIC = -2log(\mathcal{L}(g(\hat{\theta}|x))) + 2k
\]
This measure can be used to compare the relative quality of models by estimating the relative information lost when using the model to approximate the "true" from which the observed data came \cite{akaike_1974}. The term $-2log(\mathcal{L}(g(\hat{\theta}|x)))$ is also the deviance of the model \cite{crawley_2014}. As the number of parameters increases the AIC, additional parameters are penalized, only being justified in the model if they reduce the deviance by more than the penalty the introduced.
\newpage
\section{Model selection}
Model selection is the process of selecting which terms to include in the model. A saturated model is a model with one parameter for each observation. It is simply interpolating between every data point. A null model is a model with only the intercept which is equal to the mean of the response. Neither of these models are useful for prediction. A maximal model is a model where all predictors which my be of interest are included, including interactions, polynomial terms etc. Many of the terms included in a maximal model will be insignificant \cite{crawley_2014}.

The goal is to select a model using the principle of parsimony which is the simplest model that satisfactorily fits the data. All other things being equal, a model with less rather than more parameters; less rather than more predictors; lower order polynomial terms rather than higher order polynomial terms; and no interactions, is preferable.

To select a desirable model stepwise selection can be used. In backwards stepwise selection the maximal model is fit first. Terms are then removed from the model one by one in order of which give the greatest improvement to the fit criterion when removed. Terms continue to be removed until there are no terms left which can be removed without inuring a significant loss of fit. The fit criterion used is the AIC. This procedure is implemented in R using the "step" function. Forward stepwise selection involves starting with the null model and iteratively adding terms which best improve the fit.
\newpage
\section{Validating the model}
Validating the model involves evaluating the models ability to generalize to independent data that was not used to train the model. The data is split into training and validation sets and the model is fitted using only the training data. The fitted model is then used to make predictions on the validation set. A measure of fit is then used to asses how well the model predicted the response values in comparison to the observed values in the validation set \cite{hardin_hilbe_2007}.

\subsection{Holdout validation}
One common method and possibly the most simple is to split the data into training and validation sets usually at a 70:30 ratio. The model is then fitted on the training set and tested on the validation set. Generally the data should be split at random rather than simply taking the first n rows for the training set and the remaining as the validation set. This is to prevent misfitting the data should the dataset happen to be ordered according to any variables. There are some regression models however where the opposite is true such as time series where the most recent period of the data should be used for validation. If the dataset is small this method can lead to high variability in the results as different data being in each set may affect the measure of fit.

\subsection{K-fold cross validation}
With k-fold cross validation the data is divided into k subsets. For each of the k subsets a model is fitted using the given subset as a validation set and fitting the model using the remaining k-1 subsets. The performance of each of the k models is then averaged to give a more accurate measure of the models performance. With this method each observation is utilized validating the model exactly k times and in fitting the model exactly k-1 times.

\subsection{Leave one out cross validation}
Leave one out cross validation is k-fold cross validation where k is the number over observations in the dataset. For every observation a model is fitted using all but the selected observation and then validated on the removed observation. This method can however lead to high variance in the measure of fit \cite{Kohavi:1995:SCB:1643031.1643047}.

\subsection{ROC curves}
The predicted response of a logistic regression model a probability of a success or failure given specific values for the predictors. In order to perform classification a threshold value must chosen. Predicted response values are then classified by weather they fall above or below the threshold. Different threshold values will affect the sensitivity and specificity of the model i.e the number of false positives or false negatives. It is sometime suggested to set the threshold to be the mean of the response \cite{hardin_hilbe_2007}.

In different situations a false positive may be more detrimental or costly than a false negative or vice versa. In the case of credit card defaulters a bank may be able to estimate how much it will lose from a defaulter and can find a balance between an acceptable rate of false negatives when estimating if a customer will default in order to maximize profit.

A receiver operating characteristic curve or ROC curve, is used to show the performance of a binary classifier at different classification thresholds. An ROC curve plots the true positive rate against the false positive rate at varying thresholds. Categorical predictors will cause the curve to behave like a step function. Figure 6 shows an ROC curve for a model fitted with the single predictor of credit limit and the default variable a binary response. Holdout validation was used with the model being trained on a subset consisting of about $70\%$ of the dataset. The remaining subset was then used to predict the values which generated the shown plot.

\begin{figure}[H]
\centering
\includegraphics[width=0.5\textwidth]{limit_bal_roc}
\caption{ROC curve for model with default as the response and credit limit as a single predictor}
\end{figure}

The area under the curve or AUC can be used to asses the predictive power of the model across all threshold values. An AUC of 0.5 is equivalent to random guessing and indicates the model has no predictive power. A perfect model would have an AUC of 1 indicating every prediction is correct for any threshold. The AUC is one type of criterion that can be used to evaluate the predictive power of a model.

\subsection{Misclassified error}
For a specific threshold the misclassified error can be calculated. This is the proportion of predictions which were classified incorrectly. A confusion matrix can also be generated which tallies the total number of true or false positives and negatives. Using the same model from the example above, a threshold of 0.5 was chosen. The misclassified error was found to be 0.217. This translates to $~78.3\%$ of the validation set being classified correctly.

\newpage
\section{Interpreting the model parameters}
Once the model has been fit the parameters can give insight into how the predictors affect the response. Recall the odds of an event with probability p occurring are
\[
\frac{p}{1-p}
\]
and that the probability of the response being a success given specific values for the predictors is
\[
p_i = \frac{exp(\beta_0+\beta_1 x_1+...+\beta_n x_n)}{1 + exp(\beta_0 + \beta_1 x_1 + ... + \beta_n x_n)}
\]
Substituting this in, the odds of the response being a success are given by
\[
odds(Y=1|X_1=x_1, X_2=x_2, ..., X_n=x_n) = e^{\beta_0+\beta_1 x_1 + ... + \beta_n x_n}
\]
Incrementing a predictor $X_i$ by 1 will give
\[
odds(Y=1|X_1=x_1, ..., X_i=(x_i+1), ..., X_n=x_n) = e^{\beta_0 + \beta_1 x_1 + ... + \beta_n x_n}e^{\beta_i}
\]
Taking the ratio of these then gives
\[
\frac{e^{\beta_0 + \beta_1 x_1 + ... + \beta_n x_n}e^{\beta_i}}{e^{\beta_0+\beta_1 x_1 + ... + \beta_n x_n}} = e^{\beta_i}
\]
Thus (all other things being equal) a unit increase in the $i$th predictor will increase the odds of the response being a success by a factor of $e^{\beta_i}$ where $\beta_i$ is the corresponding coefficient for the parameter. This property enables an understanding of the model simply by observing the coefficients alone.

\chapter{Predicting credit card defaults using logistic regression}

Before building a model several procedures should be completed involving the dataset. The dataset must be cleaned and formatted correctly. This means checking for missing values, outliers or other errors or aspects of the dataset which may cause issues when fitting a model \cite{kabacoff_2011}.

Missing or invalid values should be removed or recoded. For categorical variables dummy variables must be created. By encoding categorical variables as dummy variables each level is encoded a new binary variable.

\section{Investigating the data}

\subsection{Miscategorized data}
First the categorical variables were examined. Contingency tables were created using the table function in R. The contingency table for the sex variable appeared to show that the data is clean. All customers are specified as belonging to one of the two designated categories and had roughly an equal amount in each as would be expected. The contingency tables for the marriage and education variables are displayed in tables 2 and 3. These tables show that these variables contain categories with very few observations. This would indicate that those observations are potentially erroneous may need to be removed from the data set or recategorized.

\begin{table}[h!]
\centering
\begin{tabular}{|l|l|l|l|l|}
    \hline
    Marital status  & 0  & 1    & 2    & 3  \\ \hline
    Did not default & 17 & 3477 & 4247 & 78 \\ \hline
    Did default     & 4  & 1092 & 1147 & 29 \\ \hline
    \end{tabular}
\caption{Contingency table of customers marital status}
\label{table:1}
\end{table}
\begin{table}[h!]
\centering
 \begin{tabular}{|l|l|l|l|l|l|l|l|}
    \hline
    Education level & 1 & 1    & 2    & 3    & 4  & 5  & 6  \\ \hline
    Did not default & 7 & 2869 & 3529 & 1267 & 45 & 92 & 10 \\ \hline
    Did default     & 0 & 713  & 1126 & 425  & 0  & 4  & 4  \\ \hline
    \end{tabular}
\caption{Contingency table of customers education levels}
\label{table:1}
\end{table}

The remaining categorical variables are the set of variables which indicate how behind a customer is on payments each month. Examining these variables reveals values that were not explained in the description of the data provided such as values of -2 and 0. Bar plots were generated for each of these variables (Figure 5).
\begin{figure}[h]
\centering
\includegraphics[width=1\textwidth]{pay_n_barplot}
\caption{Bar plots of the variables indicating late payment for April and May}
\end{figure}
If the categories do in fact represent how many months overdue a payment is, it would be expected to see observations in both the 1 and 2 categories from month to month. As the two bar plots show these consecutive months do not exhibit this behavior. As well as this there are a large amount of observations in the -1 and -2 categories. To understand this further a new variable was derived by dividing the amount payed into the amount billed. This represents the percentage of the bill each customer paid for the given month. In the case where the bill was zero it was taken that the customer had paid 100\% of their bill. This new variable was then plotted for 1000 observations and colored using the late status variable (Figure 6).
\begin{figure}[H]
\centering
\includegraphics[width=1\textwidth]{pay_5_scatter_plot}
\caption{Scatter plot of \% of bill payed for 1000 customers in May}
\end{figure}
It is immediately apparent from this plot that most of the values labeled as 0 fall somewhere between 1 and 10. Referring back to how credit cards work. This would correspond to customers paying off the minimum amount. There does not seem to be any obvious distinction between the values of -1 and -2, both seem to indicate that the customer paid duly as apart from a small number of exceptions all observations categorized as such sit on or above 100\%. One conclusion that could be drawn is that positive values indicate late payment, negative indicate full or overpayment and zero indicates the customer paid the minimum payment.
\subsection{Outliers}
The continuous variables were checked for outliers using the boxplot function in R \cite{rbase}. The credit limit and age variables were found to contain 54 and 94 outliers respectively. The box plots and histograms for the credit limit with and without outliers is displayed in figure 7. As this is relatively small compared to the number of observations in the data set these observations could potentially be removed to aid in fitting the model. The monthly pay amount and bill amount variables each contained a large number of outliers (800 to 1000). The large amount of outliers for these variables could suggest that the data is multi-modal \cite{crawley_2014}.
\begin{figure}[H]
\centering
\includegraphics[width=1\textwidth]{outlier_check}
\caption{Boxplots and histograms of credit limit before and after outliers have been removed}
\end{figure}
\subsection{Density plots}
For each of the continuous variables a density plot was generated of the distribution split by defaulters and non defaulters. Two of these plots are displayed in figures 8 and 9. These were generated using the geomdensity function in the ggplot2 library. This function computes and plots the kernel density curve over the data \cite{ggplot2}. Outliers were excluded for readability. These plots give an indication of how the variables affect the response. From the limit balance plot it is clear that defaulters will tend to have a lower credit limit than non defaulters. When looking at the plot for age, it seems to indicate that ages of roughly under 25 or over 50 are more likely to default. This makes sense intuitively as middle aged people may tend to be more financially stable. The plots for the monthly amounts payed were similar to the graph for limit balance where non defaulters tended to pay off higher amounts each month. The plots for the monthly bill amounts were less clear as the differences between the two density curves were not as great. For variables such as age, where the relationship is non linear, a polynomial term may have to be introduced or the variable could be discretized.

\begin{figure}[h]
\centering
\includegraphics[width=1\textwidth]{limit_balance_density}
\caption{Density plot of customers credit limits split by defaulters/non defaulters}
\end{figure}
\begin{figure}[H]
    \centering
    \includegraphics[width=1\textwidth]{age_density}
    \caption{Density plot of customers ages split by defaulters/non defaulters}
\end{figure}


\subsection{Correlation}
The continuous variables were tested for correlation. Highly correlated predictor variables can cause problems when fitting a model so it is necessary to check for this beforehand. A plot of the correlation was generated (Figure 10).
\begin{figure}[H]
\centering
\includegraphics[width=0.9\textwidth]{correlation_plot}
\caption{Chart displaying correlation between all continuous variables}
\end{figure}

The cluster of dark blue dots at the intersection of each of the monthly bill amounts indicates that there is a high correlation between all the bill amounts. If a customer has a high bill one month, they will tend to also have a high bill the next. There is also a correlation between the monthly bill amount and the amount a customer pays off on that months bill. The limit balance also appears to be correlated somewhat with all of the variables. The correlations between the bill and pay amounts suggest that they should be consolidated into a smaller number of derived variables as each conveys much the same information.

\subsection{Hypothesis testing}
Each variable was tested for independence with the default variable. The continuous variables were first discretized by binning them into roughly equal categories using functions from the binr library. Contingency tables were then created of each categorical variable and the default variable. The contingency table for the discretized age variable and the default response variable is displayed in table 4.

\begin{table}[h!]
\centering
\begin{tabular}{|l|l|l|l|l|}
    \hline
    Age             & 20s  & 30s  & 40s  & 50+ \\ \hline
    Defaulted       & 2449 & 2989 & 1703 & 678 \\ \hline
    Did not default & 762  & 766  & 500  & 244 \\ \hline
    \end{tabular}
\caption{Contingency table of discretized age and default variables}
\label{table:1}
\end{table}

Chi-square tests for independence where then performed on each contingency table. The results of some of the Chi-Square tests are displayed in table 5.

\begin{table}[h!]
\centering
    \begin{tabular}{|l|l|l|l|}
    \hline
    ~                  & Age     & Marriage & Credit limit \\ \hline
    Cases              & 10091   & 10091    & 10091        \\ \hline
    Factors            & 2       & 2        & 2            \\ \hline
    Chi-Sq             & 20.638  & 10.644   & 327.3        \\ \hline
    Degrees of freedom & 3       & 2        & 5            \\ \hline
    P value            & 0.00013 & 0.0049   & $<$0.00001     \\ \hline
    \end{tabular}
\caption{Results of Chi-Square tests for 3 (discretized) variables}
\label{table:1}
\end{table}

For each Chi-Square test the null hypothesis states that the variables are independent and the alternative hypothesis states that they are not independent. Taking the significance level to 0.05, as is the accepted standard in hypothesis testing, it can be seen that the p-value is less than the chosen significance level for each of these variables and the result is statistically significant. The null hypothesis is therefore rejected and it is concluded that each of these predictors do have an effect on weather a customer will default or not. The rest of the variables produced similar results, all having p values below 0.05.
\newpage
\section{Cleaning and preparing the data}

While the preliminary analysis revealed that the data was largely free of missing values of erroneous data some variables required attention. The 0 and 3 categories of the marriage variable which contained a very small amount of observations were merged into a single new category. To decide weather to keep these observations or drop them from the data set, the valid categories were grouped and a contingency table was constructed as seen in table 6.

\begin{table}[h!]
\centering
\begin{tabular}{|l|l|l|}
    \hline
    ~               & Single or married & Other/unknown \\ \hline
    Did not default & 7724              & 95            \\ \hline
    Did default     & 2239              & 33            \\ \hline
    \end{tabular}
\caption{Contingency table for regrouped marital status and default variables}
\label{table:1}
\end{table}

A Chi-Square test was conducted on this table. The null hypothesis states that the new regrouped marital status and default variables are independent. The alternative hypothesis states that they are not independent. The p-value was calculated to be 0.4331. As this is grater than the significance level $\alpha=0.5$ we fail to reject the null hypothesis. This result implies that the rate of defaulters in the potentially erroneous observations does not differ significantly from the rest of the population and these observations can be dropped from the data set without fear of losing possibly meaningful observations.

The same process was conducted for the education variable however the p-vale was calculated to be 0.0000001128. This is highly significant and so the customers who's education level was recategorized as other/unknown were not dropped from the dataset.

For each of the variables indicating payment status (late, early etc) each month, the values of -2 were corrected to -1. As well as this, after inspecting the bar plots for each of these variables the simplest answer as to the irregular distribution from month to month was that for the months of June, July and August, the values of 1 and 2 were swapped. This was corrected by swapping them back. These corrections produced better results in the model and produced a smoother distribution of the derived variable "lateness score" which will be discussed later.

The three categorical variables, sex, marriage and education were converted to factor variables. By doing this R will encode them as dummy variables when fitting the model. Rather than being included in the model a set of binary predictor variables will be included in their place indicating membership of the respective factor level. The number of dummy variables included for each categorical variable is one less than the number of levels in the variable corresponding to the degrees of freedom of the factor.
\newpage
\section{Creating derived variables}
Several derived variables were created from the existing variables. for each new derived variable a model was fitted using the derived variable as a single predictor to assess the predictive power of the variable.

\subsection{Lateness score}
In the dataset there are 6 variables relating to monthly payment status. (After cleaning and regrouping) They are numeric variables where -1 indicates full payment, 0 indicates minimum payment and positive numbers indicate degrees of lateness. A new variable called lateness score was created by summing the 6 monthly variables together. As they are coded such that lower numbers indicate positive  This new variable gives an indication of how good a given customer is at paying off the balance on their card. A model was fitted using this new variable. It was found to be highly significant. An ROC curve was generated and it was determined to be a satisfactory predictor of the response variable with an AUC of 0.705.

This concept was then taken further. By summing these values, the new variable is essentially assigning a score to each case, where the customer either pays in full, pays the minimum amount or pays late. Where the scores are -1, 0 or a positive number for how late the payment was. However these initial values coded in the data set may not reflect the actual severity of how each case reflects on how likely they are to default. To determine if they should be weighted differently each monthly payment variable was recoded to a factor with 3 levels. As a crude way to quickly determine more appropriate values for each factor level a function was defined such that the input was 3 scalers. The values of the 3 scalers were assigned to the 3 factor levels for each month, the lateness score was calculated, a model was fitted with the new re-weighted lateness score as the single predictor and the output of the function was the AUC of the models ROC plot. 10-fold cross validation was used to reduce the variance of the output. This function was then maximized for the 3 input variables using the optim function in R. The new optimal weights were calculated to be roughly -10.5 for full payment, -8 for minimum payment, and 5.5 for late payment. A model was fitted using this re-weighted lateness score as a single predictor and the AUC was calculated to be ~0.733, a reasonable improvement on the first unweighted lateness score.

\subsection{Payment score}
For each of the monthly billing and payment amounts, the payment amount was divided by the corresponding bill amount to give the proportion of the bill that was payed off. For the case where the bill amount was zero or negative the customer was considered to have payed off all the required amount and the value was set to 1. This incurs some loss of information such how much the customer overpaid by if they overpaid when their bill was zero or negative however still encapsulates cases when the customer overpaid when the bill was a positive amount. As the monthly payment amounts correspond to the previous months bill, the information in the September bill amount is lost as it is left unused.

This results in 5 new variables indicating the proportion of the bill paid off for the months of April, May, June, July and August. As can be seen in figure 11 these new variables are also highly correlated.
\begin{figure}[H]
\centering
\includegraphics[width=0.6\textwidth]{bill_prop_cor}
\caption{Chart displaying correlation between the variables of the proportion bill payed each month}
\end{figure}
The high correlation indicates that they comprise of much the same information and should not all be included in the model. Two methods were tried to reduce these variables to a smaller number of predictors. The first method was simply to take the average of the 5 and call this new variable the payment score. A model was fitted using this as a single predictor and the ROC curve was found to have an AUC of 0.605.

\subsection{Principal component analysis}
The second method tried was to use principle component analysis. Principle component analysis can be used to transform a set of correlated variables into a set of orthogonal uncorrelated variables \cite{hardin_hilbe_2007}. Consider a vector of variables.
\[
\mathbf{X} = (X_1, X_2, ..., X_n)
\]
By taking a linear combination of each $X_i$ we can form a new set of random variables $Y$
\[
\mathbf{Y} =
 \begin{bmatrix}
 Y_{1} \\
 Y_{2} \\
 \vdots \\
 Y_{n}
 \end{bmatrix} =
 \begin{bmatrix}
 a_{11} & a_{12} & \cdots & a_{1n} \\
 a_{21} & a_{22} & \cdots & a_{2n} \\
 \vdots & \vdots & \ddots & \vdots \\
 a_{m1} & a_{m2} & \cdots & a_{mn}
 \end{bmatrix}
 \begin{bmatrix}
 X_{1} \\
 X_{2} \\
 \vdots \\
 X_{n}
 \end{bmatrix}
\]
The first component is found by finding the coefficients which maximize the variance of $Y_1$. The subsequent components are then found similarly by finding the coefficients that maximize their variance with the constraints that the correlation between the new component and all previous components is 0 and that the squares of the coefficients sum to 1. This is equivalent to finding the eigendecomposition of the covariance matrix of $Y$ such that
\[
Var(Y_i) = \lambda_i\
\text{where}\
\lambda_1 \geqslant \lambda_2 \geqslant \cdots \geqslant \lambda_n
\]
And the coefficients for the $i$th component are given by the elements of the corresponding eigenvector. Effectively PCA fits an n-dimensional ellipsoid to the data and then performs a change of basis where the unit vectors corresponding to the ellipsoids axises form the new basis. As PCA involves maximizing the variance the original variables should be scaled so that they have equal variance. Alternatively the correlation matrix can be used instead of the covariance matrix as correlations are in the range of 0 to 1 \cite{rbase}.

All principle components of the 5  variables were fitted and the first two were found to be significant. However they were found to offer no significant improvement over simply taking the average of the 5 variables. It was decided to include the average rather than the principle components as it offers a more meaningful interpretation when included in the model, i.e the coefficient of the average of the payment scores will indicate how paying off a higher proportion of the bill will affect the response.

\subsection{Volatility}
As well as the payment score variable a payment volatility variable was derived. This was simply the variance of the five monthly payment scores. This gives a measure of the volatility or how sporadic a given customer is in paying their bills. One might suppose that a customer who consistently pays off their bill each month may be less likely to default than one who pays varying amounts from month to month. This variable could perhaps offer insight that is lost when averaging over the months as in the payment score.

When fit as a single predictor as was done with the previous derived variables to give an indication of usefulness, the variable had little to no predictive power producing a model with an AUC of roughly 0.5 indicating a model fitted using only the payment volatility measure was no better than random guessing.
\newpage
\section{Fitting and assessing the model}
\subsection{Fitting the model}
A model was fit using the predictors age, sex, marriage, education, credit limit, late score, and payment score. Backwards stepwise selection was then used using AIC as the election criterion to select which variables to include in the model. After performing stepwise selection the age predictor was dropped from the model. The resulting model had an AIC of 9092.88.

\subsection{Interpreting the model}
The summary function in R displays various information about the model including the parameter estimates. They are displayed in table tk below.

\begin{table}[h!]
\centering
 \begin{tabular}{|l|l|l|}
    \hline
    Predictor                      & Coefficient estimate & Significant \\ \hline
    Intercept                      & 0.396                & Yes         \\ \hline
    Sex (0=Male, 1=Female)         & -0.166               & Yes         \\ \hline
    Marriage (0=Single, 1=Married) & -0.206               & Yes         \\ \hline
    Education (Second Level)       & -0.001               & No          \\ \hline
    Education (Third Level)        & -0.06                & No          \\ \hline
    Education (Other)              & -1.393               & Yes         \\ \hline
    Credit Limit                   & -0.001               & Yes         \\ \hline
    Late Score                     & 0.341                & Yes         \\ \hline
    Payment Score                  & -0.462               & Yes         \\ \hline
    \end{tabular}
\caption{Table of parameter estimates for the fitted model}
\label{table:1}
\end{table}

As discussed in section 2.10 the parameter estimates of a logistic model can be used to understand the relationship between the predictors and the response. A unit increase in a predictor will increase the odds of the response being a `success' by a factor of the predictors coefficient. For the fitted model a unit increase in one of the predictors will multiply the odds of a customer defaulting by the corresponding parameter estimate.

Interpreting the results above, all other things being equal, Customers with a higher credit limit are less likely to default and customers with a higher payment score i.e those who on average pay off a higher percent of their bill each month are less likely to default than those who pay off a lower percent of their bill. Both of these support what would be expected intuitively. As well as this the model suggests Females are less likely to default then males.

The fitted model also suggests that those whose education level does not fit into first, second, or third level (perhaps customers with a postgraduate level of education although this was not explicitly stated or particularly evident from the dataset) are much less likely to default than others. Finally the estimate for the late score predictor, the aggregated score of monthly overdue payments, indicates as expected that those who tend to be late on payments are more likely to default than those who tend to pay on time.

\subsection{Validating and evaluating the model}
To evaluate the model k-fold cross validation was used with number of folds set to ten. The models predictive power was evaluated by calculating an average ROC curve and AUC value across all folds. The resulting averaged area under the curve was calculated to be 0.757. The variance of this average value was calculated to be 0.00053 over the 10 AUC values. The low variance implies that this AUC value is reasonably accurate. The ROC curve for one of the folds is shown below (Fig tk).

\begin{figure}[H]
\centering
\includegraphics[width=0.65\textwidth]{fin_roc_plot}
\caption{ROC curve for fitted model}
\end{figure}

\chapter{Other classification techniques}
Logistic regression is a powerful technique for classification however there are many other methods which achieve varying success in different situations. Other popular techniques for classification that could be used for this problem of predicting defaulters and non defaulters include decision trees, random forests, support vector machines and artificial neural networks \cite{kabacoff_2011}. One of these techniques artificial neural networks was explored, An artificial neural network was trained on the dataset and results were compared to those of the logistic regression model.

\section{Artificial neural network}
An artificial neural network is a structure of nodes and pathways between them. The type of neural network often used for classification is a feed forward neural network. A feed forward neural network consists of an input layer, several hidden layers, and an output layer. The input layer consists of nodes corresponding to the predictor variables, each predictor corresponds to one input node. The output layer consists of nodes corresponding to the response variables. In the case of the given dataset there will be a single output node corresponding to the `default' variable. The hidden layers are one or more layers consisting of one or more nodes. There is no generally accepted rule for choosing the number of hidden layers or number of nodes in each hidden layer however there are various guidelines which have been observed to give desirable results. For example it is common to have a single hidden layer containing a number of nodes roughly two thirds the number of input nodes \cite{montavon_orr_musller_2012}.

The network is trained on a training set using a backpropagation algorithm whereby the inputs are propagated through the nodes in the network until reaching the output layer. The weights between each node are iteratively updated so as the output layer moves closer to the known true output values values. A neural network can approximate any function given enough hidden nodes however training can be computationally and time intensive. The `neuralnet' library in R provides functions for creating and training feed forward neural networks \cite{neuralnet}.

Before building a neural network the inputs should be normalized so that they have a mean and variance of approximately 0 and 1 respectively. Highly correlated variables should also be decorrelated. This can be achieved using principal component analysis \cite{montavon_orr_musller_2012}.

A neural network was fit to the given credit card dataset. First the dataset was processed. The highly correlated monthly bill amount variables where decorrelated using PCA. The continuous variables where then normalized to have a mean of 0 and variance of 1 using the `scale' function in R. Dummy variables were created for each level of each categorical variable, this can be done efficiently using the `model.matrix' function in R. The variables indicating how late each payment was were treated as categorical. The resulting dataset after processing contained 42 predictor variables and 1 response variable.

The data was split into training and validation sets and a neural network was fitted using the neuralnet function from the `neuralnet' package in R. The fitting process can be highly computationally intensive and time consuming. To mitigate this the computation was performed using an Amazon elastic cloud compute (ec2) instance. Amazon ec2 provides scalable easily accessible cloud computing. A network was fitted with 42 input nodes, one hidden layer containing 10 nodes and a single output node. The training process took several hours to complete. The validation set was then used to generate an ROC curve and the AUC was calculated to be 0.732. A plot of the fitted network is shown in fig tk.

\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{nn_10_hl_plot}
\caption{Plot of the trained neural network}
\end{figure}

The performance of the trained neural network was comparable to the fitted regression model and did not offer any major predictive improvements. The time taken to train the network was also far longer and computationally expensive than for fitting the regression model. The predictive ability could potentially be improved by increasing the number of hidden nodes or layers however this is not guaranteed and would incur even more computational effort. A further downside is that a neural network is in a sense a black box. It is practically impossible make any meaningful interpretation of the weights between the nodes and what is going on between the input and output layers. This is unlike logistic regression where the model parameters can be used to understand the relationships between the predictors and the response. The major benefit of this approach is that it required little thought and preparation, satisfactory results were achieved without needing a deep understanding of the data or manually creating derived variables.

\chapter{Conclusion}
The aim of this project was to understand logistic regression and to fit a model to a specific dataset which could be used to classify credit card holders by whether they would eventually default on their credit. The dataset came with sparse documentation and much investigation had to be undergone in order to fully understand some aspects of it. This investigation led to a more thorough understanding of the data and enabled more informed decisions to be made on how to process and transform the data in order to fit better models.

Researching logistic regression made it apparent that the data must be processed in order to avoid issues when fitting the model. Specifically issues with categorical predictors with categories containing too few observations and highly correlated predictor variables became apparent. Some categories were dropped from the dataset entirely and some regrouped. Principle component analysis was researched as a way to mitigate issues with some highly correlated variables.

After aggregating the predictors into a smaller number of cleaned, regrouped and derived variables, a model was fit using stepwise selection. The resulting model contained some predictors relating to the customer themselves such as sex and marital status but the most significant predictors were those pertaining to the customers previous financial history. It is clear that while a persons sex, weather they are married or how educated they are have some effect on determining how likely a person is to default, the strongest indicators are financial ones such as how timely they paid off previous bills.

The fitted model had reasonable predictive power however was not exceptional. Due to the random nature of the dataset involving actions of many different people and the relatively few predictor variables it is impossible to predict the response with a very high accuracy. Perhaps with more numerous and more granular predictor variables relating to a more varied array of attributes a more accurate model could be developed. Many different approaches and techniques still failed to produce a model with an AUC grater than 0.76.

A neural network model was trained in order to investigate if that technique could possible yield better results on the chance that there may be complex interactions which a logistic regression model may fail to account for. While this provided similar classification accuracy to the logistic model it failed to surpass it. As well as this it offers less insight into the relationship between predictor and response.

In conclusion logistic regression is a powerful technique for binary classification and can produce accurate models which also give insight into the data itself. Techniques such as principle component analysis can be used to produce better predictor variables and give better results. At the same time there all real world data contains some element of randomness and no model will ever be perfect. Some datasets will be more noisy than others and some techniques will lend themselves better to different types of data.

\appendix

\chapter{Code}
\section{Loading and cleaning the data}
\begin{lstlisting}[basicstyle=\tiny, language=R]
library(readxl) #load library

data <- read_excel("data.xlsx") #read data into dataframe

#correct column name pay_0 to pay_1
names(data)[names(data) == "pay_0"] <- "pay_1"

data <- data[data$marriage %in% c(1, 2),]

#correct marriage, education, and age categories
data <- within(data, {
  education[!(education %in% c(1, 2, 3))] <- 4
  pay_2[pay_2==2] <-1
  pay_3[pay_3==2] <-1
  pay_4[pay_4==2] <-1
})
\end{lstlisting}
\newpage
\section{Optimizing the values of late payment variables}
\begin{lstlisting}[basicstyle=\tiny, language=R]
#load and clean data

set.seed(6734742)

#we are only concerned with these variables so remove all others
data <- data[c("pay_1", "pay_2", "pay_3", "pay_4", "pay_5", "pay_6", "defaultnm")]

#this function takes a list of 4 weights and a data frame containing variables pay_1 to pay_6
#it then discretizes each of the 6 variables as payed in full, minimum, late or very late
#it then assigns the weights to the their respective cetegories
#next the late score is computed by summing the 6 variables now with the new weights
#a model is then fitted using this new score as a single predictor
#with k fold cross validation to reduce variance
#the mean of the auc for this model is then returned
f <- function(param, d) {
  f <- param[1]
  m <- param[2]
  l <- param[3]
  v <- param[4]
  #assign weights to respective categories
  d <- within(d, {
    pay_1[pay_1%in%c(-1, -2)] <- f
    pay_1[pay_1%in%c(0, 0)] <- m
    pay_1[pay_1%in%c(1)] <- l
    pay_1[pay_1%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- v
    pay_2[pay_2%in%c(-1, -2)] <- f
    pay_2[pay_2%in%c(0, 0)] <- m
    pay_2[pay_2%in%c(1)] <- l
    pay_2[pay_2%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- v
    pay_3[pay_3%in%c(-1, -2)] <- f
    pay_3[pay_3%in%c(0, 0)] <- m
    pay_3[pay_3%in%c(1)] <- l
    pay_3[pay_3%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- v
    pay_4[pay_4%in%c(-1, -2)] <- f
    pay_4[pay_4%in%c(0, 0)] <- m
    pay_4[pay_4%in%c(1)] <- l
    pay_4[pay_4%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- v
    pay_5[pay_5%in%c(-1, -2)] <- f
    pay_5[pay_5%in%c(0, 0)] <- m
    pay_5[pay_5%in%c(1)] <- l
    pay_5[pay_5%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- v
    pay_6[pay_6%in%c(-1, -2)] <- f
    pay_6[pay_6%in%c(0, 0)] <- m
    pay_6[pay_6%in%c(1)] <- l
    pay_6[pay_6%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- v
  })
  #compute late score
  d$late_score <- as.numeric(d$pay_1) + as.numeric(d$pay_2) + as.numeric(d$pay_3) +
  	as.numeric(d$pay_4) + as.numeric(d$pay_5) + as.numeric(d$pay_6)
  #implement k fold cross validation
  k <- 10
  d$split <- sample(1:k, nrow(d), replace = T)
  auc <- lapply(1:k, function(x) {
    d.training <- d[!d$split==x,]
    d.validation <- d[d$split==x,]
    #fit model
    m <- glm(defaultnm~late_score, data = d.training, family = binomial(link = "logit"))
    #predict and calculate auc
    p <- predict(m, newdata=d.validation, type="response")
    pr <- prediction(p, d.validation$defaultnm)
    auc <- performance(pr, measure = "auc")
    auc <- auc@y.values[[1]]
    return(as.numeric(auc))
  })
  #return average auc
  return(c(mean(as.numeric(auc)), var(as.numeric(auc))))
}

#this function can be used to try many different weights and select those which perform the best

#create a dataframe to record output of trials
output <- data.frame(f= numeric(0), m= numeric(0), l = numeric(0), v= numeric(0), auc <- numeric(0))
names(output) <- c("full", "min", "late", "v_late", "auc")

#try many different weights by iterating over different ranges
#the percentage of tested weights is ouputted periodically to indicate how much time remains
#the values used in the following block of code can be changed as required to give more granular results
it <- 0
max <- f(c(0, 0, 0, 0), data)[1]
print(max)
for (i in seq(-1, 0, 0.5)) {
  for (j in seq(-1, 0, 0.5)) {
    print(paste("percent=", ((it/256)*100), "%", sep = "") )
    print(c(i, j))
    for (k in seq(0, 1, 0.5)) {
      for (l in seq(0, 1, 0.5)) {
        it <- it+1
        n <- f(c(i, j, k, l), data)[1]
        output <- rbind(output, c(i, j, k, l, n))
        if (n>max) {
          max <- n
          print(paste("percent=", ((it/256)*100), "%, params=(", i, j, k, l, ")", "auc=", max, sep=" "))
        }
      }
    }
  }
}
#the output can then be inspected or graphed to roughly see which weights work best

#individual weights can be fine tuned as follows
it <- 0
for (i in seq(-1, -0.5, 0.1)) {
  it <- it+1
  n <- f(c(-0.5, i, 0, 1), data)[1]
  output <- rbind(output, c(-1, i, 0, 1, n))
}
\end{lstlisting}
\newpage
\section{Creating derived variables}
\begin{lstlisting}[basicstyle=\tiny, language=R]

data$bill_1 <- ifelse (data$bill_amt6>0 & data$pay_amt5<data$bill_amt6, data$pay_amt5/data$bill_amt6, 1)
data$bill_2 <- ifelse (data$bill_amt5>0 & data$pay_amt4<data$bill_amt5, data$pay_amt4/data$bill_amt5, 1)
data$bill_3 <- ifelse (data$bill_amt4>0 & data$pay_amt3<data$bill_amt4, data$pay_amt3/data$bill_amt4, 1)
data$bill_4 <- ifelse (data$bill_amt3>0 & data$pay_amt2<data$bill_amt3, data$pay_amt2/data$bill_amt3, 1)
data$bill_5 <- ifelse (data$bill_amt2>0 & data$pay_amt1<data$bill_amt2, data$pay_amt1/data$bill_amt2, 1)

bill_comps <- as.data.frame(princomp(data[c("bill_1", "bill_2", "bill_3", "bill_4", "bill_5")])$scores)
data$bill_comp_1 <- bill_comps$Comp.1
data$bill_comp_2 <- bill_comps$Comp.2

data$payment_score <- (data$bill_1+data$bill_2+data$bill_3+data$bill_4+data$bill_5)/5

data$payment_volatility <- apply(data[c("bill_1", "bill_2", "bill_3", "bill_4", "bill_5")], 1, var)

data$sex <- factor(data$sex)
data$education <- factor(data$education)
data$marriage <- factor(data$marriage)

data <- within(data, {
  pay_1[pay_1%in%c(-1, -2)] <- -0.5
  pay_1[pay_1%in%c(0, 0)] <- -0.8
  pay_1[pay_1%in%c(1)] <- 0.5
  pay_1[pay_1%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- 1
  pay_2[pay_2%in%c(-1, -2)] <- -0.5
  pay_2[pay_2%in%c(0, 0)] <- -0.8
  pay_2[pay_2%in%c(1)] <- 0.5
  pay_2[pay_2%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- 1
  pay_3[pay_3%in%c(-1, -2)] <- -0.5
  pay_3[pay_3%in%c(0, 0)] <- -0.8
  pay_3[pay_3%in%c(1)] <- 0.5
  pay_3[pay_3%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- 1
  pay_4[pay_4%in%c(-1, -2)] <- -0.5
  pay_4[pay_4%in%c(0, 0)] <- -0.8
  pay_4[pay_4%in%c(1)] <- 0.5
  pay_4[pay_4%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- 1
  pay_5[pay_5%in%c(-1, -2)] <- -0.5
  pay_5[pay_5%in%c(0, 0)] <- -0.8
  pay_5[pay_5%in%c(1)] <- 0.5
  pay_5[pay_5%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- 1
  pay_6[pay_6%in%c(-1, -2)] <- -0.5
  pay_6[pay_6%in%c(0, 0)] <- -0.8
  pay_6[pay_6%in%c(1)] <- 0.5
  pay_6[pay_6%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- 1
})

data$late_score <- as.numeric(data$pay_1) + as.numeric(data$pay_2) +
	as.numeric(data$pay_3) + as.numeric(data$pay_4) + as.numeric(data$pay_5) + as.numeric(data$pay_6)

\end{lstlisting}
\newpage
\section{Visualizing maximum likelihood estimation}
\begin{lstlisting}[basicstyle=\tiny, language=R]
#load and clean data
#create derived variables

library(boot)
library(ggplot2)
library(mgcv)
library(rgl)
library(Brobdingnag)

data <- as.data.frame(data[1:30,])

ll <- function(b0, b1) {
  t <- b0+(b1*data$late_score)
  sum((data$defaultnm)*(t)-log(1+exp(t)))
}

l <- function(b0, b1) {
  e <- as.brob( b0+(b1*data$late_score) )
  p <- as.brob( 1/(1+exp(-e)))
  l <- as.brob( ( ( (p)^(1-data$defaultnm) ) * ( (1-p) ^ (data$defaultnm) ) ) )
  as.numeric(prod(l))
}

x <- seq(-5, 5, length.out = 50)
y <- seq(-5, 5, length.out = 50)
z <- outer(x, y, Vectorize(ll))
pmat <- persp(x, y, exp(z), phi=45, theta = 45)
fit <- glm(defaultnm~late_score, data = data, family = binomial(link = "logit"))
x_ <- coef(fit)[1]
y_ <- coef(fit)[2]
points(trans3d(x_, y_, ll(x_, y_), pmat=pmat), pch=20, cex=1, col=4)

b_1 <- function(b_0) {
  b_0 <- as.matrix(b_0)
  X <- matrix(c(rep(1, nrow(data)), data$late_score), ncol = 2)
  e <- X%*%(b_0)
  p <- 1/(1+exp(-e))
  J <- matrix(nrow = 2, ncol = 2)
  J[1] <- sum(p*(1-p))
  J[2] <- sum(data$late_score*p*(1-p))
  J[3] <- sum(data$late_score*p*(1-p))
  J[4] <- sum(data$late_score*data$late_score*p*(1-p))
  U <- c(sum(data$defaultnm - p), sum(data$late_score*(data$defaultnm - p)))
  (b_0+(solve(J)%*%U))
}
\end{lstlisting}
\newpage
\section{Visualizing Newton-Rhapson}
\begin{lstlisting}[basicstyle=\tiny, language=R]
#load and clean data
#create derived variables
#this follows on from the likelyhood script

library(ggplot2) #for graphing
library(reshape2) #needed for melt function
library(plotrix) #needed for rescale

#reduce number of observations so that the likelyhoods are smaller
#this will produce more manageble values and nicer graphs
data <- as.data.frame(data[1:30,])

#implement NR using matrices
b_1 <- function(b_0) {
  b_0 <- as.matrix(b_0)
  X <- matrix(c(rep(1, nrow(data)), data$late_score), ncol = 2)
  e <- X%*%(b_0)
  p <- 1/(1+exp(-e))
  J <- matrix(nrow = 2, ncol = 2)
  J[1] <- sum(p*(1-p))
  J[2] <- sum(data$late_score*p*(1-p))
  J[3] <- sum(data$late_score*p*(1-p))
  J[4] <- sum(data$late_score*data$late_score*p*(1-p))
  U <- c(sum(data$defaultnm - p), sum(data$late_score*(data$defaultnm - p)))
  (b_0+(solve(J)%*%U))
}

#calculate each iteration of the model parameters
b0 <- c(0, 0)
b1 <- b_1(b0)
b2 <- b_1(b1)
b3 <- b_1(b2)
b4 <- b_1(b3)
b5 <- b_1(b4)

#format likelyhood as a dataframe
df <- melt(z)
names(df) <- c("x", "y", "z")
df$x <- rescale(df$x, c(-1, 1))
df$y <- rescale(df$y, c(-1, 1))
#plot gradient plot
v <- ggplot(df, aes(x, y, z = z))
v + geom_raster(aes(fill = z))
\end{lstlisting}
\newpage
\section{Fitting and evaluating the model}
\begin{lstlisting}[basicstyle=\tiny, language=R]
set.seed(6734742)

#select model using stepwise aic
model <- glm(defaultnm ~age+sex+marriage+education+limit_bal+late_score+
	payment_score,family=binomial(link='logit'),data=data)

summary(model)
model <- step(model)
summary(model)

#validate model using 10 fold cross validation
k <- 10
data$split <- sample(1:k, nrow(data), replace = T)
auc <- lapply(1:k, function(x) {
  data.training <- data[!data$split==x,]
  data.validation <- data[data$split==x,]
  #fit model
  m <- glm(defaultnm~sex+marriage+education+limit_bal+late_score+payment_score, data =
  	 data.training, family = binomial(link = "logit"))
  #predict and calculate auc
  p <- predict(m, newdata=data.validation, type="response")
  pr <- prediction(p, data.validation$defaultnm)
  prf <- performance(pr, measure = "tpr", x.measure = "fpr")
  plot(prf, xaxs="i", yaxs="i")
  auc <- performance(pr, measure = "auc")
  auc <- auc@y.values[[1]]
  return(as.numeric(auc))
})
auc <- as.numeric(auc)
print(paste('AUC=',mean(auc)))

\end{lstlisting}
\newpage
\section{Processing data for nerual network}
\begin{lstlisting}[basicstyle=\tiny, language=R]
library(readxl)
library(corrplot)
data <- read_excel("dilan.xlsx")
names(data)[names(data) == "pay_0"] <- "pay_1" #correct column name pay_0 to pay_1
#correct marriage, education, and age categories
data <- data[data$marriage %in% c(1, 2),]
data <- within(data, {
  education[!(education %in% c(1, 2, 3))] <- 4
  pay_1[pay_1%in%c(-1,-2)] <- -1
  pay_2[pay_2%in%c(-1,-2)] <- -1
  pay_3[pay_3%in%c(-1,-2)] <- -1
  pay_4[pay_4%in%c(-1,-2)] <- -1
  pay_5[pay_5%in%c(-1,-2)] <- -1
  pay_6[pay_6%in%c(-1,-2)] <- -1
  pay_1[pay_1%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- 2
  pay_2[pay_2%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- 2
  pay_3[pay_3%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- 2
  pay_4[pay_4%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- 2
  pay_5[pay_5%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- 2
  pay_6[pay_6%in%c(2, 3, 4, 5, 6, 7, 8, 9)] <- 2
})


#standardize vars
data$limit_bal <- scale(data$limit_bal)
data$age <- scale(data$age)
data$pay_amt1 <- scale(data$pay_amt1)
data$pay_amt2 <- scale(data$pay_amt2)
data$pay_amt3 <- scale(data$pay_amt3)
data$pay_amt4 <- scale(data$pay_amt4)
data$pay_amt5 <- scale(data$pay_amt5)
data$pay_amt6 <- scale(data$pay_amt6)

#decorrelate and standardize correlated vars using pca
scores <- princomp(data[, c("bill_amt1", "bill_amt2", "bill_amt3", "bill_amt4", "bill_amt5", "bill_amt6")])$scores
scores <- as.data.frame(scores)
data$bill_amt1 <- scale(scores$Comp.1)
data$bill_amt2 <- scale(scores$Comp.2)
data$bill_amt3 <- scale(scores$Comp.3)
data$bill_amt4 <- scale(scores$Comp.4)
data$bill_amt5 <- scale(scores$Comp.5)
data$bill_amt6 <- scale(scores$Comp.6)

data$sex <- factor(data$sex)
data$marriage <- factor(data$marriage)
data$education <- factor(data$education)
data$pay_1 <- factor(data$pay_1)
data$pay_2 <- factor(data$pay_2)
data$pay_3 <- factor(data$pay_3)
data$pay_4 <- factor(data$pay_4)
data$pay_5 <- factor(data$pay_5)
data$pay_6 <- factor(data$pay_6)

data$sex <- as.data.frame(with(data["sex"], model.matrix(~ sex + 0)))$sex1
data$marriage <- as.data.frame(with(data["marriage"], model.matrix(~ marriage + 0)))$marriage1
data <- cbind(data, with(data["education"], model.matrix(~ education + 0)))
data <- cbind(data, with(data["pay_1"], model.matrix(~ pay_1 + 0)))
data <- cbind(data, with(data["pay_2"], model.matrix(~ pay_2 + 0)))
data <- cbind(data, with(data["pay_3"], model.matrix(~ pay_3 + 0)))
data <- cbind(data, with(data["pay_4"], model.matrix(~ pay_4 + 0)))
data <- cbind(data, with(data["pay_5"], model.matrix(~ pay_5 + 0)))
data <- cbind(data, with(data["pay_6"], model.matrix(~ pay_6 + 0)))

#remove unused variables
data <- within(data, rm(education, pay_6, pay_5, pay_4, pay_3, pay_2, pay_1))

data$sex <- ifelse(data$sex==1, 1, -1)
data$marriage <- ifelse(data$marriage==1, 1, -1)

names(data) <- gsub("-", "m", names(data)) #minus signs in column names cause issues
\end{lstlisting}
\newpage
\section{Train and evaluate neural network}
\begin{lstlisting}[basicstyle=\tiny, language=R]
library(neuralnet)

set.seed(56476857685)

bound <- floor((nrow(data)/4)*3)         #define % of training and test set
data <- data[sample(nrow(data)), ]           #sample rows
train_ <- data[1:bound, ]              #get training set
test_ <- data[(bound+1):nrow(data), ]    #get test set

n <- names(data)
f <- as.formula(paste("defaultnm ~", paste(n[!n %in% "defaultnm"], collapse = " + ")))
nn <- neuralnet(f, data=train_, hidden=10)

vars <- names(data) %in% c("defaultnm")
pr.nn <- compute(nn,test_[!vars])
pr.nn_ <- pr.nn$net.result
fitted.results <- as.numeric(pr.nn_)
fitted.results <- ifelse(fitted.results > mean(data$defaultnm),1,0)
misClasificError <- mean(fitted.results != test_$defaultnm)
print(paste('Accuracy',1-misClasificError))

library(ROCR)
p <- pr.nn_
pr <- prediction(p, test_$defaultnm)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
\end{lstlisting}
\newpage

\bibliography{ref}

\end{document}