Classification Mini Project.Rmd

---
title: "Classification Mini Project: Comparison of Methods"
output: html_notebook
---
Group Members: \
Rachel G:rachamin12@gmail.com,\
James Hick: redsoxfan765@gmail.com, \
Quientin Morrison: morriq@rpi.edu

### Import Libraries:
```{r}
library(readr)
library(MASS)
library(devtools)
library(ggplot2)
```
### Read the Data:
Our classification method comparison will make use of data for which classification has already been achieved, chemsrus.csv.

This reads in published urls from google drive cooresponding to chemsdata <- chemsrus.csv.
The function suppressMessages() hides the parsing output; chemsdata and targetdata still exist in the coding enviornment

```{r}
chemsdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1arHUuWJrVjpZboLOJa97iIbPzCszX8stE-fbYhw2OCA/pub?gid=1533528387&single=true&output=csv")))
```

### Training and Testing Sets:

In order to construct a classifier for the Mean Method and Fisher Linear Discrimminant Analysis (LDA) we need to use data with known classifications so that we can compare these two methods an asses their viability. As such, the data set used will be chemsdata <- chemsrus.csv

The classifiying hyperplane will be constructed from a from a randomised set of data that accounts for 90% of the original data in chemsdata. This dataset will be called chemstrain. The remaining 10% of data will be arranged into a data set called chemstest; chemstest will be used as a measure of how well the classifying hyperplane does it's job.

Steps:

1) Create a vector train that contains a randomized set of 90% of numbers ranging from 1 to the number of chemsdata

```{r}
ss<- as.integer(.90*nrow(chemsdata)) # as.integer is a necessary addition to ensure ss is an integer

set.seed(79586879); train <- sample(1:nrow(chemsdata),ss)
```

2) Create chemstrain and chemstest given the vector train. In addition, strip the first column of each of these vectors, as it is just a sample label

```{r}
chemstrain <- chemsdata[train, -1]
chemstest <- chemsdata[-train, -1] # Using -train gives us all rows except the training rows.
```

3) The last column of chemstrain and chemstest is the class label; we split this off into the column vectors trainclass and testclass respectivly. These vectors will be used in anaylzing how effective our methods were in classifying the data

```{r}
trainclass <- chemstrain[, ncol(chemstrain)]
testclass <- chemstest[,ncol(chemstest)]
```

4) Once we've split off the last columns in chemstrain and chemstest, we can remove this column from these data sets to get the matricies of features, trainmatrix and testmatrix

```{r}
trainmatrix <- as.matrix(chemstrain[ ,c(1:ncol(chemstrain)-1)])
testmatrix <- as.matrix(chemstest[ ,c(1:ncol(chemstest)-1)])
```


### Mean Method

Here we want to construct a function that encapsulates the many aspects of the Mean Method and return the components as requested:

1) Needs to take in an id-stripped training data set and seperate it into two seperate matricies, degrade and nondegrade. Once we've seperated the matricies, we can then strip the last column.

2) Find the mean for each of these classes, m_degrade and m_nondegrade

3) Subtract one mean from another to get a vector called m_normal; this is the normal vector of the classifying hyperplane

4) Use m_normal to find the threshold, m_threshold

Our function will return a list of the follwing values

- "m_degrade" is the mean of the entries with class= 1 in the input data

- "m_nondegrade" is the mean of the entries with class= -1 in the input data

- "normal" is the unit normal vector of the seperating hyperplane calculated using the mean method

- "threshold" is the threshold of the seperating hyperplane

```{r}
mean_method <- function(training){

  ## Seperate data and then strip the classification.

  degrade1 <- training[training[ ,ncol(training)]>0,  ]
  degrade <- degrade1[ ,1:(ncol(training)-1)]

  nondegrade1 <-training[training[ ,ncol(training)] <0, ]
  nondegrade <- nondegrade1[ , 1:(ncol(training)-1)]

  ## Find the mean of both

  m_degrade <- (1/nrow(degrade))*rep(1,nrow(degrade)) %*% as.matrix(degrade)
  m_nondegrade <- (1/nrow(nondegrade))*rep(1,nrow(nondegrade)) %*% as.matrix(nondegrade)
  difference <- m_degrade-m_nondegrade

  ## Find the normal vector
  m_normal <- t(difference) # should be a column vector, so take the transpose
  m_normal <- m_normal/(as.numeric(sqrt((difference)%*%t(difference)))) ## Make into a unit vector

  ## Find the threshold

  m_threshold <- ((m_degrade+m_nondegrade)/2)%*% m_normal

  return(list("m_degrade" = m_degrade, "m_nondegrade"= m_nondegrade, "normal"=m_normal, "threshold"=m_threshold))
}
```

### LDA

The Fisher LDA command, lda, is a modual in r that performs a Linear Discriminant Analysis on an input training data set.

- The classification column in our input matrix is called "class" which is used in the first argument.

- The input matrix; this matrix will still have the original classification vector

- The "prior" option specifies weighting between classes. This uses (1/2,1/2) saying they are weighted only by size.

Given that we will be using the threshold to the hyperplane formed by Fisher LDA often, it will be beneficial to encompass the lda function inside another function called fisher_method that takes id-stripped data s its input (chemstrain or chemstest).

fisher_method returns a list of information that will be valuable for our analysis:

- "means" is a 2 x k matrix, where k is the number of columns in the input matrix. Each row in means contains the mean of the all the data in the input data set associated with a specific class value

- "normal" is a unit normal vector to the seperating hyperplane calculated using lda;

- "threshold" is the threshold of the seperating hyperplane

- "z" is the original output of the lda command

```{r}
fisher_method<- function(trainingdata){
  z <- lda(class ~ .,trainingdata,prior=c(1,1)/2)

  #Calculate the Fisher threshold from the means and the normal vector.

  z_threshold <- ((z$means[1,] + z$means[2,])/2)%*%z$scaling

  return(list("z"=z,"means"=z$means, "normal"= z$scaling, "threshold"= z_threshold))
  }
```

### Classification Method:

Given the equation of a hyperplane
$$ X \cdot\hat{w} = t $$
Where $\hat{w}$ is a unit vector, $X$ is a point and $t$ is the threhold of the hyperplane. Then by rewritting this equation
$$X \cdot\hat{w}-t =0 $$
A point $X$ lies on the positive side of the hyperplane and is thus designated with class=1=biodegradable if
$$X \cdot\hat{w}-t > 0 $$

A point $X$ lies on the negative side of the hyperplane and is thus designated with class = -1 = nonbiodegradable if
$$X \cdot\hat{w}-t < 0$$

So, given the unit vector $\hat{w}$ and the threshold {t} from a classification method such as the mean method or linear discrimminant analysis, it is possible to iterate over the rows of a matrix and using the above equations, classify each row. It is important to note that the normal vector and threshold result from applying these methods on the training data, the testing data is not used in the construction of the hyperplane.

Steps:
1) Define a function classify that takes as it's input a feature matrix, a normal vector , and a threshold

2) Create an empty column vector A that has the same number of rows as the input matrix

3) For each row in the input matrix (without a classification), calculate $row \cdot normal vector - threshold$ and store this value in A

4) Adjust A such that all the entries are either -1 for values less than 0 or 1 for values greater than 1

```{r}
classify<- function(matrix,normalvector,threshold){
  A <- vector("numeric",nrow(matrix)) # this is what we end up returning

  for (i in 1:nrow(matrix)){A[i]= (matrix[i, ] %*% normalvector) - threshold}
  ans <- 2*as.numeric(A >0) -1
  return(ans)
}
```


### Testing Accuracy

Once we have classified the rows of a set of data and obtained a column vector of 1s and -1s, we can test the accuracy of our methods. To achieve this, we define a fuction called accuracy.

Accuracy takes the following information as inputs:

- ans: the column vector of 1s and -1s obtained from using either mean_classify (for the mean method) or predict (for LDA) on a data set

- class: the corresponding column vector to ans with the set of 1s and -1s that was part of the original data before it was stripped. For example, if ans was calculated using chemstrain, then the class should be trainclass

Accuracy returns the following outputs:

- "p_as_m" is the percent of biodegradable samples which the model classified as nonbiodegradable

- "m_as_p" is the percent of nonbiodegradable samples which the model classified as biodegradable (this is the more important of the two)

- "total_misclass" is the percentage of total misclassified samples

Steps:

1) Find out how many samples were misclassified by subtracting ans from class, call the result cc

2) Isolate how many of these values in cc are positive; this respresent the number of positive samples that were classified as negative

3) Isolate how many of these values in cc are negative; this represents the number of negative samples that were classified as positive

4) Calculate the percent errors

```{r}
accuracy<- function(ans,class) {
  cc <- class - as.matrix(as.numeric(as.matrix(ans)))
  plusasminus <- sum(cc >0) #Number of positive samples which the model classified as negative
  minusasplus <- sum(cc <0) #Number of negative samples which the model classified as positive

  p_as_m <- plusasminus/nrow(class)

  m_as_p <- minusasplus/nrow(class)

  total_misclass <- (plusasminus+minusasplus)/nrow(class)

  return(list("p_as_m"=p_as_m, "m_as_p"=m_as_p, "total_misclass"= total_misclass))
}
```

### Comparison of methods:

To reduce confusion:

- The letter M will precede any variable associated with the Mean Method

- The letter L will precede any variable associated with the Fisher LDA Method

For each method:

1) create the classification hyperplane from the training data chemstrain. Print the threshold and normal vector

```{r}
M <- mean_method(chemstrain)

cat("Normal to Mean Method Hyperplane: ")
M$normal

cat("Threshold of Mean Method Hyperplane: ", M$threshold)

L <- fisher_method(chemstrain)
cat("Normal to Fisher LDA Hyperplane: ")
L$normal

cat("Threshold of Fisher LDA Hyperplane:", L$threshold)

```

2) Classify the Data Points in Chemstrain using mean_classify and trainmatrix or predict and chemstrain:

```{r}
M_train_ans <- classify(trainmatrix,M$normal,M$threshold)

L_train_ans <- classify(trainmatrix,L$normal,L$threshold)

```

3) Compare the accuracy of both methods on the training data using the accuracy function

```{r}
M_train_accuracy <- accuracy(M_train_ans,trainclass)
L_train_accuracy <- accuracy(L_train_ans,trainclass)
```


```{r}
M_train_accuracy
```
```{r}
L_train_accuracy
```

The accuracy of Fisher LDA is much much better than that of the Mean Method

4) Classify the Data Points in chemstest using mean_classify and testmatrix or predict and chemstest

```{r}
M_test_ans <- classify(testmatrix,M$normal,M$threshold)

L_test_ans <- classify(testmatrix,L$normal,L$threshold)

## L_test_ans <- predict(L$z,chemstest)$class (leads to the same result as classify, but less intuitive)
```

5) Compare the accuracy of both methods on the testing data using the accuracy function

```{r}
M_test_accuracy <- accuracy(M_test_ans,testclass)
L_test_accuracy <- accuracy(L_test_ans,testclass)
```

```{r}
M_test_accuracy
```

```{r}
L_test_accuracy
```

Once again, Fisher LDA is much more accurate than the Mean Method

## Making some histograms:
To better visialize the differences in accuracy from the previous section, we can create a histogram of the data with respect to a threshold. Points to the right of the threshold indicate those that have been classified as biodegradable, and points to the left of the theshold indicate those that have been classified as nonbiodegradable. The greater the number of points within a particualar region, the taller the histogram is at that point.

The fuctions we use here produce two such historgams:
- The top historgam represents the points that, in the original matrix, were considered to be non biodegradable. If a model perfectly classifies these points, then we should expect no points to the right of the threshold

- The bottom histogram respesents the points than, in the original matrix, were considered to be biodegradable. If a model perfectly classifiees these points, then we should expect no points to the left of the threshold


Here histopair is a function that creates a histogram given class projections, a threshold and some bounds, and hisoplot is a support function that helps create some of the inputs to histopair.

```{r}
#A routine to plot a pair of histograms and a threshold
histopair <-function(pminus,pplus,thresh,yy=c(0,60),label2="Plus",label1="Minus") {
  require(ggplot2); require(gridExtra)
  hist1 <- ggplot(as.data.frame(pminus), aes(pminus)) + geom_histogram(col="blue",fill="blue")
  hist2 <- ggplot(as.data.frame(pplus),  aes(pplus))  + geom_histogram(col="red",fill="red")
  df <- data.frame(x1=c(thresh,thresh),y1=c(yy[1],yy[2]))
  pmin <- min(pminus,pplus)
  pmax<- max(pminus,pplus)
  me1 <- hist1 + expand_limits(x=c(pmin,pmax)) + geom_line(data=df,aes(x=x1,y=y1)) + xlab(label1)
  me2 <- hist2 + expand_limits(x=c(pmin,pmax)) + geom_line(data=df,aes(x=x1,y=y1)) + xlab(label2)
  pl <- vector("list",2)
  pl[[1]] <- me1;  pl[[2]]<-me2;
  grid.arrange(grobs=pl,ncols=1)
}

## A function to streamline the histogram analysis
## matrix is either trainmatrix or testmatix; these are the feature matricies
## normalvector is a normal vector to a classifying hyperplane; either z$scaling or h$normal
## threshold is the threshold to a classifying hyperplane; either thres (associated with z) or h$threshold
## Classvector is the vector of class labels for the associated data set

histoplot<- function(matrix,normalvector,threshold,classvector){
  proj <- matrix %*% as.matrix(normalvector)
  pplus<- proj[classvector[]>0] #All the class 1 projections
  pminus<- proj[classvector[]<0] #All the class -1 projections
  histopair(pminus,pplus,threshold,label1="Not Readily Biodegradable",label2="Readily Biodegradable")}


```

Now we'll create some histograms of the training data.

Histogram of training data using the Mean Method
```{r}
histoplot(trainmatrix,M$normal,M$threshold,trainclass)
```

Histogram of training data using Fisher LDA
```{r}
histoplot(trainmatrix,L$normal,L$threshold,trainclass)
```

Its clear that Fisher LDA is a much better method to use


Now we'll create some histograms of the testing data. Points to the right of the threshold indicate biodegradable chemicals, and points to the left of the theshold indicate nonbiodegradable chemicals. It should be noted that the count on these histograms will be much lower than those of the training data, as we are looking at a much smaller sample


Histogram of testing data using the Mean Method
```{r}
histoplot(testmatrix,M$normal,M$threshold,testclass)
```

Histogram of testing data using Fisher LDA
```{r}
histoplot(testmatrix,L$normal,L$threshold,testclass)
```

Once again, its clear that Fisher LDA is a much better method to use