Final Classification.Rmd

---
title: "Classification Mini Project: Classification of Chemstest"
output: html_notebook

Group Members: \
Rachel G: rachamin12@gmail.com,\
James Hick: redsoxfan765@gmail.com, \
Quientin Morrison: morriq@rpi.edu
---

### Introduction:
Our research comparing classification methods clearly showed that Fisher Linear Discrimminant Analysis was superior to the Mean Mean method in the creation of a classifying hyperplane.

We will now use Fisher LDA in the classification of the data in chemstest.csv.

Our procedure for creating and testing a hyperplane will be quite similar to the method we used in our comparitive analysis, however, we will now use the entirety of chemsrus.csv as a resource to create a hyperplane; this is our "training" data. Likewise, chemstest.csv is roughly equivalent to "testing" data, with the exception that we do not know what each sample is classified as. As such, we will need to develop a predictive model that can esitimate the errors our method will produce.

### Import Libraries:
```{r}
library(readr)
library(MASS)
library(devtools)
library(ggplot2)
```

### Read the Data:
There are two sets of data that we read in:
- chemsdata corresponds to chemsrus.csv and is our "training set", what we use to make a classifier

- targetdata corresponds to chemstest.csv and is our "testing set", what we are seeking to classify

This reads in published urls from google drive cooresponding to chemsdata <- chemsrus.csv and targetdata <- chemstest.csv\
The function suppressMessages() hides the parsing output; chemsdata and targetdata still exist in the coding enviornment

```{r}
chemsdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1arHUuWJrVjpZboLOJa97iIbPzCszX8stE-fbYhw2OCA/pub?gid=1533528387&single=true&output=csv")))

targetdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1NtoMaw06IlCDJ3k9Rxtcv01u2wUwdAh8D5rajytiNpQ/pub?gid=899540023&single=true&output=csv")))
```

### Data Parsing

Steps:

1) Fisher LDA  will make use of the data in chemsdata and target data for which the id numbers have been stripped. We call these data sets known and target respectivly. However, we also need to save the id numbers of targetdata for our final result, so we save the first column of targetdata as labels

```{r}
labels <- targetdata[ ,1]

known <- chemsdata[ , -1]

target <- targetdata[ , -1]
```

2) The last column of train is the class label; we split this off into the column vector knownclass. This vector will be used in anaylzing how well Fisher LDA does on the entriety of chemsrus.csv in terms of classification errors

```{r}
knownclass <- known[, ncol(known)]
```

3) Our classification method takes as it's input a feature matrix; that is a data set without id numbers or classification. As such, we remove the classification column from known and target to get feture matrixies, knownmatrix and targetmatrix

```{r}
knownmatrix <- as.matrix(known[ ,c(1:ncol(known)-1)])
targetmatrix <- as.matrix(target[ ,c(1:ncol(target)-1)])
```

### LDA

The Fisher LDA command, lda, is a modual in r that performs a Linear Discriminant Analysis on an input training data set.

- The classification column in our input matrix is called "class" which is used in the first argument.

- The input matrix; this matrix will still have the original classification vector

- The "prior" option specifies weighting between classes. This uses (1/2,1/2) saying they are weighted only by size.

Given that we will be using the threshold to the hyperplane formed by Fisher LDA often, it will be beneficial to encompass the lda function inside another function called fisher_method that takes the id-stripped traingdata as its input.

fisher_method returns a list of information that will be valuable for our analysis:

- "means" is a 2 x k matrix, where k is the number of columns in the input matrix. Each row in means contains the mean of the all the data in the input data set associated with a specific class value

- "normal" is a unit normal vector to the seperating hyperplane calculated using lda;

- "threshold" is the threshold of the seperating hyperplane

- "z"" is the original output of the lda command

```{r}
fisher_method<- function(trainingdata){
  z <- lda(class ~ .,trainingdata,prior=c(1,1)/2)

  #Calculate the Fisher threshold from the means and the normal vector.

  z_threshold <- ((z$means[1,] + z$means[2,])/2)%*%z$scaling

  return(list("z"=z,"means"=z$means, "normal"= z$scaling, "threshold"= z_threshold))
  }
```

### Classification Method:

Given the equation of a hyperplane
$$ X \cdot\hat{w} = t $$
Where $\hat{w}$ is a unit vector, $X$ is a point and $t$ is the threhold of the hyperplane. Then by rewritting this equation
$$X \cdot\hat{w}-t =0 $$
A point $X$ lies on the positive side of the hyperplane and is thus designated with class=1=biodegradable if
$$X \cdot\hat{w}-t > 0 $$

A point $X$ lies on the negative side of the hyperplane and is thus designated with class = -1 = nonbiodegradable if
$$X \cdot\hat{w}-t < 0$$

So, given the unit vector $\hat{w}$ and the threshold $t$ from a classification method such as the mean method or linear discrimminant analysis, it is possible to iterate over the rows of a matrix and using the above equations, classify each row. It is important to note that the normal vector and threshold result from applying these methods on the training data, the testing data is not used in the construction of the hyperplane.

Steps:
1) Define a function classify that takes as it's input a feature matrix, a normal vector , and a threshold

2) Create an empty column vector A that has the same number of rows as the input matrix

3) For each row in the input matrix (without a classification), calculate $row \cdot normal vector - threshold$ and store this value in A

4) Adjust A such that all the entries are either -1 for values less than 0 or 1 for values greater than 1

```{r}
classify<- function(matrix,normalvector,threshold){
  A <- vector("numeric",nrow(matrix)) # this is what we end up returning

  for (i in 1:nrow(matrix)){A[i]= (matrix[i, ] %*% normalvector) - threshold}
  ans <- 2*as.numeric(A >0) -1
  return(ans)
}
```


### Testing Accuracy of Fisher LDA on a known data set

Once we have classified the rows of a set of data and obtained a column vector of 1s and -1s, we can test how well this classifiation matches to the original classification. To achieve this, we define a fuction called accuracy.

Accuracy takes the following information as inputs:

- ans: the column vector of 1s and -1s obtained from using either mean_classify (for the mean method) or predict (for LDA) on a data set

- class: the corresponding column vector to ans with the set of 1s and -1s that was part of the original data before it was stripped. For example, if ans was calculated using chemstrain, then the class should be trainclass

Accuracy returns the following outputs:

- "p_as_m" is the percent of biodegradable samples which the model classified as nonbiodegradable

- "m_as_p" is the percent of nonbiodegradable samples which the model classified as biodegradable (this is the more important of the two)

- "total_misclass" is the percentage of total misclassified samples

Steps:

1) Find out how many samples were misclassified by subtracting ans from class, call the result cc

2) Isolate how many of these values in cc are positive; this respresent the number of positive samples that were classified as negative

3) Isolate how many of these values in cc are negative; this represents the number of negative samples that were classified as positive

4) Calculate the percent errors

```{r}
accuracy<- function(ans,class) {
  cc <- class - as.matrix(as.numeric(as.matrix(ans)))
  plusasminus <- sum(cc >0) #Number of positive samples which the model classified as negative
  minusasplus <- sum(cc <0) #Number of negative samples which the model classified as positive

  p_as_m <- (plusasminus/nrow(class))*100

  m_as_p <- (minusasplus/nrow(class))*100

  total_misclass <- ((plusasminus+minusasplus)/nrow(class))*100

  return(list("p_as_m"=p_as_m, "m_as_p"=m_as_p, "total_misclass"= total_misclass))
}
```

### Creating the classifying hyperplane

Here, we use H (for hyperplane) as the variable to store all the information created by running the fisher_method() on known
```{r}
H <- fisher_method(known)
cat("Normal to Fisher LDA Hyperplane: ")
H$normal

cat("Threshold of Fisher LDA Hyperplane:", H$threshold)
```

### Esimating the accuracy of Fisher Linear Discrimminant Analysis:

Prediciting the misclassification of chemicals for a random data set is difficult, as we have no idea what the "true" classifications of our target set is, hence why we go through the trouble to classify the data.

However, we can approximate this relationship by creating hyperplanes from a randomized subset of our known data (say 90%) and testing another randomized subset of the known data (10%) comparing the classifications obtained from our hyperplane with the classification of our testing data. Given these two different classifications, we can calculate the percentage of biodegradable points (class=1) calculated as nonbiodegradable (class = -1) and total misclassified points.

To obtain the best approximation of errors for a random data set given our known data (assuming both are defined by the same number of features) we simply run the above process for an arbitrary k number of times and find the average.

It is important to note that the testing data is not used in the creation of the classifying hyperplanes. For all intents and purposes, these testing data sets are "unknown data". As such, while we may known the classification of these "unknown" data points, and use them to calculate the percentage of misclassified points, the errors calculated will be a good indication of how well our classification model will do on any unknown data set.


Our function here makes use of the functions classify, accuracy and fisher_method

```{r}
estimate_inaccuracy_test <- function(knowndata, k){

  #Each of these vectors is composed of k entries of their associated values generated by the function accuracy
  P_As_M <- vector("numeric",k)
  M_As_P <- vector("numeric",k)
  Total_Misclass <- vector("numeric", k )

  for (i in 1:k){
    ss<- as.integer(.90*nrow(knowndata))
    value <- sample(1:nrow(chemsdata),ss)

    knowntrain <- knowndata[value, -1]
    knowntest <- knowndata[-value, -1]
    testclass <- knowntest[,ncol(knowntest)]
    testmatrix <- as.matrix(knowntest[ ,c(1:ncol(knowntest)-1)])

    H <- fisher_method(knowntrain)

    H_test_ans <- classify(testmatrix, H$normal, H$threshold)

    H_test_accuracy <- accuracy(H_test_ans, testclass)

    P_As_M[i] <- H_test_accuracy$p_as_m

    M_As_P[i] <- H_test_accuracy$m_as_p

    Total_Misclass[i] <- H_test_accuracy$total_misclass
  }

  ## Now all we need to do is return the average of each of these values

    return(list("P_As_M"= mean(P_As_M), "M_As_P" = mean(M_As_P), "Total_Misclass" = mean(Total_Misclass)))
}
```

We can follow a similar process for the training data by checking the accuracy of the seperating hyperplane on the data it's created from. If these percentages are too low, thats an indicaion that our hyperplane is overfitting the data.

```{r}
estimate_inaccuracy_train <- function(knowndata, k){

  #Each of these vectors is composed of k entries of their associated values generated by the function accuracy
  P_As_M <- vector("numeric",k)
  M_As_P <- vector("numeric",k)
  Total_Misclass <- vector("numeric", k )

  for (i in 1:k){
    ss<- as.integer(.90*nrow(knowndata))
    value <- sample(1:nrow(chemsdata),ss)

    knowntrain <- knowndata[value, -1]
    trainclass <- knowntrain[,ncol(knowntrain)]
    trainmatrix <- as.matrix(knowntrain[ ,c(1:ncol(knowntrain)-1)])

    H <- fisher_method(knowntrain)

    H_train_ans <- classify(trainmatrix, H$normal, H$threshold)

    H_train_accuracy <- accuracy(H_train_ans, trainclass)

    P_As_M[i] <- H_train_accuracy$p_as_m

    M_As_P[i] <- H_train_accuracy$m_as_p

    Total_Misclass[i] <- H_train_accuracy$total_misclass
  }

  ## Now all we need to do is return the average of each of these values

    return(list("P_As_M"= mean(P_As_M), "M_As_P" = mean(M_As_P), "Total_Misclass" = mean(Total_Misclass)))

}
```


Running estimate_inaccuracy_train for a large number of trials (say 400) will give us a good indication of how well the Fisher Method predicts the classification of knowndata. If the percentage of misclassified points is too low, then thats an indication that our hyperplane was overfitting the data

```{r}
estimate_inaccuracy_train(chemsdata,400)

```


Running estimate_inaccuracy_test for a large number of trails (say 400) will give us a good indication of what the percentage of misclassified points will be for an unknown data set


```{r}
estimate_inaccuracy_test(chemsdata,400)

```

### Applying Fisher LDA on chemstest.csv
Given a seperating hyperplane H, we can classify the chemicals in chemstest.csv and determine whether they are biodegradable, class = 1 or nonbiodegradable, class=-1.

Steps:
1) Apply classify to the targetmatrix to get H_target_ans

```{r}
H_target_ans <- classify(targetmatrix, H$normal, H$threshold)
H_known_ans <- classify(knownmatrix, H$normal, H$threshold)
```

1.5) Out of curiosity, compare the percentage of biodegradable and nonbiodegradable points with respect to all points for both knowndata and targetdata.

Reminder: known data corresponds to chemsrus.csv and targetdata corresponds to chemstest.csv

- known_biodegradable is the percentage of biodegradable points (class = 1) for the known data
- known_nonbiodegradable is the percentage of nonbiodegrade points (class = -1) for the unknown data

- target_biodegradable is the percentage of biodegradable points (class = 1) in the target data
- target_nonbiodegradable is the percentage of nonbiodegradable points (class = -1) in the target data

```{r}
known_biodegradable<- (sum(H_known_ans[H_known_ans>0])/nrow(known))*100
known_nonbiodegradable<- (-1*sum(H_known_ans[H_known_ans<0])/nrow(known))*100

target_biodegradable <- (sum(H_target_ans[H_target_ans>0])/nrow(target))*100
target_nonbiodegradable <- (-1* sum(H_target_ans[H_target_ans<0])/nrow(target))*100
```

```{r}
known_biodegradable
known_nonbiodegradable

target_biodegradable
target_nonbiodegradable
```


2) Stitch together H_taget_ans with labels to get a matrix with 2 columns. The first column is the id numbers of the chemicals, and the second column is the classification of the chemicals.

To do this, we first rename H_target_ans to prediciton and ensure the result is a column vector. Then, we apply the function cbind, which will take the column vectors labels and prediction as an input and outputs a matrix of the two.

```{r}
prediction<- as.matrix(H_target_ans)
M <- cbind(labels, prediction)
```
3) Transform the matrix M into a csv file called "result.csv" using the write_csv command.

```{r}
write_csv(M,"results.csv")
```