Final Classification.Rmd

---
title: "Classification Mini Project: Classification of Chemstest"
output: html_notebook

Group Members: \
Rachel G: rachamin12@gmail.com,\
James Hick: redsoxfan765@gmail.com, \
Quientin Morrison: morriq@rpi.edu
---

### Introduction:
Our research comparing classification methods clearly showed that Fisher Linear Discrimminant Analysis was superior to the Mean Mean method in the creation of a classifying hyperplane.

We will now use Fisher LDA in the classification of the data in chemstest.csv.

Our procedure for creating and testing a hyperplane will be quite similar to the method we used in our comparitive analysis, however, we will now use the entirety of chemsrus.csv as a resource to create a hyperplane; this is our "training" data. Likewise, chemstest.csv is roughly equivalent to "testing" data, with the exception that we do not know what each sample is classified as. As such, we will need to develop a predictive model that can esitimate the errors our method will produce.

### Import Libraries:
```{r}
library(readr)
library(MASS)
library(devtools)
library(ggplot2)
```

### Read the Data:
There are two sets of data that we read in:
- chemsdata corresponds to chemsrus.csv and is our "training set", what we use to make a classifier

- targetdata corresponds to chemstest.csv and is our "testing set", what we are seeking to classify

This reads in published urls from google drive cooresponding to chemsdata <- chemsrus.csv and targetdata <- chemstest.csv\
The function suppressMessages() hides the parsing output; chemsdata and targetdata still exist in the coding enviornment

```{r}
chemsdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1arHUuWJrVjpZboLOJa97iIbPzCszX8stE-fbYhw2OCA/pub?gid=1533528387&single=true&output=csv")))

targetdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1NtoMaw06IlCDJ3k9Rxtcv01u2wUwdAh8D5rajytiNpQ/pub?gid=899540023&single=true&output=csv")))
```

### Data Parsing

Steps:

1) Fisher LDA  will make use of the data in chemsdata and target data for which the id numbers have been stripped. We call these data sets known and target respectivly. However, we also need to save the id numbers of targetdata for our final result, so we save the first column of targetdata as labels

```{r}
labels <- targetdata[ ,1]

known <- chemsdata[ , -1]

target <- targetdata[ , -1]
```

2) The last column of train is the class label; we split this off into the column vector knownclass. This vector will be used in anaylzing how well Fisher LDA does on the entriety of chemsrus.csv in terms of classification errors

```{r}
knownclass <- known[, ncol(known)]
```

3) Our classification method takes as it's input a feature matrix; that is a data set without id numbers or classification. As such, we remove the classification column from known and target to get feture matrixies, knownmatrix and targetmatrix

```{r}
knownmatrix <- as.matrix(known[ ,c(1:ncol(known)-1)])
targetmatrix <- as.matrix(target[ ,c(1:ncol(target)-1)])
```

### LDA

The Fisher LDA command, lda, is a modual in r that performs a Linear Discriminant Analysis on an input training data set.

- The classification column in our input matrix is called "class" which is used in the first argument.

- The input matrix; this matrix will still have the original classification vector

- The "prior" option specifies weighting between classes. This uses (1/2,1/2) saying they are weighted only by size.

Given that we will be using the threshold to the hyperplane formed by Fisher LDA often, it will be beneficial to encompass the lda function inside another function called fisher_method that takes the id-stripped traingdata as its input.

fisher_method returns a list of information that will be valuable for our analysis:

- "means" is a 2 x k matrix, where k is the number of columns in the input matrix. Each row in means contains the mean of the all the data in the input data set associated with a specific class value

- "normal" is a unit normal vector to the seperating hyperplane calculated using lda;

- "threshold" is the threshold of the seperating hyperplane

- "z"" is the original output of the lda command

```{r}
fisher_method<- function(trainingdata){
  z <- lda(class ~ .,trainingdata,prior=c(1,1)/2)

  #Calculate the Fisher threshold from the means and the normal vector.

  z_threshold <- ((z$means[1,] + z$means[2,])/2)%*%z$scaling

  return(list("z"=z,"means"=z$means, "normal"= z$scaling, "threshold"= z_threshold))
  }
```

### Classification Method:

Given the equation of a hyperplane
$$ X \cdot\hat{w} = t $$
Where $\hat{w}$ is a unit vector, $X$ is a point and $t$ is the threhold of the hyperplane. Then by rewritting this equation
$$X \cdot\hat{w}-t =0 $$
A point $X$ lies on the positive side of the hyperplane and is thus designated with class=1=biodegradable if
$$X \cdot\hat{w}-t > 0 $$

A point $X$ lies on the negative side of the hyperplane and is thus designated with class = -1 = nonbiodegradable if
$$X \cdot\hat{w}-t < 0$$

So, given the unit vector $\hat{w}$ and the threshold $t$ from a classification method such as the mean method or linear discrimminant analysis, it is possible to iterate over the rows of a matrix and using the above equations, classify each row. It is important to note that the normal vector and threshold result from applying these methods on the training data, the testing data is not used in the construction of the hyperplane.

Steps:
1) Define a function classify that takes as it's input a feature matrix, a normal vector , and a threshold

2) Create an empty column vector A that has the same number of rows as the input matrix

3) For each row in the input matrix (without a classification), calculate $row \cdot normal vector - threshold$ and store this value in A

4) Adjust A such that all the entries are either -1 for values less than 0 or 1 for values greater than 1

```{r}
classify<- function(matrix,normalvector,threshold){
  A <- vector("numeric",nrow(matrix)) # this is what we end up returning

  for (i in 1:nrow(matrix)){A[i]= (matrix[i, ] %*% normalvector) - threshold}
  ans <- 2*as.numeric(A >0) -1
  return(ans)
}
```


### Testing Accuracy of Fisher LDA on a known data set

Once we have classified the rows of a set of data and obtained a column vector of 1s and -1s, we can test how well this classifiation matches to the original classification. To achieve this, we define a fuction called accuracy.

Accuracy takes the following information as inputs:

- ans: the column vector of 1s and -1s obtained from using either mean_classify (for the mean method) or predict (for LDA) on a data set

- class: the corresponding column vector to ans with the set of 1s and -1s that was part of the original data before it was stripped. For example, if ans was calculated using chemstrain, then the class should be trainclass

Accuracy returns the following outputs:

- "p_as_m" is the percent of biodegradable samples which the model classified as nonbiodegradable

- "m_as_p" is the percent of nonbiodegradable samples which the model classified as biodegradable (this is the more important of the two)

- "total_misclass" is the percentage of total misclassified samples

Steps:

1) Find out how many samples were misclassified by subtracting ans from class, call the result cc

2) Isolate how many of these values in cc are positive; this respresent the number of positive samples that were classified as negative

3) Isolate how many of these values in cc are negative; this represents the number of negative samples that were classified as positive

4) Calculate the percent errors

```{r}
accuracy<- function(ans,class) {
  cc <- class - as.matrix(as.numeric(as.matrix(ans)))
  plusasminus <- sum(cc >0) #Number of positive samples which the model classified as negative
  minusasplus <- sum(cc <0) #Number of negative samples which the model classified as positive

  p_as_m <- plusasminus/nrow(class)

  m_as_p <- minusasplus/nrow(class)

  total_misclass <- (plusasminus+minusasplus)/nrow(class)

  return(list("p_as_m"=p_as_m, "m_as_p"=m_as_p, "total_misclass"= total_misclass))
}
```

### Estimating accuracy of Fisher LDA on an unknown data set


### Creating the classifying hyperplane

Here, we use H (for hyperplane) as the variable to store all the information created by running the fisher_method() on known
```{r}
H <- fisher_method(known)
cat("Normal to Fisher LDA Hyperplane: ")
H$normal

cat("Threshold of Fisher LDA Hyperplane:", H$threshold)
```

### Testing Accuracy of Fisher LDA on chemsrus.csv

```{r}
H_known_ans <- classify(knownmatrix, H$normal, H$threshold)
H_known_accuracy <- accuracy(H_known_ans,knownclass)
```

How well Fisher LDA does in chemsrus.csv

```{r}
H_known_accuracy
```


### Applying Fisher LDA on chemstest.csv
Given a seperating hyperplane H, we can classify the chemicals in chemstest.csv and determine whether they are biodegradable, class = 1 or nonbiodegradable, class=-1.

Steps:
1) Apply classify to the targetmatrix to get H_target_ans

```{r}
H_target_ans <- classify(targetmatrix, H$normal, H$threshold)
```

1.5) Out of curiosity, compare the percentage of biodegradable and nonbiodegradable points with respect to all points for both knowndata and targetdata.

Reminder: known data corresponds to chemsrus.csv and targetdata corresponds to chemstest.csv

- known_biodegradable is the percentage of biodegradable points (class = 1) for the known data
- known_nonbiodegradable is the percentage of nonbiodegrade points (class = -1) for the unknown data

- target_biodegradable is the percentage of biodegradable points (class = 1) in the target data
- target_nonbiodegradable is the percentage of nonbiodegradable points (class = -1) in the target data

```{r}
known_biodegradable<- sum(H_known_ans[H_known_ans>0])/nrow(known)
known_nonbiodegradable<- -1*sum(H_known_ans[H_known_ans<0])/nrow(known)

target_biodegradable <- sum(H_target_ans[H_target_ans>0])/nrow(target)
target_nonbiodegradable <- -1* sum(H_target_ans[H_target_ans<0])/nrow(target)
```

```{r}
known_biodegradable
known_nonbiodegradable

target_biodegradable
target_nonbiodegradable
```


2) Stitch together H_taget_ans with labels to get a matrix with 2 columns. The first column is the id numbers of the chemicals, and the second column is the classification of the chemicals.

To do this, we first rename H_target_ans to prediciton and ensure the result is a column vector. Then, we apply the function cbind, which will take the column vectors labels and prediction as an input and outputs a matrix of the two.

```{r}
prediction<- as.matrix(H_target_ans)
M <- cbind(labels, prediction)
```
3) Transform the matrix M into a csv file called "result.csv" using the write_csv command.

```{r}
write_csv(M,"results.csv")
```