Advertisement
flutedaddyfunk

Final Classification.Rmd

Apr 23rd, 2017
551
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 10.84 KB | None | 0 0
  1. ---
  2. title: "Classification Mini Project: Classification of Chemstest"
  3. output: html_notebook
  4.  
  5. Group Members: \
  6. Rachel G: rachamin12@gmail.com,\
  7. James Hick: redsoxfan765@gmail.com, \
  8. Quientin Morrison: morriq@rpi.edu
  9. ---
  10.  
  11. ### Introduction:
  12. Our research comparing classification methods clearly showed that Fisher Linear Discrimminant Analysis was superior to the Mean Mean method in the creation of a classifying hyperplane.
  13.  
  14. We will now use Fisher LDA in the classification of the data in chemstest.csv.
  15.  
  16. Our procedure for creating and testing a hyperplane will be quite similar to the method we used in our comparitive analysis, however, we will now use the entirety of chemsrus.csv as a resource to create a hyperplane; this is our "training" data. Likewise, chemstest.csv is roughly equivalent to "testing" data, with the exception that we do not know what each sample is classified as. As such, we will need to develop a predictive model that can esitimate the errors our method will produce.
  17.  
  18. ### Import Libraries:
  19. ```{r}
  20. library(readr)
  21. library(MASS)
  22. library(devtools)
  23. library(ggplot2)
  24. ```
  25.  
  26. ### Read the Data:
  27. There are two sets of data that we read in:
  28. - chemsdata corresponds to chemsrus.csv and is our "training set", what we use to make a classifier
  29.  
  30. - targetdata corresponds to chemstest.csv and is our "testing set", what we are seeking to classify
  31.  
  32. This reads in published urls from google drive cooresponding to chemsdata <- chemsrus.csv and targetdata <- chemstest.csv\
  33. The function suppressMessages() hides the parsing output; chemsdata and targetdata still exist in the coding enviornment
  34.  
  35. ```{r}
  36. chemsdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1arHUuWJrVjpZboLOJa97iIbPzCszX8stE-fbYhw2OCA/pub?gid=1533528387&single=true&output=csv")))
  37.  
  38. targetdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1NtoMaw06IlCDJ3k9Rxtcv01u2wUwdAh8D5rajytiNpQ/pub?gid=899540023&single=true&output=csv")))
  39. ```
  40.  
  41. ### Data Parsing
  42.  
  43. Steps:
  44.  
  45. 1) Fisher LDA will make use of the data in chemsdata and target data for which the id numbers have been stripped. We call these data sets known and target respectivly. However, we also need to save the id numbers of targetdata for our final result, so we save the first column of targetdata as labels
  46.  
  47. ```{r}
  48. labels <- targetdata[ ,1]
  49.  
  50. known <- chemsdata[ , -1]
  51.  
  52. target <- targetdata[ , -1]
  53. ```
  54.  
  55. 2) The last column of train is the class label; we split this off into the column vector knownclass. This vector will be used in anaylzing how well Fisher LDA does on the entriety of chemsrus.csv in terms of classification errors
  56.  
  57. ```{r}
  58. knownclass <- known[, ncol(known)]
  59. ```
  60.  
  61. 3) Our classification method takes as it's input a feature matrix; that is a data set without id numbers or classification. As such, we remove the classification column from known and target to get feture matrixies, knownmatrix and targetmatrix
  62.  
  63. ```{r}
  64. knownmatrix <- as.matrix(known[ ,c(1:ncol(known)-1)])
  65. targetmatrix <- as.matrix(target[ ,c(1:ncol(target)-1)])
  66. ```
  67.  
  68. ### LDA
  69.  
  70. The Fisher LDA command, lda, is a modual in r that performs a Linear Discriminant Analysis on an input training data set.
  71.  
  72. - The classification column in our input matrix is called "class" which is used in the first argument.
  73.  
  74. - The input matrix; this matrix will still have the original classification vector
  75.  
  76. - The "prior" option specifies weighting between classes. This uses (1/2,1/2) saying they are weighted only by size.
  77.  
  78. Given that we will be using the threshold to the hyperplane formed by Fisher LDA often, it will be beneficial to encompass the lda function inside another function called fisher_method that takes the id-stripped traingdata as its input.
  79.  
  80. fisher_method returns a list of information that will be valuable for our analysis:
  81.  
  82. - "means" is a 2 x k matrix, where k is the number of columns in the input matrix. Each row in means contains the mean of the all the data in the input data set associated with a specific class value
  83.  
  84. - "normal" is a unit normal vector to the seperating hyperplane calculated using lda;
  85.  
  86. - "threshold" is the threshold of the seperating hyperplane
  87.  
  88. - "z"" is the original output of the lda command
  89.  
  90. ```{r}
  91. fisher_method<- function(trainingdata){
  92. z <- lda(class ~ .,trainingdata,prior=c(1,1)/2)
  93.  
  94. #Calculate the Fisher threshold from the means and the normal vector.
  95.  
  96. z_threshold <- ((z$means[1,] + z$means[2,])/2)%*%z$scaling
  97.  
  98. return(list("z"=z,"means"=z$means, "normal"= z$scaling, "threshold"= z_threshold))
  99. }
  100. ```
  101.  
  102. ### Classification Method:
  103.  
  104. Given the equation of a hyperplane
  105. $$ X \cdot\hat{w} = t $$
  106. Where $\hat{w}$ is a unit vector, $X$ is a point and $t$ is the threhold of the hyperplane. Then by rewritting this equation
  107. $$X \cdot\hat{w}-t =0 $$
  108. A point $X$ lies on the positive side of the hyperplane and is thus designated with class=1=biodegradable if
  109. $$X \cdot\hat{w}-t > 0 $$
  110.  
  111. A point $X$ lies on the negative side of the hyperplane and is thus designated with class = -1 = nonbiodegradable if
  112. $$X \cdot\hat{w}-t < 0$$
  113.  
  114. So, given the unit vector $\hat{w}$ and the threshold $t$ from a classification method such as the mean method or linear discrimminant analysis, it is possible to iterate over the rows of a matrix and using the above equations, classify each row. It is important to note that the normal vector and threshold result from applying these methods on the training data, the testing data is not used in the construction of the hyperplane.
  115.  
  116. Steps:
  117. 1) Define a function classify that takes as it's input a feature matrix, a normal vector , and a threshold
  118.  
  119. 2) Create an empty column vector A that has the same number of rows as the input matrix
  120.  
  121. 3) For each row in the input matrix (without a classification), calculate $row \cdot normal vector - threshold$ and store this value in A
  122.  
  123. 4) Adjust A such that all the entries are either -1 for values less than 0 or 1 for values greater than 1
  124.  
  125. ```{r}
  126. classify<- function(matrix,normalvector,threshold){
  127. A <- vector("numeric",nrow(matrix)) # this is what we end up returning
  128.  
  129. for (i in 1:nrow(matrix)){A[i]= (matrix[i, ] %*% normalvector) - threshold}
  130. ans <- 2*as.numeric(A >0) -1
  131. return(ans)
  132. }
  133. ```
  134.  
  135.  
  136. ### Testing Accuracy of Fisher LDA on a known data set
  137.  
  138. Once we have classified the rows of a set of data and obtained a column vector of 1s and -1s, we can test how well this classifiation matches to the original classification. To achieve this, we define a fuction called accuracy.
  139.  
  140. Accuracy takes the following information as inputs:
  141.  
  142. - ans: the column vector of 1s and -1s obtained from using either mean_classify (for the mean method) or predict (for LDA) on a data set
  143.  
  144. - class: the corresponding column vector to ans with the set of 1s and -1s that was part of the original data before it was stripped. For example, if ans was calculated using chemstrain, then the class should be trainclass
  145.  
  146. Accuracy returns the following outputs:
  147.  
  148. - "p_as_m" is the percent of biodegradable samples which the model classified as nonbiodegradable
  149.  
  150. - "m_as_p" is the percent of nonbiodegradable samples which the model classified as biodegradable (this is the more important of the two)
  151.  
  152. - "total_misclass" is the percentage of total misclassified samples
  153.  
  154. Steps:
  155.  
  156. 1) Find out how many samples were misclassified by subtracting ans from class, call the result cc
  157.  
  158. 2) Isolate how many of these values in cc are positive; this respresent the number of positive samples that were classified as negative
  159.  
  160. 3) Isolate how many of these values in cc are negative; this represents the number of negative samples that were classified as positive
  161.  
  162. 4) Calculate the percent errors
  163.  
  164. ```{r}
  165. accuracy<- function(ans,class) {
  166. cc <- class - as.matrix(as.numeric(as.matrix(ans)))
  167. plusasminus <- sum(cc >0) #Number of positive samples which the model classified as negative
  168. minusasplus <- sum(cc <0) #Number of negative samples which the model classified as positive
  169.  
  170. p_as_m <- plusasminus/nrow(class)
  171.  
  172. m_as_p <- minusasplus/nrow(class)
  173.  
  174. total_misclass <- (plusasminus+minusasplus)/nrow(class)
  175.  
  176. return(list("p_as_m"=p_as_m, "m_as_p"=m_as_p, "total_misclass"= total_misclass))
  177. }
  178. ```
  179.  
  180. ### Estimating accuracy of Fisher LDA on an unknown data set
  181.  
  182.  
  183. ### Creating the classifying hyperplane
  184.  
  185. Here, we use H (for hyperplane) as the variable to store all the information created by running the fisher_method() on known
  186. ```{r}
  187. H <- fisher_method(known)
  188. cat("Normal to Fisher LDA Hyperplane: ")
  189. H$normal
  190.  
  191. cat("Threshold of Fisher LDA Hyperplane:", H$threshold)
  192. ```
  193.  
  194. ### Testing Accuracy of Fisher LDA on chemsrus.csv
  195.  
  196. ```{r}
  197. H_known_ans <- classify(knownmatrix, H$normal, H$threshold)
  198. H_known_accuracy <- accuracy(H_known_ans,knownclass)
  199. ```
  200.  
  201. How well Fisher LDA does in chemsrus.csv
  202.  
  203. ```{r}
  204. H_known_accuracy
  205. ```
  206.  
  207.  
  208. ### Applying Fisher LDA on chemstest.csv
  209. Given a seperating hyperplane H, we can classify the chemicals in chemstest.csv and determine whether they are biodegradable, class = 1 or nonbiodegradable, class=-1.
  210.  
  211. Steps:
  212. 1) Apply classify to the targetmatrix to get H_target_ans
  213.  
  214. ```{r}
  215. H_target_ans <- classify(targetmatrix, H$normal, H$threshold)
  216. ```
  217.  
  218. 1.5) Out of curiosity, compare the percentage of biodegradable and nonbiodegradable points with respect to all points for both knowndata and targetdata.
  219.  
  220. Reminder: known data corresponds to chemsrus.csv and targetdata corresponds to chemstest.csv
  221.  
  222. - known_biodegradable is the percentage of biodegradable points (class = 1) for the known data
  223. - known_nonbiodegradable is the percentage of nonbiodegrade points (class = -1) for the unknown data
  224.  
  225. - target_biodegradable is the percentage of biodegradable points (class = 1) in the target data
  226. - target_nonbiodegradable is the percentage of nonbiodegradable points (class = -1) in the target data
  227.  
  228. ```{r}
  229. known_biodegradable<- sum(H_known_ans[H_known_ans>0])/nrow(known)
  230. known_nonbiodegradable<- -1*sum(H_known_ans[H_known_ans<0])/nrow(known)
  231.  
  232. target_biodegradable <- sum(H_target_ans[H_target_ans>0])/nrow(target)
  233. target_nonbiodegradable <- -1* sum(H_target_ans[H_target_ans<0])/nrow(target)
  234. ```
  235.  
  236. ```{r}
  237. known_biodegradable
  238. known_nonbiodegradable
  239.  
  240. target_biodegradable
  241. target_nonbiodegradable
  242. ```
  243.  
  244.  
  245. 2) Stitch together H_taget_ans with labels to get a matrix with 2 columns. The first column is the id numbers of the chemicals, and the second column is the classification of the chemicals.
  246.  
  247. To do this, we first rename H_target_ans to prediciton and ensure the result is a column vector. Then, we apply the function cbind, which will take the column vectors labels and prediction as an input and outputs a matrix of the two.
  248.  
  249. ```{r}
  250. prediction<- as.matrix(H_target_ans)
  251. M <- cbind(labels, prediction)
  252. ```
  253. 3) Transform the matrix M into a csv file called "result.csv" using the write_csv command.
  254.  
  255. ```{r}
  256. write_csv(M,"results.csv")
  257. ```
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement