Advertisement
flutedaddyfunk

Final Classification.Rmd

Apr 25th, 2017
556
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 15.13 KB | None | 0 0
  1. ---
  2. title: "Classification Mini Project: Classification of Chemstest"
  3. output: html_notebook
  4.  
  5. Group Members: \
  6. Rachel G: rachamin12@gmail.com,\
  7. James Hick: redsoxfan765@gmail.com, \
  8. Quientin Morrison: morriq@rpi.edu
  9. ---
  10.  
  11. ### Introduction:
  12. Our research comparing classification methods clearly showed that Fisher Linear Discrimminant Analysis was superior to the Mean Mean method in the creation of a classifying hyperplane.
  13.  
  14. We will now use Fisher LDA in the classification of the data in chemstest.csv.
  15.  
  16. Our procedure for creating and testing a hyperplane will be quite similar to the method we used in our comparitive analysis, however, we will now use the entirety of chemsrus.csv as a resource to create a hyperplane; this is our "training" data. Likewise, chemstest.csv is roughly equivalent to "testing" data, with the exception that we do not know what each sample is classified as. As such, we will need to develop a predictive model that can esitimate the errors our method will produce.
  17.  
  18. ### Import Libraries:
  19. ```{r}
  20. library(readr)
  21. library(MASS)
  22. library(devtools)
  23. library(ggplot2)
  24. ```
  25.  
  26. ### Read the Data:
  27. There are two sets of data that we read in:
  28. - chemsdata corresponds to chemsrus.csv and is our "training set", what we use to make a classifier
  29.  
  30. - targetdata corresponds to chemstest.csv and is our "testing set", what we are seeking to classify
  31.  
  32. This reads in published urls from google drive cooresponding to chemsdata <- chemsrus.csv and targetdata <- chemstest.csv\
  33. The function suppressMessages() hides the parsing output; chemsdata and targetdata still exist in the coding enviornment
  34.  
  35. ```{r}
  36. chemsdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1arHUuWJrVjpZboLOJa97iIbPzCszX8stE-fbYhw2OCA/pub?gid=1533528387&single=true&output=csv")))
  37.  
  38. targetdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1NtoMaw06IlCDJ3k9Rxtcv01u2wUwdAh8D5rajytiNpQ/pub?gid=899540023&single=true&output=csv")))
  39. ```
  40.  
  41. ### Data Parsing
  42.  
  43. Steps:
  44.  
  45. 1) Fisher LDA will make use of the data in chemsdata and target data for which the id numbers have been stripped. We call these data sets known and target respectivly. However, we also need to save the id numbers of targetdata for our final result, so we save the first column of targetdata as labels
  46.  
  47. ```{r}
  48. labels <- targetdata[ ,1]
  49.  
  50. known <- chemsdata[ , -1]
  51.  
  52. target <- targetdata[ , -1]
  53. ```
  54.  
  55. 2) The last column of train is the class label; we split this off into the column vector knownclass. This vector will be used in anaylzing how well Fisher LDA does on the entriety of chemsrus.csv in terms of classification errors
  56.  
  57. ```{r}
  58. knownclass <- known[, ncol(known)]
  59. ```
  60.  
  61. 3) Our classification method takes as it's input a feature matrix; that is a data set without id numbers or classification. As such, we remove the classification column from known and target to get feture matrixies, knownmatrix and targetmatrix
  62.  
  63. ```{r}
  64. knownmatrix <- as.matrix(known[ ,c(1:ncol(known)-1)])
  65. targetmatrix <- as.matrix(target[ ,c(1:ncol(target)-1)])
  66. ```
  67.  
  68. ### LDA
  69.  
  70. The Fisher LDA command, lda, is a modual in r that performs a Linear Discriminant Analysis on an input training data set.
  71.  
  72. - The classification column in our input matrix is called "class" which is used in the first argument.
  73.  
  74. - The input matrix; this matrix will still have the original classification vector
  75.  
  76. - The "prior" option specifies weighting between classes. This uses (1/2,1/2) saying they are weighted only by size.
  77.  
  78. Given that we will be using the threshold to the hyperplane formed by Fisher LDA often, it will be beneficial to encompass the lda function inside another function called fisher_method that takes the id-stripped traingdata as its input.
  79.  
  80. fisher_method returns a list of information that will be valuable for our analysis:
  81.  
  82. - "means" is a 2 x k matrix, where k is the number of columns in the input matrix. Each row in means contains the mean of the all the data in the input data set associated with a specific class value
  83.  
  84. - "normal" is a unit normal vector to the seperating hyperplane calculated using lda;
  85.  
  86. - "threshold" is the threshold of the seperating hyperplane
  87.  
  88. - "z"" is the original output of the lda command
  89.  
  90. ```{r}
  91. fisher_method<- function(trainingdata){
  92. z <- lda(class ~ .,trainingdata,prior=c(1,1)/2)
  93.  
  94. #Calculate the Fisher threshold from the means and the normal vector.
  95.  
  96. z_threshold <- ((z$means[1,] + z$means[2,])/2)%*%z$scaling
  97.  
  98. return(list("z"=z,"means"=z$means, "normal"= z$scaling, "threshold"= z_threshold))
  99. }
  100. ```
  101.  
  102. ### Classification Method:
  103.  
  104. Given the equation of a hyperplane
  105. $$ X \cdot\hat{w} = t $$
  106. Where $\hat{w}$ is a unit vector, $X$ is a point and $t$ is the threhold of the hyperplane. Then by rewritting this equation
  107. $$X \cdot\hat{w}-t =0 $$
  108. A point $X$ lies on the positive side of the hyperplane and is thus designated with class=1=biodegradable if
  109. $$X \cdot\hat{w}-t > 0 $$
  110.  
  111. A point $X$ lies on the negative side of the hyperplane and is thus designated with class = -1 = nonbiodegradable if
  112. $$X \cdot\hat{w}-t < 0$$
  113.  
  114. So, given the unit vector $\hat{w}$ and the threshold $t$ from a classification method such as the mean method or linear discrimminant analysis, it is possible to iterate over the rows of a matrix and using the above equations, classify each row. It is important to note that the normal vector and threshold result from applying these methods on the training data, the testing data is not used in the construction of the hyperplane.
  115.  
  116. Steps:
  117. 1) Define a function classify that takes as it's input a feature matrix, a normal vector , and a threshold
  118.  
  119. 2) Create an empty column vector A that has the same number of rows as the input matrix
  120.  
  121. 3) For each row in the input matrix (without a classification), calculate $row \cdot normal vector - threshold$ and store this value in A
  122.  
  123. 4) Adjust A such that all the entries are either -1 for values less than 0 or 1 for values greater than 1
  124.  
  125. ```{r}
  126. classify<- function(matrix,normalvector,threshold){
  127. A <- vector("numeric",nrow(matrix)) # this is what we end up returning
  128.  
  129. for (i in 1:nrow(matrix)){A[i]= (matrix[i, ] %*% normalvector) - threshold}
  130. ans <- 2*as.numeric(A >0) -1
  131. return(ans)
  132. }
  133. ```
  134.  
  135.  
  136. ### Testing Accuracy of Fisher LDA on a known data set
  137.  
  138. Once we have classified the rows of a set of data and obtained a column vector of 1s and -1s, we can test how well this classifiation matches to the original classification. To achieve this, we define a fuction called accuracy.
  139.  
  140. Accuracy takes the following information as inputs:
  141.  
  142. - ans: the column vector of 1s and -1s obtained from using either mean_classify (for the mean method) or predict (for LDA) on a data set
  143.  
  144. - class: the corresponding column vector to ans with the set of 1s and -1s that was part of the original data before it was stripped. For example, if ans was calculated using chemstrain, then the class should be trainclass
  145.  
  146. Accuracy returns the following outputs:
  147.  
  148. - "p_as_m" is the percent of biodegradable samples which the model classified as nonbiodegradable
  149.  
  150. - "m_as_p" is the percent of nonbiodegradable samples which the model classified as biodegradable (this is the more important of the two)
  151.  
  152. - "total_misclass" is the percentage of total misclassified samples
  153.  
  154. Steps:
  155.  
  156. 1) Find out how many samples were misclassified by subtracting ans from class, call the result cc
  157.  
  158. 2) Isolate how many of these values in cc are positive; this respresent the number of positive samples that were classified as negative
  159.  
  160. 3) Isolate how many of these values in cc are negative; this represents the number of negative samples that were classified as positive
  161.  
  162. 4) Calculate the percent errors
  163.  
  164. ```{r}
  165. accuracy<- function(ans,class) {
  166. cc <- class - as.matrix(as.numeric(as.matrix(ans)))
  167. plusasminus <- sum(cc >0) #Number of positive samples which the model classified as negative
  168. minusasplus <- sum(cc <0) #Number of negative samples which the model classified as positive
  169.  
  170. p_as_m <- (plusasminus/nrow(class))*100
  171.  
  172. m_as_p <- (minusasplus/nrow(class))*100
  173.  
  174. total_misclass <- ((plusasminus+minusasplus)/nrow(class))*100
  175.  
  176. return(list("p_as_m"=p_as_m, "m_as_p"=m_as_p, "total_misclass"= total_misclass))
  177. }
  178. ```
  179.  
  180. ### Creating the classifying hyperplane
  181.  
  182. Here, we use H (for hyperplane) as the variable to store all the information created by running the fisher_method() on known
  183. ```{r}
  184. H <- fisher_method(known)
  185. cat("Normal to Fisher LDA Hyperplane: ")
  186. H$normal
  187.  
  188. cat("Threshold of Fisher LDA Hyperplane:", H$threshold)
  189. ```
  190.  
  191. ### Esimating the accuracy of Fisher Linear Discrimminant Analysis:
  192.  
  193. Prediciting the misclassification of chemicals for a random data set is difficult, as we have no idea what the "true" classifications of our target set is, hence why we go through the trouble to classify the data.
  194.  
  195. However, we can approximate this relationship by creating hyperplanes from a randomized subset of our known data (say 90%) and testing another randomized subset of the known data (10%) comparing the classifications obtained from our hyperplane with the classification of our testing data. Given these two different classifications, we can calculate the percentage of biodegradable points (class=1) calculated as nonbiodegradable (class = -1) and total misclassified points.
  196.  
  197. To obtain the best approximation of errors for a random data set given our known data (assuming both are defined by the same number of features) we simply run the above process for an arbitrary k number of times and find the average.
  198.  
  199. It is important to note that the testing data is not used in the creation of the classifying hyperplanes. For all intents and purposes, these testing data sets are "unknown data". As such, while we may known the classification of these "unknown" data points, and use them to calculate the percentage of misclassified points, the errors calculated will be a good indication of how well our classification model will do on any unknown data set.
  200.  
  201.  
  202. Our function here makes use of the functions classify, accuracy and fisher_method
  203.  
  204. ```{r}
  205. estimate_inaccuracy_test <- function(knowndata, k){
  206.  
  207. #Each of these vectors is composed of k entries of their associated values generated by the function accuracy
  208. P_As_M <- vector("numeric",k)
  209. M_As_P <- vector("numeric",k)
  210. Total_Misclass <- vector("numeric", k )
  211.  
  212. for (i in 1:k){
  213. ss<- as.integer(.90*nrow(knowndata))
  214. value <- sample(1:nrow(chemsdata),ss)
  215.  
  216. knowntrain <- knowndata[value, -1]
  217. knowntest <- knowndata[-value, -1]
  218. testclass <- knowntest[,ncol(knowntest)]
  219. testmatrix <- as.matrix(knowntest[ ,c(1:ncol(knowntest)-1)])
  220.  
  221. H <- fisher_method(knowntrain)
  222.  
  223. H_test_ans <- classify(testmatrix, H$normal, H$threshold)
  224.  
  225. H_test_accuracy <- accuracy(H_test_ans, testclass)
  226.  
  227. P_As_M[i] <- H_test_accuracy$p_as_m
  228.  
  229. M_As_P[i] <- H_test_accuracy$m_as_p
  230.  
  231. Total_Misclass[i] <- H_test_accuracy$total_misclass
  232. }
  233.  
  234. ## Now all we need to do is return the average of each of these values
  235.  
  236. return(list("P_As_M"= mean(P_As_M), "M_As_P" = mean(M_As_P), "Total_Misclass" = mean(Total_Misclass)))
  237. }
  238. ```
  239.  
  240. We can follow a similar process for the training data by checking the accuracy of the seperating hyperplane on the data it's created from. If these percentages are too low, thats an indicaion that our hyperplane is overfitting the data.
  241.  
  242. ```{r}
  243. estimate_inaccuracy_train <- function(knowndata, k){
  244.  
  245. #Each of these vectors is composed of k entries of their associated values generated by the function accuracy
  246. P_As_M <- vector("numeric",k)
  247. M_As_P <- vector("numeric",k)
  248. Total_Misclass <- vector("numeric", k )
  249.  
  250. for (i in 1:k){
  251. ss<- as.integer(.90*nrow(knowndata))
  252. value <- sample(1:nrow(chemsdata),ss)
  253.  
  254. knowntrain <- knowndata[value, -1]
  255. trainclass <- knowntrain[,ncol(knowntrain)]
  256. trainmatrix <- as.matrix(knowntrain[ ,c(1:ncol(knowntrain)-1)])
  257.  
  258. H <- fisher_method(knowntrain)
  259.  
  260. H_train_ans <- classify(trainmatrix, H$normal, H$threshold)
  261.  
  262. H_train_accuracy <- accuracy(H_train_ans, trainclass)
  263.  
  264. P_As_M[i] <- H_train_accuracy$p_as_m
  265.  
  266. M_As_P[i] <- H_train_accuracy$m_as_p
  267.  
  268. Total_Misclass[i] <- H_train_accuracy$total_misclass
  269. }
  270.  
  271. ## Now all we need to do is return the average of each of these values
  272.  
  273. return(list("P_As_M"= mean(P_As_M), "M_As_P" = mean(M_As_P), "Total_Misclass" = mean(Total_Misclass)))
  274.  
  275. }
  276. ```
  277.  
  278.  
  279. Running estimate_inaccuracy_train for a large number of trials (say 400) will give us a good indication of how well the Fisher Method predicts the classification of knowndata. If the percentage of misclassified points is too low, then thats an indication that our hyperplane was overfitting the data
  280.  
  281. ```{r}
  282. estimate_inaccuracy_train(chemsdata,400)
  283.  
  284. ```
  285.  
  286.  
  287. Running estimate_inaccuracy_test for a large number of trails (say 400) will give us a good indication of what the percentage of misclassified points will be for an unknown data set
  288.  
  289.  
  290. ```{r}
  291. estimate_inaccuracy_test(chemsdata,400)
  292.  
  293. ```
  294.  
  295. ### Applying Fisher LDA on chemstest.csv
  296. Given a seperating hyperplane H, we can classify the chemicals in chemstest.csv and determine whether they are biodegradable, class = 1 or nonbiodegradable, class=-1.
  297.  
  298. Steps:
  299. 1) Apply classify to the targetmatrix to get H_target_ans
  300.  
  301. ```{r}
  302. H_target_ans <- classify(targetmatrix, H$normal, H$threshold)
  303. H_known_ans <- classify(knownmatrix, H$normal, H$threshold)
  304. ```
  305.  
  306. 1.5) Out of curiosity, compare the percentage of biodegradable and nonbiodegradable points with respect to all points for both knowndata and targetdata.
  307.  
  308. Reminder: known data corresponds to chemsrus.csv and targetdata corresponds to chemstest.csv
  309.  
  310. - known_biodegradable is the percentage of biodegradable points (class = 1) for the known data
  311. - known_nonbiodegradable is the percentage of nonbiodegrade points (class = -1) for the unknown data
  312.  
  313. - target_biodegradable is the percentage of biodegradable points (class = 1) in the target data
  314. - target_nonbiodegradable is the percentage of nonbiodegradable points (class = -1) in the target data
  315.  
  316. ```{r}
  317. known_biodegradable<- (sum(H_known_ans[H_known_ans>0])/nrow(known))*100
  318. known_nonbiodegradable<- (-1*sum(H_known_ans[H_known_ans<0])/nrow(known))*100
  319.  
  320. target_biodegradable <- (sum(H_target_ans[H_target_ans>0])/nrow(target))*100
  321. target_nonbiodegradable <- (-1* sum(H_target_ans[H_target_ans<0])/nrow(target))*100
  322. ```
  323.  
  324. ```{r}
  325. known_biodegradable
  326. known_nonbiodegradable
  327.  
  328. target_biodegradable
  329. target_nonbiodegradable
  330. ```
  331.  
  332.  
  333. 2) Stitch together H_taget_ans with labels to get a matrix with 2 columns. The first column is the id numbers of the chemicals, and the second column is the classification of the chemicals.
  334.  
  335. To do this, we first rename H_target_ans to prediciton and ensure the result is a column vector. Then, we apply the function cbind, which will take the column vectors labels and prediction as an input and outputs a matrix of the two.
  336.  
  337. ```{r}
  338. prediction<- as.matrix(H_target_ans)
  339. M <- cbind(labels, prediction)
  340. ```
  341. 3) Transform the matrix M into a csv file called "result.csv" using the write_csv command.
  342.  
  343. ```{r}
  344. write_csv(M,"results.csv")
  345. ```
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement