Advertisement
flutedaddyfunk

Classification Mini Project.Rmd

Apr 23rd, 2017
589
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 15.33 KB | None | 0 0
  1. ---
  2. title: "Classification Mini Project: Comparison of Methods"
  3. output: html_notebook
  4. ---
  5. Group Members: \
  6. Rachel G:rachamin12@gmail.com,\
  7. James Hick: redsoxfan765@gmail.com, \
  8. Quientin Morrison: morriq@rpi.edu
  9.  
  10. ### Import Libraries:
  11. ```{r}
  12. library(readr)
  13. library(MASS)
  14. library(devtools)
  15. library(ggplot2)
  16. ```
  17. ### Read the Data:
  18. Our classification method comparison will make use of data for which classification has already been achieved, chemsrus.csv.
  19.  
  20. This reads in published urls from google drive cooresponding to chemsdata <- chemsrus.csv.
  21. The function suppressMessages() hides the parsing output; chemsdata and targetdata still exist in the coding enviornment
  22.  
  23. ```{r}
  24. chemsdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1arHUuWJrVjpZboLOJa97iIbPzCszX8stE-fbYhw2OCA/pub?gid=1533528387&single=true&output=csv")))
  25. ```
  26.  
  27. ### Training and Testing Sets:
  28.  
  29. In order to construct a classifier for the Mean Method and Fisher Linear Discrimminant Analysis (LDA) we need to use data with known classifications so that we can compare these two methods an asses their viability. As such, the data set used will be chemsdata <- chemsrus.csv
  30.  
  31. The classifiying hyperplane will be constructed from a from a randomised set of data that accounts for 90% of the original data in chemsdata. This dataset will be called chemstrain. The remaining 10% of data will be arranged into a data set called chemstest; chemstest will be used as a measure of how well the classifying hyperplane does it's job.
  32.  
  33. Steps:
  34.  
  35. 1) Create a vector train that contains a randomized set of 90% of numbers ranging from 1 to the number of chemsdata
  36.  
  37. ```{r}
  38. ss<- as.integer(.90*nrow(chemsdata)) # as.integer is a necessary addition to ensure ss is an integer
  39.  
  40. set.seed(79586879); train <- sample(1:nrow(chemsdata),ss)
  41. ```
  42.  
  43. 2) Create chemstrain and chemstest given the vector train. In addition, strip the first column of each of these vectors, as it is just a sample label
  44.  
  45. ```{r}
  46. chemstrain <- chemsdata[train, -1]
  47. chemstest <- chemsdata[-train, -1] # Using -train gives us all rows except the training rows.
  48. ```
  49.  
  50. 3) The last column of chemstrain and chemstest is the class label; we split this off into the column vectors trainclass and testclass respectivly. These vectors will be used in anaylzing how effective our methods were in classifying the data
  51.  
  52. ```{r}
  53. trainclass <- chemstrain[, ncol(chemstrain)]
  54. testclass <- chemstest[,ncol(chemstest)]
  55. ```
  56.  
  57. 4) Once we've split off the last columns in chemstrain and chemstest, we can remove this column from these data sets to get the matricies of features, trainmatrix and testmatrix
  58.  
  59. ```{r}
  60. trainmatrix <- as.matrix(chemstrain[ ,c(1:ncol(chemstrain)-1)])
  61. testmatrix <- as.matrix(chemstest[ ,c(1:ncol(chemstest)-1)])
  62. ```
  63.  
  64.  
  65. ### Mean Method
  66.  
  67. Here we want to construct a function that encapsulates the many aspects of the Mean Method and return the components as requested:
  68.  
  69. 1) Needs to take in an id-stripped training data set and seperate it into two seperate matricies, degrade and nondegrade. Once we've seperated the matricies, we can then strip the last column.
  70.  
  71. 2) Find the mean for each of these classes, m_degrade and m_nondegrade
  72.  
  73. 3) Subtract one mean from another to get a vector called m_normal; this is the normal vector of the classifying hyperplane
  74.  
  75. 4) Use m_normal to find the threshold, m_threshold
  76.  
  77. Our function will return a list of the follwing values
  78.  
  79. - "m_degrade" is the mean of the entries with class= 1 in the input data
  80.  
  81. - "m_nondegrade" is the mean of the entries with class= -1 in the input data
  82.  
  83. - "normal" is the unit normal vector of the seperating hyperplane calculated using the mean method
  84.  
  85. - "threshold" is the threshold of the seperating hyperplane
  86.  
  87. ```{r}
  88. mean_method <- function(training){
  89.  
  90. ## Seperate data and then strip the classification.
  91.  
  92. degrade1 <- training[training[ ,ncol(training)]>0, ]
  93. degrade <- degrade1[ ,1:(ncol(training)-1)]
  94.  
  95. nondegrade1 <-training[training[ ,ncol(training)] <0, ]
  96. nondegrade <- nondegrade1[ , 1:(ncol(training)-1)]
  97.  
  98. ## Find the mean of both
  99.  
  100. m_degrade <- (1/nrow(degrade))*rep(1,nrow(degrade)) %*% as.matrix(degrade)
  101. m_nondegrade <- (1/nrow(nondegrade))*rep(1,nrow(nondegrade)) %*% as.matrix(nondegrade)
  102. difference <- m_degrade-m_nondegrade
  103.  
  104. ## Find the normal vector
  105. m_normal <- t(difference) # should be a column vector, so take the transpose
  106. m_normal <- m_normal/(as.numeric(sqrt((difference)%*%t(difference)))) ## Make into a unit vector
  107.  
  108. ## Find the threshold
  109.  
  110. m_threshold <- ((m_degrade+m_nondegrade)/2)%*% m_normal
  111.  
  112. return(list("m_degrade" = m_degrade, "m_nondegrade"= m_nondegrade, "normal"=m_normal, "threshold"=m_threshold))
  113. }
  114. ```
  115.  
  116. ### LDA
  117.  
  118. The Fisher LDA command, lda, is a modual in r that performs a Linear Discriminant Analysis on an input training data set.
  119.  
  120. - The classification column in our input matrix is called "class" which is used in the first argument.
  121.  
  122. - The input matrix; this matrix will still have the original classification vector
  123.  
  124. - The "prior" option specifies weighting between classes. This uses (1/2,1/2) saying they are weighted only by size.
  125.  
  126. Given that we will be using the threshold to the hyperplane formed by Fisher LDA often, it will be beneficial to encompass the lda function inside another function called fisher_method that takes id-stripped data s its input (chemstrain or chemstest).
  127.  
  128. fisher_method returns a list of information that will be valuable for our analysis:
  129.  
  130. - "means" is a 2 x k matrix, where k is the number of columns in the input matrix. Each row in means contains the mean of the all the data in the input data set associated with a specific class value
  131.  
  132. - "normal" is a unit normal vector to the seperating hyperplane calculated using lda;
  133.  
  134. - "threshold" is the threshold of the seperating hyperplane
  135.  
  136. - "z" is the original output of the lda command
  137.  
  138. ```{r}
  139. fisher_method<- function(trainingdata){
  140. z <- lda(class ~ .,trainingdata,prior=c(1,1)/2)
  141.  
  142. #Calculate the Fisher threshold from the means and the normal vector.
  143.  
  144. z_threshold <- ((z$means[1,] + z$means[2,])/2)%*%z$scaling
  145.  
  146. return(list("z"=z,"means"=z$means, "normal"= z$scaling, "threshold"= z_threshold))
  147. }
  148. ```
  149.  
  150. ### Classification Method:
  151.  
  152. Given the equation of a hyperplane
  153. $$ X \cdot\hat{w} = t $$
  154. Where $\hat{w}$ is a unit vector, $X$ is a point and $t$ is the threhold of the hyperplane. Then by rewritting this equation
  155. $$X \cdot\hat{w}-t =0 $$
  156. A point $X$ lies on the positive side of the hyperplane and is thus designated with class=1=biodegradable if
  157. $$X \cdot\hat{w}-t > 0 $$
  158.  
  159. A point $X$ lies on the negative side of the hyperplane and is thus designated with class = -1 = nonbiodegradable if
  160. $$X \cdot\hat{w}-t < 0$$
  161.  
  162. So, given the unit vector $\hat{w}$ and the threshold {t} from a classification method such as the mean method or linear discrimminant analysis, it is possible to iterate over the rows of a matrix and using the above equations, classify each row. It is important to note that the normal vector and threshold result from applying these methods on the training data, the testing data is not used in the construction of the hyperplane.
  163.  
  164. Steps:
  165. 1) Define a function classify that takes as it's input a feature matrix, a normal vector , and a threshold
  166.  
  167. 2) Create an empty column vector A that has the same number of rows as the input matrix
  168.  
  169. 3) For each row in the input matrix (without a classification), calculate $row \cdot normal vector - threshold$ and store this value in A
  170.  
  171. 4) Adjust A such that all the entries are either -1 for values less than 0 or 1 for values greater than 1
  172.  
  173. ```{r}
  174. classify<- function(matrix,normalvector,threshold){
  175. A <- vector("numeric",nrow(matrix)) # this is what we end up returning
  176.  
  177. for (i in 1:nrow(matrix)){A[i]= (matrix[i, ] %*% normalvector) - threshold}
  178. ans <- 2*as.numeric(A >0) -1
  179. return(ans)
  180. }
  181. ```
  182.  
  183.  
  184. ### Testing Accuracy
  185.  
  186. Once we have classified the rows of a set of data and obtained a column vector of 1s and -1s, we can test the accuracy of our methods. To achieve this, we define a fuction called accuracy.
  187.  
  188. Accuracy takes the following information as inputs:
  189.  
  190. - ans: the column vector of 1s and -1s obtained from using either mean_classify (for the mean method) or predict (for LDA) on a data set
  191.  
  192. - class: the corresponding column vector to ans with the set of 1s and -1s that was part of the original data before it was stripped. For example, if ans was calculated using chemstrain, then the class should be trainclass
  193.  
  194. Accuracy returns the following outputs:
  195.  
  196. - "p_as_m" is the percent of biodegradable samples which the model classified as nonbiodegradable
  197.  
  198. - "m_as_p" is the percent of nonbiodegradable samples which the model classified as biodegradable (this is the more important of the two)
  199.  
  200. - "total_misclass" is the percentage of total misclassified samples
  201.  
  202. Steps:
  203.  
  204. 1) Find out how many samples were misclassified by subtracting ans from class, call the result cc
  205.  
  206. 2) Isolate how many of these values in cc are positive; this respresent the number of positive samples that were classified as negative
  207.  
  208. 3) Isolate how many of these values in cc are negative; this represents the number of negative samples that were classified as positive
  209.  
  210. 4) Calculate the percent errors
  211.  
  212. ```{r}
  213. accuracy<- function(ans,class) {
  214. cc <- class - as.matrix(as.numeric(as.matrix(ans)))
  215. plusasminus <- sum(cc >0) #Number of positive samples which the model classified as negative
  216. minusasplus <- sum(cc <0) #Number of negative samples which the model classified as positive
  217.  
  218. p_as_m <- plusasminus/nrow(class)
  219.  
  220. m_as_p <- minusasplus/nrow(class)
  221.  
  222. total_misclass <- (plusasminus+minusasplus)/nrow(class)
  223.  
  224. return(list("p_as_m"=p_as_m, "m_as_p"=m_as_p, "total_misclass"= total_misclass))
  225. }
  226. ```
  227.  
  228. ### Comparison of methods:
  229.  
  230. To reduce confusion:
  231.  
  232. - The letter M will precede any variable associated with the Mean Method
  233.  
  234. - The letter L will precede any variable associated with the Fisher LDA Method
  235.  
  236. For each method:
  237.  
  238. 1) create the classification hyperplane from the training data chemstrain. Print the threshold and normal vector
  239.  
  240. ```{r}
  241. M <- mean_method(chemstrain)
  242.  
  243. cat("Normal to Mean Method Hyperplane: ")
  244. M$normal
  245.  
  246. cat("Threshold of Mean Method Hyperplane: ", M$threshold)
  247.  
  248. L <- fisher_method(chemstrain)
  249. cat("Normal to Fisher LDA Hyperplane: ")
  250. L$normal
  251.  
  252. cat("Threshold of Fisher LDA Hyperplane:", L$threshold)
  253.  
  254. ```
  255.  
  256. 2) Classify the Data Points in Chemstrain using mean_classify and trainmatrix or predict and chemstrain:
  257.  
  258. ```{r}
  259. M_train_ans <- classify(trainmatrix,M$normal,M$threshold)
  260.  
  261. L_train_ans <- classify(trainmatrix,L$normal,L$threshold)
  262.  
  263. ```
  264.  
  265. 3) Compare the accuracy of both methods on the training data using the accuracy function
  266.  
  267. ```{r}
  268. M_train_accuracy <- accuracy(M_train_ans,trainclass)
  269. L_train_accuracy <- accuracy(L_train_ans,trainclass)
  270. ```
  271.  
  272.  
  273. ```{r}
  274. M_train_accuracy
  275. ```
  276. ```{r}
  277. L_train_accuracy
  278. ```
  279.  
  280. The accuracy of Fisher LDA is much much better than that of the Mean Method
  281.  
  282. 4) Classify the Data Points in chemstest using mean_classify and testmatrix or predict and chemstest
  283.  
  284. ```{r}
  285. M_test_ans <- classify(testmatrix,M$normal,M$threshold)
  286.  
  287. L_test_ans <- classify(testmatrix,L$normal,L$threshold)
  288.  
  289. ## L_test_ans <- predict(L$z,chemstest)$class (leads to the same result as classify, but less intuitive)
  290. ```
  291.  
  292. 5) Compare the accuracy of both methods on the testing data using the accuracy function
  293.  
  294. ```{r}
  295. M_test_accuracy <- accuracy(M_test_ans,testclass)
  296. L_test_accuracy <- accuracy(L_test_ans,testclass)
  297. ```
  298.  
  299. ```{r}
  300. M_test_accuracy
  301. ```
  302.  
  303. ```{r}
  304. L_test_accuracy
  305. ```
  306.  
  307. Once again, Fisher LDA is much more accurate than the Mean Method
  308.  
  309. ## Making some histograms:
  310. To better visialize the differences in accuracy from the previous section, we can create a histogram of the data with respect to a threshold. Points to the right of the threshold indicate those that have been classified as biodegradable, and points to the left of the theshold indicate those that have been classified as nonbiodegradable. The greater the number of points within a particualar region, the taller the histogram is at that point.
  311.  
  312. The fuctions we use here produce two such historgams:
  313. - The top historgam represents the points that, in the original matrix, were considered to be non biodegradable. If a model perfectly classifies these points, then we should expect no points to the right of the threshold
  314.  
  315. - The bottom histogram respesents the points than, in the original matrix, were considered to be biodegradable. If a model perfectly classifiees these points, then we should expect no points to the left of the threshold
  316.  
  317.  
  318. Here histopair is a function that creates a histogram given class projections, a threshold and some bounds, and hisoplot is a support function that helps create some of the inputs to histopair.
  319.  
  320. ```{r}
  321. #A routine to plot a pair of histograms and a threshold
  322. histopair <-function(pminus,pplus,thresh,yy=c(0,60),label2="Plus",label1="Minus") {
  323. require(ggplot2); require(gridExtra)
  324. hist1 <- ggplot(as.data.frame(pminus), aes(pminus)) + geom_histogram(col="blue",fill="blue")
  325. hist2 <- ggplot(as.data.frame(pplus), aes(pplus)) + geom_histogram(col="red",fill="red")
  326. df <- data.frame(x1=c(thresh,thresh),y1=c(yy[1],yy[2]))
  327. pmin <- min(pminus,pplus)
  328. pmax<- max(pminus,pplus)
  329. me1 <- hist1 + expand_limits(x=c(pmin,pmax)) + geom_line(data=df,aes(x=x1,y=y1)) + xlab(label1)
  330. me2 <- hist2 + expand_limits(x=c(pmin,pmax)) + geom_line(data=df,aes(x=x1,y=y1)) + xlab(label2)
  331. pl <- vector("list",2)
  332. pl[[1]] <- me1; pl[[2]]<-me2;
  333. grid.arrange(grobs=pl,ncols=1)
  334. }
  335.  
  336. ## A function to streamline the histogram analysis
  337. ## matrix is either trainmatrix or testmatix; these are the feature matricies
  338. ## normalvector is a normal vector to a classifying hyperplane; either z$scaling or h$normal
  339. ## threshold is the threshold to a classifying hyperplane; either thres (associated with z) or h$threshold
  340. ## Classvector is the vector of class labels for the associated data set
  341.  
  342. histoplot<- function(matrix,normalvector,threshold,classvector){
  343. proj <- matrix %*% as.matrix(normalvector)
  344. pplus<- proj[classvector[]>0] #All the class 1 projections
  345. pminus<- proj[classvector[]<0] #All the class -1 projections
  346. histopair(pminus,pplus,threshold,label1="Not Readily Biodegradable",label2="Readily Biodegradable")}
  347.  
  348.  
  349. ```
  350.  
  351. Now we'll create some histograms of the training data.
  352.  
  353. Histogram of training data using the Mean Method
  354. ```{r}
  355. histoplot(trainmatrix,M$normal,M$threshold,trainclass)
  356. ```
  357.  
  358. Histogram of training data using Fisher LDA
  359. ```{r}
  360. histoplot(trainmatrix,L$normal,L$threshold,trainclass)
  361. ```
  362.  
  363. Its clear that Fisher LDA is a much better method to use
  364.  
  365.  
  366. Now we'll create some histograms of the testing data. Points to the right of the threshold indicate biodegradable chemicals, and points to the left of the theshold indicate nonbiodegradable chemicals. It should be noted that the count on these histograms will be much lower than those of the training data, as we are looking at a much smaller sample
  367.  
  368.  
  369. Histogram of testing data using the Mean Method
  370. ```{r}
  371. histoplot(testmatrix,M$normal,M$threshold,testclass)
  372. ```
  373.  
  374. Histogram of testing data using Fisher LDA
  375. ```{r}
  376. histoplot(testmatrix,L$normal,L$threshold,testclass)
  377. ```
  378.  
  379. Once again, its clear that Fisher LDA is a much better method to use
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement