Advertisement
flutedaddyfunk

a) Data Description.Rmd

Apr 23rd, 2017
620
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 4.02 KB | None | 0 0
  1. ---
  2. title: "Classification Mini Project: Data Description"
  3. output: html_notebook
  4. ---
  5. Group Members: \
  6. Rachel G:rachamin12@gmail.com,\
  7. James Hick: redsoxfan765@gmail.com, \
  8. Quientin Morrison: morriq@rpi.edu
  9.  
  10. ### Import Libraries:
  11. ```{r}
  12. library(readr)
  13. library(MASS)
  14. library(devtools)
  15. library(ggplot2)
  16. ```
  17. ### Read the Data:
  18. This reads in published urls from google drive cooresponding to chemsdata <- chemsrus.csv and targetdata <- chemstest.csv\
  19. The function suppressMessages() hides the parsing output; chemsdata and targetdata still exist in the coding enviornment
  20.  
  21. ```{r}
  22. chemsdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1arHUuWJrVjpZboLOJa97iIbPzCszX8stE-fbYhw2OCA/pub?gid=1533528387&single=true&output=csv")))
  23.  
  24. targetdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1NtoMaw06IlCDJ3k9Rxtcv01u2wUwdAh8D5rajytiNpQ/pub?gid=899540023&single=true&output=csv")))
  25. ```
  26. ### Data Manipulation: targetdata <- chemstest.csv
  27. Steps:
  28.  
  29. 1) show the dimesions of targetdata, the first number is the number of samples, and the second is the number of columns. Column 1 = ID Number, Column 43 = Class (Undetermined), Columns 2-42 = features
  30.  
  31. ```{r}
  32. cat("Dimensions of chemstest.csv: ", dim(targetdata))
  33. ```
  34.  
  35. 2) Strip the first and last columns of the targetdata, call this target
  36.  
  37. ```{r}
  38. target <- targetdata[ ,2:(ncol(targetdata)-1)]
  39. ```
  40.  
  41. 3) Calculate the mean of target, call this m_target
  42.  
  43. ```{r}
  44. m_target<- (1/nrow(target))*rep(1,nrow(target)) %*% as.matrix(target)
  45. m_target
  46. ```
  47.  
  48. 4) Boxplot of target
  49. - I set the outliers to be colored orange for visual appeal
  50. - I set the indicies to be in unique levels , that is, the features are read in the same way they are arranged in tagetdata
  51. - I perform a coordinate flip: normally the feature names would overlap; this avoids that problem and makes the graph more visually appealing
  52.  
  53. ```{r}
  54. ggplot(data=stack(target), aes(x=factor(ind, levels=unique(ind)), y=values)) + geom_boxplot(varwidth=TRUE, fill="white",outlier.color="orange") + ggtitle("Boxplot of chemstest.csv") + labs(x="Features",y="Values") + coord_flip()
  55.  
  56. ```
  57.  
  58. ### Data Manipulation: chemsdata <- chemsrus.csv
  59. Steps:
  60.  
  61. 1) Show the dimensions of chemsdata
  62.  
  63. ```{r}
  64. cat("Dimensions of chemsrus.csv: ", dim(chemsdata))
  65. ```
  66.  
  67. 2) Create two seperate matricies; degrade and nondegrade that consist of data with class 1 and -1 respectivly. I achieve this with the subset operator, which takes as it's input the original data, the column name that contains information you want to classify by, and the classification
  68. value.
  69.  
  70. ```{r}
  71. degrade <-subset(chemsdata, class==1)
  72. nondegrade <-subset(chemsdata, class==-1)
  73. ```
  74.  
  75. 3) For each of these, strip the first and last columns
  76.  
  77. ```{r}
  78. degrade <- degrade[ ,2:(ncol(degrade)-1)]
  79. nondegrade <- nondegrade[ , 2:(ncol(nondegrade)-1)]
  80. ```
  81.  
  82.  
  83. 4) Find the mean of degrade and nondegrade: m_degrade and m_nondegrade
  84.  
  85. ```{r}
  86. m_degrade <- (1/nrow(degrade))*rep(1,nrow(degrade)) %*% as.matrix(degrade)
  87.  
  88. m_degrade
  89.  
  90. m_nondegrade <- (1/nrow(nondegrade))*rep(1,nrow(nondegrade)) %*% as.matrix(nondegrade)
  91.  
  92. m_nondegrade
  93. ```
  94.  
  95.  
  96. 5) Boxplots for degrade and nondegrade:
  97. - I set the outliers to be colored for visual appeal
  98. - I set the indicies to be in unique levels , that is, the features are read in the same way they are arranged in degrade and nondegrade
  99. - I perform a coordinate flip: normally the feature names would overlap; this avoids that problem and makes the graph more visually appealing
  100.  
  101. ```{r}
  102. ggplot(data=stack(degrade), aes(x=factor(ind, levels=unique(ind)), y=values)) + geom_boxplot(varwidth=TRUE, fill="white",outlier.color="green") + ggtitle("Boxplot of Biodegradable Items in chemsrus.csv") + labs(x="Features",y="Values") + coord_flip()
  103.  
  104.  
  105. ggplot(data=stack(nondegrade), aes(x=factor(ind, levels=unique(ind)), y=values)) + geom_boxplot(varwidth=TRUE, fill="white",outlier.color="red") + ggtitle("Boxplot of Nonbiodegradable Items in chemsrus.csv") + labs(x="Features",y="Values") + coord_flip()
  106. ```
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement