Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ---
- title: "Classification Mini Project: Data Description"
- output: html_notebook
- ---
- Group Members: \
- Rachel G:rachamin12@gmail.com,\
- James Hick: redsoxfan765@gmail.com, \
- Quientin Morrison: morriq@rpi.edu
- ### Import Libraries:
- ```{r}
- library(readr)
- library(MASS)
- library(devtools)
- library(ggplot2)
- ```
- ### Read the Data:
- This reads in published urls from google drive cooresponding to chemsdata <- chemsrus.csv and targetdata <- chemstest.csv\
- The function suppressMessages() hides the parsing output; chemsdata and targetdata still exist in the coding enviornment
- ```{r}
- chemsdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1arHUuWJrVjpZboLOJa97iIbPzCszX8stE-fbYhw2OCA/pub?gid=1533528387&single=true&output=csv")))
- targetdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1NtoMaw06IlCDJ3k9Rxtcv01u2wUwdAh8D5rajytiNpQ/pub?gid=899540023&single=true&output=csv")))
- ```
- ### Data Manipulation: targetdata <- chemstest.csv
- Steps:
- 1) show the dimesions of targetdata, the first number is the number of samples, and the second is the number of columns. Column 1 = ID Number, Column 43 = Class (Undetermined), Columns 2-42 = features
- ```{r}
- cat("Dimensions of chemstest.csv: ", dim(targetdata))
- ```
- 2) Strip the first and last columns of the targetdata, call this target
- ```{r}
- target <- targetdata[ ,2:(ncol(targetdata)-1)]
- ```
- 3) Calculate the mean of target, call this m_target
- ```{r}
- m_target<- (1/nrow(target))*rep(1,nrow(target)) %*% as.matrix(target)
- m_target
- ```
- 4) Boxplot of target
- - I set the outliers to be colored orange for visual appeal
- - I set the indicies to be in unique levels , that is, the features are read in the same way they are arranged in tagetdata
- - I perform a coordinate flip: normally the feature names would overlap; this avoids that problem and makes the graph more visually appealing
- ```{r}
- ggplot(data=stack(target), aes(x=factor(ind, levels=unique(ind)), y=values)) + geom_boxplot(varwidth=TRUE, fill="white",outlier.color="orange") + ggtitle("Boxplot of chemstest.csv") + labs(x="Features",y="Values") + coord_flip()
- ```
- ### Data Manipulation: chemsdata <- chemsrus.csv
- Steps:
- 1) Show the dimensions of chemsdata
- ```{r}
- cat("Dimensions of chemsrus.csv: ", dim(chemsdata))
- ```
- 2) Create two seperate matricies; degrade and nondegrade that consist of data with class 1 and -1 respectivly. I achieve this with the subset operator, which takes as it's input the original data, the column name that contains information you want to classify by, and the classification
- value.
- ```{r}
- degrade <-subset(chemsdata, class==1)
- nondegrade <-subset(chemsdata, class==-1)
- ```
- 3) For each of these, strip the first and last columns
- ```{r}
- degrade <- degrade[ ,2:(ncol(degrade)-1)]
- nondegrade <- nondegrade[ , 2:(ncol(nondegrade)-1)]
- ```
- 4) Find the mean of degrade and nondegrade: m_degrade and m_nondegrade
- ```{r}
- m_degrade <- (1/nrow(degrade))*rep(1,nrow(degrade)) %*% as.matrix(degrade)
- m_degrade
- m_nondegrade <- (1/nrow(nondegrade))*rep(1,nrow(nondegrade)) %*% as.matrix(nondegrade)
- m_nondegrade
- ```
- 5) Boxplots for degrade and nondegrade:
- - I set the outliers to be colored for visual appeal
- - I set the indicies to be in unique levels , that is, the features are read in the same way they are arranged in degrade and nondegrade
- - I perform a coordinate flip: normally the feature names would overlap; this avoids that problem and makes the graph more visually appealing
- ```{r}
- ggplot(data=stack(degrade), aes(x=factor(ind, levels=unique(ind)), y=values)) + geom_boxplot(varwidth=TRUE, fill="white",outlier.color="green") + ggtitle("Boxplot of Biodegradable Items in chemsrus.csv") + labs(x="Features",y="Values") + coord_flip()
- ggplot(data=stack(nondegrade), aes(x=factor(ind, levels=unique(ind)), y=values)) + geom_boxplot(varwidth=TRUE, fill="white",outlier.color="red") + ggtitle("Boxplot of Nonbiodegradable Items in chemsrus.csv") + labs(x="Features",y="Values") + coord_flip()
- ```
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement