a) Data Description.Rmd

---
title: "Classification Mini Project: Data Description"
output: html_notebook
---
Group Members: \
Rachel G:rachamin12@gmail.com,\
James Hick: redsoxfan765@gmail.com, \
Quientin Morrison: morriq@rpi.edu

### Import Libraries:
```{r}
library(readr)
library(MASS)
library(devtools)
library(ggplot2)
```
### Read the Data:
This reads in published urls from google drive cooresponding to chemsdata <- chemsrus.csv and targetdata <- chemstest.csv\
The function suppressMessages() hides the parsing output; chemsdata and targetdata still exist in the coding enviornment

```{r}
chemsdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1arHUuWJrVjpZboLOJa97iIbPzCszX8stE-fbYhw2OCA/pub?gid=1533528387&single=true&output=csv")))

targetdata <- suppressMessages(read_csv(url("https://docs.google.com/spreadsheets/d/1NtoMaw06IlCDJ3k9Rxtcv01u2wUwdAh8D5rajytiNpQ/pub?gid=899540023&single=true&output=csv")))
```
### Data Manipulation: targetdata <- chemstest.csv
Steps:

1) show the dimesions of targetdata, the first number is the number of samples, and the second is the number of columns. Column 1 = ID Number, Column 43 = Class (Undetermined), Columns 2-42 = features

```{r}
cat("Dimensions of chemstest.csv: ",  dim(targetdata))
```

2) Strip the first and last columns of the targetdata, call this target

```{r}
target <- targetdata[ ,2:(ncol(targetdata)-1)]
```

3) Calculate the mean of target, call this m_target

```{r}
m_target<- (1/nrow(target))*rep(1,nrow(target)) %*% as.matrix(target)
m_target
```

4) Boxplot of target
  - I set the outliers to be colored orange for visual appeal
  - I set the indicies to be in unique levels , that is, the features are read in the same way they are arranged in tagetdata
  - I perform a coordinate flip: normally the feature names would overlap; this avoids that problem and makes the graph more visually appealing

```{r}
ggplot(data=stack(target), aes(x=factor(ind, levels=unique(ind)), y=values)) + geom_boxplot(varwidth=TRUE, fill="white",outlier.color="orange") + ggtitle("Boxplot of chemstest.csv") + labs(x="Features",y="Values") + coord_flip()

```

### Data Manipulation: chemsdata <- chemsrus.csv
Steps:

1) Show the dimensions of chemsdata

```{r}
cat("Dimensions of chemsrus.csv: ", dim(chemsdata))
```

2) Create two seperate matricies; degrade and nondegrade  that consist of data with class 1 and -1 respectivly. I achieve this with the subset operator, which takes as it's input the original data, the column name that contains information you want to classify by, and the classification
value.

```{r}
degrade <-subset(chemsdata, class==1)
nondegrade <-subset(chemsdata, class==-1)
```

3) For each of these, strip the first and last columns

```{r}
degrade <- degrade[ ,2:(ncol(degrade)-1)]
nondegrade <-  nondegrade[ , 2:(ncol(nondegrade)-1)]
```


4) Find the mean of degrade and nondegrade: m_degrade and m_nondegrade

```{r}
m_degrade <- (1/nrow(degrade))*rep(1,nrow(degrade)) %*% as.matrix(degrade)

m_degrade

m_nondegrade <- (1/nrow(nondegrade))*rep(1,nrow(nondegrade)) %*% as.matrix(nondegrade)

m_nondegrade
```


5) Boxplots for degrade and nondegrade:
  - I set the outliers to be colored for visual appeal
  - I set the indicies to be in unique levels , that is, the features are read in the same way they are arranged in degrade and nondegrade
  - I perform a coordinate flip: normally the feature names would overlap; this avoids that problem and makes the graph more visually appealing

```{r}
ggplot(data=stack(degrade), aes(x=factor(ind, levels=unique(ind)), y=values)) + geom_boxplot(varwidth=TRUE, fill="white",outlier.color="green") + ggtitle("Boxplot of Biodegradable Items in chemsrus.csv") + labs(x="Features",y="Values") + coord_flip()


ggplot(data=stack(nondegrade), aes(x=factor(ind, levels=unique(ind)), y=values)) + geom_boxplot(varwidth=TRUE, fill="white",outlier.color="red") + ggtitle("Boxplot of Nonbiodegradable Items in chemsrus.csv") + labs(x="Features",y="Values") + coord_flip()
```