HW1 - Descriptive Variables

1. Find the range, mean, median, variance, and standard deviation of the vector:
c(78L, 67L, 62L, 37L, 86L, 53L, 59L, 7L, 57L, 95L, 52L, 62L, 73L, 93L, 60L, 2L, 8L, 1L, 36L, 48L, 92L, 1L, 80L, 19L, 62L, 6L, 20L, 75L, 19L, 18L, 45L, 97L, 75L, 14L, 24L, 48L, 63L, 90L, 77L, 4L, 96L, 7L, 89L, 45L, 95L, 68L, 2L, 56L, 48L, 54L, 65L, 29L, 68L, 31L, 7L, 14L, 92L, 59L, 42L, 9L, 48L, 33L, 82L, 62L, 88L, 70L, 34L, 55L, 48L, 46L, 76L, 45L, 62L, 100L, 47L, 2L, 46L, 99L, 28L, 27L, 31L, 64L, 17L, 19L, 82L, 8L, 23L, 7L, 87L, 15L, 83L, 12L, 36L, 36L, 64L, 67L, 48L, 94L, 8L, 43L, 34L, 23L, 64L, 20L, 55L, 21L, 63L, 61L, 1L, 46L, 82L, 33L, 36L, 1L, 82L, 32L, 55L, 58L, 44L, 3L, 3L, 40L, 45L, 78L, 62L, 23L, 87L, 86L, 82L, 28L, 24L, 23L, 91L, 86L, 69L, 98L, 2L, 73L, 64L, 28L, 66L, 61L, 19L, 57L, 70L, 49L, 68L, 62L, 63L, 26L, 5L, 45L, 81L, 17L, 44L, 74L, 97L, 44L, 8L, 34L, 97L, 42L, 58L, 71L, 65L, 68L, 35L, 81L, 87L, 47L, 20L, 89L, 63L, 49L, 23L, 2L, 36L, 100L, 64L, 99L, 92L, 53L, 48L, 66L, 12L, 86L, 27L, 50L, 96L, 68L, 78L, 40L, 82L, 35L, 39L, 22L, 19L, 18L, 37L, 23L, 76L, 100L, 10L, 39L, 87L, 38L, 25L, 82L, 16L, 68L, 18L, 5L, 94L, 19L, 47L, 82L, 31L, 89L, 54L, 52L, 85L, 62L, 34L, 52L, 34L, 71L, 42L, 95L, 86L, 31L, 41L, 74L, 44L, 50L, 63L, 94L, 96L, 28L, 13L, 35L, 85L, 55L, 27L, 32L, 32L, 5L, 1L, 25L, 62L, 51L, 80L, 30L, 52L, 56L, 64L, 50L, 5L, 78L, 98L, 26L, 77L, 62L, 73L, 42L, 18L, 70L, 55L, 98L, 93L, 53L, 38L, 71L, 85L, 50L, 23L, 21L, 64L, 88L, 91L, 70L, 19L, 83L, 25L, 38L, 48L, 65L, 68L, 8L, 43L, 53L, 9L, 69L, 58L, 17L, 80L, 95L, 14L, 75L, 5L, 90L, 71L, 65L, 57L, 42L, 49L, 70L, 65L, 1L, 88L, 62L, 11L, 30L, 3L, 53L, 11L, 17L, 52L, 96L, 44L, 1L, 21L, 69L, 82L, 57L, 23L, 69L, 29L, 85L, 16L, 27L, 32L, 100L, 25L, 62L, 3L, 83L, 62L, 85L, 48L, 57L, 84L, 10L, 54L, 80L, 50L, 89L, 55L, 50L, 20L, 77L, 76L, 90L, 23L, 56L, 72L, 77L, 12L, 38L, 36L, 1L, 30L, 75L, 47L, 97L, 46L, 99L, 78L, 94L, 65L, 60L, 30L, 69L, 94L, 50L, 10L, 30L, 8L, 28L, 37L, 21L, 89L, 80L, 53L, 84L, 64L, 8L, 66L, 39L, 30L, 53L, 29L, 93L, 94L, 12L, 82L, 97L, 40L, 64L, 62L, 29L)
!! c(range(x),mean(x),median(x),var(x),sd(x))
!! [1]   1.00000 100.00000  51.13750  52.00000 794.10385  28.17985

2. Use the quantile() function to find the 0%, 25%, 50%, 75% and 100% quantiles
!! quantile(x)
!!    0%    25%    50%    75%   100%
!!  1.00  28.00  52.00  74.25 100.00

3. Use the quantile() function with a probs argument to find what the 40% mark is
!!> quantile(x,probs=0.4)
!!40%
!! 44

4. The following matrix has points from two clusters. How many are in the cluster with smaller x values? How many in the cluster with larger x values?
structure(c(7.6, 3.4, 10.2, 4.4, 8, 3.7, 3.3, 3.1, 2.1, 8.6,
9.1, 3.7, 7.6, 3, 8.8, 10.1, 2.9, 3.2, 9.1, 8.4, 0.2, 4.7, 2.9,
1.8, 10.5, 9, 3.6, 1.6, 2.6, 8, 8.3, 8.2, 3.7, 3.4, 9.6, 10.8,
4.4, 4.1, 3, 10.1, 3.6, 2.6, 2, 7.4, 8.8, 10.1, 8.7, 1.7, 8,
3.8, 4.1, 9.8, 2.4, 9.9, 1.5, 9.9, 10, 9.8, 8.6, 5.1, 4, 9.1,
4.3, 7.5, 1.8, 2.8, 7.6, 8.2, 3.3, 2.3, 9.2, 7.6, 7.7, 9.5, 1.6,
1.4, 10, 9.2, 8.4, 4.2, 3.5, 4.3, 9.4, 8.9, 4.6, 3.5, 9.7, 9.7,
8.8, 3.7, 9.3, 11.6, 9.1, 3, 4.7, 4.5, 3.6, 8.8, 5.1, 8), .Dim = c(50L,
2L)) . Try plotting the points with the first column as x values and the second column as y values. Does that help?
!! plot(x)
!! > sum(x[,1]>6)
!! [1] 23
!! > sum(x[,1]<6)
!! [1] 27

5. This problem uses the "College" list dataset from the ISLR library. You may need to do install.packages("ISLR") to continue. Then uses libary(ISLR) to load the library. For example, College["Harvard University",] is the row on Harvard. You can use grep("harvard",rownames(College),ignore.case=TRUE) to find a college. For this problem, return how many universities are Private. Compare the College$Private column with the "Yes" string, and sum up the result.
!! > summary(College$Private)
!! No Yes
!! 212 565

6. For example with ISLR::College, you can compute whether private school students spend more money on books than public school students with aggregate(Books~Private,College,mean) , which shows $547 for private school students and $554 for public school students. For this problem, compute the graduation rates for private vs public schools.
!! > aggregate(Grad.Rate ~ Private, College, mean)
!!   Private Grad.Rate
!! 1      No  56.04245
!! 2     Yes  68.99823

7. Compute a vector for whether a school spends more than the average instructional Expenditure per student. What is the graduation rate for high spenders vs low spenders?
!! > aggregate(Grad.Rate ~ (Expend > mean(Expend)), College, mean)
!!   Expend > mean(Expend) Grad.Rate
!! 1                 FALSE  61.60971
!! 2                  TRUE  73.03817

8. How many schools spend (Expend) more than 2 times the standard deviation above the mean?
!! > summary(College$Expend > mean(College$Expend) + 2*sd(College$Expend))
!!    Mode   FALSE    TRUE    NA's
!! logical     749      28       0

9. Which school spends the MOST per student? You may find which.max a useful function
!! College[which.max(College$Expend),]

10. Use sort(College$Accept / College$Apps, index.return=TRUE)$ix to find the order of College acceptance rates. Which 5 colleges have the HIGHEST acceptance rate (not Harvard!, which has lowest)
!! > tail(College[sort(College$Accept / College$Apps, index.return=TRUE)$ix,1:4])
!!                                  Private Apps Accept Enroll
!! Emporia State University              No 1256   1256    853
!! Mayville State University             No  233    233    153
!! MidAmerica Nazarene College          Yes  331    331    225
!! Southwest Baptist University         Yes 1093   1093    642
!! University of Wisconsin-Superior      No  910    910    342
!! Wayne State College                   No 1373   1373    724

11. What is the enrollment rate (College$Accept/College$Apps) for schools where 90% or more of the students are from Top 10% of their high school class (College$Top10perc) versus those with 89% or less.
!! > aggregate(Accept / Apps ~ Top10perc >= 90, College, mean)
!!   Top10perc >= 90 Accept/Apps
!! 1           FALSE   0.7511380
!! 2            TRUE   0.2837915

12. Take a look at the histogram (hist function) for the College acceptance rate. Notice how it is left skewed (there is a long tail on the left). Is College$Expend left skewed or right skewed?
!! hist(College$Expend) #right skewed

13. We can see here that Colleges with lower acceptance rates have higher graduation rates:
 Acceptance     Graduation
 <0.25          99.50000
 >0.25          78.50000
 >0.5           64.29485
Create a similar chart for how graduation rate is affected by per student spending (0 to $8377, $8377 to $10830, and greater than $10830)
!! > aggregate(Grad.Rate ~ ifelse(Expend <= 8377,"<= 8377",ifelse(Expend > 10830,"> 10830","<>Middle")), College, mean)
!!   ifelse(Expend <= 8377, "<= 8377", ifelse(Expend > 10830, "> 10830", "<>Middle")) Grad.Rate
!! 1                                                                          <= 8377  59.71722
!! 2                                                                         <>Middle  67.84536
!! 3                                                                          > 10830  74.60309

14. 72% of the schools here are private. For schools that < 10% of alumni donate (perc.alumni) and greater than 90% graduate (Grad.Rate), what percentage are private?
!!> summary(College[College$perc.alumni < 10 & College$Grad.Rate > 90,"Private"])
!! No Yes
!!  1   3


15. How much more on average do private colleges with over 95% graduation rates spend compared to public colleges that also have an over 95% graduation rate.
!! > aggregate(Expend ~ Private, College[College$Grad.Rate > 95,],mean)
!!   Private   Expend
!! 1      No  4692.00
!! 2     Yes 15611.91

--- Open ended project questions ---

16. What are the 20 most common character bigrams in the Dutch language? (Nederlands)
!! > head(bifreq,20)
!! bigrams
!!          en          er          de          an          in          te          ee          nd          or          he
!! 0.031121026 0.022308979 0.017512549 0.016062465 0.015504741 0.015393196 0.014054657 0.011489124 0.011489124 0.011154490
!!          el          ge          et          ie          ar          is          ch          ng          st          ri
!! 0.011042945 0.010819855 0.010039041 0.010039041 0.009927496 0.009927496 0.009704406 0.009481316 0.009146682 0.009035137

17. Download the top 5 books by popularity from Project Gutenberg (https://www.gutenberg.org/ebooks/search/%3Fsort_order%3Ddownloads) and compute the top 50 English bigrams for them. Compare the bigram frequencies in that in a table with that from 10 random wikipedia articles. Which are common in Wikipedia but not in project Gutenberg, and vice versa?
gutenberg <- c("http://www.gutenberg.org/cache/epub/1342/pg1342.txt","http://www.gutenberg.org/cache/epub/11/pg11.txt","http://www.gutenberg.org/cache/epub/1661/pg1661.txt","http://www.gutenberg.org/cache/epub/98/pg98.txt","http://www.gutenberg.org/files/4300/4300-0.txt")
> books <- sapply(gutenberg,getURL)
> allbooktext <- paste(books,collapse="")
> allpages <- allbooktext
> language <- "englishgutenbeg"
> allpages <- gsub("[[:space:]]+","", allpages)
> allpages <- gsub("[[:digit:]]+","", allpages)
> replacePunctByBlank <- function(x) gsub("[[:punct:]]+", " ", x)
> allpages <- replacePunctByBlank(allpages)
> allpages <- tolower(allpages)
> getBigrams <- function(x) { sapply(seq(from=1, to=nchar(x), by=2), function(i) substr(x, i, i+1)) } #http://stackoverflow.com/questions/26497583/split-a-string-every-5-characters
> bigrams <- c(getBigrams(allpages) , getBigrams(substring(allpages,2)))
> bigrams <- sort(table(bigrams), decreasing=TRUE)
> bifreq <- bigrams / sum(bigrams)
> bifreqByLanguage[[language]] <- bifreq
> round(english[1:10],3)
bigrams
   in    he    es    ch    th    el    ll    an    is    hi
0.034 0.024 0.019 0.019 0.016 0.015 0.015 0.014 0.013 0.012
> round(bifreq[1:10],3) #gutenberg
bigrams
   th    he    er    in    an    re    en    ha    ou    on
0.026 0.026 0.021 0.019 0.015 0.013 0.012 0.012 0.011 0.011

18. Write a program that computes the most common 100 words in a corpus. Which words are common in Wikipedia but not in Project Gutenberg, and vice versa? You may find the strsplit function useful.
!! zz <- sort(table(strsplit(tolower(allbooktext), " ")))
!! > tail(round(sort(zz/sum(zz)),3),20)
!!    at    is   had    as   for   her   you  with    it   was  that   his    he     i    in     a    to   and    of   the
!! 0.006 0.006 0.006 0.007 0.007 0.007 0.008 0.008 0.009 0.010 0.010 0.011 0.011 0.012 0.016 0.021 0.023 0.027 0.028 0.048

!! url <- "https://en.wikipedia.org/wiki/Special:Random"
!! language <- "english"
!! numPages <- 10
!! pages <- c() #empty list
!! for(i in 1:numPages) {
!! message("Downloading page #",i," of ",language,"        \r",appendLF=FALSE)
!!   flush.console()
!! html <- getURL(url, followlocation = TRUE)
!! # parse html
!! doc = htmlParse(html, asText=TRUE)
!! plain.text <- xpathSApply(doc, "//p", xmlValue)
!! pages <- c(pages, paste(plain.text, collapse = ""))
!! }
!!
!! allpages <- paste(pages,collapse='')
!! zz <- sort(table(strsplit(tolower(allpages), " ")))
!! tail(round(sort(zz/sum(zz)),3),20)

     it      at      be    from      by lottery      as    that     for  kansas    with      on     was      is       a
  0.005   0.005   0.005   0.006   0.006   0.006   0.007   0.007   0.008   0.009   0.009   0.009   0.010   0.011   0.020
     to      in     and      of     the
  0.021   0.023   0.031   0.031   0.071

19. What kind of other websites would it be interesting to do frequency analysis on? Take a look at their data. How difficult would it be to collect data?

20. Other than frequency analysis, what other simple descriptive statistics do you think you could generate from the text of a web site? What about the full HTML?