Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- 1. Find the range, mean, median, variance, and standard deviation of the vector:
- c(78L, 67L, 62L, 37L, 86L, 53L, 59L, 7L, 57L, 95L, 52L, 62L, 73L, 93L, 60L, 2L, 8L, 1L, 36L, 48L, 92L, 1L, 80L, 19L, 62L, 6L, 20L, 75L, 19L, 18L, 45L, 97L, 75L, 14L, 24L, 48L, 63L, 90L, 77L, 4L, 96L, 7L, 89L, 45L, 95L, 68L, 2L, 56L, 48L, 54L, 65L, 29L, 68L, 31L, 7L, 14L, 92L, 59L, 42L, 9L, 48L, 33L, 82L, 62L, 88L, 70L, 34L, 55L, 48L, 46L, 76L, 45L, 62L, 100L, 47L, 2L, 46L, 99L, 28L, 27L, 31L, 64L, 17L, 19L, 82L, 8L, 23L, 7L, 87L, 15L, 83L, 12L, 36L, 36L, 64L, 67L, 48L, 94L, 8L, 43L, 34L, 23L, 64L, 20L, 55L, 21L, 63L, 61L, 1L, 46L, 82L, 33L, 36L, 1L, 82L, 32L, 55L, 58L, 44L, 3L, 3L, 40L, 45L, 78L, 62L, 23L, 87L, 86L, 82L, 28L, 24L, 23L, 91L, 86L, 69L, 98L, 2L, 73L, 64L, 28L, 66L, 61L, 19L, 57L, 70L, 49L, 68L, 62L, 63L, 26L, 5L, 45L, 81L, 17L, 44L, 74L, 97L, 44L, 8L, 34L, 97L, 42L, 58L, 71L, 65L, 68L, 35L, 81L, 87L, 47L, 20L, 89L, 63L, 49L, 23L, 2L, 36L, 100L, 64L, 99L, 92L, 53L, 48L, 66L, 12L, 86L, 27L, 50L, 96L, 68L, 78L, 40L, 82L, 35L, 39L, 22L, 19L, 18L, 37L, 23L, 76L, 100L, 10L, 39L, 87L, 38L, 25L, 82L, 16L, 68L, 18L, 5L, 94L, 19L, 47L, 82L, 31L, 89L, 54L, 52L, 85L, 62L, 34L, 52L, 34L, 71L, 42L, 95L, 86L, 31L, 41L, 74L, 44L, 50L, 63L, 94L, 96L, 28L, 13L, 35L, 85L, 55L, 27L, 32L, 32L, 5L, 1L, 25L, 62L, 51L, 80L, 30L, 52L, 56L, 64L, 50L, 5L, 78L, 98L, 26L, 77L, 62L, 73L, 42L, 18L, 70L, 55L, 98L, 93L, 53L, 38L, 71L, 85L, 50L, 23L, 21L, 64L, 88L, 91L, 70L, 19L, 83L, 25L, 38L, 48L, 65L, 68L, 8L, 43L, 53L, 9L, 69L, 58L, 17L, 80L, 95L, 14L, 75L, 5L, 90L, 71L, 65L, 57L, 42L, 49L, 70L, 65L, 1L, 88L, 62L, 11L, 30L, 3L, 53L, 11L, 17L, 52L, 96L, 44L, 1L, 21L, 69L, 82L, 57L, 23L, 69L, 29L, 85L, 16L, 27L, 32L, 100L, 25L, 62L, 3L, 83L, 62L, 85L, 48L, 57L, 84L, 10L, 54L, 80L, 50L, 89L, 55L, 50L, 20L, 77L, 76L, 90L, 23L, 56L, 72L, 77L, 12L, 38L, 36L, 1L, 30L, 75L, 47L, 97L, 46L, 99L, 78L, 94L, 65L, 60L, 30L, 69L, 94L, 50L, 10L, 30L, 8L, 28L, 37L, 21L, 89L, 80L, 53L, 84L, 64L, 8L, 66L, 39L, 30L, 53L, 29L, 93L, 94L, 12L, 82L, 97L, 40L, 64L, 62L, 29L)
- !! c(range(x),mean(x),median(x),var(x),sd(x))
- !! [1] 1.00000 100.00000 51.13750 52.00000 794.10385 28.17985
- 2. Use the quantile() function to find the 0%, 25%, 50%, 75% and 100% quantiles
- !! quantile(x)
- !! 0% 25% 50% 75% 100%
- !! 1.00 28.00 52.00 74.25 100.00
- 3. Use the quantile() function with a probs argument to find what the 40% mark is
- !!> quantile(x,probs=0.4)
- !!40%
- !! 44
- 4. The following matrix has points from two clusters. How many are in the cluster with smaller x values? How many in the cluster with larger x values?
- structure(c(7.6, 3.4, 10.2, 4.4, 8, 3.7, 3.3, 3.1, 2.1, 8.6,
- 9.1, 3.7, 7.6, 3, 8.8, 10.1, 2.9, 3.2, 9.1, 8.4, 0.2, 4.7, 2.9,
- 1.8, 10.5, 9, 3.6, 1.6, 2.6, 8, 8.3, 8.2, 3.7, 3.4, 9.6, 10.8,
- 4.4, 4.1, 3, 10.1, 3.6, 2.6, 2, 7.4, 8.8, 10.1, 8.7, 1.7, 8,
- 3.8, 4.1, 9.8, 2.4, 9.9, 1.5, 9.9, 10, 9.8, 8.6, 5.1, 4, 9.1,
- 4.3, 7.5, 1.8, 2.8, 7.6, 8.2, 3.3, 2.3, 9.2, 7.6, 7.7, 9.5, 1.6,
- 1.4, 10, 9.2, 8.4, 4.2, 3.5, 4.3, 9.4, 8.9, 4.6, 3.5, 9.7, 9.7,
- 8.8, 3.7, 9.3, 11.6, 9.1, 3, 4.7, 4.5, 3.6, 8.8, 5.1, 8), .Dim = c(50L,
- 2L)) . Try plotting the points with the first column as x values and the second column as y values. Does that help?
- !! plot(x)
- !! > sum(x[,1]>6)
- !! [1] 23
- !! > sum(x[,1]<6)
- !! [1] 27
- 5. This problem uses the "College" list dataset from the ISLR library. You may need to do install.packages("ISLR") to continue. Then uses libary(ISLR) to load the library. For example, College["Harvard University",] is the row on Harvard. You can use grep("harvard",rownames(College),ignore.case=TRUE) to find a college. For this problem, return how many universities are Private. Compare the College$Private column with the "Yes" string, and sum up the result.
- !! > summary(College$Private)
- !! No Yes
- !! 212 565
- 6. For example with ISLR::College, you can compute whether private school students spend more money on books than public school students with aggregate(Books~Private,College,mean) , which shows $547 for private school students and $554 for public school students. For this problem, compute the graduation rates for private vs public schools.
- !! > aggregate(Grad.Rate ~ Private, College, mean)
- !! Private Grad.Rate
- !! 1 No 56.04245
- !! 2 Yes 68.99823
- 7. Compute a vector for whether a school spends more than the average instructional Expenditure per student. What is the graduation rate for high spenders vs low spenders?
- !! > aggregate(Grad.Rate ~ (Expend > mean(Expend)), College, mean)
- !! Expend > mean(Expend) Grad.Rate
- !! 1 FALSE 61.60971
- !! 2 TRUE 73.03817
- 8. How many schools spend (Expend) more than 2 times the standard deviation above the mean?
- !! > summary(College$Expend > mean(College$Expend) + 2*sd(College$Expend))
- !! Mode FALSE TRUE NA's
- !! logical 749 28 0
- 9. Which school spends the MOST per student? You may find which.max a useful function
- !! College[which.max(College$Expend),]
- 10. Use sort(College$Accept / College$Apps, index.return=TRUE)$ix to find the order of College acceptance rates. Which 5 colleges have the HIGHEST acceptance rate (not Harvard!, which has lowest)
- !! > tail(College[sort(College$Accept / College$Apps, index.return=TRUE)$ix,1:4])
- !! Private Apps Accept Enroll
- !! Emporia State University No 1256 1256 853
- !! Mayville State University No 233 233 153
- !! MidAmerica Nazarene College Yes 331 331 225
- !! Southwest Baptist University Yes 1093 1093 642
- !! University of Wisconsin-Superior No 910 910 342
- !! Wayne State College No 1373 1373 724
- 11. What is the enrollment rate (College$Accept/College$Apps) for schools where 90% or more of the students are from Top 10% of their high school class (College$Top10perc) versus those with 89% or less.
- !! > aggregate(Accept / Apps ~ Top10perc >= 90, College, mean)
- !! Top10perc >= 90 Accept/Apps
- !! 1 FALSE 0.7511380
- !! 2 TRUE 0.2837915
- 12. Take a look at the histogram (hist function) for the College acceptance rate. Notice how it is left skewed (there is a long tail on the left). Is College$Expend left skewed or right skewed?
- !! hist(College$Expend) #right skewed
- 13. We can see here that Colleges with lower acceptance rates have higher graduation rates:
- Acceptance Graduation
- <0.25 99.50000
- >0.25 78.50000
- >0.5 64.29485
- Create a similar chart for how graduation rate is affected by per student spending (0 to $8377, $8377 to $10830, and greater than $10830)
- !! > aggregate(Grad.Rate ~ ifelse(Expend <= 8377,"<= 8377",ifelse(Expend > 10830,"> 10830","<>Middle")), College, mean)
- !! ifelse(Expend <= 8377, "<= 8377", ifelse(Expend > 10830, "> 10830", "<>Middle")) Grad.Rate
- !! 1 <= 8377 59.71722
- !! 2 <>Middle 67.84536
- !! 3 > 10830 74.60309
- 14. 72% of the schools here are private. For schools that < 10% of alumni donate (perc.alumni) and greater than 90% graduate (Grad.Rate), what percentage are private?
- !!> summary(College[College$perc.alumni < 10 & College$Grad.Rate > 90,"Private"])
- !! No Yes
- !! 1 3
- 15. How much more on average do private colleges with over 95% graduation rates spend compared to public colleges that also have an over 95% graduation rate.
- !! > aggregate(Expend ~ Private, College[College$Grad.Rate > 95,],mean)
- !! Private Expend
- !! 1 No 4692.00
- !! 2 Yes 15611.91
- --- Open ended project questions ---
- 16. What are the 20 most common character bigrams in the Dutch language? (Nederlands)
- !! > head(bifreq,20)
- !! bigrams
- !! en er de an in te ee nd or he
- !! 0.031121026 0.022308979 0.017512549 0.016062465 0.015504741 0.015393196 0.014054657 0.011489124 0.011489124 0.011154490
- !! el ge et ie ar is ch ng st ri
- !! 0.011042945 0.010819855 0.010039041 0.010039041 0.009927496 0.009927496 0.009704406 0.009481316 0.009146682 0.009035137
- 17. Download the top 5 books by popularity from Project Gutenberg (https://www.gutenberg.org/ebooks/search/%3Fsort_order%3Ddownloads) and compute the top 50 English bigrams for them. Compare the bigram frequencies in that in a table with that from 10 random wikipedia articles. Which are common in Wikipedia but not in project Gutenberg, and vice versa?
- gutenberg <- c("http://www.gutenberg.org/cache/epub/1342/pg1342.txt","http://www.gutenberg.org/cache/epub/11/pg11.txt","http://www.gutenberg.org/cache/epub/1661/pg1661.txt","http://www.gutenberg.org/cache/epub/98/pg98.txt","http://www.gutenberg.org/files/4300/4300-0.txt")
- > books <- sapply(gutenberg,getURL)
- > allbooktext <- paste(books,collapse="")
- > allpages <- allbooktext
- > language <- "englishgutenbeg"
- > allpages <- gsub("[[:space:]]+","", allpages)
- > allpages <- gsub("[[:digit:]]+","", allpages)
- > replacePunctByBlank <- function(x) gsub("[[:punct:]]+", " ", x)
- > allpages <- replacePunctByBlank(allpages)
- > allpages <- tolower(allpages)
- > getBigrams <- function(x) { sapply(seq(from=1, to=nchar(x), by=2), function(i) substr(x, i, i+1)) } #http://stackoverflow.com/questions/26497583/split-a-string-every-5-characters
- > bigrams <- c(getBigrams(allpages) , getBigrams(substring(allpages,2)))
- > bigrams <- sort(table(bigrams), decreasing=TRUE)
- > bifreq <- bigrams / sum(bigrams)
- > bifreqByLanguage[[language]] <- bifreq
- > round(english[1:10],3)
- bigrams
- in he es ch th el ll an is hi
- 0.034 0.024 0.019 0.019 0.016 0.015 0.015 0.014 0.013 0.012
- > round(bifreq[1:10],3) #gutenberg
- bigrams
- th he er in an re en ha ou on
- 0.026 0.026 0.021 0.019 0.015 0.013 0.012 0.012 0.011 0.011
- 18. Write a program that computes the most common 100 words in a corpus. Which words are common in Wikipedia but not in Project Gutenberg, and vice versa? You may find the strsplit function useful.
- !! zz <- sort(table(strsplit(tolower(allbooktext), " ")))
- !! > tail(round(sort(zz/sum(zz)),3),20)
- !! at is had as for her you with it was that his he i in a to and of the
- !! 0.006 0.006 0.006 0.007 0.007 0.007 0.008 0.008 0.009 0.010 0.010 0.011 0.011 0.012 0.016 0.021 0.023 0.027 0.028 0.048
- !! url <- "https://en.wikipedia.org/wiki/Special:Random"
- !! language <- "english"
- !! numPages <- 10
- !! pages <- c() #empty list
- !! for(i in 1:numPages) {
- !! message("Downloading page #",i," of ",language," \r",appendLF=FALSE)
- !! flush.console()
- !! html <- getURL(url, followlocation = TRUE)
- !! # parse html
- !! doc = htmlParse(html, asText=TRUE)
- !! plain.text <- xpathSApply(doc, "//p", xmlValue)
- !! pages <- c(pages, paste(plain.text, collapse = ""))
- !! }
- !!
- !! allpages <- paste(pages,collapse='')
- !! zz <- sort(table(strsplit(tolower(allpages), " ")))
- !! tail(round(sort(zz/sum(zz)),3),20)
- it at be from by lottery as that for kansas with on was is a
- 0.005 0.005 0.005 0.006 0.006 0.006 0.007 0.007 0.008 0.009 0.009 0.009 0.010 0.011 0.020
- to in and of the
- 0.021 0.023 0.031 0.031 0.071
- 19. What kind of other websites would it be interesting to do frequency analysis on? Take a look at their data. How difficult would it be to collect data?
- 20. Other than frequency analysis, what other simple descriptive statistics do you think you could generate from the text of a web site? What about the full HTML?
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement