Advertisement
ktbyte

HW1 - Descriptive Variables

Nov 2nd, 2016
132
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 11.77 KB | None | 0 0
  1. 1. Find the range, mean, median, variance, and standard deviation of the vector:
  2. c(78L, 67L, 62L, 37L, 86L, 53L, 59L, 7L, 57L, 95L, 52L, 62L, 73L, 93L, 60L, 2L, 8L, 1L, 36L, 48L, 92L, 1L, 80L, 19L, 62L, 6L, 20L, 75L, 19L, 18L, 45L, 97L, 75L, 14L, 24L, 48L, 63L, 90L, 77L, 4L, 96L, 7L, 89L, 45L, 95L, 68L, 2L, 56L, 48L, 54L, 65L, 29L, 68L, 31L, 7L, 14L, 92L, 59L, 42L, 9L, 48L, 33L, 82L, 62L, 88L, 70L, 34L, 55L, 48L, 46L, 76L, 45L, 62L, 100L, 47L, 2L, 46L, 99L, 28L, 27L, 31L, 64L, 17L, 19L, 82L, 8L, 23L, 7L, 87L, 15L, 83L, 12L, 36L, 36L, 64L, 67L, 48L, 94L, 8L, 43L, 34L, 23L, 64L, 20L, 55L, 21L, 63L, 61L, 1L, 46L, 82L, 33L, 36L, 1L, 82L, 32L, 55L, 58L, 44L, 3L, 3L, 40L, 45L, 78L, 62L, 23L, 87L, 86L, 82L, 28L, 24L, 23L, 91L, 86L, 69L, 98L, 2L, 73L, 64L, 28L, 66L, 61L, 19L, 57L, 70L, 49L, 68L, 62L, 63L, 26L, 5L, 45L, 81L, 17L, 44L, 74L, 97L, 44L, 8L, 34L, 97L, 42L, 58L, 71L, 65L, 68L, 35L, 81L, 87L, 47L, 20L, 89L, 63L, 49L, 23L, 2L, 36L, 100L, 64L, 99L, 92L, 53L, 48L, 66L, 12L, 86L, 27L, 50L, 96L, 68L, 78L, 40L, 82L, 35L, 39L, 22L, 19L, 18L, 37L, 23L, 76L, 100L, 10L, 39L, 87L, 38L, 25L, 82L, 16L, 68L, 18L, 5L, 94L, 19L, 47L, 82L, 31L, 89L, 54L, 52L, 85L, 62L, 34L, 52L, 34L, 71L, 42L, 95L, 86L, 31L, 41L, 74L, 44L, 50L, 63L, 94L, 96L, 28L, 13L, 35L, 85L, 55L, 27L, 32L, 32L, 5L, 1L, 25L, 62L, 51L, 80L, 30L, 52L, 56L, 64L, 50L, 5L, 78L, 98L, 26L, 77L, 62L, 73L, 42L, 18L, 70L, 55L, 98L, 93L, 53L, 38L, 71L, 85L, 50L, 23L, 21L, 64L, 88L, 91L, 70L, 19L, 83L, 25L, 38L, 48L, 65L, 68L, 8L, 43L, 53L, 9L, 69L, 58L, 17L, 80L, 95L, 14L, 75L, 5L, 90L, 71L, 65L, 57L, 42L, 49L, 70L, 65L, 1L, 88L, 62L, 11L, 30L, 3L, 53L, 11L, 17L, 52L, 96L, 44L, 1L, 21L, 69L, 82L, 57L, 23L, 69L, 29L, 85L, 16L, 27L, 32L, 100L, 25L, 62L, 3L, 83L, 62L, 85L, 48L, 57L, 84L, 10L, 54L, 80L, 50L, 89L, 55L, 50L, 20L, 77L, 76L, 90L, 23L, 56L, 72L, 77L, 12L, 38L, 36L, 1L, 30L, 75L, 47L, 97L, 46L, 99L, 78L, 94L, 65L, 60L, 30L, 69L, 94L, 50L, 10L, 30L, 8L, 28L, 37L, 21L, 89L, 80L, 53L, 84L, 64L, 8L, 66L, 39L, 30L, 53L, 29L, 93L, 94L, 12L, 82L, 97L, 40L, 64L, 62L, 29L)
  3. !! c(range(x),mean(x),median(x),var(x),sd(x))
  4. !! [1] 1.00000 100.00000 51.13750 52.00000 794.10385 28.17985
  5.  
  6. 2. Use the quantile() function to find the 0%, 25%, 50%, 75% and 100% quantiles
  7. !! quantile(x)
  8. !! 0% 25% 50% 75% 100%
  9. !! 1.00 28.00 52.00 74.25 100.00
  10.  
  11. 3. Use the quantile() function with a probs argument to find what the 40% mark is
  12. !!> quantile(x,probs=0.4)
  13. !!40%
  14. !! 44
  15.  
  16. 4. The following matrix has points from two clusters. How many are in the cluster with smaller x values? How many in the cluster with larger x values?
  17. structure(c(7.6, 3.4, 10.2, 4.4, 8, 3.7, 3.3, 3.1, 2.1, 8.6,
  18. 9.1, 3.7, 7.6, 3, 8.8, 10.1, 2.9, 3.2, 9.1, 8.4, 0.2, 4.7, 2.9,
  19. 1.8, 10.5, 9, 3.6, 1.6, 2.6, 8, 8.3, 8.2, 3.7, 3.4, 9.6, 10.8,
  20. 4.4, 4.1, 3, 10.1, 3.6, 2.6, 2, 7.4, 8.8, 10.1, 8.7, 1.7, 8,
  21. 3.8, 4.1, 9.8, 2.4, 9.9, 1.5, 9.9, 10, 9.8, 8.6, 5.1, 4, 9.1,
  22. 4.3, 7.5, 1.8, 2.8, 7.6, 8.2, 3.3, 2.3, 9.2, 7.6, 7.7, 9.5, 1.6,
  23. 1.4, 10, 9.2, 8.4, 4.2, 3.5, 4.3, 9.4, 8.9, 4.6, 3.5, 9.7, 9.7,
  24. 8.8, 3.7, 9.3, 11.6, 9.1, 3, 4.7, 4.5, 3.6, 8.8, 5.1, 8), .Dim = c(50L,
  25. 2L)) . Try plotting the points with the first column as x values and the second column as y values. Does that help?
  26. !! plot(x)
  27. !! > sum(x[,1]>6)
  28. !! [1] 23
  29. !! > sum(x[,1]<6)
  30. !! [1] 27
  31.  
  32. 5. This problem uses the "College" list dataset from the ISLR library. You may need to do install.packages("ISLR") to continue. Then uses libary(ISLR) to load the library. For example, College["Harvard University",] is the row on Harvard. You can use grep("harvard",rownames(College),ignore.case=TRUE) to find a college. For this problem, return how many universities are Private. Compare the College$Private column with the "Yes" string, and sum up the result.
  33. !! > summary(College$Private)
  34. !! No Yes
  35. !! 212 565
  36.  
  37. 6. For example with ISLR::College, you can compute whether private school students spend more money on books than public school students with aggregate(Books~Private,College,mean) , which shows $547 for private school students and $554 for public school students. For this problem, compute the graduation rates for private vs public schools.
  38. !! > aggregate(Grad.Rate ~ Private, College, mean)
  39. !! Private Grad.Rate
  40. !! 1 No 56.04245
  41. !! 2 Yes 68.99823
  42.  
  43. 7. Compute a vector for whether a school spends more than the average instructional Expenditure per student. What is the graduation rate for high spenders vs low spenders?
  44. !! > aggregate(Grad.Rate ~ (Expend > mean(Expend)), College, mean)
  45. !! Expend > mean(Expend) Grad.Rate
  46. !! 1 FALSE 61.60971
  47. !! 2 TRUE 73.03817
  48.  
  49. 8. How many schools spend (Expend) more than 2 times the standard deviation above the mean?
  50. !! > summary(College$Expend > mean(College$Expend) + 2*sd(College$Expend))
  51. !! Mode FALSE TRUE NA's
  52. !! logical 749 28 0
  53.  
  54. 9. Which school spends the MOST per student? You may find which.max a useful function
  55. !! College[which.max(College$Expend),]
  56.  
  57. 10. Use sort(College$Accept / College$Apps, index.return=TRUE)$ix to find the order of College acceptance rates. Which 5 colleges have the HIGHEST acceptance rate (not Harvard!, which has lowest)
  58. !! > tail(College[sort(College$Accept / College$Apps, index.return=TRUE)$ix,1:4])
  59. !! Private Apps Accept Enroll
  60. !! Emporia State University No 1256 1256 853
  61. !! Mayville State University No 233 233 153
  62. !! MidAmerica Nazarene College Yes 331 331 225
  63. !! Southwest Baptist University Yes 1093 1093 642
  64. !! University of Wisconsin-Superior No 910 910 342
  65. !! Wayne State College No 1373 1373 724
  66.  
  67. 11. What is the enrollment rate (College$Accept/College$Apps) for schools where 90% or more of the students are from Top 10% of their high school class (College$Top10perc) versus those with 89% or less.
  68. !! > aggregate(Accept / Apps ~ Top10perc >= 90, College, mean)
  69. !! Top10perc >= 90 Accept/Apps
  70. !! 1 FALSE 0.7511380
  71. !! 2 TRUE 0.2837915
  72.  
  73. 12. Take a look at the histogram (hist function) for the College acceptance rate. Notice how it is left skewed (there is a long tail on the left). Is College$Expend left skewed or right skewed?
  74. !! hist(College$Expend) #right skewed
  75.  
  76. 13. We can see here that Colleges with lower acceptance rates have higher graduation rates:
  77. Acceptance Graduation
  78. <0.25 99.50000
  79. >0.25 78.50000
  80. >0.5 64.29485
  81. Create a similar chart for how graduation rate is affected by per student spending (0 to $8377, $8377 to $10830, and greater than $10830)
  82. !! > aggregate(Grad.Rate ~ ifelse(Expend <= 8377,"<= 8377",ifelse(Expend > 10830,"> 10830","<>Middle")), College, mean)
  83. !! ifelse(Expend <= 8377, "<= 8377", ifelse(Expend > 10830, "> 10830", "<>Middle")) Grad.Rate
  84. !! 1 <= 8377 59.71722
  85. !! 2 <>Middle 67.84536
  86. !! 3 > 10830 74.60309
  87.  
  88. 14. 72% of the schools here are private. For schools that < 10% of alumni donate (perc.alumni) and greater than 90% graduate (Grad.Rate), what percentage are private?
  89. !!> summary(College[College$perc.alumni < 10 & College$Grad.Rate > 90,"Private"])
  90. !! No Yes
  91. !! 1 3
  92.  
  93.  
  94. 15. How much more on average do private colleges with over 95% graduation rates spend compared to public colleges that also have an over 95% graduation rate.
  95. !! > aggregate(Expend ~ Private, College[College$Grad.Rate > 95,],mean)
  96. !! Private Expend
  97. !! 1 No 4692.00
  98. !! 2 Yes 15611.91
  99.  
  100. --- Open ended project questions ---
  101.  
  102. 16. What are the 20 most common character bigrams in the Dutch language? (Nederlands)
  103. !! > head(bifreq,20)
  104. !! bigrams
  105. !! en er de an in te ee nd or he
  106. !! 0.031121026 0.022308979 0.017512549 0.016062465 0.015504741 0.015393196 0.014054657 0.011489124 0.011489124 0.011154490
  107. !! el ge et ie ar is ch ng st ri
  108. !! 0.011042945 0.010819855 0.010039041 0.010039041 0.009927496 0.009927496 0.009704406 0.009481316 0.009146682 0.009035137
  109.  
  110. 17. Download the top 5 books by popularity from Project Gutenberg (https://www.gutenberg.org/ebooks/search/%3Fsort_order%3Ddownloads) and compute the top 50 English bigrams for them. Compare the bigram frequencies in that in a table with that from 10 random wikipedia articles. Which are common in Wikipedia but not in project Gutenberg, and vice versa?
  111. gutenberg <- c("http://www.gutenberg.org/cache/epub/1342/pg1342.txt","http://www.gutenberg.org/cache/epub/11/pg11.txt","http://www.gutenberg.org/cache/epub/1661/pg1661.txt","http://www.gutenberg.org/cache/epub/98/pg98.txt","http://www.gutenberg.org/files/4300/4300-0.txt")
  112. > books <- sapply(gutenberg,getURL)
  113. > allbooktext <- paste(books,collapse="")
  114. > allpages <- allbooktext
  115. > language <- "englishgutenbeg"
  116. > allpages <- gsub("[[:space:]]+","", allpages)
  117. > allpages <- gsub("[[:digit:]]+","", allpages)
  118. > replacePunctByBlank <- function(x) gsub("[[:punct:]]+", " ", x)
  119. > allpages <- replacePunctByBlank(allpages)
  120. > allpages <- tolower(allpages)
  121. > getBigrams <- function(x) { sapply(seq(from=1, to=nchar(x), by=2), function(i) substr(x, i, i+1)) } #http://stackoverflow.com/questions/26497583/split-a-string-every-5-characters
  122. > bigrams <- c(getBigrams(allpages) , getBigrams(substring(allpages,2)))
  123. > bigrams <- sort(table(bigrams), decreasing=TRUE)
  124. > bifreq <- bigrams / sum(bigrams)
  125. > bifreqByLanguage[[language]] <- bifreq
  126. > round(english[1:10],3)
  127. bigrams
  128. in he es ch th el ll an is hi
  129. 0.034 0.024 0.019 0.019 0.016 0.015 0.015 0.014 0.013 0.012
  130. > round(bifreq[1:10],3) #gutenberg
  131. bigrams
  132. th he er in an re en ha ou on
  133. 0.026 0.026 0.021 0.019 0.015 0.013 0.012 0.012 0.011 0.011
  134.  
  135. 18. Write a program that computes the most common 100 words in a corpus. Which words are common in Wikipedia but not in Project Gutenberg, and vice versa? You may find the strsplit function useful.
  136. !! zz <- sort(table(strsplit(tolower(allbooktext), " ")))
  137. !! > tail(round(sort(zz/sum(zz)),3),20)
  138. !! at is had as for her you with it was that his he i in a to and of the
  139. !! 0.006 0.006 0.006 0.007 0.007 0.007 0.008 0.008 0.009 0.010 0.010 0.011 0.011 0.012 0.016 0.021 0.023 0.027 0.028 0.048
  140.  
  141. !! url <- "https://en.wikipedia.org/wiki/Special:Random"
  142. !! language <- "english"
  143. !! numPages <- 10
  144. !! pages <- c() #empty list
  145. !! for(i in 1:numPages) {
  146. !! message("Downloading page #",i," of ",language," \r",appendLF=FALSE)
  147. !! flush.console()
  148. !! html <- getURL(url, followlocation = TRUE)
  149. !! # parse html
  150. !! doc = htmlParse(html, asText=TRUE)
  151. !! plain.text <- xpathSApply(doc, "//p", xmlValue)
  152. !! pages <- c(pages, paste(plain.text, collapse = ""))
  153. !! }
  154. !!
  155. !! allpages <- paste(pages,collapse='')
  156. !! zz <- sort(table(strsplit(tolower(allpages), " ")))
  157. !! tail(round(sort(zz/sum(zz)),3),20)
  158.  
  159. it at be from by lottery as that for kansas with on was is a
  160. 0.005 0.005 0.005 0.006 0.006 0.006 0.007 0.007 0.008 0.009 0.009 0.009 0.010 0.011 0.020
  161. to in and of the
  162. 0.021 0.023 0.031 0.031 0.071
  163.  
  164. 19. What kind of other websites would it be interesting to do frequency analysis on? Take a look at their data. How difficult would it be to collect data?
  165.  
  166. 20. Other than frequency analysis, what other simple descriptive statistics do you think you could generate from the text of a web site? What about the full HTML?
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement