Advertisement
ktbyte

Hw2 (R)

Nov 2nd, 2016
138
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 11.05 KB | None | 0 0
  1. ---- Probability Distributions ----
  2.  
  3. # Normal Distribution: plot(seq(-3,3,.1),dnorm(seq(-3,3,.1)))
  4.  
  5. 1. #Central Limit Theorem ( https://github.com/victoirv/CLTgif )
  6. Pick any data set of numbers (preferably > 100 meaningful numbers) and write a program that samples 10 numbers from it for a "sample-mean". Compute 10000 sample means and plot them as a histogram. The distribution should look like a normal distribution
  7.  
  8. rands <- runif(10000,0,100)
  9. hist(sapply(1:1000, function(i){ mean(sample(rands,10)) }))
  10.  
  11.  
  12. 2. # Z-score to p , http://ncalculators.com/statistics/z-score-calculator.htm , pnorm
  13. Given a normal distribution with mean 5 and standard deviation 3, what is the probability of drawing a value less than 2 ? ( We will do this problem in class )
  14.  
  15. > pnorm(2,5,3)
  16. [1] 0.1586553
  17.  
  18. 3. # P-score to z , qnorm
  19. Given a normal distribution with mean 5 and standard deviation 3, 99.9% of values are greater than X. What is X? (Do in class)
  20.  
  21. > qnorm(0.001 , 5, 3)
  22. [1] -4.270697
  23.  
  24.  
  25. 4. # 2 sided P-score to z,
  26. Given a normal distribution with mean 5 and standard deviation 3, 80% of the values are between A and B, what is A or B?
  27.  
  28. > qnorm(0.1 , 5, 3)
  29. [1] 1.155345
  30. > qnorm(0.9 , 5, 3)
  31. [1] 8.844655
  32.  
  33.  
  34. 5. You can use the binomial distribution to model how many of 10 fair coin flips would result in 4 or fewer heads (pbinom(4,10,0.5)). Similarly (although not needed for this problem), you can find that in 99% of cases, you would find at most how many number of heads: qbinom(0.99,10,0.5). Finally you can use dbinom to get the percentage of exactly one event, such as dbinom(1,3,0.5) is 3/8, or the probability of having at least 1 head after flipping a coin 3 times.
  35.  
  36. For this problem, there is a 37% probability of raining on any given day. Given 10 random days, what is the probability that exactly 3 of the days have rain. What is the probability that at most 3 of the days have rain? What is the probability that at least 3 of the days have rain?
  37.  
  38. Exactly 3?
  39. > pbinom(3,10,0.37) - pbinom(2,10,0.37)
  40. [1] 0.2394254
  41.  
  42. At most 3?
  43. > pbinom(3,10,0.37)
  44. [1] 0.4599962
  45.  
  46. At least 3?
  47. > 1-pbinom(2,10,0.37)
  48. [1] 0.7794292
  49.  
  50.  
  51. ---- Data Formats ----
  52.  
  53. 6. This API allows you to match IP Addresses to countries. https://api.ip2country.info/ip?8.8.8.8 . The website has a throttle request for 2 requests / second / client: https://ip2country.info/humans.txt . Additionally, according to XKCD, the IP address space is mapped like this: https://xkcd.com/195/ . Write a program that executes 0-255.{66,127,191,255}.0.0 (a total of 256*4 queries, which should take you just over 30 minutes to run) . Compile a histogram of which countries have the most IP address spaces registered to them. If you do not have enough time, then just explore the 0-255.0.0.0 ip address space.
  54.  
  55.  
  56. ips <- as.character(sapply(0:255,function(x){ paste(x,c(66,127,191,255),"0.0",sep=".") }))
  57. ipdata <- list()
  58. for(ip in ips) {
  59. data <- getURL(paste("https://api.ip2country.info/ip?",ip,sep=""))
  60. message(ip);
  61. message(data);
  62. ipdata[[ip]] <- data
  63. Sys.sleep(2)
  64. }
  65.  
  66. #ipdataList <- lapply(ipdata,fromJSON)
  67. countryByIP <- data.frame(cbind(sapply(ipdataList, function(x) { x[[2]] })))
  68.  
  69. > sort(table(countryByIP))
  70. countryByIP
  71. BEN BGD BGR BHS BIH CHL CIV CRI DOM DZA ECU ETH GHA IRQ KWT LBY LKA LTU MLT MYS PAN PRY ROU SRB SVK TZA UKR VEN AZE BLR EST GRC PAK
  72. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
  73. SAU AUT EGY ESP EU IRL ISR NGA NZL PER PHL HUN MAR CHE DNK FIN IRN NOR TUR ARG BEL COL HKG MEX SWE VNM IDN IND SGP ZAF POL TWN NLD
  74. 2 3 3 3 3 3 3 3 3 3 3 4 4 5 5 5 5 5 5 6 6 6 6 6 6 6 7 7 7 7 9 9 11
  75. ITA AUS CAN RUS FRA BRA KOR DEU GBR JPN CHN USA
  76. 12 14 15 15 22 23 24 25 26 41 80 158 362
  77.  
  78. 7. Pick a dataset from http://www.stat.ufl.edu/~winner/datasets.html , and do a linear regression on it. Write a sentence on what your finding implies. (Include the link to the dataset and your code in the answer)
  79.  
  80. Data used: http://www.stat.ufl.edu/~winner/data/birthrate.txt
  81. Data: http://www.stat.ufl.edu/~winner/data/birthrate.dat
  82.  
  83. data <- getURL("http://www.stat.ufl.edu/~winner/data/birthrate.dat")
  84. data <- gsub("[ ]{2,}","\t",data)
  85. data <- read.table(textConnection(data), sep="\t")
  86. data <- data.frame(data)
  87. names(data) <- c("nation","birthrate","capincome","poponfarms","infantmort")
  88.  
  89. > summary(lm(data$birthrate ~ data$infantmort))
  90.  
  91. Call:
  92. lm(formula = data$birthrate ~ data$infantmort)
  93.  
  94. Residuals:
  95. Min 1Q Median 3Q Max
  96. -15.583 -4.995 -1.893 5.131 18.190
  97.  
  98. Coefficients:
  99. Estimate Std. Error t value Pr(>|t|)
  100. (Intercept) 13.43237 2.75145 4.882 3.83e-05 ***
  101. data$infantmort 0.21574 0.04587 4.703 6.25e-05 ***
  102. ---
  103. Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  104.  
  105. Residual standard error: 7.285 on 28 degrees of freedom
  106. Multiple R-squared: 0.4413, Adjusted R-squared: 0.4214
  107. F-statistic: 22.12 on 1 and 28 DF, p-value: 6.245e-05
  108.  
  109. Birth rates are positively correlated with income mortality, with high significance.
  110.  
  111. 8. Find a JSON dataset from https://github.com/toddmotto/public-apis that does not require any authentication AND where a linear regression returns a high correlation coefficient.
  112.  
  113. pokemonjson <- list()
  114. for(id in 1:721) {
  115. poke <- getURL(paste("http://pokeapi.co/api/v2/pokemon/",id,"/",sep=""))
  116. message(poke)
  117. message(id)
  118.  
  119. pokemonjson[[id]] <- poke
  120. Sys.sleep(1)
  121. }
  122.  
  123. pokemonList <- list()
  124. for(id in 1:length(pokemonjson)) {
  125. message(id)
  126. pokemonList[[id]] <- fromJSON(pokemonjson[[id]])
  127. }
  128.  
  129. pbody <- data.frame(t(sapply(pokemonList, function(x) { c(x$weight,x$height,length(x$moves),x$base_experience ) })))
  130. names(pbody) <- c("w","h","moves","expbase")
  131. summary(lm(w ~ h + moves + expbase + I(h*expbase) + I(moves*expbase), data=pbody))
  132.  
  133.  
  134. 9. Yahoo Finance provides CSV time series data on the "Adjusted Close" price of various financial instruments. For example, Google's hold company, "Alphabet", is here: http://finance.yahoo.com/q/hp?s=GOOG+Historical+Prices . At the bottom of the page, there is a download CSV links. For this problem, find 2 stocks that have a correlation in price > 0.6 . Show the code that you used to perform your analysis.
  135.  
  136. snp <- read.csv("http://chart.finance.yahoo.com/table.csv?s=^GSPC&a=5&b=14&c=2016&d=6&e=14&f=2016&g=d&ignore=.csv")
  137. nasdaq <- read.csv("http://chart.finance.yahoo.com/table.csv?s=^IXIC&a=5&b=14&c=2016&d=6&e=14&f=2016&g=d&ignore=.csv")
  138. > summary(lm(snp$Adj.Close ~ nasdaq$Adj.Close))$r.squared
  139. [1] 0.9691687
  140.  
  141. 10. weather.gov provides an XML interface given longitude and lattitude: http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML . After saving the xml file to your workspace (check getwd() to see what your workspace location is), you can use the following code to parse it:
  142. require(XML)
  143. data <- xmlParse("file.xml")
  144. xml_data <- xmlToList(data)
  145. location <- as.list(xml_data[["data"]][["location"]][["point"]])
  146. start_time <- unlist(xml_data[["data"]][["time-layout"]][
  147. names(xml_data[["data"]][["time-layout"]]) == "start-valid-time"])
  148. temps <- xml_data[["data"]][["parameters"]]
  149. temps <- temps[names(temps) == "temperature"]
  150. temps <- temps[sapply(temps, function(x) any(unlist(x) == "hourly"))]
  151. temps <- unlist(temps[[1]][sapply(temps, names) == "value"])
  152. out <- data.frame(
  153. as.list(location),
  154. "start_valid_time" = start_time,
  155. "hourly_temperature" = temps)
  156.  
  157. #answer starts here:
  158. windspeed <- xml_data[["data"]][["parameters"]]
  159. windspeed <- windspeed[names(windspeed)=="wind-speed"]
  160. windspeed <- windspeed[sapply(windspeed, function(x) any(unlist(x) == "sustained"))]
  161. windspeed <- as.numeric(unlist(windspeed[[1]][sapply(windspeed, names) == "value"]))
  162. out$windspeed <- windspeed
  163. #out$windspeed <- as.numeric(windspeed[[1]][1:dim(out)[1]])
  164.  
  165. precip <- xml_data[["data"]][["parameters"]]
  166. precip <- precip[names(precip)=="probability-of-precipitation"]
  167. precip <- precip[sapply(precip, function(x) any(unlist(x) == "floating"))]
  168. precip <- as.numeric(unlist(precip[[1]][sapply(precip, names) == "value"]))
  169. out$precip <- precip
  170. head(out)
  171.  
  172. For this problem, add the "wind-speed", "probability-of-precipitation" and humidity columns to the dataframe
  173.  
  174. ---- ISLR Problems ----
  175.  
  176. 11. From the ISLR textbook page 121 (Section 3.7), Do problem #8 (a.i, a.ii, a.iii, a.iv, b, and c)
  177.  
  178. http://blog.princehonest.com/stat-learning/ch3/applied.html
  179.  
  180. 12. From the ISLR textbook page 122 (Section 3.7), Do problem #9 (c.i,c.ii,c.iii)
  181.  
  182. 13. From the ISLR textbook page 123 (Section 3.7), Do problem #10 (a,b)
  183.  
  184. ---- Simple Regression in Java ----
  185.  
  186. 14. (Java Exercise) use this linear regression class and any data to fix the slope & intercept, and determine the r^2: http://introcs.cs.princeton.edu/java/97data/LinearRegression.java.html
  187.  
  188. 15. (Java Exercise) The Apache Math library can be used for linear regression. For this problem, we will use their library to perform a simple linear regression.
  189.  
  190. a. Create a new Maven project in Eclipse: http://www.tech-recipes.com/rx/39279/create-a-new-maven-project-in-eclipse/
  191. b. Add their dependency from https://mvnrepository.com/artifact/org.apache.commons/commons-math3/3.6.1 to your pom.xml:
  192. <dependency>
  193. <groupId>org.apache.commons</groupId>
  194. <artifactId>commons-math3</artifactId>
  195. <version>3.6.1</version>
  196. </dependency>
  197. b. Create a simple project that uses the SimpleRegression class as outlined by http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.4_Simple_regression
  198.  
  199.  
  200.  
  201.  
  202.  
  203. Problems on Distributions
  204.  
  205. 1. Given x , y, and the model y_est = P_0 + P_1 * x, compute the Residual Sum of Squares ("RSS"). Recall that RSS equals (e_1)^2 + (e_2)^2 + (e_3)^2 .. (e_n)^2 where (e_i)^2 = y_i - (P_0 + P_1) * x.
  206.  
  207. 2. In a simple linear regression with the model y_estimate = P_0 + P_1 * x, the parameters ("coefficients") P_0 and P_1 can be calculated with a formula to minize the Residual Sum of Squares . The equation for this "least squares" approach is P_1 (the slope) = sum_over_i( (x_i - mean(x)) * (y_i - mean(y))) / sum_over_i(x_i-mean(x))^2 . Also, P_0 (the intercept) = mean(y) - P_1 * mean(x). Note that the derivation for these two equations requires calculus, and is beyond thte scope of this course. Given arrays xdiff = x_i - mean(x) and ydiff = y_i - mean(y) as well as mean(x) and mean(y), compute P_0 and P_1 .
  208.  
  209. 3. The residual Standard Error ("RSE") is computed as sqrt(RSS/(n-2)), and it can be used to calculate the standard error for parameters P_0 and P_1 . Specifically, SE(P_0) = sqrt(RSE^2 * (1/n + (mean(x)^2)/sum_over_i((x_i-mean(x))^2))) , and SE(P_1) = sqrt((RSE^2)/sum_over_i(x_i-mean(x))) . Please compute SE(P_0) and SE(P_1) for the linear least squares model y_est = P_0 + x*P_1.
  210.  
  211. 4. One way to test if a linear model (y=P_0 + P_1 * x) was effective is to look at the "t-statistic" of the lsope parameter P_1 . For this problem, rather than manually computing t_1 = (P_1 - 0)/(SE(P_1)), use R to compute the t-statistic. Also use confint to compute the resulting 95% confidence interval.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement