Advertisement
ktbyte

HW3 (R) - Basis Functions

Nov 2nd, 2016
331
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 11.31 KB | None | 0 0
  1.  
  2.  
  3.  
  4. ---- Basis Functions -----
  5.  
  6. 1. Find the "King Penguin Breeding Pairs" dataset on http://www.stat.ufl.edu/~winner/datasets.html . Fit a quadratic equation to the data. What's the resulting r^2 ?
  7.  
  8. penguins <- read.table("http://www.stat.ufl.edu/~winner/data/penguin_trend_jj.dat")
  9. names(penguins) <- c("year","pairs")
  10. > model <- lm(pairs ~ year + I(year^2), data=penguins)
  11. > summary(model)$r.squared
  12. [1] 0.5512973
  13.  
  14. 2. Using the same data as #1, in what year would the population go back under 5000 ? Now fit a cubic function . What year would the population now go back under 5000?
  15. cbind(1960:2026,predict( lm(pairs ~ year + I(year^2), data=penguins), data.frame(year=1960:2026))) #looks like around 2019
  16. cbind(1960:2026,predict( lm(pairs ~ year + I(year^2) + I(year^3), data=penguins), data.frame(year=1960:2026))) # looks like around 2007
  17.  
  18. 3. Find the "Exponential Decay of Beer Foam for 3 Brands after pouring glass" . Fit three exponential equations to each of the columns . When do you predict beer #1 to have a height of 1 ?
  19. beer <- read.table("http://www.stat.ufl.edu/~winner/data/beer_foam2.dat")
  20. names(beer) <- c("time","one","two","three")
  21. summary(lm(log(one) ~ time, data=beer))
  22. summary(lm(log(two) ~ time, data=beer))
  23. summary(lm(log(three) ~ time, data=beer))
  24.  
  25. cbind(seq(0,1000,10),predict(lm(log(one) ~ time, data=beer), data.frame(time=seq(0,1000,10)))) # roughly at 870 seconds, the log(one) goes below 0
  26.  
  27.  
  28. 4. The following data was created with y^3 = a * x^2 + b * x + c . Find the coefficients a, b, and c: structure(list(x = c(665, 2483, 496, 655, 1301, 2829, 1242, 1733,
  29. 550, 354, 1182, 1382, 2796, 592, 2325, 1963, 2497, 1950, 2559,
  30. 411, 2399, 774, 551, 1972, 2107, 2527, 1265, 1405, 2230, 1957,
  31. 2627, 1760, 1389, 1348, 988, 2044, 946, 1787, 1144, 1116, 15,
  32. 1749, 2680, 2166, 701, 1983, 198, 542, 2336, 2486, 1780, 2336,
  33. 991, 1098, 178, 2528, 1855, 542, 273, 2583, 1522, 2634, 2062,
  34. 32, 1521, 2019, 1020, 1836, 1123, 2523, 2390, 1806, 568, 882,
  35. 2820, 2436, 791, 46, 613, 220, 563, 2043, 2383, 2589, 1345, 31,
  36. 1310, 2199, 437, 272, 429, 1639, 2425, 2568, 211, 2637, 723,
  37. 2138, 1425, 2702), y = c(146, 343, 121, 145, 225, 374, 218, 271,
  38. 129, 99, 211, 233, 371, 135, 329, 294, 344, 293, 350, 107, 336,
  39. 161, 129, 295, 308, 347, 221, 236, 320, 293, 356, 274, 234, 230,
  40. 188, 302, 183, 276, 206, 203, 37, 273, 361, 314, 151, 296, 72,
  41. 128, 330, 344, 276, 330, 188, 201, 67, 347, 283, 128, 85, 352,
  42. 249, 357, 304, 39, 249, 299, 192, 281, 204, 347, 335, 278, 132,
  43. 174, 373, 339, 163, 43, 138, 75, 131, 302, 334, 353, 230, 38,
  44. 226, 317, 113, 85, 111, 261, 338, 351, 74, 357, 154, 311, 238,
  45. 363)), .Names = c("x", "y"), row.names = c(NA, -100L), class = "data.frame") -> dat
  46.  
  47. summary(lm(y ~ x + I(x^2),dat))
  48. Coefficients:
  49. Estimate Std. Error t value Pr(>|t|)
  50. (Intercept) 4.207e+01 8.446e-01 49.82 <2e-16 ***
  51. x 1.614e-01 1.379e-03 117.05 <2e-16 ***
  52. I(x^2) -1.615e-05 4.643e-07 -34.77 <2e-16 ***
  53. ---
  54. Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  55.  
  56. Residual standard error: 2.73 on 97 degrees of freedom
  57. Multiple R-squared: 0.9992, Adjusted R-squared: 0.9992
  58. F-statistic: 6.355e+04 on 2 and 97 DF, p-value: < 2.2e-16
  59.  
  60. >
  61. 5. Build a multivariate linear regression on Grad.Rate in the College data of library(ISLR). Use the independent variables Private, Room.Board, Books, Expend, Top10perc, Top25perc, I(Enroll/Apps) and I(Accept/Apps) . Write a short english sentence for each of these variables, such as "PrivateYes has a small P (1.7e-11) which is very significant (***), and the estimate is positive (8.425), so it seems like Private Schools have higher graduation rates."
  62.  
  63. > summary(lm(Grad.Rate ~ Private + Room.Board + Books + Expend + Top10perc + Top25perc + I(Enroll/Apps) + I(Accept/Apps), data=College))
  64.  
  65. Call:
  66. lm(formula = Grad.Rate ~ Private + Room.Board + Books + Expend +
  67. Top10perc + Top25perc + I(Enroll/Apps) + I(Accept/Apps),
  68. data = College)
  69.  
  70. Residuals:
  71. Min 1Q Median 3Q Max
  72. -46.326 -8.689 -0.133 7.937 53.629
  73.  
  74. Coefficients:
  75. Estimate Std. Error t value Pr(>|t|)
  76. (Intercept) 4.635e+01 4.935e+00 9.391 < 2e-16 ***
  77. PrivateYes 8.425e+00 1.233e+00 6.833 1.70e-11 *** # Private schools have higher graduation, with a significant probability (<1.7e-11)
  78. Room.Board 2.583e-03 5.841e-04 4.422 1.12e-05 *** # Higher spending on room and board is correlated with higher graduation, with a significant probability (<1.12e-5)
  79. Books -7.353e-03 3.034e-03 -2.424 0.01559 * # Higher spending on books seems to be negatively correlated, although the coefficient is positive by itself
  80. Expend -1.244e-04 1.386e-04 -0.898 0.36962 # Higher expense is not significant when room.board/books is factored. By itself, t=11.8, p<2e-16
  81. Top10perc 1.878e-01 7.218e-02 2.603 0.00943 ** # Having more Top 10% of high school students yields higher graduation, even with private/expenses/grad rate factored in
  82. Top25perc 1.743e-01 5.634e-02 3.093 0.00205 ** # Same thing with Top 25% of high school, with even more significance than Top 10%. Note independently t~=15 p<2e-16 for either
  83. I(Enroll/Apps) -1.781e+01 5.502e+00 -3.238 0.00126 ** # Lower enrollment yields higher graduation with some significance (3.238 standard deviations from expected mean)
  84. I(Accept/Apps) -3.229e+00 4.633e+00 -0.697 0.48613 # When factoring enrollment, acceptance rate in and of itself is no longer significant, but by itself t=-8.34,p 3.39e-16
  85. ---
  86. Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  87.  
  88. Residual standard error: 13.62 on 768 degrees of freedom
  89. Multiple R-squared: 0.3776, Adjusted R-squared: 0.3711
  90. F-statistic: 58.24 on 8 and 768 DF, p-value: < 2.2e-16
  91.  
  92.  
  93. ---- Problems with Linear ----
  94. (* Cover Residual in class)
  95.  
  96. 6. Given x and y vectors, ten points are obvious outliers, where the y is far from the predicted y_est. Return the row ids of these points:
  97. structure(c(6.38, 5.54, 7.66, 0.97, 7.48, 8.75, 0.08, 9.35, 4.07,
  98. 7.89, 9.76, 3.33, 7.87, 0.89, 0.44, 5.15, 0.48, 0.89, 6.88, 0.35,
  99. 4.31, 7.59, 8.32, 3.82, 6.16, 3.45, 5.32, 2.6, 5.73, 9.06, 5.66,
  100. 0.56, 1.02, 3.28, 9.91, 7.14, 0.79, 6.14, 4.93, 2.69, -14.89,
  101. -13.04, 27.01, 3.78, 24.85, 30.59, -5.01, -11.57, 12.12, -8.88,
  102. -22.88, -24.38, 31.73, 4.16, 2.4, 17.76, 5.31, 5.28, -42.06,
  103. 3.75, 15.83, 26.1, 24.89, -16.07, 19.82, 10.46, 16.44, 7.89,
  104. 18.39, 28.55, 20.57, 5.67, 3.47, -3.18, 32.44, 26.23, 4.26, 17.03,
  105. 17.26, 9.83), .Dim = c(40L, 2L))
  106. # which(dat[,2]<0)
  107.  
  108. 7. Given x and y vectors, one point is a "high leverage point", where the x value is unusual. Remove this point and compute a simple linear regression. Output the coefficients of slope and intercept :
  109. structure(c(6.38, 5.54, 7.66, 0.97, 7.48, 8.75, 0.08, 9.35, 4.07,
  110. 7.89, 9.76, 3.33, 7.87, 0.89, 0.44, 5.15, 18.29, 0.89, 6.88,
  111. 0.35, 4.31, 7.59, 8.32, 3.82, 6.16, 3.45, 5.32, 2.6, 5.73, 9.06,
  112. 5.66, 0.56, 1.02, 3.28, 9.91, 7.14, 0.79, 6.14, 4.93, 2.69, 20.05,
  113. 22.24, 27.01, 3.78, 24.85, 30.59, 6.7, 30.65, 12.12, 23.7, 35.3,
  114. 7.86, 31.73, 4.16, 2.4, 17.76, 35, 5.28, 18.74, 3.75, 15.83,
  115. 26.1, 24.89, 8.73, 19.82, 10.46, 16.44, 7.89, 18.39, 28.55, 20.57,
  116. 5.67, 3.47, 11.23, 32.44, 26.23, 4.26, 17.03, 17.26, 9.83), .Dim = c(40L,
  117. 2L))
  118. dat[17,] <- c(NA,NA)
  119.  
  120. > summary(lm(dat[,2] ~dat[,1]))
  121.  
  122. Call:
  123. lm(formula = dat[, 2] ~ dat[, 1])
  124.  
  125. Residuals:
  126. Min 1Q Median 3Q Max
  127. -4.5658 -1.4007 0.0562 1.2290 5.7657
  128.  
  129. Coefficients:
  130. Estimate Std. Error t value Pr(>|t|)
  131. (Intercept) 1.3467 0.7275 1.851 0.0722 .
  132. dat[, 1] 3.1280 0.1262 24.794 <2e-16 ***
  133. ---
  134. Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  135.  
  136. Residual standard error: 2.359 on 37 degrees of freedom
  137. (1 observation deleted due to missingness)
  138. Multiple R-squared: 0.9432, Adjusted R-squared: 0.9417
  139. F-statistic: 614.8 on 1 and 37 DF, p-value: < 2.2e-16
  140.  
  141.  
  142. 8. After plotting the residual of y~x, you should notice that consecutive points tend to have the same error, that is -- the error terms are correlated.
  143. structure(list(x = c(0.04, 0.05, 0.06, 0.13, 0.17, 0.2, 0.25,
  144. 0.26, 0.31, 0.39, 0.5, 0.51, 0.78, 0.88, 1.09, 1.12, 1.15, 1.22,
  145. 1.33, 1.45, 1.61, 1.71, 1.88, 1.98, 2.2, 2.2, 2.28, 2.37, 2.62,
  146. 2.84, 2.88, 2.89, 2.9, 3.16, 3.27, 3.3, 3.31, 3.32, 3.33, 3.36,
  147. 3.49, 3.74, 3.83, 3.89, 4.18, 4.22, 4.27, 4.28, 4.3, 4.4, 4.55,
  148. 4.68, 4.69, 4.71, 4.84, 5.13, 5.21, 5.35, 5.48, 5.55, 5.57, 5.6,
  149. 5.93, 6.11, 6.13, 6.19, 6.27, 6.75, 7.17, 7.19, 7.26, 7.47, 7.48,
  150. 7.5, 7.63, 7.7, 7.73, 7.81, 7.9, 8.02, 8.06, 8.06, 8.48, 8.58,
  151. 8.59, 8.74, 8.78, 8.82, 8.83, 9.04, 9.16, 9.22, 9.24, 9.43, 9.52,
  152. 9.58, 9.75, 9.76, 9.89, 9.97), y = c(2.78, 3.79, 5.35, 8.13,
  153. 3.72, 7.34, 9.57, 6.82, 12.81, 15.87, 11.56, 14.39, 12.85, 7.66,
  154. 5.87, -0.59, 5.71, 3.66, -0.77, -7, -4.51, 2.76, -0.19, 0.98,
  155. 11.77, 14.57, 16.09, 15.93, 19.39, 21.68, 17.78, 19.7, 15.86,
  156. 11.63, 8.87, 4.44, 7.51, 7.37, 4.14, 6.04, 3.04, 3.37, 3.92,
  157. 7.92, 17.62, 15.94, 16.07, 17.51, 16.92, 23, 21.91, 23.24, 25.09,
  158. 25.9, 23.47, 21.5, 20.63, 12.99, 11.01, 10.47, 11.74, 8.01, 16.08,
  159. 13.35, 17.58, 17.54, 19.07, 30.31, 29.25, 27.28, 25.77, 23.09,
  160. 21.04, 21.38, 15.41, 20.16, 16.9, 13.67, 17.66, 14.46, 14.82,
  161. 17.71, 30.28, 29.93, 33.69, 36.34, 34.76, 39.92, 35.33, 33.42,
  162. 36.92, 31.73, 37.03, 29.7, 28.86, 26.46, 21.75, 24.08, 17.58,
  163. 21.5)), .Names = c("x", "y"), row.names = c(NA, -100L), class = "data.frame")
  164.  
  165. For example, if your data is in the data.frame "sss", then run plot(sss$x,lm(y~x,data=sss)$residuals, type="l")
  166.  
  167. Using this function, compute lags of (y,1), (y,2): function(x, k) { if(k>=0) { return(c(rep(NA, k), x)[1 : length(x)]) }else { x[-k+1 : (length(x)-k-1)] } }, to build a data set like this:
  168.  
  169. > head(cbind(sss,lagpad(sss$y,1),lagpad(sss$y,2)))
  170. x y lagpad(sss$y, 1) lagpad(sss$y, 2)
  171. 1 0.04 2.78 NA NA
  172. 2 0.05 3.79 2.78 NA
  173. 3 0.06 5.35 3.79 2.78
  174. 4 0.13 8.13 5.35 3.79
  175. 5 0.17 3.72 8.13 5.35
  176. 6 0.20 7.34 3.72 8.13
  177.  
  178. Correlate y on all variables, y ~ x + lagpad(y,1) + lagpad(y,2), etc.
  179.  
  180. For this problem, compute the r^2 for each successive linear regression, e.g.:
  181. What is the r^2 for y~x?
  182. What is the r^2 for y~x+lagpad(y,1)?
  183. What is the r^2 for y~x+lagpad(y,1)+lagpad(y,2)?
  184.  
  185. Make a plot of how r^2 increases as we correlated with successive lagged time series'. This is called an "autocorrelated time series".
  186.  
  187. # cumulative
  188. rs <- sapply(0:10, function(xlagrange) {
  189. if(xlagrange==0) {
  190. form <- formula("y ~ x")
  191. } else {
  192. form <- formula(paste("y ~ x",paste(sapply(1:xlagrange , function(t) { paste("+I(lagpad(sss$y,",t,"))",sep="") }),collapse="")))
  193. }
  194. summary(lm(form,data=sss))$r.squared
  195. })
  196. plot(rs)
  197.  
  198. # non cumulative (autocorrelation)
  199. rs <- sapply(0:10, function(xlagrange) {
  200. if(xlagrange==0) {
  201. form <- formula("y ~ x")
  202. } else {
  203. form <- formula(paste("y ~ x",paste(sapply(xlagrange:xlagrange , function(t) { paste("+I(lagpad(sss$y,",t,"))",sep="") }),collapse="")))
  204. }
  205. summary(lm(form,data=sss))$r.squared
  206. })
  207. plot(rs)
  208.  
  209. 9. Collinearity ** In class p125 of ISLR problem #14 ***
  210.  
  211. 10. Look for some interesting data! Bring to class Monday
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement