Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ---- Basis Functions -----
- 1. Find the "King Penguin Breeding Pairs" dataset on http://www.stat.ufl.edu/~winner/datasets.html . Fit a quadratic equation to the data. What's the resulting r^2 ?
- penguins <- read.table("http://www.stat.ufl.edu/~winner/data/penguin_trend_jj.dat")
- names(penguins) <- c("year","pairs")
- > model <- lm(pairs ~ year + I(year^2), data=penguins)
- > summary(model)$r.squared
- [1] 0.5512973
- 2. Using the same data as #1, in what year would the population go back under 5000 ? Now fit a cubic function . What year would the population now go back under 5000?
- cbind(1960:2026,predict( lm(pairs ~ year + I(year^2), data=penguins), data.frame(year=1960:2026))) #looks like around 2019
- cbind(1960:2026,predict( lm(pairs ~ year + I(year^2) + I(year^3), data=penguins), data.frame(year=1960:2026))) # looks like around 2007
- 3. Find the "Exponential Decay of Beer Foam for 3 Brands after pouring glass" . Fit three exponential equations to each of the columns . When do you predict beer #1 to have a height of 1 ?
- beer <- read.table("http://www.stat.ufl.edu/~winner/data/beer_foam2.dat")
- names(beer) <- c("time","one","two","three")
- summary(lm(log(one) ~ time, data=beer))
- summary(lm(log(two) ~ time, data=beer))
- summary(lm(log(three) ~ time, data=beer))
- cbind(seq(0,1000,10),predict(lm(log(one) ~ time, data=beer), data.frame(time=seq(0,1000,10)))) # roughly at 870 seconds, the log(one) goes below 0
- 4. The following data was created with y^3 = a * x^2 + b * x + c . Find the coefficients a, b, and c: structure(list(x = c(665, 2483, 496, 655, 1301, 2829, 1242, 1733,
- 550, 354, 1182, 1382, 2796, 592, 2325, 1963, 2497, 1950, 2559,
- 411, 2399, 774, 551, 1972, 2107, 2527, 1265, 1405, 2230, 1957,
- 2627, 1760, 1389, 1348, 988, 2044, 946, 1787, 1144, 1116, 15,
- 1749, 2680, 2166, 701, 1983, 198, 542, 2336, 2486, 1780, 2336,
- 991, 1098, 178, 2528, 1855, 542, 273, 2583, 1522, 2634, 2062,
- 32, 1521, 2019, 1020, 1836, 1123, 2523, 2390, 1806, 568, 882,
- 2820, 2436, 791, 46, 613, 220, 563, 2043, 2383, 2589, 1345, 31,
- 1310, 2199, 437, 272, 429, 1639, 2425, 2568, 211, 2637, 723,
- 2138, 1425, 2702), y = c(146, 343, 121, 145, 225, 374, 218, 271,
- 129, 99, 211, 233, 371, 135, 329, 294, 344, 293, 350, 107, 336,
- 161, 129, 295, 308, 347, 221, 236, 320, 293, 356, 274, 234, 230,
- 188, 302, 183, 276, 206, 203, 37, 273, 361, 314, 151, 296, 72,
- 128, 330, 344, 276, 330, 188, 201, 67, 347, 283, 128, 85, 352,
- 249, 357, 304, 39, 249, 299, 192, 281, 204, 347, 335, 278, 132,
- 174, 373, 339, 163, 43, 138, 75, 131, 302, 334, 353, 230, 38,
- 226, 317, 113, 85, 111, 261, 338, 351, 74, 357, 154, 311, 238,
- 363)), .Names = c("x", "y"), row.names = c(NA, -100L), class = "data.frame") -> dat
- summary(lm(y ~ x + I(x^2),dat))
- Coefficients:
- Estimate Std. Error t value Pr(>|t|)
- (Intercept) 4.207e+01 8.446e-01 49.82 <2e-16 ***
- x 1.614e-01 1.379e-03 117.05 <2e-16 ***
- I(x^2) -1.615e-05 4.643e-07 -34.77 <2e-16 ***
- ---
- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- Residual standard error: 2.73 on 97 degrees of freedom
- Multiple R-squared: 0.9992, Adjusted R-squared: 0.9992
- F-statistic: 6.355e+04 on 2 and 97 DF, p-value: < 2.2e-16
- >
- 5. Build a multivariate linear regression on Grad.Rate in the College data of library(ISLR). Use the independent variables Private, Room.Board, Books, Expend, Top10perc, Top25perc, I(Enroll/Apps) and I(Accept/Apps) . Write a short english sentence for each of these variables, such as "PrivateYes has a small P (1.7e-11) which is very significant (***), and the estimate is positive (8.425), so it seems like Private Schools have higher graduation rates."
- > summary(lm(Grad.Rate ~ Private + Room.Board + Books + Expend + Top10perc + Top25perc + I(Enroll/Apps) + I(Accept/Apps), data=College))
- Call:
- lm(formula = Grad.Rate ~ Private + Room.Board + Books + Expend +
- Top10perc + Top25perc + I(Enroll/Apps) + I(Accept/Apps),
- data = College)
- Residuals:
- Min 1Q Median 3Q Max
- -46.326 -8.689 -0.133 7.937 53.629
- Coefficients:
- Estimate Std. Error t value Pr(>|t|)
- (Intercept) 4.635e+01 4.935e+00 9.391 < 2e-16 ***
- PrivateYes 8.425e+00 1.233e+00 6.833 1.70e-11 *** # Private schools have higher graduation, with a significant probability (<1.7e-11)
- Room.Board 2.583e-03 5.841e-04 4.422 1.12e-05 *** # Higher spending on room and board is correlated with higher graduation, with a significant probability (<1.12e-5)
- Books -7.353e-03 3.034e-03 -2.424 0.01559 * # Higher spending on books seems to be negatively correlated, although the coefficient is positive by itself
- Expend -1.244e-04 1.386e-04 -0.898 0.36962 # Higher expense is not significant when room.board/books is factored. By itself, t=11.8, p<2e-16
- Top10perc 1.878e-01 7.218e-02 2.603 0.00943 ** # Having more Top 10% of high school students yields higher graduation, even with private/expenses/grad rate factored in
- Top25perc 1.743e-01 5.634e-02 3.093 0.00205 ** # Same thing with Top 25% of high school, with even more significance than Top 10%. Note independently t~=15 p<2e-16 for either
- I(Enroll/Apps) -1.781e+01 5.502e+00 -3.238 0.00126 ** # Lower enrollment yields higher graduation with some significance (3.238 standard deviations from expected mean)
- I(Accept/Apps) -3.229e+00 4.633e+00 -0.697 0.48613 # When factoring enrollment, acceptance rate in and of itself is no longer significant, but by itself t=-8.34,p 3.39e-16
- ---
- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- Residual standard error: 13.62 on 768 degrees of freedom
- Multiple R-squared: 0.3776, Adjusted R-squared: 0.3711
- F-statistic: 58.24 on 8 and 768 DF, p-value: < 2.2e-16
- ---- Problems with Linear ----
- (* Cover Residual in class)
- 6. Given x and y vectors, ten points are obvious outliers, where the y is far from the predicted y_est. Return the row ids of these points:
- structure(c(6.38, 5.54, 7.66, 0.97, 7.48, 8.75, 0.08, 9.35, 4.07,
- 7.89, 9.76, 3.33, 7.87, 0.89, 0.44, 5.15, 0.48, 0.89, 6.88, 0.35,
- 4.31, 7.59, 8.32, 3.82, 6.16, 3.45, 5.32, 2.6, 5.73, 9.06, 5.66,
- 0.56, 1.02, 3.28, 9.91, 7.14, 0.79, 6.14, 4.93, 2.69, -14.89,
- -13.04, 27.01, 3.78, 24.85, 30.59, -5.01, -11.57, 12.12, -8.88,
- -22.88, -24.38, 31.73, 4.16, 2.4, 17.76, 5.31, 5.28, -42.06,
- 3.75, 15.83, 26.1, 24.89, -16.07, 19.82, 10.46, 16.44, 7.89,
- 18.39, 28.55, 20.57, 5.67, 3.47, -3.18, 32.44, 26.23, 4.26, 17.03,
- 17.26, 9.83), .Dim = c(40L, 2L))
- # which(dat[,2]<0)
- 7. Given x and y vectors, one point is a "high leverage point", where the x value is unusual. Remove this point and compute a simple linear regression. Output the coefficients of slope and intercept :
- structure(c(6.38, 5.54, 7.66, 0.97, 7.48, 8.75, 0.08, 9.35, 4.07,
- 7.89, 9.76, 3.33, 7.87, 0.89, 0.44, 5.15, 18.29, 0.89, 6.88,
- 0.35, 4.31, 7.59, 8.32, 3.82, 6.16, 3.45, 5.32, 2.6, 5.73, 9.06,
- 5.66, 0.56, 1.02, 3.28, 9.91, 7.14, 0.79, 6.14, 4.93, 2.69, 20.05,
- 22.24, 27.01, 3.78, 24.85, 30.59, 6.7, 30.65, 12.12, 23.7, 35.3,
- 7.86, 31.73, 4.16, 2.4, 17.76, 35, 5.28, 18.74, 3.75, 15.83,
- 26.1, 24.89, 8.73, 19.82, 10.46, 16.44, 7.89, 18.39, 28.55, 20.57,
- 5.67, 3.47, 11.23, 32.44, 26.23, 4.26, 17.03, 17.26, 9.83), .Dim = c(40L,
- 2L))
- dat[17,] <- c(NA,NA)
- > summary(lm(dat[,2] ~dat[,1]))
- Call:
- lm(formula = dat[, 2] ~ dat[, 1])
- Residuals:
- Min 1Q Median 3Q Max
- -4.5658 -1.4007 0.0562 1.2290 5.7657
- Coefficients:
- Estimate Std. Error t value Pr(>|t|)
- (Intercept) 1.3467 0.7275 1.851 0.0722 .
- dat[, 1] 3.1280 0.1262 24.794 <2e-16 ***
- ---
- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- Residual standard error: 2.359 on 37 degrees of freedom
- (1 observation deleted due to missingness)
- Multiple R-squared: 0.9432, Adjusted R-squared: 0.9417
- F-statistic: 614.8 on 1 and 37 DF, p-value: < 2.2e-16
- 8. After plotting the residual of y~x, you should notice that consecutive points tend to have the same error, that is -- the error terms are correlated.
- structure(list(x = c(0.04, 0.05, 0.06, 0.13, 0.17, 0.2, 0.25,
- 0.26, 0.31, 0.39, 0.5, 0.51, 0.78, 0.88, 1.09, 1.12, 1.15, 1.22,
- 1.33, 1.45, 1.61, 1.71, 1.88, 1.98, 2.2, 2.2, 2.28, 2.37, 2.62,
- 2.84, 2.88, 2.89, 2.9, 3.16, 3.27, 3.3, 3.31, 3.32, 3.33, 3.36,
- 3.49, 3.74, 3.83, 3.89, 4.18, 4.22, 4.27, 4.28, 4.3, 4.4, 4.55,
- 4.68, 4.69, 4.71, 4.84, 5.13, 5.21, 5.35, 5.48, 5.55, 5.57, 5.6,
- 5.93, 6.11, 6.13, 6.19, 6.27, 6.75, 7.17, 7.19, 7.26, 7.47, 7.48,
- 7.5, 7.63, 7.7, 7.73, 7.81, 7.9, 8.02, 8.06, 8.06, 8.48, 8.58,
- 8.59, 8.74, 8.78, 8.82, 8.83, 9.04, 9.16, 9.22, 9.24, 9.43, 9.52,
- 9.58, 9.75, 9.76, 9.89, 9.97), y = c(2.78, 3.79, 5.35, 8.13,
- 3.72, 7.34, 9.57, 6.82, 12.81, 15.87, 11.56, 14.39, 12.85, 7.66,
- 5.87, -0.59, 5.71, 3.66, -0.77, -7, -4.51, 2.76, -0.19, 0.98,
- 11.77, 14.57, 16.09, 15.93, 19.39, 21.68, 17.78, 19.7, 15.86,
- 11.63, 8.87, 4.44, 7.51, 7.37, 4.14, 6.04, 3.04, 3.37, 3.92,
- 7.92, 17.62, 15.94, 16.07, 17.51, 16.92, 23, 21.91, 23.24, 25.09,
- 25.9, 23.47, 21.5, 20.63, 12.99, 11.01, 10.47, 11.74, 8.01, 16.08,
- 13.35, 17.58, 17.54, 19.07, 30.31, 29.25, 27.28, 25.77, 23.09,
- 21.04, 21.38, 15.41, 20.16, 16.9, 13.67, 17.66, 14.46, 14.82,
- 17.71, 30.28, 29.93, 33.69, 36.34, 34.76, 39.92, 35.33, 33.42,
- 36.92, 31.73, 37.03, 29.7, 28.86, 26.46, 21.75, 24.08, 17.58,
- 21.5)), .Names = c("x", "y"), row.names = c(NA, -100L), class = "data.frame")
- For example, if your data is in the data.frame "sss", then run plot(sss$x,lm(y~x,data=sss)$residuals, type="l")
- Using this function, compute lags of (y,1), (y,2): function(x, k) { if(k>=0) { return(c(rep(NA, k), x)[1 : length(x)]) }else { x[-k+1 : (length(x)-k-1)] } }, to build a data set like this:
- > head(cbind(sss,lagpad(sss$y,1),lagpad(sss$y,2)))
- x y lagpad(sss$y, 1) lagpad(sss$y, 2)
- 1 0.04 2.78 NA NA
- 2 0.05 3.79 2.78 NA
- 3 0.06 5.35 3.79 2.78
- 4 0.13 8.13 5.35 3.79
- 5 0.17 3.72 8.13 5.35
- 6 0.20 7.34 3.72 8.13
- Correlate y on all variables, y ~ x + lagpad(y,1) + lagpad(y,2), etc.
- For this problem, compute the r^2 for each successive linear regression, e.g.:
- What is the r^2 for y~x?
- What is the r^2 for y~x+lagpad(y,1)?
- What is the r^2 for y~x+lagpad(y,1)+lagpad(y,2)?
- Make a plot of how r^2 increases as we correlated with successive lagged time series'. This is called an "autocorrelated time series".
- # cumulative
- rs <- sapply(0:10, function(xlagrange) {
- if(xlagrange==0) {
- form <- formula("y ~ x")
- } else {
- form <- formula(paste("y ~ x",paste(sapply(1:xlagrange , function(t) { paste("+I(lagpad(sss$y,",t,"))",sep="") }),collapse="")))
- }
- summary(lm(form,data=sss))$r.squared
- })
- plot(rs)
- # non cumulative (autocorrelation)
- rs <- sapply(0:10, function(xlagrange) {
- if(xlagrange==0) {
- form <- formula("y ~ x")
- } else {
- form <- formula(paste("y ~ x",paste(sapply(xlagrange:xlagrange , function(t) { paste("+I(lagpad(sss$y,",t,"))",sep="") }),collapse="")))
- }
- summary(lm(form,data=sss))$r.squared
- })
- plot(rs)
- 9. Collinearity ** In class p125 of ISLR problem #14 ***
- 10. Look for some interesting data! Bring to class Monday
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement