HW3 (R) - Basis Functions



---- Basis Functions -----

1. Find the "King Penguin Breeding Pairs" dataset on http://www.stat.ufl.edu/~winner/datasets.html . Fit a quadratic equation to the data. What's the resulting r^2 ?

penguins <- read.table("http://www.stat.ufl.edu/~winner/data/penguin_trend_jj.dat")
names(penguins) <- c("year","pairs")
> model <- lm(pairs ~ year + I(year^2), data=penguins)
> summary(model)$r.squared
[1] 0.5512973

2. Using the same data as #1, in what year would the population go back under 5000 ? Now fit a cubic function . What year would the population now go back under 5000?
cbind(1960:2026,predict( lm(pairs ~ year + I(year^2), data=penguins), data.frame(year=1960:2026))) #looks like around 2019
cbind(1960:2026,predict( lm(pairs ~ year + I(year^2) + I(year^3), data=penguins), data.frame(year=1960:2026))) # looks like around 2007

3. Find the "Exponential Decay of Beer Foam for 3 Brands after pouring glass" . Fit three exponential equations to each of the columns . When do you predict beer #1 to have a height of 1 ?
beer <- read.table("http://www.stat.ufl.edu/~winner/data/beer_foam2.dat")
names(beer) <- c("time","one","two","three")
summary(lm(log(one) ~ time, data=beer))
summary(lm(log(two) ~ time, data=beer))
summary(lm(log(three) ~ time, data=beer))

cbind(seq(0,1000,10),predict(lm(log(one) ~ time, data=beer), data.frame(time=seq(0,1000,10)))) # roughly at 870 seconds, the log(one) goes below 0


4. The following data was created with y^3 = a * x^2 + b * x + c . Find the coefficients a, b, and c: structure(list(x = c(665, 2483, 496, 655, 1301, 2829, 1242, 1733,
550, 354, 1182, 1382, 2796, 592, 2325, 1963, 2497, 1950, 2559,
411, 2399, 774, 551, 1972, 2107, 2527, 1265, 1405, 2230, 1957,
2627, 1760, 1389, 1348, 988, 2044, 946, 1787, 1144, 1116, 15,
1749, 2680, 2166, 701, 1983, 198, 542, 2336, 2486, 1780, 2336,
991, 1098, 178, 2528, 1855, 542, 273, 2583, 1522, 2634, 2062,
32, 1521, 2019, 1020, 1836, 1123, 2523, 2390, 1806, 568, 882,
2820, 2436, 791, 46, 613, 220, 563, 2043, 2383, 2589, 1345, 31,
1310, 2199, 437, 272, 429, 1639, 2425, 2568, 211, 2637, 723,
2138, 1425, 2702), y = c(146, 343, 121, 145, 225, 374, 218, 271,
129, 99, 211, 233, 371, 135, 329, 294, 344, 293, 350, 107, 336,
161, 129, 295, 308, 347, 221, 236, 320, 293, 356, 274, 234, 230,
188, 302, 183, 276, 206, 203, 37, 273, 361, 314, 151, 296, 72,
128, 330, 344, 276, 330, 188, 201, 67, 347, 283, 128, 85, 352,
249, 357, 304, 39, 249, 299, 192, 281, 204, 347, 335, 278, 132,
174, 373, 339, 163, 43, 138, 75, 131, 302, 334, 353, 230, 38,
226, 317, 113, 85, 111, 261, 338, 351, 74, 357, 154, 311, 238,
363)), .Names = c("x", "y"), row.names = c(NA, -100L), class = "data.frame") -> dat

summary(lm(y ~ x + I(x^2),dat))
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  4.207e+01  8.446e-01   49.82   <2e-16 ***
x            1.614e-01  1.379e-03  117.05   <2e-16 ***
I(x^2)      -1.615e-05  4.643e-07  -34.77   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.73 on 97 degrees of freedom
Multiple R-squared:  0.9992,    Adjusted R-squared:  0.9992
F-statistic: 6.355e+04 on 2 and 97 DF,  p-value: < 2.2e-16

>
5. Build a multivariate linear regression on Grad.Rate in the College data of library(ISLR). Use the independent variables Private, Room.Board, Books, Expend, Top10perc, Top25perc, I(Enroll/Apps) and I(Accept/Apps) . Write a short english sentence for each of these variables, such as "PrivateYes has a small P (1.7e-11) which is very significant (***), and the estimate is positive (8.425), so it seems like Private Schools have higher graduation rates."

> summary(lm(Grad.Rate ~ Private + Room.Board + Books + Expend + Top10perc + Top25perc + I(Enroll/Apps) + I(Accept/Apps), data=College))

Call:
lm(formula = Grad.Rate ~ Private + Room.Board + Books + Expend +
    Top10perc + Top25perc + I(Enroll/Apps) + I(Accept/Apps),
    data = College)

Residuals:
    Min      1Q  Median      3Q     Max
-46.326  -8.689  -0.133   7.937  53.629

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)     4.635e+01  4.935e+00   9.391  < 2e-16 ***
PrivateYes      8.425e+00  1.233e+00   6.833 1.70e-11 ***   # Private schools have higher graduation, with a significant probability (<1.7e-11)
Room.Board      2.583e-03  5.841e-04   4.422 1.12e-05 ***   # Higher spending on room and board is correlated with higher graduation, with a significant probability (<1.12e-5)
Books          -7.353e-03  3.034e-03  -2.424  0.01559 *     # Higher spending on books seems to be negatively correlated, although the coefficient is positive by itself
Expend         -1.244e-04  1.386e-04  -0.898  0.36962       # Higher expense is not significant when room.board/books is factored. By itself, t=11.8, p<2e-16
Top10perc       1.878e-01  7.218e-02   2.603  0.00943 **    # Having more Top 10% of high school students yields higher graduation, even with private/expenses/grad rate factored in
Top25perc       1.743e-01  5.634e-02   3.093  0.00205 **    # Same thing with Top 25% of high school, with even more significance than Top 10%. Note independently t~=15 p<2e-16 for either
I(Enroll/Apps) -1.781e+01  5.502e+00  -3.238  0.00126 **    # Lower enrollment yields higher graduation with some significance (3.238 standard deviations from expected mean)
I(Accept/Apps) -3.229e+00  4.633e+00  -0.697  0.48613       # When factoring enrollment, acceptance rate in and of itself is no longer significant, but by itself t=-8.34,p 3.39e-16
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.62 on 768 degrees of freedom
Multiple R-squared:  0.3776,    Adjusted R-squared:  0.3711
F-statistic: 58.24 on 8 and 768 DF,  p-value: < 2.2e-16


---- Problems with Linear ----
(* Cover Residual in class)

6. Given x and y vectors, ten points are obvious outliers, where the y is far from the predicted y_est. Return the row ids of these points:
structure(c(6.38, 5.54, 7.66, 0.97, 7.48, 8.75, 0.08, 9.35, 4.07,
7.89, 9.76, 3.33, 7.87, 0.89, 0.44, 5.15, 0.48, 0.89, 6.88, 0.35,
4.31, 7.59, 8.32, 3.82, 6.16, 3.45, 5.32, 2.6, 5.73, 9.06, 5.66,
0.56, 1.02, 3.28, 9.91, 7.14, 0.79, 6.14, 4.93, 2.69, -14.89,
-13.04, 27.01, 3.78, 24.85, 30.59, -5.01, -11.57, 12.12, -8.88,
-22.88, -24.38, 31.73, 4.16, 2.4, 17.76, 5.31, 5.28, -42.06,
3.75, 15.83, 26.1, 24.89, -16.07, 19.82, 10.46, 16.44, 7.89,
18.39, 28.55, 20.57, 5.67, 3.47, -3.18, 32.44, 26.23, 4.26, 17.03,
17.26, 9.83), .Dim = c(40L, 2L))
# which(dat[,2]<0)

7. Given x and y vectors, one point is a "high leverage point", where the x value is unusual. Remove this point and compute a simple linear regression. Output the coefficients of slope and intercept :
structure(c(6.38, 5.54, 7.66, 0.97, 7.48, 8.75, 0.08, 9.35, 4.07,
7.89, 9.76, 3.33, 7.87, 0.89, 0.44, 5.15, 18.29, 0.89, 6.88,
0.35, 4.31, 7.59, 8.32, 3.82, 6.16, 3.45, 5.32, 2.6, 5.73, 9.06,
5.66, 0.56, 1.02, 3.28, 9.91, 7.14, 0.79, 6.14, 4.93, 2.69, 20.05,
22.24, 27.01, 3.78, 24.85, 30.59, 6.7, 30.65, 12.12, 23.7, 35.3,
7.86, 31.73, 4.16, 2.4, 17.76, 35, 5.28, 18.74, 3.75, 15.83,
26.1, 24.89, 8.73, 19.82, 10.46, 16.44, 7.89, 18.39, 28.55, 20.57,
5.67, 3.47, 11.23, 32.44, 26.23, 4.26, 17.03, 17.26, 9.83), .Dim = c(40L,
2L))
dat[17,] <- c(NA,NA)

> summary(lm(dat[,2] ~dat[,1]))

Call:
lm(formula = dat[, 2] ~ dat[, 1])

Residuals:
    Min      1Q  Median      3Q     Max
-4.5658 -1.4007  0.0562  1.2290  5.7657

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.3467     0.7275   1.851   0.0722 .
dat[, 1]      3.1280     0.1262  24.794   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.359 on 37 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.9432,    Adjusted R-squared:  0.9417
F-statistic: 614.8 on 1 and 37 DF,  p-value: < 2.2e-16


8. After plotting the residual of y~x, you should notice that consecutive points tend to have the same error, that is -- the error terms are correlated.
structure(list(x = c(0.04, 0.05, 0.06, 0.13, 0.17, 0.2, 0.25,
0.26, 0.31, 0.39, 0.5, 0.51, 0.78, 0.88, 1.09, 1.12, 1.15, 1.22,
1.33, 1.45, 1.61, 1.71, 1.88, 1.98, 2.2, 2.2, 2.28, 2.37, 2.62,
2.84, 2.88, 2.89, 2.9, 3.16, 3.27, 3.3, 3.31, 3.32, 3.33, 3.36,
3.49, 3.74, 3.83, 3.89, 4.18, 4.22, 4.27, 4.28, 4.3, 4.4, 4.55,
4.68, 4.69, 4.71, 4.84, 5.13, 5.21, 5.35, 5.48, 5.55, 5.57, 5.6,
5.93, 6.11, 6.13, 6.19, 6.27, 6.75, 7.17, 7.19, 7.26, 7.47, 7.48,
7.5, 7.63, 7.7, 7.73, 7.81, 7.9, 8.02, 8.06, 8.06, 8.48, 8.58,
8.59, 8.74, 8.78, 8.82, 8.83, 9.04, 9.16, 9.22, 9.24, 9.43, 9.52,
9.58, 9.75, 9.76, 9.89, 9.97), y = c(2.78, 3.79, 5.35, 8.13,
3.72, 7.34, 9.57, 6.82, 12.81, 15.87, 11.56, 14.39, 12.85, 7.66,
5.87, -0.59, 5.71, 3.66, -0.77, -7, -4.51, 2.76, -0.19, 0.98,
11.77, 14.57, 16.09, 15.93, 19.39, 21.68, 17.78, 19.7, 15.86,
11.63, 8.87, 4.44, 7.51, 7.37, 4.14, 6.04, 3.04, 3.37, 3.92,
7.92, 17.62, 15.94, 16.07, 17.51, 16.92, 23, 21.91, 23.24, 25.09,
25.9, 23.47, 21.5, 20.63, 12.99, 11.01, 10.47, 11.74, 8.01, 16.08,
13.35, 17.58, 17.54, 19.07, 30.31, 29.25, 27.28, 25.77, 23.09,
21.04, 21.38, 15.41, 20.16, 16.9, 13.67, 17.66, 14.46, 14.82,
17.71, 30.28, 29.93, 33.69, 36.34, 34.76, 39.92, 35.33, 33.42,
36.92, 31.73, 37.03, 29.7, 28.86, 26.46, 21.75, 24.08, 17.58,
21.5)), .Names = c("x", "y"), row.names = c(NA, -100L), class = "data.frame")

For example, if your data is in the data.frame "sss", then run plot(sss$x,lm(y~x,data=sss)$residuals, type="l")

Using this function, compute lags of (y,1), (y,2): function(x, k) { if(k>=0) { return(c(rep(NA, k), x)[1 : length(x)]) }else { x[-k+1 : (length(x)-k-1)] } }, to build a data set like this:

> head(cbind(sss,lagpad(sss$y,1),lagpad(sss$y,2)))
     x    y lagpad(sss$y, 1) lagpad(sss$y, 2)
1 0.04 2.78               NA               NA
2 0.05 3.79             2.78               NA
3 0.06 5.35             3.79             2.78
4 0.13 8.13             5.35             3.79
5 0.17 3.72             8.13             5.35
6 0.20 7.34             3.72             8.13

Correlate y on all variables, y ~ x + lagpad(y,1) + lagpad(y,2), etc.

For this problem, compute the r^2 for each successive linear regression, e.g.:
What is the r^2 for y~x?
What is the r^2 for y~x+lagpad(y,1)?
What is the r^2 for y~x+lagpad(y,1)+lagpad(y,2)?

Make a plot of how r^2 increases as we correlated with successive lagged time series'. This is called an "autocorrelated time series".

# cumulative
rs <- sapply(0:10, function(xlagrange) {
	 if(xlagrange==0) {
		form <- formula("y ~ x")
	} else {
		form <- formula(paste("y ~ x",paste(sapply(1:xlagrange , function(t) { paste("+I(lagpad(sss$y,",t,"))",sep="") }),collapse="")))
	}
	summary(lm(form,data=sss))$r.squared
})
plot(rs)

# non cumulative (autocorrelation)
rs <- sapply(0:10, function(xlagrange) {
	 if(xlagrange==0) {
		form <- formula("y ~ x")
	} else {
		form <- formula(paste("y ~ x",paste(sapply(xlagrange:xlagrange , function(t) { paste("+I(lagpad(sss$y,",t,"))",sep="") }),collapse="")))
	}
	summary(lm(form,data=sss))$r.squared
})
plot(rs)

9. Collinearity ** In class p125 of ISLR problem #14 ***

10. Look for some interesting data! Bring to class Monday