Advertisement
Guest User

Untitled

a guest
Feb 25th, 2020
88
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 16.10 KB | None | 0 0
  1. ---
  2. title: "Statistical Computing Assignment 1"
  3. author: "Ruth Turpie s1643103"
  4. date: "22/02/2020"
  5. output: pdf_document
  6. ---
  7.  
  8. ```{r setup, include=FALSE}
  9. knitr::opts_chunk$set(echo = TRUE)
  10. # importing needed libraries
  11. library(ggplot2)
  12. library(readr)
  13. suppressPackageStartupMessages(library(knitr))
  14. filament <- read_csv("filament.csv")
  15.  
  16. ```
  17. Question 1
  18.  
  19. In Figure 1.1 the CAD vs Actual Weight Data has been plotted with matching colours for each material. The black line represents the line y = x, which we would expect to pass through all the points if the CAD weight was equal to the Actual weight. However in the plot the data points lie largely above y = x, hence it appears that the actual weight will tend to be greater than the CAD weight.
  20.  
  21. ```{r}
  22. # attaching the data base to allow the objects inside to be accessed easily
  23. attach(filament)
  24.  
  25. # also seperating filament data base by class for easier access later
  26. filament_obs <- subset(filament, Class == "obs")
  27. filament_test <- subset(filament, Class == "test")
  28.  
  29. # a set of colours that will match up with the colours in the material data base column
  30. # used throughout this report in later plots
  31. # note magenta is quite light to help differentiate it from the neon pink
  32. material_colour <- c("black", "#8bb565", "#f693bf", "#01aae1", "#e03fd8", "red")
  33.  
  34. # using gg plot to plot CAD vs Actual Weight
  35. ggplot(Data = filament) +
  36. geom_point(aes(x = CAD_Weight, y = Actual_Weight, col = Material)) +
  37. geom_abline(slope = 1, size = 0.25) + # adds line y = x
  38. scale_color_manual(values = material_colour) +
  39. ggtitle("Figure 1.1. CAD Weight vs Actual Weight") +
  40. labs(x = "CAD Weight, grams", y = "Actual Weight, grams") +
  41. theme(plot.title = element_text(hjust = 0.5))
  42. ```
  43.  
  44. Question 2
  45.  
  46. Defining a function model estimate that takes the parameters formulas, data, and
  47. response, where formulas and data have the same interpretation as the
  48. input arguments to model Z, and response is a character/string variable
  49. naming the response (measured outcome) variable column of the data. The function estimates the model by numerical optimisation, and returns a list with three named elements: theta, formulas, and Sigma theta,
  50. containing the information needed to call the model predict function.
  51.  
  52.  
  53.  
  54. ```{r}
  55.  
  56. # model_Z is just from CWA2020code.r
  57. model_Z <- function(formulas, data) {
  58. ZE <- model.matrix(formulas$E, data = data)
  59. ZV <- model.matrix(formulas$V, data = data)
  60. list(ZE = cbind(ZE, ZV * 0), ZV = cbind(ZE * 0, ZV))
  61. }
  62.  
  63. # neg_log_lik is just from CWA2020code.r
  64. neg_log_lik <- function(theta, Z, y) {
  65. -sum(dnorm(y, mean = Z$ZE %*% theta, sd = exp(Z$ZV %*% theta)^0.5, log = TRUE))
  66. }
  67.  
  68. # model_estimate() estimates a model using numerical optimisation using BFGS method
  69.  
  70. # Input:
  71. # response: the string variable naming the response variable column of the data
  72. # formulas: a list of formulas, of the kind needed by model_Z()
  73. # data: covariate information, in whatever format the model_Z() wants it.
  74. # Output:
  75. # A list with entries 'formulas', 'theta_hat', 'sigma_theta',
  76. # as required by the model predict function
  77. model_estimate <- function(formulas, data, response) {
  78. Z <- model_Z(formulas, data)
  79. opt <- optim(par = rep(0,ncol(Z$ZE)),
  80. fn = neg_log_lik, Z = Z, y = data[[response]],
  81. method = "BFGS",
  82. control = list(maxit = 5000),
  83. hessian = TRUE)
  84. theta_hat <- opt$par
  85. Sigma_theta <- solve(opt$hessian)
  86. return(list("formulas" = formulas, "theta_hat" = theta_hat, "Sigma_theta" = Sigma_theta))
  87. }
  88.  
  89. ```
  90.  
  91. Question 3
  92.  
  93. Estimating a model for Actual Weight, with an intercept and covariate
  94. CAD Weight for the model expectations, and an intercept for the
  95. model (log-)variances using the observed data. This model will from this point be known as "Model 1".
  96.  
  97. ```{r}
  98.  
  99. # E is the formula for the expectation with an intercept and covariate
  100. # V is the formula for the varience with an intercept only
  101. formulas_model1 <- c("E" = ~ 1 + CAD_Weight,"V" = ~ 1)
  102.  
  103. # estimating the model for actual weight using the observed data
  104. estimate_model1 <-model_estimate(formulas_model1, subset(filament, Class == 'obs'), "Actual_Weight")
  105. ```
  106.  
  107. Question 4
  108.  
  109. Figure 4.1 shows the test data together with the prediction intervals for the estimated Model 1. This is to analyse how good the estimated model ,which was generated using observed data, is at predicting correctly the test data. A value of alpha = 0.1 has been used to make fairer comparisons in later models when alpha = 0.1.
  110.  
  111. This figure appears to show that the prediction interval in general seems to be fairly accurate (it at least follows path slightly above y = x implying that most actual weights are slightly bigger than their CAD weights as desired). However the test data points at high CAD weights, or at least materials with high CAD weights, lie significantly far outwith the prediction interval, suggesting that this estimated model is not good at predicting test data for high CAD weights.
  112.  
  113. ```{r}
  114. # model_predict is just from CWA2020code.r
  115. model_predict <- function(theta, formulas, Sigma_theta = NULL,
  116. newdata,
  117. type = c("expectation", "log-variance", "observation"),
  118. alpha = 0.05, df = Inf,
  119. nonlinear.correction = TRUE) {
  120. type <- match.arg(type)
  121. Z <- model_Z(formulas, data = newdata)
  122. fit_E <- Z$ZE %*% theta
  123. fit_V <- Z$ZV %*% theta
  124. if (is.null(Sigma_theta)) {
  125. ZE_var <- 0
  126. ZV_var <- 0
  127. } else {
  128. ZE_var <- rowSums(Z$ZE * (Z$ZE %*% Sigma_theta))
  129. ZV_var <- rowSums(Z$ZV * (Z$ZV %*% Sigma_theta))
  130. }
  131. if (type == "expectation") {
  132. fit <- fit_E
  133. sigma <- ZE_var^0.5
  134. } else if (type == "log-variance") {
  135. fit <- fit_V
  136. sigma <- ZV_var^0.5
  137. } else if (type == "observation") { ## observation predictions
  138. fit <- fit_E
  139. sigma <- (exp(fit_V + ZV_var / 2 * nonlinear.correction) + ZE_var)^0.5
  140. }
  141. q <- qt(1 - alpha / 2, df = df)
  142. lwr <- fit - q * sigma
  143. upr <- fit + q * sigma
  144. data.frame(mu = fit, sigma, lwr, upr)
  145. }
  146.  
  147. predict_model1 <- model_predict(estimate_model1$theta_hat, estimate_model1$formulas, Sigma_theta = estimate_model1$Sigma_theta, subset(filament, Class == 'test'), alpha = 0.1, type = "observation")
  148.  
  149. ggplot(cbind(filament_test,predict_model1)) +
  150. geom_ribbon(
  151. aes(CAD_Weight, ymin = lwr, ymax = upr),
  152. alpha = 0.5, fill = "#72edd0") +
  153. geom_point(aes(y = Actual_Weight, x = CAD_Weight,col = Material)) +
  154. ggtitle("Figure 4.1: Prediction Intervals for Model 1 with Test Data Points") +
  155. scale_color_manual(values = material_colour) +
  156. labs(x = "CAD Weight, grams", y = "Actual Weight, grams") +
  157. geom_abline(slope = 1, size = 0.25) + # adds line y = x
  158. theme(plot.title = element_text(hjust = 0.5))
  159.  
  160.  
  161. ```
  162.  
  163.  
  164. Question 5
  165.  
  166. Estimating a model for Actual Weight, with an intercept and covariate
  167. CAD Weight for the model expectations, and an also an intercept and covariate CAD Weight for the
  168. model (log-)variances using the observed data. This model will from this point be known as "Model 2".
  169.  
  170. ```{r}
  171.  
  172. # E is the formula for the expectation with an intercept and covariate
  173. # V is the formula for the varience with an intercept and covariate
  174. formulas_model2 <- c("E" = ~ 1 + CAD_Weight,"V" = ~ 1 + CAD_Weight)
  175.  
  176. # estimating the model for actual weight using the observed data
  177. estimate_model2 <-model_estimate(formulas_model2, filament, "Actual_Weight")
  178.  
  179. ```
  180. Question 6
  181.  
  182. Figure 6.1 shows the test data together with the prediction intervals for the estimated Model 2, and will be analysed in a similiar fashion to that of Model 1 in Question 4. A value of alpha = 0.1 has once again been used to make fairer comparisons in later models when alpha = 0.1.
  183.  
  184. This model takes into account the covariate CAD Weight for the model (log-)variances, which appears to widen the prediction interval for higher CAD weights and slim it for low CAD weights. This seems to predict the test data quite well, as the points lie far more within the ribbon representing the prediction interval. At lower CAD weights, the test points do not seem to deviate far from y = x and are "easier" to predict, meaning that the prediction interval can be narrow, whilst the opposite is true for the higher CAD weights. This analysis alone appears to show Model 2 may be a better model than Model 1.
  185.  
  186. ```{r}
  187.  
  188. predict_model2 <- model_predict(estimate_model2$theta_hat, estimate_model2$formulas, Sigma_theta = estimate_model2$Sigma_theta, subset(filament, Class == 'test'), alpha = 0.1, type = "observation")
  189.  
  190. ggplot(cbind(filament_test,predict_model2)) +
  191. geom_ribbon(
  192. aes(CAD_Weight, ymin = lwr, ymax = upr),
  193. alpha = 0.5, fill = "#72edd0") +
  194. geom_point(aes(y = Actual_Weight, x = CAD_Weight,col = Material)) +
  195. ggtitle("Figure 6.1: Prediction Intervals for Model 2 with Test Data Points") +
  196. scale_color_manual(values = material_colour) +
  197. labs(x = "CAD Weight, grams", y = "Actual Weight, grams") +
  198. geom_abline(slope = 1, size = 0.25) + # adds line y = x
  199. theme(plot.title = element_text(hjust = 0.5))
  200.  
  201. ```
  202.  
  203.  
  204. Question 7
  205.  
  206. ```{r}
  207.  
  208. # The following 3 functions are just the score functions from CWA2020
  209. # They compute the Squared Error, Dawid-Sebastiani, and Interval scores respectively
  210.  
  211. score_se <- function(pred, y) {
  212. (y - pred$mu)^2
  213. }
  214.  
  215. score_ds <- function(pred, y) {
  216. ((y - pred$mu) / pred$sigma)^2 + 2 * log(pred$sigma)
  217. }
  218.  
  219. score_interval <- function(pred, y, alpha = 0.1) {
  220. L <- pred$lwr
  221. U <- pred$upr
  222. U - L + 2 / alpha * (pmax(0, L - y) + pmax(0, y - U))
  223. }
  224.  
  225. # model_scores returns a list of score value vectors for a given model prediction
  226. # Input:
  227. # pred: data.frame with (at least) columns 'mu' and 'sigma'
  228. # y: data vector
  229. # Output:
  230. # A list of score value vectors
  231. model_scores <- function(pred, y){
  232. return(list("score_se" = score_se(pred,y), "score_ds" = score_ds(pred,y), "score_interval" = score_interval(pred,y)))
  233. }
  234.  
  235. # finding the scores for models 1 and 2
  236. scores_model1 <-model_scores(predict_model1, filament_test$Actual_Weight)
  237. scores_model2 <-model_scores(predict_model2, filament_test$Actual_Weight)
  238. print(scores_model1[[1]])
  239. print(scores_model2[[1]])
  240.  
  241. # the pairwise differences between the scores for Model 2 - Model 1
  242. scores_diff2_1 <- mapply('-',scores_model2,scores_model1,SIMPLIFY=FALSE)
  243. print(scores_diff2_1[[1]])
  244.  
  245. df <- data.frame(t(data.frame(matrix(unlist(scores_diff2_1), nrow=length(scores_diff2_1), byrow=T),stringsAsFactors=FALSE)))
  246. names(df)[1] <- "score_se"
  247. names(df)[2] <- "score_ds"
  248. names(df)[3] <- "score_interval"
  249.  
  250. # kable(df, caption = "Hello")
  251.  
  252. ggplot(df) +
  253. stat_ecdf(geom = "step", aes(df$score_se, colour = "Squared Error score")) +
  254. stat_ecdf(geom = "step", aes(df$score_ds, colour = "Dawid-Sebastiani Scores")) +
  255. stat_ecdf(geom = "step", aes(df$score_interval, colour = "Interval Score")) +
  256. scale_colour_manual("",
  257. breaks = c("Squared Error score", "Dawid-Sebastiani Scores", "Interval Score"),
  258. values = c("#32a852", "#326eab", "#b0337e"))
  259.  
  260. ```
  261. Question 8
  262.  
  263. ```{r}
  264.  
  265. formulas_model3 <- c("E" = ~ 1 + Material:CAD_Weight,"V" = ~ 1 + CAD_Weight)
  266.  
  267. estimate_model3 <-model_estimate(formulas_model3, subset(filament, Class == 'obs'), "Actual_Weight")
  268.  
  269. predict_model3 <- model_predict(estimate_model3$theta_hat, estimate_model3$formulas, Sigma_theta = estimate_model3$Sigma_theta, subset(filament, Class == 'test'), alpha = 0.1, type = "observation")
  270.  
  271. scores_model3 <-model_scores(predict_model3, filament_test$Actual_Weight)
  272.  
  273. scores_diff3_1 <- mapply('-',scores_model3,scores_model1,SIMPLIFY=FALSE)
  274.  
  275. scores_diff3_2 <- mapply('-',scores_model3,scores_model2,SIMPLIFY=FALSE)
  276.  
  277. df <- data.frame(t(data.frame(matrix(unlist(scores_diff3_1), nrow=length(scores_diff3_1), byrow=T),stringsAsFactors=FALSE)))
  278. names(df)[1] <- "score_se"
  279. names(df)[2] <- "score_ds"
  280. names(df)[3] <- "score_interval"
  281.  
  282. # kable(df, caption = "Hello")
  283.  
  284. ggplot(df) +
  285. stat_ecdf(geom = "step", aes(df$score_se, colour = "Squared Error score")) +
  286. stat_ecdf(geom = "step", aes(df$score_ds, colour = "Dawid-Sebastiani Scores")) +
  287. stat_ecdf(geom = "step", aes(df$score_interval, colour = "Interval Score")) +
  288. scale_colour_manual("",
  289. breaks = c("Squared Error score", "Dawid-Sebastiani Scores", "Interval Score"),
  290. values = c("#32a852", "#326eab", "#b0337e"))
  291.  
  292. df <- data.frame(t(data.frame(matrix(unlist(scores_diff3_2), nrow=length(scores_diff3_1), byrow=T),stringsAsFactors=FALSE)))
  293. names(df)[1] <- "score_se"
  294. names(df)[2] <- "score_ds"
  295. names(df)[3] <- "score_interval"
  296.  
  297. # kable(df, caption = "Hello")
  298.  
  299. ggplot(df) +
  300. stat_ecdf(geom = "step", aes(df$score_se, colour = "Squared Error score")) +
  301. stat_ecdf(geom = "step", aes(df$score_ds, colour = "Dawid-Sebastiani Scores")) +
  302. stat_ecdf(geom = "step", aes(df$score_interval, colour = "Interval Score")) +
  303. scale_colour_manual("",
  304. breaks = c("Squared Error score", "Dawid-Sebastiani Scores", "Interval Score"),
  305. values = c("#32a852", "#326eab", "#b0337e"))
  306.  
  307.  
  308. ```
  309.  
  310. Question 9
  311.  
  312. First, the code computes the probabilities from each of the models that Actual Weight >= 1.1*CAD Weight for each test data point. Figure 9.1 shows a graph of the CAD Weight against this probability for both Model 1 and Model 2. Figure 9.2 shows the same except for Model 3, and shows the impact of the material colours on these trends. With blue, green and black, the larger the CAD Weight the less likely for the Actual Weight to be 1.1 times bigger, while the opposite appears to be true for red. There is not very much data to show trends in colours for the other data points. Alternatively, it can be looked at without the lense of material, and the trend appears to be vaguely parabolic with a minimum at around CAD Weight = 37.5. This is suggesting that CAD Weight increasing until
  313.  
  314. ```{r}
  315.  
  316. # using Model 1 compute probabilities for brier score function
  317.  
  318. prob_model1<- pnorm(1.1*filament_test$CAD_Weight , mean = predict_model1$mu, sd = predict_model1$sigma, lower.tail = FALSE)
  319.  
  320. # using Model 2 compute probabilities for brier score function
  321.  
  322. prob_model2 <- pnorm(1.1*filament_test$CAD_Weight , mean = predict_model2$mu, sd = predict_model2$sigma, lower.tail = FALSE)
  323.  
  324. # using Model 3 compute probabilities for brier score function
  325.  
  326. prob_model3 <- pnorm(1.1*filament_test$CAD_Weight , mean = predict_model3$mu, sd = predict_model3$sigma, lower.tail = FALSE)
  327.  
  328. # creating a list of all the probabilities for different models
  329. probs <- data.frame(list("prob_model1" = prob_model1, "prob_model2" = prob_model2, "prob_model2" = prob_model2))
  330.  
  331. # plotting the probabilities for Model 1 and 2
  332. ggplot(cbind(probs,filament_test)) +
  333. geom_line(aes(x = filament_test$CAD_Weight, y =probs[[1]])) +
  334. geom_line(aes(x = filament_test$CAD_Weight, y =probs[[2]]))
  335.  
  336. # plotting the probabilities for Model 3
  337. ggplot(cbind(filament_test, probs)) +
  338. geom_line(aes(x = filament_test$CAD_Weight, y =probs[[3]], col = Material)) +
  339. geom_point(aes(x = filament_test$CAD_Weight, y =probs[[3]], col = Material)) +
  340. scale_color_manual(values = material_colour)
  341.  
  342. binary <- rep(0, length(subset(CAD_Weight, Class == 'test')))
  343. for (i in 1:length(binary)) {
  344. if (subset(Actual_Weight, Class == 'test')[[i]] >= 1.1*subset(CAD_Weight, Class == 'test')[[i]]){
  345. binary[[i]] <- 1
  346. }
  347. }
  348.  
  349. binary <- (subset(Actual_Weight, Class == 'test') >= 1.1*subset(CAD_Weight, Class == 'test'))
  350.  
  351. print(binary)
  352. print(df)
  353.  
  354. brier <- function(p, z){
  355. return((z - p)^2)
  356. }
  357.  
  358. brierq3 <- brier(probs$q3prob, binary)
  359. brierq5 <- brier(probs$q5prob, binary)
  360. brierq8 <- brier(probs$q8prob, binary)
  361.  
  362. ```
  363.  
  364.  
  365. Question 10
  366.  
  367. ```{r}
  368. nLL <- function(theta, y) {
  369. -sum(dcauchy(y, location = theta[[1]], scale = exp(theta[[2]]), log = TRUE))
  370. }
  371.  
  372. test_data <- rcauchy(10000, location = 2, scale =5)
  373.  
  374. binary <- test_data <= 0
  375.  
  376. optq10 <- function(N){
  377. opt<- optim(par = c(0,0),
  378. fn = nLL, y = rcauchy(N, location = 2, scale = 5),
  379. method = "BFGS",
  380. control = list(maxit = 5000))
  381. return(list("location" = opt$par[[1]], "scale" = exp(opt$par[[2]])))
  382. }
  383.  
  384. ```
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement