Advertisement
Guest User

demonstration of various ways to deal with measurement error

a guest
Aug 11th, 2015
333
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. ## Create a simulated dataset with known parameters, and then run a ML multilevel model, a ML SEM,
  2. ## and a Bayesian multilevel model; with the last, calculate Expected Value of Sample Information (EVSI):
  3.  
  4. ## SIMULATE
  5. set.seed(2015-08-11)
  6. ## "There is a variable X, x belongs to [0, 100]."
  7. toplevel <- rnorm(n=1, 50, 25)
  8. ## "There are n ways of measuring it, among them A and B are widely used."
  9. ## "For any given measurer, the difference between x(A) and x(B) can be up to 20 points."
  10. A <- toplevel + runif(1, min=-10, max=10)
  11. B <- toplevel + runif(1, min=-10, max=10)
  12. c(toplevel, A, B)
  13. # [1] 63.85938385 55.43608379 59.42333264
  14. ### the true level of X we wish to recover is '63.85'
  15.  
  16. ## "Between two any measurers, x(A)1 and x(A)2 can differ on average 10 points, likewise with B."
  17. ### let's imagine 10 hypothetical points are sample using method A and method B
  18. ### assume 'differ on average 10 points' here means something like 'the standard deviation is 10'
  19. A_1 <- rnorm(n=10, mean=A, sd=10)
  20. B_1 <- rnorm(n=10, mean=B, sd=10)
  21.  
  22. data <- rbind(data.frame(Measurement="A", Y=A_1), data.frame(Measurement="B", Y=B_1)); data
  23. #    Measurement           Y
  24. # 1            A 56.33870025
  25. # 2            A 69.07267213
  26. # 3            A 40.36889573
  27. # 4            A 48.67289213
  28. # 5            A 79.92622603
  29. # 6            A 62.86919410
  30. # 7            A 53.12953462
  31. # 8            A 66.58894990
  32. # 9            A 47.86296948
  33. # 10           A 60.72416003
  34. # 11           B 68.60203507
  35. # 12           B 58.24702007
  36. # 13           B 45.47895879
  37. # 14           B 63.45308935
  38. # 15           B 52.27724328
  39. # 16           B 56.89783535
  40. # 17           B 55.93598486
  41. # 18           B 59.28162022
  42. # 19           B 70.92341777
  43. # 20           B 49.51360373
  44.  
  45. ## MLM
  46.  
  47. ## multi-level model approach:
  48. library(lme4)
  49. mlm <- lmer(Y ~ (1|Measurement), data=data); summary(mlm)
  50. # Random effects:
  51. #  Groups      Name        Variance Std.Dev.
  52. #  Measurement (Intercept)  0.0000  0.000000
  53. #  Residual                95.3333  9.763877
  54. # Number of obs: 20, groups:  Measurement, 2
  55. #
  56. # Fixed effects:
  57. #              Estimate Std. Error  t value
  58. # (Intercept) 58.308250   2.183269 26.70685
  59. confint(mlm)
  60. #                    2.5 %       97.5 %
  61. # .sig01       0.000000000  7.446867736
  62. # .sigma       7.185811525 13.444112087
  63. # (Intercept) 53.402531768 63.213970887
  64.  
  65. ## So we estimate X at 58.3 but it's not inside our confidence interval with such little data. Bad luck?
  66.  
  67. ## SEM
  68.  
  69. library(lavaan)
  70. X.model <- '        X =~ B + A
  71.                    A =~ a
  72.                    B =~ b'
  73. X.fit <- sem(model = X.model, meanstructure = TRUE, data = data2)
  74. summary(X.fit)
  75. # ...                   Estimate  Std.err  Z-value  P(>|z|)
  76. # Latent variables:
  77. #   X =~
  78. #     B                 1.000
  79. #     A              7619.504
  80. #   A =~
  81. #     a                 1.000
  82. #   B =~
  83. #     b                 1.000
  84. #
  85. # Intercepts:
  86. #     a                58.555
  87. #     b                58.061
  88. #     X                 0.000
  89. #     A                 0.000
  90. #     B                 0.000
  91. ## Well, that didn't work well - explodes, unfortunately. Probably still not enough data.
  92.  
  93. ## MLM (Bayesian)
  94.  
  95. library(R2jags)
  96. ## rough attempt at writing down an explicit multilevel model which
  97. ## respects the mentioned priors about errors being reasonably small:
  98. model <- function() {
  99.   grand.mean ~ dunif(0,100)
  100.  
  101.   delta.between.group ~ dunif(0, 10)
  102.  
  103.   sigma.between.group ~ dunif(0, 100)
  104.   tau.between.group <- pow(sigma.between.group, -2)
  105.  
  106.   for(j in 1:K){
  107.    # let's say the group-level differences are also normally-distributed:
  108.    group.delta[j] ~ dnorm(delta.between.group, tau.between.group)
  109.    # and each group also has its own standard-deviation, potentially different from the others':
  110.    group.within.sigma[j] ~ dunif(0, 20)
  111.    group.within.tau[j] <- pow(group.within.sigma[j], -2)
  112.  
  113.    # save the net combo for convenience & interpretability:
  114.    group.mean[j] <- grand.mean + group.delta[j]
  115.   }
  116.  
  117.   for (i in 1:N) {
  118.    # each individual observation is from the grand-mean + group-offset, then normally distributed:
  119.    Y[i] ~ dnorm(grand.mean + group.delta[Group[i]], group.within.tau[Group[i]])
  120.   }
  121.  
  122.   }
  123. jagsData <- list(N=nrow(data), Y=data$Y, K=length(levels(data$Measurement)),
  124.              Group=data$Measurement)
  125. params <- c("grand.mean","delta.between.group", "sigma.between.group", "group.delta", "group.mean",
  126.             "group.within.sigma")
  127. k1 <- jags(data=jagsData, parameters.to.save=params, inits=NULL, model.file=model); k1
  128. # ...                      mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
  129. # delta.between.group     4.971   2.945   0.221   2.353   4.967   7.594   9.791 1.008   260
  130. # grand.mean             52.477  11.321  23.453  47.914  53.280  58.246  74.080 1.220    20
  131. # group.delta[1]          6.017  11.391 -16.095   0.448   5.316  10.059  34.792 1.152    21
  132. # group.delta[2]          5.662  11.318 -15.836   0.054   5.009  10.107  33.548 1.139    21
  133. # group.mean[1]          58.494   3.765  50.973  56.188  58.459  60.838  66.072 1.001  3000
  134. # group.mean[2]          58.139   2.857  52.687  56.366  58.098  59.851  63.999 1.003   920
  135. # group.within.sigma[1]  12.801   2.766   8.241  10.700  12.446  14.641  18.707 1.002  1100
  136. # group.within.sigma[2]   9.274   2.500   5.688   7.475   8.834  10.539  15.700 1.002  1600
  137. # sigma.between.group    18.031  21.159   0.553   3.793   9.359  23.972  82.604 1.006  1700
  138. # deviance              149.684   2.877 145.953 147.527 149.081 151.213 156.933 1.001  3000
  139.  
  140. ## VOI
  141.  
  142. posteriorXs <- k1$BUGSoutput$sims.list[["grand.mean"]]
  143. MSE <- function(x1, x2) { (x2 - x1)^2 }
  144. lossFunction <- function(x, predictions) { mean(sapply(predictions, function(x2) { MSE(x, x2)}))}
  145. ## our hypothetical mean-squared loss if we predicted, say, X=60:
  146. lossFunction(60, posteriorXs)
  147. # [1] 184.7087612
  148. ## of the possible values for X, 1-100, what value of X minimizes our squared error loss?
  149. losses <- sapply(c(1:100), function (n) { lossFunction(n, posteriorXs);})
  150. which.min(losses)
  151. # [1] 52
  152. ## 52 also equals the mean estimate of X, which is good since it's well known that the mean is what minimizes
  153. ## the loss when the loss is squared-error so it suggests that I have not screwed up the definitions
  154. losses[52]
  155. [1] 128.3478462
  156.  
  157. ## to calculate EVSI, we repeatedly simulate a few hundred times the existence of a hypothetical 'C' measurement
  158. ## and draw n samples from it;
  159. ## then we add the C data to our existing A & B data; run our Bayesian multilevel model again on the bigger dataset;,
  160. ## calculate what the new loss is, and compare it to the old loss to see how much the new data
  161. ## reduced the loss/mean-squared-error.
  162. ## Done for each possible n (here, 1-30) and averaged out, this tells us how much 1 additional datapoint is worth,
  163. ## 2 additional datapoints, 3 additional datapoints, etc.
  164. sampleValues <- NULL
  165. for (i in seq(from=1, to=30)) {
  166.  
  167.     evsis <- replicate(500, {
  168.         n <- i
  169.    
  170.         C <- toplevel + runif(1, min=-10, max=10)
  171.         C_1 <- rnorm(n=n, mean=C, sd=10)
  172.         ## all as before, more or less:
  173.         newData <- rbind(data, data.frame(Measurement="C", Y=C_1))
  174.    
  175.         jagsData <- list(N=nrow(newData), Y=newData$Y, K=length(levels(newData$Measurement)),
  176.                          Group=newData$Measurement)
  177.         params <- c("grand.mean","delta.between.group", "sigma.between.group", "group.delta", "group.mean",
  178.                     "group.within.sigma")
  179.         jEVSI <- jags(data=jagsData, parameters.to.save=params, inits=NULL, model.file=model)
  180.    
  181.         posteriorTimesEVSI <- jEVSI$BUGSoutput$sims.list[["grand.mean"]]
  182.         lossesEVSI <- sapply(c(1:100), function (n) { lossFunction(n, posteriorTimesEVSI);})
  183.  
  184.         oldOptimum <- 128.3478462 # losses[52]
  185.         newOptimum <- losses[which.min(lossesEVSI)]
  186.         EVSI <- newOptimum - oldOptimum
  187.         return(EVSI)
  188.         }
  189.         )
  190.    
  191.     print(i)
  192.    
  193.     print(mean(evsis))
  194.     sampleValues[i] <- mean(evsis)
  195. }
  196. sampleValues
  197. #  [1] 13.86568780 11.07101087 14.15645538 13.05296681 11.98902668 13.86866619 13.65059093 14.05991443
  198. #  [9] 14.80018511 16.36944874 15.47624541 15.64710237 15.74060632 14.79901214 13.36776390 15.35179426
  199. # [17] 14.31603459 13.70914727 17.20433606 15.89925289 16.35350861 15.09886204 16.30680175 16.27032067
  200. # [25] 16.30418553 18.84776433 17.86881713 16.65973397 17.04451609 19.17173439
  201.  
  202. ## As expected, the gain in reducing MSE continues increasing as data comes in but with diminishing returns;
  203. ## this is probably because in a multilevel model like this, you aren't using the _n_ datapoints to estimate X
  204. ## directly so much as you are using them to estimate a much smaller number of latent variables, which are then
  205. ## the _n_ used to estimate X. So instead of getting hyperprecise estimates of A/B/C, you need to sample from additional
  206. ## groups D/E/F/... Trying to improve your estimate of X by measuring A/B/C many times is like trying to estimate
  207. ## IQ precisely by administering a WM test a hundred times.
  208.  
  209. ## If we wanted to compare with alternatives like instead sampling n data points from C and a D, it's easy to modify
  210. ## the EVSI loop to do so: generate `D <- toplevel + runif(1, min=-10, max=10); D_1 <- rnorm(n=n, mean=D, sd=10)`
  211. ## and now `rbind` D_1 in as well. At a guess, after 5-10 samples from the current group, estimates of X will be improved more
  212. ## by then sampling from a new group.
  213.  
  214. ## Or the loss function could be made more realistic. It's unlikely one is paid by MSE, and if one adds in how much
  215. ## money each sample costs, with a realistic loss function, one could decide exactly how much data is optimal to collect.
  216.  
  217. ## To very precisely estimate X, when our measurements are needed to measure at least 3 latent variables,
  218. ## requires much more data than usual.
  219.  
  220. ## In general, we can see the drawbacks and benefits of each approach. A canned MLM
  221. ## is very fast to write but doesn't let us include prior information or easily run
  222. ## additional analyses like how much additional samples are worth. SEM works poorly
  223. ## on small samples but is still easy to write in if we have more complicated
  224. ## models of measurement error. A full-blown modeling language like JAGS is quite
  225. ## difficult to write in and MCMC is slower than other approaches but handles small
  226. ## samples without any errors or problems and offers maximal flexibility in using
  227. ## the known prior information and then doing decision-theoretic stuff. Overall for
  228. ## this problem, I think JAGS worked out best, but possibly I wasn't using LAVAAN
  229. ## right and that's why SEM didn't seem to work well.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement