View difference between Paste ID: 8090dgvB and sJAQa70t
SHOW: | | - or go back to the newest paste.
1-
## Create a simulated dataset with known parameters, and then run a ML multilevel model, a ML SEM, and a Bayesian multilevel model; with the last, calculate Expected Value of Sample Information (EVSI):
1+
## Create a simulated dataset with known parameters, and then run a ML multilevel model, a ML SEM,
2
## and a Bayesian multilevel model; with the last, calculate Expected Value of Sample Information (EVSI):
3
4
## SIMULATE
5
set.seed(2015-08-11)
6
## "There is a variable X, x belongs to [0, 100]."
7
toplevel <- rnorm(n=1, 50, 25)
8
## "There are n ways of measuring it, among them A and B are widely used."
9
## "For any given measurer, the difference between x(A) and x(B) can be up to 20 points."
10
A <- toplevel + runif(1, min=-10, max=10)
11
B <- toplevel + runif(1, min=-10, max=10)
12
c(toplevel, A, B)
13
# [1] 63.85938385 55.43608379 59.42333264
14
### the true level of X we wish to recover is '63.85'
15
16
## "Between two any measurers, x(A)1 and x(A)2 can differ on average 10 points, likewise with B."
17
### let's imagine 10 hypothetical points are sample using method A and method B
18
### assume 'differ on average 10 points' here means something like 'the standard deviation is 10'
19
A_1 <- rnorm(n=10, mean=A, sd=10)
20
B_1 <- rnorm(n=10, mean=B, sd=10)
21
22
data <- rbind(data.frame(Measurement="A", Y=A_1), data.frame(Measurement="B", Y=B_1)); data
23
#    Measurement           Y
24
# 1            A 56.33870025
25
# 2            A 69.07267213
26
# 3            A 40.36889573
27
# 4            A 48.67289213
28
# 5            A 79.92622603
29
# 6            A 62.86919410
30
# 7            A 53.12953462
31
# 8            A 66.58894990
32
# 9            A 47.86296948
33
# 10           A 60.72416003
34
# 11           B 68.60203507
35
# 12           B 58.24702007
36
# 13           B 45.47895879
37
# 14           B 63.45308935
38
# 15           B 52.27724328
39
# 16           B 56.89783535
40
# 17           B 55.93598486
41
# 18           B 59.28162022
42
# 19           B 70.92341777
43
# 20           B 49.51360373
44
45
## MLM
46
47
## multi-level model approach:
48
library(lme4)
49
mlm <- lmer(Y ~ (1|Measurement), data=data); summary(mlm)
50
# Random effects:
51
#  Groups      Name        Variance Std.Dev.
52
#  Measurement (Intercept)  0.0000  0.000000
53
#  Residual                95.3333  9.763877
54
# Number of obs: 20, groups:  Measurement, 2
55
# 
56
# Fixed effects:
57
#              Estimate Std. Error  t value
58
# (Intercept) 58.308250   2.183269 26.70685
59
confint(mlm)
60
#                    2.5 %       97.5 %
61
# .sig01       0.000000000  7.446867736
62
# .sigma       7.185811525 13.444112087
63
# (Intercept) 53.402531768 63.213970887
64
65
## So we estimate X at 58.3 but it's not inside our confidence interval with such little data. Bad luck?
66
67
## SEM
68
69
library(lavaan)
70
X.model <- '        X =~ B + A
71
                    A =~ a
72
                    B =~ b'
73
X.fit <- sem(model = X.model, meanstructure = TRUE, data = data2)
74
summary(X.fit)
75
# ...                   Estimate  Std.err  Z-value  P(>|z|)
76
# Latent variables:
77
#   X =~
78
#     B                 1.000
79
#     A              7619.504
80
#   A =~
81
#     a                 1.000
82
#   B =~
83
#     b                 1.000
84
# 
85
# Intercepts:
86
#     a                58.555
87
#     b                58.061
88
#     X                 0.000
89
#     A                 0.000
90
#     B                 0.000
91
## Well, that didn't work well - explodes, unfortunately. Probably still not enough data.
92
93
## MLM (Bayesian)
94
95
library(R2jags)
96
## rough attempt at writing down an explicit multilevel model which
97
## respects the mentioned priors about errors being reasonably small:
98
model <- function() {
99
  grand.mean ~ dunif(0,100)
100
101
  delta.between.group ~ dunif(0, 10)
102
103
  sigma.between.group ~ dunif(0, 100)
104
  tau.between.group <- pow(sigma.between.group, -2)
105
106
  for(j in 1:K){
107
   # let's say the group-level differences are also normally-distributed:
108
   group.delta[j] ~ dnorm(delta.between.group, tau.between.group)
109
   # and each group also has its own standard-deviation, potentially different from the others':
110
   group.within.sigma[j] ~ dunif(0, 20)
111
   group.within.tau[j] <- pow(group.within.sigma[j], -2)
112
113
   # save the net combo for convenience & interpretability:
114
   group.mean[j] <- grand.mean + group.delta[j]
115
  }
116
117
  for (i in 1:N) {
118
   # each individual observation is from the grand-mean + group-offset, then normally distributed:
119
   Y[i] ~ dnorm(grand.mean + group.delta[Group[i]], group.within.tau[Group[i]])
120
  }
121
122
  }
123
jagsData <- list(N=nrow(data), Y=data$Y, K=length(levels(data$Measurement)),
124
             Group=data$Measurement)
125
params <- c("grand.mean","delta.between.group", "sigma.between.group", "group.delta", "group.mean",
126
            "group.within.sigma")
127
k1 <- jags(data=jagsData, parameters.to.save=params, inits=NULL, model.file=model); k1
128
# ...                      mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
129
# delta.between.group     4.971   2.945   0.221   2.353   4.967   7.594   9.791 1.008   260
130
# grand.mean             52.477  11.321  23.453  47.914  53.280  58.246  74.080 1.220    20
131
# group.delta[1]          6.017  11.391 -16.095   0.448   5.316  10.059  34.792 1.152    21
132
# group.delta[2]          5.662  11.318 -15.836   0.054   5.009  10.107  33.548 1.139    21
133
# group.mean[1]          58.494   3.765  50.973  56.188  58.459  60.838  66.072 1.001  3000
134
# group.mean[2]          58.139   2.857  52.687  56.366  58.098  59.851  63.999 1.003   920
135
# group.within.sigma[1]  12.801   2.766   8.241  10.700  12.446  14.641  18.707 1.002  1100
136
# group.within.sigma[2]   9.274   2.500   5.688   7.475   8.834  10.539  15.700 1.002  1600
137
# sigma.between.group    18.031  21.159   0.553   3.793   9.359  23.972  82.604 1.006  1700
138
# deviance              149.684   2.877 145.953 147.527 149.081 151.213 156.933 1.001  3000
139
140
## VOI
141
142
posteriorXs <- k1$BUGSoutput$sims.list[["grand.mean"]]
143
MSE <- function(x1, x2) { (x2 - x1)^2 }
144
lossFunction <- function(x, predictions) { mean(sapply(predictions, function(x2) { MSE(x, x2)}))}
145
## our hypothetical mean-squared loss if we predicted, say, X=60:
146
lossFunction(60, posteriorXs)
147
# [1] 184.7087612
148
## of the possible values for X, 1-100, what value of X minimizes our squared error loss?
149
losses <- sapply(c(1:100), function (n) { lossFunction(n, posteriorXs);})
150
which.min(losses)
151-
## 52 also equals the mean estimate of X, which is good since it's well known that the mean is what minimizes the loss when the loss is squared-error
151+
152-
## so it suggests that I have not screwed up the definitions
152+
## 52 also equals the mean estimate of X, which is good since it's well known that the mean is what minimizes
153
## the loss when the loss is squared-error so it suggests that I have not screwed up the definitions
154
losses[52]
155
[1] 128.3478462
156
157
## to calculate EVSI, we repeatedly simulate a few hundred times the existence of a hypothetical 'C' measurement
158
## and draw n samples from it;
159
## then we add the C data to our existing A & B data; run our Bayesian multilevel model again on the bigger dataset;,
160
## calculate what the new loss is, and compare it to the old loss to see how much the new data
161
## reduced the loss/mean-squared-error.
162
## Done for each possible n (here, 1-30) and averaged out, this tells us how much 1 additional datapoint is worth,
163
## 2 additional datapoints, 3 additional datapoints, etc.
164
sampleValues <- NULL
165
for (i in seq(from=1, to=30)) {
166
167
    evsis <- replicate(500, {
168
        n <- i
169
    
170
        C <- toplevel + runif(1, min=-10, max=10)
171
        C_1 <- rnorm(n=n, mean=C, sd=10)
172
        ## all as before, more or less:
173
        newData <- rbind(data, data.frame(Measurement="C", Y=C_1))
174
    
175
        jagsData <- list(N=nrow(newData), Y=newData$Y, K=length(levels(newData$Measurement)),
176
                         Group=newData$Measurement)
177
        params <- c("grand.mean","delta.between.group", "sigma.between.group", "group.delta", "group.mean",
178
                    "group.within.sigma")
179
        jEVSI <- jags(data=jagsData, parameters.to.save=params, inits=NULL, model.file=model)
180
    
181
        posteriorTimesEVSI <- jEVSI$BUGSoutput$sims.list[["grand.mean"]]
182
        lossesEVSI <- sapply(c(1:100), function (n) { lossFunction(n, posteriorTimesEVSI);})
183
184
        oldOptimum <- 128.3478462 # losses[52]
185
        newOptimum <- losses[which.min(lossesEVSI)]
186
        EVSI <- newOptimum - oldOptimum
187
        return(EVSI)
188
        }
189
        )
190
    
191
    print(i)
192
    
193
    print(mean(evsis))
194
    sampleValues[i] <- mean(evsis)
195
}
196
sampleValues
197
#  [1] 13.86568780 11.07101087 14.15645538 13.05296681 11.98902668 13.86866619 13.65059093 14.05991443
198
#  [9] 14.80018511 16.36944874 15.47624541 15.64710237 15.74060632 14.79901214 13.36776390 15.35179426
199
# [17] 14.31603459 13.70914727 17.20433606 15.89925289 16.35350861 15.09886204 16.30680175 16.27032067
200
# [25] 16.30418553 18.84776433 17.86881713 16.65973397 17.04451609 19.17173439
201
202
## As expected, the gain in reducing MSE continues increasing as data comes in but with diminishing returns;
203
## this is probably because in a multilevel model like this, you aren't using the _n_ datapoints to estimate X
204
## directly so much as you are using them to estimate a much smaller number of latent variables, which are then
205
## the _n_ used to estimate X. So instead of getting hyperprecise estimates of A/B/C, you need to sample from additional
206
## groups D/E/F/... Trying to improve your estimate of X by measuring A/B/C many times is like trying to estimate
207
## IQ precisely by administering a WM test a hundred times.
208
209-
## requires much more data than usual.
209+
## If we wanted to compare with alternatives like instead sampling n data points from C and a D, it's easy to modify
210
## the EVSI loop to do so: generate `D <- toplevel + runif(1, min=-10, max=10); D_1 <- rnorm(n=n, mean=D, sd=10)`
211
## and now `rbind` D_1 in as well. At a guess, after 5-10 samples from the current group, estimates of X will be improved more
212
## by then sampling from a new group.
213
214
## Or the loss function could be made more realistic. It's unlikely one is paid by MSE, and if one adds in how much
215
## money each sample costs, with a realistic loss function, one could decide exactly how much data is optimal to collect.
216
217
## To very precisely estimate X, when our measurements are needed to measure at least 3 latent variables,
218
## requires much more data than usual.
219
220
## In general, we can see the drawbacks and benefits of each approach. A canned MLM
221
## is very fast to write but doesn't let us include prior information or easily run
222
## additional analyses like how much additional samples are worth. SEM works poorly
223
## on small samples but is still easy to write in if we have more complicated
224
## models of measurement error. A full-blown modeling language like JAGS is quite
225
## difficult to write in and MCMC is slower than other approaches but handles small
226
## samples without any errors or problems and offers maximal flexibility in using
227
## the known prior information and then doing decision-theoretic stuff. Overall for
228
## this problem, I think JAGS worked out best, but possibly I wasn't using LAVAAN
229
## right and that's why SEM didn't seem to work well.