Advertisement
Guest User

Untitled

a guest
Nov 15th, 2017
318
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. the akaike information criterion is a
  2. measure of the relative quality of
  3. statistical models for a given set of
  4. data given a collection of models for
  5. the data AIC estimates the quality of
  6. each model relative to each of the other
  7. models
  8. hencE AIC provides a means for model
  9. selection AIC is founded on information
  10. theory it off is a relative estimate of
  11. the information lost when a given model
  12. is used to represent the process that
  13. generates the data in doing so it deals
  14. with the trade-off betwn the goodness
  15. of fit of the model and the complexity
  16. of the model AIC does not providE a test
  17. of a model in the sensE of testing a
  18. null hypothesis ie AIC can tell nothing
  19. about the quality of the model in an
  20. absolute sense if all the candidate
  21. models fit poorly AIC will not give any
  22. warning of that definition suppose that
  23. we have a statistical model of some data
  24. let L be the maximum value of the
  25. likelihood function for the model let K
  26. bE the number of estimated parameters in
  27. the model then the AIC value of the
  28. model is the following given a set of
  29. candidate models for the data the
  30. preferred model is the one with the
  31. minimum AIC value hence AIC rewards
  32. goodness of fit but it also includes a
  33. penalty that is an increasing function
  34. of the number of estimated parameters
  35. the penalty discourages overfitting AIC
  36. is founded in information theory
  37. suppose that the data is generated by
  38. some unknown process f we consider two
  39. candidatE models to represent F D and
  40. G if we knew left then we could find
  41. the information lost from using G to
  42. represent F by calculating the
  43. kullbackleibler divergence d KL
  44. similarly the information lost from
  45. using G to represent F could be found
  46. by calculating decay L we would then
  47. choose the candidate model that
  48. minimized the information loss we cannot
  49. choose with certainty because we do not
  50. know F okite showed however that we can
  51. estimate
  52. by our AI s how much more information
  53. is lost by g than by g the estimate
  54. though is only valid asymptotically if
  55. the number of data points he's small
  56. then some correction is often necessary
  57. how to apply AIC in practicE to apply
  58. AIC in practice
  59. we start with a set of candidate models
  60. and then find the models corresponding
  61. AIC values there will almost always be
  62. information lost due to using a
  63. candidate model to represent the true
  64. model we wish to select from among the
  65. candidate models the model that
  66. minimizes the information loss we cannot
  67. choose with certainty but we can
  68. minimize the estimated information loss
  69. suppose that there arE our candidate
  70. models denote the AIC values of those
  71. models by AIC AIC to AIC AIC are let
  72. Aikman be the minimum of those values
  73. then exp can be interpreted as the
  74. relative probability that the ith model
  75. minimizes the information loss as an
  76. example suppose that there are thr
  77. candidatE models whose AIC values are
  78. and then the second medal is
  79. exp equals . a times as probable as
  80. the first model to minimize the
  81. information loss similarly the third
  82. model is exp equals zero point zero
  83. times as probablE as the first model to
  84. minimize the information loss in this
  85. example we would omit the third model
  86. from further consideration we then have
  87. thr options gather more data in the
  88. hope that this will allow clearly
  89. distinguishing betwn the first two
  90. models simply conclude that the data is
  91. insufficient to support selecting one
  92. model from among the first to take a
  93. weighted average of the first two models
  94. with weights and . respectively
  95. and then do statistical inference based
  96. on the weighted multimodal
  97. the
  98. one two exp two is a relative likelihood
  99. of model I if all the models in the
  100. candidate set have the same number of
  101. parameters then using AIC might at first
  102. appear to bE very similar to using the
  103. likelihood ratio test therE of however
  104. important distinctions in particular the
  105. likelihood ratio test is valid only for
  106. nested models whereas AIC has no such
  107. restriction ai a I is AIC with a
  108. correction for finite sample sizes the
  109. formula for a I depends upon the
  110. statistical model assuming that the
  111. model is univariate linear and has
  112. normally distributed residuals the
  113. formula for a I is as follows where n
  114. denotes the sample size and K denotes
  115. the number of parameters if the
  116. assumption of a univariatE linear model
  117. with normal residuals does not hold then
  118. the formula for a I will generally
  119. changE further discussion of the formula
  120. with examples of other assumptions is
  121. given by burn Eamonn Anderson and
  122. Konishi and Kitagawa in particular with
  123. other assumptions bootstrap estimation
  124. of the formula is often feasible AI is
  125. essentially AIC with a greater penalty
  126. for extra parameters using AIC instead
  127. of a I when n is not many times larger
  128. than k increases the probability of
  129. selecting models that have too many
  130. parameters ie
  131. of overfitting thE probability of AIC
  132. overfitting can be substantial in some
  133. cases Brockwell and davis advise using
  134. AIC s as the primary criterion in
  135. selecting the orders of an ARMA model
  136. for timE series McQuarrie and psy ground
  137. their high opinion of a I on extensive
  138. simulation work with regression and time
  139. series Burnham & Anderson note that
  140. sincE a I converges to AIC as n gets
  141. large a I rather than AIC should
  142. generally be employed note that if all
  143. thE candidate
  144. models have the same k then AIC s an
  145. AIC will give identical valuations hence
  146. there will be no disadvantage in using
  147. AIC instead of AI
  148. furthermorE if n is many times larger
  149. than k then the correction will be
  150. negligible hence there will be
  151. negligiblE disadvantage in using AIC
  152. instead of a I history the akaike
  153. information criterion was developed by
  154. Hirata Guha cake originally under the
  155. name an information criterion it was
  156. first announced by a hiker to
  157. symposium the Procdings of which were
  158. published in the publication
  159. though was only an informal presentation
  160. of the concepts the first formal
  161. publication was in a paper by a
  162. kike as of October the paper
  163. had received more than citations
  164. in thE web of science making it the rd
  165. most cited research paper of all time
  166. the initial derivation of AIC relied
  167. upon some strong assumptions tickets
  168. showed that the assumptions could be
  169. made much weaker turkeys work however
  170. was in Japanese and was not widely known
  171. outsidE Japan for many years a I was
  172. originally proposed for linear
  173. regression by Segura that instigated the
  174. work of hurvich and SCI and several
  175. further papers by the same authors which
  176. extended the situations in which a I
  177. could be applied the work of her vision
  178. SCI contributed to thE decision to
  179. publish a second edition of the volume
  180. by Brockwell and Davis
  181. which is thE standard reference for
  182. linear time series the second edition
  183. states our prime criterion for model
  184. selection among arma models will bE the
  185. AIC s the first general exposition of
  186. the information theoretic approach was
  187. the volume by Burnham & Anderson it
  188. includes an English presentation of the
  189. work of to couch the volume led to far
  190. greater use of AIC
  191. and it now has more than
  192. citations on Google Scholar okok
  193. originally called his approach and
  194. entropy maximization principle because
  195. the approach is founded on the concept
  196. of entropy and information theory indd
  197. minimizing AIC in a statistical model is
  198. effectively equivalent to maximizing
  199. entropy in a thermodynamic system in
  200. other words the information theoretic
  201. approach in statistics is essentially
  202. applying the second law of
  203. thermodynamics as such AIC has roots in
  204. the work of Ludwig Boltzmann on entropy
  205. for more on these issues s I can burn
  206. Amon anderson usage tips counting
  207. parameters a statistical model must fit
  208. all the data points thus a straight line
  209. on its own is not a model of the data
  210. unless all the data points lie exactly
  211. on the line wE can however choose a
  212. model that is a straight line plus noise
  213. such a model might be formally described
  214. thus a equals B plus B XI plus
  215. epsilon I here the epsilon I of the
  216. residuals from the straight line fit if
  217. the epsilon I are assumed to be I by D
  218. Gaussian then the model has thr
  219. parameters B B and the variance of
  220. the Gaussian distributions
  221. thus when calculating the AIC value of
  222. this model we should usE K equals
  223. more generally for any least squares
  224. model with ie I D Gaussian residuals the
  225. variance of the residuals distributions
  226. should be counted as one of the
  227. parameters as another example consider a
  228. first order autoregressive model defined
  229. by XI equals C plus Phi XI minus plus
  230. epsilon I with the epsilon I being I I D
  231. Gaussian for this model there are thr
  232. parameters C and the variance of the
  233. epsilon I more generally a PTH order
  234. autoregressive model has P plus
  235. parameters transforming data the AIC
  236. values of the candidate
  237. must all be competed with the same data
  238. set sometimes though wE might want to
  239. compare a model of the data with a model
  240. of the logarithm of the data more
  241. generally we might want to compare a
  242. model of the data with a model of
  243. transformed data here is an illustration
  244. of how to deal with data transforms
  245. suppose that we want to compare two
  246. models a normal distribution of the data
  247. and a normal distribution of the
  248. logarithm of the data we should not
  249. directly compare the AIC values of the
  250. two models instead we should transform
  251. the normal cumulative distribution
  252. function to first take the logarithm of
  253. the data to do that we nd to perform
  254. the relevant integration by substitution
  255. thus we nd to multiply by the
  256. derivative of the logarithm function
  257. which is /x hence the transform
  258. distribution has the following
  259. probability density function which is
  260. thE probability density function for the
  261. log normal distribution we then compare
  262. the AIC value of the normal model
  263. against the AIC value of the log normal
  264. model software unreliability some
  265. statistical software will report the
  266. value of AIC or thE maximum value of the
  267. log likelihood function but the reported
  268. values are not always correct
  269. typically any incorrectness is due to a
  270. constant in the log likelihood function
  271. being emitted
  272. for example the log likelihood function
  273. for n independent identical normal
  274. distributions is this is the function
  275. that is maximized when obtaining the
  276. value of AIC some software however omits
  277. the term lanE and so reports erroneous
  278. values for the log likelihood maximum
  279. and thus for AIC such errors do not
  280. matter for AIC based comparisons if all
  281. the models havE their residuals as
  282. normally distributed because then the
  283. errors cancel out in general however the
  284. constant term nds to bE included in
  285. the log likelihood function hence before
  286. using softwarE to calculate hey I s it
  287. is generally
  288. good practice to run some simple tests
  289. on the software to ensure that the
  290. function values are correct comparisons
  291. with other model selection methods
  292. comparison with BIC the AIC penalizes
  293. the number of parameters less strongly
  294. than does the Bayesian information
  295. criterion a comparison of AIC AI and
  296. BIC is given by Burnham & Anderson the
  297. authors show that hey I s an AIC si
  298. can be derived in the same Bayesian
  299. framework as BIC just by using a
  300. different prior the authors also argue
  301. that hey I s a I has theoretical
  302. advantages over BIC first because AIC
  303. AI is derived from principles of
  304. information BIC is not despite its name
  305. second because the derivation of BIC has
  306. a prior of r which is not sensible
  307. since the priors should be a decreasing
  308. function of K additionally they present
  309. a few simulation studies that suggest a
  310. I tends to have practical performance
  311. advantages over BIC c Burnham &
  312. Anderson further comparison of AIC and
  313. BIC in the context of regression is
  314. given by yang in particular AIC is
  315. asymptotically optimal in selecting the
  316. model with the least mean squared error
  317. under the assumption that the exact true
  318. model is not in the candidate set BIC is
  319. not asymptotically optimal under the
  320. assumption ya and additionally shows
  321. that the rate at which a IC converges to
  322. the optimum is in a certain sense the
  323. best possible for a more detailed
  324. comparison of AIC and big seabrzE A&R
  325. ho al comparison with least squares some
  326. times each candidate model assumes that
  327. the residuals are distributed aording
  328. to independent identical normal
  329. distributions that gives rise to least
  330. squares model fitting in this case the
  331. maximum likelihood estimate for the
  332. variance of a models residuals
  333. distributions Sigma is RSS n where RSS
  334. is the residual summer
  335. where's then the maximum value of the
  336. models log likelihood function is where
  337. C is a constant independent of the
  338. model and dependent only on the
  339. particular data points ie it does not
  340. change if the data does not changE that
  341. gives AIC equals to k plus n lane - c
  342. equals k plus n lane plus c because
  343. only differences in AIC a meaningful the
  344. constant c can be ignored which
  345. conveniently allows us to take AIC e
  346. equals k plus n lane for model
  347. comparisons
  348. note that if all the models have the
  349. same k then selecting the model with
  350. minimum AIC is equivalent to selecting
  351. the model with minimum RS S which is a
  352. common objective of least squares
  353. fitting comparison with cross-validation
  354. leave one out cross validation is
  355. asymptotically equivalent to the AIC for
  356. ordinary linear regression models such
  357. asymptotiC equivalence also holds for
  358. mixed effects models comparison with
  359. mallow P malice as CP is equivalent to
  360. AIC in the case of linear regression
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement