Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- the akaike information criterion is a
- measure of the relative quality of
- statistical models for a given set of
- data given a collection of models for
- the data AIC estimates the quality of
- each model relative to each of the other
- models
- hencE AIC provides a means for model
- selection AIC is founded on information
- theory it off is a relative estimate of
- the information lost when a given model
- is used to represent the process that
- generates the data in doing so it deals
- with the trade-off betwn the goodness
- of fit of the model and the complexity
- of the model AIC does not providE a test
- of a model in the sensE of testing a
- null hypothesis ie AIC can tell nothing
- about the quality of the model in an
- absolute sense if all the candidate
- models fit poorly AIC will not give any
- warning of that definition suppose that
- we have a statistical model of some data
- let L be the maximum value of the
- likelihood function for the model let K
- bE the number of estimated parameters in
- the model then the AIC value of the
- model is the following given a set of
- candidate models for the data the
- preferred model is the one with the
- minimum AIC value hence AIC rewards
- goodness of fit but it also includes a
- penalty that is an increasing function
- of the number of estimated parameters
- the penalty discourages overfitting AIC
- is founded in information theory
- suppose that the data is generated by
- some unknown process f we consider two
- candidatE models to represent F D and
- G if we knew left then we could find
- the information lost from using G to
- represent F by calculating the
- kullbackleibler divergence d KL
- similarly the information lost from
- using G to represent F could be found
- by calculating decay L we would then
- choose the candidate model that
- minimized the information loss we cannot
- choose with certainty because we do not
- know F okite showed however that we can
- estimate
- by our AI s how much more information
- is lost by g than by g the estimate
- though is only valid asymptotically if
- the number of data points he's small
- then some correction is often necessary
- how to apply AIC in practicE to apply
- AIC in practice
- we start with a set of candidate models
- and then find the models corresponding
- AIC values there will almost always be
- information lost due to using a
- candidate model to represent the true
- model we wish to select from among the
- candidate models the model that
- minimizes the information loss we cannot
- choose with certainty but we can
- minimize the estimated information loss
- suppose that there arE our candidate
- models denote the AIC values of those
- models by AIC AIC to AIC AIC are let
- Aikman be the minimum of those values
- then exp can be interpreted as the
- relative probability that the ith model
- minimizes the information loss as an
- example suppose that there are thr
- candidatE models whose AIC values are
- and then the second medal is
- exp equals . a times as probable as
- the first model to minimize the
- information loss similarly the third
- model is exp equals zero point zero
- times as probablE as the first model to
- minimize the information loss in this
- example we would omit the third model
- from further consideration we then have
- thr options gather more data in the
- hope that this will allow clearly
- distinguishing betwn the first two
- models simply conclude that the data is
- insufficient to support selecting one
- model from among the first to take a
- weighted average of the first two models
- with weights and . respectively
- and then do statistical inference based
- on the weighted multimodal
- the
- one two exp two is a relative likelihood
- of model I if all the models in the
- candidate set have the same number of
- parameters then using AIC might at first
- appear to bE very similar to using the
- likelihood ratio test therE of however
- important distinctions in particular the
- likelihood ratio test is valid only for
- nested models whereas AIC has no such
- restriction ai a I is AIC with a
- correction for finite sample sizes the
- formula for a I depends upon the
- statistical model assuming that the
- model is univariate linear and has
- normally distributed residuals the
- formula for a I is as follows where n
- denotes the sample size and K denotes
- the number of parameters if the
- assumption of a univariatE linear model
- with normal residuals does not hold then
- the formula for a I will generally
- changE further discussion of the formula
- with examples of other assumptions is
- given by burn Eamonn Anderson and
- Konishi and Kitagawa in particular with
- other assumptions bootstrap estimation
- of the formula is often feasible AI is
- essentially AIC with a greater penalty
- for extra parameters using AIC instead
- of a I when n is not many times larger
- than k increases the probability of
- selecting models that have too many
- parameters ie
- of overfitting thE probability of AIC
- overfitting can be substantial in some
- cases Brockwell and davis advise using
- AIC s as the primary criterion in
- selecting the orders of an ARMA model
- for timE series McQuarrie and psy ground
- their high opinion of a I on extensive
- simulation work with regression and time
- series Burnham & Anderson note that
- sincE a I converges to AIC as n gets
- large a I rather than AIC should
- generally be employed note that if all
- thE candidate
- models have the same k then AIC s an
- AIC will give identical valuations hence
- there will be no disadvantage in using
- AIC instead of AI
- furthermorE if n is many times larger
- than k then the correction will be
- negligible hence there will be
- negligiblE disadvantage in using AIC
- instead of a I history the akaike
- information criterion was developed by
- Hirata Guha cake originally under the
- name an information criterion it was
- first announced by a hiker to
- symposium the Procdings of which were
- published in the publication
- though was only an informal presentation
- of the concepts the first formal
- publication was in a paper by a
- kike as of October the paper
- had received more than citations
- in thE web of science making it the rd
- most cited research paper of all time
- the initial derivation of AIC relied
- upon some strong assumptions tickets
- showed that the assumptions could be
- made much weaker turkeys work however
- was in Japanese and was not widely known
- outsidE Japan for many years a I was
- originally proposed for linear
- regression by Segura that instigated the
- work of hurvich and SCI and several
- further papers by the same authors which
- extended the situations in which a I
- could be applied the work of her vision
- SCI contributed to thE decision to
- publish a second edition of the volume
- by Brockwell and Davis
- which is thE standard reference for
- linear time series the second edition
- states our prime criterion for model
- selection among arma models will bE the
- AIC s the first general exposition of
- the information theoretic approach was
- the volume by Burnham & Anderson it
- includes an English presentation of the
- work of to couch the volume led to far
- greater use of AIC
- and it now has more than
- citations on Google Scholar okok
- originally called his approach and
- entropy maximization principle because
- the approach is founded on the concept
- of entropy and information theory indd
- minimizing AIC in a statistical model is
- effectively equivalent to maximizing
- entropy in a thermodynamic system in
- other words the information theoretic
- approach in statistics is essentially
- applying the second law of
- thermodynamics as such AIC has roots in
- the work of Ludwig Boltzmann on entropy
- for more on these issues s I can burn
- Amon anderson usage tips counting
- parameters a statistical model must fit
- all the data points thus a straight line
- on its own is not a model of the data
- unless all the data points lie exactly
- on the line wE can however choose a
- model that is a straight line plus noise
- such a model might be formally described
- thus a equals B plus B XI plus
- epsilon I here the epsilon I of the
- residuals from the straight line fit if
- the epsilon I are assumed to be I by D
- Gaussian then the model has thr
- parameters B B and the variance of
- the Gaussian distributions
- thus when calculating the AIC value of
- this model we should usE K equals
- more generally for any least squares
- model with ie I D Gaussian residuals the
- variance of the residuals distributions
- should be counted as one of the
- parameters as another example consider a
- first order autoregressive model defined
- by XI equals C plus Phi XI minus plus
- epsilon I with the epsilon I being I I D
- Gaussian for this model there are thr
- parameters C and the variance of the
- epsilon I more generally a PTH order
- autoregressive model has P plus
- parameters transforming data the AIC
- values of the candidate
- must all be competed with the same data
- set sometimes though wE might want to
- compare a model of the data with a model
- of the logarithm of the data more
- generally we might want to compare a
- model of the data with a model of
- transformed data here is an illustration
- of how to deal with data transforms
- suppose that we want to compare two
- models a normal distribution of the data
- and a normal distribution of the
- logarithm of the data we should not
- directly compare the AIC values of the
- two models instead we should transform
- the normal cumulative distribution
- function to first take the logarithm of
- the data to do that we nd to perform
- the relevant integration by substitution
- thus we nd to multiply by the
- derivative of the logarithm function
- which is /x hence the transform
- distribution has the following
- probability density function which is
- thE probability density function for the
- log normal distribution we then compare
- the AIC value of the normal model
- against the AIC value of the log normal
- model software unreliability some
- statistical software will report the
- value of AIC or thE maximum value of the
- log likelihood function but the reported
- values are not always correct
- typically any incorrectness is due to a
- constant in the log likelihood function
- being emitted
- for example the log likelihood function
- for n independent identical normal
- distributions is this is the function
- that is maximized when obtaining the
- value of AIC some software however omits
- the term lanE and so reports erroneous
- values for the log likelihood maximum
- and thus for AIC such errors do not
- matter for AIC based comparisons if all
- the models havE their residuals as
- normally distributed because then the
- errors cancel out in general however the
- constant term nds to bE included in
- the log likelihood function hence before
- using softwarE to calculate hey I s it
- is generally
- good practice to run some simple tests
- on the software to ensure that the
- function values are correct comparisons
- with other model selection methods
- comparison with BIC the AIC penalizes
- the number of parameters less strongly
- than does the Bayesian information
- criterion a comparison of AIC AI and
- BIC is given by Burnham & Anderson the
- authors show that hey I s an AIC si
- can be derived in the same Bayesian
- framework as BIC just by using a
- different prior the authors also argue
- that hey I s a I has theoretical
- advantages over BIC first because AIC
- AI is derived from principles of
- information BIC is not despite its name
- second because the derivation of BIC has
- a prior of r which is not sensible
- since the priors should be a decreasing
- function of K additionally they present
- a few simulation studies that suggest a
- I tends to have practical performance
- advantages over BIC c Burnham &
- Anderson further comparison of AIC and
- BIC in the context of regression is
- given by yang in particular AIC is
- asymptotically optimal in selecting the
- model with the least mean squared error
- under the assumption that the exact true
- model is not in the candidate set BIC is
- not asymptotically optimal under the
- assumption ya and additionally shows
- that the rate at which a IC converges to
- the optimum is in a certain sense the
- best possible for a more detailed
- comparison of AIC and big seabrzE A&R
- ho al comparison with least squares some
- times each candidate model assumes that
- the residuals are distributed aording
- to independent identical normal
- distributions that gives rise to least
- squares model fitting in this case the
- maximum likelihood estimate for the
- variance of a models residuals
- distributions Sigma is RSS n where RSS
- is the residual summer
- where's then the maximum value of the
- models log likelihood function is where
- C is a constant independent of the
- model and dependent only on the
- particular data points ie it does not
- change if the data does not changE that
- gives AIC equals to k plus n lane - c
- equals k plus n lane plus c because
- only differences in AIC a meaningful the
- constant c can be ignored which
- conveniently allows us to take AIC e
- equals k plus n lane for model
- comparisons
- note that if all the models have the
- same k then selecting the model with
- minimum AIC is equivalent to selecting
- the model with minimum RS S which is a
- common objective of least squares
- fitting comparison with cross-validation
- leave one out cross validation is
- asymptotically equivalent to the AIC for
- ordinary linear regression models such
- asymptotiC equivalence also holds for
- mixed effects models comparison with
- mallow P malice as CP is equivalent to
- AIC in the case of linear regression
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement