Untitled

the akaike information criterion is a
measure of the relative quality of
statistical models for a given set of
data given a collection of models for
the data AIC estimates the quality of
each model relative to each of the other
models
hencE AIC provides a means for model
selection AIC is founded on information
theory it off is a relative estimate of
the information lost when a given model
is used to represent the process that
generates the data in doing so it deals
with the trade-off betwn the goodness
of fit of the model and the complexity
of the model AIC does not providE a test
of a model in the sensE of testing a
null hypothesis ie AIC can tell nothing
about the quality of the model in an
absolute sense if all the candidate
models fit poorly AIC will not give any
warning of that definition suppose that
we have a statistical model of some data
let L be the maximum value of the
likelihood function for the model let K
bE the number of estimated parameters in
the model then the AIC value of the
model is the following given a set of
candidate models for the data the
preferred model is the one with the
minimum AIC value hence AIC rewards
goodness of fit but it also includes a
penalty that is an increasing function
of the number of estimated parameters
the penalty discourages overfitting AIC
is founded in information theory
suppose that the data is generated by
some unknown process f we consider two
candidatE models to represent F D  and
G  if we knew left then we could find
the information lost from using G  to
represent F by calculating the
kullbackleibler divergence d KL
similarly the information lost from
using G  to represent F could be found
by calculating decay L we would then
choose the candidate model that
minimized the information loss we cannot
choose with certainty because we do not
know F okite showed however that we can
estimate
by our AI s how much more information
is lost by g than by g the estimate
though is only valid asymptotically if
the number of data points he's small
then some correction is often necessary
how to apply AIC in practicE to apply
AIC in practice
we start with a set of candidate models
and then find the models corresponding
AIC values there will almost always be
information lost due to using a
candidate model to represent the true
model we wish to select from among the
candidate models the model that
minimizes the information loss we cannot
choose with certainty but we can
minimize the estimated information loss
suppose that there arE our candidate
models denote the AIC values of those
models by AIC  AIC to AIC  AIC are let
Aikman be the minimum of those values
then exp  can be interpreted as the
relative probability that the ith model
minimizes the information loss as an
example suppose that there are thr
candidatE models whose AIC values are
  and  then the second medal is
exp  equals . a times as probable as
the first model to minimize the
information loss similarly the third
model is exp  equals zero point zero
times as probablE as the first model to
minimize the information loss in this
example we would omit the third model
from further consideration we then have
thr options gather more data in the
hope that this will allow clearly
distinguishing betwn the first two
models simply conclude that the data is
insufficient to support selecting one
model from among the first to take a
weighted average of the first two models
with weights  and . respectively
and then do statistical inference based
on the weighted multimodal
the
one two exp two is a relative likelihood
of model I if all the models in the
candidate set have the same number of
parameters then using AIC might at first
appear to bE very similar to using the
likelihood ratio test therE of however
important distinctions in particular the
likelihood ratio test is valid only for
nested models whereas AIC has no such
restriction ai a I is AIC with a
correction for finite sample sizes the
formula for a I depends upon the
statistical model assuming that the
model is univariate linear and has
normally distributed residuals the
formula for a I is as follows where n
denotes the sample size and K denotes
the number of parameters if the
assumption of a univariatE linear model
with normal residuals does not hold then
the formula for a I will generally
changE further discussion of the formula
with examples of other assumptions is
given by burn Eamonn Anderson and
Konishi and Kitagawa in particular with
other assumptions bootstrap estimation
of the formula is often feasible AI is
essentially AIC with a greater penalty
for extra parameters using AIC instead
of a I when n is not many times larger
than k increases the probability of
selecting models that have too many
parameters ie
of overfitting thE probability of AIC
overfitting can be substantial in some
cases Brockwell and davis advise using
AIC s as the primary criterion in
selecting the orders of an ARMA model
for timE series McQuarrie and psy ground
their high opinion of a I on extensive
simulation work with regression and time
series Burnham & Anderson note that
sincE a I converges to AIC as n gets
large a I rather than AIC should
generally be employed note that if all
thE candidate
models have the same k then AIC s an
AIC will give identical valuations hence
there will be no disadvantage in using
AIC instead of AI
furthermorE if n is many times larger
than k then the correction will be
negligible hence there will be
negligiblE disadvantage in using AIC
instead of a I history the akaike
information criterion was developed by
Hirata Guha cake originally under the
name an information criterion it was
first announced by a hiker to
symposium the Procdings of which were
published in  the  publication
though was only an informal presentation
of the concepts the first formal
publication was in a  paper by a
kike as of October  the  paper
had received more than  citations
in thE web of science making it the rd
most cited research paper of all time
the initial derivation of AIC relied
upon some strong assumptions tickets
showed that the assumptions could be
made much weaker turkeys work however
was in Japanese and was not widely known
outsidE Japan for many years a I was
originally proposed for linear
regression by Segura that instigated the
work of hurvich and SCI and several
further papers by the same authors which
extended the situations in which a I
could be applied the work of her vision
SCI contributed to thE decision to
publish a second edition of the volume
by Brockwell and Davis
which is thE standard reference for
linear time series the second edition
states our prime criterion for model
selection among arma models will bE the
AIC s the first general exposition of
the information theoretic approach was
the volume by Burnham & Anderson it
includes an English presentation of the
work of to couch the volume led to far
greater use of AIC
and it now has more than
citations on Google Scholar okok
originally called his approach and
entropy maximization principle because
the approach is founded on the concept
of entropy and information theory indd
minimizing AIC in a statistical model is
effectively equivalent to maximizing
entropy in a thermodynamic system in
other words the information theoretic
approach in statistics is essentially
applying the second law of
thermodynamics as such AIC has roots in
the work of Ludwig Boltzmann on entropy
for more on these issues s I can burn
Amon anderson usage tips counting
parameters a statistical model must fit
all the data points thus a straight line
on its own is not a model of the data
unless all the data points lie exactly
on the line wE can however choose a
model that is a straight line plus noise
such a model might be formally described
thus a equals B  plus B  XI plus
epsilon I here the epsilon I of the
residuals from the straight line fit if
the epsilon I are assumed to be I by D
Gaussian then the model has thr
parameters B  B  and the variance of
the Gaussian distributions
thus when calculating the AIC value of
this model we should usE K equals
more generally for any least squares
model with ie I D Gaussian residuals the
variance of the residuals distributions
should be counted as one of the
parameters as another example consider a
first order autoregressive model defined
by XI equals C plus Phi XI minus  plus
epsilon I with the epsilon I being I I D
Gaussian for this model there are thr
parameters C  and the variance of the
epsilon I more generally a PTH order
autoregressive model has P plus
parameters transforming data the AIC
values of the candidate
must all be competed with the same data
set sometimes though wE might want to
compare a model of the data with a model
of the logarithm of the data more
generally we might want to compare a
model of the data with a model of
transformed data here is an illustration
of how to deal with data transforms
suppose that we want to compare two
models a normal distribution of the data
and a normal distribution of the
logarithm of the data we should not
directly compare the AIC values of the
two models instead we should transform
the normal cumulative distribution
function to first take the logarithm of
the data to do that we nd to perform
the relevant integration by substitution
thus we nd to multiply by the
derivative of the logarithm function
which is /x hence the transform
distribution has the following
probability density function which is
thE probability density function for the
log normal distribution we then compare
the AIC value of the normal model
against the AIC value of the log normal
model software unreliability some
statistical software will report the
value of AIC or thE maximum value of the
log likelihood function but the reported
values are not always correct
typically any incorrectness is due to a
constant in the log likelihood function
being emitted
for example the log likelihood function
for n independent identical normal
distributions is this is the function
that is maximized when obtaining the
value of AIC some software however omits
the term lanE and so reports erroneous
values for the log likelihood maximum
and thus for AIC such errors do not
matter for AIC based comparisons if all
the models havE their residuals as
normally distributed because then the
errors cancel out in general however the
constant term nds to bE included in
the log likelihood function hence before
using softwarE to calculate hey I s it
is generally
good practice to run some simple tests
on the software to ensure that the
function values are correct comparisons
with other model selection methods
comparison with BIC the AIC penalizes
the number of parameters less strongly
than does the Bayesian information
criterion a comparison of AIC AI and
BIC is given by Burnham & Anderson the
authors show that hey I s an AIC si
can be derived in the same Bayesian
framework as BIC just by using a
different prior the authors also argue
that hey I s a I has theoretical
advantages over BIC first because AIC
AI is derived from principles of
information BIC is not despite its name
second because the derivation of BIC has
a prior of r which is not sensible
since the priors should be a decreasing
function of K additionally they present
a few simulation studies that suggest a
I tends to have practical performance
advantages over BIC c Burnham &
Anderson further comparison of AIC and
BIC in the context of regression is
given by yang in particular AIC is
asymptotically optimal in selecting the
model with the least mean squared error
under the assumption that the exact true
model is not in the candidate set BIC is
not asymptotically optimal under the
assumption ya and additionally shows
that the rate at which a IC converges to
the optimum is in a certain sense the
best possible for a more detailed
comparison of AIC and big seabrzE A&R
ho al comparison with least squares some
times each candidate model assumes that
the residuals are distributed aording
to independent identical normal
distributions that gives rise to least
squares model fitting in this case the
maximum likelihood estimate for the
variance of a models residuals
distributions Sigma  is RSS n where RSS
is the residual summer
where's then the maximum value of the
models log likelihood function is where
C  is a constant independent of the
model and dependent only on the
particular data points ie it does not
change if the data does not changE that
gives AIC equals to k plus n lane - c
equals  k plus n lane plus c because
only differences in AIC a meaningful the
constant c can be ignored which
conveniently allows us to take AIC e
equals  k plus n lane for model
comparisons
note that if all the models have the
same k then selecting the model with
minimum AIC is equivalent to selecting
the model with minimum RS S which is a
common objective of least squares
fitting comparison with cross-validation
leave one out cross validation is
asymptotically equivalent to the AIC for
ordinary linear regression models such
asymptotiC equivalence also holds for
mixed effects models comparison with
mallow P malice as CP is equivalent to
AIC in the case of linear regression