Suppose we have a model that describes the data generation process behind a dataset. The distribution by the model is denoted as $\hat f$. The actual data generation process is described by a distribution $f$.

How good is the approximation using $\hat f$?

To be more precise, how much information is lost if we use our model dist $\hat f$ to substitute the actual data generation distribution $f$?

AIC defines this information loss as

$$\mathrm{AIC} = - 2 \ln p(y|\hat\theta) + 2k$$

• $y$: data set
• $\hat\theta$: parameter of the model that is estimated by maximum-likelihood
• $\ln p(y|\hat\theta)$: log maximum likelihood (the goodness-of-fit)
• $k$: number of adjustable model params; $+2k$ is then a penalty.

The first term represents the goodness of fit and the second term is a penalty for the complexity.

The smaller AIC, the better the model is by the AIC.

Limiting behaviors:

• $k\to0$: $\mathrm{AIC}\to- 2 \ln p(y|\hat\theta)$, which makes sense since we estimated the parameters using maximum likelihood.
• $k\to\infty$: $\mathrm{AIC}\to\infty$. There is a problem with this. If we have a huge number of adjustable parameters, the data set will not be relevant for choosing a model anymore.

Planted: by ;

L Ma (2020). 'Akaike Information Criterion', Datumorphism, 11 April. Available at: https://datumorphism.leima.is/cards/statistics/aic/.