Minimum Description Length

#MDL #Model Selection

The minimum description length, aka, MDL, is based on the relations between regularity and data compression. (See [[Kolmogorov complexity]] Kolmogorov Complexity Description of Data The measurement of complexity is based on the observation that the compressibility of data doesn’t depend on the “language” used to describe the compression process that much. This makes it possible for us to find a universal language, such as a universal computer language, to quantify the compressibility of the data. One intuitive idea is to use a programming language to describe the data. If we have a sequence of data, … for more about data descriptions.).

In statistical inferences, given a dataset $\mathcal D$, the compressions of the data is our model $\mathcal M$, or hypothesis $\mathcal H$ in the language of hypothesis testing.

MDL looks for the model that compresses the data well but with a reasonable cost of complexity. The complexity of the model is described by a length $L(\mathcal M)$, the goodness of the model is $G$. MDL looks into the balance of $L(\mathcal M)$ and $G$. For example, one could calculate the length of the model using the number of parameters $k$ in the model, and the goodness of the model using likelihood $p(\mathcal D \mid \mathcal M)$.

There are many versions of MDL: ¹

crude two-part code,
Fisher information approximation ( [[FIA]] Fisher Information Approximation FIA is a method to describe the minimum description length ( [[MDL]] Minimum Description Length MDL is a measure of how well a model compresses data by minimizing the combined cost of the description of the model and the misfit. ) of models, $$ \mathrm{FIA} = -\ln p(y | \hat\theta) + \frac{k}{2} \ln \frac{n}{2\pi} + \ln \int_\Theta \sqrt{ \operatorname{det}[I(\theta)] d\theta } $$ $I(\theta)$: Fisher information matrix of sample size 1. $$I_{i,j}(\theta) = E\left( \frac{\partial \ln p(y| … ),
Normalized Maximum likelihood ( [[NML]] Normalized Maximum Likelihood $$ \mathrm{NML} = \frac{ p(y| \hat \theta(y)) }{ \int_X p( x| \hat \theta (x) ) dx } $$ ).