MDL and Neural Networks
Minimum Description Length ( [[MDL]] Minimum Description Length MDL is a measure of how well a model compresses data by minimizing the combined cost of the description of the model and the misfit. ) can be used to construct a concise network. A fully connected network has great expressing power but it is easily overfitting.
One strategy is to apply constraints to the networks:
- Limit the connections;
- Shared weights in subgroups of the network;
- Constrain the weights using some probability distributions.
By minimizing the MDL of the network and the misfits on the data, we can build a concise network. Based on the [[Source Coding Theorem]] Coding Theory Concepts The code function produces code words. The expected length of the code word is limited by the entropy from the source probability $p$. The Shannon information content, aka self-information, is described by $$ - \log_2 p(x=a), $$ for the case that $x=a$. The Shannon entropy is the expected information content for the whole sequence with probability distribution $p(x)$, $$ \mathcal H = - \sum_x p(x\in X) \log_2 p(x). $$ The Shannon source coding theorem says that for $N$ samples from the source, … , we can encode the misfit and the model using Shannon information content 1. The description length for the misfit and the model corresponds to the Shannon information content. Thus we can define an expected description length and minimize it in the model so that we can balance the complexity of the model and the goodness of fit.
- Hinton1993 Hinton, G. E., & van Camp, D. (1993). Keeping the neural networks simple by minimizing the description length of the weights. Proceedings of the Sixth Annual Conference on Computational Learning Theory - COLT 93, 5–13.
- Shannon’s Source Coding Theorem (Foundations of information theory: Part 3)
L Ma (2021). 'MDL and Neural Networks', Datumorphism, 02 April. Available at: https://datumorphism.leima.is/wiki/model-selection/mdl-and-neural-networks/.