# Model Selection

## A good model selection process selects a good model for us. What is a good model? How do we quantify it?

^{5} MDL and Neural Networks

Published: 2021-02-14

Category: { Model Selection }

Tags:

References:
- Hinton, G. E., & van Camp, D. (1993). Keeping the neural networks simple by minimizing the description length of the weights. Proceedings of the Sixth Annual Conference on Computational Learning Theory - COLT 93, 5–13.
- Shannon’s Source Coding Theorem (Foundations of information theory: Part 3)

Summary: Minimum Description Length ( MDL Minimum Description Length MDL is a measure of how well a model compresses data by minimizing the combined cost of the description of the model and the misfit. ) can be used to construct a concise network. A fully connected network has great expressing power but it is easily overfitting.
One strategy is to apply constraints to the networks:
Limit the connections; Shared weights in subgroups of the network; Constrain the weights using some probability distributions.

Pages: 5

^{4} Parsimony of Models

Published: 2020-11-08

References:
- Vandekerckhove, J., & Matzke, D. (2015). Model comparison and the principle of parsimony. Oxford Library of Psychology.

Summary: For models with a lot of parameters, the goodness-of-fit is very likely to be very high. However, it is also likely to generalize bad. So we need measure of generalizability
Here parsinomy gives us a few advantages.
easy to perceive better generalizations

Pages: 5

^{3} Measures of Generalizability

Published: 2020-11-08

Category: { Model Selection }

Tags:

References:
- Vandekerckhove, J., & Matzke, D. (2015). Model comparison and the principle of parsimony. Oxford Library of Psychology.
- Roelofs, R. (2019). Measuring Generalization and Overfitting in Machine Learning. Doctoral Dissertation, UC Berkeley, 1–171.

Summary: To measure the generalization, we define a generalization error,
$$ \begin{align} \mathcal G = \mathcal L_{P}(\hat f) - \mathcal L_E(\hat f), \end{align} $$
where $\mathcal L_{P}$ is the population loss, $\mathcal L_E$ is the empirical loss, and $\hat f$ is our model by minimizing the empirical loss.
However, we do not know the actual joint probability $p(x, y)$ of our dataset $\{x_i, y_i\}$. Thus the population loss is not known. In machine learning, we usually use cross validation Cross Validation Cross validation is a method to estimate the risk The Learning Problem The learning problem posed by Vapnik:1 Given a sample: $\{z_i\}$ in the probability space $Z$; Assuming a probability measure on the probability space $Z$; Assuming a set of functions $Q(z, \alpha)$ (e.

Pages: 5

^{2} Goodness-of-fit

Published: 2020-11-08

Category: { Model Selection }

Tags:

References:
- Vandekerckhove, J., & Matzke, D. (2015). Model comparison and the principle of parsimony. Oxford Library of Psychology.

Summary: Does the data agree with the model?
Calculate the distance between data and model predictions. Apply Bayesian methods such as likelihood estimation: likelihood of observing the data if we assume the model; the results will be a set of fitting parameters. … Why don’t we always use goodness-of-fit as a measure of the goodness of a model?
We may experience overfitting. The model may not be intuitive. This is why we would like to balance it with parsimony using some measures of generalizability.

Pages: 5

^{1} Model Selection

Published: 2020-11-08

Category: { Model Selection }

Tags:

References:
- Collinearity and Parsimony from Linear Regression and Modeling on Coursear
- Vandekerckhove, J., & Matzke, D. (2015). Model comparison and the principle of parsimony. Oxford Library of Psychology.

Summary: Suppose we have a generating process that generates some numbers based on a distribution. Based on a data sample, we could reconstruct some sort of theoretical models to represent the actual generating process.
Which is a Good Model? (1)The black curve represent the generating process. The red rectangle is a very simple model that captures some major samples. The blue step-wise model is capturing more sample data but with more parameters.

Pages: 5