Model Selection
A good model selection process selects a good model for us. What is a good model? How do we quantify it?
5 MDL and Neural Networks
Published:
Category: { Model Selection }
Tags:
References:
- Hinton, G. E., & van Camp, D. (1993). Keeping the neural networks simple by minimizing the description length of the weights. Proceedings of the Sixth Annual Conference on Computational Learning Theory - COLT 93, 5–13.
- Shannon’s Source Coding Theorem (Foundations of information theory: Part 3)
Summary: Minimum Description Length ( [[MDL]] Minimum Description Length MDL is a measure of how well a model compresses data by minimizing the combined cost of the description of the model and the misfit. ) can be used to construct a concise network. A fully connected network has great expressing power but it is easily overfitting.
One strategy is to apply constraints to the networks:
Limit the connections; Shared weights in subgroups of the network; Constrain the weights using some probability distributions. By minimizing the MDL of the network and the misfits on the data, we can build a concise network. Based on the [[Source Coding Theorem]] Coding Theory Concepts The code function produces code words.
Pages: 5
4 Parsimony of Models
Published:
References:
- Vandekerckhove, J., & Matzke, D. (2015). Model comparison and the principle of parsimony. Oxford Library of Psychology.
Summary: For models with a lot of parameters, the goodness-of-fit is very likely to be very high. However, it is also likely to generalize bad. So we need measure of generalizability
Here parsinomy gives us a few advantages.
easy to perceive better generalizations
Pages: 5
3 Measures of Generalizability
Published:
Category: { Model Selection }
Tags:
References:
- Vandekerckhove, J., & Matzke, D. (2015). Model comparison and the principle of parsimony. Oxford Library of Psychology.
- Roelofs, R. (2019). Measuring Generalization and Overfitting in Machine Learning. Doctoral Dissertation, UC Berkeley, 1–171.
Summary: To measure the generalization, we define a generalization error,
$$ \begin{align} \mathcal G = \mathcal L_{P}(\hat f) - \mathcal L_E(\hat f), \end{align} $$
where $\mathcal L_{P}$ is the population loss, $\mathcal L_E$ is the empirical loss, and $\hat f$ is our model by minimizing the empirical loss.
However, we do not know the actual joint probability $p(x, y)$ of our dataset $\{x_i, y_i\}$. Thus the population loss is not known. In machine learning, we usually use [[cross validation]] Cross Validation Cross validation is a method to estimate the [[risk]] The Learning Problem The learning problem posed by Vapnik:1 Given a sample: $\{z_i\}$ in the probability space $Z$; Assuming a probability measure on the probability space $Z$; Assuming a set of functions $Q(z, \alpha)$ (e.
Pages: 5
2 Goodness-of-fit
Published:
Category: { Model Selection }
Tags:
References:
- Vandekerckhove, J., & Matzke, D. (2015). Model comparison and the principle of parsimony. Oxford Library of Psychology.
Summary: Does the data agree with the model?
Calculate the distance between data and model predictions. Apply Bayesian methods such as likelihood estimation: likelihood of observing the data if we assume the model; the results will be a set of fitting parameters. … Why don’t we always use goodness-of-fit as a measure of the goodness of a model?
We may experience overfitting. The model may not be intuitive. This is why we would like to balance it with parsimony using some measures of generalizability.
K-means and overfitting
The overfitting problem is easily demonstrated using the K-means model.
Suppose we use $k=1$, i.
Pages: 5
1 Model Selection
Published:
Category: { Model Selection }
Tags:
References:
- Collinearity and Parsimony from Linear Regression and Modeling on Coursear
- Vandekerckhove, J., & Matzke, D. (2015). Model comparison and the principle of parsimony. Oxford Library of Psychology.
Summary: Suppose we have a generating process that generates some numbers based on a distribution. Based on a data sample, we could reconstruct some sort of theoretical models to represent the actual generating process.
Which is a Good Model? (1)The black curve represent the generating process. The red rectangle is a very simple model that captures some major samples. The blue step-wise model is capturing more sample data but with more parameters.
In the above example, the red model on the left is not that good in most cases while the blue model seems to be better. In reality, the choice depends on the usage of the model.
Pages: 5