# Learning Theory

## Root of learning

^{5} Information Bottleneck

Published: 2022-04-30

Category: { Learning Theory }

Tags:

Summary: Information Bottleneck In a [[induction-deduction framework]] Induction, Deduction, and Transduction , for a given training dataset
$$ \{X, Y\}, $$
a prediction Markov chain1
$$ X \to \hat X \to Y, $$
where $\hat X$ is supposed to be the minimal sufficient statistics of $X$. $\hat X$ is the minimal data that can still represent the relation between $X$ and $Y$, i.e., $I(X;Y)$, the [[mutual information]] Mutual Information Mutual information is defined as $$ I(X;Y) = \mathbb E_{p_{XY}} \ln \frac{P_{XY}}{P_X P_Y}. $$ In the case that $X$ and $Y$ are independent variables, we have $P_{XY} = P_X P_Y$, thus $I(X;Y) = 0$.

Pages: 3

^{3} VC Dimension

Published: 2021-02-21

Category: { Learning Theory }

Tags:

References:
- Vapnik, V. N. (2000). The Nature of Statistical Learning Theory. Springer New York.
- Deckert D-A. Advanced Topics in Machine Learning. In: Advanced Topics in Machine Learning [Internet]. Apr 2017 [cited 17 Oct 2021]. Available: https://www.mathematik.uni-muenchen.de/~deckert/teaching/SS17/ATML/
- Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning requires rethinking generalization. arXiv [cs.LG]. 2016. Available: http://arxiv.org/abs/1611.03530
- Bernstein J. Machine learning is just statistics + quantifier reversal. In: jeremybernste [Internet]. [cited 1 Nov 2021]. Available: https://jeremybernste.in/writing/ml-is-just-statistics
- Guedj B. A Primer on PAC-Bayesian Learning. arXiv [stat.ML]. 2019. Available: http://arxiv.org/abs/1901.05353
- Abu-Mostafa, Yaser S and Magdon-Ismail, Malik and Lin, Hsuan-Tien. Learning from Data. AMLBook; 2012. Available: https://www.semanticscholar.org/paper/Learning-From-Data-Abu-Mostafa-Magdon-Ismail/1c0ed9ed3201ef381cc392fc3ca91cae6ecfc698

Summary: Two of the key elements in a learning problem are:
a set of hypothesis $\mathcal H$, and a set of data samples $\mathcal S$. $\mathcal H$
Inside $\mathcal H$, we have a lot of hypotheses, for example, $\mathcal h$. Given some input, e.g., $x_1$ and $x_2$, we can produce some outputs, e.g., $h(x_1)$ and $h(x_2)$. $\mathcal S$
A sample $\mathcal S$ is a fair sample drawn from all the possible inputs $\mathcal X$, where $\mathcal X$ is called the input space. A dataset $\mathcal S$ can be used as a probe of the hypothesis set $\mathcal H$. $\mathcal S$ can be used to probe the learning capacity of $\mathcal H$.

Pages: 3

^{1} Feasibility of Learning

Published: 2021-10-17

Category: { Learning Theory }

Tags:

References:
- Abu-Mostafa, Yaser S and Magdon-Ismail, Malik and Lin, Hsuan-Tien. Learning from Data. AMLBook; 2012. Available: https://www.semanticscholar.org/paper/Learning-From-Data-Abu-Mostafa-Magdon-Ismail/1c0ed9ed3201ef381cc392fc3ca91cae6ecfc698
- Bernstein J. Machine learning is just statistics + quantifier reversal. In: jeremybernste [Internet]. [cited 1 Nov 2021]. Available: https://jeremybernste.in/writing/ml-is-just-statistics
- Guedj B. A Primer on PAC-Bayesian Learning. arXiv [stat.ML]. 2019. Available: http://arxiv.org/abs/1901.05353
- Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning requires rethinking generalization. arXiv [cs.LG]. 2016. Available: http://arxiv.org/abs/1611.03530

Summary: Why is learning from data even possible? To discuss this problem, we need a framework for learning. Operationally, we can think of learning as the following framework1.
Abu-Mostafa2012
Naive View Naively speaking, a model should have two key properties,
enough capacity to hold the necessary information embedded in the data, and a method to find the combination of parameters so that the model can generate/complete new data. Most neural networks have enough capacity to hold the necessary information in the data2. The problem is, the capacity is so large. Why does backprop even work? How did backprop find a suitable set of parameters that can generalize?

Pages: 3