# 5Information Bottleneck

Published:
Category: { Learning Theory }
Summary: Information Bottleneck In a [[induction-deduction framework]] Induction, Deduction, and Transduction , for a given training dataset $$\{X, Y\},$$ a prediction Markov chain1 $$X \to \hat X \to Y,$$ where $\hat X$ is supposed to be the minimal sufficient statistics of $X$. $\hat X$ is the minimal data that can still represent the relation between $X$ and $Y$, i.e., $I(X;Y)$, the [[mutual information]] Mutual Information Mutual information is defined as $$I(X;Y) = \mathbb E_{p_{XY}} \ln \frac{P_{XY}}{P_X P_Y}.$$ In the case that $X$ and $Y$ are independent variables, we have $P_{XY} = P_X P_Y$, thus $I(X;Y) = 0$.
Pages: 3

# 3VC Dimension

Published:
Category: { Learning Theory }
Tags:
Summary: Two of the key elements in a learning problem are: a set of hypothesis $\mathcal H$, and a set of data samples $\mathcal S$. $\mathcal H$ Inside $\mathcal H$, we have a lot of hypotheses, for example, $\mathcal h$. Given some input, e.g., $x_1$ and $x_2$, we can produce some outputs, e.g., $h(x_1)$ and $h(x_2)$. $\mathcal S$ A sample $\mathcal S$ is a fair sample drawn from all the possible inputs $\mathcal X$, where $\mathcal X$ is called the input space. A dataset $\mathcal S$ can be used as a probe of the hypothesis set $\mathcal H$. $\mathcal S$ can be used to probe the learning capacity of $\mathcal H$.
Pages: 3

# 1Feasibility of Learning

Published:
Category: { Learning Theory }
Tags:
Summary: Why is learning from data even possible? To discuss this problem, we need a framework for learning. Operationally, we can think of learning as the following framework1. Abu-Mostafa2012 Naive View Naively speaking, a model should have two key properties, enough capacity to hold the necessary information embedded in the data, and a method to find the combination of parameters so that the model can generate/complete new data. Most neural networks have enough capacity to hold the necessary information in the data2. The problem is, the capacity is so large. Why does backprop even work? How did backprop find a suitable set of parameters that can generalize?
Pages: 3