Information Bottleneck

#Learning Theory #Basics #Information Bottleneck

Information Bottleneck

In a [[induction-deduction framework]] Induction, Deduction, and Transduction , for a given training dataset

$$ \{X, Y\}, $$

a prediction Markov chain¹

$$ X \to \hat X \to Y, $$

where $\hat X$ is supposed to be the minimal sufficient statistics of $X$. $\hat X$ is the minimal data that can still represent the relation between $X$ and $Y$, i.e., $I(X;Y)$, the [[mutual information]] Mutual Information Mutual information is defined as $$ I(X;Y) = \mathbb E_{p_{XY}} \ln \frac{P_{XY}}{P_X P_Y}. $$ In the case that $X$ and $Y$ are independent variables, we have $P_{XY} = P_X P_Y$, thus $I(X;Y) = 0$. This makes sense as there would be no “mutual” information if the two variables are independent of each other. Entropy and Cross Entropy Mutual information is closely related to entropy. A simple decomposition shows that $$ I(X;Y) = H(X) - H(X\mid Y), $$ which is the reduction of … between $X$ and $Y$. There are competing effects in this framework:

On one hand, as an induction process, the less complexity of the representation the better, i.e., smaller $R\equiv I(X;\hat X)$.
However, if we are too extreme and come up with a $\hat X$ that is too simple, we reach a very small $R$ but we lose the effectiveness in the deduction process. We can not make good predictions. The deduction process requires the “preserved relevant information”, $I_Y\equiv hat X;Y$ to be large.

An optimal representation $\hat X$ should minimize the following Lagrangian¹

$$ \begin{align} \mathcal L &= R - \beta I_Y \\ &= I(X;\hat X) - \beta I(\hat X;Y), \end{align} $$

where $\beta$ is Lagrange multiplier.

To see that this is an Lagrangian, this Lagrangian is equivalent to ¹

$$ \tilde{\mathcal L} = I(X;\hat X) + \beta I(X;Y\vert \hat X) $$

That is we are looking for a $\hat X$ that minimizes the mutual information between $X$ and $\hat X$, $I(X;\hat X)$, but under the constraint $I(X;Y\vert \hat X)=0$, where $I(X;Y\vert\hat X)$ is the mutual information between $X$ and $Y$ but conditioned on $\hat X$. Then $\beta$ is our Lagrange multiplier (see this chart).

Tishby2015 Tishby N, Zaslavsky N. Deep Learning and the Information Bottleneck Principle. arXiv [cs.LG]. 2015. Available: http://arxiv.org/abs/1503.02406 ↩︎ ↩︎ ↩︎

Planted: 2022-04-30 by L Ma;

References:

Tishby2015 Tishby N, Zaslavsky N. Deep Learning and the Information Bottleneck Principle. arXiv [cs.LG]. 2015. Available: http://arxiv.org/abs/1503.02406