## Information Bottleneck

In a , for a given training dataset

$$\{X, Y\},$$

a prediction Markov chain1

$$X \to \hat X \to Y,$$

where $\hat X$ is supposed to be the minimal sufficient statistics of $X$. $\hat X$ is the minimal data that can still represent the relation between $X$ and $Y$, i.e., $I(X;Y)$, the between $X$ and $Y$. There are competing effects in this framework:

1. On one hand, as an induction process, the less complexity of the representation the better, i.e., smaller $R\equiv I(X;\hat X)$.
2. However, if we are too extreme and come up with a $\hat X$ that is too simple, we reach a very small $R$ but we lose the effectiveness in the deduction process. We can not make good predictions. The deduction process requires the “preserved relevant information”, $I_Y\equiv hat X;Y$ to be large.

An optimal representation $\hat X$ should minimize the following Lagrangian1

\begin{align} \mathcal L &= R - \beta I_Y \\ &= I(X;\hat X) - \beta I(\hat X;Y), \end{align}

where $\beta$ is Lagrange multiplier.

To see that this is an Lagrangian, this Lagrangian is equivalent to 1

$$\tilde{\mathcal L} = I(X;\hat X) + \beta I(X;Y\vert \hat X)$$

That is we are looking for a $\hat X$ that minimizes the mutual information between $X$ and $\hat X$, $I(X;\hat X)$, but under the constraint $I(X;Y\vert \hat X)=0$, where $I(X;Y\vert\hat X)$ is the mutual information between $X$ and $Y$ but conditioned on $\hat X$. Then $\beta$ is our Lagrange multiplier (see this chart).

Planted: by ;