Latent variables introduce extra correlations between the nodes in a network. Introducing hidden units can also help us remove the direct connection between some nodes in a Boltzmann machine and create a restricted Boltzmann machine. A restricted Boltzmann machine requires less computation while having some expressing power.

## The Ising Model

Given a Ising-like energy function (c.f. )

$$E(x) = - \sum_i h_i x_i - \frac{1}{2} \sum_{ij} J_{ij} x_i x_j,$$

The term $\sum_{ij} J_{ij} x_i x_j$ can be decomposed using the Cholesky decomposition,

$$\sum_{ij} J_{ij} x_i x_j = \sum_{ijk} W_{ik} W_{kj}^T x_i x_j = \sum_{ijk} W_{ik} W_{jk} x_i x_j.$$

The energy function can be rewritten as

$$E(x) = - \sum_i h_i x_i - \frac{1}{2} \sum_{ijk} W_{ik} W_{jk} x_i x_j.$$

The distribution (derived using ) is

$$$$p(x) \propto \exp{ \left( \sum_i h_i x_i + \frac{1}{2} \sum_{ijk} W_{ik} W_{jk} x_i x_j \right)} \label{eqn-boltzmann-dist-hopfield}$$$$

## Latent Variables: From Energy Function to Distribution

Using the , we transform equation $\eqref{eqn-boltzmann-dist-hopfield}$ into1

\begin{align} p(x) \propto & \exp{\left( \sum_i h_i x_i \right)} \exp{ \left(\frac{1}{2} \sum_{ijk} W_{ik} W_{jk} x_i x_j \right)} \\ \propto & \exp{\left( \sum_i h_i x_i \right)} \frac{1}{\sqrt{2\pi}} \int \mathrm dh_k\, \exp{ \sum_k \left( -\frac{1}{2} h_k^2 + \sum_i x_i W_{ik} h_k \right)} \\ \propto & \frac{1}{\sqrt{2\pi}} \int \mathrm dh_k\, \exp{ \left[ \sum_i h_i x_i + \sum_k \left( -\frac{1}{2} h_k^2 + \sum_i x_i W_{ik} h_k \right) \right]}. \end{align}

Define a new energy function

$$\tilde E(x,h) = - \sum_i h_i x_i - \sum_k \left( -\frac{1}{2} h_k^2 + \sum_i x_i W_{ik} h_k \right) .$$

The distribution becomes the marginalization of the hidden variable

$$$$p(x) \propto \int \mathrm dh\, \exp{ \left( -\tilde E(x,h) \right) }. \label{eqn-marginalization-new-energy-function}$$$$

Use the result $\eqref{eqn-marginalization-new-energy-function}$, we can define a model with two groups of nodes $x$ and $h$. The group $x$ has no internal connections, i.e., no high order correlations like $x^2$. The group of nodes $x$ is called visible layer while the hidden nodes $h$ are called hidden layer. In general, we require a form of energy

$$\tilde E(x, h) = -\sum_i a_i(x_i) - \sum_k b_k(h_k) - \sum_{ik} x_i W_{ik} h_k.$$

Usually, we use the form

$$$$\tilde E(x, h) = -\sum_i a_i x_i - \sum_k b_k h_k - \sum_{ik} x_i W_{ik} h_k, \label{eqn-energy-function-with-latent-variables}$$$$

with the nodes can only be 0 or 1. The visible layer and hidden layer are called Bernoulli as they have binary states.

## Latent Variables: From Distribution to Energy Function

We can also derive the energy function for distributions of the form $\eqref{eqn-energy-function-with-latent-variables}$. For easier understanding, we define a new joint distribution $p(x, h)$ from

$$p(x, h) = \frac{\exp{\left( -\tilde E(x, h) \right)}}{Z},$$

so that

$$p(x) = \int \mathrm d h\, p(x, h) = \frac{ \exp{\left( -E(x) \right)} }{Z}.$$

Using the above two definitions, we have

$$\exp{\left(-E(x)\right)} = \int \mathrm d h \, \exp{\left( -\tilde E(x, h) \right)},$$

from which we derive a formalism for the marginalized energy function

\begin{align} E(x) =& - \ln \int \mathrm d h \, \exp{\left( -\tilde E(x, h) \right)} \\ =& \sum_i a_i(x_i) - \sum_k \ln \int \mathrm d h_k \, \exp{\left( b_k(h_k) \right)} \exp{\left( \sum_j x_j W_{jk}h_k \right)}. \label{eqn-e-x} \end{align}

This energy function can be expanded using the cumulant2, we can decompose the energy function $E(x)$, equation $\eqref{eqn-e-x}$,

$$E(x) = -\sum_i a_i(x_i) - \sum_i \sum_k \kappa^{(1)}_k W_{ik} x_i - \frac{1}{2}\sum_{ij} \left( \sum_{k} \kappa^{(2)}_k W_{ik} W_{jk} \right) x_i x_j + \cdots$$

where $\kappa^{(\cdot)}$ are coefficients related to the cummulant3. We can find any order of correlations even without in-group connections between the visible nodes.

Expansion of the Energy Function

The integral in our energy function,

$$\sum_k \ln \int \mathrm d h_k , \exp{\left( b_k(h_k) \right)} \exp{\left( \sum_j x_j W_{jk}h_k \right)}$$

is similar to the cumulant generating function

$$\ln \int \mathrm d h q(h) e^{ht} = K(t) = \sum_{n=1}^\infty \kappa^{(n)} \frac{t^n}{n!},$$

with $\kappa^{(n)}=\partial^n_t K(t=0)$. This is a Taylor expansion around $t=0$.

To perform the expansion, we define a new quantity $q_k(h_k) = e^{b_k(h_k)}/Z$ so that

$$\sum_k \ln \int \mathrm d h_k \exp{\left( b_k(h_k) \right)} \exp{\left( \sum_j x_j W_{jk}h_k \right)} = \sum_k \ln \int \mathrm d h_k Z q_k(h_k) e^{t h_k },$$

with $t=\sum_j x_j W_{jk}$.

In this section, we will focus on Bernoulli layers, i.e., $a_i(x_i) = a^0_i x_i$ and $b_k(h_k) = b^0_k h_k$

Using the results from , we can find the gradient of the log-likelihood $\mathcal L(W_{ik}, a_i, b_k)$,

\begin{align} \partial_{W_{ik}} \mathcal L =& \langle x_i h_k \rangle_{\text{model}} - \langle x_i h_k \rangle_{\text{data}} \\ \partial_{a^0_i} \mathcal L =& \langle x_i \rangle_{\text{model}} - \langle x_i \rangle_{\text{data}} \\ \partial_{b^0_k} \mathcal L =& \langle h_k \rangle_{\text{model}} - \langle h_k \rangle_{\text{data}}. \end{align}

The calculation of $\langle \cdot \rangle_{\text{model}}$ is easier in RBM.

For , the visible units do not depend on themselves as there are not direct connections between the visible units. So are the hidden units. That being said, we have $p(x\mid h)$ and $p(h\mid x)$.

In this simple situation, the distributions can be sampled using . To calculate $p(h\mid x)$, we will clamp our visible units to the data values then infer $h$ which will be used to infer $x$ based on $p(x\mid h)$. Iterate this process we will be able to sample the model, i.e.,

\begin{align*} & p(h\mid x_0) \to h_0 \\ \Rightarrow & p(x\mid h_0) \to x_1 \\ \Rightarrow & p(h\mid x_1) \to h_1 \\ \Rightarrow & \cdots \end{align*}

Planted: by ;