Latent Variable Models

#latent variable model #variational autoencoder #normalizing flow

In the view of statistics, we know everything about a physical system if we know the probability $p(\mathbf s)$ of all possible states of the physical system $\mathbf s$. Time can also be part of the state specification.

As an example, we will classify fruits into oranges and non oranges. We will have the state vector $\mathbf s = (\text{is orange}, \text{texture } x)$. Our goal is to find the joint probability $p(\text{is orange}, x)$.

The reality, we only have sample data. This sample data usually can not cover all the possible states of the system. Thus a direct calculation to find the joint probability $p(\mathbf s)$ is not feasible. We do not have so much information nor computing power.

A formal description is to set the generating process on a measure space ${Y, \Sigma_Y}$ with measure $\phi$. We will need to define this space using a sample drawn from it.

Luckily, our world follows some simple rules. The data is very likely to be generated by a simple generating process. So we can estimate the generating process behind the sample data.

Introduce a Gaussian mixture model to describe the observed data points,
$$ g(y) = \pi \phi(\mu_1, \sigma_1) + (1 - \pi) \phi(\mu_2, \sigma_2), $$
with $\phi$ being the normal distribution density. Given data as our prior and $g$ as the model, we can write down the likelihood, $p(\pi, \mu_1, \sigma_1, \mu_2, \sigma_2\mid y)$
To find the parameters, we could maximum the log-likelihood, in principle. However, this is not that easy as it involves some sum of logs,
$$ \sum_{\text{all data points}}\log\left(\pi \phi(\mu_1, \sigma_1) + (1 - \pi) \phi(\mu_2, \sigma_2) \right). $$
The trick is to introduce latent variable that will not be part of the model but helps us solve the problem. One of the latent variables we can use is a discrete variable that tells us which Gaussian compoent the data is associated with, which is denoted as $\Delta_i\in {0,1}$ for data point $y_i$. This is not known from the data. With the latent variable, the log-likelihood simplifies and we can apply the Expectation-Maximization method, aka EM method. The EM method tackles this problem by introducing an iterative process.

With this latent variable $\mathbf z$, we have the marginalized probability

$$ p(\mathbf s) = \int p( \mathbf s \mid \mathbf z ) p(\mathbf z) d\mathbf z. $$

This is a mixture model of infinite components (in Hilbert space).

Instead of inferring $p(\mathbf s)$, we infer $p( \mathbf s \mid \mathbf z )$ and $p(\mathbf z)$ as well as $p(\mathbf z \mid \mathbf s)$ for the parameters.
Between $\mathbf x$ and $\mathbf z$, we have
prior $p(\mathbf z)$,
likelihood $p(\mathbf s \mid \mathbf z)$,
posterior $p(\mathbf z \mid \mathbf s)$.

In the Bayesian world, we will separate the state vector $\mathbf s$ into the observables $y$ and the model $\theta$ and use Maximum Likelihood Estimation (MLE) or Maximum A Posteriori (MAP) to find our parameters.

The model based on MLE is
$$ \begin{align} \theta_{\mathrm{MLE}} &= \mathop{\mathrm{arg,max}}\limits_{\theta} \log p(y \mid \theta) \\ &= \mathop{\mathrm{arg,max}}\limits_{\theta} \log \prod_{y_i\in Y} p(y_i \mid \theta) \\ &= \mathop{\mathrm{arg,max}}\limits_{\theta} \sum_{y_i\in Y} \log p(y_i \mid \theta) \end{align} $$
The model based on MAP is $$ \begin{align} \theta_{\mathrm{MAP}} &= \mathop{\mathrm{arg,max}}\limits_{\theta} p(y \mid \theta) p(\theta) \\ &= \mathop{\mathrm{arg,max}}\limits_{\theta} \log p(y \mid \theta) + \log p(\theta) \\ &= \mathop{\mathrm{arg,max}}\limits_{\theta} \log \prod_{y_i\in Y} p(y_i \mid \theta) + \log p(\theta) \\ &= \mathop{\mathrm{arg,max}}\limits_{\theta} \sum_{y_i\in Y} \log p(y_i \mid \theta) + \log p(\theta) \end{align} $$

For example, the likelihood is the margin probability

$$ p(y\mid \theta) = \int p( y, \mathbf z \mid \theta ) p(\mathbf z) d\mathbf z. $$

To work out the integral $p(y\mid \theta)$, numerical methods such as Monte Carlo can be utilized. Monte Carlo takes a discrete point of view and find a fair [[multiset]] Multiset, mset or bag A bag is a set in which duplicate elements are allowed. An ordered bag is a list that we use in programming. sample ${\cdots, \mathbf z^i, \cdots}$ of the latent space $\mathbf z$. However, the sample space of $\mathbf z$ is usually quite large. To solve this problem, importance sampling comes to the save. Instead of evaluating

$$ p(y\mid \theta) = \int p( y \mid \theta, \mathbf z ) p(\mathbf z) d\mathbf z, $$

we rewrite it as

$$ p(y\mid \theta) = \int p(y \mid \theta, \mathbf z ) p(\mathbf z) d\mathbf z = \int p(y \mid \theta, \mathbf z ) \frac{f(\mathbf z\mid y)}{f(\mathbf z\mid y)} (p(\mathbf z) d\mathbf z) = \int \frac{p( y \mid \theta, \mathbf z ) p(\mathbf z)}{f(\mathbf z\mid y)} ( f(\mathbf z\mid y)d\mathbf z), $$ where $f(\mathbf z\mid y)$ is our proposed sampling weight in the $\mathbf z$ space. By carefully choosing $f(\mathbf z\mid y)$, one could simplify the numerical integration.

Our journey doesn’t end here. Using Bayes’ theorem,

$$ p(y\mid \theta) = \int \frac{p( y \mid \theta, \mathbf z ) p(\mathbf z)}{f(\mathbf z\mid y)} ( f(\mathbf z\mid y)d\mathbf z) = \int \frac{p( \mathbf z\mid \theta, y ) p(y)}{f(\mathbf z\mid y)} ( f(\mathbf z\mid y)d\mathbf z) . $$

The KL divergence

$$ \begin{align} KL\left( f(\mathbf{z} \mid y) || p(\mathbf{z} \mid \theta, y) \right) = -\mathcal{L} (y; \theta, f) + \log p(\mathbf{x}\mid \theta) \end{align} $$

Since KL divergence is nonnegative and $\log p(\mathbf{x}\mid \theta)$ is independent of $f$, $\mathcal{L} (y; \theta, f)$ serves as an lower bound of $\log p(\mathbf{x}\mid \theta)$, aka Evidence-lower bound (ELBO). Instead of dealing with likelihood, we can maximize the ELBO. This is the tricky part. a minimal ELBO due to the choice of $f$ leads to a minimal KL divergence which in turn indicates that $f$ and $p(\mathbf z\mid\theta,y)$ are similar. This is the magic of many methods such as EM algorithm and variational encoders.

Planted: 2021-01-27 by L Ma;

References:

Dynamic Backlinks to wiki/machine-learning/bayesian/latent-variable-models:

State Space Models

The state space model is an important category of models for sequential data such as time series

Variational Auto-Encoder

In an inference problem, $p(z\vert x)$, which is used to infer $z$ from $x$. $$ p(z\vert x) = …

Evidence Lower Bound: ELBO

ELBO is an very important concept in variational methods

KL Divergence

Kullback–Leibler divergence indicates the differences between two distributions

Reparametrization in Expectation Sampling

Reparametrize the sampling distribution to simplify the sampling

Normalizing Flows: An Introduction and Review of Current Methods

To generate complicated distributions step by step from a simple and interpretable distribution.

wiki/machine-learning/bayesian/latent-variable-models Links to:

Bayes' Theorem

Bayes’ Theorem is stated as $$ P(A\mid B) = \frac{P(B \mid A) P(A)}{P(B)} $$ $P(A\mid B)$: …

KL Divergence

Kullback–Leibler divergence indicates the differences between two distributions

Likelihood

Likelihood is not necessarily a pdf

Additional Double Backet Links:

Multiset, mset or bag

A bag is a set in which duplicate elements are allowed. An ordered bag is a list that we use in …

LM (2021). 'Latent Variable Models', Datumorphism, 01 April. Available at: https://datumorphism.leima.is/wiki/machine-learning/bayesian/latent-variable-models/.