In the view of statistics, we know everything about a physical system if we know the probability $p(\mathbf s)$ of all possible states of the physical system $\mathbf s$. Time can also be part of the state specification.

As an example, we will classify fruits into oranges and non oranges. We will have the state vector $\mathbf s = (\text{is orange}, \text{texture } x)$. Our goal is to find the joint probability $p(\text{is orange}, x)$.

The reality, we only have sample data. This sample data usually can not cover all the possible states of the system. Thus a direct calculation to find the joint probability $p(\mathbf s)$ is not feasible. We do not have so much information nor computing power.

A formal description is to set the generating process on a measure space ${Y, \Sigma_Y}$ with measure $\phi$. We will need to define this space using a sample drawn from it.

Luckily, our world follows some simple rules. The data is very likely to be generated by a simple generating process. So we can estimate the generating process behind the sample data.

Introduce a Gaussian mixture model to describe the observed data points,

$$g(y) = \pi \phi(\mu_1, \sigma_1) + (1 - \pi) \phi(\mu_2, \sigma_2),$$

with $\phi$ being the normal distribution density. Given data as our prior and $g$ as the model, we can write down the likelihood, $p(\pi, \mu_1, \sigma_1, \mu_2, \sigma_2\mid y)$

To find the parameters, we could maximum the log-likelihood, in principle. However, this is not that easy as it involves some sum of logs,

$$\sum_{\text{all data points}}\log\left(\pi \phi(\mu_1, \sigma_1) + (1 - \pi) \phi(\mu_2, \sigma_2) \right).$$

The trick is to introduce latent variable that will not be part of the model but helps us solve the problem. One of the latent variables we can use is a discrete variable that tells us which Gaussian compoent the data is associated with, which is denoted as $\Delta_i\in {0,1}$ for data point $y_i$. This is not known from the data. With the latent variable, the log-likelihood simplifies and we can apply the Expectation-Maximization method, aka EM method. The EM method tackles this problem by introducing an iterative process.

With this latent variable $\mathbf z$, we have the marginalized probability

$$p(\mathbf s) = \int p( \mathbf s \mid \mathbf z ) p(\mathbf z) d\mathbf z.$$

This is a mixture model of infinite components (in Hilbert space).

Instead of inferring $p(\mathbf s)$, we infer $p( \mathbf s \mid \mathbf z )$ and $p(\mathbf z)$ as well as $p(\mathbf z \mid \mathbf s)$ for the parameters.

Between $\mathbf x$ and $\mathbf z$, we have

• prior $p(\mathbf z)$,
• likelihood $p(\mathbf s \mid \mathbf z)$,
• posterior $p(\mathbf z \mid \mathbf s)$.

In the Bayesian world, we will separate the state vector $\mathbf s$ into the observables $y$ and the model $\theta$ and use Maximum Likelihood Estimation (MLE) or Maximum A Posteriori (MAP) to find our parameters.

The model based on MLE is

\begin{align} \theta_{\mathrm{MLE}} &= \mathop{\mathrm{arg,max}}\limits_{\theta} \log p(y \mid \theta) \\ &= \mathop{\mathrm{arg,max}}\limits_{\theta} \log \prod_{y_i\in Y} p(y_i \mid \theta) \\ &= \mathop{\mathrm{arg,max}}\limits_{\theta} \sum_{y_i\in Y} \log p(y_i \mid \theta) \end{align}

The model based on MAP is \begin{align} \theta_{\mathrm{MAP}} &= \mathop{\mathrm{arg,max}}\limits_{\theta} p(y \mid \theta) p(\theta) \\ &= \mathop{\mathrm{arg,max}}\limits_{\theta} \log p(y \mid \theta) + \log p(\theta) \\ &= \mathop{\mathrm{arg,max}}\limits_{\theta} \log \prod_{y_i\in Y} p(y_i \mid \theta) + \log p(\theta) \\ &= \mathop{\mathrm{arg,max}}\limits_{\theta} \sum_{y_i\in Y} \log p(y_i \mid \theta) + \log p(\theta) \end{align}

For example, the likelihood is the margin probability

$$p(y\mid \theta) = \int p( y, \mathbf z \mid \theta ) p(\mathbf z) d\mathbf z.$$

To work out the integral $p(y\mid \theta)$, numerical methods such as Monte Carlo can be utilized. Monte Carlo takes a discrete point of view and find a fair sample ${\cdots, \mathbf z^i, \cdots}$ of the latent space $\mathbf z$. However, the sample space of $\mathbf z$ is usually quite large. To solve this problem, importance sampling comes to the save. Instead of evaluating

$$p(y\mid \theta) = \int p( y \mid \theta, \mathbf z ) p(\mathbf z) d\mathbf z,$$

we rewrite it as

$$p(y\mid \theta) = \int p(y \mid \theta, \mathbf z ) p(\mathbf z) d\mathbf z = \int p(y \mid \theta, \mathbf z ) \frac{f(\mathbf z\mid y)}{f(\mathbf z\mid y)} (p(\mathbf z) d\mathbf z) = \int \frac{p( y \mid \theta, \mathbf z ) p(\mathbf z)}{f(\mathbf z\mid y)} ( f(\mathbf z\mid y)d\mathbf z),$$ where $f(\mathbf z\mid y)$ is our proposed sampling weight in the $\mathbf z$ space. By carefully choosing $f(\mathbf z\mid y)$, one could simplify the numerical integration.

Our journey doesn’t end here. Using Bayes' theorem,

$$p(y\mid \theta) = \int \frac{p( y \mid \theta, \mathbf z ) p(\mathbf z)}{f(\mathbf z\mid y)} ( f(\mathbf z\mid y)d\mathbf z) = \int \frac{p( \mathbf z\mid \theta, y ) p(y)}{f(\mathbf z\mid y)} ( f(\mathbf z\mid y)d\mathbf z) .$$

The KL divergence

\begin{align} KL\left( f(\mathbf{z} \mid y) || p(\mathbf{z} \mid \theta, y) \right) = -\mathcal{L} (y; \theta, f) + \log p(\mathbf{x}\mid \theta) \end{align}

Since KL divergence is nonnegative and $\log p(\mathbf{x}\mid \theta)$ is independent of $f$, $\mathcal{L} (y; \theta, f)$ serves as an lower bound of $\log p(\mathbf{x}\mid \theta)$, aka Evidence-lower bound (ELBO). Instead of dealing with likelihood, we can maximize the ELBO. This is the tricky part. a minimal ELBO due to the choice of $f$ leads to a minimal KL divergence which in turn indicates that $f$ and $p(\mathbf z\mid\theta,y)$ are similar. This is the magic of many methods such as EM algorithm and variational encoders.

Planted: by ;