Diffusion Models for Forecasting

Objective

In a denoising diffusion model, given

  • an input $\mathbf x^0$ drawn from a complicated and unknown distribution $q(\mathbf x^0)$,

we find

  • a latent space with a simple and manageable distribution, e.g., normal distribution, and
  • the transformations from $\mathbf x^0$ to $\mathbf x^n$, as well as
  • the transformations from $\mathbf x^n$ to $\mathbf x^0$.

An Example

For example, with $N=5$, the forward process is

flowchart LR x0 --> x1 --> x2 --> x3 --> x4 --> x5

and the reverse process is

flowchart LR x5 --> x4 --> x3 --> x2 --> x1 --> x0

The joint distribution we are searching for is

$$ q(\mathbf x^1, \mathbf x^2, \mathbf x^3, \mathbf x^4, \mathbf x^5 \vert \mathbf x^0) = q(\mathbf x^5\vert \mathbf x^4) q(\mathbf x^4\vert \mathbf x^3) q(\mathbf x^3\vert \mathbf x^2)q(\mathbf x^2\vert \mathbf x^1)q(\mathbf x^1\vert \mathbf x^0), $$

A diffusion model assumes a simple diffusion process, e.g.,

$$ \begin{equation} q(\mathbf x^n \vert \mathbf x^{n-1}) \equiv \mathcal N (\mathbf x^n ; \sqrt{ 1 - \beta_n} \mathbf x ^{n -1}, \beta_n\mathbf I). \label{eq-guassian-noise} \end{equation} $$

This simulates an information diffusion process. The information in the original data is gradually smeared.

If the chosen diffusion process is revertible, the reverse process of it can be modeled by a similar Markov process

$$ p_\theta (\mathbf x^0, \mathbf x^1, \mathbf x^2, \mathbf x^3, \mathbf x^4, \mathbf x^5) = p_\theta (\mathbf x^0 \vert \mathbf x^1) p_\theta (\mathbf x^1 \vert \mathbf x^2) p_\theta (\mathbf x^2 \vert \mathbf x^3) p_\theta (\mathbf x^3 \vert \mathbf x^4) p_\theta (\mathbf x^4 \vert \mathbf x^5) p(\mathbf x^5). $$

This reverse process is the denoising process.

As long as our model estimates $p_\theta (\mathbf x^n \vert \mathbf x^{n-1})$ nicely, we can go $\mathbf x^0 \to \mathbf x^N$ and $\mathbf x^N \to \mathbf x^0$.

The Reverse Process: A Gaussian Example

With Eq \ref{eq-guassian-noise}, the reverse process is

$$ \begin{equation} p_\theta (\mathbf x^{n-1} \vert \mathbf x^n) = \mathcal N ( \mathbf x^{n-1} ; \mu_\theta(\mathbf x^n, n), \Sigma_\theta(\mathbf x^n, n)\mathbf I). \label{eqn-guassian-reverse-process} \end{equation} $$

Summary

  • Forward: perturbs data to noise, step by step;
  • Reverse: converts noise to data, step by step.
flowchart LR prior["prior distribution"] data --"forward Markov chain"--> noise noise --"reverse Markov chain"--> data prior --"sampling"--> noise

Optimization

The forward chain is predefined. To close the loop, we have to find $p_\theta$. A natural choice for our loss function is the negative log-likelihood,

$$ \mathbb E_{q(\mathbf x^0)} \left( - \log ( p_\theta (\mathbf x^0) ) \right). $$

(Ho et al., 2020) proved that the above loss has an upper bound related to the diffusion process defined in Eq \ref{eq-guassian-noise}1

$$ \begin{align} &\operatorname{min}_\theta \mathbb E_{q(\mathbf x^0)} \\ \leq & \operatorname{min}_\theta \mathbb E_{q(\mathbf x^{0:N})} \left[ -\log p(\mathbf x^N) - \sum_{n=1}^{N} \log \frac{p_\theta (\mathbf x^{n-1}\vert \mathbf x^n)}{q(\mathbf x^n \vert \mathbf x^{n-1})} \right] \\ =& \operatorname{min}_\theta \mathbb E_{\mathbf x^0, \epsilon} \left[ \frac{\beta_n^2}{2\Sigma_\theta \alpha_n (1 - \bar \alpha_n)} \lVert \epsilon - \epsilon_\theta ( \sqrt{ \bar \alpha_n} \mathbf x^0 + \sqrt{1-\bar \alpha_n} \epsilon , n ) \rVert \right] \end{align} $$

where $\epsilon$ is a sample from $\mathcal N(0, \mathbf I)$. The second step assumes the Gaussian noise in Eq \ref{eq-guassian-noise}, which is equivalent to1

$$ q(\mathbf x^n \vert \mathbf x^0) = \mathcal N (\mathbf x^n ; \sqrt{\bar \alpha_n} \mathbf x^0, (1 - \bar \alpha_n)\mathbf I), $$

with $\alpha_n = 1 - \beta _ n$, $\bar \alpha _ n = \Pi _ {i=1}^n \alpha_i$, and $\Sigma_\theta$ in Eq \ref{eqn-guassian-reverse-process}.

Planted: by ;

Dynamic Backlinks to wiki/machine-learning/energy-based-model/diffusion-model:

LM (2023). 'Diffusion Models for Forecasting', Datumorphism, 02 April. Available at: https://datumorphism.leima.is/wiki/machine-learning/energy-based-model/diffusion-model/.