# NMF: Nonnegative Matrix Factorizatioin

## #Machine Learning #Factorization

## Decomposition

To make it easier to understand, we start with a data point $\mathbf P$ in a $k$-dimensional space spanned by $k$ basis vectors $\mathbf V^k$. Naturally, we could write down the component decomposition of the point using the basis vectors $\mathbf V^k$,

$$ \mathbf P = P_k \mathbf V^k. $$

This is immediately obvious to us since we have been dealing with rank 2 $(k, 1)$ basis vectors and we are talking about the $k$ coordinates for a point.

This point is represented by a matrix of rank 2 $(k, 1)$ given this basis.

$$ \mathbf P \to \begin{pmatrix} P_1, P_2, \cdots, P_k \end{pmatrix} $$

If we assume a cartesian coordinate system, the basis are normal vectors.

$$ \mathbf V^1 \to \begin{pmatrix} 1\\ 0\\ \vdots\\ 0 \end{pmatrix}, \mathbf V^2 \to \begin{pmatrix} 0\\ 1\\ \vdots\\ 0 \end{pmatrix}, \cdots $$These representations of the basis vectors can be joined together

$$ \mathbf V \to \begin{pmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 1 \end{pmatrix} $$Using these representations of the abstract vectors, we could represent the point $\mathbf P$ using

$$ \mathbf P \to \begin{pmatrix} P_1, P_2, \cdots, P_k \end{pmatrix} \begin{pmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 1 \end{pmatrix}, $$*which is insanely trivial since the representation of basis in this case is an identity matrix.*

NMF is talking about the same idea: coordinate transformations,

$$ X_{n}^{\phantom{n}k} = H_n^{\phantom{n}r} W_r^{\phantom{r}k}. $$

We will view $X_{n}^{\phantom{n}k}$ as a vertical stack of $n$ points:

$$ \begin{pmatrix} \mathbf P_1 \\ \mathbf P_2 \\ \vdots \\ \mathbf P_n \end{pmatrix} \to \begin{pmatrix} P_{11} & P_{12} & \cdots & P_{1k} \\ P_{21} & P_{22} & \cdots & P_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ P_{n1} & P_{n2} & \cdots & P_{nk} \end{pmatrix} = \begin{pmatrix} H_{11} & H_{12} & \cdots & H_{1r} \\ H_{21} & H_{22} & \cdots & H_{2r} \\ \vdots & \vdots & \ddots & \vdots \\ H_{n1} & H_{n2} & \cdots & H_{nr} \end{pmatrix} \begin{pmatrix} W_{11} & W_{12} & \cdots & W_{1k} \\ W_{21} & W_{22} & \cdots & W_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ W_{r1} & W_{r2} & \cdots & W_{rk} \end{pmatrix} \leftarrow \begin{pmatrix} \mathbf H_1 \\ \mathbf H_2 \\ \vdots \\ \mathbf H_n \end{pmatrix} \begin{pmatrix} \mathbf W^1 & \mathbf W^2 & \cdots & \mathbf W^k \end{pmatrix} $$For a point $i$, we have

$$ \mathbf P_i = \mathbf H_i \begin{pmatrix}\mathbf W^1 & \mathbf W^2 & \cdots & \mathbf W^k \end{pmatrix} $$

Compare this with the decomposition of a point in the cartesian coordinate system, this is a decomposition of each point on a basis spanned by $\mathbf W^i$.

## Nonnegative Matrix Factorization

NMF is a dimension reduction method that decomposes $X_{n}^{\phantom{n}k}$ using

$$ \begin{equation} X_{n}^{\phantom{n}k} \sim H_n^{\phantom{n}r} W_r^{\phantom{r}k}. \label{eq-nmf} \end{equation} $$while requiring the elements of the decomposition to be nonnegative. **But there are many possible decompositions!** Then we require this approximation to be as accurate as possible.

How do we measure the approximations?

We use the Frobenius distance between the matrix $X_{n}^{\phantom{n}k}$ and $H_n^{\phantom{n}r} W_r^{\phantom{r}k}$,

$$ \lVert X_{n}^{\phantom{n}k} - H_n^{\phantom{n}r} W_r^{\phantom{r}k} \rVert^2 \equiv \sum_{n,k} (X_{n}^{\phantom{n}k} - H_n^{\phantom{n}r} W_r^{\phantom{r}k})^2. $$

So NMF will require this Frobenius distance to be minimal.

## Why Does It Even Work?

Well, it doesn’t always work. **We might have many different NMFs for one single matrix.**

How do we find a proper decomposition using NMF? We put restrictions on it, such as penalties. Apart from this, we also need to choose the rank of the decomposition $r$ in Eq. ($\ref{eq-nmf}$). Even if we fixed all these problems, NMF is computation expensive.

## Table of Contents

**Current Ref:**

- wiki/machine-learning/factorization/nmf.md

**Links to:**

###### SVD: Singular Value Decomposition

Given a matrix $\mathbf X \to X_{m}^{\phantom{m}n}$, we can decompose it into three matrices $$ …

###### Eigenvalues and Eigenvectors

To find the eigenvectors $\mathbf x$ of a matrix $\mathbf A$, we construct the eigen equation $$ …

###### Unsupervised Learning: PCA

Principal component analysis is a method to remove redundancies of the features by looking into the …

###### Unsupervised Learning: SVM

unsupervised learning: support vector machine