The weights are better if they1

• are zero centered, and
• have similar variance across layers.

Why

If we have very different variances across layers, we will need a different learning rate for each layer for our optimization. Setting the variances to be on the same scale, we can use a global learning rate for the whole network.

Suppose we are using a simple linear activation, $\sigma(x) = \alpha x$. For a series of inputs $x_j$, the outputs $y_i$ are

$$y_i = \sum_{j} w_{ij} x_j.$$

The variance of $y_i$ is

\begin{align} \operatorname{Var}\left[ y \right] &= \alpha^2 \operatorname{Var}\left[\sum_{j} w_{ij} x_j \right] \\ & = \alpha^2 \sum_{j}\operatorname{Var}\left[ w_{ij} x_j \right] \\ &= \alpha^2 \sum_j \left( \mathbb E\left[ (w_{ij}x_j)^2 \right] - \mathbb E^2 \left[ w_{ij} x_j \right] \right) \label{eq-var-expand-var}\\ &= \alpha^2 \sum_j \left( \mathbb E\left[ (w_{ij}x_j)^2 \right] - {\color{red}\mathbb E^2 \left[ w_{ij} \right]} \mathbb E^2 \left[ x_j \right] \right) \label{eq-var-expand-expectation-sq} \\ &= \alpha^2 \sum_j \left( \mathbb E\left[ (w_{ij}x_j)^2 \right]\right) \label{eq-var-drop-zero-exp} \\ &= \alpha^2 \sum_j \left( \mathbb E\left[ w_{ij}^2x_j^2 \right]\right) \label{eq-var-expand-sq-expectation}\\ &= \alpha^2 \sum_j \left( \mathbb E\left[ w_{ij}^2\right]\mathbb E\left[x_j^2 \right]\right) \label{eq-var-propagate-exp}\\ &= \alpha^2 \sum_j \left( \left(\mathbb E\left[ w_{ij}^2\right] - {\color{red}\mathbb E^2\left[ w_{ij} \right]}\right)\left(\mathbb E\left[x_j^2 \right] - {\color{red}\mathbb E^2\left[ x_j \right]} \right) \right) \label{eq-var-add-zero-exp-w-x}\\ &= \alpha^2 \sum_j \left( \operatorname{Var}\left[ w_{ij} \right]\operatorname{Var}\left[x_j \right]\right) \label{eq-var-form-var-w-x}\\ &= D \alpha^2 \left( \sigma_w^2\sigma_x^2\right) \label{eq-var-calculate-sum}. \end{align}

In the derivation,

• the red term in $\eqref{eq-var-expand-expectation-sq}$, ${\color{red}\mathbb E^2 \left[ w_{ij} \right]}$ is zero,
• we assume the input has zero expectation in $\eqref{eq-var-add-zero-exp-w-x}$, i.e., ${\color{red}\mathbb E^2\left[ x_j \right]}=0$,
• in the last step {eq-var-calculate-sum}, we assumed all the variances are the same, $\operatorname{Var}\left[ w_{ij} \right]=\sigma_w^2$ and $\operatorname{Var}\left[x_j \right]=\sigma_x^2$,
• $D$ is the dimension of the input data $x$.

If we require $\operatorname{Var}\left[ y \right] = \sigma_x^2$, the variance of the weights should be

$$\sigma_w^2 = \frac{1}{D \alpha^2}.$$

For linear activation function, $\alpha=1$, thus

$$\sigma_w^2 = \frac{1}{D}.$$

For more features we have in the input, we should choose smaller initial variance for the weights.

Assuming $\hat D$ is the output dimension of the layer, it is also proved that we need

$$\sigma_w^2 = \frac{1}{\hat D},$$

to make sure the back-prop is also stable, i.e., have similar gradient variance across layers.

## Xavier Initialization

The Xavier initialization is something in between the above two ideas,

$$\sigma_w^2 = \frac{2}{D + \hat D}.$$

In Glorot2010, the authors also proposed a uniform distribution alternative to normal distribution, which they call the normalized initialization2. In this proposal, we uniformly sample from the range

$$\left[ -\frac{\sqrt{6}}{D + \hat D}, \frac{\sqrt{6}}{D + \hat D} \right].$$

This proposal is essentially the same idea as the Xavier initialization.

## What are the differences

deeplearning.ai has an interactive session on the effects of initializations for classification tasks on MNIST.

Planted: by ;

L Ma (2021). 'Initialize Artificial Neural Networks', Datumorphism, 09 April. Available at: https://datumorphism.leima.is/cards/machine-learning/neural-networks/neural-networks-initialization/.