Initialize Artificial Neural Networks

The weights are better if they1

  • are zero centered, and
  • have similar variance across layers.


If we have very different variances across layers, we will need a different learning rate for each layer for our optimization. Setting the variances to be on the same scale, we can use a global learning rate for the whole network.

Suppose we are using a simple linear activation, $\sigma(x) = \alpha x$. For a series of inputs $x_j$, the outputs $y_i$ are

$$ y_i = \sum_{j} w_{ij} x_j. $$

The variance of $y_i$ is

$$ \begin{align} \operatorname{Var}\left[ y \right] &= \alpha^2 \operatorname{Var}\left[\sum_{j} w_{ij} x_j \right] \\ & = \alpha^2 \sum_{j}\operatorname{Var}\left[ w_{ij} x_j \right] \\ &= \alpha^2 \sum_j \left( \mathbb E\left[ (w_{ij}x_j)^2 \right] - \mathbb E^2 \left[ w_{ij} x_j \right] \right) \label{eq-var-expand-var}\\ &= \alpha^2 \sum_j \left( \mathbb E\left[ (w_{ij}x_j)^2 \right] - {\color{red}\mathbb E^2 \left[ w_{ij} \right]} \mathbb E^2 \left[ x_j \right] \right) \label{eq-var-expand-expectation-sq} \\ &= \alpha^2 \sum_j \left( \mathbb E\left[ (w_{ij}x_j)^2 \right]\right) \label{eq-var-drop-zero-exp} \\ &= \alpha^2 \sum_j \left( \mathbb E\left[ w_{ij}^2x_j^2 \right]\right) \label{eq-var-expand-sq-expectation}\\ &= \alpha^2 \sum_j \left( \mathbb E\left[ w_{ij}^2\right]\mathbb E\left[x_j^2 \right]\right) \label{eq-var-propagate-exp}\\ &= \alpha^2 \sum_j \left( \left(\mathbb E\left[ w_{ij}^2\right] - {\color{red}\mathbb E^2\left[ w_{ij} \right]}\right)\left(\mathbb E\left[x_j^2 \right] - {\color{red}\mathbb E^2\left[ x_j \right]} \right) \right) \label{eq-var-add-zero-exp-w-x}\\ &= \alpha^2 \sum_j \left( \operatorname{Var}\left[ w_{ij} \right]\operatorname{Var}\left[x_j \right]\right) \label{eq-var-form-var-w-x}\\ &= D \alpha^2 \left( \sigma_w^2\sigma_x^2\right) \label{eq-var-calculate-sum}. \end{align} $$

In the derivation,

  • the red term in $\eqref{eq-var-expand-expectation-sq}$, ${\color{red}\mathbb E^2 \left[ w_{ij} \right]}$ is zero,
  • we assume the input has zero expectation in $\eqref{eq-var-add-zero-exp-w-x}$, i.e., ${\color{red}\mathbb E^2\left[ x_j \right]}=0$,
  • in the last step {eq-var-calculate-sum}, we assumed all the variances are the same, $\operatorname{Var}\left[ w_{ij} \right]=\sigma_w^2$ and $\operatorname{Var}\left[x_j \right]=\sigma_x^2$,
  • $D$ is the dimension of the input data $x$.

If we require $\operatorname{Var}\left[ y \right] = \sigma_x^2$, the variance of the weights should be

$$ \sigma_w^2 = \frac{1}{D \alpha^2}. $$

For linear activation function, $\alpha=1$, thus

$$ \sigma_w^2 = \frac{1}{D}. $$

For more features we have in the input, we should choose smaller initial variance for the weights.

Assuming $\hat D$ is the output dimension of the layer, it is also proved that we need

$$ \sigma_w^2 = \frac{1}{\hat D}, $$

to make sure the back-prop is also stable, i.e., have similar gradient variance across layers.

Xavier Initialization

The Xavier initialization is something in between the above two ideas,

$$ \sigma_w^2 = \frac{2}{D + \hat D}. $$

In Glorot2010, the authors also proposed a uniform distribution alternative to normal distribution, which they call the normalized initialization2. In this proposal, we uniformly sample from the range

$$ \left[ -\frac{\sqrt{6}}{D + \hat D}, \frac{\sqrt{6}}{D + \hat D} \right]. $$

This proposal is essentially the same idea as the Xavier initialization.

What are the differences has an interactive session on the effects of initializations for classification tasks on MNIST.

Planted: by ;

L Ma (2021). 'Initialize Artificial Neural Networks', Datumorphism, 09 April. Available at: