Initialize Artificial Neural Networks

#Artificial Neuron #Neural Network #Basics

The weights are better if they1

  • are zero centered, and
  • have similar variance across layers.


If we have very different variances across layers, we will need a different learning rate for each layer for our optimization. Setting the variances to be on the same scale, we can use a global learning rate for the whole network.

Suppose we are using a simple linear activation, $\sigma(x) = \alpha x$. For a series of inputs $x_j$, the outputs $y_i$ are

$$ y_i = \sum_{j} w_{ij} x_j. $$

The variance of $y_i$ is

$$ \begin{align} \operatorname{Var}\left[ y \right] &= \alpha^2 \operatorname{Var}\left[\sum_{j} w_{ij} x_j \right] \\ & = \alpha^2 \sum_{j}\operatorname{Var}\left[ w_{ij} x_j \right] \\ &= \alpha^2 \sum_j \left( \mathbb E\left[ (w_{ij}x_j)^2 \right] - \mathbb E^2 \left[ w_{ij} x_j \right] \right) \label{eq-var-expand-var}\\ &= \alpha^2 \sum_j \left( \mathbb E\left[ (w_{ij}x_j)^2 \right] - {\color{red}\mathbb E^2 \left[ w_{ij} \right]} \mathbb E^2 \left[ x_j \right] \right) \label{eq-var-expand-expectation-sq} \\ &= \alpha^2 \sum_j \left( \mathbb E\left[ (w_{ij}x_j)^2 \right]\right) \label{eq-var-drop-zero-exp} \\ &= \alpha^2 \sum_j \left( \mathbb E\left[ w_{ij}^2x_j^2 \right]\right) \label{eq-var-expand-sq-expectation}\\ &= \alpha^2 \sum_j \left( \mathbb E\left[ w_{ij}^2\right]\mathbb E\left[x_j^2 \right]\right) \label{eq-var-propagate-exp}\\ &= \alpha^2 \sum_j \left( \left(\mathbb E\left[ w_{ij}^2\right] - {\color{red}\mathbb E^2\left[ w_{ij} \right]}\right)\left(\mathbb E\left[x_j^2 \right] - {\color{red}\mathbb E^2\left[ x_j \right]} \right) \right) \label{eq-var-add-zero-exp-w-x}\\ &= \alpha^2 \sum_j \left( \operatorname{Var}\left[ w_{ij} \right]\operatorname{Var}\left[x_j \right]\right) \label{eq-var-form-var-w-x}\\ &= D \alpha^2 \left( \sigma_w^2\sigma_x^2\right) \label{eq-var-calculate-sum}. \end{align} $$

In the derivation,

  • the red term in $\eqref{eq-var-expand-expectation-sq}$, ${\color{red}\mathbb E^2 \left[ w_{ij} \right]}$ is zero,
  • we assume the input has zero expectation in $\eqref{eq-var-add-zero-exp-w-x}$, i.e., ${\color{red}\mathbb E^2\left[ x_j \right]}=0$,
  • in the last step {eq-var-calculate-sum}, we assumed all the variances are the same, $\operatorname{Var}\left[ w_{ij} \right]=\sigma_w^2$ and $\operatorname{Var}\left[x_j \right]=\sigma_x^2$,
  • $D$ is the dimension of the input data $x$.

If we require $\operatorname{Var}\left[ y \right] = \sigma_x^2$, the variance of the weights should be

$$ \sigma_w^2 = \frac{1}{D \alpha^2}. $$

For linear activation function, $\alpha=1$, thus

$$ \sigma_w^2 = \frac{1}{D}. $$

For more features we have in the input, we should choose smaller initial variance for the weights.

Assuming $\hat D$ is the output dimension of the layer, it is also proved that we need

$$ \sigma_w^2 = \frac{1}{\hat D}, $$

to make sure the back-prop is also stable, i.e., have similar gradient variance across layers.

Xavier Initialization

The Xavier initialization is something in between the above two ideas,

$$ \sigma_w^2 = \frac{2}{D + \hat D}. $$

In Glorot2010, the authors also proposed a uniform distribution alternative to normal distribution, which they call the normalized initialization2. In this proposal, we uniformly sample from the range

$$ \left[ -\frac{\sqrt{6}}{D + \hat D}, \frac{\sqrt{6}}{D + \hat D} \right]. $$

This proposal is essentially the same idea as the Xavier initialization.

What are the differences has an interactive session on the effects of initializations for classification tasks on MNIST.

  1. Lippe2020 Lippe P. Tutorial 3: Activation Functions — UvA DL Notebooks v1.1 documentation. In: UvA Deep Learning Tutorials [Internet]. [cited 23 Sep 2021]. Available:  ↩︎

  2. Glorot2010 Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. Teh YW, Titterington M, editors. 2010;9: 249–256. Available:  ↩︎

Published: by ;

L Ma (2021). 'Initialize Artificial Neural Networks', Datumorphism, 09 April. Available at:

Current Ref:

  • cards/machine-learning/neural-networks/