Initialize Artificial Neural Networks

The weights are better if they1

  • are zero centered, and
  • have similar variance across layers.

Why

If we have very different variances across layers, we will need a different learning rate for each layer for our optimization. Setting the variances to be on the same scale, we can use a global learning rate for the whole network.

Suppose we are using a simple linear activation, $\sigma(x) = \alpha x$. For a series of inputs $x_j$, the outputs $y_i$ are

$$ y_i = \sum_{j} w_{ij} x_j. $$

The variance of $y_i$ is

$$ \begin{align} \operatorname{Var}\left[ y \right] &= \alpha^2 \operatorname{Var}\left[\sum_{j} w_{ij} x_j \right] \\ & = \alpha^2 \sum_{j}\operatorname{Var}\left[ w_{ij} x_j \right] \\ &= \alpha^2 \sum_j \left( \mathbb E\left[ (w_{ij}x_j)^2 \right] - \mathbb E^2 \left[ w_{ij} x_j \right] \right) \label{eq-var-expand-var}\\ &= \alpha^2 \sum_j \left( \mathbb E\left[ (w_{ij}x_j)^2 \right] - {\color{red}\mathbb E^2 \left[ w_{ij} \right]} \mathbb E^2 \left[ x_j \right] \right) \label{eq-var-expand-expectation-sq} \\ &= \alpha^2 \sum_j \left( \mathbb E\left[ (w_{ij}x_j)^2 \right]\right) \label{eq-var-drop-zero-exp} \\ &= \alpha^2 \sum_j \left( \mathbb E\left[ w_{ij}^2x_j^2 \right]\right) \label{eq-var-expand-sq-expectation}\\ &= \alpha^2 \sum_j \left( \mathbb E\left[ w_{ij}^2\right]\mathbb E\left[x_j^2 \right]\right) \label{eq-var-propagate-exp}\\ &= \alpha^2 \sum_j \left( \left(\mathbb E\left[ w_{ij}^2\right] - {\color{red}\mathbb E^2\left[ w_{ij} \right]}\right)\left(\mathbb E\left[x_j^2 \right] - {\color{red}\mathbb E^2\left[ x_j \right]} \right) \right) \label{eq-var-add-zero-exp-w-x}\\ &= \alpha^2 \sum_j \left( \operatorname{Var}\left[ w_{ij} \right]\operatorname{Var}\left[x_j \right]\right) \label{eq-var-form-var-w-x}\\ &= D \alpha^2 \left( \sigma_w^2\sigma_x^2\right) \label{eq-var-calculate-sum}. \end{align} $$

In the derivation,

  • the red term in $\eqref{eq-var-expand-expectation-sq}$, ${\color{red}\mathbb E^2 \left[ w_{ij} \right]}$ is zero,
  • we assume the input has zero expectation in $\eqref{eq-var-add-zero-exp-w-x}$, i.e., ${\color{red}\mathbb E^2\left[ x_j \right]}=0$,
  • in the last step {eq-var-calculate-sum}, we assumed all the variances are the same, $\operatorname{Var}\left[ w_{ij} \right]=\sigma_w^2$ and $\operatorname{Var}\left[x_j \right]=\sigma_x^2$,
  • $D$ is the dimension of the input data $x$.

If we require $\operatorname{Var}\left[ y \right] = \sigma_x^2$, the variance of the weights should be

$$ \sigma_w^2 = \frac{1}{D \alpha^2}. $$

For linear activation function, $\alpha=1$, thus

$$ \sigma_w^2 = \frac{1}{D}. $$

For more features we have in the input, we should choose smaller initial variance for the weights.

Assuming $\hat D$ is the output dimension of the layer, it is also proved that we need

$$ \sigma_w^2 = \frac{1}{\hat D}, $$

to make sure the back-prop is also stable, i.e., have similar gradient variance across layers.

Xavier Initialization

The Xavier initialization is something in between the above two ideas,

$$ \sigma_w^2 = \frac{2}{D + \hat D}. $$

In Glorot2010, the authors also proposed a uniform distribution alternative to normal distribution, which they call the normalized initialization2. In this proposal, we uniformly sample from the range

$$ \left[ -\frac{\sqrt{6}}{D + \hat D}, \frac{\sqrt{6}}{D + \hat D} \right]. $$

This proposal is essentially the same idea as the Xavier initialization.

What are the differences

deeplearning.ai has an interactive session on the effects of initializations for classification tasks on MNIST.

Planted: by ;

cards/machine-learning/neural-networks/neural-networks-initialization Links to:

L Ma (2021). 'Initialize Artificial Neural Networks', Datumorphism, 09 April. Available at: https://datumorphism.leima.is/cards/machine-learning/neural-networks/neural-networks-initialization/.