Initialize Artificial Neural Networks

#Artificial Neuron #Neural Network #Basics

The weights are better if they¹

are zero centered, and
have similar variance across layers.

Why

If we have very different variances across layers, we will need a different learning rate for each layer for our optimization. Setting the variances to be on the same scale, we can use a global learning rate for the whole network.

Suppose we are using a simple linear activation, $\sigma(x) = \alpha x$. For a series of inputs $x_j$, the outputs $y_i$ are

$$ y_i = \sum_{j} w_{ij} x_j. $$

The variance of $y_i$ is

$$ \begin{align} \operatorname{Var}\left[ y \right] &= \alpha^2 \operatorname{Var}\left[\sum_{j} w_{ij} x_j \right] \\ & = \alpha^2 \sum_{j}\operatorname{Var}\left[ w_{ij} x_j \right] \\ &= \alpha^2 \sum_j \left( \mathbb E\left[ (w_{ij}x_j)^2 \right] - \mathbb E^2 \left[ w_{ij} x_j \right] \right) \label{eq-var-expand-var}\\ &= \alpha^2 \sum_j \left( \mathbb E\left[ (w_{ij}x_j)^2 \right] - {\color{red}\mathbb E^2 \left[ w_{ij} \right]} \mathbb E^2 \left[ x_j \right] \right) \label{eq-var-expand-expectation-sq} \\ &= \alpha^2 \sum_j \left( \mathbb E\left[ (w_{ij}x_j)^2 \right]\right) \label{eq-var-drop-zero-exp} \\ &= \alpha^2 \sum_j \left( \mathbb E\left[ w_{ij}^2x_j^2 \right]\right) \label{eq-var-expand-sq-expectation}\\ &= \alpha^2 \sum_j \left( \mathbb E\left[ w_{ij}^2\right]\mathbb E\left[x_j^2 \right]\right) \label{eq-var-propagate-exp}\\ &= \alpha^2 \sum_j \left( \left(\mathbb E\left[ w_{ij}^2\right] - {\color{red}\mathbb E^2\left[ w_{ij} \right]}\right)\left(\mathbb E\left[x_j^2 \right] - {\color{red}\mathbb E^2\left[ x_j \right]} \right) \right) \label{eq-var-add-zero-exp-w-x}\\ &= \alpha^2 \sum_j \left( \operatorname{Var}\left[ w_{ij} \right]\operatorname{Var}\left[x_j \right]\right) \label{eq-var-form-var-w-x}\\ &= D \alpha^2 \left( \sigma_w^2\sigma_x^2\right) \label{eq-var-calculate-sum}. \end{align} $$

In the derivation,

the red term in $\eqref{eq-var-expand-expectation-sq}$, ${\color{red}\mathbb E^2 \left[ w_{ij} \right]}$ is zero,
we assume the input has zero expectation in $\eqref{eq-var-add-zero-exp-w-x}$, i.e., ${\color{red}\mathbb E^2\left[ x_j \right]}=0$,
in the last step {eq-var-calculate-sum}, we assumed all the variances are the same, $\operatorname{Var}\left[ w_{ij} \right]=\sigma_w^2$ and $\operatorname{Var}\left[x_j \right]=\sigma_x^2$,
$D$ is the dimension of the input data $x$.

If we require $\operatorname{Var}\left[ y \right] = \sigma_x^2$, the variance of the weights should be

$$ \sigma_w^2 = \frac{1}{D \alpha^2}. $$

For linear activation function, $\alpha=1$, thus

$$ \sigma_w^2 = \frac{1}{D}. $$

For more features we have in the input, we should choose smaller initial variance for the weights.

Assuming $\hat D$ is the output dimension of the layer, it is also proved that we need

$$ \sigma_w^2 = \frac{1}{\hat D}, $$

to make sure the back-prop is also stable, i.e., have similar gradient variance across layers.

Xavier Initialization

The Xavier initialization is something in between the above two ideas,

$$ \sigma_w^2 = \frac{2}{D + \hat D}. $$

In Glorot2010, the authors also proposed a uniform distribution alternative to normal distribution, which they call the normalized initialization². In this proposal, we uniformly sample from the range

$$ \left[ -\frac{\sqrt{6}}{D + \hat D}, \frac{\sqrt{6}}{D + \hat D} \right]. $$

This proposal is essentially the same idea as the Xavier initialization.

What are the differences

deeplearning.ai has an interactive session on the effects of initializations for classification tasks on MNIST.

Planted: 2021-09-23 by L Ma;

References:

Dynamic Backlinks to cards/machine-learning/neural-networks/neural-networks-initialization:

Learning Rate

Find a good learning rate

PyTorch: Initialize Parameters

We can set the parameters in a for loop. We take some of the initialization methods from Lippe1. To …

cards/machine-learning/neural-networks/neural-networks-initialization Links to:

Artificial Neural Networks

Simple artificial neural networks using multilayer perceptron

L Ma (2021). 'Initialize Artificial Neural Networks', Datumorphism, 09 April. Available at: https://datumorphism.leima.is/cards/machine-learning/neural-networks/neural-networks-initialization/.

Variance is related to the input size of the layer

Weight Variance is also related to the output size of the layer

Xavier Initialization

What are the differences