Artificial Neural Networks

#Machine Learning #Artificial Neural Networks #Basics

Artificial neural networks works pretty well for solving some differential equations.

Universal Approximators

Maxwell Stinchcombe and Halber White proved that no theoretical constraints for the feedforward networks to approximate any measurable function. In principle, one can use feedforward networks to approximate measurable functions to any accuracy.

However, the convergence slows down if we have a lot of hidden units. There is a balance between accuracy and convergence rate. More hidden units lead to slow convergence but more accuracy.

Here is a quick review of the history of this topic.

Kolmogorov’s Theorem

Kolmogorov’s theorem shows that one can use a finite number of carefully chosen continuous functions to mix up by sums and multiplication with weights to a continuous multivariable function on a compact set.

Here is the exact math.

Cybenko 1989
Cybenko proved that
$$ \sum_k v_k \sigma(w_k x + u_k) $$
is a good approximation of continuous functions because it is dense in continuous function space. In this result, $\sigma$ is a continuous sigmoidal function and the parameters are real.
Hornik 1989
“Single hidden layer feedforward networks can approximate any measurable functions arbitrarily well regardless of the activation function, the dimension of the input and the input space environment.”
Reference: http://deeplearning.cs.cmu.edu/notes/Sonia_Hornik.pdf

Dense

Set A is dense in set X means that we can use A to arbitarily approximate X. Mathematically for any given element in X, the neighbour of x always has nonzero intersection.

Measurable Function

It means the function is continuous.

Activation Functions

Neural networks usually consists of some affine transformations and some activation functions¹

$$ \mathbf Y = H(\mathbf X, \mathbf W) $$

For example, $H$ can be a combination of a linear transformation $\hat L = \mathbf W \cdot $ and an nonlinear activation function $\sigma(\cdot)$, i.e., $H(\mathbf X, \mathbf W) = \sigma( \mathbf W \mathbf X)$. This is a super simple example and this transformation can be much more complicated.

There are many activation functions.

[[Uni-Polar Sigmoid Function]] Uni-Polar Sigmoid Uni-polar sigmoid function and its properties
[[Bipolar Sigmoid Function]] BiPolar Sigmoid BiPolar sigmoid function and its properties
[[Hyperbolic Tangent]] Hyperbolic Tanh Tanh function and its properties
[[Radial Basis Function]] Radial Basis Function Radial Basis Function function and its properties
[[Conic Section Function]] Conic Section Function Conic Section Function and its properties
[[ReLu]] ReLu Rectified Linear Unit, aka ReLu, and its properties
[[Leaky ReLu]] Leaky ReLu Leaky ReLu and its properties
[[ELU]] ELU ELU and its properties
[[Swish]] Swish Swish and its properties

Lippe P. Tutorial 3: Activation Functions — UvA DL Notebooks v1.1 documentation. In: UvA Deep Learning Tutorials [Internet].

Solving Differential Equations

The problem here to solve is

$$ \frac{d}{dt}y(t)= - y(t), $$

with initial condition $y(0)=1$.

To construct a single layered neural network, the function is decomposed using

$$ \begin{align} y(t_i) & = y(t_0) + t_i v_k f(t_i w_k+u_k) \\ &= 1+t_i v_k f(t_i w_k+u_k) , \end{align} $$

where $y(t_0)$ is the initial condition and $k$ is summed over.

Presumably this should be the gate controlling trigering of the neuron or not. Therefore the following expit function serves this purpose well,

$$ f(x) = \frac{1}{1+\exp(-x)}. $$

One important reason for choosing this is that a lot of expressions can be calculated analytically and easily.

Fermi-Dirac Distribution

Aha, the Fermi-Dirac distribution.

With the form of the function to be solved, we can define a cost

$$ I=\sum_i\left( \frac{dy}{dt}(t_i)+y(t_i) \right)^2, $$

which should be minimized to 0 if our structure of networks is optimized for this problem.

Now the task becomes clear:

Write down the cost analytically;
Minimized cost to find structure;
Substitute back to the function and we are done.

Overfitting

It is possible that we could over fit a network so that it works only for the training data. To avoid that, people use several strategies.

Split data into two parts, one for training and one for testing. A youtube video
Throw more data in. At least 10 times as many as examples as the DoFs of the model. A youtube video
Regularization by plugin a artificial term to the cost function, as an example we could add the . A youtube video

Neural Network and Finite Element Method

We consider the solution to a differential equation

$$ \mathcal L \psi - f = 0. $$

Neural network is quite similar to finite element method. In terms of finite element method, we can write down a neural network structured form of a function ²

$$ \psi(x_i) = A(x_i) + F(x_i, \mathcal N_i), $$

where $\mathcal N$ is the neural network structure. Specifically,

$$ \mathcal N_i = \sigma( w_{ij} x_j + u_i ). $$

The function is parameterized using the network. Such parameterization is similar to collocation method in finite element method, where multiple basis is used for each location.

One of the choices of the function $F$ is a linear combination,

$$ F(x_i, \mathcal N_i) = x_i \mathcal N_i, $$

and $A(x_i)$ should take care of the boundary condition.

Relation to finite element method

This function is similar to the finite element function basis approximation. The goal in finite element method is to find the coefficients of each basis functions to achieve a good approximation. In ANN method, each sigmoid is the analogy to the basis functions, where we are looking for both the coefficients of sigmoids and the parameters of them. These sigmoid functions are some kind of adaptive basis functions.

With such parameterization, the differential equation itself is parameterized such that

$$ \mathcal L \psi - f = 0, $$

such that the minimization should be

$$ \lvert \mathcal L \psi - f \rvert^2 \to 0 $$

at each point.

Planted: 2018-11-19 by L Ma;

References:

Dynamic Backlinks to wiki/machine-learning/neural-networks/artificial-neural-networks:

Artificial Neural Networks

Simple artificial neural networks using multilayer perceptron

Learning Rate

Find a good learning rate

Initialize Artificial Neural Networks

Initialize a neural network is important for the training and performance. Some initializations …

The log-sum-exp Trick

For numerical stability we can use the log-sum-exp trick to calculate some loss such as cross …

McCulloch-Pitts Model

Artificial neuron that separates the state space

Rosenblatt's Perceptron

Connected perceptrons

BiPolar Sigmoid

BiPolar sigmoid function and its properties

Conic Section Function

Conic Section Function and its properties

ELU

ELU and its properties

Hyperbolic Tanh

Tanh function and its properties

Leaky ReLu

Leaky ReLu and its properties

Radial Basis Function

Radial Basis Function function and its properties

ReLu

Rectified Linear Unit, aka ReLu, and its properties

Swish

Swish and its properties

Uni-Polar Sigmoid

Uni-polar sigmoid function and its properties

Additional Double Backet Links: