Fisher Information

#Information Theory

Given a probability density model $f(X; \theta)$ for a observable $X$, the amount of information that $X$ carriers regarding the model is called Fisher information.

Given ${\theta}$, the probability of observing the value $X$, i.e., the likelihood is

$$ f(X\mid\theta). $$

To describe the suitability of a model and the observables, we can use a the likelihood $f(X\mid \theta)$. One particular interesting property is the sensitivity of the likelihood in terms of the parameter $\theta$ change. For example, the case on the left is less compatible as we have a large variance in the parameters. The model is not very sensitive to the parameter change.

To describe this sensitivity, we grab the derivative of the log likelihood and define a score function

$$ S(\theta) = \partial_\theta \ln f(X\mid \theta) = \frac{ \partial_\theta f(X\mid \theta) }{\ln f(X\mid\theta)}. $$

The expectation of the squared score function,

$$ I(\theta) = \mathbb E_f [\partial_\theta \ln f(X\mid \theta) ] = \int \left(\partial_\theta \ln f(X\mid\theta)) \right)^2 f(X\mid\theta) ,\mathrm dX. $$

is the Fisher Information.

Under some conditions, we can prove that it is the same as

$$ I(\theta) = \mathbb E_f [\partial^2_\theta \ln f(X\mid \theta) ] = \int f(X\mid\theta) \partial^2_\theta \ln f(X\mid\theta)) ,\mathrm dX. $$

For Bernoulli probability, we have the likelihood

$$ f(X\mid \theta) = \theta^X (1-\theta)^X, $$

where $X$ indicates side of the coin in a coin flip and $\theta$ is the probability of the coin showing head $X=1$. The Fisher information of the Bernulli model is

$$ \begin{align} I_X(\theta) =& \mathbb E _f \left[ \partial^2_\theta \theta^X (1-\theta)^X \right] \\ =& \mathbb E _f \left[ \frac{X}{\theta^2} + \frac{1-X}{(1-\theta)^2} \right] \\ =& \frac{1}{\theta(1-\theta)}. \end{align} $$