Bias-Variance
Bias and Variance
Suppose $f(X)$ is a perfect model that represents a “tight” model of the dataset $(X,Y)$ but some irredicible error $\epsilon$,
$$ \begin{equation} Y = f(X) + \epsilon. \label{dataset-using-true-model} \end{equation} $$
On the other hand, we build another model using a specific method such as k-nearest neighbors, which is denoted as $k(X)$.
The bias measures the deficit between $k(X)$ and the perfect model $f(X)$,
$$ \operatorname{Bias}[k(X)] = E[k(X)] - f(X) $$
Zero bias means we are matching the perfect model.
The variance of the model is a measurement of the consistency
$$ \operatorname{Variance} ( k(X) ) = \operatorname{E} \left( ( k(X) - \operatorname{E}( k(X) ) )^2 \right) $$
The larger the variance, the more wiggly the model is.
Mean Square Error
Bias measures the deficit between the specific model and the perfect model. To measure the deficit between the specific model and the actual data point, we need the Mean Squared Error (MSE).
The Mean Squared Error (MSE) is defined as
$$ \begin{equation} \operatorname{MSE}(X) = \operatorname{E} \left( ( Y - k(X) )^2 \right). \end{equation} $$
This form of expected error can also be used to evaluate models, i.e., calculate expectations by varying models. A straightforward decomposition using equation ($\ref{dataset-using-true-model}$) shows that we have three components in our expected error. There are several ways to derive the decomposition of the expected error. We only show one here.
$$ \begin{align} \operatorname{Expected Error}(X) &= \operatorname{E} \left( ( Y - k )^2 \right) \nonumber \\ &= \operatorname{E} \left( ( f + \epsilon - k )^2 \right) \\ &= \operatorname{E} \left( (f - \operatorname{E} k - (k - \operatorname{k}) + \epsilon)^2 \right) \\ &= \operatorname{E} \left( (f - \operatorname{E} k)^2 + (k - \operatorname{E}k)^2 + \epsilon^2 - (f - \operatorname{E} k)(k - \operatorname{E}k) + (f - \operatorname{E} k) \epsilon - (k - \operatorname{E}k)\epsilon \right) \\ &= \operatorname{E} \left( (f - \operatorname{E} k)^2\right) + \operatorname{E}\left((k - \operatorname{E}k)^2\right) + \operatorname{E} \left( \epsilon^2 \right) {\color{red}- 2\operatorname{E} \left( (f - \operatorname{E} k)(k - \operatorname{E}k) \right) + 2\operatorname{E} \left( (f - \operatorname{E} k) \epsilon \right) - 2\operatorname{E} \left( (k - \operatorname{E}k)\epsilon \right)} \\ &= (f - \operatorname{E} k)^2 + \operatorname{E}\left((k - \operatorname{E}k)^2\right) + \operatorname{E} \left( \epsilon^2 \right)\\ &= \operatorname{Bias} ( k )^2 + \operatorname{Variance} (k) + \text{Irreducible Error} \end{align} $$
In this derivation, we’ve used several relations.
- We used $\operatorname{E} \left( (f - \operatorname{E} k)^2\right) = (f - \operatorname{E} k)^2$ becuase the term $(f - \operatorname{E} k)^2$ is constant thus the expectation value is itself.
- We have dropped the terms in red. They are related to the fact that the irreducible error is required to be zero, $\operatorname{E}(\epsilon)=0$. If it is not zero then the model $f(X)$ is not perfect.
- $\operatorname{E} \left( (f - \operatorname{E} k)(k - \operatorname{E}k) \right)= (f - \operatorname{E} k)\operatorname{E} \left( (k - \operatorname{E}k) \right)= 0.$
- $\operatorname{E} \left( (f - \operatorname{E} k) \epsilon \right) = (f - \operatorname{E} k) \operatorname{E} \left( \epsilon \right) = 0.$
- $\operatorname{E} \left( (k - \operatorname{E}k)\epsilon \right) = 0.$
Bias-Variance Tradeoff
The more parameters we introduce in the model, it is more likely to reduce the bias. However, at some point, the more complexity we have in the model, the more wiggles the model will have. Thus the variance will be larger.
wiki/machine-learning/basics/bias-variance
Links to:L Ma (2019). 'Bias-Variance', Datumorphism, 06 April. Available at: https://datumorphism.leima.is/wiki/machine-learning/basics/bias-variance/.