Cross Validation

Cross validation is a method to estimate the [[risk]] The Learning Problem The learning problem posed by Vapnik:1 Given a sample: $\{z_i\}$ in the probability space $Z$; Assuming a probability measure on the probability space $Z$; Assuming a set of functions $Q(z, \alpha)$ (e.g. loss functions), where $\alpha$ is a set of parameters; A risk functional to be minimized by tunning “the handles” $\alpha$, $R(\alpha)$. The risk functional is $$ R(\alpha) = \int Q(z, \alpha) \,\mathrm d F(z). $$ A learning problem is the minimization of this risk. Vapnik2000 … .

To perform cross validation, we split the train dataset $\mathcal D$ into $k$ folds, with each fold denoted as $\mathcal D_k$.

Given a model $\mathcal M(x, \theta)$ with parameter $\theta$, there are two steps in the modelling procedure:

  • Fitting
    • where the estimator estimates the parameters $\hat \theta$;
    • The fitting step can be denoted as $\hat\theta = \mathcal F(\mathcal D, \mathcal M)$
  • Prediction
    • where the estimated parameters are fed into the model to get the predictions $\mathcal M(\hat\theta)$;
    • The prediction step can be denoted as $\hat y = \mathcal M (x, \hat\theta)$.

For a $k$th fold, we perform fitting on the datasets $\mathcal D_{\sim k}$ where ${}_{\sim k}$ means all datasets that are not the $k$th fold, the perform prediction using the $k$th dataset $\mathcal D_k$. The risk can be estimated as

$$ \begin{align} R_k =& \frac{1}{\lvert D_k \rvert}\sum_{i\in \mathcal D_k} L (y_i, \hat y ) \\ =& \frac{1}{\lvert D_k \rvert}\sum_{i\in \mathcal D_k} L (y_i,\mathcal M (x_i, \hat\theta_{\sim k}) ) \\ =& \frac{1}{\lvert D_k \rvert} \sum_{i\in \mathcal D_k} L (y_i,\mathcal M (x_i, \mathcal F(\mathcal D_{\sim k}, \mathcal M) ) ). \end{align} $$

The overall $K$-fold cross validation risk $R$ is the sum of all the risks $R_k$,

$$ \begin{align} R = \sum_{k=1}^K R_k \end{align} $$

If we have $\lvert \mathcal D_k \rvert = K$, we will have only one sample in the prediction step. This is called leave one out cross validation, aka LOOCV.

Planted: by ;

L Ma (2021). 'Cross Validation', Datumorphism, 05 April. Available at: