VC Dimension

Two of the key elements in a learning problem are:

a set of hypothesis $\mathcal H$, and
a set of data samples $\mathcal S$.

$\mathcal H$

Inside $\mathcal H$, we have a lot of hypotheses, for example, $\mathcal h$. Given some input, e.g., $x_1$ and $x_2$, we can produce some outputs, e.g., $h(x_1)$ and $h(x_2)$.

$\mathcal S$

A sample $\mathcal S$ is a fair sample drawn from all the possible inputs $\mathcal X$, where $\mathcal X$ is called the input space.

A dataset $\mathcal S$ can be used as a probe of the hypothesis set $\mathcal H$. $\mathcal S$ can be used to probe the learning capacity of $\mathcal H$.

Following this idea, the VC dimension of $\mathcal H$ is defined as the largest dataset size $\operatorname{max}(\lvert \mathcal S \rvert)$ that $\mathcal H$ can [[shatter]] Shatter Given a set $\mathcal S$, and a class (collection of sets) $\mathcal H$. For any subset of $\mathcal S$, denoted as $\mathcal s$, if we have an element of class $\mathcal H$, denoted as $\mathcal h$, that leads to1 $$ \mathcal h \cap \mathcal S = \mathcal s. $$ Since the power set of $\mathcal S$ ($P(\mathcal S)$) contains all the possible subsets of $\mathcal S$, we can also rephrase the concept using power set. If we can find the power set $P(\mathcal S)$ by looking into intersections of … . There are some simple examples on wikipedia: Vapnik–Chervonenkis dimension#Examples.

Growth Function of $\mathcal H$

Given a input space, there exists good $\mathcal H_G$ and bad $\mathcal H_B$.

$\mathcal H_G$ is able to distinguish the input using the outputs.
$\mathcal H_B$ produces extremely similar outputs for inputs and makes it hard to distinguish the inputs using the outputs.

In a binary classification problem, it is easier to understand the goodness of $\mathcal H$. The combinations of outputs when feeding data samples to $\mathcal H$ are called dichotomies, as the elements of $\mathcal H$ are being splitted into two parts based on the binary classes. It can be denoted as $\mathcal H(x_1, x_2, \cdots, x_m)$.

In such binary classification problems, dichotomies can be used to understand how good is the hypothesis set $\mathcal H$ on an input space $\mathcal X$.

If we see more dichotomies, i.e., larger $\lvert\mathcal H(x_1, x_2, \cdots, x_m)\vert$, over the input space $\mathcal X$, we can think of $\mathcal H$ being more powerful over the input space $\mathcal X$.

For example, if we have $2^m$ dichotomies, that means we can use $\mathcal H$ to distinguish all the $m$ data records since the max possible combinations of labels for these $m$ data records is $2^m$.

The growth function is defined as the maximum $\mathcal H(x_1, x_2, \cdots, x_m)$, if we search through all possible $x_1, x_2, \cdots, x_m$ in the input space $\mathcal X$¹,

$$ \Pi_{\mathcal H}(m) = \operatorname{max}_{x_1, x_2, \cdots, x_m \in \mathcal X} \lvert \mathcal H(x_1, x_2, \cdots, x_m) \rvert $$

VC Dimension

For binary classification problems, $\Pi_{\mathcal H}(m) \leq 2^m$. For small $m$, it is easier to have $\Pi_{\mathcal H}(m) = 2^m$. It is probably also easy for $m=2$. As $m$ becomes larger, not every $\mathcal H$ can reach $\Pi_{\mathcal H}(m) = 2^m$. At some point, $m=m_t$, we will start to have $\Pi_{\mathcal H}(m) < 2^m$. The VC dimension $d_{\text{VC}}(\mathcal H)$ is this $m_t$,

$$ d_{\text{VC}}(\mathcal H) = m_t. $$

For some $\mathcal H$, we have $\Pi_{\mathcal H}(m) = 2^m$ for all $m$. These hypothesis sets have $d_{\text{VC}}(\mathcal H) = \infty$.

Why Do We Care about the VC Dimension

Why is learning from data feasible? A learning process is the selection of good hypotheses from $\mathcal H$ using a data sample $\mathcal S$.

Given a training data sample $\mathcal S$, we train the model and calculate the in-sample error of the model, $E_{s}$. However, the in-sample error is not necessarily the same as the actual error, or the generalization error/out of sample error, $E_{g}$.

Using the Hoeffding’s inequality, one could derive the so called learning bound¹,

$$ P\left( E_g(h) - E_s(h) \leq \sqrt{ \frac{\log\lvert H\rvert + \log(2/\delta)}{2m} } \right) \geq 1 - \delta, \forall h\in \mathcal H, $$

where $\mathcal H$ is a finite hypothesis set.

We immediately realize that if $\lvert\mathcal H\rvert \to \infty$, the bound becomes infinity too. The above theorem becomes useless.

Since the VC dimension describes the number of useful hypotheses, we might want to replace the actual hypothesis set size $\lvert \mathcal H\rvert$. This is not exactly correct but we indeed have a similar form

$$ P\left( E_g(h) - E_s(h) \leq \sqrt{ \frac{8d\log(m/d) + 8\log(4/\delta)}{m} } \right) \geq 1 - \delta, $$

for a binary classification problem.