Gini Impurity

The code used in this article can be found in this repo.

Suppose we have a dataset $\{0,1\}^{10}$, which has 10 records and 2 possible classes of objects $\{0,1\}$ in each record.

The first example we investigate is a pure 0 dataset.

object
0
0
0
0
0
0
0
0
0
0
0
0

For such an all-0 dataset, we would like to define its impurity as 0. Same with an all-1 dataset. For a dataset with 50% of 1 and 50% of 0, we would define its impurity as max due to the symmetries between 0 and 1.

Definition

Given a dataset $\{0,1,…,d\}^n$, the Gini impurity is calculated as

$$ G = \sum_{i \in \{0,1,...,d\} } p(i)(1-p(i)), $$

where $p(i)$ is the probability of a random picked record being class $i$.

In the above example, we have two classes, $\{0,1\}$. The probabilities are

$$ \begin{align} p(0) =& 1\\ p(1) =& 0 \end{align}. $$

The Gini impurity is

$$ G = p(0)(1-p(0)) + p(1)(1-p(1)) = 0+0 = 0. $$

Examples

Suppose we have another dataset with 50% of the values being 50%.

object
0
0
1
0
0
1
1
1
0
0
0
1

The Gini impurity is

$$ G = p(0)(1-p(0)) + p(1)(1-p(1)) = 0.5 * 0.5+ 0.5*0.5 = 0.5. $$

For data with two possible values $\{0,1\}$, the maximum Gini impurity is 0.25. The following chart shows all the possible values of the Gini impurity for two-value dataset.

For data with three possible values, the Gini impurity is also visualized using the same chart given the condition that $p_3 = 1 - p_1 - p_2$.