Gini Impurity
Suppose we have a dataset $\{0,1\}^{10}$, which has 10 records and 2 possible classes of objects $\{0,1\}$ in each record.
The first example we investigate is a pure 0 dataset.
object |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
For such an all-0 dataset, we would like to define its impurity as 0. Same with an all-1 dataset. For a dataset with 50% of 1 and 50% of 0, we would define its impurity as max due to the symmetries between 0 and 1.
Given a dataset $\{0,1,…,d\}^n$, the Gini impurity is calculated as
$$ G = \sum_{i \in \{0,1,...,d\} } p(i)(1-p(i)), $$
where $p(i)$ is the probability of a random picked record being class $i$.
In the above example, we have two classes, $\{0,1\}$. The probabilities are
$$ \begin{align} p(0) =& 1\\ p(1) =& 0 \end{align}. $$
The Gini impurity is
$$ G = p(0)(1-p(0)) + p(1)(1-p(1)) = 0+0 = 0. $$
Suppose we have another dataset with 50% of the values being 50%.
object |
0 |
0 |
1 |
0 |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
1 |
The Gini impurity is
$$ G = p(0)(1-p(0)) + p(1)(1-p(1)) = 0.5 * 0.5+ 0.5*0.5 = 0.5. $$
For data with two possible values $\{0,1\}$, the maximum Gini impurity is 0.25. The following chart shows all the possible values of the Gini impurity for two-value dataset.
For data with three possible values, the Gini impurity is also visualized using the same chart given the condition that $p_3 = 1 - p_1 - p_2$.
Links to:L Ma (2020). 'Gini Impurity', Datumorphism, 01 April. Available at: