Gini Impurity

The code used in this article can be found in this repo.

Suppose we have a dataset {0,1}10, which has 10 records and 2 possible classes of objects {0,1} in each record.

The first example we investigate is a pure 0 dataset.

object
0
0
0
0
0
0
0
0
0
0
0
0

For such an all-0 dataset, we would like to define its impurity as 0. Same with an all-1 dataset. For a dataset with 50% of 1 and 50% of 0, we would define its impurity as max due to the symmetries between 0 and 1.

Definition

Given a dataset {0,1,,d}n, the Gini impurity is calculated as

G=i{0,1,...,d}p(i)(1p(i)),

where p(i) is the probability of a random picked record being class i.

In the above example, we have two classes, {0,1}. The probabilities are

(1)p(0)=1(2)p(1)=0.

The Gini impurity is

G=p(0)(1p(0))+p(1)(1p(1))=0+0=0.

Examples

Suppose we have another dataset with 50% of the values being 50%.

object
0
0
1
0
0
1
1
1
0
0
0
1

The Gini impurity is

G=p(0)(1p(0))+p(1)(1p(1))=0.50.5+0.50.5=0.5.

For data with two possible values {0,1}, the maximum Gini impurity is 0.25. The following chart shows all the possible values of the Gini impurity for two-value dataset.

For data with three possible values, the Gini impurity is also visualized using the same chart given the condition that p3=1p1p2.

Planted: by ;

Supplementary:

L Ma (2020). 'Gini Impurity', Datumorphism, 01 April. Available at: https://datumorphism.leima.is/cards/machine-learning/measurement/gini-impurity/.