Information Gain

#Data

Information gain is a frequently used metric in calculating the gain during a split in tree-based methods.

First o all, the entropy of a dataset if defined as

$$ S = - sum_i p_i \log p_i - sum_i (1-p_i)\log p_i, $$

where $p_i$ is the probability of a class.

The information gain is the difference between the entropy.

For example, in a decision tree algorithm, we would split a node. Before splitting, we assign a label $m$ to the node,

$$ S_m = - p_m \log p_m - (1-p_m)\log p_m. $$

After the splitting, we have two groups that contributes to the entropy, group $L$ and group $R$,

$$ S'_m = p_L (- p_m \log p_m - (1-p_m)\log p_m) + p_R (- p_m \log p_m - (1-p_m)\log p_m), $$

where $p_L$ and $p_R$ are the probabilities of the two groups. Suppose we have 100 samples before splitting and 29 samples in the left group and 71 samples in the right group, we have $p_L = 29/100$ and $p_R = 71/100$.

The information gain is thus

$$ Gain = S_m - S'_m. $$

Planted: 2020-01-16 by L Ma;

References:

Shalev-Shwartz, S., & Ben-David, S. (2013). Understanding machine learning: From theory to algorithms. Understanding Machine Learning: From Theory to Algorithms.

Supplementary:

Code

Dynamic Backlinks to cards/machine-learning/measurement/information-gain:

Decision Tree

In this article, we will explain how decision trees work and build a tree by hand. The code used in …

Random Forest

random forest in machine learning

Gini Impurity

The Gini impurity is a measurement of the impurity of a set.

cards/machine-learning/measurement/information-gain Links to:

Decision Tree

In this article, we will explain how decision trees work and build a tree by hand. The code used in …

Gini Impurity

The Gini impurity is a measurement of the impurity of a set.

Coding Theory Concepts

The code function produces code words. The expected length of the code word is limited by the …

L Ma (2020). 'Information Gain', Datumorphism, 01 April. Available at: https://datumorphism.leima.is/cards/machine-learning/measurement/information-gain/.