Decision Tree

In this article, we will explain how decision trees work and build a tree by hand.

The code used in this article can be found in this repo.

Definition of the problem

We will decide whether one should go to work today. In this demo project, we consider the following features.

featurepossible values
health0: feeling bad, 1: feeling good
weather0: bad weather, 1: good weather
holiday1: holiday, 0: not holiday
For more compact notations, we use the abstract notation $\{0,1\}^3$ to describe a set of three features each with 0 and 1 as possible values. In general, the notation $\{0,1\}^d$ indicates $d$ binary features.

Our prediction will be a binary result, 0 or 1, with 0 indicates staying at home and 1 indicates going to work.

To be compact, this prediction can be denoted as $\{0,1\}^1$.

How to Describe a Decision Tree

In theory, we would expect a decision tree of the following.

graph TD A[health] --> |feeling bad| E[stay home] A[health] --> |feeling good| B[weather] B --> |bad weather| E B --> |good weather| C[holiday] C --> |holiday| E C --> |not holiday| G[go to the office]
It is straight forward to prove that the max required depths and max required leaves of a model that maps $\{0,1\}^d$ to $\{0,1\}^1$ are $d+1$ and $2^d$. In our simple example, some of the branches are truncated based on our understanding of the problem. In principle, the branch “feeling bad” could also go on to the next level.

Data

However, we do not forge trees using experience. We build the tree using data.

The following is a sample of the full dataset.

healthweatherholidaygo_to_office
00010
11110
21010
30000
41010
Full Data
healthweatherholidaygo_to_office
00010
11110
21010
30000
41010
50100
60110
71101
80100
90010
100010
110010
121000
131000
140010
150010
160100
170100
180110
190110
201110
210010
221010
231110
240000
250010
261000
270110
281000
291010
301010
310000
320000
331110
341110
351000
361110
370110
380110
391110
400010
410110
421000
431000
441110
450010
460010
471010
480100
491010
501101
510010
521000
531010
541010
550010
560000
570100
581110
591010
601110
610100
620110
630000
640100
651010
660110
671101
680000
691000
700000
710010
720010
730010
740000
750110
761110
771101
780100
790000
800000
811000
820100
830100
840100
850000
861110
871000
880100
891110
901101
911000
920110
930000
941101
951010
960010
970110
981000
991101

Build a Tree

A decision tree trained with the dataset.

A decision tree trained with the dataset.

Reading the Decision Tree Chart

Reading the Decision Tree Chart

On each node of the tree, we read loads of information.

We will look into the root node as an example. The feature name and value range are denoted on the first row, i.e., weather<= 0.5, which means that we are making decisions based on whether the value of the weather feature less or equal to 0.5. If the value is less or equal to 0.5, we go the left branch, otherwise, we go to the right branch. The following rows in the node are assuming the condition is satisfied.

On the second row, we read the gini impurity value. Gini impurity is a measure of the impurity of the data under the condition.

On the third row, samples of the given condition (weather <= 0.5) is also given.

Finally, we read the values of the samples. In this example, value = [93, 7], i.e., 93 of the samples have target value 0, 7 of the samples have target value 1.

This is a very good result. It is the same as our theoretical expectations.

Surely it will. We forged the dataset based on the theoretical expectations.

Surely it will. We forged the dataset based on the theoretical expectations.

Here is an example of using data that doesn’t always fit into our theoretical model.

A decision tree trained with a fake “impure dataset” that doesn’t always fit into our theoretical model."

Overfitting

Fully grown trees will most likely to overfit the data since they always try to grow pure leaves. Besides, fully grown trees grow exponentially as the number of features grow which requires a lot of computation resources.

Applying Occam’s razor, we prefer smaller trees as long as the trees can explain the data well.

To achieve this, we will either have to limit how the trees grow during training, or pruning the trees after the trees are built.

Pruning of a tree is achieved by replacing subtrees at a node with a leaf if some certain conditions based on cost estimations.

Remarks

The Iterative Dichotomizer 3 algorithm, aka ID3 algorithm, is one of the most famous implementations of the decision tree. The following is the “flowchart” of the algorithm.

graph TD Leaf("Prepare samples in node") MajorityVote["Calculate majority vote"] Assign[Assign label to node] Leaf --> MajorityVote --> Assign Assign --> Split1[Split on feature 1] Assign --> Splitdots["..."] Assign --> Splitd[Split on feature d] subgraph "split on a subset of features" Split1 --> |"Split on feature 1"|B1["Calculate gain of split"] Splitdots --> |"..."| Bdots["..."] Splitd --> |"Split on feature d"| Bd["Calculate gain of split"] end B1 --> C["Use the split with the largest gain"] Bdots --> C Bd --> C C --> Left["Prepare samples in left node"] C --> Right["Prepare samples in right node"] subgraph "left node" MajorityVoteL["Calculate majority vote"] AssignL(Assign label to left node) Left --> MajorityVoteL --> AssignL end subgraph "right node" MajorityVoteR["Calculate majority vote"] Right --> MajorityVoteR AssignR(Assign label to right node) MajorityVoteR --> AssignR end

To “calculate the gain of the split”, we use information gain or Gini impurity.

Planted: by ;

L Ma (2019). 'Decision Tree', Datumorphism, 12 April. Available at: https://datumorphism.leima.is/wiki/machine-learning/tree-based/decision-tree/.