This is Part 2. of my decision tree series. Here we will see how we can build a decision tree algorithmically using Leo Breiman’s (One of the big, **big** names in decision trees) CART algorithm.

## Maths (yay!)

Ok, so there’s going to be a little maths in this post, so for those who are not sure what all the notations mean let’s go over them. (if you already know all this feel free to skip ahead 😄)

### Set notation

Anything in between brackets is a set of objects (generally numbers).

- is the set that is composed of , and .
- is the set of all integers between and
- means is equal to either , or (it is read
**in**) - means is any number tht is not , or (it is read
**not in**) - means the set of that less or equal to and in ,
*(here this means )* - means that the set a subset of the set and it cannot be equal to . For example is
**true**since and are included in the set of integers between and , but is**not true**since is not included in the set of integers between and*(we could say )*. - If we have a set then is it’s opposite set. So if then

### Sums

Sums are represented by the symbol *(read as “sigma”)*. The value under the sigma is the start point, and tje value above is the end point of the sum. So :

If you have any more questions about mathematical notation I would suggest you consult this wikipedia page which is quite comprehensive.

### in this article

- denotes the target variable
*(in our iris dataset it would be the species)* - are the explanatory variables
*(in the iris dataset: petal length and width, and sepal length and width, so )*

For a classification tree , where is the number of possible classes.

On the other hand for a regression tree *(with the set of real numbers)*

*(NB. in any post I might use “variable” and “feature” interchangeably because they essentially mean the same thing)*

In decision trees nodes represent a split in data, in this post I will usually call splits and represent them as the 2 opposite sets and , where is the set that goes to the left branch of the split and the set that goes to the right branch of the split. For example, the first split of part 1’s simple tree, that has the condition will be noted as:

## Steps in the CART decision tree inference algorithm

## What is CART?

CART stands for **C**lassification **A**nd **R**egression **T**rees, it is an algorithm developed by Leo Breiman *et. al* in 1984 designed for inferring decision trees. It relies on evaluating possible splits at each node in our tree and choosing the best one.

## splitting the data

To infer a tree we need to choose the best possible split at each node, and to do that we need to know what those splits are. As we saw in part 1 a split is only ever done on one feature

There are two types of features, hence two types of splits:

**numerical features**: these have number values (usually continuous) with no fixed set of values. These values intrinsically have an order.**categorical features**: these are features that can take one value from a fixed set. For example a “blood type” feature could only have values from the set , or a “group number” feature could have values only in set . It is important to know that categorical features can be numbers, they just have to by limited to a specific finite set of possible values. These values are unordered.

*(NB. there can be ordered categorical values, like grades for example but they can be assimilated to numerical values which is why I didn’t mention them in the list above)*

#### A. numerical splits

For a given node there are data points. Let us consider the numerical expalatory feature . In this case we have a maximum number of splits corresponding to all the splits:

We stop at because, for the maximum value of all points are smaller or equal to it, meaning we send all of our data points to one side of the split and leave the other side empty. This of course is not split, so we don’t count that possibility.

#### B. categorical splits

Let us consider now the categorical feature , which has possible levels. the possible number of splits is . A given split can be defined as:

*How many subsets are there?*

Let’s consider the case of a feature called with 3 levels:

To create a subset we must choose which values are in or not the subset. For example the subset what we are saying is and and . So each subset is a set of 3 values indicating presence or absence of a given level of in the subset. With this we can easily calculate the number of possible subsets.

For the first value (presence or absence of in the subset) there are possible options, for the second value we also have potentail values and the same for the third values.

So our total number of possible subsets is:

And if we have possible levels, our number of possible subsets becomes:

*but further up it was* *why?*

Well this comes from the fact that we are looking for splits, not just for subsets. And splits are symmetrical. For our example above, if we have:

It is easy to see that , that symmetry is why on out number of possible splits we have instead of simply . The explanation for why it is and not is the same as for numerical features, because if then Our splits has all the points one one side and and empty set on the other, so it is not a split, so we remove that possibility and that’s how we end up with *(thats also why we used instead of when we defined our categorical splits).*

So now we have all of the possible splits in our data, but how do we choose the best one?

## Choosing the best split

Since we want to use the tree to predict either a class or a value, we want the leafs of the tree to be as “pure” as possible, meaning we want the examples in each leaf to be similar. To get that we need a way to measure the “purity” of a node, so how similar all data points are in that node.

Therefore, to chose the best split, we choose the one that maximizes this “purity” measure in the child nodes. In practice we don’t measure “purity” but rather “impurity”. There are several of these measures. In this whole section let’s consider a node , and for an impurity measure we can define as the impurity of this node.

### The Gini index

The Gini index is a way to measure impurity in classification trees. Let be the probability of having class in our node, and the number of classes in the node. The gini index is:

Since we don’t know the real probability for a given class, we just approximate it by the frequency of said class in the node:

with the number of examples of class in the considered node and the total number of examples in said node.

### Entropy

Entropy is also an impurity measure that is commonly used in CART with classification trees. If we use the same notation as for the Gini index we can define the Entropy of a node, , as:

usually the base logarithm is used.

### RSS

The Residual Sum of Squares (RSS) is used in regression trees, where examples have an outcome value instead of an outcome class. For a given node, let be the number of examples at that node, the outcome value of the example, and the mean outcome values of all examples of the node. We can define :

To then choose the optimal split we choose the one that leads to the maximal decrease in impurity *(which can be either the Gini index or the entropy for classification trees or the RSS for regression trees)*. So if we are in a node and a given split defines nodes and the ** left** and

**child nodes of respectively. We can define the decrease in impurity as:**

*right*with and the impurities of the child nodes and and the probabilities of a sample from node will go to node or node , again we equate this to the proportions of cases that go to nodes and so (with and the number of cases in nodes and )

## The algorithm

The basic algorithm is actually quite simple:

- Find all possible splits in dataset
- calculate decrease in impurity for each of these splits
- Choose split for which decrease in impurity is maximal
- on each half of the split, start this process again

We can implement it in a recursive manner thusly:

```
function infer_tree(dataset) {
#stopping condition
if dataset_is_pure(dataset){
stop
}
splits = find_all_possible_splits(dataset)
impurities = calculate_impurity_decrease(splits)
# choose split with maximum decrease
best_split = splits[max(impurities)]
left_dataset = best_split[0]
right_dataset = best_split[1]
#recursive part
infer_tree(left_dataset)
infer_tree(right_dataset)
}
```

This is only a part of the algorithm, it results in a tree that grows until all leaves are pure (meaning that they contain examples of only one class), and the CART algorithm restricts that so we can have a more generalizable tree (no overfitting).

- you can define a minimum number of examples in each leaf, and if the dataset at a node is smaller or equal than that minimum then the node is not split and stays a leaf. (This is not typically used in CART)
- you can prune the tree and collapse leaves together, which we will see in a later part.

## Conclusion

I hope you learned something on how to build decision trees, and go check out part 3 to see a `Python`

implementation of this algorithm.