HomeAbout MeBlogOther Stuff
Contact Me

The CART Algorithm

February 27, 2019

This is Part 2. of my decision tree series. Here we will see how we can build a decision tree algorithmically using Leo Breiman's (One of the big, big names in decision trees) CART algorithm.

Maths (yay!)

Ok, so there's going to be a little maths in this post, so for those who are not sure what all the notations mean let's go over them. (if you already know all this feel free to skip ahead πŸ˜„)

Set notation

Anything in between brackets {}\{\}{} is a set of objects (generally numbers).

  • {1,3,4}\{1,3,4\}{1,3,4} is the set that is composed of 111, 333 and 444.
  • {1,β‹―,7}\{1,\cdots, 7\}{1,β‹―,7} is the set of all integers between 111 and 777
  • i∈{1,3,4}i \in \{1,3,4\}i∈{1,3,4} means iii is equal to either 111, 333 or 444 (it is read iii in {1,3,4}\{1,3,4\}{1,3,4})
  • iβˆ‰{1,3,4}i \notin \{1,3,4\}iβˆ‰{1,3,4} means iii is any number tht is not 111, 333 or 444 (it is read iii not in {1,3,4}\{1,3,4\}{1,3,4})
  • {x≀4,Β x∈{1,β‹―,6}}\{x \leq 4,\ x\in\{1,\cdots,6\}\}{x≀4,Β x∈{1,β‹―,6}} means the set of xxx that less or equal to 444 and xxx in {1,β‹―,6}\{1,\cdots,6\}{1,β‹―,6}, (here this means x∈{1,2,3,4}x\in\{1,2,3,4\}x∈{1,2,3,4})
  • AβŠ‚BA\subset BAβŠ‚B means that the set AAA a subset of the set BBB and it cannot be equal to BBB. For example {1,2}βŠ‚{1,β‹―,8}\{1,2\} \subset \{1,\cdots,8\}{1,2}βŠ‚{1,β‹―,8} is true since 111 and 222 are included in the set of integers between 111 and 888, but {1,2}βŠ‚{2,β‹―,8}\{1,2\} \subset \{2,\cdots,8\}{1,2}βŠ‚{2,β‹―,8} is not true since 111 is not included in the set of integers between 222 and 888 (we could say {1,2}ΜΈβŠ‚{2,β‹―,8}\{1,2\}\not\subset \{2,\cdots,8\}{1,2}ΜΈβŠ‚{2,β‹―,8}).
  • If we have a set A\mathcal{A}A then Aβ€Ύ\overline{\mathcal{A}}A is it's opposite set. So if A={1,2,3}\mathcal{A}=\{1,2,3\}A={1,2,3} then Aβ€Ύ={xβˆ‰{1,2,3}}\overline{\mathcal{A}} = \{x \notin \{1,2,3\}\}A={xβˆ‰{1,2,3}}

Sums

Sums are represented by the Ξ£\SigmaΞ£ symbol (read as "sigma"). The value under the sigma is the start point, and the value above is the end point of the sum. So :

βˆ‘i=13i2=12+22+32=14\sum^3_{i=1} i^2 = 1^2 + 2^2 + 3^2 = 14i=1βˆ‘3​i2=12+22+32=14

If you have any more questions about mathematical notation I would suggest you consult this Wikipedia page which is quite comprehensive.

in this article

  • YYY denotes the target variable (in our iris dataset it would be the species)
  • X{1,β‹―,p}X_{\{1,\cdots,p\}}X{1,β‹―,p}​ are the ppp explanatory variables (in the iris dataset: petal length and width, and sepal length and width, so Β p=4\ p=4Β p=4)

For a classification tree Yi∈{1,2,β‹―,k}Y_i \in \{1,2,\cdots,k\}Yiβ€‹βˆˆ{1,2,β‹―,k}, where kkk is the number of possible classes.
On the other hand for a regression tree Yi∈RY_i \in \mathbb{R}Yiβ€‹βˆˆR (with R\mathbb{R}R the set of real numbers)
(NB. in any post I might use "variable" and "feature" interchangeably because they essentially mean the same thing)

In decision trees nodes represent a split in data, in this post I will usually call splits SSS and represent them as the 2 opposite sets A\mathcal{A}A and Aβ€Ύ\overline{\mathcal{A}}A, where A\mathcal{A}A is the set that goes to the left branch of the split and Aβ€Ύ\overline{\mathcal{A}}A the set that goes to the right branch of the split. For example, the first split of part 1's simple tree, that has the condition x3≀1.9x_3 \leq 1.9x3​≀1.9 will be noted as:

S={x3≀1.9},Β {x3>1.9}S = \{x_3 \leq 1.9\},\ \{x_3 > 1.9\}S={x3​≀1.9},Β {x3​>1.9}



Steps in the CART decision tree inference algorithm

What is CART?

CART stands for Classification And Regression Trees, it is an algorithm developed by Leo Breiman et. al in 1984 designed for inferring decision trees. It relies on evaluating possible splits at each node in our tree and choosing the best one.

splitting the data

To infer a tree we need to choose the best possible split at each node, and to do that we need to know what those splits are. As we saw in part 1 a split is only ever done on one feature
There are two types of features, hence two types of splits:

  • numerical features: these have number values (usually continuous) with no fixed set of values. These values intrinsically have an order.
  • categorical features: these are features that can take one value from a fixed set. For example a "blood type" feature could only have values from the set {A,Β B,Β AB,Β O}\{A,\ B,\ AB,\ O\}{A,Β B,Β AB,Β O}, or a "group number" feature could have values only in set {1,β‹―,10}\{1,\cdots,10\}{1,β‹―,10}. It is important to know that categorical features can be numbers, they just have to by limited to a specific finite set of possible values. These values are unordered.

(NB. there can be ordered categorical values, like grades for example A>B>C>DA > B > C > DA>B>C>D but they can be assimilated to numerical values which is why I didn't mention them in the list above)

A. numerical splits

For a given node there are nnn data points. Let us consider the numerical explanatory feature XnumX_{num}Xnum​. In this case we have a maximum number of splits nβˆ’1n-1nβˆ’1 corresponding to all the splits:

S={Xnum≀xi},{Xnum>xi}i∈{1,β‹―,nβˆ’1}S = \{ X_{num} \leq x_i\},\{X_{num} > x_i\}\quad\quad i \in \{1,\cdots,n-1\}S={Xnum​≀xi​},{Xnum​>xi​}i∈{1,β‹―,nβˆ’1}

We stop at nβˆ’1n-1nβˆ’1 because, for the maximum value of XnumX_{num}Xnum​ all points are smaller or equal to it, meaning we send all of our data points to one side of the split and leave the other side empty. This of course is not split, so we don't count that possibility.

B. categorical splits

Let us consider now the categorical feature XcatX_{cat}Xcat​, which has kkk possible levels. the possible number of splits is 2kβˆ’1βˆ’12^{k-1} - 12kβˆ’1βˆ’1. A given split can be defined as:

S={Xcat∈A},{Xcat∈Aβ€Ύ}AβŠ‚{1,β‹―,k}S = \{X_{cat} \in \mathcal{A}\},\{X_{cat} \in \mathcal{\overline{A}}\}\quad\quad \mathcal{A}\subset\{1,\cdots,k\}S={Xcatβ€‹βˆˆA},{Xcatβ€‹βˆˆA}AβŠ‚{1,β‹―,k}

How many subsets A\mathcal{A}A are there?
Let's consider the case of a feature called CCC with 3 levels:

C∈{red, blue, green}C \in \{red,\ blue,\ green\}C∈{red, blue, green}

To create a subset we must choose which values are in or not the subset. For example the subset A={red,Β green}\mathcal{A} = \{red,\ green\}A={red,Β green} what we are saying is red∈Ared\in \mathcal{A}red∈A and green∈Agreen\in \mathcal{A}green∈A and blueβˆ‰Ablue\notin \mathcal{A}blueβˆ‰A. So each subset is a set of 3 values indicating the presence or absence of a given level of CCC in the subset. With this we can easily calculate the number of possible subsets.
For the first value (presence or absence of redredred in the subset) there are 222 possible options, for the second value we also have 222 potential values and the same for the third values.
So our total number of possible subsets is:

Nsets=2Γ—2Γ—2=23N_{sets}=2\times2\times2 = 2^3Nsets​=2Γ—2Γ—2=23

And if we have kkk possible levels, our number of possible subsets becomes:

Nsets=2kN_{sets}=2^kNsets​=2k

but further up it was 2kβˆ’1βˆ’12^{k-1} -12kβˆ’1βˆ’1 why?
Well this comes from the fact that we are looking for splits, not just for subsets. And splits are symmetrical. For our example above, if we have:

A={red,green}⇔Aβ€Ύ={blue}S1={C∈A},{C∈Aβ€Ύ}S2={C∈Aβ€Ύ},{C∈A}\begin{aligned} \mathcal{A} &= \{red, green\} \Leftrightarrow \mathcal{\overline{A}} = \{blue\} \\ \quad\\ S_1 &= \{C\in\mathcal{A}\},\{C\in\mathcal{\overline{A}}\}\\ S_2 &= \{C\in\mathcal{\overline{A}}\},\{C\in\mathcal{A}\}\\ \end{aligned}AS1​S2​​={red,green}⇔A={blue}={C∈A},{C∈A}={C∈A},{C∈A}​

It is easy to see that S1=S2S_1 = S_2S1​=S2​, that symmetry is why on out number of possible splits we have 2kβˆ’12^{k-1}2kβˆ’1 instead of simply 2k2^k2k. The explanation for why it is 2kβˆ’1βˆ’12^{k-1}-12kβˆ’1βˆ’1 and not 2kβˆ’12^{k-1}2kβˆ’1 is the same as for numerical features, because if A={red,Β blue,Β green}\mathcal{A}=\{red,\ blue,\ green\}A={red,Β blue,Β green} then Our splits has all the points one one side and and empty set on the other, so it is not a split, so we remove that possibility and that's how we end up with 2kβˆ’1βˆ’12^{k-1}-12kβˆ’1βˆ’1 (thats also why we used AβŠ‚{1,β‹―,k}\mathcal{A} \subset \{1,\cdots,k\}AβŠ‚{1,β‹―,k} instead of AβŠ†{1,β‹―,k}\mathcal{A} \subseteq \{1,\cdots,k\}AβŠ†{1,β‹―,k} when we defined our categorical splits ).

So now we have all of the possible splits in our data, but how do we choose the best one?

Choosing the best split

Since we want to use the tree to predict either a class or a value, we want the leaves of the tree to be as "pure" as possible, meaning we want the examples in each leaf to be similar. To get that we need a way to measure the "purity" of a node, so how similar all data points are in that node.
Therefore, to chose the best split, we choose the one that maximizes this "purity" measure in the child nodes. In practice we don't measure "purity" but rather "impurity". There are several of these measures. In this whole section let's consider a node ttt, and for an impurity measure iii we can define i(t)i(t)i(t) as the impurity of this node.

The Gini index

The Gini index is a way to measure impurity in classification trees. Let pip_ipi​ be the probability of having class iii in our node, and kkk the number of classes in the node. The Gini index GGG is:

G(t)=1βˆ’βˆ‘i=1kpi2G(t) = 1 - \sum^k_{i=1} p_i^2G(t)=1βˆ’i=1βˆ‘k​pi2​

Since we don't know the real probability pip_ipi​ for a given class, we just approximate it by the frequency of said class in the node:

pi=ninp_i = \frac{n_i}{n}pi​=nni​​

with nin_ini​ the number of examples of class iii in the considered node and nnn the total number of examples in said node.

Entropy

Entropy is also an impurity measure that is commonly used in CART with classification trees. If we use the same notation as for the Gini index we can define the Entropy of a node, EEE, as:

E(t)=βˆ‘i=1kpilog⁑(pi)E(t) = \sum^k_{i=1} p_i \log(p_i)E(t)=i=1βˆ‘k​pi​log(pi​)

usually the base 222 logarithm is used.

RSS

The Residual Sum of Squares (RSS) is used in regression trees, where examples have an outcome value instead of an outcome class. For a given node, let nnn be the number of examples at that node, yiy_iyi​ the outcome value of the ithi^{th}ith example, and ΞΌ\muΞΌ the mean outcome values of all examples of the node. We can define :

RSS(t)=βˆ‘i=1n(yiβˆ’ΞΌ)2RSS(t) = \sum^n_{i=1} (y_i - \mu)^2RSS(t)=i=1βˆ‘n​(yiβ€‹βˆ’ΞΌ)2



To then choose the optimal split we choose the one that leads to the maximal decrease in impurity (which can be either the Gini index or the entropy for classification trees or the RSS for regression trees). So if we are in a node ttt and a given split defines nodes LLL and RRR the left and right child nodes of ttt respectively. We can define the decrease in impurity Ξ”i\Delta iΞ”i as:

Ξ”i=i(t)βˆ’pLβ‹…i(L)βˆ’pRβ‹…i(R)\Delta i = i(t) - p_L\cdot i(L) - p_R\cdot i(R)Ξ”i=i(t)βˆ’pL​⋅i(L)βˆ’pR​⋅i(R)

with i(L)i(L)i(L) and i(R)i(R)i(R) the impurities of the child nodes and pLp_LpL​ and pRp_RpR​ the probabilities of a sample from node ttt will go to node LLL or node RRR, again we equate this to the proportions of cases that go to nodes RRR and LLL so pL=nLntp_L = \frac{n_L}{n_t}pL​=nt​nL​​ (with nLn_LnL​ and ntn_tnt​ the number of cases in nodes LLL and ttt)

The algorithm

The basic algorithm is actually quite simple:

  1. Find all possible splits in the dataset
  2. calculate the decrease in impurity for each of these splits
  3. Choose split for which decrease in impurity is maximal
  4. on each half of the split, start this process again

We can implement it in a recursive manner thusly:

function infer_tree(dataset) {

    #stopping condition
    if dataset_is_pure(dataset){
        stop
    }

    splits = find_all_possible_splits(dataset)
    impurities = calculate_impurity_decrease(splits)

    # choose split with maximum decrease
    best_split = splits[max(impurities)]

    left_dataset = best_split.left
    right_dataset = best_split.right

    #recursive part
    infer_tree(left_dataset)
    infer_tree(right_dataset)

}

This is only a part of the algorithm, it results in a tree that grows until all leaves are pure (meaning that they contain examples of only one class), and the CART algorithm restricts that so we can have a more generalizable tree (no overfitting).

  • you can define a minimum number of examples in each leaf, and if the dataset at a node is smaller or equal than that minimum then the node is not split and stays a leaf. (This is not typically used in CART)
  • you can prune the tree and collapse leaves together, which we will see in a later part.

Conclusion

I hope you learned something on how to build decision trees, and go check out part 3 to see a Python implementation of this algorithm.

Luc Blassel. 2020