Decision Trees
Decision Trees (CART)
Tree based methods partition the feature space into a set of rectangles, and then fit a simple model (i.e constant) in each one. We focus on CART in this post.
Suppose we have dataset \(D = \{(\mathbf{x}_1, y_1) , ...., (\mathbf{x}_N, y_N) ;\; \mathbf{x}_i \in \mathbb{R}^d\}\). The algorithm needs to automatically decide on the splitting variables and splitting points and also what shape the tree should have.
Regression Trees
In this scenario, our response variable \(Y\) is continuous. Suppose first that we have a partition into \(M\) regions \(R_1, ...., R_M\) and we define the model prediction as:
\[\hat{y} = \sum^{M}_{m=1} c_m I[\mathbf{x}_i \in R_m]\]
By minimizing the mean square loss \(\frac{1}{2} \frac{1}{N} \sum^{N}_{i=1} (y_i - \hat{y}_i)^2\), we have:
\[\begin{aligned} \frac{\partial L}{\partial c_m} &= \frac{1}{N}\sum^{N}_{i=1} (y_i - \sum^{M}_{m=1} c_m I[\mathbf{x}_i \in R_m]) I[\mathbf{x}_i \in R_m]\\ &= \frac{1}{N_m}\sum^{N}_{i=1} (y_i I[\mathbf{x}_i \in R_m]) - c_m\\ \implies \hat{c}_m &= \frac{1}{N_m}\sum^{N}_{i=1} (y_i I[\mathbf{x}_i \in R_m]) \end{aligned}\]Thus, the best estimate \(\hat{c}_m\) in each region is the average training responses in that region w.r.t mean square error:
\[\hat{c}_m = \frac{1}{N_m} \sum^{N}_{i=1} y_i I[\mathbf{x}_i \in R_m]\]
Where \(N_m = \sum^{N}_{i=1} I[\mathbf{x}_i \in R_m]\), is total training examples in region \(R_m\).
Best Splitting Point
Now, finding the binary partition in terms of minimum sum of squares is generally computational infeasible. However, we can use a greedy algorithm that starts with all of the data, considering every splitting variable \(j\) and splitting point \(s\) and find the pair that minimize the particular loss:
Every pair of splitting point \(j\) and split point \(s\) define the pair of half-planes:
- \(R_1(j, s) = \{\mathbf{x} | x_j \leq s\}\) (all samples that have \(j\)th feature less than or equal to \(s\))
- \(R_2(j, s) = \{\mathbf{x} | x_j > s\}\) (all samples that have \(j\)th feature greater than \(s\))
For each half-plane, we find the best estimate that will minimize the mean square loss in that region: \[\hat{c}_1 = \underset{c_1}{\arg\min} \frac{1}{N_{1}}\sum_{x_i \in R_1 (j, s)} (y_i - c_1)^2\] \[\hat{c}_2 = \underset{c_2}{\arg\min} \frac{1}{N_{2}}\sum_{x_i \in R_2 (j, s)} (y_i - c_2)^2\]
From previous results, we know that the minimizers are the average training responses in these regions, that is:
\[\hat{c}_1 = \frac{1}{N_1} \sum_{x_i \in R_1 (j, s)} y_i I[\mathbf{x}_i \in R_1] \quad \quad \hat{c}_2 = \frac{1}{N_2} \sum_{x_2 \in R_2 (j, s)} y_i I[\mathbf{x}_i \in R_2]\]
For any choice of \((j, s)\), we seek to minimize the overall objective:
\[\min_{j, s} [\frac{1}{N_{1}}\sum_{x_i \in R_1 (j, s)} (y_i - \hat{c}_1)^2 + \frac{1}{N_{2}}\sum_{x_i \in R_2 (j, s)} (y_i - \hat{c}_2)^2]\]
This optimization problem can be solved by scanning through all positive pair of \((j, s)\) very quickly. Having found the best split, we partition the data into two regions and repeat this finding procedure until stopping signal received.
Cost Complexity Pruning
How large should the tree grow? Clearly, a large tree will overfit the training set while a small tree might not capture the important structure. Tree size is a tuning parameter governing the model's complexity, and the optimal tree size should be adaptively chosen from the data.
The preferred strategy is to grow a large tree \(T_0\), stopping the splitting process only until some stopping signals, then pruned the large tree using cost-complexity pruning
.
We define a subtree \(T \in T_0\) to be any tree that can be obtained by pruning \(T_0\), that is, collapsing any number of its internal (non-leave) nodes (set the nodes as leaves). The idea is to find, for each \(\alpha\) the subtree \(T_\alpha \subset T_0\) to minimize the objective:
\[C_{\alpha} (T) = \sum^{|T|}_{m=1} \sum_{x_i \in R_m} (y_i - \hat{c}_m)^2 + \alpha |T|\]
Where \(|T|\) represents current number of leaves, \(\alpha |T|\) is the penalty term that trades off tree size and goodness of fit.
Hyperparameters:
- Tree size: \(\;\) Governing the tree's complexity.
- Minimum decease in loss from split: \(\;\)Split tree nodes only if the decrease in sum of squares due to the split exceeds some threshold. However, this approach is short-sighted because a seemingly worthless split might lead to a very good split below it.
- Minimum or maximum leave size: \(\;\) Minimum or maximum number of leaves.
- \(\alpha\): Controls for penalty in cost-complexity pruning.
Classification Trees
If \(Y\) is a classification outcome taking values \(1, 2, 3, ...., K\), the only change needed in the tree algorithm pertain to the criteria for splitting nodes and pruning the tree. For regression, we used MSE as the splitting criteria and we use average response \(\hat{c}\) at each leave as our prediction. In classification case, for each region \(m\), we use proportion to make predictions:
\[\hat{p}_{mk} = \frac{1}{N_m} \sum_{x_1 \in R_m} I[y_i = k]\]
\[\hat{y}_m = \underset{k}{\arg\max} \; \hat{p}_{mk} \]
Thus, the prediction at each region is the majority class in that region.
Impurity Measure
In classification case, we call the splitting criteria impurity measure
. We have several choices for the impurity measure:
Misclassification Error: \[\frac{1}{N_m} \sum_{i \in R_m} I[y_i \neq \hat{y}_m] = 1 - \hat{p}_{m \hat{y}_m}\]
Gini Index: \[\sum_{k \neq k^{\prime}} \hat{p}_{mk} \hat{p}_{mk^{\prime}} = \sum^{K}_{k=1} \hat{p}_{mk} (1 - \hat{p}_{mk})\]
Notice here, if there is only one class in the region, then the gini index will be \(0\). That is, gini index prefers purer nodes.
Cross-entropy: \[-\sum^{K}_{k=1} \hat{p}_{mk} \log \hat{p}_{mk}\]
Notice here, if there is only one class in the region, then the cross-entropy will be \(0\) which is minimum. Thus, cross-entropy prefers purer nodes.
In general, we should always use cross entropy or gini index over misclassification rate, because misclassification rate does not capture extra purity. The impurity needs to scale by the instance number in each region
Other Issues
Instability of Trees
One major problem with trees is their high variance, often a small change in the data changes the structure of the tree. The major reason for this instability is the hierarchical nature of the process: the effect of an error in the top split is propagated down to all of the splits below it.
Difficulty in Capturing Additive Structure
Another problem with trees is their difficulty in modeling additive structure. For regression \(Y = c_1 I[X_1 < t_1] + c_2 I[X_2 < t_2] + \epsilon\), a tree has to split based on \((X_1, t_1)\), then split on \((X_2, t_2)\). This might happen with sufficient data, but the model is given no special encouragement to find such structure.
Linear Combination Splits
To solve the additive structure problem, rather than restricting splits to be of the form \(X_j \leq s\), one can allow splits along linear combinations of features:
\[\sum_{j} a_j X_j \leq s\]
While this can improve the predictive power of the tree, it can hurt interpretability.
Implementation
1 | import numpy as np |
Ref
ESLII Chapter 9