Decision Trees
Table of Contents
1. Basics
An acyclic directed graph that's used to make decisions. At each branching node, a specific check is applied on the features:-
- based on a binary decision: either the left or the right child is chosen for further decisions
1.1. Iterative Dichotomiser 3
Out of the several optimization criterion applicable to decision trees, highlighting the ID3 (iterative dichotomiser 3) learning algorithm : other options include CART and C4.5
The optimization criterion for ID3 is given as the average log Likelihood
(defun predict-id3 (input) ...); the decision tree prediction (defun log-likelihood (input label) (let (prediction (predict-id3 input)) (+ (* label (log prediction)) (* (- 1 label) (log (- 1 prediction)))))) (defun log-overall-likelihood (inputs labels) (reduce #'add (mapcar #'log-likelihood inputs labels) :initial-value 0))
Contrary to the case of Logistic Regression, where one builds a Parametric model, the ID3 is a non-parametric approach
An important aspect of deciding the splitting criterion in a decision tree algorithm is based on the notion of Entropy.
Concisely speaking, the splitting criterion for a node is based on choosing the feature and decision criterion towards maximising information gain (reducing uncertainty) which is calculated using entropy or Gini Impurity.
- Entropy meaures the impurity or randomness in a dataset
- Information gain measures the reduction in entropy achieved by splitting the dataset based on a particular attribute.
Stopping criteria for a split decision are as follows:
- all examples in the parent belong to a single class
- an attribute to split upon can't be found
- entropy decrease is less than a threshold epsilon (set experimentally)
- max depth d has been achieved (set experimentally)
Symbolically summarizing the splitting procedure
(defun build-split-criterion (feature split-point) (...) (pack-criterion splitting-attribute specific-split-point)) (defun build-splits (parent-node splitting-criterion) (...) (list left-split right-split)) (defun set-positive-probability (set) ; for binary classification "chance of the label in a set being positive (1)" (mean (fetch-labels set))); as 0s and 1s (defun set-entropy (set) ; for binary classification (let* ((p (split-positive-probability set)) (q (- 1 p))) ;split-negative-probability (- (+ (* p (log p)) (* q (log q)))))) (defun split-entropy (parent-node feature split-point) (defun normalize-entropy (split) (* (/ (cardinality split) (cardinality parent-node)) (set-entropy split))) (let* ((splits (build-splits parent-node (build-split-criterion feature split-point))) (left-split (car splits)) (right-split (cadr splits))) (+ (normalize-entropy left-split) (normalize-entropy right-split))));basically a weighted average ;; the objective is then to choose a split such that the above is minimized (defmacro minimize (parameters objective) "internally has access to a searching mechanism over parameters" '(let ... ... (list optimal-split-feature optimal-split-point)));macro as parameters need to be literally captured in objective (defun optimal-split-parent (parent) (let ((solution (minimize '(feature split-point) (split-entropy parent)))) solution));solution is a list of feature and corresponding split point (defun stop-check (parent) ...)
Now that all the helpers have been expressed, the principal recursor looks as follows:
(defvar *decision-tree* nil) (defun build-subtree (parent) (when (stop-check parent) (return-from build-subtree (generate-leaf (adapt-set parent)))) (let* ((optimal-split-info (optimal-split-parent parent)) (optimal-feature (car optimal-split-info)) (optimal-split-point (cadr optimal-split-info))) (generate-split parent (mapcar #'build-subtree (build-splits parent (build-split-criterion)))))) (setf *decision-tree* (build-subtree dataset))
Note that the split decision at each point is local in ID3 (a greedy approach). This doesn't guarantee optimality. Backtracking may be employed during the search at the cost of the procedure running longer.
Do note that this algorithm then approximately maximises average log-likelihood.
2. Systematic Issues
2.1. Extrapolation Errors
- when dealing with points that are too far off the range of the data that the tree was trained with, due to the nature of the prediction strategy (averaging/ensembling), the tree isn't able to extrapolate well.