build predictive models using the most informative features
interior nodes → query on some descriptive feature of the dataset leaf nodes → decision/predicted classification/predicted value
shallow trees >>
- prevent Overfitting
 
informative features split the dataset into more homogenous or pure sets.
Measures of purity
- entropy & information gain
 - information gain ratio
 - gini index
 - variance
 
Entropy
Entropy
⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠
Text Elements
Link to original
Information Gain
Information Gain
⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠
Text Elements
Link to original
Information Gain Ratio
Information Gain has a preference for features with many values
Information Gain Ratio divides information gain with the amount of information used to determine the value of the feature
Information Gain Ratio
⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠
Text Elements
Link to original
Gini Index
Gini Index
⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠
Text Elements
Link to original
Variance
Variance
⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠
Text Elements
Link to original
- Used for regression trees
 
Variance for Regression Trees
⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠
Text Elements
Link to original
Continuous Descriptive features
- preprocessing like binning
 - turn into Boolean features using some threshold value
- < threshold value and >= threshold value
 - sort the dataset according to the continuous feature
 - adjacent instances with different y are possible threshold values
- threshold value → lies between the continuous feature value of the two instances {((x2 - x1)/2) + x1}
 
 - optimal threshold - compute information gain (or other measure) for each split and select the split with highest information gain
 
 
Algorithm
- ID3
 - CART both are greedy algos, they don’t check whether the best possible splits at a high level lead to lowest possible impurity at lower levels.
 
Overfitting
The likelihood of over-fitting occurring increases as a tree gets deeper because the resulting classifications are based on smaller and smaller subsets as the dataset is partitioned after each feature test in the path.