DT solves the classification problem with discrete features.(If some features are continuous, then discretizing them.) To decide which feature to use in each node, we calculate the Entropy or Gini Hybridity. Having the hybridity, we can calculate the information gain and decide when to stop.
Entropy
For Entropy, we define
Since,
In the information theory, entropy measures the purity or uncertainty of the information. This can be used in deciding which feature to use.
Gini Hybridity
For Gini Hybridity, we define
Intuitively, it measures the mixing level of the information.
Information Gain(IG)
We define
Pruning
Pruning is a very important technique to avoid overfitting. It is divided into two cases ---- Pre-Pruning and Post-Pruning.
Pre-Pruning
Normally, there are two conditions when we stop pruning.
- Set a fixed number of layers.
- Set a fixed number of information gain. If the value goes below this, we stop classifying.
Post-Pruning
Normally, we use validation set to help do the post-pruning. We greedily delete the nodes when deleting them improve the performance in the validation set.
Or, we could use rule post-pruning technique. C4.5 uses it.