Decision Trees Ensembles

Introduction to Decision Trees

Decision Trees are binary tree models that enumerates a number of records (past historical data also known as training data) with large number of predictors  to build the tree and make predictions on the future unseen data (known as validation or evaluation dataset). The child nodes or leaves contain a sub-sample of the historical or training records compared to its parents. The record could either belong to the left child node or the right child node based on a evaluation metric. This evaluation metric leads to a plethora of decision trees that scientists have discovered.


An ensemble is a collection of weak models to make a better decision as we approach the final outcome. An ensemble could be based on

  1. Bagging – A technique where you sample the records without replacement from the training dataset and build a large number models in parallel. Once all the weak models are built, a voting mechanism is used to determine the final outcome of the model. e.g.: Random Forest Trees
  2. Boosting – A sequence of weak models are built where each model makes corrections on the wrongly classified sample and thus improves the accuracy. e.g.: GBM, XGBoost
  3.  Stacking – An approach to learning intermediate features by training N models from N sampled buckets from the training dataset. The scores or probabilities from these models form the training features for the next layer of training models. You stack the output of one set of models as input to the next set of models as in neural networks. Hence, the name is referred to as stacking. e.g.: Neural networks

Before we get to the ensembles, it would be useful to understand the history of different types of decision trees. Decision trees has been used by statisticians and more recently machine learning experts in computer science field for a number of applications.

Decision trees is one of most important techniques in machine learning since it has the properties to segment non-linear high-dimensional data without overfitting or under-ftting the model.

A number of types of decision trees exist in practice:

  1. Classification and Regression Trees (CART)
  2. C4,5
  3. ID3
  4. CHAID etc.

The decision trees have some characteristics that is needed to build a model

  1. Ability to handle Predictors that have both numeric and categorical values (ordered or unordered)
  2. Target Variable can be binomial, multinoulli, unordered categorical values or a regression score
  3. Trees decide the data partitioning based on a impurity measure
  4. Trees can be grown and pruned back


Classification and Regression Tree (CART)

CART was first published by Breiman et. al. in 1984. The CART recursively splits data in two partitions based on minimizing the “impurity” of each node over all predictors and full training dataset.  It exhaustively searches over each predictor if it can split the node impurity than than the rest. If so, the data is partitioned into two groups and further recursively partitioned.

Some examples of node impurity are

  1. GINI Index
  2. Entropy
  3. Variance




(to be continued..)



  1. Breiman, Leo, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. “Classification and regression trees. Wadsworth & Brooks.” Monterey, CA (1984).
  2. Loh, Wei‐Yin. “Classification and regression trees.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1, no. 1 (2011): 14-23.
  3. Quinlan, J. Ross. “Induction of decision trees.” Machine learning 1, no. 1 (1986): 81-106.
  4. Quinlan, J. Ross. C4. 5: programs for machine learning. Elsevier, 2014.
  5. Wu, Xindong, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan et al. “Top 10 algorithms in data mining.” Knowledge and information systems 14, no. 1 (2008): 1-37
  6. Wolpert, David H. “Stacked generalization.” Neural networks 5, no. 2 (1992): 241-259.