Decision Trees Ensembles

Introduction to Decision Trees

Decision Trees are binary tree models that enumerates a number of records (past historical data also known as training data) with large number of predictors  to build the tree and make predictions on the future unseen data (known as validation or evaluation dataset). The child nodes or leaves contain a sub-sample of the historical or training records compared to its parents. The record could either belong to the left child node or the right child node based on a evaluation metric. This evaluation metric leads to a plethora of decision trees that scientists have discovered.


An ensemble is a collection of weak models to make a better decision as we approach the final outcome. An ensemble could be based on

  1. Bagging – A technique where you sample the records without replacement from the training dataset and build a large number models in parallel. Once all the weak models are built, a voting mechanism is used to determine the final outcome of the model. e.g.: Random Forest Trees
  2. Boosting – A sequence of weak models are built where each model makes corrections on the wrongly classified sample and thus improves the accuracy. e.g.: GBM, XGBoost
  3.  Stacking – An approach to learning intermediate features by training N models from N sampled buckets from the training dataset. The scores or probabilities from these models form the training features for the next layer of training models. You stack the output of one set of models as input to the next set of models as in neural networks. Hence, the name is referred to as stacking. e.g.: Neural networks

Before we get to the ensembles, it would be useful to understand the history of different types of decision trees. Decision trees has been used by statisticians and more recently machine learning experts in computer science field for a number of applications.

Decision trees is one of most important techniques in machine learning since it has the properties to segment non-linear high-dimensional data without overfitting or under-ftting the model.

A number of types of decision trees exist in practice:

  1. Classification and Regression Trees (CART)
  2. C4,5
  3. ID3
  4. CHAID etc.

The decision trees have some characteristics that is needed to build a model

  1. Ability to handle Predictors that have both numeric and categorical values (ordered or unordered)
  2. Target Variable can be binomial, multinoulli, unordered categorical values or a regression score
  3. Trees decide the data partitioning based on a impurity measure
  4. Trees can be grown and pruned back


Classification and Regression Tree (CART)

CART was first published by Breiman et. al. in 1984. The CART recursively splits data in two partitions based on minimizing the “impurity” of each node over all predictors and full training dataset.  It exhaustively searches over each predictor if it can split the node impurity than than the rest. If so, the data is partitioned into two groups and further recursively partitioned.

Some examples of node impurity are

  1. GINI Index
  2. Entropy
  3. Variance




(to be continued..)



  1. Breiman, Leo, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. “Classification and regression trees. Wadsworth & Brooks.” Monterey, CA (1984).
  2. Loh, Wei‐Yin. “Classification and regression trees.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1, no. 1 (2011): 14-23.
  3. Quinlan, J. Ross. “Induction of decision trees.” Machine learning 1, no. 1 (1986): 81-106.
  4. Quinlan, J. Ross. C4. 5: programs for machine learning. Elsevier, 2014.
  5. Wu, Xindong, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan et al. “Top 10 algorithms in data mining.” Knowledge and information systems 14, no. 1 (2008): 1-37
  6. Wolpert, David H. “Stacked generalization.” Neural networks 5, no. 2 (1992): 241-259.

Installing MySQL on Mac OS-X

SQL is currently and the future of Analytics language. The SQL language is the most suitable for analytics because it is declarative. That means “You declare the goal and the machine figures out the process and executes it”.

MySQL is one of the most advanced open source relational database management systems and is managed by Oracle Inc. You can practice SQL queries on MySQLWorkbench post installation as follows:

Installing MySQL in MAC OS-X – EL Capitan 10.11.2

Download the latest files from the Oracle Inc. website after registering yourself on the website.

  • mysql-5.7.10-osx10.10-x86_64.dmg
  • mysql-workbench-community-6.3.6-osx-x86_64.dmg


$ mysql restart

ERROR 2002 (HY000): Can’t connect to local MySQL server through socket ‘/tmp/mysql.sock’ (2)

Command to start the server is not configured. Please set the command that must be used to start the server in the remote management section of this connections settings.


$ sudo chown -R _mysql:_mysql /usr/local/var/mysql
$ sudo mysql.server start
Starting MySQL

Spark on Mac

Introduction to Apache Spark

Spark is getting its due attention as the lightning fast distributed computing engine. It is an improvement over the Hadoop Map Reduce. There are significant improvements in Spark that makes it superior for performing analytics on big data.

  1. In-memory storage and computation for iterative algorithms
  2. Supports object-oriented and functional programming paradigms with scala, python and java
  3. interactive shell for quick test and deployment as binaries
  4. Works with Yarn / Mesos – Resource Managers next-generation Hadoop


Install Java

$ java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

Install brew

ruby -e "$(curl -fsSL" 

Install Hadoop

$ brew install hadoop

Brew installs it in the below folder. Add this folder to your bash_profile path

$ vi ~/.bash_profile
export HADOOP_HOME=/usr/local/Cellar/hadoop/2.6.0
$ source ~/.bash_profile
  • Edit Configuration files
/usr/local/Cellar/hadoop/2.6.0$ cd libexec/etc/hadoop/
vi ~/core-site.xml















vi yarn-site.xml







vi mapred-site.xml







vi hdfs-site,xml







Install Apache Maven

$ brew install maven

$ vi ~/.bash_profile
export MAVEN_HOME=/usr/local/Cellar/maven/3.2.5

Install Apache Spark

download the latest tarball for Spark  and unzip in /usr/local

$ tar xvf ~/Downloads/spark-1.3.1.tar.gz /usr/local
/user/local/spark-1.3.1 $ mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package

Once the spark installation is complete and successful, add the following

$ vi ~/.bash_profile
export SPARK_HOME=/usr/local/spark-1.3.1
$ source ~/.bash_profile

Resolving Errors

Error1: Detected Maven Version: 3.2.5 is not in the allowed range 3.3.3.
Solution: Upgrade to higher version of maven

$brew update
$brew install maven

Error2: [error] missing or invalid dependency detected while loading class file ‘WebUI.class’.
Solution: Switch to Scala 2.11 version in the spark directory

$ cd /usr/local/spark-1.5.2
$ ./dev/ 2.11

Data Exploration

  • Start hadoop
$ hadoop dfsadmin -safemode leave
$ hadoop fs -mkdir -p /user/<username>/datasets/

Download the restaurant_ratings

Open the file in Excel and save the data as restaurant_ratings.csv

Copy the data in hadoop that you want to parse.

$ hadoop fs -copyFromLocal ./restaurant_ratings.csv /user/<username>/datasets/
  • Start yarn
  • Start spark interactive shell
$ spark-shell --master yarn-client

scala >

// load data
 val ratings = sc.textFile("/user/<username>/datasets/restaurant_ratings.csv")
// sample 10 records
// remove header
 def isHeader(line: String) = {if (line.contains("userID")) false else true}
// apply the remove header filter
 val rhRatings = ratings.filter(isHeader)
// define a template for each class
 case class restRating (userId: String, placeId: String, rating: Int)
// parse the data
 def parsed (line: String): restRating = {
 val words = line.split(",")
 val userId = words(0)
 val placeId = words(1)
 val rating = words(2).toInt
 restRating(userId, placeId, rating)
// apply the parsed function to the RDD[String] to convert into RDD[restRatings]
 val parsedRecords =
// extract just the ratings
 val ratings = => rr.rating)
// cache into memory
// sample data
 val samples = parsedRecords.take(20)
// pick two columns
 val parsedRDD = parsedRecords.take(20).map(rr => (rr.userId, rr.rating))
import org.apache.spark.SparkContext._
// group by user Id _1 indicates column 1
 val grouped = parsedRDD.groupBy(rr => (rr._1))
// output results
 grouped.mapValues(x => x.size).foreach(println)
 // Summary Statistics
// Count by value
 val matchedCounts = => md.rating).countByValue()
// scala.collection.Map[Int,Long] = Map(0 -> 254, 2 -> 486, 1 -> 421)
// Convert them to sequence so we can sort them by column 1,2...N
val matchedCountSeq = matchedCounts.toSeq
// Sort by column 1
// sort by column 2
// descending order sorting
// statistics => md.rating).stats()

Lead Function using the Custom Reducer Script in Hive: Start time and End time

Sometimes, it is necessary to extract the next row in the query function to perform computation between rows.

Hive 0.11.x version has included a Windowing and Analytics module with Java map-reduce UDFs.  However, there is another possible approach to replicate the lead function in Hive using the custom reduce script.

The approach is simple in the way that we could force the entire mapped data to forced into a single reducer for each key, value pair using the following hive script.

use package;


drop table lead_table;

set mapred.reduce.tasks=1;

create table lead_table as 
select transform (a.column1, a.column2) using
as a_column1, a_column2, lead_column3
from (select 
column1, column2
from source_table 
cluster by column1 ) a;

There are two tricks here:

  • We skip the map phase altogether by using SELECT TRANSFORM and not MAP USING
  • cluster by column1 – enforces the key-value pair to go into one single reducer.


Caveat: More than one key may end up in the same reducer. So, the below program has to be modified to account for switch in the keys.

Below is the reducer python script.


import sys
import ast
from datetime import datetime
count = 0
for line_out in sys.stdin:
	str1 = []
	str = []
	line_split = line_out.strip().split('\t')
	if count == 0:
		# prev_time = datetime.strptime(line_split[1].split(".")[1],'%Y-%m-%d %H:%M:%S')
		# print time.strftime("%b %d %Y %H:%M:%S", time.gmtime(prev_time))
		# Ignore the first line
		count +=1
		# retain the first timestamp
		prev_time = line_split[1]
	elif count >= 1:
		# print the key
		# Print the start time
		# Print the end time
		print "".join(str)
		# retain the current timestamp as the start timestamp
		prev_time = line_split[1]

# print the last line
str = []
print "".join(str)


Installing Hive in Mac

Apache Hive Installation

The default apache hive comes with a Derby Database that can support only one user at a time and resets itself every time  hive server is restarted. To avoid this, it becomes necessary to setup the mysql database server, connect the hive server to mysql db. The mysql will become the de facto for storing metadata for all hive databases and tables, that the hive creates. We need to ensure that the database (metastore) in mysql server will have the read/write access to the user so that the hive can make changes over time.

It is necessary that the following software has been installed in your Mac OS X (Lion/Yosemite)

  • Xcode (latest Version)
  • Hadoop 2.3+

If not, follow the instructions in Xcode link

Hadoop and Brew Installation

For instructions to setup hadoop, please refer to this link

For instructions to install brew, please refer to this link

Hive Installation

$ brew install hive
$ vi ~/.bash_profile
$ vi ~/.bash_profile
export HIVE_HOME=/usr/local/Cellar/hive/1.1.0
$ source ~/.bash_profile

Install mySQL Server

$ brew install mysql
$ mysql -u root

Error 1: Can’t connect to local MySQL server through socket ‘/tmp/mysql.sock’ (2)

$ brew uninstall mysql 
$ brew install mysql
$ mysql -u root

Setup mySQL server

Ensure that the username is the same as that of user for which hadoop and hive was installed.

mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> CREATE USER 'username'@'localhost' IDENTIFIED BY 'password';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,ALTER,CREATE, INDEX ON metastore.* TO 'username'@'localhost';
mysql> create database tempstatsstore;
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,ALTER,CREATE, INDEX ON tempstatsstore.* TO 'username'@'localhost';

Error 2: The specified datastore driver (“com.mysql.jdbc.Driver”) was not found in the CLASSPATH.

Download the latest version of the mysql-connector-java using curl

curl -L '' | tar xz

5.1.35 was the latest version of the JDBC mysql connector. We can check my visiting the mysql download website

$ cp mysql-connector-java-5.1.35/mysql-connector-java-5.1.35-bin.jar /usr/local/Cellar/hive/1.1.0/lib/

Just in case, we forget mysql root password, reset following instructions here.

Setup Hive Configuration

$ /usr/local/Cellar/hive/1.1.0/conf$ cp hive-default.xml.template hive-site.xml
 <description>password to use against metastore database</description>
 <description>JDBC connect string for a JDBC metastore</description>
 <description>Driver class name for a JDBC metastore</description>
 <description>Username to use against metastore database</description>

Other useful hive settings

 <description>Whether to include the current database in the Hive prompt.</description>
 <description>JDBC driver for the database that stores temporary Hive statistics.</description>
 <description>The default connection string for the database that stores temporary Hive statistics.</description>
 <description>Whether to print the names of the columns in query output.</description>

Complex Networks on Hadoop Map Reduce

Brief Introduction

JUNG is a social network API which can compute many centrality measures to study networks. Some of them include

  • Betweenness Centrality
  • degree centrality
  • Clustering Co-efficient / Transitivity
  • Closeness Centrality
  • Graph Density
  • Geodesic Distance
  • Weighted /Unweighted shortest path

If there are large number of networks, with hundreds of vertices in each network, then it would be best to process each network in parallel. I have placed the JUNG process code in the reducer. Mapper will pass different networks into separate reducers in process them in parallel.



Install python module without admin rights

Most remote servers are governed by system admin rights. Python modules are abundantly available but cannot be installed in the default location without admin rights.

An alternative approach is to install the modules in the local home directory and point the python interpreter to access the modules placed in the local home directory.


Change directory into the source directory of the python module. Make sure the exists in the directory and execute the following command.

$ python install –home=/home/user/package_dir

At the shell command prompt, execute the following command

$ export PYTHONPATH=”${PYTHONPATH}:/home/user/package_dir/lib/python”

At the python Command Line Interpreter (CLI), test the following to confirm that the module has installed.

$ python

>>> import module_name