Lead Function using the Custom Reducer Script in Hive: Start time and End time

Sometimes, it is necessary to extract the next row in the query function to perform computation between rows.

Hive 0.11.x version has included a Windowing and Analytics module with Java map-reduce UDFs.  However, there is another possible approach to replicate the lead function in Hive using the custom reduce script.

The approach is simple in the way that we could force the entire mapped data to forced into a single reducer for each key, value pair using the following hive script.


use package;

ADD FILE ./reducer.py;

drop table lead_table;

set mapred.reduce.tasks=1;

create table lead_table as 
select transform (a.column1, a.column2) using
'python reducer.py'
as a_column1, a_column2, lead_column3
from (select 
column1, column2
from source_table 
cluster by column1 ) a;

There are two tricks here:

  • We skip the map phase altogether by using SELECT TRANSFORM and not MAP USING
  • cluster by column1 – enforces the key-value pair to go into one single reducer.

 

Caveat: More than one key may end up in the same reducer. So, the below program has to be modified to account for switch in the keys.

Below is the reducer python script.

 


#!/usr/bin/python
import sys
import ast
from datetime import datetime
count = 0
for line_out in sys.stdin:
	str1 = []
	str = []
	line_split = line_out.strip().split('\t')
	if count == 0:
		# prev_time = datetime.strptime(line_split[1].split(".")[1],'%Y-%m-%d %H:%M:%S')
		# print time.strftime("%b %d %Y %H:%M:%S", time.gmtime(prev_time))
		# Ignore the first line
		count +=1
		# retain the first timestamp
		prev_time = line_split[1]
	elif count >= 1:
		# print the key
		str.append(line_split[0])
		str.append('\t')
		# Print the start time
		str.append(prev_time)
		str.append('\t')
		# Print the end time
		str.append(line_split[1])
		print "".join(str)
		# retain the current timestamp as the start timestamp
		prev_time = line_split[1]


# print the last line
str = []
str.append(line_split[0])
str.append('\t')
str.append(line_split[1])
str.append('\t')
str.append('')
print "".join(str)

 

Installing Hive in Mac

Apache Hive Installation

The default apache hive comes with a Derby Database that can support only one user at a time and resets itself every time  hive server is restarted. To avoid this, it becomes necessary to setup the mysql database server, connect the hive server to mysql db. The mysql will become the de facto for storing metadata for all hive databases and tables, that the hive creates. We need to ensure that the database (metastore) in mysql server will have the read/write access to the user so that the hive can make changes over time.

It is necessary that the following software has been installed in your Mac OS X (Lion/Yosemite)

  • Xcode (latest Version)
  • Hadoop 2.3+

If not, follow the instructions in Xcode link

Hadoop and Brew Installation

For instructions to setup hadoop, please refer to this link

For instructions to install brew, please refer to this link

Hive Installation

$ brew install hive
$ vi ~/.bash_profile
$ vi ~/.bash_profile
export HIVE_HOME=/usr/local/Cellar/hive/1.1.0
$ source ~/.bash_profile

Install mySQL Server

$ brew install mysql
$ mysql -u root

Error 1: Can’t connect to local MySQL server through socket ‘/tmp/mysql.sock’ (2)

$ brew uninstall mysql 
$ brew install mysql
$ mysql -u root

Setup mySQL server

Ensure that the username is the same as that of user for which hadoop and hive was installed.

mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> CREATE USER 'username'@'localhost' IDENTIFIED BY 'password';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,ALTER,CREATE, INDEX ON metastore.* TO 'username'@'localhost';
mysql> create database tempstatsstore;
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,ALTER,CREATE, INDEX ON tempstatsstore.* TO 'username'@'localhost';

Error 2: The specified datastore driver (“com.mysql.jdbc.Driver”) was not found in the CLASSPATH.

Download the latest version of the mysql-connector-java using curl

curl -L 'http://www.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.35.tar.gz/from/http://mysql.he.net/' | tar xz

5.1.35 was the latest version of the JDBC mysql connector. We can check my visiting the mysql download website

$ cp mysql-connector-java-5.1.35/mysql-connector-java-5.1.35-bin.jar /usr/local/Cellar/hive/1.1.0/lib/

Just in case, we forget mysql root password, reset following instructions here.

Setup Hive Configuration

$ /usr/local/Cellar/hive/1.1.0/conf$ cp hive-default.xml.template hive-site.xml
 
 <property>
 <name>javax.jdo.option.ConnectionPassword</name>
 <value>password</value>
 <description>password to use against metastore database</description>
 </property>
 <property>
 <name>javax.jdo.option.ConnectionURL</name>
 <value>jdbc:mysql://localhost/metastore</value>
 <description>JDBC connect string for a JDBC metastore</description>
 </property>
 <property>
 <name>javax.jdo.option.ConnectionDriverName</name>
 <value>com.mysql.jdbc.Driver</value>
 <description>Driver class name for a JDBC metastore</description>
 </property>
 <property>
 <name>javax.jdo.option.ConnectionUserName</name>
 <value>username</value>
 <description>Username to use against metastore database</description>
 </property>

Other useful hive settings

 
 <property>
 <name>hive.cli.print.current.db</name>
 <value>true</value>
 <description>Whether to include the current database in the Hive prompt.</description>
 </property>
 <property>
 <name>hive.stats.jdbcdriver</name>
 <value>com.mysql.jdbc.Driver</value>
 <description>JDBC driver for the database that stores temporary Hive statistics.</description>
 </property>
 <property>
 <name>hive.stats.dbconnectionstring</name>
 <value>jdbc:mysql://localhost/tempstatsstore</value>
 <description>The default connection string for the database that stores temporary Hive statistics.</description>
 </property>
 <property>
 <name>hive.cli.print.header</name>
 <value>true</value>
 <description>Whether to print the names of the columns in query output.</description>
 </property>