Lead Function using the Custom Reducer Script in Hive: Start time and End time

Sometimes, it is necessary to extract the next row in the query function to perform computation between rows.

Hive 0.11.x version has included a Windowing and Analytics module with Java map-reduce UDFs.  However, there is another possible approach to replicate the lead function in Hive using the custom reduce script.

The approach is simple in the way that we could force the entire mapped data to forced into a single reducer for each key, value pair using the following hive script.


use package;

ADD FILE ./reducer.py;

drop table lead_table;

set mapred.reduce.tasks=1;

create table lead_table as 
select transform (a.column1, a.column2) using
'python reducer.py'
as a_column1, a_column2, lead_column3
from (select 
column1, column2
from source_table 
cluster by column1 ) a;

There are two tricks here:

  • We skip the map phase altogether by using SELECT TRANSFORM and not MAP USING
  • cluster by column1 – enforces the key-value pair to go into one single reducer.

 

Caveat: More than one key may end up in the same reducer. So, the below program has to be modified to account for switch in the keys.

Below is the reducer python script.

 


#!/usr/bin/python
import sys
import ast
from datetime import datetime
count = 0
for line_out in sys.stdin:
	str1 = []
	str = []
	line_split = line_out.strip().split('\t')
	if count == 0:
		# prev_time = datetime.strptime(line_split[1].split(".")[1],'%Y-%m-%d %H:%M:%S')
		# print time.strftime("%b %d %Y %H:%M:%S", time.gmtime(prev_time))
		# Ignore the first line
		count +=1
		# retain the first timestamp
		prev_time = line_split[1]
	elif count >= 1:
		# print the key
		str.append(line_split[0])
		str.append('\t')
		# Print the start time
		str.append(prev_time)
		str.append('\t')
		# Print the end time
		str.append(line_split[1])
		print "".join(str)
		# retain the current timestamp as the start timestamp
		prev_time = line_split[1]


# print the last line
str = []
str.append(line_split[0])
str.append('\t')
str.append(line_split[1])
str.append('\t')
str.append('')
print "".join(str)

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s