Datum Engineering !

An engineered artwork to make decisions..

Agility in Hive — Map & Array score for Hive

Posted by datumengineering on September 27, 2012

There are debate and comparison between PIG and Hive. There are good post from @Larsgeorge which talks about PIG v/s Hive.

I am not an expert to go in details of comparison but here I want to explore some of the Hive features which gives Hive an edge.

These feature are MAP (Associative Array) and ARRAY. MAP can give you an alternative way to segregate your data  around KEY and VALUE way.  So, if you have data something like this

clientid=’xxxx234xx’, category=’electronics’,timetaken=’20/01/2000 10:20:20′.

Then, you can really break it down in to key and value. Where, clientid, category and timetaken are keys and values are: xxxx234xx,electronics,20/01/2000 10:20:20.  How about not only converting them into key and value.  But storing and retrieving them as well  into a column. So, When you define the MAP it does store the complete MAP into a single column, like;

COL_1

{“clientid”=”xxxx234xx”, “category”=”electronics”,”timetaken”=”20/01/2000 10:20:20″}

To store like this you need to define the table like this:

Create table table1

(

COL_1 MAP<STRING,STRING>

)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’ MAP DELIMITED BY “=”

Now, retrieval is pretty easy : you just need to say in your HiveQL: Select COL1.["category"] from table. You’ll get electronics. Had it been MAP is not there i would have end up writing a complete parsing program for storing such custom format in table.

Similarly, Array can be use to store collection into a column. So you can have data like:

‘xxxx1234yz';’/electronics/music-player/ipad/shuffle/';

Now, you want to parse the complete level in the second column. It would be easy in Hive to store it as in ARRAY. Definition would be:

Create table table1

(

CUSTOMERID STRING,

COL_1 ARRAY<STRING>

)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘;’ COLLECTION ITEM TERMINATED BY “/”

Now data retrieving is obvious, query the table with collection index or level you want to go

Select Col_1[1] from table1;

You may also have scenarios when you have COLLECTION of MAPS. There you need to use both MAP and ARRAY together in same table definition along with required delimiter for ARRAY and MAP.

So your table’s delimiter definition should look like this:

FIELDS TERMINATED BY ‘,’
COLLECTION ITEMS TERMINATED BY ‘/’
MAP KEYS TERMINATED BY ‘=’

About these ads

One Response to “Agility in Hive — Map & Array score for Hive”

  1. [...] my previous post on Hive – Agility to go in detail of how Hive UDF’s helped to run this analysis efficiently using MAP and [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 43 other followers

%d bloggers like this: