Datum Engineering !

An engineered artwork to make decisions..

Archive for the ‘Hive’ Category

Data flow: Web log analysis on a Hive-way

Posted by datumengineering on February 8, 2013

 Data flow design to get an insight of user behavior on web site. Data flow explains the method of flattening up all elements in web log which can support detail user analysis and behavior.

Technology & Skills: Hadoop-Hive, HiveQL (+ Rich set of UDF in HiveQL) .

Infrastructure: Amazon Web Services (AWS).

Process -1 Moving data from Web Server to Amazon Simple Storage Services (S3) to HDFS.

0

Process -2 Start EC2 instance type : small to run Map Reduce job to parse log file.

To run jobs on AWS we should have EBS and EC2 both instance running.

Process -3 Prepare for Elastic Map Reduce to run the jobs from command line.

To run the EMR from command line we use an Amazon EMR credentials file to simplify job flow creation and authentication of requests. The credentials file provides information required for many commands. The credentials file is a convenient place to store command parameters so you don’t have to repeatedly enter the information. The Amazon EMR CLI automatically looks for these credentials in the file credentials.json.

To install the Elastic MapReduce CLI1. Navigate to your elastic-mapreduce-cli directory.
2. Unzip the compressed file: Linux and UNIX users, from the command-line prompt, enter the following:$ unzip elastic-mapreduce-ruby.zipConfiguring Credentials
The Elastic MapReduce credentials file can provide information required for many commands. It is
convenient to store command parameters in the file to save you from the trouble of repeatedly entering the information. Your credentials are used to calculate the signature value for every request you make. Elastic MapReduce automatically looks for your credentials in the file credentials.json. It is convenient to edit the credentials.json file and include your AWS credentials. An AWS key pair is a security credential
similar to a password, which you use to securely connect to your instance when it is running.To create your credentials file:1. Create a file named credentials.json in the elastic-mapreduce-cli/elastic-mapreduce-ruby directory.2. Add the following lines to your credentials file:
{
“access_id”: “[Your AWS Access Key ID]”,
“private_key”: “[Your AWS Secret Access Key]”,
“keypair”: “[Your key pair name]”,
“key-pair-file”: “[The path and name of your PEM file]”,
“log_uri”: “[A path to a bucket you own on Amazon S3, such as, s3n://myloguri/]”,
“region”: “[The Region of your job flow, either us-east-1, us-west-2, uswest-1, eu-west-1, ap-northeast-1, ap-southeast-1, or sa-east-1]”
}Note the name of the Region. You will use this Region to create your Amazon EC2 key pair and your
Amazon S3 bucket.

Process -4 Prepare Hive table for data analysis. Create landing table to load log data.

We create schema for tokenizing the string. So MAP and COLLECTION is used to build key-value array.

CREATE TABLE logdata (

C_2 STRING,

C_3 MAP<STRING, STRING>,

C_4 STRING,

C_21 STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘ ‘ COLLECTION ITEMS TERMINATED BY ’73’ MAP KEYS TERMINATED BY ‘=’ STORED AS textfile;

Process -6 Load Hive landing table with log file data from HDFS.

 1

LOAD DATA INPATH ‘hdfs://10.130.86.181:9000/input/log.txt’ OVERWRITE INTO TABLE `logdata`;

Process -7 Load Hive stage table from landing table.

This stage table will have the data from landing. Stage table is used to load cleansed data without any junk character (Log has some # characters which we remove when load into staging).

2

create table logdata_stg

comment ‘log data’ stored as sequencefile as

select * from logdata where C_0 not like ‘%#%’;

Process -8 Load Hive final table from staging table.

 3

This process will create flatten structure of complete log file into final table. This table will be used in all over the analysis. This table is created with actual column names identified in the log file. Final table load happen using UDF to parse query string, host name and category tree in browse data.

create table logdata_fnl

comment ‘log data’ stored as sequencefile as

Read my previous post on Hive – Agility to go in detail of how Hive UDF’s helped to run this analysis efficiently using MAP and ASSOCIATIVE ARRAY

Advertisements

Posted in Big Data, Hive | Tagged: , , | Leave a Comment »

Agility in Hive — Map & Array score for Hive

Posted by datumengineering on September 27, 2012

There are debate and comparison between PIG and Hive. There are good post from @Larsgeorge which talks about PIG v/s Hive.

I am not an expert to go in details of comparison but here I want to explore some of the Hive features which gives Hive an edge.

These feature are MAP (Associative Array) and ARRAY. MAP can give you an alternative way to segregate your data  around KEY and VALUE way.  So, if you have data something like this

clientid=’xxxx234xx’, category=’electronics’,timetaken=’20/01/2000 10:20:20′.

Then, you can really break it down in to key and value. Where, clientid, category and timetaken are keys and values are: xxxx234xx,electronics,20/01/2000 10:20:20.  How about not only converting them into key and value.  But storing and retrieving them as well  into a column. So, When you define the MAP it does store the complete MAP into a single column, like;

COL_1

{“clientid”=”xxxx234xx”, “category”=”electronics”,”timetaken”=”20/01/2000 10:20:20″}

To store like this you need to define the table like this:

Create table table1

(

COL_1 MAP<STRING,STRING>

)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’ MAP DELIMITED BY “=”

Now, retrieval is pretty easy : you just need to say in your HiveQL: Select COL1.[“category”] from table. You’ll get electronics. Had it been MAP is not there i would have end up writing a complete parsing program for storing such custom format in table.

Similarly, Array can be use to store collection into a column. So you can have data like:

‘xxxx1234yz’;’/electronics/music-player/ipad/shuffle/’;

Now, you want to parse the complete level in the second column. It would be easy in Hive to store it as in ARRAY. Definition would be:

Create table table1

(

CUSTOMERID STRING,

COL_1 ARRAY<STRING>

)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘;’ COLLECTION ITEM TERMINATED BY “/”

Now data retrieving is obvious, query the table with collection index or level you want to go

Select Col_1[1] from table1;

You may also have scenarios when you have COLLECTION of MAPS. There you need to use both MAP and ARRAY together in same table definition along with required delimiter for ARRAY and MAP.

So your table’s delimiter definition should look like this:

FIELDS TERMINATED BY ‘,’
COLLECTION ITEMS TERMINATED BY ‘/’
MAP KEYS TERMINATED BY ‘=’

Posted in Big Data, Hive | Tagged: , , | 1 Comment »