Datum Engineering !

An engineered artwork to make decisions..

Data flow: Web log analysis on a Hive-way

Posted by datumengineering on February 8, 2013

 Data flow design to get an insight of user behavior on web site. Data flow explains the method of flattening up all elements in web log which can support detail user analysis and behavior.

Technology & Skills: Hadoop-Hive, HiveQL (+ Rich set of UDF in HiveQL) .

Infrastructure: Amazon Web Services (AWS).

Process -1 Moving data from Web Server to Amazon Simple Storage Services (S3) to HDFS.


Process -2 Start EC2 instance type : small to run Map Reduce job to parse log file.

To run jobs on AWS we should have EBS and EC2 both instance running.

Process -3 Prepare for Elastic Map Reduce to run the jobs from command line.

To run the EMR from command line we use an Amazon EMR credentials file to simplify job flow creation and authentication of requests. The credentials file provides information required for many commands. The credentials file is a convenient place to store command parameters so you don’t have to repeatedly enter the information. The Amazon EMR CLI automatically looks for these credentials in the file credentials.json.

To install the Elastic MapReduce CLI1. Navigate to your elastic-mapreduce-cli directory.
2. Unzip the compressed file: Linux and UNIX users, from the command-line prompt, enter the following:$ unzip elastic-mapreduce-ruby.zipConfiguring Credentials
The Elastic MapReduce credentials file can provide information required for many commands. It is
convenient to store command parameters in the file to save you from the trouble of repeatedly entering the information. Your credentials are used to calculate the signature value for every request you make. Elastic MapReduce automatically looks for your credentials in the file credentials.json. It is convenient to edit the credentials.json file and include your AWS credentials. An AWS key pair is a security credential
similar to a password, which you use to securely connect to your instance when it is running.To create your credentials file:1. Create a file named credentials.json in the elastic-mapreduce-cli/elastic-mapreduce-ruby directory.2. Add the following lines to your credentials file:
“access_id”: “[Your AWS Access Key ID]”,
“private_key”: “[Your AWS Secret Access Key]”,
“keypair”: “[Your key pair name]”,
“key-pair-file”: “[The path and name of your PEM file]”,
“log_uri”: “[A path to a bucket you own on Amazon S3, such as, s3n://myloguri/]”,
“region”: “[The Region of your job flow, either us-east-1, us-west-2, uswest-1, eu-west-1, ap-northeast-1, ap-southeast-1, or sa-east-1]”
}Note the name of the Region. You will use this Region to create your Amazon EC2 key pair and your
Amazon S3 bucket.

Process -4 Prepare Hive table for data analysis. Create landing table to load log data.

We create schema for tokenizing the string. So MAP and COLLECTION is used to build key-value array.

CREATE TABLE logdata (






Process -6 Load Hive landing table with log file data from HDFS.



Process -7 Load Hive stage table from landing table.

This stage table will have the data from landing. Stage table is used to load cleansed data without any junk character (Log has some # characters which we remove when load into staging).


create table logdata_stg

comment ‘log data’ stored as sequencefile as

select * from logdata where C_0 not like ‘%#%’;

Process -8 Load Hive final table from staging table.


This process will create flatten structure of complete log file into final table. This table will be used in all over the analysis. This table is created with actual column names identified in the log file. Final table load happen using UDF to parse query string, host name and category tree in browse data.

create table logdata_fnl

comment ‘log data’ stored as sequencefile as

Read my previous post on Hive – Agility to go in detail of how Hive UDF’s helped to run this analysis efficiently using MAP and ASSOCIATIVE ARRAY


Posted in Big Data, Hive | Tagged: , , | Leave a Comment »

MR => Techniques Of Map To Reduce Efficiently

Posted by datumengineering on October 27, 2012

The concept of Map Reduce over a HDFS is fairly simple with an aim of reducing calculation complexity on network. Reducer has to run calculation over the data. So as a MR programmer it is programmer’s responsibility to give a calculation in such a manner that reducer should have very less work to across the nodes and hence less keys to reduce over the network.

class Mapper
method Map(key a; val v)
for all term t IN val v do
    Emit(term t; count 1)
class Reducer
method Reduce(term t; counts [c1; c2; …..])
    sum <–  0
for all count c IN counts [c1; c2;……] do
    sum <–  sum + c
Emit(term t; count sum)

Let’s find out some of the practices to perform some reducer’s work locally:

Combiner: Mapper acts more like a assignee which can just break the data locally and assign Key and Value. This splitting is the first step of MR programming and it happens locally at all nodes. Once data is split there are still opportunities to aggregate it locally before aggregating over the network. These splits, if given to reducer directly then there will be lot of work for reducer to do on the network. To overcome this create a reducer kind of functionality at individual node. Combiner give this functionality which act locally as a reducer.

class Mapper
 method Map(string t; integer r)
Emit(string t; pair (r; 1))
class Combiner
 method Combine(string t; pairs [(s1; c1); (s2; c2)…..])
     sum <–  0
     cnt <–  0
     for all pair (s; c) 2 pairs [(s1; c1); (s2; c2)…..] do
     sum <–  sum + s
     cnt <–  cnt + c
Emit(string t; pair (sum; cnt))
class Reducer
 method Reduce(string t; pairs [(s1; c1); (s2; c2)…..])
     sum <–  0
     cnt <–  0
     for all pair (s; c) 2 pairs [(s1; c1); (s2; c2)…..] do
     sum <–  sum + s
     cnt <–  cnt + c
     r(avg) <–  sum/cnt
Emit(string t; integer r(avg))

In-Mapper Combiner: In a typical scenario, Key-Value pair emit on local memory as shown above. However, associative array approach can combine all the key in associate array and create a pair. So, MAP can emit the pair than individual splits.

class Mapper
method Initialize
S <–  new AssociativeArray
C <–  new AssociativeArray
method Map(string t; integer r)
S{t} <–  S{t} + r
C{t} <–  C{t} + 1
method Close
for all term t IN S do
Emit(term t; pair (S{t};C{t}))

Before deciding over either of method to aggregate data locally mind the below points first:

  1. HDFS cluster consider combiner on case by case basis. So it is not always when combiner will run. I don’t know exact reason WHY Hadoop behaves like this. Some of the famous text book says: The combiner is provided as a semantics-preserving optimization to the execution framework, which has the option of using it, perhaps multiple times, or not at all (or even in the reduce phase).
  2. When your mapper or combiner EMIT the Key value either in split or in pair. It should match with reducer IN TAKE definition. What i mean here is that if mapper or combiner emit result in associate array (pair) then your reducer should take this associate array (pair) as input to process under reducer.
  3. As combiner can have risk of proper utilization from Hadoop, In-mapper combiner create memory bottleneck. So there is an additional overhead at in-mapper combiner for memory management. There is a counter required to keep track of memory. So, there is a fundamental scalability bottleneck associated with the in-mapper combining pattern. It critically depends on having sufficient memory to store intermediate results until the mapper has completely processed all key-value pairs in an input split.
  4. One common solution to limiting memory usage when using the in-mapper combining technique is to “block” input key-value pairs and “flush” in-memory data structures periodically. However, memory management is a manageable challenge for In-mapper combiner.
  5. In-mapper combiner creates more complex keys and values termed as pair and strips.

There are some more techniques like key partitioning also play major role to achieve performance.

Posted in Hadoop, Map Reduce | Tagged: | Leave a Comment »

Agility in Hive — Map & Array score for Hive

Posted by datumengineering on September 27, 2012

There are debate and comparison between PIG and Hive. There are good post from @Larsgeorge which talks about PIG v/s Hive.

I am not an expert to go in details of comparison but here I want to explore some of the Hive features which gives Hive an edge.

These feature are MAP (Associative Array) and ARRAY. MAP can give you an alternative way to segregate your data  around KEY and VALUE way.  So, if you have data something like this

clientid=’xxxx234xx’, category=’electronics’,timetaken=’20/01/2000 10:20:20′.

Then, you can really break it down in to key and value. Where, clientid, category and timetaken are keys and values are: xxxx234xx,electronics,20/01/2000 10:20:20.  How about not only converting them into key and value.  But storing and retrieving them as well  into a column. So, When you define the MAP it does store the complete MAP into a single column, like;


{“clientid”=”xxxx234xx”, “category”=”electronics”,”timetaken”=”20/01/2000 10:20:20″}

To store like this you need to define the table like this:

Create table table1





Now, retrieval is pretty easy : you just need to say in your HiveQL: Select COL1.[“category”] from table. You’ll get electronics. Had it been MAP is not there i would have end up writing a complete parsing program for storing such custom format in table.

Similarly, Array can be use to store collection into a column. So you can have data like:


Now, you want to parse the complete level in the second column. It would be easy in Hive to store it as in ARRAY. Definition would be:

Create table table1






Now data retrieving is obvious, query the table with collection index or level you want to go

Select Col_1[1] from table1;

You may also have scenarios when you have COLLECTION of MAPS. There you need to use both MAP and ARRAY together in same table definition along with required delimiter for ARRAY and MAP.

So your table’s delimiter definition should look like this:


Posted in Big Data, Hive | Tagged: , , | 1 Comment »

PIG, generation’s langauge: Simple Hadoop-PIG configuration, Register UDF.

Posted by datumengineering on June 26, 2012

I would consider PIG a step further to 4th generation. PIG emerged as an ideal language for programmers. PIG is a data flow language in Hadoop echo system. Now, It became a gap filler in BIG data analytics world between 2 audiences ETL developer & Java/Python Programmer. PIG has some very powerful feature which gives it an edge for generations:

  • Bringing schema less approach to unstructured data.
  • Bringing programming support to data flow.

PIG brings ETL capabilities to big data without having schema to be defined in advance. That’s an indispensable qualities. All these features together gives a power of analytics on Cloud with back of HDFS processing capabilities and MR programming model. Here, we’ll see in simple steps how can we use PIG as a data flow for analysis. Obviously, you should have PIG installed on your cluster.

PIG uses Hadoop configuration for data flow processing. Below are the steps of Hadoop configuration. I would prefer to do it in /etc/bash.bashrc

  • Point PIG to the JAVA_HOME.
  • set PIG_HOME to the core PIG script.
  • set PIG_DIR to the bin- dir.
  • set PIG_CONF_DIR to hadoop configuration directory.
  • finally set the PIG_CLASSPATH add it to the CLASSPATH.

Here is the exact code for above 5 steps:

export JAVA_HOME
export PATH

export PATH

export PIG_HOME
export PATH

export PIG_DIR
export PATH

export PATH

export PATH

export CLASSPATH=$CLASSPATH:$HADOOP_DIR/hadoop-core.jar:$HADOOP_DIR/hadoop-tools.jar:$HADOOP_DIR/hadoop-ant.jar:$HADOOP_DIR/lib/commons-logging-1.0.4.jar:$PIG_DIR/pig-core.jar

export PIG_CONF_DIR=/usr/lib/hadoop/conf

Now if you have written UDF then first register it and then define the function.

REGISTER /path/to/input/<jarfile>.jar;

define <function> org.apache.pig.piggybank.storage.<functionname>();

Now you’ve UDF available to use throughout your script.

A = load ‘/path/to/inputfile’ using org.apache.pig.piggybank.storage.<functionname>() as (variable:<datatype>);

Life becomes easy once we have UDF available to use. You just need to have basic understanding on SQL functionality to perform data flow operations in PIG scripting language.

Next write up in continue with PIG will be on: ANALYTICS: HOW PIG creates touch points for data flow as well as analytics.

Posted in Big Data, Hadoop, PIG | Tagged: , , | Leave a Comment »

BiG DaTa & Vectorization

Posted by datumengineering on May 14, 2012

It has been while when Big data entered into the market and buzz the analytics world. Now a day all analytics leaders are chanting about Big data applications. Since I have started with Hadoop technologies and with Machine learning one question has been bugging in mind:

Which is a greater innovation Big Data Or Machine Learning & Vectorization?

When it comes to analytics Vectorization and machine learning more innovative. Wait a minute, I don’t want to be biased and I am not concluding here. But, i would like to showcase more on the direction when we take out data for the analytics world. We have structured data, we have enterprise data, we have data which is still measurable and suffice analytical and advanced analytical need. But how many of Business analytics use it smartly to do predictions, How many have applied different statistical algorithm to be benefited from this data ? How many times available data has been utilized to its potential ? I guess, only 20% cases. When we are still not up to the utilization of structured, measurable data then why we are so much behind the unstructured and monster data. In fact this big data need more work than enterprise data. I don’t advocate to go to saturation first and then think of innovation or out of the box, NO. My emphasis more on the best utilization of existing enterprise data and keep the innovation alive by experimenting the possible options to explore the data which is unexplored or unfeasible through conservative technologies. Innovation doesn’t mean keep thinking and just doing new things. Innovation is more meaningful when you do something meaningful to the world which other people acknowledges but they says “Not feasible”. I am not in favor of anyone here. I am coming from the world where I see data processing challenges, when I see data storage challenges, when I see data aggregation challenges, when I see lot of challenges during sorting and searching. There I would look at Hadoop related technologies. The way Hive provides query processing power, HBase provides data storage and manipulation power is indeed way beyond the other RDBMS. Their power of MAP REDUCE is exemplary. But all these Big data technologies should enter into the enterprise which is already mature enough in the analytics world by fully utilization of its enterprise data at length. If Hadoop itself claim that I am not a replacement of your current enterprise datawarehouse then why you shouldn’t first fully grind the existing EDW data and then look at Hadoop opportunities to give an edge to your enterprise competency.

Posted in Big Data | Tagged: , | 2 Comments »

HDFS & Cloudera Hadoop – What, How and Where?

Posted by datumengineering on April 3, 2012

It is always good to understand “Whats going on in background?”. My first practice to understand any database is to go in detail of DB architecture. Because every database has its own flavor of benefits which is unique. Now it is challenging with Hadoop when things become interesting when I need to go to in detail architecture focussed around the ” file system “. Hadoop made me to do this exercise …..

What, How, Where ?

I have been trying my best to go in detail of this giant to understand “How” the things are moving underneath. There had been lot of questions blinked in mind about internal architecture and flow. Finally to conclude this study, I have set up Cloudera Distribution Hadoop on my Ubuntu in Pseudo-distributed environment on single node. During this configuration and installation on single node i have started analyzing and sketching the diagram.

My HDFS directory is on hdfs://localhostand:8020/tmp and related local file system is on /app/hadoop/tmp. I have preferred cloudera distribution compare to the apache directly. This is a first step, eventually i am trying to expand it towards Hive, HBase, Zookeeper on top of this diagram.

I look at complete Hadoop job as a function of three main component: configuration,  services and metastore.

In reverse order of above list, First and foremost player is metastore: Namenode,  Secondary Namenode.

Second player is: Job tracker and Task trackers, who actually are the players.

And finally, a configuration which tell first 2 players how and where to run.

Disclaimer :- Diagram is completely my own understanding. Your catch and corrections are welcome.

Posted in Hadoop | Tagged: , , | Leave a Comment »

Provision of small file processing in HDFS

Posted by datumengineering on March 7, 2012

Hadoop meant ONLY for mammoth file processing. Though this has been ideal condition but Hadoop do have provision to process small files. Hadoop introduced a Big container to hold small files for further processing. These big containers intended for processing small files data in Map Reduce model. In HDFS these containers are termed as a Sequence file.

These sequence files hold small files as a whole record. However as Map Reduce model expected, it stores data in {Key,Value} pair. File name of the smaller file can be key and the content of the file becomes value. Once the files stored in Sequence file it can be read and write back to HDFS.Writing the data for Sequence file is matter of writing Key and value pair. It depend of the kind of serialization you use. Read process is similar to the collection processing where you define next() method which accept a key and value pair, and reads the next key and value in the stream in the variable. It process until it reaches to EOF and next() method returns false. Again you need to go in detail of kind of serialization you are using here. This unique feature of HDFS given an opportunity to process million small files together as a Sequence file.

Structure of Sequence file is pretty simple. It has header, which hold metadata and compression details for the files stored and the record. Record contains the whole file in it along with the key length, key name and value (i.e. file content/data). The internal format of the records depends on record/block compression. Record compression is just compress the file content (i.e. value), however block compression method compresses number of records. Hence block compression is more meaningful and preferred.

Another form of Sequence file is Map file. Map file is sorted sequence file which is sort on the key with an index to perform lookup on the key. This helps map reduce model to improve the performance of the sequence file.

With this kind of framework of Sequence file & Map file Hadoop has opened feasibility to process millions of small files together. So should we say that HDFS is not just a matter of handling Big data files but it does have capability to process small files too, that also efficiently within Map Reduce model?

Any thought or use case you can suggest here?

Posted in Hadoop, Map Reduce | Tagged: , , | Leave a Comment »

Technology Shift: In Multi-Channel and Agile Commerce

Posted by datumengineering on February 21, 2012

It needs a technology shift:

When you move to Multi-Channel and eventually progress towards Agile Commerce

Today marketing and advertisement have opened many channels other than traditional channels. It includes: telemarketing, print catalogue, Kiosk and ecommerce.  However contribution of e-commerce is huge in comparison to other channels. It opens numerous feeds to the organization, like:

Campaign data (email, A/B test analysis)

Browse behavior on e-commerce site (Web logs)

Mobile BI

Call Centre voice data converted in textual

Data from search engines.

Social Media like Facebook, Twitter.

Customer experience, recommendation.

Multichannel purchase history etc.

Most importantly customer sentiment analysis will give an edge to marketing strategy.

So, effective and efficient utilization of multi-channel requires that we must tear down the walls what we have been building between different channels over the years.

Data from Multi-channel: When the channels are cosmic, data produced from these channels are utterly unstructured and gigantic. Web logs, Consumer Web interaction, Social media messages are few examples of highly unstructured data.  

Business still needs to analyze this data: Though it is unstructured but it had proven to be more meaningful in terms of trend analysis and consumer preference and sentiment, compared to the direct store data. 

Titanic and unstructured nature makes it unfeasible for analysis: Due to variety, volume and velocity of the data the task of transformation into relational database becomes vulnerable and analysis nearly impossible. So we need to add flavor of Big Data Analytics in traditional DW.


Strategy to Big Data Analytics

We neither need to cleanse this data nor need to bring it into relational database. We do not even need to wait until it gets processed through series of transformations. Because until that time information will lose its flavor of real time.  However we can still store it quickly as it comes and can be easily accessed using:

  • Ø HDFS (Hadoop Distributed File System) & Map Reduce Framework.
  • Ø Real time frequently updated unstructured data storage in HBASE
  • Ø Quicker access using HIVE datawarehouse on Hadoop.

And, all these processing can be establish and distributed on company Private CLOUD.  Predominantly, this unstructured data analysis would be an excellent support to existing datawarehouse.

HDFS (Hadoop Distributed File System):

Hadoop is designed for distributed parallel processing. In Hadoop programming patterns data is record oriented and the files are spread across distributed file system as a chunk, each compute process running on nodeworks on subset of data. So instead of moving whole data across the network for processing, Hadoop actually moves the computation to the data. Hadoop uses Map Reduce programming model to process and generate large dataset. Program written in this functional style will automatically run in parallel on large cluster of commodity machine.  Map-Reduce function takes care of functionality of data processing. However power of parallelism and fault tolerances are under cover in libraries (C++ libraries). These libraries need to include when writing Map & Reduce program.  When running large Map Reduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth.


Hadoop file system with Map Reduce functionality forms an underlying foundation to process huge amount ofdata. Data management and analytics need more than just a building an underlying file storage mechanism.It requires data to be organized in tables so it can be easily accessible without writing complex lines of code.  Moreover we all are more comfortable with database operation than file operation. So, abstract layer is required to simplify scattered Big Data.  HBase and Hive is the answer for this.

HBASE: HBASE is aimed to hold billions of rows and millions of columns in a big table like structure. HBASE is column oriented distributed storage build on top of Hadoop.  It is a NoSQL database where data is stored in form of Hash Table. So the data is sparse, distributed and in sorted Map.  Tables are sorted by rows. Tables are made of rows and columns. All columns in HBase belong to a particular column family. Data model for HBase is little different than conventional RDBMS.

So, HBASE is :

1.Column oriented distributed storage build on top of Hadoop layered over HDFS.
2.NoSQL database where data is stored in form of Hash Table.
3.Distributed on many servers and tolerant of machine failure. 
4.All columns in HBase belong to a particular column family.
It is not a :
  • Relational database.
  • Support to data join.

HIVE:  It is a system for managing and querying structured data build on top of Hadoop.  Aim of HIVE is to give familiar SQL like interface to data stored in Hadoop framework. Underline it still uses Map Reduce programming model for extracting data from HDFS.  It also gives ETL flexibility on the data stored in Hive tables.  All the metadata can be stored in RDBMS. 

1.Aim of HIVE is to give familiar SQL like interface to data stored in Hadoop framework.
2.Underline it still uses Map Reduce programming model for extracting data from HDFS. 
3.It also gives ETL flexibility on the data stored in Hive tables. 
4.All the metadata can be stored in RDBMS. 

JDBC/ODBC drivers allow 3rd party applications to pull Hive data for reporting.

HBase and Hive has its own flavor of benefits.

Hive is more interactive in terms of SQL queries and metadata in RDBMS. But it has limited use as read-intensive data; updating data in Hive is a costly operation. Because here update means create another copy of existing data.

 HBase can step in here and can give a functionality of high row level updates. It sidesteps Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging intelligently based on data distribution changes.

Marrying HBase with Hive can spell out a near real time datawarehouse on Hadoop echo system with simplicity of SQL like interface from Hive tables and keeping near time replica in HBase tables.


Important Note:

  • All structured data can continue to be analyzed with Enterprise Data Warehouse. Hadoop will play on unstructured data.
  • Hadoop can used as “Data Bag” for EDW. 
  • Push Down Aggregation: All the  intensive, voluminous aggregation can be pushed to Hadoop.
  • Push Down ETL: All ETL complexity on Big Data can be implemented in Hadoop (Hive).

Value Add and ROI from Hadoop Data Analytics


HDFS, Hive and HBase is an open source. Investment merely require on skills than on tools and technologies. Nonetheless, this has been a biggest challenge in Hadoop related development.  Company has to define the strategy to invest in skills and continuous investigation on Hadoop platform.

$: – Indeed, x% of budget allocation for Hadoop development and some additional investment may require for plug-in from Hive/HBase to Teradata for third party. 

Posted in Big Data | Tagged: | Leave a Comment »

Hadoop Configuration Simplified : Master Slave Architecture

Posted by datumengineering on February 11, 2012

Master-Slave: Hadoop configuration has simple components which can be divided as Master and Slave components. Master: NameNode, Secondary namenode and Job tracker. Slave: data node and task tracker.

Following are the key points of configuration:

1. Namenode always be localize during hadoop configuration in cluster environment.

2. Entries for master and slave are in master and slaves files repectively. Master translate to namenode host and job tracker.

3. All configuration can be categorize in 4 aspects:

– Environment variable configuration
– Hadoop-HDFS configuration.
– Master-Slave configuration.
– Map Reduce configuration.

I) Environment configuration: Set up all environment variables to run the scripts in HDFS environment.

{Files: hdfs-env.sh}

II) Hadoop-HDFS configuration: This includes entries for hosts for namenode, secondary namenode, job trackers and task trackers.

{Files: core-site.xml, hdfs-site.xml, mapred-site.xml}

III) Text file configuration: There are 2 text entries in hadoop configuration for master and slaves. One for master node i.e. Host entry for master node and other for slaves i.e. Host entry for all slaves machin where data node and task tracker will run.

{Files: masters, slaves}

IV) Hadoop metrics and job log configuration: in map reduce configuration setting it captures all the metrics and logs related to map reduce program.

{Files: hadoop-metrics.properties, log4j.properties}

This is how brief function of different components together:

Namenode and job tracker starts at local machine, starts secondary node on each machine listed in the masters file. Eventually start tasktracker and datanode on each machine listed in the masters file.
In a cluster environment namenode, secondary namenode and job tracker run on single machine as a master node. However in a large cluster it can be sparated. When namenode and jobtracker are on separate node their slaves files should be in synch.

Posted in Hadoop, HDFS | Tagged: , | Leave a Comment »

Hadoop Recommendation

Posted by datumengineering on February 9, 2012

  • All structured data can continue to be analyzed with Enterprise Data Warehouse. Hadoop will play on unstructured data.
  • Hadoop can used as “Data Bag” for EDW.
  • Push Down Aggregation: All the  intensive, voluminous aggregation can be pushed to Hadoop.
  • Push Down ETL: All ETL complexity on Big Data can be implemented in Hadoop (Hive).

Posted in Big Data | Tagged: | Leave a Comment »