Datum Engineering !

An engineered artwork to make decisions..

Posts Tagged ‘Hadoop’

Big Data? How do you run capacity planning?

Posted by datumengineering on February 15, 2013

Most of Datawarehouse folks are very much accustomed with the term “Capacity Planning”, Read Inmon. This is widely used process for DBA’s and Datawarehouse Architects. In an typical project of data management and warehouse wide variety of audience is involved to drive the capacity planning. It involves everyone from Business Analyst to Architect to Developer to DBA and finally Data Modeler.

This practice which has had wide audience in typical Datawarehouse world,  how this has been driven in Big Data? I have hardly heard noise around this in any Hadoop driven project which had started with an intention to handle growing data.  I have met pain bearers DBA/Architects who have been facing challenges at all stages of data management when data outgrows. They are the main players who advocates bringing Hadoop ASAP.  Crux of their problem is not growing data. But the problem is, they didn’t have mathematical calculation which advocate the growth rate. All we talk about is: How much percentage it is going? Most of the time that percentage also come from experience 🙂

Capacity planning should be explore more than just calculating the percentage and experience.

  1. It should be more mathematical calculation of every byte of the data sources coming into the system.
  2. How about designing a predictive model which will confirm my data growth with an accuracy until 10 years?
  3. How about involving business to confirm the data growth drivers and feasibility of future born data sources ?
  4. Why don’t consider compression factor and purging into the calculation to reclaim the space for data grow.
  5. Why we consider only disk utilization and why there is no consideration about other hardware resources like memory, processor, cache? After all, it is all about data processing environment.
  • I think this list of consideration can still grow….

I know building robust Capacity planning is not a task of day or month. One to two year of time frame data is good enough to understand this trend and develop a algorithm around it.  Consider 1-2 years as a learning data set and take some months of data as training data set and start analyzing the trend, start building the model which can predict the growth after 3rd or 4th year. Because as per Datawarehouse gurus bleeding starts after 5th year age.

I’ll leave up to you to design the solution and process for capacity capacity to claim your DATA as BIG DATA.

Remember, disk space is cheap but not the disk seek.

Posted in Big Data, Hadoop | Tagged: , | 2 Comments »

Data flow: Web log analysis on a Hive-way

Posted by datumengineering on February 8, 2013

 Data flow design to get an insight of user behavior on web site. Data flow explains the method of flattening up all elements in web log which can support detail user analysis and behavior.

Technology & Skills: Hadoop-Hive, HiveQL (+ Rich set of UDF in HiveQL) .

Infrastructure: Amazon Web Services (AWS).

Process -1 Moving data from Web Server to Amazon Simple Storage Services (S3) to HDFS.


Process -2 Start EC2 instance type : small to run Map Reduce job to parse log file.

To run jobs on AWS we should have EBS and EC2 both instance running.

Process -3 Prepare for Elastic Map Reduce to run the jobs from command line.

To run the EMR from command line we use an Amazon EMR credentials file to simplify job flow creation and authentication of requests. The credentials file provides information required for many commands. The credentials file is a convenient place to store command parameters so you don’t have to repeatedly enter the information. The Amazon EMR CLI automatically looks for these credentials in the file credentials.json.

To install the Elastic MapReduce CLI1. Navigate to your elastic-mapreduce-cli directory.
2. Unzip the compressed file: Linux and UNIX users, from the command-line prompt, enter the following:$ unzip elastic-mapreduce-ruby.zipConfiguring Credentials
The Elastic MapReduce credentials file can provide information required for many commands. It is
convenient to store command parameters in the file to save you from the trouble of repeatedly entering the information. Your credentials are used to calculate the signature value for every request you make. Elastic MapReduce automatically looks for your credentials in the file credentials.json. It is convenient to edit the credentials.json file and include your AWS credentials. An AWS key pair is a security credential
similar to a password, which you use to securely connect to your instance when it is running.To create your credentials file:1. Create a file named credentials.json in the elastic-mapreduce-cli/elastic-mapreduce-ruby directory.2. Add the following lines to your credentials file:
“access_id”: “[Your AWS Access Key ID]”,
“private_key”: “[Your AWS Secret Access Key]”,
“keypair”: “[Your key pair name]”,
“key-pair-file”: “[The path and name of your PEM file]”,
“log_uri”: “[A path to a bucket you own on Amazon S3, such as, s3n://myloguri/]”,
“region”: “[The Region of your job flow, either us-east-1, us-west-2, uswest-1, eu-west-1, ap-northeast-1, ap-southeast-1, or sa-east-1]”
}Note the name of the Region. You will use this Region to create your Amazon EC2 key pair and your
Amazon S3 bucket.

Process -4 Prepare Hive table for data analysis. Create landing table to load log data.

We create schema for tokenizing the string. So MAP and COLLECTION is used to build key-value array.

CREATE TABLE logdata (






Process -6 Load Hive landing table with log file data from HDFS.



Process -7 Load Hive stage table from landing table.

This stage table will have the data from landing. Stage table is used to load cleansed data without any junk character (Log has some # characters which we remove when load into staging).


create table logdata_stg

comment ‘log data’ stored as sequencefile as

select * from logdata where C_0 not like ‘%#%’;

Process -8 Load Hive final table from staging table.


This process will create flatten structure of complete log file into final table. This table will be used in all over the analysis. This table is created with actual column names identified in the log file. Final table load happen using UDF to parse query string, host name and category tree in browse data.

create table logdata_fnl

comment ‘log data’ stored as sequencefile as

Read my previous post on Hive – Agility to go in detail of how Hive UDF’s helped to run this analysis efficiently using MAP and ASSOCIATIVE ARRAY

Posted in Big Data, Hive | Tagged: , , | Leave a Comment »

Agility in Hive — Map & Array score for Hive

Posted by datumengineering on September 27, 2012

There are debate and comparison between PIG and Hive. There are good post from @Larsgeorge which talks about PIG v/s Hive.

I am not an expert to go in details of comparison but here I want to explore some of the Hive features which gives Hive an edge.

These feature are MAP (Associative Array) and ARRAY. MAP can give you an alternative way to segregate your data  around KEY and VALUE way.  So, if you have data something like this

clientid=’xxxx234xx’, category=’electronics’,timetaken=’20/01/2000 10:20:20′.

Then, you can really break it down in to key and value. Where, clientid, category and timetaken are keys and values are: xxxx234xx,electronics,20/01/2000 10:20:20.  How about not only converting them into key and value.  But storing and retrieving them as well  into a column. So, When you define the MAP it does store the complete MAP into a single column, like;


{“clientid”=”xxxx234xx”, “category”=”electronics”,”timetaken”=”20/01/2000 10:20:20″}

To store like this you need to define the table like this:

Create table table1





Now, retrieval is pretty easy : you just need to say in your HiveQL: Select COL1.[“category”] from table. You’ll get electronics. Had it been MAP is not there i would have end up writing a complete parsing program for storing such custom format in table.

Similarly, Array can be use to store collection into a column. So you can have data like:


Now, you want to parse the complete level in the second column. It would be easy in Hive to store it as in ARRAY. Definition would be:

Create table table1






Now data retrieving is obvious, query the table with collection index or level you want to go

Select Col_1[1] from table1;

You may also have scenarios when you have COLLECTION of MAPS. There you need to use both MAP and ARRAY together in same table definition along with required delimiter for ARRAY and MAP.

So your table’s delimiter definition should look like this:


Posted in Big Data, Hive | Tagged: , , | 1 Comment »

PIG, generation’s langauge: Simple Hadoop-PIG configuration, Register UDF.

Posted by datumengineering on June 26, 2012

I would consider PIG a step further to 4th generation. PIG emerged as an ideal language for programmers. PIG is a data flow language in Hadoop echo system. Now, It became a gap filler in BIG data analytics world between 2 audiences ETL developer & Java/Python Programmer. PIG has some very powerful feature which gives it an edge for generations:

  • Bringing schema less approach to unstructured data.
  • Bringing programming support to data flow.

PIG brings ETL capabilities to big data without having schema to be defined in advance. That’s an indispensable qualities. All these features together gives a power of analytics on Cloud with back of HDFS processing capabilities and MR programming model. Here, we’ll see in simple steps how can we use PIG as a data flow for analysis. Obviously, you should have PIG installed on your cluster.

PIG uses Hadoop configuration for data flow processing. Below are the steps of Hadoop configuration. I would prefer to do it in /etc/bash.bashrc

  • Point PIG to the JAVA_HOME.
  • set PIG_HOME to the core PIG script.
  • set PIG_DIR to the bin- dir.
  • set PIG_CONF_DIR to hadoop configuration directory.
  • finally set the PIG_CLASSPATH add it to the CLASSPATH.

Here is the exact code for above 5 steps:

export JAVA_HOME
export PATH

export PATH

export PIG_HOME
export PATH

export PIG_DIR
export PATH

export PATH

export PATH

export CLASSPATH=$CLASSPATH:$HADOOP_DIR/hadoop-core.jar:$HADOOP_DIR/hadoop-tools.jar:$HADOOP_DIR/hadoop-ant.jar:$HADOOP_DIR/lib/commons-logging-1.0.4.jar:$PIG_DIR/pig-core.jar

export PIG_CONF_DIR=/usr/lib/hadoop/conf

Now if you have written UDF then first register it and then define the function.

REGISTER /path/to/input/<jarfile>.jar;

define <function> org.apache.pig.piggybank.storage.<functionname>();

Now you’ve UDF available to use throughout your script.

A = load ‘/path/to/inputfile’ using org.apache.pig.piggybank.storage.<functionname>() as (variable:<datatype>);

Life becomes easy once we have UDF available to use. You just need to have basic understanding on SQL functionality to perform data flow operations in PIG scripting language.

Next write up in continue with PIG will be on: ANALYTICS: HOW PIG creates touch points for data flow as well as analytics.

Posted in Big Data, Hadoop, PIG | Tagged: , , | Leave a Comment »

HDFS & Cloudera Hadoop – What, How and Where?

Posted by datumengineering on April 3, 2012

It is always good to understand “Whats going on in background?”. My first practice to understand any database is to go in detail of DB architecture. Because every database has its own flavor of benefits which is unique. Now it is challenging with Hadoop when things become interesting when I need to go to in detail architecture focussed around the ” file system “. Hadoop made me to do this exercise …..

What, How, Where ?

I have been trying my best to go in detail of this giant to understand “How” the things are moving underneath. There had been lot of questions blinked in mind about internal architecture and flow. Finally to conclude this study, I have set up Cloudera Distribution Hadoop on my Ubuntu in Pseudo-distributed environment on single node. During this configuration and installation on single node i have started analyzing and sketching the diagram.

My HDFS directory is on hdfs://localhostand:8020/tmp and related local file system is on /app/hadoop/tmp. I have preferred cloudera distribution compare to the apache directly. This is a first step, eventually i am trying to expand it towards Hive, HBase, Zookeeper on top of this diagram.

I look at complete Hadoop job as a function of three main component: configuration,  services and metastore.

In reverse order of above list, First and foremost player is metastore: Namenode,  Secondary Namenode.

Second player is: Job tracker and Task trackers, who actually are the players.

And finally, a configuration which tell first 2 players how and where to run.

Disclaimer :- Diagram is completely my own understanding. Your catch and corrections are welcome.

Posted in Hadoop | Tagged: , , | Leave a Comment »

Provision of small file processing in HDFS

Posted by datumengineering on March 7, 2012

Hadoop meant ONLY for mammoth file processing. Though this has been ideal condition but Hadoop do have provision to process small files. Hadoop introduced a Big container to hold small files for further processing. These big containers intended for processing small files data in Map Reduce model. In HDFS these containers are termed as a Sequence file.

These sequence files hold small files as a whole record. However as Map Reduce model expected, it stores data in {Key,Value} pair. File name of the smaller file can be key and the content of the file becomes value. Once the files stored in Sequence file it can be read and write back to HDFS.Writing the data for Sequence file is matter of writing Key and value pair. It depend of the kind of serialization you use. Read process is similar to the collection processing where you define next() method which accept a key and value pair, and reads the next key and value in the stream in the variable. It process until it reaches to EOF and next() method returns false. Again you need to go in detail of kind of serialization you are using here. This unique feature of HDFS given an opportunity to process million small files together as a Sequence file.

Structure of Sequence file is pretty simple. It has header, which hold metadata and compression details for the files stored and the record. Record contains the whole file in it along with the key length, key name and value (i.e. file content/data). The internal format of the records depends on record/block compression. Record compression is just compress the file content (i.e. value), however block compression method compresses number of records. Hence block compression is more meaningful and preferred.

Another form of Sequence file is Map file. Map file is sorted sequence file which is sort on the key with an index to perform lookup on the key. This helps map reduce model to improve the performance of the sequence file.

With this kind of framework of Sequence file & Map file Hadoop has opened feasibility to process millions of small files together. So should we say that HDFS is not just a matter of handling Big data files but it does have capability to process small files too, that also efficiently within Map Reduce model?

Any thought or use case you can suggest here?

Posted in Hadoop, Map Reduce | Tagged: , , | Leave a Comment »

Hadoop Configuration Simplified : Master Slave Architecture

Posted by datumengineering on February 11, 2012

Master-Slave: Hadoop configuration has simple components which can be divided as Master and Slave components. Master: NameNode, Secondary namenode and Job tracker. Slave: data node and task tracker.

Following are the key points of configuration:

1. Namenode always be localize during hadoop configuration in cluster environment.

2. Entries for master and slave are in master and slaves files repectively. Master translate to namenode host and job tracker.

3. All configuration can be categorize in 4 aspects:

– Environment variable configuration
– Hadoop-HDFS configuration.
– Master-Slave configuration.
– Map Reduce configuration.

I) Environment configuration: Set up all environment variables to run the scripts in HDFS environment.

{Files: hdfs-env.sh}

II) Hadoop-HDFS configuration: This includes entries for hosts for namenode, secondary namenode, job trackers and task trackers.

{Files: core-site.xml, hdfs-site.xml, mapred-site.xml}

III) Text file configuration: There are 2 text entries in hadoop configuration for master and slaves. One for master node i.e. Host entry for master node and other for slaves i.e. Host entry for all slaves machin where data node and task tracker will run.

{Files: masters, slaves}

IV) Hadoop metrics and job log configuration: in map reduce configuration setting it captures all the metrics and logs related to map reduce program.

{Files: hadoop-metrics.properties, log4j.properties}

This is how brief function of different components together:

Namenode and job tracker starts at local machine, starts secondary node on each machine listed in the masters file. Eventually start tasktracker and datanode on each machine listed in the masters file.
In a cluster environment namenode, secondary namenode and job tracker run on single machine as a master node. However in a large cluster it can be sparated. When namenode and jobtracker are on separate node their slaves files should be in synch.

Posted in Hadoop, HDFS | Tagged: , | Leave a Comment »