Datum Engineering !

An engineered artwork to make decisions..

Archive for the ‘Hadoop’ Category

Where & Why Do You Keep Big Data & Hadoop?

Posted by datumengineering on December 13, 2015

I am Back ! Yes, I am back (on the track) on my learning track. Sometime, it is really necessary to take a break and introspect why do we learn, before learning.  Ah ! it was 9 months safe refuge to learn how Big Data & Analytics can contribute to Data Product.


Data strategy has always been expected to be revenue generation. As Big data and Hadoop entering into the enterprise data strategy it is also expected from big data infrastructure to be revenue addition. This is really a tough expectation from new entrant (Hadoop) when the established candidate (DataWarehouse & BI) itself struggle mostly for its existence. So, it is very pertinent for solution architects to raise a question WHERE and WHY to bring the Big data (Obviously Hadoop) in the Data Strategy. And, the safe option for this new entrant should the place where it supports and strengthen the EXISTING data analysis strategy. Yeah! That’s the DATA LAKE.

Hope, you would have already understood by now the 3 Ws (What: Data Lake, Who: Solution Architect, Where: Enterprise Data strategy) of Five Ws questions for information gathering. Now look at the diagram to depict WHERE and WHY.

Precisely, 3 major areas of opportunity for new entrant (Hadoop):

  1. Semi-structured and/or unstructured data ingestion.
  2. Push down bleeding data integration problems to Hadoop Engine.
  3. Business need to build comprehensive analytical data stores.

Absence of any one of these 3 needs above would make Hadoop case weak to enter into the existing enterprise strategy. And, this data lake approach believes to be aligning to the business analysis outcomes without much disruption, hence it will also create comfortable path in the enterprise. We can further dig into Data Lake Architecture and implementation strategy in detail.

Moreover, there lot of other supporting systems which are brewing in parallel with Hadoop eco-system and Apache Kylin ….opportunities are immense on datalake 


Posted in Big Data, DataLake, Hadoop | 1 Comment »

ETL, ELT and Data Hub: Where Hadoop is the right fit ?

Posted by datumengineering on November 17, 2013

Few days back i have attended a good webinar conducted by Metascale on topic “Are You Still Moving Data? Is ETL Still Relevant in the Era of Hadoop?” This post is targeting this webinar.

In summary, this webinar had nicely explained about how enterprise can use Hadoop as a data hub along with the existing Datawarehouse set up. “Hadoop as a Data Hub” this line itself raised lot of questions in my mind:

  1. When we project Hadoop as a Data-hub and same time maintain the datawarehouse as an another data (conventional) repository for the enterprise then won’t it be creating another platform in silos? Presenter in the webcast repeatedly telling about keeping existing datawarehouse intact when developing Hadoop as a Data Hub. Difficult to digest 😦
  2. Next question that would arise is: challenges in Hadoop environment as Master Data Management and Data governance platform. I don’t think Hadoop ecosystem is mature enough to swiftly handle the MDM complexity. As far as data governance is concerned Hadoop ecosystem lacks in applications which are required on top of Hadoop for robust data governance.
  3. Why to put lot of energy to build compatibility of ETL tools like Informatica with HDFS to connect existing ETL infrastructure with Big Data? I feel this is a crazy idea. Because you are selling cost effective solution with some under the cover cost. Obviously, Informatica will not give you Hadoop connector as “Free”. There are many other questions other than cost like performance, business logic stage etc.
  4. Also there is a big bet on Hadoop to replace existing ETL/ELT framework to push transformation to Hadoop considering its Map Reduce framework. I partially get this idea long back. But, still not convinced when:
  • Your use case doesn’t support Map Reduce framework during ETL.

  • You process relatively small amount of data using Hadoop. Hadoop is not meant for this and takes longer than it supposed to be.

  • You try to join some information with existing datawarehouse and unnecessary duplicate the information at HDFS as well as at conventional RDBMS.

Now, having these questions in place doesn’t mean Hadoop can’t be projected as a replacement/amendment of existing datawarehouse strategy. On contrary, I could see some different possibilities and ways for Hadoop to sneak-in into the existing enterprise data architecture. Here are my few cents:

  1. Identify the areas where data come with once write and many reads. Most importantly, identify the nature of the read. Ask yourself that the read is straightforward or joined or aggregated? All such situations in case of BIG DATA can be efficiently handled on HDFS. If you don’t know in advance then data profiling and capacity planning will be decision maker here to identify whether this data should go to your RDBMS or HDFS. However remember, if your queries is more ad-hoc and you are planning to move it to HDFS then you need a skill more than Hive & PIG.
  2. Use Hadoop’s HDFS feature more than the Map Reduce. I mean distributed storage to minimize effort to back up and data replication. This will be cost effective in comparison to DBA costs. For example, archive data on HDFS than tap drives. So your data never retire for analysis. Entertain MR intelligently whenever you could see the opportunity to break down your calculations into different parts i.e. MAPs.
  3. Identify the data which is small and can be fit into distributed cache in HDFS. Only this can have an entry into HDFS. However, rest of small (not BIG) data can stay on RDBMS. Again, Capacity planning is major role player.
  4. Now it comes to ETL: I am really happy to see Hadoop & HDFS here. But not with Informatica, Data Stage or any other ETL tools (i don’t know much about Pentaho). I must appreciate and support Metascale webinar . They have given a right approach to take Hadoop as an Extract, Load and Transform framework. Yes, this is the only way to do right transformation on Hadoop. Let’s rename it to DATA INTEGRATION. The moment you start thinking about ETL tools it means you are taking your data out of Hadoop and the moment you take data out, you are going against all the purpose of using Hadoop as a data processing platform. Isn’t it killing the idea of doing transformation on Hadoop using MR and also the idea of brining your overall cost effective down? However, I’ll be open to learn the right logic to use informatica or any other ETL tool on top of Hadoop.

I think, effort to bring Hadoop to enterprise require diligent changes in datawarehouse reference architecture. We are going to change a lot in our Reference Architecture when we bring Hadoop into the enterprise.

So, it is not important to learn WHERE HADOOP IS THE RIGHT FIT? but it is very important to understand HOW HADOOP IS THE RIGHT FIT?

Posted in Big Data, Hadoop | Leave a Comment »

Big Data? How do you run capacity planning?

Posted by datumengineering on February 15, 2013

Most of Datawarehouse folks are very much accustomed with the term “Capacity Planning”, Read Inmon. This is widely used process for DBA’s and Datawarehouse Architects. In an typical project of data management and warehouse wide variety of audience is involved to drive the capacity planning. It involves everyone from Business Analyst to Architect to Developer to DBA and finally Data Modeler.

This practice which has had wide audience in typical Datawarehouse world,  how this has been driven in Big Data? I have hardly heard noise around this in any Hadoop driven project which had started with an intention to handle growing data.  I have met pain bearers DBA/Architects who have been facing challenges at all stages of data management when data outgrows. They are the main players who advocates bringing Hadoop ASAP.  Crux of their problem is not growing data. But the problem is, they didn’t have mathematical calculation which advocate the growth rate. All we talk about is: How much percentage it is going? Most of the time that percentage also come from experience 🙂

Capacity planning should be explore more than just calculating the percentage and experience.

  1. It should be more mathematical calculation of every byte of the data sources coming into the system.
  2. How about designing a predictive model which will confirm my data growth with an accuracy until 10 years?
  3. How about involving business to confirm the data growth drivers and feasibility of future born data sources ?
  4. Why don’t consider compression factor and purging into the calculation to reclaim the space for data grow.
  5. Why we consider only disk utilization and why there is no consideration about other hardware resources like memory, processor, cache? After all, it is all about data processing environment.
  • I think this list of consideration can still grow….

I know building robust Capacity planning is not a task of day or month. One to two year of time frame data is good enough to understand this trend and develop a algorithm around it.  Consider 1-2 years as a learning data set and take some months of data as training data set and start analyzing the trend, start building the model which can predict the growth after 3rd or 4th year. Because as per Datawarehouse gurus bleeding starts after 5th year age.

I’ll leave up to you to design the solution and process for capacity capacity to claim your DATA as BIG DATA.

Remember, disk space is cheap but not the disk seek.

Posted in Big Data, Hadoop | Tagged: , | 2 Comments »

Data flow: Web log analysis on a Hive-way

Posted by datumengineering on February 8, 2013

 Data flow design to get an insight of user behavior on web site. Data flow explains the method of flattening up all elements in web log which can support detail user analysis and behavior.

Technology & Skills: Hadoop-Hive, HiveQL (+ Rich set of UDF in HiveQL) .

Infrastructure: Amazon Web Services (AWS).

Process -1 Moving data from Web Server to Amazon Simple Storage Services (S3) to HDFS.


Process -2 Start EC2 instance type : small to run Map Reduce job to parse log file.

To run jobs on AWS we should have EBS and EC2 both instance running.

Process -3 Prepare for Elastic Map Reduce to run the jobs from command line.

To run the EMR from command line we use an Amazon EMR credentials file to simplify job flow creation and authentication of requests. The credentials file provides information required for many commands. The credentials file is a convenient place to store command parameters so you don’t have to repeatedly enter the information. The Amazon EMR CLI automatically looks for these credentials in the file credentials.json.

To install the Elastic MapReduce CLI1. Navigate to your elastic-mapreduce-cli directory.
2. Unzip the compressed file: Linux and UNIX users, from the command-line prompt, enter the following:$ unzip elastic-mapreduce-ruby.zipConfiguring Credentials
The Elastic MapReduce credentials file can provide information required for many commands. It is
convenient to store command parameters in the file to save you from the trouble of repeatedly entering the information. Your credentials are used to calculate the signature value for every request you make. Elastic MapReduce automatically looks for your credentials in the file credentials.json. It is convenient to edit the credentials.json file and include your AWS credentials. An AWS key pair is a security credential
similar to a password, which you use to securely connect to your instance when it is running.To create your credentials file:1. Create a file named credentials.json in the elastic-mapreduce-cli/elastic-mapreduce-ruby directory.2. Add the following lines to your credentials file:
“access_id”: “[Your AWS Access Key ID]”,
“private_key”: “[Your AWS Secret Access Key]”,
“keypair”: “[Your key pair name]”,
“key-pair-file”: “[The path and name of your PEM file]”,
“log_uri”: “[A path to a bucket you own on Amazon S3, such as, s3n://myloguri/]”,
“region”: “[The Region of your job flow, either us-east-1, us-west-2, uswest-1, eu-west-1, ap-northeast-1, ap-southeast-1, or sa-east-1]”
}Note the name of the Region. You will use this Region to create your Amazon EC2 key pair and your
Amazon S3 bucket.

Process -4 Prepare Hive table for data analysis. Create landing table to load log data.

We create schema for tokenizing the string. So MAP and COLLECTION is used to build key-value array.

CREATE TABLE logdata (






Process -6 Load Hive landing table with log file data from HDFS.



Process -7 Load Hive stage table from landing table.

This stage table will have the data from landing. Stage table is used to load cleansed data without any junk character (Log has some # characters which we remove when load into staging).


create table logdata_stg

comment ‘log data’ stored as sequencefile as

select * from logdata where C_0 not like ‘%#%’;

Process -8 Load Hive final table from staging table.


This process will create flatten structure of complete log file into final table. This table will be used in all over the analysis. This table is created with actual column names identified in the log file. Final table load happen using UDF to parse query string, host name and category tree in browse data.

create table logdata_fnl

comment ‘log data’ stored as sequencefile as

Read my previous post on Hive – Agility to go in detail of how Hive UDF’s helped to run this analysis efficiently using MAP and ASSOCIATIVE ARRAY

Posted in Big Data, Hive | Tagged: , , | Leave a Comment »

MR => Techniques Of Map To Reduce Efficiently

Posted by datumengineering on October 27, 2012

The concept of Map Reduce over a HDFS is fairly simple with an aim of reducing calculation complexity on network. Reducer has to run calculation over the data. So as a MR programmer it is programmer’s responsibility to give a calculation in such a manner that reducer should have very less work to across the nodes and hence less keys to reduce over the network.

class Mapper
method Map(key a; val v)
for all term t IN val v do
    Emit(term t; count 1)
class Reducer
method Reduce(term t; counts [c1; c2; …..])
    sum <–  0
for all count c IN counts [c1; c2;……] do
    sum <–  sum + c
Emit(term t; count sum)

Let’s find out some of the practices to perform some reducer’s work locally:

Combiner: Mapper acts more like a assignee which can just break the data locally and assign Key and Value. This splitting is the first step of MR programming and it happens locally at all nodes. Once data is split there are still opportunities to aggregate it locally before aggregating over the network. These splits, if given to reducer directly then there will be lot of work for reducer to do on the network. To overcome this create a reducer kind of functionality at individual node. Combiner give this functionality which act locally as a reducer.

class Mapper
 method Map(string t; integer r)
Emit(string t; pair (r; 1))
class Combiner
 method Combine(string t; pairs [(s1; c1); (s2; c2)…..])
     sum <–  0
     cnt <–  0
     for all pair (s; c) 2 pairs [(s1; c1); (s2; c2)…..] do
     sum <–  sum + s
     cnt <–  cnt + c
Emit(string t; pair (sum; cnt))
class Reducer
 method Reduce(string t; pairs [(s1; c1); (s2; c2)…..])
     sum <–  0
     cnt <–  0
     for all pair (s; c) 2 pairs [(s1; c1); (s2; c2)…..] do
     sum <–  sum + s
     cnt <–  cnt + c
     r(avg) <–  sum/cnt
Emit(string t; integer r(avg))

In-Mapper Combiner: In a typical scenario, Key-Value pair emit on local memory as shown above. However, associative array approach can combine all the key in associate array and create a pair. So, MAP can emit the pair than individual splits.

class Mapper
method Initialize
S <–  new AssociativeArray
C <–  new AssociativeArray
method Map(string t; integer r)
S{t} <–  S{t} + r
C{t} <–  C{t} + 1
method Close
for all term t IN S do
Emit(term t; pair (S{t};C{t}))

Before deciding over either of method to aggregate data locally mind the below points first:

  1. HDFS cluster consider combiner on case by case basis. So it is not always when combiner will run. I don’t know exact reason WHY Hadoop behaves like this. Some of the famous text book says: The combiner is provided as a semantics-preserving optimization to the execution framework, which has the option of using it, perhaps multiple times, or not at all (or even in the reduce phase).
  2. When your mapper or combiner EMIT the Key value either in split or in pair. It should match with reducer IN TAKE definition. What i mean here is that if mapper or combiner emit result in associate array (pair) then your reducer should take this associate array (pair) as input to process under reducer.
  3. As combiner can have risk of proper utilization from Hadoop, In-mapper combiner create memory bottleneck. So there is an additional overhead at in-mapper combiner for memory management. There is a counter required to keep track of memory. So, there is a fundamental scalability bottleneck associated with the in-mapper combining pattern. It critically depends on having sufficient memory to store intermediate results until the mapper has completely processed all key-value pairs in an input split.
  4. One common solution to limiting memory usage when using the in-mapper combining technique is to “block” input key-value pairs and “flush” in-memory data structures periodically. However, memory management is a manageable challenge for In-mapper combiner.
  5. In-mapper combiner creates more complex keys and values termed as pair and strips.

There are some more techniques like key partitioning also play major role to achieve performance.

Posted in Hadoop, Map Reduce | Tagged: | Leave a Comment »

Agility in Hive — Map & Array score for Hive

Posted by datumengineering on September 27, 2012

There are debate and comparison between PIG and Hive. There are good post from @Larsgeorge which talks about PIG v/s Hive.

I am not an expert to go in details of comparison but here I want to explore some of the Hive features which gives Hive an edge.

These feature are MAP (Associative Array) and ARRAY. MAP can give you an alternative way to segregate your data  around KEY and VALUE way.  So, if you have data something like this

clientid=’xxxx234xx’, category=’electronics’,timetaken=’20/01/2000 10:20:20′.

Then, you can really break it down in to key and value. Where, clientid, category and timetaken are keys and values are: xxxx234xx,electronics,20/01/2000 10:20:20.  How about not only converting them into key and value.  But storing and retrieving them as well  into a column. So, When you define the MAP it does store the complete MAP into a single column, like;


{“clientid”=”xxxx234xx”, “category”=”electronics”,”timetaken”=”20/01/2000 10:20:20″}

To store like this you need to define the table like this:

Create table table1





Now, retrieval is pretty easy : you just need to say in your HiveQL: Select COL1.[“category”] from table. You’ll get electronics. Had it been MAP is not there i would have end up writing a complete parsing program for storing such custom format in table.

Similarly, Array can be use to store collection into a column. So you can have data like:


Now, you want to parse the complete level in the second column. It would be easy in Hive to store it as in ARRAY. Definition would be:

Create table table1






Now data retrieving is obvious, query the table with collection index or level you want to go

Select Col_1[1] from table1;

You may also have scenarios when you have COLLECTION of MAPS. There you need to use both MAP and ARRAY together in same table definition along with required delimiter for ARRAY and MAP.

So your table’s delimiter definition should look like this:


Posted in Big Data, Hive | Tagged: , , | 1 Comment »

PIG, generation’s langauge: Simple Hadoop-PIG configuration, Register UDF.

Posted by datumengineering on June 26, 2012

I would consider PIG a step further to 4th generation. PIG emerged as an ideal language for programmers. PIG is a data flow language in Hadoop echo system. Now, It became a gap filler in BIG data analytics world between 2 audiences ETL developer & Java/Python Programmer. PIG has some very powerful feature which gives it an edge for generations:

  • Bringing schema less approach to unstructured data.
  • Bringing programming support to data flow.

PIG brings ETL capabilities to big data without having schema to be defined in advance. That’s an indispensable qualities. All these features together gives a power of analytics on Cloud with back of HDFS processing capabilities and MR programming model. Here, we’ll see in simple steps how can we use PIG as a data flow for analysis. Obviously, you should have PIG installed on your cluster.

PIG uses Hadoop configuration for data flow processing. Below are the steps of Hadoop configuration. I would prefer to do it in /etc/bash.bashrc

  • Point PIG to the JAVA_HOME.
  • set PIG_HOME to the core PIG script.
  • set PIG_DIR to the bin- dir.
  • set PIG_CONF_DIR to hadoop configuration directory.
  • finally set the PIG_CLASSPATH add it to the CLASSPATH.

Here is the exact code for above 5 steps:

export JAVA_HOME
export PATH

export PATH

export PIG_HOME
export PATH

export PIG_DIR
export PATH

export PATH

export PATH

export CLASSPATH=$CLASSPATH:$HADOOP_DIR/hadoop-core.jar:$HADOOP_DIR/hadoop-tools.jar:$HADOOP_DIR/hadoop-ant.jar:$HADOOP_DIR/lib/commons-logging-1.0.4.jar:$PIG_DIR/pig-core.jar

export PIG_CONF_DIR=/usr/lib/hadoop/conf

Now if you have written UDF then first register it and then define the function.

REGISTER /path/to/input/<jarfile>.jar;

define <function> org.apache.pig.piggybank.storage.<functionname>();

Now you’ve UDF available to use throughout your script.

A = load ‘/path/to/inputfile’ using org.apache.pig.piggybank.storage.<functionname>() as (variable:<datatype>);

Life becomes easy once we have UDF available to use. You just need to have basic understanding on SQL functionality to perform data flow operations in PIG scripting language.

Next write up in continue with PIG will be on: ANALYTICS: HOW PIG creates touch points for data flow as well as analytics.

Posted in Big Data, Hadoop, PIG | Tagged: , , | Leave a Comment »

HDFS & Cloudera Hadoop – What, How and Where?

Posted by datumengineering on April 3, 2012

It is always good to understand “Whats going on in background?”. My first practice to understand any database is to go in detail of DB architecture. Because every database has its own flavor of benefits which is unique. Now it is challenging with Hadoop when things become interesting when I need to go to in detail architecture focussed around the ” file system “. Hadoop made me to do this exercise …..

What, How, Where ?

I have been trying my best to go in detail of this giant to understand “How” the things are moving underneath. There had been lot of questions blinked in mind about internal architecture and flow. Finally to conclude this study, I have set up Cloudera Distribution Hadoop on my Ubuntu in Pseudo-distributed environment on single node. During this configuration and installation on single node i have started analyzing and sketching the diagram.

My HDFS directory is on hdfs://localhostand:8020/tmp and related local file system is on /app/hadoop/tmp. I have preferred cloudera distribution compare to the apache directly. This is a first step, eventually i am trying to expand it towards Hive, HBase, Zookeeper on top of this diagram.

I look at complete Hadoop job as a function of three main component: configuration,  services and metastore.

In reverse order of above list, First and foremost player is metastore: Namenode,  Secondary Namenode.

Second player is: Job tracker and Task trackers, who actually are the players.

And finally, a configuration which tell first 2 players how and where to run.

Disclaimer :- Diagram is completely my own understanding. Your catch and corrections are welcome.

Posted in Hadoop | Tagged: , , | Leave a Comment »

Provision of small file processing in HDFS

Posted by datumengineering on March 7, 2012

Hadoop meant ONLY for mammoth file processing. Though this has been ideal condition but Hadoop do have provision to process small files. Hadoop introduced a Big container to hold small files for further processing. These big containers intended for processing small files data in Map Reduce model. In HDFS these containers are termed as a Sequence file.

These sequence files hold small files as a whole record. However as Map Reduce model expected, it stores data in {Key,Value} pair. File name of the smaller file can be key and the content of the file becomes value. Once the files stored in Sequence file it can be read and write back to HDFS.Writing the data for Sequence file is matter of writing Key and value pair. It depend of the kind of serialization you use. Read process is similar to the collection processing where you define next() method which accept a key and value pair, and reads the next key and value in the stream in the variable. It process until it reaches to EOF and next() method returns false. Again you need to go in detail of kind of serialization you are using here. This unique feature of HDFS given an opportunity to process million small files together as a Sequence file.

Structure of Sequence file is pretty simple. It has header, which hold metadata and compression details for the files stored and the record. Record contains the whole file in it along with the key length, key name and value (i.e. file content/data). The internal format of the records depends on record/block compression. Record compression is just compress the file content (i.e. value), however block compression method compresses number of records. Hence block compression is more meaningful and preferred.

Another form of Sequence file is Map file. Map file is sorted sequence file which is sort on the key with an index to perform lookup on the key. This helps map reduce model to improve the performance of the sequence file.

With this kind of framework of Sequence file & Map file Hadoop has opened feasibility to process millions of small files together. So should we say that HDFS is not just a matter of handling Big data files but it does have capability to process small files too, that also efficiently within Map Reduce model?

Any thought or use case you can suggest here?

Posted in Hadoop, Map Reduce | Tagged: , , | Leave a Comment »

Hadoop Configuration Simplified : Master Slave Architecture

Posted by datumengineering on February 11, 2012

Master-Slave: Hadoop configuration has simple components which can be divided as Master and Slave components. Master: NameNode, Secondary namenode and Job tracker. Slave: data node and task tracker.

Following are the key points of configuration:

1. Namenode always be localize during hadoop configuration in cluster environment.

2. Entries for master and slave are in master and slaves files repectively. Master translate to namenode host and job tracker.

3. All configuration can be categorize in 4 aspects:

– Environment variable configuration
– Hadoop-HDFS configuration.
– Master-Slave configuration.
– Map Reduce configuration.

I) Environment configuration: Set up all environment variables to run the scripts in HDFS environment.

{Files: hdfs-env.sh}

II) Hadoop-HDFS configuration: This includes entries for hosts for namenode, secondary namenode, job trackers and task trackers.

{Files: core-site.xml, hdfs-site.xml, mapred-site.xml}

III) Text file configuration: There are 2 text entries in hadoop configuration for master and slaves. One for master node i.e. Host entry for master node and other for slaves i.e. Host entry for all slaves machin where data node and task tracker will run.

{Files: masters, slaves}

IV) Hadoop metrics and job log configuration: in map reduce configuration setting it captures all the metrics and logs related to map reduce program.

{Files: hadoop-metrics.properties, log4j.properties}

This is how brief function of different components together:

Namenode and job tracker starts at local machine, starts secondary node on each machine listed in the masters file. Eventually start tasktracker and datanode on each machine listed in the masters file.
In a cluster environment namenode, secondary namenode and job tracker run on single machine as a master node. However in a large cluster it can be sparated. When namenode and jobtracker are on separate node their slaves files should be in synch.

Posted in Hadoop, HDFS | Tagged: , | Leave a Comment »