Datum Engineering !

An engineered artwork to make decisions..

Technology Shift: In Multi-Channel and Agile Commerce

Posted by datumengineering on February 21, 2012

It needs a technology shift:

When you move to Multi-Channel and eventually progress towards Agile Commerce

Today marketing and advertisement have opened many channels other than traditional channels. It includes: telemarketing, print catalogue, Kiosk and ecommerce.  However contribution of e-commerce is huge in comparison to other channels. It opens numerous feeds to the organization, like:

Campaign data (email, A/B test analysis)

Browse behavior on e-commerce site (Web logs)

Mobile BI

Call Centre voice data converted in textual

Data from search engines.

Social Media like Facebook, Twitter.

Customer experience, recommendation.

Multichannel purchase history etc.

Most importantly customer sentiment analysis will give an edge to marketing strategy.

So, effective and efficient utilization of multi-channel requires that we must tear down the walls what we have been building between different channels over the years.

Data from Multi-channel: When the channels are cosmic, data produced from these channels are utterly unstructured and gigantic. Web logs, Consumer Web interaction, Social media messages are few examples of highly unstructured data.  

Business still needs to analyze this data: Though it is unstructured but it had proven to be more meaningful in terms of trend analysis and consumer preference and sentiment, compared to the direct store data. 

Titanic and unstructured nature makes it unfeasible for analysis: Due to variety, volume and velocity of the data the task of transformation into relational database becomes vulnerable and analysis nearly impossible. So we need to add flavor of Big Data Analytics in traditional DW.


Strategy to Big Data Analytics

We neither need to cleanse this data nor need to bring it into relational database. We do not even need to wait until it gets processed through series of transformations. Because until that time information will lose its flavor of real time.  However we can still store it quickly as it comes and can be easily accessed using:

  • Ø HDFS (Hadoop Distributed File System) & Map Reduce Framework.
  • Ø Real time frequently updated unstructured data storage in HBASE
  • Ø Quicker access using HIVE datawarehouse on Hadoop.

And, all these processing can be establish and distributed on company Private CLOUD.  Predominantly, this unstructured data analysis would be an excellent support to existing datawarehouse.

HDFS (Hadoop Distributed File System):

Hadoop is designed for distributed parallel processing. In Hadoop programming patterns data is record oriented and the files are spread across distributed file system as a chunk, each compute process running on nodeworks on subset of data. So instead of moving whole data across the network for processing, Hadoop actually moves the computation to the data. Hadoop uses Map Reduce programming model to process and generate large dataset. Program written in this functional style will automatically run in parallel on large cluster of commodity machine.  Map-Reduce function takes care of functionality of data processing. However power of parallelism and fault tolerances are under cover in libraries (C++ libraries). These libraries need to include when writing Map & Reduce program.  When running large Map Reduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth.


Hadoop file system with Map Reduce functionality forms an underlying foundation to process huge amount ofdata. Data management and analytics need more than just a building an underlying file storage mechanism.It requires data to be organized in tables so it can be easily accessible without writing complex lines of code.  Moreover we all are more comfortable with database operation than file operation. So, abstract layer is required to simplify scattered Big Data.  HBase and Hive is the answer for this.

HBASE: HBASE is aimed to hold billions of rows and millions of columns in a big table like structure. HBASE is column oriented distributed storage build on top of Hadoop.  It is a NoSQL database where data is stored in form of Hash Table. So the data is sparse, distributed and in sorted Map.  Tables are sorted by rows. Tables are made of rows and columns. All columns in HBase belong to a particular column family. Data model for HBase is little different than conventional RDBMS.

So, HBASE is :

1.Column oriented distributed storage build on top of Hadoop layered over HDFS.
2.NoSQL database where data is stored in form of Hash Table.
3.Distributed on many servers and tolerant of machine failure. 
4.All columns in HBase belong to a particular column family.
It is not a :
  • Relational database.
  • Support to data join.

HIVE:  It is a system for managing and querying structured data build on top of Hadoop.  Aim of HIVE is to give familiar SQL like interface to data stored in Hadoop framework. Underline it still uses Map Reduce programming model for extracting data from HDFS.  It also gives ETL flexibility on the data stored in Hive tables.  All the metadata can be stored in RDBMS. 

1.Aim of HIVE is to give familiar SQL like interface to data stored in Hadoop framework.
2.Underline it still uses Map Reduce programming model for extracting data from HDFS. 
3.It also gives ETL flexibility on the data stored in Hive tables. 
4.All the metadata can be stored in RDBMS. 

JDBC/ODBC drivers allow 3rd party applications to pull Hive data for reporting.

HBase and Hive has its own flavor of benefits.

Hive is more interactive in terms of SQL queries and metadata in RDBMS. But it has limited use as read-intensive data; updating data in Hive is a costly operation. Because here update means create another copy of existing data.

 HBase can step in here and can give a functionality of high row level updates. It sidesteps Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging intelligently based on data distribution changes.

Marrying HBase with Hive can spell out a near real time datawarehouse on Hadoop echo system with simplicity of SQL like interface from Hive tables and keeping near time replica in HBase tables.


Important Note:

  • All structured data can continue to be analyzed with Enterprise Data Warehouse. Hadoop will play on unstructured data.
  • Hadoop can used as “Data Bag” for EDW. 
  • Push Down Aggregation: All the  intensive, voluminous aggregation can be pushed to Hadoop.
  • Push Down ETL: All ETL complexity on Big Data can be implemented in Hadoop (Hive).

Value Add and ROI from Hadoop Data Analytics


HDFS, Hive and HBase is an open source. Investment merely require on skills than on tools and technologies. Nonetheless, this has been a biggest challenge in Hadoop related development.  Company has to define the strategy to invest in skills and continuous investigation on Hadoop platform.

$: – Indeed, x% of budget allocation for Hadoop development and some additional investment may require for plug-in from Hive/HBase to Teradata for third party. 


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: