Datum Engineering !

An engineered artwork to make decisions..

Archive for February, 2012

Technology Shift: In Multi-Channel and Agile Commerce

Posted by datumengineering on February 21, 2012

It needs a technology shift:

When you move to Multi-Channel and eventually progress towards Agile Commerce

Today marketing and advertisement have opened many channels other than traditional channels. It includes: telemarketing, print catalogue, Kiosk and ecommerce.  However contribution of e-commerce is huge in comparison to other channels. It opens numerous feeds to the organization, like:

Campaign data (email, A/B test analysis)

Browse behavior on e-commerce site (Web logs)

Mobile BI

Call Centre voice data converted in textual

Data from search engines.

Social Media like Facebook, Twitter.

Customer experience, recommendation.

Multichannel purchase history etc.

Most importantly customer sentiment analysis will give an edge to marketing strategy.

So, effective and efficient utilization of multi-channel requires that we must tear down the walls what we have been building between different channels over the years.

Data from Multi-channel: When the channels are cosmic, data produced from these channels are utterly unstructured and gigantic. Web logs, Consumer Web interaction, Social media messages are few examples of highly unstructured data.  

Business still needs to analyze this data: Though it is unstructured but it had proven to be more meaningful in terms of trend analysis and consumer preference and sentiment, compared to the direct store data. 

Titanic and unstructured nature makes it unfeasible for analysis: Due to variety, volume and velocity of the data the task of transformation into relational database becomes vulnerable and analysis nearly impossible. So we need to add flavor of Big Data Analytics in traditional DW.

 

Strategy to Big Data Analytics

We neither need to cleanse this data nor need to bring it into relational database. We do not even need to wait until it gets processed through series of transformations. Because until that time information will lose its flavor of real time.  However we can still store it quickly as it comes and can be easily accessed using:

  • Ø HDFS (Hadoop Distributed File System) & Map Reduce Framework.
  • Ø Real time frequently updated unstructured data storage in HBASE
  • Ø Quicker access using HIVE datawarehouse on Hadoop.

And, all these processing can be establish and distributed on company Private CLOUD.  Predominantly, this unstructured data analysis would be an excellent support to existing datawarehouse.

HDFS (Hadoop Distributed File System):

Hadoop is designed for distributed parallel processing. In Hadoop programming patterns data is record oriented and the files are spread across distributed file system as a chunk, each compute process running on nodeworks on subset of data. So instead of moving whole data across the network for processing, Hadoop actually moves the computation to the data. Hadoop uses Map Reduce programming model to process and generate large dataset. Program written in this functional style will automatically run in parallel on large cluster of commodity machine.  Map-Reduce function takes care of functionality of data processing. However power of parallelism and fault tolerances are under cover in libraries (C++ libraries). These libraries need to include when writing Map & Reduce program.  When running large Map Reduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth.

 

Hadoop file system with Map Reduce functionality forms an underlying foundation to process huge amount ofdata. Data management and analytics need more than just a building an underlying file storage mechanism.It requires data to be organized in tables so it can be easily accessible without writing complex lines of code.  Moreover we all are more comfortable with database operation than file operation. So, abstract layer is required to simplify scattered Big Data.  HBase and Hive is the answer for this.

HBASE: HBASE is aimed to hold billions of rows and millions of columns in a big table like structure. HBASE is column oriented distributed storage build on top of Hadoop.  It is a NoSQL database where data is stored in form of Hash Table. So the data is sparse, distributed and in sorted Map.  Tables are sorted by rows. Tables are made of rows and columns. All columns in HBase belong to a particular column family. Data model for HBase is little different than conventional RDBMS.

So, HBASE is :

1.Column oriented distributed storage build on top of Hadoop layered over HDFS.
2.NoSQL database where data is stored in form of Hash Table.
3.Distributed on many servers and tolerant of machine failure. 
4.All columns in HBase belong to a particular column family.
 
It is not a :
  • Relational database.
  • Support to data join.

HIVE:  It is a system for managing and querying structured data build on top of Hadoop.  Aim of HIVE is to give familiar SQL like interface to data stored in Hadoop framework. Underline it still uses Map Reduce programming model for extracting data from HDFS.  It also gives ETL flexibility on the data stored in Hive tables.  All the metadata can be stored in RDBMS. 

1.Aim of HIVE is to give familiar SQL like interface to data stored in Hadoop framework.
2.Underline it still uses Map Reduce programming model for extracting data from HDFS. 
3.It also gives ETL flexibility on the data stored in Hive tables. 
4.All the metadata can be stored in RDBMS. 

JDBC/ODBC drivers allow 3rd party applications to pull Hive data for reporting.

HBase and Hive has its own flavor of benefits.

Hive is more interactive in terms of SQL queries and metadata in RDBMS. But it has limited use as read-intensive data; updating data in Hive is a costly operation. Because here update means create another copy of existing data.

 HBase can step in here and can give a functionality of high row level updates. It sidesteps Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging intelligently based on data distribution changes.

Marrying HBase with Hive can spell out a near real time datawarehouse on Hadoop echo system with simplicity of SQL like interface from Hive tables and keeping near time replica in HBase tables.

 

Important Note:

  • All structured data can continue to be analyzed with Enterprise Data Warehouse. Hadoop will play on unstructured data.
  • Hadoop can used as “Data Bag” for EDW. 
  • Push Down Aggregation: All the  intensive, voluminous aggregation can be pushed to Hadoop.
  • Push Down ETL: All ETL complexity on Big Data can be implemented in Hadoop (Hive).

Value Add and ROI from Hadoop Data Analytics

Investment:

HDFS, Hive and HBase is an open source. Investment merely require on skills than on tools and technologies. Nonetheless, this has been a biggest challenge in Hadoop related development.  Company has to define the strategy to invest in skills and continuous investigation on Hadoop platform.

$: – Indeed, x% of budget allocation for Hadoop development and some additional investment may require for plug-in from Hive/HBase to Teradata for third party. 

Advertisements

Posted in Big Data | Tagged: | Leave a Comment »

Hadoop Configuration Simplified : Master Slave Architecture

Posted by datumengineering on February 11, 2012

Master-Slave: Hadoop configuration has simple components which can be divided as Master and Slave components. Master: NameNode, Secondary namenode and Job tracker. Slave: data node and task tracker.

Following are the key points of configuration:

1. Namenode always be localize during hadoop configuration in cluster environment.

2. Entries for master and slave are in master and slaves files repectively. Master translate to namenode host and job tracker.

3. All configuration can be categorize in 4 aspects:

– Environment variable configuration
– Hadoop-HDFS configuration.
– Master-Slave configuration.
– Map Reduce configuration.

I) Environment configuration: Set up all environment variables to run the scripts in HDFS environment.

{Files: hdfs-env.sh}

II) Hadoop-HDFS configuration: This includes entries for hosts for namenode, secondary namenode, job trackers and task trackers.

{Files: core-site.xml, hdfs-site.xml, mapred-site.xml}

III) Text file configuration: There are 2 text entries in hadoop configuration for master and slaves. One for master node i.e. Host entry for master node and other for slaves i.e. Host entry for all slaves machin where data node and task tracker will run.

{Files: masters, slaves}

IV) Hadoop metrics and job log configuration: in map reduce configuration setting it captures all the metrics and logs related to map reduce program.

{Files: hadoop-metrics.properties, log4j.properties}

This is how brief function of different components together:

Namenode and job tracker starts at local machine, starts secondary node on each machine listed in the masters file. Eventually start tasktracker and datanode on each machine listed in the masters file.
In a cluster environment namenode, secondary namenode and job tracker run on single machine as a master node. However in a large cluster it can be sparated. When namenode and jobtracker are on separate node their slaves files should be in synch.

Posted in Hadoop, HDFS | Tagged: , | Leave a Comment »

Hadoop Recommendation

Posted by datumengineering on February 9, 2012

  • All structured data can continue to be analyzed with Enterprise Data Warehouse. Hadoop will play on unstructured data.
  • Hadoop can used as “Data Bag” for EDW.
  • Push Down Aggregation: All the  intensive, voluminous aggregation can be pushed to Hadoop.
  • Push Down ETL: All ETL complexity on Big Data can be implemented in Hadoop (Hive).

Posted in Big Data | Tagged: | Leave a Comment »