It needs a technology shift:
When you move to Multi-Channel and eventually progress towards Agile Commerce
Today marketing and advertisement have opened many channels other than traditional channels. It includes: telemarketing, print catalogue, Kiosk and ecommerce. However contribution of e-commerce is huge in comparison to other channels. It opens numerous feeds to the organization, like:
Campaign data (email, A/B test analysis)
Browse behavior on e-commerce site (Web logs)
Call Centre voice data converted in textual
Data from search engines.
Social Media like Facebook, Twitter.
Customer experience, recommendation.
Multichannel purchase history etc.
Most importantly customer sentiment analysis will give an edge to marketing strategy.
So, effective and efficient utilization of multi-channel requires that we must tear down the walls what we have been building between different channels over the years.
Data from Multi-channel: When the channels are cosmic, data produced from these channels are utterly unstructured and gigantic. Web logs, Consumer Web interaction, Social media messages are few examples of highly unstructured data.
Business still needs to analyze this data: Though it is unstructured but it had proven to be more meaningful in terms of trend analysis and consumer preference and sentiment, compared to the direct store data.
Titanic and unstructured nature makes it unfeasible for analysis: Due to variety, volume and velocity of the data the task of transformation into relational database becomes vulnerable and analysis nearly impossible. So we need to add flavor of Big Data Analytics in traditional DW.
Strategy to Big Data Analytics
We neither need to cleanse this data nor need to bring it into relational database. We do not even need to wait until it gets processed through series of transformations. Because until that time information will lose its flavor of real time. However we can still store it quickly as it comes and can be easily accessed using:
And, all these processing can be establish and distributed on company Private CLOUD. Predominantly, this unstructured data analysis would be an excellent support to existing datawarehouse.
HDFS (Hadoop Distributed File System):
Hadoop is designed for distributed parallel processing. In Hadoop programming patterns data is record oriented and the files are spread across distributed file system as a chunk, each compute process running on nodeworks on subset of data. So instead of moving whole data across the network for processing, Hadoop actually moves the computation to the data. Hadoop uses Map Reduce programming model to process and generate large dataset. Program written in this functional style will automatically run in parallel on large cluster of commodity machine. Map-Reduce function takes care of functionality of data processing. However power of parallelism and fault tolerances are under cover in libraries (C++ libraries). These libraries need to include when writing Map & Reduce program. When running large Map Reduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth.
Hadoop file system with Map Reduce functionality forms an underlying foundation to process huge amount ofdata. Data management and analytics need more than just a building an underlying file storage mechanism.It requires data to be organized in tables so it can be easily accessible without writing complex lines of code. Moreover we all are more comfortable with database operation than file operation. So, abstract layer is required to simplify scattered Big Data. HBase and Hive is the answer for this.
HBASE: HBASE is aimed to hold billions of rows and millions of columns in a big table like structure. HBASE is column oriented distributed storage build on top of Hadoop. It is a NoSQL database where data is stored in form of Hash Table. So the data is sparse, distributed and in sorted Map. Tables are sorted by rows. Tables are made of rows and columns. All columns in HBase belong to a particular column family. Data model for HBase is little different than conventional RDBMS.
So, HBASE is :
- Relational database.
- Support to data join.
HIVE: It is a system for managing and querying structured data build on top of Hadoop. Aim of HIVE is to give familiar SQL like interface to data stored in Hadoop framework. Underline it still uses Map Reduce programming model for extracting data from HDFS. It also gives ETL flexibility on the data stored in Hive tables. All the metadata can be stored in RDBMS.
JDBC/ODBC drivers allow 3rd party applications to pull Hive data for reporting.
HBase and Hive has its own flavor of benefits.
Hive is more interactive in terms of SQL queries and metadata in RDBMS. But it has limited use as read-intensive data; updating data in Hive is a costly operation. Because here update means create another copy of existing data.
HBase can step in here and can give a functionality of high row level updates. It sidesteps Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging intelligently based on data distribution changes.
Marrying HBase with Hive can spell out a near real time datawarehouse on Hadoop echo system with simplicity of SQL like interface from Hive tables and keeping near time replica in HBase tables.
- All structured data can continue to be analyzed with Enterprise Data Warehouse. Hadoop will play on unstructured data.
- Hadoop can used as “Data Bag” for EDW.
- Push Down Aggregation: All the intensive, voluminous aggregation can be pushed to Hadoop.
- Push Down ETL: All ETL complexity on Big Data can be implemented in Hadoop (Hive).
Value Add and ROI from Hadoop Data Analytics
HDFS, Hive and HBase is an open source. Investment merely require on skills than on tools and technologies. Nonetheless, this has been a biggest challenge in Hadoop related development. Company has to define the strategy to invest in skills and continuous investigation on Hadoop platform.
$: – Indeed, x% of budget allocation for Hadoop development and some additional investment may require for plug-in from Hive/HBase to Teradata for third party.