Datum Engineering !

An engineered artwork to make decisions..

Posts Tagged ‘Big Data’

An indispensable Python : Data sourcing to Data science.

Posted by datumengineering on August 27, 2013

Data analysis echo system has grown all the way from SQL’s to NoSQL and from Excel analysis to Visualization. Today, we are in scarceness of the resources to process ALL¬†(You better understand what i mean by ALL) kind of data that is coming to enterprise. Data goes through profiling, formatting, munging or cleansing, pruning, transformation steps to analytics and predictive modeling. Interestingly, there is no one tool proved to be an effective solution to run all these operations { Don’t forget the cost factor here ūüôā }. ¬†Things become challenging when we mature from aggregated/summarized analysis to Data mining, mathematical modeling, statistical modeling and predictive modeling. Pinch of complication will be added by¬†Agile implementation.

Enterprises have to work out the solution: “Which help to build the data analysis (rather analytics) to go in Agile way to all complex data structure in either of the way of SQL or NoSQL, and in support of data mining activities” .

So, let’s look at the Python & its eco system (I would prefer to call Python libraries as echo system) and how it can cover up enterprise’s a*s for data analysis.

Python: functional object orientated programming language, most importantly super easy to learn. Any home grown programmer with less or minor knowledge on programming fundamentals can start anytime on python programming.  Python has rich library framework. Even the old guy can dare to start programming in python. Following data structure and functions can be explored for implementing various mathematical algorithms like recommendation engine, collobrative filtering, K-means, Clustering and Support Vector Machine.

  • Dictionary.
  • Lists.
  • String.
  • Sets.
  • Map(), Reduce().

Python Echo System for Data Science:

Let’s begin with sourcing data, bringing into dataset format and shaping mechanism.

{ Pandas: Data loading, Cleansing, Summarization, Joining, Time Series Analysis }

Pandas: Data analysis covered up in python libraries. It has most of the things which you look out to run quick analysis. Data Frames, Join, Merge, Group By are the in-builds which are available to run SQL like analysis on the data coming in CSV files (read CSV function). To install Pandas you need to have NumPy installed first.

{ NumPy: Data Array, Vectorization, matrix and Linear algebra operations i.e. mathematical modeling }

NumPy: Rich set of functions for array, matrix and Vector operations. Indexing, Slicing and Stacking are prominent functionality of NumPy.

{ Scipy:  Mean, variance, skewness, kurtosis }

Scipy: SciPy to run scientific analysis on the data. However, statistics functions can be located in the sub-package scipy.stats

{ Matplotlib: Graph, histograms, power spectra, bar charts, errorcharts, scatterplots }

Matplotlib: 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

Moreover, how can we second python support to Big data Analytics and Machine Learning.  Below resources can be utilize for various big data applications:

  • Lightweight Map-Reduce implementation written in Python: Octopy
  • Hbase¬†interaction¬†using python: happybase
  • Machine learning algorithm implementation in Python: Scikit. It has built on NumPy, SciPy, and matplotlib.

Having said that, Python is capable enough to give a way out to implement data analysis algorithms and hence to build your own data analysis framework.

Watch out this space for implementations of various algorithms in Python under one umbrella i.e.  Python data analysis tools.


Posted in Big Data, Data Analysis, Predictive Model, Python, Statistical Model | Tagged: , , , | Leave a Comment »

Big Data? How do you run capacity planning?

Posted by datumengineering on February 15, 2013

Most of Datawarehouse folks are very much¬†accustomed with the term¬†“Capacity Planning”,¬†Read Inmon. This is widely used process for DBA’s and Datawarehouse Architects. In an typical project of data management and warehouse wide variety of audience is involved to drive the capacity planning. It involves everyone from Business Analyst to Architect to Developer to DBA and finally Data Modeler.

This practice which has had wide audience in typical¬†Datawarehouse world,¬† how this has been driven in Big Data? I have hardly heard noise around this in any Hadoop driven project which had started with an intention to handle growing data. ¬†I have met pain bearers DBA/Architects who have been facing challenges at all stages of data management when data outgrows. They are the main players who advocates bringing Hadoop¬†ASAP. ¬†Crux of their problem is not growing data. But the problem is, they didn’t have mathematical calculation which advocate the growth rate. All we talk about is: How much percentage it is going? Most of the time that percentage also come from experience ūüôā

Capacity planning should be explore more than just calculating the percentage and experience.

  1. It should be more mathematical calculation of every byte of the data sources coming into the system.
  2. How about designing a predictive model which will confirm my data growth with an accuracy until 10 years?
  3. How about involving business to confirm the data growth drivers and feasibility of future born data sources ?
  4. Why don’t consider compression factor and purging into the calculation to reclaim the space for data grow.
  5. Why we consider only disk utilization and why there is no consideration about other hardware resources like memory, processor, cache? After all, it is all about data processing environment.
  • I think this list of consideration can still grow….

I know building robust Capacity planning is not a task of day or month. One to two year of time frame data is good enough to understand this trend and develop a algorithm around it.  Consider 1-2 years as a learning data set and take some months of data as training data set and start analyzing the trend, start building the model which can predict the growth after 3rd or 4th year. Because as per Datawarehouse gurus bleeding starts after 5th year age.

I’ll leave up to you to design the solution and process for capacity capacity to claim your DATA as BIG DATA.

Remember, disk space is cheap but not the disk seek.

Posted in Big Data, Hadoop | Tagged: , | 2 Comments »

Data analysis drivers

Posted by datumengineering on February 11, 2013

I have been exploring data analysis and modeling techniques since months. There are lots of topics floating around in the space of data analysis like statistical modeling, predictive modeling. There have always been questions in mind which technique to choose? which is preferred way for data analysis? Some articles and lecture highlight machine learning or mathematical model over statistics modeling limitations. They mention mathematical modeling as a next step of accuracy and prediction. This kind of articles create more questions in mind of naive user.

Finally, i would thank to coursera.org for zero down this confusion and stating a clear picture of Data Analysis drivers. Now, things are pretty clear in terms of How to proceed on data analysis? Rather, defining “DATA ANALYSIS DRIVERS”. In one liner the answer is simple “Define a question or problem“. So, all depend upon how you define the problem.

To start with data analysis drivers here are steps in a data analysis

  1. Define the question
  2. Define the ideal data set
  3. Determine what data you can access
  4. Obtain the data
  5. Clean the data
  6. Exploratory data analysis
  7. Statistical prediction/modeling
  8. Interpret results
  9. Challenge results
  10. Synthesize/write up results
  11. Create reproducible code
  • Defining the question means how the business problem has stated and how you proceed on story telling on this ¬†problem. Story telling on the problem will take you to the structuring the solution. So you should be good in story telling on the problem statement.
  • Defining the solution will help you to prepare the data (data set) for the solution.
  • Profile the source to identify what data you can access.
  • Next step is cleansing¬†the data.
  • Now, once the data is cleansed it is either in one of the following standard: txt, csv, xml/html, json and database.
  • Based on the solution need we start building the model. Precisely, the solution will have requirement of Descriptive analysis,¬†Inferential analysis or predictive analysis.

Henceforth, The data set and model may depend on your goal:

  1. Descriptive – a whole population.
  2. Exploratory – a random sample with many variables measured.
  3. Inferential – the right population, randomly sampled.
  4. Predictive – a training and test data set from the same population.
  5. Causal – data from a randomized study.
  6. Mechanistic – data about all components of the system.

From here knowledge on statistics, machine learning and mathematical algorithm works ūüôā

Posted in Big Data, Data Analysis | Tagged: , | 6 Comments »

Agility in Hive — Map & Array score for Hive

Posted by datumengineering on September 27, 2012

There are debate and comparison between PIG and Hive. There are good post from @Larsgeorge which talks about PIG v/s Hive.

I am not an expert to go in details of comparison but here I want to explore some of the Hive features which gives Hive an edge.

These feature are MAP (Associative Array) and ARRAY. MAP can give you an alternative way to segregate your data  around KEY and VALUE way.  So, if you have data something like this

clientid=’xxxx234xx’, category=’electronics’,timetaken=’20/01/2000 10:20:20′.

Then, you can really break it down in to key and value. Where, clientid, category and timetaken are keys and values are: xxxx234xx,electronics,20/01/2000 10:20:20.  How about not only converting them into key and value.  But storing and retrieving them as well  into a column. So, When you define the MAP it does store the complete MAP into a single column, like;


{“clientid”=”xxxx234xx”, “category”=”electronics”,”timetaken”=”20/01/2000 10:20:20″}

To store like this you need to define the table like this:

Create table table1





Now, retrieval is pretty easy : you just need to say in your HiveQL: Select COL1.[“category”] from table. You’ll get electronics. Had it been MAP is not there i would have end up writing a complete parsing program for storing such custom format in table.

Similarly, Array can be use to store collection into a column. So you can have data like:


Now, you want to parse the complete level in the second column. It would be easy in Hive to store it as in ARRAY. Definition would be:

Create table table1






Now data retrieving is obvious, query the table with collection index or level you want to go

Select Col_1[1] from table1;

You may also have scenarios when you have COLLECTION of MAPS. There you need to use both MAP and ARRAY together in same table definition along with required delimiter for ARRAY and MAP.

So your table’s delimiter definition should look like this:


Posted in Big Data, Hive | Tagged: , , | 1 Comment »

PIG, generation’s langauge: Simple Hadoop-PIG configuration, Register UDF.

Posted by datumengineering on June 26, 2012

I would consider PIG a step further to 4th generation. PIG emerged as an ideal language for programmers. PIG is a data flow language in Hadoop echo system. Now, It became a gap filler in BIG data analytics world between 2 audiences ETL developer & Java/Python Programmer. PIG has some very powerful feature which gives it an edge for generations:

  • Bringing schema less approach to unstructured data.
  • Bringing programming support to data flow.

PIG brings ETL capabilities to big data without having schema to be defined in advance. That’s an indispensable qualities. All these features together gives a power of analytics on Cloud with back of HDFS processing capabilities and MR programming model. Here, we’ll see in simple steps how can we use PIG as a data flow for analysis. Obviously, you should have PIG installed on your cluster.

PIG uses Hadoop configuration for data flow processing. Below are the steps of Hadoop configuration. I would prefer to do it in /etc/bash.bashrc

  • Point PIG to the JAVA_HOME.
  • set PIG_HOME to the core PIG script.
  • set PIG_DIR to the bin- dir.
  • set PIG_CONF_DIR to hadoop configuration directory.
  • finally set the PIG_CLASSPATH add it to the CLASSPATH.

Here is the exact code for above 5 steps:

export JAVA_HOME
export PATH

export PATH

export PIG_HOME
export PATH

export PIG_DIR
export PATH

export PATH

export PATH

export CLASSPATH=$CLASSPATH:$HADOOP_DIR/hadoop-core.jar:$HADOOP_DIR/hadoop-tools.jar:$HADOOP_DIR/hadoop-ant.jar:$HADOOP_DIR/lib/commons-logging-1.0.4.jar:$PIG_DIR/pig-core.jar

export PIG_CONF_DIR=/usr/lib/hadoop/conf

Now if you have written UDF then first register it and then define the function.

REGISTER /path/to/input/<jarfile>.jar;

define <function> org.apache.pig.piggybank.storage.<functionname>();

Now you’ve UDF available to use throughout your script.

A = load ‘/path/to/inputfile’ using org.apache.pig.piggybank.storage.<functionname>() as (variable:<datatype>);

Life becomes easy once we have UDF available to use. You just need to have basic understanding on SQL functionality to perform data flow operations in PIG scripting language.

Next write up in continue with PIG will be on: ANALYTICS: HOW PIG creates touch points for data flow as well as analytics.

Posted in Big Data, Hadoop, PIG | Tagged: , , | Leave a Comment »

BiG DaTa & Vectorization

Posted by datumengineering on May 14, 2012

It has been while when Big data entered into the market and buzz the analytics world. Now a day all analytics leaders are chanting about Big data applications. Since I have started with Hadoop technologies and with Machine learning one question has been bugging in mind:

Which is a greater innovation Big Data Or Machine Learning & Vectorization?

When it comes to analytics Vectorization and machine learning more innovative. Wait a minute, I don’t want to be biased and I am not concluding here. But, i would like to showcase more on the direction when we take out data for the analytics world. We have structured data, we have enterprise data, we have data which is still measurable and suffice analytical and advanced analytical need. But how many of Business analytics use it smartly to do predictions, How many have applied different statistical algorithm to be benefited from this data ? How many times available data has been utilized to its potential ? I guess, only 20% cases. When we are still not up to the utilization of structured, measurable data then why we are so much behind the unstructured and monster data. In fact this big data need more work than enterprise data. I don’t advocate to go to saturation first and then think of innovation or out of the box, NO. My emphasis more on the best utilization of existing enterprise data and keep the innovation alive by experimenting the possible options to explore the data which is unexplored or unfeasible through conservative technologies. Innovation doesn‚Äôt mean keep thinking and just doing new things. Innovation is more meaningful when you do something meaningful to the world which other people acknowledges but they says “Not feasible”. I am not in favor of anyone here. I am coming from the world where I see data processing challenges, when I see data storage challenges, when I see data aggregation challenges, when I see lot of challenges during sorting and searching. There I would look at Hadoop related technologies. The way Hive provides query processing power, HBase provides data storage and manipulation power is indeed way beyond the other RDBMS. Their power of MAP REDUCE is exemplary. But all these Big data technologies should enter into the enterprise which is already mature enough in the analytics world by fully utilization of its enterprise data at length. If Hadoop itself claim that I am not a replacement of your current enterprise datawarehouse then why you shouldn’t first fully grind the existing EDW data and then look at Hadoop opportunities to give an edge to your enterprise competency.

Posted in Big Data | Tagged: , | 2 Comments »

Technology Shift: In Multi-Channel and Agile Commerce

Posted by datumengineering on February 21, 2012

It needs a technology shift:

When you move to Multi-Channel and eventually progress towards Agile Commerce

Today marketing and advertisement have opened many channels other than traditional channels. It includes: telemarketing, print catalogue, Kiosk and ecommerce.  However contribution of e-commerce is huge in comparison to other channels. It opens numerous feeds to the organization, like:

Campaign data (email, A/B test analysis)

Browse behavior on e-commerce site (Web logs)

Mobile BI

Call Centre voice data converted in textual

Data from search engines.

Social Media like Facebook, Twitter.

Customer experience, recommendation.

Multichannel purchase history etc.

Most importantly customer sentiment analysis will give an edge to marketing strategy.

So, effective and efficient utilization of multi-channel requires that we must tear down the walls what we have been building between different channels over the years.

Data from Multi-channel: When the channels are cosmic, data produced from these channels are utterly unstructured and gigantic. Web logs, Consumer Web interaction, Social media messages are few examples of highly unstructured data.  

Business still needs to analyze this data: Though it is unstructured but it had proven to be more meaningful in terms of trend analysis and consumer preference and sentiment, compared to the direct store data. 

Titanic and unstructured nature makes it unfeasible for analysis: Due to variety, volume and velocity of the data the task of transformation into relational database becomes vulnerable and analysis nearly impossible. So we need to add flavor of Big Data Analytics in traditional DW.


Strategy to Big Data Analytics

We neither need to cleanse this data nor need to bring it into relational database. We do not even need to wait until it gets processed through series of transformations. Because until that time information will lose its flavor of real time.  However we can still store it quickly as it comes and can be easily accessed using:

  • √ė HDFS (Hadoop Distributed File System) & Map Reduce Framework.
  • √ė Real time frequently updated unstructured data storage in HBASE
  • √ė Quicker access using HIVE datawarehouse on Hadoop.

And, all these processing can be establish and distributed on company Private CLOUD.  Predominantly, this unstructured data analysis would be an excellent support to existing datawarehouse.

HDFS (Hadoop Distributed File System):

Hadoop is designed for distributed parallel processing. In Hadoop programming patterns data is record oriented and the files are spread across distributed file system as a chunk, each compute process running on nodeworks on subset of data. So instead of moving whole data across the network for processing, Hadoop actually moves the computation to the data. Hadoop uses Map Reduce programming model to process and generate large dataset. Program written in this functional style will automatically run in parallel on large cluster of commodity machine.  Map-Reduce function takes care of functionality of data processing. However power of parallelism and fault tolerances are under cover in libraries (C++ libraries). These libraries need to include when writing Map & Reduce program.  When running large Map Reduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth.


Hadoop file system with Map Reduce functionality forms an underlying foundation to process huge amount ofdata. Data management and analytics need more than just a building an underlying file storage mechanism.It requires data to be organized in tables so it can be easily accessible without writing complex lines of code.  Moreover we all are more comfortable with database operation than file operation. So, abstract layer is required to simplify scattered Big Data.  HBase and Hive is the answer for this.

HBASE: HBASE is aimed to hold billions of rows and millions of columns in a big table like structure. HBASE is column oriented distributed storage build on top of Hadoop.  It is a NoSQL database where data is stored in form of Hash Table. So the data is sparse, distributed and in sorted Map.  Tables are sorted by rows. Tables are made of rows and columns. All columns in HBase belong to a particular column family. Data model for HBase is little different than conventional RDBMS.

So, HBASE is :

1.Column oriented distributed storage build on top of Hadoop layered over HDFS.
2.NoSQL database where data is stored in form of Hash Table.
3.Distributed on many servers and tolerant of machine failure. 
4.All columns in HBase belong to a particular column family.
It is not a :
  • Relational database.
  • Support to data join.

HIVE:  It is a system for managing and querying structured data build on top of Hadoop.  Aim of HIVE is to give familiar SQL like interface to data stored in Hadoop framework. Underline it still uses Map Reduce programming model for extracting data from HDFS.  It also gives ETL flexibility on the data stored in Hive tables.  All the metadata can be stored in RDBMS. 

1.Aim of HIVE is to give familiar SQL like interface to data stored in Hadoop framework.
2.Underline it still uses Map Reduce programming model for extracting data from HDFS. 
3.It also gives ETL flexibility on the data stored in Hive tables. 
4.All the metadata can be stored in RDBMS. 

JDBC/ODBC drivers allow 3rd party applications to pull Hive data for reporting.

HBase and Hive has its own flavor of benefits.

Hive is more interactive in terms of SQL queries and metadata in RDBMS. But it has limited use as read-intensive data; updating data in Hive is a costly operation. Because here update means create another copy of existing data.

 HBase can step in here and can give a functionality of high row level updates. It sidesteps Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging intelligently based on data distribution changes.

Marrying HBase with Hive can spell out a near real time datawarehouse on Hadoop echo system with simplicity of SQL like interface from Hive tables and keeping near time replica in HBase tables.


Important Note:

  • All structured data can continue to be analyzed with Enterprise Data Warehouse. Hadoop will play on unstructured data.
  • Hadoop can used as ‚ÄúData Bag‚ÄĚ for EDW.¬†
  • Push Down Aggregation: All the¬† intensive, voluminous aggregation can be pushed to Hadoop.
  • Push Down ETL: All ETL complexity on Big Data can be implemented in Hadoop (Hive).

Value Add and ROI from Hadoop Data Analytics


HDFS, Hive and HBase is an open source. Investment merely require on skills than on tools and technologies. Nonetheless, this has been a biggest challenge in Hadoop related development.  Company has to define the strategy to invest in skills and continuous investigation on Hadoop platform.

$: РIndeed, x% of budget allocation for Hadoop development and some additional investment may require for plug-in from Hive/HBase to Teradata for third party. 

Posted in Big Data | Tagged: | Leave a Comment »

Hadoop Recommendation

Posted by datumengineering on February 9, 2012

  • All structured data can continue to be analyzed with Enterprise Data Warehouse. Hadoop will play on unstructured data.
  • Hadoop can used as ‚ÄúData Bag‚ÄĚ for EDW.
  • Push Down Aggregation: All the¬† intensive, voluminous aggregation can be pushed to Hadoop.
  • Push Down ETL: All ETL complexity on Big Data can be implemented in Hadoop (Hive).

Posted in Big Data | Tagged: | Leave a Comment »