Datum Engineering !

An engineered artwork to make decisions..

Archive for the ‘Big Data’ Category

Big Data Analytics: From Ugly Duckling to Beautiful Swan

Posted by datumengineering on January 31, 2016

Recently, I came across with an interesting book on the statistics which has a narration of Ugly Duckling story and correlation of this story with today’s DATA or rather BIG DATA ANALYTICS world. This story originally from famous storyteller Hans Christian Andersen

Story goes like this…

The duckling was a big ugly grey bird, so ugly that even a dog would not bite him. The poor duckling was ridiculed, ostracized and pecked by the other ducks. Eventually, it became too much for him and he flew to the swans, the royal birds, hoping that they would end his misery by killing him because he was so ugly. As he stared into the water, though, he saw not an ugly grey bird but a beautiful swan.

Data are much the same. Sometimes they’re just big, grey and ugly and don’t do any of the things that they’re supposed to do. When we get data like these, we swear at them, curse them, peck them and hope that they’ll fly away and be killed by the swans.

Alternatively, we can try to force our data into becoming beautiful swans.

Let me correlate the above narration with the data analysis solution, in 2 ways:

1. Build the process to expose the potential of the data to become beautiful swan

2. Every data need set of assumptions and hypotheses to be tested before it dies as a ugly duckling.

The process of exposing the potential of the data is vast from data sourcing, wrangling, cleansing to Exploratory Data Analysis (EDA) and further detailed analysis.  These steps should be an integral part of any data product. Though these processes have been for years with most of the data analysis systems and projects, but in recent years it is fairly extended and integrated to external datasets. This external data build an eco-system (support system) around your data to prove the value. e.g. If you want to expose your customer data to a level where it not only show 360 degree view but it also start revealing customer pattern, response with external system. Location play an important role (one of the important part) in this whole process. The spatial mapping, where the customers can be joined with their surrounding. There are various tools which can help you to achieve this spatial mapping from Java GIS libraries to R-Spatial Libraries. read this Spatial Analysis in R at original on DominoData Lab blog

Once you set the mapping right with external datasets, then there are various tools available for wrangling. Eventually, you cleanse the data and do the EDA with this broader dataset, then it becomes customer view with much broader spectrum of external datasets of Geo Location, Economy, GPS-sensor etc. With this, You can start analyzing customers by different segments which you have never captured within your systems. in short, something like this..

 

Not limited to spatial mapping and analysis but there are many more external data elements which can help your data building process to extend it to much broader range of variables for analysis. With an effective (rather smart) use of these data linkages you can start converting any ugly duckling into meaningful swan.

Let us look at the second part of the solution to build assumptions and hypotheses. Given any Data duckling you should start assessing how much of an ugly duckling of a data set you have, and discovering how to turn it into a swan. This is more a statistical solution of conversion (proving and probing) for duckling than a previously explained engineering solution. When assumptions are broken we stop being able to draw accurate conclusions about reality. Different statistical models assume different things, and if these models are going to reflect reality accurately then these assumptions need to be true.  This is a step by step process and developed from parametric test i.e. a test that requires data from one of the large catalogue of distributions that statisticians have described. The assumptions that can be tested are:

Normally distributed data: The rationale behind hypothesis testing relies on having something that is normally distributed (in some cases it’s the sampling distribution, in others the errors in the model).
Homogeneity of variance: This assumption means that the variances should be the same throughout the data. In designs in which you test several groups of participants this assumption means that each of these samples comes from populations with the same variance. In correlational designs, this assumption means that the variance of one variable should be stable at all levels of the other variable.
Interval data: Data should be measured at least at the interval level. This assumption is tested by common sense.
Independence: This assumption, like that of normality, is different depending on the test you’re using. In some cases it means that data from different participants are independent, which means that the behavior of one participant does not influence the behavior of another.

As there is vast support of tools in data collection there are various tools which can also help you to test hypotheses not only by number but visually too e.g. ggplot2, pastecs and psych

So, jump straight into the data with either of these approaches (or both) and forsure you can take any duckling to a journey of becoming a beautiful swan. That’s actually start of science, eventually developing a process of learning. And, build a process to learn by itself, whenever a new bird comes it would predict whether it will become a swan or remain to be duckling forever 🙂

 

 

 

Advertisements

Posted in Big Data, Data Analysis, Data Science, R, Statistical Model | Leave a Comment »

Where & Why Do You Keep Big Data & Hadoop?

Posted by datumengineering on December 13, 2015

I am Back ! Yes, I am back (on the track) on my learning track. Sometime, it is really necessary to take a break and introspect why do we learn, before learning.  Ah ! it was 9 months safe refuge to learn how Big Data & Analytics can contribute to Data Product.

DataLake

Data strategy has always been expected to be revenue generation. As Big data and Hadoop entering into the enterprise data strategy it is also expected from big data infrastructure to be revenue addition. This is really a tough expectation from new entrant (Hadoop) when the established candidate (DataWarehouse & BI) itself struggle mostly for its existence. So, it is very pertinent for solution architects to raise a question WHERE and WHY to bring the Big data (Obviously Hadoop) in the Data Strategy. And, the safe option for this new entrant should the place where it supports and strengthen the EXISTING data analysis strategy. Yeah! That’s the DATA LAKE.

Hope, you would have already understood by now the 3 Ws (What: Data Lake, Who: Solution Architect, Where: Enterprise Data strategy) of Five Ws questions for information gathering. Now look at the diagram to depict WHERE and WHY.

Precisely, 3 major areas of opportunity for new entrant (Hadoop):

  1. Semi-structured and/or unstructured data ingestion.
  2. Push down bleeding data integration problems to Hadoop Engine.
  3. Business need to build comprehensive analytical data stores.

Absence of any one of these 3 needs above would make Hadoop case weak to enter into the existing enterprise strategy. And, this data lake approach believes to be aligning to the business analysis outcomes without much disruption, hence it will also create comfortable path in the enterprise. We can further dig into Data Lake Architecture and implementation strategy in detail.

Moreover, there lot of other supporting systems which are brewing in parallel with Hadoop eco-system and Apache Kylin ….opportunities are immense on datalake 

Posted in Big Data, DataLake, Hadoop | 1 Comment »

ETL, ELT and Data Hub: Where Hadoop is the right fit ?

Posted by datumengineering on November 17, 2013

Few days back i have attended a good webinar conducted by Metascale on topic “Are You Still Moving Data? Is ETL Still Relevant in the Era of Hadoop?” This post is targeting this webinar.

In summary, this webinar had nicely explained about how enterprise can use Hadoop as a data hub along with the existing Datawarehouse set up. “Hadoop as a Data Hub” this line itself raised lot of questions in my mind:

  1. When we project Hadoop as a Data-hub and same time maintain the datawarehouse as an another data (conventional) repository for the enterprise then won’t it be creating another platform in silos? Presenter in the webcast repeatedly telling about keeping existing datawarehouse intact when developing Hadoop as a Data Hub. Difficult to digest 😦
  2. Next question that would arise is: challenges in Hadoop environment as Master Data Management and Data governance platform. I don’t think Hadoop ecosystem is mature enough to swiftly handle the MDM complexity. As far as data governance is concerned Hadoop ecosystem lacks in applications which are required on top of Hadoop for robust data governance.
  3. Why to put lot of energy to build compatibility of ETL tools like Informatica with HDFS to connect existing ETL infrastructure with Big Data? I feel this is a crazy idea. Because you are selling cost effective solution with some under the cover cost. Obviously, Informatica will not give you Hadoop connector as “Free”. There are many other questions other than cost like performance, business logic stage etc.
  4. Also there is a big bet on Hadoop to replace existing ETL/ELT framework to push transformation to Hadoop considering its Map Reduce framework. I partially get this idea long back. But, still not convinced when:
  • Your use case doesn’t support Map Reduce framework during ETL.

  • You process relatively small amount of data using Hadoop. Hadoop is not meant for this and takes longer than it supposed to be.

  • You try to join some information with existing datawarehouse and unnecessary duplicate the information at HDFS as well as at conventional RDBMS.

Now, having these questions in place doesn’t mean Hadoop can’t be projected as a replacement/amendment of existing datawarehouse strategy. On contrary, I could see some different possibilities and ways for Hadoop to sneak-in into the existing enterprise data architecture. Here are my few cents:

  1. Identify the areas where data come with once write and many reads. Most importantly, identify the nature of the read. Ask yourself that the read is straightforward or joined or aggregated? All such situations in case of BIG DATA can be efficiently handled on HDFS. If you don’t know in advance then data profiling and capacity planning will be decision maker here to identify whether this data should go to your RDBMS or HDFS. However remember, if your queries is more ad-hoc and you are planning to move it to HDFS then you need a skill more than Hive & PIG.
  2. Use Hadoop’s HDFS feature more than the Map Reduce. I mean distributed storage to minimize effort to back up and data replication. This will be cost effective in comparison to DBA costs. For example, archive data on HDFS than tap drives. So your data never retire for analysis. Entertain MR intelligently whenever you could see the opportunity to break down your calculations into different parts i.e. MAPs.
  3. Identify the data which is small and can be fit into distributed cache in HDFS. Only this can have an entry into HDFS. However, rest of small (not BIG) data can stay on RDBMS. Again, Capacity planning is major role player.
  4. Now it comes to ETL: I am really happy to see Hadoop & HDFS here. But not with Informatica, Data Stage or any other ETL tools (i don’t know much about Pentaho). I must appreciate and support Metascale webinar . They have given a right approach to take Hadoop as an Extract, Load and Transform framework. Yes, this is the only way to do right transformation on Hadoop. Let’s rename it to DATA INTEGRATION. The moment you start thinking about ETL tools it means you are taking your data out of Hadoop and the moment you take data out, you are going against all the purpose of using Hadoop as a data processing platform. Isn’t it killing the idea of doing transformation on Hadoop using MR and also the idea of brining your overall cost effective down? However, I’ll be open to learn the right logic to use informatica or any other ETL tool on top of Hadoop.

I think, effort to bring Hadoop to enterprise require diligent changes in datawarehouse reference architecture. We are going to change a lot in our Reference Architecture when we bring Hadoop into the enterprise.

So, it is not important to learn WHERE HADOOP IS THE RIGHT FIT? but it is very important to understand HOW HADOOP IS THE RIGHT FIT?

Posted in Big Data, Hadoop | Leave a Comment »

An indispensable Python : Data sourcing to Data science.

Posted by datumengineering on August 27, 2013

Data analysis echo system has grown all the way from SQL’s to NoSQL and from Excel analysis to Visualization. Today, we are in scarceness of the resources to process ALL (You better understand what i mean by ALL) kind of data that is coming to enterprise. Data goes through profiling, formatting, munging or cleansing, pruning, transformation steps to analytics and predictive modeling. Interestingly, there is no one tool proved to be an effective solution to run all these operations { Don’t forget the cost factor here 🙂 }.  Things become challenging when we mature from aggregated/summarized analysis to Data mining, mathematical modeling, statistical modeling and predictive modeling. Pinch of complication will be added by Agile implementation.

Enterprises have to work out the solution: “Which help to build the data analysis (rather analytics) to go in Agile way to all complex data structure in either of the way of SQL or NoSQL, and in support of data mining activities” .

So, let’s look at the Python & its eco system (I would prefer to call Python libraries as echo system) and how it can cover up enterprise’s a*s for data analysis.

Python: functional object orientated programming language, most importantly super easy to learn. Any home grown programmer with less or minor knowledge on programming fundamentals can start anytime on python programming.  Python has rich library framework. Even the old guy can dare to start programming in python. Following data structure and functions can be explored for implementing various mathematical algorithms like recommendation engine, collobrative filtering, K-means, Clustering and Support Vector Machine.

  • Dictionary.
  • Lists.
  • String.
  • Sets.
  • Map(), Reduce().

Python Echo System for Data Science:

Let’s begin with sourcing data, bringing into dataset format and shaping mechanism.

Pandas: Data loading, Cleansing, Summarization, Joining, Time Series Analysis }

Pandas: Data analysis covered up in python libraries. It has most of the things which you look out to run quick analysis. Data Frames, Join, Merge, Group By are the in-builds which are available to run SQL like analysis on the data coming in CSV files (read CSV function). To install Pandas you need to have NumPy installed first.

NumPy: Data Array, Vectorization, matrix and Linear algebra operations i.e. mathematical modeling }

NumPy: Rich set of functions for array, matrix and Vector operations. Indexing, Slicing and Stacking are prominent functionality of NumPy.

{ Scipy:  Mean, variance, skewness, kurtosis }

Scipy: SciPy to run scientific analysis on the data. However, statistics functions can be located in the sub-package scipy.stats

{ Matplotlib: Graph, histograms, power spectra, bar charts, errorcharts, scatterplots }

Matplotlib: 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

Moreover, how can we second python support to Big data Analytics and Machine Learning.  Below resources can be utilize for various big data applications:

  • Lightweight Map-Reduce implementation written in Python: Octopy
  • Hbase interaction using python: happybase
  • Machine learning algorithm implementation in Python: Scikit. It has built on NumPy, SciPy, and matplotlib.

Having said that, Python is capable enough to give a way out to implement data analysis algorithms and hence to build your own data analysis framework.

Watch out this space for implementations of various algorithms in Python under one umbrella i.e.  Python data analysis tools.

Posted in Big Data, Data Analysis, Predictive Model, Python, Statistical Model | Tagged: , , , | Leave a Comment »

Big Data? How do you run capacity planning?

Posted by datumengineering on February 15, 2013

Most of Datawarehouse folks are very much accustomed with the term “Capacity Planning”, Read Inmon. This is widely used process for DBA’s and Datawarehouse Architects. In an typical project of data management and warehouse wide variety of audience is involved to drive the capacity planning. It involves everyone from Business Analyst to Architect to Developer to DBA and finally Data Modeler.

This practice which has had wide audience in typical Datawarehouse world,  how this has been driven in Big Data? I have hardly heard noise around this in any Hadoop driven project which had started with an intention to handle growing data.  I have met pain bearers DBA/Architects who have been facing challenges at all stages of data management when data outgrows. They are the main players who advocates bringing Hadoop ASAP.  Crux of their problem is not growing data. But the problem is, they didn’t have mathematical calculation which advocate the growth rate. All we talk about is: How much percentage it is going? Most of the time that percentage also come from experience 🙂

Capacity planning should be explore more than just calculating the percentage and experience.

  1. It should be more mathematical calculation of every byte of the data sources coming into the system.
  2. How about designing a predictive model which will confirm my data growth with an accuracy until 10 years?
  3. How about involving business to confirm the data growth drivers and feasibility of future born data sources ?
  4. Why don’t consider compression factor and purging into the calculation to reclaim the space for data grow.
  5. Why we consider only disk utilization and why there is no consideration about other hardware resources like memory, processor, cache? After all, it is all about data processing environment.
  • I think this list of consideration can still grow….

I know building robust Capacity planning is not a task of day or month. One to two year of time frame data is good enough to understand this trend and develop a algorithm around it.  Consider 1-2 years as a learning data set and take some months of data as training data set and start analyzing the trend, start building the model which can predict the growth after 3rd or 4th year. Because as per Datawarehouse gurus bleeding starts after 5th year age.

I’ll leave up to you to design the solution and process for capacity capacity to claim your DATA as BIG DATA.

Remember, disk space is cheap but not the disk seek.

Posted in Big Data, Hadoop | Tagged: , | 2 Comments »

Data analysis drivers

Posted by datumengineering on February 11, 2013

I have been exploring data analysis and modeling techniques since months. There are lots of topics floating around in the space of data analysis like statistical modeling, predictive modeling. There have always been questions in mind which technique to choose? which is preferred way for data analysis? Some articles and lecture highlight machine learning or mathematical model over statistics modeling limitations. They mention mathematical modeling as a next step of accuracy and prediction. This kind of articles create more questions in mind of naive user.

Finally, i would thank to coursera.org for zero down this confusion and stating a clear picture of Data Analysis drivers. Now, things are pretty clear in terms of How to proceed on data analysis? Rather, defining “DATA ANALYSIS DRIVERS”. In one liner the answer is simple “Define a question or problem“. So, all depend upon how you define the problem.

To start with data analysis drivers here are steps in a data analysis

  1. Define the question
  2. Define the ideal data set
  3. Determine what data you can access
  4. Obtain the data
  5. Clean the data
  6. Exploratory data analysis
  7. Statistical prediction/modeling
  8. Interpret results
  9. Challenge results
  10. Synthesize/write up results
  11. Create reproducible code
  • Defining the question means how the business problem has stated and how you proceed on story telling on this  problem. Story telling on the problem will take you to the structuring the solution. So you should be good in story telling on the problem statement.
  • Defining the solution will help you to prepare the data (data set) for the solution.
  • Profile the source to identify what data you can access.
  • Next step is cleansing the data.
  • Now, once the data is cleansed it is either in one of the following standard: txt, csv, xml/html, json and database.
  • Based on the solution need we start building the model. Precisely, the solution will have requirement of Descriptive analysis, Inferential analysis or predictive analysis.

Henceforth, The data set and model may depend on your goal:

  1. Descriptive – a whole population.
  2. Exploratory – a random sample with many variables measured.
  3. Inferential – the right population, randomly sampled.
  4. Predictive – a training and test data set from the same population.
  5. Causal – data from a randomized study.
  6. Mechanistic – data about all components of the system.

From here knowledge on statistics, machine learning and mathematical algorithm works 🙂

Posted in Big Data, Data Analysis | Tagged: , | 6 Comments »

Data flow: Web log analysis on a Hive-way

Posted by datumengineering on February 8, 2013

 Data flow design to get an insight of user behavior on web site. Data flow explains the method of flattening up all elements in web log which can support detail user analysis and behavior.

Technology & Skills: Hadoop-Hive, HiveQL (+ Rich set of UDF in HiveQL) .

Infrastructure: Amazon Web Services (AWS).

Process -1 Moving data from Web Server to Amazon Simple Storage Services (S3) to HDFS.

0

Process -2 Start EC2 instance type : small to run Map Reduce job to parse log file.

To run jobs on AWS we should have EBS and EC2 both instance running.

Process -3 Prepare for Elastic Map Reduce to run the jobs from command line.

To run the EMR from command line we use an Amazon EMR credentials file to simplify job flow creation and authentication of requests. The credentials file provides information required for many commands. The credentials file is a convenient place to store command parameters so you don’t have to repeatedly enter the information. The Amazon EMR CLI automatically looks for these credentials in the file credentials.json.

To install the Elastic MapReduce CLI1. Navigate to your elastic-mapreduce-cli directory.
2. Unzip the compressed file: Linux and UNIX users, from the command-line prompt, enter the following:$ unzip elastic-mapreduce-ruby.zipConfiguring Credentials
The Elastic MapReduce credentials file can provide information required for many commands. It is
convenient to store command parameters in the file to save you from the trouble of repeatedly entering the information. Your credentials are used to calculate the signature value for every request you make. Elastic MapReduce automatically looks for your credentials in the file credentials.json. It is convenient to edit the credentials.json file and include your AWS credentials. An AWS key pair is a security credential
similar to a password, which you use to securely connect to your instance when it is running.To create your credentials file:1. Create a file named credentials.json in the elastic-mapreduce-cli/elastic-mapreduce-ruby directory.2. Add the following lines to your credentials file:
{
“access_id”: “[Your AWS Access Key ID]”,
“private_key”: “[Your AWS Secret Access Key]”,
“keypair”: “[Your key pair name]”,
“key-pair-file”: “[The path and name of your PEM file]”,
“log_uri”: “[A path to a bucket you own on Amazon S3, such as, s3n://myloguri/]”,
“region”: “[The Region of your job flow, either us-east-1, us-west-2, uswest-1, eu-west-1, ap-northeast-1, ap-southeast-1, or sa-east-1]”
}Note the name of the Region. You will use this Region to create your Amazon EC2 key pair and your
Amazon S3 bucket.

Process -4 Prepare Hive table for data analysis. Create landing table to load log data.

We create schema for tokenizing the string. So MAP and COLLECTION is used to build key-value array.

CREATE TABLE logdata (

C_2 STRING,

C_3 MAP<STRING, STRING>,

C_4 STRING,

C_21 STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘ ‘ COLLECTION ITEMS TERMINATED BY ’73’ MAP KEYS TERMINATED BY ‘=’ STORED AS textfile;

Process -6 Load Hive landing table with log file data from HDFS.

 1

LOAD DATA INPATH ‘hdfs://10.130.86.181:9000/input/log.txt’ OVERWRITE INTO TABLE `logdata`;

Process -7 Load Hive stage table from landing table.

This stage table will have the data from landing. Stage table is used to load cleansed data without any junk character (Log has some # characters which we remove when load into staging).

2

create table logdata_stg

comment ‘log data’ stored as sequencefile as

select * from logdata where C_0 not like ‘%#%’;

Process -8 Load Hive final table from staging table.

 3

This process will create flatten structure of complete log file into final table. This table will be used in all over the analysis. This table is created with actual column names identified in the log file. Final table load happen using UDF to parse query string, host name and category tree in browse data.

create table logdata_fnl

comment ‘log data’ stored as sequencefile as

Read my previous post on Hive – Agility to go in detail of how Hive UDF’s helped to run this analysis efficiently using MAP and ASSOCIATIVE ARRAY

Posted in Big Data, Hive | Tagged: , , | Leave a Comment »

Agility in Hive — Map & Array score for Hive

Posted by datumengineering on September 27, 2012

There are debate and comparison between PIG and Hive. There are good post from @Larsgeorge which talks about PIG v/s Hive.

I am not an expert to go in details of comparison but here I want to explore some of the Hive features which gives Hive an edge.

These feature are MAP (Associative Array) and ARRAY. MAP can give you an alternative way to segregate your data  around KEY and VALUE way.  So, if you have data something like this

clientid=’xxxx234xx’, category=’electronics’,timetaken=’20/01/2000 10:20:20′.

Then, you can really break it down in to key and value. Where, clientid, category and timetaken are keys and values are: xxxx234xx,electronics,20/01/2000 10:20:20.  How about not only converting them into key and value.  But storing and retrieving them as well  into a column. So, When you define the MAP it does store the complete MAP into a single column, like;

COL_1

{“clientid”=”xxxx234xx”, “category”=”electronics”,”timetaken”=”20/01/2000 10:20:20″}

To store like this you need to define the table like this:

Create table table1

(

COL_1 MAP<STRING,STRING>

)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’ MAP DELIMITED BY “=”

Now, retrieval is pretty easy : you just need to say in your HiveQL: Select COL1.[“category”] from table. You’ll get electronics. Had it been MAP is not there i would have end up writing a complete parsing program for storing such custom format in table.

Similarly, Array can be use to store collection into a column. So you can have data like:

‘xxxx1234yz’;’/electronics/music-player/ipad/shuffle/’;

Now, you want to parse the complete level in the second column. It would be easy in Hive to store it as in ARRAY. Definition would be:

Create table table1

(

CUSTOMERID STRING,

COL_1 ARRAY<STRING>

)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘;’ COLLECTION ITEM TERMINATED BY “/”

Now data retrieving is obvious, query the table with collection index or level you want to go

Select Col_1[1] from table1;

You may also have scenarios when you have COLLECTION of MAPS. There you need to use both MAP and ARRAY together in same table definition along with required delimiter for ARRAY and MAP.

So your table’s delimiter definition should look like this:

FIELDS TERMINATED BY ‘,’
COLLECTION ITEMS TERMINATED BY ‘/’
MAP KEYS TERMINATED BY ‘=’

Posted in Big Data, Hive | Tagged: , , | 1 Comment »

PIG, generation’s langauge: Simple Hadoop-PIG configuration, Register UDF.

Posted by datumengineering on June 26, 2012

I would consider PIG a step further to 4th generation. PIG emerged as an ideal language for programmers. PIG is a data flow language in Hadoop echo system. Now, It became a gap filler in BIG data analytics world between 2 audiences ETL developer & Java/Python Programmer. PIG has some very powerful feature which gives it an edge for generations:

  • Bringing schema less approach to unstructured data.
  • Bringing programming support to data flow.

PIG brings ETL capabilities to big data without having schema to be defined in advance. That’s an indispensable qualities. All these features together gives a power of analytics on Cloud with back of HDFS processing capabilities and MR programming model. Here, we’ll see in simple steps how can we use PIG as a data flow for analysis. Obviously, you should have PIG installed on your cluster.

PIG uses Hadoop configuration for data flow processing. Below are the steps of Hadoop configuration. I would prefer to do it in /etc/bash.bashrc

  • Point PIG to the JAVA_HOME.
  • set PIG_HOME to the core PIG script.
  • set PIG_DIR to the bin- dir.
  • set PIG_CONF_DIR to hadoop configuration directory.
  • finally set the PIG_CLASSPATH add it to the CLASSPATH.

Here is the exact code for above 5 steps:

JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export JAVA_HOME
PATH=$PATH:$JAVA_HOME
export PATH

HADOOP_HOME=/usr/lib/hadoop
export HADOOP_HOME
PATH=$PATH:$HADOOP_HOME
export PATH

PIG_HOME=/usr/lib/pig/bin
export PIG_HOME
PATH=$PATH:$PIG_HOME
export PATH

PIG_DIR=/usr/lib/pig
export PIG_DIR
PATH=$PATH:$PIG_DIR
export PATH

HADOOP_DIR=/usr/lib/hadoop
export HADOOP_DIR
PATH=$PATH:$HADOOP_DIR
export PATH

PIG_CLASSPATH=/home/manish/input
export PIG_CLASSPATH
PATH=$PATH:$PIG_CLASSPATH
export PATH

export CLASSPATH=$CLASSPATH:$HADOOP_DIR/hadoop-core.jar:$HADOOP_DIR/hadoop-tools.jar:$HADOOP_DIR/hadoop-ant.jar:$HADOOP_DIR/lib/commons-logging-1.0.4.jar:$PIG_DIR/pig-core.jar

export PIG_CONF_DIR=/usr/lib/hadoop/conf

Now if you have written UDF then first register it and then define the function.

REGISTER /path/to/input/<jarfile>.jar;

define <function> org.apache.pig.piggybank.storage.<functionname>();

Now you’ve UDF available to use throughout your script.

A = load ‘/path/to/inputfile’ using org.apache.pig.piggybank.storage.<functionname>() as (variable:<datatype>);

Life becomes easy once we have UDF available to use. You just need to have basic understanding on SQL functionality to perform data flow operations in PIG scripting language.

Next write up in continue with PIG will be on: ANALYTICS: HOW PIG creates touch points for data flow as well as analytics.

Posted in Big Data, Hadoop, PIG | Tagged: , , | Leave a Comment »

BiG DaTa & Vectorization

Posted by datumengineering on May 14, 2012

It has been while when Big data entered into the market and buzz the analytics world. Now a day all analytics leaders are chanting about Big data applications. Since I have started with Hadoop technologies and with Machine learning one question has been bugging in mind:

Which is a greater innovation Big Data Or Machine Learning & Vectorization?

When it comes to analytics Vectorization and machine learning more innovative. Wait a minute, I don’t want to be biased and I am not concluding here. But, i would like to showcase more on the direction when we take out data for the analytics world. We have structured data, we have enterprise data, we have data which is still measurable and suffice analytical and advanced analytical need. But how many of Business analytics use it smartly to do predictions, How many have applied different statistical algorithm to be benefited from this data ? How many times available data has been utilized to its potential ? I guess, only 20% cases. When we are still not up to the utilization of structured, measurable data then why we are so much behind the unstructured and monster data. In fact this big data need more work than enterprise data. I don’t advocate to go to saturation first and then think of innovation or out of the box, NO. My emphasis more on the best utilization of existing enterprise data and keep the innovation alive by experimenting the possible options to explore the data which is unexplored or unfeasible through conservative technologies. Innovation doesn’t mean keep thinking and just doing new things. Innovation is more meaningful when you do something meaningful to the world which other people acknowledges but they says “Not feasible”. I am not in favor of anyone here. I am coming from the world where I see data processing challenges, when I see data storage challenges, when I see data aggregation challenges, when I see lot of challenges during sorting and searching. There I would look at Hadoop related technologies. The way Hive provides query processing power, HBase provides data storage and manipulation power is indeed way beyond the other RDBMS. Their power of MAP REDUCE is exemplary. But all these Big data technologies should enter into the enterprise which is already mature enough in the analytics world by fully utilization of its enterprise data at length. If Hadoop itself claim that I am not a replacement of your current enterprise datawarehouse then why you shouldn’t first fully grind the existing EDW data and then look at Hadoop opportunities to give an edge to your enterprise competency.

Posted in Big Data | Tagged: , | 2 Comments »