Datum Engineering !

An engineered artwork to make decisions..

Archive for the ‘Data Science’ Category

Big Data Analytics: From Ugly Duckling to Beautiful Swan

Posted by datumengineering on January 31, 2016

Recently, I came across with an interesting book on the statistics which has a narration of Ugly Duckling story and correlation of this story with today’s DATA or rather BIG DATA ANALYTICS world. This story originally from famous storyteller Hans Christian Andersen

Story goes like this…

The duckling was a big ugly grey bird, so ugly that even a dog would not bite him. The poor duckling was ridiculed, ostracized and pecked by the other ducks. Eventually, it became too much for him and he flew to the swans, the royal birds, hoping that they would end his misery by killing him because he was so ugly. As he stared into the water, though, he saw not an ugly grey bird but a beautiful swan.

Data are much the same. Sometimes they’re just big, grey and ugly and don’t do any of the things that they’re supposed to do. When we get data like these, we swear at them, curse them, peck them and hope that they’ll fly away and be killed by the swans.

Alternatively, we can try to force our data into becoming beautiful swans.

Let me correlate the above narration with the data analysis solution, in 2 ways:

1. Build the process to expose the potential of the data to become beautiful swan

2. Every data need set of assumptions and hypotheses to be tested before it dies as a ugly duckling.

The process of exposing the potential of the data is vast from data sourcing, wrangling, cleansing to Exploratory Data Analysis (EDA) and further detailed analysis.  These steps should be an integral part of any data product. Though these processes have been for years with most of the data analysis systems and projects, but in recent years it is fairly extended and integrated to external datasets. This external data build an eco-system (support system) around your data to prove the value. e.g. If you want to expose your customer data to a level where it not only show 360 degree view but it also start revealing customer pattern, response with external system. Location play an important role (one of the important part) in this whole process. The spatial mapping, where the customers can be joined with their surrounding. There are various tools which can help you to achieve this spatial mapping from Java GIS libraries to R-Spatial Libraries. read this Spatial Analysis in R at original on DominoData Lab blog

Once you set the mapping right with external datasets, then there are various tools available for wrangling. Eventually, you cleanse the data and do the EDA with this broader dataset, then it becomes customer view with much broader spectrum of external datasets of Geo Location, Economy, GPS-sensor etc. With this, You can start analyzing customers by different segments which you have never captured within your systems. in short, something like this..


Not limited to spatial mapping and analysis but there are many more external data elements which can help your data building process to extend it to much broader range of variables for analysis. With an effective (rather smart) use of these data linkages you can start converting any ugly duckling into meaningful swan.

Let us look at the second part of the solution to build assumptions and hypotheses. Given any Data duckling you should start assessing how much of an ugly duckling of a data set you have, and discovering how to turn it into a swan. This is more a statistical solution of conversion (proving and probing) for duckling than a previously explained engineering solution. When assumptions are broken we stop being able to draw accurate conclusions about reality. Different statistical models assume different things, and if these models are going to reflect reality accurately then these assumptions need to be true.  This is a step by step process and developed from parametric test i.e. a test that requires data from one of the large catalogue of distributions that statisticians have described. The assumptions that can be tested are:

Normally distributed data: The rationale behind hypothesis testing relies on having something that is normally distributed (in some cases it’s the sampling distribution, in others the errors in the model).
Homogeneity of variance: This assumption means that the variances should be the same throughout the data. In designs in which you test several groups of participants this assumption means that each of these samples comes from populations with the same variance. In correlational designs, this assumption means that the variance of one variable should be stable at all levels of the other variable.
Interval data: Data should be measured at least at the interval level. This assumption is tested by common sense.
Independence: This assumption, like that of normality, is different depending on the test you’re using. In some cases it means that data from different participants are independent, which means that the behavior of one participant does not influence the behavior of another.

As there is vast support of tools in data collection there are various tools which can also help you to test hypotheses not only by number but visually too e.g. ggplot2, pastecs and psych

So, jump straight into the data with either of these approaches (or both) and forsure you can take any duckling to a journey of becoming a beautiful swan. That’s actually start of science, eventually developing a process of learning. And, build a process to learn by itself, whenever a new bird comes it would predict whether it will become a swan or remain to be duckling forever 🙂





Posted in Big Data, Data Analysis, Data Science, R, Statistical Model | Leave a Comment »

Operational Data Science: excerpt from 2 great articles

Posted by datumengineering on December 12, 2013

The term “Data Science” has been evolving not only as a niche skill but as a niche process as well. It is interesting to study “how” the Big data analytics/Data Science/Analytics can be efficiently implemented into the enterprise. So, along with my typical study of analytics viz. Big data analytics I have been also exploring the methodologies to bring the term “Data Science” into mainstream of existing enterprise data analysis, which we conventionally know as “Datawarehouse & BI”. This excerpt is just a study of Data Science workflow with respect to enterprise and opens the forum for discussion on  Operational Data Science” (I am just tossing this term “Operational Data Science”, it can be named better!). Meanwhile, I must mention the articles those I have followed during my whole course of learning on the operational side of Data Science . Both the articles mentioned below are super write ups written by Data scientists during their research work and they can prove to be a valuable gift for enterprises.

  1. Data Science Workflow: Overview & Challenges.
  2. Enterprise Data Analysis and Visualization: An Interview Study

While article #1  gives fair idea of complete data science workflow, it can be very well understood by article #2 with nice explanation and challenges mentioned. Idea is to understand them together.

All the studies around implementation of analytics revolves around 5 basic areas: 1) Data Discovery 2) Data Preparation & Cleansing 3) Data Profiling 4) Modeling  5) Visualization.

Operationally, these 5 areas can be efficiently covered if data scientist can rightly collaborate with  Data Engineers, Datawarehouse Architects & Data Analyst. It is the responsibility of a Data Scientist to run the show from data discovery to communicating predictions to the business. I certainly don’t intend to define the role of a data scientist here (In fact i am not even eligible for this). My aim is to sum up the skill sets and identify operational aspect of it. One of the imprtant point to be discussed here is ‘Diversity of skills’.

Diversity is pretty important. A generalist is more valuable than a specialist. A specialist isn’t fluid enough. We look
for pretty broad skills and data passion. If you are passionate about it you’ll jump into whatever tool you need to. If
it’s in X, I’ll go jump in X.

It needs diversity of the skill i.e. “Business understanding  on data (data discovery)” to “Programming hand on the scripting or SQL” to “communicate effectively through right visualization tool”. It is difficult for one person to diversify in all these areas and same time specialize in one. In such complex environment  we should look at the opportunity to bring Datawarehouse + Unstructured Data analysis + Predictive Analytics together. This opportunity is well detailed in the article#2.

Most of the organisations work in silos on their data and in absence of  effective communication channel between Datawarehouse and Analytics team the whole effort of effective analysis goes awry. 80% organisations divide the effort of Datawarehouse, Advance analytics &  Statistical analysis into different teams and these teams not only address the different business problems but they also aligned with different architects. In my opinion this could be the main reason that kills the flavor of Data Science. Interestingly during one of my assignments in the field of retail data analysis, I observed that they had developed their datawarehouse team only at the maturity level of summarization and aggregation. I realized that this datawarehouse or Data store world would end after delivering bunch of reports and some dashboards. Eventually, they would be more interested in archiving the data thereafter. That’s the Bottleneck !

To overcome this bottleneck we need to bring analytics either into mainstream of data processing layer or we should develop parallel workflow for this, and article #1 articulates the same and proposes the flow mentioned below. If you believe this figure (from article #1)  is a data science workflow then you need to have diverse skilled engineers working on common goal to deliver this workflow unlike conventional data analysis. Observe it closely and figure out the business oriented engineering team.

Team with diversity can work together for an enterprise to deliver this workflow.

  • Person(s) with strong business acumen with Visualization skills.
  • Data Integration Engineer(s) with ETL skills on structured and unstructured data
  • Statistician with R skills
  • Smart programmer with Predictive modeling skills.

Interestingly, there is no right and specific order of delivery from these people. Having said that the person who has strong business background can work at both the ends of a shore i.e. in data discovery as well as in communicating the final result (either in terms of prediction or pattern or summarization). However, programmers can pretty much independently work in all areas of data preparation,  data analysis and scripting to build datasets for modeling (In fact this is hardest area read the article #2). In a same way, statistician can very much communicate the business result and reflection. Now after all these efforts what left is just a game of effective collaboration. That is easily visible in the figure mentioned below.

Responsibility Matrix

Responsibility Matrix

Moreover, along with the right collaboration channel there should be a Data scientist(s) who can watch over and architect the whole work flow and should always be ready to design+code+test the prototype of the end product. So, this whole Operation Data Science need a collaborative team and an architect(s) with diverse skills who should be ready to phrase the below statement.

I’m not a DBA, but I’m good at SQL. I’m not a programmer but am good at programming. I’m not a statistician but I am good at applying statistical techniques!

Posted in Data Analysis, Data Science | Leave a Comment »

Warm-up exercise before data science.

Posted by datumengineering on October 18, 2013

Practicing Data science indeed a long term effort than a learning handful of skills.  We ought to be academically good enough to take up this challenge. However, if you think you came a long way from your academic rebuilding,  but you still have that zeal & passion to take the oil from the data and fill the skill gap of data science then here is the warm-up tips. Below points must exercised before jumping into any data science & data mining problems:

  • Come out of “table-row-column” mode and start looking data set more as a MATRIX and VECTOR.
  • matrix-2

Not all datasets are in the form of a data matrix. For instance, more complex datasets can be in the form of sequences, text, time-series, images, audio, video, and so on, which may need special techniques for analysis. However, in many cases even if the raw data is not a data matrix it can usually be transformed into that form via feature extraction. A practical example of feature example is explained in my last post on scikit-learn library.

  • Number of attributes defines the dimensionality of the data matrix. Save the dimensionality in mind when you think of any matrix operations.
  • Each row may be considered as a d-dimensional column vector (all vectors are assumed to be column vectors by default). You must also understand the term row space and column space.
  • matrix
  • Treating data instances and attributes as vectors, and the entire dataset as a matrix, enables one to apply both geometric and algebraic methods to aid in the data mining and analysis tasks.  At least you must aware about unit vector, identity matrix etc..
  • Clear dust from your school learning about matrix manipulation i.e. matrix addition, multiplication, transpose, inverse etc. Similar applies to some of the algebraic equation like distance between two points, Pythagorean theorem—or Pythagoras‘ theorem etc..
  • Through understanding on matrix manipulation will help you to implement multiplication and summation of  elements.
  • Leaving probability is probably not a good idea. Run through some short probability problems & exercise before you go in detail of any supervised learning models.
  • You may need to practice on the topics that you mightily left during schools like:  Orthogonal projection of vector (projecting a vector to another vector),  Probabilistic view of the data, Probability density function. (i admit to avoid these topics during graduations 🙂 )
  • Relax yourself with all the formula of descriptive statistical analysis. From Mean, median, mode to normal distribution, standard deviation, skewness and most importantly don’t forget to cover-up Variance  and standard deviation.  You should be ready with basic statistical analysis of univariate & multivariate numeric data. Believe me distance finding methodologies change due to distribution of the data. (Using Euclidean distance score when data is normally distributed otherwise Pearson coefficient score)
  • Generalization, Correlation & regression concepts are widely used across statistics and mathematical modeling. So this must be broadly rehearsed before you go into modeling techniques.
  • You must do some exercise on how to normalize vector. Vector normalization is the must-to-know concept in prediction algorithms.

” In fact, data mining is part of a larger knowledge discovery process, which includes pre-processing tasks like data extraction, data cleaning, data fusion, data reduction and feature construction. As well as post-processing steps like pattern and model interpretation, hypothesis confirmation and generation, and so on. This knowledge discovery and data mining process tends to be highly iterative and interactive. “

CRUX:  The algebraic, geometric & probabilistic viewpoints of data play a key role in data mining. You should exercise them beforehand. So you can easily sail though your boat in Data Science !

Posted in Data Analysis, Data Science, Machine Learning, Predictive Model, Statistical Model | Leave a Comment »