Datum Engineering !

An engineered artwork to make decisions..

Big Data Analytics: From Ugly Duckling to Beautiful Swan

Posted by datumengineering on January 31, 2016

Recently, I came across with an interesting book on the statistics which has a narration of Ugly Duckling story and correlation of this story with today’s DATA or rather BIG DATA ANALYTICS world. This story originally from famous storyteller Hans Christian Andersen

Story goes like this…

The duckling was a big ugly grey bird, so ugly that even a dog would not bite him. The poor duckling was ridiculed, ostracized and pecked by the other ducks. Eventually, it became too much for him and he flew to the swans, the royal birds, hoping that they would end his misery by killing him because he was so ugly. As he stared into the water, though, he saw not an ugly grey bird but a beautiful swan.

Data are much the same. Sometimes they’re just big, grey and ugly and don’t do any of the things that they’re supposed to do. When we get data like these, we swear at them, curse them, peck them and hope that they’ll fly away and be killed by the swans.

Alternatively, we can try to force our data into becoming beautiful swans.

Let me correlate the above narration with the data analysis solution, in 2 ways:

1. Build the process to expose the potential of the data to become beautiful swan

2. Every data need set of assumptions and hypotheses to be tested before it dies as a ugly duckling.

The process of exposing the potential of the data is vast from data sourcing, wrangling, cleansing to Exploratory Data Analysis (EDA) and further detailed analysis.  These steps should be an integral part of any data product. Though these processes have been for years with most of the data analysis systems and projects, but in recent years it is fairly extended and integrated to external datasets. This external data build an eco-system (support system) around your data to prove the value. e.g. If you want to expose your customer data to a level where it not only show 360 degree view but it also start revealing customer pattern, response with external system. Location play an important role (one of the important part) in this whole process. The spatial mapping, where the customers can be joined with their surrounding. There are various tools which can help you to achieve this spatial mapping from Java GIS libraries to R-Spatial Libraries. read this Spatial Analysis in R at original on DominoData Lab blog

Once you set the mapping right with external datasets, then there are various tools available for wrangling. Eventually, you cleanse the data and do the EDA with this broader dataset, then it becomes customer view with much broader spectrum of external datasets of Geo Location, Economy, GPS-sensor etc. With this, You can start analyzing customers by different segments which you have never captured within your systems. in short, something like this..

 

Not limited to spatial mapping and analysis but there are many more external data elements which can help your data building process to extend it to much broader range of variables for analysis. With an effective (rather smart) use of these data linkages you can start converting any ugly duckling into meaningful swan.

Let us look at the second part of the solution to build assumptions and hypotheses. Given any Data duckling you should start assessing how much of an ugly duckling of a data set you have, and discovering how to turn it into a swan. This is more a statistical solution of conversion (proving and probing) for duckling than a previously explained engineering solution. When assumptions are broken we stop being able to draw accurate conclusions about reality. Different statistical models assume different things, and if these models are going to reflect reality accurately then these assumptions need to be true.  This is a step by step process and developed from parametric test i.e. a test that requires data from one of the large catalogue of distributions that statisticians have described. The assumptions that can be tested are:

Normally distributed data: The rationale behind hypothesis testing relies on having something that is normally distributed (in some cases it’s the sampling distribution, in others the errors in the model).
Homogeneity of variance: This assumption means that the variances should be the same throughout the data. In designs in which you test several groups of participants this assumption means that each of these samples comes from populations with the same variance. In correlational designs, this assumption means that the variance of one variable should be stable at all levels of the other variable.
Interval data: Data should be measured at least at the interval level. This assumption is tested by common sense.
Independence: This assumption, like that of normality, is different depending on the test you’re using. In some cases it means that data from different participants are independent, which means that the behavior of one participant does not influence the behavior of another.

As there is vast support of tools in data collection there are various tools which can also help you to test hypotheses not only by number but visually too e.g. ggplot2, pastecs and psych

So, jump straight into the data with either of these approaches (or both) and forsure you can take any duckling to a journey of becoming a beautiful swan. That’s actually start of science, eventually developing a process of learning. And, build a process to learn by itself, whenever a new bird comes it would predict whether it will become a swan or remain to be duckling forever🙂

 

 

 

Posted in Big Data, Data Analysis, Data Science, R, Statistical Model | Leave a Comment »

Where & Why Do You Keep Big Data & Hadoop?

Posted by datumengineering on December 13, 2015

I am Back ! Yes, I am back (on the track) on my learning track. Sometime, it is really necessary to take a break and introspect why do we learn, before learning.  Ah ! it was 9 months safe refuge to learn how Big Data & Analytics can contribute to Data Product.

DataLake

Data strategy has always been expected to be revenue generation. As Big data and Hadoop entering into the enterprise data strategy it is also expected from big data infrastructure to be revenue addition. This is really a tough expectation from new entrant (Hadoop) when the established candidate (DataWarehouse & BI) itself struggle mostly for its existence. So, it is very pertinent for solution architects to raise a question WHERE and WHY to bring the Big data (Obviously Hadoop) in the Data Strategy. And, the safe option for this new entrant should the place where it supports and strengthen the EXISTING data analysis strategy. Yeah! That’s the DATA LAKE.

Hope, you would have already understood by now the 3 Ws (What: Data Lake, Who: Solution Architect, Where: Enterprise Data strategy) of Five Ws questions for information gathering. Now look at the diagram to depict WHERE and WHY.

Precisely, 3 major areas of opportunity for new entrant (Hadoop):

  1. Semi-structured and/or unstructured data ingestion.
  2. Push down bleeding data integration problems to Hadoop Engine.
  3. Business need to build comprehensive analytical data stores.

Absence of any one of these 3 needs above would make Hadoop case weak to enter into the existing enterprise strategy. And, this data lake approach believes to be aligning to the business analysis outcomes without much disruption, hence it will also create comfortable path in the enterprise. We can further dig into Data Lake Architecture and implementation strategy in detail.

Moreover, there lot of other supporting systems which are brewing in parallel with Hadoop eco-system and Apache Kylin ….opportunities are immense on datalake 

Posted in Big Data, DataLake, Hadoop | 1 Comment »

(R + Python)

Posted by datumengineering on February 8, 2014

Both R & Python should be measured based on their effectiveness in advanced analytics & data science. Initially, as a new comer in data science field we spend good amount of time to understand the pros and cons of these two. I too carried out this study solely for “self” to decide which tool should i pick to get in depth of data science. Eventually, i have started realizing that both (R & Python) has its space of mastery along with their broad support to data science. Here some understanding on “when-to-use-what”

  •  R is very rich when you get into the descriptive statistics, Inference, statistical modeling and start plotting your data on the bar, pie chart and histogram. When your data is pretty much shaped and easily consumable for statistical modeling using vectors, matrix etc.
  • First time learner who have some knowledge on statistics can start getting depth of Graph-cum-visualization with their data using R in terms of  trends, identify the correlation etc. I observed that you don’t need to start practicing R as a separate programming language. You can very well start sculling your boat in depth of statistics keeping R in another hand.
  •  R plays vital role for analyst who love to see the data distributions before drawing conclusion. It also helps analysts to visualize outliers and data density of given data set.
  • As you start getting  more into probabilistic problems and probabilistic distribution R ease the data manipulation using vector and matrix. Even same applies to linear regression problems also.
  • With support of R to stat rich problems you don’t need to get into the complexities of python, OOPS and understanding of the data types nitty-gritty.

Now, when you start getting into space of predictive modeling, machine learning and mathematical modeling, Python can give a easy hand. Mathematical functions, algorithmic problems find good support from Python libraries for k-means & hierarchical clustering, multivariate regression, SVM etc. Not limited to this, but it also has good support from data processing & data munging libraries like Pandas and NumPy. Here are some cents for python:

  • We know! Python is full fledge “scripting language” and this statement tells everything. Most importantly, over the years Python has developed an eco-system for end-to-end analytics.
  • Now you are not confined to data process and formalization, but you can easily play around data sourcing and data parsing too using programming model. This open the opportunities to analyze semi-structured data (JSON,XML) easily.
  • With Python you have all liberties to start consuming the data from unstructured sources too. With streaming support from Hadoop extend the possibilities of using python on unstructured data stored on HDFS and from HBase for graph & networked data processing.
  • With rich libraries like Scikit-learn you can do all text mining, vectorize the text data and identify similarities between posts and texts.
  • Having OO language in your hand your program will be far structured and modular for all your complex mathematical calculation in comparison to R. I would rather call it as an easy to read.
  • There are lot of ready-to-serve material in support of machine learning and predictive modeling using python. Read these two in combination: Machine learning with Python + Building ML with python.

So in summary, we can bet on R when we start getting into statistical analysis and then eventually turn up towards Python to take your problem to a predictive end.

This write up doesn’t meant to highlight R or Python’s limitations. R has evolved as a good support to ML and does have combination with Hadoop as RADOOP. However, Python also has good support to statistics and does have rich library (matplotlib) for visualization. But, as i mentioned earlier in this write up, above finding points are solely based on ease of use while you learn Data Science. I suppose once matured we can develop expertise in any one on them as per job role.

Posted in Data Analysis, Python, R, Statistical Model | Tagged: , , , | 2 Comments »

Operational Data Science: excerpt from 2 great articles

Posted by datumengineering on December 12, 2013

The term “Data Science” has been evolving not only as a niche skill but as a niche process as well. It is interesting to study “how” the Big data analytics/Data Science/Analytics can be efficiently implemented into the enterprise. So, along with my typical study of analytics viz. Big data analytics I have been also exploring the methodologies to bring the term “Data Science” into mainstream of existing enterprise data analysis, which we conventionally know as “Datawarehouse & BI”. This excerpt is just a study of Data Science workflow with respect to enterprise and opens the forum for discussion on  Operational Data Science” (I am just tossing this term “Operational Data Science”, it can be named better!). Meanwhile, I must mention the articles those I have followed during my whole course of learning on the operational side of Data Science . Both the articles mentioned below are super write ups written by Data scientists during their research work and they can prove to be a valuable gift for enterprises.

  1. Data Science Workflow: Overview & Challenges.
  2. Enterprise Data Analysis and Visualization: An Interview Study

While article #1  gives fair idea of complete data science workflow, it can be very well understood by article #2 with nice explanation and challenges mentioned. Idea is to understand them together.

All the studies around implementation of analytics revolves around 5 basic areas: 1) Data Discovery 2) Data Preparation & Cleansing 3) Data Profiling 4) Modeling  5) Visualization.

Operationally, these 5 areas can be efficiently covered if data scientist can rightly collaborate with  Data Engineers, Datawarehouse Architects & Data Analyst. It is the responsibility of a Data Scientist to run the show from data discovery to communicating predictions to the business. I certainly don’t intend to define the role of a data scientist here (In fact i am not even eligible for this). My aim is to sum up the skill sets and identify operational aspect of it. One of the imprtant point to be discussed here is ‘Diversity of skills’.

Diversity is pretty important. A generalist is more valuable than a specialist. A specialist isn’t fluid enough. We look
for pretty broad skills and data passion. If you are passionate about it you’ll jump into whatever tool you need to. If
it’s in X, I’ll go jump in X.

It needs diversity of the skill i.e. “Business understanding  on data (data discovery)” to “Programming hand on the scripting or SQL” to “communicate effectively through right visualization tool”. It is difficult for one person to diversify in all these areas and same time specialize in one. In such complex environment  we should look at the opportunity to bring Datawarehouse + Unstructured Data analysis + Predictive Analytics together. This opportunity is well detailed in the article#2.

Most of the organisations work in silos on their data and in absence of  effective communication channel between Datawarehouse and Analytics team the whole effort of effective analysis goes awry. 80% organisations divide the effort of Datawarehouse, Advance analytics &  Statistical analysis into different teams and these teams not only address the different business problems but they also aligned with different architects. In my opinion this could be the main reason that kills the flavor of Data Science. Interestingly during one of my assignments in the field of retail data analysis, I observed that they had developed their datawarehouse team only at the maturity level of summarization and aggregation. I realized that this datawarehouse or Data store world would end after delivering bunch of reports and some dashboards. Eventually, they would be more interested in archiving the data thereafter. That’s the Bottleneck !

To overcome this bottleneck we need to bring analytics either into mainstream of data processing layer or we should develop parallel workflow for this, and article #1 articulates the same and proposes the flow mentioned below. If you believe this figure (from article #1)  is a data science workflow then you need to have diverse skilled engineers working on common goal to deliver this workflow unlike conventional data analysis. Observe it closely and figure out the business oriented engineering team.

Team with diversity can work together for an enterprise to deliver this workflow.

  • Person(s) with strong business acumen with Visualization skills.
  • Data Integration Engineer(s) with ETL skills on structured and unstructured data
  • Statistician with R skills
  • Smart programmer with Predictive modeling skills.

Interestingly, there is no right and specific order of delivery from these people. Having said that the person who has strong business background can work at both the ends of a shore i.e. in data discovery as well as in communicating the final result (either in terms of prediction or pattern or summarization). However, programmers can pretty much independently work in all areas of data preparation,  data analysis and scripting to build datasets for modeling (In fact this is hardest area read the article #2). In a same way, statistician can very much communicate the business result and reflection. Now after all these efforts what left is just a game of effective collaboration. That is easily visible in the figure mentioned below.

Responsibility Matrix

Responsibility Matrix

Moreover, along with the right collaboration channel there should be a Data scientist(s) who can watch over and architect the whole work flow and should always be ready to design+code+test the prototype of the end product. So, this whole Operation Data Science need a collaborative team and an architect(s) with diverse skills who should be ready to phrase the below statement.

I’m not a DBA, but I’m good at SQL. I’m not a programmer but am good at programming. I’m not a statistician but I am good at applying statistical techniques!

Posted in Data Analysis, Data Science | Leave a Comment »

ETL, ELT and Data Hub: Where Hadoop is the right fit ?

Posted by datumengineering on November 17, 2013

Few days back i have attended a good webinar conducted by Metascale on topic “Are You Still Moving Data? Is ETL Still Relevant in the Era of Hadoop?” This post is targeting this webinar.

In summary, this webinar had nicely explained about how enterprise can use Hadoop as a data hub along with the existing Datawarehouse set up. “Hadoop as a Data Hub” this line itself raised lot of questions in my mind:

  1. When we project Hadoop as a Data-hub and same time maintain the datawarehouse as an another data (conventional) repository for the enterprise then won’t it be creating another platform in silos? Presenter in the webcast repeatedly telling about keeping existing datawarehouse intact when developing Hadoop as a Data Hub. Difficult to digest😦
  2. Next question that would arise is: challenges in Hadoop environment as Master Data Management and Data governance platform. I don’t think Hadoop ecosystem is mature enough to swiftly handle the MDM complexity. As far as data governance is concerned Hadoop ecosystem lacks in applications which are required on top of Hadoop for robust data governance.
  3. Why to put lot of energy to build compatibility of ETL tools like Informatica with HDFS to connect existing ETL infrastructure with Big Data? I feel this is a crazy idea. Because you are selling cost effective solution with some under the cover cost. Obviously, Informatica will not give you Hadoop connector as “Free”. There are many other questions other than cost like performance, business logic stage etc.
  4. Also there is a big bet on Hadoop to replace existing ETL/ELT framework to push transformation to Hadoop considering its Map Reduce framework. I partially get this idea long back. But, still not convinced when:
  • Your use case doesn’t support Map Reduce framework during ETL.

  • You process relatively small amount of data using Hadoop. Hadoop is not meant for this and takes longer than it supposed to be.

  • You try to join some information with existing datawarehouse and unnecessary duplicate the information at HDFS as well as at conventional RDBMS.

Now, having these questions in place doesn’t mean Hadoop can’t be projected as a replacement/amendment of existing datawarehouse strategy. On contrary, I could see some different possibilities and ways for Hadoop to sneak-in into the existing enterprise data architecture. Here are my few cents:

  1. Identify the areas where data come with once write and many reads. Most importantly, identify the nature of the read. Ask yourself that the read is straightforward or joined or aggregated? All such situations in case of BIG DATA can be efficiently handled on HDFS. If you don’t know in advance then data profiling and capacity planning will be decision maker here to identify whether this data should go to your RDBMS or HDFS. However remember, if your queries is more ad-hoc and you are planning to move it to HDFS then you need a skill more than Hive & PIG.
  2. Use Hadoop’s HDFS feature more than the Map Reduce. I mean distributed storage to minimize effort to back up and data replication. This will be cost effective in comparison to DBA costs. For example, archive data on HDFS than tap drives. So your data never retire for analysis. Entertain MR intelligently whenever you could see the opportunity to break down your calculations into different parts i.e. MAPs.
  3. Identify the data which is small and can be fit into distributed cache in HDFS. Only this can have an entry into HDFS. However, rest of small (not BIG) data can stay on RDBMS. Again, Capacity planning is major role player.
  4. Now it comes to ETL: I am really happy to see Hadoop & HDFS here. But not with Informatica, Data Stage or any other ETL tools (i don’t know much about Pentaho). I must appreciate and support Metascale webinar . They have given a right approach to take Hadoop as an Extract, Load and Transform framework. Yes, this is the only way to do right transformation on Hadoop. Let’s rename it to DATA INTEGRATION. The moment you start thinking about ETL tools it means you are taking your data out of Hadoop and the moment you take data out, you are going against all the purpose of using Hadoop as a data processing platform. Isn’t it killing the idea of doing transformation on Hadoop using MR and also the idea of brining your overall cost effective down? However, I’ll be open to learn the right logic to use informatica or any other ETL tool on top of Hadoop.

I think, effort to bring Hadoop to enterprise require diligent changes in datawarehouse reference architecture. We are going to change a lot in our Reference Architecture when we bring Hadoop into the enterprise.

So, it is not important to learn WHERE HADOOP IS THE RIGHT FIT? but it is very important to understand HOW HADOOP IS THE RIGHT FIT?

Posted in Big Data, Hadoop | Leave a Comment »

Warm-up exercise before data science.

Posted by datumengineering on October 18, 2013

Practicing Data science indeed a long term effort than a learning handful of skills.  We ought to be academically good enough to take up this challenge. However, if you think you came a long way from your academic rebuilding,  but you still have that zeal & passion to take the oil from the data and fill the skill gap of data science then here is the warm-up tips. Below points must exercised before jumping into any data science & data mining problems:

  • Come out of “table-row-column” mode and start looking data set more as a MATRIX and VECTOR.
  • matrix-2

Not all datasets are in the form of a data matrix. For instance, more complex datasets can be in the form of sequences, text, time-series, images, audio, video, and so on, which may need special techniques for analysis. However, in many cases even if the raw data is not a data matrix it can usually be transformed into that form via feature extraction. A practical example of feature example is explained in my last post on scikit-learn library.

  • Number of attributes defines the dimensionality of the data matrix. Save the dimensionality in mind when you think of any matrix operations.
  • Each row may be considered as a d-dimensional column vector (all vectors are assumed to be column vectors by default). You must also understand the term row space and column space.
  • matrix
  • Treating data instances and attributes as vectors, and the entire dataset as a matrix, enables one to apply both geometric and algebraic methods to aid in the data mining and analysis tasks.  At least you must aware about unit vector, identity matrix etc..
  • Clear dust from your school learning about matrix manipulation i.e. matrix addition, multiplication, transpose, inverse etc. Similar applies to some of the algebraic equation like distance between two points, Pythagorean theorem—or Pythagoras‘ theorem etc..
  • Through understanding on matrix manipulation will help you to implement multiplication and summation of  elements.
  • Leaving probability is probably not a good idea. Run through some short probability problems & exercise before you go in detail of any supervised learning models.
  • You may need to practice on the topics that you mightily left during schools like:  Orthogonal projection of vector (projecting a vector to another vector),  Probabilistic view of the data, Probability density function. (i admit to avoid these topics during graduations🙂 )
  • Relax yourself with all the formula of descriptive statistical analysis. From Mean, median, mode to normal distribution, standard deviation, skewness and most importantly don’t forget to cover-up Variance  and standard deviation.  You should be ready with basic statistical analysis of univariate & multivariate numeric data. Believe me distance finding methodologies change due to distribution of the data. (Using Euclidean distance score when data is normally distributed otherwise Pearson coefficient score)
  • Generalization, Correlation & regression concepts are widely used across statistics and mathematical modeling. So this must be broadly rehearsed before you go into modeling techniques.
  • You must do some exercise on how to normalize vector. Vector normalization is the must-to-know concept in prediction algorithms.

” In fact, data mining is part of a larger knowledge discovery process, which includes pre-processing tasks like data extraction, data cleaning, data fusion, data reduction and feature construction. As well as post-processing steps like pattern and model interpretation, hypothesis confirmation and generation, and so on. This knowledge discovery and data mining process tends to be highly iterative and interactive. “

CRUX:  The algebraic, geometric & probabilistic viewpoints of data play a key role in data mining. You should exercise them beforehand. So you can easily sail though your boat in Data Science !

Posted in Data Analysis, Data Science, Machine Learning, Predictive Model, Statistical Model | Leave a Comment »

Python Scikit-learn to simplify Machine learning : { Bag of words } To [ TF-IDF ]

Posted by datumengineering on September 26, 2013

Text (word) analysis and tokenized text modeling always give a chill air around ears, specially when you are new to machine learning. Thanks to Python and its extended libraries for its warm support around text analytics and machine learning. Scikit-learn is a savior and excellent support in text processing when you also understand some of the concept like “Bag of word”, “Clustering” and “vectorization”. Vectorization is  must-to-know technique for all machine leaning learners, text miner and algorithm implementor. I personally consider it as a revolution in the analytical calculations. Read one of my earlier post about vectorization. Let’s look at the implementors of vectorization and try to zero down the process of text analysis.

Fundamentally, before we start any text analysis we need to first tokenize every word in a given text, so we can apply mathematical model on these words. When we actually tokenize the text, it can be transform into {bag of words} model of document classification. This {bag of word} model is used as a feature to train classifiers. We’ll observe in code how the feature and classifier term can be explored and implemented using Scikit-learn. But before that let us explore how to tokenize and bring the text into a Vector shape. So the  {bag of words}  representation will go with 3 step process:  tokenizing, counting and finally normalizing the vector.

  • Tokenizing: Tokenize strings and giving an integer id for each possible token.
  • Counting: Once tokenized then count the occurrences of tokens in each document.
  • Normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.

* This below code will need Python-2.7 or above, Numpy-1.3 above and scikit-learn-0.14.  Obviously all these happen on Ubuntu-12.04 LTS.

Scikit’s functions and classes are imported via the sklearn package as follows:

<code snippet>

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vectorizer = CountVectorizer(min_df=1)

</code snippet>

Here we do not have to write a custom code for counting words and representing those counts as a vector. Scikit’s CountVectorizer does the job very efficiently. It also has a very convenient interface. The parameter min_df determines how CountVectorizer treats words that are not used frequently (minimum document frequency). If it is set to an integer, all words occurring less than that value will be dropped. If it is a fraction, all words that occur less than that fraction of the overall dataset will be dropped. The parameter max_df works in a similar manner. Once we vectorize the posts using feature vector functionality we’ll have 2 simple vector. We can then simply calculate the Euclidean distance  between these two vector and calculate the nearest one to identify similarities. This is nothing but step towards clustering/classification of similar posts.

Hold-on we haven’t reached to the phase of implementing clustering algorithms. We need to cautiously move with below steps towards bringing our raw text to a more meaningful {bag of words}. We also try to correlate some of the technical terms in blue with every steps:

  1. Tokenizing the text. — Vectorization and tokenizing
  2. Throw away some less important words. — stop word
  3. Throwing away words that occur way too often to be of any help in detecting relevant posts. — stemming
  4. Throwing away words that occur so seldom that there is only a small chance that they occur in future posts.
  5. Counting the remaining words.
  6. Calculating TF-IDF values from the counts, considering the whole text corpus. — calculate TF-IDF

With this process, we’ll able to convert a bunch of noisy text into a concise representation of feature values. Hopefully, you’re familiar with the term TF-IDF. If not, then below explanation will help to build understanding around TF-IDF:

When we use feature extraction and vectorized the text then this feature values simply count occurrences of terms in a post. We silently assumed that higher values for a term also mean that the term is of greater importance to the given post. But what about, for instance, the word “subject”, which naturally occurs in each and every single post? Alright, we could tell CountVectorizer to remove it as well by means of its max_df parameter. We could, for instance, set it to 0.9 so that all words that occur in more than 90 percent of all posts would be always ignored. But what about words that appear in 89 percent of all posts? How low would we be willing to set max_df? The problem is that however we set it, there will always be the
problem that some terms are just more discriminative than others. This can only be solved by counting term frequencies for every post, and in addition, discounting those that appear in many posts. In other words, we want a high value for a given term in a given value if that term occurs often in that particular post and very rarely anywhere else. This is exactly what term frequency – inverse document frequency (TF-IDF)

So, continue to the previous code where we have imported CountVectorizer library to vectorize and tokenized the text and in below example we are going to compare “Big Data Hype” term with 2 different posts published about “Hype” of “Big Data”. To do this we first need to vectorized the posts in question (new post) and then get the third post vectorized using the same method of scikit. Once we have vectors then we can calculate the distance of the new post. This code snippet ONLY covers vectorizing and tokenizing the text.

<code snippet>

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> content = [“Bursting the Big Data bubble starts with appreciating certain nuances about its products and patterns”,”the real solutions that are useful in dealing with Big Data will be needed and in demand even if the notion of Big Data falls from the height of its hype into the trough of disappointment”]

>>> X = vectorizer.fit_transform(content)
>>> vectorizer = CountVectorizer(min_df=1)

>>> print(vectorizer)
CountVectorizer(analyzer=word, binary=False, charset=None, charset_error=None,
        decode_error=strict, dtype=<type ‘numpy.int64’>, encoding=utf-8,
        input=content, lowercase=True, max_df=1.0, max_features=None,
        min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=(?u)\b\w\w+\b, tokenizer=None,
        vocabulary=None)

>>> vectorizer.get_feature_names()
[u’about’, u’and’, u’appreciating’, u’are’, u’be’, u’big’, u’bubble’, u’bursting’, u’certain’, u’data’, u’dealing’, u’demand’, u’disappointment’, u’even’, u’falls’, u’from’, u’height’, u’hype’, u’if’, u’in’, u’into’, u’its’, u’needed’, u’notion’, u’nuances’, u’of’, u’patterns’, u’products’, u’real’, u’solutions’, u’starts’, u’that’, u’the’, u’trough’, u’useful’, u’will’, u’with’]

>>> X_train = vectorizer.fit_transform(content)
>>> num_samples, num_features = X_train.shape
>>> print(“#samples: %d, #features: %d” % (num_samples, num_features)) #samples: 5, #features: 25
#samples: 2, #features: 37

>>> vectorizer = CountVectorizer(min_df=1, stop_words=’english’)
…….

…….

</code snippet>

I would highly recommend the book “Building machine learning system with python” on Packtpub  or on Amazon

Posted in Machine Learning, Predictive Model, Python | Leave a Comment »

An indispensable Python : Data sourcing to Data science.

Posted by datumengineering on August 27, 2013

Data analysis echo system has grown all the way from SQL’s to NoSQL and from Excel analysis to Visualization. Today, we are in scarceness of the resources to process ALL (You better understand what i mean by ALL) kind of data that is coming to enterprise. Data goes through profiling, formatting, munging or cleansing, pruning, transformation steps to analytics and predictive modeling. Interestingly, there is no one tool proved to be an effective solution to run all these operations { Don’t forget the cost factor here🙂 }.  Things become challenging when we mature from aggregated/summarized analysis to Data mining, mathematical modeling, statistical modeling and predictive modeling. Pinch of complication will be added by Agile implementation.

Enterprises have to work out the solution: “Which help to build the data analysis (rather analytics) to go in Agile way to all complex data structure in either of the way of SQL or NoSQL, and in support of data mining activities” .

So, let’s look at the Python & its eco system (I would prefer to call Python libraries as echo system) and how it can cover up enterprise’s a*s for data analysis.

Python: functional object orientated programming language, most importantly super easy to learn. Any home grown programmer with less or minor knowledge on programming fundamentals can start anytime on python programming.  Python has rich library framework. Even the old guy can dare to start programming in python. Following data structure and functions can be explored for implementing various mathematical algorithms like recommendation engine, collobrative filtering, K-means, Clustering and Support Vector Machine.

  • Dictionary.
  • Lists.
  • String.
  • Sets.
  • Map(), Reduce().

Python Echo System for Data Science:

Let’s begin with sourcing data, bringing into dataset format and shaping mechanism.

Pandas: Data loading, Cleansing, Summarization, Joining, Time Series Analysis }

Pandas: Data analysis covered up in python libraries. It has most of the things which you look out to run quick analysis. Data Frames, Join, Merge, Group By are the in-builds which are available to run SQL like analysis on the data coming in CSV files (read CSV function). To install Pandas you need to have NumPy installed first.

NumPy: Data Array, Vectorization, matrix and Linear algebra operations i.e. mathematical modeling }

NumPy: Rich set of functions for array, matrix and Vector operations. Indexing, Slicing and Stacking are prominent functionality of NumPy.

{ Scipy:  Mean, variance, skewness, kurtosis }

Scipy: SciPy to run scientific analysis on the data. However, statistics functions can be located in the sub-package scipy.stats

{ Matplotlib: Graph, histograms, power spectra, bar charts, errorcharts, scatterplots }

Matplotlib: 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

Moreover, how can we second python support to Big data Analytics and Machine Learning.  Below resources can be utilize for various big data applications:

  • Lightweight Map-Reduce implementation written in Python: Octopy
  • Hbase interaction using python: happybase
  • Machine learning algorithm implementation in Python: Scikit. It has built on NumPy, SciPy, and matplotlib.

Having said that, Python is capable enough to give a way out to implement data analysis algorithms and hence to build your own data analysis framework.

Watch out this space for implementations of various algorithms in Python under one umbrella i.e.  Python data analysis tools.

Posted in Big Data, Data Analysis, Predictive Model, Python, Statistical Model | Tagged: , , , | Leave a Comment »

Big Data? How do you run capacity planning?

Posted by datumengineering on February 15, 2013

Most of Datawarehouse folks are very much accustomed with the term “Capacity Planning”, Read Inmon. This is widely used process for DBA’s and Datawarehouse Architects. In an typical project of data management and warehouse wide variety of audience is involved to drive the capacity planning. It involves everyone from Business Analyst to Architect to Developer to DBA and finally Data Modeler.

This practice which has had wide audience in typical Datawarehouse world,  how this has been driven in Big Data? I have hardly heard noise around this in any Hadoop driven project which had started with an intention to handle growing data.  I have met pain bearers DBA/Architects who have been facing challenges at all stages of data management when data outgrows. They are the main players who advocates bringing Hadoop ASAP.  Crux of their problem is not growing data. But the problem is, they didn’t have mathematical calculation which advocate the growth rate. All we talk about is: How much percentage it is going? Most of the time that percentage also come from experience🙂

Capacity planning should be explore more than just calculating the percentage and experience.

  1. It should be more mathematical calculation of every byte of the data sources coming into the system.
  2. How about designing a predictive model which will confirm my data growth with an accuracy until 10 years?
  3. How about involving business to confirm the data growth drivers and feasibility of future born data sources ?
  4. Why don’t consider compression factor and purging into the calculation to reclaim the space for data grow.
  5. Why we consider only disk utilization and why there is no consideration about other hardware resources like memory, processor, cache? After all, it is all about data processing environment.
  • I think this list of consideration can still grow….

I know building robust Capacity planning is not a task of day or month. One to two year of time frame data is good enough to understand this trend and develop a algorithm around it.  Consider 1-2 years as a learning data set and take some months of data as training data set and start analyzing the trend, start building the model which can predict the growth after 3rd or 4th year. Because as per Datawarehouse gurus bleeding starts after 5th year age.

I’ll leave up to you to design the solution and process for capacity capacity to claim your DATA as BIG DATA.

Remember, disk space is cheap but not the disk seek.

Posted in Big Data, Hadoop | Tagged: , | 2 Comments »

Data analysis drivers

Posted by datumengineering on February 11, 2013

I have been exploring data analysis and modeling techniques since months. There are lots of topics floating around in the space of data analysis like statistical modeling, predictive modeling. There have always been questions in mind which technique to choose? which is preferred way for data analysis? Some articles and lecture highlight machine learning or mathematical model over statistics modeling limitations. They mention mathematical modeling as a next step of accuracy and prediction. This kind of articles create more questions in mind of naive user.

Finally, i would thank to coursera.org for zero down this confusion and stating a clear picture of Data Analysis drivers. Now, things are pretty clear in terms of How to proceed on data analysis? Rather, defining “DATA ANALYSIS DRIVERS”. In one liner the answer is simple “Define a question or problem“. So, all depend upon how you define the problem.

To start with data analysis drivers here are steps in a data analysis

  1. Define the question
  2. Define the ideal data set
  3. Determine what data you can access
  4. Obtain the data
  5. Clean the data
  6. Exploratory data analysis
  7. Statistical prediction/modeling
  8. Interpret results
  9. Challenge results
  10. Synthesize/write up results
  11. Create reproducible code
  • Defining the question means how the business problem has stated and how you proceed on story telling on this  problem. Story telling on the problem will take you to the structuring the solution. So you should be good in story telling on the problem statement.
  • Defining the solution will help you to prepare the data (data set) for the solution.
  • Profile the source to identify what data you can access.
  • Next step is cleansing the data.
  • Now, once the data is cleansed it is either in one of the following standard: txt, csv, xml/html, json and database.
  • Based on the solution need we start building the model. Precisely, the solution will have requirement of Descriptive analysis, Inferential analysis or predictive analysis.

Henceforth, The data set and model may depend on your goal:

  1. Descriptive – a whole population.
  2. Exploratory – a random sample with many variables measured.
  3. Inferential – the right population, randomly sampled.
  4. Predictive – a training and test data set from the same population.
  5. Causal – data from a randomized study.
  6. Mechanistic – data about all components of the system.

From here knowledge on statistics, machine learning and mathematical algorithm works🙂

Posted in Big Data, Data Analysis | Tagged: , | 6 Comments »

 
Follow

Get every new post delivered to your Inbox.

Join 222 other followers