Datum Engineering !

An engineered artwork to make decisions..

Archive for the ‘Python’ Category

(R + Python)

Posted by datumengineering on February 8, 2014

Both R & Python should be measured based on their effectiveness in advanced analytics & data science. Initially, as a new comer in data science field we spend good amount of time to understand the pros and cons of these two. I too carried out this study solely for “self” to decide which tool should i pick to get in depth of data science. Eventually, i have started realizing that both (R & Python) has its space of mastery along with their broad support to data science. Here some understanding on “when-to-use-what”

  •  R is very rich when you get into the descriptive statistics, Inference, statistical modeling and start plotting your data on the bar, pie chart and histogram. When your data is pretty much shaped and easily consumable for statistical modeling using vectors, matrix etc.
  • First time learner who have some knowledge on statistics can start getting depth of Graph-cum-visualization with their data using R in terms of  trends, identify the correlation etc. I observed that you don’t need to start practicing R as a separate programming language. You can very well start sculling your boat in depth of statistics keeping R in another hand.
  •  R plays vital role for analyst who love to see the data distributions before drawing conclusion. It also helps analysts to visualize outliers and data density of given data set.
  • As you start getting  more into probabilistic problems and probabilistic distribution R ease the data manipulation using vector and matrix. Even same applies to linear regression problems also.
  • With support of R to stat rich problems you don’t need to get into the complexities of python, OOPS and understanding of the data types nitty-gritty.

Now, when you start getting into space of predictive modeling, machine learning and mathematical modeling, Python can give a easy hand. Mathematical functions, algorithmic problems find good support from Python libraries for k-means & hierarchical clustering, multivariate regression, SVM etc. Not limited to this, but it also has good support from data processing & data munging libraries like Pandas and NumPy. Here are some cents for python:

  • We know! Python is full fledge “scripting language” and this statement tells everything. Most importantly, over the years Python has developed an eco-system for end-to-end analytics.
  • Now you are not confined to data process and formalization, but you can easily play around data sourcing and data parsing too using programming model. This open the opportunities to analyze semi-structured data (JSON,XML) easily.
  • With Python you have all liberties to start consuming the data from unstructured sources too. With streaming support from Hadoop extend the possibilities of using python on unstructured data stored on HDFS and from HBase for graph & networked data processing.
  • With rich libraries like Scikit-learn you can do all text mining, vectorize the text data and identify similarities between posts and texts.
  • Having OO language in your hand your program will be far structured and modular for all your complex mathematical calculation in comparison to R. I would rather call it as an easy to read.
  • There are lot of ready-to-serve material in support of machine learning and predictive modeling using python. Read these two in combination: Machine learning with Python + Building ML with python.

So in summary, we can bet on R when we start getting into statistical analysis and then eventually turn up towards Python to take your problem to a predictive end.

This write up doesn’t meant to highlight R or Python’s limitations. R has evolved as a good support to ML and does have combination with Hadoop as RADOOP. However, Python also has good support to statistics and does have rich library (matplotlib) for visualization. But, as i mentioned earlier in this write up, above finding points are solely based on ease of use while you learn Data Science. I suppose once matured we can develop expertise in any one on them as per job role.

Advertisements

Posted in Data Analysis, Python, R, Statistical Model | Tagged: , , , | 2 Comments »

Python Scikit-learn to simplify Machine learning : { Bag of words } To [ TF-IDF ]

Posted by datumengineering on September 26, 2013

Text (word) analysis and tokenized text modeling always give a chill air around ears, specially when you are new to machine learning. Thanks to Python and its extended libraries for its warm support around text analytics and machine learning. Scikit-learn is a savior and excellent support in text processing when you also understand some of the concept like “Bag of word”, “Clustering” and “vectorization”. Vectorization is  must-to-know technique for all machine leaning learners, text miner and algorithm implementor. I personally consider it as a revolution in the analytical calculations. Read one of my earlier post about vectorization. Let’s look at the implementors of vectorization and try to zero down the process of text analysis.

Fundamentally, before we start any text analysis we need to first tokenize every word in a given text, so we can apply mathematical model on these words. When we actually tokenize the text, it can be transform into {bag of words} model of document classification. This {bag of word} model is used as a feature to train classifiers. We’ll observe in code how the feature and classifier term can be explored and implemented using Scikit-learn. But before that let us explore how to tokenize and bring the text into a Vector shape. So the  {bag of words}  representation will go with 3 step process:  tokenizing, counting and finally normalizing the vector.

  • Tokenizing: Tokenize strings and giving an integer id for each possible token.
  • Counting: Once tokenized then count the occurrences of tokens in each document.
  • Normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.

* This below code will need Python-2.7 or above, Numpy-1.3 above and scikit-learn-0.14.  Obviously all these happen on Ubuntu-12.04 LTS.

Scikit’s functions and classes are imported via the sklearn package as follows:

<code snippet>

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vectorizer = CountVectorizer(min_df=1)

</code snippet>

Here we do not have to write a custom code for counting words and representing those counts as a vector. Scikit’s CountVectorizer does the job very efficiently. It also has a very convenient interface. The parameter min_df determines how CountVectorizer treats words that are not used frequently (minimum document frequency). If it is set to an integer, all words occurring less than that value will be dropped. If it is a fraction, all words that occur less than that fraction of the overall dataset will be dropped. The parameter max_df works in a similar manner. Once we vectorize the posts using feature vector functionality we’ll have 2 simple vector. We can then simply calculate the Euclidean distance  between these two vector and calculate the nearest one to identify similarities. This is nothing but step towards clustering/classification of similar posts.

Hold-on we haven’t reached to the phase of implementing clustering algorithms. We need to cautiously move with below steps towards bringing our raw text to a more meaningful {bag of words}. We also try to correlate some of the technical terms in blue with every steps:

  1. Tokenizing the text. — Vectorization and tokenizing
  2. Throw away some less important words. — stop word
  3. Throwing away words that occur way too often to be of any help in detecting relevant posts. — stemming
  4. Throwing away words that occur so seldom that there is only a small chance that they occur in future posts.
  5. Counting the remaining words.
  6. Calculating TF-IDF values from the counts, considering the whole text corpus. — calculate TF-IDF

With this process, we’ll able to convert a bunch of noisy text into a concise representation of feature values. Hopefully, you’re familiar with the term TF-IDF. If not, then below explanation will help to build understanding around TF-IDF:

When we use feature extraction and vectorized the text then this feature values simply count occurrences of terms in a post. We silently assumed that higher values for a term also mean that the term is of greater importance to the given post. But what about, for instance, the word “subject”, which naturally occurs in each and every single post? Alright, we could tell CountVectorizer to remove it as well by means of its max_df parameter. We could, for instance, set it to 0.9 so that all words that occur in more than 90 percent of all posts would be always ignored. But what about words that appear in 89 percent of all posts? How low would we be willing to set max_df? The problem is that however we set it, there will always be the
problem that some terms are just more discriminative than others. This can only be solved by counting term frequencies for every post, and in addition, discounting those that appear in many posts. In other words, we want a high value for a given term in a given value if that term occurs often in that particular post and very rarely anywhere else. This is exactly what term frequency – inverse document frequency (TF-IDF)

So, continue to the previous code where we have imported CountVectorizer library to vectorize and tokenized the text and in below example we are going to compare “Big Data Hype” term with 2 different posts published about “Hype” of “Big Data”. To do this we first need to vectorized the posts in question (new post) and then get the third post vectorized using the same method of scikit. Once we have vectors then we can calculate the distance of the new post. This code snippet ONLY covers vectorizing and tokenizing the text.

<code snippet>

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> content = [“Bursting the Big Data bubble starts with appreciating certain nuances about its products and patterns”,”the real solutions that are useful in dealing with Big Data will be needed and in demand even if the notion of Big Data falls from the height of its hype into the trough of disappointment”]

>>> X = vectorizer.fit_transform(content)
>>> vectorizer = CountVectorizer(min_df=1)

>>> print(vectorizer)
CountVectorizer(analyzer=word, binary=False, charset=None, charset_error=None,
        decode_error=strict, dtype=<type ‘numpy.int64’>, encoding=utf-8,
        input=content, lowercase=True, max_df=1.0, max_features=None,
        min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=(?u)\b\w\w+\b, tokenizer=None,
        vocabulary=None)

>>> vectorizer.get_feature_names()
[u’about’, u’and’, u’appreciating’, u’are’, u’be’, u’big’, u’bubble’, u’bursting’, u’certain’, u’data’, u’dealing’, u’demand’, u’disappointment’, u’even’, u’falls’, u’from’, u’height’, u’hype’, u’if’, u’in’, u’into’, u’its’, u’needed’, u’notion’, u’nuances’, u’of’, u’patterns’, u’products’, u’real’, u’solutions’, u’starts’, u’that’, u’the’, u’trough’, u’useful’, u’will’, u’with’]

>>> X_train = vectorizer.fit_transform(content)
>>> num_samples, num_features = X_train.shape
>>> print(“#samples: %d, #features: %d” % (num_samples, num_features)) #samples: 5, #features: 25
#samples: 2, #features: 37

>>> vectorizer = CountVectorizer(min_df=1, stop_words=’english’)
…….

…….

</code snippet>

I would highly recommend the book “Building machine learning system with python” on Packtpub  or on Amazon

Posted in Machine Learning, Predictive Model, Python | Leave a Comment »

An indispensable Python : Data sourcing to Data science.

Posted by datumengineering on August 27, 2013

Data analysis echo system has grown all the way from SQL’s to NoSQL and from Excel analysis to Visualization. Today, we are in scarceness of the resources to process ALL (You better understand what i mean by ALL) kind of data that is coming to enterprise. Data goes through profiling, formatting, munging or cleansing, pruning, transformation steps to analytics and predictive modeling. Interestingly, there is no one tool proved to be an effective solution to run all these operations { Don’t forget the cost factor here 🙂 }.  Things become challenging when we mature from aggregated/summarized analysis to Data mining, mathematical modeling, statistical modeling and predictive modeling. Pinch of complication will be added by Agile implementation.

Enterprises have to work out the solution: “Which help to build the data analysis (rather analytics) to go in Agile way to all complex data structure in either of the way of SQL or NoSQL, and in support of data mining activities” .

So, let’s look at the Python & its eco system (I would prefer to call Python libraries as echo system) and how it can cover up enterprise’s a*s for data analysis.

Python: functional object orientated programming language, most importantly super easy to learn. Any home grown programmer with less or minor knowledge on programming fundamentals can start anytime on python programming.  Python has rich library framework. Even the old guy can dare to start programming in python. Following data structure and functions can be explored for implementing various mathematical algorithms like recommendation engine, collobrative filtering, K-means, Clustering and Support Vector Machine.

  • Dictionary.
  • Lists.
  • String.
  • Sets.
  • Map(), Reduce().

Python Echo System for Data Science:

Let’s begin with sourcing data, bringing into dataset format and shaping mechanism.

Pandas: Data loading, Cleansing, Summarization, Joining, Time Series Analysis }

Pandas: Data analysis covered up in python libraries. It has most of the things which you look out to run quick analysis. Data Frames, Join, Merge, Group By are the in-builds which are available to run SQL like analysis on the data coming in CSV files (read CSV function). To install Pandas you need to have NumPy installed first.

NumPy: Data Array, Vectorization, matrix and Linear algebra operations i.e. mathematical modeling }

NumPy: Rich set of functions for array, matrix and Vector operations. Indexing, Slicing and Stacking are prominent functionality of NumPy.

{ Scipy:  Mean, variance, skewness, kurtosis }

Scipy: SciPy to run scientific analysis on the data. However, statistics functions can be located in the sub-package scipy.stats

{ Matplotlib: Graph, histograms, power spectra, bar charts, errorcharts, scatterplots }

Matplotlib: 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

Moreover, how can we second python support to Big data Analytics and Machine Learning.  Below resources can be utilize for various big data applications:

  • Lightweight Map-Reduce implementation written in Python: Octopy
  • Hbase interaction using python: happybase
  • Machine learning algorithm implementation in Python: Scikit. It has built on NumPy, SciPy, and matplotlib.

Having said that, Python is capable enough to give a way out to implement data analysis algorithms and hence to build your own data analysis framework.

Watch out this space for implementations of various algorithms in Python under one umbrella i.e.  Python data analysis tools.

Posted in Big Data, Data Analysis, Predictive Model, Python, Statistical Model | Tagged: , , , | Leave a Comment »