Datum Engineering !

An engineered artwork to make decisions..

Archive for the ‘Machine Learning’ Category

Warm-up exercise before data science.

Posted by datumengineering on October 18, 2013

Practicing Data science indeed a long term effort than a learning handful of skills.  We ought to be academically good enough to take up this challenge. However, if you think you came a long way from your academic rebuilding,  but you still have that zeal & passion to take the oil from the data and fill the skill gap of data science then here is the warm-up tips. Below points must exercised before jumping into any data science & data mining problems:

  • Come out of “table-row-column” mode and start looking data set more as a MATRIX and VECTOR.
  • matrix-2

Not all datasets are in the form of a data matrix. For instance, more complex datasets can be in the form of sequences, text, time-series, images, audio, video, and so on, which may need special techniques for analysis. However, in many cases even if the raw data is not a data matrix it can usually be transformed into that form via feature extraction. A practical example of feature example is explained in my last post on scikit-learn library.

  • Number of attributes defines the dimensionality of the data matrix. Save the dimensionality in mind when you think of any matrix operations.
  • Each row may be considered as a d-dimensional column vector (all vectors are assumed to be column vectors by default). You must also understand the term row space and column space.
  • matrix
  • Treating data instances and attributes as vectors, and the entire dataset as a matrix, enables one to apply both geometric and algebraic methods to aid in the data mining and analysis tasks.  At least you must aware about unit vector, identity matrix etc..
  • Clear dust from your school learning about matrix manipulation i.e. matrix addition, multiplication, transpose, inverse etc. Similar applies to some of the algebraic equation like distance between two points, Pythagorean theorem—or Pythagoras‘ theorem etc..
  • Through understanding on matrix manipulation will help you to implement multiplication and summation of  elements.
  • Leaving probability is probably not a good idea. Run through some short probability problems & exercise before you go in detail of any supervised learning models.
  • You may need to practice on the topics that you mightily left during schools like:  Orthogonal projection of vector (projecting a vector to another vector),  Probabilistic view of the data, Probability density function. (i admit to avoid these topics during graduations 🙂 )
  • Relax yourself with all the formula of descriptive statistical analysis. From Mean, median, mode to normal distribution, standard deviation, skewness and most importantly don’t forget to cover-up Variance  and standard deviation.  You should be ready with basic statistical analysis of univariate & multivariate numeric data. Believe me distance finding methodologies change due to distribution of the data. (Using Euclidean distance score when data is normally distributed otherwise Pearson coefficient score)
  • Generalization, Correlation & regression concepts are widely used across statistics and mathematical modeling. So this must be broadly rehearsed before you go into modeling techniques.
  • You must do some exercise on how to normalize vector. Vector normalization is the must-to-know concept in prediction algorithms.

” In fact, data mining is part of a larger knowledge discovery process, which includes pre-processing tasks like data extraction, data cleaning, data fusion, data reduction and feature construction. As well as post-processing steps like pattern and model interpretation, hypothesis confirmation and generation, and so on. This knowledge discovery and data mining process tends to be highly iterative and interactive. “

CRUX:  The algebraic, geometric & probabilistic viewpoints of data play a key role in data mining. You should exercise them beforehand. So you can easily sail though your boat in Data Science !

Posted in Data Analysis, Data Science, Machine Learning, Predictive Model, Statistical Model | Leave a Comment »

Python Scikit-learn to simplify Machine learning : { Bag of words } To [ TF-IDF ]

Posted by datumengineering on September 26, 2013

Text (word) analysis and tokenized text modeling always give a chill air around ears, specially when you are new to machine learning. Thanks to Python and its extended libraries for its warm support around text analytics and machine learning. Scikit-learn is a savior and excellent support in text processing when you also understand some of the concept like “Bag of word”, “Clustering” and “vectorization”. Vectorization is  must-to-know technique for all machine leaning learners, text miner and algorithm implementor. I personally consider it as a revolution in the analytical calculations. Read one of my earlier post about vectorization. Let’s look at the implementors of vectorization and try to zero down the process of text analysis.

Fundamentally, before we start any text analysis we need to first tokenize every word in a given text, so we can apply mathematical model on these words. When we actually tokenize the text, it can be transform into {bag of words} model of document classification. This {bag of word} model is used as a feature to train classifiers. We’ll observe in code how the feature and classifier term can be explored and implemented using Scikit-learn. But before that let us explore how to tokenize and bring the text into a Vector shape. So the  {bag of words}  representation will go with 3 step process:  tokenizing, counting and finally normalizing the vector.

  • Tokenizing: Tokenize strings and giving an integer id for each possible token.
  • Counting: Once tokenized then count the occurrences of tokens in each document.
  • Normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.

* This below code will need Python-2.7 or above, Numpy-1.3 above and scikit-learn-0.14.  Obviously all these happen on Ubuntu-12.04 LTS.

Scikit’s functions and classes are imported via the sklearn package as follows:

<code snippet>

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vectorizer = CountVectorizer(min_df=1)

</code snippet>

Here we do not have to write a custom code for counting words and representing those counts as a vector. Scikit’s CountVectorizer does the job very efficiently. It also has a very convenient interface. The parameter min_df determines how CountVectorizer treats words that are not used frequently (minimum document frequency). If it is set to an integer, all words occurring less than that value will be dropped. If it is a fraction, all words that occur less than that fraction of the overall dataset will be dropped. The parameter max_df works in a similar manner. Once we vectorize the posts using feature vector functionality we’ll have 2 simple vector. We can then simply calculate the Euclidean distance  between these two vector and calculate the nearest one to identify similarities. This is nothing but step towards clustering/classification of similar posts.

Hold-on we haven’t reached to the phase of implementing clustering algorithms. We need to cautiously move with below steps towards bringing our raw text to a more meaningful {bag of words}. We also try to correlate some of the technical terms in blue with every steps:

  1. Tokenizing the text. — Vectorization and tokenizing
  2. Throw away some less important words. — stop word
  3. Throwing away words that occur way too often to be of any help in detecting relevant posts. — stemming
  4. Throwing away words that occur so seldom that there is only a small chance that they occur in future posts.
  5. Counting the remaining words.
  6. Calculating TF-IDF values from the counts, considering the whole text corpus. — calculate TF-IDF

With this process, we’ll able to convert a bunch of noisy text into a concise representation of feature values. Hopefully, you’re familiar with the term TF-IDF. If not, then below explanation will help to build understanding around TF-IDF:

When we use feature extraction and vectorized the text then this feature values simply count occurrences of terms in a post. We silently assumed that higher values for a term also mean that the term is of greater importance to the given post. But what about, for instance, the word “subject”, which naturally occurs in each and every single post? Alright, we could tell CountVectorizer to remove it as well by means of its max_df parameter. We could, for instance, set it to 0.9 so that all words that occur in more than 90 percent of all posts would be always ignored. But what about words that appear in 89 percent of all posts? How low would we be willing to set max_df? The problem is that however we set it, there will always be the
problem that some terms are just more discriminative than others. This can only be solved by counting term frequencies for every post, and in addition, discounting those that appear in many posts. In other words, we want a high value for a given term in a given value if that term occurs often in that particular post and very rarely anywhere else. This is exactly what term frequency – inverse document frequency (TF-IDF)

So, continue to the previous code where we have imported CountVectorizer library to vectorize and tokenized the text and in below example we are going to compare “Big Data Hype” term with 2 different posts published about “Hype” of “Big Data”. To do this we first need to vectorized the posts in question (new post) and then get the third post vectorized using the same method of scikit. Once we have vectors then we can calculate the distance of the new post. This code snippet ONLY covers vectorizing and tokenizing the text.

<code snippet>

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> content = [“Bursting the Big Data bubble starts with appreciating certain nuances about its products and patterns”,”the real solutions that are useful in dealing with Big Data will be needed and in demand even if the notion of Big Data falls from the height of its hype into the trough of disappointment”]

>>> X = vectorizer.fit_transform(content)
>>> vectorizer = CountVectorizer(min_df=1)

>>> print(vectorizer)
CountVectorizer(analyzer=word, binary=False, charset=None, charset_error=None,
        decode_error=strict, dtype=<type ‘numpy.int64’>, encoding=utf-8,
        input=content, lowercase=True, max_df=1.0, max_features=None,
        min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=(?u)\b\w\w+\b, tokenizer=None,
        vocabulary=None)

>>> vectorizer.get_feature_names()
[u’about’, u’and’, u’appreciating’, u’are’, u’be’, u’big’, u’bubble’, u’bursting’, u’certain’, u’data’, u’dealing’, u’demand’, u’disappointment’, u’even’, u’falls’, u’from’, u’height’, u’hype’, u’if’, u’in’, u’into’, u’its’, u’needed’, u’notion’, u’nuances’, u’of’, u’patterns’, u’products’, u’real’, u’solutions’, u’starts’, u’that’, u’the’, u’trough’, u’useful’, u’will’, u’with’]

>>> X_train = vectorizer.fit_transform(content)
>>> num_samples, num_features = X_train.shape
>>> print(“#samples: %d, #features: %d” % (num_samples, num_features)) #samples: 5, #features: 25
#samples: 2, #features: 37

>>> vectorizer = CountVectorizer(min_df=1, stop_words=’english’)
…….

…….

</code snippet>

I would highly recommend the book “Building machine learning system with python” on Packtpub  or on Amazon

Posted in Machine Learning, Predictive Model, Python | Leave a Comment »