Datum Engineering !

An engineered artwork to make decisions..

Archive for October, 2013

Warm-up exercise before data science.

Posted by datumengineering on October 18, 2013

Practicing Data science indeed a long term effort than a learning handful of skills.  We ought to be academically good enough to take up this challenge. However, if you think you came a long way from your academic rebuilding,  but you still have that zeal & passion to take the oil from the data and fill the skill gap of data science then here is the warm-up tips. Below points must exercised before jumping into any data science & data mining problems:

  • Come out of “table-row-column” mode and start looking data set more as a MATRIX and VECTOR.
  • matrix-2

Not all datasets are in the form of a data matrix. For instance, more complex datasets can be in the form of sequences, text, time-series, images, audio, video, and so on, which may need special techniques for analysis. However, in many cases even if the raw data is not a data matrix it can usually be transformed into that form via feature extraction. A practical example of feature example is explained in my last post on scikit-learn library.

  • Number of attributes defines the dimensionality of the data matrix. Save the dimensionality in mind when you think of any matrix operations.
  • Each row may be considered as a d-dimensional column vector (all vectors are assumed to be column vectors by default). You must also understand the term row space and column space.
  • matrix
  • Treating data instances and attributes as vectors, and the entire dataset as a matrix, enables one to apply both geometric and algebraic methods to aid in the data mining and analysis tasks.  At least you must aware about unit vector, identity matrix etc..
  • Clear dust from your school learning about matrix manipulation i.e. matrix addition, multiplication, transpose, inverse etc. Similar applies to some of the algebraic equation like distance between two points, Pythagorean theorem—or Pythagoras‘ theorem etc..
  • Through understanding on matrix manipulation will help you to implement multiplication and summation of  elements.
  • Leaving probability is probably not a good idea. Run through some short probability problems & exercise before you go in detail of any supervised learning models.
  • You may need to practice on the topics that you mightily left during schools like:  Orthogonal projection of vector (projecting a vector to another vector),  Probabilistic view of the data, Probability density function. (i admit to avoid these topics during graduations 🙂 )
  • Relax yourself with all the formula of descriptive statistical analysis. From Mean, median, mode to normal distribution, standard deviation, skewness and most importantly don’t forget to cover-up Variance  and standard deviation.  You should be ready with basic statistical analysis of univariate & multivariate numeric data. Believe me distance finding methodologies change due to distribution of the data. (Using Euclidean distance score when data is normally distributed otherwise Pearson coefficient score)
  • Generalization, Correlation & regression concepts are widely used across statistics and mathematical modeling. So this must be broadly rehearsed before you go into modeling techniques.
  • You must do some exercise on how to normalize vector. Vector normalization is the must-to-know concept in prediction algorithms.

” In fact, data mining is part of a larger knowledge discovery process, which includes pre-processing tasks like data extraction, data cleaning, data fusion, data reduction and feature construction. As well as post-processing steps like pattern and model interpretation, hypothesis confirmation and generation, and so on. This knowledge discovery and data mining process tends to be highly iterative and interactive. “

CRUX:  The algebraic, geometric & probabilistic viewpoints of data play a key role in data mining. You should exercise them beforehand. So you can easily sail though your boat in Data Science !

Advertisements

Posted in Data Analysis, Data Science, Machine Learning, Predictive Model, Statistical Model | Leave a Comment »