I have been exploring data analysis and modeling techniques since months. There are lots of topics floating around in the space of data analysis like statistical modeling, predictive modeling. There have always been questions in mind which technique to choose? which is preferred way for data analysis? Some articles and lecture highlight machine learning or mathematical model over statistics modeling limitations. They mention mathematical modeling as a next step of accuracy and prediction. This kind of articles create more questions in mind of naive user.
Finally, i would thank to coursera.org to eliminate this confusion and stating a clear picture of Data Analysis drivers. Now, things are pretty clear in terms of How to proceed on data analysis? Rather, defining “DATA ANALYSIS DRIVERS”. In one liner the answer is simple “Define a question or problem correctly“. So, all depend upon how you define the problem.
To start with data analysis drivers here are steps in a data analysis
- Define the question
- Define the ideal data set
- Determine what data you can access
- Obtain the data
- Clean the data
- Exploratory data analysis
- Statistical prediction/modeling
- Interpret results
- Challenge and optimize the results
- Synthesize/write up results
- Create reproducible code
- Defining the question means how the business problem has stated and how you proceed on story telling on this problem. Story telling on the problem will take you to the structuring the solution. So you should be good in story telling on the problem statement.
- Defining the solution will help you to prepare the data (data set) for the solution.
- Profile the source to identify what data you can access.
- Next step is cleansing the data.
- Now, once the data is cleansed it is either in one of the following standard: txt, csv, xml/html, json and database.
- Based on the solution need we start building the model. Precisely, the solution will have requirement of Descriptive analysis, Inferential analysis or predictive analysis.
Henceforth, The data set and model may depend on your goal:
- Descriptive – a whole population.
- Exploratory – a random sample with many variables measured.
- Inferential – the right population, randomly sampled.
- Predictive – a training and test data set from the same population.
- Causal – data from a randomized study.
- Mechanistic – data about all components of the system.
From here knowledge on statistics, machine learning and mathematical algorithm works 🙂