Most of Datawarehouse folks are very much accustomed with the term “Capacity Planning”, Read Inmon. This is widely used process for DBA’s and Datawarehouse Architects. In an typical project of data management and warehouse wide variety of audience is involved to drive the capacity planning. It involves everyone from Business Analyst to Architect to Developer to DBA and finally Data Modeler.
This practice which has had wide audience in typical Datawarehouse world, how this has been driven in Big Data? I have hardly heard noise around this in any Hadoop driven project which had started with an intention to handle growing data. I have met pain bearers DBA/Architects who have been facing challenges at all stages of data management when data outgrows. They are the main players who advocates bringing Hadoop ASAP. Crux of their problem is not growing data. But the problem is, they didn’t have mathematical calculation which advocate the growth rate. All we talk about is: How much percentage it is going? Most of the time that percentage also come from experience 🙂
Capacity planning should be explore more than just calculating the percentage and experience.
- It should be more mathematical calculation of every byte of the data sources coming into the system.
- How about designing a predictive model which will confirm my data growth with an accuracy until 10 years?
- How about involving business to confirm the data growth drivers and feasibility of future born data sources ?
- Why don’t consider compression factor and purging into the calculation to reclaim the space for data grow.
- Why we consider only disk utilization and why there is no consideration about other hardware resources like memory, processor, cache? After all, it is all about data processing environment.
- I think this list of consideration can still grow….
I know building robust Capacity planning is not a task of day or month. One to two year of time frame data is good enough to understand this trend and develop a algorithm around it. Consider 1-2 years as a learning data set and take some months of data as training data set and start analyzing the trend, start building the model which can predict the growth after 3rd or 4th year. Because as per Datawarehouse gurus bleeding starts after 5th year age.
I’ll leave up to you to design the solution and process for capacity capacity to claim your DATA as BIG DATA.
Remember, disk space is cheap but not the disk seek.