It is always good to understand “Whats going on in background?”. My first practice to understand any database is to go in detail of DB architecture. Because every database has its own flavor of benefits which is unique. Now it is challenging with Hadoop when things become interesting when I need to go to in detail architecture focussed around the ” file system “. Hadoop made me to do this exercise …..
I have been trying my best to go in detail of this giant to understand “How” the things are moving underneath. There had been lot of questions blinked in mind about internal architecture and flow. Finally to conclude this study, I have set up Cloudera Distribution Hadoop on my Ubuntu in Pseudo-distributed environment on single node. During this configuration and installation on single node i have started analyzing and sketching the diagram.
My HDFS directory is on hdfs://localhostand:8020/tmp and related local file system is on /app/hadoop/tmp. I have preferred cloudera distribution compare to the apache directly. This is a first step, eventually i am trying to expand it towards Hive, HBase, Zookeeper on top of this diagram.
I look at complete Hadoop job as a function of three main component: configuration, services and metastore.
In reverse order of above list, First and foremost player is metastore: Namenode, Secondary Namenode.
Second player is: Job tracker and Task trackers, who actually are the players.
And finally, a configuration which tell first 2 players how and where to run.
Disclaimer :- Diagram is completely my own understanding. Your catch and corrections are welcome.