Datum Engineering !

An engineered artwork to make decisions..

Posts Tagged ‘HDFS’

HDFS & Cloudera Hadoop – What, How and Where?

Posted by datumengineering on April 3, 2012

It is always good to understand “Whats going on in background?”. My first practice to understand any database is to go in detail of DB architecture. Because every database has its own flavor of benefits which is unique. Now it is challenging with Hadoop when things become interesting when I need to go to in detail architecture focussed around the ” file system “. Hadoop made me to do this exercise …..

What, How, Where ?

I have been trying my best to go in detail of this giant to understand “How” the things are moving underneath. There had been lot of questions blinked in mind about internal architecture and flow. Finally to conclude this study, I have set up Cloudera Distribution Hadoop on my Ubuntu in Pseudo-distributed environment on single node. During this configuration and installation on single node i have started analyzing and sketching the diagram.

My HDFS directory is on hdfs://localhostand:8020/tmp and related local file system is on /app/hadoop/tmp. I have preferred cloudera distribution compare to the apache directly. This is a first step, eventually i am trying to expand it towards Hive, HBase, Zookeeper on top of this diagram.

I look at complete Hadoop job as a function of three main component: configuration,  services and metastore.

In reverse order of above list, First and foremost player is metastore: Namenode,  Secondary Namenode.

Second player is: Job tracker and Task trackers, who actually are the players.

And finally, a configuration which tell first 2 players how and where to run.

Disclaimer :- Diagram is completely my own understanding. Your catch and corrections are welcome.

Posted in Hadoop | Tagged: , , | Leave a Comment »

Provision of small file processing in HDFS

Posted by datumengineering on March 7, 2012

Hadoop meant ONLY for mammoth file processing. Though this has been ideal condition but Hadoop do have provision to process small files. Hadoop introduced a Big container to hold small files for further processing. These big containers intended for processing small files data in Map Reduce model. In HDFS these containers are termed as a Sequence file.

These sequence files hold small files as a whole record. However as Map Reduce model expected, it stores data in {Key,Value} pair. File name of the smaller file can be key and the content of the file becomes value. Once the files stored in Sequence file it can be read and write back to HDFS.Writing the data for Sequence file is matter of writing Key and value pair. It depend of the kind of serialization you use. Read process is similar to the collection processing where you define next() method which accept a key and value pair, and reads the next key and value in the stream in the variable. It process until it reaches to EOF and next() method returns false. Again you need to go in detail of kind of serialization you are using here. This unique feature of HDFS given an opportunity to process million small files together as a Sequence file.

Structure of Sequence file is pretty simple. It has header, which hold metadata and compression details for the files stored and the record. Record contains the whole file in it along with the key length, key name and value (i.e. file content/data). The internal format of the records depends on record/block compression. Record compression is just compress the file content (i.e. value), however block compression method compresses number of records. Hence block compression is more meaningful and preferred.

Another form of Sequence file is Map file. Map file is sorted sequence file which is sort on the key with an index to perform lookup on the key. This helps map reduce model to improve the performance of the sequence file.

With this kind of framework of Sequence file & Map file Hadoop has opened feasibility to process millions of small files together. So should we say that HDFS is not just a matter of handling Big data files but it does have capability to process small files too, that also efficiently within Map Reduce model?

Any thought or use case you can suggest here?

Posted in Hadoop, Map Reduce | Tagged: , , | Leave a Comment »

Hadoop Configuration Simplified : Master Slave Architecture

Posted by datumengineering on February 11, 2012

Master-Slave: Hadoop configuration has simple components which can be divided as Master and Slave components. Master: NameNode, Secondary namenode and Job tracker. Slave: data node and task tracker.

Following are the key points of configuration:

1. Namenode always be localize during hadoop configuration in cluster environment.

2. Entries for master and slave are in master and slaves files repectively. Master translate to namenode host and job tracker.

3. All configuration can be categorize in 4 aspects:

– Environment variable configuration
– Hadoop-HDFS configuration.
– Master-Slave configuration.
– Map Reduce configuration.

I) Environment configuration: Set up all environment variables to run the scripts in HDFS environment.

{Files: hdfs-env.sh}

II) Hadoop-HDFS configuration: This includes entries for hosts for namenode, secondary namenode, job trackers and task trackers.

{Files: core-site.xml, hdfs-site.xml, mapred-site.xml}

III) Text file configuration: There are 2 text entries in hadoop configuration for master and slaves. One for master node i.e. Host entry for master node and other for slaves i.e. Host entry for all slaves machin where data node and task tracker will run.

{Files: masters, slaves}

IV) Hadoop metrics and job log configuration: in map reduce configuration setting it captures all the metrics and logs related to map reduce program.

{Files: hadoop-metrics.properties, log4j.properties}

This is how brief function of different components together:

Namenode and job tracker starts at local machine, starts secondary node on each machine listed in the masters file. Eventually start tasktracker and datanode on each machine listed in the masters file.
In a cluster environment namenode, secondary namenode and job tracker run on single machine as a master node. However in a large cluster it can be sparated. When namenode and jobtracker are on separate node their slaves files should be in synch.

Posted in Hadoop, HDFS | Tagged: , | Leave a Comment »