Datum Engineering !

An engineered artwork to make decisions..

Provision of small file processing in HDFS

Posted by datumengineering on March 7, 2012

Hadoop meant ONLY for mammoth file processing. Though this has been ideal condition but Hadoop do have provision to process small files. Hadoop introduced a Big container to hold small files for further processing. These big containers intended for processing small files data in Map Reduce model. In HDFS these containers are termed as a Sequence file.

These sequence files hold small files as a whole record. However as Map Reduce model expected, it stores data in {Key,Value} pair. File name of the smaller file can be key and the content of the file becomes value. Once the files stored in Sequence file it can be read and write back to HDFS.Writing the data for Sequence file is matter of writing Key and value pair. It depend of the kind of serialization you use. Read process is similar to the collection processing where you define next() method which accept a key and value pair, and reads the next key and value in the stream in the variable. It process until it reaches to EOF and next() method returns false. Again you need to go in detail of kind of serialization you are using here. This unique feature of HDFS given an opportunity to process million small files together as a Sequence file.

Structure of Sequence file is pretty simple. It has header, which hold metadata and compression details for the files stored and the record. Record contains the whole file in it along with the key length, key name and value (i.e. file content/data). The internal format of the records depends on record/block compression. Record compression is just compress the file content (i.e. value), however block compression method compresses number of records. Hence block compression is more meaningful and preferred.

Another form of Sequence file is Map file. Map file is sorted sequence file which is sort on the key with an index to perform lookup on the key. This helps map reduce model to improve the performance of the sequence file.

With this kind of framework of Sequence file & Map file Hadoop has opened feasibility to process millions of small files together. So should we say that HDFS is not just a matter of handling Big data files but it does have capability to process small files too, that also efficiently within Map Reduce model?

Any thought or use case you can suggest here?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: