Datum Engineering !

An engineered artwork to make decisions..

Archive for the ‘Map Reduce’ Category

MR => Techniques Of Map To Reduce Efficiently

Posted by datumengineering on October 27, 2012

The concept of Map Reduce over a HDFS is fairly simple with an aim of reducing calculation complexity on network. Reducer has to run calculation over the data. So as a MR programmer it is programmer’s responsibility to give a calculation in such a manner that reducer should have very less work to across the nodes and hence less keys to reduce over the network.

class Mapper
method Map(key a; val v)
for all term t IN val v do
    Emit(term t; count 1)
class Reducer
method Reduce(term t; counts [c1; c2; …..])
    sum <–  0
for all count c IN counts [c1; c2;……] do
    sum <–  sum + c
Emit(term t; count sum)

Let’s find out some of the practices to perform some reducer’s work locally:

Combiner: Mapper acts more like a assignee which can just break the data locally and assign Key and Value. This splitting is the first step of MR programming and it happens locally at all nodes. Once data is split there are still opportunities to aggregate it locally before aggregating over the network. These splits, if given to reducer directly then there will be lot of work for reducer to do on the network. To overcome this create a reducer kind of functionality at individual node. Combiner give this functionality which act locally as a reducer.

class Mapper
 method Map(string t; integer r)
Emit(string t; pair (r; 1))
class Combiner
 method Combine(string t; pairs [(s1; c1); (s2; c2)…..])
     sum <–  0
     cnt <–  0
     for all pair (s; c) 2 pairs [(s1; c1); (s2; c2)…..] do
     sum <–  sum + s
     cnt <–  cnt + c
Emit(string t; pair (sum; cnt))
class Reducer
 method Reduce(string t; pairs [(s1; c1); (s2; c2)…..])
     sum <–  0
     cnt <–  0
     for all pair (s; c) 2 pairs [(s1; c1); (s2; c2)…..] do
     sum <–  sum + s
     cnt <–  cnt + c
     r(avg) <–  sum/cnt
Emit(string t; integer r(avg))

In-Mapper Combiner: In a typical scenario, Key-Value pair emit on local memory as shown above. However, associative array approach can combine all the key in associate array and create a pair. So, MAP can emit the pair than individual splits.

class Mapper
method Initialize
S <–  new AssociativeArray
C <–  new AssociativeArray
method Map(string t; integer r)
S{t} <–  S{t} + r
C{t} <–  C{t} + 1
method Close
for all term t IN S do
Emit(term t; pair (S{t};C{t}))

Before deciding over either of method to aggregate data locally mind the below points first:

  1. HDFS cluster consider combiner on case by case basis. So it is not always when combiner will run. I don’t know exact reason WHY Hadoop behaves like this. Some of the famous text book says: The combiner is provided as a semantics-preserving optimization to the execution framework, which has the option of using it, perhaps multiple times, or not at all (or even in the reduce phase).
  2. When your mapper or combiner EMIT the Key value either in split or in pair. It should match with reducer IN TAKE definition. What i mean here is that if mapper or combiner emit result in associate array (pair) then your reducer should take this associate array (pair) as input to process under reducer.
  3. As combiner can have risk of proper utilization from Hadoop, In-mapper combiner create memory bottleneck. So there is an additional overhead at in-mapper combiner for memory management. There is a counter required to keep track of memory. So, there is a fundamental scalability bottleneck associated with the in-mapper combining pattern. It critically depends on having sufficient memory to store intermediate results until the mapper has completely processed all key-value pairs in an input split.
  4. One common solution to limiting memory usage when using the in-mapper combining technique is to “block” input key-value pairs and “flush” in-memory data structures periodically. However, memory management is a manageable challenge for In-mapper combiner.
  5. In-mapper combiner creates more complex keys and values termed as pair and strips.

There are some more techniques like key partitioning also play major role to achieve performance.

Advertisements

Posted in Hadoop, Map Reduce | Tagged: | Leave a Comment »

Provision of small file processing in HDFS

Posted by datumengineering on March 7, 2012

Hadoop meant ONLY for mammoth file processing. Though this has been ideal condition but Hadoop do have provision to process small files. Hadoop introduced a Big container to hold small files for further processing. These big containers intended for processing small files data in Map Reduce model. In HDFS these containers are termed as a Sequence file.

These sequence files hold small files as a whole record. However as Map Reduce model expected, it stores data in {Key,Value} pair. File name of the smaller file can be key and the content of the file becomes value. Once the files stored in Sequence file it can be read and write back to HDFS.Writing the data for Sequence file is matter of writing Key and value pair. It depend of the kind of serialization you use. Read process is similar to the collection processing where you define next() method which accept a key and value pair, and reads the next key and value in the stream in the variable. It process until it reaches to EOF and next() method returns false. Again you need to go in detail of kind of serialization you are using here. This unique feature of HDFS given an opportunity to process million small files together as a Sequence file.

Structure of Sequence file is pretty simple. It has header, which hold metadata and compression details for the files stored and the record. Record contains the whole file in it along with the key length, key name and value (i.e. file content/data). The internal format of the records depends on record/block compression. Record compression is just compress the file content (i.e. value), however block compression method compresses number of records. Hence block compression is more meaningful and preferred.

Another form of Sequence file is Map file. Map file is sorted sequence file which is sort on the key with an index to perform lookup on the key. This helps map reduce model to improve the performance of the sequence file.

With this kind of framework of Sequence file & Map file Hadoop has opened feasibility to process millions of small files together. So should we say that HDFS is not just a matter of handling Big data files but it does have capability to process small files too, that also efficiently within Map Reduce model?

Any thought or use case you can suggest here?

Posted in Hadoop, Map Reduce | Tagged: , , | Leave a Comment »