Datum Engineering !

An engineered artwork to make decisions..

MR => Techniques Of Map To Reduce Efficiently

Posted by datumengineering on October 27, 2012

The concept of Map Reduce over a HDFS is fairly simple with an aim of reducing calculation complexity on network. Reducer has to run calculation over the data. So as a MR programmer it is programmer’s responsibility to give a calculation in such a manner that reducer should have very less work to across the nodes and hence less keys to reduce over the network.

class Mapper
method Map(key a; val v)
for all term t IN val v do
    Emit(term t; count 1)
class Reducer
method Reduce(term t; counts [c1; c2; …..])
    sum <–  0
for all count c IN counts [c1; c2;……] do
    sum <–  sum + c
Emit(term t; count sum)

Let’s find out some of the practices to perform some reducer’s work locally:

Combiner: Mapper acts more like a assignee which can just break the data locally and assign Key and Value. This splitting is the first step of MR programming and it happens locally at all nodes. Once data is split there are still opportunities to aggregate it locally before aggregating over the network. These splits, if given to reducer directly then there will be lot of work for reducer to do on the network. To overcome this create a reducer kind of functionality at individual node. Combiner give this functionality which act locally as a reducer.

class Mapper
 method Map(string t; integer r)
Emit(string t; pair (r; 1))
class Combiner
 method Combine(string t; pairs [(s1; c1); (s2; c2)…..])
     sum <–  0
     cnt <–  0
     for all pair (s; c) 2 pairs [(s1; c1); (s2; c2)…..] do
     sum <–  sum + s
     cnt <–  cnt + c
Emit(string t; pair (sum; cnt))
class Reducer
 method Reduce(string t; pairs [(s1; c1); (s2; c2)…..])
     sum <–  0
     cnt <–  0
     for all pair (s; c) 2 pairs [(s1; c1); (s2; c2)…..] do
     sum <–  sum + s
     cnt <–  cnt + c
     r(avg) <–  sum/cnt
Emit(string t; integer r(avg))

In-Mapper Combiner: In a typical scenario, Key-Value pair emit on local memory as shown above. However, associative array approach can combine all the key in associate array and create a pair. So, MAP can emit the pair than individual splits.

class Mapper
method Initialize
S <–  new AssociativeArray
C <–  new AssociativeArray
method Map(string t; integer r)
S{t} <–  S{t} + r
C{t} <–  C{t} + 1
method Close
for all term t IN S do
Emit(term t; pair (S{t};C{t}))

Before deciding over either of method to aggregate data locally mind the below points first:

  1. HDFS cluster consider combiner on case by case basis. So it is not always when combiner will run. I don’t know exact reason WHY Hadoop behaves like this. Some of the famous text book says: The combiner is provided as a semantics-preserving optimization to the execution framework, which has the option of using it, perhaps multiple times, or not at all (or even in the reduce phase).
  2. When your mapper or combiner EMIT the Key value either in split or in pair. It should match with reducer IN TAKE definition. What i mean here is that if mapper or combiner emit result in associate array (pair) then your reducer should take this associate array (pair) as input to process under reducer.
  3. As combiner can have risk of proper utilization from Hadoop, In-mapper combiner create memory bottleneck. So there is an additional overhead at in-mapper combiner for memory management. There is a counter required to keep track of memory. So, there is a fundamental scalability bottleneck associated with the in-mapper combining pattern. It critically depends on having sufficient memory to store intermediate results until the mapper has completely processed all key-value pairs in an input split.
  4. One common solution to limiting memory usage when using the in-mapper combining technique is to “block” input key-value pairs and “flush” in-memory data structures periodically. However, memory management is a manageable challenge for In-mapper combiner.
  5. In-mapper combiner creates more complex keys and values termed as pair and strips.

There are some more techniques like key partitioning also play major role to achieve performance.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: