Data Visualization 1
Game Time, Baby! 1
I am a software engineer in Silicon Valley.
My interests include distributed computing, data mining, programming, data visualization, and the social web.
Sorting Reducer Input Values in Hadoop
I recently found the need to sort by value (intead of key) in Hadoop. I’ve seen some comments that call this a “secondary sort”. Essentially, I wanted the reducer’s values iterator to be sorted. There seem to be almost no docs, tutorials, or examples (that I could find) on the net for this.
I HIGHLY recommend that you read the email thread by Owen O’Malley that describes this technique in brief. I should also note that this example is using the 0.18 Hadoop API.
Suppose we have a file with a bunch of comma/line separated letters:
l,f,a,e,a,a,l f,g,b,c,b,d,f x,i,t,u,f,e,h ...etc
We want our reducer to receive bigrams (lf, fa, ae, ea, aa, al, etc), but partitioned by the first letter, and sorted (ascending) by the second. For example, for the letter a, the reducer should receive:
This is actually somewhat difficult to do, since we want to partition by key, but sort the reducer’s values iterator. The trick is to have the mapper output the bigram in the key, and only the second letter in the value. For the example above, the mapper would emit:
<ae, e> <aa, a> <al, l> ...
We can then use a custom partitioner/sorter to partition and sort according to our needs.
SORTING BY VALUE
To sort Hadoop’s mapper output by value, you need to set three settings in your JobConf:
There are many threads that say that you can’t sort by value in Hadoop. This is true. What you can do, instead, is have your mapper output all data in the key, rather than the value. Then you can use a specialized Partitioner classes and two RawComparator classes to sort and partition your map output properly.
The first class that you need to set is a class that extends org.apache.hadoop.mapred.Partitioner. This class has a single function that determines which partition your map output should go to. This means that you can’t go below 0, or above numPartitions – 1. Mostly, you’ll want to hashCode() some portion of your key and mod it by numPartitions.
In our example, the partitioner will partition by the first letter of the key.
OUTPUT VALUE GROUPING COMPARATOR
The OutputValueGroupingComparator JobConf setting takes in a org.apache.hadoop.io.RawComparator. This RawComparator is used to determine which reducer the mapper output row should go to. This RawComparator does not sort a reducer’s value iterator. Instead, it’s used to sort reducer input, so that the reducer knows when a new grouping starts.
In our example, the value grouping comparator will sort by the first letter of the key.
OUTPUT KEY COMPARATOR
The OutputKeyComparatorClass JobConf setting also takes in a org.apache.hadoop.io.RawComparator. This RawComparator is used to sort the values iterator that the reducer gets, which is what we want. It should be noted, that although the RawComparator is used to sort the values iterator, the data that gets passed into the comparator is the mapper key output. This is the reason that we must put all data in the key as well as the value.
A very important thing to note is that they key compartor must also enforce the value grouping comparator’s rules. In our example, this means that it must first check if the first letter is equal. If it’s not equal, it should return the same ruls as the value comparator. Only if the first letter of the key is equal should we apply our value-level sorting (comparing the second letter). If you do not do this, you will break your grouping.
In our example, the key comparator will sort by the second letter of the key.
RUNNING THE JOB
Now, all we need to do is run the job.
As you can see, the reducer input is grouped by the first letter (our logical key), and the values are sorted ascending.
AY: YWUTSSSRRQPPPPOMMKJIIIFB BZ: ZYYXXXWVUUURRRRQPPPPPPOONMMLLKJHGEEDDBB CZ: ZZZZZYYXXWVUUUTSSSRQQOOMKKKHHHGGFFDDCCCBB DY: YXWWSSQQPPPONMMKIHGEDDCCBB EW: WVUTRRRQPOOOOONMLLKKKJJHFEEDDCCBA FY: YXXXWVVVUUTSSRPNNNLLKJHGFFECBBBBA GZ: ZZYXVTSQQPOONLJIHHHFFCCCBBA HZ: ZYYYYXWVUUTTTRQQPOOMKJJIIIGFEDAAAA IY: YYYXWWVVVUTTSRRRQMKJJJIIIHGGFFEEEECBBBA JZ: ZZYYXXWRRRQPPOOOMLJJIIHHHHCCCBBA KZ: ZZYXWWVSSSSRQONMJIIHFEDB LZ: ZZZYYXXWUTRQQQPLKJIIIHHGGFDDCCCBBBAA MZ: ZZYYYWTTTSSQQQQOJIIIHGGFCCCBBAA NZ: ZZYYXVVUTSSSSRQPNMKIHGFFFECAA OZ: ZZZYYXWWUSRRPPOONNNMMLJIIHHHGGFEEEDDCCBA PZ: ZXXWWTSSSSSRRRQQQMMLLLKJIIIHEEDCBA QZ: ZZYXWWWWWWVVVUTTSSSSRRRRQQQPOOONMLKJIHHFDD RY: YYYXXXWVUURRQQPPOOOLLLLLLKJJJJHHHHGFFEEEDDCCCBAA SY: YXWWWVVVUUUTTSRRRQQQNNNMLJHHHGGGGFEDDCCCBB TZ: ZZYYYYXXWVTTTTSRRQPPONNNNMMLLJIIIICBB UZ: ZZZYXWWVVVSRRRQPPONMHHGEDCBBA VZ: ZYWVVUTTTQPPPOOMKIIGFEEDDCCCBB WX: XWWWVUTTSSSRRPNNNMMLLKKKKKJJIGEDAAAA XZ: ZZYXXXXSSRRQQOOMLLKKJIIIIHGFFEDDBA YZ: ZZZYXWWVUUUTTTSSRQPPPOONNNMLJIIFFFFEDCCCAA ZZ: ZYYWVVUUTSSSSRRQQQPOONMMLLLJJIIGFDBBBA