I used to hear a lot about Hadoop and MapReduce during my discussions with Yahoo! Search Engineers. Before I refer to MarkCC's explanation here is how Wikipedia goes "MapReduce is a software framework implemented by Google to support parallel computations over large (greater than 100 terabyte) data sets on unreliable clusters of computers." Quite comprehensible actually, but nothing close to MarkCC's writeup thats below.
Btw: On Hadoop here is Wikipedia's take "Hadoop is a Free Java software framework that supports distributed applications running on large clusters of commodity computers that process huge amounts of data...Hadoop consists of an open source implementation of Google's published computing infrastructure, specifically MapReduce and the Google File System (GFS)."
Here is MarkCC's explanation:
Suppose you're at work, and you need to do something that's going to take a long time to run on your computer. You don't want to wait. But you don't want to go out and spend a couple of million dollars buying a supercomputer. How do you make it run faster? One way is buy a whole bunch of cheap machines, and make it run on all of them at once. Another is to notice that your office has lots of computers - pretty much every office has a computer on the desk of every employee. And at any given moment, most of those computers aren't doing much. So why not take advantage of that? When your machine isn't doing much, you let you coworkers borrow the capability you're not using; when you need to do something, you can borrow their machines. So when you need to run something big, you can easily find a pool of a dozen machines.
The problem with that approach is that most programs aren't written to run on a dozen machines. They're written to run on one machine. To split a hard task among a lot of computers is hard.
MapReduce is a library that lets you adopt a particular, stylized way of programming that's easy to split among a bunch of machines. The basic idea is that you divide the job into two parts: a Map, and a Reduce. Map basically takes the problem, splits it into sub-parts, and sends the sub-parts to different machines - so all the pieces run at the same time. Reduce takes the results from the sub-parts and combines them back together to get a single answer.
The key to how MapReduce does things is to take input as, conceptually, a list of records. The records are split among the different machines by the map. The result of the map computation is a list of key/value pairs. Reduce takes each set of values that has the same key, and combines them into a single value. So Map takes a set of data chunks, and produces key/value pairs; reduce merges things, so that instead of a set of key/value pair sets, you get one result. You can't tell whether the job was split into 100 pieces or 2 pieces; the end result looks pretty much like the result of a single map....The beauty of MapReduce is that it's easy to write. M/R programs are really as easy as parallel programming ever gets.
Given that Mark CC's PhD work was in this field, his clarity shines through even without using terms like frameworks and computing infrastructure! For Product Managers such clarity and simplicity can actually mean the difference between success and failure of their products.
Joel btw takes the example of MapReduce to mention how badly Microsoft is trailing Google:
The very fact that Google invented MapReduce, and Microsoft didn't, says something about why Microsoft is still playing catch up trying to get basic search features to work, while Google has moved on to the next problem: building Skynet^H^H^H^H^H^H the world's largest massively parallel supercomputer. I don't think Microsoft completely understands just how far behind they are on that wave.
Just for comparisons, I am tempted to refer to a better written paragraph on Hadoop (from Wikipedia):
Hadoop is a framework for running applications on large clusters of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
1 comment:
hmm...nice blog..interesting...
hey r u shreya from infosys...
if so...then u knew me...this is jitendra....Chennai
Post a Comment