Map / Reduce – A visual explanation
Map/Reduce is a term commonly thrown about these days, in essence, it is just a way to take a big task and divide it into discrete tasks that can be done in parallel. A common use case for Map/Reduce is in document database, which is why I found myself thinking deeply about this.
Let us say that we have a set of documents with the following form:
Map / Reduce is just a pair of functions, operating over a list of data. In C#, LInq actually gives us a great chance to do things in a way that make it very easy to understand and work with. Let us say that we want to be about to get a count of comments per blog. We can do that using the following Map / Reduce queries:
The next step is to start reducing the results, in real Map/Reduce algorithms, we partition the original input, and work toward the final result. In this case, imagine that the output of the first step was divided into groups of 3 (so 4 groups overall), and then the reduce query was applied to it, giving us:
You can see why it was called reduce, for every batch, we apply a sum by blog_id to get a new Total Comments value. We started with 11 rows, and we ended up with just 10. That is where it gets interesting, because we are still not done, we can still reduce the data further.
This is what we do in the third step, reducing the data further still. That is why the input & output format of the reduce query must match, we will feed the output of several the reduce queries as the input of a new one. You can also see that now we moved from having 10 rows to have just 7.
And the final step is:
And now we are done, we can reduce the data any further because all the keys are unique.
There is another interesting property of Map / Reduce, let us say that I just added a comment to a post, that would obviously invalidate the results of the query, right?
Well, yes, but not all of them. Assuming that I added a comment to the post whose id is 10, what would I need to do to recalculate the right result?
And that is (more or less) the notion of Map / Reduce.
http://ayende.com/blog/4435/map-reduce-a-visual-explanation
Let us say that we have a set of documents with the following form:
And we want to answer a question over more than a single document. That sort of operation requires us to use aggregation, and over large amount of data, that is best done using Map/Reduce, to split the work.{ "type": "post", "name": "Raven's Map/Reduce functionality", "blog_id": 1342, "post_id": 29293921, "tags": ["raven", "nosql"], "post_content": "<p>...</p>", "comments": [ { "source_ip": '124.2.21.2', "author": "martin", "text": "..." }] }
Map / Reduce is just a pair of functions, operating over a list of data. In C#, LInq actually gives us a great chance to do things in a way that make it very easy to understand and work with. Let us say that we want to be about to get a count of comments per blog. We can do that using the following Map / Reduce queries:
There are a couple of things to note here:from post in docs.posts select new { post.blog_id, comments_length = comments.length }; from agg in results group agg by agg.key into g select new { agg.blog_id, comments_length = g.Sum(x=>x.comments_length) };
- The first query is the map query, it maps the input document into the final format.
- The second query is the reduce query, it operate over a set of results and produce an answer.
- Note that the reduce query must return its result in the same format that it received it, why will be explained shortly.
- The first value in the result is the key, which is what we are aggregating on (think the group by clause in SQL).
The next step is to start reducing the results, in real Map/Reduce algorithms, we partition the original input, and work toward the final result. In this case, imagine that the output of the first step was divided into groups of 3 (so 4 groups overall), and then the reduce query was applied to it, giving us:
You can see why it was called reduce, for every batch, we apply a sum by blog_id to get a new Total Comments value. We started with 11 rows, and we ended up with just 10. That is where it gets interesting, because we are still not done, we can still reduce the data further.
This is what we do in the third step, reducing the data further still. That is why the input & output format of the reduce query must match, we will feed the output of several the reduce queries as the input of a new one. You can also see that now we moved from having 10 rows to have just 7.
And the final step is:
And now we are done, we can reduce the data any further because all the keys are unique.
There is another interesting property of Map / Reduce, let us say that I just added a comment to a post, that would obviously invalidate the results of the query, right?
Well, yes, but not all of them. Assuming that I added a comment to the post whose id is 10, what would I need to do to recalculate the right result?
- Map Doc #10 again
- Reduce Step 2, Batch #3 again
- Reduce Step 3, Batch #1 again
- Reduce Step 4
And that is (more or less) the notion of Map / Reduce.
http://ayende.com/blog/4435/map-reduce-a-visual-explanation
No comments:
Post a Comment