Map-Reduce and Sharded Collections¶
For map-reduce operations that require custom functionality, MongoDB
aggregation operators starting in version 4.4. Use these operators to
Map-reduce supports operations on sharded collections, both as an input
and as an output. This section describes the behaviors of
mapReduce specific to sharded collections.
However, starting in version 4.2, MongoDB deprecates the map-reduce
option to create a new sharded collection as well as the use of the
sharded option for map-reduce. To output to a sharded collection,
create the sharded collection first. MongoDB 4.2 also deprecates the
replacement of an existing sharded collection.
Sharded Collection as Input¶
When using sharded collection as the input for a map-reduce operation,
mongos will automatically dispatch the map-reduce job to
each shard in parallel. There is no special option
mongos will wait for jobs on all shards to
Sharded Collection as Output¶
out field for
mapReduce has the
value, MongoDB shards the output collection using the
_id field as
the shard key.
To output to a sharded collection:
If the output collection does not exist, create the sharded collection first.
Starting in version 4.2, MongoDB deprecates the map-reduce option to create a new sharded collection and the use of the
shardedoption for map-reduce. As such, to output to a sharded collection, create the sharded collection first.
If you did not create the sharded collection first, MongoDB creates and shards the collection on the
_idfield. However, it is recommended that you create the sharded collection first.
- Starting in version 4.2, MongoDB deprecates the replacement of an existing sharded collection.
- Starting in version 4.0, if the output collection already exists but is not sharded, map-reduce fails.
- For a new or an empty sharded collection, MongoDB uses the results of the first stage of the map-reduce operation to create the initial chunks distributed among the shards.
mongosdispatches, in parallel, a map-reduce post-processing job to every shard that owns a chunk. During the post-processing, each shard will pull the results for its own chunks from the other shards, run the final reduce/finalize, and write locally to the output collection.
- During later map-reduce jobs, MongoDB splits chunks as needed.
- Balancing of chunks for the output collection is automatically prevented during post-processing to avoid concurrency issues.