Troubleshoot Sharded Clusters¶
This page describes common strategies for troubleshooting sharded cluster deployments.
If each application server has its own
mongos instance, other
application servers can continue to access the database. Furthermore,
mongos instances do not maintain persistent state, and they
can restart and become unavailable without losing any state or data.
mongos instance starts, it retrieves a copy of the
config database and can begin routing queries.
A Single Member Becomes Unavailable in a Shard Replica Set¶
Replica sets provide high availability for
shards. If the unavailable
mongod is a primary,
then the replica set will elect a new
primary. If the unavailable
mongod is a
secondary, and it disconnects the primary and secondary will
continue to hold all data. In a three member replica set, even if a
single member of the set experiences catastrophic failure, two other
members have full copies of the data. 
Always investigate availability interruptions and failures. If a system is unrecoverable, replace it and create a new member of the replica set as soon as possible to replace the lost redundancy.
|||If an unavailable secondary becomes available while it still has current oplog entries, it can catch up to the latest state of the set using the normal replication process; otherwise, it must perform an initial sync.|
All Members of a Shard Become Unavailable¶
If all members of a replica set shard are unavailable, all data held in that shard is unavailable. However, the data on all other shards will remain available, and it is possible to read and write data to the other shards. However, your application must be able to deal with partial results, and you should investigate the cause of the interruption and attempt to recover the shard as soon as possible.
A Config Server Replica Set Member Become Unavailable¶
If the replica set config server loses its primary and cannot elect a primary, the cluster's metadata becomes read only. You can still read and write data from the shards, but no chunk migration or chunk splits will occur until a primary is available.
Distributing replica set members across two data centers provides benefit over a single data center. In a two data center distribution,
- If one of the data centers goes down, the data is still available for reads unlike a single data center distribution.
- If the data center with a minority of the members goes down, the replica set can still serve write operations as well as read operations.
- However, if the data center with the majority of the members goes down, the replica set becomes read-only.
If possible, distribute members across at least three data centers. For config server replica sets (CSRS), the best practice is to distribute across three (or more depending on the number of members) centers. If the cost of the third data center is prohibitive, one distribution possibility is to evenly distribute the data bearing members across the two data centers and store the remaining member in the cloud if your company policy allows.
All config servers must be running and available when you first initiate a sharded cluster.
||| Starting in MongoDB 3.4, the use of three mirrored
Cursor Fails Because of Stale Config Data¶
could not initialize cursor across all shards because : stale config detected
This warning should not propagate back to your application. The
warning will repeat until all the
mongos instances refresh
their caches. To force an instance to refresh its cache, run the
Shard Keys and Cluster Availability¶
The most important consideration when choosing a shard key are:
- to ensure that MongoDB will be able to distribute data evenly among shards, and
- to scale writes across the cluster, and
- to ensure that
mongoscan isolate most queries to a specific
- Each shard should be a replica set, if a specific
mongodinstance fails, the replica set members will elect another to be primary and continue operation. However, if an entire shard is unreachable or fails for some reason, that data will be unavailable.
- If the shard key allows the
mongosto isolate most operations to a single shard, then the failure of a single shard will only render some data unavailable.
- If your shard key distributes data required for every operation throughout the cluster, then the failure of the entire shard will render the entire cluster unavailable.
In essence, this concern for reliability simply underscores the importance of choosing a shard key that isolates query operations to a single shard.
Config Database String Error¶
Changed in version 3.2.
Starting in MongoDB 3.2, config servers can be deployed as replica
mongos instances for the sharded cluster must
specify the same config server replica set name but can specify
hostname and port of different members of the replica set.
Starting in 3.4, the use of the deprecated mirrored
instances as config servers (SCCC) is no longer supported. Before you
can upgrade your sharded clusters to 3.4, you must convert your config
servers from SCCC to CSRS.
To convert your config servers from SCCC to CSRS, see the MongoDB 3.4 manual Upgrade Config Servers to Replica Set.
With earlier versions of MongoDB sharded clusters that use the topology
of three mirrored
mongod instances for config servers,
mongos instances in a sharded cluster must specify identical
Avoid Downtime when Moving Config Servers¶
Use CNAMEs to identify your config servers to the cluster so that you can rename and renumber your config servers without downtime.
moveChunk commit failed Error¶
At the end of a chunk migration, the shard must connect to the config database to update the chunk's record in the cluster metadata. If the shard fails to connect to the config database, MongoDB reports the following error:
ERROR: moveChunk commit failed: version is at <n>|<nn> instead of <N>|<NN>" and "ERROR: TERMINATING"
When this happens, the primary member of the shard's replica set then terminates to protect data consistency. If a secondary member can access the config database, data on the shard becomes accessible again after an election.