Navigation
This version of the documentation is archived and no longer supported.

Troubleshoot Sharded Clusters

This page describes common strategies for troubleshooting sharded cluster deployments.

Application Servers or mongos Instances Become Unavailable

If each application server has its own mongos instance, other application servers can continue to access the database. Furthermore, mongos instances do not maintain persistent state, and they can restart and become unavailable without losing any state or data. When a mongos instance starts, it retrieves a copy of the config database and can begin routing queries.

A Single mongod Becomes Unavailable in a Shard

Replica sets provide high availability for shards. If the unavailable mongod is a primary, then the replica set will elect a new primary. If the unavailable mongod is a secondary, and it disconnects the primary and secondary will continue to hold all data. In a three member replica set, even if a single member of the set experiences catastrophic failure, two other members have full copies of the data. [1]

Always investigate availability interruptions and failures. If a system is unrecoverable, replace it and create a new member of the replica set as soon as possible to replace the lost redundancy.

[1]If an unavailable secondary becomes available while it still has current oplog entries, it can catch up to the latest state of the set using the normal replication process; otherwise, it must perform an initial sync.

All Members of a Shard Become Unavailable

In a sharded cluster, mongod and mongos instances monitor the replica sets in the sharded cluster (e.g. shard replica sets, config server replica set).

If all members of a replica set shard are unavailable, all data held in that shard is unavailable. However, the data on all other shards will remain available, and it is possible to read and write data to the other shards. However, your application must be able to deal with partial results, and you should investigate the cause of the interruption and attempt to recover the shard as soon as possible.

A Config Server Replica Set Member Become Unavailable

Changed in version 3.2: Starting in MongoDB 3.2, config servers for sharded clusters can be deployed as a replica set. The replica set config servers must run the WiredTiger storage engine. MongoDB 3.2 deprecates the use of three mirrored mongod instances for config servers.

Replica sets provide high availability for the config servers. If an unavailable config server is a primary, then the replica set will elect a new primary.

If the replica set config server loses its primary and cannot elect a primary, the cluster’s metadata becomes read only. You can still read and write data from the shards, but no chunk migration or chunk splits will occur until a primary is available.

Note

All config servers must be running and available when you first initiate a sharded cluster.

In a sharded cluster, mongod and mongos instances monitor the replica sets in the sharded cluster (e.g. shard replica sets, config server replica set).

If no member of the config server replica set is reachable by a mongos instance or a shard member:

For MongoDB 3.2.0-3.2.9,

If the number of consecutive unsuccessful attempts exceeds replMonitorMaxFailedChecks parameter value, the mongod or mongos instance denotes the replica set as unavailable and the monitoring mongos or mongod instance becomes unusable until you restart the instance.

However, you can avoid having to restart the mongos or the shard replica set member mongod in this situation by setting replMonitorMaxFailedChecks to value 2147483647 when you start up these instances:

Important

The parameter setting is not persisted upon restart.

db.adminCommand( { setParameter: 1, 'replMonitorMaxFailedChecks': 2147483647 } )
For MongoDB 3.2.10 or later 3.2-series,
By default, you do not need to restart unless timeOutMonitoringReplicaSets is set to true.

Renaming Mirrored Config Servers and Cluster Availability

If the sharded cluster is using mirrored config servers instead of a replica set and the name or address that a sharded cluster uses to connect to a config server changes, you must restart every mongod and mongos instance in the sharded cluster. Avoid downtime by using CNAMEs to identify config servers within the MongoDB deployment.

To avoid downtime when renaming config servers, use DNS names unrelated to physical or virtual hostnames to refer to your config servers.

Generally, refer to each config server using the DNS alias (e.g. a CNAME record). When specifying the config server connection string to mongos, use these names. These records make it possible to change the IP address or rename config servers without changing the connection string and without having to restart the entire cluster.

Cursor Fails Because of Stale Config Data

A query returns the following warning when one or more of the mongos instances has not yet updated its cache of the cluster’s metadata from the config database:

could not initialize cursor across all shards because : stale config detected

This warning should not propagate back to your application. The warning will repeat until all the mongos instances refresh their caches. To force an instance to refresh its cache, run the flushRouterConfig command.

Shard Keys and Cluster Availability

The most important consideration when choosing a shard key are:

  • to ensure that MongoDB will be able to distribute data evenly among shards, and
  • to scale writes across the cluster, and
  • to ensure that mongos can isolate most queries to a specific mongod.

Furthermore:

  • Each shard should be a replica set, if a specific mongod instance fails, the replica set members will elect another to be primary and continue operation. However, if an entire shard is unreachable or fails for some reason, that data will be unavailable.
  • If the shard key allows the mongos to isolate most operations to a single shard, then the failure of a single shard will only render some data unavailable.
  • If your shard key distributes data required for every operation throughout the cluster, then the failure of the entire shard will render the entire cluster unavailable.

In essence, this concern for reliability simply underscores the importance of choosing a shard key that isolates query operations to a single shard.

Config Database String Error

Changed in version 3.2: Starting in MongoDB 3.2, config servers can be deployed as replica sets by default. The mongos instances for the sharded cluster must specify the same config server replica set name but can specify hostname and port of different members of the replica set.

If using the deprecated topology of three mirrored mongod instances for config servers, mongos instances in a sharded cluster must specify identical configDB string.

Avoid Downtime when Moving Config Servers

Use CNAMEs to identify your config servers to the cluster so that you can rename and renumber your config servers without downtime.

moveChunk commit failed Error

At the end of a chunk migration, the shard must connect to the config database to update the chunk’s record in the cluster metadata. If the shard fails to connect to the config database, MongoDB reports the following error:

ERROR: moveChunk commit failed: version is at <n>|<nn> instead of
<N>|<NN>" and "ERROR: TERMINATING"

When this happens, the primary member of the shard’s replica set then terminates to protect data consistency. If a secondary member can access the config database, data on the shard becomes accessible again after an election.

The user will need to resolve the chunk migration failure independently. If you encounter this issue, contact the MongoDB User Group or MongoDB Support to address this issue.