Problem Statement:

Whenever mongos balancer is running, the cluster gets insanely slow..
Latencies are back to normalcy immediately if the balancer stopped.

Mongos balancing process for reference:

  1. Balancer takes a lock
  2. Identify chunk to migrate. This is based on several criteria
  3. Send command to "Source" shard.
  4. Source does sanity check.
  5. Source sends a command to Destination shard.
  6. Transfer starts. Essentially, Destination does read queries from Source and writes to itself.
  7. Once the transfer is in sync, catch up on subsequent ops happen. inserts/deletes/updates happening during the above step doesn't stop clients from doing any op on those chunks. So far, source does all those ops. Now, destination catches up on those tasks.
  8. Once the catchup is finished, transfer is marked complete.
  9. Now the shard goes from Steady state to Critical section. Source does a commit to Config servers saying the chunk move is complete and config must be updated.
  10. Clean up on Source. Now that shard is aware of chunk that moved away, it needs to clean up those records on itself. So, it initiates a cleanup.
  11. Few subsequent requests to Source give exception to Mongos router, saying Stale config exception. This forces mongos to refresh the state by querying mongos. Client is unaware of this Stale Config exception.