Table of Contents
It is important to understand Replication before moving to Sharding. Replication is a process of creating multiple copies of the same data across multiple servers. Yes, we are maintaining multiple copies of the same data.
But, Why?
Let’s look at a Replication architecture and understand more about it. Replication mainly offers:
- High Availability
- Read scaling
- Automatic failover
In the above architecture, we have 3 MongoDB servers. One acting as the Primary and other two as secondary servers. MongoDB uses an Oplog asynchronous mechanism to replicate data into secondary servers. We will have 3 copies of the same data. We call these servers a cluster. By default, the Read/Write requests are handled by the Primary nodes. In a cluster, only one node acts as the Primary server.It is important to always be ready for the worst possible failure, that way you are better prepared to handle it smoothly. Any software can fail at any given time caused by multiple reasons. It could be a network issue, manual issue or also due to a natural disaster. Which is the main objective behind keeping multiple copies of the same data – as a backup. These above mentioned servers continuously share Heartbeat information and get to know about the other servers’ health information. By default, every 2 seconds a heartbeat information will be shared by a server.
Let us understand this better with the help of the following examples. For instance, due to some issues, the Primary server stops sharing heartbeat information with the other servers. Immediately after 10 seconds, the secondary servers decide and call for election within the available nodes. Any one of the secondary servers can become a Primary server and take the Read/Write workload. In most cases, the server which has the latest up-to-date data becomes the Primary server. This happens without any manual intervention. No downtime or manual work required. Only downside is, all application requests will be paused seconds time during this election process. And, this way we achieve High Availability and Automatic failover.
Did I miss talking about scaling? Yes? Let’s discuss that. It is an even more interesting topic. By default, Primary nodes are responsible for taking the Read/Write workload. What can we do to improve the read performance? Just tweak the configuration to send read traffic to Secondary servers. Is that a good idea? Yes, it is.
Generally, 0 to 2 seconds replication delay might be there to sync data from Primary to Secondary servers. Let’s take some examples where it will be useful. What if I want to run Reporting queries and it will be continuously read to show information about data? What if I am using data to run AI/ML models? A continuous read is required and the latest data is required. If we can route read traffic or maintain a dedicated Secondary node it helps to boost the application’s performance by avoiding load to the Primary server.
Why Sharding?
Replication addresses many problems. But, why Sharding? Remember we are talking about big data here. NoSQL database is very good at handling big data. How long can you scale up a single machine? It can easily reach the limits of vertical scaling like CPU, RAM and Hard disk. What if we can distribute this data across multiple machines. Instead, we go with a horizontal scale. Good idea, right? But think wisely when you need it.
Below is a screenshot of Vertical scaling vs Horizontal scaling
Sharding Architecture Diagrams
It is a good idea to go with Sharding when it is reaching vertical scale limits, or the data set is growing too large (1-2 TB). You should start thinking about sharding strategies if the data grows to Terabytes. Are there any downsides with sharding? Yes, the cost will increase because of the number of servers and licences associated with it. We need to keep an eye on all the servers, which results in the complexity of maintenance. Don’t think about sharding too early. In a vast majority of the cases, Replication solves our problems. Each shard is maintained as a replica set for high availability.
Sharded cluster contains the following components:
- ShardEach shard contains a subset of the sharded data, and each shard is maintained as a replica set.
- MongosMongos is a query router. It acts as an interface between client application and the sharded cluster. All applications’ traffic reaches the first mongos.
- Config serversConfig servers’ stores the metadata and the configuration settings. Mongos reads this data and understands what data is present in which shard.
Now, you must be thinking that how does the cluster knows where to store the data. Correct?
MongoDB supports two Sharding strategies:
- Ranged Sharding
- Hashed Sharding
You need to decide what strategy is best for you. Choose carefully and think through before choosing a sharding strategy. The wrong strategy will end up increasing your budget. A Shard key is a vital factor to distribute the data across shards. Changing the shard key is not possible. We can’t change it completely, but we can keep adding more fields to existing shard keys.
Advantages of Sharding
- Better read/write throughputRead/writes will be distributed to multiple servers. So, we get better throughput.
- Scalable storage capacityStorage capacity increases by adding new nodes to the cluster. We can increase shards based on our requirements.
- High AvailabilityEach shard is configured as a replica set to maintain high availability.
It is as simple as dividing huge tasks into smaller tasks, distributing each task to different people and then finally combining all the tasks including work and delivery. This strategy will give better results than having all the tasks completed by a single person. The most interesting part when dealing with sharding is that how the data will fetch from multiple shards and aggregate the results given by each shard. When an application reads data from the sharded collection, that data should have been gathered from multiple shards, combined, and then return the final output to the application. Aggregation will happen based on some strategy.
Different Merge Strategies
- Mongos MergeThis is the default merge. Data gathered multiple shards and final merge happens at mongos
- Primary Shard MergeSome scenarios, we want data from sharded and unshared collections using lookup. Aggregation runtime will place the aggregation portion on the designated primary shard, which holds the unsharded collection.
- Targeted Shard ExecutionIf all the requested data are present in a single shard, then there is no need to gather information from other shards. All the work here is done by a single shard and returns the outcome.
- Any Shard MergeThis is a more configurable option with allowDiskUse: Ture. Some cases are like we know that our query is very expensive, and it requires more memory. In this case we can explicitly inform the system to aggregate any one of the Shards. It can randomly pick one of the shards based on the workload.
Hope this article helped you get a better understanding on MongoDB Shards.
If you are looking for consultation, subscription, technical or managed services for MongoDB inquire with us. Stay updated on more open source database technologies like Postgres, Redis, Grafana, ELK etc., right here.