MongoDB is a non-relational document database that provides support for JSON-like storage. Its flexible data model allows you to easily store unstructured data. First released in 2009, it is the most commonly used NoSQL database. It has been downloaded more than 325 million times.
MongoDB is popular with developers because it is easy to get started with. Over the years, MongoDB has introduced many features that have turned the database into a robust solution able to store terabytes of data for applications.
As with any database, developers and DBAs working with MongoDB should look at how to optimize the performance of their database, especially nowadays with cloud services, where each byte processed, transmitted, and stored costs money. The ability to get started so quickly with MongoDB means that it is easy to overlook potential problems or miss out on simple performance improvements.
In this article, we’ll look at 10 essential techniques you can apply to make the most of MongoDB for your applications.
MongoDB best practice #1: Enable authorization and authentication on your database right from the start
The bigger the database, the bigger the damage from a leak. There have been numerous data leaks due to the simple fact that authorization and authentication are disabled by default when deploying MongoDB for the first time. While it is not a performance tip, it is essential to enable authorization and authentication right from the start as it will save you any potential pain over time due to unauthorized access or data leakage.
When you deploy a new instance of MongoDB, the instance has no user, password, or access control by default. In recent MongoDB versions, the default IP binding changed to 127.0.0.1 and a localhost exception was added, which reduced the potential for database exposure when installing the database.
However, this is still not ideal from a security perspective. The first piece of advice is to create the admin user and restart the instance again with the authorization option enabled. This prevents any unauthorized access to the instance.
To create the admin user:
> use admin switched to db admin > db.createUser({ ... user: "zelmar", ... pwd: "password", ... roles : [ "root" ] ... }) Successfully added user: { "user" : "zelmar", "roles" : [ "root" ] }
Then, you need to enable authorization and restart the instance. If you are deploying MongoDB from the command line:
mongod --port 27017 --dbpath /data/db --auth
Or if you are deploying MongoDB using a config file, you need to include:
security: authorization: "enabled"
MongoDB best practice #2: Don’t use ‘not recommended versions’ or ‘end-of-life versions’ in production instances and stay updated
It should seem obvious, but one of the most common issues we see with production instances is due to developers running a MongoDB version that is actually not suitable for production in the first place. This might be due to the version being out of date, such as with a retired version that should be updated to a newer iteration that contains all the necessary bug fixes.
Or it might be due to the version being too early and not yet tested enough for production use. As developers, we are normally keen to use our tools’ latest and greatest versions. We also want to be consistent over all the stages of development, from initial build and test through to production, as this decreases the number of variables we have to support, the potential for issues, and the cost to manage all of our instances.
For some, this could mean using versions that are not signed off for production deployment yet. For others, it could mean sticking with a specific version that is tried and trusted. This is a problem from a troubleshooting perspective when an issue is fixed in a later version of MongoDB that is approved for production but has not been deployed yet. Alternatively, you might forget about that database instance that is “just working” in the background, and miss when you need to implement a patch.
In response to this, you should regularly check if your version is suitable for production using the release notes of each version. For example, MongoDB 5.0 provides the following guidance in its release notes: https://www.mongodb.com/docs/upcoming/release-notes/5.0/
The guidance here would be to use MongoDB 5.0.11 as this version has the required updates in place. If you don’t update to this version, you will run the risk of losing data.
While it might be tempting to stick with one version, keeping up with upgrades is essential to preventing problems in production. You may want to take advantage of newly added features, but you should put these features through your test process first. You want to see if they pose any problems that might affect your overall performance before moving them into production.
Lastly, you should check the MongoDB Software Lifecycle Schedules and anticipate the upgrades of your clusters before the end of life of each version: https://www.mongodb.com/support-policy/lifecycles
End-of-life versions do not receive patches, bug fixes, or any kind of improvements. This could leave your database instances exposed and vulnerable.
From a performance perspective, getting the right version of MongoDB for your production applications involves being “just right” — not so near the bleeding edge that you will encounter bugs or other problems, but also not so far behind that you will miss out on vital updates.
MongoDB best practice #3: Use MongoDB replication to ensure HA and check the status of your replica often
A replica set is a group of MongoDB processes that maintains the same data on all of the nodes used for an application. It provides redundancy and data availability for your data. When you have multiple copies of your data on different database servers—or even better, in different data centers around the world—replication provides a high level of fault tolerance in case of a disaster.
MongoDB replica sets work with one writer node (also called the primary server). The best practice recommendation is to always have an odd number of members. Traditionally, replica sets have at least three instances:
- Primary (writer node)
- Secondary (reader node)
- Secondary (reader node)
All of the nodes of the replica set will work together, as the primary node will receive the writes from the app server, and then the data will be copied to the secondaries. If something happens to the primary node, the replica set will elect a secondary as the new primary. To make this process work more efficiently and ensure a smooth failover, it is important for all the nodes of the replica set to have the same hardware configuration. Another advantage of the replica set is that it is possible to send read operations to the secondary servers, increasing the read scalability of the database.
After you deploy a replica set to production, it is important to check the health of the replica and the nodes. MongoDB has two important commands for this purpose:
rs.status()
provides information on the current status of the replica set, using data derived from the heartbeat packets sent by the other members of the replica set. It’s a very useful tool for checking the status of all the nodes in a replica set.rs.printSecondaryReplicationInfo()
provides a formatted report of the status of the replica set. It’s very useful to check if any of the secondaries are behind the primary on data replication, as this would affect your ability to recover all your data in the event of something going wrong. If secondaries are too far behind the primary, then you could end up losing a lot more data than you are comfortable with.
However, note that these commands provide point-in-time information rather than continuous monitoring for the health of your replica set. In a real production environment, or if you have many clusters to check, running these commands could become time-consuming and annoying. Therefore we recommend using a monitoring system like Percona PMM to keep an eye on your clusters.
MongoDB best practice #4: Use $regex queries only when necessary and choose text search instead where you can
Sometimes the simplest way to search for something in a database is to use a regular expression or $regex
operation. Many developers choose this option but in fact using regular expressions can harm your search operations at scale. You should avoid the use of $regex
queries especially when your database is big.
A $regex
query consumes a lot of CPU time and it will normally be extremely slow and inefficient. Creating an index doesn’t help much and sometimes the performance is worse with indexes than without them.
For example, let’s run a $regex
query on a collection of 10 million documents and use .explain(true)
to view how many milliseconds the query takes.
Without an index:
> db.people.find({"name":{$regex: "Zelmar"}}).explain(true) - - Output omitted - - "executionStats" : { "nReturned" : 19851, "executionTimeMillis" : 4171, "totalKeysExamined" : 0, "totalDocsExamined" : 10000000, - - Output omitted - -
And if we created an index on “name”:
db.people.find({"name":{$regex: "Zelmar"}}).explain(true) - - Output omitted - - "executionStats" : { "nReturned" : 19851, "executionTimeMillis" : 4283, "totalKeysExamined" : 10000000, "totalDocsExamined" : 19851, - - Output omitted - -
We can see in this example that the index didn’t help to improve the $regex
performance.
It’s common to see a new application using $regex
operations for search requests. This is because neither the developers nor the DBAs notice any performance issues in the beginning when the size of the collections is small and the users of the application are very few.
However, when the collections become bigger and the application gathers more users, the $regex
operations start to slow down the cluster and become a nightmare for the team. Over time, as your application scales and more users want to carry out search requests, the level of performance can drop significantly.
Rather than using $regex
queries, use text indexes to support your text search. Text search is more efficient than $regex
but requires you to add text indexes to your data sets in advance. The indexes can include any field whose value is a string or an array of string elements. A collection can have only one text search index, but that index can cover multiple fields.
Using the same collection as the example above, we can test the execution time of the same query using text search:
> db.people.find({$text:{$search: "Zelmar"}}).explain(true) - - Output omitted - - "executionStages" : { "nReturned" : 19851, "executionTimeMillisEstimate" : 445, "works" : 19852, "advanced" : 19851, - - Output omitted - -
In practice, the same query took four seconds less using text search than using $regex
. Four seconds in “database time,” let alone online application time, is an eternity.
To conclude, if you can solve the query using text search, do so. Restrict $regex
queries to those use cases where they are really necessary.
MongoDB best practice #5: Think wisely about your index strategy
Putting some thought into your queries at the start can have a massive impact on performance over time. First, you need to understand your application and the kinds of queries that you expect to process as part of your service. Based on this, you can create an index that supports them.
Indexing can help to speed up read queries, but it comes with an extra cost of storage and they will slow down write operations. Consequently, you will need to think about which fields should be indexed so you can avoid creating too many indexes.
For example, if you are creating a compound index, following the ESR (Equality, Sort, Range) rule is a must, and using an index to sort the results improves the speed of the query.
Similarly, you can always check if your queries are really using the indexes that you have created with .explain()
. Sometimes we see a collection with indexes created, but the queries either don’t use the indexes or instead use the wrong index entirely. It’s important to create only the indexes that will actually be used for the read queries. Having indexes that will never be used is a waste of storage and will slow down write operations.
When you look at the .explain()
output, there are three main fields that are important to observe. For example:
keysExamined:0 docsExamined:207254 nreturned:0
In this example, no indexes are being used. We know this because the number of keys examined is 0 while the number of documents examined is 207254. Ideally, the query should have the ratio nreturned/keysExamined=1. For example:
keysExamined:5 docsExamined: 0 nreturned:5
Finally, if .explain()
shows you that a particular query is using an index that is wrong, you can force the query to use a particular index with .hint()
. Calling the .hint()
method on a query overrides MongoDB’s default index selection and query optimization process, allowing you to specify the index that is used, or to carry out a forward collection or reverse collection scan.