From security musts and indexing gotchas to replication and sharding tips, follow these essential dos and don’ts to make the most of your MongoDB database systems. Credit: Baranozdemir / Getty Images MongoDB is a non-relational document database that provides support for JSON-like storage. Its flexible data model allows you to easily store unstructured data. First released in 2009, it is the most commonly used NoSQL database. It has been downloaded more than 325 million times. MongoDB is popular with developers because it is easy to get started with. Over the years, MongoDB has introduced many features that have turned the database into a robust solution able to store terabytes of data for applications. As with any database, developers and DBAs working with MongoDB should look at how to optimize the performance of their database, especially nowadays with cloud services, where each byte processed, transmitted, and stored costs money. The ability to get started so quickly with MongoDB means that it is easy to overlook potential problems or miss out on simple performance improvements. In this article, we’ll look at 10 essential techniques you can apply to make the most of MongoDB for your applications. MongoDB best practice #1: Enable authorization and authentication on your database right from the start The bigger the database, the bigger the damage from a leak. There have been numerous data leaks due to the simple fact that authorization and authentication are disabled by default when deploying MongoDB for the first time. While it is not a performance tip, it is essential to enable authorization and authentication right from the start as it will save you any potential pain over time due to unauthorized access or data leakage. When you deploy a new instance of MongoDB, the instance has no user, password, or access control by default. In recent MongoDB versions, the default IP binding changed to 127.0.0.1 and a localhost exception was added, which reduced the potential for database exposure when installing the database. However, this is still not ideal from a security perspective. The first piece of advice is to create the admin user and restart the instance again with the authorization option enabled. This prevents any unauthorized access to the instance. To create the admin user: > use admin switched to db admin > db.createUser({ ... user: "zelmar", ... pwd: "password", ... roles : [ "root" ] ... }) Successfully added user: { "user" : "zelmar", "roles" : [ "root" ] } Then, you need to enable authorization and restart the instance. If you are deploying MongoDB from the command line: mongod --port 27017 --dbpath /data/db --auth Or if you are deploying MongoDB using a config file, you need to include: security: authorization: "enabled" MongoDB best practice #2: Don’t use ‘not recommended versions’ or ‘end-of-life versions’ in production instances and stay updated It should seem obvious, but one of the most common issues we see with production instances is due to developers running a MongoDB version that is actually not suitable for production in the first place. This might be due to the version being out of date, such as with a retired version that should be updated to a newer iteration that contains all the necessary bug fixes. Or it might be due to the version being too early and not yet tested enough for production use. As developers, we are normally keen to use our tools’ latest and greatest versions. We also want to be consistent over all the stages of development, from initial build and test through to production, as this decreases the number of variables we have to support, the potential for issues, and the cost to manage all of our instances. For some, this could mean using versions that are not signed off for production deployment yet. For others, it could mean sticking with a specific version that is tried and trusted. This is a problem from a troubleshooting perspective when an issue is fixed in a later version of MongoDB that is approved for production but has not been deployed yet. Alternatively, you might forget about that database instance that is “just working” in the background, and miss when you need to implement a patch. In response to this, you should regularly check if your version is suitable for production using the release notes of each version. For example, MongoDB 5.0 provides the following guidance in its release notes: https://www.mongodb.com/docs/upcoming/release-notes/5.0/ IDG The guidance here would be to use MongoDB 5.0.11 as this version has the required updates in place. If you don’t update to this version, you will run the risk of losing data. While it might be tempting to stick with one version, keeping up with upgrades is essential to preventing problems in production. You may want to take advantage of newly added features, but you should put these features through your test process first. You want to see if they pose any problems that might affect your overall performance before moving them into production. Lastly, you should check the MongoDB Software Lifecycle Schedules and anticipate the upgrades of your clusters before the end of life of each version: https://www.mongodb.com/support-policy/lifecycles End-of-life versions do not receive patches, bug fixes, or any kind of improvements. This could leave your database instances exposed and vulnerable. From a performance perspective, getting the right version of MongoDB for your production applications involves being “just right” — not so near the bleeding edge that you will encounter bugs or other problems, but also not so far behind that you will miss out on vital updates. MongoDB best practice #3: Use MongoDB replication to ensure HA and check the status of your replica often A replica set is a group of MongoDB processes that maintains the same data on all of the nodes used for an application. It provides redundancy and data availability for your data. When you have multiple copies of your data on different database servers—or even better, in different data centers around the world—replication provides a high level of fault tolerance in case of a disaster. MongoDB replica sets work with one writer node (also called the primary server). The best practice recommendation is to always have an odd number of members. Traditionally, replica sets have at least three instances: Primary (writer node) Secondary (reader node) Secondary (reader node) All of the nodes of the replica set will work together, as the primary node will receive the writes from the app server, and then the data will be copied to the secondaries. If something happens to the primary node, the replica set will elect a secondary as the new primary. To make this process work more efficiently and ensure a smooth failover, it is important for all the nodes of the replica set to have the same hardware configuration. Another advantage of the replica set is that it is possible to send read operations to the secondary servers, increasing the read scalability of the database. After you deploy a replica set to production, it is important to check the health of the replica and the nodes. MongoDB has two important commands for this purpose: <a href="https://www.mongodb.com/docs/manual/reference/method/rs.status/" rel="nofollow">rs.status()</a> provides information on the current status of the replica set, using data derived from the heartbeat packets sent by the other members of the replica set. It’s a very useful tool for checking the status of all the nodes in a replica set. <a href="https://www.mongodb.com/docs/manual/reference/method/rs.printSecondaryReplicationInfo/" rel="nofollow">rs.printSecondaryReplicationInfo()</a> provides a formatted report of the status of the replica set. It’s very useful to check if any of the secondaries are behind the primary on data replication, as this would affect your ability to recover all your data in the event of something going wrong. If secondaries are too far behind the primary, then you could end up losing a lot more data than you are comfortable with. However, note that these commands provide point-in-time information rather than continuous monitoring for the health of your replica set. In a real production environment, or if you have many clusters to check, running these commands could become time-consuming and annoying. Therefore we recommend using a monitoring system like Percona PMM to keep an eye on your clusters. MongoDB best practice #4: Use $regex queries only when necessary and choose text search instead where you can Sometimes the simplest way to search for something in a database is to use a regular expression or $regex operation. Many developers choose this option but in fact using regular expressions can harm your search operations at scale. You should avoid the use of $regex queries especially when your database is big. A $regex query consumes a lot of CPU time and it will normally be extremely slow and inefficient. Creating an index doesn’t help much and sometimes the performance is worse with indexes than without them. For example, let’s run a $regex query on a collection of 10 million documents and use .explain(true) to view how many milliseconds the query takes. Without an index: > db.people.find({"name":{$regex: "Zelmar"}}).explain(true) - - Output omitted - - "executionStats" : { "nReturned" : 19851, "executionTimeMillis" : 4171, "totalKeysExamined" : 0, "totalDocsExamined" : 10000000, - - Output omitted - - And if we created an index on “name”: db.people.find({"name":{$regex: "Zelmar"}}).explain(true) - - Output omitted - - "executionStats" : { "nReturned" : 19851, "executionTimeMillis" : 4283, "totalKeysExamined" : 10000000, "totalDocsExamined" : 19851, - - Output omitted - - We can see in this example that the index didn’t help to improve the $regex performance. It’s common to see a new application using $regex operations for search requests. This is because neither the developers nor the DBAs notice any performance issues in the beginning when the size of the collections is small and the users of the application are very few. However, when the collections become bigger and the application gathers more users, the $regex operations start to slow down the cluster and become a nightmare for the team. Over time, as your application scales and more users want to carry out search requests, the level of performance can drop significantly. Rather than using $regex queries, use text indexes to support your text search. Text search is more efficient than $regex but requires you to add text indexes to your data sets in advance. The indexes can include any field whose value is a string or an array of string elements. A collection can have only one text search index, but that index can cover multiple fields. Using the same collection as the example above, we can test the execution time of the same query using text search: > db.people.find({$text:{$search: "Zelmar"}}).explain(true) - - Output omitted - - "executionStages" : { "nReturned" : 19851, "executionTimeMillisEstimate" : 445, "works" : 19852, "advanced" : 19851, - - Output omitted - - In practice, the same query took four seconds less using text search than using $regex. Four seconds in “database time,” let alone online application time, is an eternity. To conclude, if you can solve the query using text search, do so. Restrict $regex queries to those use cases where they are really necessary. MongoDB best practice #5: Think wisely about your index strategy Putting some thought into your queries at the start can have a massive impact on performance over time. First, you need to understand your application and the kinds of queries that you expect to process as part of your service. Based on this, you can create an index that supports them. Indexing can help to speed up read queries, but it comes with an extra cost of storage and they will slow down write operations. Consequently, you will need to think about which fields should be indexed so you can avoid creating too many indexes. For example, if you are creating a compound index, following the ESR (Equality, Sort, Range) rule is a must, and using an index to sort the results improves the speed of the query. Similarly, you can always check if your queries are really using the indexes that you have created with .explain(). Sometimes we see a collection with indexes created, but the queries either don’t use the indexes or instead use the wrong index entirely. It’s important to create only the indexes that will actually be used for the read queries. Having indexes that will never be used is a waste of storage and will slow down write operations. When you look at the .explain() output, there are three main fields that are important to observe. For example: keysExamined:0 docsExamined:207254 nreturned:0 In this example, no indexes are being used. We know this because the number of keys examined is 0 while the number of documents examined is 207254. Ideally, the query should have the ratio nreturned/keysExamined=1. For example: keysExamined:5 docsExamined: 0 nreturned:5 Finally, if .explain()shows you that a particular query is using an index that is wrong, you can force the query to use a particular index with <a href="https://www.mongodb.com/docs/manual/reference/method/cursor.hint/" rel="nofollow">.hint()</a>. Calling the .hint() method on a query overrides MongoDB’s default index selection and query optimization process, allowing you to specify the index that is used, or to carry out a forward collection or reverse collection scan. MongoDB best practice #6: Check your queries and indexes frequently Every database is unique and particular to its application, and so is the way it grows and changes over time. Nobody knows how an application will grow over months and years or how the queries will change. Whatever assumptions you make, your prediction will inevitably be wrong, so it is essential to check your database and indexes regularly. For example, you might plan a specific query optimization approach and a particular index, but realize after one year that few queries are using that index and it’s no longer necessary. Continuing with this approach will cost you more in storage while not providing any improvements in application performance. For this reason, it’s necessary to carry out query optimizations and look at the indexes for each collection frequently. MongoDB has some tools to do query optimization such as the database profiler or the <a href="https://www.mongodb.com/docs/manual/reference/method/cursor.explain/" rel="nofollow">.explain()</a> method. We recommend using them to find which queries are slow, how the indexes are being used by the queries, and where you may need to improve your optimizations. In addition to removing indexes that are not used efficiently, look out for duplicate indexes that you don’t need to run. At Percona, we use scripts to check if there are duplicate indexes or if there are any indexes that are not being used. You can find them in our repository on GitHub: https://github.com/percona/support-snippets/tree/master/mongodb/scripts Similarly, you might consider how many results you want to get from a query, as providing too many results can impact performance. Sometimes you only need the first five results of a query, rather than tens or hundreds of responses. In those cases, you can limit the number of query results with <a href="https://www.mongodb.com/docs/manual/reference/method/cursor.limit/#mongodb-method-cursor.limit" rel="nofollow">.limit()</a>. Another useful approach is to use projections to get only the necessary data. If you need only one field of the document, use a projection instead of retrieving the entire document, and then filter on the app side. Lastly, if you need to order the results of a query, be sure that you are using an index and taking advantage of it to improve your efficiency. MongoDB best practice #7: Don’t run multiple mongod or mongos instances on the same server Even if it’s possible to run multiple mongod or mongos instances on the same server, using different processes and ports, we strongly recommend not doing this. When you run multiple mongod or mongos processes on the same server, it becomes very difficult to monitor them and the resources they are consuming (CPU, RAM, network, etc.). Consequently, when there is a problem, it becomes extremely difficult to find out what is going on and get to the root cause of the issue. We see a lot of cases where customers have experienced a resource problem on the server, but because they are running multiple instances of mongod or mongos, even discovering which specific process has the problem is difficult. This makes troubleshooting the problem extremely challenging. Similarly, in some cases where developers have implemented a sharded cluster to scale up their application data, we have seen multiple shards running on the same server. In these circumstances, the router will send a lot of queries to the same node, overloading the node and leading to poor performance—the exact opposite of what the sharding strategy wants to achieve. The worst case scenario here involves replica sets. Imagine running a replica set for resiliency and availability, and then discovering that two or more members of the replica set are running on the same server. This is a recipe for disaster and data loss. Rather than architecting your application for resiliency, you will have made the whole deployment more likely to fail. MongoDB best practice #8: Back up frequently So, you have a cluster with replication, but do you want to sleep better? Run backups of your data frequently. Frequent backups allow you to restore the data from an earlier moment if you need to recover from an unplanned event. There are a number of different options for backing up your MongoDB data: Mongodump / Mongorestore Mongodump reads data from MongoDB and creates a BSON file that Mongorestore can use to populate a MongoDB database. These provide efficient tools for backing up small MongoDB deployments. On the plus side, you can select a specific database or collection to back up efficiently, and this approach doesn’t require stopping writes on the node. However, this approach doesn’t backup any indexes you have created, so when restoring, you would need to re-create those indexes again. Logical backups are in general, very slow and time-consuming, so you would have to factor that time into your restore process. Lastly, this approach is not recommended for sharded clusters that are more complex deployments. Percona Backup for MongoDB Percona Backup for MongoDB is an open-source, distributed and low-impact solution for consistent backups of MongoDB sharded clusters and replica sets. It enables Backups for MongoDB servers, replica sets, and sharded clusters. It can support logical, physical and point in time recovery backups, and backup to anywhere, including AWS S3, Azure or filesystem storage types. However, it does require initial setup and configuration on all the nodes that you would want to protect. Physical / file system backups You can create a backup of a MongoDB deployment by making a copy of MongoDB’s underlying data files. You can use different methods for this type of backup, from manually copying the data files, to Logical Volume Management (LVM) snapshots, to cloud-based snapshots. These are usually faster than logical backups and they can be copied or shared to remote servers. This approach is especially recommended for large data sets, and it is convenient when building a new node on the same cluster. On the downside, you cannot select a specific database or collection when restoring, and you cannot do incremental backups. Further, running a dedicated node is recommended for taking the backup as it requires halting writes, which impacts application performance. MongoDB best practice #9: Know when to shard your replica set and choose a shard key carefully Sharding is the most complex architecture you can deploy with MongoDB. As your database grows, you will need to add more capacity to your server. This can involve adding more RAM, more I/O capacity, or even more powerful CPUs to handle processing. This is called vertical scaling. However, if your database grows so much that it outstrips the capacity of a single machine, then you may have to split the workload up. There are several reasons that might lead to this. For instance, there may not be a physical server large enough to handle the workload, or the server instance would cost so much that it would be unaffordable to run. In these circumstances, you need to start thinking about horizontal scaling. Horizontal scaling involves dividing the database over multiple servers and adding additional servers to increase capacity as required. For MongoDB, this process is called sharding and it relies on a sharding key to manage how workloads are split up across machines. Choosing a sharding key may be the most difficult task you will face when managing MongoDB. It’s necessary to study the datasets and queries and plan ahead before choosing the key, because it’s very difficult to revert the shard once it has been carried out. For MongoDB 4.2 and earlier versions, assigning a shard key is a one-way process that cannot be undone. For MongoDB 4.4 and later, it is possible to refine a shard key, while MongoDB 5.0 and above allow you to change the shard key with the reshardCollection command. If you choose a bad shard key, then a large percentage of documents may go to one of the shards and only a few to another. This will make the sharded cluster unbalanced, which will affect performance over time. An unbalanced cluster typically happens when a key that grows monotonically is chosen to shard a collection, as all the files over a given value would go to one shard rather than being distributed evenly. Alongside looking at the value used to shard data, you will also need to think about the queries that will take place across the shard. The queries must use the shard key so that the mongos process distributes the queries across the sharded cluster. If the query doesn’t use the shard key, then mongos will send the query to every shard of the cluster, affecting performance and making the sharding strategy inefficient. MongoDB best practice #10: Don’t throw money at the problem Last but not least, it’s common to see teams throwing money at the problems they have with their databases. However, instead of immediately reaching for the credit card, first try to think laterally and imagine a better solution. Adding more RAM, adding more CPU, or moving to a larger instance or a bigger machine can overcome a performance problem. However, doing so without first analyzing the underlying problem or the bottleneck can lead to more of the same kinds of problems in the future. In most cases, the answer is not spending more money on resources, but looking at how to optimize your implementation for better performance at the same level. Although cloud services make it easy to scale up instances, the costs of inefficiency can quickly mount up. Worse, this is an ongoing expense that will carry on over time. By looking at areas like query optimization and performance first, it’s possible to avoid additional spend. For some of the customers we have worked with, the ability to downgrade their EC2 instances saved their companies a lot of money in monthly charges. As a general recommendation, adopt a cost-saving mindset and, before adding hardware or beefing up cloud instances, take your time to analyze the problem and think of a better solution for the long term. Zelmar Michelini is a database engineer at Percona, a provider of open source database software, support, and services for MongoDB, MySQL, and PostgreSQL databases. — New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com. Related content news SingleStore acquires BryteFlow to boost data ingestion capabilities SingleStore will integrate BryteFlow’s capabilties inside its database offering via a no-code interface named SingleConnect. By Anirban Ghoshal Oct 03, 2024 4 mins ETL Databases Data Integration feature 3 great new features in Postgres 17 Highly optimized incremental backups, expanded SQL/JSON support, and a configurable SLRU cache are three of the most impactful new features in the latest PostgreSQL release. By Tom Kincaid Sep 26, 2024 6 mins PostgreSQL Relational Databases Databases feature Why vector databases aren’t just databases Vector databases don’t just store your data. They find the most meaningful connections within it, driving insights and decisions at scale. By David Myriel Sep 23, 2024 5 mins Generative AI Databases Artificial Intelligence feature Overcoming AI hallucinations with RAG and knowledge graphs Combining knowledge graphs with retrieval-augmented generation can improve the accuracy of your generative AI application, and generally can be done using your existing database. By Dom Couldwell Sep 17, 2024 6 mins Graph Databases Generative AI Databases Resources Videos