From performance to programmability, the right database makes all the difference. Here are 13 key questions to guide your selection.
Picking the “right” database can often be critical to the success of an application. Rather than taking the advice of vendors or using a database because you already happen to have it, it’s useful to consider the fundamental purpose and requirements of the data store.
The most important questions to ask when you are picking a database are:
- How much data do you expect to store when the application is mature?
- How many users do you expect to handle simultaneously at peak load?
- What availability, scalability, latency, throughput, and data consistency does your application need?
- How often will your database schemas change?
- What is the geographic distribution of your user population?
- What is the natural “shape” of your data?
- Does your application need online transaction processing (OLTP), analytic queries (OLAP), or both?
- What ratio of reads to writes do you expect in production?
- Do you need geographic queries and/or full-text queries?
- What are your preferred programming languages?
- Do you have a budget? If so, will it cover licenses and support contracts?
- Are there legal restrictions on your data storage?
- Should you use a SQL or NoSQL database?
Let’s expand on those questions and their implications.
How much data do you expect to store?
If your estimate is in gigabytes or less, then almost any database will handle your data, and in-memory databases are completely feasible. There are still many database options to handle data in the terabyte (thousands of gigabytes) range.
If your answer is in petabytes (millions of gigabytes) or more, then far fewer databases will serve you well, and you need to be prepared for significant data storage costs, either in capital expenditures for on-premises storage or in operating expenditures for cloud storage. At that scale you may want tiered storage so that queries on “live” data can run in-memory or against local SSDs for speed, while the full data set resides on spinning disks for economy.
For databases that need to hold and query hundreds of petabytes to exabytes of data, there are only a few viable choices. Be very careful about evaluating and planning exabyte-scale databases, as they are huge commitments and hard to change once the data is loaded.
Some people combine a data lake for all of their data with data warehouses for subsets of the data; others use a data lakehouse architecture, such as Databricks and Dremio. Still others use one of the databases derived from the original Google Spanner paper, such as Google Cloud Spanner (SQL and NoSQL), HBase (NoSQL), Cassandra (NoSQL), DataStax (NoSQL), ScyllaDB (NoSQL), and CockroachDB (SQL); or sharded extensions of SQL databases, such as Vitess and PlanetScale.
How many users do you expect to handle simultaneously?
Estimating the load from many simultaneous users is often treated as a server sizing exercise to be done just before installing your production database. Unfortunately, many databases just can’t handle thousands of users querying terabytes or petabytes of data, because of scaling issues.
Estimating simultaneous users is much easier for a database used by employees than for a public database. For the latter, you may need to have the option of scaling out to multiple servers for unexpected or seasonal loads. Unfortunately, not all databases support horizontal scaling without time-consuming manual sharding of the large tables.
Database indexes in read/write transactional SQL databases can limit the number of simultaneous queries because of contention for the indexes. Sometimes people combine a transactional database with an analysis database to mitigate this problem. (See the “OLTP, OLAP, or HTAP?” section below.)
What are your ‘-ility’ requirements?
In this category I include availability, scalability, latency, throughput, and data consistency, even though not all terms end with “-ility.”
Availability is often a key criterion for transactional databases. While not every application needs to run 24/7 with 99.999% availability, some do. A few cloud databases do offer that five-nines availability, as long as you run them in multiple availability zones. On-premises databases can usually be configured for high availability outside of scheduled maintenance periods, especially if you can afford to set up an active-active pair of servers.Scalability, especially horizontal scalability, has historically been better for NoSQL databases than SQL databases, but several SQL databases are catching up. Dynamic scalability is much easier to accomplish in the cloud. Databases with good scalability can handle many simultaneous users by scaling up or out until the throughput is sufficient for the load.
Latency refers both to the response time of the database and to the end-to-end response time of the application. Ideally every user action will have a sub-second response time; that often translates to needing the database to respond in under 100 ms for each simple transaction. Analytic queries can often take seconds or minutes. Applications can preserve response time by running complicated queries in the background.
Throughput for an OLTP database is usually measured in transactions per second. Databases with high throughput can support many simultaneous users.
Data consistency is usually “strong” for SQL databases, meaning that all reads return the latest data. Data consistency may be anything from “eventual” to “strong” for NoSQL databases. Eventual consistency offers lower latency, at the risk of reading stale data.
Consistency is the “C” in the ACID properties required for validity in the event of errors, network partitions, and power failures. The four ACID properties are Atomicity, Consistency, Isolation, and Durability.
Are your database schemas stable?
If your database schemas are unlikely to change significantly over time, and you want most fields to have consistent types from record to record, then SQL databases would be a good choice for you. Otherwise, NoSQL databases, some of which don’t even support schemas, might be better for your application. There are exceptions, however. For example, Rockset allows for SQL queries without imposing a fixed schema or consistent types on the data it imports.
Geographic distribution of users
When your database users are all over the world, the speed of light imposes a lower limit on database latency for the remote users unless you provide additional servers in their regions. Some databases allow for distributed read-write servers; others offer distributed read-only servers, with all writes forced to go through a single master server. Geographic distribution makes the trade-off between consistency and latency even harder.
Most of the databases that support globally distributed nodes and strong consistency use consensus groups to speed up writes without seriously degrading consistency, typically using the Paxos (Lamport, 1990) or Raft (Ongaro and Ousterhout, 2013) algorithms. Distributed NoSQL databases that are eventually consistent typically use non-consensus, peer-to-peer replication, which can lead to conflicts when two replicas receive concurrent writes to the same record, conflicts which are usually resolved heuristically.
Data shape
SQL databases classically store strongly-typed data in rectangular tables with rows and columns. They rely on defined relations between tables, use indexes to speed up selected queries, and use JOINS to query multiple tables at once. Document databases typically store weakly-typed JSON that may include arrays and nested documents. Graph databases either store vertexes and edges, or triples, or quads. Other NoSQL database categories include key-value and columnar stores.
Sometimes the data is generated in a shape that will also work for analysis; sometimes it isn’t, and a transformation will be necessary. Sometimes one kind of database is built on another. For example, key-value stores can underlie almost any kind of database.
OLTP, OLAP, or HTAP?
To unscramble the acronyms above, the question is whether your application needs a database for transactions, analysis, or both. Needing fast transactions implies fast write speed and minimal indexes. Needing analysis implies fast read speed and lots of indexes. Hybrid systems use various tricks to support both requirements, including having a primary transactional store feeding a secondary analysis store through replication.
Read/write ratio
Some databases are faster at reads and queries, and others are faster at writes. The mix of reads and writes you expect from your application is a useful number to include in your database selection criteria, and can guide your benchmarking efforts. The optimum choice of index type differs between read-heavy applications (usually a B-tree) and write-heavy applications (often a log-structured merge-tree, aka LSM tree).
Geospatial indexes and queries
If you have geographic or geometric data and you want to perform efficient queries to find objects inside a boundary or objects within a given distance of a location, you need different indexes than you would for typical relational data. An R-tree is often the preferred choice for geospatial indexes, but there are more than a dozen other possible geospatial index data structures. There are a couple of dozen databases that support spatial data; most support some or all of the Open Geospatial Consortium standard.
Full-text indexes and queries
Similarly, efficient full-text search of text fields requires different indexes than relational or geospatial data. Typically, you build an inverted list index of tokenized words and search that to avoid doing a costly table scan.
Preferred programming languages
While most databases support APIs for many programming languages, the preferred programming language in your application can sometimes influence your choice of database. For example, JSON is the natural data format for JavaScript, so you may want to choose a database that supports the JSON data type for a JavaScript application. When you use a strongly-typed programming language, you may want to choose a strongly-typed database.
Budgetary constraints
Databases range in price from free to very expensive. Many databases have both free and paid versions, and sometimes have more than one level of paid offering, for example offering an Enterprise version and different service response times. In addition, some databases are available in the cloud on pay-as-you-go terms.
If you choose a free, open source database, you may have to forego vendor support. As long as you have expertise in-house, that may be fine. On the other hand, it may be more productive for your people to concentrate on the application and leave database administration and maintenance to vendors or cloud providers.
Legal restrictions
There are many laws about data security and privacy. In the EU, GDPR has wide-ranging implications for privacy, data protection, and the location of data. In the US, HIPAA regulates medical information, and GLBA regulates the way financial institutions handle customers’ private information. In California, the new CCPA enhances privacy rights and consumer protection.
Some databases are capable of handling data in a way that complies with some or all of these regulations, as long as you follow best practices. Other databases have flaws that make it very difficult to use them for personally identifiable information, no matter how careful you are.
Should you use a SQL or NoSQL database?
Many of the previous questions factor into the answer to this one. Let’s consider two clear cases.
- Extreme case 1: You need very low latency and high throughput, you don’t care about short-term consistency as long as the data looks consistent after a few minutes, your data is adequately represented by key-value records, and you want massive horizontal scalability. Unquestionably, you should use a key-value NoSQL database. There are many of these available, both open-source and proprietary, and both on-premises and hosted in the cloud.
- Extreme case 2: You are tracking financial transactions and you have to guarantee atomicity, consistency, isolation, and durability (ACID) properties for all transactions, even if they involve multiple tables in geographically distributed regions. All other considerations, such as low latency and high throughput, are nice to have but secondary to reliable, accurate transactions. Here you should use a SQL database, preferably a distributed SQL database such as Google Cloud Spanner or CockroachDB.
There are many, many intermediate cases, too many to discuss individually. In some cases, either a SQL database or a document database will serve you well. Which you choose may depend on your preference for one query language or another. In other cases, you may have a very large data set that you could handle in a data lake (NoSQL) plus a data warehouse (SQL), a data lakehouse (SQL), or a high-capacity columnar (NoSQL) database such as Cassandra or ScyllaDB. In almost all of these cases, you can guide your choice by asking yourself the questions discussed above.
Which database should you use?
Honestly, that was a long list of factors to consider when choosing a database, probably more than you would prefer to ponder. Nevertheless, it’s worth trying to answer all of the questions to the best of your team’s ability before you risk committing your project to what turns out to be an inadequate or excessively expensive database.