Serdar Yegulalp
Senior Writer

NoSQL standouts: The best document databases

feature
Jan 24, 201819 mins
Cloud ComputingDatabasesNoSQL Databases

Which document-oriented database is right for your app? Follow this guide to the most developer-friendly NoSQL databases

cloud computing savings 1
Credit: Thinkstock

“The right tool for the right job.” If such wisdom holds true anywhere, it certainly holds true with the choice of database a developer picks for a given application. Document databases, one of the family of data products collectively referred to as “NoSQL,” are for developers who want to focus on their application rather than the database technology.

With a document database, data is not stored in tables with distinct column types. Instead, it is stored in freeform “documents” with any number of fields and any number of nested structures. Such documents are typically represented as JSON, and updated either by way of APIs or by sending JSON to a REST endpoint. Most every modern programming language supports JSON and REST, so working with a document database feels more like working natively with those data structures than working with a traditional database.

This schemaless design, as it is called, has its limitations. A developer must do more work to ensure that inserted data is consistent, because such consistency isn’t always guaranteed by the database itself. SQL, the standard-issue and widely understood language for database work, isn’t supported by most document databases, so those with existing database expertise must start from scratch. But the convenience, speed, scalability, and versatility of a document database is hard to beat when you’re writing an application that needs a protean, free-form data structure.

Here we’ve profiled seven of the best known and most widely used document databases. Four of the seven—CouchDB, Couchbase Server, MongoDB, and RethinkDB—are open source projects with few or no practical barriers to getting started; Couchbase and MongoDB are also available in supported enterprise editions under commercial licenses. The other three—Amazon DynamoDB, Google Firebase, and IBM Cloudant—are hosted services from major cloud vendors, where close integration with other services in those clouds is a big draw.

See the table below to compare features; scroll right in the table to see all columns, using the scrollbar at bottom. Read on for brief discussions of each database.

Amazon DynamoDB

Amazon’s DynamoDB document store began life in 2012 as an extension of Amazon’s SimpleDB. Under the hood it is powered by a key-value store, Dynamo. A co-developer of DynamoDB would later draw on many of the same ideas to create Apache Cassandra. 

DynamoDB features

Like most of Amazon’s other cloud offerings, DynamoDB is a pay-as-you-go-for-what-you-need managed service. Developers set how much storage capacity to provide for keeping either unstructured documents or key-value pairs, and choose a flat hourly rate limit for read and write requests to the database. No need to provision servers or configure replication—Amazon handles all of that under the covers, and recently added autoscaling to the mix.

Naturally, DynamoDB offers developers useful integrations with other services in the Amazon cloud. Triggers, for instance, can be set up by way of AWS Lambda functions. Amazon’s BI and analysis tools are also nearby.  The proximity to these services is convenient, but it also means Amazon can upsell functionality any number of ways. Caching and acceleration a la Redis, for instance, are available by way of the DynamoDB Accelerator, a cost-plus add-on.

DynamoDB Local

You won’t find DynamoDB in an open source incarnation. It’s available exclusively as a hosted offering on the Amazon cloud.

That said, unlike many other cloud-native databases, DynamoDB is also available in a version that can be downloaded and run locally. But DynamoDB Local is not intended for production use, but rather as a way to stage an application in a test environment without requiring connectivity or running up an Amazon bill.

Microsoft Azure Cosmos DB

Cosmos DB is an ambitious project, a database system that encompasses multiple models for storing and retrieving data. Cosmos DB can serve as a document database, a columnar database, a graph database, or a key-value store, allowing the user to pick the paradigm that suits them and draw on various APIs for working with those paradigms. 

Cosmos DB features

Rather than invent an entirely new API for a document database system, Cosmos DB provides an API compatible with the popular MongoDB (discussed below). Among the benefits is that existing code that uses MongoDB interface libraries or MongoDB’s binary wire protocol can work as-is. It amounts to Cosmos DB being able to provide MongoDB as a service. Likewise, Cosmos DB supports the API of Cassandra, the popular column-family database. 

Microsoft touts several advantages to Cosmos DB that aren’t necessarily exclusive to its document database functionality, but are intended to appeal to those building document database applications. One such offering is tunable consistency levels. If you have some classes of document transactions that require stronger consistency across Azure regions than others, you can manually specify them on a per-transaction basis.

Other features are more specific to document databases. For instance, MongoDB users have to set up indexes on document collections to optimize searches. Cosmos DB users working with the MongoDB APIs don’t have to set up indexing for documents, as every property in an inserted documented is automatically indexed.

Using Cosmos DB on Microsoft Azure

There’s no locally hosted version of Cosmos DB. It’s only available as a service in the Microsoft Azure cloud. That said, development APIs for Cosmos DB are available for most every popular enterprise language—Java, Node.js, .NET, and Python.

Couchbase Server

Couchbase is not so much sibling to CouchDB as successor. Couchbase was built on work done in CouchDB and Membase, but is not related to either of those projects. It’s a document database and distributed key-value store rolled into one, with advanced features like automated failover and cross-datacenter replication, intended for enterprise use cases.

Couchbase features

One feature that sets Couchbase apart, not just from other NoSQL competition but from its predecessor CouchDB, is its SQL-like query language called N1QL (pronounced “nickel”). N1QL doesn’t offer the full range of commands you would expect from an ANSI SQL implementation, but it provides enough useful functions, such as JOIN operations, for someone with SQL experience to get workable results.

The Couchbase query system is not just for developers, but for the DBAs and business analysts who normally deal with conventional databases. Features like the EXPLAIN keyword seem to have been put in specifically to appeal to that crowd.

As a combination document database and key-value store, Couchbase stores documents by using their unique identifiers as the key. Documents can also be assigned time-to-live values, to function like a key-value cache. That said, a true key-value caching system like Redis will be far faster for basic key-value storage, but Couchbase is more flexible, and Redis and Couchbase can be combined effectively to speed things up. On that note, Couchbase has native support for the Memcached protocol, so existing applications that use Memcached can plug into Couchbase as a substitute.

Couchbase Community vs. Enterprise

Couchbase Server comes in a full-blown for-pay enterprise edition, a free-to-use community edition, and an open source edition, which is the foundation for the others. Binary downloads for the enterprise and community edition are available from Couchbase’s site, and the source code is available from Couchbase’s developer site. (There is no one GitHub repository for the Couchbase open source project as it is an aggregation of several projects.)

The community edition can be deployed in production, but lacks the more advanced features the enterprise edition as well as support, so non-buyer beware. Some features in Couchbase, such as its horizontal scaling functionality, have found their way into the CouchDB project, but that is more the exception than the rule.

Couchbase Lite

Another edition of Couchbase worthy of note for app developers is Couchbase Lite, an embeddable version of Couchbase that can synchronize with instances of the full-blown edition. Couchbase Lite is the key component in Couchbase Mobile, an application stack for mobile apps that need a data store that synchronizes automatically with a back end. Couchbase Mobile is available for iOS, Android, Java. .Net, MacOS, and tvOS.

CouchDB

The CouchDB project was begun in 2005 by a former IBM developer and moved to the Apache Software Foundation in 2008. It is sometimes assumed that CouchDB is the basis for Couchbase, but CouchDB and Couchbase are parallel projects with different aims.

CouchDB vs. Couchbase

Whereas Couchbase is both a document database and a key-value store, CouchDB is strictly a document database. And while Couchbase has long focused on enterprise features such as fault tolerance and a SQL-like query language, such niceties are only beginning to arrive in CouchDB.

CouchDB features

CouchDB emphasizes simplicity of deployment and ease of use. Retrieving data from the database is as simple as sending JSON-formatted queries to a REST HTTPS endpoint, with the results returned in JSON. Most every modern programming language can do these things, and also perform the mapping and reducing needed to create the views behind CouchDB queries and reports. There is no need for an ODBC driver or a data connector.

One of CouchDB’s special sauces is its data reconciliation technology. Changes made to one CouchDB peer are automatically reconciled with others, in a manner akin to a version control system. Any conflicts between document versions are retained as if they were previous revisions to that document.

This eventually consistent model is useful for databases that aren’t always or consistently connected (such as for intermittently connected mobile applications), or in cases where you don’t need the latest-and-greatest version of data in a particular node. But eventual consistency is also one of CouchDB’s biggest caveats. If you do need immediate consistency, CouchDB is not the place to find it.

Scalability has long been a weak spot for CouchDB, but it has recently been addressed. Version 2.0 stirred in a new clustering technology, courtesy of bits open sourced by Cloudant/IBM and merged into the project. Finally, for those who are familiar with MongoDB and want to use a similar declarative query syntax, the Mango project, also from Cloudant/IBM, provides that as an external add-on.

CouchDB download

CouchDB binaries for all major platforms, and source code, can be downloaded from the official CouchDB site. Source for the project is available on GitHub as well.

Google Firebase Realtime Database

You might think of Google Firebase as Google’s answer to DynamoDB—a way to provide fast-syncing data storage between a cloud back-end and local apps on multiple platforms.

The Firebase Realtime Database is just one component in the Firebase stack, intended for building apps heavy on audience engagement and insight. The whole stack includes functions like authentication, performance monitoring, user analytics, and many others, but here we focus on Firebase itself.

Google Firebase features

Google acquired Firebase in 2014. In the years since, it has wired up Firebase to take advantage of many Google Cloud features. Google Cloud Functions for Firebase, for instance, allows you to trigger JavaScript functions in the cloud in response to Firebase events. Google Analytics for Firebase lets you pull mobile app data into BigQuery for deeper analysis.

As gaming is one of Firebase’s target applications, the SDKs provided for Firebase include the Unity cross-platform game development framework. Developers working on more conventional enterprise-focused or consumer-facing projects have plenty of other choices: native iOS and Android, C++, generic web/JavaScript, and any other language that supports REST (Java, Python, you name it).

Firebase is designed to work in scenarios where connectivity isn’t guaranteed. Like CouchDB, it caches changes locally when offline, and automatically synchronizes with the back end when connectivity returns. Note that Firebase isn’t designed to be used as a standalone, entirely offline solution; on Android, for instance, local databases are limited to 10 MB in storage.

Firebase on Google Cloud and GitHub

Firebase isn’t available as a standalone product, but is only available as part of Google’s cloud products offerings. The Firebase GitHub repository has source code for the SDKs and for various platform-specific tools.

IBM Cloudant

Cloudant is essentially IBM’s hosted edition of CouchDB. Originally, Cloudant was an independent company, offering an edition of CouchDB called “BigCouch” that was hosted on IBM’s SoftLayer cloud. In 2014, IBM acquired Cloudant outright as part of IBM’s overall push towards analytics and big data

Cloudant vs. CouchDB

Cloudant is meant to be more than a hosted version of CouchDB. Cloudant provides features not readily available in CouchDB itself, such as natively integrated full-text search. Full-text search in CouchDB typically requires integration with external projects. Data can be replicated in both directions between Cloudant and an instance of CouchDB, so it’s relatively easy to move between either one as needed.

Some of Cloudant’s improvements to CouchDB have found their way back into the underlying CouchDB project, including CouchDB 2.0’s horizontal scaling functionality and the Mango query language interface. But don’t take that as proof that Cloudant features will automatically trickle down to CouchDB.

Cloudant on IBM Cloud

Cloudant is primarily a cloud offering on IBM Cloud, where it can be used in conjunction with other IBM Cloud data products such as dashDB, DataWorks, and Watson Analytics.

Cloudant Local

A behind-the-firewall edition of Cloudant, called Cloudant Local, offers all of the same functionality as the cloud-hosted offering. Cloudant Local is available on the Ubuntu and Red Hat flavors of x86 Linux, as well as IBM’s own System z running Red Hat or Suse. Developers can download a free, test-and-dev-only version in a Docker image.

MarkLogic

MarkLogic might be described as the Oracle, DB2, or Microsoft SQL Server of NoSQL. MarkLogic is a proprietary NoSQL document database system, with an emphasis on performance and professional-grade features, and with a few concessions to allow developers to get quickly on board without going broke.

MarkLogic indexing

Like other document databases, MarkLogic can ingest and perform queries on semi-structured data, mainly in JSON and XML format. All document types can be indexed with a wide variety of indexing and search strategies. A “Content Processing Framework” provides automated tools for transforming documents outside the system—Microsoft Office, Adobe PDF, HTML, and so on—into MarkLogic data.

By default MarkLogic builds “universal indexes,” or document indexes that return the most common types of search or aggregation—individual words and phrases, or values of XML elements or JSON properties.

MarkLogic also lets developers specify indexes for more advanced searches. One example is range indexes. Range indexes allow searches for values, like prices or dates, that exist in a range between other values or that can be found in proximity to other values. These in turn allow functionality like performing joins between documents, or running SQL queries on views constructed from document data.

MarkLogic storage

Documents are stored directly on disk in MarkLogic, but an admin can use the tiered storage feature to organize where documents live based on their access patterns. Constantly changing documents can be kept on an SSD, while archival data can be kept in a NAS or an Amazon S3 bucket.

Storage tiers are decoupled from MarkLogic’s other organizational schemes, so you can create a single database with components that span multiple storage tiers.

MarkLogic XML schemas and XQuery

For XML documents in MarkLogic you can specify a formal schema if you need one, although not for JSON documents. Schemas can be used to perform queries on XML data by way of the W3C standard XQuery language. MarkLogic provides a console environment, the Query Console, to allow interactive execution of XQuery commands on a MarkLogic database.

MarkLogic download

MarkLogic is commercial software that requires a license for production use. A free developer license allows users to work with up to 1TB of data, in both cloud and local instances of the software, for development and testing only.

MongoDB

MongoDB is easily the most widely deployed document database, and the best-known among the developer community. It embodies most of the key concepts found in document databases and NoSQL systems generally: schemaless storage, a scale-out architecture, and a shared-nothing design.

MongoDB Enterprise vs. Community

The open source edition of MongoDB already includes the vast majority of the features needed to gin up a basic production deployment. Commercial licenses add key enterprise features including backup, automation extensions, monitoring, data exploration tools, a BI connector with SQL support, and an in-memory storage engine.

The enterprise features in MongoDB have been tilting towards drawing enterprise developers away from the likes of Oracle, as seen by the recent addition of in-memory processing, a SQL-like interface through third-party data exploration and BI tools like Tableau, and the ability to perform recursive graph queries on document data. Graph queries are useful for exploring open-ended chains of relationships, such as family trees and social networks.

MongoDB data loss

MongoDB has also been the target of much criticism. Some of it has stemmed from developers not properly understanding the product’s aims and methodologies. But some of it has been due to real problems such as dirty reads and stale reads, lost updates, and the inability to sort by collations, a serious limitation when dealing with Unicode documents. Note that all of these issues were addressed in MongoDB 3.4.

MongoDB security

Another major issue has been security, with misconfigured and publicly exposed MongoDB instances being attacked and held for ransom, but that can be remedied by paying due diligence to MongoDB’s security settings before putting it into production.

MongoDB downloads

MongoDB’s Community Server version, the free and open source edition of MongoDB, is available for download directly from MongoDB’s official site. Source code for MongoDB is available on GitHub. The Enterprise Server edition is also available for download directly from MongoDB, and can be used as-is for evaluation and testing free of charge.

RethinkDB

The story behind RethinkDB is as interesting as the project itself. RethinkDB was originally conceived as a commercial product with an open source (AGPL) licensed version, but the company behind the database failed. The Cloud Native Computing Foundation came to the rescue, purchasing the intellectual property for the project and donating it to the Linux Foundation. Now RethinkDB has a second life under a more liberal open source license and the sponsorship of a major player in the open source space.

RethinkDB features

The big innovation behind RethinkDB is a built-in change notification system that streams live updates to applications. In the words of its introductory documentation, “instead of polling for changes, the developer can tell the database to continuously push updated query results to applications in realtime.” Thus RethinkDB simplifies the development of real-time applications like multiplayer games, albeit at the cost of overall ACID compliance. Individual documents in the database are handled in a transactional way, but the state of the database as a whole is only eventually consistent.

RethinkDB lacks native support for SQL, but includes a querying system called ReQL, which is implemented by way of the native syntax in Python, JavaScript, Ruby, and Java. ReQL uses chained dot commands to allow developers to construct complex, lazily evaluated expressions in the language of their choice. For example: 

r.table(‘users’).pluck(‘last_name’).distinct().run(conn)

When changes to documents land in RethinkDB, they’re made available through a “changefeed,” a log that can be parsed by applications to derive details about the changes—for instance, when the data in question is new or an altered version of existing data. ReQL expressions are used to create what amount to callback functions for handling changefeed events (essentially, triggers). They can also define relationships between data entities by way of mechanisms that emulate table joins.

RethinkDB download

Binaries for RethinkDB are available from the official RethinkDB project page. Official Docker images can be pulled from Docker Hub. The project’s source is available on GitHub.

Further reading

Serdar Yegulalp
Senior Writer

Serdar Yegulalp is a senior writer at InfoWorld, covering software development and operations tools, machine learning, containerization, and reviews of products in those categories. Before joining InfoWorld, Serdar wrote for the original Windows Magazine, InformationWeek, the briefly resurrected Byte, and a slew of other publications. When he's not covering IT, he's writing SF and fantasy published under his own personal imprint, Infinimata Press.

More from this author