Martin Heller
Contributor

Amazon Neptune review: A scalable graph database for OLTP

reviews
May 13, 201911 mins
Amazon Web ServicesCloud ComputingDatabases

Amazon’s graph database service offers ACID properties, immediate consistency, and auto-scaling storage for billions of relationships

Graph databases, such as Neo4j, Apache Spark GraphX, DataStax Enterprise Graph, IBM Graph, JanusGraph, TigerGraph, AnzoGraph, the graph portion of Azure Cosmos DB, and the subject of this review, Amazon Neptune, are good for several kinds of applications involving highly connected data sets, such as providing recommendations based on social graphs, performing fraud detection, providing real-time product recommendations, and detecting incursions in network and IT operations. These are areas where traditional, relational databases tend to become inefficient and slow because of the need for complex SQL joins operating on large data sets.

Neptune is a fully managed graph database service with ACID properties and immediate consistency, which has at its core a purpose-built, high-performance graph database engine that is optimized for storing billions of relationships and querying the graph with milliseconds latency. Neptune supports two of the most popular open source graph query languages, Apache TinkerPop Gremlin and W3C SPARQL. The popular Cypher Query Language (CQL) used in Neo4j started off proprietary, but later became open source.

Gremlin and SPARQL address different kinds of graph database. Gremlin, like CQL, is for property graph databases; SPARQL is for Resource Description Framework (RDF) triples, designed for the web. Gremlin is a graph traversal language; SPARQL is a query language with SELECT and WHERE clauses.

The Amazon Neptune implementation allows both the Gremlin and SPARQL languages in a single database instance, but they can’t see each other’s data. The reason for allowing both is to let new users figure out which works better for their needs.

The Neptune documentation has samples using the Gremlin-Groovy, Gremlin-Java, and Gremlin-Python variants of Gremlin. Neptune allows Gremlin in the console, HTTP REST calls, Java, Python, .Net, and Node.js programs. On the SPARQL side, Neptune supports the Eclipse RDF4J console and workbench, HTTP REST calls, and Java programs.

Amazon Neptune features and benefits

As a fully-managed transactional graph database as a service, Amazon Neptune not only offers freedom from the need to maintain the database hardware and software, it also offers easy scaling, greater than 99.99% availability, and multiple levels of security. Neptune database clusters can have up to 64 TB of auto-scaling storage in six replicas of your data across three availability zones, and more if you enable high availability by using read replicas in additional zones.

The Neptune data storage layer is backed by SSDs, fault-tolerant, and self-healing. Disk failures are repaired in the background without loss of database availability. Neptune automatically detects database crashes, and restarts—typically in 30 second or less—without needing to perform crash recovery or to rebuild the database cache, since the cache is isolated from the database processes and can survive a restart. If an entire primary instance fails, Neptune will automatically fail over to one of up to 15 read replicas. Backups are continuously streamed to S3.

You can scale Amazon Neptune clusters up and down either by modifying instances, or (to avoid downtime) by adding an instance of the desired size and shutting down the old instance once a copy of the data has migrated and you have promoted the new instance to primary. Neptune VM instance sizes range from db.r4.large (two vCPUs and 16 GiB of RAM) to db.r4.16xlarge (64 vCPUs and 488 GiB of RAM).

Amazon Neptune implements security by running the engine in a VPC (virtual private cloud) network and optionally encrypting the data at rest using the AWS Key Management Service. In addition to encrypting the underlying storage, Neptune also encrypts the backups, snapshots, and replicas. Neptune is eligible to be used in HIPAA applications. Neptune does not require you to create specific indices to achieve good query performance, which is a welcome change from balancing query and write performance by carefully tuning the indices.

amazon neptune architecture 01 IDG

The diagram shows a high-level diagram of Amazon Neptune’s architecture. Note that the Neptune graph engine is between the cloud-native storage service layer and the Gremlin and SPARQL interfaces.

Amazon Neptune does not support analytic query algorithms, such as PageRank, which are featured in some other graph databases, such as Neo4j , TigerGraph, and AnzoGraph. Neptune is intended to be a low-latency transactional (OLTP) graph database for huge data sets, not an analytic (OLAP) database, and is simply not optimized for analytic use cases or queries involving more than three hops—and PageRank touches every item in the database.

Amazon Neptune does support aggregates, so it can do a little analysis, but not a lot. Like Neptune, Neo4j was also originally intended to be used for OLTP, but added an analytic query library in 2017. The fact that analysis capabilities are limited in Neptune is not necessarily a reason to rule it out, as a low-latency OLTP graph database with globally distributed read replicas and the ability to handle 64 TB of data is nothing to sneeze at.

Get started with Amazon Neptune

You can launch an Amazon Neptune cluster two ways: directly from the Amazon Neptune console, or by using an AWS CloudFormation template to create a Neptune stack. Note that the CloudFormation template provided is not suitable for production, as it is not very secure—it is intended as the basis for a tutorial.

You can start small and add capacity—larger VMs, or more read replicas—whenever your application needs it. The storage grows automatically, and you only pay for the storage you use.

In the following screenshots, I show some of the lifecycle of a Neptune image created from the Neptune console. I start with the creation of a cluster.

amazon neptune instance classes 02 IDG

When you create a Neptune cluster or instance, you need to specify the instance class to use. These range from two to 64 CPUs, with 15 GiB to 488 GiB of RAM.

Do you want high availability for your cluster? It’s not hard at all.

amazon neptune ha 03 IDG

After picking an instance size, you decide whether you want to enable high availability for your cluster. It really is just a single click.

In the advanced settings, there are multiple panels. Fortunately, most of the defaults should serve your needs.

amazon neptune advanced settings 04 IDG

The first two sections of the advanced settings define the Amazon Neptune cluster’s network security and port. In addition, you can enable database authentication via IAM users and roles.

amazon neptune advanced settings 05 IDG

You can enable data encryption and either use an AWS-supplied key or one of your own. You have fine control over the priority of read replicas to become primary in the case of a fail-over, and you can specify the retention period for backups.

Finally, you’ll see a warning before you press the button to launch the database. If you really need to see the explanation, right-click on the link and display it in another tab. (I’m hoping this bug will be fixed.)

amazon neptune permission warning 06 IDG

After the advanced screens you’ll see a warning about permissions being granted. They’re probably OK, but don’t left-click on the “Learn more” link — when I did so, I lost all my settings and had to start the cluster creation again from scratch.

Once you have a working cluster, you can perform several actions on the instances.

amazon neptune instance actions 07 IDG

Our primary Amazon Neptune instance and read replica are running and available. You can modify, reboot, delete, or fail-over each one. You can also create a read replica, take a snapshot, or restore the primary to a point in time.

At the cluster level, you can bring up a summary.

amazon neptune cluster overview 08 IDG

Amazon Neptune cluster overview. This is where you can find the cluster writer and reader endpoints.

At the instance level you can see performance graphs.

amazon neptune instance graphs 09 IDG

By default, you can see CPU, memory, and network graphs for an Amazon Neptune instance.

Loading data into Amazon Neptune

To load data into Amazon Neptune, you first load files into Amazon S3 in one of the correct formats: CSV for Gremlin, and triples, quads, RDF/XML, or Turtle for RDF. The loader supports gzip compression of single files.

You’ll need to create an IAM role and S3 VPC endpoint to give Neptune permission to access your S3 bucket unless they were already created, for example by a CloudFormation template. There’s a Neptune loader API callable through a REST endpoint (e.g. from a curl command) that can take on the IAM role and bulk load the data into your cluster. There’s also a converter for GraphML to CSV on GitHub. The data loading walkthrough is for any supported data format.

Query Amazon Neptune with Gremlin

You can use the Gremlin console and a REST endpoint to connect to and query your Amazon Neptune instances in Gremlin from an EC2 VM in the same VPC as the database. More usefully for applications, you can query Neptune using Gremlin in Java, Python, .Net, and Node.js.

There’s an entire book on Gremlin, Practical Gremlin: An Apache TinkerPop Tutorial, by Kelvin Lawrence. The book uses TinkerGraph for its database, but the same Gremlin syntax works for Neptune with small exceptions that are documented by Amazon.

Gremlin queries describe how to navigate graph vertices and edges. An example for the air routes database discussed in the book finds all the ways to fly from Austin, Texas (airport code AUS) to Agra, India (airport code AGR) with two stops:

g.V().has(‘code’,’AUS’).repeat(out()).times(3).has(‘code’,’AGR’).path().by(‘code’)

If you want to try the examples in the book on Amazon Neptune, you’ll first need to copy the edges and nodes CSV files to an S3 bucket using the AWS CLI cp command, and load the data from there into Neptune.

Query Amazon Neptune with SPARQL

You can use the RDF4J console, RDF4J workbench, and a REST endpoint to connect to and query your Amazon Neptune instances in SPARQL from an EC2 VM in the same VPC as the database. More usefully for applications, you can query Neptune using SPARQL in Java. The SPARQL 1.1 Query Language specification defines how to construct queries. If you search the web for “sparql tutorial” you’ll find a number of free written and video tutorials on the subject. These should all work with Neptune once you have loaded the data.

SPARQL looks less like functional code than Gremlin does, and more like SQL. For example:

SELECT ?book ?who
WHERE { ?book dc:creator ?who }

Amazon Neptune performance and scaling

Amazon Neptune was designed for low-latency graph queries of up to three hops on huge (up to 64 TB) databases. It supports up to 15 low-latency read replicas across three availability zones to scale read capacity and, according to Amazon, can execute more than 100,000 graph queries per second.

Its instance sizes go from two to 64 vCPUs with 15 GiB to 488 GiB of RAM, roughly doubling both at each step. At most, Neptune can use 64 vCPUs times 16 instances for a total of 1024 vCPUs, and 488 GiB RAM times 16 instances for a total of 7808 GiB of RAM. That’s an overall possible scaling growth of 512x, considering both instance size and numbers of read replicas. The costs track the resource scaling pretty much exactly.

If you’re looking for a transactional graph database as a service, Amazon Neptune is a viable choice. Given that Neptune supports both the Gremlin and SPARQL query languages, you should be able to find developers to build applications against it. On the other hand, Neptune’s lack of Cypher support may discourage existing Neo4j users, and its lack of OLAP and graph algorithm support will discourage people who want a single graph database for transactions and analysis.

Amazon Neptune’s wide range of capacities (up to 16x size and 16 instances), high maximum transaction rate (100,000 queries per second), and pay-as-you-go pricing should fit most business requirements. However, its lack of on-premises options may preclude its use at companies with policies against putting their sensitive data in the cloud and companies that prefer incurring and writing down capital expenses to ongoing operating expenses.

Cost: $0.348 to $5.568 per Instance-Hour depending on instance size, $0.10 per GB-month for storage, $0.20 per million I/O requests, $0.023 per GB-month for backup storage, $0.12 per GB for outbound data transfer; various price breaks apply. 

Platform: Amazon Web Services; supports Apache TinkerPop Gremlin or SPARQL queries; access through AWS console, HTTP REST, and Java, as well as RDF4J Workbench (SPARQL) and Python, .Net, and Node.js (Gremlin). 

Martin Heller
Contributor

Martin Heller is a contributing editor and reviewer for InfoWorld. Formerly a web and Windows programming consultant, he developed databases, software, and websites from his office in Andover, Massachusetts, from 1986 to 2010. More recently, he has served as VP of technology and education at Alpha Software and chairman and CEO at Tubifi.

More from this author