andrew_oliver
Columnist

Untangle your relationships with graph analysis

analysis
Sep 15, 20164 mins
AnalyticsDatabasesNoSQL Databases

Sure, there are graph databases like Neo4j, but graph analysis or graph search may be more useful, depending on the sorts of data relationships you need to explore

Graph processing is hot right now in anomaly and fraud detection, recommenders, social network analysis, graph search, and various forms of access control.

Why those specific applications? Because some data lends itself better to graph processing — which treats relationships between data elements as objects in their own right — than other data. Also, different graph processing implementations offer different features.

Graph databases

One way to do graph analysis is to use a graph database. Neo4j, considered the quintessential graph database, is written in Java and offers features found in both analytical and operational databases. But unlike most NoSQL databases, Neo4j offers strong transactional support.

But most graph databases have limited scalability. Sharding and other common scalability measures are difficult because it’s hard to slice a graph in half without a performance penalty. In other words, if I’m traversing nodes to execute what’s called a “friend of a friend” query, I can’t easily hop nodes if they are located across the network to do the calculation without adding a pretty large layer of computational and calculative complexity (that is, MapReduce) — and the graph structure as databases like Neo4j know it doesn’t really support true distributed computing.

Apache Titan, a more modernized graph databases, essentially uses a distributed column family store as a back end and draws a graph on top of it. This supports both transactions and splitting the data up for distributed computing (but localizing it for efficiency).

Use a graph database when you need to maintain strong integrity, including the actual relationship.

By the way, relational databases don’t do this efficiently for this type of data; they require either table or range locks. You can use a relational database for graph processing, but then again you can use an OLTP relational database as a data warehouse — it won’t hold up as well. You’re better off using a graph database to maintain the integrity of the graph and perform your graph analysis elsewhere, unless it’s kind of lightweight.

Graph analysis

There are a few frameworks for graph analysis, the most famous of which is Spark’s GraphXan OLAP framework. When you’re trying to traverse all the nodes in a graph to implement a page-rank algorithm or social graph analysis or look at a powerline network, you’ll use more of a graph analysis framework like GraphX.

You may end up combining a graph database with a graph analysis framework. In fact, you probably will at some point. In that case, you might be interested in Apache Tinkerpop, which supports multiple graph back ends with both OLAP and OLTP computing models. There are discussions on integrating GraphX, but for now, you write to one or the other. Tinkerpop’s advantage is that it has an actual graph query language implementation, Gremlin.

If you’re doing a full-text search or some sort of a search application and find there are graph characteristics to it, then you might want to look more toward a search engine that supports graph queries/traversals. With Solr 6.x-6.2 you have not only graph queries but graph queries that can return streams of results. (Disclosure: I work for Lucidworks which is the main corporate backer of Apache Solr.)

If you’re, say, looking for instances of fraud in a corpus of emails, you might find this more useful because you’ll have the power of full-text search and graph traversals altogether. A full-on graph database or graph analytics framework might be too much graph and not enough search, whereas graph functions in a search engine might hit the spot.

Put graphs to work

If you’re focused on relationships — from social graphs to figuring out Trump’s international web of financiers — graph processing is for you.

A graph database (at this point I’m going to say use Titan as opposed to Neo4j unless you know you’re going to stay small) helps you keep things consistent or start to shape a graph for later analysis, but it might not be sufficient in itself for large-scale graph analysis, which might call for GraphX. If you’re doing something that tastes more like search with graph characteristics or a lot of text-matching is involved, a search engine with graph features might be a better fit.

In any case, get in on this open source distributed graph computing goodness, or I don’t even want to be a friend of your friend (this is graph humor, you’ll get used to it).