by Andrew Mills

Don’t make Apache Kafka be your database

feature
Dec 05, 20236 mins
AnalyticsData ManagementDatabases

Event-driven architectures are wonderful. But Kafka was never intended to be a database, and using it as a database won’t solve your problem.

shutterstock 1921909664 blue and yellow pipelines and valves at a gas plant
Credit: freedomnaruk / Shutterstock

It’s a tale as old as time. An enterprise is struggling against the performance and scalability limitations of its incumbent relational database. Teams tasked with finding a newer solution land on an event-driven architecture, take one look at Apache Kafka, and say, “Aha! Here’s our new database solution.” It’s fast. It’s scalable. It’s highly available. It’s the superhero they hoped for!

Those teams set up Kafka as their database and expect it to serve as their single source of truth, storing and fetching all the data they could ever need. Except, that’s when the problems begin. The core issue is that Kafka isn’t actually a database, and using it as a database won’t solve the scalability and performance issues they’re experiencing.

What is and isn’t a database?

When developers conceptualize a database, they generally think of a data store with a secondary index and tables, like most SQL and NoSQL solutions. Another traditional requirement is ACID compliance: atomicity, consistency, isolation, and durability. However, the traditional thinking around what is or isn’t a database is being challenged regularly. For example, Redis does not have tables, and RocksDB does not have secondary indexes. And neither is ACID compliant. However, both are commonly referred to as a database. Similarly, Apache Cassandra is known as a NoSQL database, but it is not ACID compliant.

I draw the line at Kafka, which I will argue is not a database and, largely, should not be used as a database. I’d venture to say the open-source Kafka community at large holds the same perspective.

Kafka doesn’t have a query language. You can access specific records for a specific time frame, but you’re accessing a write-ahead log. Kafka does have offsets and topics, but they aren’t a substitute for indexes and tables. Crucially, Kafka isn’t ACID compliant. Although it’s possible to use Kafka as a data store or to create your own version of a database, Kafka isn’t a database in and of itself.

That begs the question, does it ever make sense to pursue using Kafka as a database anyway? Does your use case demand it? Do you have the expertise to absorb the mounting technical debt of forcing Kafka to act like a database in the long term? For most users and use cases, my answer is a firm no.

Kafka is best as a team player

Selecting the right technology for, well, any use case comes down to matching a solution to the problem you’re trying to solve. Kafka is intended to function as a distributed event streaming platform, full stop. While it can be used as a long-term data store (technically), doing so means major tradeoffs when it comes to accessing those data. Tools in Kafka’s ecosystem like ksqlDB can make Kafka feel more like a database, but that approach only functions up to medium-scale use cases. Most enterprises that choose to implement Apache Kafka have high-velocity data, and ksqlDB doesn’t keep up with their needs.

The right strategy is to let Kafka do what it does best, namely ingest and distribute your events in a fast and reliable way. For example, consider an ecommerce website with an API that would traditionally save all data directly to a relational database with massive tables—with poor performance, scalability, and availability as the result. Introducing Kafka, we can design a superior event-driven ecosystem and instead push that data from the API to Kafka as events.

This event-driven approach separates processing into separate components. One event might consist of customer data, another may have order data, and so on—enabling multiple jobs to process events simultaneously and independently. This approach is the next evolution in enterprise architecture. We’ve gone from monolith to microservices and now event-driven architecture, which reaps many of the same benefits of microservices with higher availability and more speed.

Once events are sitting in Kafka, you have tremendous flexibility in what you do with them. If it makes sense for the raw events to be stored in a relational databases, use an ecosystem tool like Kafka Connect to make that easy. Relational databases are still a critical tool in the modern enterprise architecture, especially when you consider the advantages of working with familiar tools and a mature ecosystem. Kafka isn’t a replacement for the tools we know and love. It simply enables us to handle the massive influx of data we’re seeing.

Pluggable and versatile, but not a database

Kafka provides its greatest value in enabling use cases such as data aggregation and real-time metrics. Using Kafka and Apache ecosystem tools like Spark, Flink, or KStreams, developers can perform aggregations and transformations of streaming data and then push that data to the desired database. Some of these tools can also aggregate data in a time-series or windowed fashion and push it to a reporting engine for real-time metrics.

If developers wish to save certain data to a cache—perhaps to support a website or CRM systems—it’s simple to tap into the Kafka data stream and push data to Redis or a compacted Kafka topic. Data streaming from Kafka allows teams to add various components as they see fit without worrying about any degradation in service, because Kafka is so gosh-darn scalable, reliable, and available. That includes feeding data into any data store, whether that’s Apache Cassandra, big data platforms, data lakes, or almost any other option.

If data is the lifeblood of a modern enterprise, Kafka should be the heart of your data ecosystem. With Kafka, users can pipe data wherever it needs to go. In this way, Kafka is complementary to your database, but should not be your database. The right prescription for Kafka should include the direction “use as intended,” meaning as a powerful message broker and the central data pipeline of your organization.

Andrew Mills is a senior solutions architect at Instaclustr, part of Spot by NetApp, which provides a managed platform and support around open-source technologies. In 2016 Andrew began his data streaming journey, developing deep, specialized knowledge of Apache Kafka and the surrounding ecosystem. He has designed and implemented several big data pipelines with Kafka at the core.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.