by Andrew Mills

How to size and scale Apache Kafka, without tears

feature
Oct 17, 20237 mins
AnalyticsData ManagementDatabases

The first step to sizing or scaling Kafka for optimal cost and performance is understanding how the data streaming platform uses resources. Here’s a primer.

Teams implementing Apache Kafka, or expanding their use of the powerful open source distributed event streaming platform, often need help understanding how to correctly size and scale Kafka resources for their needs. It can be tricky.

Whether you are considering cloud or on-prem hardware resources, understanding how your Kafka cluster will utilize CPU, RAM, and storage (and knowing what best practices to follow) will put you in a much better position to get sizing correct right out of the gate. The result will be an optimized balance between cost and performance.

Let’s take a look at how Kafka uses resources, walk through an instructive use case, and best practices for optimizing Kafka deployments.

How Kafka uses CPU

Generally speaking, Apache Kafka is light on CPU utilization. When choosing infrastructure, I lean towards having more cores over faster ones to increase the level of parallelization. There are a number of factors that contribute to how much CPU is used, chief among them are SSL authentication and log compression. The other considerations are the number of partitions each broker owns, how much data is going to disk, the number of Kafka consumers (more on that here), and how close to real time those consumers are. If your data consumers are fetching old data, it’s going to cost CPU time to grab the data from disk. We’ll dive more into that in the next section.

Understanding these fundamental drivers behind CPU usage is essential to helping teams size their available CPU power correctly.

How Kafka uses RAM

RAM requirements are mostly driven by how much “hot” data needs to be kept in memory and available for rapid access. Once a message is received, Kafka hands the data off to the underlying OS’s page cache, which handles saving it to disk.

From a sizing and scalability perspective, the right amount of RAM depends on the data access patterns for your use case. If your team deploys Kafka as a real-time data stream (using transformations and exposing data that consumers will pull within a few seconds), RAM requirements are generally low because only a few seconds of data need to be stored in memory. Alternatively, if your Kafka consumers need to pull minutes or hours of data, then you will need to consider how much data you want available in RAM.

The relationship between CPU and RAM utilization is important. If Kafka can access data sitting in RAM it doesn’t have to spend CPU resources to fetch that data from disk. If the data isn’t available in RAM, the brokers will pull that data from disk, spending CPU resources and adding a bit of latency in data delivery. Teams implementing Kafka should account for that relationship when sizing CPU and RAM resources.

How Kafka uses storage

Several factors impact Kafka storage needs, like retention times, data transformations, and the replication factor in place. Consider this example: Several terabytes of data land on a Kafka topic each day, six transformations are performed on that data using Kafka to keep the intermediary data, each topic keeps data for three days, and the replication factor is set to 3. It’s easy to see that teams could quickly double, triple, or quadruple stored data needs based on how they use Kafka. You need a good understanding of those factors to size storage correctly.

Kafka sizing example

Here’s a real example from our work helping a services provider in the media entertainment industry to correctly size an on-prem Kafka deployment. This business’s peak throughput ingress is 10GB per second. The organization needs to store 10% of its data (amounting to 9TB per day) and retain that data for 30 days. Looking at replication, the business will store three copies of that data, for a total storage requirement of 810TB. To account for potential spikes, it’s wise to add 30-40% headroom on top of that expected requirement—meaning that the organization should have 1.2PB storage available. They don’t use SSL and most of their consumers require real-time data, so CPU and RAM requirements are not as important as storage. They do have a few batch processes that run, but latency isn’t a concern so it’s safe for the data to come from disk.

While this particular use case is still being built out, the example demonstrates the process of calculating minimum effective sizing for a given Kafka implementation using basic data, and then exploring the potential needs of scaled-up scenarios from there.

Kafka sizing best practices

Knowing the specific architecture of a given use case—topic design, message size, message volume, data access patterns, consumer counts, etc.—increases the accuracy of sizing projections. When considering an appropriate storage density per broker, think about the time it would take to re-stream data during partition reassignment due to a hot spot or broker loss. If you attach 100TB to a Kafka broker and it fails, you’re re-streaming massive quantities of data. This could lead to network saturation, which would impede ingress or egress traffic and cause your producers to fail. There are ways to throttle the re-stream, but then you’re looking mean time to recovery is significantly increased.

A common misconception

More vendors are now offering proprietary tiered storage for Kafka and pushing Kafka as a database or data lake. Kafka is not a database. While you can use Kafka for long-term storage, you must understand the tradeoffs (which I’ll discuss in a future post). The evolution from Kafka as a real-time data streaming engine to serving as a database or data lake falls into a familiar pattern. Purpose-built technologies, designed for specific use cases, sometimes become a hammer for certain users and then every problem looks like a nail. These users will try to modify the purpose-built tool to fit their use case instead of looking at other technologies that solve the problem already.

This reminds me of when Apache Cassandra realized that users coming from a relational world were struggling to understand how important data models were in flat rows. Users were not used to understanding access patterns before they started storing data, they would just slap another index on an existing table. In Cassandra v3.0, the project exposed materialized views, similar to indexing relational tables but implemented differently. Since then, the feature has been riddled with issues and marked as experimental. I feel like the idea of Kafka as a database or data lake is doomed to a similar fate.

Find the right size for optimal cost and Kafka performance

Teams that rush into Kafka implementations without first understanding Kafka resource utilization often encounter issues and roadblocks that teach them the hard way. By taking the time to understand Kafka’s resource needs, teams will realize more efficient costs and performance, and they will be well-positioned to support their applications far more effectively.

Andrew Mills is a senior solutions architect at Instaclustr, part of Spot by NetApp, which provides a managed platform around open source technologies. In 2016 Andrew began his data streaming journey, developing deep, specialized knowledge of Apache Kafka and the surrounding ecosystem. He has architected and implemented several big data pipelines with Kafka at the core.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.