Implementing a high-availability Kafka cluster
High availability of systems is a crucial factor in providing a good customer experience. The challenge is to achieve a robust, highly available infrastructure that doesn’t replicate complexity or dramatically increase costs.
Apache Kafka is a horizontally scalable, high-availability, fault-tolerant, low-latency, high-throughput event streaming platform. It handles data streams from multiple sources and delivers them wherever the data is needed.
In this article we will design a high availability Kafka cluster, looking at the behavior when performing maintenance and the total failure of the node.
Basically, Kafka’s architecture consists of broker and its producers and consumers.
While this Kafka cluster can support typical use cases, it is too simple for most practical cases.
Kafka typically runs on a cluster of three or more brokers that can span multiple data centers or cloud regions.
This architecture supports the needs of scalability, consistency, availability, partition tolerance, and performance.
Like any engineering endeavor, there are trade-offs to be made between these qualities.
According to the CAP theorem, a distributed system can deliver only two of the three desired characteristics: consistency, availability, and partition tolerance (or ‘C’, ‘A’ (evaluability), and ‘P’ (partition tolerance).
In this article, we will implement a Kafka cluster that:
- Prefers availability over consistency, which is a trade-off you might want to make for a use case like collecting real-time metrics, where in the event of a failure, the willingness to write new data is more important than losing some historical data points.
- Chooses simplicity over other non-functional requirements (e.g., security, performance, efficiency, etc.) to focus on high availability.
- It assumes that maintenance and unplanned outages are more likely than infrastructure failure.
With these goals in mind, let’s first discuss a typical high-availability Kafka cluster.
1 — Kafka partitions e replication-factor
In Kafka, messages are categorized into topics, and each topic has a unique name across the cluster.
Topics are divided into partitions, each of which can reside on a separate node in the Kafka cluster.
In other words, all messages from a single topic can be stored on different brokers, but all messages from a single partition can only be found on the same node.
This design allows for topic parallelization, scalability, and high message throughput.
But there is more.
Topics are configured with a replication factor, which determines the number of copies for each partition.
If a cluster has a single topic with one partition, a replication factor of three means that there are three partitions: one copy for each partition.
All replicas of a partition exist on separate brokers, so you can’t have more partition copies than nodes in the cluster.
In the previous example, with a replication factor of three, you should expect at least three nodes in your Kafka cluster.
But how does Kafka keep these copies synchronized?
Partitions are organized into leaders and followers, where the leader of the partition handles all writes and reads, and followers are purely for failover.
A follower can be either in sync with the leader (containing all messages from the partition leader except messages within a small buffer window) or out of sync.
The set of all replicas in sync is called in-sync replicas (ISR).
These are the foundations of Kafka and replication.
Let’s see what happens when it breaks.
2 — Understanding broker interruptions
Let’s imagine that the Kafka cluster has three brokers and a replication factor of 1.
There is a single topic in the cluster with a single partition.
When the broker becomes unavailable, the partition is also unavailable and the cluster cannot serve consumers or producers.
Let’s change this by setting the replication factor to 3.
In this scenario, each broker has a copy of a partition.
What happens when a broker becomes unavailable?
If the partition has additional in-sync replicas, one of them becomes the partition leader.
The cluster can operate normally and there is no downtime for consumers or producers.
What about when there are partition copies but they are not synchronized?
In this case, there are two options:
Waiting for the partition leader to come back online — sacrificing availability.
Allow an out-of-sync replica to become the partition leader — sacrificing consistency.
Now that we’ve discussed some failure scenarios, let’s look at how to mitigate them.
3 — Requirements to mitigate common failures
You’ve probably noticed that a partition must have an extra in-sync replica (ISR) available to survive the loss of the partition leader.
Therefore, can the cluster have two brokers with a minimum synchronous replica size of 2?
No.
If you have only two replicas and you lose a broker, the size of the in-sync replica will decrease to 1 and neither the producer nor the consumer will be able to work (i.e., the minimum in-sync replica is 2).
Therefore, the number of brokers must be greater than the minimum size of the synchronous replica (i.e., at least 3).
But where should you place these brokers?
Considering that you will have to host the Kafka cluster, it is good to spread brokers across fault domains, such as regions, zones, nodes, etc.
Therefore, if you want to design a Kafka cluster that can tolerate failure, consider the following requirements:
A minimum of 2 synchronized replicas.
A replication factor of 3 for topics.
At least 3 Kafka brokers, each running on different nodes.
Brokers are spread across three availability zones.
In part 2 of the article, we’ll create and stop a Kafka cluster, to validate these assumptions.
Thank you.