Seamless Transition: Migrating Kafka Cluster to Kubernetes

Published in

Zendesk Engineering

13 min readFeb 13, 2024

“Seamless Transition: Migrating Kafka Cluster to Kubernetes” Digital Illustration created by OpenAI’s DALL-E, 2023.

Background

At Zendesk, we manage our own Kafka infrastructure. When we started, no sophisticated managed Kafka service was available on the market. Therefore, we built the Kafka infrastructure via Chef and deployed it on AWS EC2 Instances.

A Kafka cluster consists of two parts: a set of Kafka brokers, and Zookeeper nodes. The broker is a server in the Kafka cluster that stores and manages data, handling the publishing and subscribing of messages. The Zookeeper cluster is a quorum that coordinates and manages Kafka brokers, handling tasks like leader election, cluster membership, and maintaining metadata.

We have been moving from the VM-based infrastructure to a Kubernetes (K8s)-based one. With less support on the Chef stack, we needed to migrate the Kafka infrastructure.

Seamless Migration Definition

Seamless migration for a Kafka cluster involves a transparent, zero-downtime process that maintains data integrity and consistency, requiring minimal changes from clients and minimal manual intervention.

It includes replicating configuration and data from the old cluster, transitioning clients to the new cluster, ensuring operational continuity and performance equivalence.

The complexity of seamless migration directly correlates with scale of the cluster and its clients. Migrating a cluster that is infrequently used or predominantly handles offline workloads can be relatively straightforward. However, migrating a cluster that handles extensive realtime workloads and a significant volume of data presents a much more challenging project.

At Zendesk, our Kafka clusters fall into the second category. The Kafka cluster manages a substantial volume of data and is utilized by over 100 services.

Kafka Cluster state in detail:

12 Kafka clusters
Just below 100 brokers across all the environments
300 TB data in storage
30 Billion messages per day
more than 1000 topics
more than 100 services connected to it

Major Challenges

Migrating such a large and critical cluster requires a thoughtfully crafted design. Of the numerous details that require attention, the major challenges to address are :

ensuring data integrity and cluster availability during the replication phase, as well as
facilitating a seamless transition for clients during the switch-over phase.

It is important to understand that there is no distinct cutover moment when all the data is completely replicated, followed by a mass transition of clients to the new cluster. Due to the seamless migration requirement, the switch-over phase has to accommodate continuous data replication and client transition simultaneously.

Migration Design

Unidirectional Cluster-to-Cluster Migration

The first version design is a unidirectional cluster-to-cluster migration. This method uses MirrorMaker2 to replicate data and configurations from the old cluster to the new one. After the MirrorMaker transfers the initial load of data between the clusters, it continues to keep it in sync, each client executes its cutover process, which involves disconnecting from the old cluster and connecting to the new one.

unidirectional configuration and data replication via Mirror Maker 2

Problem: it shifts the migration complexity to the clients

While this approach initially seems straightforward and robust, it shifts the migration’s complexity to each team managing the Kafka client. This requires that every client develops and executes its cutover strategy, potentially involving multiple deployments based on their implementation.

For example, in the simplest scenario where a single team controls both the producer and consumer of a topic, the clients could be migrated through the following steps, via code changes with multiple deployments.

move all consumers to the new cluster
stop producing to the old cluster
wait until the data is fully synced in the new cluster
start the producer pointing to the new cluster

In a more complex scenario where various teams control different consumer groups of a topic, a coordinated migration involving all teams is necessary. Additionally, if a consumer also acts as a producer for another topic, the complexity increases, potentially leading to an intricate, multi-layered migration process.

Secondary overheads

There are a few secondary overheads for this particular migration design:

Single point of failure: The MirrorMaker2 connect cluster is the single point of failure. Imagine a scenario where all consumers of a topic have successfully migrated to the new cluster, but the final in-flight data fails to sync from the old cluster due to a MirrorMaker2 failure.
Indirection in failure detection and failover operation: While a MirrorMaker2 failure is beyond the control of each team, the client failover process is entirely dependent on them, as it requires code changes or redeployment of the previous version.
Long tail problem: The full completion of the migration depends on the speed of the slowest client. Due to the complexity mentioned earlier, there might be a few clients that take a substantially long period to complete the migration, slowing down the overall project delivery.
Cost: In this phase, we face not only doubled compute and storage costs for brokers but also tripled data transfer costs, i.e. moving data from old brokers to the connect cluster and then to the new brokers.

Cluster-to-cluster migration is achievable, yet we aim for better

This approach is neither a seamless migration from the client’s perspective nor a practical implementation at this scale in terms of operations and cost-effectiveness.

Broker-Level migration within a consolidated Kafka cluster

The major challenge in cluster-to-cluster migration is that clients must point to a different cluster at some point. To ensure data integrity, the timing of the cut-over differs for consumers and producers, respectively.

To address the problem, it might be worthwhile to ask a slightly more aggressive question: could clients be agnostic to the cluster switch-over?

As mentioned earlier in this article, cluster migration involves replicating configurations and data from the old cluster to the new one, as well as transitioning clients.

However, what if we eliminate the client transition from the equation? Instead of building a new cluster, we can construct a new set of brokers on K8s and register them with the existing cluster. By doing so, it becomes unnecessary for clients to change clusters. The only remaining task is to transfer data from the old brokers to the new brokers within this consolidated Kafka cluster.

Indeed! Since the new brokers will be joining the existing cluster, there’s no need for configuration migration. The necessary configurations are already in place within the cluster, simplifying the process.

Both brokers on EC2 and K8s are registered with the same cluster, with data being migrated from the brokers on EC2 to those on K8s.

Why does transferring data across brokers achieve a client switch-over?

This process is driven by the Kafka client interaction mechanism. A Kafka topic is composed of a set of partitions, and each partition can have multiple replicas. In most cases, clients interact with the leader replicas.

When moving topic data from one broker to another, it essentially involves transferring replicas from one broker to another. Once a replica has been fully migrated and elected as the leader replica, the client will detect this change and start interacting with the new leader replica. As a result, it will begin interacting with the new broker.

Lead replica transferred to the new broker so that the client interacts with the new broker.

Following the process described in the graph above, once all replicas are held by the new brokers, it means that all clients are interacting with the new brokers, and the old brokers no longer undertake any important tasks.

Let’s specify how this approach works

Step 1: Deploy the same number of brokers to K8s and register them with the same Kafka cluster under ZooKeeper. At this stage, both the brokers on EC2 and K8s are members of the same cluster, but the EC2 brokers still hold all the data and handle all client traffic. The K8s brokers are inactive except for maintaining their heartbeat with ZooKeeper.
Step 2: Topic partition replicas begin migrating from the brokers on EC2 to those on K8s. Once a replica has fully moved, the K8s brokers may start serving the actual client topics, depending on whether they hold a leader replica.
Step 3: All topic partition replicas have been transferred from the brokers on EC2 to the brokers on K8s. The brokers on EC2 no longer hold any data and are not handling client requests.
Step 4: Scale down the brokers on EC2, leaving the cluster composed entirely of K8s brokers. The migration is now complete.

Broker-Level migration within a consolidated Kafka cluster

By following the migration steps mentioned above, we successfully eliminate the need for client transition. Since clients are consistently interacting with one cluster throughout the entire migration, the migration becomes fully seamless from the client’s perspective.

Deep Dive

Now that we have the overall concept, it’s time to tackle the specific setup:

Network setup: It is crucial to group two different sets of brokers together, as well as to facilitate client discovery of brokers in this hybrid Kafka cluster.
Migration implementation and orchestration: The implementation of moving replicas to new brokers is important, but orchestration is even more critical. Effective orchestration should also take disaster recovery scenarios into account, like rollbacks.
Metrics and Monitoring: Effectively monitoring the hybrid cluster — when brokers on both EC2 and K8s are handling traffic — is crucial for problem detection, debugging, and performance fine-tuning.
Brokers disruption management: For high availability, there is a need for rolling brokers or preventing multiple brokers from different Availability Zones (AZs) from going offline, even during migration.

Network Setup for inter-broker and client-broker communication

The critical prerequisite for broker-level migration is that both sets of brokers and the ZooKeeper cluster should be able to communicate with each other with relatively low latency. Additionally, clients should be able to discover both sets of brokers and communicate with them with low latency.

The first question is solved by our network architecture. Although brokers on EC2, Zookeeper cluster and K8s cluster are deployed under different subnets, they are within the same AWS account and the subnets are interconnected.

Allowing clients to discover brokers on both EC2 and K8s requires a bit more work. To understand what needs to be done, we first need to unpack the mechanism of how clients discover brokers.

Client-Broker discovery mechanism

A Kafka Client is required to know almost every broker in order to perform correctly, depending on how many partitions are under the target topic, and how those partition replicas got distributed. A typical approach:

The client picks one of the brokers as the bootstrap broker to fetch metadata about which broker serves what partitions.
The client requests to establish connections with these brokers respectively.

Therefore, in addition to the reverse proxy defined for brokers on EC2, the reverse proxy for brokers on K8s should be available for clients. Since all brokers are within the same cluster, any broker will be able to provide the full metadata to the client regardless it is on EC2 or K8s.

register reverse proxy service IP into consul service

HashiCorp’s consul service is used for the service discovery. By registering the new reverse proxy service’s IP address into the existing Kafka consul service, clients can discover both sets of brokers. In K8s, both the advertised listener for each broker and the reverse proxy of those advertised listeners can be a K8s Service.

Migration implementation and orchestration

So far, so good: it is feasible to set up a Kafka cluster with interconnected brokers from either EC2 or K8s. Clients can discover and connect to any broker. The next question to address is how replicas can be moved from one broker to another, and in what manner.

The implementation of replica movement can be powered by Cruise Control. It provides rich APIs to manage Kafka cluster on a large scale. One of the endpoints we can use for moving replicas is remove_broker. This endpoint decommissions a broker by moving all replicas under it to the destination broker.

curl -X POST "$CRUISE_CONTROL_SERVICE/remove_broker?brokerid=11&concurrent_partition_movements_per_broker=25&destination_broker_ids=31&replication_throttle=100000000"

This is exactly what we need: the request above is asking Cruise Control to move all the replicas from broker-11 to broker-31. It also defines that there are only 25concurrent replicas movement allowed at a time, and the total replication throttle rate is about 100Mb/s.

Orchestration

It is not ideal to call the Cruise Control endpoint manually: it is neither safe nor efficient. Therefore, we built an orchestration system on top called Kafka-Migration. It’s responsible for:

Check Migration Conditions: For example, before executing any replica movement calls, it checks if the cluster is in a healthy state, and whether the target and destination broker are in the same Availability Zone (AZ).
Rich Migration modes: The system has different modes of migration. The Specific Brokers mode allows for moving data from a specific set of brokers to another. The AZ mode facilitates moving data from existing brokers to destination brokers within the same AZ. The Full-Migration mode manages the complete data movement to the new s1et of brokers, ensuring correct replica assignment and orchestrating the move on a per-AZ basis. By setting these modes and rules, it guarantees that the replica movement is contained within a safe and manageable scope. Impact is minimised as a result, and response to any incidents during the move can be quick.
Fine-Grained Migration Control: It supports not only one-directional migration but also stop, resume, and revert operations. For instance, during an unexpected spike in traffic, we might want to halt the migration to provide more bandwidth for the operational traffic and then resume the migration when the traffic spike subsides.

Metrics and Monitoring

Collecting metrics and building monitors on top is crucial to problem detection and getting insight into how the cluster performs during migration. There are three different sources of metrics we need to collect for Kafka: Kafka broker JMX metrics, broker system-level metrics, and AWS EBS metrics.

JMX metrics and EBS metrics are largely consistent across EC2 and K8s, and therefore remain substantially unchanged during the migration. The only difference is in the tags for EC2 versus K8s brokers’ metrics. Remapping those tags to the existing monitor system is straightforward.

Broker system-level metrics are a bit interesting since EC2 is a VM, while a K8s pod is a group of containers. The VM-based metrics sometimes cannot be found in container-based metrics. For example, the load average metric, which measures the workload of the host’s CPU over time and indicates how busy the system is, has no direct equivalent at the container level. Arguably, the CPU throttled rate metric could be used as the closest proxy, but it’s not quite the same.

Relying on higher-level metrics for problem detection is better than delving into low-level metrics every time. In fact, those high-level Kafka broker JMX metrics can provide good indications of a resource exhaustion problem: producer latency, network pool capacity, request handler capacity, and replication latency.

Broker disruption management

Finally, we need to manage broker disruption carefully during migration. Broker disruption refers to any issue or event that causes a broker to function improperly or become unavailable. It may be an involuntary (e.g. due to hardware failure), or a voluntary (e.g. due to brokers rolling restart) disruption.

Poor disruption management leads to low availability

For a Kafka cluster consisting of brokers across 3 AZs — where most topics are configured with 3 replicas and a minimum of 2 in-sync replicas — losing some brokers in one AZ is manageable. However, losing two brokers from two different AZs is problematic, as it will cause “offline partitions” issues (given that the minimum in-sync replicas are set to 2):

offline partitions when losing brokers from two different AZs

We need to create a system to prevent this scenario from happening as much as possible. The idea is that involuntary disruptions are unavoidable. What we can optimise is preventing voluntary disruptions from occurring during an involuntary disruption incident.

Pod Disruption Budget with Kafka Monitoring Service

A Pod Disruption Budget(PDB) in K8s is established to monitor the broker pods deployed on K8s. Once the PDB detects any offline broker, it will breach the disruption budget to prevent any other broker on K8s from being evicted.

However, there is no native way for a PDB to detect the state of brokers on EC2 instances. We set up an additional service called the Kafka Monitoring Service to include the state of brokers on EC2 in our monitoring scope. This watches the overall health of the Kafka cluster by checking for Under Replicated Partitions (URPs). If URPs are detected, the service enters an unready state.

The PDB is configured to monitor both the Kafka Monitoring Service and the brokers deployed on K8s: it can detect not only the offline brokers on K8s by directly watching the broker pods, but also the offline brokers on EC2 via the Kafka Monitoring Service.

PDB watches Kafka monitoring service and brokers on K8s to prevent unexpected disruptions

Zookeeper Migration

We’ve successfully migrated Kafka brokers from EC2 instances to K8s. However, the Zookeeper cluster is still running on the EC2 instances.

Of course, we can migrate the Zookeeper cluster via the similar manner as we migrate Kafka brokers. However, there is an even better idea! At Kafka version 3.6, Kraft finally reached General Availability, meaning we can use brokers with the controller role powered by Kraft to replace Zookeeper. Since the brokers are already running on K8s, it is easy to set up a new set of brokers with the controller role.

Summary

By applying the strategy of Broker-Level migration within a consolidated Kafka cluster, we successfully migrated all Kafka clusters to K8s. The migration process is fully seamless from clients’ perspective, with no incidents caused by the migration.

Managing Kafka on K8s is much easier compared to EC2, thanks to the robust, scalable, and flexible platform provided by K8s that automates many operational tasks: from deployment to monitoring, scaling, and self-healing. The well-defined K8s objects and Custom Resource Definition(CRD) allows us to build operators around Kafka to further automate operational tasks.

The improved automation and reduced operational overheads not only enables us to iterate on our Kafka ecosystem at a much faster pace, but also frees up development time for the team to focus more on new development tasks.