Kafka: Automating Root CA rotation with Vault

Published in

Zendesk Engineering

12 min readDec 19, 2023

Background

Nearly three years ago, we first implemented mTLS (mutual TLS) for Apache Kafka at Zendesk. Our previous blog post goes into great detail about how this works.

Recently, we embarked on a process to replace our Kafka-specific authentication tooling with Zendesk’s internal “Temp Auth” tooling for authenticating to other datastores like MySQL, Redis, etc.

Temp Auth includes an init container, responsible for delivering credentials to Kubernetes pods based on custom resources associated with the pod’s project. We have a number of Kubernetes operators which are responsible for:

provisioning datastores (MySQL, DymanoDB, Kafka topics, etc)
configuring access
populating credentials in Vault and configuring vault roles

The init container’s job is to read secrets from Vault based on a project’s declared datastores, and deliver them as files on disk for the application container to read. It also deals with either evicting a pod or refreshing its credentials when any of these short-lived credentials are close to expiry.

mTLS Refresher

When I’m not actively working on mTLS, I usually forget how it works. So here’s a refresher of the important parts of how mTLS works for Kafka:

In typical browser TLS, only the server has a certificate. The client connects to example.com, and the server presents its certificate for example.com. This certificate will be signed by some Certificate Authority (CA). Typically this is an intermediate CA, which is itself signed by another CA. At some point up the chain of certificates, your browser encounters a certificate that it already trusts, because it matches one of the hundreds of well-known CAs already included in your browser.

If a trust path is found, then your browser trusts the server. But the server has no idea who you are, so you’ll typically use a username / password to authenticate yourself.

With mutual TLS, both sides verify each other. The broker provides a certificate which the client verifies, just like the above scenario. But the client also provides its own certificate, which the broker verifies. Now both sides know exactly who they’re talking to, so we can do away with username / passwords entirely.

The simplest version of this is when both certificates are directly issued by a single CA. We use Vault’s PKI engine to easily create an internal self-signed CA. We don’t need to bother with chains of intermediate CAs, as we can quickly distribute the root CA’s certificate to every broker and client.

CA Rotation

So far, so good. Clients trust Brokers, and Brokers trust clients. We don’t need a separate username / password system, a single Vault CA can issue certificates for both use cases.

But the big problem is how to rotate certificates. It’s extremely unlikely that an attacker could steal the private key belonging to the Root CA. There are multiple layers of protection before an attacker could get to Vault, and Vault itself provides no access to read these private keys.

However unlikely it is, if it did happen we’d have to take Kafka offline, and that’s unacceptable — Kafka’s kind of important! So we need to have a process to rotate the Root CA. This is done by creating a new Root CA, and throwing the old one away. But again, we’d like Kafka to keep working while this happens, and it can’t be done instantly. So we use a gradual process to change root CAs. If we’re retiring root-a and replacing it with root-b, the required steps are:

Step 1: distribute both CA certificates to all clients.
At this stage, everyone is still using certificates issued by root-a. But they’ll also start trusting certificates issued by root-b.
Step 2: switch the primary issuer to root-b.Now that everyone trusts root-b certificates, we can start using them.
Step 3: remove the old root-a certificate from all clients.
Now that nobody has certificates issued by this CA, we can stop trusting it.

Each of these steps takes some time to propagate to all clients and brokers, which is why we can’t just swap the CAs instantly.

Our original multi-root setup

Managing this rotation process requires some shared state so that clients do the right thing at all times. In particular, everyone needs to know:

what is the primary CA, used for issuing certificates
what is the secondary CA (if any), which should be trusted alongside the primary root CA

As described in our earlier blog post, we use a Consul key/value entry for this, where the value is a JSON structure containing the Vault paths for the current primary and secondary issuers. The process to initialize an authenticated Kafka client goes like this:

Firstly, we need to know what Root CAs exist, and which is the primary one. We query the well-known kafka-pki/root Consul key, which returns a JSON value with the primary and secondary issuer path(s).

Once we know those paths, we:

Ask the primary issuer to issue a certificate. The issued certificate contains the app’s identity, which is used in Kafka for controlling access and quotas.
Ask all secondary issuers for their certificate.

With that information we can construct a Kafka Keystore and Truststore. The Keystore contains our client certificate (and private key). The Truststore contains all the issuers (both primary and secondary), which means that we’ll be able to securely communicate with a broker who has a certificate issued by either CA.

This process is shown below, with primary and secondary CAs in use:

That’s pretty bespoke!

It is, isn’t it? Three requests across two services. The interactions aren’t that novel, but they’re also not standard.

Honestly, throughout the initial design we were convinced there ought to be a simpler, more standard way. Surely other people rotate their root CAs, why did we have to invent our own glue for all of this? It really seemed like there should be some standard layer of indirection which we could use to swap out the primary issuer, without clients needing to be told about the change.

Well at least we’re not alone — we dug into how folks are doing Istio root CA rotation and it’s a pretty similar workflow. This uses Kubernetes resources for indirection instead of Consul and also involves intermediate CAs. But the core process behind rotating the root CA is the same.

Enter Vault multi-issuer support

In Vault 1.11 (June 2022), there’s now support for multiple issuers in one PKI engine. Which turns out to be exactly the layer of indirection we needed. Using this, we can have a single PKI engine mounted at /kafka-pki, but it can contain multiple issuers (root CAs in our case). Vault provides an API to set the default issuer, which will be used to issue certificates. That way clients can always make the same /kafka-pki/issue request, but we can swap the issuer behind the scenes!

So that solves half the problem — knowing about the primary issuer. But what about the secondary issuer? One way we could handle this is for clients to make a Vault API call to list all the issuers within /kafka/pki, and trust all their certificates. This would work fine, but the Temp Auth system we’re integrating doesn’t have any special multi-issuer logic, and ideally it shouldn’t care.

When Vault issues you a certificate, it gives you back the certificate, the private key associated with that certificate, and a ca_chain. The idea is that you’ll present this ca_chain when making a connection, and your peer will validate that it can find a path from your certificate up to some root it already trusts, using the chain you provide.

Initially we planned to use this property to enable two-path validation, using intermediate CAs. That is, we’d have a setup where the same intermediate CA was signed by two parent CAs. Whichever of the two root CAs you trusted, you would trust the intermediate because it had multiple trust paths, one back to each of the root CAs.

We tried this out, and there’s probably some way to get it to work. But when bashing our heads against Terraform and various openssl commands we don’t use everyday, we realised there’s a simpler, dumber way: just shove both roots in the ca_chain.

(Ab)using the ca_chain

We don’t have any use for the ca_chain currently, because we don’t use intermediate CAs. We can distribute root CAs to all our clients quickly, so there’s no need for the added complexity of intermediate CAs.

While we were busy putting both root CAs into the ca_chain of our intermediate, we realised that this is a pretty convenient way to give the client a list of CAs, which is exactly what we need to inform clients about multiple issuers. In our setup we know that both sides are using the same issuer, so the CAs used by a client are also the CAs it needs to trust to validate the broker’s certificate.

So we actually put all CA certificates into the ca_chain of each issuer. And when configuring our Kafka clients, we use the returned ca_chain as a truststore instead of a ca_chain. That is, we trust every certificate that Vault returns in its ca_chain in our newly issued certificate.

This is, it should be noted, not what the ca_chain is for, nor how it’s typically used. In a regular internet-style setup, you distribute root CAs out of band (e.g. bundling them in system packages), and the client presents a ca_chain which links the certificate they have back to a CA you trust. It would be a terrible mistake to trust the entries in a CA chain that a client gives you. But we’re not doing that — we’re configuring Vault to return a ca_chain containing all the root CAs, and then trusting those (because we trust Vault). And this works for both the clients and brokers, because they’re all using the same Vault issuer / CAs.

The good news is that this unusual setup is restricted to how we use the ca_chain (and configure Vault). As far as Temp Auth cares this is just a standard TLS certificate it’s generating; it has no understanding of multiple CAs.

Shown here is the overall process: just a single request to Vault and no special code to deal with multiple issuers:

Keep on rolling

At this point we had an interface that didn’t require any bespoke logic to work with Temp Auth, our standard datastore authentication system. But that’s only half of the picture. Our team owns the Vault PKI issuers and the process for doing the CA rotation.

In the initial mTLS implementation, we used Terraform. CA rotation was done semi-automatically, where an engineer followed a runbook, creating simple Terraform changes (updating a few JSON / HCL values) for Atlantis to apply for each step of the rotation process. This was a conscious tradeoff: fully automating the rotation would be difficult, and we didn’t rotate CAs all that often (as I mentioned, it’d be pretty hard to get these keys out of Vault!).

But thanks to some pretty glaring bugs (example) in the Terraform Vault provider around multiple issuers, we decided it was time to automate the CA rotation.

To be honest, it was way easier than we expected. We periodically run a reconcile job which performs the following pseudo-logic:

1) list all the issuers in /kafka-pki, sort them oldest-first and return the first 3

2) if there are less than 3 issuers, create a new issuer and go back to step 1

3) assign these three issuers to variables previous, current and next

4) if next is older than ROTATION_TIME: create a new issuer, and shuffle the existing issuers (i.e. the new issuer becomes next, the old next becomes current, the old current becomes previous and the old previous is forgotten)

5) set the ca_chain in all three issuers to [previous, current, next]

6) mark current as the primary issuer in Vault (if it isn’t already)

7) delete any issuers not in [previous, current, next]

This is a relatively straightforward process that is robust to interruptions — if the process crashes at any point, it’ll end up with the 3 desired issuers after the next run regardless of which point it got up to, and there’s no point where a crash would leave Vault in an unwanted state (e.g. deleting an issuer before all references to it have been removed).

Three little pi̶g̶g̶i̶e̶s̶ CAs

One thing that surprised me was that we ended up with three CAs at all times. We used to have one most of the time, and two during a rotation. We originally wrote an algorithm to replicate this, but it needed a mode concept which represented whether we were bringing in a new CA or phasing out an old one.

But this new logic is always doing both at the same time. We never considered that with the old runbook-based approach as it’d be more work. But when automated it led to less code, which was a funny realisation. Right after a after a rotation happens, these are the CA states:

The next issuer is brand new. Nobody trusts it yet, so we can’t issue certificates with it. Clients will start trusting it when they next provision certificates.
next > current (rotate in): Every client now trusts this CA so we promote it to be the primary issuer of certificates.
current > previous (rotate out): All existing certificates were issued by this CA, so we need to keep trusting it until they expire.
previous > (deleted): It’s been long enough that nobody still holds a certificate issued by this CA, so we can delete it.

The three-CA rotation process, illustrated

The second 90% of the work: migration

Zendesk has been running Kafka clusters for around a decade now, and we have hundreds of services relying on it. Our most recent migration was from Chef-managed VMs to Kubernetes, which we did very gradually over the course of many months, with zero downtime for clients. For any migration we do, sudden breaking changes are simply out of the question.

For this migration, we paved the way by introducing our new multi-issuer endpoint as a secondary_issuer in the existing Consul-based system. We also tweaked the old init container code to trust all ca_chain entries, not just the issuing certificate. This way, old clients would trust all four issuers (the old one and our three new issuers). And once we promoted the new multi-issuer endpoint to the primary_issuer (and wait for that to propagate), all our clients trust the same three root CAs regardless of which init container they’re using.

When changing authentication, there’s a real risk of breaking things for existing clients. There are a whole bunch of ways you can break the trust between two processes, not to mention the deployment surprises in switching to a different init container.

Thinking through all the states and interactions between clients and brokers in various states of the migration is important, but unfortunately not sufficient. We did a lot of testing in our staging environment — rolling forward, rolling back, and keeping an eye out for any issues that might be authentication-related. For the most part they weren’t that hard to spot 😅.

When changing something this large and previously stable, you always end up flushing out some other associated problems. During our test rollouts we uncovered a number of related issues:

A bug in our deployment tooling which caused applications to be pinned to an old version of our init-container, instead of receiving the latest version at deployment time.
One VM hostgroup where we’d overlooked the necessary upgrade of the init container, as this process differs from Kubernetes workloads.
Some interactions between different Kubernetes controllers, where under certain conditions the Temp Auth init container would try to generate Kafka certificates, but the destination volume didn’t exist in the pod’s manifest.
Subtly misconfigured clients, which would only trust the first issuer in the truststore we had generated — many tools will quietly read “a certificate” from a file containing multiple, which is hard to notice when you usually only need the first one.

This all slowed down our rollout, but for a good cause. Having flushed out those issues in staging, we could proceed to roll out to production with no customer impact, and a peaceful night’s sleep for our on-call engineers.

Conclusion

We’re really happy with this new setup. The first time we implemented mTLS for Kafka it seemed like there must be a better way, and now I feel like we’ve found it. The whole setup is easier to understand, observe, and it’s now fully automated.

Thanks to that automation, we can have an aggressive rotation schedule in days, not years. The only requirement is that ROTATION_TIME needs to be longer than the TTL of our issued certificates, to ensure that everyone’s seen the latest state before we perform the next rotation.

I’ll be honest, doing this work wasn’t a smooth process. Our team aren’t PKI experts, we took a while to meander our way through the problem space (and Terraform!) to find something that works. But we’re really happy with the results. We get to retire our Kafka-only auth injection system and reuse the common Temp Auth tooling used by other datastores at Zendesk, plus we managed to end up with a much simpler and fully automated CA rotation system.