No Access Denied: Our Transition to Kafka ACLs

Published in

Zendesk Engineering

10 min readFeb 22, 2024

Background

Since migrating our Kafka setup from Chef managed infrastructure to deploying on Kubernetes, our Kafka modernization journey has gained momentum. The next challenge was to enhance security within Kafka. Kafka ACLs (Access Control Lists) had been on our radar for a while but they were put off due to the complexities of working with Chef. With Kafka now deployed on Kubernetes, both development and deployment have become significantly easier, paving the way for us to enable ACLs.

At Zendesk, our Kafka setup already had mTLS authentication enabled. Even so, TLS only addresses the authentication aspect of security, verifying the identities of clients. Once a client is authenticated, they have full access to any resource in Kafka. This unrestricted access goes against the best practice of least access, making the implementation of ACLs a logical next step for us.

Decoding Kafka ACLs

Before delving further into our journey, let’s briefly unpack Kafka ACLs.

ACLs are a security feature that allows us to control access to Kafka resources like topics and consumer groups. Their purpose is not only to ensure that only authorized users can access Kafka, but also that users can only access their intended resources. This reduces the potential risk of a malicious actor damaging the platform, for instance, deleting topics or altering cluster configurations, or accidental data corruption, such as a producer inadvertently writing to the wrong topic.

Kafka’s ACLs are granular and defined as “Principal X [allow/deny] access to do action Y against resource Z”. ACLs also have limited pattern matching support: they only allow wildcard or prefix patterns.

Whilst this granularity is great for flexibility, most of our users aren’t Kafka experts, and don’t particularly care about that detail. What they want is: “This is my consumer, just give it the permissions it needs to talk to Kafka”. To simplify things, Kafka’s CLI tools have a handy flag that bundles together a set of common permissions for producers and consumers, which we used as our starting point in designing the interface.

Our Self Service Interface

Self Service is an interface at Zendesk that allows teams to specify what infrastructure is needed for their services, such as Aurora, S3 buckets, or Kafka access. Our infrastructure team abstracts away the complexities of the underlying infrastructure and provides simple, smart defaults for them, using Yaml. Under the hood the Yaml is transformed into Kubernetes Custom Resources, with a range of operators (like our ACL operator) to provision the actual resources.

At a high level this is how the entire flow looks:

my-kafka-consumer configures their Kafka access through a yaml file
This gets translated to Kubernetes Custom Resources when they deploy
The ACL operator then reconciles that object and creates the required ACLs for my-kafka-consumer to access topic-1
When my-kafka-consumer tries to access topic-1, it’s business as usual
When my-kafka-consumer tries to access topic-2, it gets rejected by Kafka

Here’s a sample of what our Self Service schema for ACLs looks like:

infrastructure:
  - name: my-kafka-consumer
    consume:
    - tickets
    - my-topic-2
    produce:
    - ticket-metadata
    admin:
    - foo-kstreams*
    consumerGroups:
    - my-group-1

The ACLs that get created are:

consume
- Topics: READ, DESCRIBE, DESCRIBE_CONFIGS
produce
- Topics: WRITE and everything from consume
- Cluster: IDEMPOTENT_WRITE(Not required Kafka v3+)
- Transaction id: WRITE, DESCRIBE
admin
- Topics: CREATE, DELETE and everything from produce and consume
consumerGroups
- Group: ALL

Check out https://docs.confluent.io/platform/current/kafka/authorization.html#operations for an exhaustive list of the available ACL operations and what do they.

Making security easy

Security projects aren’t usually flashy, and they don’t usually have new features to tempt users into migrating. For our ACL migration we focused on making it as easy as possible to adopt. Our two primary goals were:

Zero downtime ACL migration for all Kafka clients
Enable all producers and consumers to provision Kafka access via Self Service

Zero downtime ACL migration

You can’t enable ACLs without enabling the Authorizer, but you also can’t have the Authorizer without having ACLs. Otherwise all clients would lose access: we needed a way to be able to support both modes. To ensure zero downtime during the migration, there are three types of clients we had to support.

Plaintext clients without ACLs
TLS clients without ACLs
TLS clients with ACLs

We called this stage of the migration “hybrid mode”. During this phase, all clients needed to continue working seamlessly until we permanently disabled non-ACL access. We explored a couple of options to support hybrid mode.

allow.everyone.if.no.acl.found
Pre-create ACLs for all resources
Implement a custom Authorizer

allow.everyone.if.no.acl.found is a built-in Kafka config that allows any client to access a resource without restriction. However, access starts being enforced once an ACL is created for that resource. The issue here is that it would require producers and consumers to create the ACLs at the same time, otherwise one client would lose access.

The idea for option 2 entails creating wildcard ACLs for all topics, groups and other resources ahead of time. Once we identified all the clients for a resource had set up their access we could delete the wildcard ACLs. In practice though it would be error prone, and hard to know for sure if all clients had correctly set up access.

This left a custom Authorizer as the best and most flexible solution. We could configure the Authorizer to simply not enforce any ACLs but emit metrics for when access was allowed/denied. This gave us and end users visibility into whether their clients’ access had been set up correctly. A custom Authorizer also gave us the flexibility to move individual clients onto the “live” ACL mode, which helped us validate our solution early by migrating a couple of our team’s services first.

More information about our Authorizer in the deep dive below 👇

Self service — ctrl+c, ctrl+v

The data we collected from our custom Authorizer fed into our next goal of “access via self service”: all producers and consumers are able to provision Kafka access via Self Service.

Every time a client tried to access a resource, our Authorizer would emit a data point with: client name, resource, request type. Using this information we could generate the configs that we think teams needed for their service. For most teams this could be a simple copy-paste, for others it could be used as a starting point which they can adjust as required. This was a nice benefit for us as we knew imposing ACLs would require extra work for other teams, and we wanted to reduce the friction for teams by making the migration as smooth as possible.

The Migration

One of the most challenging parts of any project is getting teams to adopt the new thing, especially when it requires changes to existing services. In our case, enabling ACLs required active involvement of teams. We took this opportunity to bundle some other security enhancements and quality of life features to make it worthwhile for teams to migrate.

One enhancement was the rollout of Kafka Temp Auth that automated root CA rotations in Kafka. This removed the need for a second migration, saving time and resources had we delivered them separately.

We also eased the configuration process by delivering configs straight to services: whenever a service specified Kafka access via Self Service, we would mount a config file with useful shared configs such as bootstrap brokers, TLS configs, and certificate paths. This streamlined the onboarding process for new clients. Existing clients also benefited, previously they had to run and update an init container to mount these files. Now they get delivered automatically and are always up to date, reducing their maintenance burden.

What’s in a (Topic) name? Quite a bit actually!

Whilst reviewing the data from the Authorizer’s accesses, we observed a mixed bag of naming conventions that had accumulated over the years. These conventions included a mix of underscores, hyphens, and periods. Some had service names prepended to resources, some had them appended, some didn’t have them at all which made it hard to identify owners. Having this data is amazing because prior to this we rarely had a complete view of the usage in Kafka.

We saw an opportunity here to improve the naming standard for new resources. Platform teams often use a “stick” to get teams to adopt new standards. To encourage adoption, we opted for the “carrot” approach instead, offering ACL “freebies”.

As mentioned earlier, our ACL operator is built on top of Kubernetes: anytime a service provisioned access we could also get their project’s namespace. We provided a free consumer group prefixed for every service that defined a consumer topic in their Kafka access. For instance, if a team wanted to quickly get up and running, they could configure their consumer to use a consumer group service-namespace or service-namespace.foo, and wouldn’t need to add any other explicit group access.

KStreams

Our ACL freebies also extended to Kafka Stream’s (KStream’s) use of application.id. By default KStream uses application.id in the following places:

Internal topics use the format <application.id>-<operatorName>-<suffix>
As the consumer group name
As the client id prefix

If teams use the namespace as their application.id or prefix it with the namespace, they automatically get the consumer ACL for free.

However, KStreams also require CREATE and DELETE topic ACLs for their internal topics. The usual way to handle this would be adding their topic prefix into the Kafka access admin field.

For security reasons, we don’t provide default permissions for admin ACLs as most services don’t require them. This restriction helps prevent teams from unintentionally creating or deleting topics that belong to another team. We also restrict teams to only allow specifying admin topics in the format service-namespace-* (prefixed with their namespace). Anything that doesn’t start with their namespace requires approval from us to add it into our exemption list. When teams follow the blessed convention and use their namespace in the application.id, this makes the setup more seamless and doesn’t require intervention from us.

To minimise the impact on teams, all existing KStreams topics were grandfathered into the exemption list already.

What’s next

Our journey to enabling ACLs was not just about enhancing security. It provided an opportunity for us in improving our standards, streamlining our process and continuing the journey of modernising our Kafka infrastructure.

Often, security projects aren’t glamorous and are sometimes seen as just a compliance exercise. Fortunately, in delivering this project we encountered some unexpected perks alongside the security wins during our journey:

Visibility into data sources.
A common question for teams producing shared topics would be: “who’s consuming from my topics?”
This was difficult to answer before, but now we have the data to answer those questions.
Manual rate limiting.
Now that we know how teams access resources we’re able to apply rate limiting when clients’ are misconfigured and overload our Kafka cluster.
Kafka temp-auth.
Specifying Kafka access is the signal we used to determine when to mount TLS certs and other configs into service containers
Audit logging of who and what accesses resources — for both services and people
A way to encourage and guide teams on standardising their resource names

Now that we’ve got some building blocks in place, we can look forward to some other upgrades. We’re looking at upgrading to Kafka 3.6 and subsequently migrating to KRaft. Once this is done we can finally look forward to decommissioning zookeeper and removing our last Chef host. Kafka quotas (rate limiting) is also on our radar: we intend to extend the Self Service interface to support applying Kafka quotas onto clients.

Appendix

Kafka: Automating Root CA rotation with Vault

zendesk.engineering

Seamless Transition: Migrating Kafka Cluster to Kubernetes

A zero-downtime process with minimal changes to clients, and minimal manual intervention that maintains data integrity…

zendesk.engineering

Kafka Authorizer deep dive

This is a small add-on to delve deeper into our custom authorizer implementation.

Scala is our team’s langauge of choice, and we leveraged it to build our Custom Authorizer. Most of the logic was still handled by Kafka’s provided authorizer; For Kafka with Zookeeper the required authorizer is AclAuthorizer, while for KRaft you’d want to useStandardAuthorizer. Our Authorizer just added a thin layer to collect access requests data and log human interactions (not shown in this snippet). It also allowed for us to progressively enable ACLs for certain principals via the AclEnforcementList .

package com.zendesk.kafka.authorizer

class CustomAuthorizer extends AclAuthorizer {
  override def authorize(requestContext: AuthorizableRequestContext, actions: util.List[Action]): util.List[AuthorizationResult] = {
    val authorizeResults = super.authorize(requestContext, actions).asScala
    actions.asScala.zip(authorizeResults)
      .map { case (action, result) =>
        publishMetrics(requestContext, action, result)
        authorizePrincipal(result, requestContext.principal())
      }
      .asJava
  }

  override def authorizeByResourceType(requestContext: AuthorizableRequestContext, op: AclOperation, resourceType: ResourceType): AuthorizationResult = {
    // Similar to above
  }

  private def authorizePrincipal(authorizationResult: AuthorizationResult, principal: KafkaPrincipal) = {
    if (AclEnforcementList.principals.contains(principal.getName)) {
      authorizationResult
    } else {
      AuthorizationResult.ALLOWED
    }
  }

  private def publishMetrics(requestContext: AuthorizableRequestContext, action: Action, result: AuthorizationResult): Unit = {
    val principal = requestContext.principal
    val resourceType = SecurityUtils.resourceTypeName(action.resourcePattern.resourceType)
    val resourceName = action.resourcePattern.name
    val operation = SecurityUtils.operationName(action.operation)
    val authorized = result match {
      case AuthorizationResult.ALLOWED => true
      case AuthorizationResult.DENIED => false
    }

    statsD.increment(
      "authorizer.access", 
      s"authorized:$authorized", 
      s"principal:$principal", 
      s"resourceType:$resourceType", 
      s"resourceName:$resourceName", 
      s"operation:$operation"
    )
  }
}

We packaged the authorizer as an uber jar with sbt-assembly. From there it was a simple exercise to update the config to call our authorizer.

# Dockerfile
COPY /app/target/custom-kafka-authorizer.jar /opt/kafka/libs

# server.properties

authorizer.class.name=com.zendesk.kafka.authorizer.CustomAuthorizer
# Configure brokers as super users for ACL permissions
super.users=User:broker

No Access Denied: Our Transition to Kafka ACLs

Background

Decoding Kafka ACLs

Our Self Service Interface

Making security easy

Zero downtime ACL migration

Self service — ctrl+c, ctrl+v

The Migration

What’s in a (Topic) name? Quite a bit actually!

KStreams

What’s next

Appendix

Kafka: Automating Root CA rotation with Vault

Seamless Transition: Migrating Kafka Cluster to Kubernetes

A zero-downtime process with minimal changes to clients, and minimal manual intervention that maintains data integrity…

Kafka Authorizer deep dive

Written by Peter Nham