ZEN and the art of Reliability

Published in

Zendesk Engineering

13 min readMar 8, 2022

ZEN and the art of Reliability

Zendesk’s reliability principles and the real-world stories as we transitioned from a humble IT help desk software to providing mission critical systems for enterprises.

Zendesk handles approximately 250,000 requests per second at daily peak into our infrastructure, with over ½ of those requests needing to read or write to a database. At our core we’re a humble Ruby on Rails application that is partitioned and heavily sharded. Our infrastructure was simple 10 years ago — Nginx with a Ruby backend and a single MySQL database.

After being on the journey for a decade, it feels like we’ve been tested in every way possible. We’ve seen consistent misuse, both intentional and unintentional, from external folks and ourselves. We’ve reached the scaling limits of our core technologies. We’ve found short-term partitioning strategies just to make it through. We’ve architected our way through scaling bottlenecks.

During those 10 years, we’ve scaled our engineering organization from 30 engineers to 1400 who deploy changes to Production over 80,000 times a year. We’ve had our fair share of reliability problems throughout the years. Most of them we’ve isolated well and mitigated quickly. Other times our customers were the first to raise the alarm, and that is completely unacceptable.

We’ve focused more on reliability through the last couple of years than ever before. What’s guided this journey is eight principles we believe to be true for building resilient systems. Initially, these principles were written in preparation for a customer meeting who was rightly upset. In the moment we realized they were actually quite good and we could be much more effective if we used them internally. A lot of them have surfaced because of mistakes we’ve made, like setting our reliability goals too high or only measuring the reliability of the systems we could control. Other principles were inspired by companies like AWS and Netflix who were ahead of us in the Cloud and have much larger system scale and teams.

Let’s start with the first, most important, aspect of reliability: if you’re not measuring from the customer perspective, you’re not going to get the change you want to see.

1. Measure true customer impact

For a long time at Zendesk we’ve actively tracked the availability of our systems but our monitoring hasn’t always accurately mapped to what our customers were experiencing. One moment on our journey paradoxically took us further from measuring true customer impact.

Many years ago while covering the topic of poor reliability during our board meeting, a board member commented that we had to do “four-nines of reliability if we wanted to be a serious enterprise SaaS company”. That mandate cascaded through the management ranks and we said “99.99 it is”. That’s less than 5 minutes of trouble a month! At that point in time, our on-call team was getting 30,000 pages a month. That’s equivalent to being paged every 3 seconds, 24 hours a day. In that environment 99.99 was unachievable. The only chance we had of hitting that target was to exclude vendor-triggered incidents. In other words, we decided to measure only the systems we controlled. In retrospect that was naive. Our customers don’t care if we run the service that fails, or a vendor we use runs the service that fails. They care that they can’t use Zendesk to do their job. That’s why today, our internal Trouble Free Availability[1] measurements are based on customer impact, no matter if it’s us, Twilio, Cloudflare, or AWS that trigger the issue. After all, we can’t directly control the reliability of third-party vendors, but we can choose which vendors we partner with and control how we depend on them.

We ended up adjusting the goal for Trouble Free Availability down to 99.95 (including third-parties). We probably needed to go lower in retrospect, but at 99.95 it felt like we could get lucky and hit our target. We also needed to make sure we were measuring true customer impact. We needed an accurate way to measure the errors from the customer’s perspective rather than our internal system perspective. To do this we needed to improve our observability.

Improving our observability has dramatically changed our ability to detect problems ourselves with alerts, rather than manually triggering incidents. A big win has been transitioning internally from uptime monitoring, to Rate-Error-Duration monitoring as our standard. Plus leveraging a multitude of tools to get customer experience metrics from our customer’s browsers, our CDN, our proxy tier, and upstream to third party dependencies. We measure from many points as valuable insights often occur when measurements across service boundaries don’t align.

Sometimes it’s hard to measure true customer impact and even harder for management to swallow how few nines of reliability a system might have. But, the right answer is always to measure true customer impact rather than only what you control.

2. Focus efforts on core features

When we went through periods of instability, we often fell into the stereotypical management trap of introducing more processes, or stopping the whole production line and having everyone try and fix the problems. It did help in some ways, but also resulted in wasted effort building resilience into features our customers didn’t depend on. We needed to focus on what mattered most to our customers to gain the biggest wins. We needed to come up with a construct, and that construct was core features.

We defined core features as “features so critical that when broken, the impact to customers is basically the same as the whole product is down”. When a customer can’t create a Ticket, it’s basically the same as their entire Support product being down. The first step was to decide what the core features were and document ownership. This enabled the organization to focus on implementing SLIs and SLOs, then smoke tests, then rate and error monitors. The SLIs and SLOs now also feed our Product-level Error Budgets. About six months after we were happy with our core features, we built up the courage to include them as part of our SEV1 definition.

Core features weren’t the first method of focusing the organization that we tried. We initially introduced Critical Tiering of services (0–4) with Tier 0 being the most critical. A Tier 0 system being down would cause multiple products’ core features and functionality to go down. Some examples are our Nginx proxy layer, Kubernetes clusters, MySQL databases, and Authentication Service that every request passes through. While Critical Tiering was helpful and continues to be something we use today, it was difficult to map to the customer experience.

Core Features and Service Tiering allowed us to highlight the areas where we needed to invest more. We couldn’t make everything equally reliable at the same time, yet these two constructs helped us identify where to prioritize reliability work so we could get the largest improvements for our customers.

3. Assume failure & simulate failure

In our earliest days, we ran our own hardware in co-located data centers around the world. We over-provisioned nearly everything and most servers ran for multiple years without failing (or even being restarted). Moving to AWS forced us to think differently and modernize our infrastructure. We had to stop assuming our servers would see long uptimes, and start expecting for EC2 instances to restart more frequently, without notice. We had to build every system assuming it could be restarted or disappear at any point in time. It became unacceptable to have boot times of 30 minutes while Chef provisioned an instance. We needed start times that were measured in seconds. Human intervention after a restart was unacceptable. The early days of running on AWS were not fun.

We didn’t have to assume failure and simulate failure because it was happening every day in Production! We played whack-a-mole while transitioning our infrastructure to run all services on Kubernetes. Doing so forced our engineering teams to build systems resilient to a Kubernetes pod disappearing at any moment. Migrating everything to Kubernetes got us a long way, but it wasn’t enough.

It wasn’t enough because our systems were still so tightly coupled that when one piece failed, everything failed. A great example is service discovery. We heavily used Consul for service discovery, even throughout our Kubernetes migration. Then one day Consul failed and consequently… everything failed. There was no resilience baked in and we had never simulated losing service discovery. At that time, we were so busy responding to incidents and obsessing about migrating that we didn’t have time to think about future incidents. But we needed to get ahead. We needed to simulate failure in our pre-production environments, rather than just responding to failure in Production when it occurred.

Simulating failures in our pre-production environment and more recently our production environment has been a game-changer. We now have a chaos engineering team that runs monthly engineering-wide game days and weekly service-specific game days. We’ve even had situations where the day after we practiced an Availability Zone failure, we had the same team members needing to evacuate an Availability Zone in Tokyo due to cooling system failures! Game days have allowed us to practice responding to incidents, test increasing datastore latency, discover new boot dependencies, ensure services can be deployed during different failure scenarios, and more.

It feels like we’re just getting started, but we materially improve the chances of running a more reliable system with each gameday. We’re looking forward to a day when failure simulation is ‘always on’ in our pre-production environments and integrated into our deployment pipeline.

We know computers fail. To build a resilient system we have to assume it will fail and practice responding, rather than being disappointed when it happens.

4. Fail small, reducing blast radius

Failing small is incredibly difficult to do, but critical to our overall reliability strategy. We’ve been incrementally improving by leveraging common practices such as partitioning, circuit breaking, limits, fast failure, and improving how we introduce changes.

In the early days, we ran Zendesk in a single environment, in a single partition. I clearly remember one of our worst incidents, a denial of service attack while we were hosted in Rackspace. An attack on a single customer filled up the network pipe and Rackspace just turned us off. All customers were down. Every single customer. To make it worse we couldn’t even SSH into our own infrastructure because our access was via public internet. It hurt and we learned a thing or two. Mostly, we learned how important it was to fail small.

As we evolved from that wild Rackspace DDoS-related outage, we’ve seen the value of partitioning time and time again. We’ve been inspired by the way AWS builds their infrastructure with zones, the magic that is shuffle sharding in Route 53, and many of the recommended practices required to maintain a highly reliable multi-tenant system[2]. We’ve also been inspired by the way changes go out into the AWS environment, starting with a single instance in a single zone, and then increasing breadth as confidence increases. AWS has spent a lot of time automating their process. It’s detailed in a wonderful builders library article by Clare Liguori if you’re interested, Automating safe, hands-off deployments.

Today at Zendesk we obsess about partitioning our infrastructure at every level. We aim to have complete isolation between partitions, so if we have a failure in one, it doesn’t leak into another. It’s been an incredibly effective strategy for us.

Partitioning allows for failure to be contained

Consistency in deployment has also been very effective in reducing the blast radius of new changes going out to production. This one has been a huge lift for the organization, yet it is delivering significant results. Today, once a new change has passed pre-production checks and is ready to go to production it goes to our Canary environment hosting all of our own Zendesk instances (including our customer support portal, https://support.zendesk.com). Then after adequate soak time and successful smoke tests, changes continue to roll out to our partitioned production environment in a staged approach based on load and revenue.

5. Protect our infrastructure from misuse & slowly ramp up capacity

Mikkel said to me the other day, “It’s mind-blowing the amount of malicious activity on a pretty boring business. Who wants to attack a support company?” Every week Zendesk is attacked in new and wonderful ways. A lot of the time it is malicious actors. Though sometimes it’s a support admin who decides to spin up a few API calls and makes a mistake in their script — causing thousands of updates to the same ticket simultaneously.

But the most surprising misuse is when we discover our own systems are triggering it. A couple of years ago we deployed a change to our mobile SDK to improve performance with pre-loading. Our recommendation to developers was to have the SDK call home every single time the app launched. It had super snappy performance in testing! Then customers started to roll out the new version. Some of our customers include the biggest mobile gaming companies in the world and huge social networks. One very high growth company diligently implemented the SDK based on our recommendations. Suddenly, our Nginx layer was flooded with non-important traffic from millions of mobile phones distributed all around the world. The incident was a nice wake-up call for us.

Misuse, purposeful or not, self-imposed or not, comes in many forms and frequencies. Over the last decade, it’s been evident that to build a reliable Zendesk, we need multiple layers of defense. In addition to traditional DDoS and WAF style protection, limiting usage has been one of our most effective strategies. Those limits need to be in place for everything.

Not many people outside of engineering have enjoyed the introduction of limits. We’ve had to work through a methodical change of culture internally and even with some of our customers. Many new limits required instrumentation to understand current usage, debate about what the limit should be, and initiating end-of-life customer communications. Sometimes we even needed to help individual customers to migrate to different API endpoints. It’s been a heavy, heavy lift.

The wins haven’t all been equal for the amount of time we’ve invested. One of the best wins for us was a 5-line change in Nginx implementing misuse rate limits.

By implementing limits and having always-on defense mechanisms, we’ve given our engineers clear scaling boundaries. We are also clearer with our customers on how the platform scales and successfully work with them to increase limits when needed.

6. Automate most things and have them self-heal

Our goal of 99.95% TFA on core features allows for 21m 54s of degradation per month. There are few situations where humans can be alerted, assembled, diagnose, and mitigate a problem within 20 minutes. To meet our reliability requirements, we needed to automate the way we mitigate failure and self-heal where possible. A good example is the auto-remediation systems we’ve built to cordon a Kubernetes node if network errors exceed a certain threshold. Or our configuration of Istio with OutlierDetection to automatically route around containers experiencing problems.

We also leverage automation to assist with speeding up diagnosis. One common failure mode we’ve experienced is high Unicorn utilization due to a pileup of slow requests [3]. Today when an alert fires, our Automated Remediation Systems query Datadog APM and provide the on-call responders with data on accounts, endpoints, and proposes short-term limits to mitigate the issue.

It still feels like we’re in the first phase of our automation journey. Especially when we talk with AWS Networking and EC2 teams about the automated health checking systems they’ve built to enable the impressive reliability they deliver.

7. Improve our customer communication

When we fail, our customers need to be swiftly informed, and we need to be transparent about what occurred so they can respond accordingly and can trust the failure won’t occur again. That’s great in principle, but it’s really hard to do in practice. At Zendesk, when an incident is triggered it’s nearly never a repeat incident and diagnosis is often required to understand what’s going on. It then takes time to understand impact and communicate it to affected customers. Many businesses struggle to meet customer expectations during incidents. We aim to be one of the best.

We’ve improved, but still have a long way to go to reach the level of communication our customers expect from us. Proactively, we want to educate our customers and partners about our architecture, failure domains, and how we manage incidents. This empowers them to build resilient integrations and leverage workarounds when failure does occur. Reactively, we are continuing to improve on many fronts. We are focused on faster detection of impacted customers and improved notification capabilities via our Status Page. One of the best changes we’ve made has been allocating a senior trained Engineering leader to assist the incident manager during an incident, including assisting with internal and customer communications.

8. Ramp up our reliability goals over time

Goals need to be reachable for engineers to be motivated, so setting achievable goals is critical to success. At the same time, we aren’t happy with 99.95. Our board was right, 99.99 is what our customers need from us. Our customers see Zendesk as an extension of their own business. We upped our target last year by including Core Features in our SEV1 definition. Once we’ve had 6 months of predictably hitting our current goals, we’ll go higher again.

Each percentage point increase is exponentially harder to achieve and maintain. One of the continued challenges is deciphering if we have hit our targets due to luck, or great resilience.

What’s next

At Zendesk we don’t feel we can ever really “solve” reliability. We’re realists. It’s an infinite game where the bar is constantly rising and gets harder. We’ve found that maintaining these 8 principles has helped us to communicate our high-level strategy internally and externally to our customers. These principles then translate into a roadmap and real work impacting every team. And we have a lot of exciting work on our reliability roadmap! Things like Availability Zone affinity routing, automated deployments, migrating critical systems to DynamoDB, and other significant architectural improvements. If you align with these principles and like the challenge of making fast growing and high scale systems more reliable, then please get in touch. We’re hiring over 100 engineers this year who will focus on Reliability.

[1] Trouble Free Availability is defined as the period of time in which our Generally Available products are available for customers without disruption of service. Internally we weight TFA based on our revenue from each of our partitions.

[2] SaaS Storage Strategies in Multitenant Environments is a great write up from back in 2016 of some of the common partitioning strategies in SaaS.

[3] Unicorn is a HTTP server for Rack applications. We’ll probably move to Puma in the future, yet Unicorn has been good to us for a long time.

Written by Jason Smale