dbt at Zendesk — Part I: Setting foundations for scalability

Published in

Zendesk Engineering

10 min readMay 23, 2023

At Zendesk, data is an essential part of our decision-making process. Many Zendeskians interact with our data platform in one way or another throughout their day, via various types of data products and for a wide range of use cases. For this reason, our data platform needs to be scalable and versatile to serve different usage patterns.

That’s where Enterprise Data & Analytics (ED&A) comes in. As the owners of multiple key platform components, we aim to enable everyone within Zendesk to make data-driven decisions by providing them with the necessary tools and data products.

Due to Zendesk’s exponential growth, our data platform underwent multiple changes throughout the past decade, both in terms of the technologies it’s built on and the philosophy behind its design.

This three-article series will focus on a major change in one platform area: adopting a dbt-based data transformation framework.

In this article, we’ll discuss our pre-dbt architecture, why we chose dbt, and our work to define scalable foundations for our dbt usage. In the upcoming two articles, we’ll talk about our lift-and-shift playbook to move legacy pipelines to dbt and the more advanced dbt-related features we worked on.

But first, let’s look at the bigger picture

Before discussing dbt, let’s take a high-level overview of Zendesk’s data architecture. There are multiple data engineering teams at Zendesk, but two central ones maintain the core data platform:

Foundation engineering: owns the data lake, which contains all of our product data.
ED&A: owns the data warehouse, which contains a subset of product data and all of our SaaS applications’ data.

This distribution is not merely operational. The two teams support two very different data stacks, with our data lake being on AWS (S3 + Apache Hudi) while our data warehouse is on GCP (BigQuery).

A (very) high-level overview of our data architecture, with a focus on the data warehouse

In this post, we’ll focus on data transformation within the scope of our data warehouse, but a separate article from the AWS Architecture Blog provides a detailed overview of our product data architecture.

Data transformation: Chapter I

A few years ago, we made the decision to implement a self-serve pattern for our data warehouse. ED&A’s data engineering team would ingest the different source datasets into the warehouse and maintain the business-critical pipelines, while other teams would build their own data pipelines to support various use cases.

To achieve this pattern, we had to make two decisions that defined a significant portion of our data platform:

Heavily customizing our Airflow instance: We’re currently using a heavily-customized Airflow instance that adds a layer of abstraction on top of Airflow. This additional layer allowed us to hide most of the complexity of the different operations triggered and managed via Airflow DAGs, but it also meant maintaining more than 100 custom Airflow operators.
Building data pipeline abstractions: The ED&A platform team built several frameworks throughout the past few years to streamline turning SQL queries into data pipelines or Airflow DAGs. With these abstractions, users only need to provide a YAML file containing a SQL query and certain config parameters to generate an end-to-end data pipeline.

The above decisions allowed us to quickly scale our platform and enable teams to generate enriched data products at scale, but it also meant that we quickly hit the limits of our approach. By the end of 2021, we had nearly 700 Airflow DAGs and 25.000 BigQuery tables in our production environment, and the data engineering team slowly became the informal owner of all these assets. The lack of standardization and governance made it harder for consumers to trust the data, and it was common to have different versions of the same table built by different teams.

By this point, we knew we had to start from scratch with a radical shift in philosophy: a push towards governed, trusted, and high-quality curated data assets that form a foundation for our different data consumption patterns. We also needed to adopt an industry standard for our data transformation pipelines and move away from our heavily customized approach. Finally, we needed to let Airflow do what it does best (orchestration) and move the burden of managing transformation to a tool built for this task.

dbt: one standard to rule them all

When we started exploring the data transformation space to find a tool that meets the above requirements, picking dbt was a straightforward decision for quite a few reasons:

We know (and like) SQL: As mentioned above, most of our existing data transformation codebase was written in SQL (with Airflow-related components written in Python), so SQL support was a mandatory requirement for the new solution. dbt takes things further by making SQL the cornerstone of the framework.
Built-in data quality tests: We were big fans of dbt tests since they’d simplify our data quality testing process and minimize the effort to enforce data quality standards for all foundational data assets.
Built-in capabilities to apply Don’t-Repeat-Yourself (DRY): One of the hard lessons we learned with our legacy pattern was the importance of doing things once instead of hard-coding the same bits of business logic and transformations in different queries. dbt comes with the concept of macros, which allows us to apply DRY principle across all of our data assets.
Rich ecosystem: the ever-growing list of high-quality dbt packages meant that we’d be able to build upon prior community work instead of reinventing the wheel ourselves.

With the above points in mind, we knew that betting on dbt was the best path forward. On the other hand, deciding to use dbt is only a (tiny) first step in defining the standards of our new data transformation approach.

Foundations for scalability

To ensure that our new dbt-based pattern doesn’t lead us to (familiar) scalability limits, we had to define and build the necessary foundations to scale our dbt projects while maintaining the pillars of our new philosophy. And so before writing dbt models, we first defined and implemented our dbt standards.

Answering questions before they’re asked

dbt is a feature-rich tool, but that’s a double-edged sword. The number of functionalities and capabilities that dbt provides means that there are always at least two different ways to implement something using dbt. Whether you’re working on a new model or extending an existing one, it’s easy to find yourself wondering “What’s the right approach to do this?”. (More precisely, questions like “Should I make this a macro?” or “Which materialization should I use for this model?” were extremely frequent when we started experimenting with dbt.)

Even though dbt itself is an opinionated tool, we believe it shouldn’t have to answer such questions. dbt, at its core, is a tool — and it’s up to us as users to define how we want to use it. Recommendations from dbt Labs and the dbt community are a very good place to start, but ultimately answering “How do we want to use dbt?” should be an exercise done by every team planning to use dbt and at the heart of the implementation process.

In our case, our dbt standards needed to reflect a broader initiative within Zendesk: adopting the notion of data domains, focusing on six areas to start, and assigning teams to each one. And so for example, our product data domain (which contains data generated by Zendesk products) is one component with its own design choices and commitments.

For the above reasons, it was clear that we needed to set our own exhaustive and opinionated standards for our dbt projects to prevent inconsistencies. We relied heavily on GitHub discussions to explore how different dbt concepts and features could be used at Zendesk, allowing us to make informed decisions that are then translated into standards. Some of the main decisions we made early on were the following:

Layering strategy: We defined the different layers of our dbt projects, the contents of every layer, its recommended materialization, and other relevant metadata (including naming conventions at the layer level, development standards, mandatory audit columns, domain-level tags, etc.).
dbt project structure: We defined the structure of our dbt projects and where every component (models, macros, tests) should live. Here again, dbt Labs provides very practical recommendations, but we had to adapt them to our own usage patterns and data-domain-driven architecture.
Usage of dbt features (macros, tests, seeds, snapshots, etc.): For every main dbt feature we planned to use, we defined the scenarios in which we wanted to use it and recommendations related to its usage.
dbt execution: We defined how we want dbt to interact with Airflow and how granular our dbt tasks should be. We opted to remain consistent with our data domains work and rely on tags to run a specific set of models in a given Airflow task. In the third article of the series, we’ll discuss the custom Kubernetes operator we built to run our dbt jobs.

Start with small steps, until you know where you’re headed

With dbt, it’s easy to get overly ambitious. For our project setup, we initially discussed having a dbt project template that multiple teams can clone and use for their own dbt projects, with every project being a separate GitHub repository. Even though this is the most scalable approach, it’s also the one with the most complex governance structure — since the multiple repositories/dbt projects can quickly diverge, and it’ll be difficult to enforce standards across all of them.

We finally opted for one central repository containing a single dbt project as our starting point, with the aim to adopt a more distributed approach once our standards and processes are mature enough. This decision allowed us to iterate at a very fast pace during the first months of our dbt migration since we were maintaining one single (but increasingly large) dbt project.

Different corners, one big room

One big problem with our legacy pattern was that our dev BigQuery environment resembled a crowded town square. Engineers had their own Airflow sandboxes, but all Airflow instances were using the same datasets on BigQuery to read and write data, making local development a tricky process.

With dbt, we leveraged the notion of targets to automate the creation of developer-specific sandbox schemas (a schema represents a BigQuery dataset). For this, we “overloaded” the generate_schema_name macro and added a set of custom scenarios to its implementation:

If the execution occurs in a main environment (like prod or dev) then the custom schema names defined in our dbt_project.yml file remain unchanged.
If the execution is triggered from a CI environment, then a ci_ prefix is added to all custom schema names (with ci being the default schema for our CI target).
If the execution is triggered from a local setup (local as target) then a sandbox_username_ prefix is added to all custom schema names (with sandbox being the default schema for local targets, while the user name is retrieved using an environment variable).

And so, if we take the example of a dataset named product_domain, we’d have the following variations on our dev environment:

product_domain: the main dataset that’s used by our dev Airflow instance
ci_product_domain: the dataset that’s used by our CI jobs
Multiple versions of sandbox_username_product_domain: these are sandbox datasets that are used during local execution, with each developer having their own dataset

By applying the above, we automated the generation of sandbox datasets, ensuring that we never run into the scenario of two users writing to the same sandbox.

Enforcing rules: every time, everywhere

As we got closer to defining the core principles of our dbt usage, it was critical to implement automation to ensure that we consistently enforce the rules that we’re setting for this new pattern. For this, we leveraged the following components:

Linting via SQLFluff: SQLFluff was part of our dbt setup since day one. We started by defining the rules we wanted to apply and enforced them via a pre-commit hook and a dedicated GitHub Actions workflow. This ensured that our dbt codebase remained consistent with our style guide.
Creating our own dbt utils package: Even though dbt has a very rich ecosystem, there were quite a few areas in which we wanted to implement Zendesk-specific helper macros and generic tests — to ensure consistency and simplify things for Zendesk’s dbt users. For this, we built an internal dbt utils package that’s then imported into our dbt projects.

Overall, this is an area in which we’re still exploring different paths and potential enhancements (we’re currently experimenting with packages like dbt-checkpoint and dbt-coverage). Still, the golden rule here is to automate all that can be automated, and then some.

dbt, six months later — and beyond

We started our dbt journey more than a year ago and have been running it in production for over six months. Thanks to it, we now have a dedicated data transformation pattern that’s governed and relies on dbt’s core features, resulting in scalable data transformations that remain consistent with our newly-defined standards. Additionally, Airflow is back to being just an orchestrator, and we’re able to deliver new data products at a much faster pace compared to our legacy pattern.

The diagram below details the different steps of our current data transformation development and execution process, with the actual implementation of the different principles discussed above (like the separation between orchestration and transformation, developer-specific sandboxes, and automated CI checks).

In the upcoming two articles, we’ll talk about the more practical points regarding how we moved a large number of our legacy pipelines to dbt, and the technical capabilities we worked on to supercharge dbt (like generating staging models dynamically via a Python package, running dbt commands as CI actions, and implementing our own auditing framework).

The second article of the series, Supercharging dbt with Dynamic Stage, is now available.

Acknowledgments

The work discussed in this article was only possible thanks to the collaborative effort of ED&A’s data curation team: Akshay Agrawal, Bhavin Patel, Chad Isenberg, Jim Dow, Mahdi Karabiben, Niral Patel, Rudrakanta Ghosh, Ryan Piaskowy, Samantha Kirlin, Suhas Jangoan, Sumeet Gaglani, and Vasilii Surov.

We’re hiring!

If you’re interested in working on similar projects, check out our open roles and join us as we build Zendesk’s new data platform.