Semantic Search at Zendesk

Published in

Zendesk Engineering

7 min readMar 20, 2024

Co-authored by Miguel Aroca-Ouellette, and thanks to the many amazing Zendeskians who supported this work.

Zendesk Guide enables businesses to build a searchable knowledge base for their products and services. At Zendesk, we aim to streamline the information-seeking process for both support agents and end users. Our analysis suggests that even minor improvements in search relevance noticeably reduce operational costs for businesses. For agents, searching Guide is crucial to efficiently answering support tickets — improved search translates into increased productivity and quicker ticket resolutions. For end users, self-sufficiency in finding answers leads to greater satisfaction, increases ticket deflection, and allows agents to focus on more complex issues.

Our search infrastructure is powered by Elasticsearch and has proven to be fast, robust and scalable. However, the default keyword search algorithm fails to capture the relevance when there is little word overlap between the query and documents. Semantic search addresses this by understanding the “meaning” and covers for synonyms and complex patterns that would otherwise be missed.

In 2023 we integrated semantic search capabilities into Guide search utilizing off-the-shelf embedding models. This led to a significant enhancement in search quality for our users. In this post, we delve into how we took this a step further, training custom models based on how users interact with Guide’s search functionality.

Solution Outline

There are a number of ways to offer semantic search, and our focus here is embedding models that can successfully map queries and documents into a shared vector space (bi-encoders). The mapping ensures that relevant query-document pairs are positioned closer together in terms of vector distance, in contrast to pairs that are less relevant. This approach was popularized by Dense Passage Retrieval (DPR) and is the blueprint for most recent work in the industry.

Our solution consists of two steps. First, a set of candidate documents is extracted from Elasticsearch using the standard keyword matching algorithm. These documents are then re-ranked based on their proximity to the query in the embedding space.

Semantic Search re-scoring on top of keyword search results with Elasticsearch (ES)

The final score for ranking is a weighted average of keyword match score and cosine similarity score. This blending of scores allows us to take both lexical and semantic similarity into account. Our experiments showed that this leads to better results than using either approach alone.

Training Custom Models

There are a number of public pretrained models available for semantic re-scoring. However, embedding models may not generalize to new domains and we benefit from fine-tuning them on in-domain data. This is especially true at Zendesk, where our clients come from different industries with a very diverse set of search queries and articles. For example, consider a new digital currency and the custom terminology around it, or a medical device and the related issues that end users might search for. Furthermore, public models are usually trained on well-formed questions, but in practice most queries that we receive are short, informal or poorly written.

To fine-tune an embedding model, we need pairs of search queries and relevant documents. Manually annotating data to create these training samples is costly and not scalable. At Zendesk we get millions of search queries from our end users each day, so naturally we asked ourselves how we can use these queries and their clicks as a training signal.

Preparing the Data

The implicit feedback from a user clicking on a search result is known to be noisy and suffers from a number of biases, such as position bias, which refers to the tendency of users to click more on results that appear at the top of the search results page. We address this by a careful choice of aggregation, data transformations and training loss.

To generate training data, we aggregate the clicked search results for each individual search query. These are then organized into triplets of the form (search query, clicked result, log of frequency). This method of data aggregation offers several advantages:

The frequency of clicks provides an indication of relative relevancy of documents for a given query.
The use of the logarithmic function helps to reduce the variance in the data and makes the model more resilient against extreme values.
This approach is flexible and allows additional preprocessing of raw click counts before aggregation. For example, position bias can be mitigated by techniques like Inverse Propensity Scoring (IPS) before aggregation.
Training with aggregate data is more computationally efficient than processing individual click samples.

Data Sampling

One challenge with click data is that it’s biased towards popular documents and their corresponding queries. We’ve found that the set of clicked documents associated with a given query is typically quite small. This is to be expected, as there are usually only a few relevant documents for each query, and users tend to make their selections from the results on the first page. However, for a given document, there could be thousands of distinct queries leading to the same article. An article could answer a wide range of queries and the same query could be written in different ways.

Naive data sampling will lead to a poor quality training set, so we adopt a sampling approach that promotes document diversity. We sample one query at a time for each unique document, and continue until we’ve covered a balanced query set for each document.

Loss Function

There are many loss functions available for training embedding models, but the common goal is to reduce the distance between a query and its relevant documents, while increasing its distance to less-relevant documents. We have experimented with several loss functions and found that Kullback-Leibler (KL) divergence loss has the best performance given our needs. KL divergence is a measure of how one probability distribution differs from another and has a few attractive properties. Firstly, it supports multiple relevant documents for each query during training, eliminating the need for extra preprocessing. Secondly, it utilizes frequency information effectively by jointly modeling the relative popularity of all clicked documents for a query. In our case, we treat the normalized frequency distribution of documents for a given query as the target distribution. In-batch documents that aren’t associated with a given query are treated as irrelevant (negative).

Handling Multiple Accounts

Ideally a single universal embedding model would perform well across all our customers’ help centers. The challenge we face is that the training data is highly imbalanced across accounts, with some accounts having orders of magnitude more search traffic than others. Addressing this requires careful data sampling across accounts.

An alternative is to train separate models for each help center using only their respective search traffic data. This method not only addresses the issue of data imbalance but also enhances transparency, because we can directly trace the model’s performance or any potential errors back to the specific training data used. It also simplifies data governance by using data from each account exclusively for its own purposes.

To facilitate training customer-specific models, we rely on Adapters which offer a lightweight approach to fine-tuning pretrained models. Instead of adjusting all the model’s weights, adapters only alter a small portion of the parameters by adding compact modules between existing layers. During the training process, we train individual adapters using the data from one customer at a time. When serving, the appropriate adapter for the given account is loaded and used for inference. If an account lacks sufficient click data for training an adapter, we can fall back to the baseline embedding model.

Results

To allow rapid experimentation, we created a hand-annotated test set. We carefully selected a number of accounts from diverse industries, including e-commerce, hardware manufacturing, and financial services. To ensure diversity, we curated a selection of queries for each account. We then relied on pooling results from a set of diverse retrieval models to come up with candidate documents for each query. Each query-document pair was then hand-annotated on a scale of 0 (irrelevant) to 4 (perfectly relevant).

Having this test set allowed us to verify the effectiveness of different search systems in terms of well-known metrics like Normalized Discounted Cumulative Gain (NDCG), which rewards a ranking system if it places relevant documents higher in the final ranking.

The improvements achieved with the fine-tuned models depend on the availability of search data for a given account. The NDCG gains typically range from 2 to 6 percentage points compared to the baseline model. Qualitatively, the improvements are very noticeable with long queries, as well as those containing specific product names.

In our online AB tests, we observe that accounts experience up to 9% improvement in Click Through Rate (CTR) and up to 14% improvement in Mean Reciprocal Rank (MRR). These quality gains are on top of the already strong baseline models for English. We will share results for improvements on multilingual search in a future post.

Future Work

Our semantic search solution has resulted in significant improvements in the quality of search results. This has enabled end users to find answers to their queries faster, and empowered customer service agents to focus on tasks that require their expertise. We are excited about the results and are in the process of developing several improvements. These range from scaling our training and evaluation data with Large Language Models (LLMs), to devising more robust methods to integrate keyword and semantic search results.

This post is part of a series detailing how we implement semantic search at Zendesk. Stay tuned for more!