Etsy Icon>

Code as Craft

Faster ML Experimentation at Etsy with Interleaving main image

Photo by OhSewBootiful

Faster ML Experimentation at Etsy with Interleaving

  image

At Etsy, our product and machine learning (ML) teams are constantly working to make improvements to the experience of our buyers and sellers. The innovations they produce have to be tested, to validate that they do what we hope they'll do. When introducing a product or algorithm change, a team runs an experiment with online traffic to assess whether it produces an improvement over the current user experience, as measured by key indicators. Often these experiments are “gold standard” A/B tests. But in certain cases we’ve also worked with a lesser-known experimental design called interleaving. For ML models that produce ordered results, interleaving can surface user preferences on 10% or less of the traffic needed to run an equivalent A/B test, allowing our teams to experiment, learn, and iterate faster. In this post, we’ll explain how interleaving works, share how we implemented it, and walk through our process for validating its performance.

Measuring the impact of new algorithms

Consider how we might test a new algorithm (or “model” or “ranker”) built to produce search results (a set of listings in a certain order). In an A/B test, visitors are randomly bucketed into one of two groups. Group A (control) is served the results produced by the old algorithm and group B (variant) is served the results produced by the new algorithm. We calculate average behaviors of the control and variant groups separately, and run a statistical comparison. If the impact of the new ranker is significantly positive, we often roll it out to all visitors. In an interleaving test, however, we aim to detect preferences for a ranker at the level of an individual visitor rather than comparing average behaviors of two groups seeing distinct experiences.

To measure which of the two algorithms a visitor prefers, interleaving runs each search query through both algorithms and presents both sets of ordered results at once, measuring which ranker draws more engagement. To present both results at once, we weave them together like two halves of a deck of cards being riffle-shuffled. To the visitor, this is nothing but a standard search result list. But we keep track of which result comes from which model, and if the visitor makes a purchase, that gives us a bit of data about their preferences. After surfacing these interleaved search results to a large number of visitors, we learn whether these signs of preference show up more often in favor of one model than the other.

A/B test and interleaving experimental design comparison
In an A/B test, each user only sees the result set from one of two algorithms. In an interleaving test, every user sees a constructed result set combining the two algorithms’ result sets. Listing images from Etsy shops terrafirma79, copperest, KirovNixie, and MDGdesignNL.

In an interleaving experiment, we do not randomize visitors to different experiences as in an A/B test. We need to include some randomness, however, to ensure that listings from one ranker don’t get preferential placement, which might skew our data. In particular, we use a variation of team-draft interleaving, in which each of the rankers provide one listing for every pair of listings in the combined ranking. Then, we randomize which rankers’ listing is placed first in each pair. If a given listing from one ranker has already shown up and been interleaved from (and attributed to) the other ranker, the listing is skipped and the next available new listing is included.

Interleaving is fast

The primary benefit of interleaving over A/B testing is its speed. Those familiar with the difference between paired and unpaired t-tests may see a conceptual and speed-related parallel here. A/B tests and unpaired t-tests (which may be used in A/B tests) require a large sample size to control unit-to-unit variance and enable useful comparisons between groups. Interleaving and paired t-tests, however, significantly diminish the impact of that variation by measuring differences within-unit, efficiently isolating and identifying the effect of interest.

In comparisons between A/B and interleaving, we found that interleaving experiments power with 10X to 100X less traffic, and thus can provide answers to the same question 10X to 100X faster. This improvement can be translated into dozens of experiment-days saved per year, making room for teams to experiment on models that would otherwise not have access to testing traffic.

Interleaving is not a simple replacement for A/B testing, however; there are important things it cannot do. We can only interleave ordered result sets such as ranking algorithms. For example, we can’t use interleaving to test the impact of a different-colored background or whether a notification drives a certain behavior. In addition, we can’t test differences in latency between models as we do with A/B tests. This is because interleaving has to wait for a slower algorithm to generate results before it weaves two ranked lists together.

What does it take to implement interleaving?

The infrastructure we built to support interleaving experiments is composed of two parts: a thin redirection layer we call the interleaver and an offline result calculator.

Intrerleaving at Etsy architecture diagram
The interleaver splits a user’s search request into two requests, which go through one or the other of two configurations of the search pipeline. It takes the listings returned from each ranker and weaves them together. Which listing came from which ranker is recorded for internal use. When the user makes a purchase from one of the listings, the attribution job credits the ranker responsible for it, and those credits are aggregated across the experiment and analyzed to assess overall user preferences.

The interleaver acts as a branching component in front of our existing search pipelines. It automatically splits a visitor’s search request into two identical requests, sending them over two search pipelines configured according to the two variants our test is meant to compare. The two requests are run through their pipelines in parallel. The interleaver collects the results from the two pipelines and weaves them into the single, unified list returned to the client. It also tracks which variant is responsible for which listing in the final result and logs that extra information.

The offline result calculator combines search event data and user action data to establish user preferences. Just as we do for A/B tests, we schedule a daily batch attribution job, written in Spark, on a Kafka cluster to associate user interactions (for example, a purchase) with the queries that caused those actions. Unlike A/B tests, we also need to access the listing-level interleaving data logged by the interleaver. Once we tie a purchase to a query, we use that additional data to establish which variant gets credit for the purchase. An in-house statistical package aggregates these individual attributions across queries and users, and turns them into experiment-level estimates of visitor preferences. These are the results shown to the analysts and engineers running the experiments, alongside bootstrap-driven measures like confidence intervals to help determine their statistical significance.

Validation

When introducing interleaving at Etsy, we focused on testing the system’s output and ensuring alignment with teams conducting experiments.

We began by validating our newly-built system in production. We started with A-vs-A tests, in which two identical copies of our control algorithm’s results were interleaved together in an exact copy of the original results. The system still tracked which result set each listing was attributed to, but because the sets were identical and the attributions random, we expect to find a similar number of ‘wins’ accruing to either side in the test. The neutral results we saw confirmed that the interleaver was not in fact introducing any systematic biases. We also used a longer-term A-vs-A test to ensure that the system’s false positive rate aligned with our expectations, and to check for seasonality effects.

A central concern introducing any new system is that it could slow down the site, so we needed a way to detect hits to performance that might come from interleaving. We did this by embedding each interleaving test into an A/B test; the A arm featured results from a standard control algorithm, while the B arm showed visitors an interleaved result set. If the process of interleaving took too long, we would expect the A/B test to trigger built-in alerts for increased latency and possibly also show degradation in key outcomes. Our initial A-vs-A interleaving tests used this A/B design (in which the B group was simply A interleaved with itself), and we saw no negative impact. Even as we moved into using interleaving for real tests we retained this ‘A/interleaving’ design, assigning 5% of the test’s traffic to the control and 95% to the interleaved results, as a safeguard against unanticipated problems driven by the interleaving system.

Traffic allocation for interleaving tests at Etsy
Some fraction of search users are allocated to an interleaving experiment. Of those users, a small portion are shown results from the control experience, while the rest are shown the interleaved control and variant results instead. A latency comparison between those groups surfaces any slowdowns caused by the interleaving process. Separately, logged interactions with the interleaved results, like purchases and listing views, determine user preferences between the control and variant.

After completing the A-vs-A tests, we conducted a carefully controlled set of what might be called A-vs-worse-A tests. Since our algorithm is designed to optimize the ordering of results, we expected that adding artificial noise, such as flipping pairs of listings, would produce degraded outcomes. The more severely the preferred ordering was disrupted, the more our outcomes were likely to suffer. Which is just what we found interleaving “real” results against a range of artificially lower-quality ones: the more noise we added in the worse-A branch, the greater the preference for the original ranker the interleaving system detected from our visitors. Not only were the tests encouraging in themselves, they also helped us calibrate the method’s sensitivity, and thus establish the sample sizes needed to power an experiment.

At that point we were ready for a partner team, one with relevant applications and the leadership buy-in for our first product-based test. After all, it is not enough to tout the statistical and scientific backing of a new method; a good solution to central problems must play well with business needs, operational procedures, and the people who will use it. We found such a partner team in the search ranking group, whose reranking models were similar to the artificial applications we had already trialed. We helped them understand the system and what it offered, and they provided feedback that informed changes we made to it. Initial tests on their products showed agreement with A/B results (while requiring far less traffic and runtime) and demonstrated the system’s positive return on investment, motivating continued work to improve the product and expand its capabilities, use cases, and customer base.

Conclusion

Online experimentation plays a central role in product development, and we’ve seen that we need not limit ourselves to just one method. Interleaving is an important new tool at Etsy, one that we’re continuing to extend and refine. The opportunity to do the work we describe here–evaluating new ideas and new experimental means, building systems and validating their usefulness–is the sort of thing that excites us as data scientists and ML engineers. If this is the kind of work that excites you too, consider joining us! We’re always looking for people who can bring fresh experience, perspective, and ideas to our teams. We encourage you to consider a career at Etsy, helping us craft the innovations that keep our marketplace vital and growing.