Increasing experimentation accuracy and speed by using control variates
At Etsy, we strive to nurture a culture of continuous learning and rapid innovation. To ensure that new products and functionalities built by teams — from polishing the look and feel of our app and website, to improving our search and recommendation algorithms — have a positive impact on Etsy’s business objectives and success metrics, virtually all product launch decisions are vetted based on data collected via carefully crafted experiments, also known as A/B tests.
With hundreds of experiments running every day on limited online traffic, our desire for fast development naturally calls for ways to gain insights as early as possible in the lifetime of each experiment, without sacrificing the scientific validity of our conclusions. Among other motivations, this need drove the new formation of our Online Experimentation Science team: a team made of engineers and statisticians, whose key focus areas include building more advanced and scalable statistical tools for online experiments.
In this article, we share details about our team’s journey to bring the statistical method known as CUPED to Etsy, and how it is now helping other teams make more informed product decisions, as well as shorten the duration of their experiments by up to 20%. We offer some perspectives on what makes such a method possible, what it took us to implement it at scale, and what lessons we have learned along the way.
Is my signal just noise?
In order to fully appreciate the value of a method like CUPED, it helps to understand the key statistical challenges that pertain to A/B testing. Imagine that we have just developed a new algorithm to help users search for items on Etsy, and we would like to assess whether deploying it will increase the fraction of users who end up making a purchase, a metric known as conversion rate.
A/B testing consists in randomly forming 2 groups — A and B — of users, such that users in group A are treated (exposed to the new algorithm, regarded as a treatment) while users in group B are untreated (exposed to the current algorithm). After measuring the conversion rates Y_{A} and Y_{B} from group A and group B, we can use their difference Y_{A} – Y_{B} to estimate the effect of the treatment.
There are essentially two facets to our endeavour — detection and attribution. In other words, we are asking ourselves two questions:
 Does the observed difference reflect the existence of a real effect?
 If there is a real effect, is it caused by the treatment?
Since our estimated difference is based only on a random sample of observations, we have to deal with at least two sources of uncertainty.
The first layer of randomness is introduced by the sampling mechanism. Since we are only using a relatively small subset of the entire population of users, attempting to answer the first question requires the observed difference to be a sufficiently accurate estimator of the unobserved populationwide difference, so that we can distinguish a real effect from a fluke.
The other important layer of randomness comes from the assignment mechanism. Claiming that the effect is caused by the treatment requires groups A and B to be similar in all respects, except for the treatment that each group receives. As an illustrative thought experiment: pretend that we could artificially label each user as either “frequent” or “infrequent” based on how many times they have visited Etsy in the previous month. If, by chance, or rather mischance, a disproportionately large number of “frequent” users were assigned to group A (Figure 1), then it would call into question whether the observed difference in conversion rate is indeed due to an effect from the treatment, or whether it is simply due to the fact that the groups are dissimilar.
One solution to the attribution question is to exploit the randomization of the assignments, which guarantees that — except for the treatments received — groups A and B will become more and more similar in every way, on average, as their sample sizes increase. Going one step further, if we somehow understood how the type of a user (e.g. “frequent” or “infrequent”) informs their buying habit, then we could attempt to proactively adjust for group dissimilarities, and correct our naive difference Y_{A} – Y_{B} by removing the explainable contribution coming from apparent imbalances between groups A and B. This is where CUPED comes into play.
What is CUPED?
CUPED is an acronym for Controlled experiments Using PreExperiment Data [1]. It is a method that aims to estimate treatment effects in A/B tests more accurately than simple differences in means. As reviewed in the previous section, we traditionally use the observed difference between the sample means
Y_{A} – Y_{B}
of two randomlyformed groups A and B to estimate the effect of a treatment on some metric Y of interest (e.g. conversion rate). As hinted earlier, one of the challenges lies in disentangling and quantifying how much of this observed difference is due to a real treatment effect, as opposed to misleading fluctuations due to comparing two subpopulations made of nonidentical users. One way to render these latter fluctuations negligible is to increase the number of users in each group. However, the required sample sizes tend to grow proportionally to the variance of the estimator Y_{A} – Y_{B}, which may be undesirably large in some cases and lead to prohibitively long experiments.
The key idea behind CUPED is not only to play with sample sizes, but also to explain parts of these fluctuations away with the help of other measurable discrepancies between the groups. The CUPED estimator can be written as
Y_{A} – Y_{B} – (X_{A} – X_{B}) β
which corrects the traditional estimator with an additional denoising term. This correction involves the respective sample means (X_{A} and X_{B}) of a wellthoughtout vector of user attributes (socalled covariates and symbolized by X), and a vector β of coefficients to be specified. Our earlier example (Figure 1) involved a single binary covariate, but CUPED generalizes the reasoning to multidimensional and continuous variables. Intuitively, the correction term aims to account for how much of the difference in Y is not due to any effect of the treatment, but rather due to differences in other observable attributes (X) of the groups.
By choosing X as a vector of preexperiment variables (collected prior to each user’s entry into the experiment), we can ensure that the correction term added by CUPED does not introduce any bias. Additionally, the coefficient β can be judiciously optimized so that the variance of the CUPED estimator becomes smaller than the variance of the original estimator, by a reduction factor that relates to the correlation between Y and X. In simple terms, the more information on Y we can obtain from X, the more variance reduction we can achieve with CUPED. In the context of A/B testing, smaller variances imply smaller sample size requirements, hence shorter experiments. Correspondingly, if sample sizes are kept fixed, smaller variances enable larger statistical power for detecting effects (Figure 2).
The benefits and accessibility of CUPED (especially its quantifiable improvement over the traditional estimator, its interpretability, and the simplicity of its mathematical derivation) explain its popularity and widespread adoption by other online experimentation platforms [2, 3, 4, 5].
Implementing CUPED at Etsy
The implementation of CUPED at scale required us to construct a brand new pipeline (Figure 3) for data processing and statistical computation. Our pipeline consists of 3 main steps:
 Retrieving (or imputing) preexperiment data for all users.
 Computing CUPED estimators for each group.
 Performing statistical tests (ttest) using CUPED estimators.
When a user enters into an experiment for the first time, we attempt to fetch the user’s most recent historical data from the preceding few weeks. The window length was chosen to hit a sweet spot between looking far enough back in time for historical data to exist, but not as far as to render such preexperiment data unpredictive of the inexperiment outcomes. This step involves some careful engineering in order to retrieve (and possibly reconstruct) the historical data from the preexperiment period at the level of each individual user.
Once the preexperiment variables are retrieved and formatted, we may proceed with the CUPED adjustments. As it turns out, the optimal choice of coefficient β coincides with the ordinaryleastsquares coefficient of a linear regression of Y on X. This relationship enables the efficient computation of CUPED estimators as residuals from linear regressions, which we implemented at scale using Apache Spark’s MLlib [6]. The wellestablished properties of linear regressions also allowed us to design nontrivial and interpretable simulations for unit testing.
Since CUPED estimators can be expressed as simple differences in means (albeit using adjusted outcomes instead of raw outcomes), we were able to leverage our existing ttesting framework to compute corresponding pvalues and confidence intervals. Besides the adjusted difference between the two groups, our pipeline also outputs the grouplevel estimates Y_{A} – (X_{A} – X_{B}) β and Y_{B} for further reporting and diagnosis. Note the intentional asymmetry of the expressions, as Y_{A} – X_{A} β and Y_{B} – X_{B} β would generally be biased estimators of the grouplevel means, unless the covariates were properly centered.
Impact and food for thought
Overall, CUPED leads to meaningful improvements across our experiments. However, we observe varying degrees of success, e.g. when comparing different types of pages (Figure 4). This can be explained by the fact that different pages may have different amounts of available preexperiment data, with different degrees of informativeness (e.g. some pages may be more prone to be seen by newer users, on whom we may not have much historical information).
In favorable cases, our outofthebox CUPED implementation can reduce variances by up to 20%, thus leading to narrower confidence intervals and shorter experiment durations (Figure 5). In more challenging cases where preexperiment data is largely missing or uninformative, the correction term from CUPED becomes virtually 0, making CUPED estimators revert to their nonCUPED counterparts and hence yield no reduction in variance — but no substantial increase either.
On the engineering side of things, one of the lessons we learned from implementing CUPED is the importance of producing and storing experiment data at the appropriate granularity level, so that the retrieval of preexperiment data can be done efficiently and in a replicable fashion. Scalability also becomes a key desideratum as we expand the application of CUPED to more and more metrics.
Another challenge we overcame was ensuring a smooth delivery of CUPED, both in terms of user experience and communication. To this end, we conducted several user research interviews at different stages of the project, in order to inform our implementation choices and make certain that they aligned with the needs of our partners from the Analytics teams. Integrating new CUPED estimators to Etsy’s existing experimentation platform — and thus discontinuing their longestablished nonCUPED counterparts — was done after careful UX and UI considerations, by putting thoughts into the design and following a meticulous schedule of incremental releases. Our team also invested a lot of effort into creating extra documentation and resources to anticipate possible concerns or misconceptions, as well as help users better understand what to expect (and equally importantly, what not to expect) from CUPED.
Finally, from a methodological standpoint, an interesting reflection comes from noticing that CUPED estimators can achieve smaller variances than their nonCUPED counterparts, at essentially no cost in bias. The absence of any biasvariance tradeoff may make one feel skeptical of the seemingly onesided superiority of CUPED, as one may often hear that there is no such thing as a free lunch. However, it is insightful to realize that … this lunch is not free.
In fact, the conceptual dues that we are paying to reap CUPED’s benefits are at least twofold. First, we are borrowing information from additional data. Although the inexperiment sample size required by CUPED is smaller compared to its nonCUPED rival, the total amount of data effectively used by CUPED (when combining both inexperiment and preexperiment data) may very well be larger. That cost is somewhat hidden by the fact that we are organically collecting preexperiment data as a byproduct of our natural experimentation cycle, but it is an important cost to acknowledge nonetheless. Second, CUPED estimators are computationally more expensive than their nonCUPED analogues, since their linear regressions induce additional costs in terms of execution time and algorithmic complexity.
All this to say: the increased accuracy of CUPED is the fruit of sensible efforts that warrant thoughtful considerations (e.g. in the choice of covariates) and realistic expectations (i.e. not every experiment is bound to magically benefit).
We hope that our work on CUPED can serve as an inspiring illustration of the valuable synergy between engineering and statistics.
Acknowledgements
We would like to give our warmest thanks to Alexandra Pappas, MaryKate Guidry, Ercan Yildiz, and Lushi Li for their help and guidance throughout this project.
References
[1] Deng A., Xu Y., Kohavi R., Walker T. (2013). Improving the sensitivity of online controlled experiments by utilizing preexperiment data.
[2] Xie H., Aurisset J. (2016). Improving the sensitivity of online controlled experiments: case studies at Netflix.
[3] Jackson S. (2018). How Booking.com increases the power of online experiments with CUPED.
[4] Kohavi R., Tang D., Xu Y. (2020). Trustworthy online controlled experiments: a practical guide to A/B testing.
[5] Li J., Tang Y., Bauman J. (2020). Improving experimental power through control using predictions as covariate.
[6] Apache Spark. MLlib [spark.apache.org/mllib].
Related Posts

Posted by Lucia Yu on 29 Oct, 2020
Bringing Personalized Search to Etsy

Posted by Xuan Yin on 03 Aug, 2020
How to Pick a Metric as the North Star for Algorithms to Optimize Business KPI? A Causal Inference Approach

Posted by Xuan Yin and Ercan Yildiz on 24 Feb, 2020
The Causal Analysis of Cannibalization in Online Products
Great!!!!
This is a great post. The one thing that’s worth pointing out is that the authors of the original CUPED article present this method as if it were novel. It is not. This is just a special case of ANCOVA, which is a classic method in the statistical analysis of experiments that researchers in the social, behavioral and medical sciences have been using long before the internet existed and the term RCT was replaced with ‘A/B’ testing.