Building a Foundation for More Flexible A/B Testing: Applications of Interim Monitoring to Large Scale Data

Zhou, Wenru; Kroehl, Miranda; Meier, Maxene; Kaizer, Alexander

doi:10.6339/23-JDS1099

Journal of Data Science

Building a Foundation for More Flexible A/B Testing: Applications of Interim Monitoring to Large Scale Data

Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 412–427

Wenru Zhou

Miranda Kroehl Maxene Meier All authors (4)

https://doi.org/10.6339/23-JDS1099

Pub. online: 21 April 2023 Type: Statistical Data Science

Open Access

Received
15 December 2022

Accepted
17 April 2023

Published
21 April 2023

Abstract

The use of error spending functions and stopping rules has become a powerful tool for conducting interim analyses. The implementation of an interim analysis is broadly desired not only in traditional clinical trials but also in A/B tests. Although many papers have summarized error spending approaches, limited work has been done in the context of large-scale data that assists in finding the “optimal” boundary. In this paper, we summarized fifteen boundaries that consist of five error spending functions that allow early termination for futility, difference, or both, as well as a fixed sample size design without interim monitoring. The simulation is based on a practical A/B testing problem comparing two independent proportions. We examine sample sizes across a range of values from 500 to 250,000 per arm to reflect different settings where A/B testing may be utilized. The choices of optimal boundaries are summarized using a proposed loss function that incorporates different weights for the expected sample size under a null experiment with no difference between variants, the expected sample size under an experiment with a difference in the variants, and the maximum sample size needed if the A/B test did not stop early at an interim analysis. The results are presented for simulation settings based on adequately powered, under-powered, and over-powered designs with recommendations for selecting the “optimal” design in each setting.

Supplementary material

Supplementary Material

All tables and Figures are uploaded as Supplementary Materials.

References

Armitage P, McPherson C, Rowe B (1969). Repeated significance tests on accumulating data. Journal of the Royal Statistical Society. Series A. General, 132(2): 235–244. https://doi.org/10.2307/2343787

Azevedo EM, Deng A, Montiel Olea Rao JL, Rao J Weyl EG (2020). A/b testing with fat tails. Journal of Political Economy, 128(12): 4614–000. https://doi.org/10.1086/710607

Balsubramani A, Ramdas A (2015). Sequential nonparametric testing with the law of the iterated logarithm. arXiv preprint: https://arxiv.org/abs/1506.03486.

D’agostino RB, Chase W, Belanger A (1988). The appropriateness of some common procedures for testing the equality of two independent binomial populations. American Statistician, 42(3): 198–202. https://doi.org/10.1080/00031305.1988.10475563

Demets DL, Lan KG (1994). Interim analysis: The alpha spending function approach. Statistics in Medicine, 13(13–14): 1341–1352. https://doi.org/10.1002/sim.4780131308

Friedman LM, Furberg CD, DeMets DL, Reboussin DM, Granger CB (2015). Fundamentals of Clinical Trials. Springer.

Gao P, Ware JH, Mehta C (2008). Sample size re-estimation for adaptive sequential design in clinical trials. Journal of Biopharmaceutical Statistics, 18(6): 1184–1196. https://doi.org/10.1080/10543400802369053

Gordon Lan K, Reboussin DM, DeMets DL (1994). Information and information fractions for design and sequential monitoring of clinical trials. Communications in Statistics. Theory and Methods, 23(2): 403–420. https://doi.org/10.1080/03610929408831263

Haybittle J (1971). Repeated assessment of results in clinical trials of cancer treatment. British Journal of Radiology, 44(526): 793–797. https://doi.org/10.1259/0007-1285-44-526-793

Jennison C, Turnbull BW (1999). Group Sequential Methods with Applications to Clinical Trials. CRC Press.

Johari R, Koomen P, Pekelis L, Walsh D (2017). Peeking at a/b tests: Why it matters, and what to do about it. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1517–1525.

Johari R, Koomen P, Pekelis L, Walsh D (2022). Always valid inference: Continuous monitoring of a/b tests. Operations Research, 70(3): 1806–1821. https://doi.org/10.1287/opre.2021.2135

Johari R, Pekelis L, Walsh DJ (2015). Always valid inference: Bringing sequential analysis to a/b testing. arXiv preprint: https://arxiv.org/abs/1512.04922.

Kohavi R, Deng A, Frasca B, Walker T, Xu Y, Pohlmann N (2013). Online controlled experiments at large scale. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1168–1176.

Koning R, Hasan S, Chatterji A (2022). Experimentation and start-up performance: Evidence from a/b testing. Management Science.

Miller E (2010). How Not to Run an A/B Test. URL: http://www.evanmiller.org/how-not-to-run-an-ab-test.html

Miller E (2015). Simple Sequential A/B Testing. URL http://www.evanmiller.org/sequential-abtesting.html, blog post.

O’Brien PC, Fleming TR (1979). A multiple testing procedure for clinical trials. Biometrics, 549–556. https://doi.org/10.2307/2530245

Pocock SJ (1977). Group sequential methods in the design and analysis of clinical trials. Biometrika, 64(2): 191–199. https://doi.org/10.1093/biomet/64.2.191

Tamburrelli G, Margara A (2014). Towards automated a/b testing. In: International Symposium on Search Based Software Engineering, 184–198. Springer.

Wang SK, Tsiatis AA (1987). Approximately optimal one-parameter boundaries for group sequential trials. Biometrics, 193–199. https://doi.org/10.2307/2531959

Zhou W, Kroehl M, Meier M, Kaizer A (2023). Approaches to analyzing binary data for large-scale A/B testing. Contemporary Clinical Trials Communications, 101091–101091. https://doi.org/10.1016/j.conctc.2023.101091

2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

A/B testing error spending function interim monitoring stopping rule

Funding

AMK and WZ supported by NHLBI K01 HL151754.

Metrics

since February 2021

563

Article info
views

301

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file