Building a Foundation for More Flexible A/B Testing: Applications of Interim Monitoring to Large Scale Data
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 412–427
Pub. online: 21 April 2023
Type: Statistical Data Science
Open Access
Received
15 December 2022
15 December 2022
Accepted
17 April 2023
17 April 2023
Published
21 April 2023
21 April 2023
Abstract
The use of error spending functions and stopping rules has become a powerful tool for conducting interim analyses. The implementation of an interim analysis is broadly desired not only in traditional clinical trials but also in A/B tests. Although many papers have summarized error spending approaches, limited work has been done in the context of large-scale data that assists in finding the “optimal” boundary. In this paper, we summarized fifteen boundaries that consist of five error spending functions that allow early termination for futility, difference, or both, as well as a fixed sample size design without interim monitoring. The simulation is based on a practical A/B testing problem comparing two independent proportions. We examine sample sizes across a range of values from 500 to 250,000 per arm to reflect different settings where A/B testing may be utilized. The choices of optimal boundaries are summarized using a proposed loss function that incorporates different weights for the expected sample size under a null experiment with no difference between variants, the expected sample size under an experiment with a difference in the variants, and the maximum sample size needed if the A/B test did not stop early at an interim analysis. The results are presented for simulation settings based on adequately powered, under-powered, and over-powered designs with recommendations for selecting the “optimal” design in each setting.
Supplementary material
Supplementary MaterialAll tables and Figures are uploaded as Supplementary Materials.
References
Armitage P, McPherson C, Rowe B (1969). Repeated significance tests on accumulating data. Journal of the Royal Statistical Society. Series A. General, 132(2): 235–244. https://doi.org/10.2307/2343787
Azevedo EM, Deng A, Montiel Olea Rao JL, Rao J Weyl EG (2020). A/b testing with fat tails. Journal of Political Economy, 128(12): 4614–000. https://doi.org/10.1086/710607
Balsubramani A, Ramdas A (2015). Sequential nonparametric testing with the law of the iterated logarithm. arXiv preprint: https://arxiv.org/abs/1506.03486.
D’agostino RB, Chase W, Belanger A (1988). The appropriateness of some common procedures for testing the equality of two independent binomial populations. American Statistician, 42(3): 198–202. https://doi.org/10.1080/00031305.1988.10475563
Demets DL, Lan KG (1994). Interim analysis: The alpha spending function approach. Statistics in Medicine, 13(13–14): 1341–1352. https://doi.org/10.1002/sim.4780131308
Gao P, Ware JH, Mehta C (2008). Sample size re-estimation for adaptive sequential design in clinical trials. Journal of Biopharmaceutical Statistics, 18(6): 1184–1196. https://doi.org/10.1080/10543400802369053
Gordon Lan K, Reboussin DM, DeMets DL (1994). Information and information fractions for design and sequential monitoring of clinical trials. Communications in Statistics. Theory and Methods, 23(2): 403–420. https://doi.org/10.1080/03610929408831263
Haybittle J (1971). Repeated assessment of results in clinical trials of cancer treatment. British Journal of Radiology, 44(526): 793–797. https://doi.org/10.1259/0007-1285-44-526-793
Johari R, Koomen P, Pekelis L, Walsh D (2022). Always valid inference: Continuous monitoring of a/b tests. Operations Research, 70(3): 1806–1821. https://doi.org/10.1287/opre.2021.2135
Johari R, Pekelis L, Walsh DJ (2015). Always valid inference: Bringing sequential analysis to a/b testing. arXiv preprint: https://arxiv.org/abs/1512.04922.
Miller E (2010). How Not to Run an A/B Test. URL: http://www.evanmiller.org/how-not-to-run-an-ab-test.html
Miller E (2015). Simple Sequential A/B Testing. URL http://www.evanmiller.org/sequential-abtesting.html, blog post.
O’Brien PC, Fleming TR (1979). A multiple testing procedure for clinical trials. Biometrics, 549–556. https://doi.org/10.2307/2530245
Pocock SJ (1977). Group sequential methods in the design and analysis of clinical trials. Biometrika, 64(2): 191–199. https://doi.org/10.1093/biomet/64.2.191
Wang SK, Tsiatis AA (1987). Approximately optimal one-parameter boundaries for group sequential trials. Biometrics, 193–199. https://doi.org/10.2307/2531959
Zhou W, Kroehl M, Meier M, Kaizer A (2023). Approaches to analyzing binary data for large-scale A/B testing. Contemporary Clinical Trials Communications, 101091–101091. https://doi.org/10.1016/j.conctc.2023.101091