Supplementary Material

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1099

10.6339/23-JDS1099

Statistical Data Science

Building a Foundation for More Flexible A/B Testing: Applications of Interim Monitoring to Large Scale Data

https://orcid.org/0000-0003-0381-6489

Zhou

Wenru

wenru.zhou@cuanschutz.edu1∗ Kroehl

Miranda

2 Meier

Maxene

2 Kaizer

Alexander

1 113001 E 17th Pl, Aurora, CO 80045, Department of Biostatistics and Informatics University of Colorado, USA 26380 S Fiddlers Green Cir, Greenwood Village, CO 80111, Charter Communication, USA

∗Corresponding author. Email: wenru.zhou@cuanschutz.edu.

2023

2142023

212412427

Supplementary Material

All tables and Figures are uploaded as Supplementary Materials.

151220221742023

2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2023

Open access article under the CC BY license.

The use of error spending functions and stopping rules has become a powerful tool for conducting interim analyses. The implementation of an interim analysis is broadly desired not only in traditional clinical trials but also in A/B tests. Although many papers have summarized error spending approaches, limited work has been done in the context of large-scale data that assists in finding the “optimal” boundary. In this paper, we summarized fifteen boundaries that consist of five error spending functions that allow early termination for futility, difference, or both, as well as a fixed sample size design without interim monitoring. The simulation is based on a practical A/B testing problem comparing two independent proportions. We examine sample sizes across a range of values from 500 to 250,000 per arm to reflect different settings where A/B testing may be utilized. The choices of optimal boundaries are summarized using a proposed loss function that incorporates different weights for the expected sample size under a null experiment with no difference between variants, the expected sample size under an experiment with a difference in the variants, and the maximum sample size needed if the A/B test did not stop early at an interim analysis. The results are presented for simulation settings based on adequately powered, under-powered, and over-powered designs with recommendations for selecting the “optimal” design in each setting.

Keywords A/B testing error spending function interim monitoring stopping rule

NHLBI

K01 HL151754

AMK and WZ supported by NHLBI K01 HL151754.

References

Armitage

, McPherson

, Rowe

(1969). Repeated significance tests on accumulating data. Journal of the Royal Statistical Society. Series A. General, 132(2): 235–244. https://doi.org/10.2307/2343787

Azevedo

, Deng

, Montiel Olea Rao JL, Rao

Weyl

(2020). A/b testing with fat tails. Journal of Political Economy, 128(12): 4614–000. https://doi.org/10.1086/710607

Balsubramani

, Ramdas

(2015). Sequential nonparametric testing with the law of the iterated logarithm. arXiv preprint: https://arxiv.org/abs/1506.03486.

D’agostino

, Chase

, Belanger

(1988). The appropriateness of some common procedures for testing the equality of two independent binomial populations. American Statistician, 42(3): 198–202. https://doi.org/10.1080/00031305.1988.10475563

Demets

, Lan

(1994). Interim analysis: The alpha spending function approach. Statistics in Medicine, 13(13–14): 1341–1352. https://doi.org/10.1002/sim.4780131308

Friedman

, Furberg

, DeMets

, Reboussin

, Granger

(2015). Fundamentals of Clinical Trials. Springer.

Gao

, Ware

, Mehta

(2008). Sample size re-estimation for adaptive sequential design in clinical trials. Journal of Biopharmaceutical Statistics, 18(6): 1184–1196. https://doi.org/10.1080/10543400802369053

Gordon Lan

, Reboussin

, DeMets

(1994). Information and information fractions for design and sequential monitoring of clinical trials. Communications in Statistics. Theory and Methods, 23(2): 403–420. https://doi.org/10.1080/03610929408831263

Haybittle

(1971). Repeated assessment of results in clinical trials of cancer treatment. British Journal of Radiology, 44(526): 793–797. https://doi.org/10.1259/0007-1285-44-526-793

Jennison

, Turnbull

(1999). Group Sequential Methods with Applications to Clinical Trials. CRC Press.

Johari

, Koomen

, Pekelis

, Walsh

(2017). Peeking at a/b tests: Why it matters, and what to do about it. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1517–1525.

Johari

, Koomen

, Pekelis

, Walsh

(2022). Always valid inference: Continuous monitoring of a/b tests. Operations Research, 70(3): 1806–1821. https://doi.org/10.1287/opre.2021.2135

Johari

, Pekelis

, Walsh

(2015). Always valid inference: Bringing sequential analysis to a/b testing. arXiv preprint: https://arxiv.org/abs/1512.04922.

Kohavi

, Deng

, Frasca

, Walker

, Xu

, Pohlmann

(2013). Online controlled experiments at large scale. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1168–1176.

Koning

, Hasan

, Chatterji

(2022). Experimentation and start-up performance: Evidence from a/b testing. Management Science.

Miller

(2010). How Not to Run an A/B Test. URL: http://www.evanmiller.org/how-not-to-run-an-ab-test.html

Miller

(2015). Simple Sequential A/B Testing. URL http://www.evanmiller.org/sequential-abtesting.html, blog post.

O’Brien

, Fleming

(1979). A multiple testing procedure for clinical trials. Biometrics, 549–556. https://doi.org/10.2307/2530245

Pocock

(1977). Group sequential methods in the design and analysis of clinical trials. Biometrika, 64(2): 191–199. https://doi.org/10.1093/biomet/64.2.191

Tamburrelli

, Margara

(2014). Towards automated a/b testing. In: International Symposium on Search Based Software Engineering, 184–198. Springer.

Wang

, Tsiatis

(1987). Approximately optimal one-parameter boundaries for group sequential trials. Biometrics, 193–199. https://doi.org/10.2307/2531959

Zhou

, Kroehl

, Meier

, Kaizer

(2023). Approaches to analyzing binary data for large-scale A/B testing. Contemporary Clinical Trials Communications, 101091–101091. https://doi.org/10.1016/j.conctc.2023.101091