Building a Foundation for More Flexible A/B Testing: Applications of Interim Monitoring to Large Scale Data

The use of error spending functions and stopping rules has become a powerful tool for conducting interim analyses. The implementation of an interim analysis is broadly desired not only in traditional clinical trials but also in A/B tests. Although many papers have summarized error spending approaches, limited work has been done in the context of large-scale data that assists in ﬁnding the “optimal” boundary. In this paper, we summarized ﬁfteen boundaries that consist of ﬁve error spending functions that allow early termination for futility, diﬀerence, or both, as well as a ﬁxed sample size design without interim monitoring. The simulation is based on a practical A/B testing problem comparing two independent proportions. We examine sample sizes across a range of values from 500 to 250,000 per arm to reﬂect diﬀerent settings where A/B testing may be utilized. The choices of optimal boundaries are summarized using a proposed loss function that incorporates diﬀerent weights for the expected sample size under a null experiment with no diﬀerence between variants, the expected sample size under an experiment with a diﬀerence in the variants, and the maximum sample size needed if the A/B test did not stop early at an interim analysis. The results are presented for simulation settings based on adequately powered, under-powered, and over-powered designs with recommendations for selecting the “optimal” design in each setting.


Introduction
When presented with accumulating data over the course of an experiment, it is recognized that multiple testing during the experiment, for instance through interim monitoring, will lead to inflated type I error rates (Armitage et al., 1969).However, methodology for controlling type I error rates has been developed so that an experiment can be stopped early if there is strong evidence of some difference and/or futility during an interim analysis and is commonly applied to biomedical clinical trials.Ad hoc rules attempt to ensure that study operating characteristics (e.g., power and type I error rates) are maintained through the implementation of interim analyses (Friedman et al., 2015).Group sequential tests proposed by Pocock (1977), O'Brien and Fleming (1979); Wang and Tsiatis (1987); Demets and Lan (1994); Jennison and Turnbull (1999) have all been incorporated into clinical research and maintain the desired study operating characteristics while incorporating interim evaluations of the data to determine if a study should stop early for futility (i.e., not detecting any effect), the difference (i.e., finding superiority or inferiority), or both.
Ongoing evaluation of accumulating data is not uncommon in an industry setting for A/B testing, where the ability to make rapid decisions is paramount to company success.Kohavi et al. (2013) comprehensively and thoroughly discussed online A/B testing at large scale.In addition, Miller (2010Miller ( , 2015)); Koning et al. (2022); Azevedo et al. (2020) discussed novel methods in A/B testing to preserve power, control the overall type I error rate, or even meet the need of some special data distribution.However, the application of interim monitoring methods for controlling type I error rates is still not widespread in A/B testing in many companies, and the use of standard inference tools that do not account for repeated looks at the accumulating data can lead to incorrect conclusions.
One factor which might limit the utility of more sophisticated monitoring methods is that, within a single company, several dozen or even hundreds of experiments may be running at any given time, and human bandwidth inhibits the ability to apply customized design and analysis practices to each of these experiments.Another factor is that, though many novel statistical methods for A/B testing are developed, they may be too complicated to be implemented on a large scale by non-statisticians.To overcome this hurdle and make recommendations for a scalable A/B testing framework with desired statistical properties, it is important to have a more complete understanding of the performance of standard methods on commonly encountered scenarios.
In this paper we first review frequently used sequential monitoring boundaries, statistical approaches to analyzing A/B tests, and general study design considerations in Section 2. Section 3 then presents the simulation set-up and a novel loss function to use in selecting the "optimal" A/B test design by considering 16 possible combinations of group sequential methods and stopping criteria.The results of the simulations with general recommendations are summarized in Section 4. We conclude with a brief discussion in Section 5.

Background
Sequential monitoring designs have been developed and applied in the context of clinical research studies where regulatory agencies require strict control of the type I error rate α (i.e., concluding an effect when there is none) while trying to achieve acceptable statistical power (i.e., the ability to detect an effect if one exists).In the following subsections, we discuss approaches developed for interim monitoring that we will further examine in simulation studies for optimal A/B test designs.

Reasons to Stop Early
There are many reasons one may wish to terminate a study early, including for safety and efficacy.In general, for studies that compare groups and wish to detect a difference (e.g., an A/B test) we consider three potential types of stopping rules to use in an interim analysis: 1.Only stop for some detectable difference: In this situation, at each interim analysis, we determine if we should stop the study because there is evidence of a difference between our two variants in the A/B test.This may be more descriptively presented as stopping either for superiority/benefit or inferiority/harm caused by one variant with respect to the other.
2. Only stop for futility: In this situation, at each interim analysis we determine if we should stop the study because there is evidence that we are unlikely to detect a difference between our two variants in the A/B test were the experiment to continue enrolling to its planned maximum sample size.
3. Stop for either a detectable difference or futility: In this situation, at each interim analysis, we could stop for either detecting some difference between variants or for futility to detect a difference based on the accumulating data within the experiment.

Methods for Interim Monitoring
Once one has considered "why" one wishes to stop an experiment early, we must select stopping boundaries that identify "how" this decision is made.The different approaches to boundaries described below represent various trade-offs to study flexibility, the expected trial sample size, and the overall maximum trial sample size.

Ad Hoc Rules
Ad hoc rules attempt to ensure the conservative interpretation of interim results.For example, over a total of K analyses Haybittle (1971) uses a large critical value for all interim tests (such as the standard normal test statistic Z i = 3.0 for any ith interim analysis) and uses the conventional critical value at the final Kth test.This specific method is ad hoc so that no precise type I error is guaranteed.This is a precursor for methods developed to explicitly control the overall type I error rate.

Group Sequential Boundaries
One such family of methods designed to control the overall type I error rate is known as group sequential tests, which have predetermined stages for evaluating the data for each desired interim analysis.For example, Pocock (1977) sets a constant and conservative critical value Z P O for every interim analysis so that the overall significance level for the experiment will be α.Similarly, O'Brien and Fleming (1979) use critical value Z OF (α, K) √ i/K where Z OF (α, K) is determined to control the overall type I error.Wang and Tsiatis (1987) demonstrated that Pocock and O'Brien and Fleming are both special cases of a unified test where the critical value is defined as Z W T (α, K, δ)(i/K) δ−0.5 where Z W T (α, K, δ) is determined to control the overall type I error.When δ = 0, O'Brien and Fleming error spending function is produced.When δ = 0.5, Pocock error spending function is produced.δ may also be set between the Pocock and O'Brien-Fleming boundaries, where intermediate shapes are produced.

Error Spending Functions
One major limitation of predetermined group sequential boundaries is that the number of interim analyses must be fixed in advance.If an additional interim analysis is requested or does not meet the predetermined analysis plan, the trial operating characteristics may not be maintained.To address this limitation, error spending functions were proposed by Demets and Lan (1994).In this approach, the type I error rate can be allocated flexibly across interim analyses throughout the study, so that at the end of the study the overall type I error is still controlled at the desired type I error rate, α.While it is still ideal to predetermine the expected number of interim analyses, error spending functions can facilitate unexpected interim looks at the data and unequal accrual throughout a study.
The error spending function α(t * ) is a function of t * , the information fraction observed at the time of the interim analysis.t * is generally defined as the ratio of the inverse of the variance of the test statistics at a particular interim analysis and at the final analysis (Gordon Lan et al., 1994).Practically, it is estimated by the fraction of participants enrolled at calendar time t divided by the maximum number of participants planned for at the end of the study.For example, when calendar time t = 0, the information fraction t * = θ and the error spending function α(t * = θ) = 0.When the study ends, the information fraction t * = 1 and the error spending function α(t * = 1) = α.
In the context of error spending function, Pocock boundaries can be approximated by the function αln[1 + (e − 1)t * ], and for O'Brien-Fleming boundaries the approximate function is The power family of functions is another approach for interim monitoring proposed by Jennison and Turnbull (1999) that is defined as αt * ρ , where ρ > 0. For these error spending functions, they are equal to zero when t * = 0 (i.e., no data has been observed) and equal to α when t * = 1 (i.e., all data has been observed).
Examples of the boundaries of the different error spending functions discussed are presented in Figure 1 for a study that considers stopping for either futility or detecting some difference based on four total analyses with equal sample sizes enrolled in each stage.The statistical test statistic presented on the y-axis is on the standardized Z-scale (i.e., a normal distribution with mean 0 and standard deviation 1).To illustrate how these boundaries would be used in practice, assume we are comparing a binary outcome between two variants A and B so that p A − p B , where a positive difference indicates variant A performs better than variant B. At each interim stage of the A/B experiment, we may conclude one of four outcomes for a two-sided hypothesis test: • if the Z-score falls in area 1 in Figure 1, the null hypothesis of no difference between variants is rejected, and we can conclude that we stop for the superiority of variant A. • if the Z-score falls in area 5 in Figure 1, the null hypothesis of no difference between variants is also rejected, but this time we stop for the inferiority of variant A, concluding that variant B is better.• if the Z-score falls in area 3 (the "inner wedge"), we fail to reject the null hypothesis and cannot conclude that variants A and B are different, and we still stop the study for futility.• if the Z-score falls in area 2 or 4, we do not draw any conclusion and continue the study to the next stage.

Statistical Tests to Evaluate Outcomes
In many A/B tests, the outcomes may be represented as a binary variable (e.g., yes/no).When analyzing dichotomous outcomes between two variants, the chi-squared test without Yates' continuity correction (χ 2 ) or the chi-squared test with Yates' continuity correction (χ 2 c ) would be natural choices.For large sample sizes, the chi-squared test becomes asymptotically equivalent to a two-sample Z-test.Based on this asymptotic equivalence with larger sample sizes, many A/B tests with dichotomous outcomes may instead apply the two-sample t-test to compare the two variants since the t-distribution becomes increasingly normal as the sample size increases.Others have previously discussed the different behaviors of statistical methods among various significant levels.D'agostino et al. (1988) concluded that for significance levels of 0.02 and 0.01, χ 2 test performs better than the t-test.For significant levels 0.1 and 0.05, the t-test performs better than the χ 2 test.

A Proposed Loss Function to Identify "Optimal" Designs
To select among multiple candidate designs that facilitate various combinations of stopping boundaries (e.g., Pocock and power family) and rules (e.g., stopping for futility only, difference only, or both), in this subsection we propose a loss function to identify what is the "optimal" design.The loss function for each boundary is a linear combination of the weighted ratio of designs relative to the fixed sample designs based on their expected sample size under the null hypothesis (ESS null,boundary ), the expected sample size under the alternative hypothesis (ESS alt,boundary ), and the maximum sample size (MSS boundary ) if the study does not stop early: where w 1 +w 2 +w 3 = 1, and SS f ixed is the sample size of fixed sample size design.The "boundary" in the loss function refers to any of the fifteen stopping boundaries rather than the fixed design.
The optimal design is one that minimizes L 1 .With the fixed sample design as the comparator in the denominator based on sample size, any strategy with L 1 < 1 indicates an improvement over no interim monitoring.An advantage of this loss function is that the w i can be customized for a given study to identify what the "optimal" design for the A/B test is based on the emphasis placed on the expected and maximum sample sizes.One strength of this loss function is its adaptability to company goals.Departments may adjust the weights according to their objectives, such as reducing the maximum sample size by increasing w 3 to remain within budget constraints or increasing w 2 to decrease the expected sample size under the alternative hypothesis when testing a new variant.If a department does not have a specific preference for minimizing a particular type of sample size, they may select the stopping boundary with the minimum loss function value for most weight combinations.By leveraging our loss function, companies can experiment with various weight combinations and determine the stopping boundary that minimizes the loss function value, allowing them to achieve their desired outcome.
To illustrate the use and interpretation of this loss function more clearly we provide two examples.When designing an A/B test, if one wants to minimize the maximum sample size, they can set w 3 = 1, and w 1 = w 2 = 0.In this context, the loss function becomes: This means the design with the smallest maximum sample size among all maximum sample sizes from other designs will minimize the loss function and be selected as "optimal" by implementing this weighting for the loss function.
Another example can be shown when one is agnostic on how to split the weights.In this situation, the weights can be equally assigned to each component and the loss function becomes: In this situation, the design with the smallest sum of expected sample sizes under the null and alternative hypothesis, and maximum sample size among all other designs will minimize the loss function, and thus be selected as the "optimal design".From the two examples, it can be seen that the optimal design may change when different weights are specified in the loss function depending on what the study team believes is important.

Simulation Design
A common A/B scenario is simulated to compare the proportion responding in the "A" variant (θ A ) and the "B" variant (θ B ) under the null hypothesis of no difference versus the alternative hypothesis that there is some difference in variants.In our motivating industry context, t-tests are most frequently used for A/B testing experiments, regardless if the outcome is continuous or binary.In addition, Zhou et al. (2023) demonstrates that when the sample size per arm is at or above 500, the t-test and the chi-squared test for two proportions comparison have nearly identical power, type I error rates, and expected sample sizes, even when interim analyses are incorporated.Therefore, a two-sample two-sided t-test is used as our primary benchmark.We also considered the chi-squared test χ 2 and Yates's chi-squared test χ 2 c , however nearly similar results to the t-test were observed and are not presented here.
Five different stopping boundaries (O'Brien-Fleming, Pocock, and power families with ρ = 1, 2, and 3) are evaluated under three stopping strategies (futility only, difference only, or both for some difference or futility), for a total of fifteen combinations.A sixteenth approach is considered with no interim monitoring to reflect that some contexts may not optimally benefit from stopping early.The effects of increasing the number of interim looks at the data are examined across simulations with 1-, 3-or 19-interim analyses for a maximum number of 2-, 4-, or 20-looks at the data, respectively.
Assuming a constant response in variant "A" of θ A = 0.5, five different effect sizes are simulated for θ B : 0.589 (large effect), 0.528 (moderate effect), 0.509 (small effect), 0.504 (tiny effect), and 0.500 (no effect).These effects were driven to reflect A/B tests that would enroll approximately 500, 5,000, 50,000, or 250,000 per variant to detect the decreasing effect sizes, respectively, in a fixed sample design without interim monitoring.However, in practice, stakeholders may either request a larger sample size than deemed necessary by a statistical power For type I error calculation or alternatively be limited by external factors and are unable to enroll the necessary sample size.To address these potential settings, we also examine the choice of optimal interim monitoring strategy when a study is under-or over-powered.The simulation design and evaluation are shown in Table 1 and Figure S7 in the supplementary material.
We conducted a total of 10,000 simulated studies in R v4.2.0 (Vienna, Austria) for each combination of effect size and stopping boundary, assuming equal accrual between each interim analysis.We determined the stopping boundaries and sample size required to detect a given effect size for sequential designs using PROC SEQDESIGN in SAS (Cary, North Carolina).Subsequently, we calculated key statistics, including the effective sample size under the null hypothesis (ESS null,boundary ), effective sample size under the alternative hypothesis (ESS alt,boundary ), maximum sample size (MSS boundary ), power, and type I error.

Approaches to Determine Early Stop
For example, in 2-total analysis, approximately 500 participants per arm, O'Brien-Fleming with early stop for both has stopping boundary at the 1st analysis (260 per arm): 0.00154, 0.28149, 0.71851, 0.99846.This means that, if the one-sided p-values from simulated studies fall below 0.00154 or above 0.99846, those studies will stop early and claim a difference between B and A. If the one-sided p-values fall between 0.28149 and 0.71851, those studies will stop early and claim that there is a lack of evidence to show that B and A are different.For other p-values, studies will continue to the final analysis.
The boundary at the final analysis (519 per arm) is 0.02651 and 0.97349.If the p-values from simulated studies fall below 0.02651 or above 0.97349, those studies will stop early and claim a difference between B and A. If the p-values fall between 0.02651 and 0.97349, those studies will claim that there is a lack of evidence to show that B and A are different.

Approaches to Calculate Key Statistics
To calculate the ESS null,boundary , we extracted scenarios with θ B = θ A = 0.5 for 16 stopping boundaries and computed the average sample sizes for all 16 stopping boundaries among 10,000 simulated datasets.Similarly, to calculate ESS alt,boundary , we extracted scenarios with θ B = 0.589, 0.528, 0.509, 0.504 for 16 stopping boundaries and computed the average sample sizes for all 16 stopping boundaries among 10,000 simulated datasets for each effect size.
MSS boundary was determined as the maximum sample size that could be attained if no early stop occurred during the study with interim analysis.We calculated the power for each simulated study by extracting scenarios with θ B = 0.589, 0.528, 0.509, 0.504 for 16 stopping boundaries and computing the proportion of studies that successfully claimed B was different from A (either superior or inferior since we used two-sided t-test) among 10,000 simulated datasets for all 16 stopping boundaries.Similarly, we determined the type I error for each simulated study by extracting scenarios with θ B = θ A = 0.5 for 16 stopping boundaries and computing the proportion of studies that claimed B was different from A (either superior or inferior) among 10,000 simulated datasets for all 16 stopping boundaries.
For each simulation scenario, we identify what would be chosen as the "optimal" design for 5151 unique combination of weights across settings where our restriction w 1 + w 2 + w 3 = 1 is met with weights defined across a grid from 0 to 1 in increments of 0.01.Since there are three weight components, we present the results graphically in a 2-D plot that is colored by the design considered optimal for each weight combination.To further generalize the optimal stopping rules, we also present a 2-D plot where we ignore the specific stopping boundary type and present if the optimal design recommends no interim stopping, stopping for futility only, stopping for difference only, or stopping for either futility or difference.The step-by-step process of how plots are generated is presented in the supplementary materials: An example to illustrate how loss functions are calculated and plotted.

Results
In this section, we present the results for what is selected as the "optimal" design based across our different simulation scenarios.Given that the conclusions are similar across scenarios (adequately-, under-, or over-powered) and number of total analyses (i.e., 2-, 4-, and 20-total looks), we present a subset of scenarios with 4-total looks in this section with complete results in the supplementary materials.The ESS null,boundary , ESS alt,boundary , and MSS boundary of each stopping boundary are summarized in Table S1-S12 in the supplementary.

Optimal Boundaries
The stopping boundaries that minimized the loss function for each set of loss function weights w 1 , w 2 , and w 3 were plotted for each of the four adequately powered effect size scenarios with the percentage of every boundary selected as optimal among the 5151 weight combinations is presented in Figure 2. To illustrate how an "optimal" design is chosen for each combination of weights, Table 2 provides the estimated loss function value if we set w 1 = w 2 = 0.33 and w 3 = 0.34 for the scenario with a small effect size (n = 50, 000 per variant in the fixed sample design).In this example, the O'Brien-Fleming design that allows stopping for both futility or a difference had the smallest loss function value (L 1 = 0.876), therefore it was selected as optimal based on this weight combination (i.e., see the ⊕ in Figure 2).Figure 2 also showed that: Near the area w 1 = w 2 = 0.33, there was a large black-colored region, which means that O'Brien-Fleming was also selected as optimal for other combination of w 1 and w 2 near 0.33.Specifically, the O'Brien Fleming boundary that allows stopping for both is selected as optimal 24.75% among Note: w 1 = w 2 = 0.33, w 3 = 0.34.Take O'Brien-Fleming with stop for both for example: (0.33 x 36702 + 0.33 x 40789 + 0.34 x 53661)/50042 = 0.876 all 5151 weight combinations.More generally, from Figure 2 the fixed sample size is only the "optimal" design if most of the weight is placed on the maximum sample size (i.e., a large w 3 value) across all effect sizes.If the weight of ESS alt,boundary was set near 0 (e.g., w 2 < 0.05), the optimal designs for various weights on ESS null,boundary (w 1 ) favor designs that only stop for futility.In contrast, if w 1 < 0.05 and w 2 < 0.4, many optimal designs favor stopping for the difference.As w 2 increases, many optimal designs start to favor stopping for both futility and detecting any difference.If similar values were given to all three weights, all optimal designs favor stopping for both futility and difference, with the O'Brien-Fleming boundary being optimal with the power boundary with ρ = 2 and ρ = 3 also near this weight combination.As shown in Table 2, w 1 = w 2 = 0.33 and w 3 = 0.34, the loss function values of O'Brien-Fleming, power boundary ρ = 2 and 3 were all between 0.87 to 0.89.While there are subtle differences across the scenarios in Figure 2, the general trends are largely the same for each adequately powered study design.
It is worth noting, some designs were never or rarely chosen as optimal in some scenarios.For example, the Pocock boundary and Power (ρ = 1) stopping for only a difference were never selected as optimal across all scenarios.Results, as noted previously, were similar if we had 2or 20-total looks at the data.On exception is that the power (ρ = 2) that allows stopping only for a difference was selected as optimal for a very small range of w 1 , w 2 , and w 3 when there are 2-total analyses, but not for any adequately powered design with 4-or 20-total analyses.

Optimal Stopping Rules
While the specific design boundaries are important in selecting the truly "optimal" design based on the chosen loss function weights, it is also helpful to generalize the results in Figure 2 to summarize the broad stopping rules (i.e., no early stopping, stopping for only futility, stopping for only difference, or stopping early for both) to understand the potential reasons to stop a study early.In Figure 3, the optimal stopping rules for the four different adequately powered effect size scenarios are presented.
Among the four considered stopping rules, early stopping for both was selected as optimal in over 80% of weight combinations.Only when w 2 < 0.1, is early stopping for futility favored.Stopping for the difference was the least optimal except for small w 1 and w 2 around 0.25.Only when w 3 , the maximum study sample size, was given the most weight was a fixed sample size design with no early stopping favored.These findings suggest that most adequately powered studies would be optimal when allowing stopping for both futility or a difference, except under A/B studies with fairly imbalanced loss function weights.

Over-powered Simulation Scenarios
In some contexts, a stakeholder may wish to implement an intentionally over-powered design if sufficient resources are available.Figure 4a and 4c presents the results for an overpowered study to detect our moderate effect where we enroll 50,000 per variant instead of the 5000 needed for the fixed sample design to be adequately powered.
The patterns for overpowered scenarios were very similar to the patterns in adequately powered scenarios.The fixed sample size would still only be the best option when most of the weight was put on the maximum sample size (w 3 ).If the weight of ESS alt,boundary was set near 0 (e.g., w 2 < 0.05), the optimal designs for various weights on ESS null,boundary (w 1 ) still favored designs that only stop for futility.Further, if similar values were given to all three weights, all optimal designs favored stopping for both futility and difference, with the O' 20.38 %), and power boundaries with ρ = 2 (grey, 26.17 %) and ρ = 3 (light pink, 32.91 %), selected as optimal.For overpowered scenarios, designs that stop only for detecting a difference were selected if w 1 < 0.05, which is less often than in adequately powered designs.

Under-powered Simulation Scenarios
In other contexts, a stakeholder may not be able to enroll the necessary sample size for an A/B test but still desires to implement an under-powered experiment.Figure 4b and 4d present the results for the tiny effect size if only 50,000 per variant are enrolled instead of the needed 250,000.
The pattern of stopping boundaries in Figure 4b and 4d were very different from the pattern in the adequately powered design.Given that we are intentionally running an under-powered A/B test, approximately 60% of the weight combinations identify stopping for futility only.This intuitively makes sense, because we are implementing an intentionally under-powered study that is unlikely to detect the desired difference.However, as w 2 increased above 0.5, weight combinations began favoring optimal designs the stop for both futility and a difference.In practice, the choice of "optimal" designs for under-powered studies may require additional considerations about not stopping for futility since the design is expected to be futile.

Discussion
The intention of this article was to evaluate the feasibility of applying the existing sequential monitoring methodology developed primarily for use in clinical trials to the setting of A/B experimentation in large-scale environments.After first reviewing some of these methods for interim monitoring, we implemented a rigorous simulation study to determine if general guidance could be given for A/B experimentation.The effect of decreasing or increasing the number of interim analyses on the data was examined as well.Given the large number of potential designs one could consider, we also proposed a novel loss function that evaluated the sample size demands under expected and maximum values.
In terms of general approaches to "optimal" designs in A/B test designs, we recommend based on our simulation results for adequately powered studies that designs with sequential monitoring that allow stopping for both some detectable difference between variants or for futility could be used across most combinations of weights for the loss function.When no strong preference exists for minimizing either the expected sample size or maximum sample size we recommend that it may be most efficient to use O'Brien-Fleming boundaries that allow stopping for both.Since the power boundary with ρ = 2 or ρ = 3 have loss function values similar to the O'Brien-Fleming, they could also be the other two good choices, but in practice we have seen more familiarity with O'Brien-Fleming boundaries when presented to stakeholders.While not strictly "optimal" in all scenarios, recommending a single boundary choice could facilitate easier implementation and scalability in A/B testing environments to the choice of design based on the proposed loss function.
When considering designs that were implemented while being intentionally over-powered, the conclusion is similar to adequately-powered scenarios.However, for under-powered designs, it is more challenging to provide a general conclusion based on the proposed loss function.It is not unexpected that our simulation results suggest stopping early for futility is optimal in approximately 60% of the presented scenario's weight combinations, given that an under-powered design is naturally a "futile" study that is unlikely to detect the desired difference.In practice, it may be ideal to choose a fixed sample design to ensure other data, such as safety signals, may be collected in the presence of an underpowered primary outcome.
While we proposed a single loss function based on sample size parameters, others may also think about developing more kinds of loss functions.For example, loss functions could be proposed to include the type I error rate or power.If these terms are added to our existing loss function, there would be more than three weights and the results would not be easily plotted in a static 2-D figure.Further, as long as each stopping boundary accounts for the corrections to multiple interim looks, the power and type I error rates should already be similar across each approach with minimal difference.
This research has limitations worth discussing with room for further research.The simulations included only binary outcomes.While commonly used in A/B testing, other types of outcomes would be worth considering.However, given the large sample sizes simulated, it is likely that continuous outcomes would have similar results since a t-test was used in our simulation studies.A second limitation is that we considered only one outcome in an A/B test, but many experiments have multiple metrics that may be of interest.This would represent an additional layer of multiple testing that is not examined in our simulations.A third limitation is that large-scale data environments may have multiple, competing experiments running simultaneously that may not be independent.Our methods and simulations do not consider the case for potentially correlated experiments that are occurring during overlapping time periods.
It is worth noting that many novel methods have already been developed for sequential monitoring, and many of them are specifically designed for A/B testing.Johari et al. (2022Johari et al. ( , 2017Johari et al. ( , 2015) ) came up with always valid p-values and confidence intervals that are robust to the inflated type I error rate from continuous monitoring, which let users try to take advantage of data as fast as it becomes available.Balsubramani and Ramdas (2015) proposed a novel algorithmic framework for sequential hypothesis testing.Sample size can even be boosted at the penultimate stage in the sequential monitoring that achieves specified power against an alternative hypothesis (Gao et al., 2008).Tamburrelli and Margara (2014) investigated a novel approach to automate A/B test on a large scale.Those methods, however, may still not be easily scaled or implemented for hundreds of ongoing A/B tests, since involve what may be perceived as intensive mathematical background and complicated algorithms.Conversely, our recommended designs and stopping rules in this paper are simpler and easy to be implemented on a large scale and build off a rich history in biomedical clinical trials research.
This article provided an overview of fundamental concepts and a reference of choice of optimal study designs with interim monitoring for A/B testing.Future work will extend the proposed design and loss function to non-inferiority and equivalence studies, as well as experiments with multiple outcomes.Additional considerations will be given to the design of flexible platform trials that have emerged in biomedical research, to see if adaptations can facilitate the design of an optimal sequence of studies to arrive at an optimal product via sequential and potentially simultaneous A/B experimentation.

Figure 1 :
Figure 1: Example of the stopping boundary shapes for different error spending functions with three interim analyses.The red line represents O'Brien-Fleming boundary; the purple line represents Power function (ρ = 3); the yellow line represents Power function (ρ = 2); the blue line represents Power function (ρ = 1); The green line represents the Pocock boundary.The solid lines represent the boundaries of stopping for difference, and the dashed lines represent the boundaries of stopping for futility.

Figure 2 :
Figure 2: Optimal boundaries for four effect sizes from the 4-total analysis.Upper left: large effect size (n = 500 per variant in a fixed sample design); Upper right: moderate effect size (n = 5000 per variant in a fixed sample design); Lower left: small effect size (n = 50, 000 per variant in a fixed sample design); Lower right: tiny effect size (n = 250, 000 per arm in a fixed sample design).The ⊕ represents the point when the three loss function weights are equally allocated, OBF is O'Brien-Fleming.

Figure 3 :
Figure 3: Optimal stopping rules for four effect sizes from the 4-total analysis.Upper left: large effect size (n = 500 per variant in a fixed sample design); Upper right: moderate effect size (n = 5000 per variant in a fixed sample design); Lower left: small effect size (n = 50, 000 per variant in a fixed sample design); Lower right: tiny effect size (n = 250, 000 per arm in a fixed sample design).The ⊕ represents the point when the three loss function weights are equally allocated.

Figure 4 :
Figure 4: Optimal stopping boundaries and rules for 4-total analysis.(a), (c) Over-powered design, 50,000 per variant in the fixed sample design for moderate effect size which only needs 5000 per variant in the fixed sample design.(b), (d) Under-powered design, 50,000 per variant in the fixed sample design for tiny effect size which needs 250,000 per variant in the fixed sample design.The ⊕ represents the point when the three loss function weights are equally allocated, OBF is O'Brien-Fleming.

Table 2 :
Equal weights for the adequately powered scenario with small effect size (50,000 per variant), 4-total analysis.