a

Trials for comparing interventions where cluster of subjects, rather than individuals, are randomized, are commonly called cluster randomized trials (CRTs). For comparison of binary outcomes in a CRT, although there are a few published formulations for sample size computation, the most commonly used is the one developed by Donner, Birkett, and Buck (Am J Epidemiol, 1981) probably due to its incorporation in the text book by Fleiss, Levin, and Paik (Wiley, 2003). In this paper, we derive a new 2 approximation formula with a general continuity correction factor (c) and show that specially for the scenarios of small event rates (< 0:01), the new formulation recommends lower number of clusters than the Donner et al. formulation thereby providing better eciency. All known formulations can be shown to be special cases at specic value of the general correction factor (e.g., Donner formulation is equivalent to the new formulation for c = 1). Statistical simulation is presented with data on comparative ecacy of the available methods identifying correction factors that are optimal for rare event rates. Table of sample size recommendation for variety of rare event rates along with code in\R" language for easy computation of sample size in other settings is also provided. Sample size calculations for a published CRT (\Pathways to Health study" that evaluates the value of intervention for smoking cessation) are computed for various correction factors to illustrate that with an optimal choice of the correction factor, the study could have maintained the same power with a 20% less sample size.


Introduction
Cluster randomized trials (CRTs) are necessary in the evaluation of health care interventions because of practical and ethical reasons.The units for randomization ('clusters') for the evaluation of intervention in CRT are typically communities, clinics, hospital wards, or medical practices.This design has been used by investigators in the field of drug treated device for assessment of disease control [1] [2], in the evaluation of interventions intended to improve the delivery of health services and quality of care [3], in appraisal of health education activities [4] and prevention models for sexually transmitted diseases [5], to name a few.Donner and Klar (1994) provide an excellent review of the reasons for conducting CRTs, related design issues, and statistical methods for data analysis [6].CRTs are usually less efficient than individually randomized trials (IRTs) because the responses of individuals within a cluster tend to be more similar than responses of individuals in different clusters [7].This phenomenon is quantified by a statistics called intraclass correlation coefficient (ICC).Proper design with adequate sample size calculation is of utmost importance in trials utilizing CRT as they are logistically very demanding.For comparison of binary outcomes in a CRT, although there are few published formulations for sample size computation, the most commonly used is the one developed by Donner, Birkett, and Buck [8], probably due to its incorporation in the text book by Fleiss, Levin, and Paik [9].However, there is room for improvement, especially for trials with binary endpoint with rare occurrence.
We begin with a motivating example from the field of surgery in section 2. Although leaving objects behind in the body cavity after surgery is rare (1 in 5,500 surgeries) [10], it has the potential to be lethal for patients and detrimental to hospitals and insurance companies.A new medical device for detecting these objects by scanning has been marketed but a CRT for evaluation of this device compared to manual sponge count is needed.In our attempt to provide power calculation for this study, we briefly reviewed the existing methods of sample size computation for binary endpoint for IRT (Section 3.1) and their extensions to CRT (Section 3.2).We were unsure about the applicability of the existing formulas for such low proportion of events.Therefore we derived a new sample size formula using a general correction factor (Section 3.3).All known formulations are shown to be special cases at specific value of the general correction factor.In section 4, we present a simulation study for comparison of all methods identifying correction factors that are optimal for rare event rates.
Sample size calculations for a published CRT ("Pathways to Health study" that evaluates the value of intervention for smoking cessation [11]) are computed for various correction factors to illustrate that with an optimal choice of the correction factor, the study could have maintained the same power with a 20% less sample size (Section 5).This points out the usefulness of our new formulation which allows choosing a different correction factor depending on the setting.
We present table of sample size recommendation for variety of rare event rates and also provide related code in "R" language for easy computation of sample size in other settings (Section 6).We return to our motivating example in this section and through computation of sample size for this trial, provide a guideline on how to approach designing of a CRT, specially in a setting where ICCs are not available.In section 7, we offer some discussion on issues related to varying cluster size, generation of correlated binary data, and the need for exact formulation of sample size.

Motivating Example: Planning of CRT for Evaluation of a Newly Developed Medical Device for Discovering Retained Surgical Sponges
Inadvertently leaving sponges inside patients who undergo surgery continues to occur despite manual counting of sponges by operating room personnel.Retained sponges may cause no adverse effects in patients and may remain undiscovered for decades.Alternatively, retained sponges may lead to serious sequelae, including sepsis, intestinal obstruction, fistulization, and death.Cima et al (2008) reviewed the incidence and characteristics of surgical retained foreign objects (RFOs) at a tertiary care institution during 4 years and found the incidence rate to be quite small (approximately 1 in 5, 500 operations = 0.0002) [10].
A new device recently approved by FDA (RF Surgical detection system) is currently poised for marketing and evaluation (http://www.rfsurg.com/productoverview.htm).This system consists of two features: 1) sponges, gauzes, and towels with a small unobtrusive embedded chip (measuring at 3.5 mm by 11 mm) and 2) a Blair-Port wand with a 9 foot connection cord that has the capability of scanning a patient weighing up to 500 pounds.A CRT, in support of a comparative trial (Arm 1: Use of the new device for sponge detection versus Arm 2: Use of standard practice of manual sponge counting) with the binary endpoint of detection of a foreign object in body cavity after surgery, is being planned.Accuracy of the existing sample size formulations for extremely low incidence like the one in this study has been questioned in the field of IRT and adjustment to the sample size formulation has been suggested [21].Therefore we reviewed the sample size formulations in CRT and concluded that some adjustment to the sample size formulation is needed for the special case of rare events in the context of CRT as well.

Review of Sample Size Formulations in IRT
Before discussing further the sample size formulas for CRTs, we offer a brief review of the most commonly used formulation of sample sizes for the IRTs.Sahai and Khurshid (1996) provides an excellent detailed review [8].Consider the setting of a two armed clinical trial with dichotomous outcome (ie, event versus non-event).The null hypothesis under test is H 0 : π 1 = π 2 versus the alternative hypothesis of H 1 : π 1 > π 2 where π i is the proportion of events in the ith population.The overall context is to estimate the sample size so that if in fact there is no difference between the two underlying proportions, then the chance is approximately α of falsely declaring the two proportions to differ, and if in fact the proportions are unequal, then the chance is approximately 1 − β of correctly declaring the two populations to differ, for α > 0, β < 1.Throughout this manuscript, we assume equal sample sizes for the two groups.
"Exact" formulation for sample size, considered 'gold standard' in this setting is derived [11] but the approximate methods are most commonly utilized due to their simplicity in computation [9][10][12][13][14].One of the commonly employed method is the "arcsine formula" [9], where Φ(z γ ) = γ and Φ is the cumulative normal distribution function.
Another commonly used approximation is the "uncorrected χ 2 formula" (UC), where [10].It has been shown that the above two formulas, ( 1) and ( 2), give similar output but are considered to be serious underestimation of the sample size recommended by the "exact" method [11].
To rectify this, Kramer and Greenhouse (KG) developed a "corrected χ 2 method" given by where . Later, Casagrande, Pike and Smith (CPS) developed a χ 2 approximation sample size formula with a general correction factor, c , given by The KG and UC formulation were shown to be special cases of the CP formulation by setting c = −0.5 and c = 0.5 respectively.Casagrade and Pike also demonstrated that the sample size recommended by the UC and KG formulation are an underestimate and an overestimate of the sample size obtained through the exact formulation [13].They proposed a formula based on c = 0 and established that the sample size obtained via this new formula was satisfactorily close to that obtained by the exact method.

Review of Sample Size Formulations in CRT
In the two-armed clinical trial with CRT design, the most widely used sample size formula is the one given in the text book by Fleiss, Levin, and Paik (FLP) [14] which was originally developed by Donner, Birkett, and Buck (DBB) [15].Assuming there are K clusters in each group, with clusters of equal size n, the formulation is given by where f 1 and f 2 denote the variance inflation factors (VIFs) of the two sets of is the VIF under the null hypothesis of equality of proportions; and ρ being the intra-cluster co-efficient assumed to be the same within each groups [14][15].
Since in the IRTs, f 1 = f 2 = 1, it is clear that ( 5) is an extension of the uncorrected χ 2 formula to the CRT setting.To be exact, the FLP formula is actually an approximation of the DBB formula, by the fact that Several other formulations of sample size computation are available for related but somewhat different settings.For example, Liu and Liang [16] proposed sample size formulation in the context of generalized linear models which used unified tools for correlated continuous and discrete responses.For the special case of the 'two-sample problem with binary responses', their general formula reduces to the DBB formula under the assumption of equal sample size in the two groups and an exchangeable correlation structure for the working correlation matrix.Fleiss et al. [14] extends FLP/DBB formula to the case where the 'exposure' varies across clusters.Hayes [18] present a formula which takes into account the between cluster variability, but this formula doesn't take into account the intra-cluster correlation.Manatunga et al. [19] and van Breukelen et al. [20] extends these sample size estimation methods to account for variability in cluster size.

Derivation of a new sample size formula for CRTs
In this section, we derive our χ 2 approximation formula for sample sizes with a general continuity correction factor.Let X ij and Y ij denote the outcomes of the i th individual in the j th cluster in the intervention and control groups respectively; i = 1, . . .n; j = 1, . . .K. We assume that X ij ∼ Bernoulli(π 1 ) and Y ij ∼ Bernoulli(π 2 ).Let us denote the intra-cluster correlations as ρ 1 and ρ 2 and the corresponding variance inflation factors (VIFs) in each group as f 1 and f 2 .
Z is a test statistic (-χ 2 test with Yates' correction-) commonly used to test the null hypothesis H 0 : is the variance inflation factor calculated under the null hypothesis.Replacing X+ Ȳ 2 by its expectation, we get Also where Φ is the cumulative normal distribution function and c is a correction factor to allow for the discreteness in the distribution of d.If this d * is to lead to a test with power 1 − β then it should satisfy .
That is, Equating ( 6) and ( 7) and solving for K, we get, where Note that when c = 1, the formula ( 8) is the same as the FLP formulation reproduced in equation ( 5) of this document.c = 0 and c = −1 are extensions of CGS and KG formulas respectively to the CRT setting.Also note that the sample size formula for the alternate hypothesis, H 1 : π 1 < π 2 , may be obtained by switching π 1 and π 2 in formula (8)

Statistical Simulations for comparison of χ 2 approximations with different correction factors
We compare the performance of the sample size formula given by ( 8) with different choices of the continuity correction factor, c, via statistical simulations.The primary endpoint for comparison is the number of clusters recommended by different formulas maintaining the power closest to 80% to the one corresponding to exactly 80%.

Data Generation and parameter estimation Methods
We generated 500 correlated binary data within each cluster for the above parameters, using a Monte Carlo simulation method proposed by Lunn and Davies [17].For each j, X ij s were generated using the formula (given in [17]), where V ij , Z j are independent with Bernoulli(π 1 ) distribution, and U ij are independent with Bernoulli( √ ρ 1 ) distribution.Y ij 's were obtained similarly with π 1 and ρ 1 replaced by π 2 and ρ 2 .ρ's were estimated using the FLP formula.Empirical power was calculated by the proportion of times the χ 2 test statistic with Yates' correction exceeds the critical value.

Format of Tables Presenting Simulation Results
Simulation results are presented in Tables 1 with 8 columns.For each (π 1 , π 2 )-pair listed in column 1 and for each choice of c(= 1, 0, −1) listed in column 2, the required number of clusters via the formula ( 8) is computed at 80% power and 0.05 significance level with specification of ρ 1 =ρ 2 = 0.25.These number of clusters are shown at the top of each cell in column 3 (-the numbers without parenthesis-).Using the generated data, the mean estimated ρ 1 , ρ 2 , π 1 , π 2 along with their standard deviations are presented in columns 4-7.Using the estimated ρ's, the number of clusters are estimated again and are presented in column 3 (-the numbers with parenthesis-).Empirical power is listed column 8.
Note the row labeled "*" corresponding to each (π 1 , π 2 )-pair.The top number in column n 3 corresponding to this row is obtained by trial and error and is the number of clusters required to achieve as close to the power of 80% as possible.The corresponding estimated ρ's and π's are presented in the same row for the next four columns.Using these estimates, the choice of c that provides the number of clusters closest to that corresponding to "*" is listed Below "*".This is the most "optimal" choice of c for each setting of (π 1 , π 2 )-pair.

Simulation Results and Explanation
Under the null hypothesis of equality of the rate of events, the nominal significance values were found, on average, to be around 0.05 (data not shown).The numbers in Table 1 may be better explained using a particular case of (π 1 , π 2 )-pair, say (0.01, 0.005), as an example (see the bolded numbers in table 1).
The required number of clusters obtained via ( 8) for c = 1, 0 and -1, with simulation parameters specified above are 1013, 1026 and 1036 respectively.Respectively for c = 1, 0 and -1, the mean estimated ρ 1 are 0.244, 0.243, and 0.244 and the mean estimated ρ 2 's are 0.244, 0.243 and 0.243.The corresponding empirical power is 83.8%, 85.6% and 85.6%, respectively.Since the ρ's are somewhat underestimated which could be an artifact of the data generation process, for true comparison of the number of clusters needed by different values of c, we need to compute these numbers based on the mean estimated ρ's as opposed to the true value of ρ = 0.25.Using the estimates, the corresponding number of clusters are 977, 990, and 1003 respectively for c = 1, 0, and -1.The last row ("*") gives the minimum number of clusters required to achieve 80% power which in this particular case is estimated to be 942.
The mean estimated ρ's in this case are comparable to those for c = 1, 0 and -1, and hence the performance of the formula (8) with correction factors c = 1, 0 and -1 may compared by assessing the deviances 35(= 977 − 942), 48(= 990 − 942), and 61(= 1003 − 942).The best choice of the correction factor is c = 1 corresponding to the least deviance compared to c = 0 and c = −1.Lastly, the number below "*" provides the choice of c that corresponds to the required number of clusters closest to that given by "*".In this particular case, a choice of c = 3 gives the number of clusters, 948, closest to that given by "*", 942.Therefore c = 3 is considered the most optimal choice of the correction factor for detecting a halving in proportion when π 1 = 0.01 and π 2 = 0.005.

Conclusions from Simulation Results
Similarly assessing the results from all 10 scenarios presented in Table 1, the following conclusions can be drawn: • For (π 1 , π 2 ) pairs with max(π 1 , π 2 ) < 0.01, all three choices of c = 0, −1, 1 recommend higher sample size than optimal.Recommended value for number of clusters using c = 3 is more optimal.
• For all (π 1 , π 2 ) pairs with 0.4 < max(π 1 , π 2 ) ≤ 0.6, all three correction factors specified (namely, c = 1, 0 and −1), provides underestimate for the required number of cluster sizes.More extreme values of c is needed to produce optimal sample size but the value varies for particular pairs.
An important cautionary note is that above conclusions are based on power simulations which used a large ρ(= 0.25.The optimal choices for the ranges of (π 1 , π 2 ) may be different as seen in the illustrative example below.

CRT for evaluation of nicotine gum and motivational interviewing for smoking cessation: An Illustrative Example
Although there has been significant decline in smoking prevalence among adults in the United States in the past few decades, it has not been the case in all the subpopulation of smokers.This is particularly true in smokers below the poverty level.An intervention study for smoking cessation in this subpopulation is of significance, especially since studies have shown high prevalence and motivation to quit among residents of low-income housing.Okuyemi et al.(2007) reported the results from a CRT that tested nicotine gum plus motivational interviewing (MI) for smoking cessation in 20 lowincome housing developments (HDs), in which intervention participants (10 HDs) received educational materials addressing fruit and vegetable consumption, 8 weeks of 4 mg nicotine gum, and 5 MI sessions on quitting smoking, and comparison participants (10 HDs) received 5 MI sessions and educational materials only [11].The sample size (-no. of clusters-) calculation was based on the assumptions that there would be 20 participants in each of the 20 HDs, a moderate intra-cluster correlation of 0.02, a 6-month quit rate of 6% (π 1 ) in the comparison arm, and a 18% (π 2 ) quit rate in the cessation arm.Power analysis based on the DBB formula showed that there would be 89% power to detect a significant difference between the two arms.
We suspect that the DBB formula overestimated the required number of clusters, and that formula (8) with a continuity correction factor of c = 3 will provide a better estimate of the required number of clusters.Indeed, a power simulation described in Table 2 provides support to our thinking.The number of clusters required in each arm to detect a difference of 12% at 89% power (with all other assumptions kept same) is respectively, 11, 9 and 8 cluster per arm, for c = 1, 3, 4 in (8).We generated 100,000 Monte Carlo two-samples of clusters, with 20 per cluster, for 4 different scenarios: 11, 10, 9, and 8 clusters in each arm (-note that, 10 clusters in each arm corresponds to the DBB formula that was used for the study design in [22]).Since the estimates of ρ 1 and ρ 2 are lower than the actual ρ 1 and ρ 2 used in Lunn and Davis [17] method, we used ρ 1 = ρ 2 = 0.04 (which is more conservative than 0.02 used in [17]) for our power simulations.
From Table 2, we observe that even with 9 clusters per arm (corresponding to c = 3), the empirical power is greater than 89% and with 8 clusters per arm (corresponding to c = 4), the empirical power is 88.1%.But the estimated ρ 2 is substantially higher than 0.02.So, with ρ 1 = ρ 2 = 0.02, there is reason to believe that even 8 clusters per arm would have been sufficient for 89% power.In other words, this example illustrates the point that formula (8) with c = 3 provides a more accurate estimate of the required number of clusters for this choice of (π 1 , π 2 ).The reduction (that is, the improvement) in number of clusters is 10% (20 − 18/20 = 0.1) which amounts to a substantial gain in terms of the cost of conducting the study.Incidentally, in this example, the sample size based on c = 4 provides the best estimate, with a 20% improvement.We present Table 3 with some sample size tabulations for assistance with future trial design and show empirically the settings where c = 3 recommend lower sample size.This table provide the required number of clusters obtained via (8) using c = 1, 0, and −1 for various small values π's (< 0.01), ρ = 0.01, and average cluster size of 30, 5% level of significance, and 80% power.We also provide the number of clusters required based on c = 3. 'R' code for the sample size computation is provided below for easy implementation in any other setting.

Connecting to the Motivating Example
Getting back to our motivating example, where we needed sample size computation for the (π 1 , π 2 )-pair of (0.0002, 0.0001) the number of clusters needed are 7975, 8629, 9260, 6574 respectively for the value of the correction factors taking different values, with the last one being most optimal.This will mean that 6574 hospitals with 30 surgery each will be needed to detect a halving of the incidence rate of leaving foreign objects inside the patient during surgery.Since for practical purpose, it will be easier if we can recruit higher number of patients in each hospital (say 100 patients from each hospitals) and reduce our need for convincing a big number of hospitals to join the trial, we compute the number of clusters in that setting and find that xxx hospital will suffice.

Discussions
The effect of an intervention (therapeutic device or drug administration, lifestyle change, or health care delivery system change etc.) is often evaluated by a cluster randomized trial.This implies that organizational units such as hospitals, communities, or clinics are randomly allocated to treatment conditions and all persons sampled from such cluster receive the same treatment assigned to the cluster.Instead of randomizing clusters, one may randomize persons within each cluster which is statistically more efficient but not always more convenient.We provide a formulation for assessing number of clusters needed for trials like this and show few existing formulations to be special cases.
Assumption of equal cluster size has been made in our formulation.This is not a practical assumption.It has been shown that sampling 25% more clusters when the sample sizes within cluster are extremely variable compensates for all weaknesses arising from this increase in variability [24].We expect this results to be applicable for our setting.
There has been various approaches for generating correlated binary data.Park (1996) provides a comprehensive review of related issues [25].Lunn's procedure, although straight and simple to implement, gave slightly biased estimates of ρ for our simulation.Another algorithm presented in [25], although not as simple as Lunn and Davis's method to implement, but that still requires no complicated procedures might have yielded unbiased estimates of ρ's.
Relying on approximate formulas is not ideal in the setting of CRT unlike the IRT setting where there is a particular choice of the optimal correction factor for all situation.This fact emphasizes the need for deriving a formula based on exact methods for CRTs with dichotomous outcome.

Table 1 :
See Attached document

Table 2 :
Sample size comparison for the illustrative example: Pathway to Health Study Sample Size Table, Codes, and Guidelines for Designing CRT