Exact Robust Tests for Detecting Candidate-Gene Association in Case-Parents Trio Design

In the case-parents trio design for testing candidate-gene association, the distribution of the data under the null hypothesis of no association is completely known. Therefore, the exact null distribution of any test statistic can be simulated by using Monte-Carlo method. In the literature, several robust tests have been proposed for testing the association in the case-parents trio design when the genetic model is unknown, but all these tests are based on the asymptotic null distributions of the test statistics. In this article, we promote the exact robust tests using Monte-Carlo simulations. It is because: (i) the asymptotic tests are not accurate in terms of the probability of type I error when sample size is small or moderate; (ii) asymptotic theory is not available for certain good candidates of test statistics. We examined the validity of the asymptotic distributions of some of the test statistics studied in the literature and found that in certain cases the probability of type I error is greatly inflated in the asymptotic tests. In this article, we also propose new robust test statistics which are statistically more reasonable but without asymptotic theory available. The powers of these robust statistics are compared with those of the existent statistics in the literature through a simulation study. It is found that these robust statistics are preferable to the others in terms of their efficiency and robustness.


Introduction
In testing the association between a candidate gene and a disease, the caseparents trio design proposed by Schaid and Sommer (1993) has been extensively studied in recent years.The case-parents trio design is the scheme that the disease-affected children (cases) and their parents are ascertained, and then the genotypes of the cases and the parents are obtained.Statistical models were developed based on the genotype relative risks defined by ratios of penetrances of the candidate-gene genotypes and the conditional probabilities of case genotypes given parental mating types.Various test procedures were proposed and studied for testing the candidate-gene association under different assumptions on the underlying genetic mechanism.Schaid and Sommer (1993) proposed score tests when the underlying genetic mechanism is determined by one of the four genetic models, i.e., additive, recessive, multiplicative and dominant models.Note that the score test is asymptotically optimal when the underlying genetic model is correctly specified.In practice, however, the underlying mode of inheritance is unknown for complex diseases.For this situation, using a score test may lose substantial power when the genetic model is mis-specified.Hence, it is necessary to consider some robust tests which are independent of model assumptions.Schaid (1999) considered an un-constrained likelihood ratio test (LRT) procedure, Zheng, Freidlin and Gastwirth (2002) studied a test procedure which they coined as MAX3, and Zheng, Chen and Li (2003) proposed a restricted version of likelihood ratio test (RLRT).
The robust test procedures (MAX3 and RLRT) mentioned above are based on the asymptotic distributions of the test statistics.However, under the null hypothesis of no association, the distribution of the data in the case-parents design is completely known.By this fact, the exact null distribution of any test statistic can be simulated by Monte-Carlo methods.Since we are now armed with computing facilities of great capacity, Monte-Carlo methods are feasible and we can substitute computer powers for asymptotic theories.In this article, we consider the exact robust tests using Monte-Carlo methods with case-parents design data.There are three persuasive reasons for this.(i) An asymptotic theory only provides an approximation to the null distribution of a test statistic, the accuracy of the approximation depends on sample size, and how large the sample size must be for the approximation to be satisfactorily accurate is unknown in most of cases.But the Monte-Carlo method simulates the exact null distribution, it can be made arbitrarily accurate by simply increasing simulation size.(ii) Asymptotic theory is not always available for every test statistic, especially, in cases where the regularity conditions for classical theory are not satisfied.The Monte-Carlo method, however, is not subject to such regularity conditions.(iii) Though, in principle, the exact null distribution can be determined without resorting to Monte-Carlo simulation, as in the case of Fisher's exact test, it is only feasible when the sample size is small.In the trio design, data consist of three informative parental mating types.The expected sample sizes of the informative mating types are unbalanced.Even the total sample size of the three informative mating types is relatively large, e.g.1,000, the expected sample sizes for some informative mating types could still be small, which affects adversely the validity of asymptotic distributions of some test statistics.On the other hand, with a total sample size 1,000, the determination of the exact null distribution without resorting to Monte-Carlo simulation is practically impossible.
It should be remarked here that, by an exact test, we mean the critical value or the p-value of the test is determined by the true null distribution rather than the asymptotic distribution of the test statistic, which differs from what are called exact tests by other authors.For example, Cleves, Olson and Jacobs (1997) considered exact transmission-disequilibrium tests, and Schaid (1999) considered exact tests for the case-parents design.However, what they referred to as an exact test is a special test whose critical region consists of points of small probabilities, that is, any point in the critical region has a probability smaller than any point outside the critical region.Simulation methods are also considered by other authors.For instance, Lazzeroni and Lange (1998) considered a simulation method for TDT.But their simulation method is based on permutation in stead of the exact null distribution.In this article, we propose exact robust tests of which the asymptotic distributions are not available for the test statistics.We conduct extensive Monte-Carlo simulations to compare the power and robustness of various tests.
The article is arranged as follows.In section 2, the background of the caseparents design is given and various test statistics are discussed including the new ones we propose.In section 3, the method of exact test using Monte-Carlo simulation is described, and the issue on the validity of the asymptotic approximations is addressed.In section 4, the results on the power comparison of various test statistics are presented.Conclusions are given in section 5.

Background
Suppose the candidate-gene of concern has two different alleles denoted by A and a, which can form three possible genotypes aa, Aa and AA.One of them has higher risk than the other.Denote the penetrances by which are probabilities of developing disease conditional on the genotype.By taking aa as the base genotype, the relative risks of genotypes Aa and AA are defined as r 1 = f 1 /f 0 and r 2 = f 2 /f 0 respectively.Genetic models can be specified using the genotype relative risks.A genetic model is called recessive (rec), additive (add), multiplicative (mul), or dominant (dom) if r 1 = 1, r 1 = (1+r 2 )/2, r 2 = r 2 1 , or r 1 = r 2 , respectively.The null hypothesis of no association refers to that the disease status is independent of the genotypes.In terms of the relative risks, the null hypothesis can be stated as that is, individuals with different genotypes have the same risk to develop the disease.Given parental mating types, the conditional probabilities of case genotypes can be derived in terms of r 1 and r 2 , which provides the foundation for testing association between candidate-gene and disease in the case-parents design.For a gene with two alleles, there are six possible combinations of parental genotypes referred to as mating types: 1) AA × AA, 2) AA × Aa, 3) AA × aa, 4) Aa × Aa, 5) Aa × aa, and 6) aa × aa.The possible case (offspring) genotypes of each mating type are given in Table 1 (second column).For example, the case genotype can only be AA for mating type 1), since both parents can only transmit an A allele to the offspring.For mating type 2), there are two possible case genotypes: Aa (one parent transmits A and the other transmits a) and AA (both parents transmit A).Note that for mating types 1), 3) and 6), each of them can produce only one possible genotype for the offsprings.Since whether or not there exists association between the disease and the candidate gene is reflected by whether or not the probabilities of different case genotypes are different given a parental mating type, mating types 1), 3) and 6) are non-informative.The third column of Table 1 consists of counts for different combinations of genotypes of trios.The conditional probabilities of case genotypes given parental mating types are given in columns 4 to 8 of Table 1 in terms of r 1 and r 2 : column 4 for general genetic model (no assumption of relationship between genotype relative risks) and columns 5 to 8 for the four particular genetic models mentioned earlier.
The conditional case genotype probabilities can be obtained as follows.Let g 0 , g 1 and g 2 stand for case genotypes aa, Aa and AA, respectively.Let D and MT stand for disease and mating type, respectively.Denote by Pr(g i |MT, D) the conditional probability of case genotype g i given mating type MT .Then by the Bayes formula we have where Pr(D|MT, g i ) = Pr(D|g i ) because the disease status only depends on the genotype g i and Pr(g j |MT ) can be obtained by the Mendelian rule: a parent transmits his or her two alleles to a offspring equally likely.For example, Pr(g In the case-parents trio design, the cases (affected children) and their parents are ascertained and their genotypes are obtained.The data from the design consists of the numbers of case genotypes for each parental mating type.Let n k , k = 2, 4, 5 be the total number of cases with the k th parental mating type.
Tabel 1: Conditional probabilities of case genotypes given parental mating types for various genetic models ), the numbers of cases with the three informative parental mating types, (n 21 , n 22 ), (n 40 , n 41 , n 42 ) and (n 51 , n 52 ) are conditionally independent and follow binomial or trinomial distributions with cell probabilities given in Table 1.The likelihood function of (r 1 , r 2 ) is therefore given by In the remainder of this section, we discuss various test statistics for testing the null hypothesis of no association based on the above likelihood function.
When the alternative hypothesis is specified as one of the four genetic models mentioned in the last paragraph, Schaid (1999) studied the likelihood ratio test (LRT) and Schaid and Sommer (1993) considered the score test for each of the four models.They also showed that the score statistics for the additive model and the multiplicative model are the same.These score statistics are given below: It is noted that the score statistic Z add for the additive model is also the statistic of the transmission disequilibrium test (Spielman, McGinnis and Ewens, 1993).Under the null hypothesis, the score statistics follow asymptotically standard normal distributions.The critical values of the score tests can therefore be determined by the standard normal distribution.When the high risk allele (either A or a) is specified, a one-sided test is to be carried out.If the allele status is unspecified, a two-sided test is to be carried out.
As mentioned in Section 1, when the underlying genetic model is unknown, the use of any of the above score tests Z rec , Z add and Z dom may lose substantial power when the model is mis-specified.This is because these three tests correspond to three extreme cases.Robust tests are therefore desirable.That a test is robust is in the sense that its power is not affected much by the underlying genetic model.The use of robust tests was introduced in Gastwirth (1985) and Freidlin, Podgor and Gastwirth (1999) in a general context.In the context of case-parents trio design, when the genetic model is unspecified under the alternative hypothesis but the allele risk status is specified, say, A is specified as the high risk allele, several test statistics which share certain robust properties have been proposed in the literature.Zheng et al. (2002) considered the MAX3 statistic given by MAX3 1 = max{Z rec , Z add , Z dom }. (2.5) The asymptotic distribution of MAX3 1 is determined by the joint asymptotic distribution of Z rec , Z add and Z dom .Note that where c 1 = (3n 4 + 4n 5 ) 1/2 , c 2 = (4n 2 + 3n 4 ) 1/2 and c 3 = (n 2 + 2n 4 + n 5 ) 1/2 .Therefore the asymptotic distribution of MAX3 1 is indeed determined by the joint asymptotic distribution of Z dom and Z rec .They derived the exact correlation between Z rec and Z dom under the null hypothesis, which is given by ρ = n 4 /{(3n 4 +4n 5 )(3n 4 +4n 2 )} 1/2 .Though there is no closed form for the asymptotic distribution of MAX3 1 , the distribution can be simulated easily.Zheng et al. (2003) studied a restricted likelihood ratio test (RLRT).Note that when A is specified as the high risk allele, we have r 2 ≥ r 1 ≥ 1.It is then more reasonable to consider the likelihood ratio test under this restriction.The test statistic for the RLRT is given by where L(r 1 , r 2 ) is the likelihood function given in (2.1), and T 1 is the cone on r 1 -r 2 plane determined by the restriction r 2 ≥ r 1 ≥ 1.By a result from Self and Liang (1987), the asymptotic distribution of RLRT 1 is a mixture of a degenerated distribution at zero and two chi-square distributions with degrees of freedom 1 and 2.
In practice, it is more realistic that both the genetic model under the alternative hypothesis and the risk status of the alleles cannot be specified.In this case, the MAX3 1 statistic can be extended straightforwardly to the case of unspecified allele risk status as follows: (2.7) Moreover, the range of (r 1 , r 2 ) is the union of the cone T 1 and the triangle T 2 determined by the restriction 0 ≤ r 2 ≤ r 1 ≤ 1.Let T = T 1 T 2 .Note that T is not a cone, hence, differs from the parameter space considered by Self and Liang (1987).It is then natural to consider the restricted likelihood ratio test with the alternative that (r 1 , r 2 ) ∈ T \ {(1, 1)}.The test statistic for this RLRT is given below: We can also consider, as a test statistic, the maximum of the four LRT statistics obtained by restricting to each of the four genetic models.We refer to this statistic as MLRT 1 when the range of (r 1 , r 2 ) is confined to T 1 , as MLRT 2 otherwise.All these robust statistics (MAX3 2 , RLRT 2 , MLRT 1 , and MLRT 2 ), though arising naturally, have not been considered in the literature yet.A difficulty with these statistics is that we cannot find appropriate approximations to the null distributions of these statistics.But, exact null distributions of these robust tests have not been examined.

Exact Tests and Validity of Asymptotic Distributions
As discussed in the last section, the data from a case-parents design, (n 21 , n 22 ), (n 40 , n 41 , n 42 ) and (n 51 , n 52 ), follow binomial or trinomial distributions while conditioning on the numbers of cases with the three informative parental mating types.The cell probabilities of these distributions given in Table 1 are determined by r 1 and r 2 .Under the null hypothesis of no association, r 1 = r 2 = 1.Therefore, the cell probabilities of the binomial and trinomial distributions are completely determined under the null hypothesis.Specifically, (n 21 , n 22 ) follows the binomial distribution B(n 2 , 1/2), (n 40 , n 41 , n 42 ) follows the trinomial distribution Mul(n 4 , 1/4, 1/2, 1/4) and (n 50 , n 51 ) follows the binomial distribution B(n 5 , 1/2).As a consequence of this fact, the null distribution of any test statistic can be simulated and exact test can be carried out.Let N = (n 21 , n 22 , n 40 , n 41 , n 42 , n 51 , n 52 ).For any test statistic S(N ), its null distribution The accuracy of this approximation does not depend on the sample sizes n 2 , n 4 and n 5 .In principle, the approximation can be made arbitrarily accurate by simply increasing the simulation size m.Although the exact test described above can be carried out no matter whether or not an asymptotic null distribution of the test statistic is available and whether or not the sample sizes are large, one might still prefer an asymptotic test when the asymptotic null distribution of the test statistic is available because of its simplicity.However, caution must be taken on the validity of the asymptotic distributions, especially when the sample sizes are small or moderate.In the case-parents design, the validity of an asymptotic distribution is affected by the effective sample sizes n 2 , n 4 and n 5 .These effective sample sizes are indeed random variables and their expected values are greatly affected by the allele frequencies (Pr(A) = p and Pr(a) = q = 1 − p) of the population.Under the assumption of Hardy-Weinberg equilibrium, the genotype frequencies are determined by the allele frequencies as: Pr(aa) = q 2 , Pr(Aa) = 2pq and Pr(AA) = p 2 .Given the total number n of cases with the three informative parental mating types (n = n 2 +n 4 +n 5 ), the expected values are given by E(n k ) = np k /(p 2 + p 4 + p 5 ), k = 2, 4, 5, where p 2 = 2p 3 q(r 1 + r 2 )/R, p 4 = p 2 q 2 (r 2 + 2r 1 + 1)/R, p 5 = 2pq 3 (r 1 + 1)/R and R = p 2 r 2 + 2pqr 1 + q 2 .The conditional expected sample sizes of each informative parental mating type given n = 100 for various genetic models and allele frequencies are given in Table 2.It can be seen from Table 2 that when p is small the expected effective sample sizes are very un-balanced, say, when p = 0.3, the expected sample sizes are 11, 27 and 62.When n increases, the proportion n k /n, k = 2, 4, 5, will stay the same as in Table 2.This unbalancedness can affect adversely the validity of the asymptotic distribution, as will be seen in the results to be reported later.
For the test statistics of which asymptotic distributions are available, the validity of the asymptotic distributions for given sample sizes can be verified by comparing with the simulated exact null distributions.In the remainder of this section, we report some results on the comparison between the exact and the asymptotic null distributions for the test statistics Z rec , Z add , Z dom and MAX3 1 .For total sample size n = 100, 500 and 1, 000, the effective sample sizes n 2 , n 4 and n 5 are calculated using the A allele frequencies p = 0.01, 0.1, 0.3 and 0.5 under the null hypothesis r 1 = r 2 = 1.For each set of effective sample sizes so obtained, the exact critical values for a one-sided size α = 0.05 test are simulated for the four statistics.The simulation size m = 10, 000.The asymptotic critical values for MAX3 1 are also obtained by simulation.These critical values are given in Table 3.For the three score statistics, the critical values are to be compared with 1.645 which is the critical value for the asymptotic test.In Table 3, those exact critical values that differ from the corresponding asymptotic critical values by at least 0.05 are marked with an asterisk.While, generally speaking, the asymptotic critical values provide reasonably accurate approximations to the exact critical values, there are a few cases where the asymptotic critical values differ substantially from the exact critical values.The problem is most prominent with the statistic Z rec , especially when the allele frequency p is small.
To further investigate the properties of the exact null distribution of Z rec when the allele frequency p is small, we simulated the exact null distribution of Z rec with p = 0.01, n = 100, 500, 1, 000 and 10, 000 by a simulation of size m = 10, 000.The simulated results reveal that the probability mass of the exact null distribution concentrates on only a few points.For n = 100, the probability mass concentrates on three points.For n = 500, the probability mass concentrates on five points.Even for n = 1, 000, the probability mass concentrates on only eight points.This suggests that the null distribution is quite discrete so that a continuous approximation such as the standard normal distribution might not be appropriate.The simulated null distributions are presented in Table 4.For each n, the left column gives the points at which there is a positive mass, the right column gives the corresponding probabilities.For n = 10, 000, only the upper tail of the distribution is presented.Also shown in Table 4 are the type I errors when the asymptotic critical value 1.645 is used.It can be seen that the true type I error is greatly inflated when the asymptotic test is carried out.

Comparison of Test Statistics
With the feasibility of exact tests, we have a much richer repertoire of test statistics for testing gene-disease association in a case-parents design.In this section, we compare, under some specified alternative hypotheses, the powers of the following test statistics: MLRT 2 RLRT 2 .We consider two situations: (a) the allele risk status is specified and (b) the allele risk status is unspecified.In the first situation, Z rec , Z add , Z dom , MAX3 1 , MLRT 1 and RLRT 1 are compared.To fix point, allele A is assumed to be the high risk allele.In the second situation, Z rec , Z add , Z dom , MAX3 2 , MLRT 2 and RLRT 2 are compared.In the comparison, the total sample size is taken as n = 200, the allele frequency p is taken as p = 0.01, 0.1, 0.3 and 0.5, and the following five alternative hypotheses are considered: (i) (r 1 , r 2 ) = (1, 2), a recessive model, (ii) (r 1 , r 2 ) = (1.5, 1.5), a dominant model, (iii) (r 1 , r 2 ) = (1.5, 2.25), a multiplicative model, (iv) (r 1 , r 2 ) = (1.5, 2.0), a additive model and (v) (r 1 , r 2 ) = (1.3,2.3), a model which does not fall into the four categories of genetic models.The effective sample sizes n 1 , n 2 , n 3 are taken as their conditional expectations under each of the five models.In all cases, the size of the test is fixed at α = 0.05 and the simulated exact critical value is used so that the type I error for all the tests are controlled at the same level.The power of each test statistic at a given alternative hypothesis is simulated with a simulation of size m = 10, 000.The simulated powers of the test statistics for situations (a) and (b) are given in Tables 5 and 6 respectively.By comparing the powers in these two tables, we observe the following features.(i) If the genetic model is correctly specified under the alternative hypothesis, the score statistic derived by assuming the specified genetic model is generally more powerful than the other test statistics.(ii) The performance of the MLRT and RLRT statistics in both situations are comparable.They are robust in the sense that their powers are only smaller than the score (3) The MLRT and RLRT statistics are generally more powerful than the MAX3 statistics.We deliberately used the word "generally" because there are a few discrepancies from the above statements in the two tables.These discrepancies might be caused by the simulation errors.It should be noticed that the performance of MAX3 is only slightly worse than MLRT and RLRT but it is easier to compute.If one wishes to compromise the power a little bit for the ease of computation, MAX3 is also a good choice for testing the gene-disease association in case-parents design.

Conclusions
We draw briefly our conclusions in this section.For testing the gene-disease association in case-parents trio designs, exact tests using any test statistics are feasible because of our great computational capacity and the fact that the distribution of the data under the null hypothesis of no association is completely In general, exact tests should be preferred to asymptotic tests, especially when the effective sample sizes are quite unbalanced, which could have been resulted from a small allele frequency.When the allele frequency is small, some of the asymptotic tests differ substantially from their corresponding exact tests, as demonstrated in our simulation study.The robust test statistics MLRT or RLRT should be preferred to all the other test statistics because of their efficiency and robustness unless one is particularly interested in testing a specific genetic model or is quite certain what specific genetic model might be under the alternative hypothesis.

Table 2 :
The expected sample sizes n 2 , n 4 and n 5 , conditional on the total sample size n = 100 for various genetic models and allele frequencies can be simulated as follows.Given n 2 , n 4 , and n 5 , we generate N from the above binomial and trinomial distributions for m times.Here m is the simulation size and is determined by the accuracy we desire for the simulated null distribution to approximate the theoretical null distribution.Denote these simulated values by N i , i = 1, . . ., m.Then the theoretical null distribution of S(N ) is approximated by

Table 3 :
The simulated exact and asymptotic critical values of score statistics and MAX3 1 for tests of size α = 0.05 (simulation size m =

Table 4 :
The simulated null distribution of Z rec when p = .01based on 10,000 replications Only upper tail of the null distribution is reported.* Type I error is calculated when 1.645 is used as the critical value. *

Table 5 :
The powers at five specified alternatives of one-sided tests using six test statistics (total sample size n = 200 and simulation size m = 10, 000) Recessive; 2 Dominant; 3 Multiplicative; 4 Additive; 5 Arbitrary statistic optimal for the specified genetic model but are larger than or comparable with all the other test statistics.