Assessment of Eﬀects of Age and Gender on the Incubation Period of COVID-19 with a Mixture Regression Model

Following the outbreak of COVID-19, various containment measures have been taken, including the use of quarantine. At present, the quarantine period is the same for everyone, since it is implicitly assumed that the incubation period distribution of COVID-19 is the same regardless of age or gender. For testing the eﬀects of age and gender on the incubation period of COVID-19, a novel two-component mixture regression model is proposed. An expectation-maximization (EM) algorithm is adopted to obtain estimates of the parameters of interest, and the simulation results show that the proposed method outperforms the simple regression method and has robustness. The proposed method is applied to a Zhejiang COVID-19 dataset, and it is found that age and gender statistically have no eﬀect on the incubation period of COVID-19, which indicates that the quarantine measure currently in operation is reasonable.


Introduction
On March 11, 2020, the World Health Organization (WHO) declared a pandemic of COVID-19, which is also called SARS-CoV-2. The outbreak of COVID-19 poses a serious challenge to global public health and economy. By April 11, 2020, the pandemic had caused more than 1, 700, 000 confirmed cases of infection and over 100, 000 fatalities. To cope with this crisis, several containment measures, including isolation of infected individuals, travel restrictions, quarantine, etc, have been implemented in many countries to suppress virus transmission via human-to-human contact. It is worth mentioning that in addition to taking these measures, China has also established an efficient close-contact tracing system. Through these efforts, the outbreak of COVID-19 in China is now well under control, which shows that such measures can effectively block the transmission chain of COVID-19. One effective way of finding potentially infected individuals is to keep those who may have been exposed to infectious pathogens in quarantine for some time. As is well known, one of the key factors determining the optimal quarantine time for suspected cases is a good understanding of the incubation period.
To date, there has been some excellent work on the incubation period of COVID-19. Based on the first 425 laboratory-confirmed cases reported on January 22, 2020 in China, but with only 10 of them having exactly recalled dates of getting infected, Li et al. (2020) fitted a lognormal distribution and found a mean incubation period of 5.2 days. Similarly, by analyzing 291 patients who recalled their dates of exposure to infectious pathogens,  found a median incubation period of 4.0 days. However, these studies involved individuals' recall bias. To fix this problem, Lauer et al. (2020) collected time data of four events, including possible exposure to COVID-19 and symptom onset. For example, the exact exposure date was obtained if available; otherwise, upper and lower bounds were obtained to form a possible interval of exposure. A parametric accelerated failure time model was adopted and gave an estimate of the median incubation period of COVID-19 that was also 5.2 days. Analogously, Backer et al. (2020) estimated the distribution of the incubation period using the censored intervals for the incubation periods of some confirmed cases, with these intervals having been obtained from the relevant dates of travel history and symptom onset. However, both of these works suffered from two sampling bias problems. One is that the short follow-up time meant that shorter incubation periods would be observed more frequently. The other was that as the observations were of time lags between two specific timings (e.g., between the date of departure from an epidemic focus and the date of symptom onset), patients with longer incubation periods were more easily observed. Linton et al. (2020) adopted a similar approach to Backer et al. (2020), but corrected the shorter incubation period bias. To handle the longer incubation period bias, Qin et al. (2020) used renewal process theory and proposed a length-biased Weibull distribution to fit the specific time lag data of 1211 confirmed cases leaving Wuhan between January 19 and 23, 2020. They estimated the median of the incubation period to be 8.13 days, much longer than the results mentioned above.
The first motivation for proposing a mixture regression model is that although the assumptions in Qin et al. (2020) are quite reasonable, some cases may become infected on the day of departure, and thus their observed time lags between departure from Wuhan and symptom onset are also complete incubation periods, which are the time lags between infection and symptom onset. This was also noted by Qin et al. (2020), but they only considered it in the context of sensitivity analysis, whereas our focus here is clearly different from theirs. Therefore, a mixture model would be more appropriate, since it provides a flexible tool to model data arising from a heterogeneous population. Traditional mixture models involve no regression, and much excellent work has been done on these models. For example, based on the Hessian of the multivariate normal mixture model, Boldea and Magnus (2009) gave estimates of all parameters that appeared to be superior to previous estimates. Later, Qin and Priebe (2013) obtained robust estimates by maximizing a novel L q likelihood through the expectation-maximization (EM) algorithm. Chen (2017) also presented detailed and correct consistency results from a maximum likelihood estimator (MLE) with traditional mixture models and streamlined some previously obtained results. More detailed and significant reviews can be found in the book by McLachlan and Peel (2000).
As for the second motivation for our proposed model, we note that, except for the mixture problem, all of the incubation period studies mentioned above considered only the incubation period in the whole population. However, for physiological reasons, the incubation period distributions of infected individuals may differ depending on age and gender. These effects may have a negative influence on current quarantine measures. Consequently, it is meaningful to assess the effect of age and gender on the incubation period of COVID-19 through a regression model. The mixture model approaches mentioned above are clearly not appropriate for a regression problem. For the regression model, Jiang and Tanner (1999) considered a hierarchical mixtures-of-experts model in which exponential family regression was mixed, and they obtained their estimates by a maximum likelihood method. Khalili and Chen (2007) defined a family of parametric conditional density functions and created a finite mixture of regression models (FRM). Through the use of a weighted penalized log-likelihood function, they implemented variable selection procedure for the FRM. However, these two models are not appropriate for our regression problem.
In this paper, we propose a novel two-component mixture regression model (1). Parameter estimates are obtained through the maximum likelihood method and the EM algorithm is adopted. For a more detailed literature review of the EM algorithm, readers can refer to the book by Liang et al. (2010). As is well known, one drawback of the EM algorithm is the dependence of its solution on the initial values that are used, and this is often a consequence of a local maxima problem with the objective functions. Dimitris and Evdokia (2003) studied the effect of initial values on the EM algorithm for a finite normal mixture model with each normal component having common variance and for a finite Poisson mixture model. They compared several methods for choosing initial values for the EM algorithm in these models, including, for example, a random starting point and starting at some moment estimates. Although their new initial value estimator produced better results, it cannot be easily extended to our setting. In this paper, owing to its simplicity, we adopt the classic random initialization method to partially handle the initial value dependence of the EM algorithm, and we conduct a sensitivity analysis for our settings with the aim of showing that this initialization method is a reasonable one.
The rest of the paper is organized as follows. In Section 2, a novel mixture regression model (1) is proposed and the reason for formulating such a model is discussed. The estimator of the parameters of interest is the MLE based on the conditional likelihood of observed data, and the estimates are calculated using the EM algorithm. In Section 3, several simulation studies are conducted to test the performance of the proposed method. Furthermore, under various possible fixed ranges of uninteresting parameters, a sensitivity analysis is implemented. The results show that the estimates of the parameters of interest are robust to these various settings. Some simulations are also conducted to test whether likelihood ratio tests (l.r.t.'s) work for our proposed model. The results reveal that the application of l.r.t.'s is appropriate. In Section 4, the proposed model and method are applied to a Zhejiang COVID-19 dataset. The sensitivity analysis shows the estimates of the parameters of interest and maximum likelihood are robust and makes the results more reliable. Finally, based on l.r.t.'s for the regression coefficients of age and gender in the model (1), we find that it is statistically not rejected that age and gender have no effect on the incubation period of COVID-19.
The motivation for formulating (1) is as follows. Assume that an infected case with covariates X has incubation period T satisfying The error term f has density function f (·). Thus, the conditional density of T given X = x is f (t exp(x β)) exp(x β)(β = −γ ). When γ = 0, i.e., X has no effect on the incubation period T , the density of T will be f (t) and typically is assumed to be the Weibull distribution density. This is the reason for our choice of f λ,α for f in the density (1).
For a confirmed infected case, let V denote his/her time lag between departure from Wuhan and onset of symptoms, which can be considered as the forward time in a renewal process, and let A be the time lag between infection and departure from Wuhan, which can be considered as the backward time and unobservable. As pointed out in Qin et al. (2020), V is a length-biased version of the incubation period T , since it is easier to observe V if T = A + V is longer. Now, for the above infected case who has covariates X = x and incubation period density function f (t exp(x β)) exp(x β), given X = x, by renewal process theory, the joint density of (A, V ) is Marginally, given X = x, A and V have the same density, i.e., This is equivalent to log(V ) = X γ + g , and the error term g has density function g(·).
In short, given X = x, if A = 0 and V actually is the complete incubation period T , then it has density f (v exp(x β)) exp(x β). Otherwise, its density is g(v exp(x β)) exp(x β). Combined with the mix proportion π(x, θ), which is typically assumed to have logistic regression model form, and our choice for f , we obtain the proposed density (1).
Unlike the classic two-component mixture model, the two density components in (1) share the same parameters λ, α, β. Thus, estimation of λ, α, β is always possible. As is well known, the unidentifiability problem of parameters in a Gaussian mixture model is caused by the unconstrained mix proportion (Boldea and Magnus, 2009), but this does not happen with our model. The identification of the parameter of interest β in our mixture regression model can be summarized as the following theorem. In practice, we obtain the MLE (λ ML ,α ML ,β ML ,θ ML ) of (λ, α, β, θ) through maximizing the conditional likelihood of the observed data: Under some regularity conditions, when the underlying model is (1), the consistency of the MLE is obvious. Interestingly, in some cases, the MLE of β is robust to misspecified model setting. (1) is known to be independent of x, which means θ 1 ≡ 0 and the mix proportion is constant, then, even when the underlying incubation period distribution F λ,α actually is not the Weibull c.d.f.,β ML obtained by our model still is a consistent estimate of β 0 .
Proof: See the Supplementary Material. Next, we discuss how to calculate the MLE. Owing to π(x, θ), direct maximization of the conditional likelihood of the observed data is hard to implement. Thus, the EM algorithm is adopted in this paper. We give the detailed computation procedure in the Supplementary Material.

Simulation Studies
In this section, we conduct several simulation studies to test the performance of the proposed method, its sensitivity, and the inference method used in the following section. Monte Carlo samples of size n are independently generated B times and the estimate Est is averaged over estimates of all replications, and SE is the standard error of the B estimates.

Estimation Performance
Here, we first test the estimation performance using two examples.
Example 1. The data generating model is (1) Since the conditional observed likelihood (2) may have various local maxima, a random initialization method is adopted here. We set the initial values as λ I , α I , . For a test of this method, we take two settings: 1. Fixed initial value setting: λ I = 0.1, α I = 3.5, β I i = 0, i = 1, 2, θ I k = 0, k =, 0, 1, 2. This setting is denoted by F S. 2. Random initial value setting: λ I = U (0, 1), α I = U (1, 10), β I i = U (−5, 5), i = 1, 2, θ I k = U (−5, 5), k =, 0, 1, 2. In this random setting, the estimate is chosen as the one with the maximum likelihood among 10 and 100 estimates, which are obtained by starting at 10 and 100 random initial points, respectively. The setting starting at 10 random initial points is denoted by Rnd10 and that starting at 100 random initial points by Rnd100.   Table 1.
From Table 1, we see that when we implement the proposed EM estimation method, the regression coefficients β, θ are estimated very well, no matter the initial setting with which the EM computation procedure starts. However, compared with the results of F S, the random initialization method produces results with smaller bias and SE. Comparing the results of Rnd10 and Rnd100, we note that the proposed EM computation procedure is robust to a random choice of initial values and thus is reasonable. From comparison with the results of Rnd10 and Rnd100, it is found that estimates obtained by SR will produce large bias, and this may lead to unreliable statistical inference. This shows the greater practical power of our model and method.
Example 2. The data generating model is the same as that in Example 1, but with λ = α = 1. This actually means h (v|x, λ, α, θ, β We again take two initialization settings: 1. Fixed initial value setting: λ I = 0.1, α I = 2.5, β I i = 0, i = 1, 2, θ I k = 0, k =, 0, 1, 2. This setting is denoted by F S. 2. Random initial value setting: λ I = U (0, 1), α I = U (1, 10), In this setting, the estimate is chosen as the one with the maximum likelihood among 10 estimates, which are obtained by starting at 10 random initial points. This setting is denoted by Rnd10. The competing model is the same as in Example 1. The simulation results with n = 1500, B = 200 are summarized in Table 2.
In Example 2, since V |x ∼ w(v exp(x β)) exp(x β), it is obvious that θ has no effect on the generated data, and Theorem 1 shows that θ is unidentifiable. However, β is still identifiable, and so we can estimate it. We see that the EM implementation with fixed initial value setting and the 10 random initial values setting produces unbiased estimates of β. However, interestingly, SR also produces an unbiased result. The reason behind this is as follows.
In this special case, log V = x γ + , γ = −β, and e ∼ exp(1). This can be rewritten as log V = c + x γ + , = − E , and c = E . Thus, the SR method can produce unbiased results. However, compared with the results obtained by our method, the SE produced by the SR method is much larger.

Sensitivity Analysis
In practice, θ is not usually the parameter of interest, and too large a value of θ i may lead to a computational barrier. In the absence of a constraint, convergence may then be slow. To avoid this problem, we can constrain each component of θ to a given range when implementing optimization. Thus, it is necessary to test the sensitivity of the estimates of λ, α, β to such constrained optimization. For this purpose, we first explore the case with no covariate. Let f λ,α (y) and g λ,α (y) be defined as before. The mixture density function m(v; λ, α, p) = pf λ,α (v) + (1 − p)g λ,α (v) for different mix proportions p (0 p 1) is shown in Figures 1 and 2 for λ = 0.2, α = 2 and λ = 0.1, α = 0.6, respectively. From these figures, it can be seen that in contrast to the multimodality of the traditional mixture model, the mixture of the Weibull distribution and its length-biased version is unimodal when α > 1 and has no peak when α 1. This makes it impossible to determine the distribution from which the sample arises. Fortunately, this is not our goal, which is simply estimation of the parameters of interest λ, α. Take λ = 0.2, α = 2, p = 0.5 as an example. Through fixing the specific value and range of p, we get the MLE of the relevant unknown parameters. The simulation results with n = 1500, B = 200 are summarized in Table 3. From Table 3, for the case of fixed p, it can be seen that a departure from the true value p 0 of p will bring some bias. It should be noted that the greater the departure, the larger will be the bias. This is also reflected in Figure 1. For the case of a fixed range, when this range contains p 0 , the MLE produces unbiased estimates. When the range does not contain p 0 , the estimates will have some bias. Furthermore, the greater the departure of the range from p 0 is, the larger will be the bias.
For the case with covariates, we still take Example 1 with the 10 random initial value setting: λ I = U (0, 1), α I = U (1, 10), β I i = U (−5, 5) (i = 1, 2), θ I k = U (a, b) (k = 0, 1, 2). The maximization step is also constrained in {θ i ∈ [a, b], i = 1, 2, 3}. The results are summarized in Table 4. From this table, it can be seen that compared with the result with no constraint, no matter how large the fixed range of θ is, when it contains θ 0 , the estimates of λ, α, β are very stable and are still unbiased. These results indicate the greater reliability of the proposed method. However, when the true value of θ i is not contained in the fixed range, such as (−2, −0.5), (−1, 0), the greater the departure of the range from θ 0 , the larger will be the bias of the MLE of β. For example, the departure of (−2, −0.5) from θ 0 is greater than that of (−1, 0.5), and the results in the (−2, −0.5) case have larger bias than those in the (−1, 0.5) case. The reason behind this is that β is actually combined with the scale parameter λ in the mixture model. Thus, the sensitivity of the estimate of β in the mixture regression model is just a reflection of the sensitivity of the estimate of the scale parameter in the no-covariate case. Therefore, in practice, a larger fixed range is preferred when permitted. This is also what we have done in the real data analysis.

Hypothesis Testing
In Section 4, l.r.t.'s will be used to make inferences for the parameters of interest. However, in Wolfe (1971) and the book by Everitt and Hand (1981), it is noted that standard l.r.t.'s may fail for the mixture model since the asymptotic distribution is no longer a χ 2 distribution. Aitkin and Rubin (1985) also pointed out this problem and placed a prior distribution on the mix proportion to make the asymptotic distribution of l.r.t.'s approximate the classical standard   asymptotic distribution. Thus, it is necessary to test whether the l.r.t.'s and the corresponding classical standard asymptotic distribution work for our model. In this subsection, we adopt the model from Example 1 with different β and test three null hypotheses: We still use the EM algorithm to obtain the estimates {λ i0 ,α i0 ,β i0 ,θ i0 } of all parameters under the constraint H 0i (i = 1, 2, 3) and the estimates {λ 0 ,α 0 ,β 0 ,θ 0 } of all parameters without constraints. The same random initialization setting as that in Example 1 is adopted, but we use 20 random initial values here. The maximization step also is constrained in {θ i ∈ [−5, 5], i = 1, 2, 3}.
For H i0 (i = 1, 2, 3), the l.r.t. statistic l H i0 is Here, L(λ, α, β, θ) is given by (2), and the p-value is calculated as p H i0 = P {l H i0 χ 2 dim i }, where χ 2 dim i is the chi-square random variable with degree dim i equal to the number of constrained parameters in H i0 . In our simulations, as illustrations, we reject the null hypothesis H i0 if    Table 5, where NH denotes the null hypothesis. From this table, we can see that the l.r.t.'s control the type I error well, since when the null hypothesis is true, the false rejection rate is less than 5%. Even better, in such a case, the type II error is also very small. The reason for this is that the regression coefficients β 1 , β 2 are relatively large and thus the signal is strong.
Consequently, as an another test, we set B = 200, β = [0.1, 0] , which means that X 2 has no effect and X 1 has a very weak effect. Since generally test power depends on sample size, we simulate with various n to show the effect of sample size. The simulation results are summarized in Table 6, where NH again denotes the null hypothesis. From this table, it can be noted that the type-I error is still controlled well and is reasonable for the classical standard l.r.t.'s. Compared with the results for β = [0.8, 0] , we see that small but nonzero regression coefficients lead to a sharp increase in the type II error, but increasing the sample size can increase the power of the proposed test and reduce the type II error. All of these results show that the l.r.t.'s and the corresponding classical standard asymptotic distributions work well for our model. It is therefore appropriate to adopt these l.r.t.'s and asymptotic distributions.

Real Data Analysis
In this section, the proposed model and method are applied to a real dataset. Our real data analysis is based on 143 observations of time lag between departure from Wuhan and onset of symptoms, which is of count type but is treated as continuous for our method. These 143 cases left Wuhan between January 19 and January 23, 2020 for Zhejiang. The observed risk factors are Age and Gender (Female is set as 1). There are 66 female cases and 77 male cases. Figure 3 presents a description of our data, where Time Length is the time lag between the departure of a case from Wuhan and their onset of symptoms. Some summary statistics for the real data are also presented in Table 7.

Sensitivity Analysis
Actually, without any other constraints, the estimate of θ is very unstable. However, θ is not a parameter in which we are interested. For sensitivity analysis, we take a 20 random initial values computational procedure and choose the final estimate as we have done in the simulations.  Specifically, the random initialization setting is λ I = U (0, 1), α I = U (1, 10), β I i = U (−1, 1), i = 1, 2, θ I k = U (−l, l), k =, 0, 1, 2, where l is a pre-assumed positive constant. The maximization step is also constrained in {θ i ∈ [−l, l], i = 0, 1, 2}. The results of the sensitivity test are summarized in Table 8. From this table, we can see that with different constrained ranges of θ when randomly initializing and implementing the maximization step, the estimate of θ is unstable, but λ, α, β and the corresponding maximum likelihood are robust to these settings. This makes the estimates of interest λ, α, β more reliable.

Hypothesis Test for the Effect of Age and Gender
Finally, as the simulation results in Section 3.3 have shown, l.r.t.'s are allowed for β gender and β age . We gives the corresponding test results under three null hypotheses: H 1 : β gender = β age = 0, H 2 : β gender = 0, and H 3 : β age = 0. The computation procedure is implemented with the setting l = 5 in Section 4.1. The results are summarized in Table 9. All these results show that it is statistically not rejected that age and gender have no effect on the incubation period. Based on this on-hand evidence, a common quarantine period for the whole population is reasonable and there is no need to specify different quarantine measures for different groups.
As an illustration of this result, we can estimate the incubation periods for groups with various ages and genders. Suppose that, given X = x, the conditional density of V is where x = (x Gender , x Age ) , x new = (x Gender , I {x Age > 45}) , and x Gender = 1 when the case is female. The other notation is as before. We still adopt the 20 random initial values setting with l = 10. The results are summarized in Table 10. From these, we can see that the estimates of the Weibull distribution parameters and the mean and quantiles of the incubation periods  CI is the 95% confidence interval obtained from 1000 times bootstraps. Q p is the quantile corresponding to probability p of Weibull distribution with the estimated parameter.
of different groups are similar. There are some differences in the quantile estimates, but these are due to the limited sample size. The confidence intervals of all the quantiles contain the corresponding quantile estimates of the different groups. Thus, these estimates should make only a little difference, and this implicitly confirms the hypothesis test result.

Concluding Remarks
In this paper, we have proposed a novel mixture regression model to analyze the effects of age and gender on the incubation period of COVID-19. An EM method is used to obtain estimates of the parameters of interest, and the simulation results show that the proposed method outperforms the simple regression method and has robustness. The hypothesis test simulations also show that the application of l.r.t.'s and the corresponding classical standard asymptotic distribution is a reasonable approach. It should be noted, however, that we use a simple random initialization method for the EM algorithm. To the best of our knowledge, the way in which a good starting value should be chosen for the EM algorithm remains a difficult and unsolved problem deserving of further work in the future. By applying the proposed method to a Zhejiang COVID-19 dataset, it has been found that age and gender statistically have no effect on the incubation period of COVID-19. Thus, the quarantine period currently in operation, which is the same for everybody, is reasonable. This result has direct significance for COVID-19 prevention work and future economic recovery. However, our sample size still seems relatively limited, and so in the future it will be worthwhile collecting more data to further confirm our results.

Supplementary Material
The Supplementary Material including the detailed proofs of Theorems 1 and 2, can be found on the Journal of Data Science website. The data/code used in the analyses can be found at https://github.com/SimonsZheng/Assessment-of-Effects-of-Age-and-Gender-on-COVID-19.