An epidemiological forecast model and software assessing interventions on COVID-19 epidemic in China

,


Introduction
The outbreak of the coronavirus disease or COVID-19, originated in Wuhan, the capital city of Hubei province, spreads out quickly in Hubei and then in China, South Korea, Japan, Iran and Italy, as well as many other cities across the world.Being one of the great epidemics in the 21st century, till February 25, 2020, in China it has caused a total of 78,195 confirmed infections, 2,718 deaths and 30,078 recovered cases, and 2,491 suspected cases still remained to be tested.Since the outbreak of the epidemic, many clinical papers [15,3,37,38,12,6,20,5,11,8,7,28,10,42,34] have been published to unveil limited but important knowledge of COVID-19, including that (i) COVID-19 is an infectious disease caused by SARS-CoV-2, a virus closely related to the SARS 1 coronavirus (SARS-CoV) [20,5,32]; (ii) it can surely spread from person to person [11,8]; (iii) it has a high person-to-person transmission rate, especially in the case with close contact, resulting in a large number of infected cases in Hubei province; (iv) the median incubation time is 3 days, which can be as long as 24 days [7]; and (v) asymptomatic person carrying SARS-CoV-2 is contagious [28].This epidemic has been concerning not only in China but also in the rest of the world given the currently fast growing number of infected cases in South Korea, Japan, Iran, and Italy.
In addition to vaccination, quarantine or medical isolation is the human wisdom that has been proved to be one of the most effective ways to stop the spreading of infectious diseases such as SARS [35,29,22] and plague [4] in the human history.The basic idea of quarantine is to separate infected cases from susceptible population and vice versa.Since mid-January 2020, the Chinese government has implemented all kinds of very strict in-home isolation protocols nationwide, which have been elevated day by day through various government enforced quarantine and societally organized inspections to control the spread of COVID-19 in Hubei and other regions in China.In the meantime, the Chinese government has quickly increased the capacity of hospitals or as such that took symptomatic patients to be quarantined and treated by medical doctors and nurses, most of whom were dispatched from hospitals outside of Hubei province.
The question of the most importance, which draws most attention, concerns when the spread of COVID-19 will end.This question has to be answered by a prediction model using the daily most-updated data from the China CDC.Unfortunately, it is extremely difficult to make right and precise prediction due to the limited amount of available data, which are thought to be inaccurate due to the issue of under-reporting.Many prediction models [33,18,9,27,24,40,19] have already been proposed to provide good fitting results for the publicly available data that may be potentially under-reported.Each of these models may result in different predictions of turning points, such as the dates when the daily increased or the total number of infections begin to decrease.Since such forecasting needs to extrapolate a fitted model to a relatively distance future time after the last date with observed data, whichever the chosen model is used, the model itself will dictate prediction results.Many tuning methods have been developed by data scientists to tune a prediction model, producing only point predictions with no quantification of prediction uncertainty.In addition, data accuracy, in particular the issue of under-reporting, may cause bias in prediction, and ignoring this issue would lead to an optimistic prediction of early turning points.The issue of under-reporting may be attributed to the unsatisfactory sensitivity of the RNA test for SARS-CoV-2 or to the lack of enough kits for testing at the beginning of the outbreak among other logistic and political reasons.The Chinese government tried to correct these issues by using a new diagnostic protocol based on clinical symptoms starting at the first week of February.However, it undermines the quality of data collected in the early phase of the epidemic.
All the above points illustrate the complexity of human interventions on the spread of COVID-19, including but not limited to in-home quarantine, hospitalization, community enforcement of wearing masks, minimizing outdoor activities, and changed diagnostic criteria by the government.The prediction model must take such features into account in order to provide meaningful analyses and hopefully reasonable predictions.However, most existing prediction models do not have the capacity to incorporate changing interventions in reality, and with no such critical component of time-varying intervention in the model, predicted turning points would be untrustworthy.Our new model and analytic toolbox aim to fill in this significant gap.
We develop a health informatics toolbox, with an R package eSIR [1,26], that helps accomplish the following specific aims: AIM 3 : Provide an R software package to health workers who can conveniently perform their own analyses using their substantive knowledge and perhaps better quality data from provinces in China or from other countries.
By no means the toolbox developed by us would give an accurate prediction of turning points, but rather we hope to provide a data analytic toolbox to people who may have better domain-specific knowledge and experience as well as high quality data to carry out independent predictions.Our informatics toolbox is built upon a state-space model [41,13,31,14] shown in Figure 1 with an extended Markov SIR model [16], which has the following key features: (i) Our model is specified with the temporally varying prevalence of susceptible, infected and removed (recovered and death) compartments, not directly on time series of respective counts; (ii) estimation and inference are carried out and implemented by the Markov Chain Monte Carlo (MCMC) algorithm; (iii) it outputs various sample draws from the posteriors of the model parameters (e.g.transmission and removal rates) and the underlying prevalence of susceptible, infected and removed compartments, as well as their credible intervals.The latter is of extreme importance to quantify prediction uncertainty.In addition, this toolbox provides predicted turning points, including (i) the date when daily increased number of infections begins to decrease or the time at which the second order derivative of the prevalence of infected compartment is zero (i.e. the turning point of infection acceleration to deceleration); and (ii) the date when daily number of removed cases is larger than that of infected cases, or the time at which the first derivative of the prevalence of infected compartment is zero (i.e. the turning point of zero infection speed).As a byproduct, the method also provides a predicted time when the COVID-19 epidemic ends.
The R package eSIR is available at https://github.com/lilywang1988/eSIR.This paper is organized as follows.Section 2 presents our new epidemiological forecast model incorporating timevarying quarantine protocols.Section 3 concerns the algorithmic implementation via Markov Chain Monte Carlo and a deliverable R software.Section 4 is devoted to the analysis of COVID-19 data within and outside Hubei, where a calibration of under-reporting is proposed.Section 5 gives some concluding remarks, and some technical details are included in the appendices.

Basic model of coronavirus infection
We begin with a basic epidemiological model in the framework of state-space SIR models with no consideration of quarantine protocols.This framework was proposed by [23] with only onedimensional time series of infected proportions.Refer to Chapter 9-12 of [30] for an introduction to state-space models.Here we consider two time series of proportions of infected and removed cases, denoted by Y I t and Y R t at time t, respectively, where the compartment of removed R is a sum of the proportions of recovered cases and deaths at time t.We assume that the 2-dimensional time series of pY I t , Y R t q J follows a state-space model with the beta distributions at time t: where θ I t and θ R t are the respective prevalence of infection and removal at time t, and λ I and λ R are the parameters controlling the respective variances of the observed proportions.As shown in Figure 1, these observed time series are emitted from the underlying latent dynamics of COVID-19 infection characterized by the latent Markov process θ t .It is easy to see that the expected proportions in both models (1) and ( 2) are equal to the prevalence of infection and the probability of removal at time t, namely EpY I t |θ t q " θ I t and EpY R t |θ t q " θ R t .See Appendix B.Moreover, the latent population prevalence θ t " pθ S t , θ I t , θ R t q J is a three-dimensional Markov process, in which θ S t is the probability of a person being susceptible or at risk at time t, θ I t is the probability of a person being infected at time t, and θ R t is the probability of a person being removed from the infectious system (either recovered or dead) at time t.Obviously, θ S t `θI t `θR t " 1.We assume that this 3-dimensional prevalence process θ t is governed by the following Markov model: where parameter κ scales the variance of the Dirichlet distribution and function f p¨q is a 3dimensional vector that determines the mean of the Dirichlet distribution.The function f is the engine of the infection dynamics, which operates according to the classical infectious disease SIR model of the form: Note that the explicit solution to the above system (4) of ordinary differential equations is unavailable.Following [23], we invoke the fourth-order Runge-Kutta(RK4) approximation, resulting in an approximate of f pθ t´1 , β, γq as follows: where all these k t terms are given in the appendix A. Denote the set of model parameters by τ " pβ, γ, θ 0 , λ, κq J , which will be estimated using the MCMC method [2].

Epidemiological model with time-varying transmission rate
The basic epidemiological model with both constant transmission and removal rates in SIR model (4) does not reflect the reality in China, where all levels of quarantines have been enforced.Many forms of human interventions that are altering the transmission rate over time include (i) individuallevel protective measures such as wearing masks and safety glasses, using hygiene, and taking inhome isolation, and (ii) community-level quarantines such as hospitalization for infected cases, city blockade, traffic control and restricted social activities, and so on.In addition, the virus itself may mutate to evolve, so to increase the potential rate of false negative in the disease diagnosis.As a result, some individual virus carriers are not contained.Thus, the transmission rate β indeed varies over time, which should be accounted in the modeling.
Figure 2: An extended SIR model with time-varying transmission rates.
One extension to the above basic epidemiological model is to allow a time-varying probability that a susceptible person meets an infected person or vice versa.Suppose at a time t, q S ptq P r0, 1s is the chance of an at-risk person being in-home isolation, and q I ptq P r0, 1s is the chance of an infected person being in-hospital quarantine.Thus, the chance of disease transmission when an at-risk person meets an infected person is modified as: βt1 ´qS ptquθ S t t1 ´qI ptquθ I t :" βπptqθ S t θ I t , with πptq :" t1 ´qS ptqut1 ´qI ptqu P r0, 1s.In effect, this πptq modifies the chance of a susceptible person meeting with an infected person or vice versa, which is termed as a transmission modifier due to quarantine in this paper.Obviously, with no quarantine in place, πptq " 1 for all time.See Figure 2 where the product term βπptq defines a realistic transmission rate reflective to the currently enforced quarantine measures of all levels in China.Note that the above extended SIR model assumes primarily that both population-level chance of being susceptible and population-level chance of being infected remain the same, but the chance of a susceptible person meeting with an infected person is reduced by πptq.
The transmission rate modifier πptq needs to be specified according to actual quarantine protocols in a given region.A possible choice of πptq may be a step function that reflects governmentinitiated macro isolation measures in Wuhan, Hubei province: When π 0 " pπ 01 , π 02 , π 03 , π 04 q are chosen with different values, as shown in Figure 3, we obtain different types of transmission rate modifiers aligned with different quarantine protocols.
Alternatively, the modifier πptq may be specified as a continuous function that reflects steadily increased community-level awareness and responsibility of voluntary quarantine and preventive measures, which may be regarded as a kind of micro isolation measure initiated by individuals or local self-organized inspections.For example, as shown in Figure 4, we may choose the following exponential functions: πptq " expp´λ 0 tq or πptq " expt´pλ 0 tq ν u, λ 0 ą 0, ν ą 0. Technically, the RK's approximate of f function in Appendix A may be easily obtained by 6 replacing β by βπptq in the specification of the latent prevalence model (3), and moreover in all quantities and steps in the MCMC implementation.See the detailed in Section 3.

Epidemiological model with quarantine compartment
An alternative way to incorporate quarantine measures into the basic epidemiological model ( 4) is to add a new quarantine compartment that collects quarantined individuals who would have no chance of meeting any infected individuals in the infection system, as shown in Figure 5.This model allows to characterize time-varying proportions of susceptible cases due largely to the governmentenforced stringent in-home isolation outside of Hubei province.The basic SIR model in equation ( 4) is then extended by adding a quarantine compartment with a time-varying rate of quarantine φptq, which is the chance of a susceptible person being willing to take in-home isolation at time t.The extended SIR takes the following 4-dimensional latent process where We suppose that the quarantine rate φptq is a Dirac delta function with jumps at times when major macro quarantine measures are enforced.For example, we may specify the φptq function as follows: otherwise.
Appendix A. To deal with the Dirac delta function φptq, we develop a two-step approximation for model (6).In brief, we first solve a continuous function without change points via the differential equations in (5), and then we directly move φptqθ S t of the susceptible compartment to the quarantine compartment.From our experience, this approach largely improves the approximation accuracy in the presence of discontinuities.
3 Implementation: Markov Chain Monte Carlo Algorithm

MCMC Algorithm
We implemented the MCMC algorithm to collect draws from the posterior distributions, and further obtain posterior estimates and credible intervals of the unknown parameters in the above models specified in Section 2. Because of the hierarchical structure in the state-space model considered in this paper, the posterior distributions can be obtained straightforwardly.The R package rjags [25] can be directly applied to draw samples from the posterior distributions, so we omit the technical details.The latent Markov θ t processes are sampled and forecasted by the MCMC sampler, particularly for the prevalence of infection and the probability of removal, θ I t and θ R t , which enables us to determine turning points of interest and the reproduction number R 0 .
The first turning point of interest is the time when the daily number of new infected cases stops increasing.Mathematically, this corresponds to the time t at which : θ I t " 0 or the gradient of 9 θ I t is zero.As illustrated by Panel A in Figure 7, the peak of 9 θ I t , denoted by the vertical green line, is the first turning point of interest.The second turning point is the time when the cumulative infected cases reaches its maximum, meaning 9 θ I t " 0. In principle, the second turning point occurs only after the first one, as shown in Panel B in Figure 7.
The reproduction number R 0 of an infectious disease is estimated by the ratio R 0 " β{γ, where β and γ are both estimated from their posterior distributions.Because our models consider the quarantine compartment, R 0 might change according to the forms of quarantine protocols.We adopt a standard MCMC algorithm to draw the latent process.Let t 0 be the current time up to which we have observed data pY I 0:t 0 , Y R 0:t 0 q.To perform M draws of Y I t , Y R t for t P rt 0 `1, T s, we q from rY I t |θ pmq t , τ pmq s and rY R t |θ pmq t , τ pmq s according to the observed process, at t " t 0 `1, . . ., T , respectively; The prior distributions are specified with some of the hyper-parameters being set according to the SARS data from Hong Kong [21].They are, θ I 0 " Betap1, pY I 1 q ´1q, θ S 0 " Betap1, pY R 1 q ´1q, θ R 0 " 1 ´θS 0 ´θI 0 ; R 0 " LogNp1.099,0.096q with EpR 0 q " 3.15, SDpR 0 q " 1; γ " LogNp´2.955,0.910q with Epγq " 0.0117, SDpγq " 0.1, β " R 0 γ; κ " Gammap2, 0.0001q, λ I " Gammap2, 0.0001q, λ R " Gammap2, 0.0001q.
In the default setting the variances of the above prior distributions are set at relatively large values to reflect the fact that limited prior knowledge of these epidemiological model parameters is available.When more information becomes accessible during the course of the epidemic, smaller prior variance values may be used, leading to tighter credible intervals for the model parameters and turning points.

R software package
We deliver an R software package that is able to output the MCMC estimation, inference and prediction under the epidemiological model with two proposed extended SIR models that incorporate time-varying quarantine protocols.These new models have been discussed in detail in Sections 2.2 and 2.3.Our R package, named eSIR, uses daily-updated time series of infected and removed proportions as input data.The R package is available at GitHub lilywang1988/eSIR, and its user manual is appended as the supplementary material of this paper.The quarantine functions are predefined and will be treated as known functions of protocols/policies in the estimation and prediction steps.We let the transmission rate modifier πptq be either a step function or an exponential function, and let the quarantine rate φptq follow a Dirac delta function with pre-specified points of jump and sizes of jumps.The package provides various plots for users to visualize the MCMC results, including the estimated prevalence of infection and the estimated probability of removal, and predicted turning points of interest.Various summary statistics are listed in the output, including posterior mean estimates of the transmission and removal rates, estimate of the reproduction number, and forecasts of turning points and their 95% credible intervals.Moreover, the package gives multiple options to users who can save the entire MCMC results, including the output tables and summary plots, traceplots for MCMC quality control, and full MCMC draws for user's own summary analyses.Some illustrations on the use of this software package is given in Section 4. In addition, we developed an online R Shiny App that can automatically update the results whenever the China CDC updates the daily COVID-19 data.
4 Analysis of the COVID-19 Data Within and Outside Hubei

Calibration of under-reported infection data
Under-reporting of infections is a common issue in the data collection of infectious disease, especially at the beginning of an outbreak.When medical diagnostic tools become more accurate and reliable, as well the transparency of preventive measures gets improved for an exchange of voluntary in-home quarantine, certain adjustments in data typically occur.It is shown in Figure 8 that on Feb 12 the cumulative and daily added number of infected cases in Hubei had clear jumps with significantly large sizes.Such sizable jumps cannot happen within one day, rather they represent an accumulation of cases that have not been reported in previous dates prior to Feb 12.To fix this under-reporting issue, we develop a calibration procedure with the detail in Appendix C. Below we briefly describe our approach for the calibration of the infected cases.
We assume an exponential growth curve for the cumulative number of infected cases in Hubei before Feb 12 of the form yptq " ae λt `b, where parameters λ, a, b are to be estimated.Under the boundary conditions ypt " Jan 12q " 0 and ypt " Feb 12q " a expp31λq `b, we would like to minimize the one-step ahead extrapolation error on Feb 13.The constrained optimal solution can be obtained by the means of Lagrange Multipliers; the estimates are λ " 0.06605, â " 7142.80,b " ´7142.80.The resulting calibration curves for the cumulative and daily added number of infected cases are shown as the solid red curves in Figure 8.For example, on Jan 31, the reported cumulative number of infected cases is 7,153, but the calibrated number of infected cases is 17,911, with an increment of 10,758 cases.As shown later, this data quality control (QC) step helps improve the performance of MCMC.

Evaluation and prediction under time-varying quarantine
We applied our proposed models, algorithms and R package eSIR to analyze the COVID-19 data collected from the public website DXY.cn.The earliest public records for the provincial data are available since Jan 20, 2020.According to an existing R package on GitHub GuangchuangYu/nCov2019 [39], the total counts of confirmed infections and deaths are dated back on Jan 13, 2020.We assumed that before Jan 17 all the reported cases and deaths were from Hubei.We imputed by the linear interpolation the missing cases on Jan 18-19.Therefore, the data used in our analyses starts from Jan 13.The data used in analyses for the other provinces starts on Jan 23, which is the earliest time with non-zero values in the removed compartment.Note that there exist some minor discrepancies between different data sources, and the under-reporting issue is addressed in Appendix C by a calibration procedure.This section aims to provide a demonstration of our software to analyze the current public COVID-19 data, through which users may understand the proposed methods.We focus on illustrating ways to export and interpret the MCMC results.The R package may be applied to analyze infectious data from other countries.
First, we show the analysis of the Hubei COVID-19 data.Note that option dic=T enables to calculate the deviance information criterion (DIC) for model selection, while options, save_files=T and save_mcmc, allow the storage of MCMC output tables, plots, summary statistics and even full MCMC draws, which may be saved via the path of file_add, or otherwise via the current working directory.The major results returned from the package include predicted cumulative proportions, predicted turning points of interest, two ggplot2 [36] objects of the summary plots related to both infection and removed compartments, a summary output table containing all the posterior means, median and credible intervals of the model parameters, and DIC if opted.The traceplots and other useful diagnostic plots are also provided and automatically saved if save_files=T is opted.In the package, function tvt.eSIR() works on the epidemiological model with timevarying transmission rate in Section 2.2, and qh.eSIR() for the other epidemiological model with a quarantine compartment in Section 2.3.Note that in function tvt.eSIR(), with a choice of exponential=F, a step function is run in the MCMC when both change_time and pi0 are specified.To fit the model with a continuous transmission rate modifier function, user may set exponential=T and specify a value of lambda0.The default is to run the basic epidemiological model with no quarantine or πptq " 1 in Section 2.1.death_in_R is usually set to be the average ratio of death and removed proportions at each observation time point, which is used to estimate the death curve in the forecast plot of the removed compartment.Below are the R scripts used in the analysis.

### Example 1:
Step function pi(t) ### Y and R are observed proportions of infected and removed compartments change_time <-c("01/23/2020","02/04/2020","02/08/2020") pi0<-c(1.0,0.9,0.5,0.1)res.step <-tvt.eSIR(Y,R,begin_str="01/13/2020",death_in_R= 0.4, T_fin=200,pi0=pi0,change_time=change_time,dic=T, casename="Hubei_step",save_files = T, save_mcmc=F,M=5e2,nburnin = 2e2) res.step$plot_infection res.step$plot_removed res.step$dic_val ### Example 2: continuous exponential function pi(t) res.exp <-tvt.eSIR(Y,R,begin_str="01/13/2020",death_in_R= 0.4, T_fin=200,exponential=TRUE,dic=F,lambda0=0.05, casename="Hubei_exp",save_files = F,save_mcmc=F, M=5e2,nburnin = 2e2) res.exp$plot_infection #res.exp$plot_removed### Example 3: the basic state-space SIR model, pi(t)=1 res.nopi <-tvt.eSIR(Y,R,begin_str="01/13/2020",death_in_R= 0.4, T_fin=200,casename="Hubei_nopi",save_files = F, M=5e2,nburnin = 2e2) res.nopi$plot_infection #res.nopi$plot_removed In the above R coding scripts, only very short MCMC chains are specified for the consideration of running time.In practice, it is recommended to set M=5e5 and nburnin=2e5 to achieve desirable burn-ins and yield stable MCMC draws.We tried the step function given by Panel C in Figure 3 with pi0=c(1,0.9,0.5,0.1), an exponential function given by Panel B in Figure 4 with rate lambda0=0.05,both of which were compared with the basic model with πptq " 1.The results of the three modifier functions obtained from the tvt.eSIR function are summarized in Figures 9-11.In these forecasting plots of the infected and removed compartments (Panel A and Panel C), the black dots left to the blue vertical line denote the observed proportions of the infected and removed compartments on the last date of available observations or before.That is, the blue vertical marks time t 0 as defined in Section 3. The green and purple vertical lines denote the first and second turning points, respectively.The salmon color area denotes the 95% credible interval of the predicted proportions Y I t and Y R t after t 0 , respectively, while the cyan color area represents either the 95% credible intervals of the prevalence (θ I t ) and or that of the probability of removal (θ R t ) prior to time t 0 .The gray and red curves are the posterior mean and median curves.The black curve in the plot (Panel C) is the proportion of deaths estimated from a pre-specified ratio death_in_R.The middle Panel B displays important dynamic features of the infection via a spaghetti plot, in which 20 randomly selected MCMC draws of the first-order derivative of the posterior prevalence of infection, namely 9 θ I t .The black curve is the posterior mean of the derivative, and the vertical lines mark times of turning points corresponding respectively to those shown in Panel A and Panel C.Moreover, the 95% credible intervals of these turning points are also highlighted by semi-transparent rectangles in Panel B. As shown in these figures, the transmission rate modifier πptq played an important roles in shortening the key turning points of the epidemic, and its effect on both estimation and prediction of the COVID-19 infection dynamics has been clearly demonstrated .
Next, we analyzed the data from the rest of the Chinese population (i.e. the provinces outside Hubei) starting on Jan 23.We set begin_str="01/23/2020"in tvt.eSIR.To address the delayed starting time, we included two change points for the step function πptq at [Feb 4, Feb 8] with π 0 " p0.8, 0.1q.The exponential function remained the same.It is noted that the spread of   COVID-19 outside Hubei has been so far much less severe.Possible reasons for such low proportions of infection and deaths include (i) discontinuing the traffic connections between Hubei and the provinces, (ii) more timely caution and preventative measures taken, and (iii) a comparatively large population that dilutes the exposed group.When Panel A in Figure 12 is zoomed in, some of the observed proportions (black dots) are deviated from the posterior mean or median of the fitted prevalence albeit they all fall in the 95% credible intervals, as shown by Panel A in Figures 13 and  14.Since the latent process follows the SIR differential equations, there may be a lack of fit for the SIR model to accommodate a very large and complex population of 1.3 billion people, in which most of the subjects are not at risk.The proposed models should work much better for individual provinces, but we did not perform such analyses.

noq$plot_infection
We applied the R function qh.eSIR in analyses of the data within and outside Hubei.Results are summarized in Figures 15-16.Our analyses once again clearly indicated that stringent quarantine protocols can largely reduce the spread of COVID-19 both within Hubei and outside Hubei.Yet, it is known that too strict quarantine can cause backfire; people may lose their trust and patience in their committed system, and consequently may try to reduce compliance or even avoid quarantine.We also present the posterior mean probability of staying quarantine compartment in Figure 17 within Hubei and outside Hubei.Note that Jan 23 was not set as a change point for the case of outside Hubei, leading only to two jumps.It is evident that by Feb 8, more than 90% of the Chinese population have taken in-home isolation or as such, reflective to a very strict quarantine protocol enforced in the entire country.
As described in Subsection 4.1, we corrected the under-reported proportion of infections in Hubei province prior to Feb 12, when a big jump occurred on one day.We repeated the same analyses using the calibrated data of infections, and the corresponding results are shown in Figures 18-21.It is interesting to see that the abrupt rise in the infection proportion on Feb 12 disappeared in all Panel A in these new analyses, and the observed data (i.e. the black dots) align better with the    credible intervals of all the latent processes.We also notice that after the correction, the estimated reproduction numbers R 0 became smaller.The reproduction numbers estimated from different models for within and outside Hubei, with and without the data calibration, together with their 95% credible intervals are summarized in Table 1.
It is worth pointing out that the estimates of the reproduction numbers obtained from the epidemiological models with time-varying transmission or quarantine rates appear larger than those obtained from the basic model with no quarantine.This is no surprising as our new models explicitly incorporate interventions, so that the estimated R 0 is an adjusted number with the influence of interventions be removed.In contrast, the basic model with no use of the quarantine modifier implicitly integrates the effect of interventions into the transmission rate β, and consequently the estimated R 0 is reduced due to the contribution from interventions.Our analyses suggest that reproduction numbers R 0 of COVID-19 without public health interventions would be around 4-5 Hubei and around 3-3.5 outside Hubei with relatively big credible intervals.These findings are in agreement with findings from [17].As pointed out above, the size of credible interval may be reduced with more accessible data of COVID-19, which permit users to specify smaller variances in the prior distributions given in section 3.1.

Concluding Remarks
We develop an epidemiological forecast model with an R software package to assess effects of interventions on the COVID-19 epidemic within Hubei and outside Hubei in China.Since our    proposed model utilizes the strength of the SIR's dynamic system to capture the primary mechanism of the COVID-19 infectious disease, we are able to predict future episodes of the disease spread patterns over a window of 200 days from the last date of data availability.Some turning points of interest are obtained from these forecasting curves as part of the deliverable information, including the predicted time when daily proportion of infected cases becomes smaller than the previous ones and the predicted time when daily proportion of removed cases (i.e. both recovered and dead) becomes larger than that of infected cases, as well as the time when the epidemic ends.Our informatics toolbox provides quantification of uncertainty on the prediction, rather than only point prediction values, which are valuable to see the best versus the worst.The key novel contribution is the incorporation of time-varying quarantine protocols to expand the basic epidemiological model to accommodate changing transmission rates over time in the population.The toolbox can be used by practitioners who have better knowledge of quarantine and better quality data to perform their own analyses.Practitioners can use the toolbox to evaluate different types of quarantine strategies in practice.All summary statistics obtained from the toolbox are of great importance for public health workers and government policy makers to take proper actions on stop spreading of COVID-19.
We chose the MCMC algorithm to implement the statistical estimation and prediction because of the consideration on the prediction uncertainty.Given the considerable complexity in the COVID-19 virus spread dynamics and potentially inaccurate information of quarantine measures as well as likely under-reported proportions of infected and recovered cases and deaths, it is of critical importance to quantify and report uncertainty in the forecast.Note that the publicly reported data of recovery and death of COVID-19 are mostly collected from hospitals where accessibility to such information is warrant.In contrast, it is very difficulty, if not impossible, to collect the data of infected individuals with light symptoms who had in-home isolation and recovered, in spite of serious efforts made by the government for a door-to-door inspection to identify suspected cases.
This toolbox is indeed so general that it may be applicable to analyze and evaluate the COVID-19 epidemic in other countries, as well as the future outbreak of other types of infectious diseases.As noted in the paper, our proposed method does need prior data of similar infectious disease to set up initial conditions of the infection dynamics.For this, we analyzed the complete SARS data from Hong Kong given some similarity of COVID-19 to SARS.From this perspective, what we learned from this COVID-19 epidemic in this paper is extremely valuable to form initial conditions in the analysis of any future outbreak of similar infectious disease.In addition, understanding forms and strengths of quarantines for the controlling of disease spread is an inevitable path to making effective preventive policies, which is the key analytic capacity that our toolbox offers.
The proposed epidemiological models can be further extended to accommodate more data reported by the China CDC, which are worth future exploration.Two types of data that may be used in the future extension are the daily number of suspected cases and the daily number of hospitalized cases.We did not use such data due to the concern of data accuracy.For example, the number of suspected cases is largely dependent on the diagnostic protocols, which have been revised a few times since the outbreak of the disease, and the sensitivity of the RNA test.Given such concerns, our strategy in the proposed model was to only use necessary data for analysis, and over the course of improved data quality in the near future, our toolbox may be extended to enjoy greater statistical power and more accurate predictions.
2. Due to the quarantine, we deduct the susceptible by α 1pt´1q " α 1pt´1q ´φptqθ S t´1 , and let Let α t´1 " rα 1pt´1q , α 2pt´1q , α 3pt´1q s T , and it is easy to show that the sum ř 3 k"1 α kpt´1q " 1 ´θQ t .Thus we can regenerate the next day's θ t following a Dirichlet distribution adjusted by the prevalence of the quarantine compartment α t " Dirichletpκα t´1 {p1 ´θQ t qq.The estimated prevalence values become θ t " p1 ´θQ t qα t .We follow above two steps and finish the complete prevalence processes.Note that the deduction of susceptible compartments might cause θ S t ď 0, we will bound such prevalence value to be consistently 0, which is equivalent to terminating transmission among susceptible subjects.

B Moment properties of Beta and Dirichlet distributions
For the sake of being self-contained, we list the moments of both Beta and Dirichlet distributions.The mean and variance of Beta distribution Betapα, βq are respectively: Mean " α α `β , Var " αβ pα `βq 2 pα `β `1q .

C Under-reporting Calibration
As is mentioned in the Introduction, the issue of under-reporting may cause bias in prediction.In order to adjust the under-reported number of infected cases, we apply the following algorithm to calibrate the number of infections before Feb 12, during which time the Chinese government only relies on the RNA test for diagnosis, which was realized later with low sensitivity leading to many false negatives.We assume that the cumulative number of infected cases between Jan 13 and Feb 12 when a sudden big jump occurs follows an exponential function, yptq " ae λt `b, where t P t1, 2, . . .u and a, b, λ are parameters to be estimated.Here, t " 1 stands for Jan 13 and t " 31 stands for Feb 12.Under the condition of yp0q " 0, we can easily get that yptq " ae λt ´a.
To estimate parameter λ and a, we want to minimize the least square error of the estimated number ŷptq of infected cases at t " 32 (Feb 13), which is one day after the Chinese government changed the diagnosis protocol.It is assumed that the difference between the predicted and observed number of infections on Feb 13 would not be big if the model were established well, although the long term difference might be large due to other interventions.Therefore, the optimization problem we want to solve is, min a,λ typ32q ´ae 32λ `au 2 s.t.ae 31λ ´a " yp31q.
The constraint ae 31λ ´a " yp31q is used to ensure that the cumulative number of infected cases till Feb 12 equals to the observed value yp31q.The optimization problem can be solved using the method of Lagrange Multipliers.Obtained λ " 0.06605, â " 7142.80.The calibrated number of infected cased between Jan 13 and Feb 12 is shown in Figure 8.The proposed calibration method corrected the under-reporting issue, at least partially.

Package 'eSIR'
February 29, 2020  R the time series of daily observed removed compartment proportions, including death and recovered.phi0 a vector of values of the dirac delta function φ t .Each entry denotes the proportion that will be qurantined at each change time point.Note that all the entries lie between 0 and 1, its default is NULL.gamma0 the hyperparameter of average removed rate, the default is the one estimated from the SARS first-month outbreak (0.0821).

R0
the hyperparameter of the mean reproduction number R0.The default is thus the ratio of beta0/gamma0, which can be specified directly.
gamma0_sd the standard deviation for the prior distrbution of the removed rate γ, the default is 0.1.

R0_sd
the standard deviation for the prior disbution of R0, the default is 1.
file_add the string to denote the location of saving output files and tables.save_mcmc logical, whether save (TRUE) all the MCMC outputs or not (FALSE).The output file will be an .RData file named by the casename.We include arrays of prevalence values of the three compartments with their matrices of posterior draws up to the last date of the collected data as theta_p[,,1] and afterwards as theta_pp[,,1] for θ S t , theta_p[,,2] and theta_pp[,,2] for θ I t , and theta_p[,,3] and theta_pp[,,3] for θ R t .The posterior draws of the prevalence process of the quarantine compartment can be obtained via thetaQ_p and thetaQ_pp.Moreover, the input and predicted proportions Y, Y_pp, R and R_pp can also be retrieved.The prevalence and prediceted proportion matrices have rows for MCMC replicates, and columns for days.The MCMC posterior draws of other parameters including beta, gamma, R0, and variance controllers k_p, lambdaY_p, lambdaR_p are also available.save_plot_data logical, whether save the plotting data or not.

R0
the hyperparameter of the mean reproduction number R0.The default is thus the ratio of beta0/gamma0, which can be specified directly.gamma0_sd the standard deviation for the prior distrbution of the removed rate γ, the default is 0.1.

R0_sd
the standard deviation for the prior disbution of R0, the default is 1. casename the string of the job's name.The default is "tvt.eSIR".file_add the string to denote the location of saving output files and tables.save_files logical, whether save (TRUE) results or not (FALSE).This enables to save summary tables, trace plots, and plots of the posterior means of the first-order derivatives of the infection prevalence process θ I t .save_mcmc logical, whether save (TRUE) all the MCMC outputs or not (FALSE).The output file will be an .RData file named by the casename.We include arrays of prevalence values of the three compartments with their matrices of posterior draws up to the last date of the collected data as theta_p[,,1] and afterwards as theta_pp[,,1] for θ S t , theta_p[,,2] and theta_pp[,,2] for θ I t , and theta_p[,,3] and theta_pp[,,3] for θ R t .Moreover, the input and predicted proportions Y, Y_pp, R and R_pp can also be retrieved.The prevalence and prediceted proportion matrices have rows for MCMC replicates, and columns for days.The MCMC posterior draws of other parameters including beta_p, gamma_p, R0_p, and variance controllers k_p, lambdaY_p, lambdaR_p are also available.save_plot_data logical, whether save the plotting data or not.

AIM 1 :
Utilize and calibrate publicly available data to understand the epidemiological trend of COVID-19 spread in Hubei province and the other regions of China.AIM 2 : Incorporate time-varying quarantine protocols in the model of COVID-19 infection dynamics via an extension of the classical epidemiological SIR model.This dynamic infection system necessitates the forecast of the future trend of COVID-19 spread.

Figure 1 :
Figure 1: A conceptual framework of the proposed epidemiological state-space SIR model.

Figure 4 :
Figure 4: Transmission rate modifier takes continuous functions under difference micro quarantine measures over time.Subfigures from A to C are exponential survival functions with λ 0 " 0.01,λ 0 " 0.05 and λ 0 " 0.1 respectively.

Figure 5 :
Figure 5: An extended SIR model with time-varying transmission rate.

Figure 7 :
Figure 7: The first turning point in Panel A is marked by a green line, denoting the time when the estimated first-order derivative of the prevalence of infection reaches the maximum.The second turning point in Panel B is marked by a purple line, which is the time when the estimated firstorder derivative of the prevalence of infection equals to zero.The vertical blue line labels the first observation day.

Figure 8 :
Figure 8: Under-reporting calibration of the infected cases in Hubei.(A) The cumulative number of infected cases.(B) The daily added number of infected cases.Data calibration is performed from Jan 13 to Feb 12 as is shown by the red curves.

Figure 9 :
Figure 9: Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for Hubei without intervention, πptq " 1.

Figure 10 :
Figure 10: Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for Hubei with an exponential transmission rate modifier πptq " expp´0.05tq.

Figure 11 :
Figure 11: Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for Hubei with a step-function transmission rate modifier specified by π 0 " p1, 0.9, 0.5, 0.1q at change points [Jan 23, Feb 4, Feb 8].

Figure 12 :
Figure 12: Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for the other provinces outside Hubei with πptq " 1.

Figure 13 :
Figure 13: Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for the other provinces outside Hubei with with an exponential transmission rate modifier πptq " expp´0.05tq.

Figure 14 :
Figure 14: Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for the other provinces outside Hubei with a step-function transmission rate modifier specified by π 0 " p1, 0.9, 0.5, 0.1q at change points [Jan 23, Feb 4, Feb 8].

Figure 16 :
Figure 16: Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for other provinces outside Hubei with time-varying quarantine specified by φ 0 " r0.9, 0, 5s at change points [Feb 4, Feb 8].

Figure 17 :
Figure 17: The posterior mean probability of staying in quarantine compartment within and outside Hubei.

Figure 18 :
Figure 18: Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (PanelC) for Hubei province with πptq " 1 after calibration.Note that the second turning point is beyond the time-axis limit in the plots.

Figure 19 :
Figure 19: Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for Hubei with an exponential transmission rate modifier πptq " expp´0.05tqafter data calibration.

Figure 20 :
Figure 20: Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for Hubei with a step-function transmission rate modifier specified by π 0 " p1, 0.9, 0.5, 0.1q at change points [Jan 23, Feb 4, Feb 8] after data calibration.

Figure 21 :
Figure 21: Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for Hubei with time-varying quarantine specified by φ 0 " r0.1, 0.9, 0, 5s at change points [Jan 23, Feb 4, Feb 8] after data calibration.

tvt.eSIR 7 dic_val
the output of dic.sample() in dic.sample, computing deviance information criterion for model comparison.

Table 1 :
The posterior mean and credible intervals of the reproduction number R 0 obtained from different quarantine models and datasets.
The latter indicate stationary t at which the first-order derivative of the averaged posterior of θ I t equals zero.second_tp_ci with second_tp_mean, it reports the corresponding credible interval and median.