Monitoring the SARS Epidemic in China : A Time Series Analysis

In this article, we studied three types of time series analysis methods in modeling and forecasting the severe acute respiratory syndrome (SARS) epidemic in mainland China. The first model was a Box-Jenkins model, autoregressive model with order 1 (AR(1)). The second model was a random walk (ARIMA(0,1,0)) model on the log transformed daily reported SARS cases and the third one was a combination of growth curve fitting and autoregressive moving average model, ARMA(1,1). We applied all these three methods to monitor the dynamic of SARS in China based on the daily probable new cases reported by the Ministry of Health of China.


Introduction
Since the sudden epidemic of severe acute respiratory syndrome (SARS) in Guangdong, China in November 2002, this infectious disease has infected more than 8000 people worldwide by the end of May 2003 and about 2/3 of these cases were in mainland China (WHO 2003).SARS was called atypical pneumonia in China due to its unknown etiology initially.The name, a typical pneumonia, has still been used in China in most publications in Chinese because of the convenience and consistency although it was officially named as SARS by the World Health Organization (WHO) in March 2003.
SARS is a highly infectious disease with relatively high mortality (about 10%) and possible severe damage to organs in the survivors of the disease.After a few months in Guangdong, the disease found its way to the capital, Beijing, and other places in China.Later, it crossed the boarder of mainland China to Guangdong's neighbor, Hong Kong, in March 2003.Hong Kong, a special administration region of China, has a very high population density and it is also a highly internationalized city.Following the outbreak of the disease in Hong Kong, SARS has infected more than 30 countries/regions worldwide according to WHO's report (WHO 2003) since March 2003.The virus that causes SARS was identified as a novel coronavirus in March 2003 and confirmed by WHO in April by the intensive and rapid work of many scientists around the world (Drosten et al. 2003;Ksiazek et al. 2003;Peiris et al. 2003).The papers on the genome sequence of the SARS virus were published in Science (Rota et al. 2003;Marra et al. 2003).The discovery of the main proteinase structure of the novel coronavirus may be useful for developing anti-SARS drugs (Anand et al. 2003).
At present, there is no effective drug to cure the disease and the vaccine against the disease has yet to be developed.Hence, SARS poses a great threat to the global public health of human beings and may lead to a possible devastating impact to world's economic and development.Although the seriousness of the epidemic of SARS was not initially recognized by some Chinese officials, the World Health Organization pressured Chinese government for more information and issued travel warning to many places in China since March 2003(WHO 2003).In the middle of April 2003, the Chinese government removed a few high ranking officials who failed to actively lead the efforts for curbing the spread of SARS.Since then, various agencies in China have engaged in extraordinary efforts to stop the spread of the disease.In so doing, large number of people were quarantined due to possible contacts with probable or suspected SARS patients.Quarantining a large number of people is an enormous task for any government.It requires a large amount of resource and results a great loss of productivity in many business sectors in addition to the abnormal life of the people under quarantine.One important issue for the policy makers in planning the efforts in controlling SARS would be the possible future number of SARS cases.
Using data from Hong Kong, Singapore and Canada, the epidemic and the transmission dynamics of SARS were studied (Riley et al. 2003;Lipsitch et al. 2003;Chau and Yip 2003;Choi and Pak 2003).Mathematical dynamical models were also used for simulating the SARS outbreak in Beijing (Wang and Ruan 2004).However, there seems a lack of statistical analysis on the data from China in the literature.In our study, we used the data from China and compared three popular statistical models for forecasting.The first model was a Box-Jenkins model, autoregressive model with order 1 (AR(1)).The second model was a random walk (ARIMA(0,1,0)) model on the log transformed daily new reported SARS cases and the third one was a combination of growth curve fitting and autoregressive moving average model, ARMA(1,1).Due to limited number and nonstationarity of the data, a direct use of ARMA seems not appropriated.These validity issues of the models are discussed in the last section of this article.
We tried all three models on the officially reported SARS cases in mainland China since day 22 (May 12) after April 21.For an illustration, we presented the results up to day 30 (May 20) in this article.For a comparison, we also tabulated the one day ahead forecasts, their 95% confidence intervals and the actual observations in Tables 1, 2 and 3 by these three models, respectively.Figures 1, 3 and 5 showed the forecasting values.

Data Source
It seemed that the first SARS patient was able to be traced back in November 2002 in Guangdong, China.Due to a lack of an effective public health surveillance system in China for new diseases and a lack of understanding of the potential threat to the general public at that time, the disease was not acknowledged as serious as it should be by Chinese authorities.Nevertheless, the disease spread rapidly.According to various news reports, the number of SARS cases in Guangdong peaked around the Spring Festival in February 2003.The government tried to calm down the general public by not recognizing the seriousness of SARS.Later, the disease spread to the capital, Beijing, and the neighboring region of Hong Kong, then to other countries by air travelers.In March 2003, SARS infected and killed a WHO doctor who treated a SARS patient in Vietnam.On March 12, WHO issued an emergency travel advisory to several affected areas.The travel advisory guidelines have been issued and modified accordingly since then.By May 27, many places were removed from the list of considering postponing all but essential travel.These places included Guangdong and Hong Kong.A great progress of controlling SARS have been achieved.Around that time, the number of newly infected cases increased dramatically in the capital, Beijing, while new cases in Guangdong province decreased.We followed the daily reports closely.The time series of the SARS cases modeled in this article were extracted from the official reports of the Ministry of Health of China up to May 31 , 2003(MOHC 2003).The number of daily reported new probable SARS cases in China is presented in Figure 1.From Figure 1, one can clearly see that in the first 19 days following April 21, 2003 the daily reported new SARS cases were all greater than 100.Most of these cases must have been infected before April 21 because the strict quarantine policy, the close of schools and the ban of large public gatherings after April 21 could effectively stop the rapid spread of SARS.Due to these strong interventions in Beijing and across China starting in the middle of April, the daily new SARS cases decreased dramatically from May 10 to May 15.From May 16 to May 25, the daily new probable SARS cases remained relatively low.Since May 26, the number of new daily SARS cases has been less than 10.In June, the number of daily SARS case has been mostly below 5. To statistically quantify the dynamic of the infections of SARS and forecast the new cases of infections, we used and compared three time series models since May 12, 2003 (day 22) almost daily up to the end of May.In this article we report our findings of our modeling on the SARS time series.

Box-Jenkins Modeling
Box-Jenkins methods (Box and Jenkins 1976) are well established techniques for analyzing stationary linear time series.From Figure 1, the dynamic of the In Section 5, we discuss such approach via growth curve fitting.Alternatively, one may assume that the trend was due to stochasticity and use difference to remove the trend.We discuss this approach in Section 4 on log scale.In current section, we report the results from the plain ARMA(p, q) model: where Y t is the number of daily new SARS cases at day t, µ is the mean and Z t is white noise with mean zero and variance σ 2 .Using the first 30 observations, we computed the sample autocorrelation coefficients (ACF) and the sample partial autocorrelation coefficients (PACF).The ACF and PACF showed the process had a strong autocorrelation of lag one and a significant partial autocorrelation of lag one.These features in ACF and PACF indicated an ARMA(1,0) (AR(1)) may be applicable.Similar results were also true for the number of daily new cases at different lengths.These plots are not shown in the paper and they are available up on request from the author.
After fitting equation (3.1) by maximal likelihood method via S-plus based on the first 30 observations, we got α1 =0.9584 with standard deviation 0.0028, after removing the mean from the time series.The estimate of coefficient α 1 was significantly different from zero with p-value < 0.0001.A more meaningful test would be Dickey-Fuller test on unit root (Fuller 1976), which led to models in next section since the test showed that the estimated coefficients were not significant different from 1 (Fuller 1976).Similar autocorrelation structure was found for the log transformed SARS time series.Using the AR(1) model on the SARS time series, we can forecast the number of SARS cases at k days ahead.For example, for a one-day ahead forecasting, we obtained the mean value 21 with standard deviation 22. Hence, the 95% confidence interval would be (0, 64).The lower bound would be negative from the computation, which was replaced by 0. Results from models on observations with different lengths are tabulated in Table 1.The one day ahead forecasting since day 22 are shown in Figure 1.

Random Walk
In Section 3, we assumed that the observed daily new SARS cases were stationary and we fitted AR(1) model to the demeaned time series.The mean value of the one day ahead forecasting seemed reasonable.However, the standard deviation estimated from the process was relatively large and resulted nonpositive lower bound for the 95% confidence intervals for many one day ahead projections (Table 1).The large values of the estimates of the standard deviations were mostly due to the great variation at the beginning of the daily time series of the SARS cases.One popular way to stabilize the variation is to take log transformation of the time series.The log tranformed time series of the first 30 observations are shown in Figure 2(a).
From Figure 2(a), one can see that the trend persists.In this section, we assumed the trend was resulted from nonstationary process.Differencing the time series is an effective way to remove a stochastic trend.That is, we let X t =log Y tlog Y t−1 .This first order difference could remove the linear trend.Figure 2(b) is the plot of X t .The sample autocorrelation and the sample partial autocorrelation coefficients are plotted in Figure 2(c) and Figure 2(d), respectively.These two plots of ACF and PACF showed that X t was not serially correlated.Hence, we may use a simple ARIMA(0,1,0) (random walk) model for log Y t : where Z t is a white noise with mean zero and variance σ 2 and X t =µ+Z t .This random walk model has been widely used in many diversified fields such as in modeling financial markets (Tsay 2002) and monitoring clinical trials (Lan and Wittes 1988).Using model (4.1), we can construct a k day ahead prediction of log Y t and its 95% confidence intervals.Equivalently, the prediction of Y t and its 95% confidence intervals.For example, for a one day ahead forecasting for log Y t+1 at day t + 1, the 95% confidences interval would be where σ t is the sample standard deviation of Z t up to day t.From our calculation, at day 30, μ30 =-0.0840, σ30 =0.2374 and the one day ahead prediction was 16 with the 95% confidence interval (10,25).The 95% confidence intervals of the one day ahead forecasting are shown in Figure 3 for days after May 12 (day 22).The numerical results are tabulated in Table 2.

Growth Curve
In sections 3 and 4, the dynamic of the daily SARS cases in China was assumed to be stochastic.As it was mentioned in section 3, one would fit a deterministic trend to the time series such as Y t = f (t) + Z t , where f (t) might be a linear or nonlinear function.Taking into account the changing rate (daily new cases) of infection, a growth model on the cumulative case would be a good choice.Growth models have been used widely in many fields.(Wiorkowski 1981;Sandland and McGilchrist 1979).When the trend was assumed to follow a parametric function, we can use regression techniques to estimate the parameters in the function and establish the model.For the number of SARS cases in China since April 21, we cumulated the daily new cases up to June 3.The cumulative cases are plotted in Figure 5.The data prior to April 21 and after May 31 were not included in our analysis.
Table 3: The estimates of the parameters in the growth curve and the forecasting for the cumulative SARS cases: α, β and κ the estimates of the parameters in the growth curve, α1 and the β1 the estimates of the coefficients in ARMA(1,1) of the residuals, σ the estimate of the standard deviation of the innovation in ARMA(1,1) of the residuals, (a) one-day ahead forecast, (b) the 95% confidence interval of the one-day ahead forecast, (c) actual one-day ahead observation.The daily rate of increasing of new SARS cases was high from April 21 to May 10 before it decreased, which suggested an S-shaped parametric function.In our analysis, we selected the autocatalytic (logistic) model (Seber and Wild 1976).

Day
The growth function f satisfied the following condition: where k > o and 0 < f < α.The growth function f can be solved as (t−γ) . (5.2) Or, (5.3) where β = e κγ .Using the data of the first 30 days after April 21 and non-linear least squares method (nls function in S-plus), we got α=3691, β=10.89 and κ=0.2162 with the standard deviations 26.62, 0.51 and 0.005, respectively.All the estimates indicated that the parameters in the model were statistically different from zero with p-values all less than 0.0001.The fitted curve and the actual observations are shown in Figure 4(a).The residuals from the model are shown in Figure 4(b).As it is expected, the residuals from the model would be correlated.From the residual plot, we can see that there were two outliers from the initial observations.After removing these two initial observations, we computed the sample autocorrelation coefficients and the sample partial autocorrelation coefficients, which are presented in Figure 4 where Z t is white noise with mean zero and variance σ 2 .The estimate of parameters in equation (5.4) were α1 =0.7716, β1 =0.4832 and the σ30 =15.37.The diagnostic checking of equation (5.4) (not shown here) indicated a reasonable fit of the model to the data.From the fitted model, we can forecast the residuals of the cumulative model and then construct the 95% confidence intervals.Combining equation ( 5.3) and equation (5.4) (ignoring the variability of the estimates of the parameters in the growth curve model), we got the one day ahead forecast of the cumulative cases at day 30 being 3667 with the 95% confidence interval (3637,3697).The actual observation was 3669.We repeated this forecast starting from day 22, the results are tabulated in Table 3.Because the large scale in Figure 5, the 95% confidence intervals could not be shown properly in the plot.Hence, we did not include the confidence intervals in Figure 5.The fitted growth curve was based on the data up to day 41 (May 31).

Concluding Remarks
Monitoring the health of population is an important task for all countries.In many developed countries, the use of quantitative methods have been a routine practices for governmental agencies and academic institute.However, in many developing countries, there is much more to be done in this field.In this article, we applied three different methods for modeling the dynamic of the SARS cases in mainland China.All these methods involved components of time series analysis.The well developed ARIMA techniques were named after Box-Jenkins' seminar work in the late 1960 (Box and Jenkins 1976).These techniques have been used on modeling influenza in the US (Scuffham 2003) and other infectious diseases (Schnell, Zaidi and Reynolds 1989;Zaidi, Schnell and Reynolds 1989).From Tables 1 to 3, we can compare the forecasting results derived from these three methods with the actual observations.The Box-Jenkins' AR(1) model assumes stationarity of the daily new SARS cases.This was probably not true because of the extraordinary interventions implemented by the Chinese government after April 21.Hence the forecasts from the AR(1) model were usually greater than the actual observed numbers.The random walk model on the log transformed daily SARS cases followed the dynamic of the daily new cases reasonably well and it used mostly the present information to predict the future.The log transformation reduced the variability significantly.For the growth model, it fitted a parametric function globally to the cumulative SARS cases.Although it appeared that the model gave an extremely good fit from the plot, it was mostly due to scale effect.From the results in Table 3, one can see that the variability remained.
The three models studied in this article may be used for different purposes.From the AR(1) model, the differences between the observed and the forecasted could be used as a measure of the effectiveness of the interventions.For a mainly short term forecasting task, the random walk model seemed providing satisfactory results.However, if one wants to assess the relative long-term effect of SARS from a cumulative point of view, the growth curve may be more useful.
Mathematical and statistical models can not only provide description and understanding of the dynamics of diseases, they can also be used in forecasting the future values.To quantify the possible number of cases in the future is fundamental to decision making in public health, especially for highly infectious diseases with a relatively high mortality rate such as SARS.
Prior to systematic time series analysis techniques, there were other mathematical methods used in forecasting the trends of various diseases.Extrapolation of polynomial trends and exponential smoothing were two popular methods (21).Extrapolation a polynomial trend is very unstable because of the nature of the polynomials.Exponential smoothing can be viewed as a special case of ARIMA modeling (Diggle 1990).Recently, nonlinear time series models including chaos have been applied to study the measle dynamics (Bjornstad, Finkenstadt and Grenfell 2003;Finkenstadt and Grenfell 2000).The applicability of chaos theory in monitoring the SARS dynamics is under investigation.
Although the rapid increasing of SARS cases in China in April and early part of May was replaced with a much lower rate and the urgency of forecasting the new cases seemed diminishing, it is generally agreeable that the SARS virus may be able to survive the coming winter and may come back next spring.Hence constant vigilance by governmental officials and academic researchers as well as the general public is essential to keep the virus under control.
SARS is a new disease and there is much more to be learned about its cause and its transmission among the population.The time series models used in this article were mostly based on the history of the observations.When more risk factors are identified from epidemiological study, one can combine these factors in constructing better models to describe the dynamic of the disease and forecast the future.These efforts would contribute to the ultimate goal of eradicating SARS.

Figure 1 :
Figure 1: The daily new SARS cases in mainland China from April 21 to June 3, 2003 and the one-day ahead forecast and its 95% confidence interval based on AR(1) model for the daily new SARS cases in mainland China from May 12 to May 31, 2003.

Figure 2 :
Figure 2: (a) The log-transformed daily new SARS cases in mainland China from April 21 to May 20, 2003, (b) the first order differences of the logtransformed time series, (c) the sample autocorrelation coefficients, (d) the Sample partial autocorrelation coefficients.

Figure 3 :
Figure 3: The One-day Ahead Forecast and Its 95% Confidence Interval Based on the Random Walk Model for the Daily New SARS Cases in Mainland China from May 12 to May 31, 2003.

Figure 4 :
Figure 4: (a)The Fitted Growth Curve to the Cumulative SARS Cases in Mainland China from April 21 to May 20, 2003, (b) the Residuals of the Observed Values to the Fitted Model, (c) the Sample Autocorrelation Coefficients of the Residuals, (d) the Sample Partial Autocorrelation Coefficients of the Residuals.

Figure 5 :
Figure 5: The fitted growth curve to the observed values of the cumulative SARS cases in Minland China from April 21 to May 31, 2003.

published on the official website of the Ministry of Health of China 1 .
There have been very few new SARS cases in China since June 1, 2003.By August 16, 2003, the last two SARS patients in the 2003 epidemic were released from the hospital in Beijing.Since the end of the 2003 epidemic of SARS, there were a few SARS cases occurred in Guangzhou in December 2003.In late March and mid April of 2004, two researchers at the National Institute of Virology developed SARS.Because of the effective control, the isolated SARS cases in December 2003 in Guangzhou and in April 2004 in Beijing did not start another SARS epidemic in China.

Table 1 :
The Estimates of the parameters in the Box-Jenkins model and the forecasting of the daily new SARS cases: α1 the estimate of the coefficient in AR(1), μ the estimate of the mean, σ the estimate of the standard deviation of the innovation in AR(1), (a) nne-day ahead forecast, (b) the 95% confidence interval of the one-day ahead forecast, (c) actual one-day ahead observation daily new SARS cases in China seemed not stationary.To recognize the nonstationary features of the daily observations of the SARS cases in China, one could create an indicator variable (deterministic trend) and fit it to the observations.

Table 2 :
The estimates of the parameters in the log-transformed random walk model and the forecasting of the daily new SARS cases: μ the estimate of the mean in the log scale, σ the estimate of the standard deviation of the innovation in log scale, (a) one-day ahead forecast, (b) the 95% confidence interval of the one-day ahead forecast, (c) actual one-day ahead observation.