Time Series Regression Models for COVID-19 Deaths

This article develops nonlinear functional forms for modeling count time series of daily deaths due to the COVID-19 virus. Our models explain the mean levels of the time series while accounting for the time-varying variances. A Bayesian approach using Markov chain Monte Carlo (MCMC) is adopted for analysis, inference and forecasting of the time series under the proposed models. Applications are shown for time series of death counts from several countries aﬀected by the pandemic


Introduction
Coronavirus  is a new pandemic viral infection that has been spreading worldwide in the year 2020, and is caused by a newly discovered coronavirus.The virus emerged in the city of Wuhan, China, at the end of 2019, and as of August 20, 2020, it has accounted for 22,263,347 globally confirmed cases and 782,471 deaths according to the World Health Organization (World Health Organization, 2020a).The disease incidence rate grows on an exponential scale with global geographical expansion to almost all countries of the world.As the epidemic started on December 31, 2019, WHO (World Health Organization) was informed of a cluster of cases of pneumonia of unknown cause detected in Wuhan City, Hubei Province of China.The COVID-19 was identified as the causative virus by Chinese authorities on January 7, 2020 (World Health Organization, 2020b).On January 30, 2020, following the recommendations of the Emergency Committee, WHO declared that the outbreak constitutes a public health emergency of international concern.
The COVID-19 virus affects different people in different ways.Most people infected with the COVID-19 virus will experience mild to moderate respiratory illness, and will recover without requiring special treatment (World Health Organization, 2020d).Common symptoms include fever, tiredness and dry cough.Other symptoms include: shortness of breath, aches, and pains, sore throat, very few people will report diarrhea, nausea, or a runny nose (World Health Organization, 2020b).Older people and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, or cancer are more likely to develop severe symptoms.It is believed that the COVID-19 virus spreads primarily through droplets of saliva or discharge from the nose when an infected person coughs or sneezes.While there are many ongoing clinical trials aimed at possible treatments, there are no specific vaccines or treatments for COVID-19 at this time.
In the Situation Report World Health Organization (2020c) of WHO (January 21, 2020), there were only 282 confirmed cases in the region covering China, Japan, Republic of Korea, and Thailand.Since December 2019, the pandemic has shown dynamic patterns of varying extents of increase, leveling off, or decrease in different areas.Figure 1 in Section 2 presents time series plots of the daily notified cases of COVID-19 in China, USA, Spain, Italy, Brazil and India in the period ranging from January 1, 2020 to May 30, 2020.It can be seen that during this period, USA has the majority of daily notifications compared to other countries.Smaller numbers of COVID-19 notification cases are observed in China, possibly indicating that the pandemic is under control due to containment strategies taken by the authorities and/or the use of diagnostic testing on the entire population.The situations in other countries vary and are discussed in Section 2. In order to track the virus, the World Health Organization has updated the Laboratory Testing Strategy on March 21, 2020(World Health Organization, 2020d) according to different transmission scenarios: countries with no cases; countries with one or more occurrences (sporadic cases); countries experiencing clusters of cases across periods, geographic locations, or common exposure (clustered cases); or countries experiencing more significant outbreaks or sustained and pervasive local transmission (community transmission).
There is considerable recent literature related to COVID-19.The research emerges from many different fields of study including viral origin and structure (e.g., Lan et al., 2020;Shang et al., 2020;Lam et al., 2020), epidemiology (e.g., Ferretti et al., 2020), preclinical research (e.g., Kim et al., 2020), diagnostic and serology (e.g., Ju et al., 2020;Wölfel et al., 2020) and therapy and clinical trials (e.g., Shen et al., 2020).In these studies, the main focus has been identification of the symptoms of COVID-19 in order to develop an antigen (clinical trials and diagnostics), to track the presence of the virus using an app that builds a memory of proximity contacts (epidemiology) and the identification of the viral origin and the COVID-19 structure.According to Lan et al. (2020) and Andersen et al. (2020), the virus appears to be optimized for binding to the human ACE2 receptor; the spike protein of SARS-CoV-2 has a functional polybasic cleavage site at the S1-S2 boundary through the insertion of 12 nucleotides.Moreover, the receptorbinding domain (RBD) in the spike protein is the most variable part of the coronavirus genome, that is, six RBD amino acids have been shown to be critical for binding to ACE2 receptors and for determining the host range of SARS-CoV-like viruses (Andersen et al., 2020;Wan et al., 2020).A great research effort is seen worldwide in the last three months related to different ways of understanding the pandemic with hundreds of papers being published in a very short time period, (e.g., Kandel et al., 2020;Pung et al., 2020;Chan et al., 2020;Huang et al., 2020;Wu et al., 2020;Lu et al., 2020;Chen et al., 2020b;Li et al., 2020;Lai et al., 2020;Lupia et al., 2020;Shereen et al., 2020;Chen et al., 2020a;Sohrabi et al., 2020;Han et al., 2020;Chen, 2020;Wu et al., 2020;Zhao et al., 2020).Most papers are related to the transmission mechanisms of coronavirus among different persons, genomic characterization and epidemiology of COVID-19, new treatments to COVID-19, clinical features of patients infected with the virus, developments of vaccines, effects of the confinement of entire populations to minimize the spread of the disease and not overburden the health systems and to decrease the mortality rate, especially for the elderly and people with comorbidities.Another point of great interest is related to predictions of the quarantine time adopted by almost all countries in the world following WHO directions to decrease and delay new infections, but that could be catastrophic in economic terms for many countries in the world.
Mathematical models of epidemic dynamics have been used to understand patterns and predict the evolution of the pandemic.The input for these models is the time series of the number of susceptible, infected, and recovered (SIR) persons (Kermack and McKendrick, 1927;Lili Wang et al., 2020).Notification by health centers of the number of infected people depends on the efficiency with which tests and diagnostics are performed on people with suspected infection.One well-known problem is that when underreporting of the number of infected people is prevalent, the reported data can provide very optimistic predictions for the pandemic state.Another problem is caused by delays in reporting the death counts on some days, resulting in delays and discrepancies in reporting.A model that tries to solve the problem with inconsistent data and underreporting in the number of infected people was proposed by Lili Wang et al. (2020).However, detecting discrepant data and eliminating the effect of these data on model fit is easier than estimating unreported data on the number of people infected because it is unobserved data.
An analysis that can contribute to assessing and predicting the future state of a pandemic can be done by considering statistical models for the mean number of deaths in a regions.It is also crucial to understand that some factors not considered in such a model curve can lead to errors in forecasting of the number of deaths for a given day.Among these, one of the main factors is data consolidation.Consolidation is done by medical reports based on clinical and laboratory tests that prove the cause of death.Due to an excess of cases, some laboratory tests to certify the confirmation of the cause of death by COVID-19 delay the confirmation, causing an accumulation of late notifications that are reported on certain days together with the consolidated cases on the same day.This issue can cause significant peaks in the number of deaths on certain days, especially when the pandemic reaches its peak, together with other factors, such as the capacity of the health system to be exhausted, difficulty in medical care, etc., lead to the heteroscedasticity observed in the daily death counts data.
The use of nonlinear regression models has been well known in several areas such as, modeling of so called dose-response relation, tumor growth, pharmacology and in the analysis of epidemic data.Some papers have proposed nonlinear regression models to model the number of cases or number of deaths by COVID-19.Luo (2020) uses the logistic model to predict the end of the pandemic.Zhang et al. (2020) uses Gamma models to predict turning points, durations and attack rates of COVID-19.Girardi et al. (2020) presents a robust Bayesian approach on five parameter log-logistic curve models to model the number of deaths curve in some cities in Italy.Tsallis and Tirnakli (2020) present a unified nonlinear functional form for predicting COVID-19 peaks for several countries.The pattern of the COVID-19 death curve is highly variable depending on many factors, restrictive measures, and the capacity of the health services in different countries.For this reason, it is not possible to use a single function to model the epidemic in different locations in the world.In this paper, a nonlinear regression statistical model is proposed for the daily counts of COVID-19 deaths in a few different regions of the world, using a few nonlinear functions.It is essential that the model considered for the mean of the time series reproduces the asymmetric pattern of these series.Another critical aspect to be included in the model is the conditional heteroscedasticity of the time series.A good model for the variance is fundamental to calculate probabilities and more accurate credible intervals for the forecasts.
In this paper, we have described seven nonlinear functions for modeling the time series of the number of daily deaths caused by the COVID-19 pandemic.Applications of the proposed models are shown using time series of daily deaths from six countries with different characteristics and at different stages of the pandemic.The countries we selected for this analysis fall into three groups.Group G1 consists of a country (China) where the epidemic originated but is currently reported to be well under control.Group G2 consists of European countries (Italy and Spain) which were the first countries in Europe to detect cases of contamination.But at the moment, they have already passed the peak of the pandemic.Group G3 consists of large countries (USA and Brazil) of continental dimensions located in the American continent.Although their population sizes are nearly the same, they have vastly different development conditions.In these two countries, the peak was only recently reached.Group G4 consists of an extremely densely populated country (India) where the epidemic seems to have spread slowly and where the peak still to be reached in the coming months.A result of a July serological survey of 6,936 people across three suburbs of Mumbai found that more than 50% of people across parts of India's financial hub of Mumbai have coronavirus antibodies.It is indicating that the population may have inadvertently achieved the controversial "herd immunity" protection from the coronavirus.It explains why a steep drop in infections is being seen among the closelypacked population, despite new cases increase overall in the hard-hit country.(Read more at: https://www.bloombergquint.com/coronavirus-outbreak/herd-immunity-seems-to-be-developing-in-mumbai-s-poorest-areas).The purpose of our analysis is to match the best nonlinear functional form for the pattern of the expected number of daily deaths curve for each of the four groups, G1-G4, particularly to identify the timing of the peak and the rate and nature of leveling off in the curve.Such an understanding is crucial to understanding how COVID-19 deaths differentially progressed in these countries, bringing different approaches for its management and control.Further, such an analysis is useful to anticipate the peak timing and the life-cycle in a future second wave of this epidemic, or similar epidemics in the future -although we all hope this will not be the case.We also note that although we have shown our analysis for time series collected at the national level, a similar analysis is possible at different smaller zonal levels, for example, for many US, Brazilian, or Indian states.
The paper is organized as follows.In Section 2, we present a description of the data sets used and discuss the main objective of the paper.Section 3 develops the functions for modeling the mean and variance of the time series and the Bayesian approach used to do model fitting and make inference about the model parameters.Section 4 shows results from time series from the different countries, while Section 3.4 presents a Bayesian forecasting procedure for future daily death counts.Section 5 presents a summary and discussion.

COVID-19 Data Description
We have considered data from six countries, i.e., China, USA, Spain, Italy, Brazil and India.A summary of pandemic numbers by country is presented in Table 1.
For each country, we model and predict the count time series of daily deaths by COVID-19, starting from the day when the first infected case was confirmed in the country.An exception is the time series of the number of deaths in China, where the first record was made on December 31, 2019, with 26 infected cases and 0 deaths.Therefore, for different countries, the time series start on different days.For all series, we consider observations until July 31, 2020 for model fitting.We analyze the series of daily deaths from these six countries individually.We restrict ourselves to modeling time series of deaths and not time series of new infected cases because the series of infected cases depends on the number of tests that each state has performed and the data may not accurately reflect the size of the problem.The time series used in this paper were taken from the website https://countrymeters.info/ptand the European Centre for Disease Prevention and Control (ECDC) website https://www.ecdc.europa.eu/en/publications-data.Figures 1a  and 1b show the number of new daily infected cases and the logarithm of the accumulated daily number of infected cases.Figures 1c and 1d show the number of daily deaths and the logarithms of the accumulated daily number of deaths.As we mentioned earlier, we only model the time series of daily deaths in this paper.
In Section 3, we describe a Bayesian approach for fitting a nonlinear statistical model to represent the mean pattern for the daily series of new deaths in each country (Ratkowsky, 1983;Bates and Watts, 1988).The detected pattern is then used to forecast future values of the time series.

The Model
Let {Y (t), t = 1, 2, . . ., T } be a time series defined over nonnegative integers, representing the number of daily deaths by COVID-19 in a single location.We consider the following model for where η(t) is a deterministic nonlinear function of time whose forms are defined in Table 2, and Z(t) is an autoregressive conditionally heteroscedastic stochastic process given by where is the pth order autoregressive polynomial with all roots outside the unit circle, B is the backshift operator, V ar(W (t)) = σ 2 (t) which follows an autoregressive conditionally heteroscedasticity (ARCH) model of order 1 (Engle, 1982(Engle, , 1983) ) with α 0 > 0 and 0 α 1 < 1 and ε(t) is an i.i.d.normally distributed process with This model enables us to model the temporal dependence in the first two moments of the residuals from the nonlinear regression fit.Although a variance stabilizing transformation such as the logarithmic transformation of the Y (t) may be considered, we instead use a conditionally heteroscedastic model.
Plots of the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the regression residuals and the squared residuals, together with the Lagrange Multiplier (LM) test (Engle, 1982) enables us to verify the presence of autoregressive or ARCH effects in the residuals.We use as candidates for the η(t) model the functions provided in Table 2.Although these functions are widely known, we make some observations about them here.(i) The generalized exponential function (Gupta and Kundu, 2001) has excellent shape flexibility depending on the parameter θ 2 .If θ 2 > 1, this function has a unimodal asymmetric form, whose mode is given by θ −1 1 log(θ 2 ).

Function η(t)
The Rational function has been widely used in the context of statistical modeling (especially process modeling), as an empirical technique to fit curves (Kulikov, 2001;Press et al., 2007).This function also has an asymmetric shape, which is an important characteristic of our data.
A brief description of the class of Rational functions is presented in in the supplementary material.(iii) The alpha function has been used in the context of Product Life-Cycles (Petrescu, 2009) that are widely used in business applications.This curve has two inflection points, and it allows us to model four phases of the pandemic cycle.The first phase when growth accelerates, the second when the speed of the growth diminishes before reaching the peak (in general this can be a reflection of containment measures), the third phase occurring after the maximum when the curve starts to decline, and the fourth, after starting the relaxation of containment.Given η(t), we construct the residual process Z(t) given as To simplify the notation we denote F t = {Y (t − 1), ..., Y (1), η(t − 1), ..., η(1)} all information available until time t, we define the conditional mean E(Y (t)|F t ) = μ(t) given by (5) One of the biggest challenges in modeling the pandemic is to find the time, denoted by τ peak , at which the highest number of deaths (or cases) occurs.Finding τ peak allows us to assess the expected value of the stochastic process E(Y (τ peak )|F τ peak ) = μ(τ peak ).To find the value of τ peak , consider a positive integer t peak such that μ(t peak ) μ(t), ∀ t ∈ N.Then, τ peak is estimated by t peak .
Another challenge is to evaluate the probabilities P (Y (t) U ) for fixed values of U that allows us to estimate the end of the pandemic.However, evaluating these probabilities is only possible by making assumptions about the probability distribution of the white noise W (t) and finding the appropriate models for the variances σ 2 (t).As we see above, we have assumed that {W (t), t 0} are i.i.d.N(0, σ 2 (t)) variables and σ 2 (t) follows a stationary ARCH(1) process.
The probability P (Y (t) < U) can be easily calculated as where (z) is the standard normal N(0,1) cumulative distribution function (cdf).The estimate of t peak , as well the estimates of the probabilities P (Y (t) < U) (∀ t > 0), can provide extremely useful information for decision-making and resource management to cope with a pandemic.When there is interest in analyzing the accumulated number of deaths (or cases), we can consider the accumulated processes given by in which case, we have For accumulated processes, the main objective is to look for sigmoidal functions that best fit these curves.A function S(t) in sigmoidal form ideally has its origin in S(0), inflection point (peak) occurring at the beginning of the first third of the stage before approaching the maximum value, with an asymptote to be reached when the process enters a stage of decay.In a pandemic, this change in curvature occurs closer to the peak point.Some sigmoidal functions, used to model growth curves, are presented in Desta et al. (1999) and Sonnino (2020).

Model Fitting
The Bayesian approach provides an attractive method for analyzing the time series of daily death counts and generally is more reliable in small-sample inference for nonlinear regression than the least squares approach (Katz et al., 1981;Dorndorf et al., 2019).Let {Y (t), t 0} be the stochastic process and {y t , t = 1, 2, . . .T } denote the realization, where Y (t) = y t for all t ∈ N.
The likelihood function of the parameters given the data is given by where the conditional probability density function (pdf) of y t given its history is given by (Tsay (2010)) The prior specification for the parameter vector θ is as follows.We assume for the parameter vector θ * = (θ 0 , θ 1 , . . .θ m , φ 1 , . . ., φ p ) a normal prior distribution N(a, b 2 I ), whose prior density can be written as where I is an identity matrix of dimension (m+p)×(m+p).Values for the a and b hyperparameters may be assigned, as we do, to have approximately non-informative prior (we considered, a = 0 and b = 1).For parameter α 0 > 0 we assume a Log-Normal prior LN (a 0 , b 0 ), and for parameter α 1 ∈ (0, 1) we assume a Normal(0, 1) prior for parameter ξ = logit (α 1 ).Assuming priori independence for the parameters, we denote by π(θ ) the joint prior density function for the parameter vector θ .From Bayes' rule, the posterior density function for the vector of parameters θ is then given by The nonlinearity of the mean and the conditional heteroscedasticity of the process precludes a closed form for the posterior density of θ .Bayesian estimation of θ must be implemented numerically using Markov chain Monte Carlo (MCMC) methods.Although MCMC-based algorithms are time-consuming and slow to converge in complex modeling situations, the modeling considered in this paper were computationally feasible and convergence as monitored by the Geweke criterion (Geweke, 1992) was reached in less than 1 minute for each model using a core i7.However, the computational effort grows rapidly when the order p of the autoregressive model increases.
We ran the analysis by modeling μ(t) by each function in Table 2.We used the Gibbs sampler with the Metropolis-Hastings algorithm (Chib and Greenberg, 1995).Specifically, we generated a chain of size 50,000, deleting a initial burn-in sample of 30% of the generated chain and resampled a value once every 10 generated values (thinning).This resulted in a final sample of size G = 3500, which is considered a random sample from the posterior distribution of θ .We obtained such results corresponding to each function for μ(t).To select the best fitting AR(p) model, we used the BIC criterion (Schwarz, 1978), and to select the best fitting function μ(t), we used the conditional predictive ordinate (CPO) (Gelfand et al., 1992), which is briefly described below.

The Conditional Predictive Ordinate
Use of the CPO provides a useful cross-validation approach and is a computationally efficient measure of model fit.This measure is calculated using the predictive probability density function of observing y t in the future after having observed D t−1 given by where is the parametric space.Considering the equation ( 6) and the generated chain from the MCMC procedure, we can estimate the CPO t by In many situations, it is more convenient for numerical reasons, to calculate log(CPO t ), rather than CPO t .In this case, log( CPO) is given by the sum of the estimates log( CPO t ).The fitted model with the smallest − log( CPO) is selected as the model that best fits the data.

Influential Points
Let y (−t) = (y 1 , y 2 , . . ., y t−1 , y t+1 , . . ., y T ) the observation vector after removal of the t − th observation of D T .We denote the joint posterior densities of the parameter vector θ , obtained from the original data set by π(θ |D T ) and from the data set after removal of the t −th observation by π(θ |y (−t) ).The influence of the observation y t can be evaluated by calculating the Kullback-Leibler (K-L) divergence between these two posterior densities.Specifically, It can be shown that ( 8) can be expressed as a posterior expectation (Conceição et al., 2013) KL(π, where CPO t is the conditional predictive ordinate (CPO) density of the observation y t .Given a sample {θ (1) , . . ., θ (G) }, generated from posterior density π(θ |D T ), we can estimate the effect of observation y t by where CPO t is evaluated by Equation (7).
A measure of calibration for the KL(π, π (−t) ) divergence proposed by McCulloch (1989), denoted by ρ t , is derived from the solution of the equation KL(π, π (−t) ) = KL(B(0.5),B(ρ t )) = − log 4ρ t (1 − ρ t )/2, where B(ρ t ) denotes the Bernoulli distribution with a success probability ρ t .This implies describing results using the full posterior density, π(θ |D T ), instead of the posterior density by removing the t-th observation, π(θ |y (−t) ), is equivalent to describing an event not seen as having probability ρ t , when the correct probability is 0.5.Solving the equation for ρ t , we have that ] .This implies that 0.5 ρ t 1.For ρ t 0.5 it can be considered that the t-th observation is an influential point.

Forecasting
Forecasts of future values of the process Y (T +h) for some lead time h > 0 made from origin T are obtained as the conditional expectations E(Y (T + h)|D T ) of the predictive density f (y T +h |D T ), given by where is the parametric space.Thus, the forecast for Y (T + h), denoted here by Replacing the equation ( 6) in equation ( 11) and changing the order of integrals, we find an appropriate equation to calculate the predictions given by Monte Carlo estimates for Y (T + h) can be obtained considering the generated samples from the posterior density of the parameter vector θ .Let us start by considering the chains provided by the MCMC algorithm for the parameter vector θ (g) , g = 1, . . ., G. Thus, we have the values μ (g) (T +h), g = 1, . . ., G evaluated when replacing each element of the vector θ (g) in the function μ(T + h).A Monte Carlo estimate for Y (T + h) can be calculated by

Forecasts in the Presence of Heteroscedasticity
The presence of conditional heteroscedasticity does not affect the calculation of the forecasts, since E(σ (t)W (t)|D t ) = 0 for all t 0. However, when interest lies in calculating probabilities associated with Y (T + h), the presence of heteroscedasticity requires that σ 2 (T + h) be predicted as well.Considering the proposed model given in (4) for σ 2 (t), we have that σ where Z(T ) = Y (T ) − μ(T ), α 0 > 0 and 0 α 1 < 1.However, since 0 α 1 < 1, when h → ∞, the equation ( 12) results in σ 2 (T + h) → α 0 /(1 − α 1 ).Monte Carlo estimates for σ 2 (T + h) can be calculated by considering the values σ 2(g) (T + h), g = 1, . . ., G, evaluated by replacing each element of the vector α (g) 1 ) in equation ( 12).Thus, the Monte Carlo estimate σ 2 (T + h) is given by The credible intervals (CI) for the forecasts y T +h with probability 1 − α can be calculated by y T +h ± z α/2 σ T +h , where z α/2 is the 100(1 − α/2) percentile of a normal N(0, 1) distribution.When the process is homoscedastic, the variance is constant for all T + h.However, when there is heteroscedasticity, it is necessary to express σ 2 (T + h) in terms of the parameters α 0 and α 1 .
The probabilities associated with Y (T + h), in the presence of heteroscedasticity, can be calculated by In the next section, we present the results of the best model fitted and the forecast for 100 days ahead given the observed data.We mention that 100 days is a long forecasting horizon for these inherently short memory time series models, and forecasts closer to the forecast origin will be more reliable.All results were obtained using the Software (R Core Team, 2020).

COVID-19 Results for the Six Countries
In Section 2, we described count time series of daily deaths by COVID-19 for each of the six countries summarized in Table 1.The observations available until July 31 (sample size) were used for model fitting.We scale the population of each country in order to better handle the data numerically and graphically.For China and India, we denote the data in multiples of 100 million.Corresponding to the total population of each of these countries (of about 1.411 billion and 1.395 billion respectively), we get a scale factor of 14.11 for China and 13.95 for India.For the remaining four countries, we denote the data as multiples of one million, resulting in scale factors of 333.76 for USA, 45.68 for Spain, 60.02 for Italy, and 217.26 for Brazil, based on each of their total populations.We fit each of the models presented in Table 2 under the Bayesian approach described in Section 3.1.Table 3 shows the CPO(p) for each model, where p refers to the order of the autoregressive model.In this table, it is possible to see that the Alpha model was selected for two of the six data sets, the Rational function model was selected for two of the data sets, and the Logistic Model was selected for data set from China.For India, we have three models as the same criterion value, and we can use any of these models to analyze the data set from India.This is due to the fact that India is still in the first part of the pandemic and has not yet reached its peak.As we discussed in the introduction, the main objectives in fitting these models to the data are (a) to understand patterns in the mean curves, and (b) to predict future values of the time series.Therefore, our aim is to use the fitted models for the average number of daily deaths to identify the state of the pandemic and to use the fitted curve to evaluate whether the peak days of the epidemic are approaching, are occurring, or have passed.Before analyzing each country's data, it is essential to understand that some factors that can increase the forecast error are not considered in the forecasting model for the number of deaths for any given day.One of the main factors in this list is data consolidation.As mentioned in Section 1, these factors can lead to outliers and heteroscedasticity, which causes high prediction errors.However, the trajectory of the average curve and the identification of peak days is more critical in our analysis than accurate forecasting of the number of deaths on any given day.Aiming to achieve this objective with proposed models, we used a K-L divergence calibration measure to identify outliers, which we then smoothed suitably.
The fitted models were used to forecast the pattern of the mean number of deaths curve together with its 95% credible interval, for the next 100 days beyond the observed data.We present these results for the six countries.The last 20 days with observations from August 1 to August 20 were held out from the model fitting and were used to evaluate the accuracy of forecasts for these days.We present these results in Subsection 4.7.
Tables in the supplementary material present the posterior summaries and the corresponding 95% credible intervals for the parameters of the selected models fitted after applying a smoothing to account for outliers identified by the K-L divergence calibration measure.The residual analysis, including the Lagrange Multiplier test for the ARCH model for Z(t), is also presented in the supplementary material.

Results for China
We considered the time series of new daily deaths notified between December 31, 2019 and July 31, 2020 in China.According to the CPO criterion shown in Table 3, the Logistic with AR(1) model was selected as providing the best fit to the data.It is shown in Figure 2a that on February 12 (t = 45) the daily added number of deaths had clear jumps with significantly large sizes.Such sizable jumps cannot happen within one day, rather they represent an accumulation of deaths that have not been reported on previous dates prior to February 12 (Lili Wang et al., 2020).In our analysis, we considered the K-L divergence (see Subsection 3.3) to identify outliers.We chose to do two analyses, i.e., (a) to keep these outliers in the data set, or (b) to replace these outliers by the central moving average of the seven values observed with the outlier values at the center.
Proceeding with the analysis (a), Figure 2a presents the time series of daily death counts, the fits under the Logistic model for μ(t), and the 95% credible interval.Considering the K-L divergence as a measure of calibration, we accept as an influential point the values whose ρ i > 0.8.With this criterion, values that occurred on February 12 and February 23 (t = 45 and t = 56) respectively, were considered to be outliers.Therefore, for these days, it is not expected that the model will be able to predict the observations y(45) = 254 and y(56) = 150 well.Proceeding with the analysis (b), we considered the time series of new daily deaths, replacing the outliers by the central moving average.According to the CPO criterion, the Alpha with AR(3) model was selected as providing the best fit to the data (−LogCPO(3) = 183).Figure 2b presents the time series of daily deaths, the fits under the Alpha with AR(3) model for μ(t), and the 95% credible interval.In Figure 2c, we present the accumulated number of deaths and the 95% credible interval.We observe from the plots in Figure 2 that the Alpha with AR(3) model gives a good fit for this time series.In addition, Figure 2 also shows the forecast for the next 100 days (from July 31) and the credible intervals around the forecasts.
Using the fitted Alpha with AR(3) model, we can estimate τ peak , the day with the maximum number of deaths as t peak = 47, and the expected number of deaths μ(47) = 11.4 (or 161 people, by multiplying by the scale factor 14.11).The 95% credible interval for this day was (7.9, 14.9) (or, (111, 210) people).Specifically for this data set, these estimated values cannot be compared with the values that were observed, i.e., t peak = 45 and y(45) = 17.9 (or 253 people) because February 12 was an outlier.Proceeding with the analysis (b), the observed values were t peak = 47 and y(47) = 10.1 (or 143 people); the predicted values using this fitted model were μ(47) = 6.7 (or 95 people) and the 95% credible interval for this day was (6.3, 13.9) (or, (89,196) people).We note that the number of deaths observed on this peak day is included in the estimated credible interval.

Results for USA
We considered the time series of daily deaths notified by COVID-19 between January 21, 2020 until July 31, 2020 in USA.Table 3 shows the CPO for each fitted model, and the Rational function with AR(1) model was selected as the best fitting model.Figure 3a presents the time series of new daily deaths, the fitted mean μ(t) with the Rational function model, and the 95% credible interval.We can observe from Figure 3a that there are some outliers.Based on a K-L measure of ρ i > 0.8, we identified observations at t = 87 and t = 89 as outliers, with y(87) = 14.7 (4906 people) and y(89) = 11.3 (3771 people).Replacing the outliers by the central moving average and refitting all the models, the newly fitted Rational function with AR(1) model was selected as providing the best fit to the data according to the CPO criterion (−LogCPO(1) = 261).Figure 3b presents the time series of daily deaths, the fits under the newly fitted model for μ(t), and the 95% credible interval.In Figure 3c, we present the accumulated number of deaths and the 95% credible interval.Figure 3 shows the goodness of fit of the Rational function model for this data set.In addition, Figure 3 shows the forecasts for the next 100 days (from July 31) for the number of new daily deaths and the cumulative number of deaths.
For the US, the observed peak day was t peak = 95 and y(95) = 9.5 (or 3171 people, multiplying by 333.76).For this peak day, the forecast provided by the fitted model was μ(95) = 5.9 (or 1969 people) with 95% credible interval (4.5, 10.6) (or (1502, 3538) people).We note that the number of deaths observed on the peak day is included in the estimated credible interval.

Results for Spain
For Spain, we consider the time series of new daily deaths notified between February 1, 2020 until July 31, 2020.Based on the CPO criterion shown in Table 3, the Rational function with AR(3) model was selected as the best model for this data.Figure 4a presents the time series of new daily deaths, the fitted mean by the Rational function with AR(3) model μ(t), and the 95% credible interval.The measure of calibration for the K-L divergence identified as outliers (with ρ i > 0.8) the observations at t = 112, and y(112) = 15.1 (690 people).Replacing the outliers by the central moving average and fitting all models again, according to the CPO criterion, the  Alpha with AR(1) model was selected as providing the best fit to the data (−LogCPO(1) = 254).Figure 4b presents the time series of daily death counts, the fits under the new fitted model for μ(t), and the 95% credible interval.In Figure 4c, we present the accumulated number of deaths and the 95% credible interval.We can observe from Figure 4 the goodness of fit of the Alpha with AR(1) model for the asymmetric data set.These plots also show the forecasts for the next 100 days (from July 31 as origin) for the number of new daily deaths and the cumulative number of deaths.
The day with the maximum number of deaths observed was t peak = 63 and y(63) = 20.8(or 950 people, multiplying by the scale factor of 45.68).The estimate of the number of deaths given by the fitted model with their credible interval on this day is given by μ(63) = 18.3 (or 836 people), with 95% credible interval (15.9, 21.8) (or (726, 996) people).We note that the number of deaths observed on that peak day is included in the estimated credible interval.

Results for Italy
For Italy, we model the time series of new daily number of notified deaths of COVID-19, beginning on January 31 and ending on July 31, 2020.Table 3 presents the LogCPO for each assumed model, from which we see that the Alpha without autoregression model (p = 0) was initially selected as the best fit to the data.Figure 5a presents the time series of new daily death counts, the fitted mean μ(t), and the 95% credible interval.The discrepant point analysis indicated points y(52) = 13.2 (or 792 people) and y(58) = 16.2 (or 972 people) as outliers.By replacing these points with their respective seven-point centered moving averages, the best fitting new model was the Alpha with AR(1) model (−LogCPO(1) = 193).Figure 5b presents the time series of new daily number of deaths, the fitted mean by the Alpha with AR(1) model μ(t), and the 95% credible interval.In Figure 5c, we show the accumulated number of deaths and the 95% credible interval.We can observe from Figure 5 that the Alpha with AR(1) model provides a good fit to these data.Figure 5 also shows the forecasts for the next 100 days (from July 31 as origin) for the number of new daily deaths and the cumulative number of deaths.
For Italy, the day with the maximum number of deaths was t peak = 58 and y(58) = 16.2 (or 972 people, multiplying by the scale factor of 60.02), and this point was indicated as an outlier and replaced by y(58) = 688.The new peak day was t peak = 59 with y(59) = 14.8 (or 888 people).The estimated of the number of deaths from the Alpha with AR(1) model was μ(59) = 12.4 (or 744 people), with a 95% credible interval of (10.6, 14.8) (or (636, 888) people).

Results for Brazil
In Brazil, the pandemic only started after the other countries that we analyzed, so we consider the series of mortality beginning on February 26, when the first case was notified until July 31, 2020.According to the CPO criterion, shown in Table 3, the Alpha with AR(6) model was selected as providing the best fit to the data and the K-L divergence did not not identify any outliers.A peak cannot be identified for Brazil because, as of June 10, the curve in the number of new deaths per day oscillates at a level with an average of 1000 deaths per day, showing a flat line instead of a negative slope.This pattern of the curve is possibly due to political conflict in decision making regarding restrictive measures to contain the spread of the virus (see additional details in the supplementary material).The maximum number of deaths observed in the Brazilian data set was y(161) = 7.3 (or 1586 people, multiplying by the scale factor of 217.26); for that day the Alpha with AR(6) model provided a prediction of μ(161) = 5.7 (or 1238 people) with a 95% credible interval of (5.2, 8.1) (or (1130, 1760) people).
Figure 6a presents the time series of new daily deaths, the fitted mean μ(t) and the 95% credible interval.Figure 6b shows the accumulated number of deaths and the 95% credible interval.We can observe in Figure 6a that the fitted Alpha unimodal model shows a smooth decay.Figure 6 also shows the forecasts for the next 100 days (from July 31) for the number of deaths and the cumulative number of deaths in Brazil.

Results for India
For India, we consider the time series of new daily deaths notified between January 3, 2020, and July 3, 2020.However, although the first notification occurred on January 31, it was on March 5 that India's pandemic started, with 22 cases reported.The first death was reported on March 13.
According to the CPO criterion, shown in Table 3, three models can be chosen.However, the discrepant point analysis indicated points y(175) = 80.9 (or 1128 people, multiplying by the scale factor of 13.95) as outlier.After replacing this outlier value by the value calculated using its moving average centered, and fit all the models again, the Log-Normal without autoregressive model (p = 0) presented the highest value of the criterion (−LogCPO(0) = 403) and it was selected as providing the best fit to the data.The fact that the peak has not yet occurred implies that a possible asymmetry is not yet evident.Hence while the data support the Log- Normal model now, it may not be suitable for the entire series.However, the Log-Normal model currently meets the objectives of the analysis, which is to provide a forecast of the short-term trend for the pandemic data.
Figure 7a presents the time series of daily new deaths, the fitted mean with Log-Normal model μ(t), and the 95% credible interval.Figure 7b shows the accumulated number of deaths and the 95% credible interval.We can observe at the fitted mean model shown in Figure 7 the curve with an increasing curvature indicating that observation are before the peak.These plots also show the cumulative number of deaths.

Forecast results
Any pandemic trend assessment, reduction, leveling off, or growth, must be persistent for at least 14 days (since 14 days is the virus incubation time).Therefore, if our models are used to forecast this trend for 20 days, they can shed light on whether the country is leaving the peak or whether it has not even reached the peak yet.In this subsection, we present the use of fitted models to obtain 20-day ahead forecasts.
The accuracy of the predictions made with the fitted models can be measured considering the 20 days observed from August 1 to August 20, held out from model fitting in order to do forecast evaluation.The forecast errors for these days were computed as the difference between the observed counts and the corresponding forecasts.The forecast evaluation criteria used were the square root of the Mean Squared Error (RMSE) and the Mean Absolute Error (MAE).The Mean Absolute Percent Error (MAPE) and the mean arctangent absolute percentage error (MAAPE) (Kim and Kim, 2016) were calculated for countries where there is no zero observation between August 1 and August 20 (USA, Brazil, India) due to MAPE's deficiency in evaluating errors when the observation is small.For these 20 days of forecast, we also present the minimum,  average and maximum values of the number of observed deaths predicted by the model selected for each country.These values and evaluation criteria are shown in Table 4. Figure 8 presents the observed daily new deaths for 20 days along with the forecasts and 95% credible intervals for USA, Brazil and India.We can see in these graphs the trends in the number of daily deaths in each country.
Two other points that we want to draw attention to are: On May 22, the Spanish Ministry of Health reported an additional 56 deaths in the last 24 hours.However, the Catalan authorities reported an additional 632 deaths that had previously occurred and were not assigned to dates, meaning that the cumulative total increased by 688 deaths.In this work, we considered on May 22, just 56 deaths.On May 25, the government decreased the number of total cases by −372 and the number of deaths to −1918.The discrepancy is the result of the validation of the same data by autonomous communities and the transition to a new surveillance strategy.We calculated our forecasting disregarding these negative numbers.
Another interesting forecasting result that we can obtain from the fitted models is the "end date" of the pandemic.Estimating the "end date" is not trivial and warrants a few different considerations.Theoretically, we can define the "end date" as the day with a high percentage of the total deaths predicted in the pandemic life cycle curve.Our forecasts consider two alternatives for estimating "end dates": i.The date when 99% of expected of the total deaths have already occurred; ii.The date when 97% of expected of the total deaths have already occurred.
Table 5 reports the two alternative estimations of COVID-19 "end dates" for countries studied in this paper.
It is worth mentioning that China announced a relaxation of the quarantine and a partial reopening on March 30.For Brazil, these "end dates" are inaccurate because the country may be at a peak on June 10, or may not yet have reached its peak.However, Brazil released a reopening plan for June 1, at the height of the pandemic.

Final Comments
The COVID-19 pandemic is a huge public health problem.Three elements have been shown to be fundamental to differentiate the response to the crisis caused by the coronavirus: i) public attitudes of the authorities to the uncertainties brought about by the new virus; ii) the ability of the government to implement preventive measures; and iii) the structure and competence of public health systems to serve patients.In the paper in the supplementary material, we briefly report information on these attitudes for each of the countries we consider.
In this paper, we have described some nonlinear functions for modeling the time series of the number of daily deaths caused by the COVID-19 pandemic.Applications of the proposed models are shown using time series of daily deaths from six countries.
We note that it is essential that the model considered for the mean of the time series reproduces the asymmetric pattern of these series.Another critical aspect to be included in the model is the conditional heteroscedasticity of the time series.A good model for the variance is fundamental to calculate probabilities and more accurate credible intervals for the forecasts, and the ARCH(1) model that we used seems to do well.
A Bayesian approach has several advantages for estimation and forecasting in such situations, and can be an excellent tool.We use Markov Chain Monte Carlo procedures for carrying out Bayesian inference.The need for estimating model parameters with high precision led to some difficulties in choosing a specific nucleus to generate candidates for each parameter.A random walk with a Gaussian kernel whose variance was controlled for each model based on the rejection rate of the generated samples, which is always between 35% and 60%, seemed adequate.
To speak of a single peak for very large countries such as the USA, India, and Brazil is similar to speaking of a single peak for Europe.The states or regions in these countries have different peaks.But a global peak can be assessed and used to predict trends.An essential and useful take-away from our analysis is that the fitted models allow the prediction of the expected number of deaths on the day of the peak of the pandemic.However, we noticed that the variability of the series grows very close to this peak, and, usually, the model predicts a credible interval which includes the observed number of deaths.
These models can be used to forecast just 20 days ahead in order to inform what the trend of the pandemic pattern is likely to be.Further, as expected, forecasts will also change with the addition of new data.Also, the selected models will need to be fitted again in the future (perhaps once every 10 days).Nevertheless, our models come quite close to the actual counts in most cases and give reasonable estimates that can be of assistance to health authorities in decision making regarding understanding the progress of the pandemic.

Figure 1 :
Figure 1: (a) Number of new daily infected cases and (b) the logarithm of the accumulated daily number of infected cases.(c) Number of new daily deaths and (d) the logarithm of the accumulated daily number of deaths.

Figure 2 :
Figure 2: China model: (a) mean of the Logistic with AR(1) model; (b) mean of the Alpha with AR(3) model; (c) accumulated number of deaths by COVID-19 in China from December 31, 2019 to July 31, 2020, and forecasts for the next 100 days.

Figure 3 :
Figure 3: USA model: (a) mean of the Rational function with AR(1) model; (b) mean Rational function with AR(1) model without outliers; (c) accumulated number of deaths by COVID-19 in USA from January 21, 2020 to July 31, 2020 and forecasts for the next 100 days.

Figure 4 :
Figure 4: Spain model: (a) mean of Rational function with AR(3) model; (b) mean of the Alpha with AR(1) model; (c) accumulated number of deaths by COVID-19 in Spain from February 1, 2020 to July 31, 2020 and forecasts for the next 100 days.

Figure 5 :
Figure 5: Italy model: (a) mean of the Alpha with AR(0) model; (b) mean of the Alpha with AR(1) model; (c) accumulated number of deaths by COVID-19 in Italy from January 31, 2020 until July 31, 2020 and forecasts for the next 100 days.

Figure 6 :
Figure 6: Brazil model: (a) mean of the Alpha with AR(6) model; (b) accumulated number of deaths by COVID-19 in Brazil from February 26, 2020 until July 31, 2020 and forecasts for the next 100 days.

Figure 7 :
Figure 7: India model: (a) mean of the Log-Normal model; (b) accumulated number of deaths by COVID-19 in India from January 31, 2020 until July 31, 2020.

Figure 8 :
Figure 8: Number of new daily deaths observed and predicted from August 1 to August 20: (a) USA; (b) Brazil; (c) India.

Table 2 :
Alternative functions for η(t) to represent expected values of Y (t).

Table 3 :
Natural logarithm of the conditional predictive ordinate (CPO(p)) for the functions μ(t) for the daily number of new COVID-19 deaths in six countries

Table 4 :
Forecasting evaluation criteria for six countries

Table 5 :
Forecast the "end dates" of the pandemic