A Bayesian Adjustment of the HP Law via a Switching Nonlinear Regression Model

For many years actuaries and demographers have been doing curve fitting of age-specific mortality data. We use the eight-parameter HeligmanPollard (HP) empirical law to fit the mortality curve. It consists of three nonlinear curves, child mortality, mid-life mortality and adult mortality. It is now well-known that the eight unknown parameters in the HP law are difficult to estimate because numerical algorithms generally do not converge when model fitting is done. We consider a novel idea to fit the three curves (nonlinear splines) separately, and then connect them smoothly at the two knots. To connect the curves smoothly, we express uncertainty about the knots because these curves do not have turning points. We have important prior information about the location of the knots, and this helps in the estimation convergence problem. Thus, the Bayesian paradigm is particularly attractive. We show the theory, method and application of our approach. We discuss estimation of the curve for English and Welsh mortality data. We also make comparisons with the recent Bayesian method.


Introduction
Smoothing mortality curves (age-specific mortality rates) is useful to actuaries and demographers for many demographic problems.Based on the data, the observed rates are jagged or irregular along the life span or sometimes sparse in small populations.In addition, the rates for older individuals are not reliable due to age misreporting and death sparseness.So using a parametric equation to smooth the data is significant for estimating the mortality pattern and predicting the rates.We use the eight-parameter Heligman-Pollard (HP) empirical law to model mortality data across all ages.The law has been tested as a good model for fitting the pattern to many mortality data across all ages.But the convergence is still a very challenging problem in practice.A smooth mortality curve is desirable, and larger number of deaths does not help with the estimation problem.
For more than a century actuaries and demographers have developed several mathematical models (Gompertz, 1825;Makenham, 1860;Thiele, 1872;Siler, 1979;Heligman and Pollard, 1980;Mode and Jacobson, 1984).Heligman and Pollard (1980) proposed the most successful parametric model that is applicable across the entire age range.Heligman and Pollard (1980), Hartmann (1983), Fofar and Smith (1987), Kostaki (1992;2000), Mode and Busby (1982), Rogers (1986) and Congdon (1993) used classical non-linear least squares procedure to fit the empirical law.However Rogers (1986) and Congdon (1993) noticed that over parameterization is a concern with the HP law and this leads to numerical instabilities.They suggested that the problem of over parameterization can be resolved by fixing the values of one or two parameters to a feasible constant.Dellaportas, Smith and Stavropoulos (2001) adopted a Bayesian approach to overcome the problems associated with the method of non-linear least squares.They reported that with the use of informative priors the issue of over-parameterization is usually resolved; see also Sharrow and Clark (2010).There are many benefits of the Bayesian method for fitting the HP law as mentioned by Dellaportas, Smith and Stavropoulos (2001); these benefits include ease in dealing with over parameterization, non-normality, and computation.Moreover, the estimation of many quantities (useful for actuaries and demographers), such as survival probability, first to die and median lifetime, is automatic within the Bayesian approach.But there are difficulties in running their Metropolis-Hastings sampler as they needed considerable thinning in the selected chain to remove the autocorrelation.Wei, Nandram and Bhatta (2012) develop a Bayesian analysis for small areas (race-sex domains by states) to address the issues of over parameterization and sparse data in mortality curve fitting.
We make a novel adjustment to the Bayesian method.Our procedure fits the three curves (nonlinear splines) separately, and then connect them smoothly at the two knots using a switching non-linear regression model.There are two approaches to spline regression modeling (see Boor, 2001); both approaches use pieces of polynomials.Either the polynomials lie in their own domains (switching) or each polynomial starting in a domain remain in the following domains (join point regression models).Both models are quite popular in many fields such as health sciences and econometrics.Some interesting papers in join point regression models are Tiwari, Cronin, Davis, Feuer, Yu and Chib (2005), Kim, Fay, Feuer and Midthune (2000), and Ghosh, Huang, Yu and Tiwari (2009).For switching regression models, see Quandt (1958;1960;1972) and Chen (2007).A Bayesian approach was given by Ferreira (1975).Our review of the literature on switching regression model has shown nothing in nonlinear regression.There are some polynomial switching regression models but zero derivatives are used at the knots.Our splines do not have turning points at the knots.
The basic eight-parameter empirical law of Heligman and Pollard (1980), known as HP first law, is given by where π x is the probability that an individual, who has reached age x, will die before reaching age x + 1.The law expresses an odds of dying as a function of age x using three terms, representing distinct components of mortality.The first term in ( 1) is a rapidly declining exponential function that describes the child mortality (high in the first year of life and then decline until 10-13).The middle term, similar to a log normal function, represents the mid-life mortality.It describes the accident mortality for both males and females and maternal mortality as well.This additional mortality forms an "accident hump" on the natural curve.
The third term is a geometrically increasing Gompertz exponential function that describes adult mortality.It generally represents the ageing or deterioration of the body (i.e., senescent mortality).It is typically assumed that the first term has no effect on the other two terms.These three terms correspond to the three curves under study.The first curve almost dies out as it enters the domains of second and third curves, thereby providing negligible contribution on these curves.This holds also for the second and third curves.The eight parameters 0 < a, b, c, d, g < 1, e, h > 0 and f 0 ≤ f ≤ f 1 (f 0 and f 1 are to be specified) have their own demographic interpretations.Here, a is approximately equal to π 1 (age 1 mortality); b is the location index of π 0 (age 0 mortality) in the interval (π 1 , 0.5); c measures the rate of mortality decline in childhood; d reflects the intensity of the hump, f indicates its location and e measures its spread; and lastly, g and h denote respectively the initial level and rate of increase of old-age mortality.Heligman and Pollard (1980) also suggested the following eight-parameter empirical law known as HP second law which we develop further here.Note that on the left-hand side of (1) we have π x /(1 − π x ) but on the left-hand side of (2) we have π x .For the third term of (1) we have gh x and the corresponding term in (2) we have gh x /(1 + gh x ).Thus, it is easy to show that each of the three terms in (2) is in (0, 1).However, the sum of these three terms may not be in (0, 1) but this is very unlikely.
It is worth noting that larger death counts, and therefore larger population sizes, do not help with the difficulty in estimation (i.e., nonidentifiability).We need information about the eight parameters in the HP law.Our method replaces two parameters in the HP law with two new parameters for which we have some information.These parameters are the knots where the three curves are connected; a priori we have information about the location of these knots.This motivates our Bayesian approach using the switching nonlinear regression model.
This paper develops a Bayesian methodology for fitting of the HP mortality curve.In Section 2, we present the general theory on switching regression.In Section 3, we discuss the Bayesian methodology and computation needed for estimation of the parameters.In Section 4, we apply the methodology to English and Welsh data, and we also show how to obtain median lifetimes of individuals at different ages.We also compare our model with an existing Bayesian model.Finally, Section 5 has concluding remarks.

Theory for Switching Regression
We want to model the three parts of HP law separately and connect them smoothly.Thus, for the switching non-linear regression model, we modify (2) as where κ 1 ≤ f ≤ κ 2 and I(•) denotes the indicator function, for example, I(x ≤ κ 1 ) = 1 if x = 0, • • • , κ 1 .In (3) π x is between zero and one because each component of the function is between zero and one with distinct domain (a feature of switching regression).In case of (2), as we discussed earlier, each part is extended to the entire age range, and by adding these together, π x may assume values greater than unity.The switching regression model in (3) corrects this problem.
There are continuity constraints that must be incorporated to connect the three curves smoothly.At the two knots, κ 1 and κ 2 , we have and (Note that κ 1 and κ 2 are positive.)It is worth noting that we have enforced an important condition, h > 1, otherwise the adult mortality curve in (3) will not be an increasing function.It appears that this restriction is not necessary in the original HP law.Simplifying (4) and ( 5), we get In (3) we also incorporate the constraint, It is sensible to ensure that the mortality at κ 1 is at most the mortality at κ 2 .
The mortality inequality ( 7) is an important biological condition we use in our model.
One would equate the derivative at the knots for smoothing the curve.However, in our case, there are no turning points at the knots, and, therefore, differentiation is not appropriate.We propose an interesting idea to smoothly connect the three curves at the knots by expressing uncertainty about the knots at κ 1 and κ 2 .A priori we have some information about the locations of the knots.A priori ranges of values of these parameters, the continuity constraints ( 4) and ( 5) and the constraint in ( 7) provide a new parameter space.For example, the constraint κ 1 ≤ f ≤ κ 2 is updated (see Theorem 1 below).Let S denote the set of constraints on a, b, c, d, f, g, κ 1 and κ 2 .In Theorem 1, important in our method, we construct S.
Theorem 1 The set of constraints, S, is given by Proof of Theorem 1: From (6) we have Simplifying and noting that e > 0 and h > 1, we have Substituting e from ( 6) into ( 8) and simplifying we get where A = ln(g/d(1 + g))/ ln(a (b+κ 1 ) c /d) > 0. The inequality (9) gives two cases, The second inequality cannot be true because A > 0. Thus, using the first inequality we get Again, the second inequality is false because Because f ≤ κ 2 , using ( 10) and ( 11), we get max(

It follows that max (κ
Because A = ln(g/d(1 + g))/ ln(a (b+κ 1 ) c /d) > 0 and ln(a (b+κ 1 ) c /d) < 0, ln(g/d(1 + g)) < 0. Thus, we have Therefore, by ( 12) and ( 13) Our model will lead to a likelihood function, a function of a, b, c, d, f, g, κ 1 and κ 2 which are in S; e and h have simple formulas involving a, b, c, d, f, g, κ 1 , κ 2 .Of course, we still have eight parameters to estimate.However, we know that κ 1 and κ 2 are specified in fixed intervals of finite lengths, thereby adding an important piece of information into the model.We do not need to estimate e and h directly because they are functions of the other parameters.
Finally, we note the behavior of A in determining boundaries for f .For this purpose we use ( 12).
, the same as (ii).

Bayesian Model and Computation
In Section 3.1 we describe the model and in Section 3.2 we discuss the computation.

Bayesian Model
We study our proposed switching non-linear regression (SNR) model here and we compare it later with one existing model, the model of Dellaportas, Smith and Stavropoulos (DSS, 2001).
Let n x be the population at risk having age in the interval [x, x + 1) and y x be the number of individuals who have reached age x and died before reaching age x + 1.The death counts follow binomial distribution, where π x is given in (3).Let y ˜be the vector with elements y x , x = 0, • • • , t, n ˜be the vector with n x , x = 0, • • • , t and θ ˜= (a, b, c, d, f, g).Throughout we assume that n x , x = 0, • • • , t, are fixed known constants, and therefore conditioning on n ˜is not required a posteriori.
Then, the probability mass function of y ˜is where A priori we assume the following distributions on the parameters and the knots where (a 1 , b 1 ) and (a 2 , b 2 ) are to be specified.These specifications will vary with the applications.Here a, b, c, d, g and f are continuous and κ 1 and κ 2 are discrete.Note again that prior specifications are not needed for e and h because these are functions of other parameters as shown in (6).Thus, the joint prior density is ), which clearly is proper.A priori the ranges of the parameters will change when the prior is incorporated in the likelihood; see Theorem 1.
Using Bayes' Theorem and Theorem 1, we get the joint posterior distribution, It is worth noting that because π(θ ˜, κ 1 , κ 2 ) is proper, π(θ ˜, κ 1 , κ 2 |y ˜) is also proper.The joint posterior distribution here is complex and so it is difficult to sample from it directly.We apply the Gibbs sampler (Gelfand and Smith, 1990) which needs the conditional distribution of each parameter given the others and y ˜.Note that the terms in the density are not independent, rather they are linked to each other because e and h are functions of θ ˜, κ 1 and κ 2 ; see (6).

Computation
As discussed above, we need to keep all terms each time we evaluate the function to sample the parameter from its conditional posterior density.
Letting θ ˜(i) denote the vector of θ ˜excluding the i th one (e.g., θ ˜(1) = (b, c, d, f, g) and θ 1 = a), the conditional posterior density of the i th parameter is where the θ i must be restricted according to S. The marginal conditional posterior distribution of the parameters are constrained to take values in the following intervals (i) 0 < a < d 1/(b+κ 1 ) c ; (ii) max{0, (ln d/ ln a) We discuss how to obtain these intervals in Appendix A. The conditional posterior densities cannot be written in standard forms, so we use the griddy Gibbs sampler (Ritter and Tanner, 1992) for θ ˜.The idea of the griddy Gibbs sampler is to approximate a conditional posterior density by a discrete distribution and draw samples from the discrete distribution instead.
For sampling κ 1 and κ 2 , the joint conditional posterior distribution is where We have implemented the griddy Gibbs sampler in ( 16).However, the chain got stuck at (κ 1 , κ 2 ).This is a common problem in Bayesian computation.One may think that the reversible jump sampler (Green, 1995) might be appropriate.This is not true because we have a single model with parameter θ ˜.We do not have different parameters for different (κ 1 , κ 2 ).Thus, we obtain the joint posterior probability mass function of (κ 1 , κ 2 ) using an independent procedure.Accordingly we draw sample from joint posterior distribution as where The analytic integration in (18) is not feasible which makes it difficult to find the joint density of κ 1 and κ 2 .In this situation, we approximate the definite integral by using Monte Carlo integration.This requires randomly chosen points at which the integrand is evaluated.
To sample these random points, we need an importance function.For this, we consider a simpler and less constrained set, It is important to note that S κ 1 ,κ 2 is a subset of S * κ 1 ,κ 2 .First, note that if any of the inequality constraints in S κ 1 ,κ 2 is dropped, a larger set is obtained.Second, consider (18).If max((κ Then, the importance function is Note that this is a uniform distribution on S * κ 1 ,κ 2 .Now , where M is large (e.g., M ≈ 1000).Then, the integral π(κ 1 , κ 2 |y ˜) in ( 18) is approximately . Thus, we have obtained an estimate of joint posterior probability mass function of κ 1 and κ 2 .
So, our procedure is implemented by first drawing samples from π(κ 1 , κ 2 |y ˜).Then, with each (κ 1 , κ 2 ) we run a griddy Gibbs sampler to get θ ˜.This latter task is similar to what is described already.We repeat this procedure a large number of times.
We compare our method with an existing model, the beta-binomial model with extra variation (Dellaportas, Smith and Stavropoulos, 2001) which we denote by DSS.The death counts follow the binomial distribution, Dellaportas, Smith and Stavropoulos (2001) assume that the first HP law determines age-dependent quantities m x , with and the probabilities of death follow a beta distribution, Here, and ζ is an unknown positive quantity determining the variance of beta distribution.If ζ → ∞, π x will be given by ( 1).We also assume a priori that a shrinkage prior distribution for ζ.Finally, we have taken a, b, c, d, e, f, g, h to be independent with where f 0 = 10 and f 1 = 115 as in DSS.
Denoting θ ˜ * = (a, b, c, d, e, f, g, h), the resulting posterior distribution is Again we perform the computations using the griddy Gibbs sampler.A discussion of the computation is given in Appendix B. The computation for our method takes more time than DSS method.

A Numerical Example
To illustrate the methodology we use English and Welsh mortality data for females (1988)(1989)(1990)(1991)(1992).The data are presented in Table 1.To give a better illustration of the benefit of our method, we use a second example on US mortality data, 1999-2001, obtained from National Center for Health Statistics (NCHS).For confidentiality reasons we are prohibited from presenting the data.In Section 4.1 we discuss the results of fitting both SNR and DSS model and in Section 4.2 we provide median lifetimes for individuals at different ages.

Mortality Curve
For SNR model, we obtained the joint posterior probability mass function of (κ 1 , κ 2 ) using the procedure as described in Section 3. We have specified κ 1 and κ 2 to be 8 ≤ κ 1 ≤ 13 and 20 ≤ κ 2 ≤ 26.The empirical posterior distribution of (κ 1 , κ 2 ) are presented in Table 2.The data do have some influence over κ 1 and κ 2 because the joint posterior probability mass function is not uniform.
We draw a sample values of (κ 1 , κ 2 ) from the joint posterior distribution.Then, we run the griddy Gibbs sampler to draw a sample of a, b, c, d, f and g from their corresponding conditional posterior distributions 101 times and pick the last one.We repeat this process 1000 times and finally, we have the 1000 sample values for each of the parameters.We found convergence by using the trace plots.Also non-significant auto-correlation coefficient among iterates was found.Similarly, for DSS we execute the griddy Gibbs sampler by drawing θ ˜ * from (B.1) and ζ from (B.2).We have transformed e, h, f and ζ to (0, 1) to perform the griddy Gibbs sampler.This shows good performance (i.e., after a small "burn in" there was negligible autocorrelation even at lag one).
In Table 3 we summarize the posterior distributions of a, b, c, d, e, f, g and h using posterior means (PM), posterior standard deviations (PSD) and 95% credible intervals (CI) for both models.There are some differences between the two methods (e.g., the PM for e in DSS is 10.77 versus 3.998 in SNR).The 95% credible interval for e is very narrow under DSS.We get the HP curve using M = 1000 iterates of the parameters for both DSS and SNR models.Figure 1 shows the result of fitting the HP law.The plots show curves of observed, estimated and 95% credible bands for the logarithm of the probabilities of death across ages.The observed mortalities at different ages are within the 95% credible bands.It is clear that under both models the estimated curve provides a good fit to the observed data but the DSS model underestimates the variability, and they specify f 0 ≤ f ≤ f 1 , where f 0 and f 1 are really arbitrary.We use f 0 = 15 and f 1 = 110 as in DSS.We expect the confidence bands for the DSS model to be narrower than SNR model.Using box plots the empirical posterior distributions of the parameters for the two models are displayed in Figure 2. Again, larger variability for these parameters occurs in the SNR model.
In Figure 3 we have compared the fits of the DSS and SNR methods to the US mortality data.We have observed that when the DSS method is used, many of the observed mortalities are outside the 95% credible bands or very close to the bands.This is not true for the SNR method though.For age 0 to 30 the DSS bands are narrower and for age 30 to 84 wider than the SNR bands.Even when the SNR bands are narrower, they still contain the observed mortality rates.Therefore, it is reasonable to conclude that the SNR method can perform better than the DSS method.

Median Lifetime
There are many useful quantities (e.g., survival probability, first to die, joint lifetime and median time to death) that are of interest to actuaries and demographers.Here we consider inference about the median lifetime, m, of an individual of age, g.It can be obtained automatically under the Bayesian method as described by DSS.The median lifetime is obtained using the survival probability from g to m.That is, m is determined by solving the equation, where π x is the probabilities of death which can be obtained after fitting the HP law using the sampled parameter values of each iteration.
Because the observed deaths for the English and Welsh data are from age 0 to 74, at each iterate using the spline regression model we need to predict the probabilities beyond 74 up to 100 to allow computation of m.Then, using each sampled iterate of π x , x = 0, • • • , 100, we solve the equation for m.This provides the posterior distribution of the median lifetime.The posterior inference of the median life times for adults who are at ages 55, 60, 65, 70, 75, 80 and 85 are presented in Table 4.The posterior means are similar but the intervals are narrower under the DSS method.In addition, we have obtained the posterior estimates of the median life for adults who are at ages in the range from 10 to 75.The plot of the median life is shown in Figure 4. Again, we see that the shape of the curve is same but the confidence band is narrower in DSS.

Concluding Remarks
The Heligman-Pollard empirical law is very useful to fit mortality data for all ages.However, it is well known that the eight unknown parameters of the model are very difficult to estimate because there is over parameterization.While Dellaportas, Smith and Stavropoulos (2001) has pioneered Bayesian analysis for fitting the HP empirical law, we have found that it is beneficial to incorporate additional information about the eight parameters.Here, our Bayesian nonlinear spline regression model reduces the difficulty in estimating the parameters, thereby safe guarding from over parameterization.There are two difficulties which we have addressed.First, we have constructed the constraint parameter space via Theorem 1.Second, we have constructed a useful algorithm for fitting the joint posterior density.We fit the three curves separately and then connect them smoothly using switching regression.Like Dellaportas, Smith and Stavropoulos (2001), we have incorporated the HP law directly in the model.The results show that the estimated curves provide reasonable fits to the observed mortalities.We have also compared the results obtained from our model with Dellaportas, Smith and Stavropoulos (2001).Our switching non-linear regression model fits are similar to Dellaportas, Smith and Stavropoulos (2001) for the English and Welsh data but the confidence bands are wider.However, fitting both the model to US mortality data, 1999-2001, we found that our model fits better than one of the models of Dellaportas, Smith and Stavropoulos (2001).In addition, for both models we have obtained median lifetimes for individuals at different ages for the English

Figure 1 :
Figure 1: Observed and fitted mortality curves with 95% credible bands for the DSS and SNR models

Figure 2 :Figure 3 :
Figure 2: Box plots of the posterior simulations of parameters a, b, c, d, e, f, g and h for the DSS and SNR models

Table 3 :
Summaries of the posterior densities of the HP parameters for the DSS and SNR models

Table 4 :
Summaries of the posterior density of median lifetime by age