Nonparametric Modelling of Quarterly Unemployment Rates

A seasonal additive nonlinear vector autoregression (SANVAR) model is proposed for multivariate seasonal time series to explore the possible interaction among the various univariate series. Significant lagged variables are selected and additive autoregression functions estimated based on the selected variables using spline smoothing method. Conservative confidence bands are constructed for the additive autoregression function. The model is fitted to two sets of bivariate quarterly unemployment rate data with comparisons made to the linear periodic vector autoregression model. It is found that when the data does not significantly deviate from linearity, the periodic model is preferred. In cases of strong nonlinearity, however, the additive model is more parsimonious and has much higher out-of-sample prediction power. In addition, interactions among various univariate series are automatically detected.


Introduction
It has been well known that nonlinearity exists widely in macroeconomic time series, see, for example, Huang and Yang (2004) for empirical evidence of nonlinearity in the US unemployment rates.Generally speaking, when deviation from linear time series model is significant, nonparametric autoregression is more appropriate for the identification and forecasting of time series unless there is convincing evidence of a simpler parametric nonlinear structure that generates the data series.Hence, nonparametric smoothing of nonlinear autoregressive time series can be extremely useful for time series analysis, not only for exploratory study, but also for robust model selection and prediction.
Non-and semi-parametric smoothing estimation of unknown functions have found many applications in the last two decades.In the time series literature, Robinson (1983) first applied kernel (Nadaraya-Watson) method to estimate autoregression function of unknown form.Other significant contributions include: Györfi, Härdle, Sarda and Vieu (1989), Auestad and Tjøstheim (1990), Tjøstheim and Auestad (1994), Yao and Tong (1994), Fan and Yao (1998), Härdle, Tsybakov and Yang (1998), Yang and Tschernig (2002), to name a few from the great number of published articles in this area.One common feature of these cited works is that they are all based on the local least squares method of kernel/local polynomial regression, and thus are all computationally intensive.Also unaddressed is the issue of the "curse of dimensionality", which refers to the lack of accuracy in estimating multivariate functions of general nonparametric form.In the context of time series modelling, the high dimensionality can be the result of the time series being multivariate and/or potentially too many lagged variables being significant for forecasting.For the two unemployment rates studied in this paper, 24 potentially significant lagged variables are examined.
The issue of the "curse of dimensionality" can be dealt with via additive modelling, as first proposed in Hastie and Tibshirani (1990).More recent works on the subject of additive models include Chen and Tsay (1993), Tjøstheim and Auestad (1994), Linton and Nielsen (1995), Sperlich, Tjøstheim and Yang (2002), Huang and Yang (2004).In particular, Huang and Yang (2004) had taken a different approach to the estimation of nonparametric additive regression function, by polynomial spline smoothing instead of kernel smoothing.The advantage of polynomial spline is that only one least squares problem needs to be solved to obtain estimates of all function values, rather than solving a least squares problem to estimate each function value.Typically, this means that the spline estimation of additive model can be thousands of times faster than standard kernel based methods, such as in Linton and Nielsen (1995) or Sperlich, Tjøstheim and Yang (2002).Spline smoothing has been used in other dimension reduction models, for example, varying coefficient model as in Huang, Wu and Zhou (2002).
In this paper we extend the additive autoregression model of Huang and Yang (2004) to multivariate seasonal time series.One example of such series is the bivariate series consisting of quarterly unemployment rates of men and women in the US.What makes such multivariate series different from univariate series is that there may exist significant interaction among the various univariate series.This turns out to be the case for the men's and women's unemployment rates, while it turns out differently for another bivariate series.In both cases, however, the same data-driven inference procedures are applied without prior assumptions of interaction or lack thereof.Hence the issue of interaction is decided by "letting the data speak for themselves".
In section 2 we will formulate a seasonal additive nonlinear vector autoregression model (SANVAR), and discuss its use in identifying the functional structure in seasonal vector time series, via spline smoothing.In section 3, we briefly describe an asymptotically conservative confidence band for the nonparametric autoregression function, also based on the polynomial spline method.Section 4 discusses the findings on two bivariate quarterly unemployment rates data, and draws some general conclusions about the benefits and precautions when fitting the SANVAR model.

The SANVAR Model
Our aim in this section is to develop a general modelling framework, called the seasonal additive nonlinear vector autoregression model (SANVAR), for a multivariate seasonal process {Y t,γ } n,d t=1,γ=1 .There are d different series, the γ-th of which is {Y t,γ } n t=1 .Each of the series is seasonal with S seasons.The approach we take to incorporate seasonality is distinct from those in Lütkepohl (1993), Wolters (1992).We model the d series and S seasons in the form where s ∈ {1, ..., S} indicates the season, Λ γ is a subset of {1, ..., M } × {1, ..., d} for significant lagged variables, {ξ τ S+s,γ } 1≤γ≤d,τ ≥M/S are martingale differences with respect to the σ-field generated by variables {Y τ S+s−j,β } j>0,1≤β≤d , and the multivariate autoregression function m s,γ is an additive function of variables Y τ S+s−j,β , (j, β) ∈ Λ γ .The largest lag index allowed M is typically taken to be a multiple of S, and the size of set Λ γ is limited to be no more than a fixed integer λ max .Each component function m s,γ,j,β satisfies the identifiability condition Em s,γ,j,β (Y t−j,β ) ≡ 0, as is common in additive modelling.If all the component functions m s,γ,j,β are restricted to be linear, the model is a periodic vector autoregressive (PVAR) model as in Lütkepohl (1993).
When one fits a SANVAR model (2.1) to a time series γ, {Y t,γ } n t=1 , the lag set Λ γ is unknown a priori and has to be selected.Thus, every index pair (j, β) ∈ {1, ..., M } × {1, ..., d} could potentially be in Λ γ .For many real time series data, however, most of the functions m s,γ,j,β turn out to be insignificant, as one will see in section 4.
For the fitting of SANVAR, we use the adaptive spline approach, which is described here in detail.For all seasons s and every index pair (j, β) ∈ {1, ..., M } × {1, ..., d}, one denote by the interval [a j,β , b j,β ] the range of variable {Y τ S+s−j,β } τ ≥M/S , which is divided into N + 1 equally-spaced subintervals.
Here N = N n = k (n/S) 1/(2p+3) in which k is a tuning constant (default set to 1), and p is an integer no more than the degree of smoothness of the component functions (default is set to 1).The N interior endpoints of these subintervals are labelled as a (1) j,β , ..., a (N ) j,β , which form the knot sequence for the explanatory variable Y t−j,β .Next we define the spline basis as the set of the following functions where x + = x if x > 0, 0 otherwise.Linear combinations of these spline basis are piecewise smooth up to order p called spline functions.Since N n → ∞ as n → ∞, all functions of smoothness order p can be approximated on interval [a j,β , b j,β ] by such linear combinations and so for every index pairs (s, γ) and (j, β) the function m s,γ,j,β (y) is approximated by spline functions.
To estimate the component functions, we have to solve an ordinary least squares problem of the form and the solution ĉs,γ , ĉ(l) s,γ,j,β will then provide estimators ms,γ,j,β (y) = where A = M<τ S+s≤n 1. Fortunately, one typically will need only a small number of these estimators for the SANVAR modelling.To identify which of these are significant, a BIC criterion is defined for each subset Λ = {(j 1 , β 1 ) , ..., (j λ , β λ )} ⊂ {1, ..., M } × {1, ..., d} where s,γ,j,β−τ be the solution of the least squares problem (2.2), but the sum is over (j, β) ∈ Λ and with the squared error term at time τ removed, for every integer τ that satisfies j λ < τS + s ≤ n.Then for every 1 ≤ γ ≤ d, the BIC criterion for the γ-th series is defined as where n S = n/S, n s,j λ ,S = the number of integers τ that satisfy j λ < τS + s ≤ n.
The set of significant variables selected by the BIC is then defined as Λγ = argmin Λ⊂{1,...,M }×{1,...,d} and under reasonable assumptions (Huang and Yang 2004), it can be shown that Notice that in the above steps, if all basis B (l) j,β , l ≥ 2 are removed, the result would be the PVAR model.Once this set Λγ is obtained, the ordinary least squares problem (2.2) is solved only once based on this set and the resulting function estimates ms,γ,j,β are used to build the estimated SANVAR model.This intelligent identification of a parsimonious model can be used for improving forecasting.In the next section, we discuss forecasting based on SANVAR model.

Confidence Bands
Suppose that by using the BIC criterion (2.4), a set Λγ of lags has been determined for series γ of the multivariate time series, 1 ≤ γ ≤ d.The consistency property in (2.5) allows one to take the estimated set Λγ for the true set Λ γ , for the sake of simpler notation.The estimated SANVAR model is of the form with univariate functions ms,γ,j,β and constants m0,s,γ as defined in (2.3).In this section, a procedure is described for the construction of simultaneous confidence intervals, or, confidence bands, for functions m s,γ based on the estimated SANVAR model (3.1).
Recently, confidence bands for univariate regression functions have been developed by Xia (1998), Claeskens and Van Keilegom (2003).The basic idea of constructing asymptotic confidence bands from polynomial spline estimation is proposed in Wang and Yang (2005), which is limited to univariate regression (this means d = M = 1) and piecewise constant (i.e., p = 0) and piecewise linear splines (i.e., p = 1).Yang (2004) extended the procedure of Wang and Yang (2005) to additive model, using piecewise linear spline (p = 1) and wild bootstrap.The method is adopted to SANVAR model, and the steps are described here.The confidence level is taken to be 1 − α, where α has a default value of 0.05.
Using the wild bootstrap sample (3.3) is justified by the same reason as in Sperlich, Tjøstheim and Yang (2002), i.e., in terms of conditional moments up to order two, for any 1 ≤ b ≤ 400, the b-th bootstrap sample (Y τ S+s−j,β ) (j,β)∈Λγ , δ τ,b ξτS+s,γ 1≤γ≤d,τ ≥M/S always mimicks the original sample (Y τ S+s−j,β ) (j,β)∈Λγ , ξ τ S+s,γ 1≤γ≤d,τ ≥M/S , due to the fact that E (δ τ,b ) ≡ 0, var (δ τ,b ) ≡ 1.The performance of the above wild bootstrap procedure has also been examined via Monte-Carlo study in Yang (2004).In particular, simulation experiments have shown that the procedure is extremely robust in regard to the number of bootstrap samples as long as it is higher than 400.
In addition, Yang (2004) had also provided some Monte Carlo evidence that the confidence band narrows at the rate of n −2/5 log 1/2 (n) as n → ∞.
In the next section, we will apply the BIC criterion and the wild bootstrap confidence band to some unemployment series and discover some nontrivial dependence structures in these series.

Unemployment Rates
In this section we will closely examine four sets of quarterly unemployment rate data collected from the Current Population Survey (SIC) at the US Bureau of Labor Statistics.The first two series are the quarterly unemployment rates of all men 20 years & over, and all women 20 years & over, regardless of ethnic origins, family status, occupation, profession and race, from 1948 to 2002.The other two series consist of the quarterly unemployment rates of all whites 16 years & over, and all African Americans 16 years & over, regardless of ethnic origins, family status, occupation, profession and sexes, from 1972 to 2003.
The approach is to model respectively the first two jointly and the last two jointly as bivariate time series, of S = 4 seasons.For the first data, since there are a total of 220 quarters, the combined time series is {R t,γ } 220,2  t=1,γ=1 where R t,1 = unemployment rate of men 20 years and over in quarter t R t,2 = unemployment rate of women 20 years and over in quarter t, while for the second, there are a total of 124 quarters, and the combined series is {R t,γ } 124,2 t=1,γ=1 where R t,1 = unemployment rate of whites 16 years and over in quarter t R t,2 = unemployment rate of African Americans 16 years and over in quarter t.
For the two bivariate data sets, we use the beginning 90% of the data to estimate the model and then calculate the out-of-sample prediction error for the last 10% of the data.Both SANVAR and PVAR models are used for comparison.The definition of Y τ S+s−S,γ as Y τ S+s,γ = R τ S+s,γ − R τ S+s−S,γ leads one to define the forecasts of R τ S+s,γ in terms of the forecasts of Y τ S+s,γ , i.e., RτS+s,γ = ŶτS+s,γ + R τ S+s−S,γ .
For the men/women data, the fitted PVAR models give the following forecasting equations whereas the SANVAR forecasting equations are In Figures 1 and 2, the forecasts Rt,1 , Rt,2 , t = 201, ..., 220 are plotted according to computation from the SANVAR equations (4.3), (4.4) and the PVAR equations (4.1) and (4.2).The Mean Squared Prediction Error (MSPE) is evaluated for each model as 1 20 220 t=201 Rt,γ − R t,γ 2 , γ = 1, 2. The SANVAR forecasts come with confidence bands as given in (3.4) and (3.5) of section 3. From these plots, one can see clearly that the SANVAR model is superior to the PVAR model.For the series of men, the SANVAR model is only slightly better in prediction power, while for the series for women, the SANVAR model is twice as powerful as the PVAR model in prediction.For both men's and women's series, the confidence bands appear rather narrow and follow the trends well.Notice that these confidence bands are simultaneous confidence intervals, not simultaneous prediction intervals (which need to account for extra noise), hence the excellent   coverage of the actual data path by these bands is all the more remarkable.This is consistent with the conservativeness of the confidence bands as in (3.6).Similar phenomenon will be observed again for the forecasting of whites and African Americans' unemployment rates, in Figures 3 and 4.
The PVAR model for the men's series is a more parsimonious one than the SANVAR model, as both contain two variables, yet the PVAR equations are three term equations while the SANVAR equations contain seven terms.On the other hand, the SANVAR model for the women's series is more parsimonious than the PVAR model, as each SANVAR equation contains only two variables, versus the four variables of PVAR equations.It is for this reason that the SANVAR is preferred to PVAR for the women's series, but not for the men's series.In addition, the PVAR and SANVAR equations for the men's series are actually quite similar as well.Another interesting phenomenon is that the SANVAR equations for men and women's series are the same in form, both are expressed in terms of the men's series of one and three previous quarters.One explanation is that the women's job condition has been strongly affected by the men's, possibly due to family related factors.
For the white/black data, the PVAR equations are Ŷ4τ+4,2 = 0.011 + 0.834Y 4τ +3,2 − 1.242Y 4τ +2,1 + 1.411Y 4τ +3,1 (4.6) whereas the SANVAR equations are respectively.The PVAR model (4.5) for the unemployment rates of whites actually predicts better than the SANVAR model (4.7).Again, this is due to the fact that the fitted PVAR model for whites has only two explanatory variables and is very similar to the SANVAR model.Therefore, one should always use the linear periodic VAR model for better prediction when the two models produce similar results.For the series of African Americans, the opposite is true, where the SANVAR predicts much better than the PVAR.Also it is worth noticing that for both whites and African Americans, the preferred forecasting model is always a univariate series prediction model.To be precise, the PVAR model (4.5) for the whites and the SANVAR model (4.8) for African Americans both suggest that prediction for different races be best done separately.This strongly suggests that whites and African Americans have been living in parallel economies and there is little interaction of their unemployment rates.
Overall, the SANVAR model is a more robust option than the PVAR model.It nearly always predicts better, except when the series is extremely close to linearity, which is always indicated by the lack of parsimony of the fitted SANVAR model.If the fitted SANVAR model is less parsimonious than the fitted PVAR model (i.e., having more or the same number of variables), one should use the simpler PVAR model for forecasting and inference.In addition, the model is able to detect from the data whether there is any significant interaction among the individual series.
suggestions from Stuart Scott at the U.S. Bureau of Labor Statistics and from an anonymous referee are gratefully acknowledged.

Figure 1 :
Figure1: Forecasting the men's quarterly unemployment rates of 1998-2002, based on men's and women's unemployment rates of 1948-1997.The solid thick line represents the actual unemployment rates during 1998-2002, the thin dashed line represents the forecasts.Both the parametric PVAR and the nonparametric SANVAR models are used.In the plot for SANVAR model, nonparametric confidence band for the predicted means are also plotted.The MSPE is calculated as the mean squared prediction error between the predicted and true unemployment rates.

Figure 2 :
Figure2: Forecasting the women's quarterly unemployment rates of 1998-2002, based on men's and women's unemployment rates of 1948-1997.The solid thick line represents the actual unemployment rates during 1998-2002, the thin dashed line represents the forecasts.Both the parametric PVAR and the nonparametric SANVAR models are used.In the plot for SANVAR model, nonparametric confidence band for the predicted means are also plotted.The MSPE is calculated as the mean squared prediction error between the predicted and true unemployment rates.

Figure 3 :
Figure3: Forecasting the white's quarterly unemployment rates of 1999-2002, based on white's and African American's unemployment rates of 1972-1998.The solid thick line represents the actual unemployment rates during 1999-2002, the thin dashed line represents the forecasts.Both the parametric PVAR and the nonparametric SANVAR models are used.In the plot for SANVAR model, nonparametric confidence band for the predicted means are also plotted.The MSPE is calculated as the mean squared prediction error between the predicted and true unemployment rates.

Figure 4 :
Figure4: Forecasting the African American's quarterly unemployment rates of 1999-2002, based on white's and African American's unemployment rates of 1948-1998.The solid thick line represents the actual unemployment rates during 1999-2002, the thin dashed line represents the forecasts.Both the parametric PVAR and the nonparametric SANVAR models are used.In the plot for SANVAR model, nonparametric confidence band for the predicted means are also plotted.The MSPE is calculated as the mean squared prediction error between the predicted and true unemployment rates.