Spline Pattern-Mixture Models for Missing Data

We consider a continuous outcome subject to nonresponse and a fully observed covariate. We propose a spline proxy pattern-mixture model (S-PPMA), an extension of the proxy pattern-mixture model (PPMA) (Andridge and Little, 2011), to estimate the mean of the outcome under varying assumptions about nonresponse. S-PPMA improves the robustness of PPMA, which assumes bivariate normality between the outcome and the covariate, by modeling the relationship via a spline. Simulations indicate that S-PPMA outperforms PPMA when the data deviate from normality and are missing not at random, with minor losses of eﬃciency when the data are normal.


Introduction
Missing data are a common problem in many data sets. In this article we consider data where our goal is to estimate the mean of a variable Y with n 0 observed values ({Y i }, i = 1, . . . , n 0 ) and n 1 missing values ({Y i }, i = n 0 + 1, . . . , n 0 + n 1 ), when there is a set of p auxiliary variables Z 1 , . . . , Z p that are fully observed ({Z i1 , . . . , Z ip }, i = 1, . . . , n, n = n 0 + n 1 ). Define the response indicator R taking values 1 if Y is observed and 0 if Y is missing. It is common to use methods that assume Y is missing at random (MAR) in the sense that R is independent of Y given the observed covariates Z 1 , . . . , Z p (Rubin, 1976). Such methods include weighting class adjustments and imputation. Our methods build on a robust MAR imputation method called penalized spline of propensity prediction (PSPP) (Little and An, 2004;Zhang and Little, 2009;Yang and Little, 2015). This method (a) estimates the propensity that R = 1 given Z 1 , . . . , Z p based on a logistic regression of R on Z 1 , . . . , Z p , using all the data, and (b) imputes Y based on the regression of Y on a penalized spline of the estimated propensity, with other covariates being included parametrically if they improve the predictions.
MAR-based methods are generally biased in cases where the missingness is missing not at random (MNAR), meaning that missingness of Y depends not only on covariates Z 1 , . . . , Z p but also on the value of Y itself. Schouten (2007) proposes a selection strategy for weighting variables that relaxes the MAR assumption. The method uses a generalized regression estimator to estimate the mean with auxiliary variables selected to minimize the maximal absolute bias under MNAR. The selection strategy, however, is based on parameters estimated under the MAR assumption and thus may be invalid if the missingness mechanism deviates markedly from MAR. Pfeffermann and Sikov (2011) propose a method for estimating the mean under MNAR by specifying models for the outcome and propensity, which is allowed to depend on both the outcome and auxiliary variables. The method assumes known population totals for some or all of the auxiliary variables in the two models and estimates the model parameters in a way that takes into account the known population totals.
The bivariate normal pattern-mixture model (BNPM) (Little, 1993(Little, , 1994) assumes a bivariate normal distribution for a single observed covariate X and an outcome Y within strata defined by respondents and nonrespondents, with a different mean and covariance matrix in each stratum. Parameters of BNPM are identified by assumptions about the missingness mechanism. For instance, under MAR, where missingness is assumed to depend on X but not Y , the parameters of the regression of Y on X are the same for respondents and nonrespondents; as a result, the maximum likelihood (ML) estimate for the mean of Y is the regression estimate, μ Y =Ȳ (1) + s XY s XX (X −X (1) ), whereX is the sample mean of X,X (1) is the respondent mean of X, Y (1) is the respondent mean of Y , s XY is the respondent covariance of X and Y , and s XX is the respondent variance of X. When missingness is MNAR and is assumed to depend on Y but not X, the parameters of the regression of X on Y are the same for respondents and nonrespondents; Little (1994) shows that the resulting ML estimate of the mean of Y isμ Y =Ȳ (1) + s Y Y s XY (X −X (1) ), where s Y Y is the respondent variance of Y . The approach is easily extended to allow missingness of Y to depend on Y * = X + λY for some known λ, a parameter that can then be varied in a sensitivity analysis. ML, Bayesian and multiple imputation (MI) approaches to inference for this BNPM model are described in Little (1994).
An advantage of the BNPM model is that it does not need to specify an explicit functional form for the missingness mechanism, the mechanism entering in the form of restrictions on the model parameters. The modification of MAR regression estimation to MNAR models is straightforward, as seen in the estimate of the mean of Y above. However, validity of the estimates depends on bivariate normality of X and Y , which is a strong assumption. For example, if X is normal and Y given X is normal with conditional mean a quadratic function of X, then the regression of X on Y is no longer linear, and ML estimates under the BNPM model are biased. In this article we study the impact of such forms of misspecification on inferences for the mean of Y .
We also propose a modification of the BNPM model, spline-BPNM (S-BPNM), which replaces a parametric linear regression by a penalized spline, extending the PSPP method (which assumes MAR) to MNAR situations; in the case where missingness depends on Y , we model the regression of X on Y using a flexible penalized spline, rather than assuming a linear relationship. The resulting estimate of the mean of Y is shown in simulations to be more robust than BNPM to the distributional relationship between X and Y . The approach can also be generalized to the case where missingness depends on Y * = X + λY for some known value of λ.
We also consider cases with more than one covariate. In that context, proxy pattern-mixture model analysis (Andridge and Little, 2011) extends the BNPM model to data with an outcome Y and a set of p observed covariates Z 1 , . . . , Z p . The PPMA method replaces the set of covariates by a proxy X, the single best predictor of Y given the covariates, estimated by regressing Y on Z 1 , . . . , Z p for the respondents. The method then fits the pattern-mixture model in Little (1994) to Y and X. Bayesian forms of PPMA take into account the estimation of the coefficients of Z in the proxy variable X. This analysis relies on the bivariate normality assumption between the proxy X and Y , which is violated when some or all of the covariates Z 1 , . . . , Z p used to estimate X are not normally distributed. We propose a more flexible version of PPMA, which we call spline-PPMA (S-PPMA), which relaxes the bivariate normality assumption between the proxy and Y by replacing the linear regression of X on Y * implied by the bivariate normality with a penalized spline, allowing for a non-linear relationship between the variables.
We conduct simulations to examine the performance of the new S-PPMA model, and in particular to address the following questions: 1. How do inferences under S-BNPM and S-PPMA models compare with the original BNPM and PPMA methods in terms of bias, root mean squared error (RMSE) and coverage, for data sets generated under a variety of distributional assumptions? 2. How sensitive are S-BPNM and S-PPMA models to alternative assumptions about the missingness mechanism?
In the next section, we present the S-BNPM and S-PPMA models in detail. We then assess their performance in simulation studies under a variety of distributional assumptions for the auxiliary variables and missingness mechanisms.

Pattern-Mixture Model Analysis
We consider first bivariate data on X and Y , with X observed for the entire sample and Y subject to missing data, and let R = 1 if Y is observed and R = 0 if Y is missing. Little (1994) assumes the BNPM model: where N 2 (μ, ) denotes the bivariate normal distribution with mean μ and covariance matrix .
Since we have no data on Y for the nonrespondents (R = 0), we cannot estimate all of the parameters in (1) for R = 0 without further assumptions. If assume that the missingness of Y depends only on X, we can factor the joint distribution of X, Y , and R into: where p() is the probability density function. Under the bivariate normality assumption and the property that the distribution of Y given X is independent of R, the parameters of the regression of Y on X are the same for R = 1 and R = 0, leading to a just-identified model. Little (1994) derives the ML estimates; in particular the ML estimate forμ Y , the mean of Y averaging over R, is:μ Suppose now that the missingness of Y depends on Y but not X. This implies that the parameters of the regression of X on Y are the same for R = 1 and R = 0, again leading to a just-identified model. The resulting ML forμ Y averaging over R is: More generally, suppose that the missingness of Y depends on the value of Y * = X + λY for a given λ. Little (1994) shows that the ML estimate forμ Y averaging over R is then: It is easy to see that (4) reduces to (2) when the data is MAR (λ = 0), and to (3) when missingness depends only on Y (λ = ∞). In practice, the data often provide no information about the value of λ. Little (1994) suggests a sensitivity analysis to capture the uncertainty about λ by estimatingμ Y over a range of λ. Large differences inμ Y over λ suggest that inferences onμ Y are sensitive to assumptions about the missingness mechanism. Alternatively, we can specify a prior distribution that reflects the uncertainty about the choice of λ.

Spline Pattern-Mixture Model
The BNPM model estimates rely heavily on the bivariate normality assumption between X and Y. For example, (X, Y ) is not bivariate normal if (a) the conditional distribution of Y |X is normal with E(Y |X) = 10 + X and the marginal distribution of X is gamma, or (b) X is normal but the regression of Y on X is quadratic in X; in such cases the estimates from the BNPM model are potentially biased even under the correct value of λ. We propose a penalized spline regression (S-BNPM) model for X and Y that relaxes the bivariate normality assumption. Suppose that missingness depends on the value of Y * = X + λY for some known λ > 0. The conditional distribution of X|Y * is then the same for respondents and nonrespondents. The S-BNPM method creates multiple imputations of the missing values of Y * (and hence Y = (Y * − X)/λ) so that the regression of X on Y * for respondents (where Y * is observed) and nonrespondents (where Y * is imputed) follows the same spline regression model: where a + = a if a > 0 and a + = 0 otherwise, and κ 1 < . . . < κ K are K knots. The model may be fitted to the respondent data using a linear mixed model, treating the splines as random effects and adding a penalization term, α K k=1 γ 2 k , to the log-likelihood to penalize the roughness of the regression function, whereα 2 =σ 2 /τ 2 and the model parameters are estimated via restricted maximum likelihood (REML). Here, we adopt a Bayesian approach by assigning a uniform prior for β and inverse gamma (υ = 10 −5 , ω = 10 −5 ) priors for σ 2 and τ 2 , which have a mean of ω/(υ − 1) when υ > 1 and a variance of ω 2 /[(υ − 1) 2 (υ − 2)] when υ > 2, and obtain draws from their posterior distributions using a Gibbs sampler (see Supplementary Material for details of the algorithm). We give γ 1 , . . . , γ K normal N(0, τ 2 ) priors that result in an equivalent penalization to minimize over-fitting. While estimates are generally insensitive to the choice of the inverse gamma hyperparameters, small values (e.g. 10 −2 or less) should be chosen to yield relatively noninformative but finite priors. In practice, the number of knots, and potentially a polynomial basis for the splines, may be chosen based on the sample size and the observed degree of nonlinearity. We recommend having at least 10 observations for each spline. For simplicity, we consider splines with a linear basis.
We then adopt a hot deck procedure (Andridge and Little, 2010) to impute the missing values of Y * , where the missing value of Y * is imputed with the observed value of a matched donor with X and Y * observed. The method involves the following steps: 1. Draw B values of Y * for each nonrespondent from the distribution of Y * |X, R = 0, estimated under the BNPM model. This results in a pool of n 1 B values of Y * ({Y * p }, p = 1, . . . , n 1 B). In the simulations in Section 3 a value of B = 100 is sufficient.

Given each Y *
p in the pool, draw a value X p from the posterior predictive distribution of X|Y * in (5), with parameters estimated from respondents. This results in a set of pairs of 4. Repeat steps 2-3 above for 2000 iterations, deleting the first 1000 as burn-in and using every other 10 iterations to create D = 100 multiply-imputed data sets with values of Y imputed. Using multiple imputation combining rules (Little and Rubin, 2020) we obtainμ Y and its variance:μ whereμ d and W d are the estimated marginal mean and variance in the d th imputed data set, respectively. For the MAR assumption of λ = 0, we apply a Bayesian form of the PSPP method (Zhang and Little, 2009;Yang and Little, 2015). Specifically, we regress Y on a spline of X using the complete cases and impute Y by drawing directly from its predictive posterior distribution in (5) given the observed X (1) for each iteration of the Gibbs algorithm. The underlying rationale of the procedures is as follows. Since the unobserved Y * is a covariate in our spline model (5), we cannot impute Y * by drawing directly from a model. Thus we first create a donor pool of values ({Y * ib }, b = 1, . . . , B, i = n 0 + 1, . . . , n 0 + n 1 ) as draws from the BNPM model. For each donor in the pool, we create a corresponding value of X as a prediction from the spline model (5). We then match each incomplete case to a member of the donor pool with a similar value of X, and impute for that case the corresponding value of Y * from the donor. When the data are normal, the "hot-deck" matching step has little effect on the final imputations of Y * . However, when data deviate from normality, the pairs (X, Y * ) resulting from the hot-deck respect the spline model (5) and hence should improve on the imputations from the BNPM model, which incorrectly assume a linear relationship between X and Y * . In practice, we create multiple initial draws of Y * for each nonrespondent, as a large value of B allows flexibility in the nonlinearity adjustment by S-BNPM and ensures a close match with the donors for every observed X. In the following examples we find a value of B = 100 to be sufficient to ensure a near-identical match in X.
As in the original BNPM model, the S-BNPM model utilizes the fact that, conditional on the variables contributing to missingness, the regression model parameters are the same for both respondents and nonrespondents. However, the penalized spline improves robustness of the pattern-mixture model by allowing us to model nonlinearity in the relationship between X and Y . As suggested in Little (1994), inferences forμ Y should be displayed for a range of potential values of λ to account for uncertainty about the true value of λ and to assess sensitivity of inferences to the choice of λ.

Extensions of Proxy Pattern-Mixture Model Analysis
There may be multiple observed covariates Z 1 , . . . , Z p that are predictive ofμ Y . Andridge and Little (2011) proposed an extension of the pattern-mixture model analysis by taking X as a proxy obtained by regressing Y on the set of Z 1 , . . . , Z p and replacing the set of covariates by X, the estimated best predictor of Y given Z 1 , . . . , Z p . Proxy pattern-mixture model analysis (PPMA) then estimatesμ Y by applying the pattern-mixture model in Little (1994) to X and Y . The advantage of reducing Z 1 , . . . , Z p to X is simplicity: modelling departures from MAR under one sensitivity parameter λ is much simpler than specifying a model with p sensitivity parameters for each of Z 1 , . . . , Z p . Moreover, should missingness depend on some other combination of Z 1 , . . . , Z p (e.g. W = c 1 Z 1 + . . . + c p Z p , where c 1 , . . . , c p are constants), estimates for the mean of Y are still approximately unbiased since Y is independent of W given X. Andridge and Little (2011) showed that the uncertainty of the estimates ofμ Y depends largely on the degree of correlation between the proxy X and Y as well as the degree of similarity between respondents and nonrespondents with respect to the value of X. When X and Y are highly correlated and the values of X are similar for respondents and nonrespondents, information on missing values of Y and evidence on the lack of response bias are both strong, resulting in estimates ofμ Y with high precision. However, if X and Y are weakly correlated and the values of X are much different for respondents and nonrespondents, we have strong evidence for response bias with little information on the missing values of Y , resulting in estimates ofμ Y with high uncertainty.

Spline Proxy Pattern-Mixture Model
As in the bivariate case, validity of the proxy pattern-mixture model proposed by Andridge and Little (2011) when data are MNAR relies on the assumption of bivariate normality between the proxy X and Y , which is violated when some or all of the Z 1 , . . . , Z p used to obtain X are not normally distributed. Suppose, for example, Z is a fully observed standard normal variable and Y given Z is normal with mean Z + Z 2 . Let X be a proxy from the regression of Y on Z and Z 2 . When the data is MAR, X is an unbiased predictor of Y , hence estimates from the pattern-mixture model under λ = 0 are unbiased. However, when missingness depends on Y , the resulting proxy X is no longer an unbiased predictor of Y since the regression coefficients in the regression of Y on Z and Z 2 based on the respondents are biased for the nonrespondents. Since X is some function of Z and Z 2 which is not normally distributed, the assumption of bivariate normality, hence linearity, with Y fails, resulting in biased estimates for all values of λ.
We propose a modification of the proxy pattern-mixture model that relaxes the assumption of bivariate normality between X and Y . Suppose, as before, X is the predicted value of Y based on regression of Y on Z 1 , . . . , Z p for the complete cases, and that missingness depends on the value of Y * . The conditional distribution of X given Y * is independent of R and the regression coefficients of X on Y * are the same for both respondents and nonrespondents. The model proposed in Andridge and Little (2011) assumes linearity between X and Y , and hence Y * , which as discussed may not be appropriate when X and Y are not bivariate normal. Thus, we propose a spline proxy pattern-mixture model analysis (S-PPMA) to describe the relationship between X and Y . Under S-PPMA, we first estimate the proxy based on a complete-case regression of Y on Z 1 , . . . , Z p as in Andridge and Little (2011), and set X as the predicted value of Y from this regression. Then, we apply a penalized spline model to X and Y and estimateμ Y as discussed in Section 2.1. As in the bivariate model, we believe S-PPMA will further enhance the robustness of PPMA by relaxing the bivariate normality assumption.
In the next section, we describe simulation studies to assess the performance of S-PPMA under various distributions of Z 1 , . . . , Z p , Y , and missingness mechanisms. For comparison we include estimates from the proxy pattern-mixture model proposed in Andridge and Little (2011).

Simulation Studies
We assess the performance of S-PPMA for inferences about the mean of Y with respect to average bias, root mean square error, 95% confidence interval width, and rate of confidence interval non-coverage over 1000 replications and six scenarios. For each replication, we construct 95% confidence intervals and estimate the non-coverage rate as the proportion of the 1000 confidence intervals that do not cover the true value, where 95% CI = (μ Y − t n−1,0.975 Var(μ Y ),μ Y + t n−1,0.975 Var(μ Y )), t n−1,0.975 is the 97.5 th percentile of the t-distribution with n − 1 degrees of freedom, and Var(μ Y ) is the estimated variance of the mean in (7). Confidence interval widths (CIW) are computed as CIW = 2t n−1,0.975 Var(μ Y ), or for credibility intervals, the difference between the 97.5 th and 2.5 th percentiles of the posterior distribution ofμ Y . For all simulations, we set sample sizes of n = 100 and n = 400 over 1000 replications and apply the penalized spline models using K = 2 and K = 5 knots, respectively.
For the first scenario, we assume bivariate normal data of X and Y and compare estimates of the mean of Y under the BNPM and S-BNPM models. For scenarios 2-5, we assume a set of fully observed covariates Z 1 , . . . , Z p . Here, we first obtain the proxy X from a correctly specified regression of Y on Z 1 , . . . , Z p using the respondent sample. Then, we estimate the mean of Y using three methods: 1. We apply the S-PPMA model to X and Y using a penalized spline in (5). (S-PPMA) 2. We assume bivariate normality between X and Y and estimateμ Y via maximum likelihood in (4) as originally proposed in Andridge and Little (2011). Variance is estimated using 200 bootstrap samples. (PPMA-ML) 3. We assume bivariate normality between X and Y and drawμ Y from its posterior distribution as described in Little (1994). The 95% credibility intervals and coverage are based on draws from the posterior distribution. (PPMA-BAYES) Let λ T be the true, unobservable value of λ generating missing data, and let λ A be the assumed value of λ in our models. For each scenario, we simulate nonresponse using λ T = 0, 1 and ∞. To assess sensitivity of inferences to λ A , we produce estimates under λ A = 0, 1 and ∞ for each value of λ T , one of which corresponds to the true underlying value of λ T . While inferences under additional values of λ A may be explored, we chose these three values to capture a range of potential missingness mechanisms. In the following section, only results for which λ A = λ T are shown (for rest, see Supplementary Material).

Scenario 1: Bivariate Normal Data
We assume a fully observed covariate X and a Y that is bivariate normal with X and subject to missingness. The data is generated under the following pattern-mixture model with sample sizes of n = 100 and n = 400: N(1, 1).
In this and all subsequent scenarios, nonresponse rates are approximately 50%. For simplicity we only display results at n = 400, as results for n = 100 are generally similar (see Supplementary Material). Figure 1 displays the performances of each estimator in terms of av- Figure 1: Results for scenario 1 where λ A = λ T . erage bias, root mean squared error (RMSE), 95% CIW, and its corresponding non-coverage rate out of 1000 replications when λ A = λ T . In the figure, the true missingness of Y depends on X + λ T Y for λ T = 0, 1, and ∞. Results show little differences between the methods in bias, RMSE, and CIW regardless of λ A in all values of λ T (results for λ A = λ T in Supplementary Material). As expected, when λ A = λ T , all estimates are approximately unbiased and non-coverages are near the nominal 5%, as BNPM is the correct model for the data. Moreover, CIW increases as λ T increases, reflecting a rise in uncertainty as a result of nonresponse due to Y . We notice that the CIW for S-BNPM at λ A = ∞ is narrower than that for BNPM under both ML and Bayes for all values of λ T . This may be due to a small correlation of 0.5 between X and Y , which may lead to a large value of s Y Y s XY in (3) and consequently an extremeμ Y . In S-BNPM, the process of generating multiple initial draws of the missing Y and matching on the donor pool based on predictions from the spline model helps to alleviate this problem as draws of X from extreme values of Y are less likely to be matched to observed values of X, leading to less extreme imputations in this particular scenario.

Scenario 2: Bivariate Non-normal Data
Suppose X is a fully observed, gamma-distributed covariate and Y is normal conditional on X and is subject to missingness. We generate the data under a selection model with sample sizes of n = 100 and n = 400: Y |X ∼ N(10 + X, 1).
We generate missing values of Y under the following models to reflect both MAR and MNAR scenarios, assuming an unobserved latent variable U : where Y is missing if U > 0 and observed otherwise.
In this scenario we include estimates from the true model, which models Y * on U and X for λ T > 0, since Y * and U are bivariate normal conditional on X. Since U is unobserved, we estimate U by introducing a latent variable U * , and produce posterior draws of the missing Y * iteratively by the following steps: 1. Initialize values of Y * (0) and U * by setting Y * (0) as predictions from the regression of Y * on X under the complete cases, and draw U * from a normal distribution with variance 1 and mean Zπ −Ȳ + Y , whereπ is the nonresponse rate, Z α is the α th percentile of the standard normal distribution, andȲ is the estimated mean combining the observed Y * (1) and the initialized Y * (0) . For respondents, positive values of U * are discarded and redrawn until all values are negative. Likewise for nonrespondents, we discard and redraw negative values of U * . 2. At the ith iteration, obtain posterior predictive draws of Y * (0) (i) |U * (0) (i−1) , X (0) under a linear regression model with parameters estimated from Y * |U * (i−1) , X under the entire imputed sample, using values of Y * (0) (i−1) and U * (r) (i−1) drawn from the previous iteration. 3. Obtain posterior predictive draws of U * (i) |Y * (1) , Y * (0) (i) , X under a linear regression model for the entire sample, where Y * (0) (i) are predictive draws for the missing Y * (0) at the current iteration. We again discard and redraw all positive values of U * (i) for respondents and negative values of U * (i) for nonrespondents. 4. Repeat steps 2 and 3 over 1000 iterations, discarding the first 100 as burn-in. We then apply (6) and (7) over the 900 sets of drawn values of Y * (0) to estimate the mean and variance. For λ T = 0, we impute the missing Y based on posterior predictive draws from the regression of Y on X on the complete cases. Figure 2 displays results under λ A = λ T for λ T = 0, 1, and ∞. As in Scenario 1, we only display results at n = 400, as results for n = 100 are generally similar (see Supplementary Material). When λ T = 0, all methods are unbiased, with S-BNPM having slightly higher RMSE and more conservative 95% confidence intervals. Since data is MAR and Y |X is normal with a mean that is linear on X, the BNPM model is correctly specified and thus it is not surprising that its estimates are unbiased and have better precision than S-BNPM. However, when λ T = 1, linearity assumptions for X|Y * are violated, and consequently we see bias and under-coverage by BNPM. Here, S-BNPM shows reductions in bias and to a lesser extent RMSE, and achieves near nominal 5% non-coverage with a minor penalty in RMSE and precision compared to the true model. The more the data deviate from MAR, the higher the gains in bias and RMSE from S-BNPM, as evident in the results under λ T = ∞. S-BNPM shows a noticeable improvement in RMSE over BNPM and still yields close to nominal non-coverage. Robustness to normality, however, comes at the price of precision, as S-BNPM tends to yield wider intervals than both BNPM and the true model.

Scenario 3: Set of Normal Z's
In this scenario, we assume a set of covariates that are normally distributed. Let Z 1 , Z 2 , Z 3 be fully observed covariates with distributions: Let Y be missing under the following logistic models: Logit[Pr(R = 0)] = −7.5 + 0.5(2Z 2 + Y ).
For each missingness mechanism, we obtain the proxy X by regressing Y on Z 1 , Z 2 , and Z 3 apply the estimators to X and Y . Figure 3 shows results for λ A = λ T , with λ T = 0, 1, and ∞ under n = 400 (see Supplementary Material for rest of results). In addition there are two nonresponse mechanisms, D and E, that do not correspond to any λ T . When λ T = 0, Y is MAR, λ A = 0 is the correct assumption about nonresponse and as a result all estimators are approximately unbiased and yield similar RMSE, confidence interval widths, and nearnominal non-coverage of 5%. For values of λ A = 1 and ∞ when λ T = 0, all three methods exhibit bias, with negligible differences in RMSE, CIW, and non-coverage (not shown). Similarly when λ T = 1 and ∞, values of λ A such that λ A = λ T result in negligible bias and near nominal non-coverage for all estimators. For values of λ A such that λ A = λ T , all methods are biased with higher than nominal non-coverage, as expected given that the assumptions about nonresponse are wrong. Results for mechanism D (not shown) are generally similar to those of A, where λ T = 0. Here, all methods have negligible bias and nominal non-coverage at λ A = 0 and yield similar RMSE and CIW at all values of λ A . In mechanism E, all methods have minor bias at λ A = 1 and cover the true mean at a rate close to 95%, with minor differences in RMSE and CIW regardless of λ A . In this scenario, nonresponse mechanisms D and E do not deviate much from mechanisms A and B, which explains the similarity of results.
This scenario assumes that all auxiliary variables are normally distributed, resulting in a proxy X that is normal and linear with Y regardless of the nonresponse mechanism. As such, the methods in Andridge and Little (2011) produce valid estimates under the correct value of λ A . We again notice that S-PPMA tends to yield slightly more conservative confidence intervals than PPMA, which suggests there is some penalty in precision from fitting a more robust model when normality assumptions are met.

Scenario 4: Varying Distributions of Z
Let Z 1 , Z 2 , Z 3 be fully observed covariates with the following distributions: Let Y be missing under the following logistic models: We obtain the proxy by regressing Y on Z 1 , Z 2 , and Z 3 using respondent data and apply the estimators under λ A = 0, 1 and ∞. Results for which λ A = λ T under n = 400 are shown in Figure 4 (see Supplementary Material for rest of results). Mechanisms D and E do not correspond to any value of λ T . In this scenario we vary the distributions of the auxiliary variables and the conditional mean of Y given Z 1 , Z 2 , and Z 3 is dominated by a gamma distributed Z 2 . For λ T = 0 where Y is MAR, all three methods yield approximately unbiased means with close to nominal non-coverage when the correct value of λ A = 0 is used. Under the incorrect values of λ A = 1 and ∞, however, the S-PPMA has lower bias, lower RMSE, and lower non-coverage rate than the linear models albeit with more conservative confidence intervals (not shown). For λ T = 1 and ∞, the PPMA estimates exhibit small bias even when λ A = λ T , most likely as a result of lack of linearity between X and Y due to MNAR and some of the auxiliary variables being non-normal. The S-PPMA estimates at the correct λ A show low bias and non-coverages close to 5%, which may be explained by the spline's ability to model nonlinearity between X and Y . It is worth noting, however, that despite the bias PPMA still achieves good coverage at λ A = λ T = 1. In terms of RMSE, S-PPMA has no noticeable gains over PPMA under λ A = λ T = 1, and larger gains when λ A = λ T = ∞. This suggests that as dependence of nonresponse on Y increases, the degree of nonlinearity adjustment by the penalized spline increases. Robustness to λ T comes at the expense of precision, as the penalized spline yields wider intervals under all values of λ A for any λ T . However, it is important to note that values of Y tend to be much lower for respondents than nonrespondents as a result of the nonresponse mechanism, which leads to sparse data and extrapolation at higher values of Y . Thus, wider interval widths by the spline may be a reflection of uncertainty in imputing the missing values by extrapolating a nonlinear model. For mechanism D (not shown), there are no significant differences in RMSE and CIW regardless of λ A , with negligible bias at λ A = 0 and close to nominal coverage at both λ A = 0 and 1 for all methods. In mechanism E, both S-BNPM and BNPM yield similar estimates with nominal non-coverage at λ A = 1.

Scenario 5: Quadratic Term in Mean of Y
Let Z 1 and Z 2 be fully observed covariates with the following distributions: . Let Y be missing under the following mechanisms: We estimate the proxy X by regressing Y on Z 1 , Z 2 , and Z 2 2 using the complete cases and apply the estimators under the different values of λ A . Here we introduce a quadratic term in the conditional mean of Y. For λ T = 0, when data is MAR, the estimated proxies are unbiased estimates of Y since they are based on a correctly specified regression model. As a result all methods are unbiased with close to nominal 5% non-coverage when we assume the correct value of λ A = 0, with the spline having slightly wider interval widths ( Figure 5). For other values of λ A , the S-PPMA shows smaller bias, lower RMSE, and much higher coverage rate than their linear counterparts, and still achieves near nominal non-coverage under the incorrect assumption of λ A = 1 (not shown).
For λ A = λ T = 1, where missingness depends equally on both Y and the auxiliary variables, estimates under λ A = 0 (see Supplementary Material) are similarly biased and intervals undercover the true value for all methods, which is not surprising since the assumption of λ T is incorrect. However, S-PPMA has minor bias under λ A = 1, which is the correct assumption in this case shown in Figure 5, and near nominal non-coverage rates under both assumptions of λ A = 1 and λ A = ∞, where the PPMA estimates are biased and undercover the true value. With respect to RMSE, S-PPMA shows increasing gains over PPMA as λ A increases.
When λ T = ∞, where missingness depends only on Y , the penalized spline is again approximately unbiased with nominal non-coverage under the correct assumption of λ A = ∞, while the linear models are heavily biased. This is due to nonlinearity between X and Y caused by the quadratic Z 2 2 term in the mean of Y , violating the bivariate normality assumption required in PPMA. Although the spline yields more conservative intervals, possibly from extrapolating nonlinearity, its ability to model nonlinearity results in estimates that are unbiased and have good coverage rates. This is especially important when the missingness mechanism is MNAR, where the proxy X is no longer unbiased and has a nonlinear relationship with Y . It is interesting to note, however, that in this and the previous scenario, PPMA shows slightly lower RMSE at the wrong assumption of λ A = 1 when the true value is λ T = ∞ (see Supplementary Material).
In mechanism D (not shown), the methods show low bias and similar RMSE at all values of λ A , with the ML estimate of BNPM having significantly wider intervals than S-BNPM and the Bayesian estimate of BNPM, resulting in better coverage. In mechanism E, all methods are generally biased and fail to achieve nominal non-coverage regardless of λ A , with small differences in RMSE. Again the ML estimate of BNPM tends to yield much wider intervals that result in better coverage.

Scenario 6: Interaction Term in Mean of Y
Let Z 1 and Z 2 be fully observed covariates with the following distributions: Let Y be missing under the following mechanisms: Logit[Pr(R = 0)] = −10 + 0.5(5Z 2 + Y ).
In this last scenario, we let the conditional mean of Y be a function of two normally distributed variables and their interaction. We then model Y using a correctly specified regression on Z 1 , Z 2 , and Z 1 Z 2 for the respondents, and obtain the predicted values of Y as our proxy X. As in all scenarios, Figure 6 shows that when missingness is at random, all methods are unbiased, yield similar RMSE, and achieve nominal non-coverage under λ A = 0 since the proxy X itself is unbiased for Y . However, under the incorrect values of λ A = 1 and λ A = ∞, the S-PPMA shows significantly larger bias, RMSE, and CIW than PPMA (see Supplementary Material).
When λ T = 1, all methods have negligible bias under the correct value of λ A = 1 as shown in Figure 6, and achieve close to 5% non-coverage. There are generally minor differences in RMSE between the methods regardless of the assumption in λ A , though S-PPMA tends to be slightly more conservative in terms of interval widths. For λ T = ∞, all methods yield low bias with similar RMSE at λ A = ∞ and nominal non-coverage. All methods have similar bias, RMSE, CIW, and coverage at all other values of λ A (not shown). Although the mean of Y in this scenario depends on the interaction of Z 1 and Z 2 , which is not normally distributed, the model assuming linearity between X and Y still yields good estimates of the mean under MNAR when λ A = λ T . This may be because the distribution of Z 1 Z 2 does not result in a drastic departure from normality in the proxy X, so the bivariate normality assumption between X and Y still approximately holds.
In the results for mechanism D (not shown), which does not correspond to any value of λ T , estimates at λ A = 0 are generally unbiased with minor differences in RMSE, and achieve close to nominal non-coverage with the exception of the Bayesian BNPM. In mechanism E, all methods show some bias at all values of λ A with S-BNPM yielding higher RMSE than BNPM at λ A = ∞.

Example: Child Asthma Study
We apply S-PPMA and PPMA to an asthma study conducted by the University of Michigan Schools of Public Health and Medicine. The study consists of children with asthma from Detroit elementary and middle schools, whose aim is to evaluate the effectiveness of an educational intervention in reducing asthma symptoms. The main outcome of interest is the average number of nights the child experiences asthma symptoms per month, collected at baseline and one-year follow-up. Our goal is to estimate the mean change in nights of symptoms per month from baseline to follow-up, which is subject to dropout. However, since it is well documented that asthma severity naturally declines as the child ages, we restrict our attention to only those in the control group with symptoms at baseline.
Out of 133 children ages 6 to 14 with asthma symptoms at baseline in the control group, 41 (31%) dropped out before follow-up information was obtained. Since dropout may be attributed to asthma severity, we apply the S-PPMA and PPMA models to estimate the mean change in nights of symptoms per month. Only age and measurement at baseline are significantly associated with the outcome, with baseline age also being significantly associated with response. We first obtain our proxy by regressing change in nights per month on its baseline value and age using the respondent sample. We then apply the S-PPMA and PPMA models to estimate the mean change in nights of symptoms per month. Figure 7 shows the distributions of baseline age and nights per month in our data. Both variables show deviations from normality, particularly nights of symptoms per month. Figure 8 displays scatter plots for the relationship between X and Y along with the average regression lines for PPMA and S-PPMA. For the regression of Y on X under the assumption of λ = 0, both PPMA and S-PPMA yield near identical regression lines. However, differences can be seen for the regression of X on Y under the assumption of λ = ∞, where S-PPMA seems to provide a minor improvement in fit. As such, we expect some differences between estimates from S-PPMA and PPMA, particularly at λ = ∞. Figure 9 shows estimates of the mean change under each method. Each line represents the mean and its 95% confidence interval for S-PPMA (PS) and PPMA, which is estimated using both maximum likelihood bootstrap (ML) and posterior draws (PD). To assess sensitivity to our assumption about λ, we display estimates under λ = 0, 0.5, 1, 4, and ∞. Results show that the mean change in symptoms per month generally decreases as we place more weight on our outcome to response, which suggests that children with higher decrease in symptoms may be less likely to participate in the follow-up survey. As expected from Figure 8, estimates for PPMA and S-PPMA at λ = 0 are similar, with differences between the methods being most pronounced at λ = ∞. There are minor differences between the PPMA estimates, with the posterior draws generally producing more conservative intervals than maximum likelihood. As in the simulations, interval lengths tend to widen slightly as λ increases due to increasing uncertainty when missingness depends on the outcome. S-PPMA is to a small degree less sensitive to assumptions about λ than PPMA, as estimates of mean change are within 0.1 nights of each other for values of λ > 0, whereas estimates from PPMA are generally within 0.4 nights as λ varies from 1/2 to ∞. In terms of precision, S-PPMA tends to be more conservative than ML but has slightly narrower interval widths than PD.
In practice, one might choose some intermediate value of λ (e.g. λ = 1) since it represents a more conservative assumption about the missingness mechanism. However, lack of sensitivity to λ allows for more robustness of estimates to the assumptions about missingness, which is important since any belief regarding λ cannot be tested.

Discussion
Most nonresponse adjustment methods assume MAR, which can be a strong and untestable assumption. An advantage of the PPMA model is it allows us to make inferences about the mean of an outcome variable without assuming MAR. Moreover, the model does not require us to specify a propensity model, since it assumes that missingness depends only on the value of X + λY . The method simplifies nonresponse adjustment by combining a set of auxiliary variables into a single measure X and models departures from MAR using a single sensitivity parameter λ. In our proposed extension to the PPMA model, we model the relationship between X and Y through a spline. An advantage of this approach is that it does not require X and Y to be bivariate normal, which is assumed in PPMA, since splines allow us to model nonlinearity between the variables. As a result, we do not require the auxiliary variables to be normally distributed, as the model is robust to non-normal distributions of the auxiliary variables. It is important to note, however, that we do not specify a joint distribution between X and Y . Thus S-PPMA is approximately viewed as a method.
While S-PPMA utilizes initial values of Y generated from the potentially incorrect PPMA model, the additional steps of spline modelling and hot deck imputation helps to adjust for this nonlinearity. Our simulations show that the proposed S-PPMA model with penalized spline consistently yields approximately unbiased estimates with near nominal non-coverage regardless of the distributions of the auxiliary variables when the correct value of λ is used. Compared to the original PPMA proposed in Andridge and Little (2011), S-PPMA has shown to yield estimates that are more robust to covariate distributions, though with a slight penalty in precision when the PPMA model is correct. The gains in bias and RMSE are particularly noticeable the more the auxiliary variables deviate from normality. Results for a smaller sample size of n = 100 (see Supplementary Material) show similar trend, where S-PPMA provide some gains in bias and RMSE when covariates are not normal and missingness is not at random, though differences in bias and RMSE tend to be less pronounced than in larger sample sizes. Moreover, the bootstrap variance estimates of PPMA tend to be more conservative than their Bayesian counterpart, leading to better coverages.
It may be tempting to estimate the value of λ by specifying a prior distribution. However, any inference about λ would be driven entirely by the prior since the data contains no information about λ. Thus we recommend conducting a sensitivity analysis by applying the S-PPMA Figure 9: Estimates for mean change in nights of symptoms per month. model over a range of λ. The sensitivity analysis reflects our uncertainty about the nonresponse mechanism by displaying estimates of the mean over different values of λ, ranging from MAR (λ = 0) to the more extreme MNAR that assumes missingness depends only on the outcome itself (λ = ∞). Comparing estimates over a range of λ helps to provide us an idea of how sensitive our inferences are to the missingness mechanism.
Our examples assume that the variables used to predict the outcome are fully observed, which may not be the case since often both outcome and covariates are missing at the same time, as is the case in unit nonresponse. Extension to the S-PPMA model incorporating additional assumptions about missingness of the covariates may be explored. In our simulations, S-PPMA tends to yield wider confidence intervals than the bivariate normal model particularly for λ > 0. This may be attributed to the fact that when the data is MNAR, values of the outcome for the nonrespondents may be drastically different than the respondents, leading to extrapolation. Estimation becomes particularly tricky when the relationship between Y and X is nonlinear. Thus, the precision of the penalized spline at high values of λ may be a reflection of our uncertainty in extrapolating a nonlinear model. The S-PPMA and PPMA models assume that missingness depends only on the value of X + λY , where X is a function of the covariates Z 1 , . . . , Z p . In reality, there are infinite ways in which data is missing. For example, missingness of Y may depend only on some subset of