Direct and Unbiased Multiple Imputation Methods for Missing Values of Categorical Variables

: Missing data is a common problem in statistical analyses. To make use of information in data with incomplete observation, missing values can be imputed so that standard statistical methods can be used to analyze the data. Variables with missing values are often categorical and the missing pattern may not be monotone. Currently, commonly used imputation methods for data with a non-monotone missing pattern do not allow direct inclusion of categorical variables. Categorical variables are converted to numerical variables before imputation. For many applications, the imputed numerical values for those categorical variables must then be converted back to categorical values. However, this conversion introduces bias which can seriously aﬀect subsequent analyses. In this paper, we propose two direct imputation methods for categorical variables with a non-monotone missing pattern: the direct imputation approach incorporated with the expectation-maximization algorithm and the direct imputation approach incorporated with a new algorithm: the imputation-maximization algorithm. Simulation studies show that both methods perform better than the method using variable conversion. An application to real data is provided to compare the direct imputation method and the method using variable conversion.


Introduction
In many survey and observational data systems, some critical variables may have missing values.These variables can be either numerical or categorical.For example, some demographic variables are categorical and they are important to identify population groups of interest.Although methods for handling or imputing missing values for numerical variables has been extensively discussed in the literature (e.g., Rubin, 1978;1987;1996), research on how to impute missing values in categorical variables is limited, particularly when the missing pattern is not monotone.A dataset with multiple variables is said to have a monotone missing pattern when variables in the dataset can be sorted in an order such that when a variable is missing for an observation, then all subsequent variables are missing for the observation.
There are many methods to handle data with a monotone missing pattern, for example, the propensity score method (Rosenbaum and Rubin, 1983) and the discriminant function method (Brand 1999, pp. 95-96).However, for data with a non-monotone missing pattern, the only method that has been widely used is the Markov Chain Monte Carlo (MCMC) method (Schafer, 1997) although other methods are available, for example, the multiple imputation using chained equations (van Buuren and Oudshoorn, 1999).When the variables with missing values are categorical, imputation of missing values has not been adequately addressed.Schafer (1997) introduced a saturated multinomial model from a Bayesian perspective.Allison (2001) proposed a method using MCMC with each categorical variable expressed by a group of dummy variables.During the imputation process, the dummy variables are treated as numerical variables.After imputation, the imputed numerical values for the dummy variables are converted back to categorical values for the original categorical variables.Although the MCMC method is unbiased under appropriate assumptions, the variable conversion introduces bias (Horton et al., 2003, Ake, 2005, Song et al., 2010).In this paper, we use the saturated multinomial model to impute data with a non-monotone missing pattern of categorical variables and compare the performance of this approach with the one based on the MCMC method.
Our work was motivated by a missing data problem in the US national HIV case surveillance system.The system collects many variables and most of them are categorical.Among the categorical variables is the case's HIV transmission category, which summarizes the multiple risks that the individual may have taken by selecting the one through which HIV was most likely to have been acquired.Transmission category is an important variable that identifies the populations at high risk of HIV infection.However, for a substantial proportion of the HIV cases reported to the Centers for Disease Control and Prevention (CDC) risk factor information is missing so their transmission category is unknown.The proportion of HIV cases with unknown transmission category has been increasing in recent years.In 1994, less than 20% of HIV cases were reported to the CDC without a transmission category, while in 2007, the proportion increased to approximately 40% (Harrison et al., 2008).In addition, there are other numerical and categorical variables with missing values in the HIV surveillance database and the missing pattern of these variables is not monotone.
The data collected by the HIV surveillance system is the cornerstone for monitoring and characterizing the epidemic and for planning and evaluating HIVrelated prevention and care programs at the local, state and national level in the United States (CDC, 2010).Handling the missing data in this database is essential to reducing or controlling the biases in all subsequent analyses.In this paper, we introduce two imputation methods for categorical variables with a non-monotone missing pattern.The two methods use multinomial distributions to model categorical variables and impute missing values directly from multinomial distributions.Therefore, they reduce the bias introduced by the method that converts variables from categorical to numerical before imputation and then from numerical back to categorical after imputation.The proposed methods are described in the next section.To compare the performance of the proposed methods and the method for numerical variables with variable conversions (the MCMC method with variable conversion or simply the MCMC method), we conducted a simulation study.The simulation study design and results are presented in Section 3. We also applied one of the proposed imputation methods to the HIV surveillance database, which has plenty of missing data, and compared results with those derived from the MCMC method.

Imputation Methods
) denote the set of all the variables of interest, where r is the total number of variables and all variables are categorical.For an observation of X (called an observed case), say x, there are two possible outcomes: no value is missing on any of the r variables (called a complete case or observation) or at least one of the variables has a missing value (called an incomplete case or observation).Let S be the set of all observed cases with or without missing components, S k the set of all observed complete cases, and S m the set of all observed incomplete cases.For an incomplete observation x in S m , we denote the set of covariates with known values by X k , and the set of covariates with missing values by X m .If Y is a subset of (X 1 , X 2 , • • • , X r ), we denote by x(Y ) the set of values taken by the variables in Y .Also, we denote x(X k ) by x k and x(X m ) by x m .
Without loss of generality, we use positive integers to represent the levels of each variable and 0 for missing values.Thus, x(Y ) = 0 means that the observation has missing values for all of the variables in Y .Our goal is to substitute missing values in all incomplete observations with plausible values so that we have data sets with complete information on every observation for further statistical analyses.
Suppose that the joint probability distribution function P X (•) is known.For an observation x in S m , X = X k ∪ X m .The probability distribution of X m conditional on X k = x k is given by Our imputation method replaces the missing values of x(X m ) with random values generated from the above conditional probability distribution, as follows.
(1) Estimate the joint probability distribution function P X (•) using all observed cases with or without missing values.
(2) Draw a random sample from the conditional distribution for each observation with missing values.(3) Repeat ( 1) and ( 2) to obtain multiple imputed values for each missing value.Note that the joint probability distribution function from (1) is only an estimate.Multiple samples from the same estimate will be correlated.Thus, one should not draw multiple samples at step (2) to obtain multiple imputed values for each missing value.
The joint distribution P X of X is unknown.It has to be estimated from the observed data.The distribution of X can be considered as multinomial.The possible values of X are combinations of all possible values of all variables in X.If the dataset has no missing values, then the maximum likelihood estimator (MLE) for P X is the frequency distribution of the data.If the dataset contains missing values and these values are missing not completely at random, then the joint distribution P X of X cannot be estimated by only using data with complete information on all variables.However, if missing values are missing not completely at random but conditionally at random based on the observed data, then we can use all observed data (including observations with missing values) to estimate the joint distribution.
We consider data with a non-monotone missing pattern and missing values are missing conditionally at random.We describe two iterative methods to estimate the joint distribution.The first method is based on the expectation-maximization (EM) algorithm (Dempster et al., 1977, Little andRubin, 1987), and the second method is based on an imputation-maximization (IM) algorithm described later in this section.The EM algorithm requires that the data are organized in the form of Table 1, where x i , i = 1, 2, • • • , n, are all possible values of X, and f (x i ) is the frequency of x i or the total number of cases with the value x i .The EM algorithm is as follows: 1. Posit an initial estimate for the joint distribution P X (•).This can simply be a uniform distribution, or estimated from the complete cases.
2. E-step: For an incomplete case with x = (x k , x m ), estimate the probability distribution of X m conditional on X k = x k by substituting the current estimate of P X (•) into (1).Make a table in the form of Table 1, where the X-column contains all combinations of levels of the variables of X satisfying y(X k ) = x k and the frequencies are the respective estimated conditional probabilities.This is equivalent to splitting the case into several "fractional" cases.(Each row in the table represents a "fractional" case since the frequency is a probability.) 3. M-step: Combine all fractional cases from the allocation of incomplete cases in Step 2 with the original complete cases to form a new data set.
The new dataset has no missing values and is used to update the current MLE of the joint distribution P X (•).
4. Repeat the E-step and M-step until there is no significant change in the estimate of the joint distribution P X (•).
The EM algorithm allocates fractions of incomplete (or partially classified) cases to form new data sets without missing values according to the estimated conditional distribution.However, we can also randomly allocate the partially classified cases to create a dataset for the next step.This is equivalent to filling in the missing values of an incomplete case by plausible values.So, instead of distributing a case in S m into multiple fractional cases in Step 3 of the EM algorithm, we distribute the case as a whole into only one of the possible values of X m based on a draw from the conditional distribution determined by (1).This idea is similar to the one used in imputation so we call it the Imputation-Maximization (IM) algorithm.
Both algorithms will converge if the estimate of the joint distribution of X computed from complete cases does not severely deviate from the true distribution.This is usually the case when there are a reasonable number of complete cases.Further, it is easy to verify that the IM algorithm creates a Markov chain (Schafer, 1997) ( where θ is the parameter vector of the joint distribution of X.Hence, the IM algorithm is essentially a MCMC type of method. An extreme case is that, for an incomplete observation x there is no complete observation y with y(X k ) = x k , which implies that the set {y : y(X k ) = x k and PX (y) = 0} is empty, where PX refers to any estimate of P X (•) that appears in the computing process of the EM or IM algorithm.Consequently, the conditional probability distribution of X m given X k = x k cannot be estimated from (1) in the iterative process of the EM or IM algorithm.Therefore, this type of incomplete observation must be excluded from the computing process of the EM or IM algorithm for estimating the joint probability distribution of X.Fortunately, in practice such observations are rare when the number of covariates is small but the total number of observations is not small.In such cases, we suggest using the following approximation method to estimate the conditional probability distribution of Let PX be an estimate of the joint probability distribution of X obtained by the EM or IM algorithm.Let Y be the largest subset of X k such that the set is an approximation to the conditional probability of X m = x m i given X k = x k and can be used for imputing missing values.If the largest subset Y is not unique, then we estimate the conditional probability distribution of X m given X k = x k by the average of the probability distributions PY (•|X k = x k ) and use this distribution to fill in the missing values.

Simulation Studies
We have proposed two direct imputation methods, one based on the EM algorithm and the other based on the IM algorithm.To evaluate the performance of the two methods, we conducted two simulation studies.The general design of the simulation studies is as follows.Given the joint distribution of a set of categorical variables, we simulate 1,000 random samples, each with 1,000 cases from the given joint distribution.For each simulated random sample, a number of cases were selected.For each selected case, some variables' values were removed and 10 plausible values were imputed for each missing value using each of the three imputation methods: the MCMC method with variable conversion and the two proposed direct imputation methods.Ten datasets with complete information were generated.Each data set was analyzed to estimate the joint and marginal distributions of the given categorical variables.Finally, estimates for each probability (p) from the 10 imputed data sets were combined by using the standard multiple imputation procedure.The following statistics were computed from the 1,000 simulated random samples for each imputation method.
• the average of the estimates for p: p, • the bias (average of (p − p)), • the relative bias (r-bias, average of 100 × (p − p)/p), • the average length (AveLen) and the coverage rate (CR) of the 95% confidence intervals.
The details of the two simulation studies are described in the next two subsections.

Simulation Study I: Single Variable with Missing Value
We first consider a simple missing data problem where there are only two variables in the data set, both categorical, and only one can have missing values.Call these two variables X and Y .See Table 2 for the joint and marginal probability distributions of X and Y .Suppose that the probability of missing Y is 0.35 if X = 1, 0.25 if X = 2, and 0.20 if X = 3.The missing value probabilities and distribution of X and Y are chosen to match those of AIDS cases among males with age ≥ 13 at AIDS diagnosis: X is the variable for race/ethnicity groups: non-Hispanic Blacks, Hispanics, and non-Hispanic Whites, distributed as 45%, 20% and 35%, respectively.Y represents the variable for transmission categories: men who have sex with men (MSM) only, injection drug use (IDU) only, MSM and IDU, and high-risk heterosexual contact.The percentages for each of the four groups are 65%, 10%, 5% and 20% for non-Hispanic Blacks, 65%, 15%, 5% and 15% for Hispanics, and 85%, 5%, 5% and 5% for non-Hispanic Whites, respectively.Now, we consider a more complex missing data problem where several variables in the data could have missing values and the missing pattern is nonmonotone.Suppose that there are three categorical variables: X, Y, and Z with the joint probability distribution specified in Table 3.We note that the variables Y and Z result from splitting the variable Y in the previous simulation study.We assume that Y and Z could have missing values, but X does not have any missing values.Similar to simulation study I, the probability of missing either Y or Z depends on X with the probability 0.35 when X = 1, 0.25 when X = 2, and 0.20 when X = 3.Among those with either Y or Z missing, the probability of missing Y only is 0.50, missing Z only is 0.30, and missing both Y and Z is 0.20.P (Y = 1) = 0.8125 P (Y = 2) = 0.1875 P (Z = 1) = 0.7700 P (Z = 2) = 0.2300

Simulation Results
In the analysis of the simulated data, we first ignored the incomplete or partially classified cases and used only the complete cases to estimate the joint and marginal distributions under each simulation setting as this approach is common practice.We then applied the MCMC method and the two direct imputation methods to impute missing values and used them to estimate the joint and marginal distributions.Results for missing values in a single variable are presented in Tables 4 (joint distribution) and 5 (marginal distribution).Results for missing values in two variables with a non-monotone missing pattern are presented in Tables 6 (joint distribution) and 7 (marginal distribution).
In both simulation settings, the estimates based on complete cases only are seriously biased.This is because missing value in both settings is not completely at random.This bias was removed or reduced by imputing values for data with incomplete information.
The biases associated with the complete case analysis are significantly reduced, but not completely eliminated by the MCMC method (Tables 4-7).The remaining bias is caused by the conversion from the imputed numerical values back to the categorical values.In Simulation Study I, the largest relative bias is 18.6%, which occurred when X = 3 and Y = 4 while in Simulation Study II, the largest relative bias is 15.8%, which occurred when X = 1, Y = 2 and Z = 1.The method for estimating a proportion by the corresponding sample proportion is less efficient when the proportion is small so the bias introduced by variable conversion is in general more severe in this case.However, in both simulation studies, all the relative biases are less than one percent by the direct imputation methods.In fact, the simulation results show that the direct methods are almost unbiased.This is expected since the direct imputation methods bypass variable conversion, an important source for biases.
From Simulation Study I, we see that the confidence intervals perform comparably for the three methods (Tables 4-5).However, from Simulation Study II, we see that the coverage probabilities of the 95% confidence intervals based on the MCMC method are higher when the probability p is low (Tables 6-7).The computation of the confidence intervals is based on a normal approximation, which is poor when the probability of interest is close to zero or one.Thus, we computed the average lengths and coverage rates from 1,000 simulated random samples, each with 1,000 cases, without any missing values.Results from the two direct methods are similar to those observed for simulated random samples without missing values.So, it seems that the direct methods would create imputed data sets which are closer to samples from the correct distribution (results not shown).
Finally, as noted, the variable Y in Study I is split into two variables in Study II (Y and Z).When the variable Y in Study I is missing, then one of the two variables Y or Z in Study II must be missing, but they need not both be missing.Therefore, more information is lost in study I than in study II.This explains why the MCMC approach is not as biased in Study II.

Application
In this section, we apply both the direct IM-imputation method and the MCMC method to imputing missing values of categorical variables in the US national HIV case surveillance data.This dataset contains all AIDS cases diagnosed since the early 1980s and reported to CDC by June of 2009.We examined only AIDS cases diagnosed between 2000 through 2007.There were about 300,000 adults and adolescents (aged 13 years or older) diagnosed with AIDS during this time period.Among them 21% had unknown transmission category (19% among males and 29% among females).
Having a missing value for transmission category was correlated with many variables.They include, but are not limited to, sex, race/ethnicity, birth country, age at diagnosis, geographic region of residence at diagnosis, population size of metropolitan statistical area (MSA) of residence, year of diagnosis, and the type of facility where the person was diagnosed.In addition to missing transmission category, other variables could be missing.For example, 14% of AIDS cases had missing birth country and 18% had missing diagnosis facility type.Both variables are correlated with the transmission category variable.Also, the missing pattern of these variables is not monotone.
Note that except for age, year of diagnosis, and population size of MSA of residence, all the above variables are categorical.Population size is reported as one of a small number of categories in the surveillance database, and for presentation and analysis purposes, we have likewise used age groups and time intervals to categorize these two variables into a small number of categories.With all other variables being categorical, the distribution of AIDS cases stratified by these variables can be viewed as multinomial.Therefore, missing values of these variables can be imputed using the methods proposed in this paper.
Based on the proportion of data with missing values, we choose to impute 10 plausible values for each missing value to achieve a 95% relative efficiency (Rubin 1987, p. 114).Because the numbers of transmission categories for males and females are different, we performed the imputation separately for males and females.The relative differences between the two methods in estimating the distribution probabilities ranged from −5.3% to 2.9%.The difference is significant for both female transmission category proportions and for the male heterosexual proportion.The direct method estimate is higher for the heterosexual proportion in both male and female population groups (Table 8).

Summary
In this paper, we evaluated two direct approaches for filling in missing values of categorical variables, which were motivated by the problem of imputing missing HIV transmission category in the national HIV case surveillance database.The methods described here are applicable to any situation where only categorical data are involved.The MCMC method often assumes that the joint distribution of the variables under consideration is multivariate normal, which may not be true in practice.In contrast, direct imputation approaches impose no assumption on the joint distribution of a set of categorical variables.However, the natural distribution for a discrete variable is multinomial.Hence, one may consider that the joint distribution is multinomial.This is another attractive feature of the proposed methods.
Note that both IM and EM algorithms perform equally well and their differences are hardly distinguishable.In addition, as they are almost equally efficient in terms of computing time and both algorithms are fast, using either one of them should be practical.

Table 2 :
Joint and marginal distributions of X and Y in Simulation Study I

Table 3 :
Joint and marginal distributions of X, Y , and Z in Simulation Study II

Table 4 :
Simulation Study I: Estimated joint probability distribution of (X, Y )

Table 8 :
Estimated distributions of transmission categories among AIDS cases diagnosed in the United States, 2000-2007, based on the MCMC and the direct IM-imputation methods who have sex with men; IDU, injection drug use; HRH, High-risk heterosexual contact.