Missing Information as a Diagnostic Tool for Latent Class Analysis

: Latent class analysis (LCA) is a popular method for analyzing multiple categorical outcomes. Given the potential for LCA model assumptions to inﬂuence inference, model diagnostics are a particulary important part of LCA. We suggest using the rate of missing information as an additional diagnostic tool. The rate of missing information gives an indication of the amount of information missing as a result of observing multiple surrogates in place of the underlying latent variable of interest and provides a measure of how conﬁdent one can be in the model results. Simulation studies and real data examples are presented to explore the usefulness of the proposed measure.


Introduction
Latent class analysis (LCA) has a long history in the social and behavioral sciences (Lazarsfeld, 1950;Lazarsfeld and Henry, 1968;McCutcheon, 1987;Clogg, 1995) and has gained considerable attention in biostatistics over the past two decades (Garrett, Eaton and Zeger, 2002;Garett and Zeger, 2000;Bandeen-Roche et al., 1997;Formann, 1996).In general, LCA is used to explain relationships among multiple categorical variables.Specifically, LCA may be used to describe the prevalence and symptomatology of a mental disorder or health status that is measured via multiple indicators or to explore subgroups of the disorder or disease (Storr et al., 2004;Moran et el., 2004;Nestadt et al., 2003;Fergusson et al., 1995;Eaton and Bohrnstedt, 1989).In medical diagnostics, LCA may be used to measure the sensitivity and specificity of diagnostic tests in the absence of a gold standard (Garrett et al., 2002, Formann, 1996;Butler et al., 2003) or to develop or evaluate diagnostic criterion (Fossati et al., 2001;Young et al., 1983;Young, 1982).More recently, latent class models have been extended to regression settings.Latent class and latent transition regression have been proposed for quantifying the association between risk factors and latent health status when multiple surrogates are collected in lieu of a single adequate measure of health status (Miglioretti, 2003;Humphreys and Janson, 2000;Bandeen-Roche et al., 1997;Dayton and Macready, 1988).Growth mixture modeling has been proposed for identifying and describing subgroups of individuals with different longitudinal trajectories (Muthen et al., 2002;Muthen, 2004).Latent class survival models have been proposed for modeling time-to-event data (Rosen and Tanner, 1999;Lin et al., 2002).
In this paper, we propose a complementary model diagnostic measure, the "rate of missing information," which provides insight into the value of surrogates in measuring the latent variable of interest and the usefulness of the fitted latent class model.This measure may also be used to guide the design of future studies.The concept of information was introduced to statistics by Fisher in the 1920s.
In the statistical sense, information refers to the amount of information in the sample about the population parameters of interest.For incomplete data sets, the amount of missing information can be estimated by the difference between the hypothetical information given complete data and the observed information in the incomplete data.The rate of missing information (Rubin, 1987) is the proportion of the missing information over the complete data information and provides a measure of how not observing the missing data contributes to uncertainty about the population parameters of interest.For LCA, this corresponds to the rate of information missing due to measurement error associated with the observed outcomes being surrogates of the underlying latent variable.
In LCA, the latent class memberships can be considered missing data, and the rate of missing information can be easily obtained when multiple imputation (Rubin, 1987;Schafer, 1997) with data augmentation (Tanner and Wong, 1987) is used for model estimation.In the LCA setting, the rate of missing information provides a measure of how observing surrogates in place of the latent variable of interest contributes to uncertainty about the parameters.We use simulations to explore how this measure depends on sample size, number of observed items, class size, class-specific item prevalences, and number of classes.In addition, we examine how it relates to bias and variability of the parameter estimates and the ability to accurately estimate true class membership.
We begin our article with a LCA review and introduction of the rate of missing information in this context.Next, we provide simulation results.We then apply these methods to real data examples.We end with a discussion of the proposed measure.

Latent class anlaysis
Let Y i = (Y i1 , . . ., Y iJ ) denote a vector of binary indicators for the ith individual; i = 1, . . ., N; and J observed measures (e.g., diagnostic tests, health indicators, or some other surrogates of the underlying latent variable of interest).For simplicity, we consider the case of binary observed measures.Extension to categorical outcomes is straightforward.The basic idea of LCA is that association among Y i arises because the study population is comprised of a mixture of subpopulations or classes (e.g., diseased and not diseased individuals in the medical diagnostics context).Let S i ∈ {1, . . ., K} indicate latent class membership for ith individual and γ k = P (S i = k) represent the prevalence of class k.There are two basic assumptions in LCA.First, individuals have common response probabilities within a class k : Second, observed responses y i are independent given class membership S i : Given these two assumptions, the observed data likelihood may be expressed as Latent class regression extends the traditional latent class model to allow the probability of class membership γ k to depend on a 1 × P vector of covariates x i (Dayton and Macready, 1988;Bandeen-Roche et al., 1997) via polytomous regression: Because the latent classes are not necessarily ordered, this relationship is typically modeled using a generalized logit link: where β β β K = 0 for identifiability.Each latent class has a unique set of regression parameters, with the parameters for the reference class (here, the last class) set equal to zero for identifiability.

Rate of Missing Information
Latent classes can be viewed as variables that are missing with probability one; therefore, missing data methodology may be used to fit LCA models.For example, the EM algorithm has been long used to estimate LCA parameters (Goodman, 1978).When data are missing with probability one, it implies the data are missing completely at random (MCAR) (Rubin, 1987) i.e., the missingness (the missing data indicators or the process that causes the missing values) does not depend on any variable in the study.The common missing at random (MAR) assumption (Rubin, 1987), for which the missingness may depend on observed data but not on missing data, is a more general assumption that is implied by the MCAR assumption.In this case one can assume ignorability, and refrain from modelling the missingness (Schafer and Graham, 2002;Harel and Zhou, 2007).
Multiple imputation (MI) (Rubin, 1987(Rubin, , 1996;;Schafer, 1997;Schafer and Graham, 2002;Harel and Zhou, 2007) is a simulation-based technique to deal with missing values.Generally speaking, each missing value is replaced with a set of m > 1 plausible values, resulting in m sets of complete data which differ only in the imputed values.Analyzing each of the complete data sets, using a complete-data methodology, and saving the estimates and standard errors of each set results in m sets of estimates and standard errors.Combining the results using Rubin's simple arithmetic rules produces a final result that takes into account the uncertainty in the data and the uncertainty due to the missing values.When using MI, the rate of missing information due to missing values (unobserved class memberships) may be easily estimated (Rubin, 1987).
We assume a joint model for the complete data Y com = (Y obs , Y mis ) and the missingness M, where Y obs are the observed binary indicators, Y mis are the missing latent class memberships, and M is the set of missingness indicators that separate the complete data into the observed and missing parts.To apply MI in the LCA context, m independent versions of the latent class memberships, Y (1) mis , . . ., Y (m) mis , are imputed from P (Y mis |Y obs , M).Because the latent class memberships are MCAR, we can ignore the missingness model M and impute from P (Y mis |Y obs ).Next, the m sets of original data with imputed class assignment are separately analyzed.Finally, the resulting m sets of point estimates and standard errors are combined using Rubin's (1987) rules, described below.
Let Q represent the set of LCA parameters where Q = (γ, ρ) is a JK + K − 1 dimensional vector for standard LCA and Q = (β, ρ) is a JK + (K − 1)P dimen-sional vector in the latent class regression setting.Let Q = Q (Y obs , Y mis ) denote the estimate for Q if the complete data were available and U = U (Y obs , Y mis ) denote its variance estimate.We assume that with complete data, each parameter estimate Qq follows a normal distribution ( Qq − Q q )/ U q ∼ Normal(0, 1). (2.1) In the absence of Y mis , Y is the estimated complete-data variance and B q = (m − 1) −1 ( Q(l) q − Qq ) 2 is the between-imputation variance.Tests and confidence intervals may be based on a Student's t approximation If Y mis carries no information about Q q , the imputed-data estimates Q(l) q would be identical and T q would reduce to Ūq .Therefore, an estimate of the rate of missing information due to not observing Y mis , i.e., the rate of missing information, is where rq = (1+1/m)B q / Ūq .In LCA, this measure represents the rate of information missing due to the lack of knowledge about the individual class memberships, which can be translated to the model measurement error due to using the observed surrogates in lieu of the latent class memberships.If class memberships were observed, there would be no missing information, and hence the measurement error would be eliminated.
For LCA, the unobserved class memberships Y mis can easily be imputed based on the posterior probabilities of class membership given the observed data and parameter estimates.To fully incorporate the variability of the estimated parameters, the posterior probabilities of class memberships may be calculated for each imputation by drawing parameter values from their posterior distribution as in data augmentation (Lanza et al., 2005) or by drawing values from a multivariate normal distribution with mean and covariance equal to the maximum likelihood estimates.Given the imputed class memberships the Binomial distribution may be used to estimate the model parameters given the complete data Q(1) , . . ., Q(m) and the corresponding complete data variances Û(1) , . . ., Û(m) .For example, γ k = P (S i = k) may be estimated as the proportion of subjects in each latent class n k /N with variance given by γk (1 − γk )/N .The class specific item prevalences ρ jk may be estimated as the number of subjects in class k with the item j equal to 1 with variance ρjk (1 − ρjk )/n k .For latent class regression (LCR), the regression coefficients and their variances given the imputed class memberships Y (1) mis , . . ., Y (m) mis , may be estimated using standard complete-data polytomous regression.

Simulations
To explore the traits of the rate of missing information in the latent class setting we conducted a simulation study.We focus on two class models, because it is easier to manipulate the parameters in a systematic way to study the resulting behavior.We first explore the traditional LCA case, and then look at the LCR settings.

Latent class analysis
For our first simulation in the LCA settings, we generated 100 simulated data sets from 32 models with two latent classes as follows: The prevalence of class 1, γ 1 , was set to 0.6 or 0.8 and the response probabilities given class membership, ρ, were set to 0.10, 0.15, 0.20, or 0.25 for class 1 and 0.90, 0.85, 0.80, or 0.75 for class 2. Data were generated from models with four and five items with sample sizes of 100 and 1000.We imputed 100 sets of class memberships from the posterior probabilities of class membership given the observed data after sampling 100 sets of parameter values from a multivariate normal distribution with mean and variance estimated from the LCA model and calculated the rates of missing information as described in the methods section.The mean rates of missing information across the 100 simulated data sets are summarized in Table 1.There was very little variation in the rates of missing information across ρ values within the same class, so the mean value across the four or five ρ values are presented for simplicity.
The most notable changes in the rate of missing information occurs with changes in the response probabilities.Response probabilities near 0.5 indicate a higher degree of measurement error, which is reflected in the dramatic increase in the rate of missing information as ρ moves from 0.10 and 0.90 towards 0.25 and 0.75 for classes 1 and 2, respectively.The rate of missing information is also a function of the number of items, with lower values for models with 5 items compared to those with only 4 items.The rate of missing information is lower
Table 2 displays the results from the second simulation.As expected given the equal class sizes, the rate of missing information was similar for class 1 and class 2 parameters; therefore, we only present results for class 1. Increasing the number of items with low measurement error reduces the rates of missing information for the class prevalence γ and the response probabilities ρ with the same value; however, somewhat surprisingly, within a model, the rate of missing information is larger for items with lower measurement error, i.e., the values closer to zero or one compared to values closer to 0.5.As in the first simulation, there are no clear patterns with increasing sample size.
To better understand how the rate of missing information may provide insight into the value of surrogates in measuring the latent variable of interest and the usefulness of the fitted latent class model, we examined the finite sample bias and variability of the estimated parameter values for the simulated data sets (Table 3).Percent bias was defined as the difference between the mean of the estimated parameter values across the 100 simulated data sets and the true parameter value divided by the absolute value of the true parameter value.The variance across the 100 imputed data sets was also calculated.For all cases, the bias is very small.In general, both the bias and variance decrease as the rate of missing information decreases in addition to increasing sample size.This might suggest a connection between the rates of missing information and the required sample size for asymptotic results to hold.In other words, as the rates of missing information increase, a larger sample size is needed to get unbiased estimates.
We also estimated the percent agreement between the predicted and the true latent class memberships (Table 3).Class memberships were imputed from the posterior probabilities of class membership given the observed data and the maximum likelihood parameter estimates.The percent agreement follows the same pattern as the rates of missing information; the percent agreement is higher for models with lower rates of missing information.Roughly, for rates of missing information above 50%, the percent agreement is less than 90%.Thus, the rate of missing information sheds light on the usefulness of the surrogates for classifying individuals.

Latent class regression
We also conducted a simulation study for LCR to understand the behavior of the rate of missing information in the regression setting.We modeled the probabilities of class membership as a function of two covariates x where x 1 is a binary variable with prevalence 0.6 and x 2 is a continuous variable sampled from a normal (0, 1) distribution.We set β 0 = 0.4, β 1 = 1, and β 2 = −1.The response probabilities, sample size, and number of items were varied as in the first simulation for LCA.We imputed 100 sets of class memberships from the posterior probabilities and estimated the rates of missing information for the class-specific item prevalences using the binomial distribution, as described above.Regression coefficients were estimated from the 100 sets of complete-data using logistic regression, and results were combined to estimate the rates of missing information using PROC MIANALYZE available in SAS version 8.2 or higher (SAS Institute, Inc., Cary, NC).
The rates of missing information for the LCR models are summarized in Table 4.As before, the most notable change in the rates of missing information occur with the change of response probabilities.Increasing the number of items decreases the rates of missing information, while the sample size does not have much of an effect.The bias and variance of the LCR parameter estimates from the simulated data are shown in Table 5.We only report the values for the β vector, because the information for the other parameters was discussed in the previous section.The percent bias, while generally small, is larger for the regression coefficients compared to the class and item-specific prevalences described in the previous section.As in the previous simulation, the variance decreases with decreasing rates of missing information and increasing sample size; however, the pattern is less consistent for the percent bias.A noteworthy result is that the bias for β 1 and β 2 are away from the null, i.e., the absolute size of the regression coefficients tend to be overestimated.The bias may be large enough in the models with 100 subjects to be of potential concern -A 16% bias on the log-odds scale results in an odds ratio of 2.7 being overestimated as 3.2.This is in contrast with the influence of nondifferential misclassification which is known to bias risk estimates towards the null (Copeland et al., 1977;Flegal et al., 1986).In the latent class regression setting, which attempts to correct for misclassification, it seems plausible that any potential finite-sample bias may be away from the null, because covariates are also being used to help identify an individual's true class membership.This may tend to maximize the estimated relationship between the covariates and class membership in small samples.

Examples
To illustrate the proposed measure, we reanalyze data from two previously published studies.The first is a latent class analysis presented by Garrett and Zeger (2000) on depression from the Epidemiologic Catchment Area Program.The second is a latent class regression analysis presented by (Badeen-Roche et al., 1997) on mobility disability from the Woman's Health and Aging Study.

Latent class analysis
The National Institute of Mental Health (NIMH) Epidemiologic Catchment Area Program (ECA) is a five-site epidemiologic study focusing on mental health (Eaton, Reiger and Locke, 1981).Garrett and Zeger (2000) analyzed data from 2,938 individuals interviewed at the Baltimore site in 1981.The goal was to use 17 questions from the NIMH Diagnostic Interview Schedule to measure 6-month prevalence of depression.These questions were grouped into 9 items (see Table 6) and analyzed using latent class analysis fitted using a Bayesian approach.Garrett and Zeger (2000) concluded that three class model is "statistically the most appropriate."The four class model was not well identified and the threeclass model was judged to fit better than the two class model.
We reanalyzed the data using the freeware WinLTA1 (Collins et al., 1999;Collins et al., 2001).WinLTA uses the EM algorithm to find the maximum likelihood estimate and data augmentation (DA) for Bayesian estimation, variance estimation, and multiple imputation.When using the DA tab in winLTA for multiple imputation, the rates of missing information are given as a default.The maximum likelihood estimates for two and three class models are summarized in Table 6 and are similar to the results of Garrett and Zeger (2000), though some small differences exist due to the different fitting approaches.When fitting a two class LCA model, the classes may be defined as a not depressed group and a depressed group.The depressed group, comprising 13% of the population, has moderate to high probabilities of all symptoms (20-63%) with an average of about four total reported symptoms per person.The majority of the population (87%) are not depressed and thus have very low probability of reporting any symptoms (< 6%).The three class model has a not depressed group and two depression groups which we labeled minor and major depression.The minor depression group has a prevalence of 16% and has low to moderate probabilities for all symptoms (12-49%), with an average of about two symptoms per person.The major depression group, comprising only a small fraction of the population (3%), has high probability of all symptoms (34-81%) with an average of six symptoms per person.
The rates of missing information are reasonable for the two class model (20% − 30%), but moderate to high for the three class model (40% − 70%).This suggests there is sufficient information in these nine symptoms to reliably classify individuals into depressed versus not depressed groups; however, there may not be enough information to reliably distinguish people with minor depression.This is consistent with the findings of (Gartett, Eaton and Zeger, 2002), who used a latent class approach to evaluate diagnostic criteria for depression.Based on the positive and negative predicted values estimated from the model, they concluded these nine symptoms provide essentially no information about minor depression.This is supported by the psychiatric literature, which has not yet developed consistently used criteria for diagnosing minor depression (Pincus, Davis and McQueen, 1999).The Diagnostic and Statistical Manual of Mental Disorders 4th edition (DSM-IV) defines widely accepted criteria for diagnosing major depressive disorder; however, criteria for minor depression is described in Appendix B with mental disorders that are considered to have "insufficient information to include as official categories" (APA, 1994).

Latent class regression
For the LCR analysis example we used the data previously analyzed by (Bandeen-Roche et al., 1997) from the Women's Health and Aging Study (WHAS), a study of the course of disability among moderately and severely disabled elderly women in Baltimore, Maryland.The WHAS study followed 1,002 disabled women aged 65 and older from November 1992 to February 1995.For this study we use population-based data from 3,543 women that were interviewed as part of the baseline screener.
The WHAS instrument included self-reported measures of disability, disease, and demographics.Following the analysis in Bandeen-Roche et al. (1997), we analyzed data from the following items that characterize mobility disability: "Without help, do you have any difficulty [doing a specific task]?"walking 1 4 mile, climbing 10 steps, lifting up to 10 pounds, and getting in and out of bed or a chair.We regressed latent mobility disability status on age and arthritis status.We fit the LCR model using a SAS (SAS Institute, Inc., Cary, NC) macro written by the second author2 .This macro uses the EM algorithm with a Newton-Raphson step to find the maximum likelihood estimates for the model parameters (Bandeen-Roche et al., 1997).The results for the two and three class models are summarized in Table 7. Bandeen -Roche et al. (1997) concluded the three class model provided a reasonable fit to the data.
To estimate the rates of missing information, we imputed 100 sets of class memberships from the posterior probabilities of class membership after sampling 100 sets of parameter values from a multivariate normal distribution with mean and variance estimated from the LCR model.Regression coefficients were estimated from the imputed data sets using polytomous regression and results were combined to estimate the rates of missing information using SAS's PROC MI-ANALYZE (SAS Institute, Inc., Cary, NC).The two class model shows 64% of women with no disability, having a low probability of reporting difficulty with any tasks (2-12%).The remaining 36% of women may be considered to have mobility disability, with high prevalence of task difficulties (39-89%) and on average, reporting difficulty with 3.4 of the 5 tasks.The odds of being in the disabled group is 4.3 times greater for women with arthritis (95% CI = 3.6 to 5.2) and 2.2 times greater for every 10 year increase in age (95% CI = 2.0 to 2.5).
The three class model shows 52% of women in the no disability group with very low probability (< 6%) of difficulty with any task.Twenty-nine percent of women may be considered to have mild disability, with low to moderate probability of difficulty with each task (14-66%) and an average of 2 reported task difficulties.
The remaining 19% of women fall in the severe disability group with a high probability of task difficulties (57-95%) and reported difficulty with an average of 4 out of 5 tasks.The odds of being in the severe versus the no disability group is 6.8 times higher for women arthritis (95% CI = 5.2 to 8.9) and 2.9 times higher for every 10 year increase in age (95% CI = 2.5 to 3.3).The odds of being in severe versus the mild disability group is 2.2 times higher for women with arthritis (95% CI = 1.6 to 3.0) and 1.3 times higher for every 10 year increase in age (95% CI = 1.3 to 1.8).
The rates of missing information are high for three class model (50% − 80%) but reasonable for two class model (20% − 40%).This suggests that these items can be reliably used to classify patients into a healthy group of women who rarely report difficulty with any task and a disabled group with high probability of difficulty with three or more tasks; however, there may not be enough information in these items to reliably distinguish between women with mild versus severe disability.Despite this, given the large sample size, we may still be able to correctly quantify the influence of arthritis and age on the probability of being mild versus severely disabled based on the latent class regression model; however, uncertainty will be larger than for the two class model.Because Bandeen-Roche et al. (1997) found that the three class model fit the data better than the two class model, the three class model is preferable for making inference about the association between risk factors and disability.

Discussion
In this paper we introduce the rate of missing information in the context of LCA and explore the use of this measure as a diagnostic tool for latent class analysis and regression.The rate of missing information gives an indication of the amount of information missing as a result of observing multiple surrogates in place of the underlying latent variable of interest, and provides a measure of how confident one can be in the model results.If inference is based on high levels of missing information, one might be skeptical about the accuracy and usefulness of the LCA results, especially in small samples.
As demonstrated in the simulation studies and examples, the rates of missing information can be used to assess the potential of symptoms or other surrogates to be used as diagnostic criteria in the absence of a gold standard.Models with high rates of missing information (rates above approximately 50%) do not predict true class membership well and indicate the need for additional symptoms or surrogates with less measurement error for accurate classification.The rate of missing information may also be valuable for the design of future studies.By knowing if items have a strong effect on the rate of missing information, one can plan to add items, change items, or put emphasis on item quality in future studies.
In addition, high rates of missing information indicate that larger samples sizes are needed to obtain precise and unbiased estimates of the latent class model parameters.
The rate of missing information may also be useful when estimating diagnostic accuracy in absence of a gold standard.The rate of missing information provides a measure of whether it is appropriate to use LCA for measuring sensitivity, specificity, and prevalence.Based on our simulation results, if all tests have low sensitivity and specificity, there is likely to be a large amount of information missing by not directly observing a gold standard, and therefore, there is less faith in the latent class results.However, the addition of one or two tests with high sensitivity and specificity, say around 90%, is likely to increase the amount of information about all model parameters including the sensitivities and specificities of the other tests plus disease prevalence.If tests with high sensitivity and specificity are unavailable, larger sample sizes will be required to obtain precise and unbiased estimates.
When the observed data is incomplete, the missing data can be separated into two types, the missing latent class memberships and the missing surrogate values.Using two-stage MI (Harel, 2003), it would be interesting to separate the effect of the missing class memberships and the effect of the missing values on the overall uncertainty of the model.This is a topic for future study.
An unintended result of our simulation study was to find that latent class regression may bias risk estimates away from the null in small samples.In the case of 100 subjects, an odds ratio of 2.7 was overestimated to be as large as 3.2 on average.Future work should examine this issue in more detail.Until this potential bias is be better understood, care should be taken when fitting latent class regression models to small samples.

Table 1 :
Mean rates of missing information from 100 simulated LCA data sets under 32 conditions.

Table 3 :
Percent bias and variance of parameter estimates and percent agreement of predicted and true class membership from 100 simulated LCA data sets under 16 conditions.

Table 6 :
Two and three class model estimates and rates of missing information for the ECA depression data

Table 7 :
LCR two and three class parameter estimates and rates of missing information for the WHAS mobility disability example.