Bayesian Information Criterion and Selection of the Number of Factors in Factor Analysis Models

In maximum likelihood exploratory factor analysis, the estimates of unique variances can often turn out to be zero or negative, which makes no sense from a statistical point of view. In order to overcome this difficulty, we employ a Bayesian approach by specifying a prior distribution for the variances of unique factors. The factor analysis model is estimated by EM algorithm, for which we provide the expectation and maximization steps within a general framework of EM algorithms. Crucial issues in Bayesian factor analysis model are the choice of adjusted parameters including the number of factors and also the hyper-parameters for the prior distribution. The choice of these parameters can be viewed as a model selection and evaluation problem. We derive a model selection criterion for evaluating a Bayesian factor analysis model. Monte Carlo simulations are conducted to investigate the effectiveness of the proposed procedure. A real data example is also given to illustrate our procedure. We observe that our modeling procedure prevents the occurrence of improper solutions and also chooses the appropriate number of factors objectively.


Introduction
Factor analysis provides a useful tool to draw information from data by exploring the covariance structure among observed variables in terms of a smaller number of unobserved variables.Successful applications have been reported in various fields of research including the social and behavioral sciences.
The factor analysis model is usually estimated by maximum likelihood methods under the assumption that the observations are normally distributed.In practice, however, the maximum likelihood estimates of unique variances can often turn out to be zero or negative.Such estimates are known as improper solutions, and many authors have studied these inappropriate estimates both from a theoretical point of view and also by means of numerical examples (see, e.g., Jöreskog, 1967;van Driel, 1978;Sato, 1987;Kano and Ihara, 1994;Kano, 1998;Krijnen et al., 1998).Various causes of improper solutions in structural equation models, including confirmatory factor analysis model, have been also explored (see, e.g., Anderson and Gerbing, 1984;Boomsma, 1985;Gerbing and Anderson, 1987;Chen et al., 2001;Flora and Curran, 2004).
In order to prevent the occurrence of improper solutions in factor analysis model, we take a Bayesian approach by specifying a prior distribution for the variances of unique factors.In Bayesian factor analysis, the choice of a prior distribution is a fundamental issue.Martin and McDonald (1975) used a prior distribution for the elements of unique variances.Press (1982) used a natural conjugate prior distribution for factor loadings and unique variances.Akaike (1987) introduced a prior distribution using the information extracted from the knowledge of the likelihood function.Recently, a Bayesian approach based on the Markov chain Monte Carlo (MCMC) algorithms has received an amount of attention in the Bayesian factor analysis (see, e.g., Lee and Song, 2002;Lopes and West, 2004).Basically, a conjugate prior distribution is used for MCMC-based algorithm and estimates can be obtained via an algorithm based on the Gibbs sampler.
Another important point in Bayesian factor analysis model is the choice of adjusted parameters including the number of factors and hyper-parameters in the prior distribution.Regarding selection of the number of factors, the AIC (Akaike, 1973) and BIC (Schwarz, 1978) have been widely used, and some other selection procedures have also been developed by several researchers (see, e.g., Bozdogan, 1987;Press and Shigemasu, 1999).However, these procedures cannot provide suitable values of hyper-parameters included in the prior distribution.A selection procedure via the MCMC algorithms has also been widely used in the Bayesian factor analysis (see, e.g., Lee and Song, 2002;Lopes and West, 2004;Dunson, 2006;Fokouè, 2009).Although the MCMC-based selection method is certainly attractive, we take a different approach that selects both the number of factors and the values of hyper-parameters in the prior distribution since the MCMC-based procedure sometimes requires much computational load.
In this paper, we introduce a proper prior distribution for the variances of unique factors by extending a basic idea given in Akaike (1987).The Bayesian factor analysis model is estimated by EM algorithm, for which we provide the expectation and maximization steps within a general framework of EM algorithms.We treat a selection of parameters, which include the number of factors and the hyper-parameters for the prior distribution, as a model selection and evaluation problem, and derive a model selection criterion from a Bayesian viewpoint for evaluating a Bayesian factor analysis model.The proposed modeling procedure enables us to choose the number of factors and the values of hyper-parameters in the prior distribution simultaneously.
The remainder of this paper is organized as follows: Section 2 describes the maximum likelihood factor analysis and its related problems.In Section 3, we introduce a prior distribution according to the basic idea given by Akaike (1987), and provide a model estimation procedure using EM algorithm.Section 4 derives a model selection criterion for evaluating a Bayesian factor analysis model.Section 5 presents numerical results for both artificial and real datasets.Some concluding remarks are given in Section 6.

Maximum Likelihood Factor Analysis and Its Related Problems
Let X= (X 1 , • • • , X p ) be a p-dimensional observable random vector with mean vector µ and variance-covariance matrix Σ.The factor analysis model is are unobservable random vectors.The elements of F and ε are called common factors and unique factors, respectively.It is assumed that where I k is the identity matrix of order k and Ψ is a p × p diagonal matrix with i-th diagonal element ψ i which is called unique variance.Under these assumptions, the variance-covariance matrix of X can be expressed as The i-th diagonal element of ΛΛ is called communality, which measures the percent of variance in x i explained by all the factors.It is well-known that factor loadings have a rotational indeterminacy since both Λ and ΛT generate the same covariance matrix Σ, where T is an arbitrary orthogonal matrix.
Assume that the common factors F and the unique factors ε are, respectively, distributed according to multivariate normal distributions where is the likelihood function and S = (s ij ) is the sample variance-covariance matrix with x being the sample mean vector.For convenience, let us consider the discrepancy function given by The maximum likelihood estimates of Λ and Ψ are given as the solutions of ∂q(Λ, Ψ)/∂Λ = 0 and ∂q(Λ, Ψ)/∂Ψ = 0. Since the solutions cannot be expressed in a closed form, some iterative procedures are required to obtain the maximum likelihood estimates ΛML and ΨML .In maximum likelihood factor analysis, numerical algorithms have been proposed by several authors (see, e.g., Jöreskog, 1967;Jennrich and Robinson, 1969;Clarke, 1970).
In practice, the maximum likelihood estimates of unique variances can often turn out to be zero or negative, which have been called improper solutions.van Driel (1978) categorized the causes of improper solutions into the following three types: (i) sampling fluctuation, (ii) there exist no appropriate factor analysis models for extraction of beneficial information from the data, (iii) indefiniteness of the model.van Driel (1978) distinguished among the causes of improper solutions by using the standard errors of diagonal elements of ΨML , whereas the theoretical method for distinguishing the causes of improper solutions remains to be established.
What should be noted is how to prevent the occurrence of improper solutions.In order to handle this problem, several attempts have been made for parameter estimation.For example, the parameters are estimated (a) under the condition that ψ i ≥ 0.005 for i = 1, • • • , p (see Jöreskog, 1967), (b) after eliminating variables for which the estimates are improper and (c) by utilizing a Bayesian procedure.Some problems still remain, however, in approaches (a) and (b).In approach (a), the variances of unique factors are provided subjectively, whereas those should be estimated.The approach (b) often yields inappropriate estimates even when variables that cause the improper solutions are eliminated.We, therefore, focus our attention on the approach (c) that estimates the parameters included in the factor analysis model with the help of Bayesian procedure.

Bayesian Factor Analysis Model
Akaike (1987) introduced a prior distribution using the information extracted from the knowledge of the likelihood function.In this section we first give a brief review of his prior distribution and its related problems, and introduce a proper prior distribution according to the basic idea given by Akaike (1987).We then develop an estimation procedure for Bayesian factor analysis model along with the technique of EM algorithm.

Prior Distributions
First, we give a brief review of a prior distribution proposed by Akaike (1987).Akaike (1987) showed that for a given Ψ, the minimum value of the discrepancy function q(Λ, Ψ) in ( 2) with respect to Λ is given by where Note that the number of factors k in (3) will be selected by using a model selection criterion given in Section 4.
It can be seen from Equation (3) that q k (Ψ) is minimized when the values of θ k+1 , • • • , θ p are chosen as close to one as possible since a function x − log x (x > 0) has a minimum at x = 1, and that the values of θ 1 , • • • , θ k do not directly affect the function (3).Therefore, there is a possibility that the larger eigenvalues of Ψ−1/2 ML S Ψ−1/2 ML turn out to be extremely large.This implies that some diagonal elements of ΨML can become zero.In order to prevent the occurrence of improper solutions, the parameter estimation should be done under the restriction that the values of θ 1 , • • • , θ k are not too large.Akaike (1987) thus added a penalty term ρ k i=1 θ i with ρ > 0 to the discrepancy function q(Λ, Ψ) given in (2) and minimized the following function with respect to Λ and Ψ: The additional term prevents the occurrence of improper solutions because it does not allow the values of θ 1 , • • • , θ k to be infinite.Under the constraint that Λ Ψ −1 Λ is a diagonal matrix which removes the rotational indeterminacy, k i=1 θ i is equal to tr(Λ Ψ −1 Λ + I k ) (see, e.g., Lawley and Maxwell, 1971), which leads to a prior distribution proposed by Akaike (1987) in the following: where K denotes the normalizing constant and ρ can be considered as a hyperparameter.Akaike (1987) considered this distribution as a standard spherical prior distribution of the factor loadings and did not adopt the prior for Ψ.
This prior distribution has an advantage that it prevents the occurrence of improper solutions if a value of the hyper-parameter ρ is suitably chosen.We have, however, no prior convictions about factor loadings in exploratory factor analysis, because Λ has a rotational indeterminacy.Hence it is natural to define a prior distribution for the diagonal elements of Ψ rather than Λ.
We, therefore, propose adding a penalty term ρ p i=1 θ i to q(Λ, Ψ) in ( 2) and then minimize the function given by (5) It is reasonable to add the term ρ p i=1 θ i instead of ρ k i=1 θ i to q(Λ, Ψ) in ( 2) since the values of θ k+1 , • • • , θ p are close to one and could be ignored relative to the very large values of θ 1 , • • • , θ k .From Equation ( 5), the prior distribution for Ψ is thus given by The inverses of the diagonal elements of Ψ have exponential distributions that yield the normalizing constant K = p i=1 N ρs ii /2.In contrast, the normalizing constant for the prior distribution in (4) is infinite.Note that it is difficult to derive a model selection criterion, which will be described in the Section 4, with the prior distribution given in (4) since the model selection criterion depends on K.
Our proposed prior distribution is closely related to that of Martin and Mc-Donald (1975) given by where α 1 , • • • , α p are hyper-parameters of the prior distribution.However, it is difficult to specify these hyper-parameters when p is large.They recommended restricting these hyper-parameters by requiring that α i = s ii α, in which case their prior distribution coincides with the proposed prior distribution in (6).From these descriptions, our proposed prior distribution seems to be a minor modification of Martin and McDonald (1975).It is noted, however, that the proposed distribution in (6) has the theoretical justification for preventing the occurrence of improper solutions because the prior distribution is introduced according to the theoretical scheme given by Akaike (1987), whereas the prior distribution in ( 7) is heuristically provided.In addition, Martin and McDonald (1975) subjectively selected a hyper-parameter α which controls the trade-off between the log-likelihood and the penalty term, while a model selection criterion presented in Section 4 enables us to choose the hyper-parameter objectively.

Estimation
For the prior distribution π(Ψ|ρ) defined by ( 6), the posterior distribution is given by We estimate the parameters Λ and Ψ by a posterior mode.Taking the logarithm of the equation gives the penalized log-likelihood function where the hyper-parameter ρ can be considered as a regularization parameter.We estimate the parameters Λ and Ψ in the Bayesian factor analysis model by maximizing the penalized log-likelihood function given in (8).
One of the beneficial methods to obtain the maximum penalized likelihood estimates is an EM algorithm.Rubin and Thayer (1982) suggested using an EM algorithm in maximum likelihood factor analysis.The advantage of the EM algorithms is that even if the likelihood function is not concave with respect to the parameters, the algorithm leads to a (local) maximization of the function.Bentler and Tanaka (1983) pointed out the problems in the EM algorithm for factor analysis, whereas Rubin and Thayer (1983) addressed the problem of Bentler and Tanaka's (1983) discussion.
We employ an EM algorithm to obtain the maximum penalized likelihood estimates.We provide the expectation and maximization steps for the Bayesian factor analysis model within a general framework of EM algorithms.We regard the common factors as missing variables, and maximize the complete-data loglikelihood using a posterior distribution for the missing variables.The iterative procedure is given by where For detailed procedure for estimation of factor analysis models via EM algorithms, we refer to Rubin and Thayer (1982) and Tipping Bishop (1999).
In order to eliminate the rotational indeterminacy from Λ, we impose restrictions that λ ij = 0 (i > j) (see, for example, Anderson and Rubin;1956).

Model Selection Criterion
In the Bayesian factor analysis model, we still have crucial issues to be solved: the choice of a hyper-parameter ρ for the prior distribution and the number of factors k.In this section we derive a model selection criterion for evaluating a Bayesian factor analysis model.
The generalized Bayesian information criterion (GBIC), proposed by Konishi et al. (2004), enables us to choose adjusted parameters including the hyperparameter ρ and the number of factors k simultaneously by extending the Bayesian information criterion (BIC) proposed by Schwarz (1978).The basic idea of BIC is to select a model from a set of candidate models by maximizing the posterior probability.The BIC only deals with models estimated by the maximum likelihood method, whereas the model selection criterion GBIC can be applied to models estimated by the maximum penalized likelihood method.For model selection criteria we refer to Konishi and Kitagawa (2008) and references given therein.
Suppose that θ is a parameter vector given by θ where ) .We used the definition of λ .iwhich consists of only the lower elements of Λ because it eliminates the rotational indeterminacy as described in the previous section.Let f (X N | θ) be the estimated model by maximum penalized likelihood methods.Then we have a statistical model where Σ = Λ Λ + Ψ.The model selection criterion GBIC for Bayesian factor analysis is given by where p * is the number of parameters given by p(k + 1) − k(k − 1)/2 and J ρ ( θ) is a second order differential of the penalized log-likelihood function given by We choose optimum values of the hyper-parameter ρ and the number of factors k which simultaneously minimize the value of the model selection criterion in ( 11).
The derivation of the GBIC is given by Hirose et al. (2008).
Other traditional model selection criteria include AIC (Akaike, 1973) and BIC (Schwarz, 1978).It should be noted that the AIC and BIC often select a model which causes improper solutions because these model selection criteria only evaluate models estimated by the maximum likelihood method.These model selection criteria are given by

Numerical Examples
Monte Carlo simulations and a real data example are used to examine the efficiency of the proposed procedure.We investigate how well the Bayesian modeling strategy with GBIC performs well in the sense that it prevents the occurrence of improper solutions and can select the true number of factors.

Numerical Comparison
Monte Carlo simulations were conducted to investigate the performance of our proposed procedure in various covariance structures and samples.In this simulation study we focus on the choice of the number of factors and compare the performance of GBIC with that of AIC and BIC.
We consider various datasets which are likely to produce improper solutions due to sampling fluctuations.van Driel (1978) showed that improper solutions sometimes occur when one of the diagonal elements of Ψ is close to zero.In addition, improper solutions often arise when the number of samples is small or communalities are large.Taking these natures of improper solutions into account, we considered four models, which are given in Table 1, and three variants for the number of observations, N = 30, N = 50 and N = 100.
When each dataset was generated 1000 times, we often obtained improper solutions.The frequencies of improper solutions were We chose the adjusted parameters including a hyper-parameter of prior distribution and the number of factors using the model selection criterion GBIC given by (11).The minimum GBIC was selected for varying values of k and ρ.We also selected the number of factors using AIC and BIC, which only deal with the models estimated by the maximum likelihood method, to compare the performance of AIC and BIC with that of GBIC.The maximum likelihood estimates were obtained under the condition that ψ i ≥ 0.005 for i = 1, • • • , p (see Jöreskog, 1967).
Table 2 shows that how many times the model selection criteria selected each number of factors out of 1000 datasets.For example, the AIC selected the one factor model 9 times out of 1000 datasets in model (a1) when N = 30.When N = 30, the AIC often selected the large number of factors, whereas the BIC sometimes chose the small number of factors, which means the performance of the AIC and BIC are poor.However, the GBIC performed well compared with AIC and BIC since the GBIC most often selected the true number of factors for all models.The BIC selected 1 factor 170 times and the AIC selected 5 factors 37 times for the model (b1).However, the GBIC rarely selected 1 factor or 5 factors for this model.As a result, even if the GBIC does not select the true number of factors, the outcome may not be significantly worse such as the result of AIC and BIC.
When N = 50, the performance of the AIC is still poor.The performance of BIC is not poor, but the GBIC brings better result compared with the BIC for model (b1).
When N = 100, the performance of the GBIC and BIC is quite well, but the AIC still often selects the large number of factors.The performance of the BIC is slightly better than that of GBIC.It seems that the modeling procedure with BIC is preferable to our proposed procedure based on the GBIC when N is large.However, the maximum likelihood procedure with BIC is not able to prevent the occurrence of improper solutions whereas our proposed procedure with GBIC can prevent the occurrence of improper solutions.
We compared the performance of AIC, BIC with that of GBIC.The AIC and BIC selected 7 factor model and 4 factor model, respectively, each of which resulted in improper solutions since we obtained improper solutions when k ≥ 4. The model selection criterion GBIC also selected 4 factor model.Hereafter we focus on the 4 factor model.
Before we illustrate our procedure, we show how the choice of a hyperparameter is an important point.The maximum likelihood estimate of ψ 14 was −0.000, which is apparently inappropriate.To overcome this problem, we employed our modeling method.We obtained a maximum penalized likelihood estimate of ψ 14 , when ρ = 0.00001, 0.01 and 1, which is given by 0.005, 0.130 and 1.608, respectively.When ρ = 0.00001 the estimate of ψ 14 was too close to zero.This shows that we were not able to prevent the occurrence of improper solutions.In comparison, the estimate of ψ 14 was too large when ρ = 1.However, when ρ = 0.01, we obtained an appropriate estimate of ψ 14 compared with that obtained when ρ = 0.00001 and ρ = 1.
It is important to identify the cause of the improper solutions.The maximum likelihood estimates of Ψ and the standard deviation σψ i of N 1/2 ψ i /s ii (see (5.50) in Lawley and Maxwell, 1971) for k = 2 to 4 are shown in Table 3.For k = 4, the maximum likelihood estimates ψ1 , ψ3 , ψ7 , ψ13 , ψ14 were less than the corresponding estimates for k = 3.These results for the estimates of unique variances suggest that we have identified some new common factors.Moreover, van Driel (1978) found that the value of the standard deviation σψ i may be large if the cause of improper solutions is the indefiniteness, whereas it is not especially large for each i when k = 4.These results indicate that the improper solutions are probably due to sampling fluctuations rather than indefiniteness of the model.The estimates of Λ and Ψ obtained by using the proposed method are given in Table 4.The estimates of factor loadings Λ are rotated by varimax method (Kaiser, 1958).It can be seen from Table 4 that the proposed procedure prevents the occurrence of improper solutions and we can obtain the interpretable common factors in the following: Career and Adequacy, Motivation and Ability, Academic Capability and Character.For this reason, the proposed procedure performs well in that case.

Concluding Remarks
In maximum likelihood factor analysis, there arise situations in which the estimates of unique variances go to zero or become negative.To prevent the occurrence of such improper solutions, we used a Bayesian approach by specifying a proper prior distribution for unique variances.The proposed prior distribution is based on the prior distribution given by Akaike (1987).In practice, an optimal choice of the number of factors is also of importance for exploring the covariance structure.We derived the model selection and evaluation criterion GBIC from a Bayesian point of view, and used it to choose adjusted parameters that include the hyper-parameter for the proposed prior distribution and the number of factors.Monte Carlo simulations and a real data example were used to investigate the efficiency of the proposed procedure.We observed that our modeling strategy with GBIC prevents the occurrence of improper solutions and also selects the As a future research topic, it is interesting to construct a modeling procedure for preventing the occurrence of improper solutions in structural equation models including confirmatory factor analysis.

Table 1 :
Four models for simulated datasets

Table 2 :
Comparisons of model selection criteria for simulated datasets generated by (a1), (a2), (b1) and (b2).The bold text in the left column represents the true number of factors.

Table 3 :
Maximum likelihood estimates of unique variances and the standard deviations of N 1/2 ψ i /s ii for k = 2 to 4 in the job application data.

Table 4 :
The estimates of factor loading Λ and unique variances Ψ obtained by the proposed method in the job application data.