Multiple Taxicab Correspondence Analysis of a Survey Related to Health Services

We present an analysis of a health survey data by multiple correspondence analysis (MCA) and multiple taxicab correspondence analysis (MTCA), MTCA being a robust L1 variant of MCA. The survey has one passive item, gender, and 22 active substantive items representing health services offered by municipal authorities; each active item has four answer categories: this service is used, never tried, tried with no access, non response. We show that the first principal MTCA factor is perfectly characterized by the sum score of the category this service is used over all service items. Further, we prove that such a sum score characterization always exists for any survey data.


Introduction
The data, that will be discussed in this paper, represent a survey of 3530 individuals residing in downtown eastside Vancouver with high incidence of AIDS/ HIV related diseases.Table 1 displays the marginal distribution of 22 active or substantive response variables or items filled by the 3530 respondents, where each item describes a health related service offered by municipal authorities; for instance, the first question asks whether the service offered on needle exchange, coded by NXCHG, was used or not.Each item represents a polytomous qualitative variable having four categories: (1) = used this service, (2) = never tried, (3) = tried with no access, (N) = non response or missing.In Europe, particularly in France, multiple correspondence analysis (MCA) is a popular method to describe and visually explore complex relationships among items in such a questionnaire survey.MCA is the application of correspondence analysis (CA) to the super indicator 0/1 matrix Z of size 3530 × 88.The number of columns 88 comes from 4 × 22, which represents the total number of categories of the 22 items.To see how the matrix Z is constructed, refer to Section 3.An advantage of coding the data as in Z is that the missing values are incorporated in data analysis naturally without imputation, just like any other category value.Imputation for missing categorical survey data is discussed quite in detail by Finch (2010).The aim of this paper is to compare the MCA results with the multiple taxicab correspondence analysis (MTCA) results, MTCA being a robust L 1 version of MCA developed by Choulakian (2006;2008a;2008b).Because of its robustness, MTCA will reveal that there is a clear structure in this data set based on a simple sum score statistic.Further, we show that such a sum score characterization always exists for any survey questionnaire data; and this will help the researcher to see if the active items are broadly similar in objective and point to the same direction.theoretical results; in Sections 4 and 5 we present the analysis of the survey data by MCA and MTCA, respectively; and we conclude in Section 6.We suppose that the theory of multiple correspondence analysis (MCA) is known, which can be found, among others, in Benzecri (1973;1992), Greenacre (1993), Gifi (1990), Nishisato (1994), Le Roux and Rouanet (2004).Note that MCA is also known as homogeneity analysis, reciprocal averaging, dual scaling or third method of quantification.

Introduction
In a series of papers Choulakian (2003;2005;2006a;2006b) developed principle component analysis (PCA) based on matrix norms, thus generalizing the classical PCA, or equivalently generalizing the well known singular value decomposition (SVD).This led to the development of taxicab principal component analysis (TPCA) based on the most robust matrix norm named taxicab matrix norm, and on which taxicab correspondence analysis (TCA) is based.
To see that TPCA is similar to and has the same mathematical framework of classical PCA, we start with an overview of classical PCA, which can be described in many ways, see Jolliffe (2002) for a comprehensive account.However, TPCA is similar to only one of the ways, that we present it in the next subsection to make the paper self contained and reader friendly.

Classical Principal Component Analysis
Let T be a centered or standardized data set of dimension I × J, where I observations are described by the J variables, that is, T T /I is the covariance or the correlation matrix.For a vector u ∈ R J , we define its Euclidean or L 2norm to be ||u|| 2 = (u u) 1 2 .Let k = rank(T ).The classical principal component analysis (PCA) consists of successive maximization of the variance or the square of the L 2 -norm of the linear combination of the variables of the matrix T subject to a quadratic constraint; that is, it is based on the following optimization problem max ||T u|| 2 subject to ||u|| 2 = 1; (1) or equivalently, PCA can also be described as maximization of the square of the L 2 -norm of the linear combination of the observations of the matrix Equation ( 1) is the dual of (2), and they can be reexpressed as matrix norms The solution to (3), λ 1 , is the square root of the greatest eigenvalue of the matrix T T or T T .The first principal axes, u 1 and v 1 , are defined as where u 1 is the eigenvector of the matrix T T associated with the greatest eigenvalue λ 1 ; and Let f 1 be the vector of the first principal component (pc) scores, and g 1 the vector of the first pc loadings defined as Equations ( 6) and ( 7) are named transitional formulas, because v 1 and f 1 , and, u 1 and g 1 , are related by To obtain the second pc scores f 2 , loadings g 2 , and axes u 2 and v 2 , we repeat the above procedure on the residual dataset where T 1 = T .We note that rank(T 2 ) = rank(T 1 )−1, because by ( 6) and ( 7) Classical PCA can be described as the sequential repetition of the above procedure for k = rank(T ) times till the residual matrix becomes 0; thus, using α = 1, • • • , k as indices, the matrix T can be written as which, by (8), can be rewritten in a form known as singular value decomposition (SVD) Further, we have and which represents, by the Pythagorean theorem, I times the sum of the variances of the J variables or the sum of the squared Euclidean distances of the I rows from the origin, because we assumed that T is centered or standardized.Also the relative cumulative explained variability by the first α axes is

Taxicab Principal Component Analysis (TPCA)
The L 1 norm of a vector TPCA consists of maximizing the L 1 norm of the linear combination of the variables of the matrix subject to L ∞ norm constraint; more precisely, it is based on the following optimization problem max ||T u|| 1 subject to ||u|| ∞ = 1; (16) or equivalently, TPCA can also be described as maximization of the L 1 norm of the linear combination of the rows of the matrix Equation ( 17) is the dual of ( 16), and they can be reexpressed as matrix norms which is a well known and much discussed matrix norm related to Grothendieck problem, see for instance, Alon and Naor (2006).The solution to (18), λ 1 , is a combinatorial optimization problem given by Equation ( 19) characterizes the robustness of the method, in the sense that, the weights affected to the variables (similarly to the individuals by duality) are uniform ±1.The first principal axes, u 1 and v 1 , are defined as and Let f 1 be the the vector of the first principal component (pc) scores, and g 1 the vector of the first pc loadings.These are defined as Equations ( 22) and ( 23) are named transitional formulas, because v 1 and f 1 , and, u 1 and g 1 , are related by where sgn(g 1 ) = (sgn(g 1 (1)), • • • , sgn(g 1 (J)) , and sgn(g 1 (j)) = 1 if g 1 (j) > 0, sgn(g 1 (j)) = −1 otherwise.Note that ( 24) is completely different from (8).
To obtain the second pc scores f 2 , loadings g 2 , and axes u 2 and v 2 , we repeat the above procedure on the residual dataset where T 1 = T .We note that rank(T 2 ) = rank(T 1 ) − 1, because by ( 22), ( 23) and ( 24) which implies that TPCA is described as the sequential repetition of the above procedure for k = rank(T ) times till the residual matrix becomes 0; thus the matrix T can be written as It is important to note that (28) has the same form as ( 11), but it can not be rewritten as ( 12), because ( 24) is completely different from (8).Further, similar to (13), we have But the dispersion measures λ α 's in (29) will not satisfy ( 14), because the Pythagorean theorem is not satisfied in L 1 .Given that for the classical PCA ( 14) is used, so for both methods we define the total variability to be and the relative cumulative explained variability by the first α axes to be In TPCA, the optimization problem ( 16), ( 17) or ( 18) can be accomplished by two algorithms.The first one is based on complete enumeration (19); this can be applied, with the present state of desktop computing power, say, if min(I, J) 25.The second one is based on iterating the transitional formulas ( 22), ( 23) and ( 24), similar to Wold's (1966) NIPALS algorithm, also named criss-cross regression by Gabriel and Zamir (1979).It is easy to show that this is also an ascent algorithm.The criss-cross algorithm can be summarized in the following way, where g is a starting value: Step 1: u = sgn(g), f = T u and λ(u) = ||T u|| 1 ; Step This is an ascent algorithm; that is, it increases the value of the objective function λ at each iteration.The convergence of the algorithm is superlinear (very fast, at most two iterations); however it could converge to a local maximum; so we restart the algorithm I times using each row of T as a starting value.The iterative algorithm is statistically consistent in the sense that as the sample size increases there will be some observations in the direction of the principal axes, so the algorithm will find the optimal solution.
For the survey dataset, the computations are done by the iterating algorithm.

Taxicab Correspondence Analysis of A Contingency Table
Often correspondence analysis (CA) is identified as categorical PCA; that is, it is considered an adaptation of PCA to contingency tables.Similarly we consider TCA an adaptation of TPCA to contingency tables.Here we introduce TCA of a contingency table N = (n ij ) of two nominal variables with I rows and J columns.Let P = N /n be the associated correspondence matrix with elements p ij , where n = J j=1 I i=1 n ij is the sample size.We define The application of TPCA algorithm to P , described in the previous subsection, is named TCA of the contingency table N .We put P 0 = P and denote by P α be the residual correspondence matrix at the α-th iteration.That is, in the calculations described in the previous subsection, we replace T by P and the numbering of the iterations α varies from 0 to k, where k = rank(P ) − 1.
For α = 0, P 0 = P .Row and column profiles with their masses play an important role in both CA and TCA.Let The cloud of row profiles with their masses is the set {(r 0i , p i• )| for i = 1, • • • , I}, where r 0i is the ith row of R 0 ; and the cloud of column profiles with their masses is the set {(c 0j , p •j )| for j = 1, • • • , J}, where c 0j is the jth row of C 0 = D −1 c P 0 .We shall interpret the steps of TCA using the row profiles; however, we remind the reader that similar interpretation can be done using the column profiles.
For α = 0, the optimization problem ( 16) is The objective function in (32) is the weighted L 1 dispersion of the projection of the row profiles r 0i on the axis u.The 0-th principal axes are, see ( 20) and ( 21), which can be seen to be trivially u 0 = 1 J , the J component vector with coordinates of 1's, and v 0 = 1 I .The 0-th principal factor scores are which can be seen to be trivially f 0 = 1 I and g 0 = 1 J ; these are related to the corresponding principal axes by (24); that is, And, the 0-th taxicab dispersion measure can be represented in many different ways as The first residual correspondence matrix is, by (25), Note that p r p c represents the correspondence matrix under the assumption that the row and column variables are independent.This solution is considered trivial both in CA and in TCA.
For α = 1, we define the residual row and column profiles to be: R 1 = D −1 r P 1 and C 1 = D −1 c P 1 .The cloud of the residual row profiles with their masses is the set {(r 1i , p i• )| for i = 1, • • • , I}, where r 1i is the ith row of R 1 ; and the cloud of residual column profiles with their masses is the set {(c 1j , p •j )| for j = 1, • • • , J}, where c 1j is the jth row of C 1 = D −1 c P 1 .We repeat steps (20) through (25), or (32) through (37), where P 0 is replaced by P 1 .Note that the maximization problem is NP hard and not trivial.So in general, the α-th taxicab dispersion measure can be represented in many different ways And the (α + 1)-th residual correspondence matrix is From which one gets the data reconstitution formula both in TCA and CA Similar to the classical CA, the total dispersion is defined to be k α=1 λ 2 α , and the proportion of the explained variation by the α-th principal axis is λ 2 α / k β=1 λ 2 β , and the cumulative explained variation is The visual maps are obtained by plotting the points (f An important property of TCA and CA is that columns (or rows) with identical profiles (conditional probabilities) receive identical factor scores.One important advantage of TCA over CA is that it stays as close as possible to the original data: It directly acts on the correspondence matrix P without calculating a dissimilarity (or similarity) measure between the rows or columns.TCA does not admit a distance interpretation between profiles; there is no chi-square like distance in TCA.Fichet (2009) described it as a scoring method.
More technical details about TCA and a deeper comparison between TCA and CA is done in Choulakian (2006a).Further results can be found in Choulakian et al. (2006), Choulakian (2008a), and Choulakian and de Tibeiro (2012).

Multiple Taxicab Correspondence Analysis
Let n individuals fill out a questionnaire survey consisting of Q items, and each item has J q number of answer categories.Let j q be the value of the jth category in the qth item for q = 1, • • • , Q and j q = 0, • • • , J q − 1.Let Z be the super indicator 0/1 matrix of order n × Q q=1 J q .An example of a matrix Y and its 0/1 indicator matrix Z are shown below with n = 4, Q = 3 and Theorem 1. (Choulakian, 2008b): Along the first principal axis, the projected response patterns in MTCA of Y will be clustered and the number of cluster points is less than or equal to 1 + Q.
This theorem shows that MTCA automatically clusters the response patterns, that is the individuals, into at most 1 + Q clusters.This is an important feature of the method, and an important help to the researcher.Note that some clusters can be empty.

Characterization of the First MTCA Principal Factor as Sum Score Statistic
The next theorem, which is new, characterizes completely the 1 + Q clusters as a sum score statistic, more precisely as total number of "first factor successes" over all the items.So the crucial point is how to define first factor success of an item, and and its complement "first factor failure".It is important to note that the sum score statistic of items makes sense only when the nature of the items are similar, which in Section 5 we will see is the case for the data set considered in this paper.
First we consider the case of dichotomous items, when J q = 2 for q = 1, • • • , Q; then generalize the result to polytomous items.
Theorem 2. (The first MTCA factor property for dichotomous items): Let Y ∈R n×Q , where Y ij = 0 if the response of the ith individual on the jth dichotomous item is a failure, and Y ij = 1 if the response of the ith individual on the jth dichotomous item is a success, and consider MTCA of Y .Then the first principal factor scores f 1 (i) and subject sum scores Y i. , for i = 1, • • • , n, are linearly related (i.e., corr(f 1 (i), Y i• )= ±1) if and only if the first principal factor item weights is 37), the first residual correspondence matrix is The second matrix block in (38).By (34), we get It is evident that (43) equals if and only if u 11j = 1, which is the required result. 2 Since the orientation of f 1 is arbitrary, if the condition of Theorem 2 holds, we will choose f 1 so that corr(f 1 (i), Y i. ) = 1.In this case, the points (f 1 (i), Y i. ) will lie on a straight line by (44).
To see what happens if some u 11j = −1, we consider the case when only one, say, u 11Q = −1.Then by ( 43), we have Equations ( 45) and ( 46) show that the points (f 1 (i), Y i. ) will locate on two parallel lines defined by success or failure of the ith respondent on item Q.
Definition: a) For a dichotomous item q for q = 1, • • • , Q, we define the first factor success of the item q to be the category of the item q with first MTCA factor score g 1 (j q ) > 0 for j q = 0, 1. b) For a polytomous item q for q = 1, • • • , Q, we define the first factor success of the item q to be the category set {j q |g 1 (j q ) > 0 for j q = 0, • • • , J q − 1}.Now, we can interpret Theorem 2 in the following way: a) All the success (coded as 1 in Y ) categories of the Q items, u 11 = 1 Q , oppose all the failure categories (coded as 0 in Y ) of the Q items, b) A success of item q is identical to the first factor success of item q; that is, for each item success and first factor success coincide.If for an item, success and first factor success are different then, depending on the subject matter, either we delete this item from analysis or swap success by first factor success (failure).
If the condition of Theorem 2 holds, then the above two points imply that the Q items are broadly similar in objective and point to the same direction towards one general latent variable; further, principal dimensions of order higher than one will reveal specific local factors conditioned by the first general latent variable sum score, as will be seen in the analysis of the health survey data set.
The case of polytomous data follows easily from Theorem 2, if we define success of a polytomous item to be identical to the first factor success as given in the above definition; thus by Theorem 2 each cluster will be perfectly characterized by the raw sum score of the first factor successes in the response patterns belonging to that cluster.
For some theoretical and empirical comparisons of the sum score statistic for binary data that point to one underlying latent variable with parametric and non parametric models, see in particular Cox and Wermuth (2002).

Multiple Correspondence Analysis of the Health Survey Data
The second column in Table 2 displays the first five dispersion measures, the standard deviations, of the first five important factors resulting from CA of Z; in CA terminiology λ 2 α represents the inertia (variance) of the αth factor.We see that the first three values are clearly singled out: λ 1 = 0.8974 being close to 1, implies that the dataset Z has quasi 2 blocks structure; and, λ 2 ≈ λ 3 implies that the principal plane 2-3 should be looked at.We did not present the percentage of the variance explained by each principal factor, because they are misleading; further many adjusted values have been proposed in the litterature, see for instance Greenacre (1993).Figures 1 and 2 show the MCA maps of the principal planes 1-2 and 2-3, respectively.In Figure 1, we clearly see that the missing (N) and tried with no access (3) categories dominate the map by forming two different clusters far away from the center; and the remaining column points representing used this service (1) and never used (2) category values are clustered around the origin; further, the second dimension separates the missing (N) categories from the tried with no access (3) categories.Figure 2 shows the complete separation of the four category values (1), ( 2), ( 3) and (N).Table 1 shows that the two categories, (N) and ( 3), for each of the 22 items have small weights; and it is a well known fact that often rare elements disturb the graphical displays in CA or MCA.Another way of interpreting Figure 1 is that, the categories ( 3) and (N) can be considered as outliers, and their harmful influence should be eliminated.Different approaches have been proposed to handle missing or outlier categories by Michailidis and de Leeuw (1998), Le Roux and Rouanet (2004, Chapter 5), Greenacre (2006) and Greenacre (2009).Figures 3 and 4, which display the projection of the individuals on the principal planes 1-2 and 2-3 have the same form as Figures 1 and 2, and they admit the same interpretation.The third column in Table 2 displays the first five dispersion measures, the mean deviations, of the first five factors resulting from TCA of Z.We see that the first dimension is very important, λ 1 = 0.3014, and probably the second, λ 2 = 0.1910.The remaining dimensions were not interpretable.
Figure 5 shows the MTCA map of the principal plane 1-2, where we see the four groups of categories are clearly separated, and the image that they form looks like a curved horseshoe or a parabola; which implies that there is one underlying latent variable.For recent interesting discussions of horseshoes in multivariate analysis, see Diaconis, Goel andHolmes (2008) andDe Leeuw (2007).Also we note that, the first principal axis clearly separates the categories used this service (1) from the rest, (2), ( 3) and (N).By comparing Figure 5 with Figure 1, we see that in Figure 5 there is no evidence to characterize the categories (3) and (N) of all the questions as outliers: In fact all the 22 (N) categories are clustered in one point at the extreme corner of the third quadrant in Figure 5  Figure 6, which should be compared with Figures 3 and 4, shows the projection of the respondents on the first principal plane.We see a very clear pattern: the 3530 individuals are clustered, and on the first axis there are 22 clusters.Theorem 1 in Section 4 states that the maximum number of clusters of respondents on the first principal axis is 23 = (22 + 1) = (Q + 1), where Q is the number of questions.What is the interpretation of the 22 clusters?Theorem 2 of Section 3 states that the 22 clusters of respondents can be completely characterized by a discrete variable S, the simple sum score statistic of used this service (1) over all items, because the 22 categories used this service (1) have positive first principal factor scores.We name the category used this service (1) to be first factor success category for each item.The complement of "first factor success" will be "first factor failure"={(2), (3), N}.Table 3 provides some summary statistics of the clusters that we describe in steps: c) We introduce some notation to formulate mathematically the calculations done in columns 4 to 7. Let Q = 22 be the number of items or questions, C = 22 be the number of clusters; n c be the frequency of individuals in cluster c, for example n 1 = 143.We can express the 0/1 matrix Z as a three-way array, Consider the matrix W = (w iv ) of size 3530 × 4, where w iv = Q q=1 z iqv , and which represents the number of times that the respondent i chose the category value v across all items.Let W c , of size n c × 4 be the subset of the matrix W whose individuals belong to the cluster c; for instance, W 4 is of size 7 × 4, whose elements are given in Table 4.In Table 4 the row identified by min = (4 17 0 0) provides the minimum values in the four columns of the matrix W 4 ; and the row identified by So we see that the first principal factor of MTCA revealed that the data set has a very clear structure based on the simple sum score statistic of first factor success categories over all items.Further, the 22 health items are broadly similar in objective and point to the same direction.
The second principal factor has a simple interpretation: For a fixed sum score S = sum score of (1)'s, it will show the intravariability of the response patterns for that sum score.Which clusters have the most variability and what is the nature of the variability?Going to Table 3, we check the (min, max) values for each cluster: it is evident that the first cluster characterized by S = 0 is the most heterogenous, followed by clusters S = 13 and S = 14.
The cluster (S = 0) has (min, max) = (0, 14) for the categories never tried (2) and tried with no access (3), and (min, max) = (0, 22) for the category missing (N).We also note that all the missing non response values, save one, are found in this cluster.Further, Figure 6 confirms this fact, where we see 8 points aligned vertically in the third quadrant.We also note that the relative frequency of this group is very small, 143/3530 = 0.04051.In fact, we recall that the units in this group were designated as outliers in MCA.
The cluster (S = 13) has (min, max) = (4, 9) for the categories never tried (2) and (min, max) = (0, 5) for the categories tried with no access (3), and (min, max) = (0, 0) for the category missing (N).This is natural variability, because the sum score statistic being a sum of successes has the most variability around its centeral values.Similar interpretation is given to the cluster (S = 14), which has relative frequency of 228/3530 = 0.0646.the clusters (S ≤ 13) are positively associated with females (the LOR values are negative).

Conclusion
MCA is a popular well established method since 1970 to analyze questionnaire surveys of qualitative variables; but it is sensitive to the presence of outliers, which usually form a small fraction of the data.MTCA is a robust L 1 variant of MCA.
MCA and MTCA can produce different results, because the geometry underlying these two methods are different.We suggest the analysis of a data set by both methods: each method sees the data from its point of view, and sometimes the views are similar and other times not similar.So MCA and MTCA complement and enrich each other.Cox (2006) titled his talk "In praise of simple sum score".We showed that the first MTCA factor scores can always be interpreted as simple sum score of and D r = Diag(r) a diagonal matrix having diagonal elements p i• , and similarly D c = Diag(c).

Figure 1 :
Figure 1: MCA map of the 88 categories

Figure 5 :
Figure 5: MTCA map of the 88 categories

Figure 6 :
Figure 6: MTCA map of the 3530 respondents

Table 1 :
The marginal distribution of frequencies of the categories of 22 health related service items, with symbols used for their representations

Table 2 :
The first five dispersion measures CA of Z TCA of Z

Table 4 :
The W 4 matrix, where the seven respondents have first principal factor score of −1.1032

Table 5 :
The distribution of gender in each cluster