A-Kappa : A measure of Agreement among Multiple Raters

Abstract: Medical data and biomedical studies are often imbalanced with a majority of observations coming from healthy or normal subjects. In the presence of such imbalances, agreement among multiple raters based on Fleiss’ Kappa (FK) produces counterintuitive results. Simulations suggest that the degree of FK’s misrepresentation of the observed agreement may be directly related to the degree of imbalance in the data. We propose a new method for evaluating agreement among multiple raters that is not affected by imbalances, A-Kappa (AK). Performance of AK and FK is compared by simulating various degrees of imbalance and illustrate the use of the proposed method with real data. The proposed index of agreement may provide some insight by relating its magnitude to a probability scale. Existing indices are interpreted arbitrarily. This new method not only provides a measure of overall agreement but also provides an agreement index on an individual item. Computation of both AK and FK may further shed light into the data and be useful in the interpretation and presenting the results.


Introduction 1.1 The problem
Biomedical, social, behavioral and other studies routinely include statistical evaluations of agreement among multiple raters or conditions (Fienstein et al, 1985) and Fleiss' Kappa (FK) is widely used to evaluate agreements (Fleiss, 1981).
This work stems from a real problem which arose when examining agreement among radiologists who were evaluating mammographic breast images.Table 1 shows the results of an experiment where 10 radiologists at Brigham and Women's Hospital, Boston, Massachusetts, independently reviewed 102 breast MRIs obtained between September 2004 and April 2008.Increased breast density may be associated with increased breast cancer risk, and therefore, classification of mammography reports is important both scientifically and clinically.The BI-RADS algorithm developed by American College of Radiology was used to classify breast  Corresponding author.composition into four categories.One of the categories indicates whether or not an image exhibits a 'Fatty' (< 25% glandular) pattern.Table 1 collapses the three non-Fatty categories into one and denotes them as a '1': A fatty image is denoted as a '0'.Each line in the table displays a sequence of '1's and '0's which shows the scoring pattern for each of the 10 raters.All 10 raters classified 85 (83.33%) of the 102 images into non-Fatty category indicating complete agreement among the raters on these images.An additional 10 images were classified as non-Fatty by 9 (90%) of the raters so that more than 93% of the images were classified into the non-Fatty category by at least 90% of the raters.Despite such a large degree of consensus seen among the raters, Fleiss Kappa (FK) for these data is only 0.119 (95% CI: 0.090, 0.148) indicating a poor or no agreement.An alternate agreement index developed in this paper called A-Kappa (AK) yields a value of 0.906 (95% CI: 0.889, 0.923) for the same data indicating a high agreement among raters.Thus, the widely used method of evaluating agreement index FK yields a counter intuitive result in this instance.
It will be shown later that a data set consisting of a large number of positive (or negative) events may yield a poor FK despite high observed agreement.In the context of Table 1, all the raters classify 83.33% images into the positive (score 1) category.There is not a single image where all the raters classified an image into the negative (score 0) category.Similarly, there are 5 images with eight 1s, but there is not a single image with eight 0s.This type of data set is referred to as unbalanced or asymmetrical data set in this paper.This issue is further addressed in Section 2.3 and then shown that the proposed measure AK is not influenced by the imbalance in a data set.
Medical data are prone to imbalance due to high or low prevalence of a given characteristics or disease (Li, Liu and Hu, 2010).For example, in a screening setting it is likely to have are more healthy individuals, while in a specialty care center there may be a larger number of subjects with a disease.Hence, for many biomedical data Fleiss Kappa (FK) is likely to misrepresent or completely miss the agreement present in a data set.A-Kappa (AK) developed in this paper could be an alternate tool to evaluate agreement in such data sets as it is not influenced by the asymmetry or imbalance.
In two raters' case the phenomenon of high observed agreement with low Cohen's Kappa often found with asymmetrical or unbalanced data has been studied and some remedies have been suggested (Feinstein et al, 1990;Cicchetti et al, 1990;Lantz et al, 1990).These remedies primarily suggest reporting some alternate indices along with Cohen's Kappa.Lantz and Nebenzahl (1990) maintain that Kappa alone has little interpretive value and recommend reporting alternative indices along with Kappa.
In this article we show that Fleiss Kappa, the most widely used agreement index among multiple raters, shows the high agreement low Kappa behavior similar to that of Cohen's Kappa.A new method for evaluating agreement among multiple raters, A-Kappa (AK), is proposed in this article.This method is not affected by the type of imbalances described above and is able to capture the observed agreement.Furthermore, in the case of balanced data set it reduces to FK.It is proposed that both FK and AK be reported with the results of data analysis.

Proposed agreement index: A-Kappa
Suppose that each of the ) 2 ( r independent raters classifies the ith image ) ,..., 2 , 1 ( N i  into one of two categories by assigning a score of 1 or 0 to indicate presence or absence of a disease, respectively.When all the raters agree on a given image, we will have either all 1s or all 0s.Similarly, when the sequence consists of an equal numbers of 1s and 0s (50% of each) it is considered a situation of complete absence of agreement.Let i a denote the number of 1s in the sequence for the ith image.In other words, i a raters out of r total raters classify the ith image into the disease category and remaining ri a into the non-disease category.We assume that each image is read by all of the r readers.
The proposed measure of agreement A-Kappa (AK) is defined as The derivational argument behind the definition AK in equation (2.1) is provided in Section 3.1 where more than two categories are addressed.More specifically when  = 2 , equation (3.5) reduces to equation (2.1).
The following results follow from the definition of AK given by equation (2.1).Proofs are given in the Appendix.
Proposition 1.When there are two raters, AK reduces to Maxwell's Random Error (RE) coefficient (Maxwell, 1977): where 0 P is the proportion of images on which the two raters agree.Note that, RE was originally proposed as an alternate measure of agreement between two raters to address the issue of high agreement and low Cohen's kappa.Proposition 2. If i w is the proportion of pairs of raters who agree and i v is the proportion of pairs who disagree on the ith image, then Proposition 3. AK for multiple raters r is the average of AK for all possible pairs of raters.Let ij AK denote AK obtained from the ith and jth raters.Similarly, let ij P , 0 denote the proportions of images on which both of these raters agree.Then

Relationship between A-Kappa (AK) and Fleiss Kappa (FK)
In the section, relationship between AK is developed and shown that for balanced or symmetrical data these indices are identical.Concept of balanced or symmetrical data was introduced at the beginning of this paper.It will be revisited here to establish equality between AK and FK.
. Then the Fleiss Kappa (FK) is defined as (Fleiss, 1981) Multiplying both sides of it by

Symmetrical or balanced data
In the present context, symmetry may be defined in several ways.At the basic level, if 50 % of observations (e.g.images) in a data set come from one population (say healthy) and remaining 50% come from a second population (e.g.disease) then such a data set could be considered a balanced data set.Even with experienced raters, it is likely that there will be instances of misclassifications, and therefore, it is unlikely that an image will be assigned either 1or 0 by all raters.But with large data sets one can expect that they will be evenly misclassified.In other words, one would expect that for every misclassification into positive category there will be a misclassification into the negative category.Imbalance in data set may be due to design (e.g. more positive images in the collected data) or prevalence (e.g.rare disease).For example, if a data set contains considerably more positive (or negative) images, then the data set will be imbalanced.Hence, a data set in which there is an image with a given number of 1s for each image with the same number of zeros will be considered a symmetrical or balanced data.Lack of this gives rise to an unbalanced data set.According to this definition data set presented in Table 1 is an unbalanced data set.
Proposition 5. AK≥FK.When the data set is balanced then AK= FK for balanced or symmetrical data set.
Proof: In the case of balanced data (see definition above) the entire data set be can be presented in terms of 2 / N pairs of image such that for a pair consisting of ith and th i' images, we have So, Fleiss Kappa (FK) yields a smaller value than AK in an unbalanced or asymmetrical data set.
It is worth noting that if the number of 1s is equal to the number of 0s in data set then also AK= FK whether or not the data set meets the definition of symmetry.If 1's and 0's are assigned completely randomly then both AK and FK will be equal to zero indicating lack of agreement.

Simulations
We conducted several simulations to gain some insight into A-Kappa's performance and to compare it with Fleiss Kappa, and aid in its interpretation.
We based our simulations on 10 raters, 2 categories and 10,000 images.Let denote the proportion of diseased images.Let  denote the probability of correctly classifying an image by a rater and is assumed to be the same for  = each rater.Each rater is assumed to assign a score image] for each rater, but even this simple assumption can be used to generate imbalanced data.Our goal here is to simply demonstrate the vulnerability of FK and robustness of AK in certain types of data.All simulations and subsequent computations were carried out using SAS 9.2.
Note that, this is not the unique way to generate columns (or rows) of 0 and 1 in order to show that AK may fail to reflect the observed agreement in 'unbalanced' data.

Percent images 9+ raters agree
A-Kappa The first column of Table 2 shows values of , the second column shows percent to images on which 9 (90%) or more rater agree.This is taken as a measure of crude or observed agreement A quick glance at Table 2 shows that when  = 0.5 both AK and FK indicate absence of agreement.This is a situation when raters classify images randomly.However, when  is different from 0.5 then FK varies with the proportion of positive images while AK remains the same for a given .When the proportion of samples of diseased images is equal to the samples of healthy images then both AK and FK yield the same value for all  .On the other hand, when a majority of images are from diseased patients (or from healthy patients) then FK is further from the observed (or crude) agreement than AK.In such situations AK captures the observed agreement better than FK.For example, when  = 0.90, then observed agreement = 0.7325 (at least 90% of raters agree on 73.25% images) and A-Kappa = 0.646, but Fleiss Kappa could be as low as 0.03 when almost all images are either positive or negative and could be as high as 0.646 (value of AK) when 50% images are positive.
The above results showing difference between AK and FK are also depicted in Figure 1.
When the underlying proportion of positive images is roughly 0.5 (  = 0.5), the lines for AK and FK are the same for all values of .That is why lines for AK and FK50 (i.e.FK values when 50% images are positive) are not distinguishable in Fig. 1.Also, AK and FK are the same (and near zero) irrespective the proportion of diseased images when raters assign scores of 0 or 1 to an image randomly (i.e. when  = 0.50).

An Interpretation of A-Kappa (AK)
Landis and Koch (1977) provided an interpretation of Kappa statistic.However, those interpretations are considered arbitrary.Except for Kappa =1 and 0 implying perfect and chance agreement, respectively, other value fail to convey the degree of agreement in terms of an interpretable scale.The following discussion may help shed some light into interpretation of AK.
Assume that the same image is evaluated by a set of r raters by assigning 0 or 1 to the image for the presence and absence of a disease.Also, assume that some time has elapsed between two evaluations.(2.9) Thus for an AK value, we can say one would have obtained the same AK if each rater would classify a diseased image correctly with the probability given by equation (2.9).However, it is not necessarily true that the data in hand was generated with this probability.
Equation (2.8) indicates that when 90% of the raters assign 1 or 90% raters assign 0 (say, to all the images), then AK is about 0.64).In the case of two raters the proportion of images on which two raters agree is expected to be 0.9×0.9+0.1×0.1 = 0.81+0.01= 0.82 so that AK = 2(0.82)-1 = 0.64 (see equation 2.2).On the other hand, if data yields AK = 0.64, then one would obtain the same AK from data set where probability of correctly classifying an image by each rater is 0.90.No such interpretation exists for FK where the interpretation is completely arbitrary (Landis et al 1977).

Multiple Rates and Multiple Categories
Although the main focus of the paper is evaluation of agreement among multiple of raters (or situations) on two possible classification (or categories), results presented in previous sections are briefly extended to multiple category situation in the following sections.Some additional insights on AK are also presented.

Derivation of A-Kappa (AK) for multiple categories and multiple raters
Assume that each rater classifies an image into one of the k categories.Let ij a denote the number of raters who classified the ith for each image.In case of complete agreement on the ith image, all raters will classify the image into the same category.In the case of complete lack of agreement the ith image will be categorized into each category by an equal number of raters.Therefore in this case, k r / raters are expected to classify such an image into each of the k categories.One can think of agreement as the discrepancy or distance from complete disagreement.This discrepancy may then be expressed as This quantity can be rescaled by dividing it by its maximum possible value so that the distance between observed data and the state of complete disagreement lies between 0 and 1.The maximum value of the expression given in (3.1) is given by Therefore, the rescaled distance for the ith image is Hence, the mean of i G across N images can be considered as the 'crude' agreement among raters.This is given by However, some of this agreement may be due to chance.Opinions differ regarding the definition of chance induced agreement.For example, Maxwell uses 0.5 as a chance induced agreement and Cohen uses the marginal probabilities of the 2×2 table under consideration.Another way to quantify the agreement due to chance might be to estimate the agreement expected in a sample that comes from a population lacking agreement among raters.In terms of the notation used above, error due to chance may be given by the expected value of G given that there is an absence of agreement in the population.
Proposition 7. The expected value of G given that there is an absence of agreement in the population is given by (See appendix for a proof) A-Kappa (AK) is the measure G adjusted for agreement due to chance and rescaled to yield the maximum possible value of 1 is, given by : The functional forms of AK for two raters and multiple raters seem different, but in fact they are the same as shown below.A-Kappa for two categories was developed first and then it was shown as an extension of Maxwell's Random Error (RE).Since the error due to chance was already imbedded in Maxwell RE, no error adjustment was discussed.
which is the same as equation 2.1 (A proof is given in the appendix).

Agreement on individual item (image)
Note that, A-Kappa proposed in this article is the average of r rG i across the observations (images), where i G is defined by equation (3.2).Therefore, this quantity could be considered as a measure of agreement among raters on the ith observation (image).

Let
This characteristic of AK is similar to the agreement index proposed by O'Connell and Dobson (1984), but AK is much simpler to compute and uses a different strategy to estimate chance induced agreement.One advantage of obtaining agreement on an individual image (observation) is that the investigators could identify and investigate images with high disagreement.This could especially be useful when training novice raters.Equation (3.6) also points that AK from two or more data sets could be easily combined to yield the overall AK from the combined data set as shown below.Let a data set of sample size ) ( the first set, second set and combined set of data, respectively, then   = ( 1  1 )/( 1 +  2 ) + ( 2  2 )/( 1 +  2 ).Thus, AK from a combined data set is the weighted average of AKs from the component data sets.This is not necessarily true for FK.

Asymptotic distribution of A-Kappa (AK)
Following the notations of the previous sections, suppose r raters classify each of the N images into one of the k categories.Suppose that the number of ratings (A proof is given in the appendix).

Simulations for Multiple Categories
Simulations results from earlier section have shown that in the case of two categories, FK and AK are equivalent when there is absence of agreement or when the data are symmetrical.Otherwise FK may fail to reflect the high degree of observed agreement.
Here, we present a few simulations using multiple raters and multiple categories.These simulations show that as with two categories, FK may fail to reflect a high observed agreement in case of multiple categories.We generated 10,000 items (images) and assumed that each of the 10 raters classified each image into one of five categories.Let i p denote the probability with which a rater assigns an image into the ith When each image is randomly assigned into one of these categories, i.e., when i p = 0.2 for all i, then AK= FK = 0.0004.In this case, both indices truly reflect the absence of agreement among the raters.Next, suppose that 10000 images of category 4 are evaluated by 10 raters, and each rater can correctly classify the images with a probability 0.9.For a simple example, let  1 =  2 =  3 =  5 = 0.025 and 90 .0 4  p , then FK = 0.013 and AK = 0.770.Under this simulated scenario, at least 8 raters (80% or more) are found to classify 9,433 (94.33%) images into the 4 th category.Therefore, FK fails to reflect high degree of agreement among the raters.
Next, we simulated that where with 50% of the images 4 and 50% were of category were of category 2, and the raters can correctly classify the image with probability of 0.9 into these two categories (with remaining probability equally distributed over remaining categories).In this case, the simulated data showed that at least 8 raters (80% or more raters) classified 4,486 images in category 2 and 4,699 images in category 4 so that at least 80% of the raters agreed on 9,184 (91.84%) images.FK for this data turned out to be 0.674 while AK = 0.770.Thus this balancing in the data brought FK value almost to the level of AK, while AK remained the same.
Hence, in the case of more than two categories and multiple raters FK may fail to reflect the high degree of observed agreement in asymmetric data, while AK may not be influenced by such asymmetry

Real Example (multiple categories)
Consider the study described in the introduction section.Ten raters were asked to classify the breast composition of 102 images into the four categories: the breast is almost entirely fat (< 25% glandular), SFD: scattered fibroglandular densities (approximately 25-50% glandular), HD: the breast tissue is heterogeneously dense (approximately 51% -75% glandular), ED: the breast tissue is extremely dense ( > 76% glandular).Table 3 shows AK and FK among multiple raters and multiple categories.
Note that, FK indicates that the raters have poor agreement on whether the images are fatty or not.AK on the other hand shows there is an excellent agreement among the raters.Most of the raters classify images into non-Fatty categories.In Table 3, both AK and FK are 0.403 when classifying an image into heterogeneously dense (HD).This indicates that if the data were re-arranged into HD versus non-HD category, then it indicates perhaps the numbers HD and non-HD images are similar.In all other situations we have AK > FK indicating lack of such symmetry.However, the asymmetry is not substantial except for Category 1 (Fatty vs non-Fatty).In conclusion, if only FK was used we might have been misinformed about the agreement among the raters Cohen's Kappa (CK) is used routinely to evaluate agreement between two raters or two conditions, but has been criticized for being simply a function of prevalence, and counterintuitive by several investigators.Using simulations, it is shown in this article that Fleiss Kappa (FK) a measure of agreement among multiple raters inherits some of these shortcomings of CK.

Classification
A new and simple method for evaluating agreement A-Kappa (AK) among multiple raters is proposed.This method reduces to Maxwell's Random Error (RE) proposed to address the high agreement low kappa paradox in case of two raters.In this article it is shown, by simulations, that Fleiss' kappa (FK) may also yield low kappa although there is a high degree of agreement among the raters.This is especially true in the case of imbalanced data where one class of items is relatively less than the other.
AK, proposed in this paper, may be used as an alternate or an additional index in the case of multiple raters.The proposed measure does not have the seemingly paradoxical characteristic of FK.Computing both AK and FK may provide additional insight.The difference between the two values may indicate whether the data are dominated by one kind of classification of image.As indicated by simulations, FK coincides with AK when proportions of positive and negative image are the same.A small FK may not really be an indication of low agreement, while a small AK is indication of low agreement.Also, AKs from two data sets can be easily combined to yield the AK from the combined data set.We recommend calculating both AK and FK.
In the case of two categories, A-Kappa may have a meaningful interpretation in a more familiar scale of probability as discussed in this paper.Existing interpretation of Kappa values are considered somewhat arbitrary.Computation of both AK and FK may further shed light into the data and be useful in the interpretation and presenting the results.This is similar to recommendation by of computing both maximum and minimum Kappa in two raters two categories situation.
Though not the focus of the paper, there exists a body literature with model based approaches to evaluate agreement (Agresti, 1992 and2002;Tanner et al 1985).Similar to the kappa-like indices, most of the model based methods have also dealt with the situation of two raters, and the number of parameters to be estimated increases exponentially with number of raters creating computational challenges.Estimating equation approaches are also proposed to model agreement in data with multiple raters having binary and multiple categories (Williamson et al, 2000, Klar et al, 2000).AK will also be examined from repeated measures viewpoint in a future study.However, investigators especially in biomedical studies still routinely use CK and FK to evaluate agreement.This paper highlights some situations where these methods may fail to capture the agreement and propose an alternative method which reduces to existing methods proposed in two raters' situations.
Current limitations of AK include its inability to incorporate raters' characteristics.However, this is also the case with FK and CK.Evaluation AK with repeated measures (or hierarchical) approach is being explored.This will allow us to adjust for confounding factors.Despite such limitations its simplicity and intuitive nature, it provides some insights into the nature of agreement and the data set itself.It is very simple to calculate.

Appendix A
Proposition 1.In the case of two raters, Proof: From equation 2.1, in the case of two categories and r raters, Also, when there are only two raters, then i a (number of raters who agree on the ith image) takes values 0, 1 or 2. So, the term Therefore, when r = 2, where 0 P is the proportion of images on which both raters agree.Proposition 2. If i w is the proportion of pairs of raters who agree and i v is the proportion of pairs who disagree on the ith image, then Proof: Assume that i a raters out of r classify the ith image into disease category (by assigning a score of 1) and remaining ri a into the non-disease category (assigning a score of 0).Let    pairs will assign 1 while the other member will assign 0 to the ith image.Therefore, Hence difference in proportion of images on which there is pair-wise agreement and disagreement on the ith image is given by ,... , .
We need to show that Under the assumption that each rater will assign a score of 1 with probability ,   ~(, ).
Using this information, we have Note that, under the assumption of random classification of images by the raters, ith image will be classified into each category by k r / raters.Therefore, degrees of freedom, and its expected value will be ( − 1) .Hence  Under the assumption that images are independent, the variance covariance matrix for the entire sample will be a block diagonal matrix with each block being of the form expressed as in (3.6).The following results can be used to estimate asymptotic variance of AK.
From equation (3.2), for the ith image,

 
Assuming independent images and noting that that the overall A-Kappa for the given data set is the average of AK across the images (observations) we have,

Table 2 :
Comparison of A-Kappa and Fleiss' Kappa using 10,000 images (samples)  = Probability of classifying an image correctly  = Proportion of images from normal subjects of 1 to a diseased image and 0 to a healthy image (from normal subjects) according to a binomial probability  = P[X =1|diseased image] = P[X =0|healthy image].One can visualize the entire data set composed of two subsets one consisting of healthy images only and disease image.Simulated data sets were generated with  = 0.50, 0.60, 0.70, 0.80, 0.90, 0.95 and 0.99, and  = 0.50, 0.70, 0.90 and 0.95 where  = proportion of images from the normal (healthy) subjects.It is not necessarily true tah probability  = P[X =1|diseased image] = P[X =0|healthy

Figure 1
Figure 1 the two raters disagree.
of raters will assign 1 and both members of the ith image.Similarly, one member of ) (

Proposition 3 .
AK for multiple raters r is the average of RE for all possible pairs of raters.Let ij RE denote Maxwell's Random Error from the ith and jth raters.Similarly, let ij P , 0 denote the proportions of images on which both raters agree.that readings of N images by r raters are presented in N rows and r columns.Let the r columns be denoted by

Proposition 7 .
The expected value of G given that there is an absence of agreement in the population is given by r that r raters classify each of the N images into one of the k categories.Suppose that the number of ratings matrix with matrix with elements of i π on the main diagonal.


are generally unknown, they are replaced by their sample estimates