Evaluation of Agreement between Measurement Methods from Data with Matched Repeated Measurements via the Coeﬃcient of Individual Agreement

: We propose a simple method for evaluating agreement between methods of measurement when the measured variable is continuous and the data consists of matched repeated observations made with the same method under diﬀerent conditions. The conditions may represent diﬀerent time points, raters, laboratories, treatments, etc. Our approach allows the values of the measured variable and the magnitude of disagreement to vary across the conditions. The coeﬃcient of individual agreement (CIA), which is based on the comparison of the between and within-methods mean squared deviation (MSD) is used to quantify the magnitude of agreement between measurement methods. The new approach is illustrated via two examples from studies designed to compare (a) methods of evaluating carotid stenosis and (b) methods of measuring percent body fat.


Introduction
In studies designed to assess the agreement between methods of measurement, multiple observation are often made with each method on the same subject. These observations can be considered as replicated measurements if the observations with the same method on the same subject are conditionally independent and identically distributed. In this case it is assumed that the subject's true value of the measured quantity remains unchanged across the measurements made by the same method. On the other hand, agreement studies may be designed such that multiple matched observation with two (or more) methods are conducted on each subject under specific 'conditions' where the subject's true value may change across conditions. The observations are then considered as matched repeated measurements. The 'conditions' may correspond to different time points, raters, laboratories, devices, treatments, etc. For example, in a study designed to compare imaging methods for assessing carotid stenosis (Barnhart and Williamson, 2007) the same three raters used each of the imaging methods to determine the carotid stenosis of each patient. Here the three raters correspond to three 'conditions' under which measurements have been made. Chinchilli et al. (1996), Choudhary (2008), and King et al. (2007a,b) analyzed data from a study in which percentage body fat was estimated using two methods: (1) skinfold calipers, and (2) dual energy x-ray absorptiometry (DEXA), on adolescent girls. Measurements were taken in an initial visit at age 12 years and in subsequent visits which occurred every six months. In this case the 'condition' is the girl's age.
The focus of this article is on evaluation of agreement between methods of measurements from matched repeated observations. We assume that all the measurements are made on the same interval scale, hence we can evaluate the extent of agreement between methods via the differences between measurements made on the same subject with different methods. In addition, we assume that a subject's true value may change across the levels of the variable corresponding to the conditions, and that the magnitude of agreement between methods may vary across conditions. We are interested in (a) assessment of condition-specific agreement between measurement methods, (b) investigating the effect of the condition on the magnitude of agreement between methods, and (c) if we conclude that agreement between methods remains unchanged across conditions then we also may be interested in an overall measure of the extent of agreement. We are not interested in the agreement between measurements taken under different conditions as the true value of the measured variable on a subject may vary across the conditions. In the carotid stenosis example, the main interest is in comparing the imaging methods when used by the same rater. We do not investigate the agreement between the raters in this example. In the body fat example, one is mainly interested in the agreement between the skinfold calipers and DEXA measured on the same girl in the same visit.
As stated in a recent review paper by Barnhart et al. (2007a), future research is needed on assessing agreement with repeated measurements because previous works on this topic have been limited to scaled agreement indices using the concordance correlation coefficient (CCC) (Chinchilli et al. (1996) and King et al. (2007a,b)), unscaled agreement indices using the total deviation index (TDI) (Choudhary, 2008), and limits of agreement (LOA) (Bland and Altman, (2007)). In this work we focus on an alternative scaled index for assessing agreement, the coefficient of individual agreement (CIA), that may be preferable to the CCC because it does not depend on the between-subject variability, as elaborated by Barnhart et al. (2007a,b). The CIA has been introduced by Barnhart et al. (2007c), and Haber and Barnhart (2008), and has been applied to data with replicated measurements only. In this work we will show how to estimate the CIA from data with matched repeated measurements across conditions when there are no replications at each condition. If there are replications at each condition, we can accomplish goals (a)-(c) by applying the methods described in Barnhart et al. (2007c) and Haber and Barnhart (2008). However, in this work we assume that there is a single observation for each method condition combination, so that our previous methods (Barnhart et al. (2007c) and Haber and Barnhart (2008)) cannot be used. In general, the CIA compares the disagreement between methods to the disagreement between replicated measurements made by the same method on the same study subject. The agreement between methods is considered acceptable if the variability between observations made with different methods on the same subject is not much larger than the variability between observations with the same method on this subject. Hence, good individual agreement implies that replacing one method by another or using the methods interchangeably does not substantially increase the within-subject variability. The reciprocal of the CIA is interpreted as the relative increase in the variability of the measurements made on the same subject if the methods were used interchangeably. In our previous papers (Barnhart et al. (2007c) and Haber and Barnhart (2008)) we suggested that the CIA should be at least 0.8 in order to claim 'good' agreement. This means that using the measurement methods interchangeably does not increase the variability of measurements made on the same subject by more than 25%.
The CCC and CIA are scaled agreement indices attaining values in the intervals [−1, 1] and [0, 1], respectively. The CCC is based on the comparison of the between-methods and the between-subjects variability Hence it depends on the heterogeneity of the population with respect to the measured variable (Atkinson and Nevill (1997), Barnhart et al. (2007b)) and therefore comparisons of CCCs from different studies may not be valid. The CIA, on the other hand, uses the within-methods variability, σ 2 e , as a benchmark to which between-methods variability is compared. In our opinion, the latter is a more appropriate comparison as the within-methods disagreement is related to the performance of the measurement methods, while the between-subjects variability does not reflect any aspect of the measurement process and may vary between populations or samples. A detailed comparison of the two types of scaled agreement coefficients can be found in Barnhart et al. (2007b). Alternatively, one may use an unscaled measure of agreement, such as the total deviation index (Choudhary (2008), Lin et al. (2002)). Using an unscaled agreement index requires setting acceptable bound that may not be easy in practice. A thorough review of different approaches, including CCC, CIA and TDI, to evaluation of agreement between observers or measurement methods can be found in Barnhart et al. (2007a).
The key concept in the CIA is the use of the variability between readings of the same method on the same subject as a reference for assessing the disagreement between different methods. First, one must make sure that this within-method (error) variability, σ 2 e , is 'reasonably small'. Barnhart et al (2007b) suggested to compute the repeatability coefficient (Bland and Altman (1999)), 1.96 √ 2σ 2 e , and check whether it is less than or equal to an acceptable value within which the difference between two readings by the same method should lie for 95% of the subjects. Second, as illustrated in our previous papers (Barnhart et al. (2007c) and Haber and Barnhart (2008)), the within-method variability can be estimated if there are true replications. Those papers did not address the issue of estimating when there are no replications. The main purpose this paper is to use the repeated measurements in order to estimate σ 2 e , and thus to estimate CIA, by fitting a reasonable model using matched repeated measures in the absence of replications.
In our previous papers (Barnhart et al. (2007c) and Haber and Barnhart (2008)) we considered two situations: (1) one of the methods of measurement is considered a reference, or gold standard, to which the other method is compared, and (2) none of the methods is considered as a reference. In this work we focus on the second situation. We assume that the magnitude of agreement is measured by the mean squared deviation (MSD), defined as the mean of the squared difference between two readings made on the same subject under the same condition. For the sake of simplicity, we first present the new statistical techniques in the context of assessing the agreement between two measurement methods and later show how this approach can be extended to the case of multiple methods. The models and methods for the case where the 'conditions' correspond to the levels of a categorical factor, such as raters or laboratories, are described and illustrated in Section 2. In section 3 we consider the case where the factor representing the 'conditions' is continuous, such as time, age or temperature. Section 4 presents generalizations to the case of more than two measurement methods.

Conditions Correspond to the Levels of a Categorical Factor
In this Section we consider the case where each of N subjects is evaluated by two measurement methods under the same K(K ≥ 2)) conditions. As stated in the introduction, the 'conditions' may correspond to different time points, laboratories, raters, treatments, etc. We assume that the observed variable is continuous and that the true value of this variable on a given subject may change from one condition to another. We denote the measurements with the two methods by Y 1 and Y 2 . The disagreement between the methods is quantified by the mean squared deviation (MSD), defined as: where the expectation is over all the study subjects. The coefficients of indi-vidual agreement (see Barnhart et al. (2007c) and Haber and Barnhart (2008)) compare M SD(Y 1 , Y 2 ) to the MSD of two replicated observations made with same method under the same conditions. Therefore we denote by M SD(Y i , Y j ) the mean squared deviation between two (hypothetical) replicated observations made with method j (j = 1, 2) under the same condition. For the case where none of the methods is considered as a reference, the coefficient of individual agreement is defined as: In our previous papers (Barnhart et al. (2007c) and Haber and Barnhart (2008)) this coefficient was denoted by ψ N .
Since the data considered here do not include replicated observations, Y j and Y j , made with same method on the same subject under the same condition, we cannot apply the approach of Barnhart et al. (2007c) and Haber and Barnhart (2008), who used the replication variances for estimation of M SD(Y j , Y j ), j = 1, 2. Instead, we propose to estimate M SD(Y j , Y j ) from a simple linear model. Denote by Y ijk the observations with the j-th method on the i-th subject under the k-th conditions. In order to estimate these MSD's, we use the following mixed ANOVA model: The α's are the subjects' random effects while the β's and γ's are the fixed effects of the methods and the conditions, respectively. We assume that the random main effects, interactions and errors are independent and normally distributed with mean 0 and V ar(α i ) = σ 2 α , V ar((αβ) ij ) = σ 2 αβ , V ar((αγ) ij ) = σ 2 αγ , V ar(e ijk ) = σ 2 e . Regarding the fixed effects, we make the common assumption that the sum of the coefficients over every index is zero, i.e., It is important to note that this model allows the measurements Y ijk for the same subject-method combination (I, j) to vary across the m conditions. If we consider two (hypothetical) replicated observations, Y j and Y j , that could be made by method j on the same subject under the same condition then: From the above model it is evident that the disagreement between the two observers may depend on the condition. The M DS(Y 1 , Y 2 ) for the k-th condition can be obtained from the parameters of our model as follows: Using the definition (2.1) we now can obtain the coefficient of individual agreement under the k-th condition as:

Estimation and testing
Fitting the mixed model that we use to estimate the coefficients of individual agreement can be done via standard statistical software packages. We used SAS proc MIXED for this purpose. It may also be of interest to test the hypotheses of homogeneous agreement, ψ 1 = · · · = ψ m , which is equivalent to (βγ) j1 = · · · = (βγ) jm for j = 1, 2. If this hypothesis is supported by the data then the common value of all the condition-specific ψ's can be estimated by fitting the simpler form of the mixed model which does not include the methodby-condition interaction terms (βγ). Confidence intervals for the estimated coefficients can be computed using the delta method or the nonparametric bootstrap.

Example 1
We now illustrate the method using data from a carotid stenosis screening study. The goal of the study was to compare magnetic resonance angiography (MRA) for noninvasive screening of carotid artery stenosis with invasive intra-arterial angiogram (IA). Two MRA methods were considered: twodimensional time of flight (MRA-2D) and three-dimensional time of flight (MRA-3D). Each of three raters determined the percent of carotid stenosis using each of the three imaging methods. Thus, a total of nine observations were made on each study subject. Our analysis is based on the 55 study subjects for whom all 9 readings were available. Percent stenosis was measured in both the left and right carotid artery of each subject. We will use here only the data from the left arteries. For more information on the study, including graphical displays of agreement between methods and between raters, the reader is referred to Barnhart and Williamson (2001). The stenosis data can be copied from: www.sph.emory.edu/observeragreement/ Barnhart et al. (2007c) used this data to estimate the coefficients of individual agreement between the three methods where the raters were consider as independent replications. Here we re-estimate the coefficients under the more realistic assumption that each rater has her/his own effect on the observed measurements. Thus, we consider the raters as 'conditions'. Table 1 presents rater-specific estimates of the CIA's for the left artery data, along with their delta-method-based 95% confidence intervals, for all three pairs of methods. The table also presents the overall estimate of ψ under the assumption that the coefficients for the three raters are equal. The overall estimates can be interpreted as pooled (or summary) estimates of the coefficients across the three raters under the assumption that the disagreement between methods is homogeneous. These pooled estimates are not very meaningful unless the differences between methods are indeed homogeneous across raters. In Table 1, whenever the upper limit of a CI exceeded 1, it was set to 1.000. Comparison 1: Y 1 = IA, Y 2 = M RA − 2D, assuming no differences among raters: p-value for ψ 1 = ψ 2 = ψ 3 is 0.09. Comparison 2: Y 1 = IA, Y 2 = M RA − 3D, assuming no differences among raters: p-value , assuming no differences among raters: p-value for ψ 1 = ψ 2 = ψ 3 is 0.46.
As we stated in the Introduction, it is important to check the repeatability coefficient 1.96 √ 2σ 2 e for each of the methods. In the context of the present example, this coefficient is a 95% upper bound for the absolute difference of two readings made by the same rater with the same imaging method. The coefficient should be relatively small, so that we feel comfortable when using the intra-method variability as a reference to which we compare the inter-method variability. The repeatability coefficients corresponding to the three comparisons in Table 1 are 51.5, 49.0 and 63.0 percent, respectively, which are likely to be higher than acceptable values for the absolute difference of two measurements of carotid stenosis performed with the same method on the same patient. Hence, from a practical point of view the estimates in Table 1 are likely to overestimate the actual magnitude of individual agreement.
From Table 1 we can learn that the agreement between the IA method and each of the MRA methods, which was the focus of the original study, is very poor. The comparison of the two MRA methods produces higher estimates of CIA's, in the range 0.81-0.86. However, since we saw in the previous paragraph that these estimates are likely to be inflated due to an unacceptable repeatability coefficient, one may doubt whether the agreement between the two MRA methods is indeed reasonably good.

Conditions Correspond to a Continuous Factor
In this Section we assume that matched repeated measurements are performed under conditions that correspond to the values of a continuous variable. The most common situation involves measurements made at different time points, hence we will refer to the variable defining the repeated measurement as 'time' and assume that the subjects' true values are a linear function of time.
Suppose that pairs of observations (Y i1 (t), Y i1 (t)) were made with two methods of measurement on subject i at each of m i ≥ 2 different time points, t. (These time points do not have to be the same for all subjects). As we did in Section 2, we begin by fitting a linear mixed model to the observed measurements: Y ij (t) = µ + α i + β j + (αβ) ij + γt + η j t + e ij (t) (i = 1, . . . , n; j = 1, 2) As before, the random effects {α i }, {(αβ) ij }, {δ i }, {e ij (t)} are independent with zero means and: For the fixed effects we set β 1 + β 2 = η 1 + η 2 = 0 The mean squared differences are as follows: W now can obtain the CIA as a function of time as follows: Proc MIXED in SAS can again be used to estimate the parameters in the mixed model and provide an estimate of the function ψ(t) The hypotheses η 1 = η 2 can be tested in order to check whether the CIA does not change significantly over time.

Example 2
In the Young Women Health Study (Lloyd et al., (1993)) percentage body fat was estimated using skinfold calipers and dual energy x-ray absorptiometry (DEXA) on a cohort of adolescent girls. Skinfold caliper and DEXA measurements were made in an initial visit, at age 12 years, and in eight subsequent visits, which occurred every six months. Agreement between the two methods of measurements has been evaluated via the concordance correlation coefficient (CCC) (Chinchilli et al., (1996), King et al., (2007a,b)) and via the total deviation index (TDI) (Choudhary, 2008). Here we estimate the coefficients of individual agreement, using observation from 651 visits of 91 girls. We will use a girl's actual age as the 'condition' (t) since the visits did not occur exactly at ages 12.0, 12.5, 13.0 etc. Fitting the model to this data yields the following estimates:σ 2 α = 6.8553,σ 2 αβ = 2.4709,σ 2 δ = 0.01987,σ 2 e = 3.0566,β 1 = −9.3808,γ = −0.2546,η 1 = 0.6075. The t statistic for the hypothesis η 1 = 0 is 14.9, hence the data do not support the hypothesis of a time-independent CIA. The repeatability coefficient is 4.8, which can be considered an acceptable 95% bounds for the within-methods error.
Using the above estimates we can write the estimated function ψ(t): ψ(t) = 6.1132 (−18.7616 + 1.2149t) 2 + 11.0550 . Figure 1 displays the estimated coefficients along with their delta-method-based CI's for 12-16 years old girls, which is the range of ages in the data. We see that agreement between the two methods improves with age up to 15.5 years. As stated in the introduction, we suggested that agreement be considered 'acceptable' only if the relevant coefficient of individual agreement exceeds 0.8 ( Barnhart et al. (2007c), Haber and Barnhart (2008)). Since the estimates of the CIA remain below 0.6 and their upper CI's remain below 0.8, we conclude that the agreement between the DEXA and the skinfold calipers is not acceptable for girls aged 12-16 years. For comparison, Chinchilly et al. (1996) reported an estimated CCC of 0.42 for this data (their method does not assume that agreement may change with age). King et al. (2007a,b) used only the data from the first three visits of each girl and reported values in the range 0.48-0.67 for their weighted repeated measurements CCC. Choudhary (2008), who analyzed the full dataset using a tolerance interval approach, concluded that 'the agreement between the methods appears best around age 15-17', and that 'on the whole, the agreement between the skinfold and DEXA methods does not seem good enough to justify their interchangeable use'. These conclusion are similar to ours.

The Case of More Than Two Methods of Measurement
When there are more than two measurement methods, the overall coefficients of individual agreement can be obtained from the pairwise MSD's as shown in Barnhart et al. (2007c). Denote the observations made with J ≥ 3 methods Y 1 , Y 2 , . . . , Y J . When the conditions correspond to the levels (k) of a categorical factor, an overall coefficient of individual agreement for the k-th condition is: where M SD(Y j , Y j ) is the mean squared deviation between two replicated observations made by method j under the same condition and M SD k (Y j , Y j ) is the mean squared deviation between measurements by methods j, j under the k-th condition.

Discussion
We presented a simple method for assessing agreement between two or more methods of measurement based on repeated measurements matched on a factor whose levels are considered as conditions. We advocate the use of the coefficient of individual agreement rather than the concordance correlation coefficient, as the latter depends on the between-subjects heterogeneity (Atkinson and Nevill (1997), Barnhart et al. (2007b), Haber and Barnhart (2008)). Our approach allows the true values of the measured variable and the magnitude of disagreement to vary across conditions or over time.
We use the terms 'methods' and 'conditions' broadly here. For example, in the carotid stenosis study (Example 1) we considered the imaging methods as 'methods' and the human raters as 'conditions' because we were interested is the agreement between the imaging methods based on readings by the same rater. Alternatively, we could treat the raters as 'methods' and the imaging methods as 'conditions' and assess the agreement between raters when they are using the same imaging method.
We used SAS Proc MIXED, which assumes that all the measurements are normally distributed, for the analyses of the data in Examples 1 and 2. The SAS codes are available from the first author. It is important to note that the CIA's can be estimated using the method of moments from the various ANOVA mean squares without making the normality assumption. We also wrote R programs for the analysis of the carotid stenosis and the body fat data. These programs are available at XXX and can be used by readers who do not have SAS.
The coefficients of individual agreement can also be defined and estimated when the observations are binary (Haber et al. (2007)). The methods introduced in this work can also be applied to repeated binary data, for example by using generalized linear mixed models.