Journal of Data Science logo


Login Register

  1. Home
  2. To appear
  3. A Bayesian Approach to Pre-Post Comparis ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

A Bayesian Approach to Pre-Post Comparison of Inter-Rater Agreement in Ordinal Ratings
Aiden Berry   Jennifer Cao   Song Zhang  

Authors

 
Placeholder
https://doi.org/10.6339/25-JDS1213
Pub. online: 16 December 2025      Type: Statistical Data Science      Open accessOpen Access

Received
2 September 2025
Accepted
8 December 2025
Published
16 December 2025

Abstract

Inter-rater agreement is fundamental to decision making in medicine, psychology, and the social sciences, as it reflects the quality and reliability of rating systems. ICC (intraclass correlation) has been widely used as a measure of inter-rater agreement. To date, there has been no methodological development that properly assesses improvement in ICC for pre–post studies with ordinal ratings. It remain uninvestigated whether/how correlations between pre- and post-intervention scores impact the estimation and comparison of ICC. We present a Bayesian hierarchical probit framework for evaluating changes in ICCs in such settings. The model incorporates rater- and item-level correlations and compares two parameterizations: an “individual components” prior that separately models variances and correlations, and an inverse Wishart prior. Simulation studies show that accounting for pre–post correlation substantially improves estimation accuracy and power to detect changes in agreement, while ignoring it reduces efficiency. Application to a multicenter study on conjunctival inflammation demonstrates that a novel grading scale markedly increased inter-rater agreement. This framework underscores the importance of modeling ordinal outcomes appropriately and provides a flexible Bayesian tool for evaluating the effectiveness of interventions on inter-rater agreement in pre-post studies.

Supplementary material

 Supplementary Material
The supplementary material includes supplementary tables and R codes.

References

 
Ahn C, Heo M, Zhang S (2014). Sample Size Calculations for Clustered and Longitudinal Outcomes in Clinical Research. CRC Press.
 
Albert JH, Chib S (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422): 669–679. https://doi.org/10.1080/01621459.1993.10476321
 
Atenafu EG, Hamid JS, To T, Willan AR, M Feldman B, Beyene J (2012). Bias-corrected estimator for intraclass correlation coefficient in the balanced one-way random effects model. BMC Medical Research Methodology, 12(126): 1–8. https://doi.org/10.1186/1471-2288-12-126
 
Calle-Alonso F, Perez Sanchez CJ (2015). A Monte Carlo–based Bayesian approach for measuring agreement in a qualitative scale. Applied Psychological Measurement, 39(3): 189–207. https://doi.org/10.1177/0146621614554080
 
Cohen J (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1): 37–46. https://doi.org/10.1177/001316446002000104
 
Cohen J (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4): 213–220. https://doi.org/10.1037/h0026256
 
Eziama E, Nguyen C, Foster CS, Heydinger S, Cao JH (2025). Novel grading scale for conjunctival inflammation in cicatrizing conjunctivitis associated with pemphigoid. Ocular Immunology and Inflammation, 33(4): 649–653. https://doi.org/10.1080/09273948.2024.2434128
 
Fanshawe TR, Lynch AG, Ellis IO, Green AR, Hanka R (2008). Assessing agreement between multiple raters with missing rating information, applied to breast cancer tumour grading. PLoS ONE, 3(8): e2925–e2936. https://doi.org/10.1371/journal.pone.0002925
 
Fisher RA (1921). On the “probable error”’ of a coefficient of correlation deduced from a small sample. Metron, 1: 3–32.
 
Fleiss JL, Levin B, Paik MC (2013). Statistical Methods for Rates and Proportions. John Wiley & Sons.
 
Gajewski BJ, Hart S, Bergquist-Beringer S, Dunton N (2007). Inter-rater reliability of pressure ulcer staging: Ordinal probit Bayesian hierarchical model that allows for uncertain rater response. Statistics in Medicine, 26(25): 4602–4618. https://doi.org/10.1002/sim.2877
 
Gelman A, Carlin JB, Stern HS, Rubin DB (1995). Bayesian Data Analysis. Chapman and Hall/CRC.
 
Giraudeau B, Mary J (2001). Planning a reproducibility study: How many subjects and how many replicates per subject for an expected width of the 95 per cent confidence interval of the intraclass correlation coefficient. Statistics in Medicine, 20(21): 3205–3214. https://doi.org/10.1002/sim.935
 
Hallgren KA (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1): 23–34. https://doi.org/10.20982/tqmp.08.1.p023
 
Konishi S (1985). Normalizing and variance stabilizing transformations for intraclass correlations. Annals of the Institute of Statistical Mathematics, 37(1): 87–94. https://doi.org/10.1007/BF02481082
 
Müller R, Büttner P (1994). A critical discussion of intraclass correlation coefficients. Statistics in Medicine, 13(23–24): 2465–2476. https://doi.org/10.1002/sim.4780132310
 
Nelson KP, Edwards D (2015). Measures of agreement between many raters for ordinal classifications. Statistics in Medicine, 34(23): 3116–3132. https://doi.org/10.1002/sim.6546
 
Olkin I, Lou Y, Stokes L, Cao J (2015). Analyses of wine-tasting data: A tutorial. Journal of Wine Economics, 10(1): 4–30. https://doi.org/10.1017/jwe.2014.26
 
Shrout PE, Fleiss JL (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2): 420–428. https://doi.org/10.1037/0033-2909.86.2.420
 
Tran QD, Demirhan H, Dolgun A (2021). Bayesian approaches to the weighted kappa-like inter-rater agreement measures. Statistical Methods in Medical Research, 30(10): 2329–2351. https://doi.org/10.1177/09622802211037068
 
Van Oest R, Girard JM (2022). Weighting schemes and incomplete data: A generalized Bayesian framework for chance-corrected interrater agreement. Psychological Methods, 27(6): 1069–1088.
 
Von Rosen D (1988). Moments for the inverted Wishart distribution. Scandinavian Journal of Statistics, 15(2): 97–109.
 
Wang C, Yandell B, Rutledge J (1991). Bias of maximum likelihood estimator of intraclass correlation. Theoretical and Applied Genetics, 82(4): 421–424. https://doi.org/10.1007/BF00588594
 
Yue C, Chen S, Sair HI, Airan R, Caffo BS (2015). Estimating a graphical intra-class correlation coefficient (GICC) using multivariate probit-linear mixed models. Computational Statistics & Data Analysis, 89: 126–133. https://doi.org/10.1016/j.csda.2015.02.012
 
Zhang S, Cao J, Ahn C (2018). Sample size calculation for before–after experiments with partially overlapping cohorts. Contemporary Clinical Trials, 64: 274–280. https://doi.org/10.1016/j.cct.2015.09.015
 
Zhang Z (2021). A note on Wishart and inverse Wishart priors for covariance matrix. Journal of Behavioral Data Science, 1(2): 119–126. https://doi.org/10.35566/jbds/v1n2/p2

Related articles PDF XML
Related articles PDF XML

Copyright
2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
Bayesian inter-rater agreement intraclass correlation ordinal pre-post design

Metrics
since February 2021
146

Article info
views

35

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy