Comparing Estimators of Discriminative Performance of Time-to-Event Models
Pub. online: 18 February 2025
Type: Statistical Data Science
Open Access
Received
5 June 2024
5 June 2024
Accepted
1 January 2025
1 January 2025
Published
18 February 2025
18 February 2025
Abstract
Predicting the timing and occurrence of events is a major focus of data science applications, especially in the context of biomedical research. Performance for models estimating these outcomes, often referred to as time-to-event or survival outcomes, is frequently summarized using measures of discrimination, in particular time-dependent AUC and concordance. Many estimators for these quantities have been proposed which can be broadly categorized as either semi-parametric estimators or non-parametric estimators. In this paper, we review the mathematical construction of the two classes of estimators and compare their behavior. Importantly, we identify a previously unknown feature of the class of semi-parametric estimators that can result in vastly overoptimistic out-of-sample estimation of discriminative performance in common applied tasks. Although these semi-parametric estimators are popular in practice, the phenomenon we identify here suggests that this class of estimators may be inappropriate for use in model assessment and selection based on out-of-sample evaluation criteria. This is due to the semi-parametric estimators’ bias in favor of models that are overfit when using out-of-sample prediction criteria (e.g. cross-validation). Non-parametric estimators, which do not exhibit this behavior, are highly variable for local discrimination. We propose to address the high variability problem through penalized regression splines smoothing. The behavior of various estimators of time-dependent AUC and concordance are illustrated via a simulation study using two different mechanisms that produce overoptimistic out-of-sample estimates using semi-parametric estimators. Estimators are further compared using a case study using data from the National Health and Nutrition Examination Survey (NHANES) 2011–2014.
Supplementary material
Supplementary MaterialThe supplementary material includes additional information that is relevant but not included in the manuscript, including figures, mathematical derivation and data file used for the data application section. It also includes a zipped file containing code scripts to reproduce the results presented above. Here is a brief summary of is content:
•
outlier_exp.R: to generate data and produce Figure 1 in the Introduction.
•
Simulation: code scripts used to implement the simulation study.
–
Sim_overfit.R: for the first scenario of model overfit in Section 3.2.1.
–
Sim_contamination.R: for the second scenario of covariate misalignment in Section 3.2.2.
–
helpers.R: functions to calculate discussed estimators.
–
trueAUC.R: calculate the true values of incident/dynamic AUC.
–
SimFigs.R: produce Figures 2 and 3.
•
DataAppl: scripts to reproduce the data application section.
–
data_appl.R: scripts to reproduce the data application results.
–
helpers_appl.R: functions to calculate discussed estimators.
–
DataApplFigs.R: produce Figure 4.
•
SuppFigs.R: to produce figures included in the supplement.
References
Arlot S, Celisse A (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4: 40–79. https://doi.org/10.1214/09-SS054
Burman P (1989). A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika, 76(3): 503–514. https://doi.org/10.1093/biomet/76.3.503
Cornec-Le Gall E, Audrézet MP, Rousseau A, Hourmant M, Renaudineau E, Charasse C, et al. (2016). The propkd score: A new algorithm to predict renal survival in autosomal dominant polycystic kidney disease. Journal of the American Society of Nephrology, 27(3): 942–951. https://doi.org/10.1681/ASN.2015010016
Cox D (1972). Regression models and life-tables. Journal of the Royal Statistical Society, Series B, Methodological, 34(2): 187–220. https://doi.org/10.1111/j.2517-6161.1972.tb00899.x
Cui E, Crainiceanu C, Leroux A (2021). Additive functional Cox model. Journal of Computational and Graphical Statistics, 30(3): 780–793. https://doi.org/10.1080/10618600.2020.1853550
Gonen M, Heller G (2005). Concordance probability and discriminatory power in proportional hazards regression. Biometrika, 92(4): 965–970. https://doi.org/10.1093/biomet/92.4.965
Harrell FE, Lee KL, Mark DB (1996). Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15(4): 361–387. https://doi.org/10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4
Heagerty PJ, Zheng Y (2005). Survival model predictive accuracy and roc curves. Biometrics, 61(1): 92–105. https://doi.org/10.1111/j.0006-341X.2005.030814.x
Leroux A, Di J, Smirnova E, McGuffey EJ, Cao Q, Bayatmokhtari E, et al. (2019). Organizing and analyzing the activity data in NHANES. Statistics in Biosciences, 11(2): 262–287. https://doi.org/10.1007/s12561-018-09229-9
Leroux A, Xu S, Kundu P, Muschelli J, Smirnova E, Chatterjee N, et al. (2021). Quantifying the predictive performance of objectively measured physical activity on mortality in the UK Biobank. The Journals of Gerontology. Series A, Biological Sciences and Medical Sciences, 76(8): 1486–1494. https://doi.org/10.1093/gerona/glaa250
Mortensen RN, Gerds TA, Jeppesen JL, Torp-Pedersen C (2017). Office blood pressure or ambulatory blood pressure for the prediction of cardiovascular events. European Heart Journal, 38(44): 3296–3304. https://doi.org/10.1093/eurheartj/ehx464
Schmid M, Potapov S (2012). A comparison of estimators to evaluate the discriminatory power of time-to-event models. Statistics in Medicine, 31(23): 2588–2609. https://doi.org/10.1002/sim.5464
Shen W, Ning J, Yuan Y (2015). A direct method to evaluate the time-dependent predictive accuracy for biomarkers. Biometrics, 71(2): 439–449. https://doi.org/10.1111/biom.12293
Smirnova E, Leroux A, Cao Q, Tabacu L, Zipunnikov V, Crainiceanu C, et al. (2020). The predictive performance of objective measures of physical activity derived from accelerometry data for 5-year all-cause mortality in older adults: National health and nutritional examination survey 2003–2006. The Journals of Gerontology. Series A, Biological Sciences and Medical Sciences, 75(9): 1779–1785. https://doi.org/10.1093/gerona/glz193
Song X, Zhou XH, Ma S (2012). Nonparametric receiver operating characteristic-based evaluation for survival outcomes. Statistics in Medicine, 31(23): 2660–2675. https://doi.org/10.1002/sim.5386
Stephenson AJ, Scardino PT, Eastham JA, Bianco FJ, Dotan ZA, DiBlasio CJ, et al. (2005). Postoperative nomogram predicting the 10-year probability of prostate cancer recurrence after radical prostatectomy. Journal of Clinical Oncology, 23(28): 7005–7012. https://doi.org/10.1200/JCO.2005.01.867
Uno H, Cai T, Pencinac MJ, D’Agostinod RB, Weib LJ (2011). On the c-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in Medicine, 30(10): 1105–1117. https://doi.org/10.1002/sim.4154
van Geloven N, He Y, Zwinderman A, Putter H (2021). Estimation of incident dynamic auc in practice. Computational Statistics & Data Analysis, 154: 107095. https://doi.org/10.1016/j.csda.2020.107095
Wood S (2003). Thin-plate regression splines. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 65(1): 95–114. https://doi.org/10.1111/1467-9868.00374
Wood S (2004). Stable and efficient multiple smoothing parameter estimation for generalized additive models. Journal of the American Statistical Association, 99(467): 673–686. https://doi.org/10.1198/016214504000000980
Wood S (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 73(1): 3–36. https://doi.org/10.1111/j.1467-9868.2010.00749.x
Xu R, O’Quigley J (2000). Proportional hazards estimate of the conditional survival function. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 62(4): 667–680. https://doi.org/10.1111/1467-9868.00256
Yates LA, Aandahl Z, Richards SA, Brook BW (2023). Cross validation for model selection: A review with examples from ecology. Ecological Monographs, 93(1): e1557. https://doi.org/10.1002/ecm.1557