Pub. online:23 Nov 2022Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 177–192
Abstract
Clinical risk prediction models are commonly developed in a post-hoc and passive fashion, capitalizing on convenient data from completed clinical trials or retrospective cohorts. Impacts of the models often end at their publication rather than with the patients. The field of clinical risk prediction is rapidly improving in a progressively more transparent data science era. Based on collective experience over the past decade by the Prostate Biopsy Collaborative Group (PBCG), this paper proposes the following four data science-driven strategies for improving clinical risk prediction to the benefit of clinical practice and research. The first proposed strategy is to actively design prospective data collection, monitoring, analysis and validation of risk tools following the same standards as for clinical trials in order to elevate the quality of training data. The second suggestion is to make risk tools and model formulas available online. User-friendly risk tools will bring quantitative information to patients and their clinicians for improved knowledge-based decision-making. As past experience testifies, online tools expedite independent validation, providing helpful information as to whether the tools are generalizable to new populations. The third proposal is to dynamically update and localize risk tools to adapt to changing demographic and clinical landscapes. The fourth strategy is to accommodate systematic missing data patterns across cohorts in order to maximize the statistical power in model training, as well as to accommodate missing information on the end-user side too, in order to maximize utility for the public.
Abstract: PSA measurements are used to assess the risk for prostate cancer. PSA range and PSA kinetics such as PSA velocity have been correlated with in creased cancer detection and assist the clinician in deciding when prostate biopsy should be performed. Our aim is to evaluate the use of a novel, maxi mum likelihood estimation - prostate specific antigen (MLE-PSA) model for predicting the probability of prostate cancer using serial PSA measurements combined with PSA velocity in order to assess whether this reduces the need for prostate biopsy. A total of 1976 Caucasian patients were included. All these patients had at least 6 PSA serial measurements; all underwent trans-rectal biopsy with minimum 12 cores within the past 10 years. A multivariate logistic re gression model was developed using maximum likelihood estimation (MLE) based on the following parameters (age, at least 6 PSA serial measurements, baseline median natural logarithm of the PSA (ln(PSA)) and PSA velocity (ln(PSAV)), baseline process capability standard deviation of ln(PSA) and ln(PSAV), significant special causes of variation in ln(PSA) and ln(PSAV) detected using control chart logic, and the volatility of the ln(PSAV). We then compared prostate cancer probability using MLE-PSA to the results of prostate needle biopsy. The MLE-PSA model with a 50% cut-off probability has a sensitivity of 87%, specificity of 85%, positive predictive value (PPV) of 89%, and negative predictive value (NPV) of 82%. By contrast, a single PSA value with a 4ng/ml threshold has a sensitivity of 59%, specificity of 33%, PPV of 56%, and NPV of 36% using the same population of patients used to generate the MLE-PSA model. Based on serial PSA measurements, the use of the MLE-PSA model significantly (p-value < 0.0001) improves prostate cancer detection and reduces the need for prostate biopsy.
Abstract: This paper aims to propose a suitable statistical model for the age distribution of prostate cancer detection. Descriptive studies suggest the onset of prostate cancer after 37 years of age with maximum diagnosis age at around 70 years. The major deficiency of descriptive studies is that the results cannot be generalized for all types of populations usually having non-identical environmental conditions. The proposition follows by checking the suitability of the model through different statistical tools like Akaike Information Criterion, Kolmogorov Smirnov distance, Bayesian Information Criterion and χ2 statistic. The Maximum likelihood estimate of the parameters of the proposed model along with their asymptotic confidence intervals have been obtained for the considered real data set.