Pub. online:23 Nov 2022Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 177–192
Abstract
Clinical risk prediction models are commonly developed in a post-hoc and passive fashion, capitalizing on convenient data from completed clinical trials or retrospective cohorts. Impacts of the models often end at their publication rather than with the patients. The field of clinical risk prediction is rapidly improving in a progressively more transparent data science era. Based on collective experience over the past decade by the Prostate Biopsy Collaborative Group (PBCG), this paper proposes the following four data science-driven strategies for improving clinical risk prediction to the benefit of clinical practice and research. The first proposed strategy is to actively design prospective data collection, monitoring, analysis and validation of risk tools following the same standards as for clinical trials in order to elevate the quality of training data. The second suggestion is to make risk tools and model formulas available online. User-friendly risk tools will bring quantitative information to patients and their clinicians for improved knowledge-based decision-making. As past experience testifies, online tools expedite independent validation, providing helpful information as to whether the tools are generalizable to new populations. The third proposal is to dynamically update and localize risk tools to adapt to changing demographic and clinical landscapes. The fourth strategy is to accommodate systematic missing data patterns across cohorts in order to maximize the statistical power in model training, as well as to accommodate missing information on the end-user side too, in order to maximize utility for the public.
Abstract: The application of linear mixed models or generalized linear mixed models to large databases in which the level 2 units (hospitals) have a wide variety of characteristics is a problem frequently encountered in studies of medical quality. Accurate estimation of model parameters and standard errors requires accounting for the grouping of outcomes within hospitals. Including the hospitals as random effect in the model is a common method of doing so. However in a large, diverse population, the required assump tions are not satisfied, which can lead to inconsistent and biased parameter estimates. One solution is to use cluster analysis with clustering variables distinct from the model covariates to group the hospitals into smaller, more homogeneous groups. The analysis can then be carried out within these groups. We illustrate this analysis using an example of a study of hemoglobin A1c control among diabetic patients in a national database of United States Department of Veterans’ Affairs (VA) hospitals.
Abstract: PSA measurements are used to assess the risk for prostate cancer. PSA range and PSA kinetics such as PSA velocity have been correlated with in creased cancer detection and assist the clinician in deciding when prostate biopsy should be performed. Our aim is to evaluate the use of a novel, maxi mum likelihood estimation - prostate specific antigen (MLE-PSA) model for predicting the probability of prostate cancer using serial PSA measurements combined with PSA velocity in order to assess whether this reduces the need for prostate biopsy. A total of 1976 Caucasian patients were included. All these patients had at least 6 PSA serial measurements; all underwent trans-rectal biopsy with minimum 12 cores within the past 10 years. A multivariate logistic re gression model was developed using maximum likelihood estimation (MLE) based on the following parameters (age, at least 6 PSA serial measurements, baseline median natural logarithm of the PSA (ln(PSA)) and PSA velocity (ln(PSAV)), baseline process capability standard deviation of ln(PSA) and ln(PSAV), significant special causes of variation in ln(PSA) and ln(PSAV) detected using control chart logic, and the volatility of the ln(PSAV). We then compared prostate cancer probability using MLE-PSA to the results of prostate needle biopsy. The MLE-PSA model with a 50% cut-off probability has a sensitivity of 87%, specificity of 85%, positive predictive value (PPV) of 89%, and negative predictive value (NPV) of 82%. By contrast, a single PSA value with a 4ng/ml threshold has a sensitivity of 59%, specificity of 33%, PPV of 56%, and NPV of 36% using the same population of patients used to generate the MLE-PSA model. Based on serial PSA measurements, the use of the MLE-PSA model significantly (p-value < 0.0001) improves prostate cancer detection and reduces the need for prostate biopsy.
Abstract: In this study, the data based on nucleic acid amplification tech niques (Polymerase chain reaction) consisting of 23 different transcript vari ables which are involved to investigate genetic mechanism regulating chlamy dial infection disease by measuring two different outcomes of muring C. pneumonia lung infection (disease expressed as lung weight increase and C. pneumonia load in the lung), have been analyzed. A model with fewer reduced transcript variables of interests at early infection stage has been obtained by using some of the traditional (stepwise regression, partial least squares regression (PLS)) and modern variable selection methods (least ab solute shrinkage and selection operator (LASSO), forward stagewise regres sion and least angle regression (LARS)). Through these variable selection methods, the variables of interest are selected to investigate the genetic mechanisms that determine the outcomes of chlamydial lung infection. The transcript variables Tim3, GATA3, Lacf, Arg2 (X4, X5, X8 and X13) are being detected as the main variables of interest to study the C. pneumonia disease (lung weight increase) or C. pneumonia lung load outcomes. Models including these key variables may provide possible answers to the problem of molecular mechanisms of chlamydial pathogenesis.
Abstract: When comparing the performance of health care providers, it is important that the effect of such factors that have an unwanted effect on the performance indicator (eg. mortality) is ruled out. In register based studies randomization is out of question. We develop a risk adjustment model for hip fracture mortality in Finland by using logistic regression. The model is used to study the impact of the length of the register follow-up period on adjusting the performance indicator for a set of comorbidities. The comorbidities are congestive heart failure, cancer and diabetes. We also introduce an implementation of the minimum description length (MDL) principle for model selection in logistic regression. This is done by using the normalized maximum likelihood (NML) technique. The computational burden becomes too heavy to apply the usual NML criterion and therefore a technique based on the idea of sequentially normalized maximum likelihood (sNML) is introduced. The sNML criterion can be evaluated efficiently also for large models with large amounts of data. The results given by sNML are then compared to the corresponding results given by the traditional AIC and BIC model selection criteria. All three comorbidities have clearly an effect on hip fracture mortality. The results indicate that for congestive heart failure all available medical history should be used, while for cancer it is enough to use only records from half a year before the fracture. For diabetes the choice of time period is not as clear, but using records from three years before the fracture seems to be a reasonable choice.