Pub. online:23 Nov 2022Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 177–192
Abstract
Clinical risk prediction models are commonly developed in a post-hoc and passive fashion, capitalizing on convenient data from completed clinical trials or retrospective cohorts. Impacts of the models often end at their publication rather than with the patients. The field of clinical risk prediction is rapidly improving in a progressively more transparent data science era. Based on collective experience over the past decade by the Prostate Biopsy Collaborative Group (PBCG), this paper proposes the following four data science-driven strategies for improving clinical risk prediction to the benefit of clinical practice and research. The first proposed strategy is to actively design prospective data collection, monitoring, analysis and validation of risk tools following the same standards as for clinical trials in order to elevate the quality of training data. The second suggestion is to make risk tools and model formulas available online. User-friendly risk tools will bring quantitative information to patients and their clinicians for improved knowledge-based decision-making. As past experience testifies, online tools expedite independent validation, providing helpful information as to whether the tools are generalizable to new populations. The third proposal is to dynamically update and localize risk tools to adapt to changing demographic and clinical landscapes. The fourth strategy is to accommodate systematic missing data patterns across cohorts in order to maximize the statistical power in model training, as well as to accommodate missing information on the end-user side too, in order to maximize utility for the public.
Abstract: The application of linear mixed models or generalized linear mixed models to large databases in which the level 2 units (hospitals) have a wide variety of characteristics is a problem frequently encountered in studies of medical quality. Accurate estimation of model parameters and standard errors requires accounting for the grouping of outcomes within hospitals. Including the hospitals as random effect in the model is a common method of doing so. However in a large, diverse population, the required assump tions are not satisfied, which can lead to inconsistent and biased parameter estimates. One solution is to use cluster analysis with clustering variables distinct from the model covariates to group the hospitals into smaller, more homogeneous groups. The analysis can then be carried out within these groups. We illustrate this analysis using an example of a study of hemoglobin A1c control among diabetic patients in a national database of United States Department of Veterans’ Affairs (VA) hospitals.
Abstract: PSA measurements are used to assess the risk for prostate cancer. PSA range and PSA kinetics such as PSA velocity have been correlated with in creased cancer detection and assist the clinician in deciding when prostate biopsy should be performed. Our aim is to evaluate the use of a novel, maxi mum likelihood estimation - prostate specific antigen (MLE-PSA) model for predicting the probability of prostate cancer using serial PSA measurements combined with PSA velocity in order to assess whether this reduces the need for prostate biopsy. A total of 1976 Caucasian patients were included. All these patients had at least 6 PSA serial measurements; all underwent trans-rectal biopsy with minimum 12 cores within the past 10 years. A multivariate logistic re gression model was developed using maximum likelihood estimation (MLE) based on the following parameters (age, at least 6 PSA serial measurements, baseline median natural logarithm of the PSA (ln(PSA)) and PSA velocity (ln(PSAV)), baseline process capability standard deviation of ln(PSA) and ln(PSAV), significant special causes of variation in ln(PSA) and ln(PSAV) detected using control chart logic, and the volatility of the ln(PSAV). We then compared prostate cancer probability using MLE-PSA to the results of prostate needle biopsy. The MLE-PSA model with a 50% cut-off probability has a sensitivity of 87%, specificity of 85%, positive predictive value (PPV) of 89%, and negative predictive value (NPV) of 82%. By contrast, a single PSA value with a 4ng/ml threshold has a sensitivity of 59%, specificity of 33%, PPV of 56%, and NPV of 36% using the same population of patients used to generate the MLE-PSA model. Based on serial PSA measurements, the use of the MLE-PSA model significantly (p-value < 0.0001) improves prostate cancer detection and reduces the need for prostate biopsy.
Abstract: When comparing the performance of health care providers, it is important that the effect of such factors that have an unwanted effect on the performance indicator (eg. mortality) is ruled out. In register based studies randomization is out of question. We develop a risk adjustment model for hip fracture mortality in Finland by using logistic regression. The model is used to study the impact of the length of the register follow-up period on adjusting the performance indicator for a set of comorbidities. The comorbidities are congestive heart failure, cancer and diabetes. We also introduce an implementation of the minimum description length (MDL) principle for model selection in logistic regression. This is done by using the normalized maximum likelihood (NML) technique. The computational burden becomes too heavy to apply the usual NML criterion and therefore a technique based on the idea of sequentially normalized maximum likelihood (sNML) is introduced. The sNML criterion can be evaluated efficiently also for large models with large amounts of data. The results given by sNML are then compared to the corresponding results given by the traditional AIC and BIC model selection criteria. All three comorbidities have clearly an effect on hip fracture mortality. The results indicate that for congestive heart failure all available medical history should be used, while for cancer it is enough to use only records from half a year before the fracture. For diabetes the choice of time period is not as clear, but using records from three years before the fracture seems to be a reasonable choice.
Subsampling the data is used in this paper as a learning method about the influence of the data points for drawing inference on the parameters of a fitted logistic regression model. The alternative, alternative regularized, alternative regularized lasso, and alternative regularized ridge estimators are proposed for the parameter estimation of logistic regression models and are then compared with the maximum likelihood estimators. The proposed alternative regularized estimators are obtained by using a tuning parameter but the proposed alternative estimators are not regularized. The proposed alternative regularized lasso estimators are the averaged standard lasso estimators and the alternative regularized ridge estimators are also the averaged standard ridge estimators over subsets of groups where the number of subsets could be smaller than the number of parameters. The values of the tuning parameters are obtained to make the alternative regularized estimators very close to the maximum likelihood estimators and the process is explained with two real data as well as a simulated study. The alternative and alternative regularized estimators always have the closed form expressions in terms of observations that the maximum likelihood estimators do not have. When the maximum likelihood estimators do not have the closed form expressions, the alternative regularized estimators thus obtained provide the approximate closed form expressions for them.
Multi-classification is commonly encountered in data science practice, and it has broad applications in many areas such as biology, medicine, and engineering. Variable selection in multiclass problems is much more challenging than in binary classification or regression problems. In addition to estimating multiple discriminant functions for separating different classes, we need to decide which variables are important for each individual discriminant function as well as for the whole set of functions. In this paper, we address the multi-classification variable selection problem by proposing a new form of penalty, supSCAD, which first groups all the coefficients of the same variable associated with all the discriminant functions altogether and then imposes the SCAD penalty on the supnorm of each group. We apply the new penalty to both soft and hard classification and develop two new procedures: the supSCAD multinomial logistic regression and the supSCAD multi-category support vector machine. Our theoretical results show that, with a proper choice of the tuning parameter, the supSCAD multinomial logistic regression can identify the underlying sparse model consistently and enjoys oracle properties even when the dimension of predictors goes to infinity. Based on the local linear and quadratic approximation to the non-concave SCAD and nonlinear multinomial log-likelihood function, we show that the new procedures can be implemented efficiently by solving a series of linear or quadratic programming problems. Performance of the new methods is illustrated by simulation studies and real data analysis of the Small Round Blue Cell Tumors and the Semeion Handwritten Digit data sets.