A recent trend in medical research is to develop prediction models aiming to improve patient care and health outcomes. While statisticians and data scientists are well-trained in the methods and process of developing a prediction model, their role post-model-development is less clear. This paper covers the critical scientific reasoning step in the prediction pipeline after a model is developed. Working collaboratively with domain experts, statisticians and data scientists should critically evaluate models, carefully implement models into practice, and assess the model’s impact in real world settings. Constructs from implementation science are discussed in the context of prediction modeling. The paper focuses on clinical prediction models, but these ideas apply to other domains as well.
Predictor envelopes model the response variable by using a subspace of dimension d extracted from the full space of all p input variables. Predictor envelopes have a close connection to partial least squares and enjoy improved estimation efficiency in theory. As such, predictor envelopes have become increasingly popular in Chemometrics. Often, d is much smaller than p, which seemingly enhances the interpretability of the envelope model. However, the process of estimating the envelope subspace adds complexity to the final fitted model. To better understand the complexity of predictor envelopes, we study their effective degrees of freedom (EDF) in a variety of settings. We find that in many cases a d-dimensional predictor envelope model can have far more than $d+1$ EDF and often has close to $p+1$. However, the EDF of a predictor envelope depend heavily on the structure of the underlying data-generating model and there are settings under which predictor envelopes can have substantially reduced model complexity.
There has been increasing interest in modeling survival data using deep learning methods in medical research. In this paper, we proposed a Bayesian hierarchical deep neural networks model for modeling and prediction of survival data. Compared with previously studied methods, the new proposal can provide not only point estimate of survival probability but also quantification of the corresponding uncertainty, which can be of crucial importance in predictive modeling and subsequent decision making. The favorable statistical properties of point and uncertainty estimates were demonstrated by simulation studies and real data analysis. The Python code implementing the proposed approach was provided.
Large-scale genomics studies provide researchers with access to extensive datasets with extensive detail and unprecedented scope that encompasses not only genes, but also more experimental functional units, including non-coding microRNAs (miRNAs). In order to analyze these high-fidelity data while remaining faithful to the underlying biology, statistical methods are necessary that can reflect the full range of understanding in contemporary molecular biology, while remaining flexible enough to analyze a wide range of data and complex phenomena. Leveraging multiple omics datasets, miRNA-gene targets as well as signaling pathway topology, we present an integrative linear model to analyze signaling pathways. Specifically, we use a mixed linear model to characterize tumor and healthy tissue, and execute statistical significance testing to identify pathway disturbances. In this paper, pan-cancer analysis is performed for a wide range of signaling pathways. We discuss specific findings from this analysis, as well as an interactive data visualization available for public consumption that contains the full range of our analytic findings.
There are many methods of scoring the importance of variables in prediction of a response but not much is known about their accuracy. This paper partially fills the gap by introducing a new method based on the GUIDE algorithm and comparing it with 11 existing methods. For data without missing values, eight methods are shown to give biased scores that are too high or too low, depending on the type of variables (ordinal, binary or nominal) and whether or not they are dependent on other variables, even when all of them are independent of the response. Among the remaining four methods, only GUIDE continues to give unbiased scores if there are missing data values. It does this with a self-calibrating bias-correction step that is applicable to data with and without missing values. GUIDE also provides threshold scores for differentiating important from unimportant variables with 95 and 99 percent confidence. Correlations of the scores to the predictive power of the methods are studied in three real data sets. For many methods, correlations with marginal predictive power are much higher than with conditional predictive power.
Ensemble techniques have been gaining strength among machine learning models, considering supervised tasks, due to their great predictive capacity when compared with some traditional approaches. The random forest is considered to be one of the off-the-shelf algorithms due to its flexibility and robust performance to both regression and classification tasks. In this paper, the random machines method is applied over simulated data sets and benchmarking datasets in order to be compared with the consolidated random forest models. The results from simulated models show that the random machines method has a better predictive performance than random forest in most of the investigated data sets. Three real data situations demonstrate that the random machines may be used to solve real-world problems with competitive payoff.
Regression methods, including the proportional rates model and additive rates model, have been proposed to evaluate the effect of covariates on the risk of recurrent events. These two models have different assumptions on the form of the covariate effects. A more flexible model, the additive-multiplicative rates model, is considered to allow the covariates to have both additive and multiplicative effects on the marginal rate of recurrent event process. However, its use is limited to the cases where the time-dependent covariates are monitored continuously throughout the follow-up time. In practice, time-dependent covariates are often only measured intermittently, which renders the current estimation method for the additive-multiplicative rates model inapplicable. In this paper, we propose a semiparametric estimator for the regression coefficients of the additive-multiplicative rates model to allow intermittently observed time-dependent covariates. We present the simulation results for the comparison between the proposed method and the simple methods, including last covariate carried forward and linear interpolation, and apply the proposed method to an epidemiologic study aiming to evaluate the effect of time-varying streptococcal infections on the risk of pharyngitis among school children. The R package implementing the proposed method is available at www.github.com/TianmengL/rectime.
Machine learning methods are increasingly applied for medical data analysis to reduce human efforts and improve our understanding of disease propagation. When the data is complicated and unstructured, shallow learning methods may not be suitable or feasible. Deep learning neural networks like multilayer perceptron (MLP) and convolutional neural network (CNN), have been incorporated in medical diagnosis and prognosis for better health care practice. For a binary outcome, these learning methods directly output predicted probabilities for patient’s health condition. Investigators still need to consider appropriate decision threshold to split the predicted probabilities into positive and negative regions. We review methods to select the cut-off values, including the relatively automatic methods based on optimization of the ROC curve criteria and also the utility-based methods with a net benefit curve. In particular, decision curve analysis (DCA) is now acknowledged in medical studies as a good complement to the ROC analysis for the purpose of decision making. In this paper, we provide the R code to illustrate how to perform the statistical learning methods, select decision threshold to yield the binary prediction and evaluate the accuracy of the resulting classification. This article will help medical decision makers to understand different classification methods and use them in real world scenario.