Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 3 (2020): Special issue: Data Science in Action in Response to the Outbreak of COVID-19, pp. 483–495
Abstract
Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronvirus, which was declared as a global pandemic by the World Health Organization on March 11, 2020. In this work, we conduct a cross-sectional study to investigate how the infection fatality rate (IFR) of COVID-19 may be associated with possible geographical or demographical features of the infected population. We employ a multiple index model in combination with sliced inverse regression to facilitate the relationship between the IFR and possible risk factors. To select associated features for the infection fatality rate, we utilize an adaptive Lasso penalized sliced inverse regression method, which achieves variable selection and sufficient dimension reduction simultaneously with unimportant features removed automatically. We apply the proposed method to conduct a cross-sectional study for the COVID-19 data obtained from two time points of the outbreak.
There is a great deal of prior knowledge about gene function and regulation in the form of annotations or prior results that, if directly integrated into individual prognostic or diagnostic studies, could improve predictive performance. For example, in a study to develop a predictive model for cancer survival based on gene expression, effect sizes from previous studies or the grouping of genes based on pathways constitute such prior knowledge. However, this external information is typically only used post-analysis to aid in the interpretation of any findings. We propose a new hierarchical two-level ridge regression model that can integrate external information in the form of “meta features” to predict an outcome. We show that the model can be fit efficiently using cyclic coordinate descent by recasting the problem as a single-level regression model. In a simulation-based evaluation we show that the proposed method outperforms standard ridge regression and competing methods that integrate prior information, in terms of prediction performance when the meta features are informative on the mean of the features, and that there is no loss in performance when the meta features are uninformative. We demonstrate our approach with applications to the prediction of chronological age based on methylation features and breast cancer mortality based on gene expression features.
There are many methods of scoring the importance of variables in prediction of a response but not much is known about their accuracy. This paper partially fills the gap by introducing a new method based on the GUIDE algorithm and comparing it with 11 existing methods. For data without missing values, eight methods are shown to give biased scores that are too high or too low, depending on the type of variables (ordinal, binary or nominal) and whether or not they are dependent on other variables, even when all of them are independent of the response. Among the remaining four methods, only GUIDE continues to give unbiased scores if there are missing data values. It does this with a self-calibrating bias-correction step that is applicable to data with and without missing values. GUIDE also provides threshold scores for differentiating important from unimportant variables with 95 and 99 percent confidence. Correlations of the scores to the predictive power of the methods are studied in three real data sets. For many methods, correlations with marginal predictive power are much higher than with conditional predictive power.
Climate change is widely recognized as one of the most challenging, urgent and complex problem facing humanity. There are rising interests in understanding and quantifying climate changing. We analyze the climate trend in Canada using Canadian monthly surface air temperature, which is longitudinal data in nature with long time span. Analysis of such data is challenging due to the complexity of modeling and associated computation burdens. In this paper, we divide this type of longitudinal data into time blocks, conduct multivariate regression and utilize a vine copula model to account for the dependence among the multivariate error terms. This vine copula model allows separate specification of within-block and between-block dependence structure and has great flexibility of modeling complex association structures. To release the computational burden and concentrate on the structure of interest, we construct composite likelihood functions, which leave the connecting structure between time blocks unspecified. We discuss different estimation procedures and issues regarding model selection and prediction. We explore the prediction performance of our vine copula model by extensive simulation studies. An analysis of the Canada climate dataset is provided.