Pub. online:14 Feb 2023Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 205–224
Abstract
Malignant mesotheliomas are aggressive cancers that occur in the thin layer of tissue that covers most commonly the linings of the chest or abdomen. Though the cancer itself is rare and deadly, early diagnosis will help with treatment and improve outcomes. Mesothelioma is usually diagnosed in the later stages. Symptoms are similar to other, more common conditions. As such, predicting and diagnosing mesothelioma early is essential to starting early treatment for a cancer that is often diagnosed too late. The goal of this comprehensive empirical comparison is to determine the best-performing model based on recall (sensitivity). We particularly wish to avoid false negatives, as it is costly to diagnose a patient as healthy when they actually have cancer. Model training will be conducted based on k-fold cross validation. Random forest is chosen as the optimal model. According to this model, age and duration of asbestos exposure are ranked as the most important features affecting diagnosis of mesothelioma.
Abstract: Cancer is a complex disease where various types of molecular aber rations drive the development and progression of malignancies. Among the diverse molecular aberrations, inherited and somatic mutations on DNA se quences are considered as major drivers for oncogenesis. The complexity of somatic alterations is revealed from large-scale investigations of cancer genomes and robust methods for interring the function of genes. In this review, we will describe sequence mutations of several cancer-related genes and discuss their functional implications in cancer. In addition, we will in troduce the on-line resources for accessing and analyzing sequence mutations in cancer. We will also provide an overview of the statistical and computa tional approaches and future prospects to conduct comprehensive analyses of the somatic alterations in cancer genomes.
Semi-parametric Cox regression and parametric methods have been used to analyze survival data of cancer; however, no study has focused on the comparison of survival models in genetic association analysis of age at onset (AAO) of cancer. The Hepatocyte nuclear factor-1- beta (HNF1B) gene has been associated with risk of endometrial and prostate cancers; however, no study has focused on the effect of HNF1B gene on the AAO of cancer. This study examined 23 single nucleotide polymorphisms (SNPs) within the HNF1B gene in the Marshfield sample with 716 cancer cases and 2,848 non-cancer controls. Cox proportional hazards models in PROC PHREG and parametric survival models (including exponential, Weibull, log-normal, log-logistic, and gamma models) in PROC LIFEREG in SAS 9.4 were used to detect the genetic association of HNF1B gene with the AAO. The Akaike information criterion (AIC) and Bayesian information criterion (BIC) were used to compare the Cox models and parametric survival models. Both AIC and BIC values showed that the Weibull distribution is the best model for all the 23 SNPs and the Gamma distribution is the second best. The top two SNPs are rs4239217 and rs7501939 with time ratio (TR) =1.08 (p<0.0001 for the AA and AG genotypes, respectively) and 1.07 (p=0.0004 and 0.0002 for CC and CT genotypes, respectively) based on the Weibull model, respectively. This study shows that the parametric Weibull distribution is the best model for the genetic association of AAO of cancer and provides the first evidence of several genetic variants within the HNF1B gene associated with AAO of cancer.
Semi-parametric Cox regression and parametric methods have been used to analyze survival data of cancer; however, no study has focused on the comparison of survival models in genetic association analysis of age at onset (AAO) of cancer. The Hepatocyte nuclear factor-1- beta (HNF1B) gene has been associated with risk of endometrial and prostate cancers; however, no study has focused on the effect of HNF1B gene on the AAO of cancer. This study examined 23 single nucleotide polymorphisms (SNPs) within the HNF1B gene in the Marshfield sample with 716 cancer cases and 2,848 non-cancer controls. Cox proportional hazards models in PROC PHREG and parametric survival models (including exponential, Weibull, log-normal, log-logistic, and gamma models) in PROC LIFEREG in SAS 9.4 were used to detect the genetic association of HNF1B gene with the AAO. The Akaike information criterion (AIC) and Bayesian information criterion (BIC) were used to compare the Cox models and parametric survival models. Both AIC and BIC values showed that the Weibull distribution is the best model for all the 23 SNPs and the Gamma distribution is the second best. The top two SNPs are rs4239217 and rs7501939 with time ratio (TR) =1.08 (p<0.0001 for the AA and AG genotypes, respectively) and 1.07 (p=0.0004 and 0.0002 for CC and CT genotypes, respectively) based on the Weibull model, respectively. This study shows that the parametric Weibull distribution is the best model for the genetic association of AAO of cancer and provides the first evidence of several genetic variants within the HNF1B gene associated with AAO of cancer.