Variable Selection with FDR Control for Noisy Data – An Application to Screening Metabolites that Are Associated with Breast Cancer and Colorectal Cancer
Pub. online: 11 June 2025
Type: Statistical Data Science
Open Access
Received
29 June 2024
29 June 2024
Accepted
7 January 2025
7 January 2025
Published
11 June 2025
11 June 2025
Abstract
The rapidly expanding field of metabolomics presents an invaluable resource for understanding the associations between metabolites and various diseases. However, the high dimensionality, presence of missing values, and measurement errors associated with metabolomics data can present challenges in developing reliable and reproducible approaches for disease association studies. Therefore, there is a compelling need for robust statistical analyses that can navigate these complexities to achieve reliable and reproducible disease association studies. In this paper, we construct algorithms to perform variable selection for noisy data and control the False Discovery Rate when selecting mutual metabolomic predictors for multiple disease outcomes. We illustrate the versatility and performance of this procedure in a variety of scenarios, dealing with missing data and measurement errors. As a specific application of this novel methodology, we target two of the most prevalent cancers among US women: breast cancer and colorectal cancer. By applying our method to the Women’s Health Initiative data, we successfully identify metabolites that are associated with either or both of these cancers, demonstrating the practical utility and potential of our method in identifying consistent risk factors and understanding shared mechanisms between diseases.
Supplementary material
Supplementary MaterialWe provide an additional pdf file that includes additional simulation results and real data analysis. The R codes for the analysis of this paper are available at https://github.com/RunqiuWang22/Variable_Selection_FDR_noisy.
References
Antoniadis A, Fryzlewicz P, Letué F, Sapatinas T (2010). The Dantzig selector in Cox’s proportional hazards model. Scandinavian Journal of Statistics, 37(4): 531–552. https://doi.org/10.1111/j.1467-9469.2009.00685.x
Bae S, Ulrich CM, Neuhouser ML, Malysheva O, Bailey LB, Xiao L, et al. (2014). Plasma choline metabolites and colorectal cancer risk in the women’s health initiative observational study. Cancer Research, 74(24): 7442–7452. https://doi.org/10.1158/0008-5472.CAN-14-1835
Barber RF, Candès EJ (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5): 2055–2085. https://doi.org/10.1214/15-AOS1337
Barber RF, Candès EJ (2019). A knockoff filter for high-dimensional selective inference. The Annals of Statistics, 47(5): 2504–2537. https://doi.org/10.1214/18-AOS1765
Bates S, Candès E, Janson L, Wang W (2021). Metropolized knockoff sampling. Journal of the American Statistical Association, 116(535): 1413–1427. https://doi.org/10.1080/01621459.2020.1729163
Benjamini Y, Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, Methodological, 57(1): 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
Bogomolov M, Heller R (2013). Discovering findings that replicate from a primary study of high dimension to a follow-up study. Journal of the American Statistical Association, 108(504): 1480–1492. https://doi.org/10.1080/01621459.2013.829002
Bogomolov M, Heller R (2018). Assessing replicability of findings across two studies of multiple features. Biometrika, 105(3): 505–516. https://doi.org/10.1093/biomet/asy029
Candès E, Fan Y, Janson L, Lv J (2018). Panning for gold: ‘Model-x’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 80(3): 551–577. https://doi.org/10.1111/rssb.12265
Chen J, Hou A, Hou TY (2019). A prototype knockoff filter for group selection with FDR control. Information and Inference, 9(2): 271–288. https://doi.org/10.1093/imaiai/iaz012
Cheung PK, Ma MH, Tse HF, Yeung KY, Tsang HC, Chu MK, et al. (2019). The applications of metabolomics in the molecular diagnostics of cancer. Expert Review of Molecular Diagnostics, 19(9): 785–793. https://doi.org/10.1080/14737159.2019.1656530
Dai R, Barber R (2016). The knockoff filter for FDR control in group-sparse and multitask regression. In: Proceedings of The 33rd International Conference on Machine Learning (MF Balcan, KQ Weinberger, eds.), volume 48 of Proceedings of Machine Learning Research, 1851–1859. PMLR, New York, New York, USA.
Dai R, Zheng C (2023). False discovery rate-controlled multiple testing for union null hypotheses: A knockoff-based approach. Biometrics, 79(4): 3497–3509. https://doi.org/10.1111/biom.13848
Datta A, Zou H (2017). Cocolasso for high-dimensional error-in-variables regression. The Annals of Statistics, 45: 2400–2426. https://doi.org/10.1214/16-AOS1527
Garcia RI, Ibrahim JG, Zhu H (2010). Variable selection in the Cox regression model with covariates missing at random. Biometrics, 66(1): 97–104. https://doi.org/10.1111/j.1541-0420.2009.01274.x
Hata N, Shigeyasu K, Umeda Y, Yano S, Takeda S, Yoshida K, et al. (2023). ADAR1 is a promising risk stratification biomarker of remnant liver recurrence after hepatic metastasectomy for colorectal cancer. Scientific Reports, 13(1): 2078. https://doi.org/10.1038/s41598-023-29397-z
Heller R, Bogomolov M, Benjamini Y (2014). Deciding whether follow-up studies have replicated findings in a preliminary large-scale omics study. Proceedings of the National Academy of Sciences, 111(46): 16262–16267. https://doi.org/10.1073/pnas.1314814111
Heller R, Yekutieli D (2014). Replicability analysis for genome-wide association studies. Annals of Applied Statistics, 8(1): 481–498. https://doi.org/10.1214/13-AOAS697
His M, Viallon V, Dossus L, Gicquiau A, Achaintre D, Scalbert A, et al. (2019). Prospective analysis of circulating metabolites and breast cancer in epic. BMC Medicine, 17(1): 178. https://doi.org/10.1186/s12916-019-1408-4
Johnson BA (2008). Variable selection in semiparametric linear regression with censored data. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 70(2): 351–370. https://doi.org/10.1111/j.1467-9868.2008.00639.x
Kampman E, Thompson R, Wiseman M, Mitrou G, Allen K (2018). PO-087 the WCRF/AICR third expert report on diet, nutrition, physical activity and cancer: Updated recommendations. ESMO Open, 3: A260. https://doi.org/10.1136/esmoopen-2018-EACR25.615
Li S, Sesia M, Romano Y, Candès E, Sabatti C (2021). Searching for robust associations with a multi-environment knockoff filter. Biometrika, 109(3): 611–629. https://doi.org/10.1093/biomet/asab055
Liu Y, Zheng C (2019). Deep latent variable models for generating knockoffs. Stat, 8(1): e260. https://doi.org/10.1002/sta4.260
Loh PL, Wainwright MJ (2012). High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, 40(3): 1637–1664. https://doi.org/10.1214/12-AOS1018
Loktionov A (2020). Biomarkers for detecting colorectal cancer non-invasively: DNA, RNA or proteins? World Journal of Gastrointestinal Oncology, 12(2): 124. https://doi.org/10.4251/wjgo.v12.i2.124
Machala M, Procházková J, Hofmanová J, Králiková L, Slavík J, Tylichová Z, et al. (2019). Colon cancer and perturbations of the sphingolipid metabolism. International Journal of Molecular Sciences, 20(23): 6051. https://doi.org/10.3390/ijms20236051
Miller JW, Beresford SA, Neuhouser ML, Cheng TYD, Song X, Brown EC, et al. (2013). Homocysteine, cysteine, and risk of incident colorectal cancer in the women’s health initiative observational cohort. The American Journal of Clinical Nutrition, 97(4): 827–834. https://doi.org/10.3945/ajcn.112.049932
Nannini G, Meoni G, Amedei A, Tenori L (2020). Metabolomics profile in gastrointestinal cancers: Update and future perspectives. World Journal of Gastroenterology, 26(20): 2514–2532. https://doi.org/10.3748/wjg.v26.i20.2514
Neitzel C, Demuth P, Wittmann S, Fahrer J (2020). Targeting altered energy metabolism in colorectal cancer: Oncogenic reprogramming, the central role of the tca cycle and therapeutic opportunities. Cancers, 12(7): 1731. https://doi.org/10.3390/cancers12071731
Ni Y, Xie G, Jia W (2014). Metabonomics of human colorectal cancer: New approaches for early diagnosis and biomarker discovery. Journal of Proteome Research, 13(9): 3857–3870. https://doi.org/10.1021/pr500443c
Playdon MC, Ziegler RG, Sampson JN, Stolzenberg-Solomon R, Thompson HJ, Irwin ML, et al. (2017). Nutritional metabolomics and breast cancer risk in a prospective study. The American Journal of Clinical Nutrition, 106(2): 637–649. https://doi.org/10.3945/ajcn.116.150912
Putri SP, Nakayama Y, Matsuda F, et al. (2013). Current metabolomics: Practical applications. Journal of Bioscience and Bioengineering, 115(6): 579–589. https://doi.org/10.1016/j.jbiosc.2012.12.007
Rässler S, Rubin DB, Zell ER (2013). Imputation. Wiley Interdisciplinary Reviews: Computational Statistics, 5: 20. https://doi.org/10.1002/wics.1240
Romano Y, Sesia M, Candès E (2020). Deep knockoffs. Journal of the American Statistical Association, 115(532): 1861–1872. https://doi.org/10.1080/01621459.2019.1660174
Rothwell JA, Bešević J, Dimou N, Breeur M, Murphy N, Jenab M, et al. (2023). Circulating amino acid levels and colorectal cancer risk in the European prospective investigation into cancer and nutrition and UK biobank cohorts. BMC Medicine, 21(1): 80. https://doi.org/10.1186/s12916-023-02739-4
Spector A, Janson L (2022). Powerful knockoffs via minimizing reconstructability. The Annals of Statistics, 50(1): 252–276. https://doi.org/10.1214/21-AOS2104
Valko-Rokytovská M, Očenáš P, Salayová A, Kostecká Z (2021). Breast cancer: Targeting of steroid hormones in cancerogenesis and diagnostics. International Journal of Molecular Sciences, 22(11): 5878. https://doi.org/10.3390/ijms22115878
Vulcan A, Manjer J, Ohlsson B (2017). High blood glucose levels are associated with higher risk of colon cancer in men: A cohort study. BMC Cancer, 17(1): 1–8. https://doi.org/10.1186/s12885-016-3022-6
Wolfson J (2011). EEBoost: A general method for prediction and variable selection based on estimating equations. Journal of the American Statistical Association, 106(493): 296–305. https://doi.org/10.1198/jasa.2011.tm10098
Xiao Y, Xia J, Li L, et al. (2019). Associations between dietary patterns and the risk of breast cancer: A systematic review and meta-analysis of observational studies. Breast Cancer Research, 21(1): 16. https://doi.org/10.1186/s13058-019-1096-1
Xu X, Gammon MD, Zeisel SH, Lee YL, Wetmur JG, Teitelbaum SL, et al. (2008). Choline metabolism and risk of breast cancer in a population-based study. The FASEB Journal, 22(6): 2045. https://doi.org/10.1096/fj.07-101279
Yang L, Wang Y, Cai H, Wang S, Shen Y, Ke C (2020). Application of metabolomics in the diagnosis of breast cancer: A systematic review. Journal of Cancer, 11(9): 2540–2551. https://doi.org/10.7150/jca.37604
Yusof AS, Isa ZM, Shah SA (2012). Dietary patterns and risk of colorectal cancer: A systematic review of cohort studies (2000–2011). Asian Pacific Journal of Cancer Prevention, 13(9): 4713–4717. https://doi.org/10.7314/APJCP.2012.13.9.4713
Zhao SD, Nguyen YT (2020). Nonparametric false discovery rate control for identifying simultaneous signals. Electronic Journal of Statistics, 14(1): 110–142. https://doi.org/10.1214/20-EJS1726
Zhu J, Djukovic D, Deng L, Gu H, Himmati F, Chiorean EG, et al. (2014). Colorectal cancer detection using targeted serum metabolic profiling. Journal of Proteome Research, 13(9): 4120–4130. https://doi.org/10.1021/pr500494u