Journal of Data Science logo


Login Register

  1. Home
  2. To appear
  3. Variable Selection with FDR Control for ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Variable Selection with FDR Control for Noisy Data – An Application to Screening Metabolites that Are Associated with Breast Cancer and Colorectal Cancer
Runqiu Wang   Ran Dai ORCID icon link to view author Ran Dai details   Ying Huang     All authors (8)

Authors

 
Placeholder
https://doi.org/10.6339/25-JDS1166
Pub. online: 11 June 2025      Type: Statistical Data Science      Open accessOpen Access

Received
29 June 2024
Accepted
7 January 2025
Published
11 June 2025

Abstract

The rapidly expanding field of metabolomics presents an invaluable resource for understanding the associations between metabolites and various diseases. However, the high dimensionality, presence of missing values, and measurement errors associated with metabolomics data can present challenges in developing reliable and reproducible approaches for disease association studies. Therefore, there is a compelling need for robust statistical analyses that can navigate these complexities to achieve reliable and reproducible disease association studies. In this paper, we construct algorithms to perform variable selection for noisy data and control the False Discovery Rate when selecting mutual metabolomic predictors for multiple disease outcomes. We illustrate the versatility and performance of this procedure in a variety of scenarios, dealing with missing data and measurement errors. As a specific application of this novel methodology, we target two of the most prevalent cancers among US women: breast cancer and colorectal cancer. By applying our method to the Women’s Health Initiative data, we successfully identify metabolites that are associated with either or both of these cancers, demonstrating the practical utility and potential of our method in identifying consistent risk factors and understanding shared mechanisms between diseases.

Supplementary material

 Supplementary Material
We provide an additional pdf file that includes additional simulation results and real data analysis. The R codes for the analysis of this paper are available at https://github.com/RunqiuWang22/Variable_Selection_FDR_noisy.

References

 
ACS (2020). Cancer Facts and Figures 2020. American Cancer Society, Atlanta, GA.
 
Antoniadis A, Fryzlewicz P, Letué F, Sapatinas T (2010). The Dantzig selector in Cox’s proportional hazards model. Scandinavian Journal of Statistics, 37(4): 531–552. https://doi.org/10.1111/j.1467-9469.2009.00685.x
 
Bae S, Ulrich CM, Neuhouser ML, Malysheva O, Bailey LB, Xiao L, et al. (2014). Plasma choline metabolites and colorectal cancer risk in the women’s health initiative observational study. Cancer Research, 74(24): 7442–7452. https://doi.org/10.1158/0008-5472.CAN-14-1835
 
Barber RF, Candès EJ (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5): 2055–2085. https://doi.org/10.1214/15-AOS1337
 
Barber RF, Candès EJ (2019). A knockoff filter for high-dimensional selective inference. The Annals of Statistics, 47(5): 2504–2537. https://doi.org/10.1214/18-AOS1765
 
Barber RF, Candès EJ, Samworth RJ (2020). Robust inference with knockoffs. The Annals of Statistics, 48(3): 1409–1431.
 
Bates S, Candès E, Janson L, Wang W (2021). Metropolized knockoff sampling. Journal of the American Statistical Association, 116(535): 1413–1427. https://doi.org/10.1080/01621459.2020.1729163
 
Benjamini Y, Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, Methodological, 57(1): 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
 
Bogomolov M, Heller R (2013). Discovering findings that replicate from a primary study of high dimension to a follow-up study. Journal of the American Statistical Association, 108(504): 1480–1492. https://doi.org/10.1080/01621459.2013.829002
 
Bogomolov M, Heller R (2018). Assessing replicability of findings across two studies of multiple features. Biometrika, 105(3): 505–516. https://doi.org/10.1093/biomet/asy029
 
Candès E, Fan Y, Janson L, Lv J (2018). Panning for gold: ‘Model-x’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 80(3): 551–577. https://doi.org/10.1111/rssb.12265
 
Cappelletti V, Iorio E, Miodini P, Silvestri M, Dugo M, Daidone MG (2017). Metabolic footprints and molecular subtypes in breast cancer. Disease Markers, 2017(1): 7687851.
 
Chen J, Hou A, Hou TY (2019). A prototype knockoff filter for group selection with FDR control. Information and Inference, 9(2): 271–288. https://doi.org/10.1093/imaiai/iaz012
 
Cheung PK, Ma MH, Tse HF, Yeung KY, Tsang HC, Chu MK, et al. (2019). The applications of metabolomics in the molecular diagnostics of cancer. Expert Review of Molecular Diagnostics, 19(9): 785–793. https://doi.org/10.1080/14737159.2019.1656530
 
Chi Z (2008). False discovery rate control with multivariate p-values. Electronic Journal of Statistics, 2: 368–411.
 
Dai R, Barber R (2016). The knockoff filter for FDR control in group-sparse and multitask regression. In: Proceedings of The 33rd International Conference on Machine Learning (MF Balcan, KQ Weinberger, eds.), volume 48 of Proceedings of Machine Learning Research, 1851–1859. PMLR, New York, New York, USA.
 
Dai R, Zheng C (2023). False discovery rate-controlled multiple testing for union null hypotheses: A knockoff-based approach. Biometrics, 79(4): 3497–3509. https://doi.org/10.1111/biom.13848
 
Datta A, Zou H (2017). Cocolasso for high-dimensional error-in-variables regression. The Annals of Statistics, 45: 2400–2426. https://doi.org/10.1214/16-AOS1527
 
Garcia RI, Ibrahim JG, Zhu H (2010). Variable selection in the Cox regression model with covariates missing at random. Biometrics, 66(1): 97–104. https://doi.org/10.1111/j.1541-0420.2009.01274.x
 
Hata N, Shigeyasu K, Umeda Y, Yano S, Takeda S, Yoshida K, et al. (2023). ADAR1 is a promising risk stratification biomarker of remnant liver recurrence after hepatic metastasectomy for colorectal cancer. Scientific Reports, 13(1): 2078. https://doi.org/10.1038/s41598-023-29397-z
 
Heller R, Bogomolov M, Benjamini Y (2014). Deciding whether follow-up studies have replicated findings in a preliminary large-scale omics study. Proceedings of the National Academy of Sciences, 111(46): 16262–16267. https://doi.org/10.1073/pnas.1314814111
 
Heller R, Yekutieli D (2014). Replicability analysis for genome-wide association studies. Annals of Applied Statistics, 8(1): 481–498. https://doi.org/10.1214/13-AOAS697
 
His M, Viallon V, Dossus L, Gicquiau A, Achaintre D, Scalbert A, et al. (2019). Prospective analysis of circulating metabolites and breast cancer in epic. BMC Medicine, 17(1): 178. https://doi.org/10.1186/s12916-019-1408-4
 
Huang D, Janson L (2020). Relaxing the assumptions of knockoffs by conditioning. The Annals of Statistics, 48(5): 3021–3042.
 
Johnson BA (2008). Variable selection in semiparametric linear regression with censored data. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 70(2): 351–370. https://doi.org/10.1111/j.1467-9868.2008.00639.x
 
Kampman E, Thompson R, Wiseman M, Mitrou G, Allen K (2018). PO-087 the WCRF/AICR third expert report on diet, nutrition, physical activity and cancer: Updated recommendations. ESMO Open, 3: A260. https://doi.org/10.1136/esmoopen-2018-EACR25.615
 
Li S, Sesia M, Romano Y, Candès E, Sabatti C (2021). Searching for robust associations with a multi-environment knockoff filter. Biometrika, 109(3): 611–629. https://doi.org/10.1093/biomet/asab055
 
Little RJ, Rubin DB (2002). Statistical Analysis with Missing Data. John Wiley & Sons, New York.
 
Liu Y, Zheng C (2019). Deep latent variable models for generating knockoffs. Stat, 8(1): e260. https://doi.org/10.1002/sta4.260
 
Loh PL, Wainwright MJ (2012). High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, 40(3): 1637–1664. https://doi.org/10.1214/12-AOS1018
 
Loktionov A (2020). Biomarkers for detecting colorectal cancer non-invasively: DNA, RNA or proteins? World Journal of Gastrointestinal Oncology, 12(2): 124. https://doi.org/10.4251/wjgo.v12.i2.124
 
Machala M, Procházková J, Hofmanová J, Králiková L, Slavík J, Tylichová Z, et al. (2019). Colon cancer and perturbations of the sphingolipid metabolism. International Journal of Molecular Sciences, 20(23): 6051. https://doi.org/10.3390/ijms20236051
 
Miller JW, Beresford SA, Neuhouser ML, Cheng TYD, Song X, Brown EC, et al. (2013). Homocysteine, cysteine, and risk of incident colorectal cancer in the women’s health initiative observational cohort. The American Journal of Clinical Nutrition, 97(4): 827–834. https://doi.org/10.3945/ajcn.112.049932
 
Nannini G, Meoni G, Amedei A, Tenori L (2020). Metabolomics profile in gastrointestinal cancers: Update and future perspectives. World Journal of Gastroenterology, 26(20): 2514–2532. https://doi.org/10.3748/wjg.v26.i20.2514
 
Neitzel C, Demuth P, Wittmann S, Fahrer J (2020). Targeting altered energy metabolism in colorectal cancer: Oncogenic reprogramming, the central role of the tca cycle and therapeutic opportunities. Cancers, 12(7): 1731. https://doi.org/10.3390/cancers12071731
 
Ni Y, Xie G, Jia W (2014). Metabonomics of human colorectal cancer: New approaches for early diagnosis and biomarker discovery. Journal of Proteome Research, 13(9): 3857–3870. https://doi.org/10.1021/pr500443c
 
Playdon MC, Ziegler RG, Sampson JN, Stolzenberg-Solomon R, Thompson HJ, Irwin ML, et al. (2017). Nutritional metabolomics and breast cancer risk in a prospective study. The American Journal of Clinical Nutrition, 106(2): 637–649. https://doi.org/10.3945/ajcn.116.150912
 
Putri SP, Nakayama Y, Matsuda F, et al. (2013). Current metabolomics: Practical applications. Journal of Bioscience and Bioengineering, 115(6): 579–589. https://doi.org/10.1016/j.jbiosc.2012.12.007
 
Rässler S, Rubin DB, Zell ER (2013). Imputation. Wiley Interdisciplinary Reviews: Computational Statistics, 5: 20. https://doi.org/10.1002/wics.1240
 
Romano Y, Sesia M, Candès E (2020). Deep knockoffs. Journal of the American Statistical Association, 115(532): 1861–1872. https://doi.org/10.1080/01621459.2019.1660174
 
Rosenbaum M, Tsybakov AB (2013). Improved matrix uncertainty selector. In: From Probability to Statistics and Back: High-Dimensional Models and Processes – A Festschrift in Honor of Jon A. Wellner, 276–290. Institute of Mathematical Statistics, Beachwood, Ohio, USA.
 
Rothwell JA, Bešević J, Dimou N, Breeur M, Murphy N, Jenab M, et al. (2023). Circulating amino acid levels and colorectal cancer risk in the European prospective investigation into cancer and nutrition and UK biobank cohorts. BMC Medicine, 21(1): 80. https://doi.org/10.1186/s12916-023-02739-4
 
Sorensen O, Frigessi A, Thoresen M, Glad IK (2015). Measurement error in lasso: Impact and likelihood bias correction. Statistica Sinica, 25(2): 809–829.
 
Spector A, Janson L (2022). Powerful knockoffs via minimizing reconstructability. The Annals of Statistics, 50(1): 252–276. https://doi.org/10.1214/21-AOS2104
 
Tsiatis AA (2006). Semiparametric Theory and Missing Data. Springer.
 
Valko-Rokytovská M, Očenáš P, Salayová A, Kostecká Z (2021). Breast cancer: Targeting of steroid hormones in cancerogenesis and diagnostics. International Journal of Molecular Sciences, 22(11): 5878. https://doi.org/10.3390/ijms22115878
 
Vulcan A, Manjer J, Ohlsson B (2017). High blood glucose levels are associated with higher risk of colon cancer in men: A cohort study. BMC Cancer, 17(1): 1–8. https://doi.org/10.1186/s12885-016-3022-6
 
Wolfson J (2011). EEBoost: A general method for prediction and variable selection based on estimating equations. Journal of the American Statistical Association, 106(493): 296–305. https://doi.org/10.1198/jasa.2011.tm10098
 
Xiao Y, Xia J, Li L, et al. (2019). Associations between dietary patterns and the risk of breast cancer: A systematic review and meta-analysis of observational studies. Breast Cancer Research, 21(1): 16. https://doi.org/10.1186/s13058-019-1096-1
 
Xu X, Gammon MD, Zeisel SH, Lee YL, Wetmur JG, Teitelbaum SL, et al. (2008). Choline metabolism and risk of breast cancer in a population-based study. The FASEB Journal, 22(6): 2045. https://doi.org/10.1096/fj.07-101279
 
Yang L, Wang Y, Cai H, Wang S, Shen Y, Ke C (2020). Application of metabolomics in the diagnosis of breast cancer: A systematic review. Journal of Cancer, 11(9): 2540–2551. https://doi.org/10.7150/jca.37604
 
Yusof AS, Isa ZM, Shah SA (2012). Dietary patterns and risk of colorectal cancer: A systematic review of cohort studies (2000–2011). Asian Pacific Journal of Cancer Prevention, 13(9): 4713–4717. https://doi.org/10.7314/APJCP.2012.13.9.4713
 
Zhang C, Quinones A, Le A (2022). Metabolic reservoir cycles in cancer. In: Seminars in Cancer Biology, volume 86, 180–188. Elsevier.
 
Zhao SD, Nguyen YT (2020). Nonparametric false discovery rate control for identifying simultaneous signals. Electronic Journal of Statistics, 14(1): 110–142. https://doi.org/10.1214/20-EJS1726
 
Zhu J, Djukovic D, Deng L, Gu H, Himmati F, Chiorean EG, et al. (2014). Colorectal cancer detection using targeted serum metabolic profiling. Journal of Proteome Research, 13(9): 4120–4130. https://doi.org/10.1021/pr500494u

Related articles PDF XML
Related articles PDF XML

Copyright
2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
cancer FDR control measurement error metabolomics data missing data variable selection

Funding
This research is partly supported by the National Cancer Institute under grants R01 CA119171, CA277133, and P30 CA015704 and by the National Institute of General Medical Sciences under grant U54 GM115458. The WHI programs are funded by the National Heart, Lung, and Blood Institute, National Institutes of Health, U.S. Department of Health and Human Services through contracts, HHSN268201600018C, HHSN268201600001C, HHSN268201600002C, HHSN268201600003C, and HHSN268201600004C.

Metrics
since February 2021
16

Article info
views

4

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy