The R Package geeVerse for Ultra-High-Dimensional Heterogeneous Data Analysis with Generalized Estimating Equations
Pub. online: 18 November 2025
Type: Computing In Data Science
Open Access
Received
11 September 2025
11 September 2025
Accepted
2 November 2025
2 November 2025
Published
18 November 2025
18 November 2025
Abstract
High or ultra-high-dimensional data are becoming increasingly common in various fields. They often display diverse characteristics, including heterogeneity, longitudinal responses, and imbalanced measurements. These complexities make it challenging to integrate different modeling options and their combinations in order to fully leverage this rich data source. This paper provides an easy-to-use, and stand-alone, R package, geeVerse, that can implement any combination of 1) simultaneous variable selection and estimation, 2) quantile regression or mean regression for heterogeneous data, 3) longitudinal or cross-sectional data analysis, 4) balanced or imbalanced data, and 5) moderate, high, or even ultra-high-dimensional data. To accomplish this, we propose computationally efficient implementations of penalized generalized estimating equations (GEE) for quantile and mean regression. We present multiple applications with ultra-high-dimensional data including analysis of a resampled genetic dataset, quantile and mean regressions, analysis of cross-sectional and longitudinal data, differing correlation structures, and differing number of repeated measurements per subject. We also demonstrate our approach on two real data applications.
Supplementary material
Supplementary MaterialWe provide supplementary material with a detailed model and an explanation of the computational algorithms used in geeVerse . We also provide the replication script for this paper at a public repository: Github.
References
1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature, 526(7571): 68–74. https://doi.org/10.1038/nature15393
Eddelbuettel D, François R (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40: 1–18. https://doi.org/10.18637/jss.v040.i08
Evangelou E, Warren HR, Mosen-Ansorena D, Mifsud B, Pazoki R, Gao H, et al. (2018). Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nature Genetics, 50(10): 1412–1425. https://doi.org/10.1038/s41588-018-0205-x
Fan J, Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456): 1348–1360. https://doi.org/10.1198/016214501753382273
Fan J, Lv J (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5): 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
Koenker R, Bassett Jr G (1978). Regression quantiles. Econometrica: Journal of the Econometric Society, 46: 33–50. https://doi.org/10.2307/1913643
Liang KY, Zeger SL (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1): 13–22. https://doi.org/10.1093/biomet/73.1.13
McDaniel LS, Henderson NC, Rathouz PJ (2013). Fast pure R implementation of GEE: Application of the matrix package. The R journal, 5(1): 181–187. https://doi.org/10.32614/RJ-2013-017
National Center for Biotechnology Information (2025). Framingham Heart Study. https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000007.v35.p16. DbGaP Study Accession: phs000007.v35.p16. [Accessed: 2025-08-18].
Rizopoulos D (2010). JM: An R package for the joint modelling of longitudinal and time-to-event data. Journal of Statistical Software, 35(9): 1–33. https://doi.org/10.18637/jss.v035.i09
Saldana DF, Feng Y (2018). SIS: An R package for sure independence screening in ultrahigh-dimensional statistical models. Journal of Statistical Software, 83(2): 1–25. https://doi.org/10.18637/jss.v083.i02
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, et al. (1998). Comprehensive identification of cell cycle–regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell, 9(12): 3273–3297. https://doi.org/10.1091/mbc.9.12.3273
Tibshirani R (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B-Methodological, 58(1): 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Tibshirani RJ, Taylor J, Lockhart R, Tibshirani RJ (2016). Exact post-selection inference for sequential regression procedures. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(1): 127–158. https://doi.org/10.1111/rssb.12107
Wang L, Chen G, Li H (2007). Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics, 23(12): 1486–1494. https://doi.org/10.1093/bioinformatics/btm125
Wang L, Zhou J, Qu A (2012). Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics, 68(2): 353–360. https://doi.org/10.1111/j.1541-0420.2011.01678.x
Zu T, Lian H, Green B, Yu Y (2023). Ultra-high dimensional quantile regression for longitudinal data: An application to blood pressure analysis. Journal of the American Statistical Association, 118(541): 97–108. https://doi.org/10.1080/01621459.2022.2128806