The R Package geeVerse for Ultra-High-Dimensional Heterogeneous Data Analysis with Generalized Estimating Equations

Zu, Tianhai; Green, Brittany; Yu, Yan

doi:10.6339/25-JDS1207

Journal of Data Science

The R Package geeVerse for Ultra-High-Dimensional Heterogeneous Data Analysis with Generalized Estimating Equations

Tianhai Zu Brittany Green Yan Yu

https://doi.org/10.6339/25-JDS1207

Pub. online: 18 November 2025 Type: Computing In Data Science

Open Access

Received
11 September 2025

Accepted
2 November 2025

Published
18 November 2025

Abstract

High or ultra-high-dimensional data are becoming increasingly common in various fields. They often display diverse characteristics, including heterogeneity, longitudinal responses, and imbalanced measurements. These complexities make it challenging to integrate different modeling options and their combinations in order to fully leverage this rich data source. This paper provides an easy-to-use, and stand-alone, R package, geeVerse, that can implement any combination of 1) simultaneous variable selection and estimation, 2) quantile regression or mean regression for heterogeneous data, 3) longitudinal or cross-sectional data analysis, 4) balanced or imbalanced data, and 5) moderate, high, or even ultra-high-dimensional data. To accomplish this, we propose computationally efficient implementations of penalized generalized estimating equations (GEE) for quantile and mean regression. We present multiple applications with ultra-high-dimensional data including analysis of a resampled genetic dataset, quantile and mean regressions, analysis of cross-sectional and longitudinal data, differing correlation structures, and differing number of repeated measurements per subject. We also demonstrate our approach on two real data applications.

Supplementary material

Supplementary Material

We provide supplementary material with a detailed model and an explanation of the computational algorithms used in geeVerse. We also provide the replication script for this paper at a public repository: Github.

References

1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature, 526(7571): 68–74. https://doi.org/10.1038/nature15393

Bates DM, Watts DG (1988). Nonlinear Regression Analysis and Its Applications. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, New York.

Carey VJ, Lumley TS, Moler C, Ripley B (2023). gee: Generalized estimation equation solver. R Package version 4.13–26.

Deshpande V, Deshpande MV (2023). pgee.mixed: Penalized generalized estimating equations for bivariate mixed outcomes. R Package version 0.1.0.

Eddelbuettel D, François R (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40: 1–18. https://doi.org/10.18637/jss.v040.i08

Efron B, Tibshirani RJ (1994). An Introduction to the Bootstrap. Chapman and Hall/CRC, Boca Raton.

Evangelou E, Warren HR, Mosen-Ansorena D, Mifsud B, Pazoki R, Gao H, et al. (2018). Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nature Genetics, 50(10): 1412–1425. https://doi.org/10.1038/s41588-018-0205-x

Fan J, Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456): 1348–1360. https://doi.org/10.1198/016214501753382273

Fan J, Lv J (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5): 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x

Højsgaard S, Halekoh U, Yan J (2006). The R package geepack for generalized estimating equations. Journal of Statistical Software, 15: 1–11.

Inan G, Wang L (2017). PGEE: An R package for analysis of longitudinal data with high-dimensional covariates. R Journal, 9(1): 393–402.

Javanmard A, Montanari A (2014). Confidence intervals and hypothesis testing for high-dimensional statistical models. Journal of Machine Learning Research, 15: 2869–2909.

Koenker R (2025). quantreg: Quantile Regression. R package version 6.1.

Koenker R, Bassett Jr G (1978). Regression quantiles. Econometrica: Journal of the Econometric Society, 46: 33–50. https://doi.org/10.2307/1913643

Li Y, Gao X, Xu W (2023). LassoGEE: High-dimensional lasso generalized estimating equations. R Package version 1.0.

Liang KY, Zeger SL (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1): 13–22. https://doi.org/10.1093/biomet/73.1.13

McDaniel LS, Henderson NC, Rathouz PJ (2013). Fast pure R implementation of GEE: Application of the matrix package. The R journal, 5(1): 181–187. https://doi.org/10.32614/RJ-2013-017

National Center for Biotechnology Information (2025). Framingham Heart Study. https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000007.v35.p16. DbGaP Study Accession: phs000007.v35.p16. [Accessed: 2025-08-18].

Nocedal J, Wright SJ (2006). Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, New York, 2nd edition.

Rizopoulos D (2010). JM: An R package for the joint modelling of longitudinal and time-to-event data. Journal of Statistical Software, 35(9): 1–33. https://doi.org/10.18637/jss.v035.i09

Saldana DF, Feng Y (2018). SIS: An R package for sure independence screening in ultrahigh-dimensional statistical models. Journal of Statistical Software, 83(2): 1–25. https://doi.org/10.18637/jss.v083.i02

Sherwood B, Maidman A, Li S (2023). rqPen: Penalized quantile regression. Version 4.1.

Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, et al. (1998). Comprehensive identification of cell cycle–regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell, 9(12): 3273–3297. https://doi.org/10.1091/mbc.9.12.3273

Su Z, Marchini J, Donnelly P (2011). HAPGEN2: Simulation of multiple disease SNPs. Bioinformatics (Oxford, England), 27(16): 2304–2305.

Tibshirani R (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B-Methodological, 58(1): 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

Tibshirani RJ, Taylor J, Lockhart R, Tibshirani RJ (2016). Exact post-selection inference for sequential regression procedures. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(1): 127–158. https://doi.org/10.1111/rssb.12107

van de Geer S, Bühlmann P, Ritov Y, Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Annals of Statistics, 42(3): 1166–1202.

Wang L, Chen G, Li H (2007). Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics, 23(12): 1486–1494. https://doi.org/10.1093/bioinformatics/btm125

Wang L, Zhou J, Qu A (2012). Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics, 68(2): 353–360. https://doi.org/10.1111/j.1541-0420.2011.01678.x

Zu T, Lian H, Green B, Yu Y (2023). Ultra-high dimensional quantile regression for longitudinal data: An application to blood pressure analysis. Journal of the American Statistical Association, 118(541): 97–108. https://doi.org/10.1080/01621459.2022.2128806

2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

GEE longitudinal data quantile variable selection

Metrics

since February 2021

668

Article info
views

422

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file