Journal of Data Science logo


Login Register

  1. Home
  2. To appear
  3. The R Package geeVerse for Ultra-High-Di ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

The R Package geeVerse for Ultra-High-Dimensional Heterogeneous Data Analysis with Generalized Estimating Equations
Tianhai Zu   Brittany Green   Yan Yu ORCID icon link to view author Yan Yu details  

Authors

 
Placeholder
https://doi.org/10.6339/25-JDS1207
Pub. online: 18 November 2025      Type: Computing In Data Science      Open accessOpen Access

Received
11 September 2025
Accepted
2 November 2025
Published
18 November 2025

Abstract

High or ultra-high-dimensional data are becoming increasingly common in various fields. They often display diverse characteristics, including heterogeneity, longitudinal responses, and imbalanced measurements. These complexities make it challenging to integrate different modeling options and their combinations in order to fully leverage this rich data source. This paper provides an easy-to-use, and stand-alone, R package, geeVerse, that can implement any combination of 1) simultaneous variable selection and estimation, 2) quantile regression or mean regression for heterogeneous data, 3) longitudinal or cross-sectional data analysis, 4) balanced or imbalanced data, and 5) moderate, high, or even ultra-high-dimensional data. To accomplish this, we propose computationally efficient implementations of penalized generalized estimating equations (GEE) for quantile and mean regression. We present multiple applications with ultra-high-dimensional data including analysis of a resampled genetic dataset, quantile and mean regressions, analysis of cross-sectional and longitudinal data, differing correlation structures, and differing number of repeated measurements per subject. We also demonstrate our approach on two real data applications.

Supplementary material

 Supplementary Material
We provide supplementary material with a detailed model and an explanation of the computational algorithms used in geeVerse. We also provide the replication script for this paper at a public repository: Github.

References

 
1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature, 526(7571): 68–74. https://doi.org/10.1038/nature15393
 
Bates DM, Watts DG (1988). Nonlinear Regression Analysis and Its Applications. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, New York.
 
Carey VJ, Lumley TS, Moler C, Ripley B (2023). gee: Generalized estimation equation solver. R Package version 4.13–26.
 
Deshpande V, Deshpande MV (2023). pgee.mixed: Penalized generalized estimating equations for bivariate mixed outcomes. R Package version 0.1.0.
 
Eddelbuettel D, François R (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40: 1–18. https://doi.org/10.18637/jss.v040.i08
 
Efron B, Tibshirani RJ (1994). An Introduction to the Bootstrap. Chapman and Hall/CRC, Boca Raton.
 
Evangelou E, Warren HR, Mosen-Ansorena D, Mifsud B, Pazoki R, Gao H, et al. (2018). Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nature Genetics, 50(10): 1412–1425. https://doi.org/10.1038/s41588-018-0205-x
 
Fan J, Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456): 1348–1360. https://doi.org/10.1198/016214501753382273
 
Fan J, Lv J (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5): 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
 
Højsgaard S, Halekoh U, Yan J (2006). The R package geepack for generalized estimating equations. Journal of Statistical Software, 15: 1–11.
 
Inan G, Wang L (2017). PGEE: An R package for analysis of longitudinal data with high-dimensional covariates. R Journal, 9(1): 393–402.
 
Javanmard A, Montanari A (2014). Confidence intervals and hypothesis testing for high-dimensional statistical models. Journal of Machine Learning Research, 15: 2869–2909.
 
Koenker R (2025). quantreg: Quantile Regression. R package version 6.1.
 
Koenker R, Bassett Jr G (1978). Regression quantiles. Econometrica: Journal of the Econometric Society, 46: 33–50. https://doi.org/10.2307/1913643
 
Li Y, Gao X, Xu W (2023). LassoGEE: High-dimensional lasso generalized estimating equations. R Package version 1.0.
 
Liang KY, Zeger SL (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1): 13–22. https://doi.org/10.1093/biomet/73.1.13
 
McDaniel LS, Henderson NC, Rathouz PJ (2013). Fast pure R implementation of GEE: Application of the matrix package. The R journal, 5(1): 181–187. https://doi.org/10.32614/RJ-2013-017
 
National Center for Biotechnology Information (2025). Framingham Heart Study. https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000007.v35.p16. DbGaP Study Accession: phs000007.v35.p16. [Accessed: 2025-08-18].
 
Nocedal J, Wright SJ (2006). Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, New York, 2nd edition.
 
Rizopoulos D (2010). JM: An R package for the joint modelling of longitudinal and time-to-event data. Journal of Statistical Software, 35(9): 1–33. https://doi.org/10.18637/jss.v035.i09
 
Saldana DF, Feng Y (2018). SIS: An R package for sure independence screening in ultrahigh-dimensional statistical models. Journal of Statistical Software, 83(2): 1–25. https://doi.org/10.18637/jss.v083.i02
 
Sherwood B, Maidman A, Li S (2023). rqPen: Penalized quantile regression. Version 4.1.
 
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, et al. (1998). Comprehensive identification of cell cycle–regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell, 9(12): 3273–3297. https://doi.org/10.1091/mbc.9.12.3273
 
Su Z, Marchini J, Donnelly P (2011). HAPGEN2: Simulation of multiple disease SNPs. Bioinformatics (Oxford, England), 27(16): 2304–2305.
 
Tibshirani R (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B-Methodological, 58(1): 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
 
Tibshirani RJ, Taylor J, Lockhart R, Tibshirani RJ (2016). Exact post-selection inference for sequential regression procedures. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(1): 127–158. https://doi.org/10.1111/rssb.12107
 
van de Geer S, Bühlmann P, Ritov Y, Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Annals of Statistics, 42(3): 1166–1202.
 
Wang L, Chen G, Li H (2007). Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics, 23(12): 1486–1494. https://doi.org/10.1093/bioinformatics/btm125
 
Wang L, Zhou J, Qu A (2012). Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics, 68(2): 353–360. https://doi.org/10.1111/j.1541-0420.2011.01678.x
 
Zu T, Lian H, Green B, Yu Y (2023). Ultra-high dimensional quantile regression for longitudinal data: An application to blood pressure analysis. Journal of the American Statistical Association, 118(541): 97–108. https://doi.org/10.1080/01621459.2022.2128806

Related articles PDF XML
Related articles PDF XML

Copyright
2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
GEE longitudinal data quantile variable selection

Metrics
since February 2021
91

Article info
views

44

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy