Supplementary Material

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1051

10.6339/22-JDS1051

Data Science Reviews

Accelerating Fixed-Point Algorithms in Statistics and Data Science: A State-of-Art Review

Tang

Bohao

1 Henderson

Nicholas C.

2 Varadhan

Ravi

ravi.varadhan@jhu.edu13∗ 1Department of Biostatistics, Johns Hopkins University, Maryland, USA 2Department of Biostatistics, University of Michigan, Michigan, USA 3Quantitative Sciences Division, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Maryland, USA

∗Corresponding author. Email: ravi.varadhan@jhu.edu.

2023

1972022

211126

Supplementary Material

We provide an R package AccelBenchmark, available from Github that can be used to reproduce the experiments in this paper. Please check the vignette and the demo.R file under vignette folder for reference.

We provide an additional pdf file that includes a) some basic properties of fixed point iterations; b) visualization of the convergence for all the experiments and c) additional analysis for the Sinkhorn scaling problem.

7320222552022

2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2023

Open access article under the CC BY license.

Fixed-point algorithms are popular in statistics and data science due to their simplicity, guaranteed convergence, and applicability to high-dimensional problems. Well-known examples include the expectation-maximization (EM) algorithm, majorization-minimization (MM), and gradient-based algorithms like gradient descent (GD) and proximal gradient descent. A characteristic weakness of these algorithms is their slow convergence. We discuss several state-of-art techniques for accelerating their convergence. We demonstrate and evaluate these techniques in terms of their efficiency and robustness in six distinct applications. Among the acceleration schemes, SQUAREM shows robust acceleration with a mean 18-fold speedup. DAAREM and restarted-Nesterov schemes also demonstrate consistently impressive accelerations. Thus, it is possible to accelerate the original fixed-point algorithm by using one of SQUAREM, DAAREM, or restarted-Nesterov acceleration schemes. We describe implementation details and software packages to facilitate the application of the acceleration schemes. We also discuss strategies for selecting a particular acceleration scheme for a given problem.

Keywords convergence acceleration EM high dimensional models MM proximal gradient

References

Altschuler

, Weed

, Rigollet

(2017). Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. arXiv preprint: https://arxiv.org/abs/1705.09634.

Anderson

(1965). Iterative procedures for nonlinear integral equations. Journal of the ACM (JACM), 12(4): 547–560.

Atkinson

(1976). A Survey of Numerical Methods for the Solution of Fredholm Integral Equations of the Second Kind, volume 16. Society for Industrial and Applied Mathematics, Philadelphia.

Berlinet

, Roland

(2009). Parabolic acceleration of the EM algorithm. Statistics and Computing, 19(1): 35–47.

Bobb

, Varadhan

(2021). turboEM: A Suite of Convergence Acceleration Schemes for EM, MM and Other Fixed-Point Algorithms. R package version 2021.1.

Bottolo

, Richardson

, et al. (2010). Evolutionary stochastic search for Bayesian model exploration. Bayesian Analysis, 5(3): 583–618.

Boyd

, Boyd

, Vandenberghe

(2004). Convex Optimization. Cambridge University Press.

Carbonetto

, Stephens

, et al. (2012). Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Analysis, 7(1): 73–108.

Chipman

, George

, McCulloch

, Clyde

, Foster

, Stine

(2001). The practical implementation of Bayesian model selection. Lecture Notes-Monograph Series, 65–134.

Clyde

, Ghosh

, Littman

(2011). Bayesian adaptive sampling for variable selection and model averaging. Journal of Computational and Graphical Statistics, 20(1): 80–101.

Cook

, Sutskever

, Mnih

, Hinton

(2007). Visualizing similarity data with a mixture of maps. In: Artificial Intelligence and Statistics, 67–74. PMLR.

Dempster

, Laird

, Rubin

(1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1): 1–22.

, Varadhan

(2020). SQUAREM: An R package for off-the-shelf acceleration of EM, MM and other EM-like monotone algorithms. Journal of Statistical Software, 92: 1–41.

Evans

, Pollock

, Rebholz

, Xiao

(2020). A proof that Anderson acceleration improves the convergence rate in linearly converging fixed-point methods (but not in those converging quadratically). SIAM Journal on Numerical Analysis, 58(1): 788–810.

Fang

, Saad

(2009). Two classes of multisecant methods for nonlinear acceleration. Numerical Linear Algebra with Applications, 16(3): 197–221.

Friedman

(2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 1189–1232.

Geist

, Scherrer

(2018). Anderson acceleration for reinforcement learning. arXiv preprint: https://arxiv.org/abs/1809.09501.

George

, McCulloch

(1997). Approaches for Bayesian variable selection. Statistica Sinica, 339–373.

Guyon

, Gunn

, Ben-Hur

, Dror

(2004). Result analysis of the NIPS 2003 feature selection challenge. In: NIPS, volume 4, 545–552.

Hasannasab

, Hertrich

, Laus

, Steidl

(2021). Alternatives to the EM algorithm for ML estimation of location, scatter matrix, and degree of freedom of the Student t distribution. Numerical Algorithms, 87(1): 77–118.

Hasselblad

(1966). Estimation of parameters for a mixture of normal distributions. Technometrics, 8(3): 431–444.

Henderson

, Varadhan

(2020). daarem: Damped Anderson Acceleration with Epsilon Monotonicity for Accelerating EM-Like Monotone Algorithms. R package version 0.5.

Henderson

, Varadhan

(2019). Damped Anderson acceleration with restarts and monotonicity control for accelerating EM and EM-like algorithms. Journal of Computational and Graphical Statistics, 28(4): 834–846.

Higham

, Strabić

(2016). Anderson acceleration of the alternating projections method for computing the nearest correlation matrix. Numerical Algorithms, 72(4): 1021–1042.

Hobolth

, Guo

, Kousholt

, Jensen

(2020). A unifying framework and comparison of algorithms for non-negative matrix factorisation. International Statistical Review, 88(1): 29–53.

Hunter

, Lange

(2004). A tutorial on MM algorithms. The American Statistician, 58(1): 30–37.

Jin

, Tam

, Paniagua

, Hammell

(2015). TEtranscripts: A package for including transposable elements in differential expression analysis of RNA-seq datasets. Bioinformatics, 31(22): 3593–3599.

Junkins

, Bani Younes

, Woollands

, Bai

(2013). Picard iteration, chebyshev polynomials and chebyshev-picard methods: Application in astrodynamics. The Journal of the Astronautical Sciences, 60(3): 623–653.

Knight

(2008). The Sinkhorn–Knopp algorithm: Convergence and applications. SIAM Journal on Matrix Analysis and Applications, 30(1): 261–275.

Liu

, Rubin

(1995). ML estimation of the multivariate t distribution with unknown degrees of freedom. Statistica Sinica, 5: 19–39.

Mena

, Belanger

, Linderman

, Snoek

(2018). Learning latent permutations with gumbel-sinkhorn networks. arXiv preprint: https://arxiv.org/abs/1802.08665.

Nesterov

(2013). Gradient methods for minimizing composite functions. Mathematical Programming, 140(1): 125–161.

O’Donoghue

, Candes

(2015). Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, 15(3): 715–732.

Parlett

, Landis

(1982). Methods for scaling to doubly stochastic form. Linear Algebra and its Applications, 48: 53–79.

Raket

, Grimme

, Schöner

, Igel

, Markussen

(2016). Separating timing, movement conditions and individual differences in the analysis of human movement. PLoS Computational Biology, 12(9): e1005092.

Raydan

, Svaiter

(2002). Relaxed steepest descent and cauchy-barzilai-borwein method. Computational Optimization and Applications, 155–167.

Shiraishi

, Tremmel

, Miyano

, Stephens

(2015). A simple model-based approach to inferring and visualizing cancer mutation signatures. PLoS genetics, 11(12): e1005657.

Sinkhorn

, Knopp

(1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2): 343–348.

Song

, Babu

, Palomar

(2016). Sequence set design with good correlation properties via majorization-minimization. IEEE Transactions on Signal Processing, 64(11): 2866–2879.

Tibshirani

(1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1): 267–288.

Toth

, Kelley

(2015). Convergence analysis for Anderson acceleration. SIAM Journal on Numerical Analysis, 53(2): 805–819.

Tseng

(2009). On accelerated proximal gradient methods for convex-concave optimization. 2008, http://www.math.washington.edu/~tseng/papers/apgm.pdf.

Van der Maaten

, Hinton

(2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11).

Varadhan

, Roland

(2004). Squared extrapolation methods (SQUAREM): A new class of simple and efficient numerical schemes for accelerating the convergence of the EM algorithm. Department of Biostatistics Working Paper 63, Johns Hopkins University. https://biostats.bepress.com/jhubiostat/paper63/.

Varadhan

, Roland

(2008). Simple and globally convergent methods for accelerating the convergence of any EM algorithm. Scandinavian Journal of Statistics, 35(2): 335–353.

Walker

, Ni

(2011). Anderson acceleration for fixed-point iterations. SIAM Journal on Numerical Analysis, 49(4): 1715–1735.

Yang

, Peltonen

, Kaski

(2015). Majorization-minimization for manifold embedding. In: Artificial Intelligence and Statistics, 1088–1097. PMLR.

Zhang

, O’Donoghue

, Boyd

(2020). Globally convergent type-I Anderson acceleration for nonsmooth fixed-point iterations. SIAM Journal on Optimization, 30(4): 3170–3197.

Zhou

, Alexander

, Lange

(2011). A quasi-Newton acceleration for high-dimensional optimization algorithms. Statistics and Computing, 21(2): 261–273.

Zhu

, Stephens

(2018). Large-scale genome-wide enrichment analyses identify new trait-associated genes and pathways across 31 human phenotypes. Nature Communications, 9(1): 1–14.