Accelerating Fixed-Point Algorithms in Statistics and Data Science: A State-of-Art Review

Fixed-point algorithms are popular in statistics and data science due to their simplicity, guaranteed convergence, and applicability to high-dimensional problems. Well-known examples include the expectation-maximization (EM) algorithm, majorization-minimization (MM), and gradient-based algorithms like gradient descent (GD) and proximal gradient descent. A characteristic weakness of these algorithms is their slow convergence. We discuss several state-of-art techniques for accelerating their convergence. We demonstrate and evaluate these techniques in terms of their eﬃciency and robustness in six distinct applications. Among the acceleration schemes, SQUAREM shows robust acceleration with a mean 18-fold speedup. DAAREM and restarted-Nesterov schemes also demonstrate consistently impressive accelerations. Thus, it is possible to accelerate the original ﬁxed-point algorithm by using one of SQUAREM, DAAREM, or restarted-Nesterov acceleration schemes. We describe implementation details and software packages to facilitate the application of the acceleration schemes. We also discuss strategies for selecting a particular acceleration scheme for a given problem.


Introduction
Computational problems in science and mathematics are often solved using iterative algorithms, which produce a sequence of real-valued vectors converging to the solution of interest. Examples include solving systems of linear and nonlinear equations, numerical solutions of differential equations, approximation of integrals, and minimization of multivariate functions. Parameter estimation in many practical problems in statistics and data science can be ultimately reduced to a specific optimization problem often involving parameter constraints. To solve such optimization problems, various iterative algorithms have been developed including the expectationmaximization (EM) algorithm (Dempster et al., 1977), the majorization-minimization (MM) algorithm (Hunter and Lange, 2004), and gradient based methods like gradient descent (GD) and proximal gradient descent (Boyd et al., 2004). These methods are general and easy to use, and they can all be regarded as fixed-point iteration algorithms. A major appeal of these algorithms is their stability and their ability to readily handle high-dimensional problems which is a main reason for their surge in popularity for modern applications. However, a character-istic weakness of these algorithms is their potential slow convergence, i.e., the vector sequence produced by the fixed-point iterative algorithm, x n+1 = F (x n ), may converge very slowly (if it converges) to the solution x * , severely limiting their effective use in solving real problems. Hence, it is desirable to have tools available that can accelerate the convergence of the sequence {x n }. Please refer to Supplementary Material for a general, theoretical discussion of the rate of convergence of fixed-point iterations.
As highlighted in Varadhan and Roland (2008), an acceleration scheme should possess certain key properties in order to be an effective and practical tool for high-dimensional optimization problems. It should accelerate the convergence of the original iterative algorithm (fast local convergence); it should converge to the solution from any reasonable starting value (robust global convergence), provided, of course, that the base algorithm itself is convergent; it should have minimal storage/memory requirements (applicability to high-dimensional problems); and it should require minimal problem-specific tuning (off-the-shelf usability). The minimal storage requirement eliminates Newton-type algorithms which make use of full second-order information. In this paper, we examine several recently developed acceleration schemes that satisfy these listed requirements. The acceleration schemes discussed include SQUAREM (Varadhan and Roland, 2008), Anderson acceleration and DAAREM (Henderson and Varadhan, 2019), Quasi-Newton (Zhou et al., 2011), Nesterov acceleration with restarts (O'Donoghue andCandes, 2015), and Parabolic-EM (Berlinet and Roland, 2009).
The paper is organized as follows. First, in the next section, we describe several wellknown fixed-point iterations and discuss their theoretical convergence properties. In Section 3, we introduce and describe the accelerating methods to be studied. In Section 5, we test the performance of each of these acceleration methods in a range of practical problems. In Section 4, implementation details and available R packages are described, and in last Section, we discuss the results and give strategies for choosing an acceleration scheme for a given problem at hand.

MM Algorithm
The MM in the MM algorithm stands for "Majorization-Minimization" or "Minorization-Maximization", depending on whether the particular optimization problem is a minimization or maximization problem. The MM algorithm actually describes a family of algorithms that are implemented by creating a surrogate function that majorizes (minorizes) the objective function of interest and optimizing this surrogate function in each iteration. A key feature of MM algorithms is that the objective function will increase (decrease)in every iteration. See Hunter and Lange (2004) for a general description of MM algorithms.
A function g(x|x k ) is called a minorized version of the objective function f at x k if it satisfies the following two conditions ∀x : g(x|x k ) f (x) g(x k |x k ) = f (x k ).
Similarly, g(x|x k ) will be called a majorized version of f at x k if −g(x|x k ) is a minorized version of −f . An MM maximization algorithm updates the current iterate x k by maximizing the minorizing function g(x|x k ). If we define F to be the argmax operator for g(x|x k ), then we can express the MM iteration as x k+1 = argmax x g(x|x k ) =: F (x k ). (1) The fixed-point iteration (1) generates a sequence which is monotone with respect to the objective function f ; that is, we are guaranteed to have f (x 0 ) f (x 1 ) f (x 2 ) .... This is due to the fact that f (x k+1 ) g(x k+1 |x k ) g(x k |x k ) = f (x k ), and hence, one will get a strict increase in the objective function whenever g(x k+1 |x k ) = g(x k |x k ).
If x * denotes an optimal point of f , then for x k close to x * , we have the following local approximation where dF (x * ) is the Jacobian of F at x * . It can be shown that dF (x * ) is given by where d 2 f (x * ) and d 2 g(x * |x * ) denote the Hessian matrices of f (x) and g(x|x) respectively (with the derivatives in d 2 g(x * |x * ) being taken with respect to the first argument of g(x|x)). Therefore, an MM algorithm has linear convergence with a rate related to the largest eigenvalue of the Jacobian in (2), and the value of this Jacobian depends on both the objective function and choice of surrogate function. Globally, if the objective function f is strictly convex or concave, an MM algorithm will converge to the unique optimal point, assuming it exists. Otherwise, the MM algorithm will converge to one of the stationary points.

The EM Algorithm as a Special Case of MM
EM algorithms are used to find the value of a parameter vector x which maximizes a loglikelihood function (x) = log p(Y|x) of interest, where Y denotes the observed data vector and p(·|x) is the probability distribution for the observed data that is parameterized by x. To develop an EM algorithm for maximizing (x), one introduces a vector of unobserved latent data U and a probability distribution p(Y, U|x) for (Y, U) which is also parameterized by x.
Because (x) can be decomposed as log p(Y|x) = log p(Y, U|x) − log p(U|Y, x) and log p(Y|x) does not depend on U, if we take the expectation of log p(Y|x) with respect to the conditional distribution [U|Y, x k ] where x k is the current iterate of the EM algorithm, we obtain In (3), Q(x|x k ) is often referred to as the "Q-function", and computing it is referred to as the "Estep" of the EM algorithm. The term H (x|x k ) is the cross entropy of the conditional distribution [U|Y, x] relative to the conditional distribution [U|Y, x k ].
After completing the "E-step", x k+1 is found by maximizing the Q-function Q(x|x k ) with respect to x, namely, Computing x k+1 by maximizing Q(x|x k ) is usually referred to as the "M-step" of an EM algorithm.
To see why the EM algorithm is a special case of the MM algorithm, note first that it directly follows from Jensen's inequality that is a minorized version of the log-likelihood log p(Y|x). Since H (x k |x k ) is a positive constant that does not depend on x, maximizing Q(x|x k ) + H (x k |x k ) is equivalent to maximizing the Q-function. Hence, we can regard the EM algorithm as an MM algorithm with minorization function Q(x|x k ) + H (x k |x k ).

Gradient Descent
Consider the following optimization problem: for a smooth function f that has all first order derivatives with ∇f (x) = ( ∂f (x) ∂x 1 , . . . , ∂f (x) ∂x p ) denoting the gradient of f at x. Gradient descent is an iterative algorithm that always updates the current iterate x k linearly in the direction where f decrease the fastest, namely, the negative gradient −∇f (x k ). In particular, for a given choice of step size or learning rate t k , the gradient descent update of x k is given by Gradient descent may also be interpreted in the following way. At each step, we do not directly minimize the original function f , but instead, we minimize its first order approximation f k (x) around x k which is given by where || · || is the Euclidean norm. One can directly check that the minimizer of the function f k is equal to x k+1 in (5).

Proximal Gradient Descent
Optimization problem (4) is not general enough to handle optimization problems that have non-smooth terms. In such cases, one might consider the following generalization of (4) where, again, f is assumed to be smooth up to first order but h is instead a non-smooth function.
As an example of (7), the objective function used in LASSO regression (Tibshirani, 1996) can be expressed as a sum of a smooth function and the non-smooth L 1 norm term.
Using the same reasoning used to obtain approximation (6), we can approximate the target f (x) + h(x) at each step by where Const is a constant term that does not depend on x. In step k, the proximal gradient descent update x k+1 is defined as the minimizer of the approximation (f + h) k (x) shown in (8).
The minimizer x k+1 of (8) is typically expressed in terms of the proximal operator prox h (·) of a function h which is defined as It follows from (8) that the proximal gradient descent update can be expressed in terms of the proximal operator of the function t k h as The proximal gradient descent algorithm is most useful when the proximal operator has a closed form or is, at least, very easy to compute. For example, consider the case of LASSO regression where the non-smooth component h(x) of the objective function is equal to the L 1 norm multiplied by a tuning parameter λ, i.e., h(x) = λ||x|| 1 as L 1 norm. The proximal mapping of h can be expressed as where the j th component of S λ (x) is given by where x j is the j th component of x.
As shown in Boyd et al. (2004), both gradient descent and proximal gradient descent are globally convergent with the same convergence rate in convex problems. Assuming f and h are both convex and ∇f is Lipschitz continuous with Lipschitz constant L f > 0 (i.e., ∀x, y : |∇f (x) − ∇f (y)| L f |x − y|), then, for some constant C, the following inequality holds when a constant step size t k = t f < 1/L f is used where x ∞ is the optimal point and x 0 the initial point. Therefore, to obtain a precision level which is within at least ε of the optimal value of the objective function, we will need O(1/ ) proximal gradient or gradient descent iterations.

Anderson Acceleration and DAAREM
Anderson acceleration (AA), also known as Anderson mixing, was originally introduced by D.G. Anderson in 1965 to accelerate the rate of convergence of fixed-point iterations in the context of integral equations (Anderson, 1965), and this acceleration technique has turned out to be useful in a range of other applications. Recent examples include computing the nearest correlation matrix (Higham and Strabić, 2016), reinforcement learning (Geist and Scherrer, 2018), EM acceleration (Henderson and Varadhan, 2019), and electronic structure computations (Fang and Saad, 2009). The Anderson acceleration algorithm with order m applied to solving the fixed-point problem f (x) = x is shown in Algorithm 1.
Algorithm 1: Anderson acceleration and DAAREM. In the description of the algorithm, x k+1 = f (x k ) is the base fixed-point iteration, and m is the order of the acceleration scheme.
Find the {α k+1 j } to solve the following the minimization problem: if meets restart criteria then 8 Restart m k from 1 9 end 10 end In Algorithm 1, α k+1 −k denotes the vector of length m k containing the values α k+1 j for j = k − m k , . . . , k − 1, and β k is the relaxation factor used in Walker andNi (2011) andEvans et al. (2020). The non-negative scalar λ k 0 is an optional damping factor used in Henderson and Varadhan (2019), and if λ k = 0 for all k, then the update x k+1 in Algorithm 1 reduces to the more typical formulation of Anderson acceleration shown, in, for example, Walker and Ni (2011). Often, the minimization problem (12) in Algorithm 1 is stated as an equivalent unconstrained minimization problem with respect to m k unconstrained parameters rather than the m k + 1 constrained parameters {α k+1 j }. This formulation is used, for example, in Higham and Strabić (2016), and in Henderson and Varadhan (2019), where the unconstrained version of the minimization problem allows a more direct comparison with so-called multisecant quasi-Newton methods (Fang and Saad, 2009).
The convergence of Anderson acceleration for a general, nonlinear fixed-point iteration has been shown in Toth and Kelley (2015). A recent work (Evans et al., 2020) proved that Anderson acceleration can improve the convergence rate in a scenario with linear convergence but is not guaranteed to improve the convergence rate in cases of quadratic convergence.
As shown in Algorithm 1, Anderson acceleration can be modified to include restarts, where the order m k is sometimes reset to 1 and all previous memory are dropped. Different restart schema have been proposed for Anderson acceleration. Henderson and Varadhan (2019) implemented a direct, periodic restart scheme where the algorithm restarts whenever the m k reaches the value m. Zhang et al. (2020) proposed an adaptive restart where the algorithm only restarts only when the algorithm shows signs of stagnation. In many practical examples, using restarts markedly improves the performance of Anderson acceleration and can reduce the occurrence of algorithm stagnation.
In all of the numerical experiments shown in Section 5, we use the version of Anderson acceleration described in Henderson and Varadhan (2019) which they refer to as the damped Anderson acceleration with restarts and epsilon monotonicity (DAAREM) algorithm. The first component which distinguishes DAAREM from many other implementations of Anderson acceleration is the addition of the damping terms λ k 0. This L 2 regularization term generates an update which is a compromise between a pure fixed-point update and a pure Anderson acceleration update. Having large values of λ k in early iterations and allowing λ k to decrease in later iterations allows the procedure to bridge the robustness of the original fixed-point iteration with the fast local convergence of Anderson acceleration. Another key component of DAAREM is the use of systematic restarts rather than adaptive restarts, which as mentioned before, is implemented by restarting whenever the value of m k reaches m. Finally, DAAREM includes some degree of monotonicity control where the fixed-point iteration update is used if the proposed Anderson acceleration increases the objective function (in a minimization problem) by more than a small, pre-specified amount.

SQUAREM
SQUAREM (Varadhan and Roland, 2008) is a technique originally designed to accelerate EM algorithms, but it has also been shown to be useful in accelerating a range of other fixed-point iteration problems. SQUAREM has been acknowledged as a useful, general-purpose acceleration scheme by Lange and others: (Zhou et al., 2011), Unfortunately, most acceleration techniques are ill-suited to complicated models involving large number of parameters. The squared iterative methods (SQUAREM), recently proposed by Varadhan and Roland, constitute a notable exception. SQUAREM was motivated by an interesting and highly original modification of the Barzilai-Borwein type spectral gradient algorithm for optimization (Raydan and Svaiter, 2002). SQUAREM readily scales to high-dimensional settings and is very simple to implement. Hence it has been used in numerous applications to accelerate convergence of underlying iterative algorithm. Examples include: large-scale genome-wide enrichment analysis (Zhu and Stephens, 2018); analysis of human movement (Raket et al., 2016); non-negative matrix factorization across multiple applications (Hobolth et al., 2020); analysis of differential expression in RNAseq data (Jin et al., 2015)); inferring and visualizing cancer mutation signatures (Shiraishi et al., 2015); and signal processing techniques using MM algorithms (Song et al., 2016). Convergence of SQUAREM was proved in Varadhan and Roland (2004) under certain restrictive assumptions. In Varadhan and Roland (2008), the global convergence was shown for the monotonic version of SQUAREM using the notion of Lyapunov function, which we describe in more detail in the supplementary material. To date, there is no proof that provides an insight on the improved convergence rate from using SQUAREM.
Algorithm 2 describes the SQUAREM acceleration technique for finding a solution of the fixed-point problem f (x) = x.
Algorithm 2: SQUAREM. In the description of the algorithm, There are different versions of SQUAREM which only differ according to how the step length in step 5 of Algorithm 2 is computed. The three main choices of the step length are: ||v|| . One can also relate SQUAREM to an order 1 Anderson acceleration update when the previous iterate has the form x k = f (x k−1 ). To see why this is the case, note that when λ k = 0 and x k = f (x k−1 ) the solution of the minimization problem (12) in Algorithm 1 yields the following update Therefore, a single SqS1-SQUAREM update of an iterate x k−1 is equivalent to the following procedure: definex k = f (x k−1 ) and find x k by applying an order-1 Anderson acceleration update with β k = α and where x k−1 andx k = f (x k−1 ) are considered to be the previous two iterates. Notice that α in SQUAREM does not necessarily belong to (0, 1] and typically it can be much larger than 1, indicating that SQUAREM can be viewed as an over-relaxed version of order-1 Anderson acceleration where β k is not restricted to the interval (0, 1].

Parabolic-EM
Parabolic EM (Berlinet and Roland, 2009) is another extrapolation scheme designed to accelerate the EM algorithm. At each step, parabolic EM finds new iterate by extrapolating along a Bézier curve M(t) controlled by the most recent three iterations x k−2 , x k−1 , x k . Specifically, M(t) is given by A direct calculation shows that, when recent iterations are obtained from the base EM iterations (i.e., Parabolic EM applies a line search to find t by increasing t from 1 and stopping once the likelihood decreases. If no values of t in the line search are found to increase the likelihood, the algorithm will restart using the original fixed point iteration. Parabolic EM has two subtypes called arithmetic search and geometric search version which differ only in the way they perform the line search across values of t. Given a step size h > 0, arithmetic search evaluates the likelihood at M(t) for t = 1 + h, 1 + 2h, . . . until the likelihood function decreases at which point the line search stops. Similarly, given both a step size h > 0 and exponent a > 1, geometric search evaluates the likelihood at M(t) for t = 1+a, 1+a 2 h, . . . and stops whenever the likelihood function decreases. Algorithm 3 describes both the arithmetic and geometric search versions of parabolic EM.
Algorithm 3: Parabolic EM. In the description of the algorithm, Note that parabolic EM can also be applied to a general fixed-point iteration as long as the fixed-point iteration has an associated loss function to minimize. In that case, one could directly implement Algorithm 3 by replacing the likelihood evaluations in Algorithm 3 with evaluations of the negative of the loss function of interest. Zhou et al. (2011) proposed a Quasi-Newton method that can be applied to accelerating fixedpoint iterations. Consider a map f : X ⊂ R d → R d from which we want to find its fixed point x such that f (x) = x. This is equivalent as finding the root of function g(x) = x − f (x). If f is assumed to be differentiable with Jacobian df , then Newton's method for finding the root of g(x) yields the following iteration

Quasi-Newton
The goal in a quasi-Newton approach is to use an approximation of df (x k ) in iteration (13) rather than the true df (x k ). The secant method is a well-known root-finding algorithm for a function with scalar inputs that can also be thought of as a quasi-Newton algorithm. Zhou et al. (2011) proposed Quasi-Newton method is based on the linear approximation then the secant requirement for a proposed approximate Jacobian M k at iteration k would be that M k u k = v k . For an improved approximation M k , Zhou et al. (2011) require further that the following q secant conditions M k u k−j = v k−j , j = 1, . . . q − 1 hold. The matrix M k with the smallest Frobenius norm among all matrices satisfying these q secant conditions is given by Using this approximate Jacobian in iteration (13) leads to the order q Quasi-Newton scheme described in Algorithm 4.
Algorithm 4: Quasi-Newton acceleration. In the description of the algorithm, x k is a vector of length p, q is the order of the acceleration scheme, and Remove the first column of V and add column β 2 − β 1 to the right of V 12 Set x 0 = β 0 Remove the leftmost columns of matrices U and V Check for convergence 19 end

Restarted Nesterov
Nesterov accelerated gradient descent (Nesterov, 2013;Tseng, 2009) is a popular technique for accelerating first order optimization methods. Using the same notation as in Section 2.2.2, Algorithm 5 outlines Nesterov acceleration applied to the composite optimization problem (7).
Algorithm 5: Nesterov accelerated proximal gradient descent. In the description of the algorithm, the objective function to be minimized is It can be proved that Algorithm 5 has an error rate of O( 1 k 2 ), where k is the iteration number. This is a substantial improvement over the O( 1 k ) error rate shown in Equation (11) for the fixed step length proximal gradient descent algorithm. Readers can consult Tseng (2009) for a proof of this result.
Unlike most implementations of gradient descent, Algorithm 5 does not guarantee or check for monotonicity of the objective function. In practice, if you trace the objective value when running Algorithm 5, it is often the case that you see ripples or bumps in the objective function across iterations, which reduces the efficiency of the algorithm. To address this, O'Donoghue and Candes (2015) introduce a heuristic adaptive restart technique to Nesterov acceleration that can dramatically improve the convergence rate. The basic idea is to reset θ k to 1 whenever you see an increase of objective function f (x k ) + h(x k ) > f (x k−1 ) + h(x k−1 ). Setting θ k = 1 reduces the momentum term α k+1 (x k+1 − x k ) to 0 and the accleration algorithm will degenerate to ordinary proximal gradient descent in the following step.
Notice that in Algorithm 5, the momentum coefficients α k do not depend on the proximal gradient descent updates in any way. Therefore, it is possible to just replace the step x k+1 = prox t (y k − t∇f (y k )) with any fixed-point iteration x k+1 = F (y k ) and obtain a Nesterov-like acceleration method for a general fixed-point iteration problem. In Section 5, we show that this strategy can accelerate convergence in problems where gradient descent is not the base fixed-point iteration. To the best of our knowledge, this is the first work to examine Nesterov acceleration for general fixed-point iteration problems.

Implementations
R packages are available to facilitate application of the acceleration schemes described above. For example, the squarem package (Du and Varadhan, 2020) implements the SQUAREM algorithm, and the daarem package (Henderson and Varadhan, 2020) implements the DAAREM algorithm described in Section 3.1. The turboEM package (Bobb and Varadhan, 2021) provides a unified API for SQUAREM, Parabolic-EM, and Quasi-Newton acceleration. Currently, turboEM does not implement DAAREM, although this should be available in the near future. In turboEM, when the objective function value for a proposed update increases the objective value by more than 0.1, we replace this update with one iteration of the base fixed-point iteration. This can dramatically increase the stability of the Quasi-Newton method and, in many cases, can improve the ultimate convergence speed. Monotonicity control is also a default in the implementations of DAAREM, SQUAREM and Parabolic-EM. The use of restarts in Nesterov acceleration also plays a similar role to monotonicity control as the algorithm is restarted whenever a monotonicity violation occurs. It is worth mentioning that the implementation of the Quasi-Newton in turboEM includes the option of monotonicity control even though monotonicity control was not originally implemented in (Zhou et al., 2011).
Our package AccelBenchmark is available from Github which can be used to easily benchmark all of the methods described in this paper. We actually used it in all the experiments described in this paper.
There is an R package called FixedPoint which contains various acceleration methods for fixed-point problems, including Anderson acceleration and several vector extrapolation algorithms. A fundamental difference between that package and our software packages is that FixedPoint doesn't contain safeguards such as steplength and monotonicity controls, damping, and restarts. Consequently, the algorithms are less reliable for general purpose use.

Settings for the Experiments
In all of the experiments in this paper we have used the default control parameters implemented in each of the acceleration packages (turboEM and daarem). We did not optimize the algorithmic settings for each problem. This is an important point because we would like to explore the performance of these methods when they are used directly off-the-shelf. Also, unless otherwise stated, convergence is defined as the first iteration where the norm of the parameter difference ||x k+1 − x k || is less than 10 −7 . We evaluated the performance of each acceleration algorithm in terms of the number of fixed-point iterations (fpevals) and the elapsed time in seconds (elapsed). We report the mean ± standard deviation of fpevals and elapsed across a certain number of simulated experiments. Some of the performance metrics have distributions with a heavy tail making the standard deviation bigger than the mean, which is a sign of instability of that algorithm. We also report the number of failures (# failures), where failure is defined as not achieving convergence within an allotted number of iterations, which varied for each problem. We also plotted the convergence trajectories of algorithms in terms of the objective function (e.g., log-likelihood). The panels of Figure 1 in the Supplementary Material display the objective function values versus fixed-point iteration for each of the five main acceleration methods. The Loss item in the figures are normalized by subtracting the minimum value of the objective function. For every experiment, we only plot results for the most difficult setting (e.g., ν = 25 in the multivariate t distribution).

Multivariate t-Distribution
A d-dimensional Student-t distribution T ν (μ, ) with ν > 0 degrees of freedom, location parameter μ ∈ R d , and positive definite scatter matrix has the density function: where (s) denotes the Gamma function (s) = ∞ 0 t s−1 e −t dt. For observations x 1 , . . . , x n arising from a d-dimensional Student-t distribution, setting the derivative of the associated log likelihood Algorithm 6: EM for estimating multivariate t-distribution parameters.
function with user-specified weights ω 1 , . . . , ω n to zero results in the following system of equations dx − log(x) and where the weights ω i are assumed to satisfy ω i 0, n i=1 ω i = 1.
Here, we are interested in maximizing the weighted log-likelihood function under the assumption that all parameters, including the degrees of freedom ν, are unknown. Hasannasab et al. (2021) show that, under certain conditions, either a minimizer of the negative weighted log-likelihood exists, or the maximum likelihood estimator corresponds to the case ν → ∞, for which the Student-t distribution approaches the Gaussian distribution.
Because one can represent a T ν (μ, ) random variable as where Z ∼ N(0, I ) and Y ∼ (ν/2, ν/2), one can develop an EM algorithm for parameter estimation by augmenting the observed data x 1 , . . . , x n with latent data y 1 , . . . , y n where it is assumed that y i ∼ (ν/2, ν/2) independently and use the fact that the conditional distribution x i |y i is a multivariate normal distribution with mean vector μ and covariance matrix /y i . This data augmentation leads to the conventional EM algorithm (Liu and Rubin, 1995) for solving equations (14)-(16). The conventional EM algorithm for the multivariate t-distribution is described in Algorithm 6. Recently, Hasannasab et al. (2021) suggested a number of alternative algorithms for such problem that accelerated the naive EM algorithm. Among them, the Multivariate Myriad Filter (MMF) algorithm shows excellent overall performance. The MMF algorithm is the same as the EM algorithm (Algorithm 6) except that the updates for k and ν k are changed to We conducted a simulation study to determine whether or not our black-box acceleration schemes can further accelerate the fast MMF algorithm. In this simulation study, we set μ = (0, 0) and = diag{0.1, 1}, and we considered the three choices for the degrees of freedom ν from {3, 10, 25}. For each choice of ν, we simulated 1000 observations from the corresponding t-distribution and ran the MMF algorithm together with all acceleration methods to estimate the parameters μ, , ν, and this procedure was repeated across 200 simulation runs. In each run, we initialized μ 0 to be the value of the sample mean, 0 be the sample covariance matrix, and we initialized the degrees of freedom ν 0 by sampling with equal probability from the set {2, 3, 4}. The hyperparameters of the acceleration algorithms and the convergence criteria are set to their defaults. These default settings are discussed in detail in Section 4.
Results from this simulation study can be found in Table 1. From this table, we observe that SQUAREM, DAAREM and parabolic-EM (pEM) provide consistent acceleration when compared to the original MMF algorithm, and the factor of speedup from these methods increases as ν increases. Quasi-Newton and Nesterov acceleration also accelerated the MMF algorithm in some simulation settings, but the improvement over MMF was not as consistent as SQUAREM, DAAREM, and pEM.
It is interesting to note that the MMF algorithm already gives a huge speed advantage when compared to the original EM algorithm. For example, in the ν = 25 case, the EM algorithm required, on average, 1235 fixed point iterations before convergence which is 9 times more than that of the MMF algorithm. Despite the fact MMF is much faster than EM, we can still further accelerate MMF using the methods described in this manuscript.

Poisson Mixtures
A finite mixture of Poisson distributions with C components has the following discrete probability distribution where f P (y | λ c ) = e −λ c λ y c /y! denotes the probability distribution of y conditional on belonging to the c th cluster of the mixture distribution. For observations y 1 , . . . , y n , we can develop an EM algorithm for estimating the parameters in (18) by introducing latent variables z 1 , . . . , z n defined as z i = c if y i belongs to the c th cluster. This particular data augmentation scheme generates the following EM algorithm updates for the parameters of interest (p 1 , . . . , p C ) and (λ 1 , . . . , λ C ):  We explored our acceleration methods using the real count data from Hasselblad (1966) which contains 1096 observations with each observation representing a day of survival. In this dataset, the observations range from 0 to 9 and the frequencies for these values are 162,267,271,185,111,61,27,8,3, 1 respectively, no censoring is presented. Using this dataset, we fit a finite mixture of Poisson distributions with 2 components so the parameters of interest are the mixture probability p 1 and the cluster-specific Poisson rates λ 1 and λ 2 .
To study the performance of each acceleration procedure, we ran the original EM algorithm and all acceleration schemes 500 times. In each run, the mixture probability p 1 was drawn from a uniform distribution over (0, 1), and the Poisson rates λ c were independently drawn from a uniform distribution over (0, 4). In the Quasi-Newton algorithm, the order was set to 2 since we only have 3 parameters in this problem. The maximum number of fixed-point iteration evaluations is set to 3000. Other hyperparameters of the acceleration algorithm and convergence criteria are set to their defaults, which will be discussed in more detail in Section 4. Results can be found in Table 2. In this experiment, all of the methods listed in Table 2 dramatically accelerated the original EM algorithm with an up to 11-fold reduction in execution time.

LASSO
The least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996) is a widely used technique for high dimensional inference due to its ability to perform simultaneous feature selection and coefficient estimation. However, the additional L 1 penalty in the LASSO objective function also makes it impossible to obtain a general, closed-form solution for the regression coefficient estimates, and therefore, iterative algorithms are needed for optimization. A simple but effective iterative algorithm is proximal gradient descent which uses the iteration (9) described in Section 2.2.2.
In this experiment, we use the Madelon data (Guyon et al., 2004) to study the performance of difference acceleration methods applied to the proximal gradient descent algorithm. The Madelon data is artificially constructed to illustrate a particular difficulty for feature selection. It contains n = 2600 binary outcomes y 1 , . . . , y n , and for each y i , we have a predictor vector x i of length p = 500. We use the logistic regression version of LASSO where the objective function (β) to be minimized is given by where λ is a parameter that controls the regularization level and where h(β) = λ p p=1 |β p |. Following (9)-(10) yields the following proximal gradient descent iteration where X is the n × p matrix whose ith row is x T i and where μ(Xβ k ) is the n-dimensional vector whose ith element is 1/{1 + exp(x T i β k )}. In each simulation run, we initialized each coefficient β j independently and uniformly within (−1, 1), and we then applied the proximal gradient descent algorithm (19) and each of the acceleration versions of it. The values of the tuning parameter λ where set to one of three values λ ∈ {0.1, 1, 10}. The maximum number of fixed-point iterations for each method was set to 20, 000. Other hyperparameters of the acceleration methods and convergence criteria were set to their defaults. For each choice of λ, we evaluated all of the methods using 200 independent simulation runs.
Results from this simulation study can be found in Table 3. On average, all of the methods provided consistent acceleration of the original proximal gradient descent algorithm. Among the different acceleration methods, DAAREM consistently gave the greatest acceleration with a more than 20 fold improvement in execution time across different choices of λ. The failure of gradient descent for smaller values of λ is due to slow convergence, i.e. it did not converge with a maximum number of iterations. Table 2: Simulation results for estimating Poisson mixture parameters from 500 independent runs. Elapsed times are reporting in milliseconds. If an algorithm failed to converge or if it converged to a negative log-likelihood more than 1% larger than that of the original EM algorithm, then we called that run a failure. As a measure of robustness, we also recorded the number of failures for each acceleration method.  Table 3: Simulation results for estimating regression coefficients using LASSO logistic regression with 200 independent runs. pGD represents the original proximal gradient descent algorithm, and the other columns represent different acceleration methods. If an algorithm failed to converge or if it converged to a loss more than 1% larger than the optimal loss, we considered it to be a failure. As a measure of robustness, we also recorded the number of failures for each acceleration method.

Variational Inference in Bayesian Variable Selection
Bayesian variable selection methods for a regression model (e.g. George and McCulloch (1997), Chipman et al. (2001)) with binary outcomes often consider the following model: where Bern(p) denotes the Bernoulli distribution with success probability p and p θ (·) denotes the prior distribution for the vector of hyperparameters θ = (π, σ 2 β ). The values of the variables γ j drive the model selection as β j = 0 whenever γ j = 0. For simplicity, in this section we set β 0 = 1 as a known constant.
In Bayesian variable selection, calculation of the marginal posterior inclusion probabilities is a primary interest. The marginal posterior inclusion probability for variable j , defined as P I P (j) = p(γ j = 1|X, y), can be expressed as where X is the n × p design matrix whose (i, j ) element is X ij and y = (y 1 , . . . , y n ). The quantity p(γ j = 1|X, y, θ) does not have a closed-form and is often calculated using Markov Chain Monte Carlo (MCMC) methods, for example Bottolo et al. (2010) and Clyde et al. (2011). However, in high-dimensional applications, MCMC can be very computationally inefficient and often requires days or even weeks to run. To address this, Carbonetto et al. (2012) propose using variational inference to approximate p(γ j = 1|X, y, θ) and then approximate the integral in (20) with importance sampling.
To approximate p(γ j = 1|X, y, θ), one first approximates the posterior density p(β, γ |X, y, θ) with a density function of the form q(β, γ |α, μ, s 2 ) = p j =1 q(β j , γ j |α j , μ j , s 2 j ), where the components of this product are assumed to be a mixture of a normal density and a point mass at 0. Specifically, where N(·|μ, σ 2 ) stands for a normal density with mean μ and variance σ 2 and δ 0 (·) for a delta mass centered at 0. To find the best set of parameters, Carbonetto et al. (2012) derives an EM-type algorithm that can maximize the evidence lower bound (ELBO): where (X T X) hj denotes the (h, j ) element of the matrix X T X and (X T y) j denotes the j th element of the vector X T y. Also,Û andŷ are defined asÛ = diag{u} − uu T /ū;ŷ = y − 1 2 − u, where u = (u 1 , . . . , u n ) andū = n i=1 u i . The terms u i = 1 η i ψ(η i ) − 1/2 are updated from where ψ(x) = 1/(1 + e −x ). From the variational approximation (21), we have E[β j ] = α j μ j and Var[β j ] = α j (s 2 j + μ 2 j ) − (α j μ j ) 2 . In this experiment, we examined whether the acceleration schemes described above can effectively accelerate the coordinate descent algorithm for Bayesian variable selection in the case of logistic regression. For this simulation study, we use a setting similar to that described in Carbonetto et al. (2012). Specifically, we assume that logit{p(y i = 1|β, X)} = −1 − Z i1 + Z i2 + p j =1 X ij β j , where Z i1 and Z i2 are independent standard normal distributions. In these simulations, β = (β 1 , . . . , β p ) has length 2000, and only the first m components of β are assumed to be nonzero. The nonzero components of β were sampled independently from a N(0, 0.25) distribution, and the remaining components were set to 0. The elements X ij of X were drawn independently from a Binomial(2, p ij ), where p ij is drawn from a uniform distribution over (0.05, 0.5). Values of m = 100 and m = 200 were considered, and we performed 200 simulation runs for each setting of m. In each run, we set the sample size and number of covariates to n = 200 and p = 2000 respectively. We initialized the α j by drawing independently from a uniform distribution over (0, 1), the μ j were initialized by drawing from a normal distribution with mean 0 and variance 0.1, and the s j do not need to be initialized as they are first updated using equation (22). The model hyperparameters (π, σ 2 β ) were set to π = m/2000 and σ β = 0.5. The maximum number of fixed-point iterations was set to 1000. The hyperparameters of the acceleration algorithms and convergence criteria were set to their default values (see Section 4). Results from this simulation study are shown in Table 4. We do not observe a huge acceleration in this experiment, but SQUAREM still provides a consistent speedup with a roughly 4 fold improvement for both settings of m. DAAREM, on the other hand, failed frequently partly due to it convergence to a solution with a slightly higher than the optimal value of the loss function.

Sinkhorn Scaling
Given a matrix A, the problem of re-scaling its rows and columns to form a doubly stochastic matrix = DAE, where D and E are diagonal matrices, is called a matrix balancing problem. A more constrained version of the matrix balancing problem is to find diagonal scaling matrices D, E such that = DAE and that has specified row and column sums, that is, 1 = a; T 1 = b, where 1 is a vector whose entries are all equal to 1. This problem has a vast array of applications including, for example, ranking web pages (Knight, 2008), learning permutation matrices from data (Mena et al., 2018), solving optimal transport problems (Altschuler et al., 2017), etc.
A naive algorithm for the constrained matrix balancing problem is the Sinkhorn-Knopp algorithm (Sinkhorn and Knopp, 1967). The algorithm simply scales the matrix iteratively by rows and columns. Given an initialization u 0 , v 0 , the algorithm finds the next updates by where the division of vectors in (23) is done element-wise and A is the matrix with entries A ij . In this numerical experiment, we tested the performance of the acceleration methods on the Sinkhorn-Knopp algorithm for a constrained matrix balancing problem where we used certain ill-conditioned matrices known as Marshall-Olkin and Hessenberg matrices (Parlett and Landis (1982)). A Marshall-Olkin matrix M 3 is a 3 × 3 matrix with columns (100, 100, 0) T , (100, 10000, 1) T , (0, 1, 100) T . We choose the order n Hessenberg matrix H n to be the n × n matrix such that H n (i, i) = 100 for all i and H n (i, j ) = 1, for all (i, j ) such that j > i − 1, j = i. All of the other elements in H n are equal to 0. For simplicity, a and b in iteration (23) will be set to all 1s. We run such experiment for 200 independent times. In each run, we used Hessenberg matrices of order 10 and 50. We initialized parameter v as i.i.d draws from a uniform distribution between [0.5, 2]. Notice that u can be calculate from algorithm (23), therefore we do not treat it as part of the parameter vector in the acceleration algorithms. The maximum number of fixed-point iterations was set to 50,000. We measured the performance of the scaling algorithm by calculating the mean absolute differences (MAD) between row/column sums and 1. If the two MAD values were, at any time, both smaller than 10 −10 we terminated the algorithm and regarded it to be converged. Other hyperparameters of the acceleration algorithms are set to their defaults (see Section 4).
Results from this numerical experiment can be found in Table 5. The results in Table 5 show that we gain substantial and consistent acceleration by using either Nesterov with restarts or SQUAREM.
It is also possible to use a different approach to matrix scaling, which uses a different fixed-point iteration (see Supplementary section 3). By using the intermediate scaled matrix diag{u k } A diag{v k } as parameter vector rather than u, v. The acceleration schemes perform much better with this approach. Although the final output from this method is not guaranteed to be feasible, we confirmed that the relative difference is small. For example, in H 50 case, the DAAREM algorithm, which originally failed, converged in a few hundreds of iterations. It also gave a reasonably accurate answer, the discrepancy from true result being smaller than 10 −5 .

Manifold Embedding
Manifold learning is a useful approach for performing both dimension reduction and data visualization of high-dimensional data. However, many manifold learning approaches can potentially be influenced by the exploding distance in high-dimensional space, and make points with moderate distance in the original space become too crowded in the embedded low-dimensional space. This phenomenon is of called the "crowding problem" (Cook et al., 2007). To address this problem, Van der Maaten and Hinton (2008) extended the Stochastic Neighborhood Embedding (SNE) method by using a t-distributed neighborhood probability for points in the embedded space. The corresponding method is called t-SNE, and this method achieved great success in visualizing handwritten digits data and a variety of other higher-dimensional images and text data.
The main input into the t-SNE procedure, is a collection of vectors x 1 , . . . , x n from which one can compute nonnegative similarity scores P ij between pairs x i and x j normalized so that ij P ij = 1. Given the matrix P containing the values of P ij , t-SNE seeks to find an embedding matrix Y = (y 1 , y 2 , . . . , y n ) T with each y k in a lower-dimensional space by minimizing the following KL divergence-based loss function with respect to Y Yang et al. (2015) derived an alternative algorithm for solving a range of manifold embedding problems and claimed it can be more efficient than many existing algorithms. The algorithm proposed by Yang et al. (2015) is essentially an MM algorithm that iteratively majorizes the complex loss function of interest by a specially designed quadratic form and then minimizing it with a closed-form solution. In the context of t-SNE where we want to minimize (24), this algorithm results in the following updating scheme for a current embedding matrix Y k where q is the matrix with elements q ij = 1 + ||y i − y j || 2 −1 , the operation • denotes the Hadamard product, Q is the matrix whose elements are defined in (25), and ρ is a positive  In this experiment, we used the COIL20 data to study the MM algorithm (26) and its acceleration using the methods described in Section 3. The COIL20 dataset has 1440 images of 20 objects with resolution 128×128. Each object has 72 images which were taken by capturing an image at every 5 degrees along a 360 degree viewing circle. To evaluate the different acceleration methods, we ran each method 50 times using the MM algorithm (26) as the base iteration. In each of the 50 runs, we initialized the elements of the embedding matrix Y by sampling from a Normal distribution with mean zero and standard deviation 10 −3 . The value of ρ in iteration (26) is found by using an initial value of 10 −5 in each iteration and using a backtracking search to maintain monotonicity. Since the parameter values are not identified in embedding problems, we guided convergence using values of the objective function. Specifically, an algorithm is terminated whenever 1000 fixed point iterations have been reached or when the relative change of the objective function is less than 10 −4 . All other hyperparameters were set to their default values (see Section 4).
Results from this numerical experiment can be found in Table 6. We can see that SQUAREM, DAAREM, parabolic-EM, and Quasi-Newton all achieved a substantially better objective value upon convergence when compared to the original MM algorithm. Moreover, DAAREM achieved this improved objective value with an even shorter computation time than the MM algorithm. Figure 1 visualizes and compares the embedding results between MM and SQUAREM, indicating that better values of the objective function achieved by SQUAREM do result in a better quality embedding.

Discussion
In Section 5, we tested all the acceleration schemes described in Section 3 on six different practical problems. Although no single method is guaranteed to always work, at least one of SQUAREM and DAAREM effectively accelerated the original algorithm in every setting of the six examples. Moreover, SQUAREM is particularly robust in providing improved performance in every scenario of our numerical experiments. Nesterov algorithm is most popular for accelerating proximal gradient. Here we have shown that SQUAREM and DAAREM are competitive when compared to Nesterov and hence deserving of greater attention.
We summarize the results from the six main simulation studies. SQUAREM effectively accelerated the original algorithm in 5 of the 6 problems, giving up to 78 fold improvement in elapsed time with a mean reduction of roughly 18 fold. In the t-SNE problem, SQUAREM did not accelerate the convergence of the original algorithm, but it did consistently converge to a better solution than the original MM algorithm. Nesterov with restarts accelerates convergence in 4 of the 6 problems, with a speedup of up to 68 fold and a mean of 16 fold improvement. DAAREM can also accelerate 4 of the 6 tasks, gaining a factor of up to 48 fold improvement with a mean of a 13 fold reduction in computation time. Quasi-Newton also accelerates 4 of the 6 problems with mean of a 13 fold improvement in convergence time. Lastly, Parabolic-EM accelerated convergence in 4 of the 6 tasks with a mean 4 fold improvement in convergence. Since the relative performance of the acceleration methods can vary across specific problems, we suggest trying SQUAREM, DAAREM, and Nesterov with restarts when acceleration is needed for a specific problem.
We refer the reader to the our AccelBenchmark Github package, which allows the user to easily identify the best acceleration scheme for their specific problem. To use AccelBenchmark, the user only needs to supply the data, fixed-point mapping function, and a loss function if one is available, and the package will then automatically benchmark the performance of the original algorithm and all acceleration methods. For details, one can consult the package vignette.
We would like to draw attention to the importance of monotonicity control implemented in all our acceleration algorithms. The base algorithms such as EM and MM are intrinsically monotone. Hence they have guaranteed global convergence. Acceleration schemes are based on extrapolation methods. They are fast in the neighborhood of the solution. However, they are non-monotone and are not globally convergent. Therefore, we implement safeguards such as controlling the steplength and ensuring monotonicity to ensure reliable convergence from any starting value (note that in the default settings monotonicity is relaxed as the solution is approached). Performance is not guaranteed without such safeguards. More importantly, adding the safeguards seldom degrades the speed of convergence, while providing a better guarantee of convergence that is essential for general purpose implementation.
It is worth noting that the fixed-point iterations used in some of the examples discussed here are already faster versions of the original fixed-point schemes. Specifically, the MMF procedure for the multivariate-t distribution is a faster version of the original EM algorithm; proximal gradient is a faster version of the subgradient method for the LASSO problem; and the MM algorithm for t-SNE is a faster procedure than gradient descent. Nevertheless, as we have demonstrated in our simulation studies, we can further accelerate these faster schemes using the described acceleration schemes and achieve more significant speedups in many cases. Therefore, the acceleration schemes listed here are worth trying even in problems where there is a relatively fast fixed-point iterative algorithm already available.
We have shown that acceleration methods can be very effective in a wide range of appli-cations where fixed-point iteration algorithms are used. However, the use of these or similar acceleration methods in other contexts are still lacking in development. For example, adapting these methods to handle nondeterministic iterations such as stochastic gradient descent or Markov chain Monte Carlo procedures would be an interesting topic for future research. Adapting these acceleration techniques to infinite-dimensional parameter settings (e.g., solutions of integral equations (Atkinson, 1976)) would also be an interesting direction for future study. The main issue here is that the parameter dimension can change across iterations, which may require the introduction of operators like an inner product in a certain Hilbert space to compute the acceleration method updates. For example, the Picard iteration (Junkins et al., 2013) and gradient tree boosting (Friedman, 2001) involve update functions where the number of parameters increases across iterations.