Sampling-based Gaussian Mixture Regression for Big Data

This paper proposes a nonuniform subsampling method for ﬁnite mixtures of regression models to reduce large data computational tasks. A general estimator based on a subsample is investigated, and its asymptotic normality is established. We assign optimal subsampling probabilities to data points that minimize the asymptotic mean squared errors of the general estimator and linearly transformed estimators. Since the proposed probabilities depend on unknown parameters, an implementable algorithm is developed. We ﬁrst approximate the optimal subsampling probabilities using a pilot sample. After that, we select a subsample using the approximated sub-sampling probabilities and compute estimates using the subsample. We evaluate the proposed method in a simulation study and present a real data example using appliance energy data.


Introduction
A finite mixture of regression (FMR) model is a statistical tool widely used in various fields such as econometrics, psychology, genetics, marketing, and engineering (e.g., McLachlan and Peel, 2004). FMR models can identify heterogeneous groups in the population and examine the linear relationships between a response and a set of covariates for different groups. For statistical estimation, the maximum likelihood estimator (MLE) is mainly used to estimate unknown parameters. Since the MLE has no closed-form solution, iterative numerical algorithms are often implemented. The expectation-maximization (EM) (Dempster et al., 1977) is a classic approach for conducting maximum likelihood estimation in FMR models.
The EM algorithm, however, can cause an excessive computing burden for fitting FMR models on massive data. Subsampling is a technique to reduce the computational task by using a subsample extracted from the full data. In the linear regression framework, Drineas et al. (2006) and Ma et al. (2014) developed sampling algorithms that use statistical leverage scores of the covariate matrix to specify non-uniform subsampling probabilities. Wang et al. (2019) proposed a deterministic sub-data selection method, which does not involve random sampling. Wang et al. (2018) proposed optimal subsampling strategies in logistic regression. They defined optimal subsampling probabilities by minimizing the asymptotic mean squared error (MSE) of the subsample-based estimator, and extracted sub-data from the full data using approximated optimal subsampling probabilities to obtain estimates. Recently, the optimal subsampling approach based on Wang et al. (2018) has been extended to different model settings, including multinomial logistic regression (e.g., Yao and Wang, 2019), generalized linear models (GLMs) (e.g., Ai et al., 2021b;Lee et al., 2021), quantile regression (e.g., Wang and Ma, 2021;Ai et al., 2021a), quasi-likelihood models (e.g., Yu et al., 2022), and additive hazards models (e.g., Zuo et al., 2021).
In this paper, we develop optimal subsampling strategies for Gaussian FMR models. We first suggest a general estimator based on a subsample and establish its asymptotic normality. To select informative data points, we specify optimal subsampling probabilities by minimizing the asymptotic MSE of the resultant estimators. However, since the specified probabilities depend on unknown parameters, they cannot be calculated directly. To handle this challenge, we propose a two-step algorithm. In the first step, we use a pilot sample to approximate the optimal subsampling probabilities. In the second step, we select a subsample based on the approximated subsampling probabilities and run the EM-Algorithm to obtain estimates using the subsample.
The rest of the paper is organized as follows. In Section 2, we first briefly review Gaussian mixture regression, and then present subsample-based estimation under the Gaussian FMR model. Asymptotic results of the resulting estimators are investigated, and the two-step algorithm to conduct the optimal subsampling strategy is presented. Section 3 assesses the performance of the proposed methods using simulated and real data sets. Section 4 summarizes the paper. Proofs of the theoretical results are collected in Supplementary materials.

Mixture Regression Models and Optimal Subsampling Strategy 2.1 Finite Mixture of Gaussian Linear Regressions
In this section, we review a finite mixture of Gaussian linear regressions. Suppose that y is a response and x is a d dimensional covariate with the first entry being one. The conditional density function of y given x is where J is a given number of components, p j 's are the component weights satisfying p j > 0 for each j and J j =1 p j = 1, f j (y|x; β j , σ j ) is the density of a normal distribution with mean x T β j and variance σ 2 j , β j is a d × 1 vector of unknown regression coefficients including an intercept, and θ = (β 1 , . . . , β J , σ 1 , . . . , σ J , p 1 , . . . , p J −1 ). For identifiability of the finite mixture of Gaussian linear regressions, we assume that the density function f (y|x; θ ) has common support on (x, y) and is identifiable in θ up to the permutation of the components of the mixture. The maximum likelihood estimator (MLE),θ, for the unknown parameter θ is the maximizer of the following log-likelihood, The EM algorithm is an iterative algorithm that can be used to optimize (2). Define that z ij is equal to one if y i belongs to the j th component and zero otherwise for i = 1, . . . , n, and write z i = (z i1 , . . . , z iJ ). Then, the EM algorithm can be used to find the MLE by maximizing the complete-data log-likelihood of {(x i , z i , y i ) : i = 1, . . . , n}, In the expectation step (E-step), we calculate the conditional expectation of the complete-data log-likelihood given the current parameter estimates and the observed data, .
In the maximization step (M-step), we update the estimates by maximizing Q(θ |θ (s) ). For j = 1, . . . , J , the updated estimates arê The computing time which is required until convergence is O(ξ J nd 2 ), where ξ is the number of iterations.

Subsample-based Estimation
Since iterative calculations for enormous data cause excessive computational burden, we consider subsample-based estimation to obtain the parameter estimates in this section. We denote the full data as D n = {(x i , y i ) : i = 1, . . . , n}. Let {π i } n i=1 be the subsampling probabilities assigned to all observations satisfying n i=1 π i = 1. We consider a random subsample of size r selected from the full data D n based on the subsampling probabilities. Then, the subsampling estimatorθ can be obtained by maximizing the target function where x * i 's, y * i 's and π * i 's, are covariates, responses, and subsampling probabilities in the subsample, respectively. The EM algorithm can be applied to optimize the target function in (4). The details of the algorithm are presented in Algorithm 1. Now, we derive the asymptotic distribution ofθ. The following assumptions are needed to establish it.

Algorithm 1 EM Algorithm for target function (4).
Estimates can be obtained by maximizing the sampled complete-data target function, * where z * ij is equal to one if y * i belongs to the j th component and zero otherwise. E-step: Given the current estimate θ (s) , . M-step: Updates the estimate θ (s+1) by maximizing Q(θ|θ (s) ). For j = 1, . . . , J , Repeat until convergence.
Assumption 1 is a condition to guarantee that the log-likelihood function is convex for large n, and Assumptions 2 and 3 are conditions on the subsampling probabilities. Assumption 4 is needed for the Lindeberg-Feller Central Limit Theorem.

Optimal Subsampling Probability and Two-step Algorithm
In this section, we specify the optimal subsampling probabilities based on the result in Theorem 1. First, we consider minimizing the trace of V, tr(V), which can be viewed as minimizing the asymptotic MSE ofθ.
Theorem 2. The optimal subsampling probabilities that minimize tr(V) are The computing time takes O{[J (d + 2) − 1] 2 n} to calculate the optimal subsampling probabilities presented in Theorem 2.
We further consider minimizing the asymptotic MSE of linearly transformed subsample estimators to alleviate the computation time. We assign the optimal subsampling probabilities by minimizing the trace of V π which is equivalent to minimizing the asymptotic MSE of Mθ. In addition to that, we also focus on the asymptotic MSE of the coefficient estimatorβ = From the result of Theorem 1, we can derive √ rV . We minimize the trace of V β to determine the subsampling probabilities. Theorem 3. The optimal subsampling probabilities that minimize tr(V π ) are Algorithm 2 Two-step Algorithm.
2. Draw a subsample of size r with replacement based on the approximated optimal subsampling probabilities in the previous step. Combine the pilot sample in the previous step with the subsample in the second step and run Algorithm 1 with the data of size r 0 + r to obtain the estimateθ. and the optimal subsampling probabilities that minimize tr(V β ) are The calculations of the subsampling probabilities π V π i and π V β i take O{[J (d + 2) − 1]n} and O{[(J d) 2 + (2J − 1) 2 ]n} time, respectively. Compared to π V i , they require less computing time. Since the subsampling probabilities in (6), (7), and (8) depend on the unknown parameters, we propose an implementable two-step algorithm. In the first step, we approximate the proposed subsampling probabilities by replacing θ with an estimateθ 0 obtained from a pilot sample. In the next step, we draw a subsample with replacement based on the approximated subsampling probabilities, and run Algorithm 1 with the subsample. We present the practical algorithm in Algorithm 2. Wang et al. (2018), we suggest to estimate the variance-covariance matrix of the resultant estimator usingV =M −1V πM −1 for statistical inference, wherȇ

Simulation
In this section, we conduct a simulation study to assess the performance of the proposed method. We consider three different models.
We calculate empirical MSEs based on K k=1 θ (k) − θ 2 /K where K is the number of replications, andθ (k) is the estimate of θ provided from the k-th subsample. For comparison, we consider the proposed subsampling probabilities, , and uniform subsampling probabilities (UNI). For UNI, subsamples of size r 0 + r are used. We set K = 1000, r 0 = 500, and r = 500, 1000, 1500, 2000. For the initial values, k-means clustering is first conducted to form J groups. Then, we implement ordinary least square regressions for each group to obtain the initial values, β (0) j and σ (0) j for j = 1, 2, . . . , J . We set J = 2 for Model 1 and 2, and J = 3 for Model 3. Figures 1, 2, and 3 present the simulation results for MSE. We observe that the proposed methods give better performance than UNI, with the smaller MSEs in all cases. OPT-V and Figure 3: MSEs of Model 3 for varied (σ 2 1 , σ 2 2 , σ 2 3 ), (p 1 , p 2 , p 3 ), and r. OPT-V, OPT-V π , and OPT-V β use π V i , π V π i , and π V β i , respectively. UNI uses uniform subsampling probabilities.
OPT-V β show smaller MSEs than OPT-V π since they minimize the asymptotic MSEs of the unknown parameter estimatorθ and the coefficient estimatorβ, respectively. The MSEs of the proposed methods decrease when the subsample size r increases, which agrees with the asymptotic result of the resultant estimator. We investigate the proposed standard error using the diagonal elements of the variancecovariance matrix in Remark 1. Using the formula, we estimate tr(V), i.e, the MSE ofθ . Also, we calculate empirical variances for each of the estimators and then compute the sum of all these variances (EmpVar), and compare it with the average estimated MSE (AveMSE). Table 1 presents the results for AveMSE and EmpVar for Model 1. AveMSE provides similar results to EmpVar for all the cases.
We also examine components with different variances using data from Model 1. The variances (σ 2 1 , σ 2 2 ) ∈ {(1/2, 1), (1, 2)} are considered. Other settings are the same as those of Model 1.  We can see that the MSEs for the proposed methods are smaller than UNI in Figure 4, and the average estimated MSEs are quite close to the empirical MSEs in Table 2.
To evaluate computational efficiency, we record the average CPU time for Model 1 under a MacBook Pro with 2.5 GHz Intel Core i7 processor and 16 GB memory. The simulation is conducted using the R programming language. We also provide the computing time for the full data. As shown in Table 3, the sampling strategies based on the optimal subsampling probabilities can save the computing time compared to the full data approach (Full). As expected, the computing times for OPT-V π and OPT-V β were less than for OPT-V. Since UNI does not need additional computation for calculating the subsampling probabilities, it is the fastest algorithm.

Real Data Example
In this section, we apply the proposed methods to appliance energy data. 1 This dataset includes appliances energy consumption and humidity collected with an internet-connected energy monitoring system and a ZigBee wireless sensor network, respectively (Candanedo et al., 2017). For the response, appliances energy consumption is used, and three humidities in different areas are considered as covariates: kitchen area (H-Kit), living room area (H-Liv), and laundry area (H-Lau). We use the log-transformed response and covariates. The full data size is n = 19, 735. Figure 5 presents the distribution of the log-transformed energy consumption showing two peaks (around 4.2 and 5.8), and the relationships between energy consumption and humidities monitored in different areas. Table 5 provides the parameter estimates from the full data and the average of parameter estimates for subsampling methods based on 1000 subsamples with r 0 = 500 and r = 1500. We observe that the estimates from the optimal subsampling probabilities are close to those from the full data. For the average CPU time, OPT-V, OPT-V π , and OPT-V β methods took 0.065, 0.060, and 0.064 seconds to calculate the estimates, respectively. It took 0.343 seconds to obtain estimates from the full data.
is the estimate of θ provided from the k-th subsample, andθ is the estimate calculated from the full data. Figure 6 shows the results for MSE based on 1000 subsamples of the size r 0 + r from the full data. OPT-V, OPT-V π , and OPT-V β provide smaller MSE than UNI.

Conclusion
In the big data era, statistical modeling with a large amount of data causes computational burden. In this article, we proposed an optimal subsampling method under mixtures of lin- Table 4: Average of computing time (in seconds) using data from Model 1 with varied full data size n at fixed r = 2000, r 0 = 500, (σ 2 1 , σ 2 2 ) = (2, 2), and (p 1 , p 2 ) = (1/2, 1/2). The number of coefficients for each component is 10, and repetition is 100. The computing time (in seconds) for the full data is provided. n 5 × 10 5 10 6 5 × 10 6 10 7 OPT-V  ear regression models for computational efficiency. We derived the asymptotic results of the subsample-based estimator and proposed the optimal subsampling probabilities. Since the subsampling probabilities cannot be directly calculated, a practical algorithm was also proposed. We first approximated the subsampling probabilities using a pilot sample and then drew a subsample with the approximated probabilities to obtain parameter estimates. There are important questions for future research. In this paper, we assume that the number of components J is given. We often need to select the number of components in real applications. Table 5: Average estimates of the proposed methods and UNI for the appliances energy data. 1000 subsamples of r 0 = 500, and r = 1500 are used. The estimates from the full data is also provided (Full).
One possible approach is to consider model-selection criteria using the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Lumley and Scott (2015) proposed the design-based AIC and BIC for survey data under a sampling design. Based on the criteria, a modified AIC and BIC can be developed for choosing the number of components in terms of the subsampling framework. Also, we consider the weighted objective function assigning higher weights to less informative data points. For more efficient estimation, it would be interesting to develop unweighted subsample-based estimators by avoiding the inverse probability weighting in a target function (e.g Wang, 2019;Wang and Kim, 2020). These are areas of future research.

Supplementary Material
• Software: R codes used for the proposed methods and algorithms are available on GitHub https://github.com/pedigree07/OPTMixture. • Supplementary document: The supplementary document provides the proofs of the theorems.