Integrative Clustering Analysis with Application in Multi-Source Gene Expression Data

In omics studies, diﬀerent sources of information about the same set of genes are often available. When the group structure (e.g., gene pathways) within the genes are of interests, we combine the normal hierarchical model with the stochastic block model, through an integrative clustering framework, to model gene expression and gene networks jointly. The integrative framework provides higher accuracy in extensive simulation studies when one or both of the data sources contain noises or when diﬀerent data sources provide complementary information. An empirical guideline in the choice between integrative versus separate clustering models is proposed. The integrative clustering method is illustrated on the mouse embryo single cell RNAseq and bulk cell microarray data, which identiﬁed not only the gene sets shared by both data sources but also the gene sets unique in one data source.


Introduction
Network analysis is the study of networks representing relationships (i.e., links or edges) between objects (i.e., vertices or nodes). Examples of networks include social networks, World Wide Web, protein-protein interactions, logistics supply chains, etc. A large number of probabilistic and statistical models have been proposed (Goldenberg et al., 2010). In this paper, we focus on community structure in networks, where a community is a set of nodes that is densely connected internally and sparsely connected to the rest of the network.
Community structures are often reflected in the gene expression levels when genes in the same cluster are expressed synergistically. Normal hierarchical model (NHM) is a special case of Bayesian hierarchical models where individual observations follow normal distributions whose mean parameters share a common prior (Morris and Lysy, 2012). We modify the classical NHM by assuming one NMH in each gene cluster. That is, mean expression levels from genes in different clusters follow normal prior distributions with different parameters. The normal hierarchical model (NHM) is often employed in analyzing microarray data (Thompson et al., 2020). However, single cell RNAseq data cannot be easily modeled by standard probability distributions due to high levels of noises, dropouts, outliers and over-dispersion. Therefore, we dichotomize connectivity measures in single cell RNAseq data and apply stochastic block model (SBM) on the resulting gene network data. SBM is a popular community detection approach on binary network data, which assumes that the link probability of each pair of nodes is determined by their community labels (Holland et al., 1983). A lot of progresses have been made on the theoretical justification and computational methods of SBM (Bickel and Chen, 2009;Karrer and Newman, 2011;Zhao et al., 2012;Bickel and Chen, 2009;Amini et al., 2013;Abbe, 2017;Zhao, 2017). In gene network data, a pair of genes are considered to be connected if their expressions are tightly related. The similarity of two gene expression profiles may be measured by the Euclidean distance (Priness et al., 2007) or correlation coefficients (Zhang and Horvath, 2005). The main goal is to partition the gene network into cohesive groups, which may coincide with functional gene sets or pathways. Binary links in a gene network are defined by setting thresholds on the distances or correlation values. The estimated group structure is usually sensitive to the choice of the threshold and would be less reliable if an inappropriate threshold was chosen (Perkins and Langston, 2009).
With more and more microarray and single cell RNA (scRNA) data in the public data consortium, integrative analysis combining these two data sources has achieved higher accuracy and power in differential expression analysis (Forcato et al., 2021). Different sources of gene expression data on the same set of genes collected under similar conditions often provide complementary information on the underlying structure of gene sets or pathways. But there lacks an integrative clustering method explicitly designed for joint analysis of microarray and scRNA data. In multi-omics studies, integrative clustering methods can be roughly classified into five categories (Rappoport and Shamir, 2018): concatenating matrices (Wu et al., 2015;Wang et al., 2013), clustering omics separately followed by integration of clusters (Hoadley et al., 2014;Nguyen et al., 2017), integrating similarities (Wang et al., 2014), dimension reduction followed by clustering Zhang et al., 2012) and probability clustering (Mo et al., 2013;Lock and Dunson, 2013). The proposed method falls into the category of probability clustering models, since it combines the log likelihoods of the NHM and the SBM on two sets of data from independent data sources. Specifically, the NHM describes the clustering structure on the mean values of gene expression levels, while the SBM extracts groups using mutual distances between genes. The integrative log-likelihood equals the sum of the log-likelihood from a hierarchical model on the microarray data and the log-likelihood from a stochastic block model on the binary network from scRNA data.
The rest of the paper is organized as follows. In Section 2, a novel EM algorithm is proposed, which combines the pseudo-likelihood method (Amini et al., 2013) for the SBM and the NHM. Extensive simulation studies are carried out in Section 3. We examine the performance of integrative clustering versus separate clustering in the presence of contamination as well as orthogonal community structures in different data sources. In most scenarios the proposed integrative method has a higher accuracy in identifying the latent community structure compared to the methods using a single data source. Furthermore, we provide empirical guidelines for the choice between integrative analysis and separate analysis. In Section 4, our integrative clustering model is applied to a microarray data and an scRNA data, measuring gene expression levels in mouse embryo cells under the same experimental conditions, independently generated in two different labs. Section 5 discusses limitations and future research directions.

Normal Hierarchical Model for Gene Expression Data
We adopt NHM to model the normalized counts of gene expression. Specifically, the means of gene-specific expression levels share group-level prior distributions. We first set up the notations.
Suppose there are p genes and n subjects. The expression level of gene i (i = 1, . . . , p) in subject j (j = 1, . . . , n) (after proper transformations) is denoted by x ij , and let X = [x ij ]. Let c = (c 1 , . . . , c p ) where c i is the community label of gene i and there are K non-overlapping communities in total. Let P (c i = k) = π k , k = 1, . . . , K, π = (π 1 , . . . , π K ) T . We further assume x ij follows N(μ ik , 1/γ i ) given c i = k, where the gene-specific precision γ i is treated as a fixed parameter, which is estimated using observation from gene i. And μ ik follows a prior distribution N(μ k , 1/η k ), where μ k and η k are the group-level mean and precision, respectively. Let θ k = (μ k , η k ). The likelihood of the NHM is Furthermore, let x i = (x i1 , . . . , x in ), and the probability density function for the posterior distribution of μ ik given x i is Since the prior on μ ik is a conjugate prior, the posterior distribution is still normal. That is, Finally, the probability density function of x i conditional on c i = k (with μ ik being integrated out), denoted as f X (x i |θ k , c i = k), k = 1, . . . , K, goes as follows We use the sample variance to estimate γ i , that is, [ 1 n−1 n j =1 (x ij −x i ) 2 ] −1 , and replace γ i by the estimator in formulas hereafter. Next we apply the EM algorithm to obtain the MLE of {π k } and {θ k }. Let Then the expectation of the log-likelihood function in the t-th iteration of the EM algorithm, with respect to the conditional distribution of {c i } given X, {π k } and {θ k } is In the M-step, which implies Because the optimization problem above does not have a closed-form solution, BFGS quasi-Newton method (Fletcher, 2013) is employed using the R function optim.

Pseudo-Likelihood Method for the Stochastic Block Model
We adopt the stochastic block model (SBM) to model the community structure within a network with binary edges, that is, an adjacency matrix A = [A ii ] with A ii = 1 if node i and i are connected, A ii = 0 otherwise. The SBM assumes that the linkage probability between a pair only depends on the community labels c i and c i of the two endpoints. It is well known that the E-step of a classical EM algorithm is intractable when applied to the SBM. Amini et al. (2013) proposed a scalable pseudo-likelihood method that can fit a large network under the SBM efficiently. The key idea is to approximate the original adjacency matrix by a sample from a mixture of multivariate Poisson distributions and fit the model by a standard EM algorithm. We summarize this method below. Let e = (e 1 , . . . , e p ) be an initial community assignment.
As observed by Amini et al. (2013), each b i approximately follows a mixture of multivariate Poisson distribution. That is, given c i = k, b il is approximately Poisson with mean denoted by λ kl , l = 1, · · · , K. Moreover, as observed by Amini et al. (2013), {b i } are approximately independent across i because b ik and b i l (i = i ) share at most one common link, and are approximately independent when p is large. Letting λ k = (λ k1 , . . . , λ kK ), the conditional probability of b i given c i = k is (up to a constant) and the joint pseudo log-likelihood is (up to a constant) A maximum pseudo-likelihood estimate of {π k }, {λ k } can then be obtained via the standard EM algorithm for mixture models. Once the EM converges, the initial community assignment e is updated to the most current estimate of node labels. This process is repeated for a fixed number of iterations. In practice, Amini et al. (2013) observed that it usually takes a small number of updates to stabilize e.
Finally, we specify the choice of initial values Amini et al. (2013). Given the initial community assignment e obtained by spectral clustering,

Integrative Method
In the presence of both microarray data and the binary adjacent matrix, we combine the loglikelihood from the NHM and the log-pseudo-likelihood from the SBM to infer the underlying community structure. To specify the joint likelihood, we make the following two additional assumptions: 1. X and A share the same community labels c; 2. Given c, X and A are independent.
Recall that x i is the row of X corresponding to gene i and b i is a K-dimensional vector that measures the connections between node i and each block according to the initial labeling e. Based on the assumptions above, the integrative pseudo-likelihood for ( (1) and (2), respectively. As in Section 2.2, a maximum likelihood estimate of ({π k }, {θ k }, {λ k }) can be obtained via the EM algorithm. After the EM loop converges, we use the estimated posterior probabilities to update the labeling e and re-compute {b i }.
We now give the detailed algorithm. Start with initial labeling e and initial parameters ( Then repeat the process below T times: (2) Use current parameter estimates π (t) , θ (t) to compute the conditional probabilities for node labels as (4) Repeat steps (2) and (3)

Simulation
In this section, we evaluated the performance of the proposed integrative method by simulation studies. We fixed p = 1000, n = 40, K = 3, π = (0.3, 0.3, 0.4) and repeated T = 12 times in the outer loop of the integrative EM algorithm, which is sufficient for estimated labels reaching the stable status in most cases. Conditional on the labels, the gene expression data x ij (i = 1, . . . , p, j = 1, . . . , n) were generated from the NHM, that is, , and γ i were independently generated by 1/(Unif(0.2, 1.5)) 2 . Conditional on the labels, the edges of A were generated as independent Bernoulli variables with probability 0.12 if a pair of nodes are from the same community, otherwise with probability 0.06. We further introduced contamination in X and A. Specifically, for X, let each x i = (x i1 , . . . , x in ) be replaced by an n × 1 vector where each component was sampled from N(2.5, 1) independently with probability p.noise.X. For A, we randomly chose pairs of nodes with probability p.noise.A, and set their linkage probability to be p 0 = 0.0001. Both p.noise.X and p.noise.A varied from 0.1 to 0.9 with increments 0.1.
For the NHM, we first applied k-means to find the initial labels and given the initial labels, we computed the maximum likelihood estimates and used them as {θ (0) k }. For the SBM, the initial labels were produced by spectral clustering and (3) and (4). For the integrative model, we obtained initial values respectively, by the aforementioned methods. The EM algorithm starts from an E-step using the initial parameter estimates.
To compare the estimated labels with the true labels, we use the adjusted Rand index (ARI), which is a measure of similarity between two partitions. Higher ARI indicates higher degree of agreement between two sets of labels (Rand, 1971). Each plot shows three ARIs, representing the performance of three methods: the NHM using X alone, the SBM using A alone, and the integrative method (IG) using both data types. Additionally, we plot the ARI measuring the similarity between labels estimated from NHM and SBM (denoted as "NHM vs SBM" in the legend). Figure 1 shows the four sets of ARI values with varied p.noise.X (from 0.1 to 0.9) and p.noise.A (from 0.1 to 0.9). As expected, the performance of all three methods becomes worse as the contamination levels in two data sources increase. The integrative method performs Figure 1: Comparison of the performance of integrative method vs SBM along vs NHM alone when X and A share a common community structure: the ARI between true/estimated and estimated labels as a function of p.noise.X and p.noise.A.
better than both SBM and NHM when the contamination levels in X and A are from low to moderate, say both less than 0.4. On the other hand, when one data source contains a high level of contamination while the other has low contamination levels, for example, p.noise.X = 0.1 and p.noise.A = 0.9, the integrative model is no longer better than the method using the single source with little contamination. Without knowledge of which data source contains more contamination, the integrative method is a safe choice because it is always better than the worse one among the two data sources. However, if we know one data source is less contaminated than the other, we need to choose either the integrative or the better data source alone depending on the contamination levels. We therefore propose an empirical guideline in the choice between the integrative method and separate clustering analysis on individual data sources in the next section.  Figure 1 indicates an interesting pattern across the four sets of ARIs. The ARI between the community estimates from A (SBM) and X (NHM) alone is small when the contamination level is heavy in either data source, which coincides with the performance pattern of the integrative model. That is, when the ARI between NHM and SBM is small, the integrative model is no longer the best. Therefore, we searched for a threshold in the ARI between NHM and SBM, below which the performance of our integrative model is inferior to at least one of the methods using a single data source. The first section of Table 1 lists the four thresholds from the first four subplots in Figure 1 when K = 3. As shown in the plots, for p.noise.X = 0.1, 0.2, 0.3, the integrative model performs the best in almost the entire range of p.noise.A, except for p.noise.A = 0.9. Thus the thresholds are equal to the second to the smallest ARI values between NHM and SBM at p.noise.A = 0.9. For p.noise.X = 0.4, the integrative model is the best when green line is above 0.18 at p.noise.A = 0.1, after which the integrative method is slightly worse than the SBM, and we employ the ARI between X and A at p.noise.A = 0.1 as the corresponding threshold. We do not show the thresholds for p.noise.X 0.5 because our simulation (Figure 1) shows the integrative model is not the best even when p.noise.A = 0.1, and the ARIs between X and A are always lower than 0.12, which is the first ARI value of the green curve in Figure 1 at p.noise.X = 0.5. In summary, when K = 3 and the ARI between NHM and SBM is above 0.18, the integrative method using both data sources outperforms the clustering method using single source under our setup.

Empirical Guidelines Choosing Between Integrative and Separate Analyses
We carried out additional simulations to investigate how the threshold changes with the number of clusters K. We repeated the experiments as shown in Figure 1 under different K values, from 4 to 8. For each K, the true labels were evenly allocated, i.e., π = (1/K, . . . , 1/K). Parameters in the NHM, (μ k , η k ) for k = 1, 2, 3 were the same as the settings in the previous simulation. For k from 4 to 8, we further set μ 4 = 1, μ 5 = 3, μ 6 = 5, μ 7 = 6, μ 8 = 7 and η k = 1/0.36. Moreover, the edges of A were generated as independent Bernoulli variables with probability 0.12 for within community pairs and probability 0.06 for between community pairs. The resulted thresholds versus the numbers of clusters are listed in blocks 2-6 of Table 1. The contaminations in A and X are generated in the same way as those in Figure 1. The column of overall thresholds lists the thresholds above which integrative analysis performs better for different K values. Table 1 shows that when K becomes larger, one can in general expect a smaller threshold value. This may arise from the increasing possibilities of cluster memberships for each node when K is large, hence harder to reach an agreement between two sets of estimated clustering labels even though their data sources share a similar underlying clustering structure.
Lastly, the thresholds may depend on the parameter setting in the data generation process. Therefore, we recommend researchers run simulations based on the estimated values of parameters from their data to choose appropriate thresholds. We ran two sets of simulations, 3 * and 4 * in Table 1, to examine the robustness of thresholds from simulations using estimated parameters. We first generated X with p.noise.X = 0.1 and A with p.noise.A = 0.1 under the same setting as Section 3.1. We then fitted the NHM on X and the SBM on A respectively, and used the estimated parameters to generate data and find the thresholds for K = 3 and 4. The estimated parameters can be viewed as a perturbation of the original parameters. Although the thresholds in 3 * and 4 * are not identical to those in the two sets of simulations with K = 3 and K = 4, they are close numerically with the same decreasing trend.

Performance Under Unequal Community Structures
The previous simulation studies demonstrate that the integrative method pools different information sources and achieves higher accuracy when the ground truth of label c X from X and c A from A are identical, i.e., c Xi = c Ai for i = 1, . . . , p. Most of the existing integrative methods such as Wang et al. (2014), Yan and Sarkar (2021), Xu et al. (2012), and Newman and Clauset (2016) also assume the same underlying community structure among different data sources. However, under real life scenarios, whether different data sources share the same underlying community structure is unknown. Furthermore, it is possible that two different data sources contain different yet complementary community structures. In those cases, the overall community structure is the overlap of two sets of clusters. A pair of genes belongs to the same group in the overall community structure if and only if they belong to the same group in the clusters from X and also in the clusters from A. Below we demonstrate through simulations that the integrative method still provides higher accuracy in identifying the overall structure from two different sets of communities. Specifically, we investigate the performance of the integrative method given independent community labels c X and c A .
We first generated c X and c A independently with K = 2, both by p random draws from Multinomial(1, (0.5, 0.5)). Then we constructed the overall c with K = 4 where Given c X , X was generated from the NHM with (μ 1 , η 1 ) = (2, 1/0.25), (μ 2 , η 2 ) = (4, 1/0.25) and γ i (i = 1, 2, . . . , p) were independently generated by 1/(Unif(0.2, 1.5)) 2 . Given c A , A was generated under the setup of within-community connection probability 0.1 and between-community connection probability 0.06. For this scenario, we also introduced contamination into both X and A by the same way in Section 3.1. In Figure 2, the ARIs for the integrative model, the NHM and the SBM were calculated by comparing their estimated labels with the overall cluster structure c. When estimating under integrative model, we set K = 4; under the NHM and SBM, we set K = 2. Similar to what we observe in Figure 1, the integrative model is better at capturing the overall clustering Figure 2: Comparison of the performance of integrative method vs SBM along vs NHM alone when X and A have independent community structures: the ARI between true/estimated and estimated labels as a function of p.noise.X and p.noise.A. structure than the separate models unless the contamination level is very high in one source and low in the other source, in which case the source with less contamination alone is the best choice. When both contamination levels in X and A are high, the performance of the integrative method tends to overlap with that of the SBM method on data source A, possibly due to the fact that I G(x i , b i ; θ k , λ k ) is dominated by f A (b i |λ k , c i = k) in the calculation of posterior probabilities.

BIC-Type Criteria to Choose K
The BIC-type model selection criteria have been proposed to select the number of communities in networks, e.g., BIC (Saldana et al., 2017) and CBIC (Hu et al., 2020). Additional simulations were run to examine whether the BIC-type criteria are appropriate for integrative network analysis. We generated X and A with the true community number K = 4 and p.noise.A = 0.2, p.noise.X = 0.2. The rest of the parameter setting was identical to Section 3.1. We fitted the integrative model with K = 2, . . . , 10, and used the three BIC-type criteria -BIC, CBIC with λ = 0.5, and CBIC with λ = 2 for model selection. Here we followed the definition of BIC and CBIC in Hu et al. (2020). That is, the three criteria are defined in the form of 2(log likelihood) − penalty. The simulation was repeated 100 times. The three lines in Figure 3 show the average values of BIC, CBIC with λ = 0.5, and CBIC with λ = 2. Both BIC and CBIC with λ = 0.5 reached their maximum atK = 4 in all 100 replicates while CBIC with λ = 2 pickedK = 2 in all 100 replicates. It can be seen that CBIC with λ = 0.5 gives the most clear upside-down U-shape.

Robustness to Correlated Data Sources
Two correlated data sets X (1) and X (2) were generated under the NHM following the setup with K = 3, where the correlation coefficients between X (1) ij and X (2) ij varied from 0 to 0.8 with increments of 0.2. We added contamination to X (1) with p.noise.X = 0.2. Furthermore, we generated an adjacency matrix A with each entry A ij being 1 with probability equal to the inverse distance in X (2) , which were capped at 1. Finally, contamination was added to A with p.noise.A = 0.4. We then fitted the NHM on X (1) , the SBM on A, and the integrative model on both. The simulation was repeated 200 times and the results were reported in Figure 4. There is no visible deteriorating performance as the correlation increases. Although correlated, A still provide extra information about the underlying community structure and the integrative analysis is always more accurate than either SBM on A alone or NHM alone, especially in the presence of contaminations in both data sources.

Clustering Analysis for Mouse Embryo Data
The integrative clustering method is applied to two independent data sets, both of which measure gene expressions in mouse embryonic cells. One is a microarray data set from Moliner et al. (2008) that measured gene expressions in bulk mouse embryonic stem (ES) cells. The other one is the scRNA sequence data in individual mouse ES cells from Islam et al. (2011). Cell samples in these two independent studies were cultivated following the same protocol  and they were used as benchmark data sets sharing the same biomarker information in many bioinformatics papers (Wang et al., 2019;Di et al., 2011). The original bulk and single cell data were merged by gene names and genes that dropped out in more than 40 out of a total of 92 cells in the single cell data are deleted. After the pre-processing, there are 1015 genes shared by the microarray data and the scRNA data. Row data counts from both datasets were transformed using log-count-per-million (logCPM).
Both microarray and scRNA sequencing are high-throughput techniques that measure thousands to millions of genes simultaneously. Single cell expression data measure gene-specific transcript counts in individual cells, one cell at a time. Microarray data are gene expression levels from different samples, where each sample is a collection of cells. If all cells in a microarray sample have the same expression pattern, the two data sources would have the same underlying community structure/gene pathways. In this case, including both datasets in an integrative clustering analysis would lead to increased sample size and higher accuracy, especially given the extra noisy scRNA data. However, usually cells are heterogeneous with different expression patterns and scRNA data would provide information on individual cell level correlations between genes besides their mean levels. At the same time, scRNA has low data quality due to the limitation in sequencing a tiny amount of transcripts in a single cell. On the contrary, microarray data measures the average expression with high quality. Therefore, microarray data provide information on the clustering structure of the mean expression levels. The clustering structure on means (similar mean levels) and that on correlations (simultaneous expression) may not always agree. They can be viewed as complementary information on gene pathways where genes interact and collaborate to complete specific biological processes. Microarray data are often modeled as normally distributed but the scRNA data are far from normal due to the extra noise in sequencing single cells (dropout, over-dispersion, batch effects, outliers, heterogeneity across cells, etc.). We ran additional simulations to examine the performance of the SBM on binary adjacent matrices converted from data generated under the NHM, which is robust and achieves similar ARI as NHM on the original data. Therefore, we treated the bulk cell microarray data as the multivariate normal X and summarized scRNA data into a binary adjacency matrix A. We constructed an affinity matrix based on the scRNA data where the affinity value was calculated as the inverse of the Euclidean distance between two genes. The range in the affinity matrix was further normalized into [0,1] through rescaling. The affinity matrix was further converted into a binary adjacency matrix where links with affinity values higher than 0.024 were set to be one and zero otherwise. This threshold of affinity value was chosen based on overall considerations of scale free topology fitting index signed R 2 (Zhang and Horvath, 2005), average link density and ARI with the microarray data X. Our choice of threshold 0.024 corresponds to the adjacency matrix with density approximately 0.05. Additionally, we tried three other thresholds 0.022, 0.021, and 0.020, corresponding to graph densities approximately 0.10, 0.15, and 0.20. The threshold 0.024 gave an adjacent matrix with the highest scale-free topology fitting index signed R 2 and the highest ARI with the microarray data X (Table 2).
We fitted the SBM on the scRNA adjacency matrix (A) and the NHM on the normalized bulk cell microarray data (X). The clusters from spectral clustering on A served as the initial labels in the pseudo-likelihood method for the SBM. The parameter estimates based on clusters from k-means on X were used as initial values for the EM algorithm of the NHM. The two sets of estimated labels from SBM on A only and NHM on X only were compared using ARI in Figure 5A. The figure shows that the ARI between NHM from X and the SBM from A decreases when the K increases. The ARI between the estimated clusters from the two data sources is larger than the reference threshold values in Table 1 under the corresponding K, suggesting potential benefits to conduct the integrative clustering method pooling information from both X and A.
We found a set of initial labels for the integrative model by combining the clustering structures from both data sources. An affinity matrix, denoted as aff(X), was produced from X by using the same inverse Euclidean transformation as we did in the scRNA data. Spectral clustering was conducted on aff(X) + A and the resulted labels were used as the initial labels in the integrative model. BIC and CBIC with λ = 0.5 and 2 were calculated as criteria to choose the number of clusters. The BIC and CBIC values of the group labels estimated from the integrative model versus the number of clusters K are plotted in Figure 5B. From the figure, the BIC and CBICs generally increase as the number of clusters K gets larger. The BIC (blue line) is always the highest followed by CBIC with λ = 0.5, due to the different penalty terms associated with K. We chose K = 14 because after this point all lines had their first drop. The clusters identified by the proposed integrative method as well as the results from single data sources were compared with known gene sets in MSigDB (https://www.gsea-msigdb.org/gsea/msigdb) to screen for gene sets with significant overlap. The three sets of results (known gene sets with significant overlap) using p-values threshold of 10 −20 are exhibited in Table 3, Table 4 and Table 5 for the integrative model, the NHM and the SBM, respectively.
There are 32 gene sets in Table 3, 31 in Table 4, and 35 in Table 5. The last column is the cluster id that overlaps with the corresponding gene set. From the integrative clustering model, cluster 10 overlaps with 26 gene sets (p-values < 10 −20 ). Similarly, cluster 5 identified by the NHM on the bulk cell microarray data and cluster 3 by the SBM on the scRNA data also overlap with 26 and 28 gene sets, respectively. These three clusters estimated from the three models include 65, 60 and 65 genes respectively and share 31 genes in common. Moreover, 24 out of 26 common gene sets from Table 4 and Table 5, where most of them involve crucial functions in the early developments of an organism, were also identified by the integrative method in Table 3. Four gene sets discovered by the SBM in Table 5 but missed by the NHM in Table 4 are actually reported by the integrative model in Table 3. Two gene sets that are uniquely discovered by the integrative method, did not reach the p-value threshold of 10 −20 in the methods using one data set alone (p-values = 3.65 × 10 −7 , 8.43 × 10 −10 in X and 4.97 × 10 −11 , 2.03 × 10 −19 in A).

Discussion and Future Research Directions
We proposed an integrative clustering method that combines the NHM for microarray data and the SBM for a binary adjacency matrix derived from scRNA data. An EM algorithm was developed following the spirit of the pseudo-likelihood method (Amini et al., 2013). In real applications, since researchers often do not know whether the underlying community structures in different data sources agree, a challenging question is whether integrative method provides higher accuracy than separate clustering analysis on individual data sources. We proposed an   empirical guideline to this question. Furthermore, we also investigated the performance of the integrative model on data sources with different community structures. In that case, we considered the intersection of the community structures in different data sources to be the overall community structure, and different data sources provide complementary information on the overall community structure. Therefore, the integrative methods are also helpful in disclosing overlapping community structures.
We plan to conduct future research in the following directions. We provided an empirical guideline for the choice between the integrative method and separate analysis. The thresholds, which are determined by simulations studies, are ARIs between the estimated clusters from different data sources when the integrative method starts to outperform methods using a single data source. Future research may derive the asymptotic distributions of ARI under the null hypothesis that the community structures from different data sources are identical. Furthermore, if there is biological or clinical belief that the two related data sources describe the same disease or biophysical processes, it is reasonable to carry out integrative clustering analysis without checking the ARI. Moreover, the integrative method can also be used as an exploratory data analysis tool to discover new clusters. The proposed integrative clustering method falls into the probability clustering category where the log-likelihood for each dataset are summed and maximized jointly. Therefore, models other than NHM or SBM can be assumed as long as a likelihood, quasi-likelihood or pseudo-likelihood can be written out. Furthermore, in cases with more than two datasets available, their log-likelihoods can still be summed but the verification of common clustering structure can no longer be judged by pairwise ARI. Testing procedures or empirical guidelines need to be developed to examine whether clustering structures from more than two sources are the same or not. Finally, although both NHM and SBM are based on expression data, they extract information from different aspects (i.e. means and Euclidean distances), which may lead to different sample sizes and different contributions in the joint likelihood. The sizes of the two likelihood functions may not be comparable in the E-step when calculating the posterior probabilities. It is possible that the group structure are dominated by either the NHM or SBM model because of this imbalance. Therefore, when one data source is more important or reliable than the other, one may weigh the data sets accordingly in the joint likelihood.