A Mixed-Membership Model for Social Network Clustering

We propose a simple mixed membership model for social network clustering in this paper. A flexible function is adopted to measure affinities among a set of entities in a social network. The model not only allows each entity in the network to possess more than one membership, but also provides accurate statistical inference about network structure. We estimate the membership parameters using an MCMC algorithm. We evaluate the performance of the proposed algorithm by applying our model to two empirical social network data, the Zachary club data and the bottlenose dolphin network data. We also conduct some numerical studies based on synthetic networks for further assessing the effectiveness of our algorithm. In the end, some concluding remarks and future work are addressed briefly.


Introduction
Social network analysis is part of the social science which is an academic discipline studying a society and the behavior of entities therein. A social network consists of a set of entities (called actors) with certain interactions (represented by ties) among them. Statistical modeling has been a popular and powerful tool to study social networks thanks to its solid theoretical foundation. A plethora of statistical models have been established and exploited to uncover relational structure of social networks, and dyadic ties among actors. Friendship among Facebook users, business relationship across companies on the Wall street, and collaborations among researchers in a scientific field are all social network examples that have been extensively studied in the past. Social network analysis has a long history in sociology, where classical works traced back to the 1940s and 1950s (Rapoport, 1949a(Rapoport, ,b, 1950Harary, 1953;Cartwright and Harary, 1956).
Modern research on social network analysis within mathematics, physics and other scientific disciplines focus mainly on the following three distinctive network features. The first feature is to explore how local mechanisms of network formation produce global network structure. Two representative models are the network evolution model (Newman, 2001) and the nodal attribute model (Boguñá et al., 2004). We refer the readers to Toivonen et al. (2009) for a comparison of these two models, and to the survey paper (Snijders, 2001) for a complete review of related statistical models. The second feature is to investigate topological properties of social networks and develop methods of modeling, either analytically or numerically. Two of the most popular properties of social networks are and such "tie" was parameterized by a measure called the clustering coefficient. This natural phenomenon in social networks was also discussed extensively by Newman (2001); Newman et al. (2001). The formation of a cluster requires the connections of actors within the cluster are significantly higher than those between actors from different clusters. It was posited in some literatures, e.g., (Holland et al., 1983), that a high probability of the occurrence of ties between actors within a cluster was due to some kind of homology (also called "internal homogeneity") of the actors. For instance, students from the same department of a college tend to form a community, in which almost everybody is a friend of everybody (i.e., the students in the same community are more likely to be connected); while students with different educational background are much less likely to be connected. Such internal homogeneity is mostly reflected in a background parameter (e.g., same department) and a location parameter (e.g., same college).
In this paper, we propose a simple but effective method for accurately clustering the entities in a social network into mutually exclusive communities. The proposed model was inspired and elevated from the classical stochastic blockmodel (SBM, Nowicki and Snijders, 2001). Recently, there were a variety of models extended from SBM in the literature. For instance, Sengupta and Chen (2018) introduced an SBM adjusted by node popularity, Huang et al. (2020) established an SBM for heterogeneous networks accounting for node attribute and Noroozi and Pensky (2022) suggested a nested SBM integrating standard SBM and LSM. Different from the existing literature, we specifically consider a flexible function to measure the similarities between actors in a network. Mixed membership is allowed for each actor in our model. The fit of our model is done in a Bayesian framework. The ascendancy of our model over the classic SBM will be detailed and discussed in the subsequent section. This paper not only introduces a flexible and extensible model allowing mixed memberships for network actors, but also gives the interested researchers, especially those relatively new to the field, insights into a standard approach of conducting statistical inference for social network clustering problems.
The rest of this paper is organized as follows: We review some representative modelbased methods for social network clustering, with an additional concentration on the SBM, in Section 2. We propose a mixed membership model based on a simple similarity function in Section 3. Theoretical parameter estimation and an associated MCMC algorithm are presented in Section 4. Two empirical social network examples, the Zachary karate club data and the bottlenose dolphin network data, are used to evaluate the performance of our model, shown respectively in Sections 5 and 6. We then conduct some simulation study on synthetic data in Section 7. In the end, we give some concluding remarks and propose some future work in Section 8.

Notations for Stochastic Blockmodels
In general, methods for social network clustering can be summarized into two categories.
A metric-based method, in contrast, aims at specifying an objective function which evaluates the quality of each network clustering strategy, followed by an algorithm optimizing the objective function (e.g., Ng et al., 2001;Shi and Malik, 2000;Newman et al., 2002;Ouyang et al., 2020). A model-based method is to propose a (parametric) graphical generative model that characterizes the community structure of a social network, followed by an algorithm estimating the membership parameters conditioning on the observed data, most done in a Bayesian framework. To date, there have emerged a variety of graphical models for social network clustering, including but not limited to stochastic blockmodels (SBM, Nowicki and Snijders, 2001;Airoldi et al., 2008;Abbe, 2018;Gao et al., 2018), latent space models (LSM, Hoff et al., 2002;Handcock and Raftery, 2007;Sewell and Chen, 2017) random dot product graphs (RDPG, Young and Scheinerman, 2007;Marchette and Priebe, 2008;Lyzinski et al., 2017;Athreya et al., 2018), and exponential random graph models (ERGM, Snijders et al., 2006;Hunter et al., 2008;Fronczak et al., 2013), among others. The core idea model-based methods is to theoretically uncover the probabilistic and statistical properties of the proposed models. To begin with, we introduce some notations that will be used all through the paper. In general, a network is modeled by a mathematical undirected (or directed) graph consisting of a set of nodes which represent actors (e.g., Facebook users) in the network, and a set of undirected (or directed) edges which represent the relational ties between each pair of nodes (e.g., friendship connections between Facebook users). Let n be the number of nodes in an undirected social network.
The observation of the network can be mathematically represented by an n × n dyadic adjacency matrix A = (A ij ) n×n , where A ij equals 1 if nodes i and j are connected; 0, otherwise. For undirected networks, adjacency matrices are symmetric. If a network is directed, A ij = 1 refers to a directed relation from i (initiator) to j (receiver), and the associated adjacency matrix A may be asymmetric.
More specifically, we consider the stochastic blockmodel, first proposed by Snijders and Nowicki (1997). Although directed networks were considered in Snijders and Nowicki (1997), we simplify the problem to undirected networks for the sake of explanation. The model and the related methods can be extended to directed networks effortlessly. Our goal is to cluster a network of order n into h distinct communities. For each node i = 1, 2, . . . , n, let c(i) denote the community membership function for i. Assuming that c(1), c(2), . . . , c(n) are independently and identically distributed (i.i.d.) multinomial random variables with a hyperparameter vector θ = (θ 1 , θ 2 , . . . , θ h ), one defines B as an h × h symmetric probability matrix indicating linkages across different communities.
Conditioning on c(1), c(2), . . . , c(n), the distribution of A ij for each pair of nodes i and j is Bernoulli with probability B c(i)c(j) .
As A is observed, our goal turns to estimate hyperparameters in θ and the probability matrix B, and ultimately to uncover the network structure by inferring c = (c(1), c(2), . . . , c(n)). The estimation can be performed in a Bayesian framework: where π(θ, B) denotes a joint prior distribution of θ and B.
The community prediction of node i is the index of the membership with the largest posterior probability.
There are several shortcomings of the classical SBM. One is that each actor in the network only can be assigned to one community, which may not be the case for many real social networks. A mixed membership SBM, inspired from the latent Dirichlet allocation (LDA, Blei et al., 2003), was proposed by Airoldi et al. (2008) to break this limitation. The model in Airoldi et al. (2008) allows each actor in the network to possess multiple community memberships. In addition, it seems natural to define a function to quantitatively measure the similarities (or dissimilarities) between the actors in a social network space. Such functions are viewed as an indispensable part in clustering analysis, but are not considered in the classic SBM. In Section 3, we propose a mixed membership probabilistic model based on a simple and well-defined similarity function.

A Mixed Membership Model
In this section, we propose a simple generative model which admits multiple membership (of actors) for social network clustering. The development of the model is based on a probabilistic relationship between the observed adjacency matrix A and a similarity function-more specifically, the cosine similarity.
We start by introducing some additional notations and preliminaries. Consider a social network consisting of n nodes to be clustered into h distinct communities with h ≤ n.
, be an h×1 vector that represents the mixed membership of node i across h communities. To be specific, for 1 ≤ k ≤ h, Z ik refers to the probability that node i belongs to community k.
The very special case Z ik = 1 for some k indicates that node i is assigned to community k with probability 1 without uncertainty, though it is very rare in practice.
For any two s-dimensional vectors x and y, the cosine similarity between x and y is where ∥·∥ 2 refers to the standard ℓ 2 norm. Thus, the corresponding dissimilarity function is (1 − cosine similarity).
We choose the cosine similarity as the measure of similarity in our model for three major reasons: 1. Cosine similarity is a simple measure, and it can be easily applied to high-dimensional data.

Cosine similarity has a standard statistical interpretation, as it is equivalent to the
Pearson correlation coefficient for the data that are centered by mean.
3. Cosine similarity is defined on [0, 1], so it is ready for modeling link density.
Recall the adjacency matrix A = (A ij ). Assuming that A ij 's are mutually independent, we incorporate a Bernoulli model into the link distribution of A ij for nodes i and j, given their mixed community membership Z i and Z j ; that is, By the assumption of conditional independence, we obtain the likelihood function of the adjacency matrix A, where Z = (Z 1 , Z 2 , . . . , Z n ) is an h × n matrix which represents community memberships of all nodes. It is worth mentioning that Z directly reflects node memberships, so should not be interpreted as latent positions for LSM (Hoff et al., 2002). Membership parameter and latent position are conceptually nonequivalent, though the latter usually has impact on network connectivity and is implicitly related to node membership. Our goal is to predict Z given the observation of A, which can be done via an algorithm shown in Section 4.

Parameter Estimation
In this section, we estimate the parameters in our mixed membership model via a standard Bayesian method. At first, we posit a prior distribution for Z. Notice that each component in Z, Z i , consists of h elements representing probabilities adding up to 1.
Dirichlet distribution appears a reasonable and widely-accepted choice for its prior. For where α is an h-dimensional hyperparameter vector. The initial selection of α is flexible unless related information is available. In practice, one may choose each element in α to be equal to 1/h. For each Z i in Z, our goal is to approximate the posterior distribution of Z i given A. We exploit the Gibbs sampling algorithm proposed by Gelfand and Smith (1990).
The Gibbs sampling is a well-developed MCMC algorithm, which is popular for its simplicity and versatility. The Gibbs sampling was first appeared in Geman and Genman (1984), and the theoretical properties of the algorithm were discussed extensively by Casella and George (1992); Gelfand and Smith (1990). It was proven in Geman and Genman (1984) that the distribution of simulated samples converges to the posterior distribution of true parameters given the observations, regardless of the starting state (i.e., the prior distribution). The key of Gibbs sampling is to simulate the next generation of unknown parameters based on the estimates at the current state. Let n ) be the estimate in the current iteration. We simulate Z (m+1) in the following way: n , α and A. In order to implement the algorithm, we derive the posterior distribution of Z i , given

Simulate Z
According to the definition of conditional probability, we have Since the density function expressed in Equation (3) is not from any well-known distribution, we use another well-studied MCMC algorithm-the Metropolis Hastings samplingto simulate the density function at each Gibbs iteration. We present the Gibbs sampling procedures in Algorithm 1. Notice that burninNum in the input of Algorithm 1 refers to a burn-in number-a threshold of the Gibbs iterations, after which the distribution of our simulated samples converges to the posterior distribution of the target parameters.
We thus only keep the simulated estimates after the burn-in number (as reflected in Line 11 in Algorithm 1). We usually choose a large burn-in number such that with a high probability, the MCMC iterations have converged to the true posterior distribution.
Algorithm 1: The Gibbs sampling algorithm for the proposed mixed membership model. Input: burninNum = 5000, size = 10000, empty set posteriorSample Initialization Z ik ← 1 h for all i = 1, . . . , n and k = 1, . . . , h ; Initialization iterNum ← 1 ; repeat for i = 1 to n do until iterNum > size + burninNum; Output: posteriorSample After obtaining the posteriorSample of Z, we compute the sample meanZ i as the Bayes estimate for Z i , for each i = 1, 2, . . . , n. For hard clustering, i.e., each of the nodes in the network only belongs to one community, so we assign every node to the community with the associated probability dominating the estimated membership parameter, i.e., argmax k (Z ik ).

Example: Zachary Karate Club Data
In this section and the next, we evaluate the performance of our mixed membership model by applying it to two empirical social network data. The first that we consider is the Zachary karate club data, which was collected and used to study conflict and fission in  Zachary (1977). The data was from a university-based karate club of 34 members, who were tentatively divided into two groups due to an incipient conflict between the president of the club and the opposing faction. Consider the club as a social network consisting 34 nodes that represent club members. Each pair of the nodes are formalized by adding an edge in between if they are observed to interact outside normal activities, interpreted as "extra" friendship in Zachary (1977). A total of 78 (undirected) edges are observed; see Zachary (1977, Figure 1). The corresponding adjacency matrix was presented in Zachary (1977, Figure 2).
We apply the mixed membership model proposed in Section 3 to split the karate club members into two factions, and compare our clustering result with the ground truth released by Zachary (1977). Based on the feature of the karate club network data and the background story, we set the number of communities h = 2. We implement Algorithm 1, for which the burninNum and size respectively take values 5,000 and 10,000. The posterior meanZ ik , for i = 1, 2, . . . , n and k = 1, 2, is used as the Bayes estimate for the mixed membership parameter Z ik . The result is presented in Table 1. If a hard clustering framework is considered, we present a graphic summary in Figure 1a, where the nodes in  For the purpose of comparison, the ground truth corresponding to Zachary (1977, Figure 1) and Zachary (1977 , Table 1) is portrayed in Figure 1b. We observe that the entity labeled with 10 is the only misclassified node according to our model. We cluster node 10 into community 1, but in reality node 10 joins community 2. The occurrence of misclassification of node 10 is probably because the node is connected with one node (node 3) from community 1, and is also connected with one node (node A) from community 2.
However, node A is the center of community 2, hence more influential in the network.
Additionally, Table 1 shows that the membership parameter estimate for node 10 is 0.5001 for community 1 versus 0.4999 for community 2, so the difference is minimal.

Example: Bottlenose Dolphin Network Data
In this section, we analyze the bottlenose dolphin network data from Lusseau et al. (2003).
A study of identifying the roles that bottlenose dolphins played in their social network  was conducted by Lusseau and Newman (2004). The network data was collected for 62 bottlenose dolphins living in Doubtful Sound, New Zealand, over a period of seven years from 1994 to 2001. The bottlenose dolphins are represented by nodes in the network, and ties between nodes are interpreted as associations between dolphin pairs occurring more often (due to some sort of homophily) than expected by chance. There is a total of 318 edges observed in the network.
A natural division of the bottlenose dolphin network was discussed in Lusseau and Newman (2004), and it was done via an accurate and sensitive clustering algorithm proposed by Girvan and Newman (2002). The algorithm therein was based on a newlydefined "betweenness" measure generalized from the one defined in Freeman (1977). Two communities were detected for the bottlenose dolphin network, shown in Lusseau and Newman (2004, Figure 1(a)), as well as in Figure 2b, for the purpose of comparison.
We set h = 2 in our mixed membership model based on the conclusion from Lusseau and Newman (2004). Both of the burninNum and size take value 50000. Executing Algorithm 1 with the new burninNum and size, we obtain the mixed membership probabilities for the bottlenose dolphins, organized in Table 2. We also depict the hard clustering re-  Figure ?? for a better visualization. The nodes in community 1 are colored with orange, while the nodes in community 2 are colored with blue.
Comparing the clustering result of our mixed membership model and that of the betweenness-based model in Lusseau and Newman (2004), we realize that the community classification matches for most of the nodes in the network, except for "Beak", "Bumper", "Fish", "Oscar", "PL", "SN89", "SN96" and "TR77" on the boundary. Lusseau and Newman (2004), in fact, assigned these dolphins to a sub-community of Community 1 using their algorithm.

Simulations
We show the identifiability and reliability of our model as well as the proposed algorithm through two empirical social network examples in Sections 5 and 6. However, both of those networks only contain a relatively small number of nodes which are only divided into two communities. In this section, we run a few more simulations to further evaluate the performance of our algorithm. We simulate several SBMs with different predetermined community structure. Each block in the simulated SBMs is generated by implementing an algorithm for the Erdös-Rényi graph (Gilbert, 1959). There are three key parameters for simulated SBMs: class size, within-cluster link density and cross-cluster link density.
Noticing that the community structure of the simulated networks is known, we can use this information as the ground truth for assessment.
Two well-defined metrics, the Normalized Mutual Information (NMI) (Meilǎ, 2007) and the Adjusted Rand Index (ARI) (Rand, 1971), are adopted to examine the closeness between the clustering results of our algorithm and the ground truths. In addition, we implement another commonly-used method for social network clustering-the modularity maximization algorithm (Newman, 2006)-to the simulated SBMs for further comparison.
We simulate a total of five SBMs with known community structure as summarized in Table 3. SBM1 and SBM2 are both in moderate size (i.e., of order 100), containing two communities (of sizes 80 and 20, respectively). The within-cluster link densities are  For each SBM, we set burninNum at 1000 and size at 2000, respectively. The proposed algorithm is run for 30 times, and for each result, both ARI and NMI are computed. The averages of all 30 ARI's (i.e., ARI) and NMI's (i.e., NMI) are used as estimates for evaluating the performance of the algorithm. In addition, we implement the modularity maximization algorithm to all five simulated SBMs, and compute the corresponding ARI and NMI. These results are presented in Table 4.
We observe that the proposed algorithm performs well in general for all simulated SBMs. On the other hand, it seems that the modularity maximization algorithm undergoes several severe clustering problems. The first problem that we notice is overclustering. In theory, there is no cluster structure in the Erdös-Rényi graph. However, the modularity maximization algorithm divides predetermined communities (i.e., the Erdös-Rényi graphs) to reach a higher modularity index for small networks, reflected in the In SBM4, the modularity index reaches the global maximum when the two smaller communities merge together. The inconsistency of the modularity maximization algorithm was discussed extensively by Bickle and Chen (2009). However, it seems that the modularity maximization algorithm overperforms when all the communities are large in size, for instance, SBM3. Besides, our algorithm and the modularity maximization algorithm both perform perfectly well when the sizes of communities are similar in the network.
Nevertheless, we conclude that the proposed algorithm is more robust for social network clustering.

Concluding Remarks
In this paper, we develop a simple but novel model-based method for social network clustering. We adopt the cosine function to measure similarities between nodes. In addition, we propose an algorithm based on the Gibbs sampling to simulate posterior samples for mixed community membership for entities in the network. Our model is not only flexible for fuzzy clustering, but also amenable for hard clustering. We would like to point out that our model is reliable due to solid theoretical foundation of Bayesian approach and MCMC algorithms. We evaluate the performance of our model through two empirical social network data and simulations. Based on comparisons with ground truth, we conclude that our model provides accurate clustering for social network data At last, we discuss several limitations of our model, and propose some future studies.
First, it is known that MCMC algorithms are slow to achieve stationary distribution.
The complexity of the proposed algorithm in this paper is O(n 2 ) for each Gibbs iteration.
In addition, a large number of burninNum is usually needed for ensuring convergence.
Admittedly, the algorithm is not efficient especially when the number of parameters or the size of network data or both are large. There is an urge of developing faster algorithms for our mixed membership model. One alternative is the Hamiltonian Monte Carlo (HMC) algorithm, which can accelerate convergence to the target distribution by simulating Hamiltonian dynamics. We refer the interested readers to Neal (2011) for a detailed explanation of HMC, and to Betancourt (2017) for an exposition of the intuition behind HMC. Another possible approach is to use variational Bayesian methods to convert simulation procedures to optimization problems, and then implement some appropriate approximation algorithms.
Second, our current model itself can be improved.
(1) The proposed model measures node relationship based on cosine similarity, which is analogous to Pearson correlation, so it may fail to preserve membership homophily for the nodes close or on the cluster boundary. These nodes usually have similar entries in their corresponding membership variables, but the proposed model favors connections among these nodes regardless of their actual membership information. For instance, suppose Z i = (0.51, 0.49) and Z j = (0.49, 0.51), we then have p(Z i , Z j ) = 0.999 albeit i and j belonging to different communities in our setting; (2) The proposed model does not focus on sparse networks particularly. One may consider tweaking cosine similarity as p(Z i , Z j ) = ρ n cos(Z i , Z j ) with a scaling factor ρ n → 0 as n increases to incorporate network sparsity. the proposed model yet accounts for the information possibly contained in the nodes. We would like to consider a more complete model which utilizes those auxiliary variables so as to further improve clustering accuracy.
Third, the communities considered in our model are distinct. One of our future work is to look into a possibility to extend our model to overlapping communities like in Xie et al. (2013).
Lastly, our model, as well as most other graphical generative models, requires a prior knowledge about the number of communities to which nodes are assigned. However, this number is usually unavailable. Estimating the number of communities and membership parameters simultaneously could be a challenging task. A recent research paper (Geng et al., 2019) provides us some guidance about future study in this direction.