Leveraging disjoint communities for detecting overlapping ...

19
arXiv:1504.06608v1 [cs.SI] 14 Apr 2015 Leveraging disjoint communities for detecting overlapping community structure Tanmoy Chakraborty Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur, India - 721302 E-mail: its [email protected] Accepted at Journal of Statistical Mechanics: Theory and Experiment (JSTAT) April 2015 Abstract. Network communities represent mesoscopic structure for understanding the organization of real-world networks, where nodes often belong to multiple communities and form overlapping community structure in the network. Due to non-triviality in finding the exact boundary of such overlapping communities, this problem has become challenging, and therefore huge effort has been devoted to detect overlapping communities from the network. In this paper, we present PVOC (Permanence based Vertex-replication algorithm for Overlapping Community detection), a two-stage framework to detect overlapping community structure. We build on a novel observation that non-overlapping community structure detected by a standard disjoint community detection algorithm from a network has high resemblance with its actual overlapping community structure, except the overlapping part. Based on this observation, we posit that there is perhaps no need of building yet another overlapping community finding algorithm; but one can efficiently manipulate the output of any existing disjoint community finding algorithm to obtain the required overlapping structure. We propose a new post-processing technique that by combining with any existing disjoint community detection algorithm, can suitably process each vertex using a new vertex-based metric, called permanence, and thereby finds out overlapping candidates with their community memberships. Experimental results on both synthetic and large real-world networks show that PVOC significantly outperforms six state-of-the-art overlapping community detection algorithms in terms of high similarity of the output with the ground-truth structure. Thus our framework not only finds meaningful overlapping communities from the network, but also allows us to put an end to the constant effort of building yet another overlapping community detection algorithm. 1. Introduction One of the most used aspects of social network analysis is to discover and display clusters and communities in networks – the dense sub-networks, where there are more links internally, than externally. It is easy for the common person to spot dense clusters of connection in a small network visualization. However, this is extremely difficult problem

Transcript of Leveraging disjoint communities for detecting overlapping ...

arX

iv:1

504.

0660

8v1

[cs.

SI]

14

Apr

201

5

Leveraging disjoint communities for detecting

overlapping community structure

Tanmoy Chakraborty

Department of Computer Science & Engineering

Indian Institute of Technology, Kharagpur, India - 721302

E-mail: its [email protected]

Accepted at Journal of Statistical Mechanics: Theory and Experiment (JSTAT)

April 2015

Abstract. Network communities represent mesoscopic structure for understanding

the organization of real-world networks, where nodes often belong to multiple

communities and form overlapping community structure in the network. Due to

non-triviality in finding the exact boundary of such overlapping communities, this

problem has become challenging, and therefore huge effort has been devoted to detect

overlapping communities from the network.

In this paper, we present PVOC (Permanence based Vertex-replication algorithm

for Overlapping Community detection), a two-stage framework to detect overlapping

community structure. We build on a novel observation that non-overlapping

community structure detected by a standard disjoint community detection algorithm

from a network has high resemblance with its actual overlapping community structure,

except the overlapping part. Based on this observation, we posit that there is

perhaps no need of building yet another overlapping community finding algorithm;

but one can efficiently manipulate the output of any existing disjoint community

finding algorithm to obtain the required overlapping structure. We propose a new

post-processing technique that by combining with any existing disjoint community

detection algorithm, can suitably process each vertex using a new vertex-based metric,

called permanence, and thereby finds out overlapping candidates with their community

memberships. Experimental results on both synthetic and large real-world networks

show that PVOC significantly outperforms six state-of-the-art overlapping community

detection algorithms in terms of high similarity of the output with the ground-truth

structure. Thus our framework not only finds meaningful overlapping communities

from the network, but also allows us to put an end to the constant effort of building

yet another overlapping community detection algorithm.

1. Introduction

One of the most used aspects of social network analysis is to discover and display clusters

and communities in networks – the dense sub-networks, where there are more links

internally, than externally. It is easy for the common person to spot dense clusters of

connection in a small network visualization. However, this is extremely difficult problem

Leveraging disjoint communities for detecting overlapping community structure 2

to detect such groups from large scale networks. There has been a constant effort since

last one decade from the researchers of both computer science and physics domains to

explore such community structure from networks after the pioneering effort of Girvan

and Newman [1]. Today there are dozens of community detection algorithms that

can detect the disjoint/non-overlapping community structure from the network using

different heuristics and frameworks (see [2, 3] for the survey). However, in real-world

scenario, it has been observed that a node can be a part of multiple communities, which

has eventually led to the idea of overlapping/soft communities [4, 5, 6, 7]. This problem

is even more harder because of the exponential number of possible solutions. Therefore, a

new direction of research has been started to detect the overlapping community structure

from the network (see [8] for the survey).

The dichotomy between “disjoint” and “overlapping” community detection

algorithms is unfortunate because it limits the application of each algorithm. If

a network has overlapping communities, a “disjoint” algorithm cannot find them;

conversely, if communities are known to be disjoint, a “disjoint” algorithm will generally

perform better than an “overlapping” algorithm. Therefore, to obtain the actual

community structure, it is important to choose the right kind of algorithm. Note that

the question of how to choose the right kind of algorithm is outside the scope of the

present paper.

However, we hypothesize that there is perhaps no need to develop yet another

overlapping community finding algorithm given the assumption that we have diverse

and efficient disjoint community detection algorithms in hand. In this paper, we present

a method to allow any “disjoint” community detection algorithm to be used to detect

overlapping community structure instead for finding another overlapping community

detection algorithm. This means that a user wishing to find overlapping communities

need no longer be forced to use one of the overlapping algorithms that exist, but can also

choose from the many disjoint community finding algorithms. The proposed framework

is called as PVOC (Permanence based Vertex-replication algorithm for Overlapping

Community detection) which is a two-phase framework – in the first step, an efficient

disjoint community detection algorithm is used to detect the non-overlapping community

structure from the network; in the second step, each node in the disjoint communities

is processed appropriately using a new vertex-based metric, called permanence [9], in

order to measure the extent of belongingness of a vertex in its own community and

its attached neighboring communities. If the membership of the vertex in its assigned

community is similar to that in the neighboring community, we assign the vertex into the

neighboring community, keeping its original community intact. Thus the post-processing

step is the fundamental component in PVOC to find out overlapping vertices from the

non-overlapping structure.

We compare our framework with six state-of-the-art overlapping community

detection algorithms on both synthetic and large real-world networks (whose ground-

truth community structure is available). We observe that PVOC significantly

outperforms other baseline algorithms in terms of high resemblance of the output with

Leveraging disjoint communities for detecting overlapping community structure 3

the ground-truth structure. Moreover, we show that even if it is scalable, it does not

compromise the correctness of the output.

Our paper makes several unique contributions to the state-of-the-art in community

detection. These include (i) analyzing the real-world community structure and observing

that the disjoint communities are enough to be processed for discovering overlapping

community structure, (ii) proposing a new framework by combining existing disjoint

community detection algorithm along with the post-processing step, (iii) showing the

accuracy of PVOC in terms of accurately discovering the ground-truth structure.

The organization of the paper is as follows. In the next section, we provide a brief

overview of state-of-the-art approaches in overlapping community detection. Section 3

provides a brief description of the synthetic and real-world datasets. Following this,

in Section 4, we present a detailed results of our empirical observation followed by

the description of our proposed framework. Section 5 describes the results of the

experiments to detect overlapping communities and a comparative analysis with the

baseline algorithms. The experiments in this paper use a combination of PVOC with

two existing disjoint community detection algorithms, Louvain [10] and Infomap [11].

Finally, we conclude the paper in Section 6 with some immediate future directions.

2. Related work

There has been a class of algorithms for network clustering, which allow nodes belonging

to more than one community. Palla proposed “CFinder” [12], the seminal and most

popular method based on clique-percolation technique. However, due to the clique

requirement and the sparseness of real networks, the communities discovered by CFinder

are usually of low quality [13]. The idea of partitioning links instead of nodes to discover

community structure has also been explored [14, 15, 16, 17].

On the other hand, a set of algorithms utilized local expansion and optimization to

detect overlapping communities. For instance, Baumes et al. [18] proposed a two-

step algorithm “RankRemoval” using a local density function. LFM [19] expands

communities from a random seed node to form a natural community until a fitness

function is locally maximal. MONC [20] uses the modified fitness function of LFM

which allows a single node to be considered a community by itself. OSLOM [5] tests

the statistical significance of a cluster with respect to a global null model (i.e., the

random graph generated by the configuration model) during community expansion.

Chen et al. [21] proposed selecting a node with maximal node strength based on two

quantities – belonging degree and the modified modularity. EAGLE [6] and GCE [22]

use the agglomerative framework to produce overlapping communities. COCD [23] first

identifies cores and then remaining nodes are attached to cores with which they have

maximum connections.

Few fuzzy community detection algorithms have been proposed that quantify the

strength of association between all pairs of nodes and communities [24]. Nepusz et

al. [25] modeled the overlapping community detection as a nonlinear constrained

Leveraging disjoint communities for detecting overlapping community structure 4

optimization problem which can be solved by simulated annealing methods. Zhang et

al. [26] proposed an algorithm based on the spectral clustering framework. Due to the

probabilistic nature, mixture models provide an appropriate framework for overlapping

community detection [27, 28, 29, 30]. MOSES [31] uses a local optimization scheme in

which the fitness function is defined based on the observed condition distribution. Zhang

et al. used Nonnegative Matrix Factorization (NMF) to detect overlapping communities

when the number of communities and the feature vectors are provided [32, 33]. Ding et

al. [34] employed the affinity propagation clustering algorithm for overlapping detection.

Recently, BIGCLAM [35] algorithm is also built on NMF framework.

The label propagation algorithm has been extended to overlapping community

detection by allowing a node to have multiple labels. In COPRA [36], each node updates

its belonging coefficients by averaging the coefficients from all its neighbors at each time

step in a synchronous fashion. SLPA [37, 38] spreads labels between nodes according to

pairwise interaction rules. A game-theoretic framework is proposed in Chen et al. [39]

in which a community is associated with a Nash local equilibrium.

Beside these, CONGA [7] extends GN algorithm [40] by allowing a node to split

into multiple copies. Zhang et al. [41] proposed an iterative process that reinforces the

network topology and propinquity that is interpreted as the probability of a pair of nodes

belonging to the same community. Istvan et al. [42] proposed an approach focusing

on centrality-based influence functions. Recently, Gopalan and Blei [43] proposed an

algorithm that naturally interleaves subsampling from the network and updating an

estimate of its communities. The reader can get more details in a nice survey paper by

Xie et al. [8].

3. Test suite of networks

3.1. Synthetic networks

It is necessary to have good benchmarks to both study the behavior of a proposed

community detection algorithm and to compare the performance across various

algorithms. In light of this requirement, Lancichinetti et al. [44] introduced LFR‡

benchmark networks that take into account heterogeneity into degree and community

size distributions of a network. These distributions are governed by power laws with

exponents τ1 and τ2 respectively. To generate overlapping communities On, the fraction

of overlapping nodes is specified and each node is assigned to Om (≥ 1) communities.

LFR also provides a rich set of parameters to control the network topology, including

the number of nodes n, the mixing parameter µ, the average degree k, the maximum

degree kmax, the maximum community size cmax, and the minimum community size

cmin. We vary these parameters depending on the experimental needs. Unless otherwise

stated, LFR graph is generated with the following configuration: µ = 0.2, N=10,000,

Om=4, On=5%; other parameters being set to their default values. Results shown are

‡ http://sites.google.com/site/andrealancichinetti/files

Leveraging disjoint communities for detecting overlapping community structure 5

the average of 100 runs.

100

101

102

10310

−6

10−4

10−2

100

Number of community memberships

Fra

ctio

n of

ver

tices

DBLPAmazonYoutubeOrkut

Figure 1. (Color online) Distribution of the number of community memberships of

vertices. X-axis shows the number of community memberships of vertices and y-axis

shows the fraction of vertices with certain number of community memberships.

Table 1. Properties of the real-world networks used in this experiments. N : number

of nodes, C: number of communities, S: average size of a community, Om: average

number of community memberships per node.

Networks N E C S Om

DBLP 317,080 1,049,866 13,477 429.79 2.57

Amazon 334,863 925,872 151,037 99.86 14.83

Youtube 1,134,890 2,987,624 8,385 9.75 10.26

Orkut 3,072,441 117,185,083 6,288,363 34.86 95.93

3.2. Real-world networks with ground-truth communities

We use four real-world networks§ proposed by Yang and Leskovec [35, 45] whose

underlying ground-truth community structures are known a priori and whose properties

are summarized in Table 1. Figure 1 shows the distribution of the number of

communities memberships of vertices for the real-world networks.

DBLP: It is a co-authorship network where nodes represent authors and edges

connect nodes whose corresponding authors have co-authored in at least one paper.

Since research communities stem around conferences or journals, the publication venues

are used as ground-truth communities in DBLP.

Amazon: It is a Amazon product co-purchasing network where nodes represent

products and edges connect commonly co-purchased products. Each product (i.e., node)

belongs to one or more product categories. Each product category is used to define a

ground-truth community.

§ http://snap.stanford.edu

Leveraging disjoint communities for detecting overlapping community structure 6

Figure 2. (Color online) An illustrative example to show the procedure followed in

our empirical study (NCDA: Non-overlapping Community Detection Algorithm).

Youtube: In the Youtube social network, users form friendship with each other

and users can create groups where other users can join. Here, such user-defined groups

are considered as ground-truth communities.

Orkut: Orkut is a free on-line social network where users form friendship with

each other. Orkut also allows users to form a group where other members can then join.

Here also such user-defined groups are considered as ground-truth communities.

4. Vertex-replication algorithm

Our proposed algorithm is motivated from an empirical study on the ground-truth

community structure of both synthetic and real-world networks. In this section, we

first describe the empirical observation and then illustrate a new algorithm that can

detect overlapping communities from a network with the help of any standard disjoint

community detection algorithm.

4.1. Empirical observation

We empirically study the structure of the ground-truth communities. We speculated

that if we remove the vertices that are part of multiple communities from the ground-

truth structure, the rest of the portion, i.e., the community structure composed of only

non-overlapping vertices can be efficiently captured by the standard disjoint community

detection algorithm. To verify this intuition, we take all the networks with their ground-

truth communities and two standard disjoint community detection algorithms, namely

Louvain[10] and Infomap[11, 46]. Then for each network, we run the following steps:

I We run each of these algorithms to obtain the disjoint community structure from

the network.

Leveraging disjoint communities for detecting overlapping community structure 7

Table 2. Number of communities in the ground-truth structure and that obtained

from Louvain and Infomap for LFR and real-world networks. Here for the LFR

network, we consider the following configuration: N=10,000, µ=0.2, Om=4, On=5%.

The result of LFR is averaged over 100 runs.Networks Ground-truth Algorithms

Louvain Infomap

LFR 582 468 501

DBLP 8,493 7,987 8,145

Amazon 151,037 142,098 149,876

Youtube 8,385 7,967 7,132

Orkut 288,363 284,980 286,791

II Since we know the ground-truth community structure of the network, we remove

from the ground-truth those vertices (refer to set Vo) which belong to multiple

communities.

III Similarly, we remove the constituent vertices of Vo from the disjoint community

structure obtained from Step I. This step makes sure that the filtered ground-truth

community structure and the filtered disjoint community structure obtained from

the algorithm contain same set of vertices.

IV Then we compare two community structures obtained from Step II and Step III.

LFR DBLP Amazon Orkut Youtube0

0.2

0.4

0.6

0.8

NM

I

LouvainInfomap

Figure 3. (Color online) Similarity (in terms of NMI) of the community structure

obtained from two disjoint community detection algorithms (Louvain and Infomap)

with the ground-truth structures after excluding the overlapping vertices. Here for

the LFR network, we consider the following configuration: N=10,000, µ=0.2, Om=4,

On=5%.

A schematic example of the above procedure is shown in Figure 2. Figure 1 shows

that in this process, we discard nearly 40% of the vertices (on an average) which belong to

multiple communities for each network. In Table 2, we also report the number of disjoint

communities obtained from Louvain and Infomap algorithms for both synthetic and

real-world networks and that present in the ground-truth structure. We use a standard

validation metric, namely Normalized Mutual Information (NMI) [47] to compare these

two community structures. Figure 3 shows that the similarity is quite high for all the

networks; this observation indeed corroborates our earlier speculation. Therefore, we

hypothesize that a standard disjoint community detection algorithm might be able to

Leveraging disjoint communities for detecting overlapping community structure 8

find the overlapping communities with a suitable post-processing step. This means

that a user wishing to find overlapping communities need no longer be forced to use

any overlapping community finding algorithm, rather a disjoint community structure

followed by a post-processing step might produce the expected overlapping community

structure. In the rest of this section, we shall use this observation to design a suitable

post-processing technique.

4.2. Permanence based vertex-replication algorithm

Through careful inspection mentioned above, we have found that a standard disjoint

community detection algorithms are quite efficient to detect the non-overlapping part

of the community structure. However there exist few vertices in the network, which are

part of multiple communities. We intend to design an efficient algorithm that would

be able to identify such overlapping vertices with their community memberships. For

that, we use a vertex-based metric, called permanence, which by virtue of its underlying

formulation measures how intensely a vertex belongs to its community [9]. Below, we

present a brief overview of the formulation of permanence.

Figure 4. Toy example depicting permanence of two vertices u and v. Even if

vertex v has a large number of external connections than u, all these six external

connections are distributed equally into three neighboring communities, resulting in

the external pull proportional to 2; whereas u is attached with 4 external neighbors,

three of them constitute in one community and the rest is attached with another

community, resulting in the external pull proportional to 3. On the other hand, v is

connected to 3 internal neighbors which are further completely connected among each

other; whereas the neighbors of u are partially connected. This results in high internal

pull of v as compared to u. These two notions of connectively are considered in the

formulation of permanence.

4.2.1. Formulation of permanence In an earlier paper [9], we showed that the extent

of membership of a vertex to a community depends on the following two factors. (i)

The first factor is the distribution of external connections of the vertex to individual

communities. A vertex that has equal number of connections to all its external

communities (e.g., a vertex with total 6 external connections with 2 to each of 3

neighboring communities) has equal “pull” from each community whereas a vertex

Leveraging disjoint communities for detecting overlapping community structure 9

with more external connections to one particular community (e.g., a vertex with total

6 external connections with 1 connection each to two neighboring communities and 4

connections to the third neighboring community), will experience more “pull” from that

community due to large number of external connections to it. (ii) The second factor

is the density of its internal connections. The internal connections of a community are

generally considered together as a whole. However, how strongly a vertex is connected

to its internal neighbors can differ. To measure this internal connectedness of a vertex,

one can compute the clustering coefficient of the vertex with respect to its internal

neighbors. The higher this internal clustering coefficient, the more tightly the vertex is

connected to its community.

Combining these two factors together, we formulated permanence Perm(.) of a

vertex v as follows:

Perm(v) =I(v)

Emax(v)×

1

D(v)− (1− cin(v)) (1)

where I(v) is the number of internal connections of v, D(v) is the degree of v,

Emax(v) is the maximum connections of v to a single external community and cin(v) is

the clustering coefficient among the internal neighbors of v. An illustrative example is

shown in Figure 4.

For vertices that do not have any external connections, Perm(v) is considered to be

equal to the internal clustering coefficient (i.e., Perm(v) = cin(v)). The maximum value

of Perm(v) is 1 and is obtained when vertex v is an internal node and part of a clique.

The lower bound of Perm(v) is close to -1. This is obtained when I(v) ≪ D(v), such

that I(v)D(v)Emax(v)

≈ 0 and cin(v) = 0. Therefore for every vertex v, −1 < Perm(v) ≤ 1.

4.2.2. The PVOC algorithm Since permanence can assign a score to each of

the vertices, we can use it in our post-processing step to identify overlapping

vertices from the detected disjoint community structure. Subsequently, we develop

a new algorithm, called PVOC (Permanence based Vertex-replication algorithm for

Overlapping Community detection) that can combine any existing disjoint community

detection algorithm with the permanence based vertex-replication for detecting

overlapping community structure of a network. Algorithm 1 presents the pseudo-code

of PVOC.

Given undirected networkG(V,E) and a threshold θ, the algorithm works as follows:

I A standard disjoint community detection algorithm Ad is used to detect non-

overlapping community structure NC from G.

II A set of vertices Ve are identified from NC such that each constituent vertex in Ve

has at least one connection to any external community.

III For each vertex v in Ve, we do the following steps:

(a) We calculate the sum of permanence of v and its neighbors in their assigned

communities.

Leveraging disjoint communities for detecting overlapping community structure 10

Algorithm 1 PVOC: Permanence based vertex-replication algorithm for overlapping

community detection

Input: A graph G = (V,E); Ad= disjoint community detection algorithm; θ =threshold

Output: Detected overlapping communities

procedure Vertex replication(G, NC, θ)

Ve = set of vertices having at least one external neighbor

for all do v ∈ Ve

Cv = current community of v

Measure current permanence of v, Op(v)

Measure current sum of permanences of all neighbor’s of v, On(v)

Sum Op = Op(v) +On(v)

for all do n ∈ N ⊲ N is the set of external neighbors of v

Cn =current community of n

Remove v from Cv and assign it in Cn

Measure new permanence of v in community Cn, Np(v)

Measure new sum of permanences of all neighbor’s of v, Nn(v)

Sum Np = Np(v) +Nn(v)

if (|Sum Np − Sum Op| <= θ) then

Assign a replica of v in Cn along with its original presence in Cv

else

Remove vertex v from Cn and place it back in Cv

return The updated overlapping community structure

procedure PVOC

Run Ad on G to obtain disjoint community structure NC

Call VERTEX REPLICATION(G,NC,θ)

(b) We remove v from its own community and place it to each of its external

communities separately. This assignment affects the permanence value of v

and its immediate neighbors.

(c) For each external community Cn, we measure the current sum of permanence

of v (in its new community) and its neighbors.

(d) If the absolute value of the difference of the permanence values obtained from

Step III(a) and Step III(c) is less than θ, a replica of v is placed into the new

community Cn, keeping v in its original community as well; otherwise v is

assigned back to its original community. This step identifies overlapping nodes

along with their memberships in different communities.

(e) The algorithm finally returns all the vertices with new community membership.

The threshold θ controls the extent to which one can relax the condition of

replicating a vertex into multiple communities. We vary the threshold from 0 to 0.2

and observe that it produces maximum accuracy at 0.05 (see Figure 6). Therefore,

Leveraging disjoint communities for detecting overlapping community structure 11

for the rest of the experiment, we keep the value of θ as 0.05. Note that in the

permanence-based post-processing step, we only consider those vertices having at least

one external connection. The rationale behind this assumption is that vertices in the core

of each community are often considered to be correctly placed by the disjoint community

detection algorithm, whereas vertices which are placed in the peripheral region of the

community and are loosely connected to the core of the community have high chance

to be part of multiple communities. Figure 5 shows an empirical observation where we

plot the relation between the number of external connections of a vertex in the detected

disjoint community to the number of overlapping memberships of the vertex in ground-

truth community. We observe that the correlation is increasing in nature, which indeed

strengthens our hypothesis.

The time complexity of measuring the permanence of a vertex takes O(d2), where

d is the average degree of vertices in the network. In real-world networks, the value of

d is much lower than log(n), where n is the number of nodes in the network. Therefore,

the PVOC algorithm mostly depends on the underlying disjoint community detection

algorithm.

1 2 3 4 5 6 7 8 90

1

2

3

4

# of external neighbors

Num

ber

of c

omm

unity

mem

bers

hips

LFR networks

10 20 30 40 500

20

40

60

80Real−world networks

µ=0.1

µ=0.3

µ=0.6

DBLPAmazonYoutubeOrkut

(b)(a)

Figure 5. (Color online) The relation between the average number of external

connections of vertices (with the standard deviation) obtained from the output of

the disjoint community detection algorithm (here we use Louvian) and the number of

communities a vertex is a part of.

5. Experiments

We combine PVOC with two popular disjoint community detection algorithms, namely

Louvain‖ [10] and Infomap¶ [11, 46]. These are chosen because they are reasonably

accurate algorithms with the potential to handle large networks, and implementation of

them, by their authors, are publicly available.

‖ https://sites.google.com/site/findcommunities/

¶ http://www.tp.umu.se/~rosvall/code.html

Leveraging disjoint communities for detecting overlapping community structure 12

5.1. Baseline algorithms

We compare the performance of PVOC with the following state-of-the-art overlapping

community detection algorithms, whose codes are also available:

• Order statistics local optimization method (OSLOM): It is based on the local

optimization of a fitness function expressing the statistical significance of clusters

with respect to random fluctuations, which is estimated with tools of Extreme and

Order Statistics [5]. The code is available at http://www.oslom.org.

• Community overlap propagation algorithm (COPRA): This algorithm is based

on the label propagation technique of Raghavan et al [4], but is able to detect

communities that overlap. Like the original algorithm, vertices have labels that

propagate between neighboring vertices so that members of a community reach

a consensus on their community membership [36]. The code is available at

http://www.cs.bris.ac.uk/~steve/networks/software/copra.html.

• Speaker listener propagation algorithm (SLPA): The algorithm is an extension

of the Label Propagation Algorithm (LPA) [4]. In SLPA, each node can be

a listener or a speaker. The roles are switched depending on whether a node

serves as an information provider or information consumer. Typically, a node

can hold as many labels as it likes, depending on what it has experienced

in the stochastic processes driven by the underlying network structure. A

node accumulates knowledge of repeatedly observed labels instead of erasing

all but one of them. Moreover, the more a node observes a label, the more

likely it will spread this label to other nodes [37]. The code is available at

https://sites.google.com/site/communitydetectionslpa.

• Agglomerative hierarchical clustering based on maximal clique (EAGLE): It uses

the agglomerative framework to produce a dendrogram. First, all maximal

cliques are found and made to be the initial communities. Then, the pair

of communities with maximum similarity is merged. The optimal cut on the

dendrogram is determined by the extended modularity with a weight based

on the number of overlapping memberships [6]. The code is available at

http://code.google.com/p/eaglepp/.

• Cluster-overlap Newman Griven algorithm (CONGA): The idea of this algorithm

is similar to our idea of finding overlapping communities from disjoint community

structure. CONGA is based on Griven Newman’s “GN” algorithm [1] but

extended to detect overlapping communities. CONGA adds to the GN algorithm

the ability to split vertices between communities, based on the new concept of

split betweenness. At first, edge betweenness of edges and split betweenness

of vertices are calculated. Then an edge with maximum edge betweenness is

removed or a vertex with maximum split betweenness is split. After this step,

edge betweenness and split betweenness are recalculated. The above steps are

repeated until no edges remain [7]. However, the calculation of edge betweenness

Leveraging disjoint communities for detecting overlapping community structure 13

and split betweenness is expensive on large networks. The code is available at

http://www.cs.bris.ac.uk/~steve/networks/congapaper/.

• Cluster affiliation model for big networks (BIGCLAM): In this algorithm,

communities arise due to shared community affiliations of nodes. Here the affiliation

strength is explicitly modeled for each node to each community. Then each node-

community pair is assigned a nonnegative latent factor which represents the degree

of membership of a node to the community. The probability of an edge between a

pair of nodes is then modeled in the network as a function of the shared community

affiliations [35]. The code is available at http://snap.stanford.edu.

Note that each algorithm is simply used with its default parameters.

Figure 6. (Color online) Accuracy of PVOC (in terms of three validation metrics) with

the increase of θ for LFR (µ = 0.3) and one real-world network (DBLP). Maximum

accuracy is obtained at θ = 0.05, which we use in rest of the experiment. Each point

in the plot is an average of the accuracies obtained from Louvain and Infomap.

5.2. Validation metrics

A stronger test of the correctness of the community detection algorithm, however, is by

comparing the obtained community with a given ground-truth structure. For evaluation,

we use three metrics that quantify the level of correspondence between the detected and

the ground- truth communities [35].

• Overlapping Normalized Mutual Information+ [48]

• Omega Index [24]

• Average F score [49]

Note that all the metrics are bounded between 0 (no matching) and 1 (perfect

matching).

+ https://github.com/aaronmcdaid/Overlapping-NMI

Leveraging disjoint communities for detecting overlapping community structure 14

0.1 0.2 0.3 0.4 0.50

0.5

1

µ

ONMI

0.1 0.2 0.3 0.4 0.50

0.5

1

µ

Omega

0.1 0.2 0.3 0.4 0.50

0.5

1

µ

F−Score

2 4 6 8

0.5

1

Om

Val

ues

2 4 6 8

0.5

1

Om

2 4 6 8

0.5

1

Om

10 20 30 400

0.5

1

On (in%)

10 20 30 400

0.5

1

On (in %)

10 20 30 40

0.5

1

On (in %)

PVOC+LVN

PVOC+INFO

CONGA

OSLOM

COPRA

SLPA

EAGLE

BIGCLAM

Figure 7. (Color online) Accuracy of all the competing algorithms for LFR by varying

µ (top panel, where N = 10, 000, Om = 4, On = 5%), Om (middle point, where

N = 10, 000, µ = 0.1, On = 5%) and On (bottom panel, where N = 10, 000, µ = 0.1,

Om = 4)). Note that the value of On is expressed in % of n (LVN: Louvain, INFO:

Infomap).

5.3. Experimental results

In this experiment, we use PVOC combined with Louvain and Informap separately, and

compare the results with six baseline algorithms. First, we check the dependency of

PVOC with the value of θ. Figure 6 shows that at θ = 0.05, PVOC achieves maximum

accuracy for LFR and one representative real-world network; however the result is almost

same of other networks. Therefore, we use θ = 0.05 in the rest of the experiments. One

can tune θ appropriately to control the extent of overlapping membership of vertices in

the network.

In Figure 7, we compare the outputs obtained from different competing algorithms

with the ground-truth communities for LFR networks with different parameter settings.

Figure 7 (top panel) shows the results for different values of µ ranging from 0.1 to 0.5.

As µ increases, the community structure becomes less evident and it becomes difficult

for all the algorithms to discover the actual community structure. OSLOM performs

worst compared to the other algorithms. However, for all the cases, PVOC+LVN is least

affected and outperforms other algorithms. This is followed by PVOC+INFO, CONGO

and SLPA.

We then vary the average number of community memberships per vertex, Om from 2

to 8 keeping the other parameters same, and plot the performance of different algorithms

in Figure 7 (middle panel). The effect is reasonably less on the accuracy of the competing

algorithms. Here we observe that the pattern is almost similar for PVOC+LVN and

PVOC+INFO, and are much superior than others.

Finally, in Figure 7 (lower panel) we plot the accuracy of the algorithms with

the increasing value of On, percentage of overlapping vertices. Surprisingly, OSLOM

Leveraging disjoint communities for detecting overlapping community structure 15

shows an unexpected behavior with the increasing accuracy after a certain value of On.

However, on an average the change in accuracy is almost consistent for all the algorithms

in all possibilities of On.

To understand the utility of including PVOC step with the disjoint community

finding algorithms in more details, we further measure the performance of Louvain

and Infomap in isolation without PVOC step. We observe that excluding PVOC step

significantly deteriorates the performance of Louvain algorithm: for LFR network (N

= 10,000, Om = 4, On = 5% , µ=0.2) ONMI (0.569), Omega Index (0.512), F-Score

(0.523); for DBLP network ONMI (0.495), Omega Index (0.521), F-Score (0.487); for

Amazon network ONMI (0.458), Omega Index (0.498), F-Score (0.447); for Youtube

network ONMI (0.512), Omega Index (0.522), F-Score (0.564); and for Orkut network

ONMI (0.526), Omega Index (0.556), F-Score (0.544). Similar trend is observed for

Infomap algorithm. This observation therefore strengthens the need of PVOC as a

post-processing step with the disjoint community detection algorithms.

Now, we run the competing algorithms on the real-world networks. As noted in [35],

most of the baseline community detection algorithms do not scale for networks of large

size. Therefore, we use the following technique proposed by Yan and Leskovec [35] to

obtain several small subnetworks with overlapping community structure from the large

real networks. We pick a random node u in the given graphG that belongs to at least two

communities. We then take the subnetwork to be the induced subgraph of G consisting

of all the nodes that share at least one ground-truth community membership with u.

In our experiments, we created 500 different subnetworks for each of the six real-world

datasets and the results are averaged over these 500 samples. For each validation metric

(ONMI, Ω Index, F-Score), we separately scale the scores of the methods so that the

best performing community detection method has the score of 1. Finally, we compute

the composite performance by summing up three normalized scores. If a method

outperforms all the other methods in all the scores, then its composite performance

is 3.

Figure 8 displays the composite performance of the methods for different networks.

On an average, the composite performance of PVOC+INFO (2.88) and PVOC+LVN

(2.74) significantly outperform other competing algorithms: 6.27% higher than that

of BIGCLAM (2.71), 18.03% higher than that of SLPA (2.44), 101.3% higher than

that of OSLOM (1.43), 36.4% higher than that of COPRA (2.11), 48.4% higher than

that of CONGA (1.94), and 77.8% higher than that of EAGLE (1.62). The absolute

average ONMI of PVOC+INFO (PVOC+LVN) for one LFR and six real networks taken

together is 0.85 (0.83), which is 4.93% (2.46%) and 26.8% (20.8%) higher than the two

most competing algorithms, i.e., BIGCLAM (0.81), and SLPA (0.67) respectively. In

terms of absolute values of scores, PVOC+INFO (PVOC+LVN) achieves the average

F-Score of 0.84 (0.79) and average Ω Index of 0.83 (0.82). Overall, PVOC combined with

Louvain and Infomap gives the best results, followed by BIGCLAM, SLPA, COPRA,

CONGO, EAGLE and OSLOM.

As most of the baseline algorithms except BIGCLAM do not scale for large real

Leveraging disjoint communities for detecting overlapping community structure 16

Figure 8. (Color online) Performance of various competing algorithms to detect the

ground-truth communities. For each evaluation metric separately we scale the score of

the methods so that the best performing community detection algorithm achieves the

score of 1. Thus if an algorithm outperforms all the methods in all the scores, then its

composite score would become 3.

networks [35], we separately compare PVOC with BIGCLAM (which is scalable and also

the most competing algorithm) on actual large real datasets. Table 3 shows performance

of PVOC and BIGCLAM for different real networks. On average, PVOC+INFO

(PVOC+LVN) achieves 4.28% (5.63%) higher ONMI, 1.48% (2.85%) higher Ω Index,

and 6.94% (5.63%) higher F-Score. Overall, PVOC outperforms BIGCLAM in every

measure and for every network. The absolute values of the scores of PVOC+INFO and

PVOC+LVN averaged over all the networks are 0.70 and 0.71 (ONMI), 0.69 and 0.70

(Ω Index), and 0.72 and 0.71 (F-Score) respectively.

Table 3. The performance of BIGCLAM and PVOC on large real-world networks.

NetworksBIGCLAM PVOC+LVN PVOC+INFO

ONMI Omega F Score ONMI Omega F Score ONMI Omega F Score

DBLP 0.61 0.59 0.54 0.65 0.61 0.60 0.65 0.62 0.59

Amazon 0.73 0.69 0.74 0.72 0.71 0.75 0.73 0.74 0.76

Orkut 0.65 0.68 0.64 0.72 0.70 0.76 0.73 0.72 0.77

Youtube 0.68 0.76 0.78 0.77 0.78 0.72 0.71 0.68 0.78

Many optimization algorithms have the tendency to underestimate smaller size

communities [50] and sometimes tend to produce very large size communities. In our

test suite, we observe the similar tendency in BIGCLAM whereas the communities

obtained by PVOC based algorithms are comparable in size with respect to the ground-

truth. Earlier in Table 2, we have mentioned the number of communities detected

by PVOC based algorithms (the number of communities does not change due to the

inclusion of PVOC step with Louvain and Infomap). In Table 4, we show for both LFR

and real-world networks that the size of the largest and smallest communities detected

by BIGCLAM is much larger than that present in the ground-truth structure. We also

measure the similarity (using Jaccard coefficient) between the largest and smallest-size

communities detected by BIGCLAM and PVOC based algorithms with the communities

Leveraging disjoint communities for detecting overlapping community structure 17

in ground-truth structure and notice that PVOC based algorithms are able to detect

both largest and smallest-size communities which are most similar to the ground-truth

structure. Therefore, we hypothesize that our algorithm has the potentiality to produce

meaningful communities which have high resemblance with the ground-truth structure.

Table 4. Size of the largest and smallest communities present in the ground-truth and

that obtained from BIGCLAM and PVOC based algorithms (the Jaccard similarities

between the results obtained from the algorithms with the ground-truth structure are

reported within parenthesis) for both LFR and real-world networks.Networks Ground-truth BIGCLAM PVOC+LVN PVOC+INFO

Max Size Min size Max Size Min size Max Size Min size Max Size Min size

DBLP 3,458 124 9,876 (0.56) 877 (0.48) 4,098 (0.71) 243 (0.82) 4,143 (0.76) 204 (0.81)

Amazon 5,987 245 10,109 (0.45) 765 (0.57) 6,876 (0.69) 398 (0.75) 6,367 (0.72) 323 (0.83)

Orkut 10,687 1,876 13,768 (0.72) 2,985 (0.69) 11,976 (0.74) 1,908 (0.79) 11,345 (0.75) 1,976 (0.79)

Youtube 8,987 765 9,976 (0.65) 1,098 (0.62) 8,876 (0.76) 987 (0.71) 9,018 (0.74) 865 (0.82)

6. Conclusions

In this paper, we presented a study to show that there is perhaps less need of developing

yet another algorithm for finding overlapping communities from the network. We

demonstrated how the output of an efficient disjoint community detection algorithm can

be leveraged to discover the overlapping community structure. For that, we proposed

a novel, two-phase framework, called PVOC that can be combined with any efficient

disjoint community detection algorithm. PVOC uses a new metric, called permanence

in the post-processing step on each vertex and detects the overlapping vertices from

the non-overlapping structure. We combined PVOC with two efficient and scalable

algorithms, Louvain and Informap. Experimental results showed that our approach is

viable in producing meaningful overlapping communities quite efficiently even from the

large real world networks in terms of high resemblance with the ground-truth community

structure. PVOC is controlled by only one parameter θ, which can be efficiently tuned

to increase the extent of overlapping memberships per vertex in a network.

However, a major drawback of PVOC is that it produces exactly the same number

of overlapping communities that the disjoint community detection algorithm produces.

However, it might be possible that due to the overlapping nature of a community, new

community might emerge from the disjoint community structure. As an immediate step,

we would like to include a new module in the post-processing step that would consider

the emergence of new communities. Moreover, we would try to evaluate PVOC in

conjunction with even more disjoint community detection algorithms. To conclude,

we would like to emphasize on the fact that considering such a massive literature

particularly on community detection, it is perhaps the good time to put an end to

such consistent effort of proposing yet another algorithm, and to revisit some of the

existing algorithms that are efficient enough to fulfill both the purpose of discovering

disjoint and overlapping communities from the network.

Leveraging disjoint communities for detecting overlapping community structure 18

7. Reference

[1] Newman M E J and Girvan M 2004 Physical Review E 69 026113+

[2] Fortunato S 2010 Physics Reports 486 75 – 174

[3] Papadopoulos S, Kompatsiaris Y, Vakali A and Spyridonos P 2012 Data Min. Knowl. Discov. 24

515–554

[4] Raghavan U N, Albert R and Kumara S 2007 Physical Review E 76 036106+

[5] Lancichinetti A, Radicchi F, Ramasco J J and Fortunato S 2011 PLoS ONE 6 e18961

[6] H Shen X Cheng K C and Hu M B 2009 Physica A 388 1706–1712

[7] Gregory S 2007 An algorithm to find overlapping community structure in networks Proceedings of

the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases

PKDD 2007 (Berlin, Heidelberg: Springer-Verlag) pp 91–102 ISBN 978-3-540-74975-2 URL

http://dx.doi.org/10.1007/978-3-540-74976-9_12

[8] Xie J, Kelley S and Szymanski B K 2013 ACM Comput. Surv. 45 43:1–43:35

[9] Chakraborty T, Srinivasan S, Ganguly N, Mukherjee A and Bhowmick S 2014 On the permanence

of vertices in network communities The 20th ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA - August 24 - 27,

2014 pp 1396–1405

[10] Blondel V D, Guillaume J L, Lambiotte R and Lefebvre E 2008 J. Stat. Mech. 2008 P10008

[11] Rosvall M and Bergstrom C 2007 PNAS 104 7327

[12] Palla G, Dernyi I, Farkas I and Vicsek T 2005 Nature 435 814–818

[13] Fortunato S and Lancichinetti A 2009 Community detection algorithms: A comparative analysis:

Invited presentation, extended abstract Proceedings of the Fourth International ICST Conference

on Performance Evaluation Methodologies and Tools (ICST, Brussels, Belgium, Belgium: ICST

(Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering)) pp

27:1–27:2

[14] Ahn Y Y, Bagrow J P and Lehmann S 2010 Nature 466 761–764

[15] Evans T S and Lambiotte R The European Physical Journal B 77 265–272

[16] Evans T S and Lambiotte R 2009 Physical Review E 80 016105

[17] Chen Y, Wang X L, Yuan B and Tang B Z 2014 Journal of Statistical Mechanics: Theory and

Experiment 2014 P03021 URL http://stacks.iop.org/1742-5468/2014/i=3/a=P03021

[18] Baumes J, Goldberg M K, Krishnamoorthy M S, Magdon-Ismail M and Preston N 2005 Finding

communities by clustering a graph into overlapping subgraphs. IADIS AC (IADIS) pp 97–104

[19] Lancichinetti A, Fortunato S and Kertesz J 2009 New Journal of Physics 11 033015 URL

http://stacks.iop.org/1367-2630/11/i=3/a=033015

[20] Havemann F, 0003 M H, Struck A and Glser J 2010 CoRR abs/1012.1269

[21] Chen D, Shang M, Lv Z and Fu Y 2010 Physica A: Statistical Mechanics and its Applications 389

4177 – 4187

[22] Lee C, Reid F, McDaid A and Hurley N 2010 Detecting highly-overlapping community structure

by greedy clique expansion Workshop - ACM KDD-SNA pp 33–42

[23] Du N, Wang B and WU B 2008 Overlapping community structure detection in networks 17th

ACM Conference on Information and Knowledge Management (CIKM’08) pp 1371–1372

[24] Gregory S 2011 J. of Stat. Mech.

[25] Nepusz T, Petroczi A, Negyessy L and Bazso F 2008 Phys. Rev. E 1

[26] Zhang S, Wang R S and Zhang X S 2007 Physica A: Statistical Mechanics and its Applications

374 483–490

[27] Newman M E J and Leicht E A 2007 Proceedings of the National Academy of Sciences 104 9564–

9569

[28] Ren W, Yan G, Liao X and Xiao L 2009 Physical Review E (Statistical, Nonlinear, and Soft Matter

Physics) 79 036111 URL http://dx.doi.org/10.1103/physreve.79.036111

[29] Nowicki K and Snijders T A B 2001 Journal of the American Statistical Association 96 1077–1087

Leveraging disjoint communities for detecting overlapping community structure 19

[30] Zarei M, Izadi D and Samani K A 2009 Journal of Statistical Mechanics: Theory and Experiment

2009 P11013 URL http://stacks.iop.org/1742-5468/2009/i=11/a=P11013

[31] McDaid A and Hurley N 2010 Detecting highly overlapping communities with model-based

overlapping seed expansion Proceedings of the 2010 International Conference on Advances in

Social Networks Analysis and Mining ASONAM ’10 (Washington, DC, USA) pp 112–119

[32] Zhang S, Wang R S and Zhang X S 2007 Phys. Rev. E 76(4) 046103 URL

http://link.aps.org/doi/10.1103/PhysRevE.76.046103

[33] Zhao K, Zhang S W and Pan Q 2010 Fuzzy analysis for overlapping community structure of

complex network Control and Decision Conference (CCDC), 2010 Chinese pp 3976–3981

[34] Ding F, Luo Z, Shi J and Fang X 2010 Overlapping community detection by kernel-based fuzzy

affinity propagation International Workshop on Indoor Spatial Awareness (ISA10) pp 1–4

[35] Yang J and Leskovec J 2013 Overlapping community detection at scale: A nonnegative matrix

factorization approach WSDM (New York, NY, USA: ACM) pp 587–596

[36] Gregory S 2010 New Journal of Physics 12 103018

[37] Xie J and Szymanski B K 2012 Towards linear time overlapping community detection in social

networks PAKDD pp 25–36

[38] Xie J and Szymanski B K 2011 CoRR abs/1105.3264

[39] Chen W, Liu Z, Sun X and Wang Y 2010 Data Min. Knowl. Discov. 21 224–240

[40] Girvan M and Newman M E J 2002 Proceedings of the National Academy of Sciences 99 7821–7826

[41] Zhang Y, Wang J, Wang Y and Zhou L 2009 Parallel community detection on

large networks with propinquity dynamics. KDD ed IV J F E, Fogelman-Souli

F, Flach P A and Zaki M (ACM) pp 997–1006 ISBN 978-1-60558-495-9 URL

http://dblp.uni-trier.de/db/conf/kdd/kdd2009.html#ZhangWWZ09

[42] Istvan A, Palotai R, Szalay M S and Csermely P K 2010 PLoS ONE 5 e12528 URL

http://dx.doi.org/10.1371%2Fjournal.pone.0012528

[43] Gopalan P K and Blei D M 2013 Proceedings of the National Academy of Sciences 110 14534–14539

[44] Lancichinetti A, Fortunato S and Radicchi F 2008 Phys. Rev. E 78 046110

[45] Yang J and Leskovec J 2012 Defining and evaluating network communities based on ground-truth

Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics (New York, NY, USA:

ACM) pp 3:1–3:8

[46] Rosvall M and Bergstrom C T 2008 PNAS 105 1118–1123

[47] Danon L, Diaz-Guilera A, Duch J and Arenas A 2005 J. Stat. Mech. 9 P09008

[48] McDaid A F, Greene D and Hurley N J 2011 CoRR abs/1110.2515

[49] Manning C D, Raghavan P and Schutze H 2008 Introduction to Information Retrieval (New York,

NY, USA: Cambridge University Press)

[50] Fortunato S and Barthelemy M 2007 PNAS