Latency based group discovery algorithm for network aware cloud scheduling

Future Generation Computer Systems 31 (2014) 28–39

Contents lists available at ScienceDirect

Future Generation Computer Systems

journal homepage: www.elsevier.com/locate/fgcs

Latency based group discovery algorithm for network awarecloud schedulingSheheryar Malik a,c,∗, Fabrice Huet a, Denis Caromel ba Research Team OASIS, INRIA Sophia Antipolis - Mediterranean, 2004 Routes des Lucioles, 06902 Sophia Antipolis, Franceb Polytech Nice Sophia, University of Nice - Sophia Antipolis, 930 Route des Colles - BP 145, 06903 Sophia Antipolis Cedex, Francec Center for Research in Distributed & Supercomputing, Faculty of Computing, Riphah International University, I-14, Islamabad, Pakistan

h i g h l i g h t s

• Proposed a model for the grouping of nodes with respect to network latency.• It helps the cloud scheduler in scheduling decisions on the basis of latency.• It improves the throughput of distributed applications.• It has no additional network traffic overheads for algorithm computation.• It works well with incomplete latency information and performs intelligent grouping.

a r t i c l e i n f o

Article history:Received 30 October 2011Received in revised form21 March 2013Accepted 5 September 2013Available online 21 September 2013

Keywords:Cloud schedulingLatency groupingGroup partitioningNetwork measurement

a b s t r a c t

Cloud computing is a big paradigm shift of computing mechanism. It provides high scalability andelasticity with a range of on-demand services. We can execute a variety of distributed applications oncloud’s virtual machines (computing nodes). In a distributed application, virtual machine nodes needto communicate and coordinate with each other. This type of coordination requires that the inter-nodelatency should be minimal to improve the performance. But in the case of nodes belonging to differentclusters of the same cloud or in a multi-cloud environment, there can be a problem of higher networklatency. So it becomes more difficult to decide, which node(s) to choose for the distributed applicationexecution, to keep inter-node latency at minimum. In this paper, we propose a solution for this problem.Wepropose amodel for the grouping of nodeswith respect to network latency. The application schedulingis done on the basis of network latency. This model is a part of our proposed Cloud Scheduler module,which helps the scheduler in scheduling decisions on the basis of different criteria. Network latency andresultant node grouping on the basis of this latency is one of those criteria. The main essence of the paperis that our proposed latency grouping algorithm not only has no additional network traffic overheads foralgorithm computation but alsoworkswell with incomplete latency information and performs intelligentgrouping on the basis of latency. This paper addresses an important problem in cloud computing, whichis locating communicating virtual machines for minimum latency between them and group them withrespect to inter-node latency.

© 2013 Elsevier B.V. All rights reserved.

1. Introduction

With the advent in cloud computing technologies, use of cloudcomputing infrastructure is increasing day by day and a lot of en-terprises are shifting their computing from in-house infrastructure(clusters) to the cloud infrastructure (public/private/hybrid cloud).

∗ Corresponding author at: Research Team OASIS, INRIA Sophia Antipolis -Mediterranean, 2004 Routes des Lucioles, 06902 Sophia Antipolis, France. Tel.: +33658805891.

E-mail addresses: [email protected], [email protected](S. Malik).

0167-739X/$ – see front matter© 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.future.2013.09.004

The enterprise can take advantage of the intensive computing ca-pabilities and scalable virtualized environment of cloud computingto execute their tasks. By using cloud infrastructure, processing isdone on cloud computing nodes (i.e. virtual machines). These vir-tual machine nodes can be geographically co-located in the samecluster or far enough to be in different countries. Even they can be-long to different cloud vendors. So there can bemore inter-node la-tencies as compared to a typical cluster. If an application requirescoordination among different cloud nodes, then the performancecan be degradeddue to high inter-node latency. Thus there is a highdemand to have the inter-node latency as minimal as possible. Inorder to have a solution for this situation, we must have a group ofnodes which have minimum inter-node latency and also have the

http://dx.doi.org/10.1016/j.future.2013.09.004

http://www.elsevier.com/locate/fgcs

http://www.elsevier.com/locate/fgcs

http://crossmark.crossref.org/dialog/?doi=10.1016/j.future.2013.09.004&domain=pdf

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1016/j.future.2013.09.004

S. Malik et al. / Future Generation Computer Systems 31 (2014) 28–39 29

minimum number of nodes required for the application. We needto have a mechanism, which is especially useful for distributed ap-plications, which want to run on multiple nodes and have mini-mum possible inter-node latency.

As a solution to the above said problem, we propose a modelto group the nodes with minimum inter-node latency together.The algorithm fulfills all the requirements mentioned in the prob-lem statement. The proposed algorithm is quite capable to groupthe nodes with the minimum available latency information andproduce mutually exclusive groups with minimal communicationoverhead. There are some existing algorithms which fulfill one ortwo requirements mentioned in the problem statement. But noneof them fulfills all of the requirements.

In this paper, we are only focused on the grouping of nodes(group discovery and creation) with respect to their inter-node la-tency to help the scheduler for efficient scheduling of distributedapplications on cloud/cluster. Here we are not addressing the is-sues related to the latency updates. As the latency update mech-anism is not within the scope of our work, we use the alreadyexisting force based embedding techniques [1–3] for latency up-date.

The model is part of our Resource Aware Cloud Computingframework (RAC2) [4] and the implementation of its algorithmis part of the Resource Awareness Cloud Scheduling (RACS) mod-ule. RACS module assists the scheduler in making more efficientscheduling decisions on the basis of different resource characteris-tics/criteria. Network latency is one of those criteria. With the helpof cloud resource manager, RACS has information about the num-ber of nodes available from different cloud operators and clusters.Due to the rapid increase in the number of cloud service providers(cloud operator), an enterprise has the options to choose a cloudoperator from a variety. But it becomes really difficult to select onecloud operator, as any cloud operator is better than the other insome cloud services and lacking behind in some other services andattributes. NACS module assists the scheduler in doing this crit-ical decision of nodes selection. For experimentation purpose, wehave plugged in our schedulermodulewith ProActive scheduler [5].ProActive is an open source cloud/gridmiddle-ware, which enablesthe user to execute its tasks on a cluster or cloud infrastructure.

The rest of the paper is structured as follows. Section 2 givesthe background and existing related work done in the area. Thenin Section 3, we present our proposed model. In Section 4, we de-scribe the results of experimental evaluation of our proposed al-gorithm. We conclude in Section 5 and discuss the future researchdirections.

2. State of the art and positioning

2.1. Problem statement

After analyzing the current situation and the existing work,we have come up with a problem statement. It has a number ofrequirements to fulfill. None of the existing models fulfills all therequirements of our problem statement. So we need a model tofulfill all of these requirements. The requirements are;

1. Group the nodes on the basis of nodes connection value (i.e.latency) instead of number of connections to the nodes. Thenodes in a group must have minimal inter-node latency.

2. Algorithm should be able to work on incomplete latency infor-mation i.e.wedonot haveN×N communication instances (con-nections).

3. Groups should be pre-computed, so that the user applicationsshould not need to ask the algorithm to compute the groups ev-ery time it needs to run.

4. Produce mutually exclusive groups, so a node in one groupshould not appear in other group. If the node is not-mutually

exclusive in groups then a node which has already been as-signed to an application can be a candidate to be assigned toanother application by the scheduler. Even it is really required,when a process wants to execute its sub-tasks on multiplegroups. So the groups should bemutually exclusive to fulfill thisrequirement.

5. Nodes in a group should have minimal inter-node latency.

2.2. Related work

Grouping of nodes with respect to certain characteristics hasbeen studied under various names. Most commonly it is referredas community detection. Community, group, or cluster detectionhave been done for various disciplines and fields, including inter-net [6], world wide web [7], biological network [8], citation net-work [9], social network [10], complex systems, graph theory [11]etc. Different algorithms have been produced tailored to the needsof the discipline. Most of these algorithms are based on groupingof nodes on the basis of connection count for a particular node.However, in our problem scenario, we are grouping the nodeson the basis of connection value between nodes. The connectionvalue is determined by the latency between the nodes. In general,a network community is a group of nodes with more interactionsbetween its members that the other [12]. A node in a group orcommunity normally hasmore interactions inside than outside thegroup. A community detection algorithm is considered better ifits communities’ external node connection to internal node con-nection ratio is lesser. This property is called conductance. Lowerexternal to internal node connection value results in high con-ductance. Our algorithm produces groups with high conductance.Most of the algorithms in the relatedwork belong to the graph par-titioning algorithms and hierarchical agglomerative clustering.

K -CLIQUE and MODULARITY based model for distributed com-munity detection in delay tolerant networks has been introducedby Hui and Yoneki et al. [13,14]. In their work, they have presentedthree variations of their algorithm i.e. SIMPLE,K -CLIQUE andMOD-ULARITY methods. The algorithm detects both the static and tem-poral groups. The algorithm performs the community detection ina distributed fashion on mobile devices. They evaluated the algo-rithm on three human mobility data sets. k-CLIQUE and MODU-LARITY have better performance than SIMPLE, but the SIMPLE islesser complex i.e. O(n). Whereas, MODULARITY has a complexityof O(n4) in worst case. k-CLIQUE has a complexity of O(n2). All ofthese algorithms group the nodes on the basis of connection countinstead of connection value. They create groups which are not mu-tually exclusive and there is no node regrouping. A node can appearin multiple groups. Thus it is required to run algorithm each timewhenever there is a demand for a group by a process.

Local spectral partitioning algorithm is proposed by Andersonet al. [15]. It is a graph partitioning algorithmwhich uses page rankvector to perform its partitioning operation. They aimed to reducethe conductance in their algorithm. It can group the nodes on thebasis of connection value and produce mutually exclusive groups.It also does not require an algorithm to run each time, whenever aprocess demands for a group.

Flow-based Metis+MQI is proposed by Lang et al. [16]. It is agraph partitioning algorithm aimed to improve the conductance ofgraph cuts. They have combinedMQIwithMetis to obtain a heuris-tic graph partitioner to find the results. It can group the nodeson the basis of connection value and produce mutually exclusivegroups. Like Local Spectral, it also does not require the algorithmto run each time, whenever a process demands for a group.

Both Local Spectral Partitioning algorithm [15,12] and Flow-based Metis+MQI [16] are widely used. Metis+MQI is better thanLocal Spectral in terms that its communities’ internal connection isstronger than external. SoMetis+MQI has better conductance than

https://www.researchgate.net/publication/2940187_Vivaldi_A_Decentralized_Network_Coordinate_System?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/220195436_Estimating_network_proximity_and_latency?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/3334786_IDMaps_A_Global_Internet_Host_Distance_Estimation_Service?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/8463952_Fast_algorithm_for_detecting_community_structure_in_networks_Phys_Rev_E_Stat_Nonlin_Soft_Matter_Phys_696_Pt_2_066133?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/3045517_Zahn_CT_Graph-theoretical_methods_for_detecting_and_describing_gestalt_clusters_IEEE_Trans_Comput_C20_68-86?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/45913059_Empirical_Comparison_of_Algorithms_for_Network_Community_Detection?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/45913059_Empirical_Comparison_of_Algorithms_for_Network_Community_Detection?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/234805511_Distributed_community_detection_in_delay_tolerant_networks?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/220926363_Visualizing_community_detection_in_opportunistic_networks?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/2955610_Self-Organization_and_Identification_of_Web_Communities_IEEE_Comput?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

30 S. Malik et al. / Future Generation Computer Systems 31 (2014) 28–39

Table 1Comparison of existing methods/algorithms with the proposed solution in accordance with the problem statement.

Method Can group on the basis ofconnection value

Can work on incompletelatency information

Groups can bepre-computed

Produce mutuallyexclusive groups

Time complexity Algorithm needsto run each time

Simplea No Yes No No O(n) YesK -cliquea No Yes No No O(n2) YesModularitya No Yes No No O(n4) YesLocal spectralb Yes No Yes Yes O(n3) NoMetis+MQIc Yes No Yes Yes O(n2) NoClausetd No Yes Yes Yes O(n2) NoHACOCe Yes No No No O(n3) YesHCA & HSAf No Yes No No O(n2) YesNewmang No Yes No No O(n3) YesProposed algo.h Yes Yes Yes Yes O(n2) Noa Distributed community detection in delay tolerant network by Hui et al. [13].b Local spectral partitioning algorithm by Anderson et al. [15].c Flow based Metis+MQI by Lang et al. [16].d Agglomerative algorithm by Clauset [17].e HACOC agglomerative algorithm by Zhao et al. [18].f HCA and HSA agglomerative algorithms by Garcia et al. [19].g Community structure detection algorithm by Newman [20,10].h Group discovery algorithm by Malik et al.

Local Spectral. On the other hand, in worst case Metis+MQI canhave communitieswith no inter-community connection.Whereas,in Local Spectral communities are always interconnected, whichis a good property. In this situation, Local Spectral is better.Local Spectral is also more compact than Metis+MQI. The timecomplexity of Metis+MQI is O(n2), which is better than O(n3)of Local Spectral. Both of these algorithms are quite close to ouralgorithm in fulfilling the requirements of problem statement.

Clauset’s agglomerative algorithm is proposed by Clauset [17]that greedily maximizes themodularity. Their algorithm infers thehierarchy of the communities that enclose a given vertex by ex-ploring the graph one vertex at a time. The average complexity forthe algorithm is O(k2d) for general graphs, where d is the meandegree and k is the number of vertices to be explored. They usethe algorithm to show the existence ofmesoscopic structure by ex-tracting the local clustering information in the large recommendernetwork of an online retailer. It generates non-mutually exclusivegroups. It does not work on the connection value of the nodes. Butit can work on incomplete latency information.

HACOC is a mechanism for hierarchical agglomerative cluster-ing with respect to ordering constraint devised by Zhao [18]. Theyused the ordering constraints to capture hierarchical informationwhich allows the user to encode hierarchical knowledge such asontologies into agglomerative algorithms. They experimented onlabeled newsgroup data with the ordering constraints. The algo-rithm employs prior knowledge into agglomerative clustering toprevent inaccurate merging operations. The complexity of the al-gorithm is O(n3). It generates overlapping clusters with nodescommon in different clusters. It can work on incomplete latencyinformation. It performs the grouping on the connection value ofthe nodes.

Hierarchical compact algorithm and hierarchical star algorithmare proposed byGarcia et al. [19]. These algorithms are hierarchicalclustering algorithms. The algorithms are part of their generalframework for agglomerative hierarchical clustering algorithms[19], in which they tested and compared these algorithms. Theframework is based on graphs. They obtain different hierarchicalagglomerative clustering algorithms from this framework by usingdifferent measures. They evaluated the algorithms using standarddocument collections. The framework does not work on nodeconnection value and also does not generate mutually exclusivegroups. Its time complexity is O(n2).

Algorithm for detecting community structure in networks is pro-posed by Newman [20,10]. It is based on the edge removal mecha-nism and the idea ofmodularity. They have tested the algorithmon

both computer-generated and real-world networks. In the worstcase, it has a complexity of O(n3). It also neither produces mutu-ally exclusive groups, nor groups the nodes with connection value.It also requires algorithm to run each time to find a group for a pro-cess.

2.3. Comparison

We have examined some of the existing algorithms for ourproblem statement. However, none of the existing algorithms ful-fill all the requirements of it. As mentioned in our problem state-ment, every node connection has a value and the node proximityto its neighbor node is determined by that value. In most of theexisting community detection algorithms, the nodes are groupedwith respect to the number of nodes connected to a particularnode. Nodes which are far in terms of the number of connec-tion count (hop count) normally fall in different groups. Generallythese algorithms produce non-mutually exclusive groups. Thus,none of these existing algorithms fulfill all the requirements of ourproblem statement. Here we list a few of the existing algorithms,which generally produce overlapping groups or communities. Ouralgorithm produces mutually exclusive groups according to ourproblem statement and performs well on incomplete latency in-formation with a complexity of O(n2).

Table 1 shows the comparison between existing algorithms andour proposed algorithm according to the problem statement. Itcompares them for the following characteristics;

• Can group the nodes on the basis of connection value.• Can work on incomplete latency information.• Groups can be pre-computed.• Can produce mutually exclusive groups.• Time complexity.• Algorithm requires to run each time whenever a process wants

to have a group with certain number of nodes.

3. Proposed model—group discovery algorithm

The basis of our proposed solution is an algorithm that groupsthe nodes with respect to their inter-node latency. The cloudscheduler performs the scheduling decision on the basis of group-ing information obtained from the algorithm. In this solution, thealgorithm is quite capable to group the nodes with the minimumavailable latency information. Latency information is available onlyfor those nodes, which have done some communicationwith othernodes. Thus we have come up with a prudent solution, which



https://www.researchgate.net/publication/234805511_Distributed_community_detection_in_delay_tolerant_networks?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/225130449_Newman_MEJ_Detecting_community_structure_in_networks_Eur_Phys_J_B_382_321-330?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/225130449_Newman_MEJ_Detecting_community_structure_in_networks_Eur_Phys_J_B_382_321-330?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/7567786_Clauset_A_Finding_local_community_structure_in_networks_Phys_Rev_E_72_026132?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/7567786_Clauset_A_Finding_local_community_structure_in_networks_Phys_Rev_E_72_026132?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/221306058_Hierarchical_Agglomerative_Clustering_with_Ordering_Constraints?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/221306058_Hierarchical_Agglomerative_Clustering_with_Ordering_Constraints?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1

https://www.researchgate.net/publication/220932658_A_General_Framework_for_Agglomerative_Hierarchical_Clustering_Algorithms?el=1_x_8&enrichId=rgreq-34544b85-337d-4741-98f1-f1d65fc8328e&enrichSource=Y292ZXJQYWdlOzI1OTUxMDY3NTtBUzo5OTUxNDE1MzE3NzA5N0AxNDAwNzM3NDMxOTI1




heuristically performs the grouping decision on incomplete la-tency information. It does not rely on broadcasting to calculatethe inter-node latency, as broadcasting consumes a lot of band-width and put a burden on network traffic. The proposed algorithmdoes not send special messages to the other nodes to calculate thelatency. It uses the piggy back technique by determining the la-tency for the nodes which have done some communication. Thusit has a minimal effect on network traffic load. The algorithm pro-duces multiple mutually exclusive groups with variable numberof nodes. Thus a node can appear in only one group. It is ideallysuitable when a process wants to execute its sub-tasks on multi-ple groups. Scheduler finds the group or groups withmost suitablesize for the demanding process. For a process which is not going todemand more resources during its course of execution, it shouldfind a group closer to the minimum number of nodes required.For a process which can demand the scalable resources, schedulerchooses a group which can support execution in peak scalabilitydemanded. The algorithm has different configurations, which dif-fer according to the situation. The algorithm result is normally dif-ferent for every configuration. The algorithm runs periodically andalso when there is some change in the cloud, like the process mi-gration to other virtual machine, change of IP address etc.

The proposed algorithm discovers the groups by checking theinter-node latency between the nodes. It groups the nodes whichare within threshold and group threshold and also fulfills someother requirements. A node can be regrouped by checking thatwhether it has a lesser latency and group latency to another groupor not. The algorithm runs for every node and decides for nodegrouping by finding latency to its neighbor node. This latency iscompared to a threshold and group threshold value, which actsas the main consideration for the decision of node grouping. Thealgorithm is iterated for the number of times equals the number ofnodes.

The algorithmhas three different phases; initialization, configu-ration and reconfiguration. Initialization phase is responsible to setthe environment for the algorithm execution. Configuration phaseis the most important and the core phase of the algorithm, whichis responsible to discover the groups. In fact, it is the actual groupdiscovery phase. Reconfiguration phase is there to group the nodewhich cannot be grouped with other nodes during configurationphase. Before going further into the details of these phases, we in-troduce the concepts that are going to be used in the algorithm.

Visit node is a node, which is selected during each iteration tocheck its latency with its neighbor nodes and possible groupingof neighbor. Neighbor node is a node, which has done somecommunication with a particular visit node. Root Node is the corenode in a group. All the nodes in a particular group are groupedwith their direct or indirect latency to this node. Threshold is themaximum latency allowed between two communicating nodes,candidate to be grouped together. Threshold value depends on thethreshold policy. Group Threshold is the maximum latency allowedfor grouping. It is the distance latency from the visit node tothe root node of its neighbor node. Group threshold policy alsodepends upon the threshold policy. Latency between visit node andneighbor node is compared with threshold and group threshold.

3.1. Initialization phase

It is a pre-processing phase of the algorithm. In this phase, theinitialization of the environment variables is performed, whichhelps the algorithm to execute. The grouping decision in the nextphase is done on the basis of these initial settings. These settingsinclude the threshold, group threshold, regrouping index valueand some other. Some of these settings are directly stored, someare calculated according to the policies, and some are set by theadministrator. These settings are determined from the different

policies. For example, mean and standard deviation values can becalculated from the latencies and can be used to calculate thresholdand group threshold.

Policies:There are different policies, which make an impact on the out-

put of this configuration algorithm. It includes visit node selectionpolicy, and threshold policy.

3.1.1. Node ranking policyThis policy helps the visit node selection policy to choose a visit

node. It assigns a rank to each node. Lower rank value in naturalnumber (i.e. 0, 1, 2, 3, . . . ), reflects a higher rank. So the value 0 isthe highest rank. There are few sub-policies listed below.

Spiral Policy is a dynamic policy and it assigns the rankings tothe nodes during algorithmexecution time. Only exception is to as-sign rank value 0 to one of the nodes before algorithm execution.Normally, it is the node that has arrived first or the nodewithmax-imum number of neighbor nodes. All the other nodes have rankvalue as −1 (which reflects no assigned rank). The policy assign-ment starts with one node and then this node assigns rankings toits neighbor node, and then these neighbor nodes assign rankingsto their neighbors and so on. So the procedure is that the nodewithrank 0 will be selected first by the visit node selection policy as avisit node in the first iteration. Then the visit node will assign therank to its neighbor nodes as visitNodeRank+1, i.e. rank= 1. Thenin the next iteration, one of the neighbor nodes is selected as a visitnode and then it will assign rank to its neighbor nodes, if they haverank value −1. The rank value for neighbor node is visitNodeRank+1.

Neighbor Count based Policy calculates the number of neigh-bors for each node. Then it creates a sorted list of nodes in descend-ing order with respect to the number of neighbor nodes for eachnode. Then it assigns rank to the nodes in the list. Node(s) withmaximum number of neighbor nodes gets rank 0. Then the node,which has just lesser neighbor count than the previous one getsvalue 1 and so on.

Arrival Time Policy assigns the rank with respect to the timeof arrival. It creates a sorted list of nodes in ascending order withrespect to their arrival time. Then it assigns rank value 0 to thenode arrived first and then value 1 to the nodes arrived after firstnode and so on. It is based on first arrival, because there are morechances that the node arrived first has done more communicationthan the node arrived last.

Random Policy assigns the rank value randomly within a rangeof 0 to the number of nodes. This policy is not recommended to usein normal use, as it may come up with unusual groups.

After evaluating all the policieswith our algorithm (using differ-ent other configuration settings), we recommend the spiral policyas a first choice node ranking policy. In most of our experiments, ithas produced the better grouping results. So we have set the spiralpolicy as a default node ranking policy in the RACSmodule. Neigh-bor count based policy’s results dependwhen it is used. If it is used,when the processes have already done good amount of commu-nication inside the groups created by our algorithms, then its be-havior is very good. Even sometimes it is better than spiral policyin this case. But if the processes have not utilized the groups cre-ated by our algorithm toomuch, then its behavior is undetermined.Sometimes its results are good and sometimes not.

3.1.2. Visit node selection policyThis policy makes a lot of influence on the result of the algo-

rithm. It is based on the ranking policy and it decides a candidatevisit node during each iteration of the visit node. It selects the nodewith higher rank. If two nodes have the same higher rank then thedecision is done on the basis of sub-policy. The sub policies are;


First Node Arrived Policy selects the nodewhich has arrived firstamong the highest rank nodes. In the beginning all the nodes havethe same rank.

Maximum Neighbor Policy selects the node which has themaximum number of neighbor nodes among the highest ranknodes.

Random Node Policy selects a node randomly among the high-est rank nodes.

3.1.3. Threshold and group threshold policyIt is the most important policy to decide. Threshold and group

threshold really play the most significant role in the output of thealgorithm. So it becomes really difficult to decide the threshold andgroup threshold for the algorithm.

Threshold and group threshold can be decided;

• by the scheduler administrator according to the maximumcommunication latency required by potential applications. Hecan also optimize the performance by setting the thresholdvalue on the basis of his past experiences. It can also be set bythe application,who is going to need a groupof nodes to executethe process. A cloud user or application can submit the value ofthreshold along with the process through our RACS module’sopen interface. In this case, the algorithm will run on demandof this particular application and the result of the algorithmwill also be only available to the demanding process. As thethreshold requirements were set only for a particular process.• on the basis of statistical information of inter-node latency. It

can be set either on the basis of median or mean latency. Itis more preferred to decide on the basis of mean latency andstandard deviation. In the normal case, the threshold is equal tothe mean or median latency of all the nodes. Group thresholdcan be either a combination of mean and standard deviation ormedian plus some fraction of median.

In the first case, the administrator needs to be very experiencedin dealing with this type of situation and can do it with highlyfavorable result. In the second case (i.e. statistical based), meanlatency can be affected by the outliers. Median seems to workbetter than mean latency as it is not affected by outliers. The nextdifficult decision is to choose a value for Group Threshold. In thenormal case, we select it as twice of the threshold.

3.2. Configuration phase

This is the core phase of the group discovery algorithm. In thisphase, we apply the main part of the algorithm. It groups thenodes with respect to their inter-node latency. The algorithm runsfor each node and decides for its grouping by finding latency toits neighbor node. The input to the algorithm is the latency andconfiguration information. The configuration phase algorithm isshown in Algorithm 1.

3.2.1. Variables and concepts used in the algorithmHere is a brief explanation of the concept and variables used in

group discovery algorithm.

numberOfNeighborOfVisitNode tells the total number of neigh-bor nodes for any particular node.previousOfVisitNode is the node from which a visit nodegrouping relationship (based on latency)was assessed last time.This grouping relationship means that these node nodes wereassesses that whether a node can be grouped with other or not.By default, the value of this variable is null.previousOfNeighborNode is the node from which a neighbornode grouping relationship (based on latency)was assessed lasttime. This grouping relationship means that these nodes wereassessed that whether a node can be grouped with other or not.

checkForGroupableToVisitNodeRoot is a procedure to checkwhether a neighbor node can be groupedwith the group of visitnode or not.rootOfVisitNode is a root node, due to which a visit node isgrouped in the group of that particular root node. If the visitnode is not grouped before or it is itself the root node of theother nodes then its root value is null.groupOfRootNode is a group to which a root node belongs.thresholdCounter is a variable which checks that how many ofthe neighbor nodes have latency less than threshold. When-ever it finds a neighbor node fulfilling this requirement, it in-crements the counter.moreThanGroupThresholdCounter is a variable which checksthat how many of the neighbor nodes have latency less thanthreshold and more than group-threshold. Whenever it findsa neighbor node fulfilling this requirement, it increments thecounter.regroupingIndex is a value set by the administrator. It is com-pared against the ratio of thresholdCounter andmoreThanGroupThresholdCounter. If this ratio is less than regroupingIndex thenall the elements of the temporary group (i.e. neighbor nodeswith latency less than threshold to visit node) are groupedwiththe group of visit node. If this ratio is more than regroupingIn-dex then it creates a new group for the elements of temporarygroup. Default value of the regroupingIndex is 0.5. The value of0.5 of the regroupingIndexmeans that at least half of the neigh-bor nodes with latency less than threshold to visit node alsohave latency less than group threshold to root node of visitnode.isVisitNodeRemovable checks whether a visit node can be re-moved from the original group or not. It is only applicablewhen the ratio of thresholdCounter and moreThanGroupThresh-oldCounter is more than regroupingIndex. In this case, it checksthat if any of the latency between neighbor node and visit nodeis lesser than latency between visit node and previous node ofvisit node. If it finds any then it signals that visit node should beremoved from older group and added to new group along withthe other members of temporary group.

3.2.2. Algorithm descriptionThe algorithm creates the group by checking the latency

between the nodes. It first selects a starting node using visit nodeselection policy. Then it checks the latency with its neighbor nodesand compare it with threshold. If the inter-node latency is less thanthreshold, then the neighbor can be grouped with the visit node.

This algorithm iterates for the number of nodes in the system.In each iteration, it selects a visit node according to the visit nodeselection policy. The Visit Node Selection Policy selects the nodewith highest node rank among all the non-visited nodes. Then itchecks the latency between the visit node and all of its neighbornodes. If the latency between the visit node and the neighbor nodeis less than or equal to the threshold, then it checks that whetherthe neighbor node is already grouped or not. If the neighbor node isnot grouped before, then it adds the neighbor node to a temporarygroup. After this, it checks that the sum of the latencies from theneighbor of visit node to the root of visit node is less than the groupthreshold or not. If it is not, then it raises the flag that the neighbornode cannot be grouped to the root of visit node. Then it comparesthe latency of the neighbor node and the visit node with the la-tency of visit node and its previous node. If any one of the neighbornodes has lesser latency to the visit node than the latency from thevisit node to the previous node, then it removes visit node from theold group and adds it to the temporary group. In the other case, ifthe neighbor node is already grouped, then it further checks thatwhether it is also visited or not. If it is not visited, then it groups thenode with the temporary group. But if it is already visited, then it


Algorithm 1: Group Discovery Algorithm - Configuration PhaseData: nodes, latency, threshold, groupThresholdResult: Groups of Nodes with minimal inter-node latencyinitialization;for each node in registry do

select visitNodeif numberOfNeighborOfVisitNode= 0 OR (previousOfVisitNode= null & visitNode = grouped) then

create newGroupadd visitNode to newGroup

endfor each neighbourNode of visitNode do

input latency(visitNode, neighborNode)if latency(visitNode, neighborNode)≤ threshold then

if neighborNode = grouped thenadd neighborNode into tempGroupcall_proc: checkForGroupableToVisitNodeRoot(visitNode, neighborNode)

else if neighborNode= grouped thenif neighborNode = visited then

add neighborNode to tempGroupelse if neighborNode= visited & latency(visitNode, neighborNode) < latency(neighborNode, previousOfNeighborNode)then

call_proc: checkForGroupableToVisitNodeRoot(visitNode, neighborNode)endadd neighborNode to tempGroup

endelse if latency(visitNode, neighborNode) > threshold then

if neighborNode = grouped thencreate newGroupadd neighborNode to newGroup

endend

endcheck that the tempGroup is group-able to groupOfRootNode or notifmoreThanGroupThresholdCounter/thresholdCounter≤ regroupingIndex then

copy tempGroup to groupOfVisitNodeendelse if thresholdCounter > regroupingIndex then

create newGroupif isVisitNodeRemovable= true then

remove visitNode from existingGroupadd visitNode to newGroup

endcopy tempGroup to newGroup

endrank visitNode to 0

end

initialization;Procedure definition: checkForGroupableToVisitNodeRoot(visitNode, neighborNode)if sumOfLatency(neighborOfVisitNode, rootOfVisitNode) > groupThreshold then

increasemoreThanGroupThresholdCounterendif any latency(visitNode, neighborNode) < latency(visitNode, previousOfVisitNode) then

isVisitNodeRemovable= trueadd visitNode to tempGroup

end

checks that whether the latency of the neighbor node and the visitnode is lesser than the latency of the neighbor node and its previ-ous node or not. If so, then it checks if the sum of latencies fromthe neighbor of visit node to the root of visit node is not less thanthe group threshold. If so, it raises the flag that the neighbor nodecannot be grouped with the root of visit node. It also checks thatif any one of the neighbor node has lesser latency to the visit node

than latency from visit node to the previous node then it removesthe visit node from the old group and add to the temporary group.

On the other hand, if the latency between visit node and neigh-bor node is greater than the threshold, then it checks whether theneighbor node is already grouped or not. If it is already grouped,then it does not do anything. But if the neighbor node is notgrouped before, then it creates a new group and adds the neighbor


Algorithm 2: Group Discovery Algorithm—Reconfiguration PhaseData: nodes, groups, latency, threshold, pairLatency, prevMinLatencyResult: Groups of Nodes with minimal inter-node latency after finishinginitialization;for each group in registry do

prevMinLatency← 0if groupSize = 1 then

for each potentialGroup in registry doif potentialGroupSize > 1 then

pairLatency← findPairLatency(currentNode, potentialGroupLeader)if pairLatency≤ threshold and (pairLatency < prevMinLatency or prevMinLatency = 0) then

prevMinLatency← pairLatencymove candidateNode to newGroupnewGroupNumber← potentialGroup

endend

endend

endupdate latency in registry

node to the new group. Then it checks that the temporary groupis groupable to the root or not. It is checked by comparing thethreshold-counter with the regrouping index (normally half of thenumber of neighbor nodes within threshold). If thresholdCounterresults in a numbermore than or equal to the regroupingIndex thenit copies all the elements of the temporary group to the group ofvisit node. But if thresholdCounter is less than regroupingIndex thenit creates a new group. Then it checks whether the visit node canbe removed from the old group or not (if the latency between anyneighbor node and visit node is less than latency between the visitnode and the previous of visit node then the visit node is remov-able from the old group). If so, then it removes visit node from theold group and add it to the new group. Then it copies all the nodesfrom the temporary group to the new group. Then it ranks the visitnode as 0 to indicate that it has been visited. It repeats all the abovesteps starting from finding of visit node till here.

3.3. Reconfiguration phase

This is a regrouping and finishing phase of the algorithm. Thisphase of the algorithm tries to regroup the nodes which belongto a single node group with some other group (which has morethan one node). It checks for those groups which have only onenode. These nodes are candidate nodes to be grouped with someother nodes. The groups creation is already completedwithout thisphase, but it gives a better finishing to the algorithm. The reconfig-uration phase algorithm is shown in Algorithm 2.

In this phase, every candidate node probes for latency to theleaders of all the groups. It sends empty messages (if there is noexisting communication done) to the leader to find the latency. Ittries to find a leader node with minimum latency among all andwithin threshold. If it finds that the minimum latency is less thanthe threshold then it groups the candidate node with the groupof leader node having minimum inter-node latency. If it fails tofind a leader node with latency less than threshold then it does notgroup it with any one and appears as an independent group withone node. In the last, it removes all the empty groups.

At the end of this phase, the job of the group discovery algo-rithm is finished and we come up with the final groups. The nodesin each group are expected to be closer to the other nodes in thesame group as compared to the nodes in other groups.

This algorithm phase starts by checking each group with groupsize 1. It tells that which group has only one element. These groupsare candidate groups. Then the node in this group calculates pair

latencywith the leader (i.e. pairLatency) of all the groupswithmorethan 1 node (potentialGroup). Then it compares the latency of thenode in the candidate group with the prevMinLatency and thresh-old. pairLatency should be less than or equal to threshold to be ableto group in the potential group. prevMinLatency is the minimumpair latency found in the previous cycle of the loop for a particularcandidate node. If it finds that the pairLatency is less than the pre-vMinLatency then it updates its grouping information. At the endthis new internode latency information is updated to the registry.

4. Experiments and results

We have implemented our algorithm and experimented it indifferent environments with different configurations. We havedone these experiments in different environments on differentclusters, grids and cloud. We have used different alternative con-figurations for the algorithm and tried to find out the better con-figuration in different situations. The results obtained from theseexperiments reflect the algorithm efficiency with different config-urations in different conditions.

In this paper,wehave included only the experiments performedon Grid5000 [21] and our laboratory clusters. Grid5000 is a scien-tific experiment test-bed, designed to support experiment-drivenresearch in the areas of parallel, large-scale distributed computingand networking. There are multiple sites in Grid5000, which aregeographically located far from each other at variable distances asshown in Fig. 1. In this type of network, the algorithm works bet-ter for the configuration value set by the administrator, whereasconfigurations based onmean andmedian normally do not behavewell. All of the following experiments were performed with spiralranking policy.

4.1. Group discovery experiments of GDA

We have conducted the group discovery experiments on twotypes of platforms i.e. Grid5000 and Local Clusters. Both of theseplatforms are quite different in terms of inter-node latency, thusthey give bit different results.

4.1.1. Group discovery experiments on Grid5000We have selected multiple sites for our experiments. Here, we

list a few of the experiments. All the time values shown in theexperiments are in microseconds. The experiments from 1 to 3


Fig. 1. Sites of Grid5000 across France [21].

were conducted on 6 sites of Grid5000 with 8 nodes on each site.The node distribution was like this.

Node 00–07: GrenobleNode 08–15: LyonNode 16–23: BordeauxNode 24–31: ToulouseNode 32–39: SophiaNode 40–47: Nancy.

Experiment # 1: In the first experiment, we have set the thresh-old value at 1000 µs. The group threshold value is set as its twice,i.e. 2000 µs. In this experiment, the nodes have communicatedrandomly with others. Some nodes have communicated more andsome have less. The maximum number of nodes that can be con-tacted by a particular node is fifteen. We have set a limit forthe maximum communication instances for a node. Because ifthe communication is done with more nodes then the algorithmresults are more appropriate because there is more latency in-formation available. So in this experiment we have restricted themaximumnode connection value at 15. During the experiment, wehave 415 inter-node communication instances. Thus actual com-munication to broadcast communication ratio is 0.184. It has ame-dian latency value of 11 112µs. Themean latency is 10 978µswitha standard deviation of 1608 µs.

The result of configuration phase is shown in Table 2(a). Herewe can see that the configuration phase of the algorithm has gen-erated 9 groups. Most of the groups are complete. Nodes in group1 and group 2 belong to the same site (i.e. Lyon), but they are notgrouped together. There can be many reasons for this type of situ-ation, either the nodes in both groups do not have communicatedat all or either some of them have latency higher than threshold.In case of this experiment, none of the nodes in group 1 has donecommunication with the nodes in group 2. And none of the nodesin group 2 has done communication with the nodes in group 1. Socompletely lacking in information between these two resulted inthe formation of two distinct groups from the same site. In case ofnode 42 in group 7, it has not done any communication with anynode in the system. So it cannot be grouped with any other in thisphase. Node 20 has done some communicationwith 4 other nodes,but the latency between them was more than the threshold. So italso cannot be grouped with any other node. In this experiment,we can see that despite having much less latency information, the

Table 2Groups formation on Grid5000.

Group # Leader Node number

(a) Groups after configuration phase

0 6 0, 2, 4, 6, 1, 3, 7, 51 9 14, 9, 11, 12, 152 8 8, 13, 103 19 21, 18, 19, 22, 23, 16, 174 28 29, 24, 25, 26, 27, 31, 30, 285 37 36, 33, 35, 37, 38, 32, 39, 346 43 43, 40, 46, 47, 41, 45, 447 42 428 20 20

(b) Groups after reconfiguration phase

0 6 0, 2, 4, 6, 1, 3, 7, 51 9 14, 9, 11, 12, 152 8 8, 13, 103 19 21, 18, 19, 22, 23, 16, 17, 204 28 29, 24, 25, 26, 27, 31, 30, 285 37 36, 33, 35, 37, 38, 32, 39, 346 43 43, 40, 46, 47, 41, 45, 44, 42

Table 3Groups in broadcast communication.


(a) Groups after configuration phase.

0 7 0, 1, 2, 3, 4, 5, 6, 71 15 8, 9, 10, 11, 12, 13, 14, 152 23 16, 17, 18, 19, 20, 21, 22, 233 31 24, 25, 26, 27, 28, 29, 30, 314 39 32, 33, 34, 35, 36, 37, 38, 395 47 40, 41, 42, 43, 44, 45, 46, 47


0 7 0, 1, 2, 3, 4, 5, 6, 71 15 8, 9, 10, 11, 12, 13, 14, 152 23 16, 17, 18, 19, 20, 21, 22, 233 31 24, 25, 26, 27, 28, 29, 30, 314 39 32, 33, 34, 35, 36, 37, 38, 395 47 40, 41, 42, 43, 44, 45, 46, 47

algorithm has performed well and has come up with good groupsformation.

Table 2(b) shows the result of the reconfiguration phase. In thisphase, both the single node groups (i.e. 7, 8) are processed. Node42 and node 20 both find a leader nodewithin the latency and theyare groupedwith them. Node 20 is grouped in group 3 and node 42is grouped with group 6.

Experiment # 2: We have also tested the algorithm for the casewhen we do have complete latency information. i.e. N × N − 1communication is done, where N is the number of nodes. In thiscase we set the value of threshold at 1000 µs and group thresholdvalue as its twice. During this experiment, we have N × N − 1(i.e. 2256) inter-node communication instances. It has a medianlatency value of 11 099 µs. The mean latency is 11 124 µs with astandard deviation of 975 µs.

The algorithm resulted in the 100% correct result after the con-figuration phase as shown in Table 3(a). In the reconfigurationphase of the algorithm, it has nothing to do except to simply re-move the empty groups if any. The result of reconfiguration isshown in Table 3(b), which is identical to Table 3(a).

Experiment# 3: In the next experimentwe tested the broadcastcommunication scenario by setting the threshold value to statisti-cal based instead of administrator heuristic. The groups that wereformed are given below. The groups formed are identical for bothmean and median based threshold.

Table 4(a) shows that the algorithm has grouped all the nodesin one group. It is due to the fact that the statistical based thresholdgives poor results in this type of situation. Because there is much


Table 4Groups in broadcast communication using statistical based threshold.



0 26 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,19, 20, 21, 22, 23, 32, 33, 34, 35, 36, 37, 38, 39, 24, 30, 31,40, 41, 42, 43, 44, 45, 46, 47, 25, 27, 28, 29, 26


0 26 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,19, 20, 21, 22, 23, 32, 33, 34, 35, 36, 37, 38, 39, 24, 30, 31,40, 41, 42, 43, 44, 45, 46, 47, 25, 27, 28, 29, 26

Table 5Groups using statistical (mean) based threshold.

Group # Leader Nodes number


0 1 0, 7, 9, 14, 3, 8, 13, 19, 15, 17, 1, 2, 5, 12, 11, 18, 4, 10, 161 20 25, 20, 21, 24, 26, 292 28 23, 27, 22, 283 37 37, 34, 394 30 305 6 66 32 31, 33, 38, 35, 36, 32


0 1 0, 7, 9, 14, 3, 8, 13, 19, 15, 17, 1, 2, 5, 12, 11, 18, 4, 10, 16, 61 20 25, 20, 21, 24, 26, 292 28 23, 27, 22, 283 37 37, 34, 394 32 31, 33, 38, 35, 36, 32, 30

more communication outside the clusters than the inside. Everynode is communicating to 40 nodes out of its cluster which aregeographically far away, whereas intra-cluster communication isdone with only 7 nodes. Latency to the nodes of the other clusteris much higher, thus resulting in a higher value of mean andmedian threshold. This mechanism is well suited to the situationwhen there is more communication within a cluster. Therefore,we recommend using the administrator heuristics to determinethe threshold value, except in the case where automated thresholdcalculation has to be performed.

Table 4(b) shows that there is no change in reconfigurationphase from the configuration phase. As all the nodes were groupedin one single group and there was no group with a single node.

Experiment # 4: This experiment is conducted on 4 sites ofGrid5000 with 10 nodes each on 3 sites and 8 nodes on the lastsite. The node distribution was like this.

Node 00–09: GrenobleNode 10–19: LyonNode 20–29: BordeauxNode 30–37: Toulouse.

In this experiment, we have set the threshold value equals themean latency i.e. 7524 µs. The group threshold value is set as themean plus standard deviation. In this experiment the nodes havecommunicated randomly with others. Some nodes have commu-nicated more and some have less. During the experiment, we have324 inter-node communication instances. Thus actual communi-cation to broadcast communication ratio is 0.115. It has a medianlatency value of 8843µs. The mean latency is 7524µs with a stan-dard deviation of 2574 µs.

Initially, the configuration phase has created 7 groups, as shownin Table 5(a). This is quite closer to the real. In this situation, twositesGrenoble and Lyonhave lesser inter site latencies as comparedto the others. So we can see that the nodes from both of these sitesare grouped together as the threshold value is quite higher thantheir average inter-site latency. Nodes 34, 37 and 39 on Toulouse

Table 6Groups using statistical (mean) based threshold on local clusters.



0 4 0, 1, 2, 11, 4, 5, 10, 6, 17, 7, 8, 18, 3, 9, 16, 14, 131 15 19, 12, 152 29 22, 20, 21, 24, 29, 33, 26, 23, 32, 28, 30, 27, 31, 253 38 34, 35, 38, 41, 36, 51, 42, 37, 47, 43, 44, 45, 50, 48, 494 39 395 46 46, 40


0 4 0, 1, 2, 11, 4, 5, 10, 6, 17, 7, 8, 18, 3, 9, 16, 14, 131 15 19, 12, 152 29 22, 20, 21, 24, 29, 33, 26, 23, 32, 28, 30, 27, 31, 253 38 34, 35, 38, 41, 36, 51, 42, 37, 47, 43, 44, 45, 50, 48, 49, 394 46 46, 40

site have done communication between them, and also with thenodes with which they have latency higher than threshold. Thusthey are grouped together. After the reconfiguration phase wecame up with 5 groups as shown in Table 5(b). In this phase, node30 found the group 4 to be grouped with and node 6 found group0 to be grouped with.

4.1.2. Group discovery experiments on local clustersThese experiments are conducted on the three clusters of our

laboratory. These clusters are geographically co-located in thesame laboratory but in the different buildings. Thus the latency be-tween their nodes is not very high. On average latency betweenthe nodes belonging to the same cluster is not much different thanthe latency between the nodes belonging to the different clusters.This network model is quite different from the Grid5000, wherenodes belonging to the same site have much lesser inter-node la-tency than the nodes belonging to the different sites. The node dis-tribution on the clusters was like this.

Cluster-1: Node 00–19Cluster-2: Node 20–33Cluster-3: Node 34–51.

These experiments were run three times with different thresh-old and group-threshold values each time. In each run, we haveused the same data set for group creation. In the experiments,the nodes have communicated randomly with others. Some nodeshave communicated more and some have less. During the experi-ments, we have 195 inter-node communication instances. Thus ac-tual communication to broadcast communication ratio is 0.0735. Ithas a median latency value of 193 µs. The mean latency is 196 µswith a standard deviation of 39 µs.

Experiment # 5: In this experiment, the threshold value was setto mean and group-threshold tomean plus standard deviation. Ta-ble 6(a) shows the results after configuration phase using meanbased threshold. In this case, threshold value is 196 µs and groupthreshold value is 235 µs (mean + standard deviation). There are6 groups created. Groups 0, 2, and 3 represent groups close tothe actual. Nodes 19, 12, and 15 have formed a different group asthey were not able to group with group 1 due to the small group-threshold value. Nodes 46 and 40 have formed a separate group asnode 40 has only communicated with 46 and node 46 has commu-nicated with node 21 and node 40. It has a latency value 217 withnode 21, thus it cannot be in the same group with node 21. Conse-quently, having a latency less than threshold with node 40, it hasseparate group with it.

Experiment # 6: In this experiment, the threshold value is set tomedian and group-threshold to the twice of the threshold. Table 7shows the results after configuration phase using the thresholdvalue set by the administrator. In this case, threshold value is


Table 7Groups after configuration phase using threshold set by the administrator on localclusters.


0 4 0, 1, 2, 11, 4, 5, 10, 6, 17, 7, 8, 18, 3, 9, 19, 16, 12, 15, 14, 131 29 22, 20, 21, 24, 29, 33, 26, 23, 32, 28, 30, 27, 31, 252 38 34, 35, 38, 41, 36, 51, 42, 37, 47, 39, 43, 44, 45, 50, 48, 493 46 46, 40

Table 8Groups after configuration phase using statistical (median) based threshold on localclusters.


0 4 0, 1, 2, 11, 4, 5, 10, 6, 17, 7, 8, 18, 3, 9, 19, 16, 14, 131 15 12, 152 29 22, 20, 21, 24, 29, 33, 26, 23, 32, 28, 30, 27, 31, 253 38 34, 35, 38, 41, 36, 51, 42, 37, 47, 39, 43, 44, 45, 50, 48, 494 46 46, 40

200 µs and group threshold value is 400 µs (threshold × 2) setby the administrator. There are 4 groups created. It is quite closeto the actual physical clustering. Nodes 46 and 40 have formed aseparate group as they have latency less than threshold betweenthem and more than group threshold with any other node theyhave communicated.

Experiment # 7: In this experiment, the threshold and group-threshold values are set by the administrator. Table 8 shows theresults after configuration phase using median based threshold. Inthis case, threshold value is 193 µs (equal to median) and groupthreshold value is 386µs (median× 2). There are 5 groups created.Nodes 12 and 15 have formed a different group as they were notable to groupwith group 1 due to the small group-threshold value.Nodes 46 and 40have formed a separate group as they have latencyless than threshold between them and more than group thresholdwith any other node they have communicated.

After conducting all these experiments, we have concluded thatour algorithm works well on variety of platforms and topology.Threshold and group threshold values are the keys for the group-ing results in our algorithm. So they should be chosen with greatconcern. The results show that if the network has nodes whichhave less tolerance in inter-node latencies range, then any latencypolicy can be used. For simplicity and automatization we can usestatistical based approach. But if the inter-node latencies are toomuch varying, then it is better that the cloud scheduler adminis-trator should set the value for the threshold and group threshold.Because statistical threshold policy (mean or median) can be ef-fected by outlier values. It can also be set through the user appli-cation according to its maximum allowable inter-node latency.

4.2. Performance evaluation experiments of GDA

Wehave evaluated our group discovery algorithm (GDA) for theperformance evaluation.Wehave conductedmultiple experiments

to evaluate that how much our group discovery algorithm canmake an impact on the performance of a distributed application.

To evaluate, we built two applications, one for the group dis-covery and the other for the algorithm evaluation. We call the firstapplication as Preprocessing App and the second application asTest App. Both of these applications have different sub-processes,which are suppose to run on different virtual machines. Thesesub-processes require coordination with each other to accomplishtheir tasks. In each experiment, first we run the Preprocessing Appwith different inter-node communication instance requirements(i.e. number of nodes a node can communicate with). Meanwhile,our group discovery algorithm creates groups on the basis of inter-node latencies acquired through this communication process. Thenwe run the Test App for the performance evaluation of our GDA.First we run the Test App without the help of GDA (i.e. no knowl-edge of the groups). After this, we run the Test App on the basis ofour grouping algorithm.

We evaluated the algorithm onto two types of infrastructure.First experiment is conducted on Grid5000, where the nodes arespread on different sites. These sites are far from each other. Thesecond experiment is conducted on Local clusters where the nodesbelonging to different clusters are not much far from each other.

4.2.1. Performance evaluation on Grid5000In these experiments, inter-node latencies between the nodes

belonging to the same physical group (cluster) are quite low ascompared to the inter-node latencies between the nodes belong-ing to the different physical groups. In the experiment, first weacquired 50 nodes on 5 sites of Grid5000. Then we ran the Pre-processing App three times by making three variations in theinter-node communication instances. The average communicationinstances were 15, 30, and 49 for each test run (variations). Ourgroup discovery algorithm formed the groups during these varia-tions. Results of these experiments are shown in Table 9.

Then we ran our Test App in two experiment sets. In the firstset, we ran the Test App on 5 nodes with and without the use ofour group discovery algorithm. The application (Test App) requiresthat each node has to communicate 100 timeswith each other. Theresults are shown in Table 9(a). Without the use of our GDA (groupdiscovery algorithm), we assigned the processing nodes randomlyto the Test App. In this case, the Test App finished its execution in21472 ms. Then we assigned the nodes to the Test App with thehelp of our GDA. Here we have three different test runs. In exper-iment Test-G5K -1.1, we assigned the nodes to Test App on the ba-sis of grouping information acquired by running the PreprocessingAppwith an average inter-node communication of 15 nodes. In thistest run, the Test App completes its execution in 15169 ms, whichis a performance gain of 29.35%. In experiment Test-G5K -1.2, weassigned the nodes to Test App on the basis of grouping informa-tion acquired by running the Preprocessing App with an averageinter-node communication of 30 nodes. In this test run, the TestApp completed its execution in 13340 ms, which is a performancegain of 37.87%. In experiment Test-G5K -1.3, we assigned the nodes

Table 9Performance evaluation on Grid5000.

Exp # Preproc. app. comm. inst. Test app. node req. Exec. time w/o GDA (ms) Exec. time with GDA (ms) % Gain

(a) Evaluation experiment: Test-G5K-1

Test-G5K-1.1 15 5 21472 15169 29.35%Test-G5K-1.2 30 5 21472 13340 37.87%Test-G5K-1.3 49 5 21472 13214 38.46%

(b) Evaluation experiment: Test-G5K-2

Test-G5K-2.1 15 10 49685 40137 19.22%Test-G5K-2.2 30 10 49685 35713 28.12%Test-G5K-2.3 49 10 49685 34704 30.15%


Table 10Performance evaluation on local clusters.

Exp # Preproc. app. comm. inst. Test app. node req. Exec. time w/o GDA (ms) Exec. time with GDA (ms) % Gain

(a) Evaluation experiment: Test-LC-1

Test-LC-1.1 15 5 3948 3781 4.23%Test-LC-1.2 30 5 3948 3665 7.16%Test-LC-1.3 39 5 3948 3649 7.57%

(b) Evaluation experiment: Test-LC-2

Test-LC-2.1 15 10 5721 5494 3.97%Test-LC-2.2 30 10 5721 5386 5.86%Test-LC-2.3 39 10 5721 5370 6.14%

to Test App on the basis of grouping information acquired by run-ning the Preprocessing App with an average inter-node commu-nication of 49 nodes. In this test run, the Test App completed itsexecution in 13214 ms, which is a performance gain of 38.46%.Here we can notice that the performance of GDA is better whenthere is more inter-node communication. We can see a significantdifference in performance of Test-G5K -1.1 and Test-G5K -1.2. As therequired number of nodes in Test App is 5, which is quite closer to15 inter-node communication instances as compared to 30.

In the second experiment set, we ran the Test App on 10 nodeswith and without the use of our group discovery algorithm. Theapplication requires that each node has to communicate 100 timeswith each other. First, we assigned the processing nodes randomlyto the Test App i.e. without the use of our GDA (group discoveryalgorithm). In this case, the Test App finished its execution in49685 ms. Then we assigned the nodes for Test App executionon the basis of grouping information acquired with the help ofour GDA. Here we have three different test runs. The results ofthese test runs are shown in Table 9(b). We can see here that theuse of GDA gives quite significant performance gain in applicationperformance over without the use of GDA. If we compare thisexperiment set with the previous, we can see that performancegain is more for the Test App with 5 required nodes as comparedto Test App with 10 required nodes.

4.2.2. Performance evaluation on local clustersThis experiment is conducted on local clusters, where the nodes

belonging to different clusters are not much far from each other.Inter-node latencies between the nodes belonging to the samephysical group (cluster) are not much different than the nodes be-longing to the different physical groups. In this experiment, firstweacquired 40 nodes on 3 local clusters. Then we ran the Preprocess-ing App three times by making three variations in the inter-nodecommunication instances. The average communication instanceswere 15, 30, and 39 for each test run (variations). Our group discov-ery algorithm formed the groups during these variations. Results ofthese experiments are shown in Table 10.

To evaluate, we ran our Test App in two experiment sets as wehave done in experiments on Grid5000. In the first set, we ran theTest App on 5 nodes with and without the use of our group dis-covery algorithm. The application requires that each node has tocommunicate 100 times with each other. The results are shown inTable 10(a). Without the use of our GDA (group discovery algo-rithm), we assigned the processing nodes randomly to the TestApp. In this case, the Test App finished its execution in 3948 ms.Then we assigned the nodes to the Test App using grouping in-formation acquired through the help of our GDA. Here we havethree different test runs. The three test runs are on the basis ofgrouping information acquired by running the Preprocessing Appwith an average inter-node communication of 15, 30, and 39 nodesrespectively. We have a performance gain of 4.23%, 7.16%, and7.57% respectively for each test run. The behavior is same here likeGrid5000, as we can see a significant difference in the performance

of first test run (Test-LC-1.1) and second test run (Test-LC-1.2). Herethe required number of nodes in Test App is 5, which is quite closerto 15 inter-node communication instances as compared to 30.

In the second set, we ran the Test App on 10 nodes with andwithout the use of our group discovery algorithm. Again the ap-plication requires that each node has to communicate 100 timeswith each other. Without the use of our GDA (group discovery al-gorithm), we assigned the processing nodes randomly to the TestApp. In this case, we got the results of Test App in 5721 ms. Thenwe assigned the nodes for Test App execution with the help of ourGDA. Here we have three different test runs. The results of thesetest runs are shown in Table 10(b). Here we can see that we havegood performance gain while using GDA over without the use ofGDA. If we compare this experiment set with the previous, we cansee that performance gain is more for the Test App with 5 requirednodes as compared to Test App with 10 required nodes.

After conducting these experiments, we have found that thegroup discovery algorithm gives a significant performance gain toan application. However, this performance gain is more on theenvironment like Grid5000 as compared to Local Clusters. OurGDA is ideally suited to cloud computing environment as a cloudinfrastructure typically consists ofmultiple data centers/clusters oreven amulti-cloud environment. A cloud user is generally unawareabout the location of processing nodes. So our algorithm can reallyhelp him in finding the processing nodes (virtual machines) withlower inter-node latency.

5. Conclusions

Our proposed algorithm provides a viable solution for group-ing of nodes with respect to their inter-node latency. It specif-ically targets the situation when we do not have completeinformation about all the inter-node latencies. It works well on thelatency information for the nodes who have communicated. Itdoes not introduce any additional network traffic for the algo-rithm. It is very simple to understand and use. Algorithm complex-ity O(n2) is quite low as compared to its counterparts. During theexecution of the algorithm a node can be regrouped. Importantlyit produces mutually exclusive groups according to the problemstatement, whereas most of the existing grouping algorithms pro-duce overlapping groups or communities. It can also produce over-lapping groups by making small changes in the algorithm. Thiscan be done by removing the regrouping steps in the configura-tion phase of the algorithm. It also does not require to run the al-gorithm each time, whenever a process wants to have a group ofnodes. It can use the existing calculated groups. There is no falsepositive, but there exist a few chances of false negative. In case offalse negative, a node will fail to join a group with whom it shouldbe grouped.

Our proposed algorithm is independent of any application andit works as part of the RACS module. RACS is a part of the cloudscheduler and a user application running on cloud can get benefitfrom this algorithm. User applications can also use this algorithm


on demand to get node grouping information only for themselves.This algorithm is a part of our RACS module.

In future, we are aiming to integrate the SmartSockets [22]withthe ProActive cloud/grid and ultimately with our RACS modulethrough ProActive integration. The SmartSockets is a communica-tion library which enables communication through firewalls, NAT,non-routed networks, and multi-homing nodes by automaticallydiscovering the connectivity problems and solving them with aslittle support from the user as possible. The proposed algorithmwill help the SmartSockets to find the hubs by providing groupnodes and group leader information. As part of RACS, it will alsoassist the reliability based scheduling module of RACS, to assigngroup based reliability to the groups produced by our algorithm.

References

[1] F. Dabek, R. Cox, F. Kaashoek, R. Morris, Vivaldi: a decentralized networkcoordinate system, ACM SIGCOMM Computer Communication Review 34(2004) 15–26. http://dx.doi.org/10.1145/1030194.1015471.

[2] P. Sharma, Z. Xu, S. Banerjee, S.-J. Lee, Estimating network proximity andlatency, ACM SIGCOMM Computer Communication Review 36 (2006) 39–50.http://dx.doi.org/10.1145/1140086.1140092.

[3] P. Francis, S. Jamin, C. Jin, Y. Jin, D. Raz, Y. Shavitt, L. Zhang, Idmaps: aglobal Internet host distance estimation service, IEEE/ACM Transactions onNetworking 9 (2001) 525–540. http://dx.doi.org/10.1109/90.958323.

[4] S. Malik, F. Huet, D. Caromel, RACS: a framework for resource aware cloudcomputing, in: Proceedings of the 7th IEEE International Conference forInternet Technology and Secured Transactions, ICITST 2012, 2012.

[5] Proactive parallel suite. http://proactive.inria.fr/.[6] B. Karrer, E. Levina, M.E.J. Newman, Robustness of community structure in

networks, Physical Review E 77 (2008) http://dx.doi.org/10.1103/PhysRevE.77.046119.

[7] G.W. Flake, S. Lawrence, C.L. Giles, F.M. Coetzee, Self-organization andidentification of web communities, IEEE Computer 35 (2007) 66–71.http://dx.doi.org/10.1109/2.989932.

[8] L.H. Hartwell, J.J. Hopfield, S. Leibler, A.W. Murray, Frommolecular tomodularcell biology, Nature 402 (1999) C47–C52. http://dx.doi.org/10.1038/35011540.

[9] S. Redner, Citation statistics frommore than a century, Physical Review (2004).[10] M.E.J. Newman, Fast algorithm for detecting community structure in

networks, Physical Review E 69 (2004) http://dx.doi.org/10.1103/PhysRevE.69.066133.

[11] C.T. Zahn, Graph-theoretical methods for detecting and describing gestaltcluster, IEEE Transactions on Computers C-20 (1971) 68–86. http://dx.doi.org/10.1109/T-C.1971.223083.

[12] J. Leskovec, K.J. Lang, M.W. Mahoney, Empirical comparison of algorithmsfor network community detection, in: Proceedings of the 19th Interna-tional Conference on World Wide Web, WWW 2010, 2010, pp. 631–640.http://dx.doi.org/10.1145/1772690.1772755.

[13] P. Hui, E. Yoneki, S.-Y. Chan, J. Crowcroft, Distributed community detection indelay tolerant networks, in: Proceedings of the 2nd ACM/IEEE InternationalWorkshop on Mobility in the Evolving Internet Architecture, MobiArch’07,2007. http://dx.doi.org/10.1145/1366919.1366929.

[14] E. Yoneki, P. Hui, J. Crowcroft, Visualizing community detection in opportunis-tic networks, in: Proceedings of the 2nd ACM Workshop on Challenged Net-works, CHANTS’07, 2007. http://dx.doi.org/10.1145/1287791.1287810.

[15] R. Andersen, F. Chung, K. Lang, Local graph partitioning using pagerankvectors, in: Proceedings of the 47th Annual IEEE Symposium on Foundationsof Computer Science, FOCS’06, 2006, pp. 475–486. http://dx.doi.org/10.1109/FOCS.2006.44.

[16] K. Lang, S. Rao, A flow-based method for improving the expansion orconductance of graph cuts, in: D. Bienstock, G. Nemhauser (Eds.), Proceedingsof the 10th International IPCO Conference on Integer Programming andCombinatorial Optimization, in: Lecture Notes in Computer Science, Springer,Berlin, Heidelberg, 2004, pp. 383–400.

[17] A. Clauset, Finding local community structure in networks, Physical Review E72 (2005) http://dx.doi.org/10.1103/PhysRevE.72.026132.

[18] H. Zhao, Z. Qi, Hierarchical agglomerative clustering with ordering constraint,in: Proceedings of the 3rd International Conference on Knowledge Discoveryand Data Mining, WKDD’10, 2010, pp. 195–199. http://dx.doi.org/10.1109/WKDD.2010.123.

[19] R.J. Gil-Garcia, J.M. Badia-Contelles, A. Pons-Porrata, A general frameworkfor agglomerative hierarchical clustering algorithms, in: Proceedings of the18th International Conference on Pattern Recognition, ICPR 2006, 2006,pp. 569–572. http://dx.doi.org/10.1109/ICPR.2006.69.

[20] M.E.J. Newman, Detecting community structure in networks, The EuropeanPhysical Journal B 38 (2004) 321–330.

[21] Grid 5000. https://www.grid5000.fr/.[22] J. Maassen, H.E. Bal, Smartsockets: solving the connectivity problems in grid

computing, in: Proceedings of the 16th International Symposium on HighPerformance Distributed Computing, HPDC’07, 2007, pp. 1–10. http://dx.doi.org/10.1145/1272366.1272368.

Sheheryar Malik is an Assistant Professor at Riphah Inter-national University and Principal Scientist at Center for Re-search in Distributed & Supercomputing. He has done hisPh.D. from INRIA France and University of Nice Sophia An-tipolis France. He received his M.S. in Computer Scienceand M.Sc. in Computer Science from Mohammad Ali Jin-nah University, Pakistan. He has worked as a Lecturer andAssistant Professor at University of Central Punjab, Pak-istan and Air University Islamabad, Pakistan from 2003 to2009. His research interests center on cloud architecture,scheduling and management. Currently he is working on

various research and development projects in the domain of distributed and cloudcomputing.

Fabrice Huet is an Associate Professor in the Departmentof Computer Science at the University of Nice Sophia An-tipolis France and a Research Scientist at INRIA SophiaAntipolis France. He received his Masters in ComputerNetworking and Distributed Systems and Ph.D. in Com-puter Science from the University of Nice Sophia Antipo-lis, France. Since 2004 he has been working as a ResearchScientist at OASIS Research group in INRIA Sophia Antipo-lis France. Currently he is working on different aspects re-lated to cloud computing.

Denis Caromel is a Professor at the University of NiceSophia Antipolis France. He received his Masters and Ph.D.in Computer Science from the University of Nice SophiaAntipolis, France. Since 1998 he is working at Universityof Nice Sophia Antipolis, France. Currently he is workingon scheduling andmanagement issues related to the cloudcomputing.

http://dx.doi.org/doi:10.1145/1030194.1015471



http://proactive.inria.fr/

http://dx.doi.org/10.1103/PhysRevE.77.046119










http://dx.doi.org/doi:10.1038/35011540

http://refhub.elsevier.com/S0167-739X(13)00183-0/sbref9










http://dx.doi.org/10.1109/T-C.1971.223083

http://dx.doi.org/10.1109/T-C.1971.223083

http://dx.doi.org/10.1109/T-C.1971.223083

http://dx.doi.org/10.1109/T-C.1971.223083

http://dx.doi.org/10.1109/T-C.1971.223083

http://dx.doi.org/10.1109/T-C.1971.223083

http://dx.doi.org/10.1109/T-C.1971.223083

http://dx.doi.org/10.1109/T-C.1971.223083

http://dx.doi.org/10.1109/T-C.1971.223083

http://dx.doi.org/10.1145/1772690.1772755

http://dx.doi.org/10.1145/1366919.1366929

http://dx.doi.org/10.1145/1287791.1287810

http://dx.doi.org/10.1109/FOCS.2006.44










http://dx.doi.org/doi:10.1103/PhysRevE.72.026132

http://dx.doi.org/10.1109/WKDD.2010.123









http://dx.doi.org/10.1109/ICPR.2006.69


https://www.grid5000.fr/

http://dx.doi.org/10.1145/1272366.1272368

http://dx.doi.org/10.1145/1272366.1272368

http://dx.doi.org/10.1145/1272366.1272368

http://dx.doi.org/10.1145/1272366.1272368

http://dx.doi.org/10.1145/1272366.1272368

http://dx.doi.org/10.1145/1272366.1272368

http://dx.doi.org/10.1145/1272366.1272368

http://dx.doi.org/10.1145/1272366.1272368

Latency based group discovery algorithm for network aware cloud scheduling

Documents

Transcript of Latency based group discovery algorithm for network aware cloud scheduling