Parallel CBIR implementations with load balancing algorithms

14
J. Parallel Distrib. Comput. 66 (2006) 1062 – 1075 www.elsevier.com/locate/jpdc Parallel CBIR implementations with load balancing algorithms José L. Bosque a , , Oscar D. Robles a , Luis Pastor a , Angel Rodríguez b a Dpto. de Informática, Estadística y Telemática, U. Rey Juan Carlos, C. Tulipán, s/n, 28933 Móstoles, Madrid, Spain b Dept. de Tecnología Fotónica, UPM, Campus de Montegancedo s/n, 28660 Boadilla del Monte, Spain Received 22 December 2003; received in revised form 23 February 2005; accepted 7 April 2006 Available online 13 June 2006 Abstract The purpose of content-based information retrieval (CBIR) systems is to retrieve, from real data stored in a database, information that is relevant to a query. When large volumes of data are considered, as it is very often the case with databases dealing with multimedia data, it may become necessary to look for parallel solutions in order to store and gain access to the available items in an efficient way. Among the range of parallel options available nowadays, clusters stand out as flexible and cost effective solutions, although the fact that they are composed of a number of independent machines makes it easy for them to become heterogeneous. This paper describes a heterogeneous cluster-oriented CBIR implementation. First, the cluster solution is analyzed without load balancing, and then, a new load balancing algorithm for this version of the CBIR system is presented. The load balancing algorithm described here is dynamic, distributed, global and highly scalable. Nodes are monitored through a load index which allows the estimation of their total amount of workload, as well as the global system state. Load balancing operations between pairs of nodes take place whenever a node finishes its job, resulting in a receptor-triggered scheme which minimizes the system’s communication overhead. Globally, the CBIR cluster implementation together with the load balancing algorithm can cope effectively with varying degrees of heterogeneity within the cluster; the experiments presented within the paper show the validity of the overall strategy. Together, the CBIR implementation and the load balancing algorithm described in this paper span a new path for performant, cost effective CBIR systems which has not been explored before in the technical literature. © 2006 Elsevier Inc. All rights reserved. Keywords: Parallel implementations; CBIR systems; Load balancing algorithms 1. Introduction The tremendous improvements experimented by computers in aspects such as price, processing power and mass storage capabilities have resulted in an explosion of the amount of information available to people. But this same wealth makes finding the “best” information a very hard task. CBIR 1 systems try to solve this problem by offering mechanisms for selecting the data items which resemble most a specific query among all the available information [12,34], although the complexity Corresponding author. E-mail addresses: [email protected] (J.L. Bosque), [email protected] (O.D. Robles), [email protected] (L. Pastor), arodri@dtf.fi.upm.es (A. Rodríguez). 1 Content-based information retrieval. 0743-7315/$ - see front matter © 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2006.04.014 of this task depends heavily on the volume of data stored in the system. As usual, parallel solutions can be used to alleviate this problem, given the fact that the search operations present a large degree of data parallelism. Distributed solutions on clusters offer a good cost/perfor- mance ratio to solve this problem, given their excellent scalabil- ity, fault tolerance and flexibility attributes [37,6,4]. Also, this architecture allows concurrent access to disks, considered as the main bottleneck in CBIR systems. Although homogeneous clusters could also be considered for this applications, it is dif- ficult to keep the homogeneity of this type of systems during all of their life-cycle. Among the factors that affect their config- uration stability we can mention the addition of new nodes or substitution of faulty ones, technological evolution factors, and even exploitation aspects such as disk fragmentation, etc. In consequence, clusters present additional challenges, since they

Transcript of Parallel CBIR implementations with load balancing algorithms

J. Parallel Distrib. Comput. 66 (2006) 1062–1075www.elsevier.com/locate/jpdc

Parallel CBIR implementations with load balancing algorithms

José L. Bosquea,∗, Oscar D. Roblesa, Luis Pastora, Angel Rodríguezb

aDpto. de Informática, Estadística y Telemática, U. Rey Juan Carlos, C. Tulipán, s/n, 28933 Móstoles, Madrid, SpainbDept. de Tecnología Fotónica, UPM, Campus de Montegancedo s/n, 28660 Boadilla del Monte, Spain

Received 22 December 2003; received in revised form 23 February 2005; accepted 7 April 2006Available online 13 June 2006

Abstract

The purpose of content-based information retrieval (CBIR) systems is to retrieve, from real data stored in a database, information that isrelevant to a query. When large volumes of data are considered, as it is very often the case with databases dealing with multimedia data, itmay become necessary to look for parallel solutions in order to store and gain access to the available items in an efficient way.

Among the range of parallel options available nowadays, clusters stand out as flexible and cost effective solutions, although the fact that theyare composed of a number of independent machines makes it easy for them to become heterogeneous. This paper describes a heterogeneouscluster-oriented CBIR implementation. First, the cluster solution is analyzed without load balancing, and then, a new load balancing algorithmfor this version of the CBIR system is presented.

The load balancing algorithm described here is dynamic, distributed, global and highly scalable. Nodes are monitored through a load indexwhich allows the estimation of their total amount of workload, as well as the global system state. Load balancing operations between pairsof nodes take place whenever a node finishes its job, resulting in a receptor-triggered scheme which minimizes the system’s communicationoverhead. Globally, the CBIR cluster implementation together with the load balancing algorithm can cope effectively with varying degrees ofheterogeneity within the cluster; the experiments presented within the paper show the validity of the overall strategy.

Together, the CBIR implementation and the load balancing algorithm described in this paper span a new path for performant, cost effectiveCBIR systems which has not been explored before in the technical literature.© 2006 Elsevier Inc. All rights reserved.

Keywords: Parallel implementations; CBIR systems; Load balancing algorithms

1. Introduction

The tremendous improvements experimented by computersin aspects such as price, processing power and mass storagecapabilities have resulted in an explosion of the amount ofinformation available to people. But this same wealth makesfinding the “best” information a very hard task. CBIR1 systemstry to solve this problem by offering mechanisms for selectingthe data items which resemble most a specific query amongall the available information [12,34], although the complexity

∗ Corresponding author.E-mail addresses: [email protected] (J.L. Bosque),

[email protected] (O.D. Robles), [email protected] (L. Pastor),[email protected] (A. Rodríguez).

1Content-based information retrieval.

0743-7315/$ - see front matter © 2006 Elsevier Inc. All rights reserved.doi:10.1016/j.jpdc.2006.04.014

of this task depends heavily on the volume of data stored inthe system. As usual, parallel solutions can be used to alleviatethis problem, given the fact that the search operations presenta large degree of data parallelism.

Distributed solutions on clusters offer a good cost/perfor-mance ratio to solve this problem, given their excellent scalabil-ity, fault tolerance and flexibility attributes [37,6,4]. Also, thisarchitecture allows concurrent access to disks, considered asthe main bottleneck in CBIR systems. Although homogeneousclusters could also be considered for this applications, it is dif-ficult to keep the homogeneity of this type of systems duringall of their life-cycle. Among the factors that affect their config-uration stability we can mention the addition of new nodes orsubstitution of faulty ones, technological evolution factors, andeven exploitation aspects such as disk fragmentation, etc. Inconsequence, clusters present additional challenges, since they

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075 1063

can easily become heterogeneous, requiring load distributionsthat take into consideration each node’s computational features[33]. This way, one of the critical parameters to be fixed inorder to keep the efficiency high for this architectures is theworkload assigned to each of the cluster nodes. Even thoughload balancing has received a considerable amount of interest,it is still not definitely solved, particularly for heterogeneoussystems [10,18,41,45]. Nevertheless, this problem is central forminimizing the applications’ response time and optimizing theexploitation of resources, avoiding overloading some proces-sors while others are idling.

This paper describes the architecture, implementation andperformance achieved by a parallel CBIR system implementedon a heterogeneous cluster that includes load balancing. Theflexibility of the architecture herein presented allows the dy-namical addition or removal of nodes from the cluster betweentwo user queries, achieving reconfigurability, scalability and anappreciable degree of fault tolerance. This approach allows adynamic management of specific databases that can be incor-porated to or removed from the CBIR system in function of thedesired user query. The heterogeneity of the system is managedby a new dynamic and distributed load balancing algorithm,introducing a new load index that takes into account the com-putational nodes capabilities and a more accurate measure oftheir workload. The proposed method introduces a very smallsystem overhead when departing from a reasonably balancedstarting point.

As mentioned before, the amount of data to be managed inCBIR systems is so huge nowadays that it is almost manda-tory to use parallelism in order to achieve a reasonable user re-sponse times. Two alternatives were tested in a previous work:a shared-memory multiprocessor and a cluster [6]. Since thecluster implementation has given better results, it seems ad-visable to introduce load balancing strategies to improve theefficiency in heterogeneous clusters. The selected approach isbased on a dynamic, distributed, global and highly scalableload balancing algorithm. An heterogeneous load index basedon the number of running tasks and the computational powerof each node is defined to determine the state of the nodes. Thealgorithm automatically turns itself off in global overloadingor under-loading situations.

Together, the CBIR implementation and the load balancingalgorithm described in this paper open a new path for perfor-mant, cost effective CBIR systems which has not been exploredbefore in the technical literature.

The rest of this article is organized as follows: Section 2presents an overview of parallel CBIR systems and load bal-ancing algorithms. Section 3 presents an analysis of a sequen-tial version of the CBIR algorithm and a brief description of itsparallel implementation on a cluster (without load balancing).Section 4 describes the distributed load balancing algorithm ap-plied to the parallel CBIR system and Section 5 details its im-plementation on a heterogeneous cluster. Section 6 shows thetests performed in order to measure the improvement achievedby the heterogeneous cluster version with load balancing andthe results achieved. Finally, Section 7 presents the conclusionsand ongoing work.

2. Previous work

The technological development experimented during the last20 years has turned into a spectacular increase in the volumeof data managed by information systems. This fact has lead tothe search for methods to automate the process of extractingstructured information from these systems [12,31]. The poten-tial importance of CBIR systems has been reflected in the va-riety of approaches taken while dealing with different aspectsof CBIR systems. The multidisciplinary nature of this problemhas often resulted in partial advances that have been integratedlater on in new prototypes and commercial systems. For exam-ple, it is possible to find research work that takes into consider-ation man–machine interaction issues [32]; the users’ behaviorfrom a psychological modeling standpoint [27]; multidimen-sional indexing techniques [5]; multimedia database manage-ment system issues [19]; pattern recognition algorithms [17];multimedia signal processing [39]; object representation andmodeling techniques [21]; benchmarks for testing the perfor-mance of CBIR systems [16,24]; etc.

In any case, most of the research effort for CBIR sys-tems has been focused on the search for powerful repre-sentation techniques for discriminating elements among theglobal database. Although the data nature is a crucial factorto be taken into consideration, most often the final repre-sentation is a feature vector 2 extracted from the raw data,which reflects somehow its content. While dealing with 2Dimages, it is possible to find techniques using color, shape, ortexture-based primitives. Other techniques use spatial relation-ships among the image components or a combination of theabove-mentioned approaches. For higher-dimensionality inputdata, it is possible to find proposals dealing with 3D imagesor video sequences. Nowadays, one of the most promisingresearch lines is to increase the abstraction level of the se-mantics associated to the primitives managed, representinghigh-level concepts derived from the images or the multimediadata.

From the computational complexity point of view, CBIR sys-tems are potentially expensive and have user response timesgrowing with the ever-increasing sizes of the databases associ-ated to them. One of the most common approaches followed toreach acceptable price/performance ratios has been to exploitthe algorithms’ inherent parallelism at implementation time.However, the novelty of CBIR systems hinders finding refer-ences dealing with this aspect. Some contributions that canbe cited are Zaki’s compilation [43], and the contributions ofSrakaew et al. [37] and Bosque et al. [6]. Another reason thathas made difficult widespread parallel CBIR system develop-ment is that prototype analysis demands a manual image clas-sification stage that limits in practice the number of imagesused in the tests. Nevertheless, the volume of data managedby current DBs, and obviously those with multimedia infor-mation, will demand parallel optimizations for commercial im-plementations of CBIR systems. In those cases, load balancingoperations preventing the coexistence of idling and overloaded

2 Named signature or primitive.

1064 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075

processors will be almost required, since total response timesare usually considerably improved with the introduction of evensimple load balancing approaches.

Load balancing techniques can be classified according todifferent criteria [8]. First, algorithms can be labeled as staticor dynamic. Static methods perform workload distribution atcompilation time, not taking into consideration the system statevariations. Dynamic methods are able to redistribute workloadamong nodes at run time, depending on changes in the systemstate. The work of Rajagopalan et al. [28] and Obeloer et al.[25] are agent-based techniques. These are flexible and con-figurable approaches but the amount of resources needed foragent implementation is considerably large. Grosu et al. [15]present a very different cooperative approach to the load bal-ancing problem, considering it as a game in which each clus-ter node is a player and must minimize its job execution time.Banicescu et al. propose a load balancing library for scientificapplications on distributed memory architectures. The libraryintegrates dynamic loop scheduling as an object migration pol-icy with the object migration mechanism provided by the datamovement and control substrate which is extended with a mo-bile object layer [2].

Load balancing algorithms can also be classified as central-ized or distributed. In the first case, there is a single central nodein charge of keeping the system’s information updated, makingdecisions and actually performing the load balancing opera-tions. In distributed methods, every node takes part in the loadbalancing operations; Zaki et al. [44] show that distributed al-gorithms yield better results than their centralized counterparts.Last, load balancing algorithms can be classified as global orlocal. In the first case, a global view of the system state is kept[10]. In the second case, nodes are arranged in sets or domains,and distribution decisions are made only within each domain[9,40]. Other approaches mix this taxonomy by combining sev-eral features that could be considered mutually exclusive, likethe work of Ahmad and Ghafoor [1], where a semidistributedalgorithm with a two level hierarchy is presented; their workfocus on static networks where communication latency is veryimportant and depends on node placement. In this type of net-works, distributed algorithms may produce instability, scala-bility and bottleneck problems. The improvement of dynamicnetwork technologies solves these problems with broadcast so-lutions and very low latencies. The technique proposed byAhmad and Ghafoor [1], although interesting, is not easilyapplicable to general, unrestricted distributed systems: it wasdeveloped for static network environments, where latency is de-pendent on node location and where broadcast operations arevery costly in terms of system performance. Clusters, which inthe present work appear as a very attractive option for CBIRsystems in terms of cost/performance ratio, present very dif-ferent communication features, and therefore, advise using adifferent approach.

Although a set of projects have been developed to implementCBIR systems on clusters like the IRMA project for medicalimages [14] and the DISCOVIR project (distributed content-based visual information retrieval system on peer-to-peer net-work) [13], none of them include a load balancing algorithm to

distribute the workload of the cluster nodes and therefore theycannot manage system heterogeneity.

3. CBIR system description

The experimental work presented in this paper has been per-formed on a test CBIR system containing information from29.5 million color pictures. The system provides the user witha data set containing the p images considered most similar tothe query one. If the result does not satisfy the user, he/shecan choose one of the selected images or enter a new one thatpresents some kind of similarity with the desired image.

The following sections describe the heart of the CBIR sys-tem, where the signature is extracted from each image (a featurevector describing the image content), as well as the processesinvolved in serving a user’s query. More detailed analysis ofthe retrieval techniques involved in the CBIR system and themethod’s stages from the standpoint of parallel optimizationcan be found elsewhere [30,29,6], respectively.

3.1. Signature computation

Many different approaches can be used for computing theimages’ signatures, as mentioned in Section 2. In the workpresented here, a primitive that represents the color informationof the original image at different resolution levels has beenselected. To achieve a multiresolution representation, a wavelettransform is first applied to the image [22,11].

3.2. Analysis of the sequential CBIR algorithm

The search for images contained in a CBIR system can bebroken down into the following stages:

(1) Input/query image introduction: The user first selects a128×128 pixel bidimensional image to be used as a searchreference. Then the system computes its signature as de-scribed above. The whole process can be efficiently im-plemented using an O(i_s) order algorithm, i_s being theimage’s size [38]. This stage does not require high com-putational resources since the system deals with just oneimage.

(2) Query and DB image’s signature comparison and sorting:The signature obtained in the previous stage is comparedwith all of the DB images’ signatures using an Euclideandistance-based metric. After this process, the identifiers ofthe p images most similar to the input image are extracted,ranked by their similarity. Even though this process ofsignature comparison, selection and ranking is not verydemanding from the computational point of view, it has tobe performed with all of the images within the DB.

(3) Results display: The following step is to assemble a mosaicmade up of the selected p images which has to be presentedto the user as the search result (see Fig. 1).

(4) Query image update: If the user considers the search resultto be unsatisfactory, he may select one of the displayedimages as a new input and then return to the first stage.

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075 1065

Fig. 1. Visual result of a query.

Upon observing the operations involved, it is possible to noticethat the comparison and sorting stage involves a much largercomputational load than the others. Luckily, the exploitationof data parallelism can be done just by dividing the workloadamong n independent nodes, since there are no dependencies.This can be accomplished by distributing off-line the CBIR im-age’s signatures across the processing nodes. Then, each nodecan compare the image query’s signature with every availablesignature. In order to ease also the storage requirements, it ispossible to distribute images, signatures and computation overall of the n available nodes.

3.3. Parallel implementations without load balancing

3.3.1. Global strategyA remarkable feature of the signature comparison and sorting

stage is the problem’s fine granularity: it is possible to performan efficient data-oriented parallelization by combining the sig-nature comparison and sorting stages, and distributing amongthe different nodes only the data needed to perform this stage,which are the signatures of the DB images assigned to eachnode as well as a scalar defining the total number of signaturesto be returned, p. It has to be noted that the amount of com-munications among the corresponding processes is very small,since only the input image’s signature and the p identifiers fromthe most similar images which have been found at each node,together with their corresponding similarity measures, have tobe exchanged among the processes involved, as we will seebelow.

The programmed optimization strategy is based on a farmmodel, in which a master process distributes the data to bedealt with upon a set of slave processes which analyze the dataand return the partial results to the master once they have fin-ished their computations. Since this approach makes it possibleto maintain a large degree of data handling locality, it is wellsuited for distributed memory multiprocessors with messagepassing communication. Further advantages of this solution areits good price/performance ratio and its high level of scalabil-ity, whenever the number of images stored in the database isincreased. In our case, the following solution has been adopted:

(1) The master process computes the signature of the inputimage and broadcasts it to the n slave processes.

(2) The slave processes then proceed to compare the signa-ture of the input image with the signatures of the imagesassigned to their corresponding process node. Once eachcomparison has been performed, a check is then carriedout to ascertain whether the result obtained is one of thebest p images and, should that be the case, it is then in-corporated into the set which is repeatedly sorted using abubble sorting algorithm.

(3) The slave processes forward the p image identifiers andsimilarity measurements to the master process after com-paring and selecting the p images which are most similarwithin each process node.

(4) The master process collects the similarity results obtainedfrom each of the n process nodes and sorts the n · p simi-larity results, truncating the sort so as to include only thebest p.

1066 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075

selected images’+

identifiers (if any)

selected images’+

identifiers (if any)

selected images’+

identifiers (if any)

SLAVE 2

Query ImageSignature

Query ImageSignature

MASTER

Requested images to show

SLAVE 1

Query Image Signature

+Requested images to show

Requested images to show

+

SLAVE n

+

p more similar list p more similar list

p more similar list

Fig. 2. Process communication in the cluster implementation without load balancing.

(5) Finally, the master process requests the process nodes thatcontain the previously selected images to forward them sothat they may be presented to the user and, once available,proceeds to compose a mosaic that is then displayed to theuser.

Fig. 2 represents a schematic diagram of the communicationbetween the processes involved in the unbalanced system. Itmust be noticed that each node of the heterogeneous clusterruns two processes: a master to attend the user queries and aslave to provide the local results achieved by each process nodeto the master process of the cluster node where the query hasbeen generated. This situation is very similar to that found ona grid.

3.3.2. MPI cluster implementationThe application has been programmed using the MPI libraries

as communication primitives between the master and slave pro-cesses. MPI has been selected given that it currently constitutesa standard for message passing communications on parallel ar-chitectures, offering a good degree of portability among paral-lel platforms [23]. The MPI version used is 6.5.6 LAM, fromthe Laboratory for Scientific Computing of Notre Dame Uni-versity, a free distribution of MPI [36].

The pseudo-code corresponding to the implementation of themaster and slave processes is shown below.

Masterloop

Request an image to the userCompute its signatureForward the signature to each of the n slave processes usingthe MPI_BCAST(broadcast) primitiveReceive the results of the n slave processes using the MPI_-RECV (receive) primitive

Sort the partial n · p comparisons selecting the top pRequest the p most similar images to the slave processeswhere the corresponding images are storedReceive the p more similar images from the nodes containingthem using the MPI_RECV primitiveCompose the mosaic to be presented to the user

end loop

Slavej (1�j �n)

M being the number of images stored in process node jloop

Receive the signature of the query image for-warded from the master using theMPI_BCAST (reception from a previous broad-cast) primitiveInitialize the Pj set which shall contain the pbetter results of the comparisonsfor k = 1 to M do

Find the signature of the image kCompare the query signature with that of thecurrent image obtaining the similarity measure-ment msjk

if msjk ∈ Pj thenEliminate the worst result of Pj

Incorporate the result corresponding to the image kto Pj

Sort Pj using a bubble sorting algorithmend if

end forForward Pj to the masterif the master requests images to compose themosaic then

Forward the requested imagesendif

end loop

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075 1067

The size of the data corresponding to each one of the p bestresults that are transferred from every slave to the master isaround 336 bytes. Therefore, each slave process transfers 336·pbytes per query. For example, for p = 20 and n = 25, thetraffic involved in the response will be less than 165 kB. 3

4. Description of the load balancing algorithm

A dynamic, distributed, global and highly scalable load bal-ancing algorithm has been developed for CBIR application andtested with the CBIR parallel application previously described.A more detailed description of the load balancing algorithm canalso be found in [7]. A load index based on the number of run-ning tasks and the computational power of each node is usedto determine the nodes’ state, which is exchanged among all ofthe cluster nodes at regular time intervals. The initiation rule isreceiver-triggered and based on workload thresholds. Finally,the distribution rule takes into account the heterogeneous natureof the cluster nodes as well as the communication time neededfor the workload transmission in order to divide the amount ofworkload between a pair of nodes in every load balancing op-eration. These ideas are detailed along the following sections.

4.1. State rule

The load balancing algorithm is based on a load index whichestimates how loaded a node is in comparison to the rest of thenodes that compose the cluster. Many approaches can be takento compute the load index. Like in any estimation process, it isnecessary to find a trade-off between accuracy and cost, sincekeeping frequently updated node rankings according to theirworkload might be costly.

The index is based on the number of tasks in the run-queueof each CPU [20]. These data are exchanged among all ofthe nodes in the cluster to update the global state information.Moreover, each node takes into account the following informa-tion about the rest of the cluster nodes:

• Cluster heterogeneity: each node can have a different com-putational power Pi , so this factor is an important parameterto take into account for computing the load index. It is de-fined as the inverse of the time taken by node i to process asingle signature.

• Total amount of workload for each node: it is evaluated whenthe application begins its execution and it is updated if thereare any changes in a node.

• Percentage of the workload performed by each node, Wi : it isdefined in function of the total workload, the computationalpower and the number of tasks in this node.

• Period of time from the last update, D, and total executiontime, T.

3 These figures do not take into consideration either the data correspondingto the images presented to the user or the overheads originated by thecommunication primitives, although the latter could be considered negligible.

Therefore, the updates of the number of tasks are performed as

Nave = (Nlast · T ) + (Ncur · D)

T + D, (1)

where Ncur is the number of current running tasks in the node,Nlast is the average of the number of tasks running from thelast update, T is the total execution time of the Nlast tasksconsidered, and D is the interval of time since the last update.This expression gives the average number of tasks of the nodeduring the execution time of the application. So, the percentageof workload processed in each node, Wi , is evaluated as

Wi = Pi · T

W · Nave× 100. (2)

4.2. Information rule

Given that the load balancing approach described here isdynamic, distributed and global, every node in the system needsupdated information about how loaded the remaining systemnodes are [42]. The selected information rule is periodic: eachnode broadcasts its own load index to the rest of the nodesat specific time instants. A periodic rule is necessary becauseeach node has to compute the amount of workload processedby the rest of the cluster nodes, based on the average numberof tasks per node. To evaluate the average number of tasks it isnecessary that the information is updated periodically, whichmakes other information rules such as event driven or underdemand not suitable.

4.3. Initiation rule

The initiation rule determines the current times for perform-ing load balancing operations. It is a receiver initiated rule,where load balancing operations involve pairs of idling andheavily loaded nodes: whenever a processor finishes its as-signed workload, it looks for a busy node and asks it to sharepart of its remaining workload. Since each node keeps informa-tion about the amount of pending work of the remaining nodes,the selection of busy nodes is simple.

The initiation rule described above minimizes the numberof load balancing operations, reducing the algorithm overhead.Also, all the operations improve the system performance, be-cause the total response time of the nodes involved in the loadbalancing operation are equalized, provided that there are notany additional changes in their state or they are not involved inother load balancing operations.

4.4. Load balancing operation

The load balancing operation is broken down in three phases:first it is necessary to find an adequate node which will providepart of its workload (localization rule). Then, the amount ofworkload to be transferred has to be computed (distributionrule). Finally, the workload has to be actually transferred.

1068 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075

4.4.1. Localization ruleWhenever a node finishes its workload, it looks for a sender

node to start a load balancing operation. The receiver nodechecks the state of the rest of the cluster nodes and com-putes a node list, ordered by the amount of pending work.To select the sender node, the receiver checks its own posi-tion in the list and selects the node which is in the symmetricposition; for example, if nodes are ranked according to theirworkload, the node less loaded will look for the most loaded;the second less loaded node will look for the second mostloaded, and so on. In consequence, each pair of sender–receivernodes will have between both of them a similar amount ofworkload.

Apart from being very simple to implement, this approachgives good results since whenever a node finishes its work it isplaced in one end of the list, selecting a heavily loaded (in theother end of the list). This way, the selection of the sender nodeis very coherent: the underloaded nodes take workload fromthe overloaded nodes, while the nodes in middle positions inthe list do not receive a load balancing request (since it is veryunlikely that a node placed in an intermediate position starts aload balancing operation).

Additionally, if several nodes are looking for a sender at thesame time, it is unlikely that they address their requests to thesame sender. This way, situations where a loaded node receivesseveral load balancing petitions and the rest of the loaded nodesdo not receive any are avoided. Finally, this approach is nottime consuming, because the nodes have always up-to-date stateinformation to make their own list.

Whenever a node receives a load balancing request, it canaccept or reject it. In order to accept it, the sender nodeshould have a minimum amount of work left. Otherwise, thesender node is near to complete its workload and the cost ofthe load balancing operation can be higher than finishing theremaining workload locally. In that case, the receiver nodewill select another node from the list using the same proce-dure until an adequate node is found or the end of the listis reached.

4.4.2. Distribution ruleThe distribution rule computes the amount of work that

has to be moved from the sender to the receiver node. Anappropriate rule should take into consideration the relativenodes’ capabilities and availabilities, so that they finish pro-cessing their jobs at the same time (provided that no additionaloperations change their processing conditions). The commu-nication time to transfer the workload among the nodes isalso taken into consideration because the receiver node cannotrun the new assigned task until it receives the correspond-ing load, having an additional delay. The global equilibriumis obtained through successive operations between couplesof nodes.

The proposed distribution rule is based on two parameters:the number of running tasks NTi and the computational powerPi of the nodes which take part in the operation. This reflects thefact that the contribution of a powerful node might be hampered

by a large amount of external workload. Both parameters willbe included in the nodes’ actual computational power, Pacti ,which is obtained as

Pacti = Pi

NTi

. (3)

This is a multi-phase application within two different phases:comparison and sorting. Whenever the load balancing opera-tion is finished, the sender node has to finish the comparisonphase with the remaining workload. Then, it must sort all theprocessed workload. The receiver should compare and sort thenew workload. Additionally, the communication time has to betaken into account, because the receiver cannot continue theprocessing until it receives the new workload. Then, the distri-bution rule is determined by the following expressions:

Ts = Ws

Pacts+ W − Wr

Pacts,

Tr = Wr

Pactr+ Wr

Pactr+ Wr

Pc,

W = Ws + Wr, (4)

where Ts and Tr are the response times of the sender and re-ceiver processors, since the load balancing operation is finished.W is the total workload of the sender which has not still beenprocessed, Ws is the remaining workload in the sender nodeafter the load balancing operation and Wr the workload sentto the receiver. Pacts and Pactr are the sender and receivercurrent computational power. Finally, Pc is the communicationpower expressed in units of workload per second. The commu-nication power is obtained by computing offline the number ofsignatures that can be exchanged between two of the clusternodes per second.

This model takes into consideration two assumptions:

• The computational power for a node is the same in both thecomparison and sorting phases.

• The response time in both phases and the communicationtime are linear with respect to the workload.

Solving these expressions, the amount of both sender and re-ceiver workload can be computed as

Wr = 2WPactrPc

2PcPacts + 2PactrPc + PactsPactr,

Ws = W − Wr. (5)

The values for both workloads Wr and Ws take into accountthe heterogeneity of the nodes, their current state, the commu-nication times and the two different phases of the application.

In consequence, the load balancing algorithm described isdynamic, being able to redistribute workload among nodes atrun time, depending on how the system state changes. It is alsodistributed, because every node takes part in the load balancingoperations. And finally, it is global, because a global view of thesystem state is always kept. The following section describes theimplementation of this algorithm on a CBIR system running ina heterogeneous cluster.

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075 1069

5. Distributed load balancing implementation on aheterogeneous cluster

5.1. Process structure of the load balance implementation

Two replicated processes are distributed among each one ofthe cluster nodes:

(1) Load daemon: This process implements both the state andthe information rules.

(2) Distribution daemon: It collects requests from slave nodesdemanding workload and proceeds with the transference.

Fig. 3 shows a decomposition of all the actions that must becarried out when a slave node finishes its local workload andtriggers the initiation rule. First, it demands new load to the dis-tribution process, which obtains the demanded load and sendsit to the slave node with the purpose to allow the continuationof the computations. The following section describes the struc-ture of the group of processes and their functions.

5.2. Groups of processes

As mentioned in Section 3.3.2, communication and synchro-nization between processes is based on MPI. A structure ofgroups of processes based on communicators [23,26,35] hasbeen implemented, where the groups allow to establish commu-nication structures between processes and to use global com-munication functions over subsets of processes. This way, eachtype of process belongs to his own group:

• MPI_COMM_MS: this group is composed by the masterprocess and all of the slave processes.

• MPI_COMM_DIST: this group is formed by the distributionprocesses.

• MPI_COMM_LOAD: this group is composed by the loaddaemon of each of the nodes.

The group concept is the more natural way to implement thisprocess scheme, because most often the messages transmittedinvolve processes that belong to the same group. Fig. 4 presentsthis communication hierarchy.

5.3. Load daemon

The main function of this process is to compute the localload index, to send this information to the load daemons of theother nodes and to transmit all of the information available tothe local distribution process, whenever it is required to do so.Also, it is in charge of initializing and managing a table thatstores the state of the other nodes. The table stores the follow-ing information for each of the nodes: computational power,average number of active tasks while the application is running,percentage of completed work, time of the last update, total ex-ecution time with some load level and number of signatures tobe processed.

At predetermined fixed intervals, the process evaluates theload index of the node where it is running and sends the state

NOTATION:

OF NODE’S NUMBER

RECEIVES REQUEST

SORTS THE TABLE

SELECTS THE NODE’S NUMBER

SENDS THE SELECTED

NODE’S NUMBERNODE’S NUMBER

RECEIVES THE

SENDS LOAD REQUEST

TO THE SELECTED NODE

RECEIVES

LOAD REQUEST

SENDS THE

AMOUNT OF LOADAMOUNT OF LOAD

RECEIVES THE

THE LOAD

DEMANDS RECEIVES THE

LOAD REQUEST

SENDS THE

LOADLOAD

RECEIVES THE

LOAD DAEMON OF NODE i

COMPUTES THE

AMOUNT OF LOAD

LOAD

STORES THE

SENDS THE EXECUTION

ORDER TO THE SLAVE

FROM SLAVE

REQUEST NODE’S NUMBER

TO DEMAND LOAD

RECEIVES LOCAL REQUEST

DISTRIBUTION PROCESS OF NODE i

DISTRIBUTION PROCESS OF NODE j

COMMUNICATION

BETWEEN PROCESSES INSIDE A NODE

COMPUTATION

DISTRIBUTION PROCESS OF NODE i

DISTRIBUTION PROCESS OF NODE i

DISTRIBUTION PROCESS OF NODE i

DISTRIBUTION PROCESS OF NODE j

Fig. 3. General overview of the whole load balancing algorithm.

information to the other nodes. The rest of the time it remainsblocked waiting for messages from other processes; its func-tionality depends on the received messages. Table 1 summa-

1070 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075

MPI_COMM_WORLD

MPI_COMM_MS

MASTER

SLAVES

DISTRIBUTION

PROCESS

LOAD

DAEMON

MPI_COMM_DIST MPI_COMM_LOAD

Fig. 4. Group scheme.

Table 1Messages and associated functions of the load daemon

Message Associated tasksidentifier

0 Task information1 The distribution process has finished and demands the iden-

tifier of a transmitter node2 The distribution process informs about the number of sig-

natures delivered to other node3 The distribution process notifies that there are no available

nodes to transfer load4 The distribution process shows the number of signatures

obtained from other node5 Another load daemon informs about its new number of

signatures because their transference to other node6 Another load daemon reports about the new number of

signatures assigned to7 Another load daemon tells that there are no nodes to transfer

load

rizes the messages involved with the load daemon and the tasksassociated with each one.

5.4. Distribution process

The main function of the load distribution process is to im-plement the initiation rule and the load balancing operation.Whenever a particular slave finishes its local work, the distribu-tion process is then alerted, evaluating therefore the initiationrule, finding a candidate node, establishing the negotiation anddelivering the load to the slave. On the other hand, if the nodereceives a load balancing request, the distribution rule must betriggered and the appropriate workload is sent to the remotenode.

6. Analysis of the CBIR implementation with loadbalancing in a heterogeneous cluster

A set of experiments have been performed for testing thebehavior of the parallel CBIR system implemented on the het-erogeneous cluster using the above distributed load balancingalgorithm. To compare the results achieved by the parallel CBIRsystem with and without the distributed load balancing algo-

200

250

300

350

400

450

500

0 5 10 15 20 25

Com

puta

tiona

l Pow

er

Processor

Fig. 5. Computational power of the cluster nodes, measured in workloadunits/second.

rithm, the total response time of the CBIR system, with andwithout load balance, has been measured. Additionally, twoclassical load balancing algorithms have been implemented asreference: the random algorithm [3] and the Probin algorithm[9]. The random algorithm is the one of the most simple anddistributed load balancing algorithms because each node makesdecisions based on local information. A node is consideredsender if the queue length of the CPU exceeds a predeterminedand constant threshold. The receiver is selected randomly be-cause the nodes do not share any information about the statussystem. The Probin algorithm is a diffusion-based algorithm,where the information is locally exchanged defining commu-nication domains between neighbor nodes. Several levels ofcoordination can be established varying the domains’ size.

The experiments have been executed on a heterogeneouscluster composed of 25 nodes, linked through a 100 MB/s Eth-ernet. Each of the process nodes features 4 GB of storage ca-pacity in an IDE hard disk linked through DMA with 16.6 MB/stransfer speed. The PC’s operating system is Linux v. 2.2.12.The heterogeneity is determined by the hard disk features. Ithas to be noted that this component determines each node’s re-sponse, as shown in Fig. 5, since in this CBIR system (as inmany others), I/O operations are predominant with respect toCPU operations.

Two different tests have been performed for measuring theimprovement achieved by the heterogeneous cluster imple-mentation using the distributed load balancing algorithm. Thefirst one analyzes search operations within a 30 million imagedatabase using an underloaded system. Since none of the nodesare overloaded, this test studies how heterogeneity affects thesystem performance, and how this performance is improvedusing the load balancing algorithm. The second experimentadds some artificial external tasks to a node in order to testhow well the load balancing algorithm copes with the situ-ation of strong load unbalance. In this case the underloadednodes should wait for the overloaded one to finish the applica-tion. The load balancing algorithm must remove the unloadednodes’ idle time.

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075 1071

4000

6000

8000

10000

12000

14000

5 10 15 20 25 5 10 15 20 25

Exe

cutio

n tim

e (s

econ

ds)

Number of processes

Without algorithmRandom algorithm

Probin algorithmProposed algorithm

1

1.02

1.04

1.06

1.08

1.1

1.12

1.14

1.16

1.18

1.2

Spe

edup

Number of processors

Random algorithmProbin algorithm

Proposed algorithm

(a) (b)

Fig. 6. Results without external tasks (speedup with respect to the algorithm without load balancing): (a) response time and (b) speedup.

Table 2Response time without external workload, measured in seconds

No. nodes Without alg. Random alg. Probin alg. Proposed alg.

5 14 362 14 070 13 771 12 62810 7285 7044 6784 636215 5110 4970 4863 443220 4639 4320 4295 379925 4168 4114 4128 3593

6.1. Tests considering cluster heterogeneity and loadbalancing overhead

The main purposes of these tests were to detect the amountof overhead introduced by the load balancing algorithm, andhow the algorithm can manage the system heterogeneity. Thetests were performed on clusters with 5, 10, 15, 20 and 25 slavenodes plus a master node, in order to evaluate the algorithmscalability. The results are presented in Table 2 and in Fig. 6.

Table 2 shows that the response times are always shorter withsome load balancing algorithm, which means that the overheadintroduced by the algorithm is smaller than the improvementsachieved by using any of the implemented load balancing al-gorithms. From these results two main considerations can bepointed out:

• The tested load balancing algorithms improved always theresponse times between 10% and 15%. The best results wereachieved by the proposed algorithm.

• The proposed approach proved to be more stable, while theresults obtained with the other algorithms were less consis-tent.

Fig. 6(b) and Table 3 present the speedup of these algorithms,where the speedup refers to the improvements.

An interesting parameter for estimating the methods’ behav-ior is the standard deviation of the response times of the differ-ent cluster nodes, shown in Table 4 and in Fig. 7. The standarddeviation of the nodes’ response times is a measurement di-rectly related to idling times of nodes waiting for other nodesto finish their assignments.

Table 3Speedup without external tasks

No. nodes Speedup Speedup Speeduprandom alg. Probin alg. proposed alg.

5 1.021 1.047 1.13710 1.034 1.073 1.14515 1.028 1.051 1.15320 1.019 1.029 1.15825 1.013 1.01 1.16

Table 4Standard deviation of the cluster nodes without external tasks

No. nodes Without alg. Random alg. Probin alg. Proposed alg.

5 2620 1127 743.53 35510 1198 787 482 14515 763 571.94 85.73 7320 620 430 317 4625 575 333.25 198.08 37

0

500

1000

1500

2000

2500

3000

5 10 15 20 25

Tim

e (s

econ

ds)

Number of processors

Without algorithmRandom algorithm

Probin algorithmProposed algorithm

Fig. 7. Standard deviation without external tasks.

1072 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075

2000

4000

6000

8000

10000E

xecu

tion

time

(sec

onds

)

Number of Processors

Without algorithmRandom algorithm

Probin algorithmProposed algorithm

1

1.2

1.4

1.6

1.8

2

5 10 15 20 255 10 15 20 25

Spe

edup

Number of processors

Rando algorithmProbin algorithm

Proposed algorithm

(a) (b)

Fig. 8. Results with external tasks: (a) response time and (b) speedup.

Table 5Response time with external tasks on a node, measured in seconds

No. nodes Without alg. Random alg. Probin alg. Proposed alg.

5 10 548 8864 8305 573210 5535 4562 4225 305415 3645 3063 2836 209920 2893 2483 2352 173925 2589 2311 2231 1601

The load balancing algorithm presented here decreases thestandard deviation, equilibrating the response times, while withthe random algorithm a slight reduction is achieved but withthe Probin algorithm this value is erratic, depending highly onthe probed nodes. Finally, the proposed algorithm achieves thebest values of all the load balancing algorithms tested, rangingfrom a reduction of the standard deviation from 86.45% with5 nodes to 93.56% with 25 nodes with respect to the responsetimes without a load balancing algorithm.

6.2. Results with system overload

For these experiments the system is slightly overloaded, hav-ing one of the nodes heavily loaded. The goal of this test isto measure the algorithm’s ability to distribute the work of theloaded node among the remaining cluster nodes, without af-fecting the system performance. The tests were performed on aheterogeneous cluster with 5, 10, 15, 20, and 25 slave nodes anda master node, using a database of 12.5 million images. Table5 and Fig. 8 present the results achieved in this experiment.

For these tests, the differences obtained between executionswith or without load balancing were very strong. The reductionsin response times range from 45% with 5 nodes to 38% with25 nodes. As the number of nodes increases, the differences inresponse times decrease. Again, the best results are achievedwith the proposed algorithm. Table 6 and Fig. 8(b) show thespeedup achieved in these tests.

Finally, Table 7 and Fig. 9 present the standard deviationresults.

In these tests, the reduction of the standard deviation rangedfrom 90% to 95%. An interesting point to be remarked is thelack of consistency of the results provided by the random algo-

Table 6Speedup with external tasks

No. nodes Speedup Speedup Speeduprandom alg. Probin alg. proposed alg.

5 1.19 1.27 1.8410 1.213 1.31 1.8115 1.19 1.285 1.7420 1.165 1.25 1.6725 1.12 1.16 1.62

Table 7Standard deviation with external tasks

No. nodes Without alg. Random alg. Probin alg. Proposed alg.

5 2809 1232 305 12710 1117 821 296 8315 565 490 452 5420 375 345 321 3225 308 291 217 16

0

500

1000

1500

2000

2500

3000

5 10 15 20 25

Tim

e (s

econ

ds)

Number of processors

Without algorithmRandom algorithm

Probin algorithmProposed algorithm

Fig. 9. Standard deviation with external tasks.

rithm. This method provides only marginal improvements withrespect to the algorithm without load balancing for more than10–15 nodes.

The Probin algorithm has a better behavior for less than 10nodes although the relative improvements drop dramatically

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075 1073

0

500

1000

1500

2000

2500

3000

3500

4000

0 0.5 1 1.5 2

Res

pons

e tim

e (s

econ

ds.)

Number of tasks per node

Without algorithmProposed algorithm

Fig. 10. Response time considering a loaded node for a 25 node cluster.

Table 8Response time increasing the number of external tasks, measured in secondsfor a 25 node cluster

No. tasks Without alg. Proposed alg.

0 1667 14381 2589 16012 3293 1601

above 15 nodes. Finally, the proposed algorithm has very stablevalues, achieving better results when the number of nodes isincreased. The method takes advantage of the availability ofadditional nodes having them all finishing within a short timeinterval.

6.3. Results increasing the load of a heavily loaded node

The last experiment presented in this paper has been per-formed increasing the number of external tasks on a heavilyloaded node. This way, the unbalance among different nodesis higher, and the algorithms’ behavior when confronted withhighly overloaded nodes can be checked. Table 8 and Fig. 10present the results achieved.

From the results above it can be seen that the response timeswithout any load balancing algorithm increase linearly withthe external load. When the proposed load balancing algorithmis introduced, depending on the system’s external workloadstatus, it is possible to achieve a small increment in the system’sresponse time. But globally, the response times remain almostinvariant when the amount of external workload is increased,as it can be seen in Table 8 and Fig. 10. This behavior provesthat if there are some underloaded nodes the extra workloadcan be split among them and the application response time canbe kept constant.

7. Conclusions and future work

This paper begins with an analysis of the operations involvedin a typical CBIR system. From the analysis of the sequentialversion it can be observed a lack of data or algorithmic depen-

dencies. This allows efficient cluster implementations of CBIRsystems since it is a parallel architecture that meets very wellthe application needs [6].

Improvements on the cluster implementation have been madeby introducing a dynamic, distributed, global and scalable loadbalancing algorithm which has been designed specifically forthe parallel CBIR application implemented on a heterogeneouscluster. An additional important feature is that the load balanc-ing algorithm takes into account the system heterogeneity orig-inated both by the different node computational attributes andby external factors such as the presence of external tasks.

The experiments presented here show that the amount ofoverhead introduced by this method is very small. In fact, thisoverhead is hidden by the improvements achieved wheneverany degree of system heterogeneity shows up, a common sit-uation in grid systems. All these experiments have also shownthat using the load balancing algorithm results in large execu-tion time reductions and in a more uniform distribution of thenode’s response times, which can be detected through strongreductions in the response times’ standard deviation.

As it has been shown in the experiments presented here,another important aspect that should be stressed is the algorithmscalability: increasing the number of system nodes does notsignificantly change the execution time increments originatedby the introduction of the load balancing algorithm. At thismoment, considering a network with a much higher number ofnodes is not possible with the available resources. In any case, itis feasible and even simple to extend the current implementationto define a hierarchical algorithm using MPI communicators.The cluster version of the CBIR system that includes the loadbalancing algorithm is nowadays fully operative.

Finally, further work will be devoted to the evaluation of theeffects in the method’s performance of using more complexnode load indices and initiation rules. New efforts will be madein order to refine the primitives used in the CBIR system, andto introduce fault tolerance mechanisms in order to increase thesystem robustness. Analysis on the system response will alsobe made after distributing the database of the CBIR systembetween different clusters. Future migration of the implementedCBIR system to a grid will also be performed.

Acknowledgments

This work has been partially funded by the SpanishMinistry of Education and Science (Grant TIC2003-08933-C02) and Government of the Community of Madrid (GrantGR/SAL/0940/2004).

References

[1] I. Ahmad, A. Ghafoor, Semi-distributed load balancing for massivelyparallel multicomputer systems, IEEE Trans. Software Engrg. 17 (10)(1991) 987–1004.

[2] I. Banicescu, R. Carino, J. Pabico, M. Balasubramaniam, Design andimplementation of a novel dynamic load balancing library for clustercomputing, Parallel Comput. 31 (7) (2005) 736–756.

[3] K.M. Baumgartner, B.W. Wah, Computer scheduling algorithms—past,present and future, Inform. Sci. 57–58 (1991) 319–345.

1074 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075

[4] G. Bell, J. Gray, What’s next in high-performance computing?, Commun.ACM 45 (2) (2002) 91–95.

[5] A.P. Berman, L.G. Shapiro, A flexible image database system for content-based retrieval, Comput. Vision Image Understanding 75 (1/2) (1999)175–195.

[6] J.L. Bosque, O.D. Robles, A. Rodríguez, L. Pastor, Study of a parallelCBIR implementation using MPI, in: V. Cantoni, C. Guerra (Eds.),Proceedings on International Workshop on Computer Architecturesfor Machine Perception, IEEE CAMP 2000, Padova, Italy, 2000, pp.195–204, ISBN 0-7695-0740-9.

[7] J.L. Bosque, O.D. Robles, L. Pastor, Load balancing algorithms for CBIRenvironments, in: Proceedings on International Workshop on ComputerArchitectures for Machine Perception, IEEE CAMP 2003, The Centerfor Advanced Computer Studies, University of Louisiana at Lafayette,IEEE, New Orleans, USA, 2003, ISBN 0-7803-7971-3.

[8] T.L. Casavant, J.G. Kuhl, A taxonomy of scheduling in general-purposedistributed computing systems, in: T.L. Casavant, M. Singhal (Eds.),Readings in Distributed Computing Systems, IEEE Computer SocietyPress, Los Alamitos, CA, 1994, pp. 31–51.

[9] A. Corradi, L. Leonardi, F. Zambonelli, Diffusive load-balancing policiesfor dynamic applications, IEEE Concurrency 7 (1) (1999) 22–31.

[10] S.K. Das, D.J. Harvey, R. Biswas, Parallel processing of adaptive mesheswith load balancing, IEEE Trans. Parallel Distributed Systems 12 (12)(2001) 1269–1280.

[11] I. Daubechies, Ten Lectures on Wavelets, vol. 61 of CBMS-NSF RegionalConference Series in Applied Mathematics, Society for Industrial andApplied Mathematics, Philadelphia, PA, 1992.

[12] A. del Bimbo, Visual Information Retrieval, Morgan KaufmannPublishers, San Francisco, CA, 1999, ISBN 1-55860-624-6.

[13] Department of Computer Science and Engineering, The ChineseUniversity of Hong Kong, DISCOVIR Distributed Content-basedVisual Information Retrieval System on Peer-to-Peer(P2P) Network,Web, 2003. URL 〈http://www.cse.cuhk.edu.hk/∼miplab/discovir/〉.

[14] Department of Diagnostic Radiology and Department of MedicalInformatics and Division of Medical Image Processing and Lehrstuhlfür Informatik VI of the Aachen University of Technology RWTHAchen, IRMA: Image Retrieval in Medical Applications, Web,2003. URL 〈http://libra.imib.rwth-aachen.de/irma/index_en.php〉.

[15] D. Grosu, A. Chronopoulos, M. Leung, Load balancing in distributedsystems: an approach using cooperative games, in: 16th InternationalParallel and Distributed Processing Symposium IPDPS ’02, IEEE, 2002,pp. 52–53.

[16] N.J. Gunther, G. Beretta, A benchmark for image retrieval usingdistributed systems over the internet: birds-i, Technical Report HPL-2000-162, Imaging Systems Laboratory, Hewlett Packard, December2000.

[17] R.M. Haralick, L.G. Shapiro, Computer and Robot Vision, vol. I,Addison-Wesley, Reading, MA, 1992, ISBN: 0-201-10877-1.

[18] C.-C. Hui, S.T. Chanson, Hydrodynamic load balancing, IEEE Trans.Parallel Distributed Systems 10 (11) (1999) 1118–1137.

[19] S. Khoshafian, A.B. Baker, Multimedia and Imaging Databases, MorganKaufmann, San Francisco, CA, 1996.

[20] T. Kunz, The influence of different workload descriptions on a heuristicload balancing scheme, IEEE Trans. Software Engrg. 17 (7) (1991)725–730.

[21] L.J. Latecki, R. Melter, A. Gross, et al., Special issue on shaperepresentation and similarity for image databases, Pattern Recognition35 (1).

[22] S. Mallat, A theory for multiresolution signal decomposition: the waveletrepresentation, IEEE Trans. Pattern Anal. Mach. Intell. 11 (7) (1989)674–693.

[23] MPI Forum, A message-passing interface standard, 2003. URL〈www.mpi-forum.org〉.

[24] H. Müller, W. Müller, D.M. Squire, S. Marchand-Maillet, T. Pun,Performance evaluation in content-based image retrieval: overview andproposals, Pattern Recognition Lett. 22 (2001) 593–601.

[25] W. Obeloer, C. Grewe, H. Pals, Load management with mobile agents,in: 24th Euromicro Conference, vol. 2, IEEE, 1998, pp. 1005–1012.

[26] P.S. Pacheco, Parallel Programming with MPI, Morgan KaufmannPublishers Inc., San Francisco, 1997.

[27] J.S. Payne, L. Hepplewhite, T.J. Stonham, Evaluating content-basedimage retrieval techniques using perceptually based metrics, in:Proceedings of SPIE on Applications of Artificial Neural Networks inImage Processing IV, vol. 3647, SPIE, 1999, pp. 122–133.

[28] A. Rajagopalan, S. Hariri, An agent based dynamic load balancingsystem, in: International Workshop on Autonomous DecentralizedSystems, IEEE, 2000, pp. 164–171.

[29] O.D. Robles, A. Rodríguez, M.L. Córdoba, A study about multiresolutionprimitives for content-based image retrieval using wavelets, in: M.H.Hamza (Ed.), IASTED International Conference On Visualization,Imaging, and Image Processing (VIIP 2001), IASTED, ACTA Press,Marbella, Spain, 2001, pp. 506–511, ISBN 0-88986-309-1.

[30] A. Rodríguez, O.D. Robles, L. Pastor, New features for content-basedimage retrieval using wavelets, in: F. Muge, R.C. Pinto, M. Piedade(Eds.), V Ibero-american Symposium on Pattern Recognition, SIARP2000, Lisbon, Portugal, 2000, pp. 517–528, ISBN 972-97711-1-1.

[31] S. Santini, Exploratory Image Databases: Content-based Retrieval,Communications, Networking, and Multimedia, Academic Press, NewYork, 2001, ISBN 0-12-619261-8.

[32] S. Santini, A. Gupta, R. Jain, Emergent semantics through interactionin image databases, IEEE Trans. Knowledge Data Engrg. 13 (3) (2001)337–351 ISSN: 1041-4347.

[33] B. Schnor, S. Petri, R. Oleyniczak, H. Langendörfer, Scheduling ofparallel applications on heterogeneous workstation clusters, in: K.Yetongnon, S. Hariri (Eds.), Proceedings of the ISCA Ninth InternationalConference on Parallel and Distributed Computing Systems, vol. 1,ISCA, Dijon, 1996, pp. 330–337.

[34] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Content-based image retrieval at the end of the early years, IEEE Trans. PAMI22 (12) (2000) 1349–1380.

[35] M. Snir, S.W. Otto, S. Huss-Lederman, D.W. Walker, J. Dongarra, MPI:The Complete Reference, The MIT Press, Cambridge, 1996.

[36] J.M. Squyres, K.L. Meyer, M.D. McNally, A. Lumsdaine, LAM/MPIUser Guide, University of Notre Dame, lAM 6.3, 1998. URL〈http://www.mpi.nd.edu/lam/〉.

[37] S. Srakaew, N. Alexandridis, P.P. Nga, G. Blankenship, Content-basedmultimedia data retrieval on cluster system environment, in: P. Sloot,M. Bubak, A. Hoekstra, B. Hertzberger (Eds.), High-PerformanceComputing and Networking. Seventh International Conference, HPCNEurope 1999, Springer, Berlin, 1999, pp. 1235–1241.

[38] E.J. Stollnitz, T.D. DeRose, D.H. Salesin, Wavelets for ComputerGraphics, Morgan Kaufmann Publishers, San Francisco, 1996.

[39] Y. Wang, Z. Liu, J.-C. Huang, Multimedia content analysis, IEEE SignalProcess. Mag. 16 (6) (2000) 12–36.

[40] M.H. Willebeek-LeMair, A.P. Reeves, Strategies for dynamic loadbalancing on highly parallel computers, IEEE Trans. Parallel DistributedSystems 4 (9) (1993) 979–993.

[41] L. Xiao, S. Chen, X. Zhang, Dynamic cluster resource allocations forjobs with known and unknown memory demands, IEEE Trans. ParallelDistributed Systems 13 (3) (2002) 223–240.

[42] C. Xu, F. Lau, Load Balancing in Parallel Computers: Theory andPractice, Kluwer Academic Publishers, Dordrecht, 1997.

[43] M.J. Zaki, Parallel and distributed association mining: a survey, IEEEConcurrency 7 (4) (1999) 14–25.

[44] M.J. Zaki, S. Pathasarathy, W. Li, Customized Dynamic Load Balancing,vol. 1, Architectures and Systems, Prentice-Hall PTR, Upper SaddleRiver, NJ, 1999 (Chapter 24).

[45] A.Y. Zomaya, Y.-H. Teh, Observations on using genetic algorithms fordynamic load-balancing, IEEE Trans. Parallel Distributed Systems 12(9) (2001) 899–911.

Jose L. Bosque graduated in Computer Science and Engineering from Uni-versidad Politécnica de Madrid in 1994. He received the Ph.D. degree inComputer Science and Engineering from Universidad Politécnica de Madrid

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075 1075

in 2003. His Ph.D. was centered on theorical models and algorithms forheterogeneous clusters. He has been an associate professor at the UniversidadRey Juan Carlos in Madrid, Spain, since 1998. His research interest areparallel and distributed processing, performance and scalability evaluationand load balancing.

Oscar D. Robles received his degree in Computer Science and Engineeringand the Ph.D. degree from the Universidad Politécnica de Madrid in 1999and 2004, respectively. His Ph.D. was centered on Content-based image andvideo retrieval techniques on parallel architectures. Currently he is AssociateProfessor in the Rey Juan Carlos University and has published works in thefields of multimedia retrieval and parallel computer systems. His researchinterests include content-based multimedia retrieval, as well as computervision and computer grapyhics. He is an Eurographics member.

Luis Pastor received the B.S.EE degree from the Universidad Politécnicade Madrid in 1981, the M.S.EE degree from Drexel University in 1983,and the Ph.D. degree from the Universidad Politécnica de Madrid in 1985.Currently he is Professor in the university Rey Juan Carlos (Madrid, Spain).His research interests include image processing and synthesis, virtual reality,3D modeling and and parallel computing.

Angel Rodríguez received his degree in Computer Science and Engineeringand the Ph.D. degree from the Universidad Politécnica de Madrid in 1991and 1999, respectively. His Ph.D. was centered on the tasks of modeling andrecognizing 3D objects in parallel architectures. He is an Associate Professorin the Photonics Technology Department, Universidad Politécnica de Madrid(UPM), Spain and has published works in the fields of parallel computersystems, computer vision and computer graphics. He is an IEEE and an ACMmember.