Parallel CBIR implementations with load balancing algorithms
Transcript of Parallel CBIR implementations with load balancing algorithms
J. Parallel Distrib. Comput. 66 (2006) 1062–1075www.elsevier.com/locate/jpdc
Parallel CBIR implementations with load balancing algorithms
José L. Bosquea,∗, Oscar D. Roblesa, Luis Pastora, Angel Rodríguezb
aDpto. de Informática, Estadística y Telemática, U. Rey Juan Carlos, C. Tulipán, s/n, 28933 Móstoles, Madrid, SpainbDept. de Tecnología Fotónica, UPM, Campus de Montegancedo s/n, 28660 Boadilla del Monte, Spain
Received 22 December 2003; received in revised form 23 February 2005; accepted 7 April 2006Available online 13 June 2006
Abstract
The purpose of content-based information retrieval (CBIR) systems is to retrieve, from real data stored in a database, information that isrelevant to a query. When large volumes of data are considered, as it is very often the case with databases dealing with multimedia data, itmay become necessary to look for parallel solutions in order to store and gain access to the available items in an efficient way.
Among the range of parallel options available nowadays, clusters stand out as flexible and cost effective solutions, although the fact that theyare composed of a number of independent machines makes it easy for them to become heterogeneous. This paper describes a heterogeneouscluster-oriented CBIR implementation. First, the cluster solution is analyzed without load balancing, and then, a new load balancing algorithmfor this version of the CBIR system is presented.
The load balancing algorithm described here is dynamic, distributed, global and highly scalable. Nodes are monitored through a load indexwhich allows the estimation of their total amount of workload, as well as the global system state. Load balancing operations between pairsof nodes take place whenever a node finishes its job, resulting in a receptor-triggered scheme which minimizes the system’s communicationoverhead. Globally, the CBIR cluster implementation together with the load balancing algorithm can cope effectively with varying degrees ofheterogeneity within the cluster; the experiments presented within the paper show the validity of the overall strategy.
Together, the CBIR implementation and the load balancing algorithm described in this paper span a new path for performant, cost effectiveCBIR systems which has not been explored before in the technical literature.© 2006 Elsevier Inc. All rights reserved.
Keywords: Parallel implementations; CBIR systems; Load balancing algorithms
1. Introduction
The tremendous improvements experimented by computersin aspects such as price, processing power and mass storagecapabilities have resulted in an explosion of the amount ofinformation available to people. But this same wealth makesfinding the “best” information a very hard task. CBIR1 systemstry to solve this problem by offering mechanisms for selectingthe data items which resemble most a specific query amongall the available information [12,34], although the complexity
∗ Corresponding author.E-mail addresses: [email protected] (J.L. Bosque),
[email protected] (O.D. Robles), [email protected] (L. Pastor),[email protected] (A. Rodríguez).
1Content-based information retrieval.
0743-7315/$ - see front matter © 2006 Elsevier Inc. All rights reserved.doi:10.1016/j.jpdc.2006.04.014
of this task depends heavily on the volume of data stored inthe system. As usual, parallel solutions can be used to alleviatethis problem, given the fact that the search operations presenta large degree of data parallelism.
Distributed solutions on clusters offer a good cost/perfor-mance ratio to solve this problem, given their excellent scalabil-ity, fault tolerance and flexibility attributes [37,6,4]. Also, thisarchitecture allows concurrent access to disks, considered asthe main bottleneck in CBIR systems. Although homogeneousclusters could also be considered for this applications, it is dif-ficult to keep the homogeneity of this type of systems duringall of their life-cycle. Among the factors that affect their config-uration stability we can mention the addition of new nodes orsubstitution of faulty ones, technological evolution factors, andeven exploitation aspects such as disk fragmentation, etc. Inconsequence, clusters present additional challenges, since they
J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075 1063
can easily become heterogeneous, requiring load distributionsthat take into consideration each node’s computational features[33]. This way, one of the critical parameters to be fixed inorder to keep the efficiency high for this architectures is theworkload assigned to each of the cluster nodes. Even thoughload balancing has received a considerable amount of interest,it is still not definitely solved, particularly for heterogeneoussystems [10,18,41,45]. Nevertheless, this problem is central forminimizing the applications’ response time and optimizing theexploitation of resources, avoiding overloading some proces-sors while others are idling.
This paper describes the architecture, implementation andperformance achieved by a parallel CBIR system implementedon a heterogeneous cluster that includes load balancing. Theflexibility of the architecture herein presented allows the dy-namical addition or removal of nodes from the cluster betweentwo user queries, achieving reconfigurability, scalability and anappreciable degree of fault tolerance. This approach allows adynamic management of specific databases that can be incor-porated to or removed from the CBIR system in function of thedesired user query. The heterogeneity of the system is managedby a new dynamic and distributed load balancing algorithm,introducing a new load index that takes into account the com-putational nodes capabilities and a more accurate measure oftheir workload. The proposed method introduces a very smallsystem overhead when departing from a reasonably balancedstarting point.
As mentioned before, the amount of data to be managed inCBIR systems is so huge nowadays that it is almost manda-tory to use parallelism in order to achieve a reasonable user re-sponse times. Two alternatives were tested in a previous work:a shared-memory multiprocessor and a cluster [6]. Since thecluster implementation has given better results, it seems ad-visable to introduce load balancing strategies to improve theefficiency in heterogeneous clusters. The selected approach isbased on a dynamic, distributed, global and highly scalableload balancing algorithm. An heterogeneous load index basedon the number of running tasks and the computational powerof each node is defined to determine the state of the nodes. Thealgorithm automatically turns itself off in global overloadingor under-loading situations.
Together, the CBIR implementation and the load balancingalgorithm described in this paper open a new path for perfor-mant, cost effective CBIR systems which has not been exploredbefore in the technical literature.
The rest of this article is organized as follows: Section 2presents an overview of parallel CBIR systems and load bal-ancing algorithms. Section 3 presents an analysis of a sequen-tial version of the CBIR algorithm and a brief description of itsparallel implementation on a cluster (without load balancing).Section 4 describes the distributed load balancing algorithm ap-plied to the parallel CBIR system and Section 5 details its im-plementation on a heterogeneous cluster. Section 6 shows thetests performed in order to measure the improvement achievedby the heterogeneous cluster version with load balancing andthe results achieved. Finally, Section 7 presents the conclusionsand ongoing work.
2. Previous work
The technological development experimented during the last20 years has turned into a spectacular increase in the volumeof data managed by information systems. This fact has lead tothe search for methods to automate the process of extractingstructured information from these systems [12,31]. The poten-tial importance of CBIR systems has been reflected in the va-riety of approaches taken while dealing with different aspectsof CBIR systems. The multidisciplinary nature of this problemhas often resulted in partial advances that have been integratedlater on in new prototypes and commercial systems. For exam-ple, it is possible to find research work that takes into consider-ation man–machine interaction issues [32]; the users’ behaviorfrom a psychological modeling standpoint [27]; multidimen-sional indexing techniques [5]; multimedia database manage-ment system issues [19]; pattern recognition algorithms [17];multimedia signal processing [39]; object representation andmodeling techniques [21]; benchmarks for testing the perfor-mance of CBIR systems [16,24]; etc.
In any case, most of the research effort for CBIR sys-tems has been focused on the search for powerful repre-sentation techniques for discriminating elements among theglobal database. Although the data nature is a crucial factorto be taken into consideration, most often the final repre-sentation is a feature vector 2 extracted from the raw data,which reflects somehow its content. While dealing with 2Dimages, it is possible to find techniques using color, shape, ortexture-based primitives. Other techniques use spatial relation-ships among the image components or a combination of theabove-mentioned approaches. For higher-dimensionality inputdata, it is possible to find proposals dealing with 3D imagesor video sequences. Nowadays, one of the most promisingresearch lines is to increase the abstraction level of the se-mantics associated to the primitives managed, representinghigh-level concepts derived from the images or the multimediadata.
From the computational complexity point of view, CBIR sys-tems are potentially expensive and have user response timesgrowing with the ever-increasing sizes of the databases associ-ated to them. One of the most common approaches followed toreach acceptable price/performance ratios has been to exploitthe algorithms’ inherent parallelism at implementation time.However, the novelty of CBIR systems hinders finding refer-ences dealing with this aspect. Some contributions that canbe cited are Zaki’s compilation [43], and the contributions ofSrakaew et al. [37] and Bosque et al. [6]. Another reason thathas made difficult widespread parallel CBIR system develop-ment is that prototype analysis demands a manual image clas-sification stage that limits in practice the number of imagesused in the tests. Nevertheless, the volume of data managedby current DBs, and obviously those with multimedia infor-mation, will demand parallel optimizations for commercial im-plementations of CBIR systems. In those cases, load balancingoperations preventing the coexistence of idling and overloaded
2 Named signature or primitive.
1064 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075
processors will be almost required, since total response timesare usually considerably improved with the introduction of evensimple load balancing approaches.
Load balancing techniques can be classified according todifferent criteria [8]. First, algorithms can be labeled as staticor dynamic. Static methods perform workload distribution atcompilation time, not taking into consideration the system statevariations. Dynamic methods are able to redistribute workloadamong nodes at run time, depending on changes in the systemstate. The work of Rajagopalan et al. [28] and Obeloer et al.[25] are agent-based techniques. These are flexible and con-figurable approaches but the amount of resources needed foragent implementation is considerably large. Grosu et al. [15]present a very different cooperative approach to the load bal-ancing problem, considering it as a game in which each clus-ter node is a player and must minimize its job execution time.Banicescu et al. propose a load balancing library for scientificapplications on distributed memory architectures. The libraryintegrates dynamic loop scheduling as an object migration pol-icy with the object migration mechanism provided by the datamovement and control substrate which is extended with a mo-bile object layer [2].
Load balancing algorithms can also be classified as central-ized or distributed. In the first case, there is a single central nodein charge of keeping the system’s information updated, makingdecisions and actually performing the load balancing opera-tions. In distributed methods, every node takes part in the loadbalancing operations; Zaki et al. [44] show that distributed al-gorithms yield better results than their centralized counterparts.Last, load balancing algorithms can be classified as global orlocal. In the first case, a global view of the system state is kept[10]. In the second case, nodes are arranged in sets or domains,and distribution decisions are made only within each domain[9,40]. Other approaches mix this taxonomy by combining sev-eral features that could be considered mutually exclusive, likethe work of Ahmad and Ghafoor [1], where a semidistributedalgorithm with a two level hierarchy is presented; their workfocus on static networks where communication latency is veryimportant and depends on node placement. In this type of net-works, distributed algorithms may produce instability, scala-bility and bottleneck problems. The improvement of dynamicnetwork technologies solves these problems with broadcast so-lutions and very low latencies. The technique proposed byAhmad and Ghafoor [1], although interesting, is not easilyapplicable to general, unrestricted distributed systems: it wasdeveloped for static network environments, where latency is de-pendent on node location and where broadcast operations arevery costly in terms of system performance. Clusters, which inthe present work appear as a very attractive option for CBIRsystems in terms of cost/performance ratio, present very dif-ferent communication features, and therefore, advise using adifferent approach.
Although a set of projects have been developed to implementCBIR systems on clusters like the IRMA project for medicalimages [14] and the DISCOVIR project (distributed content-based visual information retrieval system on peer-to-peer net-work) [13], none of them include a load balancing algorithm to
distribute the workload of the cluster nodes and therefore theycannot manage system heterogeneity.
3. CBIR system description
The experimental work presented in this paper has been per-formed on a test CBIR system containing information from29.5 million color pictures. The system provides the user witha data set containing the p images considered most similar tothe query one. If the result does not satisfy the user, he/shecan choose one of the selected images or enter a new one thatpresents some kind of similarity with the desired image.
The following sections describe the heart of the CBIR sys-tem, where the signature is extracted from each image (a featurevector describing the image content), as well as the processesinvolved in serving a user’s query. More detailed analysis ofthe retrieval techniques involved in the CBIR system and themethod’s stages from the standpoint of parallel optimizationcan be found elsewhere [30,29,6], respectively.
3.1. Signature computation
Many different approaches can be used for computing theimages’ signatures, as mentioned in Section 2. In the workpresented here, a primitive that represents the color informationof the original image at different resolution levels has beenselected. To achieve a multiresolution representation, a wavelettransform is first applied to the image [22,11].
3.2. Analysis of the sequential CBIR algorithm
The search for images contained in a CBIR system can bebroken down into the following stages:
(1) Input/query image introduction: The user first selects a128×128 pixel bidimensional image to be used as a searchreference. Then the system computes its signature as de-scribed above. The whole process can be efficiently im-plemented using an O(i_s) order algorithm, i_s being theimage’s size [38]. This stage does not require high com-putational resources since the system deals with just oneimage.
(2) Query and DB image’s signature comparison and sorting:The signature obtained in the previous stage is comparedwith all of the DB images’ signatures using an Euclideandistance-based metric. After this process, the identifiers ofthe p images most similar to the input image are extracted,ranked by their similarity. Even though this process ofsignature comparison, selection and ranking is not verydemanding from the computational point of view, it has tobe performed with all of the images within the DB.
(3) Results display: The following step is to assemble a mosaicmade up of the selected p images which has to be presentedto the user as the search result (see Fig. 1).
(4) Query image update: If the user considers the search resultto be unsatisfactory, he may select one of the displayedimages as a new input and then return to the first stage.
J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075 1065
Fig. 1. Visual result of a query.
Upon observing the operations involved, it is possible to noticethat the comparison and sorting stage involves a much largercomputational load than the others. Luckily, the exploitationof data parallelism can be done just by dividing the workloadamong n independent nodes, since there are no dependencies.This can be accomplished by distributing off-line the CBIR im-age’s signatures across the processing nodes. Then, each nodecan compare the image query’s signature with every availablesignature. In order to ease also the storage requirements, it ispossible to distribute images, signatures and computation overall of the n available nodes.
3.3. Parallel implementations without load balancing
3.3.1. Global strategyA remarkable feature of the signature comparison and sorting
stage is the problem’s fine granularity: it is possible to performan efficient data-oriented parallelization by combining the sig-nature comparison and sorting stages, and distributing amongthe different nodes only the data needed to perform this stage,which are the signatures of the DB images assigned to eachnode as well as a scalar defining the total number of signaturesto be returned, p. It has to be noted that the amount of com-munications among the corresponding processes is very small,since only the input image’s signature and the p identifiers fromthe most similar images which have been found at each node,together with their corresponding similarity measures, have tobe exchanged among the processes involved, as we will seebelow.
The programmed optimization strategy is based on a farmmodel, in which a master process distributes the data to bedealt with upon a set of slave processes which analyze the dataand return the partial results to the master once they have fin-ished their computations. Since this approach makes it possibleto maintain a large degree of data handling locality, it is wellsuited for distributed memory multiprocessors with messagepassing communication. Further advantages of this solution areits good price/performance ratio and its high level of scalabil-ity, whenever the number of images stored in the database isincreased. In our case, the following solution has been adopted:
(1) The master process computes the signature of the inputimage and broadcasts it to the n slave processes.
(2) The slave processes then proceed to compare the signa-ture of the input image with the signatures of the imagesassigned to their corresponding process node. Once eachcomparison has been performed, a check is then carriedout to ascertain whether the result obtained is one of thebest p images and, should that be the case, it is then in-corporated into the set which is repeatedly sorted using abubble sorting algorithm.
(3) The slave processes forward the p image identifiers andsimilarity measurements to the master process after com-paring and selecting the p images which are most similarwithin each process node.
(4) The master process collects the similarity results obtainedfrom each of the n process nodes and sorts the n · p simi-larity results, truncating the sort so as to include only thebest p.
1066 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075
selected images’+
identifiers (if any)
selected images’+
identifiers (if any)
selected images’+
identifiers (if any)
SLAVE 2
Query ImageSignature
Query ImageSignature
MASTER
Requested images to show
SLAVE 1
Query Image Signature
+Requested images to show
Requested images to show
+
SLAVE n
+
p more similar list p more similar list
p more similar list
Fig. 2. Process communication in the cluster implementation without load balancing.
(5) Finally, the master process requests the process nodes thatcontain the previously selected images to forward them sothat they may be presented to the user and, once available,proceeds to compose a mosaic that is then displayed to theuser.
Fig. 2 represents a schematic diagram of the communicationbetween the processes involved in the unbalanced system. Itmust be noticed that each node of the heterogeneous clusterruns two processes: a master to attend the user queries and aslave to provide the local results achieved by each process nodeto the master process of the cluster node where the query hasbeen generated. This situation is very similar to that found ona grid.
3.3.2. MPI cluster implementationThe application has been programmed using the MPI libraries
as communication primitives between the master and slave pro-cesses. MPI has been selected given that it currently constitutesa standard for message passing communications on parallel ar-chitectures, offering a good degree of portability among paral-lel platforms [23]. The MPI version used is 6.5.6 LAM, fromthe Laboratory for Scientific Computing of Notre Dame Uni-versity, a free distribution of MPI [36].
The pseudo-code corresponding to the implementation of themaster and slave processes is shown below.
Masterloop
Request an image to the userCompute its signatureForward the signature to each of the n slave processes usingthe MPI_BCAST(broadcast) primitiveReceive the results of the n slave processes using the MPI_-RECV (receive) primitive
Sort the partial n · p comparisons selecting the top pRequest the p most similar images to the slave processeswhere the corresponding images are storedReceive the p more similar images from the nodes containingthem using the MPI_RECV primitiveCompose the mosaic to be presented to the user
end loop
Slavej (1�j �n)
M being the number of images stored in process node jloop
Receive the signature of the query image for-warded from the master using theMPI_BCAST (reception from a previous broad-cast) primitiveInitialize the Pj set which shall contain the pbetter results of the comparisonsfor k = 1 to M do
Find the signature of the image kCompare the query signature with that of thecurrent image obtaining the similarity measure-ment msjk
if msjk ∈ Pj thenEliminate the worst result of Pj
Incorporate the result corresponding to the image kto Pj
Sort Pj using a bubble sorting algorithmend if
end forForward Pj to the masterif the master requests images to compose themosaic then
Forward the requested imagesendif
end loop
J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075 1067
The size of the data corresponding to each one of the p bestresults that are transferred from every slave to the master isaround 336 bytes. Therefore, each slave process transfers 336·pbytes per query. For example, for p = 20 and n = 25, thetraffic involved in the response will be less than 165 kB. 3
4. Description of the load balancing algorithm
A dynamic, distributed, global and highly scalable load bal-ancing algorithm has been developed for CBIR application andtested with the CBIR parallel application previously described.A more detailed description of the load balancing algorithm canalso be found in [7]. A load index based on the number of run-ning tasks and the computational power of each node is usedto determine the nodes’ state, which is exchanged among all ofthe cluster nodes at regular time intervals. The initiation rule isreceiver-triggered and based on workload thresholds. Finally,the distribution rule takes into account the heterogeneous natureof the cluster nodes as well as the communication time neededfor the workload transmission in order to divide the amount ofworkload between a pair of nodes in every load balancing op-eration. These ideas are detailed along the following sections.
4.1. State rule
The load balancing algorithm is based on a load index whichestimates how loaded a node is in comparison to the rest of thenodes that compose the cluster. Many approaches can be takento compute the load index. Like in any estimation process, it isnecessary to find a trade-off between accuracy and cost, sincekeeping frequently updated node rankings according to theirworkload might be costly.
The index is based on the number of tasks in the run-queueof each CPU [20]. These data are exchanged among all ofthe nodes in the cluster to update the global state information.Moreover, each node takes into account the following informa-tion about the rest of the cluster nodes:
• Cluster heterogeneity: each node can have a different com-putational power Pi , so this factor is an important parameterto take into account for computing the load index. It is de-fined as the inverse of the time taken by node i to process asingle signature.
• Total amount of workload for each node: it is evaluated whenthe application begins its execution and it is updated if thereare any changes in a node.
• Percentage of the workload performed by each node, Wi : it isdefined in function of the total workload, the computationalpower and the number of tasks in this node.
• Period of time from the last update, D, and total executiontime, T.
3 These figures do not take into consideration either the data correspondingto the images presented to the user or the overheads originated by thecommunication primitives, although the latter could be considered negligible.
Therefore, the updates of the number of tasks are performed as
Nave = (Nlast · T ) + (Ncur · D)
T + D, (1)
where Ncur is the number of current running tasks in the node,Nlast is the average of the number of tasks running from thelast update, T is the total execution time of the Nlast tasksconsidered, and D is the interval of time since the last update.This expression gives the average number of tasks of the nodeduring the execution time of the application. So, the percentageof workload processed in each node, Wi , is evaluated as
Wi = Pi · T
W · Nave× 100. (2)
4.2. Information rule
Given that the load balancing approach described here isdynamic, distributed and global, every node in the system needsupdated information about how loaded the remaining systemnodes are [42]. The selected information rule is periodic: eachnode broadcasts its own load index to the rest of the nodesat specific time instants. A periodic rule is necessary becauseeach node has to compute the amount of workload processedby the rest of the cluster nodes, based on the average numberof tasks per node. To evaluate the average number of tasks it isnecessary that the information is updated periodically, whichmakes other information rules such as event driven or underdemand not suitable.
4.3. Initiation rule
The initiation rule determines the current times for perform-ing load balancing operations. It is a receiver initiated rule,where load balancing operations involve pairs of idling andheavily loaded nodes: whenever a processor finishes its as-signed workload, it looks for a busy node and asks it to sharepart of its remaining workload. Since each node keeps informa-tion about the amount of pending work of the remaining nodes,the selection of busy nodes is simple.
The initiation rule described above minimizes the numberof load balancing operations, reducing the algorithm overhead.Also, all the operations improve the system performance, be-cause the total response time of the nodes involved in the loadbalancing operation are equalized, provided that there are notany additional changes in their state or they are not involved inother load balancing operations.
4.4. Load balancing operation
The load balancing operation is broken down in three phases:first it is necessary to find an adequate node which will providepart of its workload (localization rule). Then, the amount ofworkload to be transferred has to be computed (distributionrule). Finally, the workload has to be actually transferred.
1068 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075
4.4.1. Localization ruleWhenever a node finishes its workload, it looks for a sender
node to start a load balancing operation. The receiver nodechecks the state of the rest of the cluster nodes and com-putes a node list, ordered by the amount of pending work.To select the sender node, the receiver checks its own posi-tion in the list and selects the node which is in the symmetricposition; for example, if nodes are ranked according to theirworkload, the node less loaded will look for the most loaded;the second less loaded node will look for the second mostloaded, and so on. In consequence, each pair of sender–receivernodes will have between both of them a similar amount ofworkload.
Apart from being very simple to implement, this approachgives good results since whenever a node finishes its work it isplaced in one end of the list, selecting a heavily loaded (in theother end of the list). This way, the selection of the sender nodeis very coherent: the underloaded nodes take workload fromthe overloaded nodes, while the nodes in middle positions inthe list do not receive a load balancing request (since it is veryunlikely that a node placed in an intermediate position starts aload balancing operation).
Additionally, if several nodes are looking for a sender at thesame time, it is unlikely that they address their requests to thesame sender. This way, situations where a loaded node receivesseveral load balancing petitions and the rest of the loaded nodesdo not receive any are avoided. Finally, this approach is nottime consuming, because the nodes have always up-to-date stateinformation to make their own list.
Whenever a node receives a load balancing request, it canaccept or reject it. In order to accept it, the sender nodeshould have a minimum amount of work left. Otherwise, thesender node is near to complete its workload and the cost ofthe load balancing operation can be higher than finishing theremaining workload locally. In that case, the receiver nodewill select another node from the list using the same proce-dure until an adequate node is found or the end of the listis reached.
4.4.2. Distribution ruleThe distribution rule computes the amount of work that
has to be moved from the sender to the receiver node. Anappropriate rule should take into consideration the relativenodes’ capabilities and availabilities, so that they finish pro-cessing their jobs at the same time (provided that no additionaloperations change their processing conditions). The commu-nication time to transfer the workload among the nodes isalso taken into consideration because the receiver node cannotrun the new assigned task until it receives the correspond-ing load, having an additional delay. The global equilibriumis obtained through successive operations between couplesof nodes.
The proposed distribution rule is based on two parameters:the number of running tasks NTi and the computational powerPi of the nodes which take part in the operation. This reflects thefact that the contribution of a powerful node might be hampered
by a large amount of external workload. Both parameters willbe included in the nodes’ actual computational power, Pacti ,which is obtained as
Pacti = Pi
NTi
. (3)
This is a multi-phase application within two different phases:comparison and sorting. Whenever the load balancing opera-tion is finished, the sender node has to finish the comparisonphase with the remaining workload. Then, it must sort all theprocessed workload. The receiver should compare and sort thenew workload. Additionally, the communication time has to betaken into account, because the receiver cannot continue theprocessing until it receives the new workload. Then, the distri-bution rule is determined by the following expressions:
Ts = Ws
Pacts+ W − Wr
Pacts,
Tr = Wr
Pactr+ Wr
Pactr+ Wr
Pc,
W = Ws + Wr, (4)
where Ts and Tr are the response times of the sender and re-ceiver processors, since the load balancing operation is finished.W is the total workload of the sender which has not still beenprocessed, Ws is the remaining workload in the sender nodeafter the load balancing operation and Wr the workload sentto the receiver. Pacts and Pactr are the sender and receivercurrent computational power. Finally, Pc is the communicationpower expressed in units of workload per second. The commu-nication power is obtained by computing offline the number ofsignatures that can be exchanged between two of the clusternodes per second.
This model takes into consideration two assumptions:
• The computational power for a node is the same in both thecomparison and sorting phases.
• The response time in both phases and the communicationtime are linear with respect to the workload.
Solving these expressions, the amount of both sender and re-ceiver workload can be computed as
Wr = 2WPactrPc
2PcPacts + 2PactrPc + PactsPactr,
Ws = W − Wr. (5)
The values for both workloads Wr and Ws take into accountthe heterogeneity of the nodes, their current state, the commu-nication times and the two different phases of the application.
In consequence, the load balancing algorithm described isdynamic, being able to redistribute workload among nodes atrun time, depending on how the system state changes. It is alsodistributed, because every node takes part in the load balancingoperations. And finally, it is global, because a global view of thesystem state is always kept. The following section describes theimplementation of this algorithm on a CBIR system running ina heterogeneous cluster.
J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075 1069
5. Distributed load balancing implementation on aheterogeneous cluster
5.1. Process structure of the load balance implementation
Two replicated processes are distributed among each one ofthe cluster nodes:
(1) Load daemon: This process implements both the state andthe information rules.
(2) Distribution daemon: It collects requests from slave nodesdemanding workload and proceeds with the transference.
Fig. 3 shows a decomposition of all the actions that must becarried out when a slave node finishes its local workload andtriggers the initiation rule. First, it demands new load to the dis-tribution process, which obtains the demanded load and sendsit to the slave node with the purpose to allow the continuationof the computations. The following section describes the struc-ture of the group of processes and their functions.
5.2. Groups of processes
As mentioned in Section 3.3.2, communication and synchro-nization between processes is based on MPI. A structure ofgroups of processes based on communicators [23,26,35] hasbeen implemented, where the groups allow to establish commu-nication structures between processes and to use global com-munication functions over subsets of processes. This way, eachtype of process belongs to his own group:
• MPI_COMM_MS: this group is composed by the masterprocess and all of the slave processes.
• MPI_COMM_DIST: this group is formed by the distributionprocesses.
• MPI_COMM_LOAD: this group is composed by the loaddaemon of each of the nodes.
The group concept is the more natural way to implement thisprocess scheme, because most often the messages transmittedinvolve processes that belong to the same group. Fig. 4 presentsthis communication hierarchy.
5.3. Load daemon
The main function of this process is to compute the localload index, to send this information to the load daemons of theother nodes and to transmit all of the information available tothe local distribution process, whenever it is required to do so.Also, it is in charge of initializing and managing a table thatstores the state of the other nodes. The table stores the follow-ing information for each of the nodes: computational power,average number of active tasks while the application is running,percentage of completed work, time of the last update, total ex-ecution time with some load level and number of signatures tobe processed.
At predetermined fixed intervals, the process evaluates theload index of the node where it is running and sends the state
NOTATION:
OF NODE’S NUMBER
RECEIVES REQUEST
SORTS THE TABLE
SELECTS THE NODE’S NUMBER
SENDS THE SELECTED
NODE’S NUMBERNODE’S NUMBER
RECEIVES THE
SENDS LOAD REQUEST
TO THE SELECTED NODE
RECEIVES
LOAD REQUEST
SENDS THE
AMOUNT OF LOADAMOUNT OF LOAD
RECEIVES THE
THE LOAD
DEMANDS RECEIVES THE
LOAD REQUEST
SENDS THE
LOADLOAD
RECEIVES THE
LOAD DAEMON OF NODE i
COMPUTES THE
AMOUNT OF LOAD
LOAD
STORES THE
SENDS THE EXECUTION
ORDER TO THE SLAVE
FROM SLAVE
REQUEST NODE’S NUMBER
TO DEMAND LOAD
RECEIVES LOCAL REQUEST
DISTRIBUTION PROCESS OF NODE i
DISTRIBUTION PROCESS OF NODE j
COMMUNICATION
BETWEEN PROCESSES INSIDE A NODE
COMPUTATION
DISTRIBUTION PROCESS OF NODE i
DISTRIBUTION PROCESS OF NODE i
DISTRIBUTION PROCESS OF NODE i
DISTRIBUTION PROCESS OF NODE j
Fig. 3. General overview of the whole load balancing algorithm.
information to the other nodes. The rest of the time it remainsblocked waiting for messages from other processes; its func-tionality depends on the received messages. Table 1 summa-
1070 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075
MPI_COMM_WORLD
MPI_COMM_MS
MASTER
SLAVES
DISTRIBUTION
PROCESS
LOAD
DAEMON
MPI_COMM_DIST MPI_COMM_LOAD
Fig. 4. Group scheme.
Table 1Messages and associated functions of the load daemon
Message Associated tasksidentifier
0 Task information1 The distribution process has finished and demands the iden-
tifier of a transmitter node2 The distribution process informs about the number of sig-
natures delivered to other node3 The distribution process notifies that there are no available
nodes to transfer load4 The distribution process shows the number of signatures
obtained from other node5 Another load daemon informs about its new number of
signatures because their transference to other node6 Another load daemon reports about the new number of
signatures assigned to7 Another load daemon tells that there are no nodes to transfer
load
rizes the messages involved with the load daemon and the tasksassociated with each one.
5.4. Distribution process
The main function of the load distribution process is to im-plement the initiation rule and the load balancing operation.Whenever a particular slave finishes its local work, the distribu-tion process is then alerted, evaluating therefore the initiationrule, finding a candidate node, establishing the negotiation anddelivering the load to the slave. On the other hand, if the nodereceives a load balancing request, the distribution rule must betriggered and the appropriate workload is sent to the remotenode.
6. Analysis of the CBIR implementation with loadbalancing in a heterogeneous cluster
A set of experiments have been performed for testing thebehavior of the parallel CBIR system implemented on the het-erogeneous cluster using the above distributed load balancingalgorithm. To compare the results achieved by the parallel CBIRsystem with and without the distributed load balancing algo-
200
250
300
350
400
450
500
0 5 10 15 20 25
Com
puta
tiona
l Pow
er
Processor
Fig. 5. Computational power of the cluster nodes, measured in workloadunits/second.
rithm, the total response time of the CBIR system, with andwithout load balance, has been measured. Additionally, twoclassical load balancing algorithms have been implemented asreference: the random algorithm [3] and the Probin algorithm[9]. The random algorithm is the one of the most simple anddistributed load balancing algorithms because each node makesdecisions based on local information. A node is consideredsender if the queue length of the CPU exceeds a predeterminedand constant threshold. The receiver is selected randomly be-cause the nodes do not share any information about the statussystem. The Probin algorithm is a diffusion-based algorithm,where the information is locally exchanged defining commu-nication domains between neighbor nodes. Several levels ofcoordination can be established varying the domains’ size.
The experiments have been executed on a heterogeneouscluster composed of 25 nodes, linked through a 100 MB/s Eth-ernet. Each of the process nodes features 4 GB of storage ca-pacity in an IDE hard disk linked through DMA with 16.6 MB/stransfer speed. The PC’s operating system is Linux v. 2.2.12.The heterogeneity is determined by the hard disk features. Ithas to be noted that this component determines each node’s re-sponse, as shown in Fig. 5, since in this CBIR system (as inmany others), I/O operations are predominant with respect toCPU operations.
Two different tests have been performed for measuring theimprovement achieved by the heterogeneous cluster imple-mentation using the distributed load balancing algorithm. Thefirst one analyzes search operations within a 30 million imagedatabase using an underloaded system. Since none of the nodesare overloaded, this test studies how heterogeneity affects thesystem performance, and how this performance is improvedusing the load balancing algorithm. The second experimentadds some artificial external tasks to a node in order to testhow well the load balancing algorithm copes with the situ-ation of strong load unbalance. In this case the underloadednodes should wait for the overloaded one to finish the applica-tion. The load balancing algorithm must remove the unloadednodes’ idle time.
J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075 1071
4000
6000
8000
10000
12000
14000
5 10 15 20 25 5 10 15 20 25
Exe
cutio
n tim
e (s
econ
ds)
Number of processes
Without algorithmRandom algorithm
Probin algorithmProposed algorithm
1
1.02
1.04
1.06
1.08
1.1
1.12
1.14
1.16
1.18
1.2
Spe
edup
Number of processors
Random algorithmProbin algorithm
Proposed algorithm
(a) (b)
Fig. 6. Results without external tasks (speedup with respect to the algorithm without load balancing): (a) response time and (b) speedup.
Table 2Response time without external workload, measured in seconds
No. nodes Without alg. Random alg. Probin alg. Proposed alg.
5 14 362 14 070 13 771 12 62810 7285 7044 6784 636215 5110 4970 4863 443220 4639 4320 4295 379925 4168 4114 4128 3593
6.1. Tests considering cluster heterogeneity and loadbalancing overhead
The main purposes of these tests were to detect the amountof overhead introduced by the load balancing algorithm, andhow the algorithm can manage the system heterogeneity. Thetests were performed on clusters with 5, 10, 15, 20 and 25 slavenodes plus a master node, in order to evaluate the algorithmscalability. The results are presented in Table 2 and in Fig. 6.
Table 2 shows that the response times are always shorter withsome load balancing algorithm, which means that the overheadintroduced by the algorithm is smaller than the improvementsachieved by using any of the implemented load balancing al-gorithms. From these results two main considerations can bepointed out:
• The tested load balancing algorithms improved always theresponse times between 10% and 15%. The best results wereachieved by the proposed algorithm.
• The proposed approach proved to be more stable, while theresults obtained with the other algorithms were less consis-tent.
Fig. 6(b) and Table 3 present the speedup of these algorithms,where the speedup refers to the improvements.
An interesting parameter for estimating the methods’ behav-ior is the standard deviation of the response times of the differ-ent cluster nodes, shown in Table 4 and in Fig. 7. The standarddeviation of the nodes’ response times is a measurement di-rectly related to idling times of nodes waiting for other nodesto finish their assignments.
Table 3Speedup without external tasks
No. nodes Speedup Speedup Speeduprandom alg. Probin alg. proposed alg.
5 1.021 1.047 1.13710 1.034 1.073 1.14515 1.028 1.051 1.15320 1.019 1.029 1.15825 1.013 1.01 1.16
Table 4Standard deviation of the cluster nodes without external tasks
No. nodes Without alg. Random alg. Probin alg. Proposed alg.
5 2620 1127 743.53 35510 1198 787 482 14515 763 571.94 85.73 7320 620 430 317 4625 575 333.25 198.08 37
0
500
1000
1500
2000
2500
3000
5 10 15 20 25
Tim
e (s
econ
ds)
Number of processors
Without algorithmRandom algorithm
Probin algorithmProposed algorithm
Fig. 7. Standard deviation without external tasks.
1072 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075
2000
4000
6000
8000
10000E
xecu
tion
time
(sec
onds
)
Number of Processors
Without algorithmRandom algorithm
Probin algorithmProposed algorithm
1
1.2
1.4
1.6
1.8
2
5 10 15 20 255 10 15 20 25
Spe
edup
Number of processors
Rando algorithmProbin algorithm
Proposed algorithm
(a) (b)
Fig. 8. Results with external tasks: (a) response time and (b) speedup.
Table 5Response time with external tasks on a node, measured in seconds
No. nodes Without alg. Random alg. Probin alg. Proposed alg.
5 10 548 8864 8305 573210 5535 4562 4225 305415 3645 3063 2836 209920 2893 2483 2352 173925 2589 2311 2231 1601
The load balancing algorithm presented here decreases thestandard deviation, equilibrating the response times, while withthe random algorithm a slight reduction is achieved but withthe Probin algorithm this value is erratic, depending highly onthe probed nodes. Finally, the proposed algorithm achieves thebest values of all the load balancing algorithms tested, rangingfrom a reduction of the standard deviation from 86.45% with5 nodes to 93.56% with 25 nodes with respect to the responsetimes without a load balancing algorithm.
6.2. Results with system overload
For these experiments the system is slightly overloaded, hav-ing one of the nodes heavily loaded. The goal of this test isto measure the algorithm’s ability to distribute the work of theloaded node among the remaining cluster nodes, without af-fecting the system performance. The tests were performed on aheterogeneous cluster with 5, 10, 15, 20, and 25 slave nodes anda master node, using a database of 12.5 million images. Table5 and Fig. 8 present the results achieved in this experiment.
For these tests, the differences obtained between executionswith or without load balancing were very strong. The reductionsin response times range from 45% with 5 nodes to 38% with25 nodes. As the number of nodes increases, the differences inresponse times decrease. Again, the best results are achievedwith the proposed algorithm. Table 6 and Fig. 8(b) show thespeedup achieved in these tests.
Finally, Table 7 and Fig. 9 present the standard deviationresults.
In these tests, the reduction of the standard deviation rangedfrom 90% to 95%. An interesting point to be remarked is thelack of consistency of the results provided by the random algo-
Table 6Speedup with external tasks
No. nodes Speedup Speedup Speeduprandom alg. Probin alg. proposed alg.
5 1.19 1.27 1.8410 1.213 1.31 1.8115 1.19 1.285 1.7420 1.165 1.25 1.6725 1.12 1.16 1.62
Table 7Standard deviation with external tasks
No. nodes Without alg. Random alg. Probin alg. Proposed alg.
5 2809 1232 305 12710 1117 821 296 8315 565 490 452 5420 375 345 321 3225 308 291 217 16
0
500
1000
1500
2000
2500
3000
5 10 15 20 25
Tim
e (s
econ
ds)
Number of processors
Without algorithmRandom algorithm
Probin algorithmProposed algorithm
Fig. 9. Standard deviation with external tasks.
rithm. This method provides only marginal improvements withrespect to the algorithm without load balancing for more than10–15 nodes.
The Probin algorithm has a better behavior for less than 10nodes although the relative improvements drop dramatically
J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075 1073
0
500
1000
1500
2000
2500
3000
3500
4000
0 0.5 1 1.5 2
Res
pons
e tim
e (s
econ
ds.)
Number of tasks per node
Without algorithmProposed algorithm
Fig. 10. Response time considering a loaded node for a 25 node cluster.
Table 8Response time increasing the number of external tasks, measured in secondsfor a 25 node cluster
No. tasks Without alg. Proposed alg.
0 1667 14381 2589 16012 3293 1601
above 15 nodes. Finally, the proposed algorithm has very stablevalues, achieving better results when the number of nodes isincreased. The method takes advantage of the availability ofadditional nodes having them all finishing within a short timeinterval.
6.3. Results increasing the load of a heavily loaded node
The last experiment presented in this paper has been per-formed increasing the number of external tasks on a heavilyloaded node. This way, the unbalance among different nodesis higher, and the algorithms’ behavior when confronted withhighly overloaded nodes can be checked. Table 8 and Fig. 10present the results achieved.
From the results above it can be seen that the response timeswithout any load balancing algorithm increase linearly withthe external load. When the proposed load balancing algorithmis introduced, depending on the system’s external workloadstatus, it is possible to achieve a small increment in the system’sresponse time. But globally, the response times remain almostinvariant when the amount of external workload is increased,as it can be seen in Table 8 and Fig. 10. This behavior provesthat if there are some underloaded nodes the extra workloadcan be split among them and the application response time canbe kept constant.
7. Conclusions and future work
This paper begins with an analysis of the operations involvedin a typical CBIR system. From the analysis of the sequentialversion it can be observed a lack of data or algorithmic depen-
dencies. This allows efficient cluster implementations of CBIRsystems since it is a parallel architecture that meets very wellthe application needs [6].
Improvements on the cluster implementation have been madeby introducing a dynamic, distributed, global and scalable loadbalancing algorithm which has been designed specifically forthe parallel CBIR application implemented on a heterogeneouscluster. An additional important feature is that the load balanc-ing algorithm takes into account the system heterogeneity orig-inated both by the different node computational attributes andby external factors such as the presence of external tasks.
The experiments presented here show that the amount ofoverhead introduced by this method is very small. In fact, thisoverhead is hidden by the improvements achieved wheneverany degree of system heterogeneity shows up, a common sit-uation in grid systems. All these experiments have also shownthat using the load balancing algorithm results in large execu-tion time reductions and in a more uniform distribution of thenode’s response times, which can be detected through strongreductions in the response times’ standard deviation.
As it has been shown in the experiments presented here,another important aspect that should be stressed is the algorithmscalability: increasing the number of system nodes does notsignificantly change the execution time increments originatedby the introduction of the load balancing algorithm. At thismoment, considering a network with a much higher number ofnodes is not possible with the available resources. In any case, itis feasible and even simple to extend the current implementationto define a hierarchical algorithm using MPI communicators.The cluster version of the CBIR system that includes the loadbalancing algorithm is nowadays fully operative.
Finally, further work will be devoted to the evaluation of theeffects in the method’s performance of using more complexnode load indices and initiation rules. New efforts will be madein order to refine the primitives used in the CBIR system, andto introduce fault tolerance mechanisms in order to increase thesystem robustness. Analysis on the system response will alsobe made after distributing the database of the CBIR systembetween different clusters. Future migration of the implementedCBIR system to a grid will also be performed.
Acknowledgments
This work has been partially funded by the SpanishMinistry of Education and Science (Grant TIC2003-08933-C02) and Government of the Community of Madrid (GrantGR/SAL/0940/2004).
References
[1] I. Ahmad, A. Ghafoor, Semi-distributed load balancing for massivelyparallel multicomputer systems, IEEE Trans. Software Engrg. 17 (10)(1991) 987–1004.
[2] I. Banicescu, R. Carino, J. Pabico, M. Balasubramaniam, Design andimplementation of a novel dynamic load balancing library for clustercomputing, Parallel Comput. 31 (7) (2005) 736–756.
[3] K.M. Baumgartner, B.W. Wah, Computer scheduling algorithms—past,present and future, Inform. Sci. 57–58 (1991) 319–345.
1074 J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075
[4] G. Bell, J. Gray, What’s next in high-performance computing?, Commun.ACM 45 (2) (2002) 91–95.
[5] A.P. Berman, L.G. Shapiro, A flexible image database system for content-based retrieval, Comput. Vision Image Understanding 75 (1/2) (1999)175–195.
[6] J.L. Bosque, O.D. Robles, A. Rodríguez, L. Pastor, Study of a parallelCBIR implementation using MPI, in: V. Cantoni, C. Guerra (Eds.),Proceedings on International Workshop on Computer Architecturesfor Machine Perception, IEEE CAMP 2000, Padova, Italy, 2000, pp.195–204, ISBN 0-7695-0740-9.
[7] J.L. Bosque, O.D. Robles, L. Pastor, Load balancing algorithms for CBIRenvironments, in: Proceedings on International Workshop on ComputerArchitectures for Machine Perception, IEEE CAMP 2003, The Centerfor Advanced Computer Studies, University of Louisiana at Lafayette,IEEE, New Orleans, USA, 2003, ISBN 0-7803-7971-3.
[8] T.L. Casavant, J.G. Kuhl, A taxonomy of scheduling in general-purposedistributed computing systems, in: T.L. Casavant, M. Singhal (Eds.),Readings in Distributed Computing Systems, IEEE Computer SocietyPress, Los Alamitos, CA, 1994, pp. 31–51.
[9] A. Corradi, L. Leonardi, F. Zambonelli, Diffusive load-balancing policiesfor dynamic applications, IEEE Concurrency 7 (1) (1999) 22–31.
[10] S.K. Das, D.J. Harvey, R. Biswas, Parallel processing of adaptive mesheswith load balancing, IEEE Trans. Parallel Distributed Systems 12 (12)(2001) 1269–1280.
[11] I. Daubechies, Ten Lectures on Wavelets, vol. 61 of CBMS-NSF RegionalConference Series in Applied Mathematics, Society for Industrial andApplied Mathematics, Philadelphia, PA, 1992.
[12] A. del Bimbo, Visual Information Retrieval, Morgan KaufmannPublishers, San Francisco, CA, 1999, ISBN 1-55860-624-6.
[13] Department of Computer Science and Engineering, The ChineseUniversity of Hong Kong, DISCOVIR Distributed Content-basedVisual Information Retrieval System on Peer-to-Peer(P2P) Network,Web, 2003. URL 〈http://www.cse.cuhk.edu.hk/∼miplab/discovir/〉.
[14] Department of Diagnostic Radiology and Department of MedicalInformatics and Division of Medical Image Processing and Lehrstuhlfür Informatik VI of the Aachen University of Technology RWTHAchen, IRMA: Image Retrieval in Medical Applications, Web,2003. URL 〈http://libra.imib.rwth-aachen.de/irma/index_en.php〉.
[15] D. Grosu, A. Chronopoulos, M. Leung, Load balancing in distributedsystems: an approach using cooperative games, in: 16th InternationalParallel and Distributed Processing Symposium IPDPS ’02, IEEE, 2002,pp. 52–53.
[16] N.J. Gunther, G. Beretta, A benchmark for image retrieval usingdistributed systems over the internet: birds-i, Technical Report HPL-2000-162, Imaging Systems Laboratory, Hewlett Packard, December2000.
[17] R.M. Haralick, L.G. Shapiro, Computer and Robot Vision, vol. I,Addison-Wesley, Reading, MA, 1992, ISBN: 0-201-10877-1.
[18] C.-C. Hui, S.T. Chanson, Hydrodynamic load balancing, IEEE Trans.Parallel Distributed Systems 10 (11) (1999) 1118–1137.
[19] S. Khoshafian, A.B. Baker, Multimedia and Imaging Databases, MorganKaufmann, San Francisco, CA, 1996.
[20] T. Kunz, The influence of different workload descriptions on a heuristicload balancing scheme, IEEE Trans. Software Engrg. 17 (7) (1991)725–730.
[21] L.J. Latecki, R. Melter, A. Gross, et al., Special issue on shaperepresentation and similarity for image databases, Pattern Recognition35 (1).
[22] S. Mallat, A theory for multiresolution signal decomposition: the waveletrepresentation, IEEE Trans. Pattern Anal. Mach. Intell. 11 (7) (1989)674–693.
[23] MPI Forum, A message-passing interface standard, 2003. URL〈www.mpi-forum.org〉.
[24] H. Müller, W. Müller, D.M. Squire, S. Marchand-Maillet, T. Pun,Performance evaluation in content-based image retrieval: overview andproposals, Pattern Recognition Lett. 22 (2001) 593–601.
[25] W. Obeloer, C. Grewe, H. Pals, Load management with mobile agents,in: 24th Euromicro Conference, vol. 2, IEEE, 1998, pp. 1005–1012.
[26] P.S. Pacheco, Parallel Programming with MPI, Morgan KaufmannPublishers Inc., San Francisco, 1997.
[27] J.S. Payne, L. Hepplewhite, T.J. Stonham, Evaluating content-basedimage retrieval techniques using perceptually based metrics, in:Proceedings of SPIE on Applications of Artificial Neural Networks inImage Processing IV, vol. 3647, SPIE, 1999, pp. 122–133.
[28] A. Rajagopalan, S. Hariri, An agent based dynamic load balancingsystem, in: International Workshop on Autonomous DecentralizedSystems, IEEE, 2000, pp. 164–171.
[29] O.D. Robles, A. Rodríguez, M.L. Córdoba, A study about multiresolutionprimitives for content-based image retrieval using wavelets, in: M.H.Hamza (Ed.), IASTED International Conference On Visualization,Imaging, and Image Processing (VIIP 2001), IASTED, ACTA Press,Marbella, Spain, 2001, pp. 506–511, ISBN 0-88986-309-1.
[30] A. Rodríguez, O.D. Robles, L. Pastor, New features for content-basedimage retrieval using wavelets, in: F. Muge, R.C. Pinto, M. Piedade(Eds.), V Ibero-american Symposium on Pattern Recognition, SIARP2000, Lisbon, Portugal, 2000, pp. 517–528, ISBN 972-97711-1-1.
[31] S. Santini, Exploratory Image Databases: Content-based Retrieval,Communications, Networking, and Multimedia, Academic Press, NewYork, 2001, ISBN 0-12-619261-8.
[32] S. Santini, A. Gupta, R. Jain, Emergent semantics through interactionin image databases, IEEE Trans. Knowledge Data Engrg. 13 (3) (2001)337–351 ISSN: 1041-4347.
[33] B. Schnor, S. Petri, R. Oleyniczak, H. Langendörfer, Scheduling ofparallel applications on heterogeneous workstation clusters, in: K.Yetongnon, S. Hariri (Eds.), Proceedings of the ISCA Ninth InternationalConference on Parallel and Distributed Computing Systems, vol. 1,ISCA, Dijon, 1996, pp. 330–337.
[34] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Content-based image retrieval at the end of the early years, IEEE Trans. PAMI22 (12) (2000) 1349–1380.
[35] M. Snir, S.W. Otto, S. Huss-Lederman, D.W. Walker, J. Dongarra, MPI:The Complete Reference, The MIT Press, Cambridge, 1996.
[36] J.M. Squyres, K.L. Meyer, M.D. McNally, A. Lumsdaine, LAM/MPIUser Guide, University of Notre Dame, lAM 6.3, 1998. URL〈http://www.mpi.nd.edu/lam/〉.
[37] S. Srakaew, N. Alexandridis, P.P. Nga, G. Blankenship, Content-basedmultimedia data retrieval on cluster system environment, in: P. Sloot,M. Bubak, A. Hoekstra, B. Hertzberger (Eds.), High-PerformanceComputing and Networking. Seventh International Conference, HPCNEurope 1999, Springer, Berlin, 1999, pp. 1235–1241.
[38] E.J. Stollnitz, T.D. DeRose, D.H. Salesin, Wavelets for ComputerGraphics, Morgan Kaufmann Publishers, San Francisco, 1996.
[39] Y. Wang, Z. Liu, J.-C. Huang, Multimedia content analysis, IEEE SignalProcess. Mag. 16 (6) (2000) 12–36.
[40] M.H. Willebeek-LeMair, A.P. Reeves, Strategies for dynamic loadbalancing on highly parallel computers, IEEE Trans. Parallel DistributedSystems 4 (9) (1993) 979–993.
[41] L. Xiao, S. Chen, X. Zhang, Dynamic cluster resource allocations forjobs with known and unknown memory demands, IEEE Trans. ParallelDistributed Systems 13 (3) (2002) 223–240.
[42] C. Xu, F. Lau, Load Balancing in Parallel Computers: Theory andPractice, Kluwer Academic Publishers, Dordrecht, 1997.
[43] M.J. Zaki, Parallel and distributed association mining: a survey, IEEEConcurrency 7 (4) (1999) 14–25.
[44] M.J. Zaki, S. Pathasarathy, W. Li, Customized Dynamic Load Balancing,vol. 1, Architectures and Systems, Prentice-Hall PTR, Upper SaddleRiver, NJ, 1999 (Chapter 24).
[45] A.Y. Zomaya, Y.-H. Teh, Observations on using genetic algorithms fordynamic load-balancing, IEEE Trans. Parallel Distributed Systems 12(9) (2001) 899–911.
Jose L. Bosque graduated in Computer Science and Engineering from Uni-versidad Politécnica de Madrid in 1994. He received the Ph.D. degree inComputer Science and Engineering from Universidad Politécnica de Madrid
J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062–1075 1075
in 2003. His Ph.D. was centered on theorical models and algorithms forheterogeneous clusters. He has been an associate professor at the UniversidadRey Juan Carlos in Madrid, Spain, since 1998. His research interest areparallel and distributed processing, performance and scalability evaluationand load balancing.
Oscar D. Robles received his degree in Computer Science and Engineeringand the Ph.D. degree from the Universidad Politécnica de Madrid in 1999and 2004, respectively. His Ph.D. was centered on Content-based image andvideo retrieval techniques on parallel architectures. Currently he is AssociateProfessor in the Rey Juan Carlos University and has published works in thefields of multimedia retrieval and parallel computer systems. His researchinterests include content-based multimedia retrieval, as well as computervision and computer grapyhics. He is an Eurographics member.
Luis Pastor received the B.S.EE degree from the Universidad Politécnicade Madrid in 1981, the M.S.EE degree from Drexel University in 1983,and the Ph.D. degree from the Universidad Politécnica de Madrid in 1985.Currently he is Professor in the university Rey Juan Carlos (Madrid, Spain).His research interests include image processing and synthesis, virtual reality,3D modeling and and parallel computing.
Angel Rodríguez received his degree in Computer Science and Engineeringand the Ph.D. degree from the Universidad Politécnica de Madrid in 1991and 1999, respectively. His Ph.D. was centered on the tasks of modeling andrecognizing 3D objects in parallel architectures. He is an Associate Professorin the Photonics Technology Department, Universidad Politécnica de Madrid(UPM), Spain and has published works in the fields of parallel computersystems, computer vision and computer graphics. He is an IEEE and an ACMmember.