CBIR on Grids

10
CBIR on Grids Oscar D. Robles 1 , José Luis Bosque 1 , Luis Pastor 1 , and Ángel Rodríguez 2 1 Dpto. de Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, U. Rey Juan Carlos, C/ Tulipán, s/n. 28933 Móstoles. Madrid. Spain {oscardavid.robles, joseluis.bosque, luis.pastor}@urjc.es 2 Dpto. de Tecnología Fotónica, U. Politécnica de Madrid, Campus de Montegancedo s/n, 28660 Boadilla del Monte, Madrid, Spain [email protected] Abstract. From the computational point of view, Content-based Im- age Retrieval systems are potentially expensive and have user response times growing with the ever-increasing sizes of the databases associated to them. This paper presents a grid implementation of a Content-based Image Retrieval system that offers a good cost/performance ratio to solve this problem due to their excellent flexibility, scalability and fault tolerance. This approach allows a dynamic management of specific data- bases that can be incorporated to or removed from the CBIR system in function of the desired user query. Experimental performance results are collected in order to show the feasibility of this solution. 1 Introduction Grid computing is becoming nowadays a feasible solution for computer appli- cations with high levels of computational power demand. This is due to the good price/performance ratio offered by the networks that compose this type of systems and because of both the high flexibility and availability offered by this computation paradigm, making easier the cooperation and resource sharing among institutions, named ”Virtual Organizations” (VO) [1]. One application area where the concept of grid computing fits in a natural way is Content Based Image Retrieval (CBIR). The main purpose of CBIR systems is to help users to perform automatic information retrieval over image databases considering rele- vant visual features extracted from the image data in a preprocessing stage [2,3]. The complexity of this task depends heavily on the number of items stored in the system, but usually, when dealing with image or video databases, large volumes of data are considered, and therefore, alternative strategies to the conventional centralized server must be seeked to manage the storage and processing of data in an efficient and flexible way. The tremendous improvements experimented by computers in aspects such as price, processing power and mass storage capabilities have resulted in an explosion of the amount of information available to people. But this same wealth makes finding the "best" information a very hard task. CBIR systems try to solve R. Meersman, Z. Tari et al. (Eds.): OTM 2006, LNCS 4276, pp. 1412–1421, 2006. c Springer-Verlag Berlin Heidelberg 2006

Transcript of CBIR on Grids

CBIR on Grids

Oscar D. Robles1, José Luis Bosque1, Luis Pastor1, and Ángel Rodríguez2

1 Dpto. de Arquitectura y Tecnología de Computadores, Ciencias de la Computacióne Inteligencia Artificial,

U. Rey Juan Carlos, C/ Tulipán, s/n. 28933 Móstoles. Madrid. Spain{oscardavid.robles, joseluis.bosque, luis.pastor}@urjc.es

2 Dpto. de Tecnología Fotónica, U. Politécnica de Madrid,Campus de Montegancedo s/n, 28660 Boadilla del Monte, Madrid, Spain

[email protected]

Abstract. From the computational point of view, Content-based Im-age Retrieval systems are potentially expensive and have user responsetimes growing with the ever-increasing sizes of the databases associatedto them. This paper presents a grid implementation of a Content-basedImage Retrieval system that offers a good cost/performance ratio tosolve this problem due to their excellent flexibility, scalability and faulttolerance. This approach allows a dynamic management of specific data-bases that can be incorporated to or removed from the CBIR system infunction of the desired user query. Experimental performance results arecollected in order to show the feasibility of this solution.

1 Introduction

Grid computing is becoming nowadays a feasible solution for computer appli-cations with high levels of computational power demand. This is due to thegood price/performance ratio offered by the networks that compose this typeof systems and because of both the high flexibility and availability offered bythis computation paradigm, making easier the cooperation and resource sharingamong institutions, named ”Virtual Organizations” (VO) [1]. One applicationarea where the concept of grid computing fits in a natural way is Content BasedImage Retrieval (CBIR). The main purpose of CBIR systems is to help users toperform automatic information retrieval over image databases considering rele-vant visual features extracted from the image data in a preprocessing stage [2,3].The complexity of this task depends heavily on the number of items stored in thesystem, but usually, when dealing with image or video databases, large volumesof data are considered, and therefore, alternative strategies to the conventionalcentralized server must be seeked to manage the storage and processing of datain an efficient and flexible way.

The tremendous improvements experimented by computers in aspects suchas price, processing power and mass storage capabilities have resulted in anexplosion of the amount of information available to people. But this same wealthmakes finding the "best" information a very hard task. CBIR systems try to solve

R. Meersman, Z. Tari et al. (Eds.): OTM 2006, LNCS 4276, pp. 1412–1421, 2006.c© Springer-Verlag Berlin Heidelberg 2006

CBIR on Grids 1413

this problem by offering mechanisms for selecting the data items which resemblemost a specific query among all the available information.

From a computational point of view, CBIR systems are potentially expensiveand their user response times grow with the ever-increasing sizes of the data-bases associated to them. One of the most common approaches followed to reachacceptable price/performance ratios has been to exploit the algorithms’ inher-ent parallelism at implementation time [4,5,6]. However, the novelty of CBIRsystems hinders finding references dealing with this aspect. Some contributionsthat can be cited are Zaki’s compilation [7], and the contributions of Srakaew etal. [8] and Bosque et al. [9].

A CBIR cluster based implementation was first introduced in [9], comparingits performance with a shared-memory solution. The results showed a definiteadvantage of the cluster in terms of scalability and price-performance ratio. Inthis paper, we extend the horizon of our previous work sharing the availableresources of different VO. This paper presents a grid implementation of a CBIRsystem that offers a good cost/performance ratio to solve this problem due totheir excellent flexibility, scalability and fault tolerance. The flexibility of thesystem herein presented allows the dynamical addition or removal of nodes fromthe grid between two user queries, achieving reconfigurability, scalability andan appreciable degree of fault tolerance. This approach allows a dynamic man-agement of specific databases that can be incorporated to or removed from theCBIR system according to users queries. Experimental performance results arecollected in order to show the feasibility of this solution taking into accountseveral database setups.

The rest of this article is organized as follows. Section 2 presents a brief de-scription of the operations involved in a CBIR system, and the grid implemen-tation developed is discussed in Section 3. Section 4 shows some performanceresults achieved by this implementation and finally, section 5 presents some con-clusions and ongoing work.

2 CBIR System Operation

Smeulders et al. [3] reveal the importance of CBIR techniques. An example ofthis is the management of the huge quantity of multimedia information thatanyone can access on Internet. This information must be efficiently accessed ina user friendly way and CBIR systems can provide proper solutions. In our case,we are dealing with 128x128 colour bidimensional databases that range fromfour hundred thousand to thirty two million images.

The search for images contained in a CBIR system can be broken down intothe following stages:

1. Input/query image introduction. The user first selects an image to beused as a search reference and the system computes its signature1. Detailsabout the retrieval techniques involved in the CBIR system can be found in

1 The signature is a vector of some features that represents the content of the images.

1414 O.D. Robles et al.

[10,11]. The whole process can be implemented using an O(k) order algo-rithm, being k the image’s size, that performs in a very efficient way [12].This stage does not require quite high computational resources since thesystem only deals with one image.

2. Query and DB image’s signature comparison and sorting. The sig-nature obtained in the previous stage is compared with all the DB images’signatures using a metric based on the Euclidean distance. The identifiersof the p most similar images are extracted. Not being a quite costly opera-tion, the volume of computations to be performed is very high though, sincethe signature of the input image must be compared with each of the imagesignatures stored in the system. Should it become necessary to incorporatea new signature to the group of the best p, the one with the worst rankingwithin the group would be discarded and the set then newly sorted. A bubblesorting algorithm with O(np log(p)) order has been used for this purpose,being n the number of images.

3. Results display and query image update. The system provides the usera data set with the p images considered most similar to the query one. If theresult does not satisfy the user, he/she can choose one of the selected imagesor enter a new one that presents some kind of similarity with the requiredimage returning to stage 1.

Upon observing the operations involved, it is possible to notice that the com-parison and sorting stage involves much larger computational load than theothers. Luckily, since there are no dependencies the exploitation of data paral-lelism can be done just by dividing the workload among n independent nodes.This can be accomplished by distributing off-line the CBIR image’s signaturesacross the processing nodes, which can then compare the query image’s signa-ture with every available signature. The storage capacity becomes also a problemwhenever the size of the CBIR system grows beyond a certain limit. The mostefficient approach to solve this point, relaxing additionally the per-node storagedemands, is to distribute signatures, images and computations over an ensembleof n processing and storage nodes.

3 Grid Implementation

3.1 Software Architecture

The grid implementation of a CBIR system corresponds to a distributed archi-tecture of grid nodes with different complexity levels, ranging from standalonePCs or workstations to parallel systems like shared memory multiprocessors orclusters. Therefore, each one of the grid nodes’ may run sequential or parallelversions of the software components of the whole application, configuring a veryheterogeneous system.

The CBIR grid application programmed can be decomposed in the followingmodules: User interface, CBIR grid management and Local search per grid node.All these modules have been programmed using the Globus toolkit vs. 4 [13].Next sections will describe each one of these elements.

CBIR on Grids 1415

3.2 User Interface

From a functional point of view, the main features of a grid system are flexibilityand versatility. The software application must make feasible a user specificationof a whole environment on every single system execution. This allows to updatethe working configuration adding or deleting specific nodes or databases takinginto consideration the type and contents of the queries at a given moment. Inthis sense, the system presents to the user a simple interface where he/she candefine parameters with two clearly different orientations:

– Grid features.– CBIR features.

Following, both group of features will be described with some detail.

Grid Features. The grid parameters cover aspects such as:

– Number of grid nodes where the query will be run.– Selection of specific nodes where the execution will be performed.– Computational power of each one of the system nodes. This is and optional

parameter. If the user has some a priori knowledge about the performance ofthe processing nodes that belong to the grid, these data can be introduced tothe application. Anyway, the application collects some statistics about eachnode performance in order to make in a future a better assignment of thework load among all available resources in the grid.

CBIR Features. Among the programmable CBIR parameters, we can mention:

– Name of the process that will perform first the search and the the sort ofthe signatures in each node. This is an optional parameter, and if it is notspecified by the user, a default option is chosen.

– Search criterion: available criteria include colour, shape and a combination ofboth features. For each criterion, several possibilities exist: energies, colourhistograms, multiresolution colour primitives, Hu and Zernike invariants, etc.[11,14].

– Metric used in the similarity computation stage for the computed inputimage signature and the precomputed signatures stored in the database.

– File name of the query image.– Name of the selected database where the query is performed. This is an

optional parameter and if this entry is left blank, the default value is selected.

3.3 CBIR_GRID Process

This process is in charge of collecting all the parameter specifications defined bythe users and starting the execution in the distributed environment. It is alsoin charge of keeping the fault tolerance of the whole system: if a grid node cannot finish the search in its own local database, a report of this fact is sent to theuser, but the rest of the answers are not lost. This process can be decomposedin the following stages:

1416 O.D. Robles et al.

– Read the system information provided by the users. At this moment, thesetup information about the query performed by the users is received andall the local data structures involved during the search process execution areinitialized, as well as those sent to the remote nodes.

– Setup of the testbed or grid system. At this point, the number of nodes thatcompose the testbed is fixed, then their availability is checked and finally,the permissions for executing remote jobs over them are verified.

– Selection of the remote jobs that have to be executed in each node of thegrid.

– Compute the signature of the input image considering its energies or his-tograms, like has been described in [10,11,14].

– Send the previously computed input image signature to all the nodes thatcompose the grid. The service GridFTP provided by the Globus Toolkit vs. 4.0is used for this purpose. Specifically, the secure command globus-url-copy.

– Once the signature is distributed, it must be launched the execution of thejobs in each one of the grid nodes, searching over their own local data-bases. A script that receives as input the list of nodes that compose the gridwith its corresponding class of node has been programmed to perform thistask. The remote execution of the jobs is based on the command or ser-vice globus-job-submit available on the Globus ToolKit vs. 4.0. This com-mand is used instead of globus-job-run since it provides a non-blockingservice, while globus-job-run performs a blocking execution of the sub-mitted job. This command returns an identifier for each job, so with thecommand globus-job-status, its state can be checked passing as an argu-ment the obtained job identifier. Monoprocessor nodes receive a sequentialjob, while cluster nodes receive a distributed job, and as it can be deduced,in this case, the search is performed over a distributed database.

– Then, the CBIR_GRID process performs a loop that controls the state ofeach one of the launched jobs. The control interval of the jobs is also anapplication parameter, setting two seconds as default value. When a job ends,it must collect through the gridFTP service the partial results generated bythe corresponding grid node. This way, a set of files with partial results isgenerated, one for each node of the grid.

– The last step in this process is to gather all the partial results and to selectthe best N . Then, the process picks up the N images from the correspondinglocal nodes and presents them to the user in a sorted mosaic view inside auser window.

3.4 Local_Retrieval Process

This process is in charge of performing the local search in each one of the gridnodes. It has assigned two main functions:

– The comparison of the input signature with respect to all the stored sig-natures in the database considering the search criteria and metric distancespecified by the user. These comparisons produce a set of similarity valuesthat are stored in an output file.

CBIR on Grids 1417

– Once the result of all the comparisons are available, the output file is sorted,achieving an output sorted according to the similarity value.

4 Experimental Results

4.1 Experimental Setup

A set of experiments has been executed for testing the behavior of the CBIRgrid implementation presented here. Several objectives are stated for the tests:

– to verify the feasibility of applying a distributed solution based on a grid,– to estimate the overhead introduced by the Globus software.– to analyze the grid response in order to optimize the distribution of the

CBIR data among the grid nodes to achieve better performance.

The testbed used in the experiments presented next is composed by the aggre-gation of the resources of two VO: the Department of Arquitectura de Computa-dores (DAC) of the Rey Juan Carlos University (URJC) and the Departmentof Arquitectura y Tecnología de Sistemas Informáticos (DATSI) of the Univer-sidad Politécnica de Madrid (UPM). Each one of these VO contributes with thefollowing resources2:

– DAC-URJC:Africa: CPU Intel(R) Pentium(R) IV, 2.80GHz;

• cache size L1 : 512 KB• main memory: 1 GB DDR RAM• hard disk: ST380011A PCI, 20 GB• operating system: Linux version 2.6.8-2-686, Debian 1:3.3.5-12• network card bandwidth: 1 Gbps

Artico: CPU Pentium III (Katmai) at 450 MHz• cache size : 512 KB• main memory: 128 MB DDR RAM• hard disk: Maxtor 91080D5 PCI, 10 GB

– DATSI-UPM:Baobab: One biprocessor node with the following features:

• 2 CPUs, Intel(R) Xeon(TM) CPU 2.40GHz• cache size : 512 KB (each CPU)• main memory: DDR RAM 1 GB• hard disk:

∗ local: 4,6 GB∗ NFS: NAS Intel Pentium 4 CPU 2.8GHz. . Raid 0 over 4 disks

with 160 Gb.• message passing libraries: LAM/MPI 7.1.1• linux kernel vs 2.6.13 (Debian): shared by all the nodes

2 Non mentioned items have the same setup than the previous resource.

1418 O.D. Robles et al.

WAN

LANLAN

DATSI−UPMDAC−URJC

Artico

Africa

Brea

Baobab

! Power

COL 1 2 3 4 5 6 7 8 1 2 3 6 25 50 8012100

10

Ether 10/100

! Power

COL 1 2 3 4 5 6 7 8 1 2 3 6 25 50 8012100

10

Ether 10/100

��������

��������

��������

��������

��������

����

Fig. 1. Scheme of the grid used in the experiments

• internal network with 2 network interfaces per node with a bandiwthof 1Gb per interface

Brea: One biprocessor node with the following features:• 2 CPUs Intel(R) Xeon(TM) CPU 3.00GHz• cache size: 1024 KB (each CPU)• hard disk:

∗ local: 9,2 GB∗ NFS: Intel(R) Pentium(R) 4 CPU 2.80GHz. NAS. Raid 0 over 4

disks with 160 Gb.

Figure 1 shows the grid described above. Dashed lines group the computationalresources available in each VO.

4.2 Performance Results

Table 1 collects the response user time per grid node considering different sizesof local database stored in each one of the grid nodes. The values are measuredin seconds. Each node of the grid stores the same amount of images for the samerow of the table, and the value gives the total amount of time spent by each nodeof the grid to give an answer to the user. It can be observed the wide range ofvalues obtained because of the heterogeneous nature of the available processingnodes.

Table 2 shows the response user time of the grid (in seconds) consideringdifferent database sizes. This Table includes also the overhead introduced by the

CBIR on Grids 1419

Table 1. User response time of grid nodes for a query considering different localdatabase sizes

Database size Africa Artico Baobab Breaper grid node

100000 1.63 6.81 0.98 0.871000000 17.30 84.36 11.58 9.962000000 33.89 173.94 23.05 20.554000000 68.45 390.63 56.73 44.398000000 136.75 816.54 125.50 88.48

Table 2. User response time for a query considering different database sizes with fairdistribution of the images over the available grid nodes

DB size Grid Globus overhead Efficiency400000 29.25 22.44 0.23

4000000 106.00 21.64 0.808000000 202.89 28.95 0.86

16000000 420.21 29.58 0.9332000000 846.16 29.62 0.96

grid implementation, which remains almost constant in all cases. Finally, theefficiency reached by the system is also included, computed as:

En = Sn/n (1)

where

Sn = T1/Tn (2)

and T1 is the execution time of the algorithm running on the slowest node of thegrid and Tn is the execution time of the algorithm carried out over n nodes ofthe grid. As can be noticed, the efficiency of the grid increases as the size of thedatabase grows up. This fact is explained by the almost constant overhead intro-duced by the grid implementation and the lack of data dependencies in the mostdemanding stage of the posed CBIR application that allows a fully paralleliza-tion of the signature comparison and sorting. For small database sizes (400000images), efficiency values are quite low in comparison with values achieved forgreater database sizes. The reason for this behaviour is that query responsetimes of nodes are smaller than time overhead values introduced by the gridimplementation and therefore the efficiency is weakened.

The grid overhead is due to several reasons like:

– Management of the auxiliary temporal files needed by the implemented al-gorithm.

1420 O.D. Robles et al.

– Globus overhead that includes:• Security control mechanisms.• State system management.

– Network traffic.

The grid herein presented allows a dynamic management of the CBIR systemsince Globus provides a set of service implementations focused on infrastruc-ture management. Specifically, GRAM (Grid Resource Allocation Management)supports the control of the available resources in the grid at a given moment.

The values of Table 1 suggest a redistribution of the workload among the gridnodes considering the response times achieved in the experiments, instead ofconsidering an homogeneous size of the image databases for all the grid nodes.Considering a static assignment of the workload, the redistribution should bemade as a function of the response time of each node in relation with the slowestnode of the grid:

DBsi =T1 · DBs1

Ti(3)

where T1 is the execution time of the algorithm running on the slowest node ofthe grid, Ti is the execution time of the algorithm on the node i of the grid andDBs1 is the database reference size in the slowest grid node.

5 Conclusions and Future Work

This paper is focused on the analysis of the feasibility to apply a grid solution toCBIR systems. In this work we have measured the efficiency reached by the gridconsidering a real environment with very heterogeneous nodes. The implemen-tation has been very satisfactory, achieving small user response times. This hasbeen originated by the lack of data or algorithmic dependencies and to the smallcommunication overhead. Thanks to the heterogeneity of the system, the com-munication overhead overlaps with the execution time in other grid nodes. Thesefeatures result also in very good performance figures for the largest databases,where efficiency values higher than 90% have been achieved and the efficiencycurve shows a tendency to approximate to one. The experiments presented hereshow that the amount of overhead introduced by this implementation is almostconstant, so the system is scalable with respect to the database size. In fact,this overhead is hidden by the improvements achieved to take into account gridheterogeneity. To conclude, definite advantages of the grid implementation areits good price-performance ratio and system scalability.

Finally, further work will be devoted to the analysis on the response of thesystem after distributing the database of the CBIR system on different VirtualOrganizations that include more complex grid nodes, like shared-memory multi-processors or clusters of PCs or workstations. We also plan to incorporate loadbalancing mechanisms to dynamically redistribute the workload correspondingto the sorting stage and increase the global performance of the grid.

CBIR on Grids 1421

Acknowledgments

This work has been partially funded by the Spanish Ministry of Educationand Science (grant TIC2003-08933-C02) and Government of the Communityof Madrid (grants GR/SAL/0940/2004 and S-0505/DPI/0235).

References

1. Berman, F., Fox, G., Hey, A.J., eds.: Grid Computing: Making the Global In-frastructure a Reality. John Wiley & Sons (2003) ISBN 0-470-85319-0.

2. del Bimbo, A.: Visual Information Retrieval. Morgan Kaufmann Publishers, SanFrancisco, California (1999) ISBN 1-55860-624-6.

3. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-basedimage retrieval at the end of the early years. IEEE Transactions on PAMI 22(12)(2000) 1349–1380

4. Pitas, I., ed.: Parallel Algorithms for Digital Image Processing, Computer Visionand Neural Networks. John Wiley & Sons (1993)

5. Maresca, M., et al.: Special issue on parallel architecture for image processing.Proceedings of the IEEE 84(7) (1996) 913–1056

6. Muntz, R.R., Golubchik, L.: Parallel data servers and appications. Parallel Com-puting 24 (1998) 1–4

7. Zaki, M.J.: Parallel and distributed association mining: A survey. IEEE Concur-rency 7(4) (1999) 14–25

8. Srakaew, S., Alexandridis, N., Nga, P.P., Blankenship, G.: Content-based multime-dia data retrieval on cluster system environment. In Sloot, P., Bubak, M., Hoek-stra, A., Hertzberger, B., eds.: High-Performance Computing and Networking. 7th

International Conference, HPCN Europe 1999, Springer Verlag (1999) 1235–12419. Bosque, J.L., Robles, O.D., Rodríguez, A., Pastor, L.: Study of a parallel CBIR

implementation using MPI. In Cantoni, V., Guerra, C., eds.: Proceedings on In-ternational Workshop on Computer Architectures for Machine Perception, IEEECAMP 2000, Padova, Italy (2000) 195–204 ISBN 0-7695-0740-9.

10. Rodríguez, A., Robles, O.D., Pastor, L.: New features for Content-Based ImageRetrieval using wavelets. In Muge, F., Pinto, R.C., Piedade, M., eds.: V Ibero-american Simposium on Pattern Recognition, SIARP 2000, Lisbon, Portugal (2000)517–528 ISBN 972-97711-1-1.

11. Robles, O.D., Rodríguez, A., Córdoba, M.L.: A study about multiresolution prim-itives for content-based image retrieval using wavelets. In Hamza, M.H., ed.:IASTED International Conference On Visualization, Imaging, and Image Process-ing (VIIP 2001), Marbella, Spain, IASTED, ACTA Press (2001) 506–511 ISBN0-88986-309-1.

12. Stollnitz, E.J., DeRose, T.D., Salesin, D.H.: Wavelets for Computer Graphics.Morgan Kauffman Publishers, San Francisco (1996)

13. Foster, I.: Globus Toolkit Version 4: Software for Service-Oriented Systems. In:IFIP International Conference on Network and Parallel Computing. Volume 3779of Lectures Notes in Computer Science. Springer Verlag (2005) 2–13

14. Robles, O.D., Toharia, P., Rodríguez, A., Pastor, L.: Towards a content-basedvideo retrieval system using wavelet-based signatures. In Hamza, M.H., ed.: 7thIASTED International Conference on Computer Graphics and Imaging - CGIM2004, Kauai, Hawaii, USA, IASTED, ACTA Press (2004) 344–349 ISBN: 0-88986-418-7, ISSN:1482-7905.