Using concepts of content-based image retrieval to implement graphical testing oracles

28
Using concepts of content-based image retrieval to implement graphical testing oracles Marcio Eduardo Delamaro 1, , , F´ atima de Lourdes dos Santos Nunes 2,3 and Rafael Alves Paes de Oliveira 1 1 Instituto de Ciˆ encias Matem´ aticas e de Computac ¸˜ ao (ICMC), Universidade de S˜ ao Paulo, Brazil 2 Escola de Artes Ciˆ encias e Humanidades (EACH), Universidade de S˜ ao Paulo, Brazil 3 Interactive Technologies Laboratory, Escola Polit´ ecnica, Universidade de S˜ ao Paulo, Brazil SUMMARY Automation of testing is an essential requirement to render it viable for software development. Although there are several testing techniques and criteria in many different domains, developing methods to test programs with complex outputs remains an unsolved challenge. This setting includes programs with graphical output, which produce images or interface windows. One possible approach towards automating the testing activity is the use of automatic oracles in which a reference image, taken as correct, can be used to establish a correctness measure in the tested program execution. A method that uses concepts of content-based image retrieval to facilitate oracle automation in the domain of programs with graphics output is presented. Two case studies, one using a computer-aided diagnostic system and one using a Web application, are presented, including some reflections and discussions that demonstrate the feasibility of the proposed approach. Copyright 2011 John Wiley & Sons, Ltd. Received 11 July 2009; Revised 9 February 2011; Accepted 16 March 2011 KEY WORDS: software testing; testing oracle; content-based image retrieval 1. INTRODUCTION Software testing is one of the most expensive and important activities in software development. Several aspects have been considered and research into this area expands into multiple lines, including techniques and criteria development for testing, development of supporting tools, and application in specific areas. In this context, the automation of testing activities plays an essential role, thus helping to increase productivity and decrease development costs. An underlying question in testing is deciding whether the behavior of program P under test with a given test datum is correct or not. Determining the set of expected outputs for a given input test set is not a trivial task. Consider, for example, program P that computes the value of constant with any number of decimal places. Unless the existence of another program Q, correct and running the same functionality is assumed, it is not always possible to determine whether the behavior of P is the expected or not. The mechanism used to decide whether the output or behavior of an execution of program P is correct is known as an ‘oracle’. In a development and testing environment, the oracle can take on various forms and must be based on the specification of the program being tested. If this specification is, for example, an informal textual definition of requirements, the tester will probably play the role of the oracle and decide on the correctness of P execution. If there is a formal model Correspondence to: Marcio Eduardo Delamaro, Av Trabalhador Sancarlense, 400, S˜ ao Carlos, SP 13560-970, Cx Postal 668, Brazil. E-mail: [email protected] Copyright 2011 John Wiley & Sons, Ltd. SOFTWARE TESTING, VERIFICATION AND RELIABILITY Softw. Test. Verif. Reliab. :171–198 Published online 1 May 2011 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/stvr.463 2013; 23

Transcript of Using concepts of content-based image retrieval to implement graphical testing oracles

Using concepts of content-based image retrieval to implementgraphical testing oracles

Marcio Eduardo Delamaro1,∗,†, Fatima de Lourdes dos Santos Nunes2,3

and Rafael Alves Paes de Oliveira1

1Instituto de Ciencias Matematicas e de Computacao (ICMC), Universidade de Sao Paulo, Brazil2Escola de Artes Ciencias e Humanidades (EACH), Universidade de Sao Paulo, Brazil

3Interactive Technologies Laboratory, Escola Politecnica, Universidade de Sao Paulo, Brazil

SUMMARY

Automation of testing is an essential requirement to render it viable for software development. Althoughthere are several testing techniques and criteria in many different domains, developing methods to testprograms with complex outputs remains an unsolved challenge. This setting includes programs withgraphical output, which produce images or interface windows. One possible approach towards automatingthe testing activity is the use of automatic oracles in which a reference image, taken as correct, can beused to establish a correctness measure in the tested program execution. A method that uses conceptsof content-based image retrieval to facilitate oracle automation in the domain of programs with graphicsoutput is presented. Two case studies, one using a computer-aided diagnostic system and one using a Webapplication, are presented, including some reflections and discussions that demonstrate the feasibility ofthe proposed approach. Copyright q 2011 John Wiley & Sons, Ltd.

Received 11 July 2009; Revised 9 February 2011; Accepted 16 March 2011

KEY WORDS: software testing; testing oracle; content-based image retrieval

1. INTRODUCTION

Software testing is one of the most expensive and important activities in software development.Several aspects have been considered and research into this area expands into multiple lines,including techniques and criteria development for testing, development of supporting tools, andapplication in specific areas. In this context, the automation of testing activities plays an essentialrole, thus helping to increase productivity and decrease development costs.

An underlying question in testing is deciding whether the behavior of program P under testwith a given test datum is correct or not. Determining the set of expected outputs for a giveninput test set is not a trivial task. Consider, for example, program P that computes the value ofconstant � with any number of decimal places. Unless the existence of another program Q, correctand running the same functionality is assumed, it is not always possible to determine whether thebehavior of P is the expected or not.

The mechanism used to decide whether the output or behavior of an execution of programP is correct is known as an ‘oracle’. In a development and testing environment, the oracle cantake on various forms and must be based on the specification of the program being tested. If thisspecification is, for example, an informal textual definition of requirements, the tester will probablyplay the role of the oracle and decide on the correctness of P execution. If there is a formal model

∗Correspondence to: Marcio Eduardo Delamaro, Av Trabalhador Sancarlense, 400, Sao Carlos, SP 13560-970, CxPostal 668, Brazil.

†E-mail: [email protected]

Copyright q 2011 John Wiley & Sons, Ltd.

SOFTWARE TESTING, VERIFICATION ANDRELIABILITYSoftw. Test. Verif. Reliab. :171–198Published online 1 May 2011 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/stvr.463

2013; 23

M. E. DELAMARO, F. L. S. NUNES AND R. A. P. DE OLIVEIRA

for P from which one can extract its behavior, it is possible to automate the oracle by building upa comparator between the output actually produced and the behavior defined in the model. Thereare also cases in which the oracle does not exist or is only approximately defined.

The automation of an oracle is hardly a trivial task, even if the set of expected outputs isknown. It is often hard to decide which features should be considered to compare the expected andcurrent outputs. A particularly subtle point in the automation of an oracle is when the outputs tobe compared are in a non-trivial format, as for instance, the screenshot of a graphical interface oranother kind of image produced by the program under consideration. In this case, one can chooseto construct oracles defining assertions about the output presumed to be correct. It is possible, forexample, to create a database that stores conventional characteristics (text or numbers) computedfrom the expected output (images). Then the actual output of the tested program is analyzedand the same conventional characteristics are computed and compared with the ones stored andused as references.

A technique known as CBIR (Content-Based Image Retrieval) has been used in recent decadesas a tool to query image databases, particularly to facilitate the retrieval of images similar toa given model image [1]. The similarity between images is defined by comparing the featuresusually extracted by means of image processing techniques. Those features, in general, are relatedto aspects of color, texture, and shape. This type of information retrieval can be used in variousfields of knowledge. For instance, it is widely used in the medical imaging area. The comparisonis implemented using functions that measure the distance between sets of characteristics of twoimages, called similarity functions.

The goal of the work herein is to present and evaluate the use of CBIR concepts in theconstruction of testing oracles, aiming to facilitate the verification of defects in software thatproduce graphical output, whether in the form of graphical user interfaces (GUIs) or some kind ofimage processing. For this, the article is organized as follows: Section 2 presents concepts on testoracles and CBIR, particularly regarding its use within the context of oracle automation. Section3 proposes a software architecture to promote the efficient use of CBIR concepts to automate thecreation of graphical oracles. Section 4 presents a framework built using the Java language togenerate testing oracles based on the proposed architecture. Two practical framework applicationexamples are presented in Section 5: the first example uses part of a CAD (Computer-AidedDiagnosis) application and the other a Web application. Finally, in Sections 6 and 7 related workand final considerations are given.

2. TESTING ORACLES AND CBIR

According to Baresi and Young’s definition [2], a testing oracle is a method used to verify whetherthe system under test behaves correctly on a particular execution. The role of an oracle can beplayed by another program—in the case of an automated oracle—or a human being, based on aspecification, who decides whether the behavior obtained is as expected.

This definition does not provide the exact dimension of the problems related to the application ofan oracle, whether human or automated. For example, when implementing a program that displaysthe calendar of a month or a year, knowing its specification is not enough because it says littleabout the correctness of each of the program executions. In this case a possible solution would beto use a similar pre-existing program—for example on another platform—to serve as a source foranswers to the oracle.

Weyuker [3] lays out a very interesting analysis about testing the so-called ‘non-testable’programs, i.e. those for which there is no oracle at all or there is no oracle that can be appliedin practice, due to its high cost, for instance. The researcher shows that in many cases, althoughnot possible to exactly know the expected result for a given execution, it is possible to determinethe desired characteristics or unwanted characteristics that give an idea about the accuracy orinaccuracy of an execution of the program under test. Weyuker also points out some alternativesthat allow testing such programs for which there are no oracles or for which the oracles are difficultto build.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

172

USING CONCEPTS OF CBIR TO IMPLEMENT TESTING ORACLES

One of the examples provided by the author is the problem of calculating the thousandth decimalplace of the constant �. In this case, the problem is not just the inexistence of an oracle but alsothe fact that one cannot state how plausible the answer provided by the program is. One solution,albeit limited, would be to slightly modify the program so that it computes the 10th decimal place.If the result is displayed correctly, the tester might have some assurance that the program alsoworks well for the original problem.

The literature has plenty of such examples, which shows the difficulty of constructing testingoracles, particularly in some domains. In the work herein, the problem addressed is building oraclesfor outputs that are not unique or exact. In such cases, the result has to be interpreted consideringrelevant characteristics that allow deciding on its correctness. This is the case of the example inSection 5.1.

In it, an image processing algorithm is tested to check whether it correctly identifies an areaof interest, in this case, the region corresponding to the breast in a mammographic image. Theimage simply ‘represents’ the object of interest (the breast), it is not the object itself. Thus, it isimpossible to say with absolute precision which pixels in the image are part of the breast and whichare not. When interpreting the results of the program there are results that are clearly incorrectbut it is necessary to acknowledge a certain degree of tolerance for the results that are consideredacceptable.

The automation of oracles is a key item to promote the productivity of the testing activity. Theexistence of a program that decides—even if not with absolute precision—on the correctness of agiven execution of the tested program may represent a reduced time period and greater accuracy,since human intervention may be slow as well as imprecise and subject to variations. Some studiesrelated to the construction of oracles, and in particular the automation of oracles for programswith graphical outputs, exist in the literature and are discussed in Section 6. This paper proposesa method that uses the same CBIR concepts for this purpose. To facilitate identification, it will bereferred to as Graphical Oracle method (GrO-method).

The studies on CBIR address mainly the problem of querying, in a database, imageswhich are similar to a given reference image. Traditionally, image database search uses keywords[4, 5], consisting of textual, numeric, or similar attributes. For this, one has to first register thedescriptors that define an image and then make queries using these descriptors. For example, animage can be described with words, such as ‘clear’, ‘black’, ‘low contrast’ to represent globalfeatures, or ‘circle with radius 3’, ‘triangle rectangle’, ‘irregular edge’ to represent aspects of itsstructures. However, this traditional approach makes the query harder when it needs to compare,for example, an image provided by the output of a program with another one already known.Finding textual attributes that describe the characteristics of the two images is a complex task,because these attributes are interpreted as single values, not providing a desired flexibility inthe comparison between them. For example, ‘low contrast’ is a single value, but it may bedesirable for images with different levels of contrast to be classified as such depending on thestructures found, the characteristics of acquisition and the application itself. Thus, in cases suchas this, it is necessary to establish metrics that provide a degree of flexibility to the comparisonprocess.

CBIR is defined as any technology that helps to organize digital image files based on their visualcontent [1]. In general, CBIR systems consist of computer programs that aim to find, in a database,those images most similar to a model image, in accordance with one or more provided criteria.The criteria for similarity are obtained by extracting features of the image, such as color, texture,and shape. The automated CBIR systems involve several areas of computer science, mainly ImageProcessing and Databases. In a simple way and within the context of this paper, a CBIR systemis essentially composed of four parts: extractors, similarity functions, indexation algorithms, andsorting algorithms [6], as shown in Figure 1.

Figure 1 shows that a user can specify a query by providing a model image, and a set of similarimages are returned as the system answer. To enable this mechanism, each element of the CBIRsystem has a specific function. In addition, a database can be used to store values of computedfeatures and index structures. The four elements are presented below.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

173

M. E. DELAMARO, F. L. S. NUNES AND R. A. P. DE OLIVEIRA

FeatureValues Indexes

valuesComputed

Extractor

algorithmsIndexation

ValuesFeature

IndexationStructures

SimilarityFunction

Indexedvalues

DistanceSorting

algorithms

imageModel

Similar images

Figure 1. Simple structure of a CBIR system.

Extractors are computational methods that obtain characteristics of images using algorithmsthat analyze colors, shapes, textures, or other aspects related to the image as a whole or parts of it.For example, the amount of red pixels is a characteristic related to the color appearance that couldbe used to compare two images. These algorithms are implemented considering image processingtechniques.

In general, these techniques are divided into three levels: low level, which conducts a prepro-cessing step, such as enhancing characteristics of interest or smooth noise to prepare the image forthe next level; medium level, which identifies structures of interest in the image; and high level,which accounts for linking the identified structures to a knowledge base [7, 8].

The extracted features are usually turned into a value that can later be compared to the valueobtained for the same characteristic of another image [9]. Usually, various extractors are combinedinto a CBIR system, and each one refers to an aspect of the image. In the process of extractingcharacteristics, the focus of the application and type of image being processed should be considered,since the characteristics of interest can vary and be closely related to the specific purpose of theapplication. The set of features extracted from an image constitutes its ‘feature vector,’ which isused in its indexing and retrieval. The extracted features represent the image during the searchtime, because it is by using them that a particular image is recovered from the image database.

The set of characteristics is not sufficient to determine the result of the search. Another factor thatinfluences the search results is the choice of measures of similarity between images. For a searchby similarity, it is necessary to use a metric to compare the feature vectors, thus defining similarityfunctions [10]. A similarity function is a mathematical function to compute the distance betweentwo sets of values, in this case, two feature vectors. There are several distance functions defined forcomparing feature vectors. Some examples are the Euclidean distance between histograms [11, 12],the Mahalanobis distance between the mean value of the characteristics [13, 14], or the metricsderived from optimization criteria [15]. After this comparison, the function returns a non-negativevalue that indicates how similar the two vectors are. The lower the value returned, the greater thesimilarity between the template image and the retrieved image.

Feature vectors are usually high-dimensional. Therefore their storage and retrieval may requirea longer time. To solve this issue, other important elements in a CBIR system are indexationalgorithms, the function of which is to provide ways to optimize the organization of the datastructures (the indexation structures in Figure 1) to store feature vectors, thus decreasing theresponse time during the image retrieval process. In general, such algorithms act in two ways:decreasing the dimensionality of the feature vectors or organizing similar images closer together

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

174

USING CONCEPTS OF CBIR TO IMPLEMENT TESTING ORACLES

in the disk, in order to retrieve them faster [16, 17]. Another approach is the definition of ametric space, which considers the distance among the elements of a dataset to store and retrieveimages [18, 19]. In both cases, indexation structures, i.e. the associated data structures used bythe algorithms, play an important role in the CBIR system. In the literature, there are interestingpapers addressing ways to minimize the problem of high-dimensionality [16, 17, 20–22]. Thispaper does not present an in-depth discussion about these structures and algorithms because theyare not relevant in the GrO-method. It always compares two images and, therefore, no specificaccess methods are involved in this process.

Finally, the sorting algorithms are methods to classify the data found during comparison,conducted with similarity function, in order to adequately show the user the similar images; forinstance, putting the most similar one in the first place.

Much like a database search, checking the correct execution of a program is simple if theproduced output is textual, especially when a ‘model’ that can be used in comparing character-by-character is available. On the other hand, when the decision is made based on the analysis ofan image or a GUI presented to the tester, the automation becomes considerably more difficult. Astriking example of this fact is the application of mutation testing [23], in which the result of eachmutant must be decided by comparing its execution against the execution of the original program,considered a model for this decision. Mutation supporting tools, such as Proteum [24], restricttheir analysis to textual output, due to the difficulty of dealing with other kinds of outputs, thuslimiting the type of programs that are prone to be tested in these environments.

When a graphical output is the object to be examined as a result of a program execution, animage can be used as a reference for comparison. There are, however, several aspects that needto be considered. The most remarkable one is that a pixel-by-pixel comparison may not producethe correct result, since the reference image may not uniquely represent the expected output ordifferent results might be considered correct, even if not exactly equal to the reference image.

In the medical field, the studies by Bellotti et al. [25] and Paquerault et al. [26] use this approach,by employing reference images to determine the accuracy of image processing algorithms. Asecond scenario of interest is the treatment of images produced by a program the interaction ofwhich takes place via a GUI. Also in this area, some studies use reference images for oracleconstruction, for instance the articles by Ye et al. [27] and Takahashi [28]. These papers arediscussed in Section 6.

As shown in Figure 2, a GUI is composed of several components the behavior of which, knownby the users, indicates if a given execution is correct or not. For example, a filled area in a progressbar or a marked ‘checkbox’ determines whether the expected output has been achieved or not.Suppose that the expected output is given by an image, as the result of a previous execution ofthe program under test or another program such as in regression testing or the application ofmutation testing. There are several factors that may cause the graphical representation of twoexecutions with the same result not to look exactly the same. For example, when both executionsof a program—one that produces the reference image and one that produces the image to beanalyzed—are conducted in different environments, with different look-and-feel. This becomesincreasingly common, given the portability achieved through some programming languages or inWeb applications, the appearance of which for the final user also depends on the client (Webbrowser) configurations. Figure 2 shows the results of two runs with the same behavior in differentenvironments. They should be considered ‘equal’ under the standpoint of a test oracle. In the detail,it is noted that comparing pixel-by-pixel, for the checkboxes on the left side of the figure, wouldproduce an incorrect result in this perspective.

The use of CBIR concepts can help in this context as it allows relevant characteristics to beextracted from the images to use them in the comparison process. For example, the area andperimeter of a region of interest identified by an image processing program can be used. Theycan be compared to the same characteristics of a reference image in which a medical expertmanually marked that region. In the case of the GUI, the number of dark pixels in the area ofa checkbox indicates the state of this component, irrespective of the look-and-feel used. In bothcases, although the result of the comparison is not exactly the same, similar results should indicate

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

175

M. E. DELAMARO, F. L. S. NUNES AND R. A. P. DE OLIVEIRA

Figure 2. GUI with two different look-and-feel.

that the execution of the program under test is correct if the image it produced is similar to thereference image. The goal of CBIR is not to indicate when two images are equal or different, butto indicate how close they are, a fact that can be positively exploited in the construction of thetesting oracles.

To enable the efficient use of CBIR, this paper proposes an architecture that allows the testerto create graphical oracles in a flexible and simple manner, regardless of the domain it is directedtowards. As described in Section 3, the work of the tester is to set the extractors, the similarityfunction and the parameters they need.

3. AN ARCHITECTURE FOR PRODUCING TESTING ORACLES

As described in the previous section, CBIR is not a new technique. It has been widely used andresearched. Contributing to its improvement in the area of image processing or recovery is not theobjective of the present work. On the other hand, no one has explored its use in the constructionof test oracles. Therefore, the main contribution of this work is to present an alternative methodfor the construction of test oracles using the well-known concepts of CBIR. This method is namedGrO-method.

The need for this method comes from the fact that there are currently few alternatives for theconstruction of graphical oracles. As shown in Section 6, the most widespread options are aimedat applications with GUIs and are based on the idea that it is possible, at the execution time,

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

176

USING CONCEPTS OF CBIR TO IMPLEMENT TESTING ORACLES

Tester

generatorOracle description

Tester application(oracle)

Plugins

Oracle description Parser

Core

Figure 3. Architecture for creating graphical oracles using CBIR.

to discover the structure of the interface components and capture the sequence of events it hasprocessed. This sequence can be recorded and used as a reference for future executions. While thisapproach is efficient and widely used in various test automation frameworks, it has its limitationsin situations in which the internal structure of the interface cannot be accessed or in which visualaspects are the most important. In Section 5.3, after the presentation of two case studies, thistraditional technique is confronted with the approach of constructing testing oracles based on theCBIR concepts.

For the GrO-method to be characterized as a real contribution to the field of software testing, it isnecessary to show how the concepts of CBIR can be efficiently used to construct graphical oracles,enabling the tester to create automated oracles with little effort. Thus, this section presents anarchitecture for building testing oracles that incorporates the concepts of CBIR. In the next section,the implementation of a framework that follows this architecture is presented. The architecturewas designed so that the following conditions were met:

• Flexibility: The tester is able to create different oracles to apply in different domains. Themost complex aspect she/he has to deal with is to define and implement the feature extractorsand similarity functions that are intended to be used in the oracle.

• Simplicity: Simple interactions between the architecture components, especially the onesconcerning the final user, i.e. the tester.

• Ease of use: Some components are designed to provide a higher level of abstraction to thetester. In particular, the parser and the wizard ensure such a feature.

Figure 3 shows the proposed architecture. In short, the core provides an API (ApplicationProgramming Interface) that allows the tester to install plugins and use them to instantiate a specificoracle. The oracle may also be specified in a higher level of abstraction by a textual description,for example, one that is read by the parser that instantiates the oracle. All of this can also be donein an interactive way by a ‘wizard’, which can install plugins and create oracle descriptions. Next,each component is described.

3.1. Plugins

The plugins depicted in Figure 3 are the major contributions from the tester, in terms of code, tocreate the oracle. They are added to the framework by invoking the core API and can be of twotypes: extractor or similarity function.

The first type of plugin is an extractor of characteristics as discussed in Section 2. In theextractor, the tester shall implement the algorithms that will identify and quantify a characteristic

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

177

M. E. DELAMARO, F. L. S. NUNES AND R. A. P. DE OLIVEIRA

in an image. The second type of plugin that the tester can add to the framework is, in the CBIRstructure, a similarity function. This is necessary because different ways of combining the resultsproduced by the extractors can be used. In order to use a similarity function, one or more extractorshave to be added to it.

3.2. The core

The core of the architecture allows the tester to install and remove plugins and provides an APIon top of which the tester can build oracles.

The installation of plugins allows them to be reused in the construction of different oracles.The core also provides an API that programmatically allows the creation of an oracle. In Figure 3,such interaction is represented by the tester application accessing the core API methods.

The API of the core must be such that when instantiating an oracle, the tester can choose theplugins to be used, pass arguments to individualize their behavior—for example, the region of thefigures in which an extractor should be applied—and possibly configure other global parametersto the oracle.

3.3. The parser

To create an oracle entirely through its API, the tester needs to perform a certain sequence ofinvocations, such as (i) create (instantiate) the extractors and set their properties; (ii) create asimilarity function; (iii) add the extractors to the similarity function; (iv) create an oracle; (v) addthe similarity function to the oracle; and (6) establish global parameters for the oracle to use.

To make this task easier, the tester can set the structure of an oracle in a simple text description,stored, for instance, in a plain file. In such a file, all the features of the oracle can be specified:the desired extractors and their properties, the similarity function, and the global parameters. Theparser, invoked through a core API operation, reads such a description and performs the tasksneeded to create an oracle, as discussed in the previous section.

3.4. The oracle generator

Also to facilitate the creation of an oracle, the architecture includes an interactive tool that generatesthe textual description of the oracle. This is a graphical interface that allows the tester to createan oracle description and also to perform other operations without writing a single line of code.

In the interface, the tester can install plugins and create oracles. The tester can select whichextractors she/he wants to use and the values of their properties, if any, and can also select thesimilarity function and global parameters. All these choices can be saved in a description filewhich may be used to instantiate an oracle, as described before.

4. O-FIm: Oracle for Images

To try the concepts of CBIR in the construction of graphical oracles, in accordance with the prin-ciples suggested in Section 3, a framework named O-FIm (Oracle for Images)‡ was implemented,following the proposed architecture. The implementation is done in Java, hence its API exposesclasses and methods in that language. As a result of using this framework, the tester gets a Javaprogram that is able to compare two images (usually stored in files) responding whether they aresimilar or not, according to the chosen characteristics.

Two interfaces IExtractor and ISimilarity were created to allow the tester to implementthe plugins. Classes that implement those interfaces must be provided and installed in the frameworkthrough calls to the API made available in the core or by using the wizard tool, described ahead.

‡Available for downloading at http://ccsl.icmc.usp.br/pt-br/content/o-fim-oracle-images.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

178

USING CONCEPTS OF CBIR TO IMPLEMENT TESTING ORACLES

+getName(): String+addExtractor(extractor: IExtractor): void+getExtractors(): IExtractor[]+computeValues(image: Image): double[]+computeSimilarity(d1: double[], d2: double[]): double

MySimilarity

+getName(): String+setProperty(name: String, value: Object): void+getProperties(): String+getProperty(name: String): Object+computeValue(image: Image): double

MyExtractor

ISimilarity IExtractor*

Figure 4. Structure of the classes for extractors and similarity functions.

Figure 5. Example of a textual oracle description.

Figure 4 summarizes these two interfaces and the relationship between them. The creation ofan oracle involves the instantiation of an object ISimilarity which, in turn, gathers one ormore extractors. These two interfaces are quite simple yet sufficient to allow their use in O-FIm.

The IExtractor allows the programmer to instantiate an extractor, retrieve the name of suchan object, to set and retrieve values for parameters it needs to execute—for example the region ofthe image to which the extractor should apply—and to compute the value extracted from a givenimage as a float value.

The ISimilarity defines methods to instantiate a similarity function object, to retrieve thename of such an object and to associate IExtractor objects to it. In addition, it defines methodsto compute a feature vector—the set of values for each IExtractor object associated to theISimilarity—and to compute the value of the similarity function based on the feature vector.

The core provides methods to manage plugins, i.e. install, remove, and search for installedplugins. It also provides methods for the creation of an oracle that computes the distance betweentwo images and decides whether they are similar or different, considering that distance.

The instantiation of an oracle can be done by creating and associating a similarity function tothe oracle and assigning a value to the threshold parameter, which indicates the maximum distanceaccepted so that two images are considered identical. It can also be done by providing the core atextual description of the oracle. The core invokes the parser and then instantiates the oracle.

The parser accepts a particular language, designed to be very simple and intuitive. Figure 5shows an example of an oracle description. An oracle is created using the extractor MyExtractorand OurExtractor. The first has a property called ‘color’ to be defined with the value (a string)‘red’ and a property ‘alpha’ the value of which is an integer, 78. The second has a property ‘scale’,the value of which will be initialized with the double value, 1.33. Both have a property called‘rectangle’ for which a vector of integers is used.

Also to facilitate the creation of an oracle, the framework O-FIm has a wizard tool that generatesthe textual description of the oracle. This is a graphical interface that allows the tester interventionto create a description as the one presented in Figure 5. Figure 6 presents the main interface ofthis tool. In the interface, the tester can install, search, and remove plugins. She/he can also selectwhich extractors she/he wants to use and the values of their properties, if any, and the similarity

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

179

M. E. DELAMARO, F. L. S. NUNES AND R. A. P. DE OLIVEIRA

Figure 6. Main interface of the oracle generator.

function and threshold. The image that appears on the left side allows selecting the region which anextractor should be applied to. This makes property ‘rectangle’ to be set according to the selectedregion. On the right, the user sees the textual oracle description.

With the O-FIm framework, the tester has all the tools needed to build testing oracles fordifferent types of programs and domains with graphical outputs. For that, one must define withthe plugins which features and which similarity function should be used in the comparison of theimages produced by the program under test. The plugins can be combined in a textual descriptionwhich can then be easily processed into an oracle. In the next section, two complete examples thatuse this framework are presented and discussed.

5. CASE STUDIES

To demonstrate the applicability of the CBIR concepts to build testing oracles, making use of thesoftware structure presented in the previous section, two case studies are presented.

The first is developed in the medical imaging area, specifically focusing on a CAD system. Thesecond focuses on a Web application for which the goal is to verify the correctness concerningvisual aspects. The next two subsections present these studies and Section 5.3 presents a discussionon their results.

5.1. A CAD system

Systems called CAD are computer systems, often coupled to specialized medical equipment, aimedat assisting in the decision making for a diagnosis. Applications with different purposes have beendeveloped by several research groups to assist in the formulation of diagnoses as a way to helpearly detection of diseases.

Giger [29] defines a CAD system as that in which the radiologist uses the results of a comput-erized analysis of medical images as a ‘second opinion’ in the detection of lesions and in theelaboration of a diagnosis. The importance of such schemes is emphasized by Chan et al. [30],Doi et al. [31], and Giger and MacMahon [32] that present rates of diagnostic errors in trackingprograms and show that the use of CAD schemes can improve the performance of radiologists

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

180

USING CONCEPTS OF CBIR TO IMPLEMENT TESTING ORACLES

in medical diagnosis. In general, CAD systems provide opinions from information extracted frommedical images.

The first case study focuses on the area of mammographic images, which constitute the mainsource of data from CAD systems for early detection of breast cancer, specifically, a CAD procedurethat segments the mammary structure in the breast imaging, highlighting it from the backgroundof the image [33]. Figure 7 shows an example of an input image and the image resulting from thisprocedure.

To establish whether the segmentation of the breast has been implemented correctly in the CADsystem, the tester should run the program with a set of input images and decide, by observingthe resulting images, preferably with the aid of a professional radiologist, if the results are thoseexpected. This visual verification procedure can be quite laborious and expensive, especiallyconsidering that a physician is not always available during the software development. Moreover,this procedure may have to be repeated several times, for example, at every change performed inthe implementation.

To minimize this problem in this case study, the construction of an automated oracle is proposed.It is based on a set of images that are ‘marked’ by a human and then used as a reference in theoracle. Marking an image, using a touch screen computer and a specially developed program,means that the tester can view the existing mammographic images and, with a pen, indicate theedges of the breast. The program removes the background of the image, turning the region identifiedas the breast white and the other regions black. Then it saves the modified image in a new file.Figure 8 displays the user interface of this program and an example of how a resulting imagelooks like, after being marked by the user. Thus, the process of deciding on the correctness of theimplementation can be performed automatically and as often as necessary.

There is also the problem of choosing a threshold for deciding when an execution should beconsidered correct or not. This value is given by the ‘precision’ parameter in the construction ofthe oracle, as mentioned in the previous section. It indicates that if the distance computed betweenthe reference image and the image produced by the implementation exceeds the threshold, then theimages should be considered distinct, and therefore, the implementation is incorrect. Otherwisethe execution should be accepted as correct.

Several forms can be used to compute an adequate threshold. It is important to note that a priori,there is no general solution for this and for each situation there may be different ways to approachthe problem. Moreover, the choice of a threshold should go through an adjustment process untilthe appropriate value can be found.

In this case study, the threshold is computed taking into account, as an initial parameter,the distances between the images of two reference sets, both manually marked. Each image inthe input test set, from now on called I ={Im I1, Im I2, . . ., Im In }, was marked twice, by two

Figure 7. Examples of images in a CAD system: (a) original image and (b) image processed by the CAD.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

181

M. E. DELAMARO, F. L. S. NUNES AND R. A. P. DE OLIVEIRA

Figure 8. Example of an image manually marked: (a) software for marking theimages and (b) marked image.

different people, thus obtaining two reference image sets, R1={ImR11 , ImR1

2 , . . ., ImR1n } and

R2 ={ImR21 , ImR2

2 , . . ., ImR2n }. Applying the same extractors and the same similarity function

used in the construction of the oracle for each pair of images 〈ImR1i , ImR2

i 〉, the distance Di(Equation (1)) is obtained and can be used to calculate the threshold. The result of executing theprogram in test against the input set I is denoted as O={ImO1, ImO2, . . ., ImOn}

Di =dist (ImR1i , ImR2

i ). (1)

Then, the threshold can be computed as in Equation (2)

thr=�×max(Di ), i =1,2, . . .,n (2)

where � is a constant that allows the tester to make adjustments in the threshold. In this case study,an initial conjecture for the threshold is necessary; the distance between the reference images wasthus used. Raising � serves as a factor to collect data for various values of the threshold. It isworth noting that to establish the perfect threshold or a method to compute it is not the goal. Thepoint here is to use different threshold values and analyze the behavior of the oracles. Rigorously,the thresholds used are merely guesses. For �=1.0 and �=2.0 the behaviors obtained for eachpicture are shown in Tables II and III. These threshold values were used in two oracles that havedistinct behaviors. They areOracle 1: If the image resulting from the execution of the program under test has a distanceabove the threshold for both reference images, then the execution is considered incorrect, i.e.

if dist (ImR1i , ImOi )>thr and dist (ImR2

i , ImOi )>thr

then the execution with Im Ii failed

Oracle 2: If the image resulting from the execution of the program under test has a distance abovethe threshold for any of the two reference images, then the execution is considered incorrect, i.e.

if dist (ImR1i , ImOi )>thr or dist (ImR2

i , ImOi )>thr

then the execution with Im Ii failed.

From the definitions given in this section, the case study was conducted in order to evaluate thecorrectness of the mentioned part of a previously developed CAD system. In the next subsections,the following aspects are discussed: the marking of the reference images, the creation of the testoracles, and the results obtained with the use of both oracles with several different parameteriza-tions.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

182

USING CONCEPTS OF CBIR TO IMPLEMENT TESTING ORACLES

Table I. Results of the marking process.

Total time Avg. time Avg. dist. Std. Dev. Max dist.

R1 16min 32 s 0.0098 0.0081 0.0347R2 22min 44 s

5.1.1. Marking of the reference images. Thirty mammographic images, digitized with 12-bitcontrast resolution and 0.075mm pixel size, were selected for this case study. This is the inputset I . The selection was random, from a previously existing set of images, used as input datain a CAD system developed in a previous study [33]. For the work herein, a component of thissystem was used. In the mammographic image, the region which actually belongs to the breastis identified using image processing techniques, more specifically image segmentation. From thispoint, this program will be referenced as a ‘program under test’ or PUT.

To create the two sets of reference images, R1 and R2, a specific program that allows the userto draw a line on the border of the breast image using a pen and a touchscreen computer wasdeveloped. After marking the perimeter limit of the breast, the background of the image (the area‘external’ to the breast) is removed (the pixels are transformed into zero) because this action isalso performed by the PUT. The end result, which is used as the reference image to the oracle, isshown in Figure 8(b).

The 30 images were marked independently by two of the authors of this article. Table I presentssome data on the marking of images and the results obtained. The rows represent each set ofmarked images, R1 and R2. The first and second columns show the total time and average time perimage spent in the marking process. The last three columns show the mean, standard deviation,and maximum value of distance between each pair of images of R1 and R2. This last value isused for the calculation of thresholds in the construction of the oracles, according to the alreadycommented equation (2).

As explained before, two values of threshold were undertaken:

�=1.0 then the value of the threshold is 0.0347;�=2.0 then the value of the threshold is 0.0694.

One point worth mentioning in this process is that, although the identification of the region ofinterest is simple, it is not always easy to draw the marks with appropriate accuracy. Lightingconditions of the environment, for example, can interfere with the image correct interpretation.Furthermore, operating the pen requires some skill and practice so as not to slip on the surface ofthe sensitive touch screen.

5.1.2. Creation of the oracles. Once the values of the thresholds are defined, the plugins for theextractors and the similarity function were created and installed. Then the textual description ofthe oracles was created. It is important to note that there is a myriad of extractors and similarityfunctions that could be used in this case study, very likely providing different results. It is not thepurpose here to review or discuss which would be the most suitable. This issue has been discussedin specific articles on image retrieval and image processing areas. Thus, the results are not expectedto be ideal in this case study, as they only mean to show the feasibility of the proposed approach.

Three extractors were used to obtain the characteristics of interest of the images. The idea isthat, in order to establish whether the images are similar, if they represent a specific vision of thebreast under examination, one can verify if: (i) the volumes of the breast (projected as the area ina two-dimensional image) on both images are similar; (ii) the size of their edges (their perimeters)are similar; and (iii) the format of the borders (smoother, more spiked, etc.) are similar. Thus, thefollowing extractors were used:Area counts the number of pixels within the region identified as the breast. For that it uses thefact that the background pixels are zero in both the reference image as well as in the imagecreated by the PUT. Because the value computed by the extractor should be in the interval[0, 1], the number of pixels is divided by the total number of pixels in the image.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

183

M. E. DELAMARO, F. L. S. NUNES AND R. A. P. DE OLIVEIRA

Perimeter counts the number of pixels along the edge of the breast, i.e. the number of pixels inthe breast region neighboring the pixels in the external region (it excludes the pixels at the edgeof the image). The value is divided by the total perimeter of the image, given by the summationof the four sides of the image.

Signature computes a number that identifies the format of the contour area of the breast accordingto its regularity. To do this, it performs the following actions: it locates the center of the breastin the last column of the image, measures the distance from this point to the edge of thebreast, taking preset intervals in degrees and calculates the standard deviation over the obtainedmeasurements. The standard deviation indicates how irregular a border is. For a perfect circle,for example, this value is zero. The standard deviation is divided by the largest measure obtained.Just like the extractors, various similarity functions have been defined and discussed in the

literature. The distance function in a CBIR system is a way to measure how close the featurevectors of two images are. Thus, the number of points in which the function is applied is thenumber of extractors used in the oracle.

This case study used the Euclidean Distance (ED), which is one of the most popular measures.Considering two vectors of characteristics x and y, of size n, where the elements are called xi andyi , respectively, the ED if defined as

ED(x, y)=√

n∑i=1

(xi − yi )2. (3)

Having built and installed the plugins, the oracle file description shown in Figure 9 was created.Note that both oracles used in the study have the same definition. They are distinguished only inthe form of use in relation to the two sets of reference images. The only extractor that requiressome parameterization is the extractor of signature. The angle between two measures from thecenter of the breast to the edge of the region of interest has to be set in the oracle description.

The creation and use of oracles, with the support of O-FIm, becomes an extremely simple task.Figure 10 depicts two small pieces of code that show how to create and use the previously proposedoracles. Variables ImR1 and ImR2 contain a reference image from each of the reference sets R1

and R2, respectively. The variable ImO contains the processed image, produced by the PUT. Thefile/home/guest/Oracles.txt is supposed to contain the description shown in Figure 9.

5.1.3. Results. The oracles of Figure 10 were applied to the 30 images processed by the PUT,using the two sets of marked images as reference and two different values of threshold. The resultsobtained are described in Tables II and III.

It can be observed that the final results obtained by each of the oracles are quite similar. For thethreshold of 0.0347, only image 24 has different verdicts between the two oracles. The threshold0.0694 for all images showed the same result for both oracles. This is justified because there islittle difference between the two sets of manually marked images and therefore the comparisonbetween any of the two images against the image processed by the PUT is very homogeneous, i.e.both have high distances or both have low distances.

The fact that there is little difference between the two sets of manually marked images andthe subsequent choice of a low threshold also justifies the fact that using the value of �=1, onlyslightly more than half the images were accepted by the oracles. This low threshold value meansthat any difference between the processed image and the reference image is enough for the oracle to

Figure 9. Description of the oracles for the case study.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

184

USING CONCEPTS OF CBIR TO IMPLEMENT TESTING ORACLES

Figure 10. Code for the creation and use of the oracles.

Table II. Results returned by the oracles – threshold = 0.0347.

Image R1 R2 Orac 1 Orac 2

01 0.0560 0.0606 Fail Fail02 0.0082 0.0117 Pass Pass03 0.0071 0.0082 Pass Pass04 0.0585 0.0623 Fail Fail05 0.0429 0.0364 Fail Fail06 0.0039 0.0063 Pass Pass07 0.0077 0.0077 Pass Pass08 0.0060 0.0114 Pass Pass09 0.0506 0.0482 Fail Fail10 0.0266 0.0101 Pass Pass11 0.0374 0.0458 Fail Fail12 0.1699 0.1486 Fail Fail13 0.0059 0.0099 Pass Pass14 0.0205 0.0132 Pass Pass15 0.0430 0.0402 Fail Fail16 0.0614 0.0631 Fail Fail17 0.0172 0.0243 Pass Pass18 0.0270 0.0329 Pass Pass19 0.0319 0.0283 Pass Pass20 0.0351 0.0376 Fail Fail21 0.0686 0.0667 Fail Fail22 0.0098 0.0155 Pass Pass23 0.0303 0.0159 Pass Pass24 0.0368 0.0284 Pass Fail25 0.0104 0.0074 Pass Pass26 0.0181 0.0168 Pass Pass27 0.0554 0.0537 Fail Fail28 0.0246 0.0344 Pass Pass29 0.1047 0.0957 Fail Fail30 0.0826 0.0818 Fail FailAvg 0.0386 0.0374 17 16Std dev 0.0351 0.0320

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

185

M. E. DELAMARO, F. L. S. NUNES AND R. A. P. DE OLIVEIRA

Table III. Results returned by the oracles – threshold = 0.0694.

Image R1 R2 Orac 1 Orac 2

01 0.0560 0.0606 Pass Pass02 0.0082 0.0117 Pass Pass03 0.0071 0.0082 Pass Pass04 0.0585 0.0623 Pass Pass05 0.0429 0.0364 Pass Pass06 0.0039 0.0063 Pass Pass07 0.0077 0.0077 Pass Pass08 0.0060 0.0114 Pass Pass09 0.0506 0.0482 Pass Pass10 0.0266 0.0101 Pass Pass11 0.0374 0.0458 Pass Pass12 0.1699 0.1486 Fail Fail13 0.0059 0.0099 Pass Pass14 0.0205 0.0132 Pass Pass15 0.0430 0.0402 Pass Pass16 0.0614 0.0631 Pass Pass17 0.0172 0.0243 Pass Pass18 0.0270 0.0329 Pass Pass19 0.0319 0.0283 Pass Pass20 0.0351 0.0376 Pass Pass21 0.0686 0.0667 Pass Pass22 0.0098 0.0155 Pass Pass23 0.0303 0.0159 Pass Pass24 0.0368 0.0284 Pass Pass25 0.0104 0.0074 Pass Pass26 0.0181 0.0168 Pass Pass27 0.0554 0.0537 Pass Pass28 0.0246 0.0344 Pass Pass29 0.1047 0.0957 Fail Fail30 0.0826 0.0818 Fail FailAvg 0.0386 0.0374 27 27Std dev 0.0351 0.0320

consider them as dissimilar. With the higher threshold (�=2), only three images were considereddifferent.

This information should be used by the tester to continue with the testing activity and possiblythe correction of the software under test or adjustment in the oracles or even in the images used inthe test. It is necessary, before anything else, to check why the oracles rejected the results presented.

Although it is not the purpose of this article to make a detailed analysis of the resultsobtained specifically for this case study, some reasons that may lead the oracles to point fail-ures in the PUT are:

• the program under test has a fault and the results it produced for those test cases are in factincorrect;

• the value of the threshold used in the experiment is inadequate. If the two sets of manuallymarked images are very close, then the computed threshold can be very low, hence makingcorrect processed images be considered as incorrect. The increase in the threshold value wouldmake these images be accepted but could make incorrect images be considered correct. Thecalculation of the threshold used in this experiment is only one way to carry this out. In realcases, more elaborated strategies should be used according to the convenience of the testerand the accuracy required in the testing activity;

• there are errors in the reference images. The images of the reference set may be incorrect dueto several factors during the marking process such as: the inability of the person in charge ofthe marking, visualization conditions at the time of marking, defects in the application thatsupports the markings. Also in this case, the framework for automated oracles can be usefulbecause it allows improving the quality of the images until an ideal reference set is obtained;

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

186

USING CONCEPTS OF CBIR TO IMPLEMENT TESTING ORACLES

Table IV. Behavior of the oracles with the variation of thresholds.

# Images accepted # Images accepted� Threshold by Oracle 1 by Oracle 2

2.1 to 2.3 0.0728 to 0.0797 27 272.4 to 2.7 0.0832 to 0.0936 28 282.8 to 3.0 0.0971 to 0.1040 29 283.1 to 4.2 0.1075 to 0.1456 29 294.3 to 4.9 0.1491 to 0.1699 30 295.0 0.1734 30 30

• the oracle is incorrect. Mistakes can occur during the development of the plugins and lead theoracle to wrongly compute the values for the image characteristics, hence generating incorrectresults in the comparison. The framework works towards providing a flexible environment inwhich the plugins can be added, removed, and fixed and, once improved, reused in variousapplications.

For an idea of the behavior of the oracles with other threshold values, the same oracles wereapplied with values of � ranging from 2.1 to 5.0. The last is the value that allowed all the executionsof the PUT to be accepted. The result is shown in Table IV. Each row of the table correspondsto values that make the oracles behave in the same way. For instance, for � in the range 2.1 to2.3 (i.e. threshold from 0.0728 to 0.0797) both oracles accepted 27 executions as correct. When� was raised to 2.4, both oracles accepted 28 images and did so while �≤2.7. It can be seen thatmost of the images are accepted by the oracles with a low threshold value. However, for someimages, only a very high value renders them to be accepted as correct. This probably indicatesan anomaly—in the PUT, in the oracle, or in the reference images—and requires the tester’sattention.

5.2. A Web application

In the second case study, the goal is showing the use of the GrO-method in different types of images,deriving from a Web application. In this kind of image, pictures and text components are placedthere by an application. The scenario reproduced here verifies the behavior of a Web applicationregarding the correct placement of components, such as text, title, and pictures. Supposing thereis a template to follow, one can use an image, such as a reference, so she/he can check in laterreleases whether this template is being followed, or if the same Web page is correctly rendered indifferent Web browser settings.

In this case study, the main page of the Brazilian YAHOO site was used. As can be seen inFigure 11, it is assumed that there is a template on that page that should always be followed. Thistemplate is defined by the placement of pictures and text snippets on the page. Thus, in the test, apage is valid if there is text and figure in their respective places. That template, marked in Figure 11by the numbered rectangles, was defined by analyzing nine international YAHOO sites that followthe same template. Thus, in some cases, the regions marked in the figure may not exactly matchthe figure or texts on the page (e.g. the region 9). Yet these differences are not significant.

The same Euclidean similarity function was used, in conjunction with two feature extractorsthat assess whether a portion of a given page contains text or image. Even though a large numberof image processing techniques that could be used for this matter can be found in the literature, thesimplest ones were chosen. This facilitates understanding their behavior and keeps the examplewithin the purpose of the article, which is to validate the GrO-method in the context of softwaretesting with graphical output and show how the concepts can be applied. The extractors, appliedto the images transformed into gray levels (8 bits, levels 0–255), are:

• color count: the number of levels of gray that appear in the image is counted. This value isdivided by 256 (maximum possible number of tones) and provides the value of the extractor;

• ratio of dark pixels by the number of bright pixels. Those below 128 are considered darkpixels. The others are bright.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

187

M. E. DELAMARO, F. L. S. NUNES AND R. A. P. DE OLIVEIRA

1

2

5 6

9

7

8

3

4

Figure 11. Main page of the Brazilian YAHOO website.

The first extractor is applied to the areas of the page where an image is expected, i.e. regions2, 4, 6, and 9 of Figure 11. If there is an image, the value of that feature should be high. If thereis text or if the image does not load, the difference in applying this extractor compared with thereference image will be high. The second extractor is applied to the other regions where a textis expected. This value is expected to be similar to the regions with texts and different from theothers, such as empty regions or regions with images.

In contrast to the simplicity of these extractors, it is easy to see that some situations can beproblematic. For example, consider an image region in which a page has a predominant mixtureof text with a small picture in it. Due to the high number of colors in the picture, the number ofcolor counting should be high, despite the region being almost entirely composed of text. Althoughthe color counting can fail in some cases, in this specific case study it was chosen as a tentativeextractor, and in most cases it worked well. A possible solution for this case would be to considercolors in a different way. For example, it is possible to analyze the image histogram and considerits shape or the amount of minimum or maximum points presented in it. However, a detaileddiscussion about extractors is not within the scope of this paper.

To choose a threshold distance that identifies when two pages differ according to these extractors,the reference image (Figure 11) and a variety of different images were taken. Those images werecreated by applying the following operations, to each region individually: (i) replace the region bya black rectangle; (ii) replace a region by a white rectangle; and (iii) replace a region by a rectanglewith randomly generated pixels. The differences can be quite small, for example, when comparinga picture region with a random rectangle, or very high, when comparing an image with a black or

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

188

USING CONCEPTS OF CBIR TO IMPLEMENT TESTING ORACLES

Table V. Comparison of the Brazilian site with others that follow the template.

de ca es us ph fr ie it mx qc uk sg

br 0.184 0.112 0.120 0.114 0.254 0.067 0.112 0.347 0.144 0.113 0.124 0.125Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass

Table VI. Comparison of the Brazilian site with others that do not follow the template.

ar au cl co kr ee hk in id jp ma tw tr ve vn

br 1.286 1.105 1.000 0.943 0.287 1.167 0.895 1.229 0.865 1.383 0.858 1.142 1.408 0.844 0.613Fail Fail Fail Fail Pass Fail Fail Fail Fail Fail Fail Fail Fail Fail Pass

white rectangle. Thus, the average distances between the reference image and 36 generated images(9 regions times 3 operators) was used as the threshold. This value is 0.6359. Just as in the firstcase study this is a first guess that can be adjusted according to the needs identified by the tester.

To use the GrO-method to test a Web application, an oracle that uses these two extractors wascreated. The image in Figure 11 was used as the reference and the computed value 0.6359 wasused as the threshold. The oracle was applied to the other international YAHOO sites that followthe same template as does the Brazilian site. The results are shown in Table V§ . Each columnshows the distance between the two images and the result of the oracle (pass or fail).

Next, the same oracle was used to compare the Brazilian site with the international YAHOOsites from those countries that do not follow the same template but share some similarities, suchas colors and font sizes. The results are shown in Table VI¶ . Note that using the threshold of0.6359, only the Korean and Vietnamese sites could be considered similar to the Brazilian site. Asshown in Figure 12, they can be considered false positives, since they significantly differ from theBrazilian template yet are accepted by the oracle. This indicates the need to improve the extractorsor the similarity function or the threshold used in the oracle.

Also, a series of tests comparing each pair of sites that consent to the aforementioned templatewas performed. The results are shown in Table VII. Then, each of those sites was compared withthe ones that do not follow the template. The results are shown in Table VIII. It can be observedthat using 0.447, the largest value of Table VII, as a threshold to the oracle, only the entries inboldface in Table VIII indicate the sites accepted. Again, surprisingly, the Korean site is consideredclose to most of the other sites.

This case study showed how to apply the GrO-method in testing Web applications. At eachnew version of a page, the tester of the Brazilian website, for example, may automatically runthe oracle that checks whether it meets the template established, using an image of the site asa reference. In the case study, extractors that verify the placement of pictures and text to decidewhether the website is correct or not were used. Depending on what the tester wants to check,other features could be considered, for instance, distribution of colors, textures, existence of someshapes, etc. The literature has several image processing techniques to identify such elements.

The same approach used in this case study could also be used in compatibility testing for Webbrowsers. In that case, the tester should ensure that her/his application produces the same resultsor similar results in different Web browsers or in several different configurations of the samebrowser. Using an image produced in one browser, the oracle can compare the aspects that thetester considers relevant with the image produced by the other browsers. The process of capturingthese pages can be conducted automatically using test automation tools. The comparison betweenthe results can be made using the GrO-method.

§br: Brazil; de: Germany; ca: Canada; es: Spain; us: U.S.A.; ph: Philippines; fr: France; ie: Ireland; it: Italy; mx:Mexico; qc: Quebec; uk: U.K.; sg: Singapore.¶ ar: Argentina; au: Australia; cl: Chile co: Colombia; kr: Korea; ee: En Espanhol; hk: Hong Kong; in: India; id:Indonesia; jp: Japan; ma: Malaysia; tw: Taiwan; tr: Turkey; ve: Venezuela; vn: Vietnam.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

189

M. E. DELAMARO, F. L. S. NUNES AND R. A. P. DE OLIVEIRA

Figure 12. False positives compared to the Brazilian site: (a) Korean site and (b) Vietnamese site.

Table VII. Comparison among sites that follow the template.

de br ca es us ph fr ie it mx qc uk sg

de 0.000 0.184 0.141 0.077 0.145 0.376 0.229 0.142 0.443 0.267 0.142 0.139 0.137br 0.184 0.000 0.112 0.120 0.114 0.254 0.067 0.112 0.347 0.144 0.113 0.124 0.125ca 0.141 0.112 0.000 0.109 0.035 0.362 0.178 0.017 0.447 0.240 0.017 0.048 0.047es 0.077 0.120 0.109 0.000 0.107 0.317 0.162 0.106 0.389 0.197 0.106 0.101 0.099us 0.145 0.114 0.035 0.107 0.000 0.363 0.178 0.027 0.447 0.237 0.033 0.046 0.048ph 0.376 0.254 0.362 0.317 0.363 0.000 0.189 0.362 0.147 0.146 0.362 0.363 0.363fr 0.229 0.067 0.178 0.162 0.178 0.189 0.000 0.178 0.289 0.091 0.178 0.184 0.185ie 0.142 0.112 0.017 0.106 0.027 0.362 0.178 0.000 0.447 0.238 0.007 0.042 0.042it 0.443 0.347 0.447 0.389 0.447 0.147 0.289 0.447 0.000 0.240 0.447 0.447 0.446mx 0.267 0.144 0.240 0.197 0.237 0.146 0.091 0.238 0.240 0.000 0.238 0.233 0.232qc 0.142 0.113 0.017 0.106 0.033 0.362 0.178 0.007 0.447 0.238 0.000 0.041 0.040uk 0.139 0.124 0.048 0.101 0.046 0.363 0.184 0.042 0.447 0.233 0.041 0.000 0.013sg 0.137 0.125 0.047 0.099 0.048 0.363 0.185 0.042 0.446 0.232 0.040 0.013 0.000

5.3. Results discussion

Building testing oracles for programs with graphical output is far from trivial. With the executionof these experiments, it can be seen that the use of the GrO-method provided a very flexible mannerto create easy-to-use oracles. It also showed that even if changes are needed in the constructionor in the form of application of such oracles, this can be performed easily and quickly. Also, theuse of automated oracles can point out the flaws in the reference images, which would allow theirimprovement, thus contributing to the quality of the testing activity.

The use of the GrO-method shows to contribute to the productivity and objectivity of thisactivity, since it avoids the images to be visually compared, and when an ideal reference set isobtained, it discards the presence of a specialist for this comparison. In the specific area of thefirst case study (CAD systems), one of the factors discussed in the literature is noted to be the

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

190

USING CONCEPTS OF CBIR TO IMPLEMENT TESTING ORACLES

Table VIII. Comparison between sites that follow the template with those that do not follow the template.

ar au cl co kr ee hk in id jp ma tw tr ve vn

de 1.372 1.172 1.039 0.982 0.337 1.220 0.906 1.321 0.875 1.433 0.883 1.180 1.417 0.883 0.649br 1.286 1.105 1.000 0.943 0.287 1.167 0.895 1.229 0.865 1.383 0.858 1.142 1.408 0.844 0.613ca 1.342 1.120 1.001 0.944 0.278 1.185 0.893 1.309 0.857 1.423 0.858 1.152 1.464 0.850 0.614es 1.331 1.152 1.021 0.965 0.327 1.194 0.899 1.274 0.863 1.404 0.870 1.173 1.412 0.866 0.633us 1.345 1.126 1.000 0.943 0.284 1.189 0.894 1.310 0.859 1.417 0.852 1.153 1.462 0.845 0.605ph 1.208 1.156 1.073 1.019 0.456 1.188 0.950 1.067 0.931 1.323 0.929 1.188 1.311 0.915 0.711fr 1.262 1.113 1.015 0.958 0.321 1.169 0.904 1.184 0.878 1.363 0.869 1.148 1.381 0.855 0.628ie 1.339 1.117 0.997 0.941 0.283 1.182 0.893 1.308 0.856 1.417 0.856 1.152 1.464 0.847 0.612it 1.117 1.181 1.022 0.969 0.501 1.120 0.887 0.941 0.860 1.245 0.867 1.142 1.203 0.860 0.678mx 1.256 1.162 1.049 0.995 0.399 1.186 0.925 1.150 0.893 1.355 0.900 1.193 1.385 0.894 0.672qc 1.340 1.117 0.999 0.942 0.283 1.183 0.893 1.308 0.856 1.416 0.858 1.152 1.464 0.849 0.614uk 1.354 1.144 1.016 0.959 0.310 1.199 0.896 1.311 0.860 1.426 0.866 1.166 1.470 0.862 0.625sg 1.352 1.145 1.016 0.960 0.313 1.197 0.897 1.310 0.859 1.424 0.868 1.170 1.474 0.865 0.628

Table IX. Plugins’ size and complexity.

Extractor LOC SIZE CC

Area 23 113 13Perimeter 35 225 24Signature 163 646 56Color counting 102 479 30Dark percentage 46 244 16Euclidean 20 92 8

LOC, total lines of code of the class.SIZE, total size (number of bytecode instructions) of the class.CC, sum of method’s cyclomatic complexity.

subjectivity inherent in the visual analysis, given that a diagnosis depends on the experience of thephysician and other factors, such as fatigue and visual acuity, therefore making the process highlyvulnerable to errors [34]. The use of automated systems helps to reduce this subjectivity, allowingfor the composition of more precise diagnosis. The use of oracles in this context contributes todecreasing the subjectivity of the analysis of the results of the CAD systems, the evaluation ofwhich is also another challenge presented in the literature. For the second case study, it allowsthe automation of an activity that, although suitable for a human analysis, could be tedious andtime-consuming if executed repeatedly.

If on the one hand automation is good, on the other it has a cost. The method and the frameworkpresented here aim at reducing this cost and making oracle automation an easier task. For that,the job of the tester is to define the characteristics that should be used to compare two images andthen implement them in the form of extractors. The intellectual effort spent on such an assignmentis difficult to measure. However if it relates to the size and complexity of the code produced, thesemetrics may be an indication of the automation cost. Table IX shows a few static metrics collectedfrom each of the plugins implemented in both case studies. The size and complexity reported arelow and represent minor work, hence worth doing it.

Comparing the proposal presented here with other existing ones, some positive points in its usecan be highlighted. As shown in Section 6, the approaches currently used in the construction ofgraphical oracles are mostly directed to GUI applications. Their main feature is to explore the userinterface structure to record the events of an execution and then use them as the expected behaviorin subsequent executions. Although it is an interesting approach and has been widely used in testautomation tools, it is not the only viable alternative. Below, some cases in which the GrO-methodcan be used with some advantages are listed.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

191

M. E. DELAMARO, F. L. S. NUNES AND R. A. P. DE OLIVEIRA

• the most distinctive feature to highlight is that the architecture proposed here and the frame-work O-FIm can be used to create oracles for graphical programs in any field, not beingrestricted to GUIs, as shown in the two case studies. The other approaches found in theliterature are limited to handling GUI applications;

• even for the case of graphical interfaces, there are cases in which the tester cannot or does notwant to extract the structure from the GUI for the oracle automation. For example, supposethe tester wants to verify whether a particular third party application works in different displayresolutions. To this end, each of its windows should be checked at each resolution. Thisprocess could be automated using the approach proposed here; moreover, it does not requireknowing the internal structure of the application being tested;

• in some cases, the visual aspect of the result can be as or more important than the contentsof a field or the state of an interface component. In the Web application sample, for instance,extracting the visual aspect from the states of the components can be difficult or impossibleto accomplish. Other features, such as the distribution of colors in a Web page might be tooabstract to be extracted from its structure or from the HTML tags;

• in order to capture the structure of the GUI components and then to register and later tocompare an execution, some support is required from the library with which it was imple-mented. Thus, for each library, it is necessary to implement distinct oracles. The GrO-methodfor building graphical oracles treats the application as a black box and, therefore, is notsubjected to this restriction.

A case similar to that exposed in the second case study is the compatibility test of Web pages. Init, the tester wants to check whether the visual aspect that the developer planned is actually renderedby each of the existing Web browsers, with some different configurations. Another mainstreamapproach uses the HTML content of the page to identify possible structures that are not correctlyhandled by a specific Web browser [35]. A database has to be incrementally built in order to registerthe problematic constructions and the Web browser configurations they affect. A testing tool canthen analyze the page source code and identify whether it is supposed to be correctly renderedin a given browser. The core of this approach, i.e. understanding the problematic structures, isprobably one of its main drawbacks. To assemble such a database is not an easy task and a newrelease of a Web browser may invalidate the collected information.

Compared with the GrO-method, this technique uses a completely different approach. The GrO-method is more general in the sense that it can be applied not only to the compatibility test butalso to other situations. For instance, it can measure the closeness in the appearance of two Webpages, as in the second case study, which Eaton’s technique cannot. The structure that can beextracted from the page source code contains no formatting description and therefore, no visualinformation to construct an oracle can be captured.

As defined by Memon et al. [36], the two approaches use different ‘oracle information’, i.e.they use different sources of information to decide whether the obtained result is acceptable or not.Hence, concerning the compatibility testing, these two approaches are complementary. In somecases, it is a good idea to use more than one oracle, each one with its particular oracle information,in a testing strategy. If both oracles agree in the verdict, the tester might have great confidencein it. If they disagree, the tester should be notified and should identify the source of discrepancy.This would finally lead to the improvement of the oracles and/or the testing strategy.

The next section summarizes some of the works dealing with graphic application test found inthe literature. They are mostly directed to GUI applications, as already mentioned.

6. RELATED WORK

Some work has been conducted that explores the development and use of testing oracles. Most ofthem are directed towards studying such mechanisms specifically for GUIs, differently from themethod presented in this paper, which can be applied not only to user interfaces but also to anykind of results represented as an image.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

192

USING CONCEPTS OF CBIR TO IMPLEMENT TESTING ORACLES

Xie and Memon [37] classify testing oracles for GUIs into four classes:Manual: The tester provides input events to the GUI and observes whether the behavior is asexpected.

Visual assertions: The tester uses a ‘capture and playback’ system which first records the inter-actions and the result of an execution, and is then able to reproduce the same execution. Thetester visually defines assertions which specify that certain parts of the interface should have avisual property.

Assertive is the use of specific APIs to programmatically create test cases and their ‘assert’statements to state the expected result. In this case, such commands are used to check the stateof GUI objects as, for instance, the text that is expected in a given text field. Examples ofAPIs that support this kind of oracle are JUnit‖, for Java programs in general, and Jemmy∗∗ orFEST-Swing†† specifically for GUI programs.

Hard-coded: The tester hard-codes the operations to be performed by the program and the resultsexpected from such operations. In this case, however, the operations are not performed throughthe GUI. Instead, test case code directly accesses the business level logic, bypassing the end-userinterface. As an example, Marick [38] shows how to create test cases for a well-known texteditor accessing its ‘scripting’ interface.

The authors mention that all these approaches require large amounts of manual effort. Thus, theGrO-method may help in the automation aspects of GUI testing by saving time and improving thequality of the oracle. In the above classification, the present work can be seen as an automatedsupport to the visual assertions approach. It proposes to use CBIR techniques to state what theexpected visual behaviors of the program under test are.

Memon et al. [39] created a formal model to describe a GUI and its behavior. This model isbased on GUI objects, properties of the objects and actions that make properties change. Theyuse this model to create a test oracle. Given a test case and the GUI model, the oracle is ableto compute the expected state of the GUI. An execution monitor collects the actual state and averifier compares the actual against the expected state in order to decide whether the GUI behavesas expected.

Most approaches that define GUI oracles use particular representations to depict the structureand the state of the GUI, as the one just mentioned. In the GrO-method, no such representationis used and the program under test may be seen as a black box. This may be an advantage incertain cases. First, it is not applicable only to GUI images, but to any kind of graphical output.Second, because manually building such a representation is not practical, its construction mightrely on parsing the program to identify the GUI components and structure, which may dependon a language and interface construction toolkit. On the other hand, a representation as the oneused by Memon et al. [39] might be more precise given that a state is a collection of objects andproperties and comparing two states is easier than comparing regions of pixels in two images.

Memon and Soffa [40] present a way to represent a GUI and how it can be used to deal withregression testing when modifications take place in the user interface. In this case, the test casesused in the original run of the program under test may become unusable. They present a techniqueto analyze existing testing cases, to discover the unusable ones and make them usable again. Theyreport a success rate of over 70%. This problem of changing the user interface between executionsmight also affect the application of the oracle if it uses the visual aspect of the GUI. The presentapproach is robust to deal with this problem, at least to some extent, by providing flexibility to thetester to create extractors that can be sensitive to visual differences in the interface. In addition,differences in the layout can be addressed by extending the O-FIm framework to allow the oracleto apply the extractors to different regions in the reference and in the evaluated image.

‖http://www.junit.org.∗∗https://jemmy.dev.java.net.††http://code.google.com/p/fest/.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

193

M. E. DELAMARO, F. L. S. NUNES AND R. A. P. DE OLIVEIRA

Although related, the works cited have a subtle difference in relation to the work herein thatshould be mentioned. Those works were developed with the goal of testing the GUI, not theapplication itself. Within that scope, it is necessary to check whether the GUI has the appropriatebehavior, not just if the program computes the correct function. Thus, having the GUI model andexpected state at each interaction is essential and the use of images as oracle references wouldnot work well or not work at all. In the GrO-method, the goal is checking the program results bymeans of one image or a sequence of images.

The same idea is shared by Ye et al. [27]. They use the images generated by an execution ofa program with a GUI to decide on its correctness. They use the Biomimetic Pattern Recognition(BPR) technique which, according to the authors, can simulate a person’s behavior on recognizingthe images. A Multi-weight Neural Network is trained with samples of acceptable images of theGUI and then is used to decide on the correctness of an actual execution image. As in the GrO-method, a threshold can be used to determine how close an image must be acceptable. This isdone at the training phase, to decide which images should or should not be used as samples.

Another approach to compare images is provided by Takahashi [28]. Considering that an appli-cation does not draw objects to a computer screen directly but instead uses a system API to createsuch objects, the author proposes to intercept the calls to that API, record them, and use them as areference for further executions of the application. With this perspective, the technique deals withlow-level primitives as its objects, not with high-level components, such as windows, buttons, ormenus. This may be a problem to support comparing GUI behaviors; however, it may be usefulif other kinds of images are compared. Another problem is the fact that the automation of thisapproach is highly dependent on the operating system and/or on the graphical API used by thetested program.

The combination of random testing and the assessment of some statistical characteristics of theactual results of a program execution may be used to implement an oracle for some programs,including some for image processing, according to Mayer and Guderlei [41]. The authors mention,for instance, the case of a program which computes the area and boundary length of the blackregion in a binary image. Although it is difficult to determine the expected result for a singleimage, if a test set is generated according to a distribution with a known mean, this value canbe used to evaluate the correctness of the program under test. Of course, this is not a preciseapproach. One of the main implications is that a verdict cannot be provided for single test cases.This is also bad if one considers that testing information may be useful for other activities in thesoftware development process as, for instance, debugging.

IMayer and Guderlei [42] state that statistical testing has a limited applicability due to thefact that it only works for applications in which a statistical distribution of the outputs is known.They propose the use of another specific technique for image processing algorithms. According tothem, some image operations can be validated by using ‘Metamorphic Testing Relations’, whichrelate the input and output data, working as necessary conditions for the implementation to becorrect. This approach is also not very general since for each image processing operator, a set ofmetamorphic relations must be applied. For instance, in the work, the authors use the EuclideanDistance transform and a set of seven relations and two properties that may be used to verify thevalidity of the resulting images. The set of relations is defined for that transformation and maynot be useful to serve as an oracle in other applications.

The idea of using existing images as references for deciding the correctness of a programexecution has also been used in other works, as cited in Section 2 [25, 26]. Bellotti et al. [25]address the processing of medical images. The authors evaluate a CAD system that locates nodulesin mammographic images. They use pictures manually marked by a physician, which are thencompared against the result of the CAD system. The mark used by the researchers is a circle thecenter of which indicates where the nodule was identified by the specialist and the area, determinedby the radius, which completely contains the nodule. While serving as a parameter for researchersto assess the accuracy degree of the CAD system, this approach is far from being precise in thesense that the form identified by the system can be quite different from the nodule shown in themammographic image and still be considered correct.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

194

USING CONCEPTS OF CBIR TO IMPLEMENT TESTING ORACLES

Paquerault et al. [26] compare three segmentation methods, also implemented in mammographicimages for the identification of microcalcifications. In this case, the results of the segmentationalgorithms that identify areas of interest in the image are placed upon the original image and a‘human oracle’ accounts for assessing which of them produced the most accurate result. These twoapproaches are not systematic and use ad hoc effort to make the correctness assessment. The use ofan automated oracle in this case would be of great importance, adding parameters for comparisonpurposes and thereby reducing the probability of mistakes.

To address the problem of compatibility testing for Web applications, Eaton and Memon [35]present an inductive model, based on the Web page source code and a specific Web browserconfiguration. The knowledge of HTML tags behavior on specific Web browser configurations isused to identify possible problems or deviation of the expected results. The authors also identifyother possible approaches to the compatibility testing. In particular, they mention manual andautomated execution-based techniques. The first is a very expensive approach because it requiresthe tester to launch the application in several different Web browser configurations and manuallycheck its results on each of them. The second reduces this cost by using tools to automatically loadthe application in a set of Web browser configurations. The problem in this case is that the existingtools are, according to Eaton and Memon, ‘non-diagnostic’. They only return images of the Webpages but do not indicate possible errors. The use of automated execution-based techniques withCBIR concepts as indicated herein may collaborate to reveal the faults. At least those that appearas a badly displayed page.

The well-known image processing concepts were applied to create a flexible and easy-to-usemethod to develop oracles for programs that produce images as outputs. As observed in thissection, the problem is of interest to the scientific community and no single solution may exist toall kinds of applications and domains. Thus, the GrO-method contributes in this field, providingan effective alternative to build systematic testing oracles.

7. FINAL REMARKS

As explained here and in the related literature, the desire for automated software testing is justifiedby the increment in the productivity and in the quality of the software product that comes with it.However, after considering the subjective variables hidden in automating this task as, for instance,the experience of the tester and prior knowledge of the proper functioning of the program undertest, the difficulty in building systems that effectively contribute to the test is clear. An approachto automate the tester’s prior knowledge is by developing testing oracles.

The difficulty of automation becomes more evident when the program being tested producesgraphical output and this output does not meet a single criterion of correctness, i.e. when morethan one answer may be correct or when a certain degree of flexibility is necessary to considerwhether the processing is correct. To contribute to the state of the art in this area, the methodpresented herein proposes the use of concepts of content-based image retrieval for the automationof testing oracles for programs with graphical outputs. An architecture and the implementation of aframework to build graphical oracles were described, as well as two case studies that demonstratethe value of the proposition.

Despite the innovation in the method presented, its use appears to require some effort if itsimplementation is not also automated. Thus, both the textual definition of the oracles and theinteractive tool in the form of a ‘wizard’ provide significant support in using this approach. Inaddition, using the GrO-method appears to require the tester to fully know the system under testand to have notions of image processing, so that she/he can properly implement the extractors.These conditions are not obstacles, because, in general, developers of systems that produce imagesas output, work with these techniques and often extract features during the construction of theiralgorithms, which can be adapted to the extractors.

However, this fact may be not completely true when it comes to programs with outputs expressedin other types of images, such as in GUIs or a Web page. The developers of such systems are not

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

195

M. E. DELAMARO, F. L. S. NUNES AND R. A. P. DE OLIVEIRA

always adapted to concepts of image processing, which could require some adaptation to automateoracles using the GrO-method. On the other hand, once a ‘universal’ set of extractors is definedthat is acquainted with the characteristics of the most common graphical interface components,such as buttons, check boxes, and bars, these extractors can be reused in the construction of oraclesfor any application using a GUI based on those components.

Considering this scenario, a future work deriving from these difficulties is the development ofa set of extractors for GUIs, customizable for different platforms. These extractors are pieces ofJava code aimed at providing a basic code to find and classify components found in graphicalinterfaces. Also a related future work is to build a second set of basic extractors (for example,region contrast, region size, object shape, among others) to help users increase their productivitythrough the use of O-FIm framework in software testing. Also considering the need to increasethe testers productivity, a set of similarity functions is being implemented that allows testers toconduct their activity considering different comparison algorithms.

Another observation from this study that should be emphasized is that the results of applyinggraphical oracles based on CBIR are significantly dependent on the application under test, theextractors, the similarity function, and the necessary parameterizations. From the case studiesconducted, the adequacy of the GrO-method was verified for particular types of systems withgraphical output. Several other cases can be analyzed and detailed. However, as aforementioned,building extractors requires a reasonable knowledge of image processing techniques, which canmean relatively hard work. Thus a followup of the current study is the construction of a libraryof extractors that may be applied to specific categories of programs, further contributing to theconstruction of oracles within this context. Another scenario that will be addressed is the verificationof the visual aspect of Websites. Again, the GrO-method is a good alternative to test this kind ofapplication in which ordinary techniques for GUI testing cannot be applied and visual aspects areextremely relevant.

In summary, the GrO-method might be useful to support the creation of testing oracles insettings that are not completely addressed by others. In particular, in those cases in which visualaspects are relevant, it seems to be the preferable choice. On the other hand, it has its limitationsand drawbacks. First, the tester or the person responsible for constructing the oracle should haveat least some basic knowledge of image processing. The case studies have shown that a minoreffort is needed to build the plugins for the oracles. However, the simple feature extractors usedare illustrative and limited even for the case studies. It is not clear how their complexity wouldscale in a real production environment.

In addition, concerning the case studies, to compute threshold values to decide when a givenresult should be considered correct or incorrect is not a trivial task. This is a characteristic of theimage processing techniques involved in the GrO-method and there is no single solution for it. Onepossible approach to mitigate this problem could be to define ranges of value for the comparison,such as ‘not accepted’, ‘suspicious’, and ‘accepted’. This may improve the accuracy of the methodbut reduce the degree of automation, since the results in the middle range would require the tester’sintervention.

Finally, the case studies demonstrated the feasibility of the GrO-method but they are a very smalluniverse of observation. This work will be extended with further experimentation and observation.In particular, its use in real production environments might indicate its strengths and weaknesses.

Also, as a continuation of the work with the GrO-method, its integration with some testingautomation environments, such as JUnit and Selenium‡‡, is a relevant subject. In particular, inte-gration with Selenium might allow developing techniques that validate visual aspects of Websites,a problem that has recently been widely explored.

Integrating O-FIm with the Proteum tool [24], which supports mutation tests, is also part ofthe future plans. One of the tasks of Proteum is to store the inputs and outputs of the test casesand use them to execute and compare the result of each mutant. Currently, the application of thetool is limited to programs which only interact with the user through the standard input as plain

‡‡http://seleniumhq.org/.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

196

USING CONCEPTS OF CBIR TO IMPLEMENT TESTING ORACLES

text. For the tool to be used with programs that interact with the user through a GUI, part of thesolution would be to allow input events and screenshots (outputs) to be captured and stored. Forthe second part of the solution, the GrO-method could be used to allow screenshots generated bythe mutants to be compared with those of the original program. The tester provides a textual oracledescription, as shown in the case studies, and the Proteum tool invokes O-FIm to instantiate thatoracle. The oracle may then be used to compare the content of the original screenshots against thescreenshots captured for the mutants.

Thus, the work presented herein is a way to collaborate with the state of the art in the automationof testing activities, through mechanisms for constructing oracles for graphical applications. Thisis an innovative and promising approach as it allows different areas to benefit from its use. Thecase studies, for example, showed how to build testing oracles for a program in the medical fieldand for Web applications, which have largely applied image processing and design techniques,respectively, but still lack software engineering tools and techniques, in particular, software testtools.

ACKNOWLEDGEMENTS

This work is sponsored by CNPq (Brazilian National Council of Scientific and Technological Development),under grant number 551002/2007-7. Rafael Alves de Oliveira is sponsored by FAPESP (The State of SoPaulo Research Foundation) under grant number 2008/07605-7.

The icons used in Figures 1 and 3 are from the ‘crystal clear’ package (http://www.everaldo.com/crystal/)released under the GNU Lesser General Public License (LGPL—http://www.everaldo.com/crystal/?action=license).

REFERENCES

1. Datta R, Joshi D, Li J, Wang JZ. Image retrieval: Ideas, influences, and trends of the new age. ACM ComputingSurveys 2008; 40(2):1–60.

2. Baresi L, Young M. Test oracles. Technical Report CIS-TR-01-02, University of Oregon, Department of Computerand Information Science, Eugene, OR, U.S.A., 2001.

3. Weyuker EJ. On testing non-testable programs. The Computer Journal 1982; 25(4):465–470.4. Lieberman H, Rosenzweig E, Singh P. Aria: An agent for annotating and retrieving images. IEEE Computer

2001; 34(7):57–62.5. Ogle VE, Stonebraker M. Chabot retrieval from a relational database of images. IEEE Computer 1995; 28(9):

40–48.6. Smeulders AW, Worring M, Santini S, Gupta A, Jain R. Content-based image retrieval at the end of the early

years. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000; 22(12):1349–1380.7. Gonzalez RC, Woods RE. Digital Image Processing (3rd edn). Prentice-Hall: Upper Saddle River, NJ, U.S.A.,

2007.8. Ballard DH, Brown CM. Computer Vision. Prentice-Hall: Englewood Cliffs, NJ, 1982.9. El-Naqa I, Yang Y, Galatsanos N, Nishikawa R, Wernick M. A similarity learning approach to content-based

image retrieval: application to digital mammography. IEEE Transactions on Medical Imaging 2004; 23(10):1233–1244.

10. Vasconcelos N. On the efficient evaluation of probabilistic similarity functions for image retrieval. IEEETransactions on Information Theory 2004; 50(7):1482–1496.

11. Hafner J, Sawhney H, Equitz W, Flickner M, Niblack W. Efficient color histogram indexing for quadratic formdistance functions. IEEE Transactions on Pattern Analysis and Machine Intelligence 1995; 17(7):729–736.

12. Swain MJ, Ballard DH. Color indexing. International Journal of Computer Vision 1991; 7(1):11–32.13. Manjunath B, Ma W. Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern

Analysis and Machine Intelligence 1996; 18(8):837–842.14. Smith J. Integrated spatial and feature image systems: retrieval, compression and analysis. PhD Dissertation,

Columbia University, New York, NY, 1997.15. Rubner Y, Tomasi C, Guibas LJ. A metric for distributions with applications to image databases. Proceedings of

the Sixth International Conference on Computer Vision, Bombay, India, 1998; 59–66.16. Gaede V, Gunther O. Multidimensional access methods. ACM Computing Surveys 1998; 30(2):170–231.17. Bohm C, Berchtold S, Keim DA. Searching in high-dimensional spaces: Index structures for improving the

performance of multimedia databases. ACM Computing Surveys 2001; 33(3):322–373.18. Traina C Jr, Filho RF, Traina AJ, Vieira MR, Faloutsos C. The Omni-family of all-purpose access methods: A

simple and effective way to make similarity search more efficient. The VLDB Journal 2007; 16(4):483–505.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

197

M. E. DELAMARO, F. L. S. NUNES AND R. A. P. DE OLIVEIRA

19. Hjaltason GR, Samet H. Index-driven similarity search in metric spaces (Survey Article). ACM Transactions onDatabase Systems 2003; 28(4):517–580.

20. Lee M, Yoon H, Kim YJ, Lee Yk. SMILE tree: A stream data multi-query indexing technique with level-dimension nodes and extended-range nodes. ICUIMC ’08: Proceedings of the Second International Conferenceon Ubiquitous Information Management and Communication. ACM: New York, NY, U.S.A., 2008; 101–107.

21. Maree R, Geurts P, Wehenkel L. Content-based image retrieval by indexing random subwindows with randomizedtrees. ACCV’07: Proceedings of the Eighth Asian Conference on Computer Vision. Springer-Verlag: Berlin,Heidelberg, 2007; 611–620.

22. Valle E, Cord M, Philipp-Foliguet S. High-dimensional descriptor indexing for large multimedia databases. CIKM’08: Proceeding of the 17th ACM Conference on Information and Knowledge Management. ACM: New York,NY, U.S.A., 2008; 739–748.

23. DeMillo RA, Lipton RJ, Sayward FG. Hints on test data selection: Help for the practicing programmer. Computer1978; 11(4):34–41.

24. Delamaro ME, Maldonado JC. Proteum—A tool for the assessment of test adequacy for C programs. Proceedingsof the Conference on Performability in Computing Systems (PCS’96), Brunswick, NJ, U.S.A., 1996; 79–95.

25. Bellotti R, De Carlo F, Tangaro S, Gargano G, Maggipinto G, Castellano M, Massafra R, Cascio D, Fauci F,Magro R, Raso G, Lauria A, Forni G, Bagnasco S, Cerello P, Zanon E, Cheran SC, Lopez Torres E, Bottigli U,Masala GL, Oliva P, Retico A, Fantacci ME, Cataldo R, De Mitri I, De Nunzio G. A completely automatedCAD system for mass detection in a large mammographic database. Medical Physics 2006; 33(8):3066–3075.

26. Paquerault S, Yarusso LM, Papaioannou J, Jiang Y, Nishikawa RM. Radial gradient-based segmentation ofmammographic microcalcifications: Observer evaluation and effect on CAD performance. Medical Physics 2004;31(9):2648–2657.

27. Ye M, Feng B, Zhu L. Automated oracle based on multi-weighted neural networks for GUI Testing. InformationTechnology Journal 2007; 6(3):370–375.

28. Takahashi J. An automated oracle for verifying GUI objects. SIGSOFT Software Engineering Notes 2001;26(4):83–88.

29. Giger M. Computer-aided diagnosis of breast lesions in medical images. Computing in Science & Engineering2000; 2(5):39–45.

30. Chan HP, Doi K, Vybrony CJ, Schmidit RA, Metz CE, Lam KL, Ogura T, Wu Y, Macmahon H. Improvementin Radiologists’ detection of clustered microcalcifications on Mammograms: The potential of computer-aideddiagnosis. Investigative Radiology 1990; 25(10):1102–1110.

31. Doi K, Giger ML, Nishikawa RM, Schmidt R. Computer-aided diagnosis of breast cancer on mammograms.Breast Cancer 1997; 4(3):228–233.

32. Giger M, MacMahon H. Image processing and computer-aided diagnosis. Radiologic Clinics of North America1996; 34(3):565–596.

33. Nunes FLS, Schiabel H, Goes CE. Contrast enhancement in dense breast images to aid clustered microcalcificationsdetection. Journal of Digital Imaging 2007; 20(1):53–66.

34. Ellis IO, Galea MH, Locker A, Roebuck EJ, Elston CW, Blamey RW, Wilson ARM. Early experience in breastcancer screening: Emphasis on development of protocols for triple assessment. The Breast 1993; 2(3):148–153.

35. Eaton C, Memon AM. An empirical approach to testing web applications across diverse client platformconfigurations. International Journal on Web Engineering and Technology (IJWET), Special Issue on EmpiricalStudies in Web Engineering 2007; 3(3):227–253.

36. Memon A, Banerjee I, Nagarajan A. What test oracle should I use for effective GUI testing? Proceedings of the18th IEEE International Conference on Automated Software Engineering, Montreal, Quebec, Canada, October2003; 164–173. DOI: 10.1109/ASE.2003.1240304.

37. Xie Q, Memon AM. Designing and comparing automated test oracles for GUI-based software applications. ACMTransactions on Software Engineering and Methodology 2007; 16(1):4.

38. Marick B. Bypassing the GUI. Software Testing and Quality Engineering Magazine 2002; 41–47.39. Memon AM, Pollack ME, Soffa ML. Automated test oracles for GUIs. SIGSOFT Software Engineering Notes

2000; 25(6):30–39.40. Memon AM, Soffa ML. Regression testing of GUIs. ESEC/FSE-11: Proceedings of the 9th European Software

Engineering Conference Held Jointly with 11th ACM SIGSOFT International Symposium on Foundations ofSoftware Engineering, Helsinki, Finland, 2003; 118–127.

41. Mayer J, Guderlei R. Test oracles using statistical methods. Proceedings of the First International Workshop onSoftware Quality (Lecture Notes in Informatics), Erfurt, Germany, 2004; 179–189.

42. Mayer J, Guderlei R. On random testing of image processing applications. Proceedings of the Sixth InternationalConference on Quality Software (QSIC’06), Washington, DC, U.S.A., 2006; 85–92.

Copyright q 2011 John Wiley & Sons, Ltd.DOI: 10.1002/stvr

Softw. Test. Verif. Reliab. 2013; :171–19823

198