On the role of the nouns in IR-based traceability recovery

10
On the Role of the Nouns in IR-based Traceability Recovery * Giovanni Capobianco , Andrea De Lucia , Rocco Oliveto ⋄† , Annibale Panichella , Sebastiano Panichella STAT – University of Molise, Contrada Fonte Lappone, Pesche (IS), Italy DMI – University of Salerno, Via Ponte don Melillo, Fisciano (SA), Italy [email protected], [email protected], [email protected], a [email protected], [email protected] Abstract The intensive human effort needed to manually man- age traceability information has increased the interest in utilising semi-automated traceability recovery techniques. This paper presents a simple way to improve the accuracy of traceability recovery methods based on Information Re- trieval techniques. The proposed method acts on the arte- fact indexing considering only the nouns contained in the artefact content to define the semantics of an artefact. The rationale behind such a choice is that the language used in software documents can be classified as a sectorial lan- guage, where the terms that provide more indication on the semantics of a document are the nouns. The results of a reported case study demonstrate that the proposed artefact indexing significantly improves the accuracy of traceability recovery methods based on the probabilistic or vector space based IR models. 1. Introduction A lot of effort in the software engineering community (both research and commercial) has been devoted to im- prove the management of the dependences (i.e., traceability links) between software artefacts. Traceability information can provides important insights into system development and evolution assisting in program comprehension, impact analysis, and reuse of existing software [2]. However, estab- lishing and maintaining traceability links between software artefacts produced or modified in the software life-cycle are costly, tedious, and error prone activities that are crucial but frequently neglected in practice. The need to provide software engineers with methods and tools supporting traceability recovery has been widely * The work described in this paper is supported by the project TOCAI.IT (Knowledge-oriented technologies for enterprise aggregation in Internet), funded by MiUR (Ministero dell’Universit` a e della Ricerca) within the FIRB (Fondo per gli Investimenti della Ricerca di Base) program. Corresponding author. recognised in the last years. In particular, several re- searchers have proposed the use of Information Retrieval (IR) [4, 13] techniques for recovering traceability links be- tween artefacts of different types [2, 3, 6, 9, 10, 16, 19, 20, 23]. The idea behind such methods is that most of the software documentation is text based or contains textual descriptions and that programmers use meaningful domain terms to define source code identifiers [2]. Thus, IR-based methods recover traceability links on the basis of the sim- ilarity between the text contained in the software artefacts. Because the number of all possible links can be very high, such tools use a similarity threshold to consider as candidate traceability links only the pairs of artefacts with similarity above such a threshold [2, 9]. Unfortunately, due to the lim- itations of both the humans developing artefacts and the IR techniques the set of retrieved links does not in general co- incide with the set of correct links between the artefacts in the repository. Indeed, any IR method will fail to retrieve some of the correct links, while on the other hand it will also retrieve links that are not correct (false positives). In this paper we describe a simple way to reduce the number of false positives retrieved by IR-based traceabil- ity recovery methods. In particular, we observe that the language used in software documents can be classified as sectorial language 1 , where the terms that provide more indi- cation on the semantics of a document are the nouns, while the verbs tend to play a connection role and have a generic semantics [17, 18]. For this reason, we propose to act on the artefact indexing process taking into account only the nouns contained in the artefact contents to define the se- mantics of an artefact. The proposed approach was applied to the Jenson-Shannon (JS) method [1, 8] and Latent Se- mantic Indexing (LSI) [13] to recover traceability links be- tween different types of artefacts. We performed the ex- perimentation on two artefact repositories that describes the same system, but in the former the language of the artefacts is Italian, while in the latter is English. The achieved re- sults demonstrated that, in general, the proposed approach 1 The language used by people who work in a particular area or who have a common interest [17, 18]. 978-1-4244-3997-3/09/$25.00 2009 IEEE ICPC 2009 148

Transcript of On the role of the nouns in IR-based traceability recovery

On the Role of the Nouns in IR-based Traceability Recovery∗

Giovanni Capobianco⋆, Andrea De Lucia⋄, Rocco Oliveto⋄†, Annibale Panichella⋆, Sebastiano Panichella⋆

⋆STAT – University of Molise, Contrada Fonte Lappone, Pesche (IS), Italy⋄DMI – University of Salerno, Via Ponte don Melillo, Fisciano (SA), Italy

[email protected], [email protected], [email protected], a [email protected], [email protected]

Abstract

The intensive human effort needed to manually man-

age traceability information has increased the interest in

utilising semi-automated traceability recovery techniques.

This paper presents a simple way to improve the accuracy

of traceability recovery methods based on Information Re-

trieval techniques. The proposed method acts on the arte-

fact indexing considering only the nouns contained in the

artefact content to define the semantics of an artefact. The

rationale behind such a choice is that the language used

in software documents can be classified as a sectorial lan-

guage, where the terms that provide more indication on the

semantics of a document are the nouns. The results of a

reported case study demonstrate that the proposed artefact

indexing significantly improves the accuracy of traceability

recovery methods based on the probabilistic or vector space

based IR models.

1. Introduction

A lot of effort in the software engineering community

(both research and commercial) has been devoted to im-

prove the management of the dependences (i.e., traceability

links) between software artefacts. Traceability information

can provides important insights into system development

and evolution assisting in program comprehension, impact

analysis, and reuse of existing software [2]. However, estab-

lishing and maintaining traceability links between software

artefacts produced or modified in the software life-cycle are

costly, tedious, and error prone activities that are crucial but

frequently neglected in practice.

The need to provide software engineers with methods

and tools supporting traceability recovery has been widely

∗The work described in this paper is supported by the project TOCAI.IT

(Knowledge-oriented technologies for enterprise aggregation in Internet),

funded by MiUR (Ministero dell’Universita e della Ricerca) within the

FIRB (Fondo per gli Investimenti della Ricerca di Base) program.†Corresponding author.

recognised in the last years. In particular, several re-

searchers have proposed the use of Information Retrieval

(IR) [4, 13] techniques for recovering traceability links be-

tween artefacts of different types [2, 3, 6, 9, 10, 16, 19,

20, 23]. The idea behind such methods is that most of

the software documentation is text based or contains textual

descriptions and that programmers use meaningful domain

terms to define source code identifiers [2]. Thus, IR-based

methods recover traceability links on the basis of the sim-

ilarity between the text contained in the software artefacts.

Because the number of all possible links can be very high,

such tools use a similarity threshold to consider as candidate

traceability links only the pairs of artefacts with similarity

above such a threshold [2, 9]. Unfortunately, due to the lim-

itations of both the humans developing artefacts and the IR

techniques the set of retrieved links does not in general co-

incide with the set of correct links between the artefacts in

the repository. Indeed, any IR method will fail to retrieve

some of the correct links, while on the other hand it will

also retrieve links that are not correct (false positives).

In this paper we describe a simple way to reduce the

number of false positives retrieved by IR-based traceabil-

ity recovery methods. In particular, we observe that the

language used in software documents can be classified as

sectorial language1, where the terms that provide more indi-

cation on the semantics of a document are the nouns, while

the verbs tend to play a connection role and have a generic

semantics [17, 18]. For this reason, we propose to act on

the artefact indexing process taking into account only the

nouns contained in the artefact contents to define the se-

mantics of an artefact. The proposed approach was applied

to the Jenson-Shannon (JS) method [1, 8] and Latent Se-

mantic Indexing (LSI) [13] to recover traceability links be-

tween different types of artefacts. We performed the ex-

perimentation on two artefact repositories that describes the

same system, but in the former the language of the artefacts

is Italian, while in the latter is English. The achieved re-

sults demonstrated that, in general, the proposed approach

1The language used by people who work in a particular area or who

have a common interest [17, 18].

978-1-4244-3997-3/09/$25.00 2009 IEEE ICPC 2009148

improves the accuracy of both the JS method and LSI. We

also discuss on the influence of other factors (such as arte-

fact type and language) on the accuracy of the traceability

recovery method.

The rest of the paper is organised as follows. Section

2 discusses related work, while Section 3 presents the pro-

posed indexing process and gives some details on the two

IR methods used in the experimentation. Section 4 pro-

vides details on the design of the case study. Sections 5 and

6 report and discuss on the achieved results, respectively.

Section 7 gives concluding remarks.

2 Related Work

Antoniol et al. [2] are the first to apply IR methods

[4, 13] to the problem of recovering traceability links be-

tween software artefacts. They use both the probabilistic

and vector space models [4] in order to trace source code

onto software documentation. The results of the experimen-

tation show that both models recover all correct links with

almost the same number of documents retrieved.

Marcus and Maletic [20] use LSI [13] to recover trace-

ability links between source code and documentation. They

perform case studies similar in design to those in [2] and

compare the accuracy of LSI with respect to the vector

space and probabilistic models. The results show that LSI

performs at least as well as the probabilistic and vector

space models combined with full parsing of the source code

and morphological analysis of the documentation.

Abadi et al. [1] compare several IR techniques to re-

cover traceability links between code and documentation.

In particular, they compare dimensionality reduction meth-

ods (e.g., LSI [4]), probabilistic and information theoretic

approaches (i.e., the Jenson-Shannon method [1]), and the

standard Vector Space Model (VSM) [4]. The achieved re-

sults show that the techniques that provide the best results

are the standard VSM and the JS method.

Recently, several variants of basic IR methods have been

proposed to improve the retrieval accuracy of IR-based

traceability recovery tools. Antoniol et al. [3] discuss how a

traceability recovery tool based on the probabilistic model

can improve the retrieval accuracy by learning from user

feedback provided as input to the tool in terms of a subset of

correct traceability links (training set). The results achieved

in a case study demonstrate that, as the training set increase,

the method performances improve.

Settimi et al. [23] propose three different variants of the

VSM for tracing requirements to UML artefacts, code, and

test cases. Moreover, the results of the reported case study

demonstrate that it may be more effective to retrieve UML

artefacts as an intermediate step to retrieve code.

Cleland-Huang et al. [6] propose three different strate-

gies for incorporating supporting information into a proba-

bilistic retrieval algorithm, namely hierarchical modelling,

logical clustering of artefacts and semi-automated pruning

of the probabilistic network. The results achieved in the re-

ported case study indicate significant overlap between the

hierarchical and clustering techniques. In [25] the authors

propose an approach to enhance standard IR metrics using

query term coverage and phrasing. The results achieved in

two case studies show that the best retrieval accuracy can be

achieved by using both the proposed enhancements.

Hayes et al. [16] present a traceability recovery tool,

called Requirements Tracing On-target (RETRO), that uses

VSM or LSI for computing similarity measures among

requirements. The results achieved in two case studies

demonstrate that good results are achieved when user feed-

back is used to change the term weights in the term-by-

document matrix.

De Lucia et al. [9] discuss how a high recall cannot

be achieved without also recovering too many false posi-

tives. In particular, they show (through several case studies

where LSI is used to recover traceability links) that what

is really useful is to use an incremental traceability recov-

ery approach to gradually identify the threshold where it is

more convenient to stop the traceability recovery process,

as the effort to discard false positives is too high. In [10]

the authors also present a critical analysis of using feed-

back within the incremental traceability recovery process

presented in [9]. The results of two reported case studies

show that even though the retrieval accuracy generally im-

proves with the use of feedbacks, IR-based approaches are

still far from solving the problem of recovering all correct

links with a low classification effort.

Lormans and van Deursen [19] use LSI to reconstruct

traceability links among requirements, design documents

and test specifications. They also define a new strategy

for selecting traceability links based on the combination of

variable and constant thresholds.

In the last decade several IR-based traceability recovery

tools have also been proposed to support the software engi-

neer during the traceability recovery process [6, 9, 16, 19,

21]. In some cases, the usefulness of such tools has also

been assessed through user studies and controlled experi-

ments. The achieved results revealed that the tool signif-

icantly reduces the time spent by the software engineer to

complete the task and the tracing errors [12]. Also, exper-

iments show that the incremental process proposed in [9]

reduces the effort to classify proposed links with respect to

a “one-shot” approach, where the full ranked list of links is

proposed without similarity information and filtering [11].

3 IR-based Traceability Recovery

An IR-based traceability recovery tool uses an IR tech-

nique to compare a set of source artefacts (used as a query)

149

against another (even overlapping) set of target artefacts and

rank the similarity of all possible pairs of artefacts.

3.1 Artefact Indexing

The artefact indexing process aims at extracting infor-

mation about the occurrences of terms (words) within the

artefacts. Such information is used to characterise the arte-

fact content. The extraction of the terms is preceded by a

text normalisation aiming at (i) removing most non-textual

tokens (i.e., operators, special symbols, some numbers, etc.)

and (ii) splitting into separate words source code identifiers

composed of two or more words.

The extracted information is stored in a m × n matrix

(called term-by-document matrix), where m is the number

of all terms that occur in all the artefacts, and n is the num-

ber of artefacts in the repository. A generic entry ai,j of this

matrix denotes a measure of the weight (i.e., relevance) of

the ith term in the jth document [4].

3.1.1 Indexing all the Terms

In a classical indexing process all the terms extracted from

the artefact content are used to define the semantics of the

artefact. However, a stop word function and/or a stop word

list are applied to discard common terms (i.e., articles, ad-

verbs, etc) that are not useful to capture the semantics of the

artefact (term filtering). In particular, the stop word func-

tion prunes out all the words having a length less than a

fixed threshold, while the stop word list is used to cut-off

all the words contained in a given word list. A morphologi-

cal analysis, like stemming [22], of the extracted terms can

also be performed aiming at removing suffixes of words to

extract their stems.

3.1.2 Indexing only the Nouns

The term filtering proposed in this paper is not based on a

stop word list and/or a stop word function, but it is based on

the grammatical nature of the extracted term. In particular,

following [17, 18] only the nouns contained in the artefact

contents are used to build the artefact corpus. In order to

achieve such a filtering, the artefact content is pre-processed

by using a Part-of-speech (POS) tagger [5] that tags all the

terms specifying its grammatical nature (e.g., verb, noun,

adjective)2. Analysing such tags it is possible to filter out

all the terms that are not nouns.

3.2 Artefact Classification

Different IR methods can be used to calculate the tex-

tual similarity between two artefacts. In our experimenta-

2In our study we used TreeTagger available for downloading at

http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

tion, we compare the retrieval accuracy of two IR methods,

i.e., a probabilistic model, namely the Jenson-Shannon (JS)

method [1, 8], and a vector space based model, namely La-

tent Semantic Indexing (LSI) [13].

3.2.1 Jenson-Shannon Method

The Jensen-Shannon (JS) similarity model is an IR tech-

niques recently proposed by Abadi et al. [1]. It is driven by

a probabilistic approach and hypothesis testing techniques.

As well as other probabilistic models, it represents each

document through a probability distribution. This means

that an artefact is represented by a random variable where

the probability of its states is given by the empirical distri-

bution of the terms occurring in the artefact (i.e., columns

of the term-by-document matrix). It is worth noting that

the empirical distribution of a term is based on the weight

assigned to the term for the specific artefact [1].

In the JS method the similarity between two artefacts

is given by a “distance” of their probability distributions

measured by using the Jensen-Shannon (JS) Divergence [8].

More details on the JS method can be found in [1, 8].

3.2.2 Latent Semantic Indexing

Latent Semantic Indexing (LSI) [13] is an extension of the

Vector Space Model (VSM) [4]. In the VSM, artefacts are

represented as vectors of terms (i.e., columns of the term-

by-document matrix) that occur within artefacts in a reposi-

tory. Thus, the similarity between two artefacts can be mea-

sured by the cosine of the angle between the corresponding

vectors, which increases as more terms are shared. This

means that VSM does not take into account relations be-

tween terms. For instance, having “automobile” in one arte-

facts and “car” in another artefact does not contribute to the

similarity measure between these two documents.

LSI was developed to overcome the synonymy and poly-

semy problems, which occur with the VSM model [13]3. In

LSI the dependencies between terms and between artefacts,

in addition to the associations between terms and artefacts,

are explicitly taken into account. For example, both “car”

and “automobile” are likely to co-occur in different arte-

facts with related terms, such as “motor”, “wheel”, etc. In

order to exploit information about co-occurrences of terms,

LSI applies Singular Value Decomposition (SVD) [13] to

project the original term-by-document matrix into a reduced

space in order to diminish the obscuring “noise” in word us-

age. Also in this case, the cosine of the angle between the

reduced artefact vectors can be used to define the similarity

between the corresponding artefacts. More details on LSI

can be found in [13].

3It is worth noting that also the JS method does not take into account re-

lations between terms. Thus, the synonymy and polysemy problems occur

in the JS method as well.

150

Table 1. Traceability recovery methods used

in the experimentation

Acronym Description

JS-ALL Jenson-Shannon method where all the terms are

considered during the indexing process.

LSI-ALL LSI method where all the terms are considered

during the indexing process.

JS-NOUN Jenson-Shannon method where only the nouns are

considered during the indexing process.

LSI-NOUN LSI method where only the nouns are considered

during the indexing process.

4 Experimental Evaluation

In this section we describe in detail the design of a case

study carried out to assess the role played by the nouns con-

tained in the artefacts in the context of IR-based traceabil-

ity recovery. The case study was conducted following the

guidelines given by Wohlin et al. [24].

4.1 Definition and Context

The goal of the experiment was to analyse whether the

accuracy of IR-based traceability recovery methods im-

proves or not when indexing only the nouns contained in

the artefact contents. The quality focus was ensuring bet-

ter recovery accuracy, while the perspective was both (i) of

a researcher, that wants to evaluate how the terms consid-

ered during the artefact indexing influence the recovery ac-

curacy; and (ii) of a project manager, that wants to evaluate

the possibility of adopting the proposed approach and tool

within his/her own organisation.

The experiment was conducted on EasyClinic, a soft-

ware system developed by final year students at the Univer-

sity of Salerno (Italy). The application provides support to

manage a doctor’s office and is composed of 30 use cases,

20 UML interaction diagrams, 63 test cases, and 37 code

classes. The language of the artefacts is Italian and the orig-

inal traceability matrix provided by the developers was used

as oracle in order to evaluate the accuracy of the IR-based

traceability recovery method. The total number of correct

links contained in the traceability matrix is 1,005.

4.2 Planning

In order to evaluate the proposed approach we recov-

ered traceability links between the code and the documenta-

tion (i.e., use cases, interaction diagrams, and test cases) of

EasyClinic using two different artefact indexing processes

to build the term-by-document matrix:

• All: all the terms contained in the artefact contents

were considered during the indexing process. This

means that the semantics of an artefact is defined on

the basis of all the terms contained in the artefacts, e.g.,

nouns, adjectives, and verbs.

• Noun: only the nouns contained in the artefact contents

were considered as keywords. Thus, all the other types

of terms (e.g., adjectives, verbs) were discarded and

the semantics of an artefact is defined on the basis of

only the nouns contained in the artefact content.

Note that in both the artefact indexing processes a stop

word function and/or a stop word list are applied, while we

did not use stemming, since its effects are variable (some-

times resulting in small improvements, sometimes in small

decreases in accuracy) [15].

We were also interested in analysing how the pro-

posed approach interacts with the IR method used to cal-

culate the artefact similarity. For this reason, the proposed

approach was experimented with a probabilistic model,

namely Jenson-Shannon (JS) method [1], as well as a vector

space based model, namely Latent Semantic Indexing (LSI)

[13]. Thus, for each IR method the two different index-

ing processes, i.e., All and Noun, were performed (see Ta-

ble 1). The four different approaches were used to perform

the following traceability recovery activities on the Easy-

Clinic repository:

A1 recovering traceability links between use cases and

code classes (83 is the total number of correct links);

A2 recovering traceability links between UML interaction

diagrams and code classes (69 is the total number of

correct links);

A3 recovering traceability links between test cases and

code classes (200 is the total number of correct links).

Finally, we also wanted to analyse how the proposed ap-

proach interacts with the artefact language. For this rea-

son, the original artefacts of the EasyClinic repository were

translated in English and the same experiments performed

on the original repository were also performed on the trans-

lated repository. Summarising, we executed twenty-four

experiments considering all the possible combinations of

the experimental variables, i.e., indexing processes (All and

Noun), IR methods (JS and LSI), traceability recovery activ-

ities (A1, A2, and A3), and artefact languages (Italian and

English).

4.3 Hypothesis Formulation

Two null-hypotheses were formulated for testing

whether the accuracy of an IR-based traceability recovery

method improves or not when taking into account only the

nouns contained in the artefacts during the indexing pro-

cess:

151

H0JSindexing the artefacts using only the nouns does not

significantly improve the accuracy of the JS method.

H0LSIindexing the artefacts using only the nouns does not

significantly improve the accuracy of the LSI method.

When the null hypothesis can be rejected with relatively

high confidence it is possible to formulate an alternative hy-

pothesis, which admits a positive effect of the nouns on the

retrieval accuracy:

HaJSindexing the artefacts using only the nouns signifi-

cantly improves the accuracy of the JS method.

HaLSIindexing the artefacts using only the nouns signifi-

cantly improves the accuracy of the LSI method.

We also formulated three other null hypotheses to inves-

tigate how the IR method, the artefact types, and the lan-

guage of the artefacts interact with the indexing process and

affect the retrieval accuracy:

H0Mthe IR method does not significantly interact with the

type of terms considered during the artefact indexing.

H0Tthe artefact type does not significantly interact with the

type of terms considered during the artefact indexing.

H0Lthe language of the artefact does not significantly inter-

act with the type of terms considered during the arte-

fact indexing.

The related alternative hypotheses can be easily derived.

4.4 Identification of Experimental Factors

The main factor of our study, denoted as Terms, is repre-

sented by the terms considered during the indexing process,

i.e., all terms (All) or only nouns (Noun). However, in order

to better assess the effect of Terms it is necessary to control

other factors that may impact the retrieval accuracy and in-

teract with the effect of Terms. In the context of our study,

we identify the following factors:

• Method: the IR method used to calculate the tex-

tual similarity between artefacts. As described in Sec-

tion 4.2 we used a probabilistic model, i.e., Jenson-

Shannon (JS) method [1], as well as a vector space

based model, i.e., Latent Semantic Indexing (LSI) [13].

• Artefact type: as described in Section 4.2, the exper-

iment involved three traceability recovery activities on

the EasyClinic artefact repository. In particular, we re-

covered traceability links between code classes and use

cases, UML interaction diagrams, and test cases, re-

spectively.

• Language: the language of the artefacts. The language

of the original artefacts of EasyClinic is Italian. How-

ever, the artefacts of the EasyClinic repository were

also translated in English.

4.5 Data Collection and Analysis

In order to evaluate the proposed approach for each

traceability recovery activity (see Section 4.2) we collected

the number of correct links and false positives retrieved by

the method. The number of correct links and false posi-

tives were automatically identified by a tool simulating the

behaviour of the software engineer during the classification

of the proposed links. In particular, the tool takes as input

the ranked list of candidate links and classifies each link as

correct link or false positive until all correct links are recov-

ered. Such a classification is automatically performed by

the tool exploiting the original traceability matrix as oracle.

A preliminary evaluation of the proposed approach can

be obtained using two well-known Information Retrieval

(IR) metrics, namely recall and precision [4]:

recall = |correct∩retrieved||correct| % precision = |correct∩retrieved|

|retrieved| %

where correct and retrieved represent the set of correct

links and the set of links retrieved by the tool, respectively.

We also performed a further comparison of the proposed

retrieval techniques based on the analysis of the statistical

significance difference of the false positives retrieved. Such

an analysis uses a statistical significance test aiming at ver-

ifying that the false positives retrieved by one method are

significantly lower than the false positives retrieved by an-

other method. Thus, the dependent variable of our study is

FP, representing the number of false positives retrieved by

the traceability recovery method for each correct link iden-

tified. Since the number of correct links is the same for each

traceability recovery activity (i.e., the data was paired), we

decided to used the Wilcoxon Rank Sum test [7] to test the

statistical significance difference between the false positives

retrieved by two traceability recovery methods. The results

were intended as statistically significant at α = 0.05. More-

over, the interaction of Terms with the IR method, the arte-

fact type, and the language of the artefacts was analysed by

using the Two-Way Analysis of Variance (ANOVA) [14] on

the whole data set. The interaction between factors was also

analysed by interaction plots. They are simple line graphs

where the means on the dependent variable for each level

of one factor are plotted over all the levels of the second

factor. The resulting profiles are parallel when there is no

interaction and nonparallel when interaction is present [14].

152

10 

20 

30 

40 

50 

60 

70 

80 

90 

100 

0  10  20  30  40  50  60  70  80  90  100 

(a) Tracing code classes onto use cases of EasyClinic-ITA

10 

20 

30 

40 

50 

60 

70 

80 

90 

100 

0  10  20  30  40  50  60  70  80  90  100 

(b) Tracing code classes onto use cases of EasyClinic-ENG

10 

20 

30 

40 

50 

60 

70 

80 

90 

100 

0  10  20  30  40  50  60  70  80  90  100 

(c) Tracing code classes onto UML interaction diagrams of

EasyClinic-ITA

10 

20 

30 

40 

50 

60 

70 

80 

90 

100 

0  10  20  30  40  50  60  70  80  90  100 

(d) Tracing code classes onto UML interaction diagrams of

EasyClinic-ENG

10 

20 

30 

40 

50 

60 

70 

80 

90 

100 

0  10  20  30  40  50  60  70  80  90  100 

(e) Tracing code classes onto test cases of EasyClinic-ITA

10 

20 

30 

40 

50 

60 

70 

80 

90 

100 

0  10  20  30  40  50  60  70  80  90  100 

(f) Tracing code classes onto test cases of EasyClinic-ENG

-./011" 1.2/011" -./3453" 1.2/3453"

Figure 1. Precision/Recall curves

5 Experimental Results

This section reports on the results achieved tracing code

classes onto use cases, UML interaction diagrams, and test

cases, respectively, adopting the traceability recovery ap-

proaches described in Section 4.2. Figure 1 shows the pre-

cision/recall curves achieved in all the experiments, while

Table 2 reports the precision values and the number of false

positives at different levels of recall (i.e., 100% and 80%)

grouped by method and traceability recovery activity. The

table also reports for each IR method the differences be-

tween the precision values, as well as the number of false

positives retrieved, achieved with the two artefact indexing

processes, i.e., Noun and All.

A preliminary analysis shows that, in general, the re-

trieval accuracy improves considering only the nouns dur-

ing the indexing process (see Figure 1). In particular, when

recall is 80% there are several cases where it is possible to

achieve an improvement of precision of about 20% (see Ta-

ble 2). A less evident improvement is also achieved when

recall is 100%. In particular, in some cases it is possible

to achieve an improvement of about 5% (see Table 2). In

153

Table 2. Precision values and number of false positives at different level of recall (i.e., 100% and 80%)

Data set Target Prec(100%)JS Prec(100%)LSI Prec(80%)JS Prec(80%)LSI

artefacts Noun All Diff Noun All Diff Noun All Diff Noun All Diff

EasyClinic-ITA Use 12.58 8.49 +4.09 13.19 8.18 +5.01 46.30 26.31 +19.99 47.47 26.69 +20.78

cases 646 1,003 -357 612 1,043 -431 87 210 -123 83 206 -123

UML 8.16 7.80 +0.36 8.34 7.75 +0.59 28.57 40.29 -11.72 26.05 40.28 -14.23

diagrams 778 816 -38 758 821 -63 140 83 +57 159 83 +76

Test 19.78 13.35 +6.43 22.96 13.38 +9.58 45.98 22.53 +23.45 50.31 25.28 +25.03

cases 811 1,298 -487 671 1,295 -624 188 550 -362 158 473 -315

EasyClinic-ENG Use 9.38 7.58 +1.80 9.38 7.62 +1.76 27.78 27.27 +0.51 29.07 28.09 +0.98

cases 898 1,133 -235 852 1,128 -276 195 200 -5 183 192 -9

UML 7.73 7.78 -0.05 7.66 7.74 -0.08 38.89 26.79 +12.10 40.29 24.56 +15.73

diagrams 824 818 +6 832 822 +10 88 153 -65 83 172 -89

Test 18.47 10.89 +7.58 20.10 11.27 +8.83 23.98 28.93 -4.95 31.87 29.14 +2.73

cases 883 1,636 -753 795 1,574 -779 507 393 +114 342 389 -47

Source artefacts are code classes.

order to better clarify the accuracy improvement for the end

user, it is necessary to analyse the number of false positives

retrieved by the method. As we can see, in several cases

it is possible to achieved a considerable reduction of false

positives especially when the goal is to recover all correct

links (i.e., recall equals to 100%). In particular, when trac-

ing code classes onto test cases the number of false posi-

tives retrieved by using both JS-ALL and LSI-ALL is about

twice the number of false positives retrieved by JS-NOUN

and LSI-NOUN.

Analysing the precision/recall curves and the data in Ta-

ble 2 we also observe that when tracing code classes onto

interaction diagrams of EasyClinic-ITA the different arte-

fact indexing processes do not provide any improvement at

all. Indeed, better results can be achieved considering all

the terms during the indexing process. This contrasting re-

sult is probably due to the POS tagger used during the in-

dexing process that considered several abstract nouns in the

description of the UML interaction diagrams as verbs in the

past participle form. Thus, during the indexing process such

terms were discarded thus reducing the similarity between

UML interaction diagrams and code classes.

The next sections report and discuss the results of the sta-

tistical analysis performed to analyse the effect on the de-

pendent variables (i.e., FP) of the main factor (i.e., Terms)

and of other factors. Such an analysis is necessary in order

to better assess the effect of Terms and analyse whether or

not other factors affect the retrieval accuracy and/or interact

with the effect of Terms.

5.1 Influence of Terms

Table 3 reports the results (i.e., p-values) of the Wilcoxon

test used to test the null hypotheses H0JSand H0LSI

. The

results of the tests confirmed our initial findings. In par-

ticular, the null hypotheses can be always rejected, except

when tracing UML interaction diagrams onto code classes

Table 3. Results of the Wilcoxon tests

Data set Target artefacts H0JSH0LSI

EasyClinic-ITA Use cases < 0.01 < 0.01

UML diagrams 0.99 0.99

Test cases < 0.01 < 0.01

EasyClinic-ENG Use cases < 0.01 < 0.01

UML diagrams < 0.01 < 0.01

Test cases < 0.01 < 0.01

Source artefacts are code classes.

of EasyClinic-ITA. Indeed, in this case significantly better

results can be achieved adopting a traditional approach (p-

values are less than 0.01 for both the IR methods).

5.2 Influence of other Factors

Regarding the influence of the IR method (i.e., Method)

ANOVA revealed that it did not influence the achieved re-

sults (p-value = 0.39). This means that the JS method and

LSI provided almost the same accuracy. Moreover, ANOVA

did not reveal any interaction between Terms and Method

(p-value = 0.81). Thus, we cannot reject H0M.

ANOVA also highlighted the influence of the artefact

type (p-value < 0.01) as well as a statistical significant in-

teraction between these two factors (p-value < 0.01). The

latter allow us to reject the null hypothesis H0T. Analysing

the interaction plot between Terms and Artefact type (see

Figure 2-a), we observe that the artefact indexing Noun does

not significantly influence the retrieval accuracy when trac-

ing code classes onto interaction diagrams. Indeed, only

on the repository EasyClinic-ENG the artefact indexing

Noun significantly influences the retrieval accuracy, while

on EasyClinic-ITA the indexing process does not provide

any improvement at all (see also Table 3).

Concerning the factor Language, ANOVA revealed no

statistically significant interaction between this factor and

154

(a) (b)

Figure 2. Interaction between Terms and Artefact (a) and between Terms and Language (b)

the main factor, i.e., Terms (p-value = 0.06). Thus, we

cannot reject the null hypothesis H0L. The test also re-

vealed a statistically significant influence of Language

(p-value < 0.01). Analysing the interaction plot between

Terms and Language (see Figure 2-b), we observe that the

effect of the artefact indexing Noun is more evident on the

repository EasyClinic-ITA. However, the artefact indexing

Noun provides statistically better results on both the artefact

repositories (see also Table 3).

5.3 Threats to Validity

This section discusses the threats to validity that can

affect our results, focusing the attention on external, con-

struct, and conclusion validity threats.

External validity concerns the generalisation of the find-

ings. An important threat is related to the repository used in

the experimentation, i.e., EasyClinic. It is not comparable

to industrial projects, but repositories used by other authors

[2, 20, 16] to compare different IR methods have a com-

parable size. Moreover, EasyClinic is just used to evaluate

IR methods to recover traceability links [9]. However, repli-

cating the experiment using industrial repositories is needed

to generalise the achieved results. Moreover, since our ap-

proach is based on linguistic heuristic, the language of the

artefacts may play an important role and affect the achieved

results. In order to mitigate such a threat we performed the

experimentation on two artefact repositories, one in Italian

and the other one in English. However, the two artefact

repositories describe the same software project, i.e., Easy-

Clinic. Probably also the domain system should play an

important role in the definition of linguistic heuristic. Thus,

also in this case we plan to replicate the experiment using

other artefact repositories to further generalise the proposed

approach.

Construct validity threats concern the relationship be-

tween theory and observation. In particular, recall and pre-

cision are widely used metrics for assessing an IR tech-

nique. Moreover, the number of false positives retrieved by

a traceability recovery tool for each correct link retrieved

well reflects its retrieval accuracy. To avoid bias the ex-

periment the translation of the artefacts of EasyClinic was

performed by a person that did not know the goal of our

study. In order to mitigate the threat to validity represented

by the quality of the English translation we selected a per-

son with a good knowledge of that language. Another threat

to validity is related to the quality of the POS tagger used

in our experimentation. We used a free available tool with a

high level of accuracy. In particular, as reported in its docu-

mentation, TreeTagger has an accuracy of about 95% for the

English language, while about 70% is the accuracy for Ital-

ian. Finally, the accuracy of the oracle we used to evaluate

the tracing accuracy could also affect the achieved results.

To mitigate such a threats we used the original traceabil-

ity matrix provided by the original developers and validated

the links during review meetings made by the original de-

velopment team together with PhD students and academic

researchers.

Conclusion validity concerns the relationship between

the treatment and the outcome. Attention was paid to not

violate assumptions made by statistical tests. Whenever

conditions necessary to use parametric statistics did not

hold (e.g., analysis of each experiment data), we used non-

parametric tests, in particular Wilcoxon test for paired anal-

yses. It is worth noting that we also used a parametric test,

i.e., ANOVA, to analyse the effect of different factors even

if the distribution was not normal. According to [24] this

can be done since (i) the ANOVA test is a very robust test;

(ii) and the distribution of false positives is very close to be

normally distributed.

6 Discussion and Lessons Learned

The results achieved in the reported case study provided

us with a number of lessons learned:

155

Table 4. Number of indexed terms and aver-age time (ms) required for the calculation ofthe SVD and the similarity between artefacts

Data set Noun All

Terms SVD Sim Terms SVD Sim

EasyClinic-ITA 401 1,021 864 895 2,241 1,867

EasyClinic-ENG 588 1,641 1,333 901 2,437 2,073

• the nouns give the maximal information on the seman-

tics of the artefacts: the achieved results demonstrate

that the use of only the nouns to build the artefact

corpus significantly improves the retrieval accuracy of

both JS and LSI methods. This means that the findings

described in [17, 18] are valid and applicable also for

software documentation. In particular, the language

used in software artefacts can be classified as sectorial

language, where the terms that provide more indica-

tion on the semantics of a document are the nouns. It

is important to note that the use of only the nouns dur-

ing the indexing process also influences the size of the

term-by-document matrix, i.e., it reduces the number

of terms. The drastic reduction of the number of terms

(see Table 4) represents another important result, since

it improves the calculation (i) of the document similar-

ity – for the JS method – and (ii) of the singular value

decomposition – for LSI (see Table 4);

• the IR method does not significantly influence the re-

trieval accuracy: the results achieved in our case study

demonstrate that the IR method does not significantly

influence the retrieval accuracy of the traceability re-

covery method. Table 5 shows the comparison of the

JS method and LSI based on the statistical significance

difference between the false positives retrieved by the

two methods. As we can see, in 6 cases out of 12 there

is no significant difference between the false positives

retrieved by JS and LSI. In the other 6 cases, JS per-

formed significantly better than LSI in 3 cases, while

in the other 3 cases the method that provided the best

results is LSI. We also observed that LSI performed

better than JS only when recovering traceability links

between artefacts written in Italian. This suggests that

probably LSI is more suitable than JS for languages,

such as Italian, that present a complex grammar, verbs

with many conjugated variants, words with different

meanings in different contexts, and irregular forms for

plurals, adverbs, and adjectives. Finally, with the JS

method the advantages provided by the artefact index-

ing Noun are more evident than with LSI. In particular,

the artefact indexing Noun is used in 2 cases (i.e., trac-

ing use cases onto code classes) out of 3 where the JS

method overcomes the accuracy of LSI;

Table 5. Comparison of JS and LSI methods:results of the Wilcoxon test

Data set Target artefatcs Best method

Noun All

EasyClinic-ITA Use cases JS (< 0.01) No diff.

UML diagrams No diff. LSI (< 0.01)

Test cases LSI (< 0.01) LSI (< 0.01)

EasyClinic-ENG Use cases JS (< 0.01) No diff.

UML diagrams No diff. No diff.

Test cases No diff. JS (< 0.01)

Source artefacts are code classes.

• the influence of the artefact type on the retrieval accu-

racy is still an open issue: the achieved results showed

that the artefact indexing Noun does not provide any

improvement at all when tracing UML interaction di-

agrams onto code classes of EasyClinic-ITA. We ob-

served that several abstract nouns in the description

of the UML interaction diagrams were considered as

verbs in the past participle form. Thus, discarding such

terms during the indexing process reduced the simi-

larity between code classes and UML interaction dia-

grams. Moreover, in general, interaction diagrams in-

clude many more verbs (action and methods name are

usually verbs) than nouns. This suggests that probably

another linguistic heuristic (i.e., which terms should be

indexed) should be used for UML interaction diagrams

taking into account also the verbs. However, such a

problem was found only for the Italian language, since

the artefact indexing Noun provides significantly better

results when tracing UML interaction diagrams onto

code classes of EasyClinic-ENG. All these considera-

tions suggest that the role played by the artefact type

still represents an open issue in the selection of the lin-

guistic heuristic to adopt during the indexing process

and has to be further analysed;

• the language of the artefacts influences the retrieval

accuracy: the results achieved in our case study

showed that the language of the artefact is an influenc-

ing factor. The analysis of the interaction plot revealed

that the improvement of the artefact indexing Noun

was more evident when tracing software artefacts writ-

ten in Italian, except when tracing code classes onto

UML interaction diagrams. However, the factor Lan-

guage did not interact with Terms.

7 Conclusion and Future Work

We described how to improve the accuracy of an IR-

based traceability recovery tool by using a linguistic heuris-

tic during the artefact indexing. Such a heuristic is used

to identify the extracted terms that have to be included in

156

the artefact corpus and considered to define the textual sim-

ilarity between two artefacts. In particular, we decided to

index only the nouns extracted from the artefact content,

since in a sectorial language (the language of the software

documentation) the terms that provide more indication on

the semantics of a document are the nouns [17, 18].

The results achieved in a reported case study demon-

strated that, in general, the proposed approach improves

the retrieval accuracy of an IR-based traceability recovery

method based on the probabilistic or vector space based

models. It is worth noting that replication in different con-

texts and with different objects is the only way to corrobo-

rate our findings. Replicating the experiment using different

repositories, IR methods, and traceability recovery activi-

ties is part of the agenda of our future work. We also plan to

experiment the proposed artefact indexing combined with

other enhancing strategies, such as stemming [22] and user

feedback analysis [4].

References

[1] A. Abadi, M. Nisenson, and Y. Simionovici. A traceabil-

ity technique for specifications. In Proceedings of 16th

IEEE International Conference on Program Comprehen-

sion, pages 103–112. IEEE CS Press, 2008.

[2] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and

E. Merlo. Recovering traceability links between code and

documentation. IEEE Transactions on Software Engineer-

ing, 28(10):970–983, 2002.

[3] G. Antoniol, G. Casazza, and A. Cimitile. Traceability re-

covery by modelling programmer behaviour. In Proceedings

of 7th Working Conference on Reverse Engineering, volume

240-247, Brisbane, Queensland, Australia, 2000. IEEE CS

Press.

[4] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information

Retrieval. Addison-Wesley, 1999.

[5] E. Charniak. Statistical techniques for natural language pars-

ing. AI Magazine, 18(4):33–44, 1997.

[6] J. Cleland-Huang, R. Settimi, C. Duan, and X. Zou. Utiliz-

ing supporting evidence to improve dynamic requirements

traceability. In Proceedings of 13th IEEE International Re-

quirements Engineering Conference, pages 135–144, Paris,

France, 2005. IEEE CS Press.

[7] W. J. Conover. Practical Nonparametric Statistics. Wiley,

3rd edition edition, 1998.

[8] T. M. Cover and J. A. Thomas. Elements of Information

Theory. Wiley-Interscience, 1991.

[9] A. De Lucia, F. Fasano, R. Oliveto, and G. Tortora. Re-

covering traceability links in software artefact management

systems using information retrieval methods. ACM Trans-

actions on Software Engineering and Methodology, 16(4),

2007.

[10] A. De Lucia, R. Oliveto, and P. Sgueglia. Incremental ap-

proach and user feedbacks: a silver bullet for traceability

recovery. In Proceedings of 22nd IEEE International Con-

ference on Software Maintenance, pages 299–309, Shera-

ton Society Hill, Philadelphia, Pennsylvania, 2006. IEEE CS

Press.

[11] A. De Lucia, R. Oliveto, and G. Tortora. IR-based traceabil-

ity recovery processes: an empirical comparison of “one-

shot” and incremental processes. In Proceedings of 23rd

International Conference Automated Software Engineering,

pages 39–48, L’Aquila, Italy, 2008. ACM Press.

[12] A. De Lucia, R. Oliveto, and G. Tortora. Assessing IR-based

traceability recovery tools through controlled experiments.

Empirical Software Engineering, 14(1):57–93, 2009.

[13] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer,

and R. Harshman. Indexing by latent semantic analysis.

Journal of the American Society for Information Science,

41(6):391–407, 1990.

[14] J. L. Devore and N. Farnum. Applied Statistics for Engineers

and Scientists. Duxbury, 1999.

[15] S. T. Dumais. Improving the retrieval of information from

external sources. Behavior Research Methods, Instruments

and Computers, 23:229–236, 1991.

[16] J. H. Hayes, A. Dekhtyar, and S. K. Sundaram. Advanc-

ing candidate link generation for requirements tracing: The

study of methods. IEEE Transactions on Software Engineer-

ing, 32(1):4–19, 2006.

[17] D. Jurafsky and J. Martin. Speech and Language Processing.

Prentice Hall, 2000.

[18] E. L. Keenan. Formal Semantics of Natural Language. Cam-

bridge University Press, 1975.

[19] M. Lormans and A. van Deursen. Can LSI help reconstruct-

ing requirements traceability in design and test? In Proceed-

ings of 10th European Conference on Software Maintenance

and Reengineering, pages 45–54, Bari, Italy, 2006. IEEE CS

Press.

[20] A. Marcus and J. I. Maletic. Recovering documentation-

to-source-code traceability links using latent semantic in-

dexing. In Proceedings of 25th International Conference

on Software Engineering, pages 125–135, Portland, Oregon,

USA, 2003. IEEE CS Press.

[21] A. Marcus, X. Xie, and D. Poshyvanyk. When and how

to visualize traceability links? In Proceedings of 3rd In-

ternational Workshop on Traceability in Emerging Forms of

Software Engineering, pages 56–61, Long Beach California,

USA, 2005. ACM Press.

[22] M. F. Porter. An algorithm for suffix stripping. Program,

14(3):130–137, 1980.

[23] R. Settimi, J. Cleland-Huang, O. Ben Khadra, J. Mody,

W. Lukasik, and C. De Palma. Supporting software evo-

lution through dynamically retrieving traces to UML arti-

facts. In Proceedings of 7th IEEE International Workshop

on Principles of Software Evolution, pages 49–54, Kyoto,

Japan, 2004. IEEE CS Press.

[24] C. Wohlin, P. Runeson, M. Host, M. C. Ohlsson, B. Regnell,

and A. Wesslen. Experimentation in Software Engineering -

An Introduction. Kluwer, 2000.

[25] X. Zou, R. Settimi, and J. Cleland-Huang. Term-based

enhancement factors for improving automated requirement

trace retrieval. In Proceedings of International Symposium

on Grand Challenges in Traceability, pages 40–45, Lexing-

ton, Kentuky, USA, 2007. ACM Press.

157