On the role of the nouns in IR-based traceability recovery
-
Upload
independent -
Category
Documents
-
view
1 -
download
0
Transcript of On the role of the nouns in IR-based traceability recovery
On the Role of the Nouns in IR-based Traceability Recovery∗
Giovanni Capobianco⋆, Andrea De Lucia⋄, Rocco Oliveto⋄†, Annibale Panichella⋆, Sebastiano Panichella⋆
⋆STAT – University of Molise, Contrada Fonte Lappone, Pesche (IS), Italy⋄DMI – University of Salerno, Via Ponte don Melillo, Fisciano (SA), Italy
[email protected], [email protected], [email protected], a [email protected], [email protected]
Abstract
The intensive human effort needed to manually man-
age traceability information has increased the interest in
utilising semi-automated traceability recovery techniques.
This paper presents a simple way to improve the accuracy
of traceability recovery methods based on Information Re-
trieval techniques. The proposed method acts on the arte-
fact indexing considering only the nouns contained in the
artefact content to define the semantics of an artefact. The
rationale behind such a choice is that the language used
in software documents can be classified as a sectorial lan-
guage, where the terms that provide more indication on the
semantics of a document are the nouns. The results of a
reported case study demonstrate that the proposed artefact
indexing significantly improves the accuracy of traceability
recovery methods based on the probabilistic or vector space
based IR models.
1. Introduction
A lot of effort in the software engineering community
(both research and commercial) has been devoted to im-
prove the management of the dependences (i.e., traceability
links) between software artefacts. Traceability information
can provides important insights into system development
and evolution assisting in program comprehension, impact
analysis, and reuse of existing software [2]. However, estab-
lishing and maintaining traceability links between software
artefacts produced or modified in the software life-cycle are
costly, tedious, and error prone activities that are crucial but
frequently neglected in practice.
The need to provide software engineers with methods
and tools supporting traceability recovery has been widely
∗The work described in this paper is supported by the project TOCAI.IT
(Knowledge-oriented technologies for enterprise aggregation in Internet),
funded by MiUR (Ministero dell’Universita e della Ricerca) within the
FIRB (Fondo per gli Investimenti della Ricerca di Base) program.†Corresponding author.
recognised in the last years. In particular, several re-
searchers have proposed the use of Information Retrieval
(IR) [4, 13] techniques for recovering traceability links be-
tween artefacts of different types [2, 3, 6, 9, 10, 16, 19,
20, 23]. The idea behind such methods is that most of
the software documentation is text based or contains textual
descriptions and that programmers use meaningful domain
terms to define source code identifiers [2]. Thus, IR-based
methods recover traceability links on the basis of the sim-
ilarity between the text contained in the software artefacts.
Because the number of all possible links can be very high,
such tools use a similarity threshold to consider as candidate
traceability links only the pairs of artefacts with similarity
above such a threshold [2, 9]. Unfortunately, due to the lim-
itations of both the humans developing artefacts and the IR
techniques the set of retrieved links does not in general co-
incide with the set of correct links between the artefacts in
the repository. Indeed, any IR method will fail to retrieve
some of the correct links, while on the other hand it will
also retrieve links that are not correct (false positives).
In this paper we describe a simple way to reduce the
number of false positives retrieved by IR-based traceabil-
ity recovery methods. In particular, we observe that the
language used in software documents can be classified as
sectorial language1, where the terms that provide more indi-
cation on the semantics of a document are the nouns, while
the verbs tend to play a connection role and have a generic
semantics [17, 18]. For this reason, we propose to act on
the artefact indexing process taking into account only the
nouns contained in the artefact contents to define the se-
mantics of an artefact. The proposed approach was applied
to the Jenson-Shannon (JS) method [1, 8] and Latent Se-
mantic Indexing (LSI) [13] to recover traceability links be-
tween different types of artefacts. We performed the ex-
perimentation on two artefact repositories that describes the
same system, but in the former the language of the artefacts
is Italian, while in the latter is English. The achieved re-
sults demonstrated that, in general, the proposed approach
1The language used by people who work in a particular area or who
have a common interest [17, 18].
978-1-4244-3997-3/09/$25.00 2009 IEEE ICPC 2009148
improves the accuracy of both the JS method and LSI. We
also discuss on the influence of other factors (such as arte-
fact type and language) on the accuracy of the traceability
recovery method.
The rest of the paper is organised as follows. Section
2 discusses related work, while Section 3 presents the pro-
posed indexing process and gives some details on the two
IR methods used in the experimentation. Section 4 pro-
vides details on the design of the case study. Sections 5 and
6 report and discuss on the achieved results, respectively.
Section 7 gives concluding remarks.
2 Related Work
Antoniol et al. [2] are the first to apply IR methods
[4, 13] to the problem of recovering traceability links be-
tween software artefacts. They use both the probabilistic
and vector space models [4] in order to trace source code
onto software documentation. The results of the experimen-
tation show that both models recover all correct links with
almost the same number of documents retrieved.
Marcus and Maletic [20] use LSI [13] to recover trace-
ability links between source code and documentation. They
perform case studies similar in design to those in [2] and
compare the accuracy of LSI with respect to the vector
space and probabilistic models. The results show that LSI
performs at least as well as the probabilistic and vector
space models combined with full parsing of the source code
and morphological analysis of the documentation.
Abadi et al. [1] compare several IR techniques to re-
cover traceability links between code and documentation.
In particular, they compare dimensionality reduction meth-
ods (e.g., LSI [4]), probabilistic and information theoretic
approaches (i.e., the Jenson-Shannon method [1]), and the
standard Vector Space Model (VSM) [4]. The achieved re-
sults show that the techniques that provide the best results
are the standard VSM and the JS method.
Recently, several variants of basic IR methods have been
proposed to improve the retrieval accuracy of IR-based
traceability recovery tools. Antoniol et al. [3] discuss how a
traceability recovery tool based on the probabilistic model
can improve the retrieval accuracy by learning from user
feedback provided as input to the tool in terms of a subset of
correct traceability links (training set). The results achieved
in a case study demonstrate that, as the training set increase,
the method performances improve.
Settimi et al. [23] propose three different variants of the
VSM for tracing requirements to UML artefacts, code, and
test cases. Moreover, the results of the reported case study
demonstrate that it may be more effective to retrieve UML
artefacts as an intermediate step to retrieve code.
Cleland-Huang et al. [6] propose three different strate-
gies for incorporating supporting information into a proba-
bilistic retrieval algorithm, namely hierarchical modelling,
logical clustering of artefacts and semi-automated pruning
of the probabilistic network. The results achieved in the re-
ported case study indicate significant overlap between the
hierarchical and clustering techniques. In [25] the authors
propose an approach to enhance standard IR metrics using
query term coverage and phrasing. The results achieved in
two case studies show that the best retrieval accuracy can be
achieved by using both the proposed enhancements.
Hayes et al. [16] present a traceability recovery tool,
called Requirements Tracing On-target (RETRO), that uses
VSM or LSI for computing similarity measures among
requirements. The results achieved in two case studies
demonstrate that good results are achieved when user feed-
back is used to change the term weights in the term-by-
document matrix.
De Lucia et al. [9] discuss how a high recall cannot
be achieved without also recovering too many false posi-
tives. In particular, they show (through several case studies
where LSI is used to recover traceability links) that what
is really useful is to use an incremental traceability recov-
ery approach to gradually identify the threshold where it is
more convenient to stop the traceability recovery process,
as the effort to discard false positives is too high. In [10]
the authors also present a critical analysis of using feed-
back within the incremental traceability recovery process
presented in [9]. The results of two reported case studies
show that even though the retrieval accuracy generally im-
proves with the use of feedbacks, IR-based approaches are
still far from solving the problem of recovering all correct
links with a low classification effort.
Lormans and van Deursen [19] use LSI to reconstruct
traceability links among requirements, design documents
and test specifications. They also define a new strategy
for selecting traceability links based on the combination of
variable and constant thresholds.
In the last decade several IR-based traceability recovery
tools have also been proposed to support the software engi-
neer during the traceability recovery process [6, 9, 16, 19,
21]. In some cases, the usefulness of such tools has also
been assessed through user studies and controlled experi-
ments. The achieved results revealed that the tool signif-
icantly reduces the time spent by the software engineer to
complete the task and the tracing errors [12]. Also, exper-
iments show that the incremental process proposed in [9]
reduces the effort to classify proposed links with respect to
a “one-shot” approach, where the full ranked list of links is
proposed without similarity information and filtering [11].
3 IR-based Traceability Recovery
An IR-based traceability recovery tool uses an IR tech-
nique to compare a set of source artefacts (used as a query)
149
against another (even overlapping) set of target artefacts and
rank the similarity of all possible pairs of artefacts.
3.1 Artefact Indexing
The artefact indexing process aims at extracting infor-
mation about the occurrences of terms (words) within the
artefacts. Such information is used to characterise the arte-
fact content. The extraction of the terms is preceded by a
text normalisation aiming at (i) removing most non-textual
tokens (i.e., operators, special symbols, some numbers, etc.)
and (ii) splitting into separate words source code identifiers
composed of two or more words.
The extracted information is stored in a m × n matrix
(called term-by-document matrix), where m is the number
of all terms that occur in all the artefacts, and n is the num-
ber of artefacts in the repository. A generic entry ai,j of this
matrix denotes a measure of the weight (i.e., relevance) of
the ith term in the jth document [4].
3.1.1 Indexing all the Terms
In a classical indexing process all the terms extracted from
the artefact content are used to define the semantics of the
artefact. However, a stop word function and/or a stop word
list are applied to discard common terms (i.e., articles, ad-
verbs, etc) that are not useful to capture the semantics of the
artefact (term filtering). In particular, the stop word func-
tion prunes out all the words having a length less than a
fixed threshold, while the stop word list is used to cut-off
all the words contained in a given word list. A morphologi-
cal analysis, like stemming [22], of the extracted terms can
also be performed aiming at removing suffixes of words to
extract their stems.
3.1.2 Indexing only the Nouns
The term filtering proposed in this paper is not based on a
stop word list and/or a stop word function, but it is based on
the grammatical nature of the extracted term. In particular,
following [17, 18] only the nouns contained in the artefact
contents are used to build the artefact corpus. In order to
achieve such a filtering, the artefact content is pre-processed
by using a Part-of-speech (POS) tagger [5] that tags all the
terms specifying its grammatical nature (e.g., verb, noun,
adjective)2. Analysing such tags it is possible to filter out
all the terms that are not nouns.
3.2 Artefact Classification
Different IR methods can be used to calculate the tex-
tual similarity between two artefacts. In our experimenta-
2In our study we used TreeTagger available for downloading at
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
tion, we compare the retrieval accuracy of two IR methods,
i.e., a probabilistic model, namely the Jenson-Shannon (JS)
method [1, 8], and a vector space based model, namely La-
tent Semantic Indexing (LSI) [13].
3.2.1 Jenson-Shannon Method
The Jensen-Shannon (JS) similarity model is an IR tech-
niques recently proposed by Abadi et al. [1]. It is driven by
a probabilistic approach and hypothesis testing techniques.
As well as other probabilistic models, it represents each
document through a probability distribution. This means
that an artefact is represented by a random variable where
the probability of its states is given by the empirical distri-
bution of the terms occurring in the artefact (i.e., columns
of the term-by-document matrix). It is worth noting that
the empirical distribution of a term is based on the weight
assigned to the term for the specific artefact [1].
In the JS method the similarity between two artefacts
is given by a “distance” of their probability distributions
measured by using the Jensen-Shannon (JS) Divergence [8].
More details on the JS method can be found in [1, 8].
3.2.2 Latent Semantic Indexing
Latent Semantic Indexing (LSI) [13] is an extension of the
Vector Space Model (VSM) [4]. In the VSM, artefacts are
represented as vectors of terms (i.e., columns of the term-
by-document matrix) that occur within artefacts in a reposi-
tory. Thus, the similarity between two artefacts can be mea-
sured by the cosine of the angle between the corresponding
vectors, which increases as more terms are shared. This
means that VSM does not take into account relations be-
tween terms. For instance, having “automobile” in one arte-
facts and “car” in another artefact does not contribute to the
similarity measure between these two documents.
LSI was developed to overcome the synonymy and poly-
semy problems, which occur with the VSM model [13]3. In
LSI the dependencies between terms and between artefacts,
in addition to the associations between terms and artefacts,
are explicitly taken into account. For example, both “car”
and “automobile” are likely to co-occur in different arte-
facts with related terms, such as “motor”, “wheel”, etc. In
order to exploit information about co-occurrences of terms,
LSI applies Singular Value Decomposition (SVD) [13] to
project the original term-by-document matrix into a reduced
space in order to diminish the obscuring “noise” in word us-
age. Also in this case, the cosine of the angle between the
reduced artefact vectors can be used to define the similarity
between the corresponding artefacts. More details on LSI
can be found in [13].
3It is worth noting that also the JS method does not take into account re-
lations between terms. Thus, the synonymy and polysemy problems occur
in the JS method as well.
150
Table 1. Traceability recovery methods used
in the experimentation
Acronym Description
JS-ALL Jenson-Shannon method where all the terms are
considered during the indexing process.
LSI-ALL LSI method where all the terms are considered
during the indexing process.
JS-NOUN Jenson-Shannon method where only the nouns are
considered during the indexing process.
LSI-NOUN LSI method where only the nouns are considered
during the indexing process.
4 Experimental Evaluation
In this section we describe in detail the design of a case
study carried out to assess the role played by the nouns con-
tained in the artefacts in the context of IR-based traceabil-
ity recovery. The case study was conducted following the
guidelines given by Wohlin et al. [24].
4.1 Definition and Context
The goal of the experiment was to analyse whether the
accuracy of IR-based traceability recovery methods im-
proves or not when indexing only the nouns contained in
the artefact contents. The quality focus was ensuring bet-
ter recovery accuracy, while the perspective was both (i) of
a researcher, that wants to evaluate how the terms consid-
ered during the artefact indexing influence the recovery ac-
curacy; and (ii) of a project manager, that wants to evaluate
the possibility of adopting the proposed approach and tool
within his/her own organisation.
The experiment was conducted on EasyClinic, a soft-
ware system developed by final year students at the Univer-
sity of Salerno (Italy). The application provides support to
manage a doctor’s office and is composed of 30 use cases,
20 UML interaction diagrams, 63 test cases, and 37 code
classes. The language of the artefacts is Italian and the orig-
inal traceability matrix provided by the developers was used
as oracle in order to evaluate the accuracy of the IR-based
traceability recovery method. The total number of correct
links contained in the traceability matrix is 1,005.
4.2 Planning
In order to evaluate the proposed approach we recov-
ered traceability links between the code and the documenta-
tion (i.e., use cases, interaction diagrams, and test cases) of
EasyClinic using two different artefact indexing processes
to build the term-by-document matrix:
• All: all the terms contained in the artefact contents
were considered during the indexing process. This
means that the semantics of an artefact is defined on
the basis of all the terms contained in the artefacts, e.g.,
nouns, adjectives, and verbs.
• Noun: only the nouns contained in the artefact contents
were considered as keywords. Thus, all the other types
of terms (e.g., adjectives, verbs) were discarded and
the semantics of an artefact is defined on the basis of
only the nouns contained in the artefact content.
Note that in both the artefact indexing processes a stop
word function and/or a stop word list are applied, while we
did not use stemming, since its effects are variable (some-
times resulting in small improvements, sometimes in small
decreases in accuracy) [15].
We were also interested in analysing how the pro-
posed approach interacts with the IR method used to cal-
culate the artefact similarity. For this reason, the proposed
approach was experimented with a probabilistic model,
namely Jenson-Shannon (JS) method [1], as well as a vector
space based model, namely Latent Semantic Indexing (LSI)
[13]. Thus, for each IR method the two different index-
ing processes, i.e., All and Noun, were performed (see Ta-
ble 1). The four different approaches were used to perform
the following traceability recovery activities on the Easy-
Clinic repository:
A1 recovering traceability links between use cases and
code classes (83 is the total number of correct links);
A2 recovering traceability links between UML interaction
diagrams and code classes (69 is the total number of
correct links);
A3 recovering traceability links between test cases and
code classes (200 is the total number of correct links).
Finally, we also wanted to analyse how the proposed ap-
proach interacts with the artefact language. For this rea-
son, the original artefacts of the EasyClinic repository were
translated in English and the same experiments performed
on the original repository were also performed on the trans-
lated repository. Summarising, we executed twenty-four
experiments considering all the possible combinations of
the experimental variables, i.e., indexing processes (All and
Noun), IR methods (JS and LSI), traceability recovery activ-
ities (A1, A2, and A3), and artefact languages (Italian and
English).
4.3 Hypothesis Formulation
Two null-hypotheses were formulated for testing
whether the accuracy of an IR-based traceability recovery
method improves or not when taking into account only the
nouns contained in the artefacts during the indexing pro-
cess:
151
H0JSindexing the artefacts using only the nouns does not
significantly improve the accuracy of the JS method.
H0LSIindexing the artefacts using only the nouns does not
significantly improve the accuracy of the LSI method.
When the null hypothesis can be rejected with relatively
high confidence it is possible to formulate an alternative hy-
pothesis, which admits a positive effect of the nouns on the
retrieval accuracy:
HaJSindexing the artefacts using only the nouns signifi-
cantly improves the accuracy of the JS method.
HaLSIindexing the artefacts using only the nouns signifi-
cantly improves the accuracy of the LSI method.
We also formulated three other null hypotheses to inves-
tigate how the IR method, the artefact types, and the lan-
guage of the artefacts interact with the indexing process and
affect the retrieval accuracy:
H0Mthe IR method does not significantly interact with the
type of terms considered during the artefact indexing.
H0Tthe artefact type does not significantly interact with the
type of terms considered during the artefact indexing.
H0Lthe language of the artefact does not significantly inter-
act with the type of terms considered during the arte-
fact indexing.
The related alternative hypotheses can be easily derived.
4.4 Identification of Experimental Factors
The main factor of our study, denoted as Terms, is repre-
sented by the terms considered during the indexing process,
i.e., all terms (All) or only nouns (Noun). However, in order
to better assess the effect of Terms it is necessary to control
other factors that may impact the retrieval accuracy and in-
teract with the effect of Terms. In the context of our study,
we identify the following factors:
• Method: the IR method used to calculate the tex-
tual similarity between artefacts. As described in Sec-
tion 4.2 we used a probabilistic model, i.e., Jenson-
Shannon (JS) method [1], as well as a vector space
based model, i.e., Latent Semantic Indexing (LSI) [13].
• Artefact type: as described in Section 4.2, the exper-
iment involved three traceability recovery activities on
the EasyClinic artefact repository. In particular, we re-
covered traceability links between code classes and use
cases, UML interaction diagrams, and test cases, re-
spectively.
• Language: the language of the artefacts. The language
of the original artefacts of EasyClinic is Italian. How-
ever, the artefacts of the EasyClinic repository were
also translated in English.
4.5 Data Collection and Analysis
In order to evaluate the proposed approach for each
traceability recovery activity (see Section 4.2) we collected
the number of correct links and false positives retrieved by
the method. The number of correct links and false posi-
tives were automatically identified by a tool simulating the
behaviour of the software engineer during the classification
of the proposed links. In particular, the tool takes as input
the ranked list of candidate links and classifies each link as
correct link or false positive until all correct links are recov-
ered. Such a classification is automatically performed by
the tool exploiting the original traceability matrix as oracle.
A preliminary evaluation of the proposed approach can
be obtained using two well-known Information Retrieval
(IR) metrics, namely recall and precision [4]:
recall = |correct∩retrieved||correct| % precision = |correct∩retrieved|
|retrieved| %
where correct and retrieved represent the set of correct
links and the set of links retrieved by the tool, respectively.
We also performed a further comparison of the proposed
retrieval techniques based on the analysis of the statistical
significance difference of the false positives retrieved. Such
an analysis uses a statistical significance test aiming at ver-
ifying that the false positives retrieved by one method are
significantly lower than the false positives retrieved by an-
other method. Thus, the dependent variable of our study is
FP, representing the number of false positives retrieved by
the traceability recovery method for each correct link iden-
tified. Since the number of correct links is the same for each
traceability recovery activity (i.e., the data was paired), we
decided to used the Wilcoxon Rank Sum test [7] to test the
statistical significance difference between the false positives
retrieved by two traceability recovery methods. The results
were intended as statistically significant at α = 0.05. More-
over, the interaction of Terms with the IR method, the arte-
fact type, and the language of the artefacts was analysed by
using the Two-Way Analysis of Variance (ANOVA) [14] on
the whole data set. The interaction between factors was also
analysed by interaction plots. They are simple line graphs
where the means on the dependent variable for each level
of one factor are plotted over all the levels of the second
factor. The resulting profiles are parallel when there is no
interaction and nonparallel when interaction is present [14].
152
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
(a) Tracing code classes onto use cases of EasyClinic-ITA
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
(b) Tracing code classes onto use cases of EasyClinic-ENG
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
(c) Tracing code classes onto UML interaction diagrams of
EasyClinic-ITA
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
(d) Tracing code classes onto UML interaction diagrams of
EasyClinic-ENG
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
(e) Tracing code classes onto test cases of EasyClinic-ITA
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
(f) Tracing code classes onto test cases of EasyClinic-ENG
-./011" 1.2/011" -./3453" 1.2/3453"
Figure 1. Precision/Recall curves
5 Experimental Results
This section reports on the results achieved tracing code
classes onto use cases, UML interaction diagrams, and test
cases, respectively, adopting the traceability recovery ap-
proaches described in Section 4.2. Figure 1 shows the pre-
cision/recall curves achieved in all the experiments, while
Table 2 reports the precision values and the number of false
positives at different levels of recall (i.e., 100% and 80%)
grouped by method and traceability recovery activity. The
table also reports for each IR method the differences be-
tween the precision values, as well as the number of false
positives retrieved, achieved with the two artefact indexing
processes, i.e., Noun and All.
A preliminary analysis shows that, in general, the re-
trieval accuracy improves considering only the nouns dur-
ing the indexing process (see Figure 1). In particular, when
recall is 80% there are several cases where it is possible to
achieve an improvement of precision of about 20% (see Ta-
ble 2). A less evident improvement is also achieved when
recall is 100%. In particular, in some cases it is possible
to achieve an improvement of about 5% (see Table 2). In
153
Table 2. Precision values and number of false positives at different level of recall (i.e., 100% and 80%)
Data set Target Prec(100%)JS Prec(100%)LSI Prec(80%)JS Prec(80%)LSI
artefacts Noun All Diff Noun All Diff Noun All Diff Noun All Diff
EasyClinic-ITA Use 12.58 8.49 +4.09 13.19 8.18 +5.01 46.30 26.31 +19.99 47.47 26.69 +20.78
cases 646 1,003 -357 612 1,043 -431 87 210 -123 83 206 -123
UML 8.16 7.80 +0.36 8.34 7.75 +0.59 28.57 40.29 -11.72 26.05 40.28 -14.23
diagrams 778 816 -38 758 821 -63 140 83 +57 159 83 +76
Test 19.78 13.35 +6.43 22.96 13.38 +9.58 45.98 22.53 +23.45 50.31 25.28 +25.03
cases 811 1,298 -487 671 1,295 -624 188 550 -362 158 473 -315
EasyClinic-ENG Use 9.38 7.58 +1.80 9.38 7.62 +1.76 27.78 27.27 +0.51 29.07 28.09 +0.98
cases 898 1,133 -235 852 1,128 -276 195 200 -5 183 192 -9
UML 7.73 7.78 -0.05 7.66 7.74 -0.08 38.89 26.79 +12.10 40.29 24.56 +15.73
diagrams 824 818 +6 832 822 +10 88 153 -65 83 172 -89
Test 18.47 10.89 +7.58 20.10 11.27 +8.83 23.98 28.93 -4.95 31.87 29.14 +2.73
cases 883 1,636 -753 795 1,574 -779 507 393 +114 342 389 -47
Source artefacts are code classes.
order to better clarify the accuracy improvement for the end
user, it is necessary to analyse the number of false positives
retrieved by the method. As we can see, in several cases
it is possible to achieved a considerable reduction of false
positives especially when the goal is to recover all correct
links (i.e., recall equals to 100%). In particular, when trac-
ing code classes onto test cases the number of false posi-
tives retrieved by using both JS-ALL and LSI-ALL is about
twice the number of false positives retrieved by JS-NOUN
and LSI-NOUN.
Analysing the precision/recall curves and the data in Ta-
ble 2 we also observe that when tracing code classes onto
interaction diagrams of EasyClinic-ITA the different arte-
fact indexing processes do not provide any improvement at
all. Indeed, better results can be achieved considering all
the terms during the indexing process. This contrasting re-
sult is probably due to the POS tagger used during the in-
dexing process that considered several abstract nouns in the
description of the UML interaction diagrams as verbs in the
past participle form. Thus, during the indexing process such
terms were discarded thus reducing the similarity between
UML interaction diagrams and code classes.
The next sections report and discuss the results of the sta-
tistical analysis performed to analyse the effect on the de-
pendent variables (i.e., FP) of the main factor (i.e., Terms)
and of other factors. Such an analysis is necessary in order
to better assess the effect of Terms and analyse whether or
not other factors affect the retrieval accuracy and/or interact
with the effect of Terms.
5.1 Influence of Terms
Table 3 reports the results (i.e., p-values) of the Wilcoxon
test used to test the null hypotheses H0JSand H0LSI
. The
results of the tests confirmed our initial findings. In par-
ticular, the null hypotheses can be always rejected, except
when tracing UML interaction diagrams onto code classes
Table 3. Results of the Wilcoxon tests
Data set Target artefacts H0JSH0LSI
EasyClinic-ITA Use cases < 0.01 < 0.01
UML diagrams 0.99 0.99
Test cases < 0.01 < 0.01
EasyClinic-ENG Use cases < 0.01 < 0.01
UML diagrams < 0.01 < 0.01
Test cases < 0.01 < 0.01
Source artefacts are code classes.
of EasyClinic-ITA. Indeed, in this case significantly better
results can be achieved adopting a traditional approach (p-
values are less than 0.01 for both the IR methods).
5.2 Influence of other Factors
Regarding the influence of the IR method (i.e., Method)
ANOVA revealed that it did not influence the achieved re-
sults (p-value = 0.39). This means that the JS method and
LSI provided almost the same accuracy. Moreover, ANOVA
did not reveal any interaction between Terms and Method
(p-value = 0.81). Thus, we cannot reject H0M.
ANOVA also highlighted the influence of the artefact
type (p-value < 0.01) as well as a statistical significant in-
teraction between these two factors (p-value < 0.01). The
latter allow us to reject the null hypothesis H0T. Analysing
the interaction plot between Terms and Artefact type (see
Figure 2-a), we observe that the artefact indexing Noun does
not significantly influence the retrieval accuracy when trac-
ing code classes onto interaction diagrams. Indeed, only
on the repository EasyClinic-ENG the artefact indexing
Noun significantly influences the retrieval accuracy, while
on EasyClinic-ITA the indexing process does not provide
any improvement at all (see also Table 3).
Concerning the factor Language, ANOVA revealed no
statistically significant interaction between this factor and
154
(a) (b)
Figure 2. Interaction between Terms and Artefact (a) and between Terms and Language (b)
the main factor, i.e., Terms (p-value = 0.06). Thus, we
cannot reject the null hypothesis H0L. The test also re-
vealed a statistically significant influence of Language
(p-value < 0.01). Analysing the interaction plot between
Terms and Language (see Figure 2-b), we observe that the
effect of the artefact indexing Noun is more evident on the
repository EasyClinic-ITA. However, the artefact indexing
Noun provides statistically better results on both the artefact
repositories (see also Table 3).
5.3 Threats to Validity
This section discusses the threats to validity that can
affect our results, focusing the attention on external, con-
struct, and conclusion validity threats.
External validity concerns the generalisation of the find-
ings. An important threat is related to the repository used in
the experimentation, i.e., EasyClinic. It is not comparable
to industrial projects, but repositories used by other authors
[2, 20, 16] to compare different IR methods have a com-
parable size. Moreover, EasyClinic is just used to evaluate
IR methods to recover traceability links [9]. However, repli-
cating the experiment using industrial repositories is needed
to generalise the achieved results. Moreover, since our ap-
proach is based on linguistic heuristic, the language of the
artefacts may play an important role and affect the achieved
results. In order to mitigate such a threat we performed the
experimentation on two artefact repositories, one in Italian
and the other one in English. However, the two artefact
repositories describe the same software project, i.e., Easy-
Clinic. Probably also the domain system should play an
important role in the definition of linguistic heuristic. Thus,
also in this case we plan to replicate the experiment using
other artefact repositories to further generalise the proposed
approach.
Construct validity threats concern the relationship be-
tween theory and observation. In particular, recall and pre-
cision are widely used metrics for assessing an IR tech-
nique. Moreover, the number of false positives retrieved by
a traceability recovery tool for each correct link retrieved
well reflects its retrieval accuracy. To avoid bias the ex-
periment the translation of the artefacts of EasyClinic was
performed by a person that did not know the goal of our
study. In order to mitigate the threat to validity represented
by the quality of the English translation we selected a per-
son with a good knowledge of that language. Another threat
to validity is related to the quality of the POS tagger used
in our experimentation. We used a free available tool with a
high level of accuracy. In particular, as reported in its docu-
mentation, TreeTagger has an accuracy of about 95% for the
English language, while about 70% is the accuracy for Ital-
ian. Finally, the accuracy of the oracle we used to evaluate
the tracing accuracy could also affect the achieved results.
To mitigate such a threats we used the original traceabil-
ity matrix provided by the original developers and validated
the links during review meetings made by the original de-
velopment team together with PhD students and academic
researchers.
Conclusion validity concerns the relationship between
the treatment and the outcome. Attention was paid to not
violate assumptions made by statistical tests. Whenever
conditions necessary to use parametric statistics did not
hold (e.g., analysis of each experiment data), we used non-
parametric tests, in particular Wilcoxon test for paired anal-
yses. It is worth noting that we also used a parametric test,
i.e., ANOVA, to analyse the effect of different factors even
if the distribution was not normal. According to [24] this
can be done since (i) the ANOVA test is a very robust test;
(ii) and the distribution of false positives is very close to be
normally distributed.
6 Discussion and Lessons Learned
The results achieved in the reported case study provided
us with a number of lessons learned:
155
Table 4. Number of indexed terms and aver-age time (ms) required for the calculation ofthe SVD and the similarity between artefacts
Data set Noun All
Terms SVD Sim Terms SVD Sim
EasyClinic-ITA 401 1,021 864 895 2,241 1,867
EasyClinic-ENG 588 1,641 1,333 901 2,437 2,073
• the nouns give the maximal information on the seman-
tics of the artefacts: the achieved results demonstrate
that the use of only the nouns to build the artefact
corpus significantly improves the retrieval accuracy of
both JS and LSI methods. This means that the findings
described in [17, 18] are valid and applicable also for
software documentation. In particular, the language
used in software artefacts can be classified as sectorial
language, where the terms that provide more indica-
tion on the semantics of a document are the nouns. It
is important to note that the use of only the nouns dur-
ing the indexing process also influences the size of the
term-by-document matrix, i.e., it reduces the number
of terms. The drastic reduction of the number of terms
(see Table 4) represents another important result, since
it improves the calculation (i) of the document similar-
ity – for the JS method – and (ii) of the singular value
decomposition – for LSI (see Table 4);
• the IR method does not significantly influence the re-
trieval accuracy: the results achieved in our case study
demonstrate that the IR method does not significantly
influence the retrieval accuracy of the traceability re-
covery method. Table 5 shows the comparison of the
JS method and LSI based on the statistical significance
difference between the false positives retrieved by the
two methods. As we can see, in 6 cases out of 12 there
is no significant difference between the false positives
retrieved by JS and LSI. In the other 6 cases, JS per-
formed significantly better than LSI in 3 cases, while
in the other 3 cases the method that provided the best
results is LSI. We also observed that LSI performed
better than JS only when recovering traceability links
between artefacts written in Italian. This suggests that
probably LSI is more suitable than JS for languages,
such as Italian, that present a complex grammar, verbs
with many conjugated variants, words with different
meanings in different contexts, and irregular forms for
plurals, adverbs, and adjectives. Finally, with the JS
method the advantages provided by the artefact index-
ing Noun are more evident than with LSI. In particular,
the artefact indexing Noun is used in 2 cases (i.e., trac-
ing use cases onto code classes) out of 3 where the JS
method overcomes the accuracy of LSI;
Table 5. Comparison of JS and LSI methods:results of the Wilcoxon test
Data set Target artefatcs Best method
Noun All
EasyClinic-ITA Use cases JS (< 0.01) No diff.
UML diagrams No diff. LSI (< 0.01)
Test cases LSI (< 0.01) LSI (< 0.01)
EasyClinic-ENG Use cases JS (< 0.01) No diff.
UML diagrams No diff. No diff.
Test cases No diff. JS (< 0.01)
Source artefacts are code classes.
• the influence of the artefact type on the retrieval accu-
racy is still an open issue: the achieved results showed
that the artefact indexing Noun does not provide any
improvement at all when tracing UML interaction di-
agrams onto code classes of EasyClinic-ITA. We ob-
served that several abstract nouns in the description
of the UML interaction diagrams were considered as
verbs in the past participle form. Thus, discarding such
terms during the indexing process reduced the simi-
larity between code classes and UML interaction dia-
grams. Moreover, in general, interaction diagrams in-
clude many more verbs (action and methods name are
usually verbs) than nouns. This suggests that probably
another linguistic heuristic (i.e., which terms should be
indexed) should be used for UML interaction diagrams
taking into account also the verbs. However, such a
problem was found only for the Italian language, since
the artefact indexing Noun provides significantly better
results when tracing UML interaction diagrams onto
code classes of EasyClinic-ENG. All these considera-
tions suggest that the role played by the artefact type
still represents an open issue in the selection of the lin-
guistic heuristic to adopt during the indexing process
and has to be further analysed;
• the language of the artefacts influences the retrieval
accuracy: the results achieved in our case study
showed that the language of the artefact is an influenc-
ing factor. The analysis of the interaction plot revealed
that the improvement of the artefact indexing Noun
was more evident when tracing software artefacts writ-
ten in Italian, except when tracing code classes onto
UML interaction diagrams. However, the factor Lan-
guage did not interact with Terms.
7 Conclusion and Future Work
We described how to improve the accuracy of an IR-
based traceability recovery tool by using a linguistic heuris-
tic during the artefact indexing. Such a heuristic is used
to identify the extracted terms that have to be included in
156
the artefact corpus and considered to define the textual sim-
ilarity between two artefacts. In particular, we decided to
index only the nouns extracted from the artefact content,
since in a sectorial language (the language of the software
documentation) the terms that provide more indication on
the semantics of a document are the nouns [17, 18].
The results achieved in a reported case study demon-
strated that, in general, the proposed approach improves
the retrieval accuracy of an IR-based traceability recovery
method based on the probabilistic or vector space based
models. It is worth noting that replication in different con-
texts and with different objects is the only way to corrobo-
rate our findings. Replicating the experiment using different
repositories, IR methods, and traceability recovery activi-
ties is part of the agenda of our future work. We also plan to
experiment the proposed artefact indexing combined with
other enhancing strategies, such as stemming [22] and user
feedback analysis [4].
References
[1] A. Abadi, M. Nisenson, and Y. Simionovici. A traceabil-
ity technique for specifications. In Proceedings of 16th
IEEE International Conference on Program Comprehen-
sion, pages 103–112. IEEE CS Press, 2008.
[2] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and
E. Merlo. Recovering traceability links between code and
documentation. IEEE Transactions on Software Engineer-
ing, 28(10):970–983, 2002.
[3] G. Antoniol, G. Casazza, and A. Cimitile. Traceability re-
covery by modelling programmer behaviour. In Proceedings
of 7th Working Conference on Reverse Engineering, volume
240-247, Brisbane, Queensland, Australia, 2000. IEEE CS
Press.
[4] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information
Retrieval. Addison-Wesley, 1999.
[5] E. Charniak. Statistical techniques for natural language pars-
ing. AI Magazine, 18(4):33–44, 1997.
[6] J. Cleland-Huang, R. Settimi, C. Duan, and X. Zou. Utiliz-
ing supporting evidence to improve dynamic requirements
traceability. In Proceedings of 13th IEEE International Re-
quirements Engineering Conference, pages 135–144, Paris,
France, 2005. IEEE CS Press.
[7] W. J. Conover. Practical Nonparametric Statistics. Wiley,
3rd edition edition, 1998.
[8] T. M. Cover and J. A. Thomas. Elements of Information
Theory. Wiley-Interscience, 1991.
[9] A. De Lucia, F. Fasano, R. Oliveto, and G. Tortora. Re-
covering traceability links in software artefact management
systems using information retrieval methods. ACM Trans-
actions on Software Engineering and Methodology, 16(4),
2007.
[10] A. De Lucia, R. Oliveto, and P. Sgueglia. Incremental ap-
proach and user feedbacks: a silver bullet for traceability
recovery. In Proceedings of 22nd IEEE International Con-
ference on Software Maintenance, pages 299–309, Shera-
ton Society Hill, Philadelphia, Pennsylvania, 2006. IEEE CS
Press.
[11] A. De Lucia, R. Oliveto, and G. Tortora. IR-based traceabil-
ity recovery processes: an empirical comparison of “one-
shot” and incremental processes. In Proceedings of 23rd
International Conference Automated Software Engineering,
pages 39–48, L’Aquila, Italy, 2008. ACM Press.
[12] A. De Lucia, R. Oliveto, and G. Tortora. Assessing IR-based
traceability recovery tools through controlled experiments.
Empirical Software Engineering, 14(1):57–93, 2009.
[13] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer,
and R. Harshman. Indexing by latent semantic analysis.
Journal of the American Society for Information Science,
41(6):391–407, 1990.
[14] J. L. Devore and N. Farnum. Applied Statistics for Engineers
and Scientists. Duxbury, 1999.
[15] S. T. Dumais. Improving the retrieval of information from
external sources. Behavior Research Methods, Instruments
and Computers, 23:229–236, 1991.
[16] J. H. Hayes, A. Dekhtyar, and S. K. Sundaram. Advanc-
ing candidate link generation for requirements tracing: The
study of methods. IEEE Transactions on Software Engineer-
ing, 32(1):4–19, 2006.
[17] D. Jurafsky and J. Martin. Speech and Language Processing.
Prentice Hall, 2000.
[18] E. L. Keenan. Formal Semantics of Natural Language. Cam-
bridge University Press, 1975.
[19] M. Lormans and A. van Deursen. Can LSI help reconstruct-
ing requirements traceability in design and test? In Proceed-
ings of 10th European Conference on Software Maintenance
and Reengineering, pages 45–54, Bari, Italy, 2006. IEEE CS
Press.
[20] A. Marcus and J. I. Maletic. Recovering documentation-
to-source-code traceability links using latent semantic in-
dexing. In Proceedings of 25th International Conference
on Software Engineering, pages 125–135, Portland, Oregon,
USA, 2003. IEEE CS Press.
[21] A. Marcus, X. Xie, and D. Poshyvanyk. When and how
to visualize traceability links? In Proceedings of 3rd In-
ternational Workshop on Traceability in Emerging Forms of
Software Engineering, pages 56–61, Long Beach California,
USA, 2005. ACM Press.
[22] M. F. Porter. An algorithm for suffix stripping. Program,
14(3):130–137, 1980.
[23] R. Settimi, J. Cleland-Huang, O. Ben Khadra, J. Mody,
W. Lukasik, and C. De Palma. Supporting software evo-
lution through dynamically retrieving traces to UML arti-
facts. In Proceedings of 7th IEEE International Workshop
on Principles of Software Evolution, pages 49–54, Kyoto,
Japan, 2004. IEEE CS Press.
[24] C. Wohlin, P. Runeson, M. Host, M. C. Ohlsson, B. Regnell,
and A. Wesslen. Experimentation in Software Engineering -
An Introduction. Kluwer, 2000.
[25] X. Zou, R. Settimi, and J. Cleland-Huang. Term-based
enhancement factors for improving automated requirement
trace retrieval. In Proceedings of International Symposium
on Grand Challenges in Traceability, pages 40–45, Lexing-
ton, Kentuky, USA, 2007. ACM Press.
157