Using acceptance tests as a support for clarifying requirements: A series of experiments

29
Using Acceptance Tests as a Support for Clarifying Requirements: a Series of Experiments Filippo Ricca Unit` a CINI at DISI, Genova, Italy Marco Torchiano Politecnico di Torino, Italy Massimiliano Di Penta RCOST - University of Sannio, Benevento, Italy Mariano Ceccato, Paolo Tonella Fondazione Bruno Kessler—IRST, Trento, Italy Abstract One of the main reasons for the failure of many software projects is the late discovery of a mismatch between the customers’ expectations and the pieces of functionality implemented in the delivered system. At the root of such a mismatch is often a set of poorly defined, incomplete, under-specified, and inconsistent requirements. Test driven development has recently been proposed as a way to clarify requirements during the initial elicitation phase, by means of acceptance tests that specify the desired behavior of the system. The goal of the work reported in this paper is to empirically characterize the contribution of acceptance tests to the clarification of the requirements coming from the customer. We focused on Fit tables, a way to express acceptance tests, which can be automatically translated into executable test cases. We ran two experiments with students from University of Trento and Politecnico of Torino, to assess the impact of Fit tables on the clarity of requirements. We considered whether Fit tables actually improve requirement understanding and whether this requires any additional comprehension effort. Experimental results show that Fit helps in the understanding of requirements without requiring a significant additional effort. Key words: Empirical studies, acceptance testing, requirements, Fit tables. Preprint submitted to Elsevier 13 December 2007

Transcript of Using acceptance tests as a support for clarifying requirements: A series of experiments

Using Acceptance Tests as a Support for

Clarifying Requirements: a Series of

Experiments

Filippo Ricca

Unita CINI at DISI, Genova, Italy

Marco Torchiano

Politecnico di Torino, Italy

Massimiliano Di Penta

RCOST - University of Sannio, Benevento, Italy

Mariano Ceccato, Paolo Tonella

Fondazione Bruno Kessler—IRST, Trento, Italy

Abstract

One of the main reasons for the failure of many software projects is the late discoveryof a mismatch between the customers’ expectations and the pieces of functionalityimplemented in the delivered system. At the root of such a mismatch is often a setof poorly defined, incomplete, under-specified, and inconsistent requirements. Testdriven development has recently been proposed as a way to clarify requirementsduring the initial elicitation phase, by means of acceptance tests that specify thedesired behavior of the system.

The goal of the work reported in this paper is to empirically characterize thecontribution of acceptance tests to the clarification of the requirements coming fromthe customer. We focused on Fit tables, a way to express acceptance tests, whichcan be automatically translated into executable test cases. We ran two experimentswith students from University of Trento and Politecnico of Torino, to assess theimpact of Fit tables on the clarity of requirements. We considered whether Fittables actually improve requirement understanding and whether this requires anyadditional comprehension effort. Experimental results show that Fit helps in theunderstanding of requirements without requiring a significant additional effort.

Key words: Empirical studies, acceptance testing, requirements, Fit tables.

Preprint submitted to Elsevier 13 December 2007

1 Introduction

Requirement elicitation is a critical phase in any software project. Late discov-ery of errors or missing information in the elicited requirements may be hard orimpossible to fix in the successive development phases. Requirements that areambiguous, incomplete, based on wishful thinking, inconsistent, silent (miss-ing), over-sized, over-specific, or too generic may conduct to delivering out ofscope products and eventually be the reason for the failure of a project [18].

The effectiveness of the requirement elicitation phase is strongly dependenton the way requirements are expressed. Natural language and free formatsare subject to inaccuracies and often prone to errors. However, this way ofexpressing the requirements represents the actual state of the practice. More-over, a lot of information is often assumed as implicit or is agreed verbally.Available data indicates that on average 85% of the defects are estimated tocome from inadequate requirements [31], thus supporting the intuition thatthere is room for improvement of the current practice.

In test-driven development [3], test case construction anticipates the actualdevelopment of the source code. Tools like JUnit 1 , which are often integratedinto the software development environment, support early construction andautomated execution of test cases by developers. The same approach has beenrecently adopted for requirement elicitation, in the form of acceptance testcases, specifying the expected behavior of the system from the point of view ofthe user, similarly to the developer’s point of view expressed in unit test cases.In agile programming, test cases represent an incarnation of the requirementswhich express them more precisely than the natural language [17].

In this paper we report two experiments assessing the effectiveness of Fit [19],one of the most popular methodologies that supports the creation of accep-tance tests to express the user’s requirements. We evaluated the capability Fittables have of making requirement easier to understand, and we investigatedon the overhead, in terms of extra effort (if any) devoted to understandingFit tables. We designed an experiment, replicated at two sites (Universityof Trento and Politecnico of Torino), involving students who were providedwith either text-only requirements or textual requirements plus Fit tables.The study was performed considering the requirements for a library manage-ment system, that can be considered representative of information systems,such as e-stores, video rental systems, electronic catalogues, etc. Requirement

Email addresses: [email protected] (Filippo Ricca),[email protected] (Marco Torchiano), [email protected](Massimiliano Di Penta), [email protected], [email protected] (Mariano Ceccato,Paolo Tonella).1 http://www.junit.org

2

understanding was assessed by means of a questionnaire on the pieces of func-tionality to be implemented in the system under development. Moreover, wemeasured the time needed to answer the questionnaire, in order to evaluatethe understanding effort, so as to get a complete picture of the potential costsand benefits of Fit.

The paper is organized as follows: Section 2 provides some background knowl-edge on Fit and can be skipped by readers already familiar with it. Section 3gives details on the experimental design. Results are reported in Section 4. InSection 5 we relate similar works reported in the literature with the presentwork. Discussion of results, conclusions and future works are summarized inSection 6.

2 Acceptance testing using Fit

As defined in the IEEE Standard 1012-1986 [11], acceptance testing is a “for-mal testing conducted to determine whether or not a system satisfies its ac-ceptance criteria and to enable the customer to determine whether or not toaccept the system”. Thus, acceptance testing is a validation activity, performedby the customer just before the system is delivered and aimed at judging if theproduct is acceptable. Usually, acceptance testing is performed on the entiresystem, or a large part of it.

Very often, acceptance testing is performed in a rather informal fashion, andit is no more than a software demonstration. However, it would be highlydesirable to have acceptance test cases precisely defined, and to have theacceptance testing phase the more automated as possible. This implies thatacceptance test cases need to be specified — cooperatively by the customerand the analyst — during the requirement elicitation and specification phase.During this activity, scenarios/user stories are developed, and acceptance testcases consist of sequences of inputs/outputs related to main paths, alternativepaths, as well as to exceptional situations contemplated in the scenarios.

Clearly, we cannot require the customer — usually lacking software develop-ment expertise — to create test drivers to allow test case execution. Customersshould be allowed to specify test cases in an easy way without having to dealwith source code development (e.g., test drivers, scripts, etc). To this aim,frameworks such as Fit [19] have been conceived. Fit (Framework for Inte-grated Test) is an open source framework used to express acceptance testcases and a tool for improving the communication between analysts and de-velopers. Fit lets analysts write acceptance tests in the form of tables (Fittables) using simple HTML or even spreadsheets. A famous implementation of

3

Fig. 1. Example of Column Fit table. Fit table column’s names without parenthesesrepresent input; parentheses indicate output.

Fit is Fitnesse 2 , a Wiki where analysts/customers can upload requirementsand Fit tables.

A Fit table specifies the inputs and expected outputs for the test. Figure 1shows an example of Column Fit tables, a particular kind of table where eachrow represents a test case. The first five columns are input values (Name,Surname, Address, Date of birth and Credit/Debit) and the last column rep-resents the corresponding expected output value (Member number()). Otherthan Column Fit tables, it is possible to specify Action Fit tables, to testuser interfaces or workflows, Row Fit tables, to validate a collection of objectsproduced as the result of a query, TimedAction Fit tables to deal with tempo-ral, non functional requirements, and others (see [19]). Overall, Fit tables aresuited to model requirements whenever it is possible to express the relation-ship between inputs and outputs, assertions on the workflow, assertions on theresult of a query, or even assertions on temporal aspects. Nevertheless, therecan be requirements where such an approach is difficult to apply (or not ap-plicable at all) [19], e.g., some particular kinds of non-functional requirementsrelated to usability or maintainability.

2 http://www.fitnesse.org

4

Fig. 2. Using Fit during software development.

Fit tables cannot however be directly executed against the system. To get exe-cutable test cases, developers have to specify drivers, called Fixtures, that linkthe test cases to the system under test. A component in the framework, theTest Runner, executes the test cases by relying on the Fixtures, and comparesFit table data with actual values obtained from the System. The test runnerhighlights the results with colors (green = correct, red = wrong). Figure 2shows how Fit tables are used during the development phase. Fit tables arefirst created during the requirement elicitation, in a process that involves theCustomer and the Analyst for the definition of user stories. After having de-veloped the Fixtures, the developer can run them against the system underdevelopment, to check whether the actual output o′ corresponding to a giveninput i matches the expected output o.

When specifying requirements, analysts have to avoid several “sins” [18]. Someof them are noise, i.e., information not relevant to the problem or a repetitionin the requirements, silence, when important information is missing and over-specification when portions of the solution are mentioned in the requirements.Others “sins” are forward references, ambiguity and over-sized documents.Another particularly unpleasant “sin” is wishful thinking. It happens whenrequirements ask developers to realize a feature that is indeed very difficult orimpossible to realize.

As highlighted in the introduction, other than supporting the acceptance testphase, Fit tables represents a way to better specify requirements and a way toavoid the above-mentioned “sins”, thus supporting developers and maintain-ers. Differently from textual requirements, Fit tables are conceived to avoid

5

Table 1Overview of the experimental design.

Goal Evaluating the effect produced by Fit

tables on the comprehension level and effort.

Context Classroom exercise, requirements are provided on paper.

Null hypotheses No effect on comprehension level.

No effect on comprehension effort.

Main factor Type of requirements used: text only vs. text plus Fit tables.

Dependent variables Comprehension level and time.

the risk of subjective and misleading interpretation. By using Fit tables, in-terpretation mismatches between analysts and developers should be highlyreduced. After having read a requirement, the developer can use Fit tables toconfirm or contradict her/his understanding, since Fit tables provide input-output “examples” of how the implementation of the requirement should work.For example, it describes what should be the output produced by the systemin correspondence of a particular set of inputs (Column Fit table), or the se-quence of interactions between the system and the user (Action Fit table), orassert properties of the results of a query performed by that functionality (RowFit table). However, Fit tables are different from other formal specificationsthat can complement requirements, in that they do not explain an algorithm,nor they express requirements pre and post conditions.

3 Experiment definition

This section describes the experiment definition, design and planning follow-ing the guidelines of well-known books on experimental software engineer-ing (Wohlin et al. [30], Juristo and Moreno [13]). Table 1 provides a conciseoverview of the the main elements of the experimentation. The experiment wasreplicated at two different locations (Trento and Torino, Italy) using the samedesign, procedure, and material. In the remainder, the two experiments will beindicated as Exp I and Exp II respectively. The complete replication package,training slides and experiment raw data are available for downloading 3 .

The goal of the study is to analyze the use of Fit tables as a way to better spec-ify requirements, with the purpose of evaluating their usefulness to improverequirement comprehension. The quality focus is ensuring high comprehensi-bility of requirements while the perspective is both of Researchers, evaluatinghow effective are the Fit tables during the comprehension activities, and ofProject managers, evaluating the possibility of adopting the Fit approach inher/his organization to clarify requirements. The experiment is performed in

3 http://www.rcost.unisannio.it/mdipenta/ist-talkingtest-package.zip

6

a context where the subjects are students from master courses in software en-gineering and E-business. The experiment material consists of a set of sixrequirements related to a library management system.

3.1 Experiment Subjects

For Exp I the subjects were 15 students from the course of Laboratory ofSoftware Analysis, in their last year of the master degree in computer scienceat the University of Trento. The students previously attended courses on pro-gramming and software engineering. For Exp II the subjects were 15 studentsfrom the Information Systems for E-Business course in their second year ofthe Master in E-business and Information and Communication Technology forManagement at Politecnico di Torino. The students had a mixed background,in general they had limited, although similar, knowledge and expertise levelin programming and software engineering topics. In both cases the sampleof subjects participating to the experiment was composed of all students be-longing to the course who, on a voluntary basis, agreed to participate to theexperiment.

Great care was taken to ensure that ethical requirements imposed by the Ital-ian regulations were met. A written, informed consent was obtained from eachsubject. Students signed a sheet containing the wording: “Data collected willbe used only for research purposes and they will be revealed only in aggregatedform”. Moreover, (i) the experiment was not mandatory, (ii) the experimenthad a clear pedagogical purpose aimed at introducing the use of Fit tables,(iii) results were made anonymous for privacy purposes and (iv) students werenot evaluated on their performance.

3.2 Experiment Material

The experiment was performed on a set of requirements for a Library System,targeting library employees who manage the loan of books and tapes. Memberscan borrow, reserve or renew the loan of books and tapes. The description ofthe system (originally presented by Callan in the book [5], pages 169–174 andslightly modified for our purposes) follows:

A library issues loan items to customers. Each customer must be known as amember and as such is issued a membership card that with a unique membernumber. Along with the membership number, other details on a customer mustbe kept such as a name, address, and date of birth. The library is made upof a number of subject sections. Each section is denoted by a classificationmark. A loan item is uniquely identified by a number bar code. There are two

7

Table 2Brief description of all requirements.

Requirement number Requirement title

R1 Inserting/deleting/updating a member

R2 Inserting/deleting/updating an item in the library

R3 Borrow an item

R4 Renew a loan

R5 Search for members

R6 Search for items

types of loan items; language tapes and books. A language tape has a title,language (e.g., French), level (e.g., beginner) and authors. A book has a titleand authors. An author has two fields: name and surname. A customer mayborrow up to a maximum of 8 items. An item can be borrowed, reserved orrenewed to extend a current loan. Each of these activity has a cost in Euro(borrowing a book costs 10 Euros while a tape only 5; if the member performsat least 3 operations — i.e., borrow, renew and/or reserve — in the same day,she/he receive a discount of 7 Euros). When an item is issued, the borrowingcustomer’s membership number is scanned via a bar code reader or enteredmanually. If the membership is still valid and the number of items on loan isless than 8, the procedure can proceed and the book bar code is read, either viathe bar code reader or entered manually. If the item can be issued (e.g., it isnot reserved) a receipt of the item is printed and then the item is issued. Thelibrary must support the facility for an item to be searched and for an updateof items and members.

Such a high-level description is then detailed into six functional requirements,for example:

R1. The library employee can insert, delete or update a member. For eachmember the following fields need to be specified: unique member ID, name,surname, address, date of birth and credit/debit. The member ID is auto-matically computed by summing day, month and year of his/her birth dateand subtracting from the result the number of letters of name and surname.If the value obtained is not unique — i.e., it is an already existing ID —then the software subtracts 1 from it.

Each requirement was complemented with a set of Fit tables, created and care-fully scrutinized by one of the authors, to ensure correctness and consistencywith respect to textual requirements.

The set of requirements used as experiment material constitutes the completeset of requirements for such an application. Although simple, the requirementsare realistic enough and can be considered as representative of requirementsfor a small management application. Table 2 shows the complete list of re-quirements; further information can be found in the replication package.

8

3.3 Hypothesis definition

The research questions this experimentation aims at addressing are:

RQ1: Does the presence of Fit tables 4 help programmers to understand therequirements?

RQ2: Does the presence of Fit tables affect the effort required in the com-prehension of textual requirements?

Once the research questions are formulated, it is possible to turn them intonull hypotheses to be tested in the experimentation:

H0c the use of Fit tables as a complement to requirements does not signifi-cantly affect the comprehension level.

H0t the use of Fit tables as a complement to requirements does not signifi-cantly affect the comprehension effort.

while the alternative hypotheses are:

Hac the use of Fit tables as a complement to requirements significantly affectsthe comprehension level.

Hat the use of Fit tables as a complement to requirements significantly affectsthe comprehension effort.

As it is shown, both null hypotheses H0c and H0t are two-tailed. Ideally, Fittables should always improve the comprehension level, since they complement,not substitute, the textual requirement. Nevertheless, they could also confusethe subjects making the comprehension harder. For such a reason, H0c is two-tailed. Nothing can be said on the comprehension effort, that can be eitherreduced or increased when Fit tables are used.

3.4 Variable selection

The dependent variables of our study are the requirement comprehension leveland the time needed to understand the requirements. The comprehension levelhas been measured by asking subjects to answer a question for each require-ment. For example, question Q1 related to requirement R1 has been formu-lated as follows:

4 Only Fit tables, not Fixtures. We are interested in understanding the possiblebenefits of adding Fit tables to requirements without considering additional benefitsderiving from test case execution.

9

Q1. Suppose that the employee has already inserted 3 members (Ricca Filippo,Road a, 26 September ’69, 0; Rossi Gino, Road b, 1 July 1972, 0; D’orthMario, Road c, 5/5/74, 0) what is the ID of member D’amico Tino, birthdate: 5 April ’93 as computed by the System?

Questions were created by a person different from who developed Fit tables,to avoid introducing a bias, e.g., creating questions that are obviously easierto be understood in the presence of Fit tables. Questions were conceived soto clearly have a unique possible correct answer. As described below, the timeneeded to answer each question has been measured by asking subjects to fill-ina proper time sheet (Time metric). To allow for comparisons between differentrequirements — each one having a different complexity and thus requiring adifferent time — we introduced a derived metric, the Normalized Time, whichis computed normalizing the absolute time by the average time spent by eachsubject and the average time required by each requirement. We defined the newmetric in such a way that the following two properties hold: (i) dimensionallyit is a time and (ii) the overall mean is equal to the Time mean. The formulafor the i-th requirement and subject S is the following:

NormTimeSi =

T imeSi

∑NR

j=1 T imeSj ·

J T imeJi

·

NR∑

j=1

J∈{subjects}

T imeJj

2

(1)

where NR is the number of requirements for the system object of our experi-mentation, i.e., 6 for the Library System.

Overall, for each requirement i and subject S we considered the followingmetrics:

(1) CorrectSi , whether the answer was correct or not;(2) T imeS

i , time required to read and understand the requirement and toanswer the question;

(3) NormTimeSi , normalized time.

The main factor in our study is the notation used to express the requirement:either a simple textual description (Text) or a textual description enhancedwith Fit tables (Fit).

3.5 Experiment Design

We adopt a very simple experiment design intended to fit a single 2-hour labsession. We have six objects and two treatments. The objects are requirementsfor a single application (the Library Management System), and the treatmentsare the following:

10

Table 3Experiment design

R1 R2 R3 R4 R5 R6

Red + - + - + -

Yellow - + - + - +

+ textual requirement enhanced with Fit tables− textual requirement only

The subjects are provided with six requirements of a software system (LibrarySystem). Then, they are asked to answer one per requirement.

The subjects are randomly split into two groups (Red and Yellow), receivingthe combination of treatments shown in Table 3. Since, as mentioned in Sec-tion 3.1, subjects’ skills were comparable, it was not necessary any particularstrategy to ensure that skills were homogeneously distributed across groups.Each group is provided with some textual-only requirements (e.g., Red groupR2, R4 and R6) and with some textual requirements plus Fit tables (e.g.,Red group R1, R3 and R5). The order of the requirements/questions is thesame for the two groups, e.g., Q1 contains the same requirement and the samequestion for both groups. However, in one case the Fit tables are available, inthe other case no. The experiment has been designed such that the assignmentreceived by the subjects in each group is of comparable difficulty.

The design used is not a standard design [30,13,15] but a variation of the pairedcomparison design, a particular kind of one factor with two treatments [30] (thesame design is called: randomized paired comparison design: two alternativeson one experimental unit in [13]). We adopted this design because:

• it can be applied when a limited time slot (i.e., only one laboratory) isavailable for the experiment. Of course more complex experimental designs(e.g., the one used in [24]) could have been adopted having more than onelaboratory available.

• each subject will experience with both treatments of the main factor (i.e.,Text and Fit), thus it is ethically and pedagogically acceptable as a courseassignment.

3.6 Training

Subjects have been trained in meaning and usage of FIT tables and Fitnesse –a tool that implements the FIT table approach – with two theoretical lessonsand two practical lessons (two hours each). Slides from the seminar are partof the replication package.

11

3.7 Experiment Procedure

The experimentation was carried out following a well-defined procedure. First,the subjects were given a short introduction to the experiment, then they wererandomly assigned to either of the two groups and sit down in the classroomaccording to a typical checkerboard pattern. They could work on a singlequestion per time, they were required to deliver the previous answer beforethey could work on the next one.

Then, the experiment went through the following steps:

(1) Subjects had 10 minutes to read the system description.(2) For each requirement in the sequence (I = 1 . . . 6):

(a) We delivered the requirement and question sheet number I. Depend-ing on the group the subject belongs to, the requirement can becomplemented or not with a Fit table (see Table 3).

(b) Subjects had to write their name and start time.(c) Subjects had to read and understand the requirement and the related

question.(d) Subjects had to answer the question using requirement statements

and Fit tables (when provided).(e) When finished, subjects had to mark the stop time and call the edu-

cators.(3) Subjects were asked to compile a post experiment questionnaire.

The post-experiment questionnaire aimed at gaining insights about the sub-jects’ behavior during the experiment and at better explaining the quantita-tive results. The questionnaire (see Table 4) is composed of nine questions.A first group of questions (Q1 through Q6) served the purpose of validatinginstrumentation source of the internal validity. They address the availabilityof sufficient time to complete the tasks, the clarity of the requirements, andthe ability of subjects to understand them. Questions (Q7 and Q8) aim atmeasuring how much time is devoted to textual requirements and to Fit tables.Eventually the last question is devoted to measure the perceived usefulness ofFit tables. The questions are expressed on a Likert scale [21].

4 Experimental results

This section reports the results of the experimentation described previously.Overall 30 subjects (15 for each experiment) took part to the experimentation.In Exp I we had 8 subjects who were assigned to the Red group and 7 to theYellow group, while in Exp II 6 subjects were assigned to the Red group and

12

Table 4Post-experiment questionnaire.

ID Question

Q1 I had enough time to perform the lab tasks (1–5).

Q2 The objectives of the lab were perfectly clear to me (1–5).

Q3 The description of the System was clear (1–5).

Q4 The questions asked were clear to me (1–5).

Q5 I experienced no difficulty in reading/understanding the Requirements (1–5).

Q6 I experienced no difficulty in reading/understanding the Fit Tables (1–5).

Q7 How much time (in terms of percentage) did you spend looking at Requirements?

(A. <20%; B. >=20% and <40%; C. >=40% and <60%; D. >=60% and <80%; E. >=80%)

Q8 How much time (in terms of percentage) did you spend looking at Fit tables?

(A. <20%; B. >=20% and <40%; C. >=40% and <60%; D. >=60% and <80%; E. >=80%)

Q9 Did you find Fit tables (when available) useful to clarify Requirements?

(A. very much; B. enough; C. undecided; D. little; E. definitely not)

1 = Strongly agree, 2 = Agree, 3= Not certain, 4 = Disagree, 5 = Strongly disagree.

9 to the Yellow group. The uneven assignments in Exp II (6 vs. 9) were due tosubjects who, initially considered in the sample, did not show up the day ofthe experiment. The way the material was prepared made difficult and riskya re-assignment on-the-fly.

The following subsection reports results aimed at answering the two researchquestions RQ1 and RQ2, and at analyzing the effect of context factors. Fi-nally, Section 4.5 discusses the threats to validity. For further analyses, aspreadsheet containing raw data of results for both experiments is part of thereplication package 5 .

4.1 RQ1: Does the presence of Fit tables help programmers understand therequirements?

The average number of correct answers among all subjects is 2.7, with a min-imum of 1 and a maximum of 5. To analyze them, we first built contingencytables and then applied the Fisher’s exact test [7]. We deemed it more accuratethan χ2 test for small sample sizes, which is the case in particular when ana-lyzing each experiment or each requirement in isolation; with larger sampleswe also applied the χ2 test to double check the results.

To test the hypothesis H0c, we considered the number of correct and wronganswers given by the subjects belonging to the two experimental samples asshown in Table 5 (the figures with a dark background refer to cases where theFit treatment was applied). For each requirement we built a 2×2 contingency

5 http://www.rcost.unisannio.it/mdipenta/ist-talkingtest-package.zip

13

Table 5Number of correct answers and results of Fisher’s test. The value in parenthesesnear the group name indicates the number of subjects in that group

Group R1 R2 R3 R4 R5 R6

Red (8) 6 3 7 1 3 0

Yellow (7) 0 2 2 6 2 3

p-value 0.0069 1.00 0.04 0.01 1 0.077

(a) Exp I

Group R1 R2 R3 R4 R5 R6

Red (6) 6 3 5 5 3 0

Yellow (9) 4 5 4 7 1 3

p-value 0.044 1.00 0.29 1.00 0.23 0.23

(b) Exp II

Group R1 R2 R3 R4 R5 R6

Red (14) 12 6 12 6 6 0

Yellow (16) 4 7 6 13 3 6

p-value 0.001 1.00 0.011 0.057 0.24 0.019

(c) Whole data set

table and applied the Fisher’s test. The tables consider for each requirementand each possible treatment (Fit or Text) the number of correct and wronganswers, thus taking into account the different number of subjects in the Redand Yellow groups. Results are shown in the bottom part of the table for ExpI, Exp II and for the whole data set. As it can be noted, the difference betweencorrect and wrong answers is significant for R1 in both experiments, for R3and R4 in Exp I. No significant difference was found for R2 and R5, whileresults of R6 are significant for the whole data set and marginally significantfor Exp I.

Results seem to indicate that the usefulness of Fit tables depends on the par-ticular requirement. While some requirements could have been understoodwithout major problems even when Fit tables were not available, this was notthe case of other requirements. For example, R1 (inserting/deleting/updatinga member) required the computation of a customer unique identifier, whichwas better understood by subjects having Fit tables — containing examplesof identifier computations — available. Otherwise, this was not the case, forR2 (inserting/deleting/updating an item) requiring the computation of a barcode, that was clearly described by means of a sequence of steps in the re-

14

Table 6Contingency table for correct answers in both experiments.

Correct Wrong

Fit (+) 56 34

Text (−) 25 65

quirement itself. Requirement R3 dealt with borrowing an item, and requiredthe maintainer to understand the rules for borrowing an item. The same hap-pened to R4 (renewing a loan) even though here the overall difference (i.e.,referred to the whole dataset) was marginal or, as in Exp II, not significant.Searching for members (R5) was easy to understand even without Fit tables,while searching for items (R6) implies the possibility of using more options,thus the requirement was more complex and Fit tables provided a benefit.What found here highlights the need for analyzing the effect of the particularrequirement as a co-factor, that will be shown in Section 4.3.

Other than performing a comparison for each requirement, we can compareresults for the two treatments in terms of the overall number of correct an-swers (related to all requirements). Under the assumption of dealing with in-dependent measures, we obtained the contingency Table 6. The Fisher’s testindicates that, considering the overall set of requirements and results fromboth experiments, there is a significant difference between Fit and Text (p-value=5.6 ·10−6), i.e., the number of correct answers is significantly higher forFit than for Text. Results are also significant, in the same direction, consider-ing the data separated (Exp I, p-value=7.5 · 10−5 and Exp II, p-value=0.019).

4.2 RQ2: Does the presence of Fit tables affect the effort required in thecomprehension of textual requirements?

The subjects spent, on average, 8 minutes on each requirement, with a mini-mum of 2 minutes and a maximum of 21. The Time distribution is not normal(Shapiro-Wilk p=4.51 · 10−6); for this reason the hypothesis will be testedusing a non-parametric test (Mann-Whitney).

To address H0t, we consider the mean time needed to process each requirementand answer the corresponding question. Table 7 shows the mean time and thep-value of the Mann-Whitney test (cells with dark background are related toFit treatment). Boxplots of the same data are shown in Figure 3.

Results show that subjects using Fit tables spent significantly less time for R1(both experiments), and significantly more time in the case of R6 (Exp II andwhole data set). No significant difference was found for the other requirements.

15

Table 7Mean absolute time to answer and results of Mann-Whitney test

Group R1 R2 R3 R4 R5 R6

Red 7.63 9.00 7.25 8.38 7.38 6.00

Yellow 9.83 10.86 6.00 9.57 7.86 8.00

p-value 0.008 0.41 0.17 0.68 1.00 0.19

(a) Exp I

Group R1 R2 R3 R4 R5 R6

Red 6.16 10.83 7.33 5.66 8.00 4.60

Yellow 10.33 9.12 7.66 7.88 6.80 8.11

p-value 0.021 0.90 0.81 0.17 0.68 0.033

(b) Exp II

Group R1 R2 R3 R4 R5 R6

Red 7.00 9.79 7.29 7.21 7.64 5.43

Yellow 10.13 9.56 6.94 8.62 7.31 8.07

p-value 0.003 0.75 0.61 0.25 0.72 0.012

(c) Whole data set

Fit Text Fit Text Fit Text Fit Text Fit Text Fit Text

510

1520

Requirement

Tim

e

1 2 3 4 5 6

Fig. 3. Whole data set – Time for different requirements with and without Fit tables.

To perform an overall analysis of the time spent to answer questions, wecompared the mean normalized times — as defined in equation (1) — spent toanswer questions with Fit tables (overall mean=8.14 minutes) with the time

16

Table 8Mean normalized times.

Group R1 R2 R3 R4 R5 R6

Red 7.11 8.64 8.75 7.52 8.82 6.68

Yellow 8.89 7.30 7.42 7.94 7.26 9.05

spent with only Textual requirements (overall mean=7.73 minutes). Valuesare shown in Table 8. The Mann-Whitney test indicates that overall, thedifference between Fit and Text is not significant (p-value=0.26), meaningthat H0t cannot be rejected.

4.3 Effects of context factors

This section aims at investigating:

(1) whether there is a relationship between the time spent in answering andthe answer correctness;

(2) whether there are context factors that influence the effect of the mainfactor treatments;

(3) whether the chosen design provided unbalanced tasks to the two differentgroups (Yellow and Red);

(4) whether there is a difference between the two replications of the experi-ment.

No Yes No Yes

0.10

0.15

0.20

0.25

0.30

0.35

Nor

mal

ized

Tim

e

Correct:Fit TextTreatment

(a) Exp I

No Yes No Yes

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Nor

mal

ized

Tim

e

Correct:Fit TextTreatment

(b) Exp II

Fig. 4. Normalized Time spent when providing correct and wrong answers.

(1) We checked the presence of a possible correlation between effort and cor-rectness. A low effort is, in fact, a positive indicator only if it correspondsto a high comprehension level. Measuring effort reduction (if any) with-out considering the correctness of the task performed (in this case the

17

Table 9Requirements vs Correct Answers.

R1 R2 R3 R4 R5 R6

Wrong 9 10 6 8 10 12

Correct 6 5 9 7 5 3

(a) Exp I

R1 R2 R3 R4 R5 R6

Wrong 5 7 6 3 11 12

Correct 10 8 9 12 4 3

(b) Exp II

comprehension task) is not worthwhile, since low effort could just indi-cate that subjects could have traded effort for reduced correctness of theanswers. Figure 4 shows, for the two treatments, boxplots of the normal-ized time for correct and wrong answers. A two-tailed Mann-Whitneytest indicated that, for both treatments (Fit or Text), there is no signif-icant difference in the time spent when answers were correct and whennot (p-value=0.31 for Fit and 0.93 for Text in Exp I, 0.32 for Fit and 0.82for Text in Exp II). In summary, this indicated that, for both treatment,different effort level did not cause an increment or decrease of the com-prehension level and that, vice versa, a higher comprehension level didnot necessarily require a higher effort.

(2) We investigated the requirement influence on the dependent variable com-prehension level. To check if Requirement influenced the number of cor-rect answers, we built a contingency table starting from the data con-tained in Table 9 and we applied on it the Cochran Q test [28]. It can beused to perform a within subject analysis to test whether, for at least twotreatments (requirements, in our case) the proportion (of correct answers,in our case) is different. According to the test, the null hypothesis thatproportions are equal for all requirements cannot be rejected for bothExp I and Exp II (p-value ∼ 1 in both cases).

(3) Due to limited number of subjects we could not use a standard (i.e. fullor latin square) design. Therefore we checked whether the combinationof treatment and requirements assigned to the two experimental groupscould have determined unbalanced tasks. In other words, we checkedwhether, overall, requirements 1, 3, and 5 differ from requirements 2,4, 6 in terms of either correct answers rate or time required. As far ascorrect answers rate is concerned, we built a contingency table and ap-plied a χ2 test, obtaining a p-value of 0.549. In terms of time required,we applied the Mann-Whitney test and obtained a p-value of 0.49. Theseresults indicate that the two groups received balanced sets of assignments.

(4) We verified that no major difference was found among the two replications

18

of the experimentation. As far as correct answers are concerned we built acontingency table (Table 10) an applied the χ2 test, with a p-value = 0.13.To check the effect on the (normalized) Time we used the Mann-Whitneytest. The test indicates that there was no significant difference betweenthe two experiments also for what concerns the time (p-value=0.93).

Table 10Experiment vs Correct Answers.

Exp I Exp II

Wrong 55 44

Correct 35 46

4.4 Post-experiment questionnaire

We encoded the results from the first six questions of the survey questionnaire(see Table 4) on a Likert scale as follows: 1=strongly agree; 2=agree; 3=notcertain; 4=disagree; 5=strongly disagree. The questions number seven andeight were encoded according to the following schema: 1=<20%; 2=≥20% and<40%; 3=≥40% and <60%; 4=≥60% and <80%; 5=≥80%. The last questionwas encoded in this way: 1=very much; 2=enough; 3=undecided; 4=little;5=definitely not. The data collected though the post-experiment questionnaireof Exp I 6 are summarized in Table 11.

The answers to questions Q1 through Q6 confirmed that the subjects wereable to understand the material provided within the time allocated for the ex-periment. From the following two questions (Q7 and Q8) we found that about50% of the time was devoted to read the textual part of the requirements.Subjects spent about 30% of the available time to read the Fit tables whenthese were available. Finally, we can observe (Q9) that subjects deemed theFit tables useful to understand requirements better (median = “very much”and mean between “enough” and “very much”). This confirms what found byMelnik et al. [17].

4.5 Threats to Validity

We discuss the threats to the validity that can have affected this experimen-tation: internal, construct, conclusion and external validity threats.

Internal validity threats concern external factors that may affect the outcomeof our study. Proper analysis were performed — as shown in Section 4.3 —

6 In Exp II we could not collect post-experiment questionnaires due to time reasons.

19

Table 11Post-questionnaire data (only Exp I). Questions 7 and 8 are on time spent onRequirements vs. Fit Tables as perceived by subjects.

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

mean 1.33 1.93 2.00 2.13 2.20 1.60 3.40 2.27 1.53

median 1 2 2 2 2 1 3 2 1

std dev 0.49 0.80 1.00 0.83 0.68 0.83 0.91 0.88 0.83

to analyze the effect of these factors. The only significant effect found was theinfluence of the particular requirement on the number of correct answers. How-ever, as discussed above, it is common to develop/maintain software systemswith requirements having different levels of complexity. Threats to internal va-lidity could have been due to the different level of difficulty and complexity ofdifferent requirements that some subjects received with Fit tables, some sub-jects with just a textual description. Also in this case, analyses made showedthat the tasks between the two groups of subjects were balanced enough (seepoint 3 in Section 4.3). Subjects were properly trained before the experiment,as described in Section 3.6. Although instructors were different, course wasprepared together by both instructors to limit differences in the training ac-tivity. To avoid social threats due to evaluation apprehension, students werenot evaluated on their performance. Finally, subjects were not aware of theexperimental hypotheses.

Construct validity threats concern the relationship between theory and ob-servation. A crucial construct validity threat is related to how requirementcomprehension and effort were measured. Requirement comprehension was as-sessed by asking questions related to each requirement, as described in Section3. Of course, it could have happened that more complex or easier questionscould have led to different results. Effort was measured by means of propertime sheets. Although this may not be very accurate, this is a widely adoptedway of measuring performance (monitoring is often not possible for legal rea-sons), and teaching assistants checked that the forms were correctly filled.Finally, since different questions required different time to be answered, thetime measure was normalized to permit the comparisons.

Conclusion validity concerns the relationship between the treatment and theoutcome. Proper tests were performed to statistically reject the null hypoth-esis. The small sample size (overall 30 subjects) may limit the capability ofstatistical tests to reveal any effect; for contingency tables we used the Fisher’sexact test, which is particularly suitable for such a context. Where the num-ber of samples was enough, the χ2 test was used. Time was compared usingthe non-parametric Mann-Whitney test. Multiple mean comparisons were per-formed using the Kruskal-Wallis test.

20

External validity concerns the generalization of the findings. External validitythreats are always present when experiments with students are conducted. Ourresults may be generalized to junior developers, but to draw any conclusionsabout more experienced developers, empirical studies involving professionalsare needed. In any case, this is just a first piece of falsifiable knowledge thatfurther studies with universities and industries could confirm or contradict.Threats to external validity are also related to the system we used as exper-imental object. While the system itself represents a real world application,requirements and related questions are forcefully simple, thus they may bedeemed as realistic but not real. On the other hand, their size and complexitywere designed to be proportional to the time available for the experiment (asingle, 2 hour laboratory session).

5 Related work

The literature reports several empirical studies on software testing, having thepurpose of comparing the effectiveness of different software testing strategies.One of the first seminal works in this direction is the experimentation per-formed by Basili and Selby [2]. The main results of this study, conducted with42 professionals, are:

• code reading reveals more faults than functional or structural testing;• functional testing reveals more faults than structural testing;

Since the introduction of Extreme Programming, many empirical studies havebeen conducted on Test-Driven Development (TDD) [3], to understand itseffects. TDD studies have been conducted both with professionals and alsowith students. An industrial study [10] showed that programmers using TDDproduced code that passed 18 percent to 50 percent more test cases thancode produced by control groups not using TDD. At the same time, subjectsusing TDD took 16% longer [10]. Another industrial study [16] showed thatTDD reduces the defect density up to 50%. Experiments with students reportcontrasting results. Two studies [8,14] reported a significant improvement insoftware quality and programmer productivity, one [9] only a correlation be-tween the number of tests written and productivity (students using test-firstmethods wrote more tests) and other two [20,22] reported no improvement ineither defect density and productivity. A summary of the experiments con-ducted on TDD can be found in [12].

The previous experimentations, however, mainly investigated on the abilityof testing strategy to reveal faults, or on the capability of TDD to improveproductivity. They do not investigate whether test cases could also constitutea mean to complement other software artifacts, with the aim of improving

21

their understandability. Also, although in the literature there are several pa-pers [1,23] and books [19] describing acceptance testing with Fit tables, onlya few works report empirical studies about Fit.

The closest work to our study is the paper by Melnik et al. [17]. It is a studyfocused on the use of Fit user acceptance tests for specifying functional re-quirements. This study has been conducted at the University of Calgary with25 undergraduate students and at the Southern Alberta Institute of Technol-ogy with 17 bachelor students. In this experiment, Melnik et al. showed thatthe use of Fit tables and the possibility of executing them improves the com-prehension of requirements and produces some benefit in the implementationphase. While we share with Melnik et al. the perceived usefulness of Fit tables,the main differences from the present study are:

(1) Off-line vs. on-line study. Melnik et al. [17]’s subjects worked on theirown, off-line for two weeks (i.e., the experiment was unsupervised). Inour case, subjects completed the tasks in a 2-hour laboratory, i.e., in acontrolled environment and without any possibility of exchanging infor-mation.

(2) Control group. In Melnik et al.’s study [17] all subjects had the Fittables as an aid to implement a subset of functional requirements. In ourstudy each subject received six requirements, three with Fit tables andthree without them (control group).

(3) Working in teams vs. individuals. In Melnik et al.’s study [17] sub-jects worked in teams of 4 to 5 members while in our experiment theyworked alone.

(4) Implementation vs. questions. In Melnik et al.’s study [17] subjectshad to implement a subset of functional requirements and the evalua-tion was done considering the number of test cases passed. In our case,subjects had to answer a comprehension question for each requirement.Differently from Melnik et al. [17], the evaluation was made by consider-ing the number of correct and wrong answers.

(5) Executable test cases. In Melnik et al.’s study [17] subjects had the Fittables and the Fixtures with the possibility to execute them. In our case,subjects could only use the Fit tables to better grasp the requirements.

(6) Guidelines. We designed the experiment following the guidelines byWohlin et al. [30] and Juristo & Moreno [13].

In Deng et al.’s study [6] a survey on Acceptance Test Driven Developmenthas been conducted with 33 professionals, that were asked to complete a ques-tionnaire. The study concluded that the time frame between the definition ofan acceptance test and its first successful pass is much longer than that of unittesting. The average time frame for acceptance testing is more than 4 hours,i.e., more than half a day. Starting from this result the authors built a new tool,named FitClipse, able to distinguish the two possible “failing cases”. Color red

22

indicates a regression failure (i.e., failure as a result of a recent change los-ing previously working functionality) while orange suggests that there is anunimplemented feature (not a failure). An initial self-evaluation shows thatthe distinction between the failure states and the use of FitClipse is useful.With respect to their study, the series of experiments reported in this paperaims at performing a more quantitative assessment of the comprehension levelachieved in the presence of Fit tables.

In a companion paper [25], we reported some partial analyses and some resultsfrom the experiment conducted in Trento (Exp I). The second experiment (ExpII), that confirms the results of the first, has been presented for the first timehere. Moreover, the present paper describes all the details of the experiment,presents a new metric (normalized time) and discusses in detail the effects ofthe context factors.

The paper [26] reports a controlled experiment with master students aimedat assessing the use of executable Fit acceptance test suites in the contextof maintenance and evolution tasks. The obtained results indicate that Fittables and the possibility to execute them significantly help developers tocorrectly perform the maintenance tasks. Other than looking at requirements,students continuously execute Fit test cases to ensure that Fit tables relatedto the change requirements passed. At the same time, they use applicationrequirements Fit tables to regression test the existing pieces of functionality.

In another preliminary study [29] some of the authors of the present paperfound a statistically significant evidence that the availability of Fit tablesallows programmers to complete more maintenance tasks. However, they didnot measure, differently from [26], whether the completed maintenance taskswere indeed correct.

Other than Fit tables, alternative formal or semi-formal notations can alsohelp to improve the comprehension of requirements, analysis or design docu-ments. For example, Briand et al. [4] reported on a series of experiments aimedat assessing the support provided by Object Constraint Language (OCL) con-straints in the comprehension of UML documentation. Similarly, Satpathyet al. investigated on the use of the B language as a formal complement toUML [27] In the authors’ opinion, however, the use of Fit table differs fora series of reasons. Among others, Fit tables can be defined during the re-quirement elicitation phase based on the discussion between customers andanalysts; also, they can be directly derived from scenarios. The way they canbe specified (tables) is pretty straight-forward and do not require particularskills necessary for languages such as OCL. Other formal specifications canbe speficified by analysts or architects as a complement to analysis or de-sign documents; as a consequence they may or may not reflect the customerexpectancies.

23

6 Discussion and conclusions

6.1 Discussion

The presence of Fit tables helps programmers to better understand require-ments and solve ambiguities and unclear points that textual descriptions oftencontain (RQ1). When Fit tables are present, the chances of correctly under-standing a requirement are in most cases (95%) at least two times higher thanwithout them, and on average 4 times higher (see Tables 5 and 6).

When reading a requirement, developers tend to use Fit tables to confirm orcontradict what they understood from the requirement text. This of coursemeans that Fit tables can constitute a valid complement — not a replacement— to textual requirements with the purpose of aiding their comprehension.This is particularly true when a requirement involves a complex computa-tion or complex options, as shown in Section 4.3. In any realistic softwaredevelopment environment, one has to deal with requirements of different com-plexity; when a requirement is particularly complex or some parts of it tendto be unclear or ambiguous, additional support in the form of test cases helps.When requirements are simpler, such an additional help may be less useful.The quantitative results that support this conclusion were also corroboratedby the answers to Question 9 of the post experiment questionnaire (median= “very much”): subjects felt confident about the support provided by Fittables in requirement understanding.

No significant difference was found in terms of time spent by subjects havingFit tables and by subjects having only textual requirements; thus, the sameeffort is on average required (RQ2). The time necessary to read and under-stand the examples provided in the Fit tables is negligible, when comparedto the time required to conceive new examples to verify the comprehension ofpurely textual requirements. In practice, Fit tables require an extra time toread and check the examples but — at least for the requirements used in ourexperiment — such an additional effort is not significant. When comparing thetime spent by subjects who answered correctly with subjects who answeredwrongly, no significant difference was found for both treatments (Fit or Text).

Moreover, no significant difference was found when comparing the results ofthe two replications of the experiment. The main difference between the twowas the background of subjects: students with a computer science backgroundin the first experiment, and students with a generic scientific background in thesecond one. We may generalize this by saying that Fit tables have a positiveeffect both for developers and non-technical people (e.g., customers) and helpall stakeholders to better understand requirements.

24

6.2 Conclusions

This paper presented the results of a series of controlled experiments aimed atassessing whether the adoption of table-based acceptance tests affects the un-derstanding of requirements. The experiment has been replicated, unchanged,two times in two different Universities (Trento and Torino), involving 15 stu-dents each time. The experiments dealt with requirements for a library man-agement system, representative of small/medium sized information systems.

The adoption of Fit tables to augment textual requirement descriptions led toa significantly better understanding of the requirements, with respect to thecase where only textual descriptions were available. Such an improvement canbe quantified as making the odds of a correct requirement comprehension fourtimes (400%) higher. The effort required is substantially the same. Replicationof the experiment in two different locations allowed us to compare the benefi-cial effects of Fit tables on populations of users having different background.Results indicate that costs and benefits have similar figures, regardless of thecharacteristics of the users involved. This means that Fit tables can help awide range of stakeholders involved in requirement elicitation, including thosewith limited or no technical background (e.g., the customer), hence providinga substantial benefit to the entire requirement definition phase.

We think that the results obtained by this series of experiments could haveseveral practical implications. Given that Fit helps in the understanding ofrequirements without involving additional effort, Project managers could con-sider using it in the software development process to reduce interpretationmismatches between analysts and developers with the final goal of improvingsoftware quality. Another consequence for practitioners concerns the trainingtime. Even if, further empirical studies are needed to verify it, it seems that(i) learning to use Fit tables is not difficult (students employed only eighthours to reach a good level of knowledge) and that (ii) the better developersare trained, the lower the additional effort is. An important advantage of Fittables, compared to other formalisms, is that they can easily can be createdby analysts and customers during the requirement elicitation phase, since theydo not require particular skills required in other cases.

In our future work, we intend to replicate this experiment in order to cor-roborate our findings and investigate more thoroughly the effort involved inthe adoption of Fit tables. In fact, although this experiment showed that Fittables were beneficial, it would be useful to see whether the additional effortneeded by analysts to develop and validate Fit tables is paid back with an im-proved comprehension level of requirements. Further experiments should alsoinvestigate on how subjects use Fit table in the comprehension task. Finally,we are interested in repeating the present experiment in alternative contexts,

25

including professional contexts involving real requirements/systems.

7 Acknowledgments

The authors would like to thank all the students who participated in the exper-imentation. Without them, this work would not have been possible. Authorsalso acknowledge the anonymous referees for their detailed and constructivefeedbacks.

26

References

[1] J. Aarniala. Acceptance testing. In whitepaper.www.cs.helsinki.fi/u/jaarnial/jaarnial-testing.pdf, October 30 2006.

[2] V. Basili and R. Selby. Comparing the effectiveness of software testingstrategies. IEEE Transactions on Software Engineering, 13(12):1278 – 1296,December 1987.

[3] K. Beck. Test Driven Development: By Example. Addison Wesley, 2003.

[4] L. C. Briand, Y. Labiche, M. Di Penta, and H. Yan-Bondoc. An experimentalinvestigation of formality in UML-based development. IEEE Transactions onSoftware Engineering, 31(10):833–849, 2005.

[5] R. Callan. Building Object-Oriented Systems: An Introduction from Conceptsto Implementation in C++. WIT Press (UK); BkDisk edition, 1994.

[6] C. Deng, P. Wilson, and F. Maurer. FitClipse: A fit-based eclipse plug-in for executable acceptance test driven development. In Proceedings of the8th International Conference on Agile Processes in Software Engineering andeXtreme Programming (XP 2007). Springer, 2007.

[7] J. L. Devore. Probability and Statistics for Engineering and the Sciences.Duxbury Press; 7 edition, 2007.

[8] S. Edwards. Using test-driven development in the classroom: Providing studentswith concrete feedback. In In Proceedings of the International Conferenceon Education and Information Systems: Technologies and Applications(EISTA’03), pages 421–426, 2003.

[9] H. Erdogmus, M. Morisio, and M. Torchiano. On the effectiveness of the test-first approach to programming. IEEE Transactions on Software Engineering,31(3):226–237, March 2005.

[10] B. George and L. Williams. A structured experiment of test-driven development.Information and Software Technology, 46(5):337–342, 2004.

[11] IEEE. IEEE Std 1012-1986 IEEE Standard for Software Verification andValidation Plans. IEEE, 1986.

[12] D. Janzen and H. Saiedian. Test-driven development concepts, taxonomy, andfuture direction. IEEE Computer, 38(9):43–50, September 2005.

[13] N. Juristo and A. Moreno. Basics of Software Engineering Experimentation.Kluwer Academic Publishers, Englewood Cliffs, NJ, 2001.

[14] R. Kaufmann and D. Janzen. Implications of test-driven development:a pilot study. In OOPSLA ’03: Companion of the 18th annual ACMSIGPLAN conference on Object-oriented programming, systems, languages, andapplications, pages 298–299. ACM Press, 2003.

27

[15] B. Kitchenham, S. Pfleeger, L. Pickard, P. Jones, D. Hoaglin, K. El Emam,and J. Rosenberg. Preliminary guidelines for empirical research in softwareengineering. IEEE Transactions on Software Engineering, 28(8):721–734, 2002.

[16] E. Maximilien and L. Williams. Assessing test-driven development at IBM.In International Conference on Software Engineering Proceedings of the 25thInternational Conference on Software Engineering, pages 564–569. IEEEComputer Society Washington, DC, USA, 2003.

[17] G. Melnik, K. Read, and F. Maurer. Suitability of FIT user acceptancetests for specifying functional requirements: Developer perspective. In Extremeprogramming and agile methods - XP/Agile Universe 2004, pages 60–72, August2004.

[18] B. Meyer. On formalism in specification. IEEE Software, January 1985.

[19] R. Mugridge and W. Cunningham. Fit for Developing Software: Framework forIntegrated Tests. Prentice Hall, 2005.

[20] M. Muller and O.Hagner. Experiment about test-first programming. IEEProceedings Software, 149(5):131–136, October 2002.

[21] A. N. Oppenheim. Questionnaire Design, Interviewing and AttitudeMeasurement. Pinter, London, 1992.

[22] M. Pancur, M. Ciglaric, M. Trampus, and T. Vidmar. Towards empiricalevaluation of test-driven development in a university environment. InEUROCON 2003. Computer as a Tool. The IEEE Region 8., pages 83–86, 22-24September 2003.

[23] K. Read, G. Melnik, and F. Maurer. Examining usage patterns of the FITacceptance testing framework. In Proc. 6th International Conference oneXtreme Programming and Agile Processes in Software Engineering (XP2005),pages Lecture Notes in Computer Science, Vol. 3556, Springer Verlag: 127–1362005, June 18-23 2005.

[24] F. Ricca, M. Di Penta, M. Torchiano, P. Tonella, and M. Ceccato. Therole of experience and ability in comprehension tasks supported by umlstereotypes. In 29th International Conference on Software Engineering (ICSE2007), Minneapolis, MN, USA, May 20-26, 2007, pages 375–384, 2007.

[25] F. Ricca, M. Torchiano, M. Ceccato, and P. Tonella. Talking tests: an empiricalassessment of the role of Fit acceptance tests in clarifying requirements. InInternational Workshop on Principles of Software Evolution (IWPSE 2007),pages 51–58. ACM Press, September 2007.

[26] F. Ricca, M. Torchiano, M. Di Penta, M. Ceccato, and P. Tonella. On the useof executable Fit tables to support maintenance and evolution tasks. In ThirdInternational ERCIM Symposium on Software Evolution (2007), pages 83–92,October 2007.

28

[27] M. Satpathy, C. Snook, R. Harrison, M. Butler, and P. Krause. A comparativestudy of formal and informal specifications of a case study from industry.In Proceedings of IEEE/IFIP Joint Workshop on Formal Specifications ofComputer-Based Systems, pages 133–137, 2001.

[28] D. J. Sheskin. Handbook of Parametric and Nonparametric StatisticalProcedure, Third Edition. Chapman & Hall, 2004.

[29] M. Torchiano, F. Ricca, and M. Di Penta. ”talking tests”: a preliminaryexperimental study on Fit user acceptance tests. In IEEE InternationalSymposium on Empirical Software Engineering and Measurement (ESEM2007), pages 464–466. IEEE CS Press, September 2007.

[30] C. Wohlin, P. Runeson, M. Host, M. Ohlsson, B. Regnell, and A. Wesslen.Experimentation in Software Engineering - An Introduction. Kluwer AcademicPublishers, 2000.

[31] R. Young. Effective Requirements Practice. Addison-Wesley, Boston, MA, 2001.

29