On the Advantages of a Systematic Inspection for Evaluating Hypermedia Usability

22
On the Advantages of a Systematic Inspection for Evaluating Hypermedia Usability A. De Angeli NCR Self-Service Advanced Technology & Research, Dundee, UK M. Matera M. F. Costabile Dipartimento di Informatica Università di Bari, Italy F. Garzotto P. Paolini Dipartimento di Elettronica e Informazione Politecnico di Milano, Italy It is indubitable that usability inspection of complex hypermedia is still an “art,” in the sense that a great deal is left to the skills, experience, and ability of the inspectors. Training inspectors is difficult and often quite expensive. The Systematic Usability Evaluation (SUE) inspection technique has been proposed to help usability inspectors share and transfer their evaluation know-how, to simplify the hypermedia inspection process for newcomers, and to achieve more effective and efficient evaluation results. SUE inspection is based on the use of evaluation patterns, called abstract tasks, which precisely describe the activities to be performed by evaluators during inspection. This article highlights the advantages of this inspection technique by presenting its empiri- cal validation through a controlled experiment. Two groups of novice inspectors were asked to evaluate a commercial hypermedia CD-ROM by applying the SUE inspection or traditional heuristic evaluation. The comparison was based on three major dimen- sions: effectiveness, efficiency, and satisfaction. Results indicate a clear advantage of the SUE inspection over the traditional inspection on all dimensions, demonstrating that abstract tasks are efficient tools to drive the evaluator’s performance. INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION, 15(3), 315–335 Copyright © 2003, Lawrence Erlbaum Associates, Inc. The authors are immensely grateful to Prof. Rex Hartson, from Virginia Tech, for his valuable sug- gestions. The authors also thank Francesca Alonzo and Alessandra Di Silvestro, from the Hypermedia Open Center of Polytechnic of Milan, for the help offered during the experiment data coding. The support of the EC grant FAIRWIS project IST-1999-12641 and of MURST COFIN 2000 is acknowledged. Requests for reprints should be sent to M. F. Costabile, Dipartimento di Informatica, Università di Bari, Via Orabona, 4–70126 Bari, Italy. E-mail: [email protected]

Transcript of On the Advantages of a Systematic Inspection for Evaluating Hypermedia Usability

On the Advantages of a Systematic Inspectionfor Evaluating Hypermedia Usability

A. De AngeliNCR Self-Service

Advanced Technology & Research, Dundee, UK

M. MateraM. F. Costabile

Dipartimento di InformaticaUniversità di Bari, Italy

F. GarzottoP. Paolini

Dipartimento di Elettronica e InformazionePolitecnico di Milano, Italy

It is indubitable that usability inspection of complex hypermedia is still an “art,” in thesense that a great deal is left to the skills, experience, and ability of the inspectors.Training inspectors is difficult and often quite expensive. The Systematic UsabilityEvaluation (SUE) inspection technique has been proposed to help usability inspectorsshare and transfer their evaluation know-how, to simplify the hypermedia inspectionprocess for newcomers, and to achieve more effective and efficient evaluation results.SUE inspection is based on the use of evaluation patterns, called abstract tasks, whichprecisely describe the activities to be performed by evaluators during inspection. Thisarticle highlights the advantages of this inspection technique by presenting its empiri-cal validation through a controlled experiment. Two groups of novice inspectors wereasked to evaluate a commercial hypermedia CD-ROM by applying the SUE inspectionor traditional heuristic evaluation. The comparison was based on three major dimen-sions: effectiveness, efficiency, and satisfaction. Results indicate a clear advantage ofthe SUE inspection over the traditional inspection on all dimensions, demonstratingthat abstract tasks are efficient tools to drive the evaluator’s performance.

INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION, 15(3), 315–335Copyright © 2003, Lawrence Erlbaum Associates, Inc.

The authors are immensely grateful to Prof. Rex Hartson, from Virginia Tech, for his valuable sug-gestions. The authors also thank Francesca Alonzo and Alessandra Di Silvestro, from the HypermediaOpen Center of Polytechnic of Milan, for the help offered during the experiment data coding.

The support of the EC grant FAIRWIS project IST-1999-12641 and of MURST COFIN 2000 isacknowledged.

Requests for reprints should be sent to M. F. Costabile, Dipartimento di Informatica, Università diBari, Via Orabona, 4–70126 Bari, Italy. E-mail: [email protected]

1. INTRODUCTION

One of the goals of the human–computer interaction (HCI) discipline is to definemethods for ensuring usability, which is now universally acknowledged as a sig-nificant aspect of the overall quality of interactive systems. ISO Standard 9241-11(International Standard Organization, 1997) defines usability as “the extent towhich a product can be used by specified users to achieve specified goals with effec-tiveness, efficiency and satisfaction in a specified context of use.” In this framework,effectiveness is defined as the accuracy and the completeness with which usersachieve goals in particular environments. Efficiency refers to the resources ex-pended in relation to the accuracy and completeness of the goals achieved. Satisfac-tion is defined as the comfort and the acceptability of the system for its users andother people affected by its use.

Much attention to usability is currently paid by industry, which is recognizing theimportance of adopting evaluation methods during the development cycle to verifythe quality of new products before they are put on the market (Madsen, 1999). One ofthe main complaints of industry is, however, that cost-effective evaluation tools arestill lacking. This prevents most companies from actually performing usability eval-uation, with the consequent result that a lot of software is still poorly designed andunusable.Therefore,usability inspectionmethodsareemergingaspreferredevalua-tion procedures, being less costly than traditional user-based evaluation.

It is indubitable that usability inspection of complex applications, such ashypermedia, is still an “art,” in the sense that a great deal is left to the skills, experi-ence, and ability of the inspectors. Moreover, training inspectors is difficult andoften quite expensive. As part of an overall methodology called Systematic Usabil-ity Evaluation (SUE; Costabile, Garzotto, Matera, & Paolini, 1997), a novel inspec-tion technique has been proposed to help usability inspectors share and transfertheir evaluation know-how, make the hypermedia inspection process easier fornewcomers, and achieve more effective and efficient evaluations. As described inprevious articles (Costabile & Matera, 1999; Garzotto, Matera, & Paolini, 1998,1999), the inspection proposed by SUE is based on the use of evaluation patterns,called abstract tasks, which precisely describe the activities to be performed by anevaluator during the inspection.

This article presents an empirical validation of the SUE inspection technique.Two groups of novice inspectors were asked to evaluate a commercial hypermediaCD-ROM by applying the SUE inspection or the traditional heuristic technique.The comparison was based on three major dimensions: effectiveness, efficiency,and satisfaction. Results demonstrated a clear advantage of the SUE inspectionover the heuristic evaluation on all dimensions, showing that abstract tasks are effi-cient tools to drive the evaluator’s performance.

This article has the following organization. Section 2 provides the rationale forthe SUE inspection by describing the current situation of usability inspection tech-niques. Section 3 briefly describes the main characteristics of the SUE methodology,whereas Section 4 outlines the inspection technique proposed by SUE. Section 5,the core of the article, describes the experiment that was performed to validate theSUE inspection. Finally, Section 6 presents conclusions.

316 De Angeli et al.

2. BACKGROUND

Different methods can be used to evaluate usability, among which the most commonare user-based methods and inspection methods. User-based evaluation mainlyconsists of user testing: It assesses usability properties by observing how the systemis actually used by some representative sample of real users (Dix, Finlay, Abowd, &Beale, 1993; Preece et al., 1994; Whiteside, Bennet, & Holtzblatt, 1988). Usability in-spection refers to a set of methods in which expert evaluators examine usability-re-latedaspectsofanapplicationandprovide judgmentsbasedontheirknowledge.Ex-amples of inspection methods are heuristic evaluation, cognitive walk-through,guideline review, and formal usability inspection (Nielsen & Mack, 1994).

User-based evaluations claim to provide, at least until now, the most reliableresults, because they involve samples of real users. Such methods, however, have anumber of drawbacks, such as the difficulty of properly selecting a correct samplefrom the user community and training participants to master the most sophisti-cated and advanced features of an interactive system. Furthermore, it is difficultand expensive to reproduce actual situations of usage (Lim, Benbasat, & Todd,1996), and failure in creating real-life situations may lead to artificial findingsrather than realistic results. Therefore, the cost and the time to set up reliable empir-ical testing may be excessive.

In comparison to user-based evaluation, usability inspection methods aremore subjective. They are strongly dependent on the inspector’s skills. Therefore,different inspectors may produce noncomparable outcomes. However, usabilityinspection methods “save users” (Jeffries, Miller, Wharton, & Uyeda, 1991; Niel-sen & Mack, 1994) and do not require special equipment or lab facilities. In addi-tion, experts can detect problems and possible future faults of a complex systemin a limited amount of time. For all these reasons, inspection methods have beenused more widely in recent years, especially in industrial environments (Nielsen,1994a).

Among usability inspection methods, heuristic evaluation (Nielsen, 1993,1994b) is the most commonly used. With this method, a small set of experts in-spect a system and evaluate its interface against a list of recognized usabilityprinciples—the heuristics. Experts in heuristic evaluation can be usability spe-cialists, experts of the specific domain of the application to be evaluated, or (pref-erably) double experts, with both usability and domain experience. During theevaluation session, each evaluator goes individually through the interface at leasttwice. The first step is to get a feel of the flow of the interaction and the generalscope of the system. The second is to focus on specific objects and functionality,evaluating their design, implementation, and so forth, against a list ofwell-known heuristics. Typically, such heuristics are general principles, which re-fer to common properties of usable systems. However, it is desirable to developand adopt category-specific heuristics that apply to a specific class of products(Garzotto & Matera, 1997; Nielsen & Mack, 1994). The output of a heuristic eval-uation session is a list of usability problems with reference to the violatedheuristics. Reporting problems in relation to heuristics enables designers to easilyrevise the design, in accordance with what is prescribed by the guidelines pro-

Evaluating Hypermedia Usability 317

vided by the violated principles. Once the evaluation has been completed, thefindings of the different evaluators are compared to generate a report summariz-ing all the findings.

Heuristic evaluation is a “discount usability” method (Nielsen, 1993, 1994a). Infact, some researchers have shown that it is a very efficient usability engineeringmethod (Jeffries & Desurvire, 1992) with a high benefit–cost ratio (Nielsen, 1994a).It is especially valuable when time and resources are short, because skilled evalua-tors, without needing the involvement of representative users, can producehigh-quality results in a limited amount of time (Kantner & Rosenbaum, 1997).This technique has, however, a number of drawbacks. As highlighted by Jeffries etal. (1991), Doubleday, Ryan, Springett, and Sutcliffe (1997), and Kantner andRosenbaum (1997), its major disadvantage is the high dependence on the skills andexperiences of the evaluators. Nielsen (1992) stated that novice evaluators with nousability expertise are poor evaluators and that usability experts are 1.8 times asgood as novices. Moreover, application domain and usability experts (the doubleexperts) are 2.7 as good as novices and 1.5 as good as usability experts (Nielsen,1992). This means that experience with the specific category of applications beingevaluated really improves the evaluators’ performance. Unfortunately, usabilityspecialists may lack domain expertise, and domain specialists are rarely trained orexperienced in usability methodologies. To overcome this problem for hypermediausability evaluation, the SUE inspection technique has been introduced. It usesevaluation patterns, called abstract tasks, to guide the inspectors’ activity. Abstracttasks precisely describe which hypermedia objects to look for and which actionsthe evaluators must perform to analyze such objects. In this way, less experiencedevaluators, with lack of expertise in usability or hypermedia, are able to producemore complete and precise results. The SUE inspection technique solves a furtherheuristic evaluation drawback, which was reported by Doubleday et al. (1997). Theproblem is that heuristics, as they are generally formulated, are not always able toadequately guide evaluators. At this proposal, the SUE inspection framework pro-vides evaluators with a list of detailed heuristics that are specific for hypermedia.Abstract tasks provide a detailed description of the activities to be performed to de-tect possible violations of the hypermedia heuristics.

In SUE, the overall inspection process is driven by the use of an applicationmodel, the hypermedia design model (HDM; Garzotto, Paolini, & Schwabe, 1993).The HDM concepts and primitives allow evaluators to identify precisely thehypermedia constituents that are worthy of investigation. Moreover, both thehypermedia heuristics and the abstract tasks focus on such constituents and areformulated through HDM terminology. Such a terminology also is used by evalua-tors for reporting problems, thus avoiding the generation of incomprehensible andvague inspection reports. In some recent articles (Andre, Hartson, & Williges, 1999;Hartson, Andre, Williges, & Van Rens, 1999), authors have highlighted the need formore focused usability inspection methods and for a classification of usabilityproblems to support the production of inspection reports that are easy to read andcompare. These authors have defined the user action framework (UAF), which is aunifying and organizing environment that supports design guidelines, usability

318 De Angeli et al.

inspection, classification, and reporting of usability problems. UAF provides aknowledge base in which different usability problems are organized, taking intoaccount how users are affected by the design during the interaction, at variouspoints where they must accomplish cognitive or physical actions. The classificationof design problems and usability concepts is a way to capitalize on past evaluationexperiences. It allows evaluators to better understand the design problems theyencounter during the inspection and helps them identify precisely which physicalor cognitive aspects cause problems. Evaluators are therefore able to proposewell-focused redesign solutions. The motivations behind this research are similarto ours. Reusing past evaluation experience and making it available to less experi-enced people is a basic goal of the authors, which is pursued through the use ofabstract tasks. Their formulation is, in fact, the reflection of the experiences of someskilled evaluators. Unlike the UAF, rather than recording problems, abstract tasksoffer a way to keep track of the activities to be performed to discover problems.

3. THE SUE METHODOLOGY

SUE is a methodology for evaluating the usability of interactive systems, whichprescribes a structured flow of activities (Costabile et al., 1997). SUE has beenlargely specialized for hypermedia (Costabile, Garzotto, Matera, & Paolini, 1998;Garzotto et al., 1998, 1999), but this methodology easily can be exploited to evaluatethe usability of any interactive application (Costabile & Matera, 1999). A core ideaof SUE is that the most reliable evaluation can be achieved by systematically com-bining inspection with user-based evaluation. In fact, several studies have outlinedhow two such methods are complementary and can be effectively coupled toobtain a reliable evaluation process (Desurvire, 1994; Karat, 1994; Virzi, Sorce, &Herbert, 1993). The inspection proposed by SUE, based on the use of abstract tasks(in the following SUE inspection), is carried out first. Then, if inspection results arenot sufficient to predict the impact of some critical situations, user-based evalua-tion also is conducted. Because SUE is driven by the inspection outcome, theuser-based evaluation tends to be better focused and more cost-effective.

Another basic assumption of SUE is that, to be reliable, usability evaluationshould encompass a variety of dimensions of a system. Some of these dimensionsmay refer to general layout features common to all interactive systems, whereasothers may be more specific for the design of a particular product category or aparticular domain of use. For each dimension, the evaluation process consists ofa preparatory phase and an execution phase. The preparatory phase is performedonly once for each dimension, and its purpose is to create a conceptual frame-work that will be used to carry out actual evaluations. As better explained in thenext section, such a framework includes a design model, a set of usability attrib-utes, and a library of abstract tasks. Because the activities in the preparatoryphase may require extensive use of resources, they should be regarded as along-term investment. The execution phase is performed every time a specific ap-plication must be evaluated. It mainly consists of an inspection, performed by ex-

Evaluating Hypermedia Usability 319

pert evaluators. If needed, inspection can be followed by sessions of user testing,involving real users.

4. THE SUE INSPECTION TECHNIQUE FOR HYPERMEDIA

The SUE inspection is based on the use of an application design model fordescribing the application, a set of usability attributes to be verified during theevaluation, and a set of abstract tasks (ATs) to be applied during the inspectionphase. The term model is used in a broad sense, meaning a set of concepts, repre-sentation structures, design principles, primitives, and terms, which can be usedto build a description of an application. The model helps organize concepts, soidentifying and describing, in a nonambiguous way, the components of the appli-cation that constitute the entities of the evaluation (Fenton, 1991). For the evalua-tion of hypermedia, the authors have adopted HDM (Garzotto et al., 1993),which focuses on structural and navigation properties as well as on active mediafeatures.

Usability attributes are obtained by decomposing general usability principlesinto finer grained criteria that can be better analyzed. In accordance with thesuggestion by Nielsen and Mack (1994) to develop category-specific heuristics,the authors have defined a set of usability attributes, able to capture the peculiarfeatures of hypermedia (Garzotto & Matera, 1997; Garzotto et al., 1998). Suchhypermedia usability attributes correspond with Nielsen’s 10 heuristics (Nielsen,1993). The hypermedia usability attributes, in fact, can be considered a specializa-tion for hypermedia of Nielsen’s heuristics, with the only exception of “good er-ror messages” and “help and documentation,” which do not need to be furtherspecialized.

ATs are evaluation patterns that provide a detailed description of the activities tobe performed by expert evaluators during inspection (Garzotto et al., 1998, 1999).They are formulated precisely by following a pattern template, which provides aconsistent format including the following items:

• AT classification code and title univocally identify the AT and succinctly conveyits essence.

• Focus of action briefly describes the context, or focus, of the AT by listing theapplication constituents that correspond to the evaluation entities.

• Intent describes the problem addressed by the AT and its rationale, trying tomakeclearwhichis thespecificgoal tobeachievedthroughtheATapplication.

• Activity description is a detailed description of the activities to be performedwhen applying the AT.

• Output describes the output of the fragment of the inspection the AT refers to.

Optionally, a comment is provided, with the aim of indicating further ATs to beapplied in combination or highlighting related usability attributes.

320 De Angeli et al.

A further advantage of the use of a model is that it provides the terminology forformulating the ATs. The 40 ATs defined for hypermedia (Matera, 1999) have beenformulated by using the HDM vocabulary. Two examples are reported in Table 1.The two ATs focus on active slots.1 The list of ATs provides systematic guidance onhow to inspect a hypermedia application. Most evaluators are very good at analyz-ing only certain features of interactive applications; often they neglect some otherfeatures, strictly dependent on the specific application category. Exploiting a set ofATs ready for use allows evaluators with no experience in hypermedia to come upwith good results.

During inspection, evaluators analyze the application and specify a viableHDM schema, when it is not already available, for describing the application.During this activity, different application components (i.e., the objects of the eval-uation) are identified. Then, having in mind the usability criteria, evaluators ap-ply the ATs and produce a report that describes the discovered problems. Evalua-tors use the terminology provided by the model to refer to objects and describecritical situations while reporting problems, thus attaining precision in their finalevaluation report.

5. THE VALIDATION EXPERIMENT

To validate the SUE inspection technique, a comparison study was conductedinvolving a group of senior students of an HCI class at the University of Bari, Italy.The aim of the experiment was to compare the performance of evaluators carryingout the SUE inspection (SI), based on the use of ATs, with the performance of evalu-ators carrying out heuristic inspection (HI), based on the use of heuristics only.

As better explained in Section 5.2, the validation metrics were defined alongthree major dimensions: effectiveness, efficiency, and user satisfaction. Suchdimensions actually correspond to the principal usability factors as defined by theStandard ISO 9241-11 (International Standard Organization, 1997). Therefore, theexperiment allowed the authors to assess the usability of the inspection technique(John, 1996). In the defined metrics, effectiveness refers to the completeness andaccuracy with which inspectors performed the evaluation. Efficiency refers to thetime expended in relation to the effectiveness of the evaluation. Satisfaction refersto a number of subjective parameters, such as perceived usefulness, difficulty,acceptability, and confidence with respect to the evaluation technique. For eachdimension, a specific hypothesis was tested.

• Effectiveness hypothesis. As a general hypothesis, SI was predicted to increaseevaluation effectiveness compared with HI. The advantage is related to two factors:(a) the systematic nature of the SI technique, deriving from the use of the HDMmodel to precisely identify the application constituents, and (b) the use of ATs,

Evaluating Hypermedia Usability 321

1In the HDM terminology, a slot is an atomic piece of information, such as text, picture, video, andsound.

which suggest the activity to be conducted over such objects. Because the ATsdirectly address hypermedia applications, this prediction should also be weightedwith respect to the nature of problems detected by evaluators. The hypermediaspecialization of the SI could constitute both the method advantage and its limit.Indeed, although it could be particularly effective with respect to hypermedia-spe-cific problems, it could neglect other flaws related to presentation and content. Inother words, the limit of ATs could be that they take evaluators away from defectsnot specifically addressed by the AT activity.

322 De Angeli et al.

Table 1: Two Abstract Tasks (ATs) from the Library of Hypermedia ATs

AS 1: Control on Active Slots

Focus of action: An active slotIntent: to evaluate the control provided over the active slot, in terms of the following:

A. Mechanisms for the control of the active slotB. Mechanisms supporting the state visibility ( i.e., the identification of any intermediate state of

the slot activation)Activity description: given an active slot:

A. Execute commands such as play, suspend, continue, stop, replay, get to an intermediate state,and so forth

B. At a given instant, during the activation of the active slot, verify if it is possible to identify itscurrent state as well as its evolution up to the end

Output:A. A list and a short description of the set of control commands and of the mechanisms supporting

the state visibilityB. A statement saying if the following are true:

• The type and number of commands are appropriate, in accordance with the intrinsic nature ofthe active slot

• Besides the available commands, some further commands would make the active slot controlmore effective

• The mechanisms supporting the state visibility are evident and effective

AS 6: Navigational Behavior of Active Slots

Focus of action: An active slot + linksIntent: to evaluate the cross effects of navigation on the behavior of active slotsActivity description: consider an active slot:

A. Activate it, and then follow one or more links while the slot is still active; return to the “original”node where the slot has been activated and verify the actual slot state

B. Activate the active slot; suspend it; follow one or more links; return to the original node wherethe slot has been suspended and verify the actual slot state

C. Execute activities A and B traversing different types of links, both to leave the original node andto return to it

D. Execute activities A and B by using only backtracking to return to the original nodeOutput:

A. A description of the behavior of the active slot, when its activation is followed by the executionof navigational links and, eventually, backtracking

B. A list and a short description of possible unpredictable effects or semantic conflicts in the sourceor in the destination node, together with the indication of the type and nature of the link that hasgenerated the problem

Note. AS = active slot

• Efficiency hypothesis. A limit of SI could be that a rigorous application of severalATs is time consuming. However, SI was not expected to compromise inspectionefficiency compared with the less structured HI technique. Indeed, the expectedhigher effectiveness of the SI technique should compensate for the major timedemand required by its application.

• Satisfaction hypothesis. Although SI was expected to be perceived as a morecomplex technique than HI, it was hypothesized that SI should enhance the evalua-tors’ control over the inspection process and their confidence in obtained results.

5.1. Method

In this section, the experimental method adopted to test the effectiveness, effi-ciency, and user satisfaction hypotheses is described.

Participants. Twenty-eight senior students from the University of Bari partic-ipated in the experiment as part of their credit for an HCI course. Their curriculumcomprised training and hands-on experience in Nielsen’s heuristic evaluationmethod, which they had applied to paper prototypes, computer-based prototypes,and hypermedia CD-ROMs. During lectures they also were exposed to the HDMmodel.

Design. The inspection technique was manipulated between participants.Randomly, half of the sample was assigned to the HI condition and the other half tothe SI condition.

Procedure. Aweek before the experiment, participants were introduced to theconceptual tools tobeusedduringtheinspection.Thetrainingsessionlasted2hrand30 min for the HI group and 3 hr for the SI group. The discrepancy was due to the dif-ferent conceptual tools used during the evaluation by the two groups, as betterexplained in the following. A preliminary 2-hr seminar briefly reviewed the HDMand introduced all the participants to hypermedia-specific heuristics, as defined bySUE. Particular attention was devoted to informing students without influencingtheir expectations and attitudes toward the two different inspection techniques. Acouple of days later, all participants were presented with a short demonstration oftheapplication, lastingalmost15min.Afewsummaryindicationsabout theapplica-tion content and the main functions were introduced, without providing too manydetails. In this way, participants, having limited time at their disposal, did not starttheir usability analysis from scratch but had an idea (although vague) of how tobecome oriented in the application. Then, participants assigned to the SI group werebriefly introduced to the HDM schema of the application and to the key concepts ofapplying ATs. In the proposed application schema, only the main application com-ponents were introduced (i.e., structure of entity types and application links for the

Evaluating Hypermedia Usability 323

hyperbase, collection structure and navigation for the access layer), without reveal-ing any detail that could indicate usability problems.

The experimental session lasted 3 hr. Participants had to inspect the CD-ROM,applying the technique to which they were assigned. All participants wereprovided with a list of 10 SUE heuristics, summarizing the usability guidelinesfor hypermedia2 (Garzotto & Matera, 1997; Garzotto et al., 1998). The SI groupalso was provided with the HDM application schema and with 10 ATs to beapplied during the inspection (see Table 2). The limited number of ATs was at-tributable to the limited amount of time participants had at their disposal. Themost basic ATs were selected that could guide SI inspectors in the analysis of themain application constituents. For example, the authors disregarded ATs ad-dressing advanced hypermedia features.

Working individually, participants had to find the maximum number of usabil-ity problems in the application and record them on a report booklet, which differedaccording to the experimental conditions. In the HI group, the booklet included 10forms, one for each of the hypermedia heuristics. The forms required informationabout the application point where that heuristic was violated and a short descrip-tion of the problem. The SI group was provided with a report booklet including 10forms, each one corresponding to an AT. Again, the forms required informationabout the violations detected through that AT and where they occurred. Examplesof the forms included in the report booklets provided to the two groups are shownin Figures 1 and 2.

At the end of the evaluation, participants were invited to fill in the evaluator-satisfaction questionnaire, which combined several item formats to measure threemain dimensions: user satisfaction with the evaluated application, evaluator satis-faction with the inspection technique, and evaluator satisfaction with the resultsachieved. The psychometric instrument was organized in two parts. The first wasconcerned with the application, and the second included the questions about the

324 De Angeli et al.

2By providing both groups with the same heuristic list, the authors have been able to measure thepossible added value of the systematic inspection induced by SUE with respect to the subjective applica-tion of heuristics.

Table 2: The List of Abstract Tasks (ATs) Submitted to Inspectors

AT Classification Code AT Title

AS-1 Control on active slotsAS-6 Navigational behavior of active slotsPS-1 Control on passive slotsHB-N1 Complexity of structural navigation patternsHB-N4 Complexity of applicative navigation patternsAL-S1 Coverage power of access structuresAL-N1 Complexity of collection navigation patternsAL-N3 Bottom-up navigation in index hierarchiesAL-N11 History collection structure and navigationAL-N13 Exit mechanisms availability

325

FIGURE 1 An example of table in the report form provided to the HI group.

FIGURE 2 A page from the report booklet given to the SI group, containing one ATand the corresponding form for reporting problems.

adopted evaluation technique. Two final questions asked participants to specifyhow satisfied they felt about their performance as evaluators.

The application. The evaluated application was the Italian CD-ROM “Cam-minare nella pittura,” which means “walking through painting” (Mondadori NewMedia, 1997). It was composed of two CD-ROMs, each one presenting the analysis ofpainting and some relevant artworks in two periods. The first CD-ROM (CD1 in thefollowing) covered the period from Cimabue to Leonardo, the second one the periodfrom Bosch to Cezanne. The CD-ROMs were identical in structure, and each onecould be used independently of the other. Each CD-ROM was a distinct and “com-plete” application of limited size, particularly suitable for being exhaustivelyanalyzed in a limited amount of time. Therefore, CD1 only was submitted to partici-pants. The limited number of navigation nodes in CD1 simplified the postex-perimental analysis of the paths followed by the evaluators during the inspectionand the identification of the application points where they highlighted problems.

Data coding. The report booklets were analyzed by three expert hypermediadesigners (expert evaluators, in the following) with a strong HCI background, toassess effectiveness and efficiency of the applied evaluation technique. Allreported measures had a reliability value of at least .85. Evaluator satisfaction wasmeasured by analyzing the self-administered postexperimental questionnaires.

All the statements written in the report booklets were scored as problems ornonproblems. Problems are actual usability flaws that could affect user performance.Nonproblems include (a) observations reflecting only evaluators’ personal prefer-ences but not real usability bugs; (b) evaluation errors, reflecting evaluators’ mis-judgments or system defects due to a particular hardware configuration; and (c)statements that are not understandable (i.e., not clearly reported).

For each statement scored as a problem or a nonproblem of type (a), a severityrating was performed. As suggested by Nielsen (1994b), severity was estimatedconsidering three factors: the frequency of the problem, the impact of the problemon the user, and the persistence of the problem during interaction. The evaluationwas modulated on a Likert scale, ranging from 1 (I don’t agree that this is a usabilityproblem at all) to 5 (usability catastrophe). Each problem was further classified in oneof the following dimensions, according to the nature of the problem itself:

• Navigation, which includes problems related to the task of moving within thehyperspace. It refers to the appropriateness of mechanisms for accessing informa-tion and for getting oriented in the hyperspace.

• Active media control, which includes problems related to the interaction withdynamic multimedia objects, such as video, animation, and audio comment. Itrefers to the appropriateness of mechanisms for controlling the dynamic behaviorof media and of mechanisms providing feedback about the current state of themedia activation.

326 De Angeli et al.

• Interaction with widgets, which includes problems related to the interactionwith the widgets of the visual interface, such as buttons of various types, icons, andscrollbars. It includes problems related to the appropriateness of mechanisms formanipulating widgets and their self-evidence.

Note that navigation and active media control are dimensions specific to hy-permedia systems.

5.2. Results

The total number of problems detected in the application was 38. Among these, 29problemswerediscoveredbytheexpertevaluators, throughaninspectionbefore theexperiment. The remaining 9 were identified only by the experimental inspectors.

During the experiment, inspectors reported a total number of 36 different types ofproblems. They also reported 25 different types of nonproblems of type (a) and (b).Four inspectors reported at least one nonunderstandable statement, i.e.nonproblems of type (c).

The results of the psychometric analysis are reported in the following para-graphs with reference to the three experimental hypotheses.

Effectiveness. Effectiveness can be decomposed into the completeness and ac-curacy with which inspectors performed the evaluation. Completeness corre-sponds to the percentage of problems detected by a single inspector out of the totalnumber of problems. It is computed by the following formula:

where Pi is the number of problems found by the ith inspector, and n is the totalnumber of problems existing in the application (n = 38).

On average, inspectors in the SI group individually found 24% of all the usabilitydefects (SEM = 1.88); inspectors in the HI group found 19% (SEM = 1.99). As shownbyaMann–WhitneyU test, thedifference isstatisticallysignificant (U=50.5,N=28,p< .05). It follows that the SI technique enhances evaluation completeness, allowingindividual evaluators to discover a major number of usability problems.

Accuracy can be defined by two indexes: precision and severity. Precision isgiven by the percentage of problems detected by a single inspector out of the to-tal number of statements. For a given inspector, precision is computed by the fol-lowing formula:

Evaluating Hypermedia Usability 327

100ii

i

PPrecision

S� �

100ii

PCompleteness

n� �

where Pi is the number of problems found by the ith inspector, and si is the totalnumber of statements he or she reported (including nonproblems).

In general, the distribution of precision is affected by a severe negative skew-ness, with 50% of participants not committing any errors. The variable ranges from40 to 100, with a median value of 96. In the SI group, most inspectors were totallyaccurate (precision value = 100), with the exception of two of them, who wereslightly inaccurate (precision value > 80). On the other hand, only two participantsof the HI condition were totally accurate. The mean value for the HI group was 77.4(SEM = 4.23), and the median value was 77.5. Moreover, four evaluators in the HIgroup reported at least one nonunderstandable statement, whereas all the state-ments reported by the SI group were clearly expressed and referred to applicationobjects using a comprehensible and consistent terminology.

This general trend reflecting an advantage of SI over HI was supported also bythe analysis of the severity index, which refers to the average rating of all scoredstatements for each participant. A t-test analysis demonstrated that the mean rat-ing of the two groups varied significantly, t(26) = –3.92, p < .001 (two-tailed).Problems detected applying the ATs were scored as more serious than thosedetected when only the heuristics were available (means and standard errors arereported in Table 3).

The effectiveness hypothesis also states that the SUE inspection technique couldbe particularly effective for detecting hypermedia-specific problems, whereas itcould neglect other bugs related to graphical user interface widgets. To test this as-pect, the distribution of problem types was analyzed as a function of experimentalconditions. As can be seen in Figure 3, the most common problems detected by allthe evaluators were concerned with navigation, followed by defects related to ac-tive media control. Only a minority of problems regarded interaction with widgets.In general, it is evident that the SI inspectors found more problems. However, thissuperiority especially emerges for hypermedia-related defects (navigation and ac-tive media control), t(26) = –2.70, p < .05 (two-tailed).

A slightly higher average number of “interaction with widgets” problems wasfound by the HI group, compared with the SI group. A Mann–Whitney U test,comparing the percentage of problems in the two experimental conditions, indi-cated that this difference was not significant (U = 67, N = 28, p = .16). This meansthat unlike what was hypothesized, the systematic inspection activity suggested byATs does not take evaluators away from other problems not covered by the activitydescription. Because the problems found by the SI group in the “interaction withwidgets” category were those having the highest severity, it also can be assumedthat the hypermedia ATs do not prevent evaluators from noticing usability catas-trophes related to presentation aspects. Also, supplying evaluators with ATs focus-

328 De Angeli et al.

Table 3: Means and Standard Errors for the Analysis of Severity

HI SI

Severity index 3.66 (0.12) 4.22 (0.08)

Note: HI = Heuristics Inspection; SI = Systematic Usability Evaluation Inspection.

ing on presentation aspects, such as those presented by Costabile and Matera(1999), may allow one to obtain a deep analysis of the graphic–user interface, withthe result that SI evaluators find a major number of “interaction with widgets”problems.

Efficiency. Efficiency has been considered both at the individual and at thegroup level. Individual efficiency refers to the number of problems extracted bya single inspector, in relation to the time spent. It is computed by the followingformula:

where Pi is the number of problems detected by the ith inspector, and ti is the timespent for finding the problems.

On average, SI inspectors found 4.5 problems in 1 hr of inspection, versus the 3.6problems per hour found by the HI inspectors. A t test on the variable normalizedby a square root transformation demonstrated that this difference was not signifi-cant, t(26) = –1.44, p = .16 (two-tailed). Such a result further supports the efficiencyhypothesis, because the application of the ATs did not compromise efficiency com-pared with a less structured evaluation technique. Rather, SI showed a positive ten-dency in finding a major number of problems per hour.

Group efficiency refers to the evaluation results achieved by aggregating theperformance of several inspectors. Toward this end, Nielsen’s cost–benefit curve,relating the proportion of usability problems to the number of evaluators, has beencomputed (Nielsen, 1994b). This curve derives from a mathematical model based

Evaluating Hypermedia Usability 329

�_ ii

i

PInd Efficiency

t

FIGURE 3 Average number of problems as a function of experimental conditionsand problem categories.

on the prediction formula for the number of usability problems found in a heuristicevaluation reported in the following (Nielsen, 1992):

where Found(i) is the number of problems found by aggregating reports from iindependent evaluators, n is the total number of problems in the application, and λis the probability of finding the average usability problem when using a singleaverage evaluator.

As suggested by Nielsen and Landauer (1993), one possible use of this model isin estimating the number of inspectors needed to identify a given percentage ofusability errors. This model therefore was used to determine how many inspectorsfor the two techniques would enable the detection of a reasonable percentage ofproblems in the application. The curves calculated for the two techniques are re-ported in Figure 4 (n = 38, λHI = 0.19, λSI = 0.24). As shown in the figure, SI tended toreach better performance with a lower number of evaluators. If Nielsen’s 75%threshold is assumed, SI can reach this level with five evaluators. The HI techniquewould require seven evaluators.

330 De Angeli et al.

( ) (1 (1 ) )iFound i n λ� � �

FIGURE 4 The cost–benefit curve (Nielsen & Landauer, 1993) computed for the twotechniques, HI (heuristic inspection) and SI (SUE inspection). Each curve shows theproportion of usability problems found by each technique when different numbers ofevaluators were used.

Satisfaction. With respect to an evaluation technique, satisfaction refers tomany parameters, such as perceived usefulness, difficulty, and acceptability ofapplying the method. The postexperimental questionnaire addressed three maindimensions: user satisfaction with the application evaluated, evaluator satisfactionwith the inspection technique, and evaluator satisfaction with the results achieved.At first sight it may appear that the first dimension, addressing evaluators’ satisfac-tion with the application, is out of the scope of the experiment, the main intent ofwhich was to compare two inspection techniques. However, the authors wanted toverify in which way the used technique may have influenced inspector severity.

User satisfaction with the application evaluated was assessed through a seman-tic–differential scale that required inspectors to judge the application on 11 pairs ofadjectives describing satisfaction with information systems. Inspectors couldmodulate their evaluation on 7 points (after recoding of reversed items, where 1 =very negative and 7 = positive). The initial reliability of the satisfaction scale is moder-ately satisfying (α = .74), with three items (reliable–unreliable, amusing–boring,difficult–simple) presenting a corrected item-total correlation inferior to .30. There-fore, the user-satisfaction index was computed averaging scorings for the remainingeight items(α=.79).Then, the indexwasanalyzedbya t test.Resultsshowedasignif-icanteffectof the inspectiongroup, t(26)=2.38,p<.05 (two-tailed).Onaverage, theSIinspectors evaluated the application more severely (M = 4.37, SEM = .23) than HI in-spectors (M = 5.13, SEM = .22). From this difference, it can be inferred that ATs pro-vide evaluators with a more effective framework to weight limits and benefits of theapplication. The hypothesis is supported by the significant correlation between thenumber of usability problems found by an evaluator and his or her satisfaction withthe application (r = –.42, p < .05). The negative index indicates that as more problemswere found, the less positive was the evaluation.

Evaluator satisfaction with the inspection technique was assessed by 11 pairs ofadjectives,modulatedon7points.Theoriginal reliabilityvaluewas .72, increasingto.75 after deletion of three items (tiring–restful, complex–simple, satisfying–unsatis-fying).Theevaluator-satisfactionindexthenwascomputedbyaveragingscoringstothe remaining eight items. The index is highly correlated with a direct item assessinglearnability of the inspection technique (r = .53, p < .001). The easier a technique isperceived, the better it is evaluated. A t test showed no significant differences in thesatisfaction with the inspection technique, t(26) = 1.19, p = .25 (two-tailed). On aver-age, evaluations were moderately positive for both techniques, with a mean differ-ence of .32 slightly favoring the HI group. To conclude, despite being objectivelymore demanding, SI was not evaluated worse than HI.

Evaluator satisfaction with the result achieved was assessed directly by aLikert-type item asking participants to express their gratification on a 4-point scale(from not at all to very much) and indirectly by a percentage estimation of the numberof problems found. The two variables were highly correlated (r = .57, p < .01). Themore problems an inspector thought he or she had found, the more satisfied he or shewas with his or her performance. Consequently, the final satisfaction index was com-puted by multiplying the two scores. A Mann–Whitney U test showed a tendencytoward a difference in favor of the HI group (U = 54.5, p = .07). Participants in the HIgroup felt more satisfied about their performance than those in the SI group.

Evaluating Hypermedia Usability 331

By considering this finding in the general framework of the experiment, itappears that ATs provide participants with higher critical abilities than heuristics.Indeed, despite the major effectiveness achieved by participants in the SI group,they were still less satisfied with their performance, as if they could better under-stand the limits of an individual evaluation.

Summary. Table 4 summarizes the experimental results presented in theprevious paragraphs. The advantage of the systematic approach adopted by theevaluators assigned to the SI condition is evident. The implications of these find-ings are discussed in the final paragraph.

6. CONCLUSIONS

In the last decade, several techniques for evaluating the usability of softwaresystems have been proposed. Unfortunately, research in HCI has not devoted suffi-cient efforts toward validating such techniques, and therefore some questionspersist (John, 1996). The study reported in this article provides some answers aboutthe effectiveness, efficiency, and satisfaction of the SUE inspection technique. Theexperiment seems to confirm the general hypothesis of a sharp increase in the over-all quality of inspection when ATs are used. More specifically, the following may beconcluded:

• The SUE inspection increases evaluation effectiveness. The SI group showed a majorcompleteness and precision in reporting problems and also identified more severeproblems.

332 De Angeli et al.

Table 4: Summary of the Experimental Results

Indexes

Hypothesis HI SI

EffectivenessCompleteness – +Accurateness

Precision – +Severity – +

EfficiencyIndividual efficiency = =Group efficiency – +

SatisfactionUser satisfaction with the application evaluated < >Evaluator satisfaction with the inspection technique = =Evaluator satisfaction with the achieved results < >

Note: HI = Heuristic Inspection; SI = Systematic Usability Evaluation Inspection; – worse perfor-mance; + better performance; = equal performance; < minor critical ability; > major critical ability.

• Although more rigorous and structured, the SUE inspection does not compromiseinspection efficiency. Rather, it enhanced group efficiency, defined as the number ofdifferent usability problems found by aggregating the reports of several inspectors,and showed a similar individual efficiency, defined as the number of problemsextracted by a single inspector in relation to the time spent.

• The SUE inspection enhances the inspectors’ control over the inspection process andtheir confidence on the obtained results. SI inspectors evaluated the application moreseverely than HI inspectors. Although SUE inspection was perceived as a more com-plex technique, SI inspectors were moderately satisfied with it. Finally, they showeda major critical ability, feeling less satisfied with their performance, as if they couldunderstand the limits of their inspection activity better than the HI inspectors.

The authors are confident in the validity of such results, because the evaluatorsin this study were by no means influenced by the authors’ association with the SUEinspection method. Actually, the evaluators were more familiar with Nielsen’sheuristic inspection, being exposed to this method during the HCI course. Theylearned the SUE inspection only during the training session. Further experimentsare being planned involving expert evaluators to further evaluate whether ATsprovide greater power to experts as well.

REFERENCES

Andre, T. S., Hartson, H. R., & Williges, R. C. (1999). Expert-based usability inspections:Developing a foundational framework and method. In Proceedings of the 2nd Annual Stu-dent’s Symposium on Human Factors of Complex Systems.

Costabile, M. F., Garzotto, F., Matera, M., & Paolini, P. (1997). SUE: A systematic usability eval-uation (Tech. Rep. 19-97). Milan: Dipartimento di Elettronica e Informazione, Politecnicodi Milano.

Costabile, M. F., Garzotto, F., Matera, M., & Paolini, P. (1998). Abstract tasks and concretetasks for the evaluation of multimedia applications. Proceedings of the ACM CHI ’98 Work-shop From Hyped-Media to Hyper-Media: Towards Theoretical Foundations of Design Use andEvaluation, Los Angeles, April 1998. Retrieved December 1, 1999, fromhttp://www.eng.auburn.edu/department/cse/research/vi3rg/ws/papers.html

Costabile, M. F., & Matera, M. (1999). Evaluating WIMP interfaces through the SUEApproach. In B. Werner (Ed.), Proceedings of IEEE ICIAP ’99—International Conference onImage Analysis and Processing (pp. 1192–1197). Los Alamitos, CA: IEEE Computer Society.

Desurvire, H. W. (1994). Faster, cheaper! Are usability inspection methods as effective asempirical testing? In J. Nielsen & R. L. Mack (Eds.), Usability inspection methods (pp.173–202). New York: Wiley.

Dix, A., Finlay, J., Abowd, G., & Beale, R. (1998). Human–computer interaction (2nd ed.). Lon-don: Prentice Hall Europe.

Doubleday, A., Ryan, M., Springett, M., & Sutcliffe, A. (1997). A comparison of usability tech-niques for evaluating design. In S. Cole (Ed.), Proceedings of ACM DIS ’97—InternationalConference on Designing Interactive Systems (pp. 101–110). Berlin: Springer-Verlag.

Fenton, N. E. (1991). Software metrics—A rigorous approach. London: Chapman & Hall.Garzotto, F., & Matera, M. (1997). A systematic method for hypermedia usability inspection.

New Review of Hypermedia and Multimedia, 3, 39–65.

Evaluating Hypermedia Usability 333

Garzotto, F., Matera, M., & Paolini, P. (1998). Model-based heuristic evaluation ofhypermedia usability. In T. Catarci, M. F. Costabile, G. Santucci, & L. Tarantino (Eds.), Pro-ceedings of AVI ’98—International Conference on Advanced Visual Interfaces (pp. 135–145).New York: ACM.

Garzotto, F., Matera, M., & Paolini, P. (1999). Abstract tasks: A tool for the inspection ofWeb sites and off-line hypermedia. In J. Westbomke, U. K. Will, J. J. Leggett, K.Tochterman, J. M. Haake (Eds.), Proceedings of ACM Hypertext ’99 (pp. 157–164). NewYork: ACM.

Garzotto,F.,Paolini,P.,&Schwabe,D. (1993).HDM—Amodelbasedapproachtohypermediaapplication design. ACM Transactions on Information Systems, 11(1), 1–26.

Hartson, H. R., Andre, T. S., Williges, R. C., & Van Rens, L. (1999). The user action frame-work: A theory-based foundation for inspection and classification of usability problems.In H.–J. Bullinger & J. Ziegler (Eds.), Proceedings of HCI International ’99 (pp. 1058–1062).Oxford, England: Elsevier.

International Standard Organization. (1997). Ergonomics requirements for office work withvisual display terminal (VDT): Parts 1–17. Geneva, Switzerland: International StandardOrganization 9241.

Jeffries, R., & Desurvire, H. W. (1992). Usability testing vs. heuristic evaluation: Was there acontext? ACM SIGCHI Bulletin, 24(4), 39–41.

Jeffries, R., Miller, J., Wharton, C., & Uyeda, K. M. (1991). User interface evaluation in the realword: A comparison of four techniques. In S. P. Robertson, G. M. Olson, & J. S. Olson(Ed.), Proceedings of ACM CHI ’91—International Conference on Human Factors in ComputingSystems (pp. 119–124). New York: ACM.

John, B. E. (1996). Evaluating usability evaluation techniques. ACM Computing Surveys,28(Elec. Suppl. 4).

Kantner, L., & Rosenbaum, S. (1997). Usability studies of WWW sites: Heuristic evaluationvs. laboratory testing. In Proceedings of ACM SIGDOC ’97—International Conference onComputer Documentation (pp. 153–160). New York: ACM.

Karat, C. M. (1994). A comparison of user interface evaluation methods. In J. Nielsen & R. L.Mack (Eds.), Usability inspection methods (pp. 203–230). New York: Wiley.

Lim, K. H., Benbasat, I., & Todd, P. A. (1996). An experimental investigation of the interactiveeffects of interface style, instructions, and task familiarity on user performance. ACMTransactions on Computer–Human Interaction, 3(1), 1–37.

Madsen, K. H. (1999). The diversity of usability practices [Special issue]. Communication ofACM, 42(5).

Matera, M. (1999). SUE: A systematic methodology for evaluating hypermedia usability. Milan:Dipartimento di Elettronica e Informazione, Politecnico di Milano.

Mondadori New Media. (1997). Camminare nella pittura [CD-ROM]. Milan: Mondadori NewMedia.

Nielsen, J. (1992). Finding usability problems through heuristic evaluation. In P. Bauersfeld,J. Benett, & G. Lynch (Eds.), Proceedings of ACM CHI ’92—International Conference on Hu-man Factors in Computing Systems (pp. 373–380). New York: ACM.

Nielsen, J. (1993). Usability engineering. Cambridge, MA: Academic.Nielsen, J. (1994a). Guerrilla HCI: Using discount usability engineering to penetrate intimi-

dation barrier. In R. G. Bias & D. J. Mayhew (Eds.), Cost-justifying usability. Cambridge,MA: Academic. Retrieved December 1, 1999, from http://www.useit.com/papers/guer-rilla_hci.html

Nielsen, J. (1994b). Heuristic evaluation. In J. Nielsen & R. L. Mack (Eds.), Usability inspectionmethods (pp. 25–62). New York: Wiley.

334 De Angeli et al.

Nielsen, J., & Landauer, T. K. (1993). A mathematical model of the finding of usability prob-lems. In Proceedings of ACM INTERCHI ’93—International Conference on Human Factors inComputing Systems (pp. 296–213). New York: ACM.

Nielsen, J., & Mack, R. L. (1994). Usability inspection methods. New York: Wiley.Preece, J., Rogers, Y., Sharp, H., Benyon, D., Holland, S., & Carey, T. (1994). Human–computer

interaction. New York: Addison Wesley.Virzi, R. A., Sorce, J. F., & Herbert L. B. (1993). Acomparison of three usability evaluation meth-

ods: Heuristic, think-aloud, and performance testing. In Proceedings of Human Factors andErgonomics Society 37th Annual Meeting (pp. 309–313). Santa Monica, CA: Human Factorsand Ergonomics Society.

Whiteside, J., Bennet, J., & Holtzblatt, K. (1988). Usability engineering: Our experience andevolution. In M. Helander (Ed.), Handbook of human–computer interaction (pp. 791–817).Oxford, England: Elsevier Science.

Evaluating Hypermedia Usability 335