Improving test coverage of LAPACK

13
AAECC manuscript No. (will be inserted by the editor) David J. Barnes · Tim R. Hopkins Improving Test Coverage of Lapack August 1, 2006 Abstract Wediscuss the way in which the Lapack suite of software is tested and look at how the application of software testing metrics affects our view of that testing. We analyse the ways in which some errorsmay be masked by the existing approach and consider how we might use the existing suite to generate an alternative test suite that is easily extensible as well as providing a high degree of confidence that the package has been well tested. 1 Introduction Good software engineering practice decrees that testing be an integral ac- tivity in the design, implementation and maintenance phases of the soft- ware life cycle. Software that executes successfully on an extensive, well con- structed suite of test cases provides increased confidence in the code and allows changes to and maintenance of the code to be closely monitored for unexpected side effects. To gauge the quality of a test suite we require quantitative measures of how well it performs. Such metrics are useful to developers in a number of ways; first, they determine when we have achieved a well-defined, minimal level of testing; second, they can reduce the amount (and, thus, the cost) of testing by allowing tests that fail to improve the metric to be discarded and, third, they can provide a starting point for a search for new test cases. Where software is not static but subject to evolution and enhancement, an important requirement is that the test suite evolves along with the software. A test suite serves to establish a behavioural norm against which changes can be checked for unwanted side-effects; data sets that have exposed errors Computing Laboratory University of Kent Canterbury, Kent, CT2 7NF, UK E-mail: {d.j.barnes, t.r.hopkins}@kent.ac.uk

Transcript of Improving test coverage of LAPACK

AAECC manuscript No.(will be inserted by the editor)

David J. Barnes · Tim R. Hopkins

Improving Test Coverage of Lapack

August 1, 2006

Abstract We discuss the way in which the Lapack suite of software is testedand look at how the application of software testing metrics affects our viewof that testing. We analyse the ways in which some errors may be masked bythe existing approach and consider how we might use the existing suite togenerate an alternative test suite that is easily extensible as well as providinga high degree of confidence that the package has been well tested.

1 Introduction

Good software engineering practice decrees that testing be an integral ac-tivity in the design, implementation and maintenance phases of the soft-ware life cycle. Software that executes successfully on an extensive, well con-structed suite of test cases provides increased confidence in the code andallows changes to and maintenance of the code to be closely monitored forunexpected side effects.

To gauge the quality of a test suite we require quantitative measures ofhow well it performs. Such metrics are useful to developers in a number ofways; first, they determine when we have achieved a well-defined, minimallevel of testing; second, they can reduce the amount (and, thus, the cost) oftesting by allowing tests that fail to improve the metric to be discarded and,third, they can provide a starting point for a search for new test cases.

Where software is not static but subject to evolution and enhancement, animportant requirement is that the test suite evolves along with the software.A test suite serves to establish a behavioural norm against which changescan be checked for unwanted side-effects; data sets that have exposed errors

Computing LaboratoryUniversity of KentCanterbury, Kent, CT2 7NF, UKE-mail: {d.j.barnes, t.r.hopkins}@kent.ac.uk

2 D.J. Barnes and T.R. Hopkins

in earlier versions and highlighted the need for changes to be made to thesource code should automatically be added to the suite; and additional testsshould be generated to cover newly added features.

In this paper we examine the test strategy that has been used with theLapack suite of linear algebra routines. With its first public release in 1992,Lapack is unusual amongst numerical software packages in that it has beenupdated and extended over a substantial period of time and several publicreleases have been made available in that time. It is thus possible to obtaina change history of the code and identify maintenance fixes that have beenapplied to the software either to remove errors or to improve the numericalqualities of the implementation.

In section 2 we discuss testing metrics in general and why we wouldconsider using more than one of these measurements in assessing the qualityof test coverage. In section 3 we describe the testing strategies used by theLapack package and identify some of the risks associated with them. We alsodiscuss how the use of the proposed testing metrics alters the way in whichwe might choose to test some parts of the software. Section 4 quantifies thelevel of testing achieved with the standard test suite and presents the resultsof our approach which is based on individual routines. Our conclusions andsuggestions for future directions are given in section 5.

2 Software Testing Metrics

Although some work has been done on the application of formal methods toprovide the basis for correctness proofs of linear algebra software and formalderivation of source code (see, for example, Gunnels [6]) the vast majority ofnumerical software in day-to-day use has been produced by more traditionalstyles of software development techniques that require extensive testing toincrease confidence that it is functioning correctly.

One of the most potent and often underused tools available to assist withthis process is the compiler. Most modern compilers will perform a varietyof static checks on the source code for adherence to the current languagestandard (for example, ANSI Fortran [10] and ANSI C [9]). Some (for exam-ple, Nag [17] and Lahey [13]) go further and provide feedback on such thingsas undeclared variables, variables declared but never used, variables set butnever used, etc. Although these often only require cosmetic changes to thesoftware in some cases these messages are flagging genuine errors where, forexample, variable names have been mis-spelt.

Furthermore, with judicious use of compilation options, compilers willnow produce executable code that performs extensive run-time checks thatwould be exceedingly difficult to perform in any other way. Almost all willcheck that array indexes are within their declared bounds but many will gofurther and check, for example, that variables and array elements have beenassigned values before they are used. Such checks are useful even where cor-rectness proofs are available, since simple typographical errors are commonwherever human editing is involved in the implementation or maintenanceprocesses.

Improving Test Coverage of Lapack 3

While it is true that such checking imposes a run-time overhead thatwould make the executable code produced unusable for many applicationswhere high-performance is essential, it should be mandatory that these ca-pabilities are used during the development, testing and maintenance stagesof any software project.

Thus our first requirement is that all the software compiles without anywarning messages and that no run-time errors are reported when runningthe test suite and all the possible execution-time checks have been enabledfor the compilation system being used. For such a requirement to have teeth,we must have confidence that the test suite thoroughly exercises the sourcecode.

In order to determine the effectiveness of the testing phase, quantitativemeasurements are required. Most of the basic testing metrics are white-boxand aim to measure how well the test data exercises the code. The sim-plest is to calculate the percentage of basic blocks that are executed. A basicblock is a linear section of code with a single entry point and a single exitpoint. The metric for basic block coverage is slightly more concise and a littlemore representative of structural complexity than the metric for individualstatement coverage. While 100% basic-block coverage is considered by manyexperts [11] to be the weakest of all coverage criteria its achievement shouldalways be regarded as an initial goal by testers. After all how can we haveconfidence that software will perform as required if blocks of statements arenever executed during the testing process? This observation is key to muchof what we discuss in the remainder of this paper.

Confidence in the correct operation of the software can be further en-hanced by extending the testing process to provide more stringent forms ofcode coverage, for example, branch (or decision) coverage and coverage oflinear code sequences and jumps (LCSAJ).

For the most basic branch coverage we attempt to generate data sets thatwill ensure that all branches within the code are executed. For example, in asimple if-then-endif block we seek data that will cause the condition to beboth true and false. We note here that 100% branch coverage implies 100%basic-block coverage. Basic-block coverage would only require the test in anif-then-endif block to be true in order to execute the statements withinthe if. On the other hand, branch coverage testing would require data to befound that causes the test to be false as well. More rigorous forms of branchor condition coverage can also be considered, such as condition/decision cov-erage (in which multiple Boolean components of a condition must each beexercised with both true and false values) and modified condition/decisioncoverage (in which each component of a condition must be shown to affectthe overall outcome of the decision independently of the others) [7].

The LCSAJ metric measures coverage of combinations of linear sequencesof basic blocks, on the grounds that computational effects in a single basicblock are likely to have impacts on the execution behaviour of immediatelyfollowing basic blocks. Errors not evident from the analysis of a single ba-sic block may be revealed when executed in combination with other basicblocks, in the same way that module integration testing often reveals furthererrors beyond those revealed by individual module testing. However, since

4 D.J. Barnes and T.R. Hopkins

some calculated LCSAJ combinations may be infeasible, for instance if theyinvolve contradictory condition values, a requirement of 100% LCSAJ cov-erage is unreasonable and a more modest figure such as 60% would be moreapproachable in a typical application.

A completely different style of testing is offered by mutation testing. Thisattempts to quantify how well the test data sets can discriminate betweencorrect and incorrect versions of the software. We generate many versionsof the software under test each obtained by injecting a single fault into theoriginal, for example, a + in an expression is replaced by a ∗, or the single useof a variable such as m replaced by n, and so on. We call each of these faultyversions a mutant and deem a mutant to be killed by a data set if that datacauses the software to exhibit a failure. For Fortran 77 King and Offut [12]use 22 mutation operators. For subroutines containing hundreds of lines ofcode this can result in many thousands of mutants being generated andvariations of mutation testing aimed at reducing this number have been putforward [18]. Once again the metric used is the percentage of mutants killedby the test data; however, there is a real problem here in that it is possible togenerate equivalent programs by mutation and these cannot either be killedor automatically detected. For example, if the statements N=1; M=1 werereplaced by a mutation of the form N=1; M=N.

For the testing metrics described above the amount and quality of testdata required to obtain a determined percentage of coverage will generallygrow as we progress from basic-block coverage, through branch coverage toLCSAJ testing. As should be clear from this discussion, to be able to obtainmetrics data requires tool support, and such tools will typically need to em-ploy compiler technology: lexical, syntactic and semantic analysis. For thisreason, compilers themselves often provide the ability to profile code execu-tion (for example Sun Fortran 95 Version 7.1 [19]) although, typically, theseonly provides basic-block coverage. We note here that the Sun compiler doesnot do a very good job in distinguishing basic blocks, this is especially thecase for older Fortran 77 statements like logical and arithmetic if statements.Tools which perform more sophisticated profiling tend to be commercial; wehave used the LDRA Fortran Testbed [14] to generate sample coverage andLCSAJ metrics and the NagWare profiler tools [15,16] for determining basicblock coverage. We do not consider condition/decision coverage and mutationtesting any further in this paper.

3 Testing Lapack

The Lapack suite [1] consists of well over a thousand routines which solve awide range of numerical linear algebra problems. In the work presented herewe have restricted our attention to the single precision real routines thusreducing the total amount of code to be inspected to a little over 25% of thetotal with little or no loss of applicability to the whole suite.

Part of the original intention of the package’s designers was to providea high quality and highly-portable library of numerical routines. In this,they have been very successful. Since its early releases – version 1.0 in 1992,version 2.0 in 1994 and version 3.0 in 1999 (all written in Fortran 77) the

Improving Test Coverage of Lapack 5

software has been modernized with the release of Lapack 95 [2]. Lapack 95took advantage of features of Fortran 95, such as optional parameters andgeneric interfaces, to provide wrapper routines to the original Fortran 77code. This exploited the fact that parameter lists of some existing routineswere effectively configured by the types of their actual parameters, and thatcalls to similar routines with different precisions could be configured usingthis type information. These wrappers provided the end user with improvedand more robust calling sequences while preserving the investment in theunderlying, older code.

A package of this size and longevity is potentially a valuable source ofdata for the evaluation of software quality metrics. Since 1992 there havebeen eight releases of Lapack and this history has allowed us to track thechanges made to the software since its initial release. This has provided uswith, among other things, data on the number of faults corrected and thishas enabled us to investigate whether software quality metrics can be usedto predict which routines are liable to require future maintenance (see [3] forfurther details). It has also been possible to determine whether the associatedtest suites have been influenced by the fault history.

Comprehensive test suites are available for both the Fortran 77 and For-tran 95 versions of the package and these have provided an excellent startingpoint for the focus of our work on the testing of the package. Indeed a corecomponent of the Lapack package is its testing material. This has been animportant resource in assisting with the implementation of the library on awide variety of platform/compiler combinations. However, little, if any, mea-surement appears to have been made of how well the testing material actuallyperforms in a software engineering sense.

The routines within the Lapack package split into three distinct levels [1]

– driver routines for solving a complete problem, such as sspgv– computational routines for performing a distinct computational task, such

as sggbak– auxiliary routines for performing common sub-tasks, such as sorg2l.

Only driver and computational routines are expected to be called directly bythe user of the package and therefore auxiliary routines are not described indetail in the user manual. This difference between auxiliary routines and theother two types is also reflected in both their testing and their error-checkingfunctionality.

A major goal of the designers of the Lapack library is that it should beusable on a wide variety of computer hardware and this is clearly reflectedin the nature of the test suites. The main purpose of the supplied testingmaterial is to support the porting process. An expectation of this process isthat all of the supplied tests should pass without failures.

Following on from the work by Hopkins [8], we were keen to explore howsafe this expectation was, firstly by using state-of-the-art compile-time andrun-time checks. Working with version 3.0 of the Fortran 77 version of Lapackwe began by executing all the supplied test software using a variety of com-pilers (including NagWare Fortran 95, Lahey Fortran 95 and Sun Fortran 95)that allowed a large number of internal consistency checks to be performed.This process detected a number of faults in both the testing code and the

6 D.J. Barnes and T.R. Hopkins

numerical routines of the current release, including accessing array elementsoutside of their declared bounds, type mismatches between actual argumentsand their definitions, and the use of variables before they have been assignedvalues. Such faults are all non-standard Fortran and could affect the finalcomputed results. In all, some fifty of the single precision real and complexroutines were affected.

Having tested basic Fortran conformance via compiler checks, we lookedat the code coverage provided by the supplied test rigs [5]. These test rigsare monolithic in that each generates a large number of datasets and rou-tine calls. The number and size of the datasets used are controlled by usersupplied configuration files with the majority of the data being generated ran-domly and error exits tested separately (but still inside the monolithic testdrivers and outside of user control). Families of routines are grouped together(for instance banded singular value decomposition routines) and tested withidentical user configuration data. Results obtained from each call are checkedautomatically via an oracle rather than against predefined expected results.This approach has a number of disadvantages

1. there is no quantitative feedback as to how thorough the actual testingprocess is,

2. running multiple datasets in a single run masks problems that could bedetected easily via run-time checking if they were run one at a time,

3. the use of a large number of randomly generated datasets is very ineffi-cient in that, from the tester’s point of view, the vast majority are notcontributing to improving any of the test metrics.

A particular example of the second item above is found in the routinesgges. The routine’s comment and entry in the user manual suggest thatthe minimum size of the WORK array is 8*N+16, but a different value isused during parameter checking, leading to failure for values of N less thanseven. This error is not discovered by the general test rig because a large valueis used for the work array which is more convenient for solving a general setof problems.

Having removed the run-time problems mentioned above our next goalwas to reduce the number of tests being run (i.e., the number of calls be-ing made to Lapack routines) by ensuring that each test made a positivecontribution to the metrics being used. In order to preserve the effort al-ready invested in testing the Lapack routines it was decided to extract asmuch useful data as possible from the distributed package, and so we usedthe default configuration files in our analysis. To obtain details of the cover-age provided by the existing test suite we modified nag profile [15] so thatit generated basic-block coverage data (profiles) for each individual call toan Lapack routine made via the test routines. We discovered that, for al-most all of the drivers and computational routines, we could obtain the samebasic-block coverage using only a small fraction of the generated test sets; seeSection 4 for details. Scripts were constructed to automate the extraction ofrepresentative datasets and to generate separate standardized test programsfor testing each individual driver and computational routine. ***** SHOULDWE REMOVE THIS SENTENCE????? In order to keep these separate test

Improving Test Coverage of Lapack 7

programs simple we only extracted datasets intended to test normal opera-tion of the routines, preferring to run the error exit tests separately.

The testing strategy currently employed by the Lapack 77 test suite isinfluenced by the three-level nature of the package. Auxiliary routines are notdirectly tested at all by the test suite. All driver routines are tested directlyfor normal operation, and for error handling via an error check routine, al-though a few of the driver routines (such as sgegv and sspgv) are missingerror handling tests. Many of the computational routines (such as sbdsqr)are similarly tested via separate normal operation and error check routines.However there is far less consistency for these routines. For instance, sggbakis tested for normal operation but not for error handling, while sggsvp andsormbr are only called directly by error check routines. A further variationis provide by the routine sgbbrd whose testing is piggy-backed onto thedata used in the testing of a more general routine. The test data file is oneused for testing all routines in a family and the particular data used withsgbbrd is generated from this. This leads to 13 out of the 103 basic blocks insgbbrd being untested; of these 11 were concerned with argument checkingfor which no out-of-bounds data were provided as it was not treated as afirst class citizen — such data had already been filtered out by its caller. Theother two were in the main body of code. The data used by the test programwas generated from a standard data file that defines the total bandwidth.This routine also requires the number of sub- and super- diagonals to bespecified and this pair of values is generated from the total bandwidth value.Unfortunately the method used to calculate these values fails to produce oneof the special cases.

Another difference between auxiliary routines and the user-callable rou-tines is in their approach to error checking. All driver and computational rou-tines consistently check the validity of their parameters and return an errordiagnostic value where appropriate. Error checking within auxiliary routinesis typically less thorough. Of 146 auxiliary routines examined, 68 did notreceive an output parameter that would allow them to indicate an error. Onthe one hand this position is understandable because auxiliary routines willinvariably be called by routines that have already checked parameter valid-ity, and they may be called many thousands of times when solving a singleproblem. slarrb, for instance, contains a comment that its inputs are notchecked for errors in the interests of speed. On the other hand, this approachrisks making it hard to locate subtle errors that originate in higher-levelroutines, and a consistent approach throughout, that would at least allowchecking to be turned on for security or off for efficiency, would appear to bepreferable.

A particular danger in using a common structure to test multiple routinesand common data for parameter values is that there is a greater likelihoodthat the distinctive features of particular routines will not be thoroughlytested by this approach. It is generally accepted that it is exactly at theboundaries of problem solutions that the flaws are most likely to be found.Over the course of its history, Lapack has undergone a number of revisions,which have included the usual modifications typical of a package of this size:addition of new routines, bug fixes, enhancements of routines, minor renam-

8 D.J. Barnes and T.R. Hopkins

ing, commentary changes, and so on. A test structure based around familiesof routines makes it much harder to inject routine-specific test data designedto prevent the re-emergence of corrected errors through regression testing.Furthermore, when regression testing is required, the shared structure makesit impossible to run just those tests that need to be run to exercise theaffected routines. Experience suggests that where regression testing is com-plicated and expensive to perform, it is less likely to be undertaken thanwhen it is simple and cheap.

However there is a deeper testing problem revealed by these examplesand the fact that testing of lower-level routines is delegated to the processof testing the higher-level routines. When testing we usually generate datawhich constitutes a well-defined and well documented, high level problem (forexample, a system of linear equations), we call the relevant solver and thencheck the computed results. We assume that it is relatively straightforward toperform checking at this level of problem definition. But, having run enoughdata sets to obtain 100% basic-block coverage of the user callable routinewe often find that one of the internally called routines has a much lowercoverage. The question for the tester is, can they actually find high level userdata that will provide the required coverage in the internal routines? In manycases the answer will be no. For example, if our testing were extended toinclude the level 1 Blas routines the only possible stride value may be one;no data to the higher level routines will then test the non-unit stride sectionsof the Blas code.

4 Results

We focussed on analysing the coverage of 304 driver, computational and aux-iliary single precision routines (s*.f). We omitted 23 auxiliary routines (suchas slaed0) and one computational routine (sstegr) that are not tested(???NEVER EXECUTED???) by the Lapack test suite, but did include onecomputational routine (sdisna) that is not tested. We instrumented thesewith nag coverage95 [16] which indicated a total of 14,644 basic blocks. Whenrunning the Lapack 77 testing software 12,093 basic blocks were executed,representing a coverage metric of 82.58% which is extremely high for numer-ical software; many packages tested achieve values closer to 50%. Coverageof the driver routines is 81.57% (3,088 blocks), computational routines is85.76% (6,770 blocks) and auxiliary routines is 78.73% (4,786 blocks). Of theapproximately 2,500 unexecuted blocks many are due to the lack of coverageof argument checking code. However, there are still a substantial number ofstatements concerned with core functionality that require data sets to beprovided at a higher problem definition level. The lower coverage of auxiliaryroutines is an artefact of their only being tested indirectly via the driver andcomputational routines.

Ideally we would like each test to count; each test should cause basicblocks to be executed that no other test exercises. To obtain some idea of thepossible degree of redundancy of the supplied tests we looked at individualexecution profiles. We define a profile as a bit string that shows the basicblocks executed by each call to a routine; i.e., the string contains a one if the

Improving Test Coverage of Lapack 9

block has been executed and zero if it has not. Note that we cannot extractany path information from this.

The Lapack 77 test cases cause almost 24 million subroutine calls to bemade (note that this does not include low level routines like lsame whichwas called over 150 million times.) The most called routine was slarfgwhich accounted for over 1.7 million calls. As an auxiliary routine this is nottested directly by the test rigs but is nevertheless called by four of the test-rigroutines and a large number of the routines that are under test. slarfg illus-trates the difficulty in obtaining complete coverage as a side effect of callinghigher level routines in that out of the thirteen basic blocks that make upthis routine one still remained unexecuted. Closer inspection of this routinereveals that the unexecuted block is used to construct a loop to repeatedlyscale the input data until it lies within a prescribed range. However, if IEEE,single-precision arithmetic is used it is impossible to provide input data thatwill exercise this block.

More telling was that out of more than 20 million profiles obtained lessthan ten thousand were distinct. Note that it is not the case that we canreduce the number of subroutine calls down to 10,000 since repeated pro-files will occur in higher level routines to generate different profiles withinlower level routines and vice versa. However it is clear that we should beable to reduce the total number of tests being performed substantially with-out reducing basic-block coverage. In slarfg only 3 distinct profiles weregenerated.

The driver routine ssbgvd provides a good illustration of the value ofthe combined compile-time and run-time approach we have been discussing.It has only 2 distinct profiles from the Lapack 77 test cases, and analysisof the basic-block coverage reveals that neither covers cases where n has avalue greater than one. However, the Lapack 95 tests do exercise those casesand reveal a work-array dimension error because of the incorrect calculationof lwmin. ssbgvd is also unusual among the driver routines in that the testrig does not explicitly test its error exits. This could be because it was notpart of the release 1.0 of Lapack.

Our approach to testing the single precision routines involved creatingseparate test programs for each driver and computational routine under con-sideration. The comprehensive and consistent documentation format of theroutines made it relatively easy to automate the generation of these programsdirectly from the source code. Each test program is designed to read a singledata set and pass it to the routine under test, reporting any resulting errordiagnostic. Running batches of tests is simple via typical operating systemscripting languages.

To provide input to these test programs the original 1.3 million data setswere reduced to only 6.5K (???IS THIS STILL ACCURATE???). With asmall supplement to these data sets as described below, the resulting coverageover all routines was 79.40%. The data extraction process was not perfectin that we only extracted data which generated unique execution profiles forthe user callable routines. We would thus expect a reduction in the overallcoverage due to the fact that two identical profiles for a high level routinemight generate distinct profiles for an auxiliary routine. Indeed, our coverage

10 D.J. Barnes and T.R. Hopkins

of the auxiliary routines is nearly ten percentage points down at 69.52%while that of the computational routines is 85.08% compared to the original85.75%. Our data collection does not currently allow us to link a profile foran auxiliary routine to the profile of the higher-level routine that led to it.More work is needed to allow us to identify the most appropriate data setsthat select the best profiles across combinations of driver, computational andauxiliary routines. This will allow the checking of results to take place at ahigher level.

Of the 59 driver routines that we analysed, only eight had 100% basic-block coverage via the original test suite. Our approach using per-routinetest drivers makes it relatively straightforward to identify which blocks ina routine are not executed and to test them directly once appropriate datahas been identified. For instance, we have been able to raise the coverage ofssbgvd discussed earlier from 50% to 100% using a total of sixteen datasets,ssbgvx from 46% to 80% with twenty-two datasets, and the previously com-pletely untested sdisna to 100% using thirteen datasets.

It is still most likely that we will not be able to achieve 100% basic-blockcoverage without making independent calls to auxiliary routines, although,in this case, there is no way of knowing whether such calls are possible usinglegal data at the user callable routine level. Additionally it is possible at alllevels that some paths and blocks of code can never be executed; in someinstances it may be possible to identify dead code and remove it. Howeverthere may be circumstances where unexecuted code is the result of defensiveprogramming and in these cases the statements should certainly not be re-moved. As a step towards generating more comprehensive coverage, we planin the future to create an interactive website in which users are invited tosubmit test data that will exercise basic blocks for which we currently do nothave datasets.

As we have already indicated, 100% basic-block coverage really only pro-vides a minimum confidence level for accuracy of software. For instance,worth noting is that the driver routine sggsvd is one that attains this levelin the Lapack test suite, however we identified an error in the case where m= n = p = 0, as permitted by the argument checking, but which leads toan array subscript error.

Therefore, while the majority of our analysis was concerned with basic-block coverage, we also obtained some data on both branch coverage andLCSAJ coverage using LDRA Testbed in order to give a clearer quantitativepicture of the coverage of this suite. Our simplified test programs were runusing all the extracted datasets. Our experience of using LDRA Testbed wasthat the initial effort of instrumenting and running the bulk of the reduceddata to analyse the execution histories took much longer than for the sim-pler coverage. However, adding new datasets was then relatively cheap. Forcomparison purposes, we obtained the following coverage metric values withthe reduced data sets

1. branch coverage: 72%2. LCSAJ: 32%

The achieved branch coverage is lagging about 7% behind the basic-blockcoverage and the LCSAJ metric is well below the recommended level of 60%.

Improving Test Coverage of Lapack 11

Further work is needed with individual routines where LCSAJ coverage isparticularly low from the reduced datasets in order to determine whetherit is significantly better with the full complement of randomly generateddatasets.

While most of the work we have reported here has focussed on the La-pack 77 testing strategy, we are also actively looking at the Lapack 95 [2] testroutines in order to compare the coverage they provide. It is interesting tonote that the addition of compile-time and run-time checks to the test rou-tines and example programs supplied with that version of the package revealmany problems similar to those thrown up with the 77 version: uncheckeduse of optional arguments (sgelsd f90); deallocation of unallocated space(sspgvd f90); use of an incorrect-precision routine (schkge); incorrect ar-ray sizes (la test sgbsv, serrge and many others); use of literals as inoutparameters (serrvx); use of unset out parameters (la test sgecon); in-correct intent (la test sgelss); incorrect slicing (la test sgesdd); floating-point overflow (la test sppsv).

5 Conclusions

Particular testing strategies should be chosen to fit specific goals and it isimportant to appreciate that the existing test strategy of the Lapack suiteis strongly influenced by the desire to check portability rather than codecoverage. Towards this goal it has been manifestly very successful.

A highly influential feature of the existing testing strategy is to batch uptests in two ways

– using multiple data sets on a single routine– using similar data on multiple routines.

It was discovered that the very act of batching up tests allowed some errorsto be masked — typically through uninitialized variables used in later testshaving been set by earlier tests. Thus, while it requires far more organiza-tion, there are very definite advantages to be gained from running each testseparately. An approach using separate tests also supports the incrementalapproach to improving testing metrics along with the introduction of ad-ditional tests whenever a routine is enhanced, or a new error is discoveredand then corrected. Such additional tests serve as regression tests for futurereleases.

The need for dealing with a large number of test cases has led us to developa more flexible testing environment which allows for the easy documentationand addition of test cases while keeping track of the code coverage obtained.Ideally, we would aim to extend the existing test suite in order to achieve100% basic-block coverage as well as increasing the metric values obtainedfor branch and conditional coverage. Our major problem at present is findingdata that will exercise the remaining unexecuted statements and we proposedoing this in part through a public website. The calling structure of the nu-merical routines is hierarchical and, in many cases it is clear that unexecutedcode in higher level routines would only be executed after the failure of somelower level routine. It is not clear whether such situations are actually known

12 D.J. Barnes and T.R. Hopkins

to occur or if the higher level code is the result of defensive programming.While defensive programming practices are obviously not to be discouraged,they can potentially confuse the picture when considering how thorough atest suite is in practice and should be well documented within the sourcecode.

We have shown how the use of software testing metrics may be used toprovide a quantitative measure of how good the testing process actually isas a confidence boosting measure of program correctness.

We have set up a framework for testing the Lapack suite of routines interms of a highly targeted reduced collection of datasets. Even so we are stillsome way from achieving 100% basic-block coverage which is considered tobe the weakest coverage metric; indeed we have identified a bug in a routinethat already had 100% coverage. Beizer [4] has argued that even if completebranch coverage is achieved then probably less than 50% of the faults leftin the released software will have been found. We will be endeavouring toincrease the basic-block coverage as well as improving the condition andbranch coverage and LCSAJ metrics.

References

1. E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. J. Dongarra,J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen.LAPACK: users’ guide. SIAM, Philadelphia, third edition, 1999.

2. V. A. Barker, L. S. Blackford, J. Dongarra, J. Du Croz, S. Hammarling,M. Marinova, J. Wasniewski, and P. Yalamov. LAPACK95: Users’ Guide.SIAM, Philadelphia, 2001.

3. D.J. Barnes and T.R. Hopkins. The evolution and testing of a medium sizednumerical package. In H.P. Langtangen, A.M. Bruaset, and E. Quak, editors,Advances in Software Tools for Scientific Computing, volume 10 of LectureNotes in Computational Science and Engineering, pages 225–238. Springer-Verlag, Berlin, 2000.

4. B. Beizer. Software System Testing and Quality Assurance. Van NostrandReinhold, New York, US, 1984.

5. Susan Blackford and Jack Dongarra. Lapack working note 41 installation guidefor lapack. Technical report, University of Tennessee, Department of ComputerScience, University of Tennessee, Knoxville, Tennessee 37996-1301, 1999.

6. John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van deGeijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans-actions on Mathematical Software, 27(4):422–455, December 2001.

7. Kelly J. Hayhurst, Dan S. Veerhusen, John J. Chilenski, and Leanna K. Rierson.A practical tutorial on modified condition/decision coverage. Technical ReportTM-2001-210876, NASA, Langley Research Center, Hampton, Virginia 23681-2199, May 2001.

8. Tim Hopkins. A comment on the presentation and testing of CALGO codes anda remark on Algorithm 639: To integrate some infinite oscillating tails. ACMTransactions on Mathematical Software, 28(3):285–300, September 2002.

9. ISO, Geneva, Switzerland. ISO/IEC 9899:1990 Information technology - Pro-gramming Language C, 1990.

10. ISO/IEC. Information Technology – Programming Languages – Fortran - Part1: Base Language (ISO/IEC 1539-1:1997). ISO/IEC Copyright Office, Geneva,1997.

11. C. Kaner, J. Falk, and H.Q. Nguyen. Testing Computer Software. Wiley,Chichester, UK, 1999.

12. K.N. King and A.J. Offutt. A Fortran language system for mutation-basedsoftware testing. Software—Practice and Experience, 21(7):685–718, July 1991.

Improving Test Coverage of Lapack 13

13. Lahey Computer Systems, Inc., Incline Village, NV, USA. Lahey/Fujitsu For-tran 95 User’s Guide, Revision C edition, 2000.

14. LDRA Ltd, Liverpool, UK. LDRA Testbed: Technical Description v7.0, 2000.15. Numerical Algorithms Group Ltd., Oxford, UK. NAGWare Fortran Tools (Re-

lease 4.0), September 1999.16. Numerical Algorithms Group Ltd., Oxford, UK. NAGWare Fortran Tools (Re-

lease 4.1), 2001.17. Numerical Algorithms Group Ltd., Oxford, UK. NAGWare f95 Compiler (Re-

lease 5.0), November 2003.18. A.J. Offut, A. Lee, G. Rothermel, R.H. Untch, and C. Zapf. An experimental

determination of sufficient mutant operators. ACM Transactions on SoftwareEngineering Methodology, 5(2):99–118, April 1996.

19. Sun Microsystems , Inc., Santa Clara, CA. Fortran User’s Guide (Forte De-veloper 7), Revision A edition, May 2002.