A Discipline of Error Handling - USENIX

11
A Discipline of Error Handling Doug Moen ABSTRACT In the UNIX world, exception handlingmechanisms for eror handlingare often discussed, but seldomapplied. This paperdescribes a disciplined approach to error handlingthar was refinedover a !-year perioO during the development of a medium-large (200K line) toolkit written in C underUNIX We describe both a portable exception handing system, written in C, and a methodology for using it which encompasses coding style, doCumentation, and testing issues. Introduction Thereis more to good enor handling than sim- The C language, as definedby Kemighan and O,]I t:.:g:.gt or lihrary support for raisingandcatch' Richeyand by it.-Ñst c standarä, i. t.tiri'*räk i1_g i:eptions. .-In this-paPer'.Iwill explainwhat on eró, handing. The only standará,rror n*äñiË :11"t:^ar-e,.describe the desi8n issues in representing facilities provide"á .rc ttrr global variabfr .ttio ãnã andreportingerrors,and discuss how our error han' the convention thar certain íunctions (such as ..ff"., l$q^]fltotch affects coding style, documentation etc.) returna distinguished valuewheà they fail, pos- ano ¡es¡mg' sibly setting enno. Two Kinds of Errors This is not a very goodbasis for eror handling. It isn't good for application programmers, becauseexplicitly checking the return values of functions that might fail, and propagating the enor, is a lot of work, and clutters up code. Even con- scientious programmers have been known to write programs that fail to checkthe returnvalue of every call to printf. As a result, there are a non-trivial number of C programs in existence which fail to properlyreporterror conditions [Danvin 85]. The standard approach to error handling isn,t good for library implernentors, either. One pioblem is that the existingset of effor numbers is not exten- sible; thus, you can't use the standard functions per- rorQ, strerrorQ, etc.,with locally defined error codes. Another problemis that a singleinteger(ermo) does not really containenough informationto completely describe an eror. Usually there is additional contex- tual information (such as the na¡ireof the file that couldn't be opened, the number of bytesthat were successfully written before an error occurred, the line number on which the error was detected, etc.) that needs to be associated with an error. We faced theseproblems when we set out to build the EMS image processing toolkit in 1989. We were building a large library of fi¡nctions for building image processing applications, and we wanted our library to support the construction of robust applications. We wanted our library to sup- port good eror handling. So we defined a comprehensive approach to good error handling, and wrote a small library of functions to support our eror handling discipline. EMS distinguishes two kinds of errorsthat can be detected by library functions: faults andfailures. A fault condition is the failure of an assertion or sanitycheck. By definition, a fault always indi- cates the presence of a bug in a program. Faults checking is not part of the contract between the function and its caller, and faults are reported by aborting the program. Somekinds of fault checking (e.g., comprehensive data structure integrity checks) are expensiveto perform, but are useful during debugging. Because client code is not allowed to depend on the existence of fault checks, expensive fault checkscan be conditionally compiledbased on a DEBUG option, without affecting the correctness of any progam. A failure is an abnormal but anticipated condi- tion such as resource exhaustion, permission denied, or a syntaxerror, which prevents the function from carrying out its job. Failures differ from faults in that failure reportsare part of the contractbetween the function and its caller. Failures are reported by reporting an effor back to the caller. An out of memory condition that causesmallocQ to return NULL is a simpleexample of a failure. Sometimes it is difficult to decideif a particu- lar condition(e.g.,an illegal argument valuef should be classifiedas a fault or a failure. Mv rule of thumb is that it should be possible, usingthe library, to write programsthat never generate fault condi- tions under any circumstances.Supposethat the exceptional condition is the detection of a syntax error in an input file, or in a character string that might have originated from outsideof the program. In this case,the condition should be classified as a failure, ratherthan as a fault. If it were classified as Summer '92 USENIX - June8-June 12,lgg2 - San Antonlo, TX 123

Transcript of A Discipline of Error Handling - USENIX

A Discipline of Error HandlingDoug Moen

ABSTRACT

In the UNIX world, exception handling mechanisms for eror handling are often discussed,but seldom applied. This paper describes a disciplined approach to error handling thar wasrefined over a !-year perioO during the development of a medium-large (200K line) toolkitwritten in C under UNIX We describe both a portable exception handing system, written inC, and a methodology for using it which encompasses coding style, doCumentation, andtesting issues.

Introduction There is more to good enor handling than sim-The C language, as defined by Kemighan and O,]I t:.:g:.gt or lihrary support for raising and catch'

Richey and by it.-Ñst c standarä, i. t.tiri'*räk i1_g i:eptions. .-In this-paPer'.I will explain whaton eró, handing. The only standará,rror n*äñiË :11"t:^ar-e,.describe

the desi8n issues in representingfacilities provide"á .rc ttrr global variabfr .ttio ãnã andreporting errors, and discuss how our error han'the convention thar certain íunctions (such as ..ff"., l$q^]fltotch

affects coding style, documentationetc.) return a distinguished value wheà they fail, pos- ano ¡es¡mg'

sibly setting enno. Two Kinds of ErrorsThis is not a very good basis for eror handling.It isn't good for application programmers,

because explicitly checking the return values offunctions that might fail, and propagating the enor,is a lot of work, and clutters up code. Even con-scientious programmers have been known to writeprograms that fail to check the return value of everycall to printf. As a result, there are a non-trivialnumber of C programs in existence which fail toproperly report error conditions [Danvin 85].

The standard approach to error handling isn,tgood for library implernentors, either. One pioblemis that the existing set of effor numbers is not exten-sible; thus, you can't use the standard functions per-rorQ, strerrorQ, etc., with locally defined error codes.Another problem is that a single integer (ermo) doesnot really contain enough information to completelydescribe an eror. Usually there is additional contex-tual information (such as the na¡ire of the file thatcouldn't be opened, the number of bytes that weresuccessfully written before an error occurred, theline number on which the error was detected, etc.)that needs to be associated with an error.

We faced these problems when we set out tobuild the EMS image processing toolkit in 1989.We were building a large library of fi¡nctions forbuilding image processing applications, and wewanted our library to support the construction ofrobust applications. We wanted our library to sup-port good eror handling. So we defined acomprehensive approach to good error handling, andwrote a small library of functions to support oureror handling discipline.

EMS distinguishes two kinds of errors that canbe detected by library functions: faults and failures.

A fault condition is the failure of an assertionor sanity check. By definition, a fault always indi-cates the presence of a bug in a program. Faultschecking is not part of the contract between thefunction and its caller, and faults are reported byaborting the program. Some kinds of fault checking(e.g., comprehensive data structure integrity checks)are expensive to perform, but are useful duringdebugging. Because client code is not allowed todepend on the existence of fault checks, expensivefault checks can be conditionally compiled based ona DEBUG option, without affecting the correctnessof any progam.

A failure is an abnormal but anticipated condi-tion such as resource exhaustion, permission denied,or a syntax error, which prevents the function fromcarrying out its job. Failures differ from faults inthat failure reports are part of the contract betweenthe function and its caller. Failures are reported byreporting an effor back to the caller. An out ofmemory condition that causes mallocQ to returnNULL is a simple example of a failure.

Sometimes it is difficult to decide if a particu-lar condition (e.g., an illegal argument valuef shouldbe classified as a fault or a failure. Mv rule ofthumb is that it should be possible, using the library,to write programs that never generate fault condi-tions under any circumstances. Suppose that theexceptional condition is the detection of a syntaxerror in an input file, or in a character string thatmight have originated from outside of the program.In this case, the condition should be classified as afailure, rather than as a fault. If it were classified as

Summer '92 USENIX - June 8-June 12,lgg2 - San Antonlo, TX 123

A Disclpllne of Error Handling

a fault, then a program written to avoid generatingfaults is obliged to scan the input beforehand toensure the absence of syntax errors.

Reporting Faults

In EMS, faults are reported by calling the func-tion fatalQ with a printf-style argument list. fatalQaborts the program by calling a handler functionregistered using at_fatalQ. If no handler function isregistered, or if the registered handler functionreturns to its caller, then fatalQ prints a message tostderr and calls abortQ to obtain a core dump.

The decision to report faults by terminating theprogram was a controversial one. It was made forthe following reasons:

o During a development cycle, the best way fora program to report a fault is to immediatelydump core. The core file can then be used forpost-mortem debugging.

o If a function signals a fault by reporting anerror to its caller, then this error reportbecomes a de facto part of the fi¡nction's con-tract with its caller, and it becomes possibleto write code that depends on interceptingfault reports. We don't want fault reports tobe part of a function's interface, becausechecking for faults can be expensive, and wewould like to have the option of tum off faultchecking using compile-time options, in orderto speed up the code. Tuming off faultchecking should not change the semantics of acorrect program.

¡ Assertions and sanity checks are easier tocode, and are more likely to be used, ifdevelopers don't have to worry about cleaningup after a failed assertion.

An unfortunate consequence of aborting a programwhen a fault is detected is that we dõn't-supportfault tolerant programs of the sort that try to keeprunning after a fault has been detected (e.g., see[Meyer 88] and [Randell 75]). This isn't quite asbad as it sounds, because EMS has two provisionsfor supporting fault tolerance. First, programs canregister a handler to save their state when fatalQ iscalled. Second, EMS has facilities for automaticallyrestarting crashed servers. Thus, an application thatconsists of a network of processes can be made tokeep running even when individual componentscrash.

Reportlng Failures

The EMS failure reporting system is based onthe following principles:

1. Enor propagation, In a typical application,errors are detected at a low level (for exam-ple, in a library function), and handled at ahigher level (for example, in an applicationprog¡am which calls the library). Only thehigher level code knows how to handle the

Moen

error; this might be to print an error messageon the terminal, display an error window, orterminate the program. It is inappropriate forlow level code to take these actions. There-fore, when an error is detected, a descriptionof the eror is propagated from the low levelcode, where it is detected, to the high levelcode (higher in the call stack) where it is han-dled.

2. Erro¡ values. In the traditional C approach toerror handling, enors are described by a singlesmall integer. This is inadequate; if the highlevel error handler needs to display an errormessage to the user, then it usually needsmore than just the type of error (e.g., "syntaxerror"): it usually needs some informationabout the context in which the error occurred(e.g., file name + line number). In our sys-tem, errors are described by an Error structurewhich contains both an error id (which is aninteger) and character string data which canbe used to print an informative message.

3. Exception handling. In the traditional Capproach to error handling, a function returnsa distinguished value (such as NULL or -1)when it gets an enor, and sets the global vari-able erno with an error id. It is the responsi-bility of the caller to check for an eror ietumstatus, and either handle the eror, or pro-pagate it up the call stack. Although thisapproach is simple, it is also enor prone: it isvery easy for a programmer to be lazy, andomit checks for error return codes. If thesechecks are omitted, the program may gowrong in a catastrophic way when an erroroccurs. Our solution to this problem is toraise an exception when an error occurs.When a function raises an exception, itimmediately terminates, its caller terminates,and so forth, until an exception handler isfound. If no exception handler can be foundanywhere on the call stack, the program ter-minates with an error message. Under thisscheme, a programmer must take positiveaction to prevent his function from being ter-minated by an error. The worst that can hap-pen if he forgets to check for errors is that, atthe library level, temporarily allocatedresources may not be freed, and at the appli-cation level, the program may terminate witha message.

Error IDs

The most important component of an errordescription is an error 'id', which identifies the typeof error that occurred. Eror ids are used by errorhandlers to distinguish different types of erors. Ifthe error id has a short string representation, then itcan also be printed out as paft of the error message,

L24 Summer '92 USENIX - June 8-June L2, L992 - San Antonio, TX

Moen

and used as a key by the end-user to look up a ver-bose description of the error in extemal documenta-tion.

A¡y system for defining error ids needs to dealwith nvo issues. First, it should be possible for pro-grammers to define new enor ids without editing amaster table somewhere, and without conflictingwith error ids defined by other libraries. Second, itshould be possible to define sets of related error idsso that eror handlers can check error ids formembership in a group, as an alternative toenumerating a long list of error ids that are to behandled in the same way.

The UNIIVANSI C system of eror ids (asdefined by <errno.h> and sys_errlist) is not extensi-ble, and provides no support for grouping.

Programming languages with builçin exceptionhandling systems provide the ability to define newerror ids as a matter of course, but not every suchlanguage provides a way to define groups of errors.In both ANSI C++ and in Common Lisp, it is possi-ble to organize error types in trees or DAGs usingsingle and multiple inherirance [Ellis 90, Steele 90].

In the exception handling system devised byAllman and Been [Allman 85], error ids are charac-ter strings, and error handlers can use glob-style paftern matching on enor ids.

In EMS, we chose a system similar to that usedby ANSI C and UNIX. Error ids are represented byintegers, which means that error handlers can useswitch statements to distinguish between differenterrors. When an error description is printed, theeror id is used as an index into a table of messagestrings.

In order to support extensibility, we supportmultiple error tables. Each enor id is a 32 bitinteger with a L9 bit table id, and a 13 bit offset,which is an index into the table. The table id iscomputed from a table name: this is a string ofbetween one and four lowercase letters which is con-verted to an integer using a variant of base 26encoding.

Each error table is maintained as a text file thatdefines 4 records for each error:

o An error name, which is a C identifier likeENOENT or E SYNTAX. The error name isused to #define-a constant. In addition. it isprinted by err3rint x for the benefit of bothusers and programmers. Users can use theerror name to look up additional informationabout. the error in extemally provided docu-mentation.

o A severity level, which is one of the constantsERR_FAULT, ERR_FAIL or ERR RETRY.ERR:FAULT denotãs a fault in an- exrernalprogram or subsystem, ERR_FAIL denotes apermanent failure, and ERR RETRY indicatesa temporary condition that-will clear up by

A Discipllne of Error Handling

itself with no intervention.o A short, one line message, analogous to the

messages in sys_errlist. This message isprinted by enjrint xQ.

a A verbose description of the error, typicallyone paragraph in length. This is incorporatedinto external documentation for errors.

This text file is processed to generate a .h file, a .cfile, and a documentation file. The .h file containsone #defined constant for each error id, plus anextemal declaration for a global variable of typeErrTab_t[], representing the eror table. The .c filecontains the definition of the error table, which con-tains, for each error id, the message string, the sever-ity level, and the eror name, represented as a string.

By convention, in an error table named ,,foo',,each error id has the prefix "EFOO_", the errortable variable is called "foo_errs", and the headerfile is called "foo-errs.h".

System error codes, as obtained from theheader file <erno.h> or the global variable errno,can be converted into valid error ids by casting themto type long. The corresponding error table is calledsys_errs. All system errors are arbitrarily assignedthe severity level ERR_FAIL.

The only provision that EMS has for defininggroups of error ids are the fixed set of error severitylevels. Although this has proven adequate for ourneeds, it is not very flexible.

Error Values

When a function fails, it is responsible for con-veying a description of the error to its caller. Thisdescription can be used in several different ways: itmight be interpreted by an error handler which needsto take different actions on different errors, it mightbe used to construct an error message which will bedisplayed to a human user, and it might be transmit-ted to another process on the network (as when aserver notifies a client of an error).

Since the error description might be interpretedby an error handler, it needs to contain an error id,plus any additional contextual information that mightbe needed by the enor handler, such as the numberof bytes that were successfully written before theerror, the line number on which the eror occurred.etc. Since the error description might be used toconstruct an error message, it needs to contain all ofthe information necessary to make this possible.Finally, since the error description might betransmitted to another host on the network, it needsto be represented in such a way that it can betransformed into a machine-independent byte streamfor transmission purposes, then reconstituted by therecipient.

Let's consider the case of constructing an errormessage. Error messages are seen by two classes ofpeople: end users, who don't understand the

Summer '92 USENIX - June 8-June lZ,lgg2 - San Antonio, TX L25

A Discipline of Error Handling

internals of the program that failed, andgurus/implementors who do.

For end users, the message should be phrasedin high level terms, and provide enough informationso that it is possible for the user to determine acorrective course of action. In practice, this means aone-line message containing a phrase describing theproblem, the context in which the error occuned(e.g., file name, line number, etc.), and an errorname which can be used as a key to look up a morecomplete description of the error in external docu-mentation.

For gurus and implementors, the messageshould contain additional information about whatwent wrong at the implementation level of the pro-gram, because this information might be needed fordebugging, or in order to fix a problem somewherein the system. This lower level information can begenerated naturally, as a consequence of the waythat error descriptions are generated. When an erroris detected, a description of the error is passed downthe call stack through one or more levels of functioncalls until it is finally handled. Now, consider thateach function is obligated to present an error inter-face which makes sense in terms of the abstractionthat it implements. When an error description is for-warded through an abstraction boundary, it is some-times necessary to reinterpret the error in terms ofthe curent level of abstraction, which means supply-ing a new error id, and new contextual information.The lower level error description which is being sup-planted need not be thrown away; this low leveldescription of the error might be of use to gurus andimplementors when it is presented in an error mes-sage.

Putting all of these requirements together, wefind that an eror description should consist of astack of one or more error intemretations. Theinterpretation on the top of the stack is a high-leveldescription of the error, while the interyretation atthe bottom of the stack describes the error in termsof the function in which the error was first detected.Each error interpretation contains an error id, plusadditional contextual information. We need opera-tions on error descriptions for printing or displayingthem as human readable messages, and for transmit-ting them across the network to other hosts (whichmay have different byte ordering, different floatingpoint representations, etc.).

Moen

In EMS, error descriptions are representdd byvalues of type Error-t. An Error¡ contains a stackof error interpretations. Each interpretation containsan error id, the error name represented as a string,the short message associated with the error id, theerror severity level, and a chæacter string containingcontextual information which is generated at runtime, when the error is detected.

Note that the contextual information associatedwith each interpretation is represented as a singlecharacter string. This is a good representation forprinting effor messages and for transmitting errordescriptions across the network, but it is not a con-venient representation for error handlers which wishto access and interpret the contextual information.In fact, the cunent implementation of EMS providesno programmatic way to access the contextual infor-mation at all! Although this limitation has notcaused any problems so far, I think it should befixed.

The stack within an Error t has a fixed max-imum size of 6 entries, and the-re is a fixed amountof space for storing character string information. Ifyou try to push more than 6 interpretations onto thestack; then the bottom interpretations are thrownaway. If you try to store an oversize context stringin an interpretation, then it will be truncated. Wechose to use fixed sized arrays for representingErrorJ's because we wanted to avoid the use ofdynamically allocated storage. We did not want torun out of heap space while attempting to reporterrors in a low memory situation, and we did notwant the additional complication of having to expli-citly free the storage associated with Error t's withinan error handler. Instead, we use automatic storagefor all Error t's. In practice, the stack size limita-tion is not a problem, and neither is the occasionaltruncation of context strings.

Operations On Error Values

When an error is first detected, an error valueis initialized with a single error interpretation, con-sisting of an error id, the associated error name,severity level and short message, and an optionalcontext string. If the error were printed at this point,it would look like the tine in Figure L, below.

An error value can be initialized to contain asingle interpretation by calling er_set0 (see Figure2, below).

context str ing: short message (errcir name)

Figure 1: Error message format

vo id e r r_se t ( Er ro r_ t *e r r , Er rTab_t * tb l , long id , char * fmt , . . . )Figure 2: Prototypical err_set call

L26 Summer '92 USENIX - June 8-June 12, L992 - San Antonlo, TX

Moen

The fmt argument is the beginning of a printfargument list, which is used to construct the contextstring. See Figure 3 for an example.

When an error value passes through an abstrac-tion boundary, it may be appropriate to push a newinterpretation onto the stack which describes theerror in higher level terms. This is done by callingen3ushQ. Alternatively, instead of pushing a newinterpretation onto the stacþ you can choose to pushan annotation by calling err_noteQ. An annotatión isa character string which is added to the top levelenor interpretation without changing the error id.Annotations are used to add information to to theenor message seen by a user, without changing theenor description from the point of view of an errorhandler. For example, the sequence of calls in Fig-ure 4 would create an error description which wouldbe printed as shown in Figure 5.

For completeness, we also supply err_clearQ,which initializes an error description to an emptystack, and en3opQ, which pops the top levelinterpretation, so that the error id underneath can beaccessed.

An error handler which needs to distinguishbetween different types of errors can query an errorusing the functions err_idQ and en_lévelQ. Thesefunctions return the 32 bit enor id and the errorseverity level, respectively, of the top level interpre-tation within an error. The severitv level is onè ofthe following three constants:

#define ERR RETRY 1#define ERR FAIT, 2#define ERR-TAULT 3

These constants are ordered by severity so that < and> tests can be used on them.

_ . 4" error can be printed by calling en3rint_xQ.This function prefixes each line of output Uy ttre pró-gram name and a colon. If it is desired to displayan error message within a dialog box, then you can

A Dlscipline of Error Handling

call en_string_xQ to obtain a character stringrepresentation of the error message, with embeddednewlines, but without the program name prefixes.

Enors can be written to and read from a net-work connection in machine independent binary for-mat by calling en_write_x0 and err_read_xQ. Anunresolved problem with these functions is thatUNIX error numbers change their interpretation fromone machine to another.

Internationalization

In multilingual environments, it is necessary toensure that enor messages are displayed in thecorrect natural language. Of course, error messagestrings are not the only strings that need to betranslated, and it is appropriate to regard enor han-dling and intemationalization as orthogonal problemsthat deserve separate and independent solutions.

Different operating environments provide dif-ferent solutions to the problem of internationaliza-tion. In the Macintosh and lffindows environments,character strings that need to be translated are storedin resource files which are shipped with the applica-tion; the buyer must purchase a copy of the aþplica-tion which has been translated to the ãésiredlanguage. In the OSF environment, the NLS pack-age supports a run-time choice of several differentnatural languages by setting the I¿.NG environmentvariable; once again, strings containing naturallanguage a¡e stored separate from the programsthemselves.

In the EMS toolkit, we have a high-level. abstraction for string translation which can map

cleanly onto the facilities provided by all of tnèabove environments. The construct

XSD ( str ing-J. i teraI )is replaced by string-literal on systems that don'tlupport internationalization, and is replaced by afunction call which returns a pointer to the transláted

Error_t errierr_set ( &err, sys_er rs , ( long)ENOMEM, "Cou ldn , t a l loca te td by tes , ' , s ize) ;

Figure 3: Creating an error value

Error_t err;err_set ( &err,err3ush ( &err,err_note ( &err,

u t i l_er rs , E_EOF, "F i le descr ip to r td " ; fd ) ;ut i l_errs, E_CORRUPT, "" ) i"Er ro r wh i le readÍng f i le \ , ' t s \ , , " , f i l ename¡ ;

Figure 4: Creating another error value

Error whi le reading f í Ie , , foo"> corrupted data file (E_CoRRUPT)> Fi le descr iptor 3: Unexpected end of f i le (E_EOF)

Figure 5: A printed error value

Summer '92 USEMX - June 8-June l2r lgg2 - San Antonio, TX t27

A Discipline of Error Handling

string on systems that do. In other words,

x s D ( " H e I I o " )returns a pointer to the string "Bonjour" if compiledin an environment that supports internationalization;and the current language is French. The construct

XSI ( str ing- l i teraI )returns a static initializer for a structure of typeXS_t. The function xs3getsQ takes a pointer to anXS-t, and retums a pointer to the translated string.The XS package is implemented using a modified Cpreprocessor which maps string literals onto theappropriate magic incantations for fetching thetranslated version of that string from the appropriateresource file or database.

Thus, internationalization of error messageswithin the EMS environment becomes a trivial prob-lem. The short error messages within each staticallyinitialized EnTab_t variable have type XS_t, and areinitialized using tñ'e XSI macro; thii'is tran-sparent tothe programmer. Programmers must be careful touse the XSD macro to translate printfQ style formatstrings that contain natural language. That's it.

Throwing and Catching Errors

In the early days of EMS, functions reportedfailures by returning a special error value. Thisapproach was abandoned because of the code cluttercaused by checking nearly every function call. Itwas replaced by a portable exception handlingmechanism using the termination model.

A function signals failure by building an errorvalue, then throwing it to its caller as shown in Fig-ure 6. As a shorthand, we supply the functionthrow_err_xQ which allows the above code to bewritten in a single line (see Figure 7). The effect ofthrow_xQ or throw_err_xQ is to raise an exception:this causes a non-local jump down through the call

Moen

stack to the nearest exception handler. If no excep-tion handlers are active, then the error is printed tostderr, and the program exits with status 255. As aresult of this default behaviour, simple utility pro-grams that exit with an error message when anyenor occurs are very easy to write.

Exceptions may be caught using the controlstructure shown in Figure 8. The "THEN-TRY ..."clause may be repeated zero or more times. Theidentifrer argument to CATCH is used to declare avariable of type Enor_tt which is local to the excep-tion handler. A TRY .. END_TRY block is syntacti-cally a compound statement, and may be writtenanywhere a statement is legal in C. They may evenbe nested.

A TRY .. END_TRY block is evaluated in thefollowing manner. First, each body is executed insequence. If an exception is raised during the exe-cution of a body, then control is immediatelytransferred to the beginning of the next body. Afterall of the bodies have been executed, one of twothings happens. If an exception occurred during theexecution of any body, then the handler is executedwith the Enor_t* variable pointing to the frrst errorthat occurred. If none of the bodies raised an excep-tion, then the handler is skipped.

For example, the following code:

TRYfoo_x( )

CATCH (err)err3r int_x(err, stderr) ;

END-TRY

calls foo_xQ; if it raises an exception, then theexception is caught, and the error is printed to stderr.

If you use break, continue, goto or return tobreak out of the body of a TRY or THEN_TRYclause, then you must call TPOPQ before

Error_t errie r r_se t (&er r , sys_er rs , ( Iong)ENOMEM, "Cou ldn , t aL loca te td by tes" , s ize) ;throw_x(eerr) t

Figure 6: Creating an error value and throwing it to its caller

th row_er r_x(sys_er rs , ( long)ENOMEM, "Cou ldn , t a l loca te td by tes" , s ize) ;Figure 7: Combined set and throw functions

TRYbodyl: statements that

THEN-IRYbody2: more statements

CATCH ( ident i f ier¡. . except ion hand ler . .

END TRY

may raise an exception

that may raise an except ion

Figure 8: Exception catching

Summer '92 USEI\IX - June 8-June L2,1992 - San Antonio, TXtzE

Moen

transferring control. Otherwise, the stack maintainedby the exception handling system will be damaged.TPOPQ can only be used to break out of a singleTRY statement; it cannot be used to break out ofseveral nested TRY statements at once. For exam-ple:

TRYi f (bar -x1 ) == 0) {

r P o P ( ) ;returni

)CATCH (err)

err3r int_x(err, stderr) ;END-TRY

The exception handling system described in this sec-tion is written in portable C, with the help ofsetjmpQ, longimpQ and the C preprocessor.Although its genesis is independent, the implementa-tion is similar to that used by Roberts [Roberts 89].The need for the TPOPQ macro is regrettable, and isan occasional source of bugs, A purpose builtpreprocessor could eliminate the need for TPOPQ,and could also generate slightly better quality Ccode for the TRY statement.

Cleaning Up After Errors

It is the responsibility of every library functionto clean up properly after detecting an error. Theremust be no memory leaks, file descriptor leaks, ordata structures left in an invalid state after an earlvenor exit.

Consider this function (note that mem_alloc_xQis a wrapper for mallocQ that raises exceptions, andmem_freeQ is a wrapper for freeQ):

voidfoo_x ( ){

F O O _ t * f l , t t 2 ì

. f l = mem_aI loc_x(s izeof ( foo_t ) ) tf2 = mem_al loc_x( sizeof (Foo_t) ) ¡. . . b o d y o f f o o _ x . . .m e m _ f r e e ( f 1 ) ;mem_free (f2l ¡

)If the first call to mem_alloc_xQ fails, then foo_xwill be immediately terminated by an exception.This is the desired effect, and no error handling codeis required within foo_x to make this happen. How-ever, if the second call to mem_alloc_xQ fails, thenthere will be a storage leak, because foo_x will exitwithout freeing f1.. Similarly, if the body of foo_x(as represented by the ellipsis) is capable of failing,then neither fl and f2 will be freed. Our solution tothis problem is to use the coding style in Figure 9.

The TRY clause contains resource allocationand the body of foo_x. The THEN_TRY clausefrees resources; it is executed whether or not the

A Discipline of Error Handling

body fails. The reason that we initialize f1 and f2 toNULL is so that the calls to mem_freeQ in theTHEN_TRY clause will work even if an exception israised in one of the calls to mem_alloc_xQ.mem-freeQ is guaranteed to ignore a NÛLL argu-ment, unlike freeQ on some UNIX systems.

voidfoo_x ( ){

Foo_t * f1 . , * f .2 ì

f l = NULL;f2 = NULL;TRY

f l = mem_al loc_x(s izeof (Foo t ) ) ;f2 = mem_al loc_x(sizeof (Foo_t) ) t

b o d y o f f o o x . . .THEN-TRY

m e m _ f r e e ( f 1 ) ;mem_free (f2l ¡

CATCH (err)throw_x(err) t

END-TRY)

Figure 9: Cleaning up after an error

This kind of analysis (for resource leaks) has tobe performed every time a library fr¡nction is writ-ten. In order to make the analysis and coding easierto perform, we strictly enforce two conventions.First, all library functions that are capable of raisingexceptions have names suffixed by _x. Second, alldeallocation routines are required to ignore a NULLargument. The _x convention has proven to be quiteuseful, because it is otherwise very difficult to tellwhether or not a particular stretch of code is capableof raising exceptions, and this is critical for resourceleak analysis. It is not convenient to assume thatevery function call could raise an exception, becausewe make heavy use of access macros to replacedirect access to structure members, and these usuallvdon't raise exceptions.

Documenting Exceptions

One of the rules that we have tried to enforcÊis that the enor interface provided by each functionmust be fully documented. After all, this error inter-face is part of the contract that the function makeswith its callers, just as surely as the result and argu-ment types are.

We have run into several problems trying toachieve this goal.

The first problem arises from the fact that it isdifficult to document the error interface of high-levelfunctions if the low-level functions that they calldon't have well:defined error interfaces. Unfor-tunately, we have this problem with the UNIX sys-tem calls. On many UNIX systems, the set of enors

Summer '92 USEMX - June 8-June 12,lgg2 - San Antonio, TX t29

A Discipline of Error Handling

generated by each system call (and the semantics ofsystem calls when an error is detected) is only par-tially documented, Furthermore, the enor interfacefor system calls is not portable: it changes from onesystem to the next. Consequently, EMS libraryfunctions that perform I/O currently have a poorlydocumented, system dependent enor interface, andwriting error handlers for code that does I/O some-times involves experimentation to find out the namesof the error numbers of interest, combined with#ifdefs to check for different enor numbers on dif-ferent machines. The obvious solution to this prob-lem, which we haven't had time to pursue, is todefine a portable error interface for each system call,then to work out the mapping from system enor idsto portable enor ids for each system call andmachine type.

The second problem arises from the fact thatthe error interface for a function tends to be definedæ the union of the enor interfaces for all of thefunctions that it calls. Not only does this lead tolarge, cluttered enor interfaces, but it also meansthat error interfaces tend to change over time as aresult of maintenance, and thus the documentationfor error interfaces tends to drift out of sync withreality. This can be viewed as a documentation andmaintenance problem, in which case the solutionmight be to attempt to generate the documentationfor error interfaces automatically, using a codeanalysis tool. However, this problem can also beviewed æ a design problem: in some cases, pro-grammers are not making the effort to design ahigh-level, abstract error interface that matches theabstraction provided by the function; instead, theyare letting the enor interface default to whatevertheir code does.

Many strongly typed languages with built-inexception mechanisms either permit or require youto declare the error interface of a function as part ofits type. (C++ and Modula-3 are examples.) Thisdeclaration consists of a fixed list of error ids. Inour experíence, life is not that simple. If you aredoing object oriented programming (as we do), thenwhat you will find is that different subclasses of abase class A will implement different error interfacesfor the virtual function inherited from A. For exam-ple, we have a class called Stream (similar to a stdioFILE). The set of enors that c¿n occur in (eg) thewrite-x virtual function depends on which subclassof Stream you are writing to. As a result, the errorinterface for a function that takes a Stream as anargument depends on which subclass of Stream isactually passed in. I-anguage designers might wishto consider polymorphic exception sets within func-tion signatures to deal with this issue.

Moen

Testing

An important part of our error handling pack-age (and a topic rarely discussed in the literature onexception handling) is testing. Full adherence to theenor handling discipline described here adds anoticeable amount of complexity to our code, mostlyin the form of exception handlers withín libraryfunctions which deallocate resources and restore datastructure invariants before forwarding an exception.These exception handlers are rarely executed in pro-duction code, and are therefore a fertile breedingground for bugs. Fortunately, we have developed asimple, yet powerful mechanism for testing theseexception handlers. This mechanism works bytriggering fake exceptions in lowJevel functionsunder the control of command line arguments. Thesefake exceptions set off a cascade of intermediateexception handlers, thereby exercising them. Whenused in conjunction with a library regression testprogram, this mechanism can be made to exercise allof the exception handling code in a library.

A control point for triggering a fake exceptionis placed by calling the macro xtrap_x0, which takesan error id as an argument. These control points areplaced in low level library functions, at points wherean exception could potentially be raised by naturalmeans. Only a handful of calls to xtrap_x areneeded in EMS; the single call to xtrap_x inmem_alloc_x (our wrapper for malloc0) is sufficientto test 75-80Vo of our exception handling code.

EMS has a mechanism for passing debugoptions into a program for use by library routines.These debug options can either be placed in anenvironment variable, or they can be supplied asarguments to the command line flag -V, which allEMS programs support. The xtrap mechanism istriggered using the command line argument -V xt=i,which causes the i-th dynamic invocation of xtrap xto raise an exception. The command line argument-V xc causes a program to run to completion, thenprint out the number of times the xtrap x macro wasexecuted. To exercise a library, we write a test pro-gram which exercises all of the functions in thelibrary. Then we run the program with -V xc, whichgives us the. number n. Finally, we run the programn times (from a shell script), supplying the i-th invo-cation with the argument -V xt=i. This is usuallysufficient to exercise 99Vo of the exception handlingcode in the library (assuming that the test program iswritten to provide full coverage of the library).rWhen this testing technique is combined with thedebugging version of our storage allocator (whichdetects bad calls to free and storage leaks), we candetect most problems involving improper releasingof resources in exception handlers.

130 Summer '92 USENIX - June t.June 12,1992 - San Antonlo, TX

Moen

Evaluation

The EMS approach to error handling is a suc-cess. Our code is robust, well behaved, and testable,and programmer acceptance of the approach is high.There is an initial cost in learning how to use theerror handling system, especially for junior program-mers, who can be a little uncomfortable with the factthat most functions they call are liable to perform anon-local jump upon encountering an error. How-ever, once programmers get past the initial learninghump, they seem to like the system, since using itleads to cleaner looking code that is less obscured byenor handling than equivalent code which checkseach ñ¡nction call for error return values.

Under the old regime of sígnalling enors byreturn codes, the most comrnon bug is failing tocheck the return code. This can lead to nastybehaviour, like core dumps. Under the new regimêof using exceptions to signal enors, the most com-mon bug is failing to catch exceptions for the pur-pose.of freeing resources, and this leads to resourceleaksr. The new style of bug has fewer harmfuleffects on the overall robustness of the code.

There is still room for improvement. A moreflexible method for organizing eiro. ids into a híerar-chy, so that programmers can test an enor formembership in a group, and a way of associatinginteger and string parameters with an error, would béwelcome improvements. So would a lint-like toolfor- detecting common bugs in exception-handlingcode, and a tool to automate the documentation oienor interfaces by scanning source code.

The EMS enor handling system compares quitefavourably to other C exception handling packages.Its greatest strength is the Enor_t structuie, whichincorporates the novel idea of a stack of errorinterpretations, and provides standard ways to printarbitrary error descriptions, and to transmit themacross an IPC connection. Roberts' system [Roberts89] provides a mechanism for raising and handlíngexceptions very similar to the EMS system, buterors are represented by a pair consisting of an errorid and an integer parameter. Allman's ðystem [All-man 85] provides a flexible way to organize enor idsinto groups, has character string parameters whichare accessible to error handlers, and integrates UNIXsignals with the exception system. On the minusside, his system requires assembly language support,and has a much less convenient syntax for excéptionhandlers. Allman also supports the more generalresumption model of exception handling. My feeling

A Discipline of Error Handling

is that terminate and resume style exceptions shouldbe signalled and handled by different mechanisms;see the appendix.

A Plea For Standardization

The benefits of an exception handling sysremare strongest when it is used everywhere. Accord-ingly, the EMS utilities library contains exception-raisíng replacements or wrappers for many of themost commonly used C library functions. We evenwent so far as to implement a complete replacementfor stdio with bettef error handlin!2 (ana of coursebetter performance). Unfortunately, we don't get thefull benefit of our enor handling system when weuse libraries, such as those for the X Window Sys-tem, that were written by other people. Bec¿use theC enor system (errno, perror, etc.) is not extensible,every library implementor is forced to create theirown enor handling system, and application program-mers are stuck with the job of knitting these dif-ferent enor handling systems together.

We think the world would be a better place ifthe C community had a standard, extensibie, andwell designed system for fault and failure handlingwhich all library implementors used. This probablywon't happen in the C community, but there is stillhope for new languages such as C++. Implementa-tions of C++ that support exception handling are justnow starting to appear. Unfortunately, syntax forraising and handling exceptions is not enough: thereshould also be standard conventions for using it, astandard way to print errors, and exceptions raised inall standard library routines upon failure.

. References

[Allman 85] Eric Allman and David Been. "AnException Handler for C, " Proceedings of theSummer 1985 USENIX Conference. Portland.Oregon, 1985.

[Darwin 85] I. Darwin and G. Collyer. "Can't hap-pen -or- /* NOTREACHED *l -or- Real Pro-grams Dump Core," Proceedings of the Winter1985 USENIX Conference. Dallas, Texas, Janu-ary 1985.

[Ellis 90] Margaret A. Ellis and Bjarne Srroustrup.The Annotated C++ Reference Manual.Addison Wesley, 1990.

[Goodenough 75] John B. Goodenough. "ExceptionHandling: issues and a proposed notation,"Communications of the ACM. Vol. 18, no. 12,December 1975.

[Liskov 79] Barbara A. Liskov and Alan Snyder."Exception Handling in CLU," IEEE Transac-tions on Software Engineering. Vol. SE-5, no.

@t went wrong when an enoroccurs, We discovered the ha¡d way that the contents ofe¡rno ¿ue trot to be trusted afte¡ a stdio error,

/Using the debug version of the EMS memory allocator,resource leaks a¡e not diffrcult to diagnose, since weprovide a facility for listing the memory blocls which arestill allocated at program exit time; this list gives the filename and line numbe¡ of the call to mem alloc x for eachallocated block.

Summer '92 USENIX - June 8-June L2,l99Z - San Antonio, TX 131

A Discipline of Error Handling

6, November 1979.[Meyer 88] Bertrand Meyer. Object-Oriented

Software Construction. Prentice Hall, 1988.[Nelson 91] Greg Nelson. Systems Programming

with Modula-3. Prentice Hall, 1991.[Randell 75] Brian Randell. "system Structure for

Software Fault Tolerance," IEEE Transactionson Software Engineering. Vol. SE-l, no. Z,June 1975.

[Roberts 89] Eric S. Roberts. "Implementing Excep-tions in C," Research Report 40, Digital Sys-tems Research Center, March 21, 1989,

[Steele 90] Guy L. Steele Jr. Common Lisp. DigitalPress, 1990.

[Yemeni 85] Shaula Yemeni and Daniel Beny. ,,4modular verifiable exception-handling mechan-ism," Transactions on Programming l:nguagesand Systems. Vol. 7, no. 2, April 1985.

Author Information

After 7 years of hacking UNIX, C, Macintosh,and graphical user interfaces, Doug Moen receivedhis B.I.S. from the University of Warerloo in 1982.His bachelor's thesis was the design of a program-ming language (Goal) with powerful abstractionmechanisms and a polymorphic type system basedon dependent types. In 1988, after graduating, Dougjoined Interactive Image Technologies wliere hemade major contributions to an object-oriented, plat-form independent graphical user interface libiary,and was the principle designer and architect forHyperCase, a multi-media authoring tool. In 1989,Doug moved to Nixdorf, where he has made sub-stantial contributions to the design and implementa-tion of EMS, a programmers toolkit for buildingdocument image processing systems. Doug is nowavailable for employment. Reach him [email protected], or at (416) 977-490'1, or at 77 CarltonSheet #1504, Toronto, Ontario, MsBZJ7.

Appendix: Notify and Signat Conditions

This paper has dealt with 2 kinds of excep-tional conditions which can be detected bv a librarvfunction: faults and failures. Both of theie kinds ofconditions result in the termination of a function callonce they are detected. But there is a third class ofexceptional conditions: these are conditions whichneed not prevent the function from completing itstask, but which may nevertheless need to be reportedto the caller before the function has completed itstask. Goodenough [Goodenough 75] distinguishestwo types of exceptional conditions which fall intothis category, which he calls notify and signøl condi-tions. A handler for a notify condition is prohibitedfrom terminating the operation. A handler for a sþ-z¿l condition is given the choice of terminating theoperation, or fixing the problem and resuming it.

Moen

For an example of a notify condition, considerpclose(3), which repeatedly calls wait(Z) until rheprocess associated with the argument stream hasexited. The exit statuses of child processes whichare different from the process that pclose0 isattempting to wait for are simply thrown away, andprograms that cåll pclose0 have no way of obtainingthis information. This is a design flaw in pclose0which could be rectified through the use of amechanism for reporting notify conditions.

For an example of a signal condition, considera function which is communicating with a remoteprocess over a network connection. If the remoteprocess has apparently stopped talking, then thefunction has two choices: it can report a failure, or itcan keep trying to communicate with its peer, on theassumption that the peer might resume communica-tion in a few seconds. In practice, timeouts areoften used in this kind of situation. A betterapproach might be to report a signal condition, andlet the caller attempt to fix the problem, or obtainadvice from the user on whether to retry or abort.

A second example of a signal condition is amemory alloc¿tor which gives the caller an oppor-tunity to free cached memory blocks before reportinga failure.

As I have tried to show, signal and notify con-ditions a¡e real. The question is, what combinationof language features and programming conventionsare most appropriate for dealing with them?

One approach is to incorporate the reporting ofsignal and notify conditions into the same exceptionhandling mechanism that is used to report failures.This leads to the retry model of exception handling.In this model, when an exception is raised, the con-text raising the exception is not immediately ter-minated. Instead, the exception handler is given thechoice of terminating the function that raised theexception, or resuming it. The Mesa programminglanguage supported this model of exception han-dling; so does Common Lisp [Steele 90]; and sodoes Allman and Been's exception handler for C[Allman 85].

An evident disadvantage of implementing theretry model of exceptíon handling in C is that it isnecessary to make exception handlers into separatefunctions. Not only is this grossly inconvenient, butthe exception handlers do not have access to thelocal variables associated with the statement blockthat they are associated with.

The retry model of exception handling providesdynamically scoped handlers for signal and notifyconditions that are associated with specific blocks ofcode. It is not clear that this is the best scopingregimen. Alternatives are to associate exceplioñhandlers with modules, or to associate them withobjects. Handlers with module scope can be imple-mented by registering a callback fi¡nction with the

t32 Summer '92 USENIX - June E.June 12,1992 - San Antonio, TX

Moen

appropriate module. Handlers with object scope canbe implemented using object oriented programming,by oveniding a virtual function in the class of theobject.

I have not yet developed a personal philosophyconcerning the proper freatrnent of signal and notífyconditions. However, I would like to show howdynamically scoped handlers for these conditionscould be added to the EMS enor handling system.The technique could probably be adopted to workwith any C-based termination-model exception han-dling system.

The function

int not i fy(Error_t *)

is used by a function to notify its client of an abnor-mal condition; it is used to implement both "signal"and "notify" exceptions as described above. TheEnor-t structure is used to describe the abnormalcondition. The value returned by noti$0 is one ofthe values:N_ABORT The client wishes the function to abort

by raising an exception.N_RETRY The client wishes the function to keep

trying.N_IGNORE The client ignored the notification.A client can catch a notification using the followingcontrol structure:

NOTIFY ( fun)code which may raisea not i f icat ion

END-NOTTFY

The argument to NOTIFY is a function pointer withtype

in t ( * fun) (Er ro r_ t * )

This function should examine the Error structure(and in particular, the error id), and returnN_ABORT, N_RETRY or N_IGNORE. The returnvalue N_IGNORE means that the notificationhandler function did not recognize this particularnotification.

NOTIFY END_NOTIFY blocks can b€nested in the same way that TRY ... END_TRYblocks can. When a notiflcation is raised, notifyQsearches the stack of NOTIFY blocks, calling eachnotification handler function in turn until one ofthem returns a value different from N IGNORE. Ifno handlers are present, or no handler- is willing tohandle the condition, then notifyQ returnsN-IGNORE.

A Discipline of Error Handling

Summer '92 USENIX - June 8-June L2, L992 - San Antonio, TX 133