OPERATING SYSTEMS - CiteSeerX

OPERATING SYSTEMS

Thomas H. Payne

Department of Computer Science and Engineering

University of California, Riverside

September 27, 2004

ii

Copyright c© 1990 – 2004 by Thomas H. PayneAll Rights Reserved.

Draft 0.9.2

Contents

1 INTRODUCTION 1

1.1 Key Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Hardware Support . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Kernel Invocations . . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.3 Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.4 Restoring CPU Context . . . . . . . . . . . . . . . . . . . 8

1.3 The Object/Server Paradigm . . . . . . . . . . . . . . . . . . . . 8

1.4 History and Perspective . . . . . . . . . . . . . . . . . . . . . . . 11

2 Binding 13

2.1 Resolution and caching . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Compiling, Linking, and Loading . . . . . . . . . . . . . . . . . . 15

2.2.1 Compiling . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.2 Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.3 Static Linking . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.4 Dynamic Linking/Loading of Unshared Modules . . . . . 18

2.2.5 Dynamically modifiable bindings . . . . . . . . . . . . . . 19

2.2.6 Shared Modules . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.7 Binding to Remote Services . . . . . . . . . . . . . . . . . 22

2.3 Binding in Other Classes of Servers . . . . . . . . . . . . . . . . . 24

3 MULTIPROGRAMMING 27

3.1 Process Anatomy . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Process Creation and Termination . . . . . . . . . . . . . . . . . 31

3.3.1 Other Approaches . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Process Migration . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Daemons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 MULTITHREADING 39

4.1 Concurrency and Servers . . . . . . . . . . . . . . . . . . . . . . . 43

iii

iv CONTENTS

5 THREAD SAFETY AND COORDINATION 49

5.1 Mutual Exclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1.1 Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1.2 Preemption Blocking . . . . . . . . . . . . . . . . . . . . . 52

5.2 Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.2.1 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.2 Structuring via Monitors . . . . . . . . . . . . . . . . . . 60

5.2.3 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2.4 High-level locks, an example. . . . . . . . . . . . . . . . . 64

5.2.5 Thread Safety via Monitor Encapsulation . . . . . . . . . 66

5.2.6 Thread safety via monitor-based schedulers . . . . . . . . 68

5.2.7 An Example Using Prioritized Waiting . . . . . . . . . . . 71

5.3 Semaphores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.4 Coordination via waiting for message replies . . . . . . . . . . . . 77

5.5 History and Perspective . . . . . . . . . . . . . . . . . . . . . . . 78

6 Bootstrapping High-Level Coordination 81

6.1 Low-Level Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.1.1 Block Locks . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.1.2 Spin Locks . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.1.3 Number Locks . . . . . . . . . . . . . . . . . . . . . . . . 85

6.1.4 Notes on Low-Level Locks . . . . . . . . . . . . . . . . . . 87

6.2 Priority Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.3 Passing a CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.4 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.4.1 threads.H . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.4.2 threads.cc . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.5 Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.5.1 monitors.H . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.5.2 monitors.cc . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.6 Ready . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.6.1 urmonitor.H . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.6.2 urmonitor.cc . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7 DEADLOCKS 103

7.1 Deadlocks and Monitors . . . . . . . . . . . . . . . . . . . . . . . 104

7.2 The Resource-Allocation Model . . . . . . . . . . . . . . . . . . . 105

7.3 Deadlock Prevention . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.3.1 Sharing by Multiplexing Preemptable Resources . . . . . 107

7.3.2 Preventing Cycles by Acquiring Resources in Order . . . 107

7.3.3 Not Waiting While Holding Resources . . . . . . . . . . . 109

7.4 Backing off for performance reasons. . . . . . . . . . . . . . . . . 109

7.5 Deadlock Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . 111

CONTENTS v

8 SCHEDULING 1138.1 Performance terminology . . . . . . . . . . . . . . . . . . . . . . 1148.2 Time of Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 1168.3 Selection Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178.4 Priority Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 1228.5 Load leveling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1228.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

9 MEMORY MANAGEMENT 1259.1 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1259.2 Memory-access anomalies under concurrency . . . . . . . . . . . 1279.3 Indirection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

9.3.1 Dynamic rebinding/relocation . . . . . . . . . . . . . . . . 1329.3.2 Failure control . . . . . . . . . . . . . . . . . . . . . . . . 1329.3.3 Copy-on-write . . . . . . . . . . . . . . . . . . . . . . . . 1329.3.4 Dynamic Memory Allocation . . . . . . . . . . . . . . . . 1339.3.5 Interleaving . . . . . . . . . . . . . . . . . . . . . . . . . . 138

9.4 Data Logistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1409.4.1 Distributed Tables . . . . . . . . . . . . . . . . . . . . . . 1409.4.2 Multilevel Storage Systems . . . . . . . . . . . . . . . . . 1419.4.3 Management of Two-Level Storage Systems . . . . . . . . 1439.4.4 Expected Hit Ratio vs. Cache Capacity . . . . . . . . . . 145

9.5 Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1479.6 Memory-Management Hardware . . . . . . . . . . . . . . . . . . 147

9.6.1 Address-Translation Schemes . . . . . . . . . . . . . . . . 1489.6.2 Flat vs. per-process address spaces . . . . . . . . . . . . . 151

9.7 Virtual Memory and Disk Caching . . . . . . . . . . . . . . . . . 1519.8 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

10 DEVICE MANAGEMENT 15510.1 I/O Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

10.1.1 Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15510.1.2 Device Controllers . . . . . . . . . . . . . . . . . . . . . . 156

10.2 I/O Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 15810.2.1 Driver Binding (i.e., Registration) . . . . . . . . . . . . . 15910.2.2 I/O Coordination . . . . . . . . . . . . . . . . . . . . . . . 16010.2.3 Disk I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

10.3 Pushed Data Transfers . . . . . . . . . . . . . . . . . . . . . . . . 16810.4 Reduced-Copy I/O . . . . . . . . . . . . . . . . . . . . . . . . . . 169

11 VOLUMES and DYNAMIC BINDING 17111.1 Opened files as streams . . . . . . . . . . . . . . . . . . . . . . . 17111.2 Volumes and File Systems . . . . . . . . . . . . . . . . . . . . . . 17311.3 Volume-based Name Management . . . . . . . . . . . . . . . . . 17411.4 Descriptor Management . . . . . . . . . . . . . . . . . . . . . . . 17711.5 Access Management . . . . . . . . . . . . . . . . . . . . . . . . . 179

vi CONTENTS

11.6 Free-Space Manager . . . . . . . . . . . . . . . . . . . . . . . . . 18011.6.1 Fetch Optimization . . . . . . . . . . . . . . . . . . . . . . 18111.6.2 Write-Back Optimizations . . . . . . . . . . . . . . . . . . 18211.6.3 Fault-Tolerance Strategies . . . . . . . . . . . . . . . . . . 182

11.7 File Management Under Unix. . . . . . . . . . . . . . . . . . . . 183

12 PROTECTION 18512.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18612.2 The Access-Control Database . . . . . . . . . . . . . . . . . . . . 18612.3 Managing Access Rights . . . . . . . . . . . . . . . . . . . . . . . 18712.4 Dynamic Protection . . . . . . . . . . . . . . . . . . . . . . . . . 19112.5 Changing Domains . . . . . . . . . . . . . . . . . . . . . . . . . . 19212.6 Unix Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19312.7 Windows NT Example . . . . . . . . . . . . . . . . . . . . . . . . 19512.8 NSA’s SELinux . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19512.9 Rejection of Access Requests . . . . . . . . . . . . . . . . . . . . 19612.10Virtual-Machine Systems . . . . . . . . . . . . . . . . . . . . . . 197

12.10.1Virtualizability of Third-Generation Architectures . . . . 19912.10.2Nested Virtual Machines . . . . . . . . . . . . . . . . . . . 20212.10.3A Virtualizable Architecture . . . . . . . . . . . . . . . . 204

13 SYSTEM ADMINISTRATION 207

A Interthread Communication 211

B Terminology 213B.1 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213B.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214B.3 Thesaurus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

C TO DO 221

D Fall 2004 225D.1 Getting information . . . . . . . . . . . . . . . . . . . . . . . . . 225D.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

D.2.1 Official ABET objectives for CS153 . . . . . . . . . . . . 226D.2.2 Some additional objectives for this offering . . . . . . . . 226D.2.3 Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . 227D.2.4 Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

D.3 Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228D.4 Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

D.4.1 Week 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228D.4.2 Week 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228D.4.3 Week 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229D.4.4 Week 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229D.4.5 Week 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

CONTENTS vii

D.4.6 Week 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230D.4.7 Week 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231D.4.8 Week 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231D.4.9 Week 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231D.4.10 Week 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232D.4.11 Week 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232D.4.12 Finals Week . . . . . . . . . . . . . . . . . . . . . . . . . . 233

viii CONTENTS

PREFACE

These notes are intended to provide a concise introduction to the principlesunderlying the design of operating systems. I have been guided by the beliefsthat:

• He who distinguishes well teaches well. Aristotle?

• Inside of every fat text is a thin text screaming to get out.

• To avoid text bloat, one must avoid distinctions that lack differences.

Unfortunately, competition, both industrial and academic, tempts innovators toinflate the novelty of new applications of old ideas by giving them new names.Also, to some extent the computing industry discovers new concepts the waythat Alzheimer’s patients make so new friends, i.e., through failure to recognizeold ones, especially in new contexts.

I have chosen C++ as the programming language for expressing examples inthis text because its C subset is most popular systems programming languageat this time. Also, in its high-level form, C++ is more suitable than mostpseudo-languages for expressing algorithms and elements of design.

Thomas Payne, 5/1/2000

ix

x CONTENTS

Preface to Draft 0.9.2

The 0.9 draft of these notes contains nearly all of the essential material that Iwant to include in undergraduate Operating Systems. The fundamental orga-nization and terminology have been worked out in detail, but the coherence ofthe writing and the correctness of some assertions remains a problem. As ofnow, 9/26/04, I’m embarking on revision 0.9.3. On this pass, I hope to fix asmany deficiencies as possible.

THP

xi

xii CONTENTS

Chapter 1

INTRODUCTION

Start with server paradigm.OS is just a server program,e.g., UML.

What’s an “operating system”? An operating system is a collection ofsoftware — programs and linkable libraries of program modules — whose pur-pose is to run other programs as efficiently as possible, controlling when andhow they access physical and logical resources.1 CPUs, memory, and I/O de-vices are examples of physical resources. Files, messages, and operating-systemtables are examples of logical resources.

In general, there are three reasons that an operating system might withholdaccess to a system resource:

• protection, e.g., my program should not be able to read your mail file.

• coordination, e.g., my program should not be allowed to read a multiwordvariable while your program is updating it, lest my program reads the firsthalf of the prior value and the second half of the new value.

1It can be helpful to consider the difference between operating systems and other softwarethat might be said to “run programs.”

A loader is a program that quits after it reads a program from a specified file into a segmentof main memory, initializes global variables and jump destinations, and has started thatprogram running. A loader, however, does not stay around to control the running of thatprogram; in particular, it does not control the program’s access to resources. Therefore,loaders are not operating systems.

An interpreter runs programs, controlling everything about the running of the program.Some interpreters interpret instruction sets for real or imaginary machines. Those for realmachines are called emulators. In either case, running instruction-set interpreter is called avirtual machine. It is possible to write an emulator for any machine in a high-level languageand then run it on any machine having a compiler for that language. In particular, it ispossible to run an emulator for a machine on that machine itself. Such an emulator is said tobe self resident. An interpreter could be called an operating system, but the overhead involvedin retaining control usually involves a factor of three slowdown. (Some recent emulators, e.g.,FX32, VirtualPC, VMware, and FREEMware, behave like interpreters but use just-in-timetranslation to significantly reduce software overhead.)

A debugger runs programs and retains the ability to regain control at any time, either at thewhim of the user or at predetermined “breakpoints.” A debugger, however, does not managethe program’s access to other resources.

1

2 CHAPTER 1. INTRODUCTION

• scheduling, e.g., my program may have to wait when it is your program’sturn to use the CPU.

It’s sometimes helpful to think of an operating system as a traffic cop.

Of what does an operating sytems consist? A distribution2 of an op-erating system usually includes various utilities (e.g., web browsers, compilers,linkers, command interpreters), but the operating system itself consists of:

• The OS’s kernel, which is a special program that contains the routines anddata structures for the handling of kernel-invoking events, i.e., occurrencesof:

– traps,

– interrupts,

– system calls

each of which diverts the processor (CPU) on which the event occursfrom the execution of the program it is currently running to the executionprespecified code for the handling of that event.3 Occurrences of kernel-invoking events are exactly the times where the operating system mustintervene to maintain control of the connection between the affected CPU’scurrent program and system resources and thereby to facilitate the sharingof those resources. (See Subsection 1.2.1.)

• application-program interfaces (APIs), which are publicly available librariesof functions for connecting to and managing resources (mostly via systemcalls).

• daemons, which are special application programs, run by the kernel, thatassist the kernel in managing and keeping track of resources, e.g., by lis-tening for and responding to special Internet messages.

Not all experts agree on what functionality to include in an operating systemand where to put that which is included. In the mid-1980s, proponents ofmicrokernels recommended moving functionality (e.g., file systems) from thekernel to the daemons. More recently proponents of exokernels have recommendmoving functionality from the kernel to the APIs. Yet others put device driversinto a hardware-abstraction layer (HAL) outside the kernel.

To bypass such software-architecture issues, the IEEE POSIX standard forUnix-like operating systems specifies only the semantics (i.e., behavior) of theAPIs and leaves to the implementers’ discretion the distribution of functionalityamong the kernel, the daemons, and APIs.Discuss standards in general.

2See http://www.over-yonder.net/∼fullermd/rants/bsd4linux/bsd4linux1.php for one au-thor’s views on the complexities of distributions, releases, etc.

3These kernel routines are called handlers or service routines for kernel invoking events.

1.1. KEY CONSIDERATIONS 3

Initialization (bootstrapping). If operating systems are programs that runother programs, what causes an operating system to run? Whenever a CPU’spower comes on or its reset button is pushed, that CPU begins executing itsoperating system via a process called bootstrap loading. Usually, a very smallpreloader program in read-only memory (ROM) copies a larger bootstrap loaderor boot manager program from some “bootable” device (e.g., the first sector ofthe main hard disk) into main memory and transfers control to it, e.g., branchesto the preloader’s entry point, i.e., the bootstrap loader’s first instruction. Thebootstrap loader in turn reads in and runs the kernel, which runs until thesystem is shutdown. That kernel, in turn, starts up some daemon processes.The system is then ready to run user programs.

There are, of course, some obvious variations on this process. Embeddedsystems, for example, have operating-system kernels that are preloaded in ROM,and diskless workstations obtain their bootstrap loader and kernel via a networkconnection.

Command interpreters. The fundamental function of a command language4

is to allow users to specify what programs to run on what resources (e.g., whatfiles). Some command-language implemetations are part of an operating sys-tem, e.g., the GUI on MS Windows. Most books on operating systems, however,consider command-language implementations to be utility application programsand not an intrinsic part of the operating system itself.

The implementation of any language involves two stages: translation followedby interpretation. The translator does lexical analysis (a.k.a. scanning), andsyntax analysis (a.k.a. parsing) and generates and equivalent program in someinterpretable language. The interpreter executes that translated code with theaid of a collection of routines that are invoked by the interpreter to implement(some of) the semantic constructs of the language. This collection of routinesis often called the language’s run-time support package.

Command languages usually involve pure interpretation, i.e., no translation,and their interpreters are often called shells. Each command language has itsown syntactic peculiarities, but they share some common semantic character-istics that generate a common set of problems to be solved by their run-timesupport routines. From the standpoint of programming languages, operatingsystems can be viewed as run-time support packages for command languages(or GUIs that serve as command languages).

1.1 Key Considerations

Three basic considerations make the design and implementation of operatingsystems difficult and interesting: concurrency, protection, and dynamic binding.Concurrency implies the need to share resources among various runs of various

4Also called a “user-interface language,” a “job-control language,” or a “shell language.”


program, and we need dynamic binding and protection to facilitate and controlsuch sharing.5

Concurrency arises from the fact that any reasonable computer systemmust do more than one thing at a time, e.g., write a record to disk whilecomputing the information for the next record to be written.6 Also, it improvesusability when users can run more than one program at a time, e.g., to be ableto edit one program while compiling another.

The terms “concurrent,” “parallel,” “simultaneous,” and “overlapping” meanessentially the same thing, but some authors make distinctions based on whetheror not multiple distinct activities can compute at exactly the same time. If so,those activities require simultaneous support from distinct hardware devices,e.g., CPUs. Of course, if there is only one CPU,7 two programs can’t run on itat exactly the same time. They can however appear to run at the same time byrapidly passing that CPU back and forth. Such programs have overlapping life-times but they do not have overlapping periods of CPU usage. Some but not allauthors use the term “concurrent” or “pseudo-concurrent” for activities havingoverlapping lifetimes and “parallel” for activities having overlapping periods ofCPU usage. (The best policy is to look carefully at the context, whenever thisdistinction is important.)

Protection is the ability of the operating system, specifically for its kernel,to maintain control of all access to shared resources. To maintain such control,programs must be forced to access such resources only by invoking trusted rou-tines provided by the operating system. It is important that programs not beallowed to circumvent those trusted routines and access shared resources in anuncontrolled fashion. For instance, a program should not be allowed to overwritethe operating system, thereby preventing the execution of further programs, orto interfere with the running of other users’ programs, or to write on a printerthat is in use by another program.8

Dynamic binding is the act of specifying at run time what actual routineor object a particular name or reference refers to. For instance, a file-sortingprogram won’t usually know until run time what actual file is to be sorted —that file may not have existed when the program was written, compiled, orlinked. Inside the program, the file to be sorted is referred to via a fictitiousname such as DATAfile. At run time the program binds that fictitious name tothe actual file to be sorted (e.g., employees) by invoking a routine that opens

5Of course, other issues must also be considered in the design of an operating system —e.g., as in the design of any system, one must consider: cost, performance, fault tolerance,and usability.

6The objective is to improve performance through the overlapped use of two resourcescapable of parallel operation, namely, the disk and the CPU.

7And that CPU is monothreaded.8There is, however, an interesting and radically different view of the purpose of protection:

“The purpose of the protection features of the 80386 is to help detect and identify bugs.”([IN], page 6-1.)

1.2. HARDWARE SUPPORT 5

that actual file.9 Dynamic binding of names to files, in turn, requires dynamicbinding of say the write operation to the appropriate routine for the kind offile being written — a program must sometimes write to a disk file, sometimesto a terminal, and sometimes to a printer. We’re talking about more

than dynamic binding, here,since even the spelling of thename doesn’t exist until runtime. Let’s call it “dynamicnaming”.

1.2 Hardware Support

To facilitate concurrency, protection, dynamic binding, and the efficient sharingof resources, most architectures include the following features: traps, interrupts,system calls, protection, memory management, I/O, the storing/restoring ofregisters, and atomic operations (e.g., test-and-set). We will now discuss theneed for (some of) these features.

1.2.1 Kernel Invocations

There are exactly three kinds of kernel-invoking events:

• An interrupt occurs in a given CPU in response to an action performed bya device external to that CPU, e.g., I/O completion, a clock/timer alarm,or a reset signal. It is involuntary on the part of the running program, andno parameters are passed (other than those provided by the underlyinghardware).

• A trap10 occurs in a given CPU as a direct result of an action by its cur-rently running program that requires intervention (e.g., dividing by zero orviolating memory protection). It is not usually deliberate, and parametersare not passed (other than those provided by the underlying hardware).In modern highly pipelined CPUs traps are sometimes imprecise in thesense that (portions of) instructions following instruction that trappedmay have already been executed when the trap occurs.

• A system call occurs in a given CPU when its currently running programexecutes a system-call instruction,11 typically to request that the kernelgrant access to some system resource. System calls differ from traps andinterrupts in that:

– They are always deliberate actions of the currently running program.

– Parameters may be passed to (and return values passed back from)the kernel’s handler routine, which often runs on a different stackand in a different address space from those of the current program.

9Note, however, that this routine binds the fictitious name to the actual file only if thatrequest conforms to the system’s protection policies.

10Also known as an exception or fault.11Also known as a “trap instruction,” “software trap,” “software interrupt” or “interrupt

instruction.”


The code portion of a kernel may be viewed as a dynamically linkable sharedlibrary of service routines (handlers) for kernel-invoking events. But note thatkernels also involve shared data tables as well as shared code.

Whenever the kernel is invoked, a branch in execution takes place to a speci-fied location, namely the handler routine’s entry point, and execution continuesfrom there. These execution branches are similar to procedure/function calls inthat execution eventually returns to and resumes from the point at which thebranch occurred and sufficient context information (i.e., register values) mustbe saved so that the previously running program can be resumed. But kernelinvocations differ from procedure/function calls in important ways:

• Vectored transfer of control. When any kind of kernel-invoking event oc-curs, the handler for that kind of event is invoked via a level of indirection:the CPU looks up the address of the handler for that kind of event in avector table,12 which resides at a CPU-known location in a kernel-managedsegment of main memory, and then invokes that handler. The memory-protection system guarantees that only the kernel can install handlers forkernel-invoking events. We will discuss this mechanism in Section 2.2.7on page 23.

• Suppression of this CPU’s protection level (i.e., increase in the CPU’sprivilege level). Following a kernel invocation, more instructions can beexecuted and more locations can be accessed without causing a protectiontrap. (Often there is also change of address space.)

• Suppression (i.e., blockage) of interrupts. Following a kernel invocation,fewer devices are allowed to interrupt this CPU; interruptions from therest (up to one per device) are postponed until those interrupts becomeunblocked.

• Restoration of context. As control returns from a kernel routine to theinvoking/interrupted code, the protection and interrupt status of the CPUmust simultaneously be restored.

The following subsections discuss further the mechanisms underlying the lastthree of these features: protection, interrupts, and the restoration of context(i.e., CPU state) following a kernel invocation.

1.2.2 Protection

To manage system resources, kernels must be able to do things that user pro-grams should not, e.g., write to any block on disk, access (read from or write

12Often, pointers to (i.e., addresses of) functions are called “vectors.” Except when itappears as the operand of sizeof or unary-&, the name of a C/C++ function represents theaddress of the function’s first instruction (i.e., entry point), just as the name of a C/C++array represents the address of its first element. Hardware determines which event type isconnected to which vector-table entry.

1.2. HARDWARE SUPPORT 7

to) all memory locations, block interrupts, etc. Thus, certain instructions mustbehave differently when executed by a kernel vs. a user program. So, CPUsmust operate in different modes, called protection levels or privilege levels, whenrunning an OS kernel vs. a user program. Typically, there are two such modes:

• a low protection level (i.e., high privilege level), called kernel mode orsystem mode or privileged mode or supervisor mode, where anything goes,

• a high protection level (i.e., low privilege level), called user mode or pro-tected mode or restricted mode, where certain instructions (and memorylocations) are “off limits” in the sense that executing or accessing themcauses a protection trap to occur.

An instruction is privileged if it traps (thereby, invoking a kernel service routine)whenever it is executed on a CPU that is in user mode.13

A privileged instruction that simply puts the CPU into privileged modewould be useless, and a non-privileged instruction that puts the CPU into priv-ileged mode would destroy protection. Instead, the hardware must be designedso that kernel invocations simultaneously increase the CPU’s privilege level andinvoke a trusted handler, whose entry point is found in a protected hardware-known vector table.14 That kernel-trusted handler can then access all systemresources.

1.2.3 Interrupts

The kernel includes data structures that are shared by many concurrent ac-tivities. To ensure the consistency of critical data structures, only one activ-ity at a time can be allowed to access certain tables in certain ways. Eachoccurrence of an interrupt starts or resumes another activity within the ker-nel. So, whenever the kernel is invoked, interrupts must automatically becomeblocked15 until they are explicitly unblocked by the handler of that kernel in-vocation. Often, interrupts remain blocked until the handler returns control tothe interrupted/invoking activity (e.g., user program).

The first occurrence of a blocked interrupt type (channel) is postponed untilthat interrupt type becomes unblocked — in most architectures, a given CPUcan have at most one pending occurrence of a given interrupt type; subsequentoccurrences beyond the first are usually ignored.

Some interrupts, such as reset and perhaps the clock, are usually nonmask-able, i.e, cannot be blocked. The maskable interrupts may be individually mask-able or maskable in groups. We will assume there is only one such group, whichincludes all maskable interrupts.

13As the protection level is lowered, the memory mapping is usually set to a special config-uration (e.g., disabled) that allows the kernel access to all locations of main memory.

14This policy of covariance of trust and privilege is fundamental to protection and security:only lower protection (“let your guard down”) when invoking trusted code.

15Instead of the term “blocked,” some authors use “masked,” “turned off,” “suppressed,”“disarmed,” or “disabled.” They all mean the same thing.


Good performance requires that we minimize the fraction of time and theduration of the intervals during which interrupts are blocked. But note that,when a handler is invoked, there is a change of context that may involve thesaving and restoring of many registers and cache locations. A CPU can becomeflooded with interrupt-handling chores, a problem that is most critical in real-time systems.

1.2.4 Restoring CPU Context

By the time that the first instruction of an invoked handler is executed, theprogram counter (PC), protection level, the interrupt-blockage status and pos-sibly the stack pointer (SP)16 all have new values. The hardware must pushthe previous values of these registers onto a stack somewhere in main memoryor leave them in special link registers where the handler can find and possiblysave them.

Whenever a CPU’s interrupts are blocked, there is a a significant likelihoodthat there will be a pending interrupt when they become unblocked. Since aninterrupt can occur almost anywhere, one cannot expect an interrupt-invokedkernel routine to return to an instruction that unblocks interrupts. On the otherhand, if the interrupts are unblocked before returning, an immediate interruptcan be expected, as can an eventual overflow of the area where previous registersettings are stored.

The common solution to this connundrum is to provide a special return-from-kernel instruction that simultaneously returns to the interrupted code (possiblypopping a specified stack and restoring certain registers) and restores interruptsto their previous status. An alternative is to have an instruction that turns onthe interrupts (or sets the PSW) but to have its effect delayed by a few cyclesduring which there is time to return from the kernel routine.

1.3 The Object/Server Paradigm

The object paradigm is an abstraction that facilitates discussion of fundamentaloperating-systems concepts such as protection, concurrency, and dynamic bind-ing. Unfortunately, that paradigm is so natural that aspects of it have beendiscovered by many groups, each of which employed different metaphors andterminology.

Per the object paradigm, an object is anything that has identity, state, andbehavior. Each object is an instance of a particular class, and the object’sbehavior in response to a given event (a.k.a. service request) is determined byits class, current state, and accessible environment. Behavior results from eventsgenerated by other objects. Behavior involves changes in the object’s state andcauses further events, which provoke subsequent behavior in other objects.

16Often the interrupt status and the protection level are controlled by fields of a specialregister called the processor-status word (PSW).

1.3. THE OBJECT/SERVER PARADIGM 9

In software, objects are implemented as data17 (i.e., state) together with a setof routines for manipulating that data. Commonly those routines are the onlycode having direct access to the object’s data — they implement the object’sbehavior. Often software objects are used to represent or simulate real-worldobjects.

State. For our purposes, an object is a device and/or a portion of memory.The object’s state is the current state of that device and/or the value(s) heldin that memory segment. Part of that memory is a descriptor (a.k.a. attributerecord, description record, control block, etc.) that contains all pertinent in-formation about that object and its current state. The descriptor’s fields arethe object’s attributes (a.k.a. instance variables or member objects). They mayinclude (references to) the routines that implement the object’s behavior.

Identity. In some cases, an object’s identity is taken to be its address or theaddress of its descriptor. In other cases, each object is given a handle (a.k.a. IDnumber or token), which is a short bit string (e.g., 32 or 64 bits) by which thesystem accesses that object. In some systems, handles are generated on a class-by-class basis. In other systems, they are unique over all classes. It would behelpful if each object got a fresh handle, but, unless handles were much longerthan current integers, the handle space would eventually overflow and provisionwould have to be made for recycling handles that are no longer in use.

Behavior. In ordinary English, we describe interactions between objects viasimple declarative sentences of the subject-verb-object form. If X does Y toZ, we simply say that “X Ys Z”. Interactions between objects on a computersystem are usually described via some variant of the following terminologies,which are nearly equivalent:

Client X requests service Y from server Z.Process X performs operation Y on object Z.Process X invokes member function Y of object Z.Subject X accesses via mode Y the resource Z.Subject X uses in manner Y the resource Z.Client X uses in manner Y the facility Z.Sender X sends message Y to receiver Z.Agent X performs action Y on object Z.Agent X causes event Y at object Z.Initiator X applies stimulus Y to responder Z.Client X contacts port Y of server Z.

Also, “X exercises capability (Y,Z)” or “access right (X,Y) is exercised on Z.”To accomplish the above in Java or C++, requires an expression of the form“Z.Y()” in one of X’s service handlers.

17Both instance-specific data and classwide data, which corresponds to the C++ notion ofstatic data members.


Note that the above list in by no means exhaustive. For instance, serversare sometimes called “service providers” or “action generators” and clients aresometimes called “service consumers” or “service recipients”.

Terminology. It must be emphasized that notions like “server” and “client”do not denote different kinds of objects but rather different roles relative to aparticular interaction. Referring to an object as a “server” without reference toany particular transaction indicates that this particular object’s plays the serverrole in most of its significant transactions. But, for example, a proxy serverreceives service requests from a clients, possibly modifies them and submitsthem to the real server. So, a proxy server is a client of the its associated realserver.

We take a server-centric view of objects, focusing on their roles as serversthough we will sometimes call a service request an “event.” The routines bywhich an object responds to an event (i.e., service request) are called methods,handlers, or service routines. Here are some examples of how we view objectsas servers:“Process” is undefined at

this point.Kind of entity services attributes

C++ class objects member functions member objectsfiles read, write, seek, ioctl data/state

program module internal functions internal variableslibraries those of members of membersprocess signals static variables

RPC server remotely callable funct’s static variableskernel kernel invocations static variables

networked computer ports of port-mapped progsmemory segments read, write fields

CPUs instructions register values

Usually a service request has multiple components, the first of which indicatesthe requested service18 and the rest of which are parameters.19 A server respondsto a request for a given service by invoking the appropriate handler. Usually,all instances of a particular class share a common copy of the handler for anygiven service.Mention that we will hence-

forth use “object” in theC/C++ sense. Inheritance. To simplify the definition of classes and provide uniform treat-

ment of similar services, object-oriented languages allow a class to be definedby specifying its differences from another class. The new class is said to bederived from the original (base) class and to inherit its attributes and services;declarations are needed only for the new attributes and for redefined and newhandlers. A reference or pointer to an object of a derived class is acceptablewherever a reference or pointer to an object of the base class is expected, but

18Also called its “port,” “channel,” “kind,” “mailbox,” or “event.”19Parameterless service requests are sometimes called signals; e.g., an interrupt is a signal

to a kernel from a device.

1.4. HISTORY AND PERSPECTIVE 11

the class of an object may not be known until run time and may be differenteach time a given service-request statement is executed (e.g., on each iterationof a loop). In such a case, binding to the handler of the requested service mustbe dynamic. Revise this paragraph.

As a word of caution, it must be mentioned that there is an unfortunatetendency to refer to classes derived from a given class, say, Widget as Widgets,and to also refer to instances of those classes as Widgets. Two places wheresuch ambiguous usage will be seen are device drivers (Chapter 11.2, page 173)and file systems (Section 10.2, page 159). Discuss invariants and the

contract metaphor.

1.4 History and Perspective

See http://www.serpentine.com/~bos/os-faq/FAQ-1.html for a lot of his-tory and perspective on operating systems in compact form.

See http://std.dkuug.dk/jtc1/sc22/wg21/docs/papers/2002/n1359.pdffor the performance aspects of the use of C++ in applications such as operatingsystems. (There has been a lot of discussion about the use of object-orientedlanguages in OS implemenations.) Add a subsection on User-

Mode Linux, WINE, Virtu-alPC, VMware, chroot jails,and virtual macine monitorsin conjunction with ABIs.

Chapter 2

Binding

Consider merging this chap-ter with the next one, whichconcerns processes or withthe one on tables.

There are many situations where it is necessary to attach a “value” to a name ora reference. Doing so is called binding that value to that entity, and the value iscalled the entity’s (current) value or its referrent. Binding is often accomplishedby installing the value as an entry in a table. When the referrent is a server, it isrepresented by its identity (a.k.a., its ID-number, handle or address). Lookingup an entity’s value is called resolving that entity (or resolving its binding). Anentity that has no binding is said to be unresolvable (or unbound).

2.1 Resolution and caching

Resolution often involves many stages of lookup, i.e., many levels of indirection.Consider the problem of resolving the name of a machine on the Internet. First,we must resolve the name down to an IP address via the Domain Name Server(DNS) system. If our local DNS server cannot resolve the IP address from itscache, it asks neighboring DNS servers, who ask their neighbors etc. But theIP address is merely a handle. Various routers create a connection to the targetmachine’s local router, which then resolves the IP address to the machine’sglobally unique MAC (ethernet) address, or its counterpart for some other LANtechnology.

Caching (or precaching) the results of some of the lookup stages can facili-tate resolving that binding in the future by bypassing some intermediate lookupoperations. Such caching is said to tighten the binding. Note, however, that sub-sequent modifications of bypassed intermediate bindings may be missed, unlesscare is taken to propagate that change to the cached value. Such missed updatesare a cache-coherence issue — see subsection 9.4.1, page 140. Delayed cachingand/or frequent recaching diminish the probability that a cache contains a stalevalue, which makes cache coherence less of a problem.

There are six standard “times” for caching bindings: compile time, link time,load time, pre-access run time, access time (i.e., on demand), and never.1 The

1Things that happen before loading are said to be static. Those that happen at load time

13

14 CHAPTER 2. BINDING

first four of these are examples of anticipatory caching (a.k.a. precaching).Caching constants as immediate operands in the operand fields of instruc-

tions is the tightest of all possible bindings and the most costly to modify. Vari-ables introduce a level of indirection between instructions and values, allowinga given instruction to be dynamically bound to a different value each time it isexecuted. References and pointers introduce more levels of indirection.

When we execute a statement that requests a service from a server, we needto find the appropriate handler.2 Just as different record types can have datafields with the same name, servers from different classes can offer services of thesame name via different service routines (i.e., handlers). The per-class overload-ing of names of services is a form of polymorphism and a major feature/benefitof the object-oriented approach to software development. We are particularlyinterested in binding occurrences of names to the attributes and service rou-tines of various servers. For instance, a statement of that reads from a fileread(InFile,...) might read from a different file each time it is executed.One time that file might be a disk file and the next it might be a file on tape.Each of those files will have a different handler for the for its read requests. Sothe name Infile must get bound to the appropriate file and the name of therequested service, in this case read, must get bound to the correct handler.Blend these paragraphs.

Which handler to invoke depends on the server’s class and the service be-ing requested, so (conceptually) we need a two-dimensional array of handleraddresses (vectors) indexed by pairs consisting of a class and a service. Sinceservers of most classes offer only a limited set of services, that array will besparse — so we might store it in a hash table indexed by class/service pairs.Alternatively, we can store the array by row or by column:

• Each server or class might have its own service-indexed vector table.3

• Each service can be given a class-indexed vector table.4

Regardless of how it is implemented, such a vector table can be used, at anytime, to bind occurrences of service requests to handler addresses. If the vectortable is indexed by symbolic names of services, the bindings it holds remainunresolved until access time. If, on the other hand, the indices are the addressesof the service routines (i.e., if the vector table represents the identity function),then the vector table is superfluous, since the names have been fully resolved asof access time.

Peter Deutch [Citation?] introduced a trick to avoid the cost of a full lookupfor a given occurrence of a service request by always invoking the same handler asthe last time this occurrence of the reqest was executed. The invoked handler

or later are said to be dynamic.2Chapter 11.2 on page 173 discusses dynamically finding servers themselves.3In most C++ implementations, an index refers to the same service for all classes derived

from a given class having that service but may refer to unrelated and/or undefine dserviceswhen neither class is derived from the other.

4Often that array gets buried in the implementation of a class-driven switch statementthat invokes (or contains) the proper handler for each class.

2.2. COMPILING, LINKING, AND LOADING 15

then checks that the server is of the correct class. If not, the address of theproper handler is looked up and cached as the address operand of the invokinginstruction, which is then re-executed.5 The efficiency of this caching mechanismdepends on the fact that the cost of the correctness check is negligible. Asdescribed so far, the mechanism requires self-modification of code, a techniquethat has been deprecated for several decades and that is prohibited by manyarchitectures. To avoid code modification, one can, in some architectures, onecan add to each handler invocation a level of indirection through data space,i.e., use a bounce table.

2.2 Compiling, Linking, and LoadingMention John Levine’s book.

Programs and modules. A program consists of possibly hundreds of mod-ules, some of which may be shared with other programs. A module defines aset of variables and functions, called its internal variables and functions.

In C/C++, each module consists of a source file (i.e., .c- or .cc-file in Unix)and a header file (.h-file in Unix). The module’s header file contains descriptions(declarations) of all variables and functions that the module exports to othermodules.6 That header file must be included (i.e., compile-time expanded) intothe source files of all modules that access those exported variables and func-tions.7 Each module’s source file should include its own header file to serve asa consistency check, if for no other reasons.

A pointer or reference that is internal to a given module can point/refer to avariable or function that is defined in another module. Thus, a given module’sinternal functions may invoke functions and read or write variables that aredefined in (i.e., internal to) other modules of the program. Such pointers andreferences to variables and functions internal to other modules are the givenmodule’s external references. A given module’s source file must include theheader file of each module that defines one or more of the given module’s externalreferents.

A module may be viewed as defining a class of servers whose attributesare the variables that the module defines and whose service routines are thefunctions that the module defines. Each execution of a program that containsthatmudule constructs it own instance of that class. A module may, however,

5This mechanism is an example of the cache-consistency strategy called “detected write-invalidation.”

6The header file must also contain declarations of the types and templates needed to accessthe exported variables and functions.

7To avoid the need to export details of a complex struct/class type, a module can providefunctions for dealing with objects of that type and export only the corresponding pointer type.Keeping such details out of the header file provides a “fire-wall” against changes to those detailsthat would otherwise force recompilation of the module’s clients. C++ programmers achievethe same effect using Abstract Base Classes (ABCs) and/or classes consisting of a pointerto an object of the complex class and stub functions that invoke its member functions. Thelatter trick is called the Cheshire-cat idiom (nothing’s left but the pointer) or the pointer-to-implementation (pimpl) idiom.


define a class that has an unlimited number of instances within any given pro-gram. For example, the module’s header file can define struct types whoseinstances provide services.

2.2.1 Compiling

A compiler takes as input a module’s source file and produces that module’sobject file,8 which includes:

• an initialized-data segment containing the initial values of the module’sinternal variables,

• a number indicating the size of the data segment that is needed to containmodule’s internal variables, i.e., the size of the module’s attribute record,9

• a code-segment containing the machine code for the functions defined inthe module, including code to initialize and finalize the module’s attributerecord, e.g., to construct and destruct globals,

• an external-reference table, which lists the variables and functions thatthis module imports from modules throughout the program,

• a table of contents, which lists the module-relative address (i.e., offset) ofeach of the module’s internal functions and global variables.10

Once we know the names of the object files for the modules of a program and therun-time offsets for their data and code segments within the program’s overalldata and code segments, we can then determine the run-time address for anygiven function or global variable. That calculation can be performed as earlyas static linking, delayed until the item is accessed, or performed at any time inbetween.

2.2.2 Loading

A special kernel routine, known as a loader, creates an execution image from anexecutable object file (a.k.a. “a program file” or “load module” or simply “anexecutable”).11 There are several industry-standard formats, e.g., COFF, ELF,A.OUT, etc., for executable files.

In most operating systems, all executable images produced from a given ex-ecutable share a single copy of the executable’s code segment. So, unless onealready exists, the loader first creates a copy of the executable’s code segment

8In Unix, object files are called “.o-files”, “object modules”, or “relocatable binary mod-ules”.

9Alternatively, we could keep track of the size of the uninitialized-data segment.10During compilation, offsets are assigned to the module’s functions and global variables

within their respective memory segments.11In Unix, the linker is a program called ld and the loader is the handler for the exec system

call. For a discussion of loading under Unix, see Section 3.2 on page 30.


in virtual memory. From information in the executable, the loader then config-ures and initializes a data segment within the execution image. This segmentis private to the execution image, i.e., it is not shared with other processes thatare running this executable. Finally, the loader transfers control to the exe-cutable’s starting point, which finishes initializing global data, e.g., by invokingconstructors for global objects, and transfers control to the first function of theprogram, which is called main in C/C++.

Unless the hardware architecture provides relocation facilities or the compileremits position-independent code, which incurs a slight performance penalty byusing only self-relative pointers and jumps/calls,12 the loader will need to adjustaddresses in the code and/or data segments to those of the actual machinelocation where te code and data are finally placed.

2.2.3 Static Linking

A special program, called a linker,13 combines object files and, in so far aspossible, resolves and caches their external references via their combined tableof contents. To link a given list of object files, we can begin with an empty objectfile and, one at a time, incorporate the object files on that list. To incorporatean object file, the linker incorporates its variables and functions, one at a time:

• To incorporate an initialized variable, we append it plus any necessaryalignment padding to the end of the new initialized-data segment.

• To incorporate an uninitialized variable, we virtually append it plus anynecessary padding to the end of the new (virtual) uninitialized-data seg-ment by incrementing the segment’s size counter by the appropriate amount.

• To incorporate a function, we append its machine code to the end of thenew code segment. Each of the function’s external references gets addedto the combined external-reference table.

In every case, the new item, with offset properly updated, gets added to thetable of contents. Finally, we resolve and cache as many external references aspossible.14

Of course, any collision of names in the table of contents constitutes amultiple-definition conflict, and aborts the linkage process unless there is someprecedence among those definition. It’s obvious that some of these operationscould be combined, e.g., we could add all uninitialized data items at once by sim-ply incrementing the size counter by the size of the (unitialized-) data segmentof the module being included.

12Self-relative jumps increment/decrement the program counter a specified amount. Simi-larly, self-relative pointers contain the distance from their own location to that of the thingpointed to.

13In Unix, the linker is called ld.14Resolving and caching an external reference involves appropriately adjusting the operand

fields of any of the function’s machine-code instructions and initialized data that directlyinvolve that reference. Or else we can use bounce tables and indirect jumps and indirect callsto avoid the need to back patch all jump and call instructions.


Libraries. A library is a file that contains a collection of object files, plus acombined table of their contents — no name collisions are allowed. When alibrary is specified to the linker, its relevant modules are automatically includedin the generated executable without their names being mentioned explicitly.

We could, of course, statically link all of the modules in a library into asingle module, but, by linking against the library instead of that combinedmodule, the generated executable’s code segment possibly avoids including alot of unnecessary code.

2.2.4 Dynamic Linking/Loading of Unshared Modules

To what extent can linking be postponed until run time? If an unresolvablereferences gets used at run time, the resulting behavior is usually undefined. Atany time prior to that use, however, that reference can be resolved and cached bydynamically loading (and linking) the referent’s module.15 Dynamically loadingan object file into an execution image could generate an image equivalent to theimage that would have existed had that object file been statically linked intothe executable that was originally loaded — it makes no significant differenceat what time an object file gets linked in, so long as it gets linked in before anyattempt to access any of its functions or variables. But we can easily trap anyattempt to access via an uncached reference and configure the correspondingtrap handler(s) to invoke the appropriate loader.

The data segment for a dynamically loaded module gets dynamically allo-cated in the program’s data area, and a private copy of that module’s codesegment gets copied into the appropriate portion of the execution image — justas in the case of statically linked modules, the code segments of unshared mod-ules can be adjusted to take into account the locations (in the execution image)of the variables and functions that they access.

A process can dynamically unload a specified module from its executionimage, provided that there are no subsequent attempts to access any of itsfunctions and/or variables. To unload a module, simply free the local copiesof its code and data segments, i.e., make that space available for other uses.16

Also, reinstall any data structures that are required for backpatching the codethat refers to the removed module’s functions and variables.

Static vs. dynamic linking. Static linking avoids the overhead of resolvingand caching external references each time an object file is loaded, thereby, mak-ing the system more responsive. Also, static linking uses address space moreefficiently. Like any other cached copies of entities, statically linked copies of anobject files and/or library functions become obsolete when new version of thecorresponding module or library becomes current. That makes the executablefile obsolete until those (copies of) obsolete object or library modules are re-placed by relinking the program with the updated object file or library. By

15The term “dynamic loading” implies linking as well.16Of course, doing so can leave gaps of unused address space in the process’s code and data

segments.


contrast, each process that dynamically loads a given module or library will getthe current version.17 Dynamic loading has the added benefit that it saves thedisk space that would be required to store multiple copies of object files insideexecutables.

The above advantages of dynamic linking can be achieved simply by movingthe linking step to load time. But, for example, many operating systems usedynamically loadable kernel modules, especially for device drivers — see [RU]for details on loadable kernel modules under Linux. Such device drivers aredynamically loaded and unloaded by user command while the kernel is runningand requires techniques that go beyond simply moving the static linking to loadtime.

2.2.5 Dynamically modifiable bindingsMention stale-handle detec-tion.Most implementations of dynamically modifiable binding use a per-server (or

per-class) vector table indexed via the handles of services:

• In the client, the names of services are statically bound to their respectivehandles, i.e., ID numbers. This binding can be accomplished at compiletime by including a header file that assigns handles to names, or it cancan be deferred to link time by linking to a proxy library.18 For eachservice, the proxy has a stub that knows the handle and calling sequenceof that service. Say, for example, that service f takes an integer parameter, Check syntax below.returns a float, and has vector-table index (handle) of 75. Then the proxy’sstub for f might be defined as follows:

float f( int i ) {

return ( (float(int)) server.vectorTable.lookup(INDEX_f) ) (i);

}

where INDEX f is defined to be 75 in a header file used to compile thestub.19

17Unfortunately, this feature often produces bugs by linking to the wrong version of a library.18A proxy is an object that serves as an intermediary between a client and a server. The

proxy is essentially a stage of indirection and a mechanism for delaying binding and making itlooser. A proxy is a server to the client and a client to the server. It accepts services requestsfrom the client and forwards them to the server and reports back any return value to theclient. Often, the main function of a proxy is to modify formats or binding arrangements.It may consist of “stub” handlers with similar names and calling sequences to those of theserver — sometimes the calling sequence is slightly different or the names have been resolvedto handles or addresses.

19This C/C++ scheme works for the case where different services have different callingsequences, so vectors need to be void* pointers. Or better yet, per the 1989/90 C standard:

Analogous to ’void*’, the generic data pointer type, define type ’void (*)(void)’to be the preferred ”generic” function pointer type, and that any function pointertype can be safely cast to and from this type with no loss of information.

This explicitly encouranges implementors to take the right approach to dealingwith function pointers. [C89/90 6.2.5#26, 6.3.2.3#8]


• In the server, the entries of a handle-indexed vector table are staticallybound to their respective handlers.

At access time, either directly or via the proxy, the client submits the service’shandle to the server’s service-indexed vector table, which returns the addressof the server’s handler for the requested service.20 The client then, directly orvia the proxy, submits the request to the server, which returns the result.21 Todynamically change handlers for a given service, the server simply adjusts thecorresponding entry in the vector table to point to the new handler.Refer back to “vectored

transfer of control” on page5.

The advantages of such dynamically modifiable binding of service requeststo their handlers are that, unlike the handler’s address, the service’s handle can:

• be selected before the service’s handler exists22 and remain stable eventhrough dynamic replacement of handlers23 for a given service, i.e., dy-namic re-binding of handles to handlers,24

• be the same over multiple classes of servers, thus facilitating dynamicpolymorphism,25

• be statically bound to a server-trusted routine, thereby preventing a clientfrom directing execution to bogus instructions having unpredictable re-sults, an important consideration for protection.

However, such indirection precludes the in-line expansion of handlers, an opti-mization that is useful when the handler is small.Mention the following pos-

sibilities for implementingbinding: stubs (proxies),opening (dynamic), headers(static), traps (dynamic).

Alternatively, each service could be variadic or each could have a single parameter thatpoints to a record containing the real parameters. Either of those alternative allows vectorsto be of a uniform pointer type and eliminates the need for a service-specific cast in each stub.

20Being an unshared module, the proxy can be directly linked to the vector table. Alterna-tively, it can submit its request to a proxy that knows the location of the vector table. In thecase of system calls that proxy is the CPU itself.

21What if, in the interim, the server dynamically replaces the corresponding handler? Forinstance, does it make a difference whether a particular signal occurrence uses the old handleror the new one? What if at some time during the update the handler vector points to neitherthe old handler nor the new one?

22If the vector table or a reference to it is always stored at a fixed offset (say zero) in theserver’s attribute record, then, even if the server’s data attributes change, clients can requestservices without any need to update their static name-to-handle bindings.

23Dynamically installed handlers are sometimes called as “call-backs.” Often we speak ofregistering a call-back rather than installing it.

24Normally, all instances of a given class use the same handlers for a given service — vectortables, therefore, are associate with classes rather than servers. But a server that changes itshandlers dynamically must have its own vector table. We might view such servers as one-of-a-kind (i.e., singleton) classes, with each such server being a class unto itself. Of course, everyhandler is free to consult a vector table and simply invoke one of the functions that has beeninstalled there.

25Suppose each execution of a given service-request statement, say in a loop that iteratesthrough a list of servers, requests the same service from servers of different classes with distincthandlers for the requested service. Such dynamic re-binding of this service-request statementto different servers requires dynamic polymorphism, i.e., dynamic (access-time) resolution ofthe service’s handle to the address of each server’s handler for that service.


2.2.6 Shared Modules

Sharing handlers among local class instances. In most C++ implemen-tations, all instances of a given class (servers) share a single copy of the class’smember functions (handlers). Obviously, we can’t link a handler tightly to aparticular instance of that class by modifying addresses within that shared code.So we add a level of indirection by passing to the handler the address of theinstance’s attribute record as a special parameter (called this in C++), so thatthe handler can access those attributes while handling service requests on behalfof that instance. The above repeats footnote

18 to some extent.

Dynamically linking/loading shared-code modules. Through memory-mapping mechanisms, a module’s code segment can be shared by multiple pro-cesses. But, each process can store its instance of the module’s attribute record(internal objects) wherever it pleases, and obviously such a shared code segmentcannot be modified for any one of its sharers. So we:

• Use position-independent code, since this code can appear at distinct lo-cations in the code segments of distinct execution images.

• Dynamically allocate a copy of the module’s attribute record, which holdsthe module’s internal objects.

• Statically or dynamically link to a non-shared proxy module whose stubspass the address of that attribute record as an additional parameter totheir shared namesakes.26

• Incorporate the module’s external-reference table as an attribute of themodule, so that internal functions can access external objects and func-tions.

External functions get linked, statically or dynamically, to the module’s internalobjects and functions. Similarly, the entries in the module’s external-referencetable get bound to their referents. The module’s internal functions may be stat-ically bound to each other. They access internal objects via indexed indirection.They access external objects and functions via indexed indirection followed byanother level of indirection.27

Dynamically linking/loading shared-data modules. It is possible forall processes sharing a module’s code segment to also share any or all of themodule’s attributes — simply place them in a supplementary shared-attributerecord that gets mapped via memory-mapping tricks into each process’s datasegment. We must, however, add a non-shared attribute that points to thatrecord of shared attributes.

26Because the proxy is non-shared is can be tightly bound to its own attributes which caninclude the address of the shared module’s attribute record.

27Remember: every problem in software design can be solved by adding another level ofindirection.


2.2.7 Binding to Remote Services

To request a service from a server whose address space is not directly accessiblefrom the client, say on a different machine, a client submits the request to a localproxy-server, which sends a message (say via some low-level message passingprotocol) containing the request’s parameters together with the name or handleof the requested service to a remote proxy-client that can submit requests withinthe remote address space. That proxy-client receives the message, reconstructsthe original service request, and submits it to the intended server, which handlesthe request and returns the result, which the proxy-client’s handler returns tothe proxy-server, which returns the result to the client, as though the servicehad been performed locally.28 The intent is to make the client think it is beingserved directly by the remote server. Note, however, that a remotely executedhandler usually can have no client-local side effects, i.e., no access to the client’saddress space.Give reference to remote

name resolution in the sec-tion on name management.

There are generally four reasons for requesting a services remotely:

1. Concurrency: The handler must execute in parallel with the client.29

2. Remoteness: The handler must execute in another address space, possiblyon another processor with a separate memory system.

3. Protection: The handler must execute in another protection domain or atanother urgency level, or the client can’t adequately insure against, say,stack overflow.

4. Security: The client can’t adequately insure that garbage left behind onits own stack(s) will not compromise security, e.g., passwords.

Normally, the client is statically linked to the proxy-server, and the proxy-client is statically linked to the server. In such a case, the proxy-server is amodule or library of “stub” handler functions that encapsulate service requestsinto messages and send them to the designated remote server. By preprocessingthe server’s header file, we can create source files for the proxy client and proxyserver with the same interfaces as the client and the server, respectively. Ofcourse, all four modules might include the server’s header file.

Note also that remote binding necessarily involves access-time resolution,which yields the previously mentioned benefits in terms of possible dynamic re-binding of handles to handlers, dynamic re-binding of service-request statementsto servers, and protection.Expand the paragraph be-

low. In some cases, the proxy-client can accept service-request messages frommultiple proxy-servers, thereby, making the server sharable, even if it was notdesigned for shared operation. The attributes of a remotely linked module

28As a form of data compression, the proxies may represent the name of the requestedservice via a handle, even if a handle is not already being used by the client and server.

29We may think of dynamic installation of an event (service-request) handler as a come-from

instruction, in the sense that, when that event (service request) occurs, execution will comefrom whatever code is executing to that handler.


are shared by all of its clients, whereas the clients of directly linked modules(whether static or dynamic), even shared code modules, usually get their owncopy of the module’s attributes. In C++, attributes that are shared by allinstances of a class are said to be “static,” but it is more accurate to view aremote server simply as a shared instance of a class.

Remote procedure calls. Often requests to remote servers are called “re-mote procedure calls (RPCs).” A client process invokes a procedure to beexecuted by a server process, possibly within a different address space. Theclient and the server are statically linked to proxies. The proxy-server is calledthe “RPC-client” and the proxy-client is called the “RPC-server.” This twist interminology reflects the client/server relationship between the proxies, whereinthe proxy-client is server to the proxy-server. Look this up and see whether

the target’s header is in-cluded and how. Also, howdoes the client get the in-dex of the target? Who givesthose target routines theirindices?

Remote daemons that support socket access. To be filled in.

Kernels. A kernel is a server that, for reasons of protection, resides in anaddress space inaccessible to its clients (both devices and user-mode programs)and must, therefore, be remotely linked to them. That address space, however,is accessible to the clients’ CPUs, and, since the kernel controls access to allresources (particularly memory), the client’s address space is accessible to thekernel’s handlers, which therefore can have client-local side effects. In otherwords, a kernel is remote from its clients, but its clients are not remote fromthe kernel.

Each kernel-invoking event has an index (a.k.a. handle) to distinguish it.Interrupts and traps are automatically resolved to their handles by hardware.System calls are resolved and cached to their handles by statically linking toproxies that know their corresponding handles. A proxy stub simply acceptsarguments passed by the user-mode (i.e., application) program, invokes thecorresponding system call on those arguments, and returns the result to theclient (user-mode program). More specifically, proxies put their arguments intoplaces expected by the kernel’s handler for the corresponding system call andissue that system-call instruction with the service request’s handle as argument,i.e., with the vector-table index of the corresponding kernel routine.

The numbering and semantics of a kernel’s system calls is referred to asits Application Binary Interface (ABI). A given executable file will interactcorrectly with a given kernel so long as the executable and the kernel agreeon the ABI protocol. The names, semantics and calling sequences (signaturesplus return types) of the stub library plus other functions that facilitate itsuse are referred to as the operating system’s Application Program Interface(API). The API is to the source program what the ABI is to the program’sexecutable. Essentially, the purpose of the stubs is to translate between theAPI and the ABI.

Some architectures have a single kernel-invocation handler, which behavesexactly like a proxy-client. It consists of a switch statement that is driven by


the event’s index. Other architectures handle the vectoring of kernel invocationscompletely in hardware. Usually, all devices of a given class share a single copyof their common handler, which uses the invoking event’s index to determinewhich device of that class it is handling. Fron that index, the handler finds theappropriate buffers, I/O ports, etc.

Operating systems. In addition to its kernel, a remote server that may havedynamically loaded components (e.g., device drivers), an operating system alsoconsists of libraries, each of which is called an API. Since it is inappropriate torequire that application programs be relinked whenever the operating systemis updated, many OS-provided libraries are dynamically linked/loaded. Often,they share code as well.

The OS’s directly linked modules (including the kernel’s proxy stubs), whetherstatically or dynamically linked, constitute an implementation of the system’sAPI. As mentioned earlier, the IEEE POSIX standard specifies the semanticsof the POSIX API and makes no mention of its ABI. It is not particularlyrelevent to client programs which functions are part of the kernel and/or whattheir handles are. Such matters are implementation details and should be leftto the discretion of the OS’s implementer — they do not belong in the standard.

2.3 Binding in Other Classes of Servers

Signals. An occurrence of a kernel-invoking event arrives first at a CPU. Ifthe event is currently blocked in that CPU, the CPU’s pending bit for that eventgets set. As soon as a pending event is re-enabled (e.g., by the resetting of itsblocking flag or lowering the blocking threshold below that event’s priority), itgets serviced, i.e., its handler gets invoked.30 No matter how many times anevent occurs, a given CPU can have pending at most one occurrence of eachevent that it is currently blocking.

Some language implementations allow a running program to dynamically in-stall its own handlers for certain parameterless services, called “signals.” Thesehandlers run within the program’s address space. When such an event occurs,the kernel’s handler for that event delegates (a portion of) the handling of theevent to that event’s signal handler in a user-mode program:

• At compile time, with the aid of a header file, signal names are resolvedto their handles (i.e., id numbers), which get cached as constants in theobject code.

• These handles are resolved via a vector table to handler addresses at accesstime, which allows programs to install their own handlers at run time.

Normally processes are clients of kernels. However, when a kernel passes a signalto a process, it treats that process as a server that provides certain parameterless

30Of course, many event occurrences may compete for service, but the invocation of thehandler for any of them immediately blocks the rest and keeps them pending.

2.3. BINDING IN OTHER CLASSES OF SERVERS 25

services. In fact, one can view all direct kernel access to user space (e.g., accessresulting when a program calls read or write) as service for the kernel by theprogram.

Persistent dynamically linkable servers. Some operating systems, espe-cially Unix, attempt to treat nearly all system-managed servers as files.31 Tobind a name to a file and acquire certain modes of access to it, a client processinvokes the open system call passing the name of the file and the requestedaccess modes as parameters. The handler for open checks whether the “owner”of the requesting client has permission to access that file in those modes. If thatcheck succeeds, the kernel caches the corresponding bindings (a.k.a. capabilities)in a the client’s open-file table. Make a reference to the cor-

responding part of the objectmanagement section.

To accommodate servers that are not really files (i.e., not streams of data),Unix files have a service, ioctl(), that allows services other than read, write,execute and seek to be requested from servers posing as files. According tothe Unix manual ioctl(int fd, int request, caddr t arg):

... performs a special function on the object referred to by theopen descriptor fd. The set of functions that may be performeddepends on the object that fd refers to. For example, many operatingcharacteristics of character special files (for instance, terminals) maybe controlled with ioctl() requests. The writeups in section 4 discusshow ioctl() applies to various objects.

The request codes for particular functions are specified in includefiles specific to objects or to families of objects; the writeups insection 4 indicate which include files specify which requests.

For most ioctl() functions, arg is a pointer to data to be usedby the function or to be filled in by the function. Other functionsmay ignore arg or may treat it directly as a data item; they may, forexample, be passed an int value.

Invoking ioctl() requires read or write access. The ioctl handler itself decideson an invocation-by-invocation basis whether or not to service any given request.

Object-request brokers. To be filled in.

Networked computers (with ports as services). To be filled in. (Talkabout conventional port assignments and the workings of the port mapper.)

31There is an old say that “When your only tool is a hammer, every problem looks like anail.”

Chapter 3

MULTIPROGRAMMING

Review the chapter on pro-cess control in Nemeth’sbook.

An operating system that can run several programs at a time is said to supportmultiprogramming. The benefits of multiprogramming result from the over-lapped use of various resources, including the users’ time. For example, oneprogram can be doing I/O while another is computing. The result is higherthroughput, i.e., a given set of processes requires less total time to finish than ifthey were executed sequentially, i.e., one after the other. Suppose that two usersedit different files at the same time using the same editor, say emacs. There isonly one program running, namely emacs, but there are two distinct simultane-ous executions of emacs, i.e., two different processes each running emacs.

In most systems one program can chain to another via a system call thatinvokes the loader on a specified executable file. For instance, if widget is anexecutable, the system call

execute(widget);

causes (the program in) widget to be loaded and start running in place of thecurrent program, within the current protection domain and with the currentdynamic bindings (i.e., with the same descriptor, the same auxiliary data seg-ments, and the same open-file table). After this call to execute, execution of theinvoking program stops, never to resume on this process unless some subsequentprogram running on this process executes it. The execution of the new pro-gram is a new phase of the current process. A process is a sequence of programexecutions each of which (except the first) results from its predecessor via aninvocation of execute. Other terms for “process” are job, task and heavy-weightprocess. It is sometimes useful to think of a process as a virtual computer. Thestreams of activity to which CPUs are allocated are called threads. Every pro-cess has one or more threads. The activity of a process is exactly the sum ofthe activities of its threads. It is useful to think of threads as virtual CPUs.

Processes are the fundamental units of dynamic binding to system resources(other than CPUs), as manifested by the process’s tables of open files and datasegments. Processes are the fundamental holders of capabilities, i.e., rights to

27

28 CHAPTER 3. MULTIPROGRAMMING

access various system-managed objects in various ways.1 Inside of a process,there may be multiple threads

3.1 Process Anatomy

At every point in its lifetime, a process is running a particular executable (i.e.,program file),2 which consists of the machine-language instructions for a col-lection of procedures3 together with initial values for some global variables. InC/C++, execution begins with the main procedure, which calls other proce-dures, which in turn call others, etc. In addition to the code for the programthat it is running, the process’s environment (a.k.a. its execution image) at agiven moment consists of the state of its memory segment(s), the status of itsopen files, its register settings, etc.

Code segments (a.k.a. text segments). Every process has a code segmentcontaining the machine-language instructions for a set of procedures that in-cludes those that the process’s threads are currently running. In most modernarchitectures, these are a read-only segments that are shared with other pro-cesses running the same program. In some systems, a process can share severalcode segments at a time.

Data segments. Each process has its own read/write data segment, whichcontains its dynamic free-space area (heap), space for global variables, I/Obuffers, and the transient portion of the process’s descriptor. A process mayhave multiple data segments, and they may be shared with other processes. InUnix a process’s main data segment is laid out as follows (starting from lowestaddresses):

• initialized data (read from the program file by exec).

• Uninitialized data, a.k.a. bss (initialized to zero by exec).

• heap.

• free area for heap and/or stack growth.

• argv[], a string array containing the nullstring-terminated list of command-line arguments passed to exec — accessed by consulting the global vari-ables argc and agrv[].

• envp[], a string array containing the nullstring-terminated list of environment-variable values passed to exec— accessed by invoking getenv(variable),which consults the global variable environ.

So, what’s different aboutarg[]? 1Such rights are called capabilities.

2This fact leads some authors to define a process to be “a program in execution.”3Procedures are called “functions” in C — also in Pascal, when they return a value. They

are called “subroutines,” “subprograms,” or “methods” in various other languages.

3.1. PROCESS ANATOMY 29

Stacks. A particular invocation of a procedure is sometimes referred to asan activation or instantiation of that procedure. Each time a procedure getsinvoked, an activation record for that invocation is pushed onto a control stackwithin the process’s data space. The space that this activation record occupieson that stack is called its stack frame. An activation record contains fieldsthat hold the information necessary for the execution of and return from thisparticular procedure invocation: local variables, saved values of registers, passedparameters, return values, etc. Of course, when recursion takes place, the stackwill have multiple activation records for the same procedure. Each procedureinvocation eventually returns according to a last-invoked-first-returned (LIFO)protocol. Activation records are stored on a downward-growing stack, becausethat data structure best accommodates their LIFO protocol.

Each thread within a given process has its own stack of activation recordsfor its procedure invocations that have not yet returned. These stacks usuallyreside in the process’s main data segment, perhaps within the heap, but on somesystems each stack is given its own kernel-known memory segment.

Descriptors. The descriptor for a process includes, among other things, tablesthat hold the identities of its data segment and its open files plus informationabout its thread(s). Often, a process’s descriptor is divided into two parts:the resident part, which contains priority information and enough informationto make the process resident in main memory, and the transient part, whichcontains all other information needed to resume the process’s threads and tokeep track of its status and its connections to resources (e.g., its table of openfiles).

Processes are not allowed to directly write to their descriptors. Otherwise,they could raise their own priority or open files that they are not authorized toaccess. Processes can, however, modify their own descriptors via certain systemcalls, e.g., an invocation of open makes an entry in the process’s open-file table.

In Unix, a process’s descriptor includes the following attributes, which areinherited from the process’s parent and can be changed by the process itself ifits UID (see below) is that of the super-user (UID zero):

• The process’s address space map, i.e., references to its code segment(s)and data segment(s).

• Current status: runnable, sleeping, swapped, zombie, or stopped.

• Execution-priority values, including “nice” number.

• Resource consumption,

• PID, process identification number, i.e., the process’s handle.

• PPID, the Parent’s PID.

• Control Terminal.

• Signal mask, which tells which signals are blocked.


• Reference to the open-file table of this process, which is its table of bind-ings to objects external to its own address space and, in some systemssuch as Linux, can be shared with other processes.

• UID, the user ID number (handle) of the process’s owner. Same as parentnormally, but super-user processes can change their UID. Used to setownership of files created by the process. Also used by passwd and su inauthenticating users.

• GID, same as above but giving the process’s group. Used to set groupownership of files created by this process.

• EUID, the effective user ID — changes when a set-UID program is exe-cuted and used in determining access privileges.

• EGID, the effective group ID — changes when a set-GID program is ex-ecuted. This is a list in modern implementations. This list is used indetermining access privileges.

• UMASK, the protections mask for default protections of files created bythis process. (Initialized to zero and changed via umask(); often changedin the shell’s startup file.) A call to open() creates the specified file if itdoesn’t already exit. In such a case, the requested permissions get ANDedwith the complement of UMASK become the permission bits of the newfile.

• CD, the current directory of this process. (Changed via chdir(), a systemcall that allows the change only if the process has execute access to thetarget directory.)

See Bentson 5.2.1.Consider combining this sec-tion on loading with the ear-lier one.

3.2 Loading

An executable contains code and initial values for configuring a process’s exe-cution image to start running that program. Under Unix, when one programexecutes another, it invokes the loader routine via the exec system call.4 Theold process descriptor (including the open-file table, etc.) remains, as do anyshared data segments. But the stack and private data segments are usuallyreplaced by fresh ones. Since the process keeps its previous descriptor and iden-tity, no new process is created. The only time that an invocation of exec returnsis when there is an error, e.g., the specified file is not executable. According to[TH] pp. 1933-1934:

A process may exec a file. This consists of exchanging the currenttext [code] and data segments of the process for new text and datasegments specified in the file. The old segments are lost. Doing

4Do not be confused by the fact that Unix calls its linker program ld. Its loader is reallyits handler for the exec system call.

3.3. PROCESS CREATION AND TERMINATION 31

an exec does not change processes; the process that did the exec

persists, but after the exec it is executing a different program. Filesthat were open before the exec remain open after the exec.

Code segments (a.k.a. text segments in some Unix literature) are shared amongprocesses and are created by exec only when necessary. The current versionof exec is execve, whose prototype is int execve(const char* filename,

const char* argv[], const char* envp[]) and whose man page states:

execve() executes the program pointed to by filename. filenamemust be either a binary executable, or a shell script starting with aline of the form “#! interpreter [arg]”. In the latter case, the inter-preter must be a valid full executable pathname and the contents offilename is presented to it on standard input.

execve() does not return on success, and the text, data, bss, andstack of the calling process are overwritten by that of the programloaded. The program invoked inherits the calling process’s PID, andany open file descriptors that are not set to close on exec. Signalspending on the parent process are cleared.

If the executable is a dynamically-linked binary executable con-taining shared-library stubs, the Linux dynamic linker ld.so(1) iscalled before execution to bring needed shared libraries into core5

and link the executable with them.

Notice that the invoker gets to specify the string array argv of the command-line arguments and the string array envp of environment variable values thatget passed to the next phase of the current process. If the invoker is a shell, itis customary to put into argv the words obtained by parsing the command lineand for argv[0] to contain filename, i.e., the name by which the program tobe executed is being invoked.

The first word of the specified program file indicates which loader format thefile is in, e.g., a.out, coff, or elf. If the program file begins with an ascii string ofthe form “#! pathname arg” then execve will load and execute the interpreterdesignated by the content of pathname and will insert the strings pathname andarg at the front of the argv string list as argv[0] and argv[1], respectively.This allows each interpreted program script to specify its interpreter (given bypathname) and flags (given by arg). Keep in mind, however, that arg is optional. See Bentson 7.1.2 and 7.1.3.

3.3 Process Creation and Termination

When a system “boots up,” i.e., (re)starts itself, after power is turned on or areset interrupt occurs, it first creates the root process (called init in Unix) ofthe process tree. That root process, in turn, creates all other initial processes

5Originally, the term “core memory” referred to main memory, because main memory wasthen based on tiny iron donuts called “magnetic-cores.” It still refers to main memory because,by way of a pun, main memory is central, i.e., “core.”


(daemons). Many systems respect the geneology of processes by restrictingcertain forms of interprocess communication to processes that are related toeach other in certain ways., e.g., in Unix two processes can communicate viaany pipe established by a common ancestor.

In most systems, a process can create a child process by calling a specialsystem routine. In Unix, the fork system call generates a “child” process thatis a clone of the “parent,” i.e., the process that invoked fork. The child sharesits parent’s code segment and inherits its parent’s register settings. It gets aduplicate copy of its parent’s data segments, including the parent’s stack(s).6 Itgets a copy of its parent’s descriptor, with new values for PID and PPID. Thatdescriptor includes a copy of the parent’s open-file table, so fork() incrementsthe access counts on all files referenced by the open-file. The parent passesinformation to the child by leaving it in variables, either global or local.This needs an address space

diagram. An invocation of fork() returns the child’s PID to the parent,7 and zeroto the child. These two processes avoid identity confusion by testing theirrespective returned value; each determines its own identity and selects its owndestiny, e.g., via a call to exec.

Upon termination of a process, all of its open files are closed, its space isdeallocated, and its descriptor is returned to the free-space or free-descriptorpool. In Unix there are two special system calls for handling coordinationand termination of processes. The first, exit(status), terminates the callingprocess and communicates the value of the integer parameter status to thecaller’s parent, if it still exists. The second, pid = wait(&status) causes thecaller to wait until one of its children exits.8 The PID (handle) of the child isreturned, and the status reported by the exiting child is placed in the integerobject whose address is specified in this call to wait. If a process exits beforeits parent waits or exits, it enters a “zombie” state, waiting for its parent toaccept the status value.

We can combine the fork, wait and exec system calls to invoke a programin a blocking manner, i.e., to invoke that program and wait for it to finish.9

According to [TH] pp. 1933-1934:

If a program, say the first pass of a compiler, wishes to overlay itselfwith a new program, say the second pass, then it simply execs thesecond program. This is analogous to a “goto.” If a program wishesto regain control after execing a second program, it should fork achild process, have the child exec the second program, and have theparent wait on the child. This is analogous to a “call.” ...

6We might expect that fork involves a lot of copying overhead, since fork involves thecopying of large data segments that are almost always thrown away by an immediate call toexecute. Most implementations avoid this penalty, however, by postponing that copying viaa copy-on-write scheme. (See 9.3.3 on page 132.)

7It returns minus one to the parent if unsuccessful.8To avoid some process-switching overhead, there is also a version of wait, called waitpid,

that allows the caller to specify which child to wait for.9Blocking (respectively, non-blocking) invocations are sometimes called synchronous (re-

spectively, asynchronous).

3.3. PROCESS CREATION AND TERMINATION 33

When a shell runs a command in the foreground, the invocation of the com-mand is said to be blocking — the parent process suspends execution until thischild exits, just as the caller waits for the callee to return in a normal functioninvocation. But if the parent doesn’t call wait, the child runs nonblocking, i.e.,as a background process. Shells are structured, more or less, as follows:

.

.

// Command-line processing is done here.

.

.

if ( pid_t kidpid = fork() ) {

// You’re the parent.

if ( ! backgroundChild() ) wait(status);

} else {

// You’re the child.

// I/O redirection can be done here.

exec( command ); // return only on exec failure

// report inability to exec

exit( errno ); // return error number as status

}

.

.

.

The fact that the child inherits a copy of the parent’s per-process open-filetable facilitates I/O redirection. The child shares its parent’s standard-inputand standard-output files, by default, and can change these without affectingthe open-file table of the parent.10 For instance, before calling exec, the childcan reopen its standard output to another file without disturbing the standardoutput of its parent, which continues to run the shell program and communicatewith the user in the foreground. See Bentson 5.2.2.

3.3.1 Other Approaches

The fork routine is fairly easy to implement when there is sufficient memory-mapping support in the underlying hardware — fork merely copies the addressspace of the parent, adjusts the child’s memory map appropriately, and incre-ments the access count on the open files. Each process has its own address space,i.e., set of virtual locations. Virtual location zero is the base of the process’saddress space. Address one designates the next virtual location. And so on.

If, however, a process has to deal with absolute addresses, then fork isdifficult to implement, since local pointer variables in the child process wouldcontain the absolute addresses of objects in the parent’s data segment. One can

10The child should inherit the parent’s list of outstanding (unserviced) traps and signals aswell.


implement fork on say and embedded processor that lacks relocation hardwareby using only position-independent code, which uses self-relative jumps andpointers.11 Some C/C++ compilers, notably GCC/G++, have command-lineflags that force the generation of position-independent code.

Separate fork and execute routines are semantically more powerful thanthe combined fork-and-execute in the sense that, while it is easy to imple-ment fork-and-execute using fork and execute, to implement fork usingfork-and-execute is nearly impossible — there are problems in reconstructingthe sequence of outstanding procedure calls and in setting all variables (espe-cially pointer variables) to the proper value.

3.4 Process Migration

Some operating systems allow processes to migrate from one computer to an-other. Sprite, for example, migrates a process by suspending it, swapping outall of the dirty (i.e., modified) pages of its data segments to a special backingfile, and then resuming the process on the target system, on which it will bedemand paged from its executable and backing files.Check whether this has fol-

lows the Sprite paper. If so,make it a quote.

A migrated process executes remotely as though all kernel invocations wereexecuted via remote procedure calls on the migrator’s home machine, i.e., themachine on which it was created. For many kernel invocations, however, itmakes no difference whether they are executed by the home machine or by thecurrent host — for efficiency, such invocations are handled locally by that host.There are places where the implementation gets complicated, e.g., where parentand child share an offset pointer into an open file and say the child migratesbut the parent does not. Obviously, processes that share a data segment mustmigrate together.

3.5 Daemons

In addition to processes running user programs, there are other processes (dae-mons) created by the operating system to handle its own chores. For instance,there will be a daemon for controlling the printer and for handling the loginat each terminal port. Also, there may be daemons to distribute electronicmail, run the line printer, and perform other system processes. An excellenttreatment of Unix daemons is found in Chapter 31 of Nemeth et al.:

A daemon is a background process that performs a system-relatedtask. In keeping with the UNIX philosophy of modularity, daemonsare programs rather than parts of the kernel. Many daemons startat boot time and continue to run as long as the system is up. Otherdaemons are started when needed and run only as long as they areuseful.

11Caution: registers might contain the absolute versions of self-relative pointers unless thosepointers are treated as being volatile.

3.5. DAEMONS 35

The words “daemon” and “demon” both come from the sameroot, but “daemon” is an older form and its meaning is somewhatdifferent. A daemon is an attendant spirit that influences one’s char-acter or personality. Daemons aren’t minions of good or evil; theyare creatures of independent thought and will. As a rule, UNIXsystems seem to be infested with both daemons and demons.

... init is the first process to run after the system boots, and inmany ways it is the most important daemon. It always has a PIDof 1 and is an ancestor of all user processes and all but a few systemprocesses.

... After processing startup files, ... it opens the ports and spawnsa getty process on each one. If a port cannot be opened, init,periodically issues complaints ...

... inetd is a daemon that manages other daemons. It startsup its client daemons when there is work for them to do and allowsthem to die gracefully once their processes have been completed.

... inetd only works with daemons that provide services over thenetwork. ...

Under Unix, system-provided executables intended for spawning daemon pro-cesses are often given names ending in the letter “d”, e.g., “inetd”.

Linux bootup. When a Linux system powers up or restarts itself due to areset interrupt, the system invokes the boot manager, a small program residingon the Master Boot Record (MBR) of the system’s default boot disk (or otherbootable device). The most common Linux boot manager is GRUB (GRandUnified Bootloader), which presents a menu of files that (should) contain copiesof OS kernels that it is willing to bootstrap, i.e., load into memory and startexecuting. GRUB passes some arguments to the bootstrapped kernel, two ofwhich are init and runlevel.

The init argument contains the name of a normal executable that the ker-nel runs to create its first daemon, called “init”, which directly or indirectlycreates all other user-mode processes, daemons included. If the user specifiesinit=/bin/sh, init runs the sh shell, which is a nonstandard initial daemonthat is useful for performing special debugging tasks.

The default value of the init parameter, however, is /sbin/init, the path-name of an executable that contains a copy of the System V init program, whichinterprets the runlevel parameter (usually 0, 1, 2, 3, 4, 5, or 6) and creates thenext generation of daemons per a special configuration file, /etc/inittab. Linesbeginning with # are comments. The other lines contain four colon-separatedfields:

1. a unique line identifier up to four characters in length

2. a string of run levels, e.g., 235, to which this line applies12

12If this field is blank, the line applies to all run levels.


3. one of a dozen or so keywords that tell when this line’s command shouldbe executed, e.g., once, powerfail, sysinit, wait, etc. — man inittab

gives the details.

4. a command to be exectued.

There is a special line, e.g., id:5:initdefault:, that specifies the default runlevel, which is 5 in the example case.

For example, every time the system is booted, the line si::sysinit:/etc/rc.d/rc.sysinitruns the script /etc/rc.d/rc.sysinit. And, whenever the runlevel 5 is en-tered, the line l5:5:wait:/etc/rc.d/rc 5 runs the script /etc/rc.d/rc, passesit the parameter 5, and then wait for it to exit. That script is responsiblefor stopping and starting various services (service daemons) when the runlevelchanges. On entry to runlevel 5, for example, the script goes through thedirectory /etc/rc.d/rc5.d/, which contains symbolic links to executables in/etc/rc.d/init.d/ that are run by daemons that provide services such as net-working. It first invokes executables whose sybolic link’s name begins with theletter K, passing them the parameter stop, which kills that service’s currentdaemon. Then it runs those whose symbolic link’s name begins with S, passingthem the parameter start. It’s a good idea to maintain /etc/inittab andthose symbolic links via a configuration utility, e.g., redhat-config-servicesor the Services Configuration Tool:

The Services Configuration Tool is a graphical application developedby Red Hat to configure which SysV services in /etc/rc.d/init.d arestarted at boot time (for runlevels 3, 4, and 5) and which xinetdservices are enabled. It also allows you to start, stop, and restartSysV services as well as restart xinetd.

Note that xinetd is the Linux replacement for inetd.For each terminal port, inittab contains a line that causes init to repeatedly

and synchronously fork a child that invokes a getty program to open a specifiedterminal port (tty line), e.g.:

# Run gettys in standard runlevels

1:2345:respawn:/sbin/mingetty tty1




After opening the specified terminal port, e.g., /dev/tty2, the child processexecs the login program, which issues a challenge on that port for a correctuser name and password. When a user name is entered, login consults thatuser’s entry in the system’s passwd file. If the user passes the authenticationchallenge by entering the correct password, then per login’s man page:

Random administrative things, such as setting the UID and GID ofthe tty are performed. The TERM environment variable is pre-served, if it exists (other environment variables are preserved if

3.5. DAEMONS 37

the -p option is used). Then the HOME, PATH, SHELL, TERM,MAIL, and LOGNAME environment variables are set. PATH de-faults to /usr/local/bin:/bin:/usr/bin:. for normal users, and to/sbin:/bin:/usr/sbin:/usr/bin for root. [...] The user’s shell is thenstarted.

At this point that authenticated user can run whatever additional programshe/she pleases. The child process exits when that shell terminates, whereuponinit spawns another getty process on that port.

Chapter 4

MULTITHREADING

It is often desirable to structure a program as a set of cooperating concurrentactivities sharing the same data segments and open-file table. Such a programstructure allows multiple CPUs to be devoted to a single executing program (i.e.,a single process). Also, such a structure is especially useful for programs thatget input from many different sensors and that must control several differentactuators, since the various streams of activity within such a program accessthe same global variables and share the same connections to open files. Forexample, a video telephone requires a video camera, a display, a control paneland a bi-directional communications link. So video telephone software could bedivided into the following potentially concurrent activities:

• capturing another video image in time to compress it just before the lastpixels of the previous image are sent

• compressing that outbound image

• transmitting outbound images

• receiving incoming images

• decompressing incoming images

• displaying decompressed incoming images

• responding to changes in settings of the control panel.

As we will see later, the best way to structure an operating system is to organizeit as a system of cooperating activities.

A stream of activity within a process will be called a thread (a.k.a. activeobject, coroutine, or light-weight process1). Over the years, there have beenseveral attempts at defining the notion of thread:

1By contrast, normal processes, which have their own address space and open-file tablearea said t be heavy-weight processes because it take more work to create and destroy them.

39

40 CHAPTER 4. MULTITHREADING

• that which is manifested by the existence of a thread descriptor

• the “animated spirit” of a procedure

• the “locus of control” of a procedure in execution

• that entity to which processors are assigned

• the “dispatchable” unit

• anything that can wait

• a maximal sequence of procedure invocations, the first of which is a non-blocking invocation from outside the thread and each of the rest of whichis a blocking invocation by its most recent still-active predecessor.

Thread creation. One may think of thread creation as a nonblocking (a.k.a.asynchronous) function invocation, wherein the caller does not wait for thecallee to return. Since the nonblocking caller might return before the callee,the activation record for a nonblocking invocation must be pushed onto a freshstack. Thus, at any moment, each thread has exclusive access to its own uniquestack for keeping track of its currently unreturned normal (blocking) functioninvocations.

The key thing about threads is that they can wait. A thread is said to bewaiting from the time it is created or suspended until the next time it is resumed(i.e., gets allocated a host processor), and it is running (hosted) from the timeit is resumed until it is suspended or destroyed. Suspending a waiting threador resuming a running thread is an error that yields undefined behavior.

The descriptor for a thread contains information about when that threadshould be resumed (priorities, time-out thresholds, etc.) plus the informationneeded to resume (dispatch) the thread when it is waiting, i.e., the informationneeded to restore the values of key processor registers (PC, SP and PSW) towhat they were when the thread was suspended. Of course, to suspend a runningthread, one stores the registers of the thread’s current host processor into thethread’s descriptor.

When a thread passes its processor to another thread, it resumes that otherthread and usually simultaneously suspends itself.2 Sometimes, however, a giventhread performs the activities of another thread by resuming the other threadand assuming its register settings without suspending itself. In such a case, thegiven thread is said to be a virtual processor that is running (hosting) that otherthread.3 Virtual processors, like non-virtual ones, can move from stack to stack,while ordinary threads are confined to their own unique stack.

2The code that accomplishes this processor handoff is somewhat tricky. It is often called“trampoline code” or a “coroutine call”. Such resumption is sometimes nonblocking in thesense that the suspended thread is still runnable but will have to wait for a processor. Aninvocations of a trampoline routine has the weird property that the thread to which it returnsis waiting as a result of a different invocation of that the trampoline routine.

3Of course, if we are ever again to find the top of the previous stack, then a pointer to ithas to be stored somewhere.

41

Threads vs. processes. In the early days of computing, all processes weremonothreaded and no distinction was made between “process” and “thread” —the term “process” was used to describe both.4 The distinction is important.Different processes have different address spaces. Different threads have differentstacks but may be in the same address space. Threads are the fundamental unitsof concurrent activity (i.e., access to CPUs), and processes correspond to datasgments that hold global variables for threads.

Each thread belongs to a specific process. Most processes are monothreaded,i.e., they have only one thread. Those with multiple threads are said to bemultithreaded. Normally, a thread is known only within its process, and, sincethe kernel allocates CPUs, only kernel threads directly get CPUs — the restmust run on virtual processors that belong to the kernel. (In the Unix world,Kernel-known threads are sometimes called tasks.)

Every process has at least one dedicated kernel thread, which handles theprocess’s kernel invocations (traps and system calls) on a stack within the kernel.That kernel thread also acts as a virtual processor to run the process’s user-modethread(s).

In most systems, fork() and exec() can only be called from processes thathave a single such dedicated kernel-based virtual processor. In such processes,multiple user threads that are unknown to the kernel,5 can achieve pseudo-concurrency by passing a kernel-known virtual processor among themselves. If,however, each of those multiple threads ran on its own kernel-based virtualprocessor, the kernel could allocate processors, one each, to several of them atthe same time, thereby, achieving true concurrency among those threads. Toachieve true concurrency, we need system call by which a user thread can acquirea dedicated kernel-based virtual processor on which to run. Compare and contrast the

POSIX and Linux model forthreads and processes.Clone(). Note that the process-vs.-thread dichotomy is “all vs. nothing” in

terms of anatomy sharing between child and parent. In Linux, however, clone()is a system call by which one kernel-known thread creates another and specifieswhat is to be shared. Its prototype is pid t clone(void* sp, unsigned long

flags) and per its man page:

If sp is non-zero, the child process uses sp as its initial stackpointer. [Otherwise, it uses the current setting.] The low byte offlags contains the signal sent to the parent when the child dies.flags may also be bitwise-or’ed with either or both of COPYVM orCOPYFD.

If COPYVM is set, child pages are copy-on-write images of the par-ent pages. [See 9.3.3 on page 132.] If COPYVM is not set, the childprocess shares the same pages as the parent, and both parent and

4Some authors use the term “heavy-weight process” for “process” and “light-weight pro-cess” for “thread.”

5Many programming languages (e.g., Java, Modula2 and Ada) have built-in provisionfor some form of threads. For C/C++ there are various libraries, such as Pthreads, thatimplement threads via assembly-language routines or via tricks involving setjmp and longjmp.


child may write on the same data. If COPYFD is set, the child’s filedescriptors are copies of the parent’s file descriptors. If COPYFD isnot set, the child’s file descriptors are shared with the parent.

On success, the PID of the child process is returned in the parent’sthread of execution, and a 0 is returned in the child’s thread ofexecution. On failure, a -1 will be returned in the parent’s context,no child process will be created, and errno will be set appropriately.

So clone(0,SIGCLD|COPYVM|COPYFD) is identical to ordinary fork(). On theother hand, if neither flag is set, then the child is a kernel-known thread (i.e.,task) serving as a virtual processor for a user thread in its parent’s process. Ineither case, the child has separate stacks for its user-mode activity and its kernelinvocations and has its own signal mask and its own queue of pending signals.Note that in every case, clone() creates a new user thread and dedicates akernel-based virtual processor to running it. The conventional approach hasthe advantage that to cancel all threads running in a given address space, onesimply cancels their process. In the clone-based approach, however, to simul-taneously cancelling all threads associated with a given address space requireskernel support.Get a direct quote from

Sprite paper for the para-graph below.

The Sprite operating system [OU] introduced a special clone-like fork oper-ation where the child shares the parent’s data segment(s) but gets a separatecopy of the parent’s stack segment and open-file table. Under this scheme, thethreads of a given process do not have access to each others’ stacks. Whenit was suggested in netnews that “a clone flag that shared all of VM [virtualmemory] except the stack might be interesting,” Linus Torvalds, the creator ofLinux, replied:

Linux is so far the only system I know of that avoided that particularbraindamage. Both Irix and Plan-9 have “clone-like” system calls(in fact, as far as I know the concept came from Plan-9), but bothof them tried to do it the way you allude to above.

And it’s completely broken and totally moronic when you actuallystart looking into what that one small “wouldn’t it be nice if” featuremeans for the implementation.

Why Linux does not do the above:

• It’s impossible to do an efficient process-switch on any currenthardware I know of once you start doing “partially shared”page tables.

• It’s impossible to do an even remotely sane implementationof good sharing, if that sharing can sometimes be incomplete.What happens when the stack grows in one thread but not theother?

• You have “local pointers” and “global pointers” depending onwhether they point to inside your stack or not. As such, some-

4.1. CONCURRENCY AND SERVERS 43

times you can pass pointers to your stack around, and some-times you can’t depending on whether the thing you pass thepointer to happens to be communicating with another thread.

Trust me, anybody who ever implemented what you were thinkingof is now either feeling very sorry about it or is just too stupid tounderstand what a disaster it is.

And yes, it looks like a deceptively good idea. It just isn’t.

If the stack segment is a separate address space, local (automatic) variablesand global (static) variables can have identical offsets within their respectivesegments, which requires that pointers be accompanied by some indication ofwhich segment they are relative to. That would seem to be a difficulty forcompilers more than operating systems.

4.1 Concurrency and Servers

A server can have one or more queues of pending service requests from clientthreads. A client’s request arrives at a server, waits in one of the server’s queuefor some time, and eventually receives the requested service per some schedulingprotocol. We will study the following categorization of services:

• passive

• active

– nonpreemptive

∗ blocking

∗ nonblocking

– pretemptive

∗ diverting

∗ nondiverting

· preemptee blocking

· preemptee nonblocking

Active vs. passive services. There are two ways that servers can servicerequest from concurrent client threads:

• passively (self-service): clients queue up to use the server,6 i.e., to invokeand run the server’s handlers.

• actively (full-service): client requests are queued up and eventually getserviced within the server by “agent” threads, who invoke the server’shandlers on behalf of the clients.

6In the case of a passive service, the request queue is actually a queue of clients waiting touse the server.


Given a passive service, an equivalent active service can be obtained bycreating an active proxy, e.g., by placing the original service in a multithreadedRPC-server process having for the given passive service a handler (stub) thatcauses an agent thread to request that service from the passive server. Anagent thread then requests and self-services the client-requested service on thereal server. The agent then returns the result to the proxy handler, whichreturns the result to the real client.

Blocking. A request is said to be blocking if clients cannot do other computationswhile their requests are handled. A passive service is necessarily blocking, sincethe handler’s invocation record blocks the client from invoking further proce-dures until that invocation record is popped upon the handler’s return. Requeststo void active services, on the other hand, can easily be made nonblocking —the client can simply continue after communicating its request.

Remoteness. A request for a passive service must be handled in an addressspace that is accessible to the requesting client, i.e., the invocation record forthe handler gets allocated on the client’s current stack.7

A request for an active service, on the other hand, can be handled in anaddress space that is inaccessible to the client. Requests for active services arehandled on server-local stacks by pre-existing or newly created agent threadswithin the server.8 Such requests are communicated to the server via someform of interprocess message passing, e.g., copying parameters and return valuesfrom one stack to another, which involves a lot of overhead.9 Thus, activeservices can be remotely accessible (e.g., via remote procedure calling) and workwell in distributed environments. Unfortunately, binding to and invoking thehandlers of active services involves overhead, both in terms of run time andimplementation.

Preemptive vs. nonpreemptive active services. Some requesting clientsof active services are not threads but rather devices that need a service suchas the handling of an IO-completion. Normally, a processor must notice thisrequest and switch from running the current procedure of the current thread(called the preemptee) to running the requested service’s handler as a procedurethat is involuntarily invoked by the preemptee or by some designated otherthread. In either case, the processor is said to be preempted, and the service issaid to be preemptive. If the handler is invoked by or on behalf of the preemptee,the preemptive service is said to be diverting. Diverting services are necessarilypreemptee blocking, i.e., they block the preemptee’s own agenda. By contrast,the handler for a nondiverting preemptive service can, in principle, restore thepreemptee to a runnable state, thereby, unblocking the preemptee and increasing

7Typically, they are implemented as objects shared among the threads of a given process.8To avoid the overhead of thread creation and destruction, one can maintain a pool of

readily available agent threads.9By contrast, handlers for passive services can be invoked via the same mechanisms as

ordinary functions and even inlined.


potential concurrency.10

Polling is a mechanism by which preemptive behavior can be simulated ina nonpreemptive environment. Some running thread must notice the requestand perform the appropriate service. (This approach is called polling, becauseto notice the request this thread must look for it, i.e., poll for it.) Polling hasthe disadvantages that polling instructions must be inserted to occur sufficientlyoften that the device’s request will be handled in a timely manner. In the case oflong-running tight loops, such polling code can constitute a significant overhead.Polling has the advantage that it occurs at coherent predictable points; thuscompilers and architecture do not have to make allowances for rude surprises.

For coherence, one must be able to block and unblock preemption for a givenpreemptive service. One can, in fact, simulate polling for a preemptive serviceby briefly unblocking that service (e.g, interrupt or signal), which will invoke theservice’s handler if the service’s pending flag has been set, and then immediatelyblocking that service again. Give forward reference to

event blocking.Because their requestors are not generally able to pass parameters and/orreceive return values, the handlers for preemptive services are usually void/0-aryfunctions.11

Interrupts and traps are preemptive requests to a kernel; they divert a realprocessor. For example, in a Unix kernel invocation, the interrupted/invokinguser thread has its kernel-known virtual processor preempted to the handlingof the kernel invocation.

Signals are preemptive requests to user processes; they divert kernel-knownvirtual processors. Usually, signals occur when the kernel propagates an inter-rupt or trap to a user thread. Some signals are void/0-ary service requests sentby one thread to another, requesting, for instance, that the recipient exit.12

Signals resulting from traps are directed to the thread that caused that proces-sor condition.13 With clone-created threads under Linux, signals resulting frominterrupts (e.g., timer interrupts) are directed to the thread that armed (i.e.,enabled) that interrupt, thereby, requesting the event, which they are underLinux. Under POSIX, however, such signals go arbitrarily to any thread havingthe corresponding address space (i.e., process) and having that event unblocked.The following was posted to the newsgroup comp.programming.threads, April5, 2001, by Dave Butenhof, one of the main designers of the POSIX Pthreadsstandard:

> The POSIX standard mandates that a thread that

10Nondiverting preemptive services are robbers; they snatch the preemptee’s processor.Diverting preemptives services are kidnappers; they snatch the preemptee himself and makehim do their bidding.

11A “void” function or request returns no value, and an “n-ary” function or request takesn arguments. Thus a “void/0-ary” functions or request takes no parameters and returns novalue. (But you knew that already.)

12Usually the function by which threads exit (i.e., terminate themselves) is given a name likethread exit to distinguish it from the system call, exit, which terminates the entire processof its caller.

13FIX: Okay, but how does the kernel do this? Does it rig a fake call to the handler ontothe user-mode stack of that thread?


> misbehaves and causes a SIGSEGV to be generated will

> have that signal delivered to the thread in question.

More generally, any signal caused by the direct and immediate actionof a thread must be delivered to the thread. These are “threaddirected” signals, and include SIGSEGV, SIGBUS, SIGFPE, anythingsent by pthread kill() or raise(). Signals that are directed at theprocess, though, cannot be restricted to a particular thread. Thisincludes kill(), and the normal “asynchronous” external signals suchas SIGCHLD, SIGIO, SIGINT, etc.

> Most other signals will be delivered to a random thread

> that does not have the signal masked.

This is the crux of the problem with the Linux implementation. Ifthe PROCESS receives a signal, and it happens to come into a thread(Linux process) that has the signal blocked, the PROCESS cannotreceive the signal until the THREAD unblocks it. This is bad.

POSIX requires that a PROCESS signal go to any thread that doesnot have the signal blocked; and if all have it blocked, it pendsagainst the PROCESS until some thread unblocks it. (Linuxthreadscan, to some extent, fake this by never really blocking signals, andhaving the handlers echo them to the manager thread for redirec-tion to some child that doesn’t have the signal blocked... but that’scomplicated, error prone, [e.g., the signal will still EINTR a block-ing call, whereas a blocked signal wouldn’t], and all that echoing isrelatively slow.)

Lots of people may disagree with the POSIX model. It was the hard-est part of the standard. We fought for years before Nawaf Bitar fear-lessly lead (and cattle-prodded) the various camps into ”the grandsignal compromise” that ended up forming the basis of the stan-dard. Still, the major alternative to what we have would have beena model based on “full per-thread signal state”, and that would havemade a real mess out of job control because, to stop a process, you’dneed to nail each thread with a separate SIGSTOP... problematic ifit’s dynamically creating threads while you work. (And, sure, thereare infinite shadings between the extremes; but that’s where peoplereally started knocking each other over the head with chairs, and itjust was not a pretty scene.)

The computing public may sleep better not knowing how standards are made.Make a table of the kinds ofthreads.

Diverting vs. nondiverting preemptive active services. Most preemp-tive requests are handled by the preemptee as though the preemptee made aninvoluntary request to a passive service. This must be the case if the handler’sactivation record is allocated on the preemptee’s stack. Such services are said


to be diverting. In such a case, any waiting by the handler causes the preempteeto wait, so it is possible for the handler to implement various coordination andscheduling that involve the preemptee (e.g., timeslicing).

Diverting services involve less overhead than nondiverting ones, since thereis less context to switch, i.e., fewer registers to store and restore. In a divertingservice one can simulate a non-diverting servicing by designing the handlers tosimply resume a server-local agent thread. (Doing so can mitigate the impactof preemptee-blocking on potential concurrency.)

Kernels and diversion. A kernel is a server, and kernel invocations are servicerequests. Note why passive and divert-

ing services are a protectionproblem in a multiprocessingenvironment. A user modethread can mess with a priv-mode stack!

Kernel invocations where only the PC changes are normally passive or di-verting. In such a case, it’s useful to think of requests from devices as being ser-viced by interrupt-sustained pseudo-threads that get work done by temporarilyinhabiting (diverting) real threads and making them do the interrupt handler’sbidding.

Kernel invocations where the CPU picks up a new stack pointer SP as wellas a new program counter PC from the vectoring hardware are normally activeand nondiverting, but diversion can be simulated by immediately restoring theSP to its prior value (which will have been stored somewhere).

Diversion is more common in embedded systems, where inexpensive hard-ware lacks protection features and all software is usually trusted. In non-embedded systems, the interrupted/invoking client thread does not have accessto the kernel’s address space; however, the kernel always has access to the ad-dress space of local threads, since their address space can be added to that ofthe kernel through memory-mapping tricks.

Nondiverting kernel invocations are common for multiprogrammed systems.Unix kernels, for example, have an agent thread for each user-mode process(for handling its traps and system calls) and for each device (for handling itsinterrupts). In monothreaded active kernels, if all interrupt handlers run tocompletion, they can share a single kernel stack.

In a given kernel, some services may be diverting and others non-diverting.For instance, system calls may be passive and be treated somewhat as procedureinvocations, while interrupts are preemptive and nondiverting and get treatedas thread resumption or creation. In fact, the handlers for different interruptingdevices might be treated differently. One possible scheme is to specify the newSP value of zero in the vectoring tables to indicate that SP is to remain un-changed, just as in calls to clone. As another example, the x86 architecture hasfour hardware-known stacks per process, one for each protection level — when aprocess is running at a particular protection level, it uses its corresponding stackfor all function invocations, including system calls. That architecture has a bitfor each interrupt and trap vector that tells whether the handler is a processwith its own four stacks or a procedure to run on one of the four stacks of thecurrent process.

The major argument against making all service diverting is: “Where doyou handle stack-overflow traps? Surely not on the current stack, which just


overflowed!” Usually, however, the a bounds register can be set to trap suchoverflows with room to spare. But correctness proofs are difficult in such cases,since for example the kernel must be compiled before it is known how muchroom to allow.

In multiuser environments, another argument against making all service di-verting is: “How do you prevent an interrupted thread from possibly discoveringsensitive information through inspection of discarded activation records that in-terrupt handlers leave behind in user space?” Careful writing of handlers canprevent system invocations from leaving critical information on user stacks, butsuch care is difficult to guarantee.So what’s special about ker-

nels as servers? Robustness?Offer loader services? Whatelse?

Chapter 5

THREAD SAFETY ANDCOORDINATION

In this chapter we concentrate on passive servers, since we know how to turnthem into active servers that behave equivalently. We study thread coordinationat a level of abstraction that applies both to kernel threads and to user-modethreads. Note that a user-mode thread runs when and only when a kernel threadhosts it. Distinguish among: coor-

dination policy, schedulingpolicy, and access (protec-tion) policy.

Threads need to coordinate (i.e., synchronize) their access to shared servers.1

For instance, two threads that simultaneously try to obtain the first buffer froma free-buffer list might each get the same one (a bad thing) unless each inturn gets exclusive access to the buffer list — in subsection 6.1.2 on page 83,we will see that even a server consisting of a single bit may require exclusiveaccess. Servers that are coherently sharable among multiple concurrent threadsare said to be thread safe.2 This chapter discusses mechanisms for establishingmutual exclusion, as well as some more elaborate coordination protocols thatare required for thread safety.

Often service requests involve waiting, e.g., until a certain server/resourcebecomes available. A server that delays the progress of its client threads bymaking them wait (per some protocol) is called a scheduler or dispatcher.3 Allthread-safe servers except low-level locks have schedulers (or are part of a serverthat has a scheduler). That scheduler provides acquire and release servicesfor clients.4 The acquire service takes parameters that depend of the nature of

1For automobiles, an intersection is a shared server/resource requiring coordinated access.It is sometimes helpful to think of a kernel as a traffic cop for the resources of a computersystem.

2See http://www.noble-library.org and [FR] for discussion of lock-free concurrently acces-

sible data structures.3Dispatchers often enforce access-control (i.e., protection) policies as well as scheduling

policies.4The real service performed by schedulers is that of retarding clients’ progress until it is

safe to proceed.

49

50 CHAPTER 5. THREAD SAFETY AND COORDINATION

the scheduled server and its policies. These parameters might include prioritiesand/or the amounts of various resources that are being requested. Initially,the handler of a requested service “checks in” with the server’s scheduler byinvoking the scheduler’s acquire service. The handler “checks out” just beforereturning by invoking release. The handler may also check out and back in attimes, e.g., before and after waiting.

Note that acquire is a service that can be requested, and requesting it isoften viewed as requesting the server. In the interim between the return froman acquire request and issuing the subsequent release request,5 the client issaid to have acquired from the scheduler (some amount of) that server/resourceand to hold it, to own it, and to owe it to the scheduler. That portion of theserver/resource, in turn, is said to be accessed by the client or to be servicing theclient’s request. (In some contexts, we say that the scheduler itself is held, ratherthan the scheduled server/resource — context usually resolves that ambiguity.)

We allow ownership to be transferred to another thread, who receives (in-herits) ownership. This transfer is usually implicit, i.e., does not involve theexecution of actual code. For diagnostic and documentation purposes, however,it is often helpful to add transfer and receive services to schedulers.

Within some servers, more than one service request can be active (not wait-ing) at a given time, in which case the server is said to be multithreaded. Other-wise, it is said to be monothreaded (or single-threaded), and the server is calleda monitor.6 A monitor in which no waiting occurs is called a serial server; theservicing of each of its service requests must complete before servicing of thenext can begin.

It would appear that, each scheduler, being a server itself, would need ascheduler, which, in turn, would need yet another scheduler, and so on. In fact,high-level schedulers are almost invariably implemented as monitors, which havevery simple serial schedulers called locks. Low-level locks are primitive, i.e., theydon’t need yet another scheduler, but all thread-shared servers other than low-level locks require schedulers in order to be thread safe.

Locks restrict concurrency and, thereby, potentially diminish processor uti-lization. It is, therefore, a good policy to keep critical sections (i.e., code seg-ments where a lock is held) short and, when transferring a lock to anotherthread, to pass the host processor (preferrably with its interrupts/preemptionblocked) along with the lock, so that the receiving thread gets through its criticalsection quickly.

Waiting on low-level locks entails additional inefficiencies involving runningwith interrupts off, stopping processors, or wasting processor cycles. Higher-level coordination is built upon low-level locking but involves passive waiting —the waiting thread relinquishes its processor after putting itself onto a priorityqueue of passive (i.e., not running) threads, where it waits until it is selected and

5Eventually we will introduce a version of release that allows a client to designate a specificclient to which it will release (a certain amount of) a resource.

6Think, for instance, of the waiting room of a clinic with many doctors but only onereceptionist to receive patients, tell them to wait and to summon them when their doctor isavailable. This monothreaded receptionist is a monitor that schedules the clinic’s activities.

5.1. MUTUAL EXCLUSION 51

again given a processor. To minimize impact on system performance, low-levelwaits should be kept very short; long-term waiting should occur on high-levelmechanisms where both the processor and the lock are relinquished.

5.1 Mutual Exclusion

5.1.1 Locking

The mutual-exclusion problem7 is to implement a class of schedulers, calledlocks,8 that enforce a one-thread-at-a-time scheduling protocol — more specifi-cally, the problem is to implement a class of servers having acquire and release

services that meet the following specification,9 i.e., that fulfill the following con-tract by keeping its promises so long as its requirements are met:

A correct implementation of Lock fulfills the following contract whose obli-gations (a.k.a. promises) are:

• Mutual exclusion. No lock is ever held by more than one client at a time.10

• Progress. Any invocation of acquire on an unheld lock returns promptly.

• Bounded waiting. There is a bound on the number of times any otherclient can acquire a given lock during a given invocation of acquire.

provided that the following requirements are observed by all clients:

• No client shall invoke release on a lock that it doesn’t hold, i.e., nogratuitous releases.

• No client shall invoke acquire on a lock that it already holds, i.e., norecursive acquisitions.

The behavior of a correct implementation of Lock is not defined whenever oneof its clients has violated either of the above requirements.

Visibly, an invocation of acquire on a lock held by another thread must notreturn until that other thread has invoked that lock’s release, and then oneand only one unreturned invocation of that lock’s acquire handler can return.

Recursive acquisitions occur when a service routine of a monitor requestsa service from the same monitor — or invokes a function that requests such a

7Also known as the critical-section problem8Some authors use the nouns “mutex” or “serializer” for “lock,” and the verbs “lock”

or “request” for “acquire,” “unlock” for “release,” and “own” for “hold,” respectively. Theacquire service of a serializer may involve scheduling parameters. A lock is a serializer whoseacquire is parameterless.

9From the C++ perspective, Lock is an abstract base class having two parameterless purevirtual void function members: acquire and release.

10In addition to its direct services, acquire and release, a lock performs an indirect serviceto its holder, that of holding other threads at bay, even while none of its direct services inexecuting. This is similar to the service of remembering a bit pattern that a memory locationperforms while no read or write operations are action on it.


service, etc. These situations can usually be found by careful syntactic analysis11

and should be treated as programming errors leading to potential collisions overcritical data since the handler of the second service request to that monitorwould likely find that data in an inconsistent state.

When a thread holding a lock violates the prohibition on recursive acqui-sitions and recursively acquires that lock without first releasing it, some kindsof locks will release, some kinds will jam (never to release), and others willcause the client to self-deadlock, waiting for itself to release the lock. Thoseimplementations that avoid the above problems will usually release on the firstinvocation of release after recursive acquition (often to the astonishment ofthe programmer) rather than after the initial recursive acquisition has finallybeen released.12

5.1.2 Preemption BlockingThis is a three-stage process.Initially blocking is a prop-erty of processors. We liftit first to threads, and thenfrom threads to servers.

To enforce coordination protocols, some handlers of passive services must tem-porarily protect their clients against preemption by (requests for) certain pre-emptive services of surrounding servers, thus in effect postponing those requests.The handler simply sets its (possibly virtual) host processor’s preemption mask,an array of event-blocking flags,13 to block (i.e., disable, disarm, turn off, sus-pend, postpone, or delay) preemption by requests for those preemptive services.

Each thread has a preemption-mask attribute that indicate which preemp-tive services’ requests (events) are prohibited from preempting host processorsaway from that thread. Whenever a thread resumes (i.e., acquires a host pro-cessor), its event mask is copied to that of its host. Conversely, when the threadsuspends, the host’s preemption mask gets copied back to that of the thread.At each processor, requests for each of its blocked service are queued up tosome per-service maximum number to be handled when that service becomesunblocked (i.e., re-enabled) on that processor.14 Postponed requests beyond themaximum number are simply ignored (and lost).

Blocking a given service is somewhat akin to acquiring a lock, but note that:

• Locks usually exclude requests on a server-wide basis, while events are usu-ally blocked on a service-by-service basis or a service-group basis. (Lockscan, however, be used in that way.)

11But this problem is equivalent to the halting problem and, therefore, unsolvable in thegeneral case.

12It is possible to construct locks that are recursively acquirable and that only allow anotherthread to acquire them after they’ve been released as many times as they’ve been acquired.Such locks do an excellent job of hiding bugs. I recommend against them.

13The term “flag” means “boolean variable”, and the term “event” means “service request”.14Where the handlers for preemptive services are void/0-ary functions, a processor’s post-

ponement queues can be represented as service-indexed array of counters that don’t roll overon overflow. In the case of CPUs, they are usually one-bit counters, i.e., flip-flops.

Note that unblocking a service on a given client is somewhat analogous to requesting anonblocking service from the client and doing other things while waiting for the completionof that request to be indicated by some client-generated event.

5.1. MUTUAL EXCLUSION 53

• Locks are shared among threads, while each thread has its own preemptionmask.

• When a thread resumes, its new host’s preemption mask gets set to thestate that the thread’s then-current host had when the thread last sus-pended. Some locks (specifically those of the class BlockLock) are simi-larly automatic, i.e., hardware causes threads to release them on suspen-sion and reacquire them when resuming, say when the preemption mask isreset. Usually, however, when a thread suspends, it must explicitly releasea lock (to allow other threads access to certain lock-protected data) andmust inherit or explicitly reacquire that lock upon resumption.

• An event-blocking flag is never simply released. Rather, it is restored toits prior state, the one it had at its corresponding “acquisition” (i.e., thepoint where it got set by the current thread), which is not necessarilyits most recent “acquisition” when we consider activity by other threads.(Because they are restored rather than directly “released,” event-blockingflags are recursively acquirable.)

The fundamental principle of preemption blocking. In general, beforeacquiring a given lock:

• To avoid undefined behavior from a recursive request to acquire this lock,block all diverting services whose handlers might attempt to acquire thislock.

• To avoid self-deadlock, block all other preemptee-blocking preemptive ser-vices whose handlers might attempt to acquire this lock.15

• To avoid performance degradaton, block all preemptive services whosehandlers might attempt to acquire this lock involves busy waits.

That is, block each preemptive services whose handlers might attempt to acquirethis lock unless the lock is of high level and and the service doesn’t block itspreeptees. Because their handling might as well wait for an event as for this lock,one might as well block all services whose handlers will surely attempt to acquirethis lock. If the lock is of high level, however, blocking preemptee nonblockingservices that aren’t sure to attempt to acquire the lock will needlessly diminishpotential concurrency and prolong response times for these services. (But whytake chances?)

Warning. Obviously, in a multprocessor environment, even if all preemptiveservices are blocked whenever any lock is held, before a handler accesses a lock-protected resource, it must acquire that lock to block threads running on otherprocessors from concurrently accessing that resource.

15Remember that the only preemptive services that don’t block their preemtees are thenon-diverting ones where the handler takes steps to restore the preemptee to the runnablestate, e.g., putting the preemptee onto the ready queue.


Critical sections. Sections of code requiring mutual exclusion (i.e., monothread-ing) are often called critical. Locks are schedulers specifically designed to providemutual exclusion by blocking the progress of certain threads. So, the layout ofa critical section should be as follows:

.

.

.

block appropriate events on the host and record their prior

blockage status;

acquire lock;

.

.

.

critical section

.

.

.

release lock;

restore event blockage of host to status recorded above;

.

.

.

Implementation notes. Distinct threads are obviously distinct activities,but a preemptee and the preempting invocation of a handler are also distinct(and possibly unsynchronized) activities, even though in the case of divertingservices the handler is in effect invoked by the preemptee.

If one activity reads the value of a non-atomic variable while another activityis updating that value, say after part but not all of the value’s representationhas been written, then the value read may be invalid and simply reading it mayresult in undefined behavior. Also, if one activity reads a variable written byanother activity, that updated value may only be in a register in the processorof the updating activity, and the first activity will read an obsolete value forthat variable. Blocking and locking can prevent problems of the first kind, butproblems of the second kind require help from the compiler and the underlyingarchitecture. Instructions must be generated that flush register-resident vari-ables and invoke memory-barrier instructions before and after invocations oflock service handlers.16

The C and C++ standards leave undefined the behavior that ensues whena signal handler reads any global variable, writes to any global other than oneof type volatile sigatomic t, or exits the handler of potentially imprecise

16Making shared variables “volatile” is neither necessary nor sufficient to handle these issues.On the other hand, separate compilation of the lock code is almost always sufficient properlyto coordinate register-cached values. This situation is a cache-coherence issue if we viewregisters as a cache.

5.2. MONITORS 55

signals, e.g., SIGFPE, via return (rather than longjmp).17 Basically, the stan-dards guarantee that a signal handler can set a flag, and that’s all. In writingevent handlers for a kernel, we must presume a great deal more of the languageimplementation than the C and C++ standards guarantee.

In some systems, the handler of a preemptive service can terminate thepreemptee, but such preemptive thread termination18 is difficult to implementcorrectly — in addition to the possibiity of incoherence in the preemptee’s state,the preemptee may hold resources that must be released. It is usually better tohave the preemptee simply give up its resources and wait in a pool of availablethreads that are waiting for new assignments.

5.2 Monitors

As mentioned above, a monitor is a monothreaded server, e.g., a server whosehandler bodies are critical regions controlled by a server-specific lock. In sucha server, known to its service routines as *this, mutual exclusion is enforedby acquiring the server’s lock at each handler’s entry and releasing it at eachreturn.19

To facilitate acquisition and release of locks and the blocking and restorationof events, we apply the C++ idiom of acquisition via initialization20 and declarea class, called Sentry, whose constructor handles blocking and locking andwhose destructor releases and restores.21 We then implement critical sectionsas follows:

#define EXCLUSION Sentry x(this);

.

.

.

{ EXCLUSION // creates Sentry x, invoking Sentry(this)

.

.

.

critical section

.

.

.

} // Sentry x goes out of scope, invoking ~Sentry()

17Fortunately, the POSIX standard places sufficient requirements on both the operatingsystem and the compiler that one can count on conforming implementations.

18The term “asynchronous thread termination” is more common for this notion.19It’s a good policy to structure code in such a way that the only locks are such “instance

locks” for monitors.20C++ gurus refer to this idiom via the slogan “Resource Acquisition Is Initialization” and

via the acronym “RAII”.21The lock will be released and events restored whenever an unhandled exception unwinds

the stack past the service routine where the lock was acquired. That’s not necessarily a goodthing, since the reason for locking might not go away when an exception occurs.


.

.

.

Here x is a Sentry, local to the critical section. Its declaration invokes Sentry’sconstructor, which will save the host’s current preemption mask, block theevents specified by the surrounding monitor’s preemption mask and acquirethe surrounding monitor’s lock. At the end of the x’s scope, i.e., the end of thecompound statement, the Sentry’s destructor, ~Sentry, will be automaticallyinvoked. It will release the surrounding monitor’s lock and restore the host’spreemption mask to its prior state.

To make that work correctly, we implement Monitor as a base class consist-ing of a lock and an preemption mask that specifies which events get blockedwhile that lock is held. Monitors provide two services, lock() and unlock(),which, respectively, acquire and release the monitor’s lock. The class Sentry isdefined as follows:

class Sentry { // An autoreleaser for local lock.

Monitor& mon; // Reference to local monitor.

const unsigned old; // Old preemption-blockage status.

public:

Sentry( Monitor* m ) // m’s argument is always "this", a

: mon( *m ), // pointer to the surrounding monitor

old( thisProcessor().events.block( mon.mask ) )

{

mon.lock(); // Acquire the monitor’s lock.

}

~Sentry() {

mon.unlock(); // Release the monitor’s lock.

thisProcessor().events.set( old );

}

};

We assume that thisProcessor().events.set(flags) sets the specified block-ing flags on the client’s host and that thisProcessor().events.block(flags)blocks the specified events in addition to those already blocked, and returns theformer status of all flags. When mon.mask is zero, we can bypass the operationson thisProcessor().events.

We make old an attribute of Sentry, so that old contains the current hostprocessor’s preemption mask as of the Sentry’s construction, i.e., when the lock-acquiring thread invoked the monitor’s service routine. If old were an attributeof the monitor’s lock, releasing a received lock would restore the preemptionmask to its status as of the lock’s most recent acquisition, possibly by anotherthread.

Exercise. Above, why pass this as a parameter? Why not simply initializemon via mon( *this )?

5.2. MONITORS 57

Effective multiprocessing via multimonitor kernels. Monitors were in-troduced in 1974 by Hoare [HO] as a structuring mechanism for operating-system kernels and other concurrent programs, primarily to facilitate debug-ging and verification. The following quote from Ousterhout et al [OU] makesthe scalability-based case for accepting the overhead (in terms of complexityand processor load) and structuring kernels as multiple monitors rather than asa single (“monolithic”) monitor:22

Many operating system kernels, including Unix, are single-threaded,which means that a single lock is acquired when a process calls thekernel and released when a process puts itself to sleep or returns touser state. ... The single-threaded approach simplifies kernel imple-mentation by eliminating many potential synchronization problemsbetween processes. Unfortunately, it does not adapt well to a multi-processor environment. With more than a few processors, contentionfor the single kernel lock will limit system performance.

In contrast, the Sprite kernel is multithreaded, which means thatseveral processes may execute in the kernel at the same time. Thekernel is organized in a monitor-like style with many small locks,instead of a single overall lock, protecting individual modules or datastructures. Many processes may execute in the kernel simultaneouslyas long as they do not attempt to access the same monitored codeor data. The multithreaded approach allows Sprite to run moreefficiently on multiprocessors, but the multiplicity of locks makesthe kernel more complex and slightly less efficient since many locksmay have to be acquired and released over the lifetime of each kernelcall.

Since most kernels are written in C, which does not directly support object-oriented concepts, it is common for implemetors to speak of multiple “locks”rather than multiple “monitors.” In my opinion, an object-oriented approachsuch as the one used in these notes can significantly diminish the complexitycost of the multi-monitor approach.

At the current time, mainframe and commercial Unix system have followedthe approach suggested above. They scale well to say 64 or more processorsand are very stable. Their performance is especially good for things like largedatabases and transaction processing. Large multiprocessor systems are noteconomical for most scientific processing, which can be distributed over a largeloosely coupled network of low-cost one- or two-processor systems.

So far as I know, the various versions of BSD are still to monolithic (singlethreaded) and are very stable, but they don’t scale well to multiple processors.To minimize overhead and complexity Linux and Windows took ad hoc inter-mediate approaches with poor results in terms of both stability, scalability, andperformance. Windows 2003 Server seems to have finally established reasonable

22Monothreaded kernels are often called “monolithic,” because they consist of a single mon-itor.


stability and performance for Microsoft. Linux 2.6 is due out soon and is aimedat overcoming Linux’s deficiencies in all three areas.

5.2.1 ConditionsDevelop conditions per thePthreads model and then in-troduce monitors.

A monitor provides mutually exclusive access to its data and can, with thehelp of conditions, efficiently delay the handling of service requests per vari-ous coordination/scheduling protocols. Monitor locks are usually of low level,so monitor service routines should be kept short — they should simply checkand/or update the status of the monitored resource and then either leave themonitor (i.e., return from the service routine) or wait on a condition inside themonitor — in either case, they will quickly release the monitor’s lock.

A condition is a scheduler having wait and signal services23 such thatwait’s handler doesn’t return until there has been a subsequent signal request,after which one and only one of the condition’s waiting client threads (waitors)resumes.24

Usually, monitors have a third service, called awaited(), that returns trueif any threads are waiting on the condition, otherwise false. Alternatively, wemight introduce a service, called waiting(), that returns the current numberof waitors.25

Conditions should involve passive (high-level) waiting and all long-term wait-ing should be on conditions rather than monitor locks. To wait on a particularcondition, a thread requests the condition’s wait service, whose handler putsthe client’s descriptor into a queue of threads waiting for this condition, sus-pends the thread, and resumes the highest priority thread that is runnable, i.e.,waiting on a special global condition, called ready, for a processor to becomeavailable. To implement various scheduling policies, wait takes an optionalinteger parameter (with default INT MAX) specifying the waitor’s resumptionpriority — the smaller the integer, the higher the priority. Ties are broken ona First-In/First-Out (FIFO) basis.

Since a condition’s queue is a critical shared data structure, each condition isa private attribute (subserver) of a monitor and can be accessed only by a clientthread that holds the surrounding monitor’s lock, i.e., only from the bodiesof the monitor’s service routines. Of course, wait’s handler must release themonitor’s lock whenever a client waits on a condition; otherwise, the waitingclient would be self-deadlocked — any potential signaller would have to acquirethe lock, which is held by the waitor. Also, to ensure the monothreading of themonitor, wait’s handler must reacquire or implicitly receive that lock, when awaiting client thread resumes. Thus waitors comes out of their waits holding

23This is a new and different use of the term “signal,” in addition to the previous “softwareinterrupt” meaning. Note the previous use was as a common noun, while this second use isas a verb and/or proper noun (as the name of a service). It is not uncommon, however, torefer to an invocation of signal() as a “signal” and an invocation of wait() as a “wait.”

24A lock can be thought of as a condition that can remember one unawaited signal (i.e.,release) to be acquired by the next thread that waits (i.e., requests the acquire service).

25We often speak in terms of “occupying” a monitor, which we envision as a single-occupancyroom. We then view conditions as multiple-occupancy waiting rooms just outside the monitor.

5.2. MONITORS 59

the lock on the surrounding monitor, just as they held it when they invokedwait’s handler.

When a client thread requests a condition’s signal service, the condition’shighest priority waitor becomes runnable. In this interaction, the client thatrequests the signal service is called the signaller and the resumed thread iscalled the signallee. Because the surrounding monitor is monothreaded, onlyone of the signaller and the signallee can subsequently hold the monitor’s lock.So the handlers for wait and signal must implement a protocol as to whichone gets to run and when. The following are the two primary possibilities:

• Hoare semantics: The signaller defers to the signallee by (implicitly) trans-ferring the monitor’s lock to the signallee and waiting on a hidden con-dition, commonly called urgent. The signallee runs, with no possibilityof intervening access to the monitor by another thread.26 Whenever athread waits or exits the monitor (i.e., returns from one of the monitor’sservice routines), it signals urgent if urgent has any waitors; otherwise,the thread simply releases the monitor’s lock.

• Mesa semantics: Upon resumption, the signallee requests the lock, possi-bly in competition with other threads. The signaller eventually waits orexits the monitor, thereupon releasing the lock but with no guarantee thatthe signallee will be the next to acquire it. In such a case, there is no needfor the signaller to hold the lock when signalling, i.e, “naked signalling” isallowed.

Java uses Mesa semantics and follows Mesa in the practice of using the term“notify” for “signal.” It:

Wakes up a single thread that is waiting on this object’s monitor.If any threads are waiting on this object, one of them is chosen tobe awakened. The choice is arbitrary and occurs at the discretionof the implementation. A thread waits on an object’s monitor bycalling one of the wait methods.

The awakened thread will not be able to proceed until the currentthread relinquishes the lock on this object. The awakened threadwill compete in the usual manner with any other threads that mightbe actively competing to synchronize on this object; for example,the awakened thread enjoys no reliable privilege or disadvantage inbeing the next thread to lock this object.[http://java.sun.com/j2se/1.3/docs/api/java/lang/Object.html#notify()]

Similarly, the POSIX Pthreads extension to C includes condition variables thatemploy Mesa semantics. However, since C lacks object-oriented conveniencessuch as constructors, a Pthreads condition receives its lock (a.k.a. mutex) as aparameter of wait — if concurrent waitors specify different locks, the resulting

26Passing the processor along with ownership of the monitor avoids priority inversions wherea higher priority signaller must wait for a lower priority signallee to release the monitor’s lock.


behavior is undefined. Ada uses a variant of Hoare semantics, but with someclever compiler optimizations.

5.2.2 Structuring via Monitors

Each publicly available service has explicit or implicit pre-conditions (require-ments) and post-conditions (promises).27 The specification for a service is that ifthe pre-conditions hold before the service is requested, then the post-conditionsmust hold when the service routine returns.

An invariant for a server is any statement that is a pre- and post-conditionfor each of its services. While a request is being serviced, however, invariantsmay be false. The purpose of a monitor’s instance-specific lock is to guaranteethat the servicing of another request does not begin at such a point. So, everymonitor service routine (including the constructors but excluding the destructor)must make sure the monitor’s invariant is true whenever that routine releasesor transfers the monitor’s lock:

• before returning from any service routine, including the constructor butexcluding the destructor,

• before each invocation of wait(),

• before each invocation of signal() under Hoare semantics.

Then the monitor’s invariant will hold every time the monitor’s lock is acquiredor received:

• on entering any service routine, including the destructor but excluding theconstructors,

• after each return from wait(),

• after each return from signal(), even under Hoare semantics.

It is a good idea to provide a service, okay(), that checks the object’s invariantsvia assert statements. The expression mon.okay() can then be included in theconstructor and destructor for Sentry, so that it will be invoked at the beginningand end of each monitor service.Example, please.

27Unfortunately, there are two distinct but related meanings of the term “condition”:

• It designates an instance of the class Condition (a.k.a. a condition variable).

• It designates a statement (a.k.a. predicate) about the requesting client, the server’sattributes, and possibly other aspects of the system of threads and servers that may betrue or false — preferrably expressed as a boolean expression that is programmaticallycheckable.

Pre-conditions and post-conditions are conditions in the second sense. In general, we rely oncontext to resolve this ambiguity.

5.2. MONITORS 61

Post-conditions on waits. Waiting is always done for a reason that is ex-pressed as a boolean expression and called the wait’s predicate or post-condition.So a request to wait always occurs in one of the following two contexts:

• while ( !predicate ) condition.wait(); // recheck

• if ( !predicate ) condition.wait(); // non-recheck

The predicate-recheck waiting (i.e., the while form) can be viewed as a busy waitfor the predicate to become true but with the invocation of wait slowing downthe rechecking. An invocation of signal is then a hint that something significantmay have changed, so it is time to recheck the predicate. It doesn’t guaranteethat the predicate is true.

In many implementations of threads and conditions, a waiting client timesout after waiting for a specified period. At that point, the client receives anartificial signal from the system, but that signal doesn’t guarantee that thecondition’s predicate is true, that the lock has been transferred, or that theinvariants are true. Also, if a timeout occurs just as a thread is being signalled,another waitor may receive a “spurious signal.” To diminish overhead, Pthreadsexplicitly allows for spurious signals.

Under Mesa semantics, even if the predicate holds when a signal is issued,there is the possibility that some other thread will beat the signallee to themonitor’s lock and change the state of the monitor’s attributes in such a waythat the predicate no longer holds. It is, therefore, the responsibility of thesignallee to see to it that the predicate holds before proceeding, i.e., the signalleemust recheck the predicate and perhaps must wait again. Also, the signaller hasno responsibility to establish the predicate — rather, under Mesa semantics, asignal is merely a hint that it might be a good time to recheck the predicate.Similarly, in the presence of possible spurious signals, predicate rechecking mustbe used.

Under Hoare semantics, each signaller makes sure that the predicate holdsbefore signalling, so that each signallee can assume that the predicate still holdsupon returning from an invocation of wait(). In such a case case, either formof waiting works fine.

Some experts recommend predicate-rechecking whenever possible — the ex-tra check is redundant under Hoare semantics, but it costs very little and is agood defense against careless signallers and spurious signals.28 Mention systems-engineering

tradeoffs. Mention difficultyin doing correctness proofsunder non-deterministic sig-nalling. Mention that mon-itors may or may not be alanguage feature vs. a libraryfeature.

Notes and comparisions. On monoprocessor systems under Mesa seman-tics, whenever the lock gets released, priorities must be re-evaluated and the

28Under Hoare semantics, some programmers prefer

if ( !predicate ) condition.wait(); assert( predicate );

to discover bugs as soon as possible rather than hide them. The question of whether it isbetter to find and eradicate bug or to built safeguards against them is an old debate and theappropriate answer obviously depends on circumstances.


signallee must get higher priority than the signaller; otherwise, the signalleewill wait at least until the signaller’s time slice ends.

On multiprocessor systems under Hoare semantics, there is a possibility forsignaller/signallee priority inversions, where a signaller waits for the signallee oflower processor-priority29 to signal urgent. The signallee’s processor-priorityshould therefore be boosted while it holds that lock. Also, to avoid the overheadof context switching, urgent can be implemented as a spinlock, and signalleescan simply be made runnable (at boosted processor-priority).

Hoare semantics has the advantage of a certain determinacy in that thereis no question of which thread runs next when a signal occurs. Under Mesasemantics, non-determinacy associated with competition for the lock can leadto rude surprises, e.g., the possibility of indefinite postponement. That benefit,however, is bought at the price of two context switches per signal compared tonone or one for Mesa semantics, depending on the relative processor-prioritiesof the signaller and signallee.

Mesa signalling can be made deterministic by giving each monitor a signal-in-progress flag and a Condition (or secondary lock) on which entering threadscan defer to any signallee that gets beaten to the lock.

Special-case optimization. In cases where the signaller or the signallee im-mediately departs the monitor after a signal, there is no behavioral differencebetween Hoare semantics and Mesa semantics, and some optimizations are pos-sible.

• Tail signalling: A tail signal is the case where an invocation of signal isthe signaller’s final action before exiting (returning from) the monitor.

...

whatever.signal();

return;

...

Such is almost always the case in practice — e.g., in all of our examples.In such cases, implementations of Hoare semantics can be optimized byomitting the wait on urgent, and, when the condition is awaited, im-plementations of Mesa semantics can implicitly transfer the lock to thesignallee, thus eliminating a lock release and re-acquision.

There is still the issue of who gets the processor. The signaller can either:

– pass its current processor to the signallee and wait on the ready queue— thus prematurely aborting the signaller’s timeslice but preventingthe lock from being hoarded by a suspended signallee.

29A thread’s processor-priority is the priority that the thread has in competing for a proces-sor. Obviously, the highest possible processor-priority is to hold a processor whose preemptionis blocked.

5.2. MONITORS 63

– boost the signallee’s processor-priority, whereupon:

∗ on multiprocessor systems, the signallee quickly acquires its ownprocessor.

∗ on uniprocessor systems, unless the signaller has boosted its ownpriority by blocking preemption, the signallee preempts the pro-cessor of the signaller, which holds the lock — not a good thing.

• Tail waiting: Often, all data-manipulation code after the final invocationof wait in a monitor’s service routine can be moved to just before the cor-responding invocation(s) of signal.30 That wait is then the last operationin that service routine and is, therefore, called a tail wait.

...

if ( !<predicate> ) whatever.wait();

return;

...

When a tail-waiting signallee is resumed, it need not re-acquire the mon-itor’s lock since it will immediately exit the monitor, anyway. In imple-mentations of Ada, the compiler transforms a wait into a tail wait and byexecuting the signallee’s post-wait code on the signaller’s stack.

By definition, predicate rechecking cannot occur under tail-waiting. So,tail-waiting should not be used where there is a possibility of spurious sig-nals and violations of the wait’s post-condition have serious consequences.

5.2.3 Broadcast

Some conditions provide a broadcast service, which works like the signal

service, except that broadcast resumes all threads waiting on that condition.Under Hoare semantics, the equivalent of broadcast can be faked by followingeach call to wait with a call to signal to resume the Condition’s next waitor.

...

if ( !<predicate> ) whatever.wait();

whatever.signal();

...

At best, this daisy-chain implementation is ugly. Under Mesa semantics, how-ever, this daisy-chain approach can lead to indefinite postponement — if athread waits on a condition while a signal is propagating down the list of waitor,that late-arriving thread will be resumed even though it arrived after the sim-ulated broadcast began, so, in principle, that signal could propagate forever.

30It is tedious, however, to give the signaller access to the parameters of the signallee’smonitor call — the Ada programming language enlists the compiler in dealing with this issue.


Note that broadcast’s implementation can simply extract the condition’squeue of waitors and append it to the ready queue. Normally, the broadcasteesbecome serialized as each in turn re-acquires the monitor’s. However, if thebroadcastees are tail-waiting such serialization can be optimized away. (Notethe performance tradeoff between allowing spurious signals and allowing tailwaiting, which seem to be incompatible optimizations.)

Exercise. Some authors and systems take the view that signal is simply aspecialized, optimized form of broadcast and that all code should be writtenso that it would work correctly if every invocation of signal were replaced by aninvocation of broadcast. Discuss the implications of that view. Do you agree?Why or why not?

Exercise. The most obvious and efficient implementation of broadcast is toextract the condition’s queue of waitors and append it to the ready queue. Isthe following a suitable implementation of the broadcast operation for the classCondition for cases where signal() is the only access to that queue?

void broadcast() {

for( int i = 0; i != waitorCt; ++i ) signal();

}

What difference, if any, does it make whether we use Hoare or Mesa semanticshere? Assume FIFO scheduling on urgent.

5.2.4 High-level locks, an example.

Generally, one must include accesses to both the data on which a thread basesits decision to wait and the Condition on which waiting takes place in the samecritical section. For example, suppose we want to control access to a singleresource, such as a printer, via a boolean variable called owned. To request theprinter, a client checks owned. If owned is true, the client waits on a Condition

called available. Upon resumption, the client rechecks owned. If, on theother hand, owned is false, the client simply sets owned to true and exits themonitor.

To release the printer, a thread resets owned to false and signals available.But, if that release occurs between the time a requesting client finds owned tobe true and the time that requestor waits on available, then there is a lost-signal problem, and the requestor waits in vain. So, checking owned, waitingon the Condition, and setting owned to true must occur in the same criticalsection, but the lock must be released for the duration of the wait. Similarly,resetting owned to false and signalling the Condition should both occur inanother segment of the same critical section, which is protected by that samelock.

The monitor just described serializes access to the printer and involves pas-sive waiting. In other words, it’s a high-level lock. Of course, we need some form

5.2. MONITORS 65

of mutual exclusion (e.g., low-level locks) before we can implement Conditions,since a Condition’s handlers for wait and signal need exclusive access to theCondition’s list of waiting threads; otherwise, the integrity of that list is likelyto be destroyed. Once we have implemented low-level locks, however, we canimplement high-level ones as monitors per the description above and the C++code below:31

class HighLevelLock : Monitor {

bool owned;

Condition available;

public:

void acquire() {

EXCLUSION

if ( owned ) {

while ( owned ) available.wait();

} else {

owned = true;

}

}

void release() {

EXCLUSION

if ( available.awaited() )

available.signal();

} else {

owned = false;

}

}

HighLevelLock()

: available(this), // conditions need a monitor ref.

owned(false)

{}

};

Notes. This implementation involves only tail signalling, which under Hoaresemantics, eliminates the signaller’s need to wait on urgent. It uses predi-cate rechecking to support Mesa semantics and to tolerate spurious signals. Ifpredicate rechecking were dropped, the implementation would involve only tailwating, thus eliminating the need for the signallee to re-acquire the lock, evenimplicity.

31HiLevelLocks are also known as binary semaphores.


Under Mesa semantics, this straightforward implementation of HighLevelLockis subject to indefinite postponements — it lacks the bounded-waiting propertythat is required of properly implemented locks. A signallee might lose the com-petition for the low-level lock to a late-arriving client and have to wait againon available. Statistically, we can expect each requesting client to eventuallyacquire the HighLevelLock, but that client might have to retry an unboundednumber of times before succeeding.

Exercise. We have mentioned the need to release the monitor’s lock duringa wait on a condition, but we didn’t mention unblocking whatever events areblocked. Why not? Hint: the fact that this situation appears in this exercise isa good indictation that “a simple oversight” is not the correct answer.

5.2.5 Thread Safety via Monitor Encapsulation

There are two ways to use monitors to make data structures thread safe. Thefirst of these is to encapsulate the data structure as a member or a base object ofa monitor. In the following example, producer threads append items to a thread-safe queue, and consumer threads remove them. In this case, the monitor’s lockis the queue’s main scheduler, and two conditions, nonempty and nonfull, areauxilliary schedulers. The scheduling protocol is not only mutually exclusive,but involves additional delays, per the conditions.

template< class Item >

class ThreadSafeQueue : Monitor, Queue<Item> {

Condition nonempty;

Condition nonfull;

public:

ThreadSafeQueue( int size )

: Queue<Item>(size),

nonempty(this),

nonfull(this)

{}

void append( Item& x ) {

EXCLUSION

while ( Queue<Item>::full() ) nonfull.wait();

Queue<Item>::append(x);

nonempty.signal();

}

Item& remove() {

EXCLUSION

while ( Queue<Item>::empty() ) nonempty.wait();

Item& x = Queue<Item>::remove();

nonfull.signal();

return x;

}

5.2. MONITORS 67

};

Often, we have many producer-consumer streams that share a pool of freebuffers. Each producer repeatedly acquires a free buffer, fills it, and then ap-pends it to a particular stream for the consumers.

template< class Buffer >

class BufferAllocator : Monitor, Queue<Buffer> {

Condition nonempty;

public:

BufferAllocator( int size )

: nonempty(this),

Queue<Buffer>(size)

{ // Put all buffers into this Queue.

}

Buffer& get() {

EXCLUSION

while ( Queue<Buffer>::empty() ) nonempty.wait();

return Queue<Buffer>::remove();

}

void put( Buffer& x ) {

EXCLUSION

Queue<Buffer>::append(x);

nonempty.signal();

}

};

In practice, monitors of the above classes might be used as follows. Supposethat anotherBuffer is a BufferAllocator and that producerConsumer is aThreadSafeQueue of Buffers. A producer thread would involve a loop of thefollowing form:

while ( ! finished() ) {

Buffer& x = anotherBuffer.get();

... fill buffer x ...

producerConsumer.append(x);

}

A consumer thread would involve a loop of a complementary form:

while ( ! finished() ) {

Buffer& x = producerConsumer.remove();

... process buffer x ...

anotherBuffer.put(x);

}


Note. If one of the consumers is broken or goes slow relative to its producer(s),then its producerConsumer stream tends to acquire all of the buffers.32 Thereare two approaches that help:

• In waiting for buffers, use the number of buffers a stream has acquired asits producers’ priority when waiting for a free buffer.

• For each stream, set a maximum number of buffers. When it has thatmany, make its producers wait.

5.2.6 Thread safety via monitor-based schedulers

The second way to provide thread safety via monitors is to equip a class ofservers with per-instance schedulers that are monitors implementing a schedul-ing protocol other than serial access (i.e., mutual exclusion), preferrably onethat allows greater concurrency. In such a case, the scheduled servers are notinside the monitors.

Suppose, for example, that we have a server that can serve multiple concur-rent requests for services that only read data but whose writing services requireexclusive access. Such a server requires a scheduler — call it a sharable lock —that behaves like a lock but has a third state, shared, in addition to the standardstates of available and owned. A sharable lock can be simultaneously shared byany number of client threads and cannot become owned while it is shared. Sucha scheduler is said to implement the Concurrent-Read/Exclusive-Write (CREW)protocol and to solve the readers/writers problem.Mention the fallacy in Lamp-

son’s unprotected access inthe case where values are notatomic. He really needs asharable lock.

To prevent indefinite postponement, we alternate access between sharersand excluders (i.e., owners). When the sharable lock is available, any threadcan immediately acquire exclusive access (i.e., ownership) or shared access byincrementing the access count and setting the lock’s state to owned or shared,accordingly.

To release ownership, a client thread decrements the access count and setsthe state to shared and resumes all aspiring sharers, if there are any; otherwise,it sets the state to owned and resumes the first aspiring owner, if there is one;otherwise it sets the state to available and returns.

To release shared access, a client thread decrements the access count. Thelast sharer to release its shared access (i.e., the one who sets access count tozero) sets the status to owned and resumes the first aspiring owner, if there isone; otherwise it sets the state to available and returns — there should be noaspiring sharers at this point.

class SharableLock : Monitor {

enum{ available, owned, shared } state;

int accessCount;

32Optimal allocation of pooled buffers among producer/consumer streams is an example ofwhat is called the doctors’ waiting-room problem. (Cite Gopinath.)

5.2. MONITORS 69

Condition okToShare;

Condition okToOwn;

public:

SharableLock()

: accessCount(0),

state(available),

okToShare(this),

okToOwn(this)

{}

void acquire( bool exclusive = true ) {

if ( exclusive ) {

if ( state != available ) {

okToOwn.wait();

} else {

state = owned;

}

} else {

if ( state == owned || okToOwn.awaited() ) {

okToShare.wait();

} else {

state = shared;

}

}

++accessCount;

}

void release() {

EXCLUSION

--accessCount;

if ( state == owned ) {

if ( okToShare.awaited() ) {

state = shared;

okToShare.broadcast();

} else if ( okToOwn.awaited() ) {

state = owned;

okToOwn.signal();

} else {

state = available;

}

} else {

assert( state == shared );

if ( sharerCount == 0 ) {

if ( okToOwn.awaited() ) {


state = owned;

okToOwn.signal();

} else if ( okToShare.awaited() ) {

assert( false ); // should never happen!

} else {

state = available;

}

}

}

}

bool tryToAcquire( bool exclusive = true ) {

EXCLUSION

if ( exclusive ) {

if ( state != available ) {

return false;

} else {

state = owned;

}

} else {

if ( state == owned || okToOwn.awaited() ) {

return false;

} else {

state = shared;

}

}

++accessCount;

return true;

}

} ;

The tryToAcquire service is similar to acquire but returns false rather thanwaiting. It has been added for later use.Perhaps we can use two flags

for owned and shared anduse a simple testAndSet totry to acquire.Use tail waiting and tryto prevent awakened readersfrom any serial dealing withthe SharableLock, except todecrement the count as theyleave.

Exercise. The above implementation of SharableLock is visibly based onHoare semantics and would not work correctly in the presence of spurious sig-nals. Generate a solution based on Mesa semantics that works correctly in thepresence of spurious signals.

Exercise. Automatic locking and unlocking (which we implement via the thethe EXCLUSION macro) give monitors convenience and safety advantages overother scheduled servers. Some authors, however, consider the fact that thatmonitor-based schedulers, such as SharableLocks, do not inherit those advan-tages to be a major weakness of monitors, e.g.:

5.2. MONITORS 71

A major weakness of monitors is the absence of concurrency if amonitor encapsulates the resource, since only one thread can be ac-tive within a monitor at a time. In the [SharableLock] example ...,to allow concurrent access for readers, the resource ... is separatedfrom the monitor (it is not local to the monitor). For proper syn-chronization, procedures of the monitor must be invoked before andafter accessing the shared resource. This arrangement, however, al-lows the possibility of threads improperly accessing the resourceswithout first invoking the monitor’s procedures. ([SS], page 24)

Mitigate this “major weakness” by extending the Sentry/EXCLUSION syntacticsugar to handle acquisition and release of SharableLocks. First, define a baseclass Mulitor that is to SharableLock as Monitor is to Lock. Then, definethe macro, SHARED, to acquire shared access the way that EXCLUSION acquiresexclusive access. Finally, redefine Sentry to work with Mulitors as well asMonitors. Note that Mulitors must have two event masks. One mask blocksall events whose handlers might try to acquire shared or exclusive access; theymust be blocked while the lock is owned. The other mask blocks all events whosehandlers might try to acquire exclusive access; they must be blocked while thelock is shared or owned.

Exercise. Implement a thread-safe version of the STL container class map byencapsulating map in a Mulitor, in a manner similar to the way we encapsulatedqueue in a Monitor. Insertions and deletions require exclusive access, whilelookups and the size operation allow shared access.

5.2.7 An Example Using Prioritized Waiting

A standard scheduling problem is that of allowing a thread to suspend itselfuntil a specified time. Many kernels, for example, have special mechanisms forperforming delayed actions call “delayed procedure calls” (DPCs) or “bottomhalves”, but instead we can simply create a thread that suspends itself until theappropriate time, performs that delayed action, and then exits.

In the following implementation of such timed resumption, sleeping threadswait on a condition, in order of requested wake-up time. When a sleepingthread’s wake-up time arrives, it resumes, sets alarm to INT MAX, and awakensthe next sleeper, who resets alarm and goes back to sleep:

class AlarmClock : Monitor {

int now;

int alarm;

Condition wakeUp;

public:


AlarmClock()

: now(0),

alarm(0),

wakeUp(this)

{}

void tick() {

EXCLUSION

if ( ++now >= alarm ) wakeUp.signal();

}

void wakeme(int myTime) {

EXCLUSION

while ( now < myTime ) {

alarm = min(alarm,myTime); // reset the alarm.

wakeUp.wait( myTime ); // prioritized.

}

alarm = INT_MAX;

wakeUp.signal(); // next thread will set own alarm.

}

int getTime(void) {

EXCLUSION // required, unless ints are atomic.

return now;

}

};

Typically, the timer’s interrupt handler invokes tick(). Note, however, thatthere is a potential wrap-around problem with the value of now. Note also thatwe can achieve slightly more concurrency than the implementation above, byderiving AlarmClock from SharableLock and treating getTime as a read (i.e.,shared) operation. Finally, if integers are atomic, getTime doesn’t need a lock.33

Exercise. The above implementation presumes Hoare semantics. Design asolution that works for Mesa semantics. Pay particular attention to the possi-bility that an awakened sleeper may not be the next thread to acquire the lockand the possibility of clusters of spurious signals.

5.3 Semaphores

With conditions and locks, the order in which operations are performed makesa difference — if on a condition you wait for something after someone else

33Not really — we still need to worry about register resident values and memory barriers,etc.

5.3. SEMAPHORES 73

has released it, you might wait forever. Dijkstra introduced the notion of asemaphore, which is another class of scheduler having acquire and release

operations.34 Essentially, a semaphore is a generalized lock that allows n-fold Find refernce for Dijkstra.acquisition (to allocate from a pol of n resources), where n is the initial value ofthe semaphore’s counter. When there are n unreleased acquisitions, subsequentrequests to acquire will wait.

When a semaphore is created, its counter is initialized to a specified integer,say n. It is an invariant of semaphores that at any time, the number of returnsfrom request shall be no greater than n plus the number of calls to release.As in the case of conditions, it is generally understood that semaphores involveonly high-level (i.e., passive) waiting. With semaphores, release and acquire

are commutative, which eliminates race conditions35 between the requestors andthe releasers.

Semaphores as special monitors. Semaphores can be implemented as verysimple monitors. Instead of the flag used to implement HighLevelLock, Semaphoreuses counters.

class Semaphore : Monitor {

int count;

int releaseCt;

int requestCt;

Condition available;

public:

Semaphore( int n, unsigned mask = 0 )

: Monitor(mask),

count(n), // initialize count

available(this)

{}

void acquire() {

EXCLUSION

while ( count == 0 ) avail.wait();

--count; // take one

}

void release() {

EXCLUSION

34Dijkstra referred to acquire and release as P (for “proberen” — test) and V (for “ver-

hogen” — increment), respectively.35A race condition is a situation where operations on an object are significantly noncom-

mutative — different orders lead to significantly different results, some of which violate theinvariants of the system.


++count; // return one

available.signal();

}

};

This implementation involves only tail signalling, which under Hoare semanticseliminates the signaller’s wait on urgent. It uses predicate rechecking to supportMesa semantics and to tolerate spurious signals.

Locks as special semaphores. Visibly, a semaphore initialized to one sat-isfies the specification for a lock and is, therefore, a (high-level) type of lock.But, violation of a lock’s requirements on client behavior have predictable con-sequences for such locks:

• A gratuitous release is remembered and acquired by the next requestingclient.

• Recursive requests to acquire lead to self-deadlocking.

Thus, not all locks fulfill the semantics of semaphore initialized to one.

Semaphore-based implementation of monitors, sentries, and condi-tions. One can implement monitors, conditions, and sentries by using semaphoresinitialized to zero to simulate queues.36 Thus, semaphores are equivalent tomonitors in their ability to solve thread-coordination problems. Semaphores,though conceptually simple, tend to lead to badly structured solutions, as doesthe unstructured use of goto statements. The key advantage of using monitorsis that it simplifies design, debugging, and verficiation of applications involvingconcurrency.

The following implementation of monitors, conditions, and sentries is adaptedfrom [HO] and adheres to Hoare semantics for signalling. It simulates a condi-tion via an integer, count, inside the monitor and a zero-initialized semaphore,q, outside the monitor. We use the condition’s count to keep track of thenumber of waitors, so that we never release an empty condition semaphore.

class Monitor : Semaphore {

unsigned mask // event mask while lock owned

Semaphore urgent;

int urgentCount;

friend Sentry;

friend Condition;

public:

36This implementation is important, because many operating systems make semaphoresavailable to user threads.

5.3. SEMAPHORES 75

Monitor( unsigned x = 0 )

: Semaphore(1), // the base semaphore is a lock

mask(x),

urgent(0), // urgent behaves like a condition

urgentCount(0)

{}

};

class Sentry { // An autoreleaser for the monitor’s lock.

Monitor& mon; // refers to surrounding monitor

unsigned old; // old event status

public:

Sentry( Monitor* m )

: mon(*m), // init mon via argument


{ mon.acquire();

}

~Sentry() {

if ( mon.urgentCount > 0 ) {

mon.urgent.release();

// Transfer lock to a deferring signaller

} else {

mon.release(); // Release lock

thisProcessor().events.set( old ); // Restore mask

}

}

};

class Condition {

Monitor& mon; // Reference to surrounding Monitor

int count;

Semaphore q; // Semaphore-simulated queue

public:

Condition( Monitor* m )


: mon(*m),

count(0),

q(0)

{}

int waiting() { return count; }

bool awaited() { return count > 0; }

void wait( int pr = INT_MAX ) {

++count;

if ( mon.urgentCount > 0 ) {

mon.urgent.release();

// Transfer lock to a deferring signaller.

} else {

mon.release(); // Release lock.

}

q.acquire(pr); // No lost signal here! Why?

--count; // Receive lock from signaller.

}

void signal() {

if ( count > 0 ) {

++mon.urgentCount;

q.release(); // Transfer lock to signallee.

mon.urgent.acquire(); // Defer to signallee.

--mon.urgentCount; // Receive lock from signallee.

}

}

};

To wait on a Condition, a thread increments the Condition’s count, givesup mon’s lock, and waits by acquiring q. The Condition’s count keeps track ofthe total number of waitors for the condition, including those that have givenup mon’s lock but not yet acquired q. Suppose that, after a waitor releases mon’slock but before q has been acquired, another thread enters mon and signals thissemaphore-simulated condition. The semaphore, q, will remember that release,and the waitor will find that unconsumed release waiting when it finally issuesa request to acquire q — no lost-signal problem. In fact, when q.count isnonzero, its value is the number of waitors that have such unconsumed releasesawaiting them.

Note that this implementation does not reliably enforce priorities; a release-simulated signal may arrive at a condition after the highest priority waitor hasreleased the monitor’s lock but before it has acquired the semaphore. In thatcase, the highest priority thread already waiting on the semaphore will be the

5.4. COORDINATION VIA WAITING FOR MESSAGE REPLIES 77

signallee.

Exercise. Would the following implementation of awaited() behave correctly:

bool awaited() { return q.awaited(); }

where q.awaited() tells whether the queue of the semaphore q is nonempty?

Exercise. Give an example showing what would go wrong if the counts asso-ciated with conditions or with urgent were eliminated in the above implemen-tation.

Exercise. Note that, under the above semaphore-based implementation ofmonitors and conditions and sentries, a thread could miss its alarm in anAlarmClock monitor. How could that happen? How bad is that? Explainthe difficulties in implementing prioritized waiting when conditions are imple-mented via semaphores as described above. Suggest some possible solutions.

Exercise. Given that our implementation of Semaphores tolerate spuriousinterrupts. Would our implementation of SharableLock tolerate spurious inter-rupts, if we used the above semaphore-based implementation of monitors andconditions? Why or why not?

5.4 Coordination via waiting for message replies

The blocking form of message passing, where the sender waits for a reply fromthe receiver, can be used to schedule threads, i.e., delay them per some schedul-ing protocol. In Subsection 4.1, on page 44, we showed how to turn passiveservers, such as the C++ monitors that we’ve discussed so far, into active serverswhose services are requested via messages. That transformation embedded thepassive server into a multithreaded process. But, a thread or a monothreadedprocess is already a monitor, so it is natural to ask whether an arbitrary passivemonitor can be emulated via a single thread and message passing.

Emulating conditions via threads. To wait on a thread-emulated “pseudo-condition,” a client sends the pseudo-condtion a wait message specifying theclient’s identity and a priority. The client then waits for a reply. The pseudo-condition copies that message into a local priority queue of messages. Tosignal this pseudo-condition, a client sends it a signal message, whereuponthe pseudo-condition then extracts the first message on its priority queue andreplies to the message’s waiting sender.


Emulating semaphores via threads. To obtain a class of threads thatemulates semaphores, we modifying pseudo-conditions as follows:

• Change the names of wait and signal to acquire and release, respec-tively.

• Maintain a count of unconsumed release-messages: the number of release-messages received minus the acquire-messages replied to. If an acquire-message arrives when when the count is positive, reply immediately anddecrement the count. We can initialize that count, just as we initialize thecounts in real semaphores.

Emulating passive monitors via threads. A thread that services remoteprocedure calls is a server. Specificially, it is an active monitor and can serve asa proxy for a passive monitor. But, we must rewrite that passive monitor to usetail waiting, i.e., so that every wait is followed by an immediate return. (SeeSubsection 5.2.2 on page 63.) The handler’s for wait and signal are modified sothat:

• When a service routine of the passive server requests wait on a condition, arecord with the current remote client’s identity gets inserted into a priorityqueue the condition’s waitor list. Because the passive server uses only tailwaiting, the service routine does an immediate return to the proxy, whichis programmed to forgo (rather, postpone) the normal RPC reply.

• Signal’s handler extracts the first message from the condition’s recordqueue and sends the postponed RPC reply to the remote client identifiedin that record.

Coordination via System Calls Kernels can allow users to request the cre-ation of kernel-based semaphores, locks, and/or (naked) conditions whose ser-vices can be requested via system calls. One can efficiently implement semaphoresvia locks and naked conditions in shared-memory implementions. The same isnot necessarily true for kernel-based mechanisms, where system calls are ex-pensive, e.g., acquiring the semaphore’s lock and waiting on the semaphore’scondition would require two distinct kernel calls. Moreover, only kernels canprevent concurrency-diminishing preemptions while that lock is held.

Commonly, kernels make available semaphores for coordination among threadsthat don’t share address space. Locks can be implemented as semaphores ini-tialized to one. A conditions can be implemented as a zero-initialized semaphoreplus a waitor counter, as in Hoare’s implementation of monitors via semaphores.

5.5 History and Perspective

Most of the insights and examples in this chapter comes from Hoare’s paper[HO]. Several, however, come from [LA2], which introduced Mesa semantics. Fi-

5.5. HISTORY AND PERSPECTIVE 79

nally, a number of insights came from postings on comp.programming.threads,especially those of Dave Butenhof and Alexander Terekov.

Hoare [HO] introduced monitors as an object-oriented language feature fororganizing operating systems. The language was Pascal with some object-oriented features from Simula67 grafted on. Lampson and Reddell [LA2] appliedthose ideas to the development of the Pilot operating system in the Mesa lan-guage and concluded:

When monitors are used in real systems of any size, however, anumber of problems arise which have not been adequately dealt with:the semantics of nested monitor calls; the varius ways of defining themeaning of wait; priority scheduling; handling of timeouts, abortsand other exceptional conditions; interactions with process creationand desruction; monitoring large numbers of small objects.

The Pthreads committee successfully took up those challenges but in the con-text of the C language, which lacks the abstract data types of Mesa and Hoare’sextension to Pascal, so they focused on locks (which they call “mutexes”) andconditions — in that context, monitors should be viewed as a design pattern.Java, by contrast, incorporates a variant of monitors as a feature of the core lan-guage. There have also been various concurrency libraries that take advantageof C++’s features, e.g., http://boost.org/libs/thread/doc/index.html.

The facts that semphores obviously are special monitors and that Hoare’snot-so-obvious semaphore-based implementation of monitors and conditions to-gether show that these semaphores and monitors have the same logical power.But, like Turing machines vs. higher-level programming languages, althoughtheir logical power is equivalent, their psychological power, i.e., their power tointuitively organize human thought and intuition, are vastly different. Sem-phores are sometimes compared to goto-statements, in that they are powerfulif you use enough of them, but they lead to tangled and difficult-to-maintainsolutions to straight-forward programming problems.

There appear to be six major approaches to threads that seem especiallyworthy of study:

• POSIX’s Pthreads

• LinuxThreads, which is a clone-based variant of Pthreads

• Java threads, which is monitor based

• Ada Tasks, which, if I understand correctly, is based on Lamport’s workon cooperating sequential processes.

• Solaris threads, which is the most elaborate threads model

• Microsoft threads.

The use of threads in Java is described in Multithreaded Programming withJava Technology by Bil Lewis (Author), et al. The use of the Pthreads library


is described in Programming with POSIX(R) Threads by David R. Butenhof(Author).

A lot of good information is available on line, especially for Pthreads, e.g.:

• http://www.serpentine.com/ bos/os-faq/FAQ-1.html#The-history-of-threadsfor a history of threads.

• http://www.opengroup.org/onlinepubs/007904975/functions/pthread cond wait.html from The Open Group, which is the keeper of thePthreads standard.

• http://www.tru64unix.compaq.com/docs/base doc/DOCUMENTATION/V51A PDF/ARH9RBTE.PDF, i.e., Tru64 Unix: Guide to the Posix ThreadsLibrary. It is a PDF of a few hundred pages at about 1051k.

• http://www-919.ibm.com/developer/threads/uguide/document.htm, IBM’sAS/400 Posix Threads documentation, not as detailed as the Tru64 Unixdocument.

• http://oss.software.ibm.com/developerworks/opensource/pthreads/ to ob-tain a current free implementation of Pthreads.

• http://www.cs.arizona.edu/computer.help/policy/DIGITAL unix/Digital UNIX Bookshelf.htmlfor a lot of good information on DECthreads, a precursor to Pthreads.

Finally, “all issues about multithreaded programming” are extensively discussedat the Internet newsgroup: comp.programming.threads.

Chapter 6

Bootstrapping High-LevelCoordination

Make Lock an abstract baseclass. Also, replace “CPU”with “processor”.

There are three fundamental implementation issues in high-level (passive) wait-ing:

• implementing and handling primitive (a.k.a., low-level) locks, i.e., lockswhose implementation doesn’t involve the use of some other lock,

• queue handling,

• passing a processor from one thread to another.

In this chapter, the software rubber meets the hardware road. We willpresent actual code for a C++ implementation of a concurrency kernel: threads,monitors, sentries and conditions. This implementation involves:

• Hoare semantics for signalling [HO].

• Support for symmetric multiprocessing.

• Almost portable CPU passing based on setjmp and longjmp.

• Transparent blocking of events.

• Acquisition and release of locks and interrupt blocking using the RAIIidiom, i.e., acquisition-via-initialization.

• All waiting on conditions is based priority queues.

• Each thread does its own queue handling.

• Isolation of all lock handling and queue handling in three routines: bounce(),Condition::hatch(), and Ready::start().1

A lot of diagnostic code has been removed to improve readability.

1The thread termination operation, which is not presented, also handles a queue.

81

82 CHAPTER 6. BOOTSTRAPPING HIGH-LEVEL COORDINATION

6.1 Low-Level Locks

There are three basic low-level strategies for implementing locks.

6.1.1 Block Locks

The first and most common strategy for implementing low-level locks is hoardingthe CPU cycles — if, after a lock has been acquired, no thread other than thelock’s holder gets any CPU cycles, then no other thread can even request thelock much less acqure it, since threads need CPU cycles to do anything. Note,however, that if another thread gets CPU cycles either because the lock’s holderreleased control of the CPU or allowed the CPU to be pre-empted away, thenthere is no assurance that other threads will not acquire the BlockLock.

class BlockLock : Lock {

// Use only with a mask that blocks at least all

// preemptive services whose handlers might

// directly or indirectly attempt to acquire

// this loc, say by resuming another thread.

public:

void acquire() { blockOtherCPUs(); }

void release() { resumeOtherCPUs(); }

};

The procedures BlockOtherCPUs and ResumeOtherCPUs are low-level routinesfor stopping and resuming other processors.

While holding a BlockLock, a thread may not invoke and/or permit preemp-tive invocation of any function/handler that might possibly attempt to acquirethat lock, either directly or indirectly — not even by passing the CPU to athread that might attempt to acquire this lock, either directly or indirectly.(Nothing except hoarding the CPU cycles enforces the lock. When anotherthread gets the CPU, control is usually gone. If no other thread will attemptto acquire this lock, there is no reason to acquire it.)

Unfortunately, the same goes for possible attempts to acquire any otherBlockLock, since releasing any BlockLock releases them all. If, however, weappropriate one of the bits of Monitor’s mask, say mask[0], for controlling theblocking of other processors, we can revise Sentry and BlockLock as follows:

class Sentry { // An autoreleaser for local lock.

Monitor& mon; // Reference to local monitor.

const unsigned old; // Old preemption-blockage status.

public:

Sentry( Monitor* m ) // m’s argument is always "this", a

: mon( *m ), // pointer to the surrounding monitor


{

if ( mon.mask[0] ) blockOtherCPUs();

6.1. LOW-LEVEL LOCKS 83

mon.lock(); // Acquire the monitor’s lock.

}

~Sentry() {

mon.unlock(); // Release the monitor’s lock.

if ( ! old[0] ) ) resumeOtherCPUs();

thisProcessor().events.set( old );

}

};

class BlockLock : Lock {

// Use only with a mask that blocks all other CPUs

// and all preemptive services whose handlers

// might directly or indirectly attempt to acquire

// this lock, say by resuming another thread.

public:

void acquire() {}

void release() {}

};

Note that this revised version of BlockLock even allows recursive acquisition.Note also that:

• Use of a BlockLock diminishes CPU utilization and extends response timefor preemptive events (e.g., interrupts) by executing critical sections withonly one CPU running and preemption blocked. The severity of theseperformance degradations is a function of the lengths of critical sections— keeping them short diminishes the problem.

• BlockLocks are trivial on monoprocessor systems, where there are no“other CPUs” to block. On such systems, preemption blocking does all thework and, since preemption blocking is required in any case, BlockLocksdegrade performance no more than any other kind of lock. So, on mono-processor systems, most locks are BlockLocks.

6.1.2 Spin Locks

The naive first attempt at a busy-waiting implementation of locks is to use aflag to indicate that the lock is held:2

class SpinLock : Lock {

volatile bool held; // The flag.

public:

SpinLock() : held(false) {}

void acquire() {

while ( held ) {} // Test the flag.

2For a discussion of “volatility” and other relevant memory-access anomalies, see Section9.2, page 128.


// THE TROUBLE SPOT !!

held = true; // Set the flag.

}

void release() { held = false; }

};

This implementation doesn’t work, however, because one thread could test theflag while another thread was at the trouble spot, i.e., between the time thatother thread tested the flag and the time it set the flag — both threads wouldtest the flag; both would find it false; both would set it to true. Then eachthread would return from its request to acquire, thereby acquiring the lock!The problem is that the flag itself is shared data requiring exclusive access, i.e.,SpinLock should be a monitor, but a monitor already needs a lock, so monitorscan’t be used to implement primitive locks.

Therefore, most architectures provide a testAndSet instruction that atom-ically sets a flag to true and returns its former value in a single memory cycle.Access to the flag is exclusive, since the critical section is one instruction long,and only one instruction per cycle can access any given basic memory location.We can use such an atomic instruction to fix the solution above:


volatile bool held;

friend bool testAndSet(bool& x);

public:


void acquire() { while ( testAndSet(held) ) {} }

void release() { held = false; }

};

On a multiprocessor system, however, each execution of testAndSet(held)

writes to held, which invalidates held’s cached value in all other processors,thereby congesting the memory bus with those updates. Because C/C++’s ||operation involves short-circuit evaluation (i.e., the right-hand operand is notevaluated whenever the left-hand operand is true), the following implementationof acquire merely reads held until release sets it to false:

void acquire() { while ( held || testAndSet(held) ) delay(); }

The nonspecific delay diminishes the probability that, when held becomesfalse, many threads will immediately congest the memory bus with attemptsto acquire the lock via testAndSet.Cite Anderson’s paper.

Obviously, SpinLocks provide mutual exclusion. Avoiding the possibility ofindefinite postponement, i.e., the possibility that some thread waits indefinitelywhile others repeatedly acquire and release the lock, can be achieved at the costof significant overhead. To implement a bounded-waiting SpinLock, assumethat the threads are organized in a circularly linked list:


extern Thread& thisThread(); // returns identity of calling thread.


volatile bool held;

friend bool testAndSet(bool& x);

public:


void acquire() {

Thread& me = thisThread();

me.requesting = this;

while ( me.requesting == this && ( held || testAndSet(held) ) ) {

delay();

}

me.requesting = 0;

}

void release() {


for ( Thread* t = me.next; t != &me ; t = t->next ) {

if ( t->requesting == this ) {

t->requesting = 0;

return;

}

}

held = false;

return;

}

};

6.1.3 Number LocksCite Dijkstra and Lamport.

In 19XX, Edsger Dijkstra posed the problem of implementing mutual exclusionwithout using any special instructions, like testAndSet or BlockOtherCPUs.The Dutch mathematician, Dekker, was the first to solve this problem. Thesolution below is due to Leslie Lamport and is called the Bakery Algorithm,since it was inspired by the take-a-number system used in some bakeries. Theprecedence of a thread is based on its number, smallest first with ties broken onthe basis of thread addresses. Update per real code.

extern Thread& thisThread(); // returns identity of calling thread.

class NumberLock : Lock {

// Precedence operator on threads -- smallest first.

bool operator < ( Thread& m1, Thread& m2 ) {

if ( m2.requesting != this ) return true;

if ( m1.requesting != this ) return false;


if ( m1.number < m2.number ) return true;

if ( m1.number == m2.number && &m1 < &m2 ) return true;

return false;

}

public:

void acquire() {


me.taking = true;

me.requesting = this;

{ // Take a number that exceeds that of any other thread

// that is requesting *this.

int max = 0;

for (

list<Thread>::iterator t = threads.begin();

t != threads.end();

++t

) {

if ( t->requesting == this && t->number > max ) {

max = t->number;

}

}

me.number = max + 1; // got it.

}

me.taking = false;

for (

list<Thread>::iterator t = threads.begin();

t != threads.end();

++t

) {

while ( t->taking ) {}

// Now, ’til I release, if t->number changes, me < *t.

while ( *t < me ) {}

// Now I have higher or equality priority for this lock

// than *t does, and *t cannot get higher priority than

// mine until I have released this lock.

}

// Now I have all other threads locked out ’til I release.

}

void release() {


me.requesting = 0;

me.number = 0;

}


};Acquiring a number lock re-quires scanning through theentire list of threads, which istime consuming. Also, sucha scan requires that therebe such a list that can bescanned without the protec-tion of a lock. One can,for instance, use an arrayof thread descriptors hav-ing single bit that indicateswhether or not a given de-scriptor is valid. One canthen protect that “list” witha single NumberLock, whenone is adding or deletingthreads, i.e., modifying anyof those validity bits. Hmm-mmm.

It should be clear that NumberLocks deliver on the progress and bounded-waiting promises required of locks. To see that NumberLocks enforce mutualexclusion, suppose toward a contradiction that two threads, i and j, hold thesame NumberLock concurrently — say that i has precedence over j for thatlock. Then i.number is no larger than j.number, so j did not have its currentnumber when i most recently raised its taking flag — otherwise, i would havegotten a larger number than that of j. So, subsequently:

1. j got its current number,

2. and then j found i.taking to be false, by which time, both threads hadtheir current numbers,

3. and then j waited and continues to wait for i to release the lock, since i

had and continues to have precedence over j.

So j’s invocation of acquire has not returned. So, j does not hold the lock. Acontradiction!

There is nothing in the Bakery Algorithm that depends on exclusive accessto memory locations, in the sense that it works with dual-ported memories. Thekey is that the test loop

while ( t->taking ) {}

iterates at least once after me.taking becomes false (i.e., after me.number got itscurrent value) and before me.number gets used in any subsequent computationof a number for thread i. This requires only that me.taking always be setto true before me.number is updated, and that the updating of me.number becompleted before me.taking becomes false.

6.1.4 Notes on Low-Level Locks

• BlockLocks aren’t individually releaseable. By contrast, SpinLocks andNumberLocks, are individually releasable, which promotes concurrency onmultiprocessor systems.

• BlockLockss stop all other CPUs. By contrast SpinLocks and NumberLocks,block only those CPUs that host threads trying to acquire a held lock,which promotes concurrency on multiprocessor systems.

• Recursively requesting a low-level lock will produce different results de-pending on the type of lock:

– In the case of a SpinLock, the client will self-deadlock — the threadwill find the flag set to true and busy-wait on itself forever. In fact,SpinLocks are low-level binary semaphores, i.e., semaphores with aone-bit counter. (Releasing such a semaphore has no effect when itscount is one.)


– In the case of a NumberLock, the lock will release (by granting theclient a new number larger than those of the currently waiting threads)and cause the thread to wait its turn to reacquire the lock. Mean-while another thread can acquire this already held lock — not a goodsituation if the associated invariant has not been restored.

– The first subsequent invocation of release on a crude BlockLockwillresume other CPUs, which partially releases the lock. BlockLockswhere blockage of other CPUs is stored in a mask bit are recursivelyacquirable.

Drop crude BlockLocks fromthe discussion.

• Locking the bank of memory containing a sensitive data structure is an-other way to guarantee exclusive access. This approach must be ap-plied with caution, however, in systems with processors having instructionand/or data caches.

• To improve performance, many modern compilers and CPUs make appar-ently harmless rearrangements in the order of instructions. Unfortunately,these compilers and CPUs are not equipped to recognize code that acquiresand releases locks. In order that rearranged code not leak out the endsof critical sections and get executed outside the protection of a lock, it isimportant that instructions not get hoisted past (moved ahead of) a lockacquisition or sunk past (moved behind) a lock release. To prevent a com-piler from preforming such rearrangements, it usually suffices to compilethe definitions of acquire() and release() separately from their invo-cations. To prevent CPUs from such reordering, the compiler must emitarchitecture-specific “memory-barrier” instructions, which usually requirethe implementer to resort to assembly language. For a fuller discussion ofmemory-access anomalies see Section 9.2 on page 127.

Exercise. The tradeoff between high-level locks and low-level locks is thatlow-level locks waiting have negligible acquisition and release overhead but highwaiting-time overhead . Design a two-stage lock using a spinlock that countsdown to say one and then puts the waitor into a high-level waiting state, therebyputting an upper bound on wait-time overhead. There are well-known formulasgiving average waiting times as a function of arrival rates and the distributionof service times. When loads (arrival rates) get sufficiently high, would it payto lower the spin-lock count, thereby shifting to high-level waiting sooner?Consider moving the follow-

ing paragraph.

Taking turns. Not all thread coordination problems are so difficult as locks toimplement via primitive instructions. If two threads will take turns, we can usea simple flag-based protocol. For example, a single-producer/single-consumerbounded queue can be coordinated with two simple flags. Initially, the queue isempty, the space-available flag is raised and the data-available flag is lowered.The producer, who polls the space-available flag before producing, finds it raisedand proceeds to produce until there is no more space available or time runs

6.2. PRIORITY QUEUES 89

out or whatever. The producer then lowers the space-available flag, raises thedata-available flag and continues with other business. The consumer, who pollsthe data-available flag before consuming, now finds it raised and proceeds toconsume until there is no more data available or time runs out or whatever.The consumer then lowers the data-available flag, raises the space-available flagand goes about other business. And so on.

template< class Item >

class ThreadSafeQueueII : Monitor, Queue<Item> {

// Restriction: single producer and single consumer

// They take turns.

bool spaceAvailable;

bool dataAvailable;

public:

ThreadSafeQueueII( int size )

: Queue<Item>(size),

spaceAvailable(true),

dataAvailable(false)

{}

void append( Item& x ) {

while ( ! spaceAvailable ) {}


if ( Queue<Item>::full() or whatever ) {

spaceAvailable = false;

dataAvailable = true;

}

}

Item& remove() {

while ( ! dataAvailable ) {}

Item& x = Queue<Item>::remove();

if ( Queue<Item>::empty() or whatever ) {

dataAvailable = false;

spaceAvailable = true;

}

return x;

}

};

Note that we could get by with a single flag. Explain and expand the stuffbelow.Note that the blocking of pre-emptive events is an example of such pro-

ducer/consumer coordination — so is interaction with external devices.

6.2 Priority Queues

A priority queue is a data structure that offers two services:


• insert takes two arguments, an item and a priority, and places the iteminto the queue at the specified priority.

• extract takes no arguments — it removes the highest-priority (smallestpriority number) item from the queue and returns it to the client. We willassume that ties are broken on a first-come-first-served basis.

In C++, PriorityQueue would be an abstract base class:

template < class Item>

class PriorityQueue {

public:

void insert(Item& x, int priority ) = 0;

Item& extract() = 0;

};

In our case, the items will be thread descriptors. We are interested in priorityqueues because they allow flexible implementation of scheduling protocols, whichwe will cover in the chapter on scheduling.

There is a possible pitfall in the implementation of such queues. Manypriority-queue implementations require allocating a new “node” each time an-other member is added to a queue. A recursive lock request occurs whenevera scheduler based on such queues is used to implement a thread-safe memoryallocator.

6.3 Passing a CPU

To pass a CPU from one thread to another, we must store the current processorstate into the current thread’s descriptor and restore the processor’s state fromdata in the descriptor of the thread receiving the CPU. In practice, switchingis more complicated, because when a thread gives up its CPU, it must putitself on some condition’s queue and must pass its CPU to the highest prioritythread waiting on some other condition’s queue. Each queue will, of course, beprotected by a lock that the passing and/or receiving thread(s) may or may notalready hold.Mention sigsetjmp and

siglongjmp.

Trick. The functions setjmp and longjmp are from the Standard C Library:

• int setjmp( jmp buf env ) saves the current values of the state regis-ters in a specified jmp buf3. The normal return value is zero.

• void longjmp( jmp buf env, int val ) restores the registers’ valuesfrom the specified jmp buf, which resumes the corresponding call to setjmp.The second argument, which should not be zero, specifies the non-normalreturn value for the resumed call to setjmp.

3A structure type defined in the header file setjmp.h.

6.4. THREADS 91

These routines are designed to allow a C/C++ program, on encountering anerror, to restore sanity by getting back to (an approximation of) an earlier stateof the call stack without having to unwind it via a sequence of normal returnoperations. These routines were not intended as a mechanism to implementconcurrency, and some implementations of longjmp may modify the currentstack in ways that are not compatible with this usage. Most implementations,however, allow setjmp(), longjmp() and jmp buf to be used to save, restore,and retain register values, respectively. Per the 1989 C Standard [7.6.2.1]:

The longjmp function restores the environment saved by themost recent invocation of the setjmp macro in the same invocationof the program, with the corresponding jmp buf argument. If therehas been no such invocation, or if the function containing the invo-cation of the setjmp macro has terminated execution in the interim,the behavior is undefined.

All accessible objects have values as of the time longjmp wascalled, except that the values of objects of automatic storage dura-tion that are local to the function containing the invocation of thecorresponding setjmp macro that do not have volatile-qualified typeand have been changed between the setjmp invocation and longjmp

call are indeterminate. [Among other things, such a variable’s newvalue may or may not have been written to that variable’s memoryaddress (if any).]

As it bypasses the usual function call and return mechanisms, thelongjmp function shall execute correctly in contexts of interrupts,signals and any of their associated functions. ...

After longjmp is completed, program execution continues as ifthe corresponding invocation of the setjmp macro had just returnedthe value specified by val. The longjmp function cannot cause thesetjmp macro to return the value 0; if val is 0, the setjmp macroreturns the value 1.

So, a thead can pass the CPU on which it is running to another thread t viathe following C-language idiom:

if ( !setjmp(registers) ) longjmp(t.registers, true);

There is no need for a first-try flag, because the return value of setjmp() iszero on the first try and a nonzero when and if control is returned to that pointvia an invocation of longjmp. The only remaining problem is how to initializea jmp buf to refer to a new stack — see 6.5.1, on page 94. Mention GNUthreads,

including Butenhof’s com-ments on them.

6.4 Threads

A Thread works something like a procedure that, on each call, resumes whereit left off the last time it ran. One Thread runs until it explicitly passes itsprocessor to another Thread (or its processor gets pre-empted away). It then


waits until some running Thread passes a processor back to it. Then it resumeswhere it left off and runs until it again hands off its processor.4Mention coroutines.

At any point in time, every waiting Thread is waiting on exactly one Condition.There is a special Condition, called ready, that holds those Threads that arewaiting only for a CPU.

6.4.1 threads.H

Thread is a base class with a virtual function action(), whose nonblockinginvocation begins the thread — for example, a derived thread class that printsout an endless stream of some specified character can be defined as follows:

class CharSpitter : public Thread {

char ch;

public:

CharSpitter( char c ) : ch(c) {}

void action() { while(true) cout << ch; }

};

We can declare as many CharSpitters as we like, each spitting out an endlessstream of its specified character; for instance

ready.start( new CharSpitter( ’Z’ ) );

creates a new thread that prints out an endless stream of Z’s, places it on ready

waiting for a CPU to complete the execution of its constructor.

A Thread (descriptor) is a Node that can be linked into a priority queue.Specifically, Thread is an abstract base class:

class Thread : public Node {int stackSize;void∗ stackLocation;virtual void action() {} ;friend class Ready;

public:int currentCPU;jmp buf registers; // TRICK !!int priority() { return 0; } // stubThread(int size = 10000);virtual ∼Thread() { delete[] stackLocation; }

};

Thread’s attributes and operations include:

4Some Threads can be simulated using procedures and static variables, e.g., a getCharacter

routine that returns the next character of a line buffer on each call and reads the next linewhenever that buffer becomes empty.

6.4. THREADS 93

• the size and location of the Thread’s stack,

• a virtual function, action, to be supplied (overridden) by the derivedclass,

• an integer telling which CPU (if any) is the Thread’s current host,

• a jmp buf for holding the Thread’s register settings whenever it gets sus-pended,

• a virtual function, priority, for computing the Thread’s priority whenthe Thread waits on ready,

• a constructor,

• a virtual destructor that simply deallocates the Thread’s stack.

6.4.2 threads.cc

A Thread’s constructor must allocate the Thread’s stack and initialize its regis-ter settings (for which purpose we supply a semi-portable hack called adjust).5

The constructor’s client is responsible for putting the newly constructed Thread

onto the ready condition’s queue via a call to ready.start(). Upon first re-ceiving a CPU, the newly constructed Thread starts out running code within itsconstructor, which immediately invokes ready.hatch, which gets that Thread

off ready’s queue, releases ready’s lock, and unblocks preemptive events. Theconstructor then invokes the Thread’s action function. If and when action

returns, the constructor rquests the services of tt theTerminator, which cancelsthe thread and invokes its destructor.

// Architecture-dependent register initialization (jmp buf hacking).// To determine appropriate settings for these, look at setjmp.h or run// the accompanying program archtester.cc, which runs various experiments// to attempt to determine which registers are which.

#ifdef i386const int BP = 3;const int SP = 4;

#endif

#ifdef sparcconst int BP = 3;const int SP = 1;

#endif

5In [EN], there is a fuller discussion that includes and a more elegant and portable approchto initializing a Thread’s registers that is used in the GNU portable Threads library.


static voidadjust(jmp buf& regs, void∗ loc, int size) {

long∗ jb = (long∗) &regs; // leaving slack for safety.jb[SP] = long( loc + size∗sizeof(long) - 256 );jb[BP] = long( loc + size∗sizeof(long) - 128 );

}

Thread::Thread( int size ): stackSize(size){

if ( setjmp( registers ) ) {// If child ...Thread& me = ready.hatch(true, interrupts.on);me.action(); // Live!theTerminator.request(); // Die!

}// If parent, set up child’s descriptor.if ( stackSize ) stackLocation = new long[ stackSize ]; // Allocate stack.adjust( registers, stackLocation, stackSize ); // Adjust child

}

6.5 Monitors

6.5.1 monitors.H

The following are possible declarations for Sentry, Condition and Monitor.Each instance of Sentry or Condition each has a constructor-initialized monitorreference (mon) to the surrounding monitor.

Monitors have lock and unlock services that acquire and release the moni-tor’s lock, respectively.6 There are also transfer and receive services to switchownership of the Monitor. The handlers for these services include diagnosticsthat use the Monitor’s owner attribute to make sure that proper lock-handlingprotocols are observed.

class Monitor ;

class Sentry { // An autoreleaser for monitor’s lock.Monitor& mon; // reference to local monitor, ∗this.const unsigned old; // place to store old interrupt status.

6I didn’t call them “acquire” and “release” in order to avoid collisions with similarly namedservices of derived classes. It might be better to use namespaces for that purpose.

6.5. MONITORS 95

public:Sentry(Monitor∗);∼Sentry();

};

#define EXCLUSION Sentry exclusion(this);

class Condition : protected Queue {friend class Terminator;friend class Ready;friend void bounce( bool, Condition&, bool, Condition&, int );friend class Thread;Monitor& mon; // reference to local monitor.Thread& hatch(bool, unsigned); // only for bounce and Thread(int).

public:void wait( int pr = INT MAX );void signal();bool awaited() { return !empty(); }Condition( Monitor∗ m ) : mon( ∗m ) {;}

};

class Monitor {friend class Terminator;friend class Sentry;friend class Condition;friend class Ready;friend void bounce( bool, Condition&, bool, Condition&, int);Condition urgent;unsigned mask;SpinLock local;Monitor∗ trailer; // Sleep monitor for relinquishing thread.void lock();void unlock();void transfer(Thread&);void receive();Thread∗ owner; // address of owner, if owned.

public:Monitor( unsigned x = 0 );

};


6.5.2 monitors.cc

Most of the tedium of this implementation has been concentrated into two rou-tines, bounce and Condition::hatch. The function bounce takes references totwo conditions, wake and sleep plus two flags that tell whether sleep’s monitorand wake’s monitor, respectively, are owned by the invoking thread. The func-tion inserts the current Thread into sleep’s queue at a specified priority andpasses that Thread’s CPU to wake’s highest priority waitor. Note that sleep’slock must be released by the awakened Thread.

Condition::hatch is best viewed as the lower half of bounce. We separateit out so that it can be called from Thread’s constructor.

voidbounce(

// Passes CPU to the first thread waiting for a ”wake” condition.// Current thread sleeps on ”sleep” condition.bool wakeHeld, // Do I own wake’s monitor now?Condition& wake,bool sleepHeld, // Do I own sleep’s monitor now?Condition& sleep,int sleepPriority // At what priority should I sleep?

) {

// Block interrupts, lock both monitors and get onto sleep queue.unsigned oldStat = interrupts.block( interrupts.off );bool sameMonitor = ( &wake.mon == &sleep.mon );if ( sameMonitor && wakeHeld ) sleepHeld = true;if ( ! sleepHeld ) sleep.mon.lock(); // Grab it, unless I own it.

Thread& me = CPUalloc.thisThread(); // Call current thread ”me”.sleep.insert(me, sleepPriority);

if ( ! wakeHeld && ! sameMonitor ) wake.mon.lock();

// Determine waker.assert( wake.awaited() );Thread& him = (Thread&) wake.first(); // Call waker, ”him”.assert( & him 6= & me );

// I go to sleep, passing him this CPU and both locks.wake.mon.transfer(him); // Pass wake.mon, and sleep.monif ( ! sameMonitor ) sleep.mon.transfer(him); // unless they’re the same.wake.mon.trailer = &sleep.mon; // So that he can unlock it.

if ( ! setjmp(me.registers) ) longjmp( him.registers, true ); // Trick!

// Much later, I resume and finish waking up, having received

6.5. MONITORS 97

// this CPU from yet another thread, who sees me as his ”him”.sleep.hatch( ! sleepHeld, oldStat ); // Release sleep.mon, if grabbed.

}

Thread&Condition::hatch( bool toRelease, unsigned stat ) {

// For use only by Thread::Thread and bounce.// I’ve just awakened from a sleep having received a CPU// from a thread who calls me ”him.”assert( awaited() ); // Extract self.Thread& me = (Thread&) extract();CPUalloc.allocate( me ); // Grab this CPU.if ( mon.trailer 6= & mon ) { // Deferred release?

mon.trailer→receive();mon.trailer→unlock();

}mon.receive(); // Receive monitor from other.if ( toRelease ) mon.unlock(); // Unlock monitor if requested.interrupts.set(stat); // Restore interrupt status.return me; // Identify yourself for Thread(int).

}

Sentry::Sentry( Monitor∗ m ): mon( ∗m ),

old( interrupts.block(mon.mask) ){

mon.lock();}

Sentry::∼Sentry() {if ( mon.urgent.awaited() ) {

int pr = CPUalloc.thisThread().priority();bounce( true, mon.urgent, false, ready, pr );

} else {mon.unlock();

}interrupts.set( old );

}

voidCondition::wait( int pr ) {

if ( mon.urgent.awaited() ) {


bounce( true, mon.urgent, true, ∗this, pr );} else {

bounce( false, ready, true, ∗this, pr );}

}

voidCondition::signal() {

if ( awaited() ) bounce( true, ∗this, true, mon.urgent, 0 );}

Monitor::Monitor( unsigned x ): urgent( this ),

mask(x),trailer(0),owner(0),

{;}

voidMonitor::transfer( Thread& NewOwner ) {

assert( owner == & CPUalloc.thisThread() );assert( owner 6= & NewOwner );owner = & NewOwner;

}

voidMonitor::receive() {

assert( owner == & CPUalloc.thisThread() );}

voidMonitor::lock() {

Thread∗ current = &CPUalloc.thisThread();assert( owner 6= current );local.acquire();assert( owner == 0 );owner = current;

}

void

6.6. READY 99

Monitor::unlock() {assert( owner == & CPUalloc.thisThread() );owner = 0;local.release();

}

6.6 Ready

6.6.1 urmonitor.H

Through dual inheritance, the monitor ready is a both a monitor and a con-dition, where Threads wait for a CPU. The urmonitor module defines ready

and the CPUallocator, which is a data base that keeps track of which CPUcurrently hosts which Thread. Specifically, there is an array of thread point-ers, one per CPU: CPU[i]7 points to the thread that is currently running (i.e.,hosted) on the i-th CPU.

class Ready : public Monitor, Condition {friend class Condition;friend class Sentry;friend class Thread;

public:void start(Thread∗);Ready();void defer();

};

extern Ready& ready;

class CPUs {

struct CPUdesc {Thread∗ current; // The thread running on this CPU.

};

const int CPUcount = 1;CPUdesc CPU[ CPUcount ];

public:

7There is a thread identity problem: how does a thread find out his own pid or, equivalently,on which CPU it is running. One possibility is that each CPU might have its own kernel-invocation vector table with a special trap or system call that reports that CPU’s indexnumber.


Thread& thisThread() { return ∗(CPU[ thisCPU() ].current); }

int thisCPU() { return 0; } // Should be a sys trap.

void allocate(Thread& t) {int n = thisCPU();CPU[n].current = &t;t.currentCPU = n;

}

};

extern CPUs& CPUalloc;

// Either thisCPU or thisThread has to be primitive. There should// be a thisCPU trap, but one can make a primitive thisThread by// checking which stack SP is within – grossly inefficient, however.

6.6.2 urmonitor.ccWhy can’t waitors “pause”when there is no-one to passthe CPU to. Hmmmmm.

We need a dummy Thread per CPU that can run when no other Threads arerunnable. Repeatedly, the dummy pauses until the next interrupt (signal) occursand then defers to the highest priority Thread waiting on ready.

class Dummy : Thread {friend class Ready;

public:void action() {

for(;;) {ready.defer();pause();

}}

};

void Ready::defer() {EXCLUSIONwait();

}

voidReady::start( Thread∗ child ) {

EXCLUSIONinsert( ∗child, child→priority() );

6.7. SUMMARY 101

}

Ready::Ready() :Monitor(interrupts.off),Condition(this)

{// Give initial thread a thread descriptor.// It’ll get register setting when it sleeps.// Zapping it may have dire consequences.CPUalloc.allocate( ∗(new Thread(0) ) );// Need one dummy per CPU.ready.start(new Dummy);

}

6.7 Summary

This chapter sketched the implementation of a threads library. It is intendedto show the principles involved in the implementations of threads and theircoordination mechanisms and to dispell some of the sense of magic involved inthe study and use of concurrency. It would be tedious to add such things aspre-emptive thread cancellation, timeouts, and graceful interaction with pre-emptive services, e.g., signals and interrupts.

An embedded computer often runs no more than one program, so the ser-vices of a traditional operating system are not of much use. Usually, howeverembedded systems involve concurrent steams of information to or from variousdevices. In such cases, the operating system needs little more than a concur-rency more or less like the one described here in this chapter.

Chapter 7

DEADLOCKS

A deadlock is a situation in which, for whatever reason, each member of a givenset of threads waits for occurrences of events (e.g., the signalling of conditionsand the releasing of locks) that can be generated only by threads in that set.

Deadlocks arise in many situations. Suppose that two brothers each try tomake an omelet in the same kitchen and that both need the omlet pan and thespatula. If one grabs the spatula and the other grabs the omlet pan, and eachwaits for the other to relinquish his utensil, both brothers will go hungry.

As a more complex example, suppose that we have a round table wherephilosophers are dining. Between any two adjacent plates there is a singlefork. A philosopher alternates between eating and philosophizing. To eat, aphilosopher requires two forks, the one to her left and the one to her right.While philosophizing, she relinquishes her forks, thus, making them available toher left and right colleagues, respectively. Obviously, if each philospher picks upthe fork to her left and waits for her neighbor on the right to finish eating andrelinqish her fork, all of the philosophers will starve. The dining-philosophersproblem is to design a good protocol by which the philosophers can acquire andrelease their forks, i.e., one that is:

• free of deadlocks,

• efficient, in the sense that, if a philosopher is waiting to eat and both ofher forks are available, she will not be kept waiting,1

• not subject to indefinite postponement.

We will discuss this problem further.

1Otherwise, we could simply serialize the eating by the philosophers so that only one couldeat at time.

103

104 CHAPTER 7. DEADLOCKS

7.1 Deadlocks and Monitors

In a monitor-structured system, wherein the only places threads ever wait areconditions and monitors’ locks, there are three kinds of deadlock situations[LA2]:

• All deadlocked threads are waiting on conditions. The simplestdeadlock is where all of the deadlocked threads are waiting on condi-tions in the same monitor. This situation is a straightforward bug thatis usually easy to find, provided that monitors are kept small and well-structured. When the deadlocked threads wait on conditions in distinctmonitors, finding the bug is considerably more difficult but still reasonablystraightforward.

• All deadlocked threads are waiting on locks. In such cases, theremust be a circular calling pattern among the monitors involved. For in-stance, a handler from one monitor might request a service from anothermonitor, whose handler requests a service from the first monitor, resultingin a recursive request for the first monitor’s lock and a possibility of selfdeadlocking. More generally, two threads might enter distinct monitors,and each invoke an operation of the other monitor, only to find it locked.Of course, such patterns can go through several threads and monitorsbefore deadlocking.

• Some of the threads wait on locks and others on conditions.Suppose that a service routine of one monitor, perhaps indirectly, requestsa service of another monitor, and the second service routine waits on acondition in the second monitor. Note that the lock of the first monitor isheld during that wait. A deadlock ensues, if the awaited condition can besignalled only by threads that similarly enter the second monitor via thefirst monitor, which is locked in this case. Finding such bugs is tedious.

One could, of course, release the first monitor’s lock before requesting theservice of the second monitor. Note, however, that the invariant of thefirst monitor must be re-established before entering that hollow region,i.e., the region where the lock is not held:

operation(...) {

{ EXCLUSION

.

.

.

// code to restore invariant

}

// begin hollow region

secondMonitor.op(...);

// end hollow region

7.2. THE RESOURCE-ALLOCATION MODEL 105

{ EXCLUSION

.

.

.

// code to restore invariant

}

}

Exercise. It is sometimes suggested that all locks held by waiting threads beautomatically released during the wait and reacquired after the wait. Discussthe safety implications of that policy.

7.2 The Resource-Allocation Model

Schedulers that also allocate resources often called allocators.Often, we can view each reason for waiting as an allocatable resource (or

pool of interchangeable resources) to be requested and released via a scheduler.Locks can easily be viewed as resources, but viewing conditions as resourcesis not always so simple. For example, in servers protected by SharableLocks,the allocated resource is the privilege of accessing that server. But there aretwo associated conditions: okToShare and okToOwn. An owner, having waitedon okToOwn, may later signal okToShare or okToOwn or neither, depending onwhether there are aspiring sharers, aspiring owners, or neither. Moreover, if weuse the daisy-chained-signals implementation for broadcast, when an aspiringsharer resumes, it immediately signals okToShare; this signal does not representthe release of an acquired resource but rather a simulated broadcast.

We can view the privilege-of-access resource as being held by the last threadto acquire the SharableLock. Upon releasing a sharable lock, a sharer holdingthe privilege-of-access resource can pass it to another sharer that is still accessingthe server, if there are any. Otherwise, the resource holder releases the resourceor passes it implicitly to an aspiring sharer or aspiring owner, per the dictatesof the CREW protocol.

In AlarmClock schedulers, we can take the view that the resource for whicha thread waits is the occurrence of a particular time, and the timer’s interrupthandler releases an unbounded number those occurrences when that time ar-rives. Such a resources is said to be consumable. By contrast, reusable resourcesfollow the conservation principle: acquiring and releasing has no effect on thetotal amount of a reusable resource, i.e., the amount held by the allocator plusall clients.2

Cases where it is not possible to apply the resource-allocation model requirea more sophisticated approach to liveness proofs, e.g., Petri nets or cartesianproducts of finite-state automata.

2For this to work, no client may release more of the resource than it holds. Of course, athread self-deadlocks if it requests more of a resource than the amount originally in the poolminus the amount it has already acquired.


Resource-allocation graphs. A resource-allocation graph is a directed bi-partite graph with weighted edges. Its two node sets represent threads andresource pools, respectively. Each thread-to-pool edge represents a request fora corresponding amount from that pool, and each pool-to-thread edge reqpre-sents an allocation of the corresponding amount from that pool to that thread.3

To represent the granting of a request, we simply reverse the correspondingedge, thereby, converting it into an allocation. To represent the release of anallocation, we simply remove that edge or give it zero weight.

For consistency, we require that resource-allocation graphs be:

• realizable in the sense that the sum of the allocations from a given resourcepool must not exceed the capacity of that pool.

• realistic in the sense that no request edge exceeds the total capacity of thecorresponding resource pool minus the amount of that pool currently heldby the coresponding thread — otherwise, the thread would self-deadlock.

In this abstraction, to resume a thread, we simply grant its current requests.The thread is resumable if the resulting allocation is realizable. A subgraph inwhich no thread is resumable (relative to that subgraph) is called a deadlock;a thread is said to be deadlocked if and only if it is a member of a deadlock.Obviously, in a deadlock, each thread has positive out-degree, and one of itsrequested resources has positive out-degree, since each request is realistic. So,every finite deadlock contains a cycle, since each of its threads has an out-goingpath to a thread in that same deadlock.

To determine which threads in a resource graph are deadlocked, removeresumable threads until there are no more.4 Each of the remaining threadsis waiting for a resource held by other remaining threads, none of whom areresumable. The original graph was deadlock-free if and only if the remaininggraph is empty. Of course, some of the removed threads may eventually alsodeadlock as they make subsequent resource requests.

7.3 Deadlock Prevention

To prevent deadlocks, we attack one or more of the following conditions, eachof which is necessary for a deadlock:

• resources cannot be shared (per the conservation principle and the resource-allocation model),

3In a normal monitor-based systems, threads have out-degree at most one, i.e., they waiton at most one resource at a time. However, the resource-allocation model applies to systemswhere a thread can make multiple simultaneous requests.

4Note that removing one resumable thread may make other threads resumable. For athread requesting a consumable resource to be resumable, that resource must have an out-going edge to a runnable thread. In determining the deadlocked threads, one does not removeedges from consumable resources.

7.3. DEADLOCK PREVENTION 107

• threads can wait indefinitely while holding resources (per the resource-allocation model),

• there is a circular pattern in the resource graph (as proved above).

Except in the case of single-unit (i.e., one-of-a-kind) resources, these conditionsare not sufficient to guarantee a deadlock. Say, for example, that one threadholds the scanner and waits for a printer. A second thread holds a printer andis waiting for the scanner. We now have a circular waiting pattern. But ifthere is a second printer, held by a runnable thread, then, when that threadreleases that second printer, the rest of the threads might be able to finish —no deadlock even though all three requirements are met. So, although thoseconditions are necessary, they are insufficient to guarantee a deadlock.

7.3.1 Sharing by Multiplexing Preemptable Resources

Deadlocks occur only with respect to nonsharable resources, but preemptableresources are de facto sharable via time-division multiplexing. So, to minimizepossibilities for deadlocks, we can:

1. Make use of preemptable resources, which can be shared by multi-plexing them among the active threads. For example, CPUs and disks arepreemptable; printers and tape drives are not. In virtual-memory systems,main memory becomes preemptable.

2. Virtualize non-preemptable resources, i.e., represent nonpreempt-able resources by preemptable ones. For instance, output destined for aprinter can be “spooled” to a preemptable device, such as a disk, ratherthan to the nonpreemptable printer. Later, a daemon can dump that diskfile to the printer when it bcomes available.

Excessive preemption leads to performance breakdowns, called “thrashing,”where threads spend most of their time taking resources away from each other.Thrashing is particularly noticeable in overly committed virtual-memory sys-tems. Though the resulting problem is serious, it is not so bad as a deadlockand can often be cured by increasing resources and/or decreasing load.

Note also that many logical resources, e.g., monitor locks, cannot be vir-tualized or shared. Such problems are especially prominent in the design ofdatabase systems, where a thread must lock and unlock various records as itreads and updates them.

7.3.2 Preventing Cycles by Acquiring Resources in Order

The most common mechanism for preventing deadlocks it to assign a rank toeach resource and design all operations so that a thread will only wait for re-sources of rank higher than the highest it currently holds. In such a case, dead-lock is impossible, because it is impossible to have a cycle in the resource graphwhen each resource in the cycle must have a rank larger than its predecessor.


A simple but inefficient solution to the dining-philosophers problem is to rankthe forks, say by address, and require that each philosopher pick up her forksin rank order.Show that this solution

could degenerate into one-at-a-time.

Exercise. What, if anything, would go wrong if we replace the words “higherthan” by “higher than or equal to” in the above policy?

Exercise. Prove or disprove that under this policy each monitor’s lock musthave higher rank than any of the resources associated with conditions in thatmonitor. Remember that waitors give up their locks, but not until after theirdecision to wait.

The dining philosophers revisited. If there are an even number of philoso-phers, alternate red and green forks around the table — each philosopher willthen sit between a red fork and a green fork. If, on the other hand, there are anodd number of philosophers, make one fork yellow and alternate red and greenamong the rest — one philosopher then have a red fork and a yellow one, andher neighbor will have a yellow fork and a green one.

The protocol is that when a philosopher wishes to eat, she first requests herred fork, if she has one, else her yellow fork. When she has acquired her firstfork, she then requests her green fork, if she has one, else her yellow fork. Whenshe has acquired both forks, she eats. When she finishes eating, she relinquishesboth of her forks.

• There is no possibility of deadlock, since any cycle in such a graph mustgo around the table, in which case all forks are held, and, by definition,anyone holding a green fork has violated the protocol.

• No philosopher will starve, since each requested fork gets acquired as soonas its current holder finishes eating.

• At least half of the philosophers who are hungry will succeed in acquiringtheir first fork. And at least half of those who acquire their first fork willsucceed in acquiring their second fork. So, at any time, at least a quarterof the philosophers who are hungry will be eating.

Note that red forks get held while waiting, but green forks do not. It can beargued, therefore, that the philosopher with no red fork has an unfair advantage,while the philosopher with no green fork is at an unfair disadvantage. To evenlydistribute these advantages and disadvantages, whenever a philosopher holdinga yellow fork in her left hand relinquishes her forks, she must put the yellowfork to her right and the other fork to her left.

To implement this protocol, create a descriptor for each fork. That descriptoris a monitor with two attributes, a color and a “held” flag, and a condition wherephilosophers wait to acquire the fork.

7.4. BACKING OFF FOR PERFORMANCE REASONS. 109

7.3.3 Not Waiting While Holding Resources

The third necessary condition for deadlocks to occur is that, while holdingresources, threads wait arbitrarily long for resources to become available.

Deadlock resolution. A deadlock can be broken when one or more threadsbacks off, i.e., restores and releases some of the resources it has acquired. Typ-ically, backing off occurs when a thread is resumed artificially after:

• a time out, i.e., after waiting a length of time specified by an additionalparameter to wait.

• deadlock detection, i.e., after allocators have determined, say through anal-ysis of the resource-allocation graph, that the thread is deadlocked.

In either case, the resumed thread must be able to determine that it has beenresumed artificially (e.g., by a special return value or an exception return fromthe call to wait()), so that it will know to back off.5

Alternatively, the system can cancel the thread and reclaim its resources,but we have assumed that resources are non-preemptable (i.e., cannot reliablybe taken away by the operating system), because the state of a resource and thesteps necessary to restore it cannot generally be known by the system. Unless anapplication is carefully designed, such preemption can leave shared infomation(e.g., files) in an incoherent state. Mention BEB vs. BLAM.

7.4 Backing off for performance reasons.

In general, waiting while holding guarantees that each thread will make progress,unless a deadlock ensues. But, waiting while holding diminishes the opportuni-ties for concurrency and thereby degrades performance.

Note that a deadlock-free graph remains deadlock-free even if threads thathold no resources make arbitrary requests. For instance, a graph where at mostone thread holding resources has a request (out-going edge) is deadlock free,since all requests are realistic and, therefore, threads can’t self-deadlock.6

Often, a particular operation (e.g., a financial transaction) will requires ex-clusive access to a relatively small set of resources. Many threads, each at-tempting to perform such an operation, may hold or be attempting to acquire(possibly intersecting) sets of individually lockable resources. Simply rankingthese resources is not efficient, since allowing a thread to hold some resourceswhile waiting for a full set diminishes the chances for concurrent activity byother threads.

Each thread can try to acquire the resources it needs for its transaction oneresource at a time. But, if it can’t acquire all of them, for the sake of efficiency,

5Time-outs can occur for a number of reasons other than deadlocks, e.g., a printer mightbe out of paper. There is no need to back off (i.e., release resources) in such a case.

6In such circumstances we can prevent starvation (indefinite postponement) by passingaround the the privilege of waiting while holding.


it should “back off” by releasing those it has already acquired. It should thenwait on the availability of the resource it couldn’t acquire, and then starts over.To prevent collisions with other threads trying to acquire some of the sameresources, it helps if threads attempt to acquire resources in order of increasingrank (e.g., address).

To implement such multiple requests, we need a new scheduler class Lockableand a friend function multiRequest that attempts to acquire each Lockable ofa specified set:

void multiRequest( Set<Lockable> resources ) {

sort resources by address;

while ( true ) {

for ( each x in resources in sorted order ) {

if ( ! x.tryToAcquire() ) break;

}

if ( x == resources.end() ) return; // You’ve locked ’em.

// Otherwise, back off.

for ( each resource y up to x in sorted order ) y.release();

x.await(); // wait for availability of x, but don’t acquire it.

// Now, try again, from the top.

}

}

The following points are worth noting:

• Sometimes we don’t know in advance exactly which resources need to beacquired — each required resource may contain a reference to the next.

• In the definition of multiRequest, the base class Lockable must have aservice tryToAcquire that doesn’t wait when the lock is unavailable. Ourdefinition of SharableLock, for example, includes such a service. We alsouse a service, await(), that waits for availability but doesn’t acquire.

• The dining philosophers can apply multiRequest to their respective sets(pairs) of forks. In such a case, no deadlock can ensue because nobodywaits while holding. Also, at any time, at least a third of the philosopherswho wish to eat can do so, since an eating philosopher can block at mosther two adjacent colleagues from eating. By contrast, the ranked-forkssolution can degenerate into a situation where at most one philosophereats at a time.

This protocol allows starvation, since some unfortunate philosopher mightbe blocked by her left and right neighbors in an alternating pattern of in-definite length. But, as noted eariler, to avoid the possibility of starvation,we can pass around say clockwise the privilege of waiting while holding afork.

• In a back-off protocol, there is a possibility of thrashing where each threadacquires and releases resources without making much progress — imagine,

7.5. DEADLOCK AVOIDANCE 111

for instance, that repeatedly and in unison each philosopher picked up herleft fork, noticed that her neighbor had done the same, and so put her leftfork back on the table. To prevent such a live-lock one should order theresources and acquire them in rank order, thereby guaranteeing that thethread with the highest ranking resource will make progress.

7.5 Deadlock Avoidance

Imagine a game played on a given resource-allocation graph pitting the threadsagainst the allocators. A move by the threads is to make some requests and re-leases. A move by the allocator is to grant some requests. A resource-allocationgraph is deadlockable if and only if there is a strategy by which the threads canforce a deadlock.

Note that a non-deadlockable graph remains non-deadlockable if the alloca-tors grant some provisional requests, i.e., ones where the requestors will makeno further requests (i.e., not wait) while holding those resources. A thread’s fi-nal request is obviously provisional. But any request that is maximal (i.e., thatasks for all that the thread will ever acquire of each resource) is potentially finalin the sense that whenever the thread gives up any resources after a maximalrequest, the allocators can hold those resources in reserve for future requestsby that thread. If the current request of each thread is the last it will evermake, then (by induction) each non-deadlocked thread will ultimately resumeand finish. It follows that a given resource-allocation graph is deadlockable ifand only if there is a deadlock in the corresponding maximal-request graph, i.e.,the graph obtained by increasing each thread’s requests to whatever it needs tomeet its maximum subsequent need for each resource.

Habermann’s deadlock avoidance policy. Habermann’s deadlock-avoidancepolicy is to never resume a given thread unless resulting graph has a maximal-request graph that is deadlock free. Otherwise, the requesting thread must re-quest again later, say after waiting on a special condition that receives a broad-cast each time a resource is released.7 This policy:

• requires an a priori listing of each thread’s maximum needs,

• assumes pessimistically that a maximal request is made at the time of thesafety check — that pessimism loses some potential concurrency.

The following example represents a resource graph in terms of matrices.

Capacity vector, A: Suppose a given system has n resource pools; say it has a1

units of the first resource, a2 of the second, · · · , an of the n-th. For example,say that we have one printer, three frame grabbers, and five CD writers:

7Habermann’s policy is sometimes called the bankers’ algorithm — it generalizes the policythat bankers (supposedly) use in allocating money for construction loans: never make a loanif there would not be enough money left to finish some project so that the return of the capitalfor that project would furnish capital to finish another, etc.


A = ( 1 3 5 )

Allocation matrix, B: Say that there are m threads and that bij denotes theamount of the j-th resource currently allocated to the i-th thread, i.e., thecombined weights of the edges from the j-th resource to the i-th thread. Forexample,

B =

0 1 11 1 20 1 0

In this example, the current allocation is realizable, since the sum of each columnof B is no larger than the corresponding entry of A, i.e., for all j,

b1j + · · · + bmj ≤ aj .

Maximum-needs matrix, M: Let mij denote the maximum amount of the j-thresource that the i-th thread will ever need from now on. For example,

M =

1 1 11 2 40 1 2

Final-request matrix, C: Let C = M − B, i.e., let cij denote the additionalamount of the j-th resource that the i-th thread needs to acquire to have all ofthat resource it will ever need from now on.

C =

1 0 00 1 20 0 2

Assume that we boost all current requests to final requests, i.e., make C thecurrent request matrix. In general, we resume thread i by adding its currentrequest (i.e., the i-th row of C) to its current resource allocation (the i-th rowof B)8 and setting its current request to zero. But, because the requests in Care final, a resumed thread will finish and release its resources, i.e., the i-th rowof B will become zero.

To determine whether the resource graph consisting of allocation B andrequest C is deadlock free given capacity vector A, we make a pass and notethat the first and second threads cannot resume, but the third thread can resumeand finish, so we release its resources, i.e., zero out its row in B. On a secondpass, the second thread can now resume and finish, so we release its resources.Now that the resources of the other two threads have been released, the firstthread can resume and finish. So, no possibility of deadlock here, since allthreads can resume and finish.

8In the case at hand, that sum is simply the i-th row of M .

Chapter 8

SCHEDULING

Add the simplest version ofthe theorem that as utiliza-tion approaches one, waitingtimes go to infinity.

Scheduling is the matter of selecting requests to service:1

• when to select (i.e., when to signal a given condition)?

• which to select (i.e., what value to pass as the priority argument of arequest to a condition’s wait serice)?

A selection algorithm is sometimes called a scheduling algorithm or schedulingprotocol, and the set of rules on which a selection algorithm is based is called ascheduling policy, strategy or discipline.

Contexts where scheduling problems arise:

1. Traffic engineering

2. Cafeteria management

3. Industrial engineering or operations research or systems analysis

4. Communications

5. Military (target selection)2

6. Medicine (triage)

7. Operating systems (dispatching).

In some of these contexts, the services performed by servers on behalf ofclients and/or their requestsare called “jobs.” We are primarily interested inserial servers, which service only one request at a time. Many parallel servers

1The analytical study of the scheduling performance is called queuing theory. The presen-tation in this chapter based on the treatment of scheduling in [HA] and on conversations withHerb Hellerman.

2During battle, systems-trained military personnel often speak glibly of “servicing” a tar-get. Whatever it is they have in mind is not likely to be seen as a “service” by the inhabitantsof the target.

113

114 CHAPTER 8. SCHEDULING

can be partitioned into serial servers — for instance, a bank that has a com-mon customer queue for multiple tellers. There are, however, servers that areintrinsically parallel, e.g., a movie screen services many movie-goers at a time.

Schedulers that allocate a resource according to a one-at-a-time protocol arecalled serializers. Locks, for example, are the simplest serializers. Requests forthe acquire service of a more complex serialzer might involve parameters suchas deadlines, estimates of required service time, etc.

8.1 Performance terminology

Often, the choice of how priorities are assigned has a major impact on perfor-mance. Usually, it does not affect the correctness of the system — if, however,one used the wrong waiting priority for our AlarmClock monitor (see Subsection5.2.7, on page 71), threads would not awaken on time.

We begin by reviewing some systems-analysis/operations-research terminol-ogy regarding performance. Imagine, for instance, that the server is a diskdrive.

• Utilization of a server during an interval: the fraction of that intervalthat the server spends servicing requests, i.e., is not idle.

• Throughput of a server during an interval: the number of requestsserviced during that interval divided by the length of that interval.

• Processing time for a particular request: time required for the serverto complete that request at 100% utilization.

• Waiting time for a particular request: time spent waiting in queues.

• Turnaround time3 for a particular request: waiting time plus servicetime.

• Latency (a.k.a. set-up time) for a particular request: time fromstart of service to first output.4

• Service time for a particular request: Processing time plus latency.

• Response time for a particular request: waiting time plus latency.5

3a.k.a. “Sojourn time” or “time in system.”4A given request sits in a queue for some waiting period until it is selected. The request’s

service begins with a latency period during which the disk heads are moved into position andthe disk rotates into position. Then there is a period of data transmission that may involverepositionining the heads, say to an adjacent track.

5Variance in response time is a negative factor in system performance that leads to suchthings as: hitting backspace, having nothing happen, hitting it again and then seeing twocharacters disappear.

8.1. PERFORMANCE TERMINOLOGY 115

• A server’s transfer rate (a.k.a. bandwidth) during a data-transmissionrequest: quantity of data transmitted divided by transmission time.6

• Backlog of a server (or pool of servers) at a given instant: setof requests waiting for service from that server or pool of servers at thatinstant.

• Queue length of a server (or pool of servers) at a given instant:number of requests in the server’s backlog at that instant.

• Length of a schedule on a given schedule: total service time requiredto complete that schedule, i.e., the sum of the service times for all requestsin the server’s backlog under that schedule, including any remaining ser-vice required by those requests currently receiving service.

If, for a given backlog, different schedules have different lengths, the server issaid to be schedule sensitive. Usually, schedule sensitivity results from variancein latency. Disk drives, for example, are shedule sensitive, since the amountof head movement required for a given set of requests varies with the order inwhich those requests are serviced — see 10.2.3 on 163. By contrast, CPUs andprinters are schedule-insensitive.

For schedule-sensitive servers, utilization and throughput improve as morerequests enter the backlog. Especially in the case of multiple servers, schedulingpolicies can encourage overlapped utilization of servers, e.g., a CPU schedulingpolicy that favors I/O-bound requests over compute-bound requests will oftenincrease disk utilization.

Note. A thread spends its time waiting, running in user mode, or runningin privileged mode. The most important compuational work gets done in usermode. To get I/O service, however, a thread must wait for access to an I/Odevice and then wait for I/O completion while the I/O service takes place.

Note. During any given interval, the average queue length is the total waitingtime for all requests serviced during that interval divided by the length of theinterval. And, the total waiting time for all requests serviced during that intervalis their average wating time times the number of requests serviced. And, thenumber of requests serviced is the average throughput (service rate) times thelength of the interval. Back-substituting and cancelling out the interval length,we obtain Little’s Law: the average queue length is the throughput (service rate)times average waiting time.

6There is a saying in communications and memory systems: “You can always buy band-width, but latency is forever.” Consider for instance the bandwidth of a 747 flying a full loadof CDs from LA to New York. Flying two 747s would double the bandwidth, but to cut thelatency in half you’d have to fly the CDs in a plane twice as fast as a 747.


Note. Given a server, let ρ denote its utilization. Then average service timeis (by definition) ρ divided by throughput.

Let P (k) denote the probability that k the number of requests at the givenserver, i.e., waiting for or being served by that server. The server is said to be:

• “work conserving” if and only if it is utilized whenever it has requestspresent, i.e., ρ = |n 6= 0|

• in the “steady state” if and only if for each integer k, |n = k| is notchanging.

Suppose that a server’s queue is in the steady state. Then, the throughputis equal to the rate at which requests arrive, which is commonly called thearrival rate and denoted λ. Also, the probability of the queue shrinking and theprobability of it growing are equal, i.e., for all k, λP (k) is equal to P (k + 1)divided by average servicetime, so λP (k) = P (k + 1)λ/ρ, so P (k + 1) = ρP (k).Also, if we take a sufficiently small interval, the probability of it doing both isvanishingly small. The same holds for the probability of shrinking or growingby more than one. So, by induction, P (k) = ρkP (0) = ρk(1−ρ). So the averagevalue of number of requests at this server is

∑

kP (k), i.e.,∑

k(1 − ρ)ρkRevise: replace with aproper derivation of thequeue-length formula forM/G/1 queues.

Exercise. Graph throughput of a monoprocessor system as a function of thedegree of multiprogramming (i.e., the number of kernel-known threads that arerunnable or running) and explain that graph. Consider three separate cases:

1. All threads are CPU bound.

2. All threads are I/O bound doing disk I/O.

3. There is a mix of CPU-bound and I/O-bound threads.

We will later repeat this exercise for virtual-memory systems.

8.2 Time of Selection

Policies for when to select a request to receive service fall into two generalcategories, preemptive and nonpreemptive. Of course, a preemptive policy canbe used only with a preemptable server, e.g., CPUs are preemptable but printersare not.

• In nonpreemptive scheduling, selection is done at request completion, alsoat request arrival when the server is idle.

• In preemptive scheduling, selection occurs at request completion and atrequest arrival. Also, it may occur at regular intervals called time slices.This last technique is called multiplexing, more specifically time-divisionmultiplexing. The appropriate time-slice interval depends primarily onpreemption overhead. At the end of a time slice, the priority of the current

8.3. SELECTION POLICIES 117

thread for this service is recalculated and compared to the priority of thethread at the head of the condition’s queue. If the current thread haslower priority, it signals the queue and then waits at its new priority.

The above suggests the needfor two more operations onconditions: one that returnsthe priority of the first waitorelse MAX INT and anotherthat yields, i.e., signals andwaits on this condition.

8.3 Selection Policies

Implementation. According to Dijkstra, the most important considerationin programming is the proper separation of concerns.7 It is especially importantto separate policies from implementation mechanisms whenever possible. Theappropriate tool for implementing scheduling policies in a way that separatespolicy from mechanism is the prioritized wait operation on conditions. “Whento select” becomes an issue of when to signal, and “which waiting thread toselect” becomes an issue of picking the right numerical formula for the priorityargument of the wait request.

For standard prioritized waiting to be able to implement a given policyit is important that under that policy a thread’s priority not change whileit waits. This policy is sometimes called the implementation principle — see[HA]. Often policies that violate the implementation principle can easily bemodified to yield an equivalent conforming policy. For example, oldest-first isnot implementable, since age constantly changes. The equivalent policy, earliest-birthdate-first, conforms to the implementation principle, since birthdates aretime invariant.

To implement policies that violate the implementation principle, one musttreats priorities as functions rather than numbers and selects the highest prioritywaitor by scanning the entire queue and recomputing priorities.

Standard selection policies. Note that some of the following policies arebased on attributes of the request itself, e.g., first-come-first-served selectionis based on time-of-request, while others are based on attributes that the re-quest inherits from the requesting client, e.g, the client’s priority and/or servicehistory. Round-robin selection, for example, is based on the time of the mostrecent service to the requesting client.

1. Random selection:

selection: preemptive or nonpreemptive.

priority: a random number chosen at selection time (not arrival time).

implementation principle: violates.

indefinite postponement: allows, but with probability that diminishes ex-ponentially with waiting time.

comments: If instead we simply assigned each request a random numberat arrival time requests that draw high numbers would wait a very longtime.

7Find citation.


2. First-come-first-served (FCFS):

selection: nonpreemptive.

priority: time of request.8

implementation principle: conforms.

indefinite postponement: prevents.

comments: Minimizes variance in waiting times. Waiting time for a givenrequest is the schedule length at its time of arrival (a.k.a. virtual waitingtime).Ask Mart for a reference on

optimality.3. Shortest-request-first:

selection: nonpreemptive.

priority: estimated service time.


indefinite postponement: allows.

comments: Gives the shortest average turnaround time of any nonpreemp-tive policy for servers whose throughput is not a function of scheduling.Relative to FCFS, it increases the average response time for big requestsand decreases response time for small jobs. Requires a priori estimates ofservice times. A good policy for printer scheduling, since print files havea known length.

4. Shortest-remaining-time:

selection: preemptive.

priority: remaining service time, i.e., estimated request length minus ac-crued service.


indefinite postponement: allows.

comments: A preemptive version of shortest-request-first. Gives shortestaverage turnaround time of all possible policies for schedule-insensitiveservers.

5. First-of-optimal-schedule:

selection: preemptive or non-preemptive.

priority: position on optimal schedule for all admitted requests.

implementation principle: violates.


comments: For schedule-sensitive servers such as disks, finding optimalschedules is usually an NP-complete problem, in which case one must useheuristics when the queue is long. We admit all waiting requests to the

8Default priority works fine if implementation breaks ties on firt-come-first-served basis.


schedule whenever the current schedule finishes. But, to prevent indefinitepostponement, we admit newly arriving requests only if, after servicing thefirst request, the remaining schedule has shorter length than the currentschedule.

6. Round-robin:

selection: preemptive with multiplexing.

priority: time of client’s last service.9



comments: The turnaround time is proportional to queue length (includ-ing the current request) times service time.

7. Soonest-deadline-first:

selection: preemptive without multiplexing.

priority: deadline.10



comments: Used most often in real-time systems. On a schedule-insensitiveserver, for all requests to be on-time completeable under soonest-deadlinescheduling, it is sufficient (but not necessary) that the following validitycriterion be met:

s1/(d1 − t) + · · · + sn/(dn − t) ≤ 1

where si denotes the remaining service time for the i-th request, di denotesits deadline, and t denotes the current time. Roughly, this criterion saysthat at any point in time, the total of all the fractions of the service neededby the current requests isn’t more than one — if I need three quarters ofthe CPU cycles for the next three seconds to meet my deadline and youneed one third of them, then one of us is going to miss his/her deadline.On the other hand, if you need that much for only a millisecond, there isno problem — so the above criterion is not a necessary condition.

8. Least leeway:

selection: non-preemptive or preemptive with de-facto multiplexing.

priority: leeway,11 i.e., time until deadline minus remaining service timerequired for completion.

9Default priority works fine if implementation breaks priority ties on a first-come-first-served basis.

10Soldiers whose position is under attack have a tendency to “service” the nearest attackerfirst, and I’m told that some students facing multiple impending assignment deadlines tendto have similar scheduling instincts.

11In some contexts, “leeway” is called “slack.”




comments: Requires a priori estimates of service times. Often performsbetter than the more popular deadline scheduling. To check whether allrequests are on-time completeable under least-leeway scheduling, maintainan estimated completion time for each job. When a new request arrives,add its service time to the completion times for all requests having subse-quent deadlines.

9. Least-accumulated-service:

selection: preemptive with multiplexing.

implementation principle: violates. (Requires periodic re-scaling of prior-ities of waiting clients — see below.)


priority:∫ t

−∞s(x)ec(x−t0)dx where t is the current time, t0 is some fixed

base time, and s(x) is 1 if this thread was receiving service at time x, and0 otherwise. The exponential term is used to weight past service less thanrecent service. The rate constant c determines how rapidly that weightdecays over time.

comments: Gives a burst of service to a recent arrival based on how longthat request has been without service from this server. Compared toround-robin, this policy improves average response time and promotesoverlapped use of resources.

Least-accumulated-service: “What have you done for me lately?”Least-accumulated-service scheduling is based on the concept that a servershould give the most service to the client that recently has been getting theleast service. To promote overlapped utilization of resources, it would be bestto select the client that will require the least service from this server before mov-ing on to request service from some other server. Doing so not only promotesutilization of other servers by diminishing the probability that they sit idle dueto their input queue being empty. Also, getting that request onto the queue ofanother server increases that other server’s throughput, if its throughput im-proves with queue length (which, for example, is the case with disk drives). Sinceoften it is impossible to foretell which thread will require the least service beforewaiting for another server, we select instead the thread requesting/getting theleast (recent) service from the scheduled server on the assumption that, per theprinciple of temporal locality, the near future will be like the recent past, i.e.,that the thread which has requested the least service lately will be likely torequire the least servce before moving on.

The total amount of service given to a request is obtained by integrating itsservice function s(x) from minus infinity to the present time,

∫ t

−∞s(x)dx. If,

however, we set each request’s priority to be its total service time thus far, anew request would receive all of the server’s service until that request catches


up with the rest — such a policy would not promote good system responseand/or overlapped use of resources. So, we discount past service, i.e., makethe value of past service decay over time. To do so, we simply multiply bythe appropriate exponential function, obtaining

∫ t

−∞s(x)ec(x−t0)dx, which we

denote by priorityt0(t), where t denotes the current time and t0 denotes some

fixed base time.The choice of t0 is irrelevant, except that if t0 is too much smaller or larger

than t, we encounter values outside the processor’s floating-point range. There-fore, we must occasionally shift the base t0 to another point in time, say t1, toavoid overflows:

priorityt1(t) =

∫ t

−∞

s(x)ec(x−t1)dx

=

∫ t

−∞

s(x)ec(x−t0)+c(t0−t1)dx

=

∫ t

−∞

s(x)ec(x−t0)ec(t0−t1)dx

= ec(t0−t1)

∫ t

−∞

s(x)ec(x−t0)dx

= ec(t0−t1)priorityt0(t).

Such a shift of base violates the implementation principle, but does not reorderthe relative priorities of the waiting requests. It requires only that each priorityvalue be accompanied by its base time — a thread’s base time needs to beupdated only when the thread’s priority is used in a comparison.

Suppose that the current thread completed its last service at time T , andthat we computed its priority at that point. Suppose further that the currentservice interval started at time b. Then the thread’s priority at time t (e.g.,the end of the current time slice) can be computed as follows:

priorityt0(t) =

∫ t

−∞

s(x)ec(x−t0)dx

=

∫ T

−∞

s(x)ec(x−t0)dx +

∫ b

T

s(x)ec(x−t0)dx +

∫ t

b

s(x)ec(x−t0)dx

=

∫ T

−∞

s(x)ec(x−t0)dx +

∫ b

T

0 · ec(x−t0)dx +

∫ t

b

1 · ec(x−t0)dx

= prioritytR0(T ) +

∫ t

b

ec(x−t0)dx

= priorityt0(T ) + [ec(t−t0) − ec(b−t0)]/c.

Choice of parameters. The proper choice of decay rate is important: tooslow leads to the same problems as no decay; too fast leads to all threads, otherthan the current one, having priority near zero — at which point, the policy


becomes round robin. Suppose we choose a value of c that gives a doubling timeof about a second, i.e., c = ln(2)/k where k is the number of time units in asecond. A thread that goes unserviced for several seconds, e.g., one waiting forkeyboard input, soon acquires the highest priority, namely, zero. A thread thatmakes a lot of disk requests with little intervening CPU service also tends toget high CPU priority (i.e., priority for ready).

Integer implementation. One can approximate ex by the first few terms ofits Taylor series expansion: 1 + x/1! + x2/2! + · · ·. Better yet, one store a tableof ex for the appropriate range and linearly interpolate.

Crude approximations suffice for preserving priority ordering, which is whatwe really care about. An occasional error that allows a request to run out ofturn might impact performance slightly but should not impact correctness.Move the following para-

graph to a more suitableplace.

8.4 Priority Inversion

There is a scheduling anomaly known as “priority inversion,” wherein a low-priority client holds a resource requested by a high-priority client. Althoughthe high-priority client might wait at top priority, the condition’s next signalwon’t occur until the low-priority resource holder obtains its required service.A possible solution is to boost the priority of the resource holder to that ofthe highest-priority request waiting for that resource. Most implementations ofHoare semantics avoid priority inversions by immediately giving a CPU (withappropriate preemptive services blocked) to the signalled thread until it exitsthe monitor. CPU-priority doesn’t get any higher than that.

8.5 Load leveling

There are many situations where there is a system of interchangeable servers.For such systems a common queue is often the best solution (e.g., a tightlycoupled symmetric multiprocessor system) but often the set up time would beexcessive (e.g., a cluster of PCs), in which case each server must have its ownqueue. In that case, we try to direct newly arriving jobs to the server withthe shortest current schedule, a policy that is called “load leveling”.12 Havingseparate per-server queues, has the advantage there is no master queue whoselock can become a bottleneck. Depending on the cost of “migration”, a threadmay move to a less loaded server, i.e., one having shorter current schedule length.

12But note that, to determine the queue with shortest schedule length, either a lock perqueue must be acquired or a sharable lock on a heap of schedule lenghts must be acquire —that heap must be non-sharably updated each time a thread (re)enters one of the per-serverqueues.

8.6. DISCUSSION 123

8.6 Discussion

It is important not to confuse the selection of scheduling policy with selectionof data structure for queue representation. There are many ways to implementpriority queues:

• Unsorted linked lists. Insertion is an O(1) operation, and extraction isO(n).

• sorted linked lists. Insertion has expected complexity of n/2 steps, whileextraction is O(1).

• Heaps, i.e., trees whose minimum element is at the top and whose subtreesare heaps.13 Balanced heaps and leftist trees have complexity O(log(n))for both insertion and extraction.

For larger queues, some sort of heap is required. For small queues, I rec-ommend unsorted lists with a minimum priority number of zero and zero asthe default priority. Insertion always appends to the tail of the list and hasO(1) complexity. Extraction simply takes the first item on the list if its priorityis zero, thus extraction is O(1) in the default case, i.e., FCFS. Implementinground-robin by extracting and inserting at default priority gives it O(1) complex-ity as well. Policies that violate the implementation principle require scanningthe entire queue at which time priorities are recomputed, which again is bestaccomodate via unsorted lists.

Exercise (hard). Read http://lwn.net/2002/0110/a/scheduler.php3 (January2002), and explain the scheduling policies in Ingo Molnar’s “O(1) scheduler” forthe Linux 2.5 kernel without referring to his data structures.

13This notion of heap has nothing to do with dynamically allocated free-space areas.

Chapter 9

MEMORYMANAGEMENT

Mention shared address-space vs. separate-addressspace models for operatingsystems.

The goals of memory management are fault tolerance, high reliability, high per-formance, high capacity, all at the lowest possible price. In addition, memorymanagement must support sharing both for economic and for operational rea-sons, e.g., to accomodate centralized, rapidly-changing databases such as thosefound in air-traffic control systems. Sharing, in turn, requires protection of boththe privacy and the integrity of the sharers’ data. Somewhere mention check-

sums (perhaps as an iden-tity/handle).

9.1 Tables

Memory systems (a.k.a. stores or storage systems) are a subclass of an abstractclass that we will call tables. Tables are used for storing and retrieving items ofdata.1 Among the services provided by tables are: store (or write), lookup(or read), and possibly remove. The servicing of a request for any of theseservices is called an access or reference to the corresponding item in the table.

The lookup service takes a parameter called a key,2 and returns (possibly byreference) the corresponding data item, if any. If tab is a table and k is a key,in some languages that item is designate by “tab[k].” The event where thereis no such item is called a miss, and in such cases a special value is returned oran exception occurs.

Tables are characterized by various properties:

• Capacity — how many items can be stored.

• Item size (if the table has items of a fixed size).

1The term data item (a.k.a. entry) is used in the general study of tables, but segments ofmemory that hold data are commonly called “objects” in C/C++ literature.

2In some contexts a key is called an “address,” a “name,” a “token,” a “handle,” or anyof many other synonyms for an identity.

125

126 CHAPTER 9. MEMORY MANAGEMENT

• The key space — the possible keys and how many there are.

• Write limitations, e.g., read/write, write-once, or read-only.

• Cost of services both in terms of time and of physical energy:

– cost of a lookup:

∗ cost of a hit, i.e., a successful lookup.

∗ cost of a miss, i.e., an unsuccessful lookup.

– cost of an insertion,

– cost of a deletion.

• Removability (a.k.a. mountability), e.g., diskettes.Be sure that we some-where mention John Kubia-towicz’s code scheme for off-site archival copies.

• Reliability and tolerance of faults such as:3

– loss of power — tables that are vulnerable to loss of power are some-times said to be “volatile.”

– natural calamities, such as earthquakes,

– not-so natural calamities, such as acts of tampering or terrorism,

– kernel crashes,

– the inadvertent flipping of bits due to high-energy radiation or otherphysical phenomena.

• Addressing capabilities/methods:“Addressing” may not be theright term here. Check ex-actly what the terms “in-dex” and “indexing” mean todatabase folks.

– index-based addressing: Tables whose address space is a segment (i.e.,subrange) of the integers and whose capacity is exactly the same asthe size of its address space are called arrays, e.g.:

∗ read-only hardware arrays, which are called “read-only memo-ries” or ROMs,

∗ write-once hardware arrays, which are called “programmableread-only memories” or PROMs,

∗ read/write hardware arrays, which are called “random-accessmemories” or RAMs.

After a given item of an array has been read or written, the additionaltime cost of transferring (i.e., reading or writing) the next item andsubsequent items thereafter is often relatively small. In such cases,the reciprocal of that additional time is the transfer rate, and thetime required to access the first item is latency. Some hardwarearrays, notably disk drives, directly support read and write servicesthat transfer all items within a specified range, i.e., a subsegment ofthe array.

3Fault tolerance is sometimes called “robustness,” while intolerance of faults is often called“vulnerability.” Many faults can be handled via error detection and correction algorithms,other faults require say the archiving of prior versions of items, e.g., off-site copies of files.

9.2. MEMORY-ACCESS ANOMALIES UNDER CONCURRENCY 127

– content-based addressing: A table with a capacity smaller than thesize of its address space is called a content-addressable table (or asparse array) — in such cases, the key must be stored with the data.Examples:

∗ read-only content-addressable hardware tables, which are called“logic arrays,”

∗ write-once content-addressable hardware tables, which are called“programmable logic arrays” or PLAs,

∗ read/write content-addressable hardware tables, which are called“content-addressable RAMs.”

∗ hash tables.

Simple content-addressable tables lack an efficient way to find thenext entry, thus making it difficult to extract their items in order.

– associative addressing: Finding the first item whose key is greaterthan or equal to a given value is easy in a B-tree, an AVL tree, or ared/black tree — but very inefficient in a hash table. Hardware tablesproviding such find-next service are called “associative memories.”

9.2 Memory-access anomalies under concurrency

Hardware read/write memory systems are often organized as arrays of bytes,but they usually have an access granularity that is larger than a byte. Theunit of access granularity is usually called a word, but the term sector is usedwith respect to disk drives. These anomalies arise in many contexts and shouldbe of particular concern to implementors of threads libraries and software thatconcurrently accesses disk files, e.g., databases.

The point is that the reading and writing of words is atomic, i.e., one accessto a word can never get time-interleaved with another access to that same word.To perform a partial-word write, the entire word must be read, the portionbeing written must be modified, and then the entire word must be writtenback. Some systems perform partial-word writes atomically in hardware but atsignificant overhead. Software always can and often must make a given collectionof operations on a given collection of data items mutually atomic by putting alloccurrences of those operations into a common critical region protected by alock, but again there is an overhead.

A contiguous segment of memory that is dedicated to holding values ofa particular type during particular phases of a process is called an object inC/C++ literature. In other literature, it might be called a (possibly nameless)variable or a data item. The state of an object is the tuple of bits found in itssegment of memory.

There are four memory-access anomalies that arise in environments wheremultiple concurrent or pseudo-concurrent streams of computational activityshare access to objects. Those streams may involve threads, the alternative


continuations of execution following a pre-emption (i.e., a signal, interrupt, orexception) or an invocation of setjmp.

Tearing. Suppose that a given object spans multiple words and that a writeoperation on that object is in progress and has modified the state of some ofthose words but not others. The object is now in an indeterminate state — onsome systems, accessing it might generate generate a trap that gets handled bythe operating system. Likewise, if a write to that object were to occur while aread was in progress, i.e., after some but not all of the words of the object hadbeen read, the value read could be similarly incoherent. This phenomenon isoften called word tearing, but in reality it is the object that gets torn (along wordboundaries). The problem is that a write and another access to that object gotinterleaved. Access to a multiword object can be made atomic through lockingat either the hardware or software level.

False sharing. Suppose that multiple objects intersect a given word. Inter-leaved partial-word writes to update two of those objects are likely to leave oneof the objects in an inappropriate state. An object is said to be isolated if writesto other objects cannot interfere with (i.e., invalidate) writes to that object inthis way. In practice, isolation can be achieved either by:

• allocating the object so that it does not share any words with other objects(incurring some possible space overhead), or

• making atomic all partial-word writes to words it occupies — such atom-icity can be provided either by hardware or software, but incurs timeoverhead in either case.

Note that access to objects that are smaller than a word can be made atomicby allocating the object to a single word and isolating it via either of the abovemethods.

Volatility. An object’s value is the value last given to it by the current com-putational activity. At times an object’s state will differ from its value. On theone hand, the object might be dirty, i.e., the object might have been assigneda new value that is not yet reflected in the object’s state, e.g., the only copy ofthat new value might still be in a register. On the other hand, the object mightbe stale — for example, the object’s state might have been externally modifiedvia a debugger. So, at certain points in a computational stream, objects mustbe reconciled, i.e.:

• Before that point but after last preceding update of the object’s value bythe stream, the object’s state must be updated from the object’s value,i.e. that object’s value must be stored to the object’s home location.

• After that point but before then next use of the object’s value, unless thereis an intervening update of the object’s value, the object’s value must beupdated (loaded) from the object’s state.


Regardless of when an object’s state is fetched, instantly that state may bemodified, say, by a debugger or a signal handler. Since fetched values are neverguaranteed to be absolutely fresh, how fresh is fresh enough? Ultimately, in theexecution of a program there need to be sequence points, at which each objectgets reconciled and after which values fetched from memory are assumed to befresh enough until the next sequence point. In C/C++ there is a sequence pointat each ocurrence of a semicolon, a function call, or an operator having a definedorder of evaluation.

Although every object must be reconciled at each sequence point, many ofthose load and store operations can be optimized away under the as-if rule4 andthe default assumptions that, except where otherwise specified:

• each object is non-observable, i.e., the object’s state has no effect on otherstreams of computational activity, in which case the store operation asso-ciated with reconciling this object can usually be optimized away.

• each object is non-volatile, i.e., the object’s current value is the value lastassigned to it by the current stream, in which case the fetch operationassociated with reconciling this object can usually be optimized away.5

Certain objects violate the default assumptions, e.g., objects that are allo-cated at input or output registers or that are to be modified or observed viaa debugger or via non-coordinated streams of computational activity. For be-havior to be well defined, the default assumptions must be suspended for suchobjects, which adds significant reconciliation overhead when accessing them.In C/C++, this suspension is accomplished by giving the object a volatile-qualified type.6 Alternatively, the object could be accessed only via specialassembly-language functions that deal directly with the object’s state. Undercurrent practice, but without guarantee by the standards, the functions

int get( int* x ) { return *x; }

int put( int* x, int y ) { return *x = y; }

will work so long as they are not in-line and are compiled separately from therest of the code.

Critical-region leaks. The stream-shared objects that represent the statesof locks (or other stream-coordination mechanisms) must be atomic, isolated,and must be treated as volatile and observable — they can be given a volatile-qualified type or their access can be restricted to special get() and set()

4The as-if rule states that a conforming implementation of a language can generate what-ever code it like as long as the program behaves “as if” it did what the Standard says.

5In general, memory is said to be volatile in cases where it doesn’t remember. Some memorysystems are called “volatile” because the fail to remember under loss of power. Similarly somememory locations are said to be “volatile” because their content might change for reasonsother than the assignment of a new value by the current computational activity.

6Note that objects are given a volatile-qualified type because they are volatile or observable,not the other way around.


functions that respect volatility and observability. Other stream-shared objectsrequire protection of stream coordination mechanisms and do not require theoverhead burden of isolation, atomicity, or volatile-qualified types. However, atthe sequence points at the entrances and exits of critical regions (i.e., at thelock handling operations):

• The default assumptions of non-volatility and non-observability must besuspended at least for lock-protected objects, i.e., they must be treated asvolatile, since outside the critical region another thread might acquire thelock and observe and/or modify those objects. (Streams must not carryunresolved stream-local values of those objects into or out of those criticalregions, in order that all streams access the current value of the protectedobjects.)

• The accessing of lock-protected objects must not spill out of critical re-gions, i.e., critical accesses must not begin until the corresponding lock hasbeen acquired and must be completed before the lock is released. There-fore, code-reordering optimizations must be controlled at three levels:

– Compilers must not hoist critical accesses past lock acquisitions norsink them past lock releases.7

– Threads that run on other CPUs are affected by and only by theorder in which data is read from to the cache and written to thecache to memory. At the point of lock acquisition, cached valuesof live shared objects must agree with the values in memory, andprior to lock release updates of critical items must be flushed tomemory. CPUs intended for multiprocessor application should haveappropriate mechanisms for such purposes. Often these take the formof special instructions.

– Threads that share the locking thread’s CPU will be affected bythe reordering of instructions that affect the cache content and/orcontrol pre-emption. CPUs must either guarantee certain aspects ofinstruction ordering or provide instructions that allow the compilerto control instruction ordering as necessary.

Instructions that control hardware reordering of accesses are generallycalled “fence” or “barrier” instructions.

At the appropriate place,make the point that, onuniprocessor systems, if pre-emption (interrupts and sig-nals) is properly blocked, nostream should ever need towait when acquiring a lock,making the lock’s state ob-ject(s) irrelevant.

Continuation-shared data. Following an invocation of setjmp that initial-izes a given jmp buf, there can be multiple continuations each of which is adistinct stream of compuational activity. The initial continuation begins withthe normal return from that invocation of setjmp. Subsequent continuations

7Most compilers written for monothreaded languages do not provide directives that arebarriers to code motion. But, even in a single-stream environment, a compiler would getinto serious trouble with monostreamed code if it moved instructions past calls to separatelycompiled functions prior to linking. Post-linkage optimization can however present problems.


begin when there is a longjmp back to that setjmp via that jmp buf. In termsof both semantics and implementation, a longjmp is similar to the resumptionof a coroutine.8 According to the 1989 C Standard, following an invocation oflongjmp:

All accessible objects have values as of the time longjmp was called,except that the values of objects of automatic storage duration thatare local to the function containing the invocation of the correspond-ing setjmp macro that do not have volatile-qualified type and havebeen changed between the setjmp invocation and longjmp call areindeterminate. [C89 7.6.2.1]

So setjmp must save enough context to be able to restore the values of localobjects of the call-stack predecessors of its caller. Also, all static and dynamicobjects must get reconciled across calls to setjmp or longjmp. Regarding thenon-volatile locals of setjmp’s caller, consider:

int main() {

int x;

volatile int y;

if ( setjmp(...) ) {

/* Post-longjmp continuation starts here. */

/* But what are the value of x and y ? */

} else {

/* Initial continuation, where x and y get new values. */

f(); /* where f() calls g() calls h() calls longjmp(). */

}

}

The only copy of the unreconciled value of x could get spilled to a stack-basedtemporary in g(), where longjmp would have no way to find it. Because y

has volatile-qualified type, however, it will be reconciled immediately after itslast modification in the initial continuation and again before any use in thepost-longjmp continuation.

Data shared with signal handlers. The handling of a signal is a streamof computational activity that is not synchronized with the pre-empted activitybut may share access to its static variables. In C and C++ there are no intrinsi-cally atomic data types. All that the standards guarantee is that objects of typevolatile sig atomic t will not to get torn (thereby producing an indetermi-nate state and undefined behavior) when accessed by a program and written bya concurrent invocation of a signal handler:

... the behavior is undefined if [the handler of a signal] ... refersto any object with static storage duration other than by assign-ing a value to a static storage duration variable of type volatile

sig atomic t. [C89 7.7.1]

8In fact, it has been used to implement coroutines and threads — see [EN].


The need for atomicity and the need to suspend the normal guarantees of non-volatility and non-observability are obvious. It’s also clear that objects of typesig atomic t must be isolated. However, I don’t see why the behavior should beundefined when a signal handler reads such an object, e.g., which may be theheld flag of a spin-lock.Add C89 to bibliography.

9.3 IndirectionMention tables of tables andCurrying. Accessing an object by looking it up in a table is called indirect access, because

the program stores or computes and first accesses the object’s key. Two-levelindirection is a two-stage lookup technique where an item’s key is looked upin an intermediate table, and the resulting intermediate key is then looked upin a second table.9 The table lookup(s) involved in indirect access impose anoverhead, but there are many situation where indirection is helpful. In fact,there is a saying that “There’s no problem of systems design that cannot besolved by adding another level of indirection.”

9.3.1 Dynamic rebinding/relocation

By changing the binding of the intermediate key in the second table of a tandemsystem of tables, we can change the binding of all primary keys associated withthat intermediate key without needing to identify them. For example, when anindirectly accessed item is moved around in main memory, all references to itcan be updated merely by changing a single pointer value.

9.3.2 Failure control

If an invalid address is used when say calling a function, undefined behaviorensues. To diminish the chances of undefined behavior, we can call functionsvia a bounds-checked vector table to guarantee that control will pass to a trustedfunction or else a bounds-violation exception will occur.Mention reference semantics.

9.3.3 Copy-on-write

There is a policy called “lazy evaluation,” under which operations are delayedas long as possible, in that hope that the operation will not be needed. Aninstance of this policy is a technique called copy-on-write, where an operationthat copies one item to another is delayed until one or the other of those itemsgets modified — meanwhile, the original item appears at two addresses (keys)and, thus, appears to be two distinct items. Indirect references to the copy-on-write image of the original item are redirected to the original item. But,the item must really be replicated (copied), whenever the original or the virtualcopy gets modified.Merge with VM-copy sec-

tion.9Note that instead of keeping the two tables separate, we can merge them. In fact, we can

have an arbitrary number of levels of indirection within a single table.

9.3. INDIRECTION 133

9.3.4 Dynamic Memory Allocation

A heap (a.k.a. dynamic memory allocator)10 is a server consisting of an arrayand two services: allocate and release (a.k.a. free or deallocate).

The allocate service takes, as a parameter, the size of the requested sub-segment of that array and, if possible, returns a reference to a freshly allocatedsegment of at least that size. An overflow is a situation where there is no freesegment of sufficient size from which to fill the request — not a good thing.

The release service takes, as a parameter, a reference or pointer to an allo-cated segment to be put back into the unallocated-memory pool. For coalescingfree space, it is helpful to keep a doubly linked list of all segments, ordered byaddress. As an optimization to save list-management overhead, many imple-mentations round upward requests that would otherwise leave an overly smallresidual free segment. Each segment’s descriptor can then be allocated as a“header” within the segment itself.

Placement policies for dynamic allocation

Which free segment to allocate space from when more than one free segmentis of sufficient size to accomodate a given request? The policy on which thatdecision is based is called the heap’s placement policy. Placement policies are on-line algorithms for the memory-allocation problem. That problem can be viewedas the one-dimensional bin-packing problem generalized via the introduction ofrelease operations. Among the standard placement policies are:

• Paging: A paged heap’s space is divided into page-size segments calledpage frames. We round requests upward to that size. A request larger thanthis page size must be rejected, unless it can be broken into smaller requests(which is why paging hardware is often added to CPUs). Usually, there isno reason to prefer one free frame over another for storing a given page, butin disk systems, for example, placement can have a large impact on latencyand effective bandwidth. (See Section 11.6 on page 180.) Obviously, arequest succeeds if and only if there is at least one free frame. Somelookup mechanisms, e.g., set-associative caches, restrict the placement offixed-size items.

• Buddy system: The size of each segment is a power of two, and, for eachpower of two, we keep a list of the free segments of that size. Requests arerounded upward to the nearest power of two and filled by returning thefirst free segment from the corresponding free-segment list. If that list isempty, we split a block of the next larger size into two “buddies” of therequested size. If the list of next-larger free segments is empty, we proceedrecursively. Upon deallocation, we coalesce buddies whenever possible.

10This use of the term “heap” is in no way related to trees that are used to implemenpriority queues, e.g., the heaps in the heap-sort algorithm.


• First fit: Free segments have arbitrary size and are kept on an unorderedlist. Allocate any given request from the first11 free segment of sufficientsize and leave the remainder of that free segment (if any) on the free-segment list. When a segment is deallocated, coalesce it with any adjacentfree segment.

• Best fit: As above, but free segments are kept on a list ordered by segmentsize, and we allocate from the smallest free segment of sufficient size.

• Worst fit: As above, but free segments are kept on list ordered by segmentsize, and we allocate from the largest free segment and insert the leftoverfree segment into the size-sorted list.

• Next fit: As above, but allocate cyclically starting from the last allocationand searching until a segment of sufficient size is found.

There is a special problem when a request is of indeterminate size. One usefulpolicy is to allocate the second-largest free segment or half of the largest freesegment, whichever is larger.

The fixed-size (a.k.a. paged) policy and the buddy policy lead to internalfragmentation, i.e., space that is allocated but unused.12 By contrast, the otherpolicies lead to external fragmentation, i.e., unallocated space that is not sousable because it is distributed over many small “fragments.”

Performance of policies that coalesce adjacent free space. Supposethat a policy that coalesces adjacent free space, which includes all but pagingand the buddy system, is operating in the steady state, i.e., number of allocatedsegments and the amount of free space aren’t changing significantly. Let kdenote the number of free segments, n denote the number of allocated segments,and n0, n1, and n2 denote the number of allocated segments with zero, one, andtwo free neighbors, respectively. The fifty-percent rule states that in such a casethe expected value of n is twice the expected value of k. By way of proof, notethat allocating a segment almost never changes the number of free segments(since there is almost always a residue), but releasing an allocated segmentchanges the number of free segments by the number of allocated neighbors minusone. So, if all segments are equally likely to be released and the number of freesegments isn’t changing, then n2 must be the same as n0. But, by definition, nis n0 + n1 + n2, which (since n0 = n2) is 2n2 + n1, which (by definition) is thenumber of borders between free and allocated segments, which is 2k since weare discussing policies that coalesce adjacent free space.

Obviously, the number and size of the allocated segments are independent ofthe placement policy — they depend only on the number and size of blocks thathave been requested and not yet released. The same is true of the total amountof free space, and, thus, by the fifty-percent rule, the same is also true of the

11“First” in some fixed but arbitrary ordering, e.g., increasing address.12Under the buddy system, internal fragmentation consumes about a quarter of the allocated

space.


average size of the free segments. Placement policy can, however, affect variancein free-segment size, which is important since the probability of overflow on anygiven request is inversely correlated with the expected size of the largest freesegment, which is directly correlated with the variance in free segment sizes —the smaller and more numerous the small free-segments, the larger we expectthe large segments to be. To minimize the probability of overflow, the bestthing that can happen is to have maximum variance, i.e., for all but one of thefree segments to be of minimum size.

Given an allocation request and a list of free segments, the worst-fit alloca-tion yields the resulting free-segment list with the minimum possible variance,while the best-fit allocation produces the list with the greatest possible vari-ance. Next-fit is similar to allocating from a randomly selected sufficiently largefree-segment, and doesn’t push variance in either direction. A first-fit alloca-tion cuts down the size of the first segment that can accomodate the request,thereby tending to shrink the segments near the front of the list while preservingthe larger segments near the end of the list, i.e., first-fit creates a list that issomewhat sorted by size and yields an approximation to best-fit. Compared tobest-fit, one would expect the free-segment lists under first-fit to have slightlylower variance and thus a slightly higher tendency to overflow.

• [AHU] agrees:

The best-fit strategy seems to reduce fragmentation comparedwith first-fit, in the sense that best-fit tends to produce verysmall “fragments”, i.e., left-over blocks. While the number ofthese fragments is about the same as for first-fit, they tend totake up a rather small area. [page 399]

• [KN1] disagrees:

... the best-fit method tends to increase the number of very smallblocks, and proliferation of small blocks is usually undesirable... [page 437]

... In all experiments comparing the best-fit and first-fit meth-ods, the latter always appeared to be superior. When memorysize was exhausted, the first-fit method actually stayed in ac-tion longer than the best-fit method before memory overflowoccurred, in most instances. [page 448]

• [ST] also disagrees:

The first-fit algorithm is not only the simplest but usually thebest and fastest as well. [...] The best-fit algorithm, despite itsname, is usually the worst performer. Because this algorithmlooks for the smallest block that will satisfy the requirement, itguarantees that the fragment left behind is as small as posssi-ble. Although each memory request always wastes the smallest


amount of memory, the result is that main memory is quickly lit-tered by blocks too small to satisfy memory allocation requests.... [pp. 311-312]

• [SG] also disagrees, but less emphatically:

Neither first-fit nor best-fit is clearly better in terms of storageutilization, but first-fit is generally faster. [page 254]

Exercise. Run simulations to determine who is correct in the above differenceof opinion.

Exercise. Most text books reflect the conventional wisdom that first-fit isfaster than best-fit. The obvious way to implement best-fit is via an associa-tive table, e.g., a red/black tree, which has O(log(n)) complexity for insertion,deletion and associative lookup. Find an O(log(n)) implementation of first-fit.Hint: Keep a tree-structuredlist of free segments ordered by address, decoratingeach node with the size of its largest descendant.

Exercise. Show that, given any ordered pair of the above placement poli-cies where the second policy isn’t paging, there is an instance of the memory-allocation problem, i.e., an allocate/release sequence, on which the first pol-icy overflows but the second does not.

Garbage-collection policies

Garbage-collection systems determine which allocated items are no longer ac-cessible/useful and automatically release them. Among the standard policiesare

• Access counting. An item’s access count is incremented each time a clientlinks to it (e.g., via a pointer, name, reference, or occurrence in open-filetables) and decremented each time a link is severed (e.g., when a file isclosed). An item is freed when its access count becomes zero.

• Mark-and-sweep. Set the access count of each item to zero. Go throughall links counting the number of links to each item. Then free those whosecount is still zero. (Of course, a one-bit boolean counter13 suffices for thispurpose.)

• Time-to-live. Give each item a time-to-live and free it after it has beenallocated for that amount of time. This policy requires that items bepreemptable.

Add generation garbage col-lection.

13I.e., a flag, not mod-2 arithmetic.


Overflow policies

What to do when an overflow occurs?

• Deny the request by returning a special value, setting a special variable,throwing an exception, or aborting the thread.

• Delay until a segment is freed and try again.

• Expand the heap.

• Compact the free space, i.e., coalesce multiple free segments into a singlefree segment (of sufficient size, we hope) by moving some of the allocatedsegments to one end of the heap. This procedure is called compaction,defragmentation, or burping the memory. Compaction requires that itbe possible to re-bind all references to data in allocated segments, whichbenefits from a level of indirection in accessing items in those segments.The frequency of overflow increases sharply as the percentage of allocatedspace approaches 100%. So, although compaction helps a bit, it is easy toreach a point where most of the time gets spent compacting. Should we consider replac-

ing only items whose deal-location creates a sufficientlylarge free space to satisfy thecurrent request?

• Replace (i.e., preemptively deallocate) one of the currently allocated items,after making a backup copy if necessary. Future attempts to find thedeallocated item in this table will miss (i.e., fail), at some cost — perhapsthe item will have to be recomputed or fetched from a possibly slowerbackup table. The policy for selecting which item to deallocate is calledthe replacement policy, e.g.:

– Random Selection (RAND), which is surprisingly effective.

– First-in-First-Out (FIFO). When an item is installed in the table, itis appended to a FIFO replacement queue. To replace an item, wealways select the first item on the replacement queue. FIFO is notmuch better than the random selection, and occasionally worse.

– Longest-to-Next-Use (LNU, a.k.a. Belady’s algorithm), which is anoff-line policy, i.e., requires knowledge of all requests including futureones. One can, however, pre-run the program to gain such knowledge,and then run it again under LNU replacement. On-line policies canthen be compared to LNU.

– Least-Recently-Used (LRU). The principle of temporal locality holdsthat the near future is likely to resemble the recent past, which im-plies that the least-recently-used item is likely to be the longest-to-next-use. We could time-stamp items when they are accessed andsearch for the oldest time-stamp when overflow occurs. It is more ef-ficient, however, to move items to the tail of a replacement-candidatelist each time they are accessed.

– Not-Recently-Used (NRU). Each time an item is accessed, its accessedflag gets set. If the accessed-flag of the current top candidate for


replacement is set, reset it and move that candidate to the tail ofthe replacement list. NRU requires less overhead than LRU, butempirical studies show it to have comparable performance.

– Second-Chance. Same as NRU but, if the current top candidate forreplacement has not been accessed but is dirty (i.e., modified) andthis is not its second chace, then set its second-chance flag and moveit to the tail of the replacement list. The point is that it costs more toreplace a dirty item than a clean one, since those modifications mayhave to be recomputed or the item may have to be moved (copied)somewhere else.

9.3.5 Interleaving

In addition to two-level indirection where intermediate keys are looked up, thereis two-level indirection where intermediate keys are computed. One can, for ex-ample, combine two similar arrays in a couple of different ways to make a largerarray. First of all, one can stack (concatenate) the arrays, assigning items hav-ing low indexes to one array and items having high indexes to the other — onemerely needs to consider the high-order bit. Alternatively, one can interleavethe arrays, assigning items having even indexes to one and items having oddindexes to the other — one merely needs to consider the low-order bit. Interleav-ing is sometimes called striping, and the interleaving of disk drives is sometimescalled RAID-0. Of course, we are not restricted to two-way interleaving. Also,we have some discretion regarding the granularity of interleaving, i.e., we canfirst partition the arrays into fixed-size chunks called blocks, and then interleavethe arrays on a block-by-block basis.This block size has nothing

to do with the OS block size. Servicing requests that span multiple blocks via interleaving follows a com-mon paradigm where servers are organized into teams and each service requestrequires a certain amount of latency (setup time) that each team member mustincur individually followed by processing time, which can be spread across ateam of servers. Suppose, for example, that you hire a team of ten painters togang-up on a request to paint your house and that you pay them for travel time.Compared to hiring a single painter, you’ll pay the same amount for the timespent acutually painting your house, and the team will only spend one tenth aslong on your property. But you’ll pay ten times as much for time spent travel-ing. Were there ten houses to paint, dispatching a painter per house would bemore efficient.Rewrite the above.

If n arrays are capable of concurrent operation, the array obtained throughn-way interleaving has the same latency as the individual arrays but n times thebandwidth (since multiple items can transfer in parallel). So n-way interleavingdecreases the time to read or write multi-block data segments by a factor ofup to n, but it causes a corresponding increase in the aggregate latency service,i.e., the number of disk-seconds spent getting the heads to those multiple blocks.More formally, let l denote the average latency of an n-way interleaved array ofarrays, and let s denote the average transfer time for reading or writing a block:


• Turnaround time: The average turnaround time (measured in seconds)to read or write n consecutive blocks from an n-way interleaved array ofarrays is l+s, while the average time to read or write that same data froma non-interleaved array is l + ns.

• Workload: The average amount of service (measured in server-seconds) re-quired for a non-striped transfer is equal to the noninterleaved turnaroundtime, l + ns. But an interleaved transfer requires l + s seconds of workfrom each of n arrays, for a total of n(l + s), i.e., nl + ns, server-secondsof work.

per n-block transfer (striped)/(non-striped) if l/ns ≈ 0 if ns/l ≈ 0

turnaround time (l + s)/(l + ns) ≈ 1/n ≈ 1workload (nl + ns)/(l + ns) ≈ 1 ≈ n

You can always buy band-width, but latency is forever.Thus, the ratio of block size to request size affects the performance of an

interleaved array:

• Course-grain interleaving: When the block size is large relative to requestsizes, most requests will fall onto one or possibly two blocks, involvingone or possibly two arrays, and the only effect of striping will be to speadthe requests somewhat uniformly across the n arrays thereby leveling theworkload.

• Fine-grain interleaving: When the block size is small relative to the size ofa data-transfer request, most transfers span multiple blocks and the arraysgang up on servicing the request, as discussed above. When the workloadis light, turn-around time for a request is nearly equal to the request’sservice time (processing time plus latency). In such a case, fine-graininterleaving decreases turn-around time up to a factor of n. On the otherhand, when the workload is heavy (i.e., all arrays are fully utilized), fine-grain interleaving increases workload (i.e., decreases throughput) by up toa factor of n.14 In such cases, turn-around time is dominated by waitingtime, which, for a given queue length, is proportional to the aggregateservice requirement of the queued requests. But, for a given arrival ratefor requests, queue lengths grow very quickly as throughput decreases. Rewrite the above.

Exercise. Suppose that the average positional latency of the disks in an arrayis 5 ms, that the disks rotate at 10,000 rpm, that each disk’s internal (heads-to-buffer) transfer rate is 20 megabytes/sec, and that the average blocke size is 4kbytes. What effect would four-way striping have under heavy loads?

Exercise. Which is better for a file server, large blocks or small ones, andwhy? Which is better for a film-editing workstation, and why?

14In the non-striped case, only one disk services a given request; the other n − 1 disks areavailable to service other requests. The mistaken notion that fine-grain striping of disks alwaysoffers performance advantages usually results from comparisons not with unstriped arrays butwith single large disks of equivalent cost.


9.4 Data LogisticsGive an introduction.

9.4.1 Distributed Tables

For reasons of fault tolerance and/or performance, we often allocate multiplecopies of items over multiple tables, each of which has its own characteristicsand its own access costs that are different for different processors, e.g., itemsmay be in RAM or disks connected to a particular processor or set of processors.When main memory is organized as such a distributed table, the architecture iscalled a non-uniform-memory-access (NUMA) multiprocessor — each processorhas fast access to its local memory system and slower access to nonlocal memorysystems. NUMA systems whose interconnection is based on a local-area network(LAN) are often called a distributed-shared-memory (DSM) systems.

Consistency. To improve performance, it helps if processors can have localcopies of data items. There are a number of general policies for maintainingconsistency among the various copies of an item in a distributed table.Define “cache” as a verb.

• write invalidate: When an item is updated in one table, all other storedcopies are invalidated:

– notified: a notification is sent to the other tables.

– detected: when the invalid item is used, an exception occurs (e.g., astale file handle).

• write propagate: When an item is updated in one table, that update ispropagated to the other tables:

– notified: a notification and the update is sent the other tables.

– detected: each table (or its underlying hardware) snoops and mimicsthe write traffic of the rest.

• locking: Allow at most one table at a time to hold a writable copy of agiven item.

• centralizing shared writable copies: Sprite [OU], for instance, caches pagesfrom file servers in the main memory of client systems. Whenever a clientrequests access to a file (i.e., opens it):

– Flush all of the file’s pages from the last client to have it open forwriting, unless that last writer is the requesting client.

– Return to the requesting client the file’s handle and current versionnumber (incremented if the file is being opened for writing).

– If the file will be open on multiple clients one of which will have thefile open for writing, disable client caching for this file and release allof those multiple clients’ local copies of the file’s pages.

9.4. DATA LOGISTICS 141

– Increment the version numbers on the requesting client’s local copiesof the previous version of the file’s pages if the file is being openedfor writing.

– Release the requesting client’s local copies of earlier versions of thefile’s pages.

• roll-back (a.k.a. time-warp): fill in.

• time-to-live: fill in.

• conflict-resolution: fill in.

• commit protcols: fill in.

• mirroring: Write identical data to each of two tables, i.e., mirror thosetables, (Mirroring disks is known as RAID-1.) If the tables are capableof concurrent operations, there is no degradation in write performanceand read requests can be spread over the two tables, thereby, improvingperformance. Of course, mirroring increases cost per capacity by a factorof two.

Fill in the above. Also, ex-plain the other RAID levels.

Mirroring and interleaving. There are two ways to combine mirroring andinterleaving:

• mirrored interleaving, a.k.a. RAID 0+1: Organize two interleaved arraysof n tables and then mirror them.

• interleaved mirroring, a.k.a. RAID 1+0: Organize n mirrored pairs oftables and then n-way interleave them.

In either case, the combined table is interleaved and protected against single-table failures.

Exercise. Which of RAID 1+0 and RAID 0+1 is more tolerant of multi-tablefailures and why, when the number of tables and degree of interleaving (i.e., n)is greater than two? Does it make a difference? Why or why not?

Exercise. Is hardware support necessary or helpful for a RAID 1+0 array ofdisks? Why or why not?

9.4.2 Multilevel Storage Systems

As a special case of distributed tables, we often allocate items over two ta-bles, one of which, called the remote or backing table or the archive, has highaccess cost but has advantages in terms of lower vulnerability and/or lowerprice/capacity relative to the other table, which is called the local or working orspeedup table or the cache. When accessing an item, we first check the cache.


If the item is there, we look no further — we have a hit (i.e., the item is cacheresident), and the access cost is that of the cache and is called the hit cost.Otherwise, we have a miss and must look for the item in the archive. The misscost is that for the failed cache lookup plus that for the archive lookup plus anyreplacement costs incurred — see subsection 9.4.3 on page 144. The hit ratiois the fraction of lookups that result in a hit; the miss ratio is one minus thehit ratio. The effective access cost of a two-level storage system is the hit costtimes the hit ratio plus the miss cost times the miss ratio.

The price of a two-level storage system is roughly the price of its cache plusthe price of the archive. Its capacity is that of its archive. And, when the hitratio is high enough, its effective access cost is only slightly more than its hitcost.Note that the cost of archiv-

ing may not be the same asthe cost of caching, since,for example, we might al-ways archive two copies indistinct locations. Also somecaching is done for reasons ofbuffering, as in a disk driveor a communications system,e.g., a router.

Examples. Two-level storage systems are ubiquitous:

• Memory-cache systems, which treat high-speed RAM as a cache for lower-speed RAM;

• Back-up systems, which archive files onto tape to protect them and to freeup disk space.

• Register allocation, which treats main memory as an archive for a registerbank.

• Disk caching, which treats an array of block buffers in main memory as acache for the disk blocks,

• Virtual-memory systems, which treat disk space as an archive for mainmemory,

• Name caching for distributed name managers,

• Speedup tables for descriptor managment.

To create a storage hierarchy of three levels, one merely connects two two-level systems so that the cache for one is the archive of the other. But, thereis no need to stop at three levels. The physical-storage hierarchy in currentdesktop computer systems consists of the following levels:

1. CPU registers

2. level-one (L1) cache

3. level-two (L2) cache

4. main memory (RAM)

5. disks

6. removable media such as floppies and tapes.


speedup vs. enlargement. Notice that some two-level storage systems arespeedup (caching) systems, where the cache is subsidiary and the two-level sys-tem has the semantics (i.e., behavior) of the archive, except for speed. Othersare enlargement (archiving) systems, where the archive is subsidiary and thetwo-level system has the semantics of the cache, except for capacity. For in-stance, both virtual memory and disk caches have a portion of main memoryas their cache and a portion of disk as their archive. The distinction is thatdisk caching is a speedup system wherein main memory behaves like disk, whilevirtual memory is an enlargement system wherein disks appear to behave likemain memory.15

9.4.3 Management of Two-Level Storage Systems

Two-level storage systems require fetch and write-back policies.

Fetch policies. Copying an item from archive to cache is often referred to as“fetching it” or as “making it resident,” or “swapping it in” or “caching it.” Wewill use those terms interchangeably. Such archive-to-cache traffic is under thecontrol of the fetch policy and consists of two components:

• demand fetching, which occurs in response to misses,

• prefetching (a.k.a. anticipatory fetching or reading ahead), which attemptsto anticipate and avoid some misses by fetching certain items on the basisof educated guesses as to which items will be accessed in the near future.16

The laziest strategy is to do no prefetching — under poor prediction, prefetchinghurts performance, since fetching an item that doesn’t get accessed generatesunnecessary archive-to-cache traffic and lowers the expected hit ratio by possi-bly replacing an item that will be missed. Prefetching is particularly helpful,however, with archives, such as disks, where the latency-to-transfer cost ratiois very high.17 In virtual-memory systems, for example, a miss causes an ab-sence trap. The CPU is then free to run another thread while the missed itemis being fetched from disk space. But the trapping thread is blocked, whichshortens the ready queue, thereby, diminishing the expected CPU utilization.Good prefetching can help to avoid such performance degradation.

Exercise. Is prefetching of benefit when the archive has latency time that isthe reciprocal of the archive’s transfer rate? How about when the transfer rateis sufficient to block other access to the cache during a transfer?

15In compiler work, sometimes main memory is viewed as an enlargement of the registerset. But registers don’t have addresses, so memory is used to extend the semantics of theregisters by providing addresses.

16The guesswork part of prefetching of instructions is called “branch prediction.”17There is an old saying that “you can always buy bandwidth but latency is forever.”

Caching mitigates the effects of latency for resident items, and prefetching mitigates latencywhen accessing non-resident items.


Write-Back policies. When an item’s cached value differs from its archivedvalue, the item is said to be dirty. Updating the archived value of a dirty itemis called “writing the item back (or out)” or “backing it up” or “swapping itout” or simply “archiving it.”

Cache-to-archive traffic is under control of the system’s write-back policy andconsists of two components:

• replacement archiving, which occurs when a dirty item is replaced,

• anticipatory archiving, which cleans up dirty items. There are two reasonsfor anticipatory archiving:

– clean-up in anticipation of replacement: One can replace an itemmore quickly when it is not dirty, and it may cost less to clean upa dirty item at lower urgency, e.g., at a time when there is less disktraffic.

– stabilization, i.e., clean-up in anticipation of faults: In cases wherethe archive is less vulnerable (e.g., less volatile) than the cache, back-ing up dirty items diminishes the vulnerability of the two-level sys-tem, making it more tolerant of certain faults like power failures orsystem crashes. Per the principle of temporal locality, when an itemis modified, it is likely to be modified again soon. So, the percentageof data updates that get archived (and the anticipatory archivingtraffic) is a declining function of the window of vulnerability, i.e., theamount of time that items are allowed to be dirty.18

There are various write-back policies:

• Write-through: Archiving the new value of a cached item as soon as it ismodified is the least lazy (i.e., most eager) policy — that way there arenever any dirty items, i.e., the window of vulnerability is closed. Write-through generates the most cache-to-archive traffic but minimizes the vul-nerability, especially on systems with volatile caches.

• Delayed-write: There are several delayed-write (a.k.a. write-behind) poli-cies:

– Replacement-only: The laziest possible write-back strategy is to dono anticipatory archiving, thereby minimizing the cache-to-archivetraffic but maximizing the window of vulnerability.

– Flushing: At certain intervals, shorter than some threshold, writeback all dirty items.

– Scavenging: A low-priority scavenger thread or process continuallybacks up items that have been dirty longer than some threshold.

18It is possible to further diminish vulnerability by creating a “mirrored” archiving systemby connecting two two-level systems so that they have the same cache. Write-back operationsinvolve both archives, while fetching involves only one, thereby allowing a certain amount ofload leveling.


– On-close: Archive the dirty items of a table as soon as no thread stillhas that table open.

9.4.4 Expected Hit Ratio vs. Cache Capacity

There are three kinds of misses, only one of which is related to cache capacity:

• Startup misses (also known as compulsory19 misses), i.e., misses that occuron first access to archived items.

• Capacity misses, which result from the fact that a cache must omit someitems due to capacity limitations.

• Contention misses, which result due to restrictive cache-placement policies(e.g., set-associative policies).

The hit ratio is zero whenever the cache size is zero. When the hit-ratiois low, the effective access cost approaches the miss cost. In such a case, thesystem will incur great expense (e.g., spend most of its time) processing misses,a situation known as thrashing.

Obviously, there is a trade-off: the fewer entries in the cache, the morecapacity misses can be expected to occur and the higher the effective access cost.Belady [BE], however, has noted the anomaly that, under the FIFO replacementpolicy, there are certain atypical access patterns where the hit ratio actuallydecreases with increased cache size, e.g., accessing items 1, 2, 3, 4, 1, 2, 5, 1,2, 3, 4, 5 in that order generates only two hits with a cache size of four butthree hits when the cache size is three. The table below shows the replacementqueues at each stage:

1 2 3 4 1 2 5 1 2 3 4 5

1 2 3 4 1 2 5 5 5 3 4 4

1 2 3 4 1 2 2 2 5 3 3

1 2 3 4 1 1 1 2 5 5

1 2 3 4 4 4 5 1 2 3 4 51 2 3 3 3 4 5 1 2 3 4

1 2 2 2 3 4 5 1 2 3

1 1 1 2 3 4 5 1 2Boxed entries designate hits

During any given time interval, an item’s access ratio is the number of timesthat item is accessed divided by the total number of accesses during the interval.The access ratio for a set of items is the sum of their access ratios, e.g., if wecache the items of highest expected access ratio, then the expected hit ratio fora cache of size n would be the sum of the n highest access ratios, and, at cache

19Visibly, “compulsory” is a misnomer, since start-up misses can often be avoided viaprefetching


size n, the derivative of expected hit ratio vs. cache capacity would be the n-thhighest access ratio.

The set of items with non-zero access ratios (i.e., the set of items accessedduring the interval) is called the working set for that interval. The principle ofspatial locality implies that the working set tends to be a small fraction of allitems. As the cache becomes larger and the items in the working set get cached,the hit ratio approaches one, so the effective access cost approaches that of thecache, and the traffic associated with misses becomes insignificant. The onlyrequired traffic will then be archiving to stabilize dirty items, which is trafficthat decreases as the window of vulnerability is expanded.Needs figures.

Imagine that we graph the per-item access ratios in decreasing order. If, forexample, there are exaactly M items each with identical 1/M access ratios, thegraph of expected hit ratio vs. cache capacity would be a straight line20 of slope1/M from (0, 0) to (M, 1). Suppose, however, that the graph of per-item accessratios is a step function; say N items have the higher access ratio, p, and theremaining M −N items have the lower ratio, q. Then the graph of expected hitratio vs. cache capacity will be a broken line with an initial segment from (0, 0)to (N, pN) having a steeper slope, namely the higher per-item access ratio, p,while the slope of the final segment from (N, pN) to (M, 1) will be the lowerper-item access ratio, q.

The principle of temporal locality implies that an item’s expected near-futureaccess ratio is likely to be its recent-past access ratio. So the expected hit ratioof a two-level storage system at any instant is the sum of the recent access ratiosof the cached items. Adding another item to the cache increases the expectedhit ratio by the recent access ratio of that item.

In reality, the per-item graph of expected access ratios will be a somewhatsmoothed out step function, and its derivative will be a somewhat fattenedup negative delta function (i.e., a negative bell-shaped curve) centered at theworking-set size. This fact, plus the fact that access ratios and working-set sizesvary over time and the fact that real-world replacement policies don’t alwayskeep the most-likely-to-be-accessed items in the cache, rounds out the cornerof this broken-line graph and curves each of the line segments downward. Inparticular, however, the graph’s always-positive derivative falls off sharply nearthe working set size.

Optimality of LNU. Since the cached item that is longest to next use isthe only cached item having an access ratio of zero during the interval up tothat next use, LNU replaces the item of lowest near-term access ratio, thereby,maximizing the expected hit ratio of the resulting set of cached items. LNUis, therefore, optimal for cases where the replacement cost is uniform over allitems.

In reality, the cost of a miss varies depending on whether or not the replaceditem is dirty and needs to be archived. Replacement policies that favor replace-ment of clean items avoid some write-back costs, so LNU is not optimal for all

20Actually, it would be more like stair steps.

9.5. SNAPSHOTS 147

access strings.

Each replacement policy has a slightly different graph for expected hit ratiovs. cache capacity. The better a given on-line policy predicts which pages areleast likely to be accessed soon, the closer its graph will be to that of LNU. Be-cause the least-recently-used item is the only cached item with zero access ratiosince the LRU item’s last use, temporal locality suggests that the LRU item canbe expected to have the lowest near-term access ratio of all cached items andcan, therefore, be expected to be best candidate for the LNU item.

9.5 Snapshots

Frequent snapshots (a.k.a. checkpoints) of the entire content of a table allowone to inspect past states of items and enhances the table’s tolerance of suchfaults as inadvertent modifications/deletions. Such snapshots take relativelylittle space — NetworkAppliance indicates that, under “average” usage patterns,256 snapshots add about 20% overhead to total diskspace requirments.

Under the following snapshot scheme, each item has at most one currentcopy and a minimal set of read-only, retired (i.e., noncurrent) copies. A copyreflects the item’s state from the copy’s last modification upto its retirement.

• To create a snapshot, mark the retired-copies list and record the currenttime.

• When the current copy of an item is about to be removed or replaced, ifit was last modified before the most recent snapshot, retire it, i.e., appendit to the retired-copies list.

• To delete a snapshot, simply delete all copies that were last modifiedafter the snapshot’s youngest predecessor occurred from the portion ofthe retired-copies list between that snapshot and its oldest successor —they don’t reflect the items state as of any remaining checkpoints.21

If the table has a rooted, linked structure, we can simply keep track of the rootat the time of a snapshot to be able to recover the entire table as of that pointin time. To roll the table back to that point simply delete all subsequent copies.

9.6 Memory-Management Hardware

We need to review some terminology from architecture.

• location: the key (bit string) that a memory system uses to access its itemsof data or code. It is also called a physical address.

21For example, in the case of two-level storage systems, when an item is written-back to thearchive, a copy of the previous version of that item is already present in the archive ready tobe retired.


• address: the key (bit string) that a program (or instruction set) uses toaccess items of data or code. It is also called a virtual location.

• address space: an address-indexed array of items. A portion of a process’saddress space is said to be resident if there is a copy of it at a known placein main memory.

Main memory is often called RAM. The term “core memory” is also used andis something of a pun. Originally “core” referred to a memory technology thatwas based on tiny iron donuts, called “magnetic cores.” Now it refers to thecentrality of main memory.

9.6.1 Address-Translation Schemes

Between the CPU and main memory there is a bidirectional data bus. There isalso the unidirectional address bus from the CPU to the address translator anda unidirectional location bus from the address translator to main memory. Theaddress translator is a table that takes an address from the address bus as itskey and puts the corresponding main-memory location onto the location bus.This address-to-location translation is called memory mapping. It facilitates:

• protection, e.g., controlling the ability of user-mode processes to modifyeach other and the kernel,22

• relocation, e.g., allowing suspended processes to be swapped out to disk,23

• large address spaces, e.g., allowing the running of programs whose addressspace is larger than the number of physical locations,

• sharing, e.g., allowing multiple processes to share various data and codesegments.

There are many schemes for implementing address translators. These schemesvary in complexity, but most fall into one of the following four categories.Add diagrams.

Unmapped. The simplest address-translation scheme is to have none (i.e.,addresses and locations are equal), a situation that is especially prevalent inembedded systems. This scheme offers no protection — user programs can eas-ily modify one another and the kernel.24 Such unmapped memory managementuses the loader to properly locate programs and allows no relocation — a pro-gram must run where it is loaded, which makes main memory a non-preemptableresource.Expand. Also, allow for

position-independent code. 22It enforces protection by remapping and/or faulting (i.e., interrupting) on invalid ad-dresses.

23When such a process is swapped back in and resumed, its original locations may have beenallocated to other processes. Running the process at new locations almost always requires somehardware assistance.

24A notable exception, however, was the IBM 360 system, which provided for page-by-pageprotection but not page-by-page mapping.

9.6. MEMORY-MANAGEMENT HARDWARE 149

Relocation registers. Under the second scheme, on each user-mode memoryaccess, the address translator adds the content of a protected relocation register(a.k.a. base register) to the address to form the location. For protection, theaddress also is simultaneously compared to the content of a bounds register,and a protection fault (trap) occurs whenever the address exceeds that bound.When a CPU gets passed to another thread, the contents of these registers mustbe reset to the settings for the address space in which the awakened thread willrun.

Under this scheme, a process can be dynamically relocated by copying itsaddress space and changing relocation-register settings of its threads, whichsimplifies the implementation of loaders and of the fork system call. Also, diskspace can be treated as an archive for main memory, creating a two-level storewhose items are the address spaces of processes. Main memory then becomesa preemptable resource that can be shared among processes via time-divisionmultiplexing — e.g., a process can be swapped out when it waits for keyboardinput and swapped back in when that input completes. The disadvantages ofthis scheme are that:

• It allows address space no larger than main memory.

• It requires dynamic allocation of relatively large variable-sized segmentsof contiguous memory.

In privileged mode, the relocations register and bounds register are (or canbe) set so that the data area of the kernel is accessible. This situation leads tocomplications when system calls receive or return addresses, since the user-modeaddress of a given location will often differ from its privileged-mode address.

Paged memory management. To eliminate the need for dynamic allocationof variable-size segments, instead of one relocation register, we can use many ofthem, thereby allowing a process’s address space to be broken up into smallersegments of uniform size, called pages, which do not have to be contiguouslyallocated. Main memory is broken up into page frames of that same size, sayfour kilobytes. On each memory access (i.e., read/write request), the upper bitsof the address specify the page number and go through a translator RAM, whichemits the corresponding frame number onto the upper bits of the location bus.The lower bits (lower 12 bits in the case where pages are 4K bytes) specify theoffset and go directly onto the location bus, without modification. The offsetspecifies where the addressed item is located within its page. If the CPU triesto access a missing page (possibly indicated by a special frame number), anabsence trap25 occurs.

Virtual-memory systems. Systems in which an instruction cannot be restartedafter causing an absence trap require that the entire address space of a processbe resident for any of its threads to run — unfortunately:

25A.k.a. absence fault.


• In such systems, the address space of a process can be no larger than mainmemory,

• Such systems wastes a critical resource, namely main memory, since muchof a program’s address space is unused most of the time (e.g., the text forerror messages).

Of course, one may programmatically overlay portions of address space, butthat requires modifications to the program (i.e., additional programmer effort).

A system in which programs can run without their entire address spaceresident is called a virtual-memory system. In such systems, only the addressinglimitations of the instruction set (e.g., 32 bits) are imposed on the program.Physical-memory size limitations do not impose directly on the behavior of aprogram, though they may impose indirectly via performance degradation.

Restarting instructions. When an instruction attempts to access an item thatis not resident, an absence fault occurs and the fault handler must fetch thecorresponding portion of address space. Then comes the problem of completingthe interrupted instruction, which may have already modified some (possiblyinaccessible) registers — simply re-executing the instruction might produce in-valid results. There are three main approaches to this problem, which is calledthe instruction-restart problem:

• Suspend the faulting CPU while another processor handles the fault. Thisapproach is simple to implement and does not require that the CPU bedesigned with virtual memory in mind. It has the disadvantage that thefaulting CPU will be idle while the missing page is being swapped in.On a multiprocessor system where CPUs are plentiful, there may still beenough CPUs to run all runnable threads.

• Undo the instruction, i.e., reverse all modifications done by the faultinginstruction prior to the fault, thereby, restoring the system to its priorstate. One must, for example, decrement all registers that the instruc-tion has incremented. Most load/store RISC architectures are sufficientlysimple that partially completed instructions can be undone by softwarewhenever an absence fault occurs.

• Save the micro-context in main memory and restore it after the faultingpage has been swapped in. This approach is required under most complexinstruction sets and requires special privileged instructions.

Address translation. Virtual-memory systems treat pieces of a process’s addressspace as items to be archived on disk. A few early virtual-memory systems werebased on the swapping of variable-size segments. Nearly all virtual-memorysystems, however, do their swapping in terms of pages. The swapping of pagesis often called “paging.”

In systems having 32-bit addresses, one might expect up to a million four-kilobyte pages in a process’s address space. An address translator based on

9.7. VIRTUAL MEMORY AND DISK CACHING 151

very fast RAM could be inordinately expensive and/or slow, so virtual-memorysystems typically use a three-level store for translating page numbers to framenumbers. The cache consists of a small (e.g., 16 entries) content-addressablememory, called a “translation look-aside buffer” (TLB), which holds the framenumbers of the most recently accessed pages. When an address lookup misses inthe TLB, hardware or software (depending on the system) consults the resident-page table, which tells which frame (if any) holds each resident page.26 Ifthe pages is resident, the appropriate entry is then installed in the TLB, andthe instruction is restarted. Otherwise, an absence fault occurs, and the faulthandler consults a much less sparse table (parts of which are often not memoryresident) that tells where on disk to find each non-resident page. The faulthandler fetches the absent page, updates the resident-page table, and resumesthe processing of the TLB miss.

With each TLB entry there are protection (mode) bits, namely read, write,and execute, which must be set per the privileges of the current process. Aprotection trap occurs whenever there is any attempt to access a page whosecorresponding mode bit is false. Note that these bits must be revised orreloaded at each switch of process contexts.

Each page frame has a referenced bit (a.k.a. accessed bit) that gets setwhenever that page is accessed, and a dirty bit that gets set whenever the pageis modified. Whenever a TLB entry is replaced, these bits are written out tothe corresponding entry in the resident-page table.

9.6.2 Flat vs. per-process address spacesAdd material on segmenta-tion and linking.An address is usually made up of three components:

• The high-order bits indicate which segment the page belongs to.27

• The middle bits specify which page within that segment.

• Finally, the low-order bits specify the offset of the address within the page.

Under the so-called “flat” model, all processes share a single address segmentedaddress space of say 232 bytes. The alternative is that each process has one ormore segments 232-byte segments to itself, each having its own page table andeach including some shared space kernel space for communication etc.

9.7 Virtual Memory and Disk Caching

Most Unix systems cache pages of disk space in main memory in two ways:

• the virtual-memory system, which is an enlargement system,

26If the page table omits null entries, i.e., has entries only for resident pages, it is said tobe “inverted,” an unfortunate misuse of terminology.

27Each process has a shared code segment and one or more data segments. In general, asystem shares segments and swaps pages.


• the disk-caching system, which is a speedup system.28

Earlier Unix systems rigidly set aside a portion of main memory and a por-tion of disk space for each of these two-level systems and allows each to workindependently of the other. There are, however, some useful variants.

Sharing cache space. Now many Unix-based systems (e.g., Linux) dynam-ically share main memory between the virtual-memory and the disk-cachingsubsystems. Each replacement request is filled by swapping out the first pageon a global replacement list and allocating the freed frame to the requestingsubsystem.

Memory-mapped files. Some operating systems support memory-mappedfiles, which can be mapped into the address spaces of any process that opensthem. How this is done depends on how the underlying architecture handlesits page table(s). In any case, the combined size of a process’s open memory-mapped files must be less than its address space, e.g., not more than four giga-bytes on 32-bit systems.

Swapping from files. In most Unix implementations, there are separate par-titions on disk for file space (the archive for disk caching) and for virtual-memoryswapping areas (the archive for virtual memory). Sprite [OU], however, mapsthe swapping area into file space by using special backing files as the swappingareas for data segments. Object code is swapped from a process’s current exe-cutable file, which remains unmodified and is shared with other processes. Toprevent double caching, swapping from backing and executable files bypassesthe disk-caching system. When execing a recently compiled file, some of itspages may be resident in the disk cache, in which case they are released to thevirtual-memory system. When a process exits, its backing files are deallocated,but its executable file persists and that file’s resident pages are aged normallyvia the replacement policy. Swapping from files rather than partitions:

• facilitates swapping from the main memory of a file-server system, whichis often more efficient than swapping from local disk,29

• eliminates the need to dedicate partitions of disk space as swapping areas,

• facilitates process migration.30

This might be the place tosay more about distributedshared memory (DSM).

28Disk caching is a two-level storage system that keeps the most recently accessed disk pagesin main memory. Whenever there is a request to read or modify data on a particular page,the operating system looks first in this cache. If the page is not there, it is read into the cachefrom disk. In any case, it is a cached copy of the page that gets read or written.

29This arrangement makes it particularly easy to run client workstations in a diskless con-figuration.

30To migrate a process, suspend it, archive its dirty pages to its backing files, then resumeit on the target system by demand paging it from its backing and executable files via adistributed file system.

9.8. MISCELLANEOUS 153

9.8 Miscellaneous

Copying under VM. The indirection provided by the page table allows oneto move data around in address space by simply changing page-table entries.This trick is called page remapping — data that used to be at one addresssuddenly appears at another address (with similar page alignment). To keepthe original data, simply write-protect those pages and use the usual copy-on-write trickery. Page remapping can be a big win in situations, where otherwisea lot of information would be copied and immediately discarded, e.g., a fork

followed by an immediate invocation of exec. Discuss both linking-by-copying and off-segmentcalls. DLLs are a specialcase of PR/COW, since codeis never written.

Performance. Given a reasonable mix of I/O and CPU workload, the higherthe degree of multiprogramming on a system, the greater the throughput dueto overlapped use of resources. But, it is easy to cause thrashing in a virtual-memory system by allowing too many concurrent processes. In a time-sharingsystem, as more users enter commands, more work piles up and the degree ofmultiprogramming increases. At some point, thrashing occurs, and the systemsuddenly runs much slower. For example, if only 20% of the users are normallyrunning, and thrashing starts, then suddenly 50% of the users may be waitingon commands, which leads to more thrashing. Such a system is bi-stable and islikely to stay in that thrashing mode until enough users log off in disgust.

In a virtual-memory system, it is important to avoid thrashing, but detectingits onset can be difficult. A sudden drop in hit ratio does not necessarily indicatethrashing — an invocation of exec can cause a sudden increase in paging trafficdue to start-up misses.

Exercise. Draw a qualitatively correct graph of throughput vs. degree of mul-tiprogramming (i.e., the number of processes with threads running or runnable— in the ready queue) for a multiprogrammed virtual-memory system. Considerthree separate cases:

1. All processes are CPU bound.

2. All processes are I/O bound.

3. There is a mix of CPU-bound and I/O-bound processes.

Exercise. Discuss the suitability of the following thrashing-avoidance tech-nique for time-shared virtual-memory systems: when there are no free pageframes left, swap out all pages for some process (e.g., the one that has beenresident the longest). Swap these processes back in at a rate of say one ev-ery second and on completion of keyboard input. (The question of whether ornot to swap out a process waiting on keyboard input depends on how long itwould take to resume, and that depends on disk performance and allocationtechniques.)

Chapter 10

DEVICE MANAGEMENT

Threads usually communicate with I/O devices via the file system (a.k.a. server-management system), which interacts with the device-management system. Insome cases, such as video games, however, user-mode processes may by-pass thefile system and submit requests directly to the device-management system.

10.1 I/O Architecture

10.1.1 Devices

• reset buttons — an interrupt from this device should be noninterruptable,since it is the last alternative to pulling the plug.

• clocks and timers

• disks

• cameras, speakers, and microphones

• terminals

• tapes

• instruments

• terminal controllers

• network controllers

• control registers for memory mapping

• display controllers

• keyboards

• mice

155

156 CHAPTER 10. DEVICE MANAGEMENT

10.1.2 Device Controllers

Device controllers serve as electronic interfaces between devices and I/O busses.Currently, I/O busses are capable of transferring, say 64 bits at at time, at ratesfrom a few megabyte per second to a few gigabytes per second.

• There are many popular bus protocols:

VME

PCI

CompactPCI

PCMCIA for notebook computers.

• Also, many popular device protocols:

Terminals: RS232/ASCII

Disks: SCSI, ATA (IDE), SATA

Networks: Ethernet, ATM, Fiber Channel, USB, IEEE 1394 (Firewire)

Instruments: IEEE 488

Monitors: SVGA, XGA, SXGA, DVI.

Programs interact with device controllers by reading from and writing to certainaddresses on the I/O bus. Each controller is assigned a small set of I/O addressesto which the controller’s ports are connected:

• Data ports (read and/or write),

• Control ports (write only),

• Status ports (read only).

Usually there are privileged instructions, I/O-WRITE and I/O-READ, thattransfer data to and from I/O addresses (and, thereby, to and from the corre-sponding controller ports), just as load and store instructions transfer data toand from main memory. Some architectures, however, use memory-mapped I/O,where the final, say, 512K memory locations are I/O bus addresses — in suchcases, no special I/O instructions are needed and protection can be establishedvia the normal memory-protection mechanisms.

Word-at-a-time controllers. To write data to a device via a word-at-a-timecontroller, a CPU:

• writes the output data value (byte or word) to the appropriate data porton the controller,

• writes the appropriate control bits to the appropriate control port to causethe data in the controller’s data register to be transferred to the device,

10.1. I/O ARCHITECTURE 157

• waits for the controller to signal that it has completed that data transfer(and is, therefore, ready to accept more data), either by the setting of anI/O-completion bit in one of its status registers or by causing an occurrenceof the device’s I/O-completion interrupt — see Subsection 10.2.2 on page160.

To instead read from the device, a CPU performs the second and third steps andthen executes an I/O-READ instruction to read the transferred data from theappropriate data port of the controller to a CPU register. Reading or writingsuccessive values simply involves repeating these steps.

Direct-memory-access (DMA) controllers. To perform DMA input oroutputs, a CPU writes the appropriate control bits to the appropriate controllerports, thereby, telling a device controller the length and address of a memorybuffer where the controller should get or put a segment of data. The CPUthen waits for the controller to signal completion of the requested data transfer,either by the setting of an I/O-completion bit in one of its status registers or bycausing an occurrence of the device’s I/O-completion interrupt — see Subsection10.2.2 on page 160.. The difference between DMA and word-at-a-time transfersis similar to the difference between call-by-value and call-by-reference; in DMAtransfers, the CPU gives the controller an address rather than giving or receivinga value.

Usually a DMA controller can transfer data faster than a CPU can performword-at-a-time I/O (e.g., it can keep up with a disk) and sometimes allows theCPU to perform other useful work while transfers are in progress. However,DMA controllers compete with CPUs and with each other for access to thememory bus. Bus hardware arbitrates this competition on a word-by-wordbasis, but programmers must take care to avoid over-committing the memorybus. Note that CPUs can always wait, but a controller of a synchronous devicewill corrupt a data transfer if the controller is not ready or the bus can’t keepup. So CPUs should have the lowest bus priority. The following point is made

again in 10.2.3.It sometimes happens that, when faster disks are installed on a system, some

programs run slower. In such cases, a CPU can respond to an I/O-completioninterrupt and rearm the DMA in the interblock-gap time of the slower disksbut not that of the faster ones — reading a sequentially allocated file suddenlyrequires a disk revolution per block, which slows data transfer by about twoorders of magnitude. There are, however, DMA controllers that can queueup several commands and respond to each during the corresponding interblockgap time.1 CPUs give such controllers a sequence of I/O instructions, possiblyby leaving them at a known place in memory. By following such a commandsequence, the controller can move several pages from disk to frames in memory

1The response of general-purpose CPUs is hampered by the need to run with interruptsoff for periods of indeterminate length. Of course, the pages of a file can be interleaved withother pages to alleviate this problem, but doing so diminishes by a factor of two the effectivebandwidth in reading successive pages of the file.


without interrupting the CPU. Such multi-sector transfers to and from non-contiguous sections of main memory are called scatter reads and gather writes,respectively. They are particularly useful for swapping an entire address spaceto or from disks. As a side benefit, such controllers also shield the CPU frommuch of the overhead of servicing interrupts.2

Video controllers work like DMA devices, but to transfer data to a videocontroller a CPU must specify:

• the starting address of a pixel array in main memory,

• where to display that pixel array on the screen (i.e., where to put it in thememory of the video controller),

• whichever two of the following that controller requires:

– the array’s length,

– the array’s width,

– the array’s aspect ratio,

– the array’s number of pixels (area).

The controller then performs the requested bit-block transfer operation (a.k.a.bitblt, pronounced “bitblit”).

I/O channels. With DMA controllers, the intelligence is in the controllercard, which is plugged into the I/O bus. It is possible to put the intelligenceinto a special processor on the I/O bus and have that processor transfer blocksby issuing word-at-a-time commands to simple word-at-a-time controllers onthe I/O bus. Such special purpose processors are commonly called I/O chan-nels. Some I/O channels handle only single-block transfers and then interrupta CPU. Most, however, can operate from a queue of commands.Mention Shimon Muller’s

stuff.

10.2 I/O Programming

A driver for a certain device is a proxy server, typically impemented as a Cmodule or a C++ object, through which clients (usually kernel threads) interactwith that device. The driver implements an abstraction of the device and shouldreflect all of its important structure as conveniently as possible. Typically, adriver has services for initiating input and output, handling I/O-completioninterrupts, handling exceptions, returning device status, initializing the device,resetting or restarting the device and for various other device-specific duties.Often there is a special service (called ioctl in Unix) that performs manyof these control services depending parameter values and the kind of device

2Modern disk drives avoid the need to respond within the intersector-gap time by chainingcommands within the drive (SCSI) or by using large read-ahead/write-behind buffers (ATAand SATA). They do not, however, relieve the CPU of any interrupt-service burden.

10.2. I/O PROGRAMMING 159

involved. (See Section 2.3, on page 25.) Ultimately, the driver should haveservices corresponding to all of the device’s major operations.

The driver for a given device is an instance of its driver class, which is usuallywritten to cover all device/controller pairs of a certain type. Unfortunately, boththe class and its instances are referred to as “drivers,”3 so it is a good idea touse the terms “driver instance” and “driver class” whenever there is a potentialfor ambiguity. Sometime driver classes are very specific to a particular makeand model of controller, while others cover lots of makes and models of devices.

To write a driver class, a programmer should try to have detailed informationabout the device, the device-controller architecture, the CPU, the busses (theirprotocols) and chips between the CPU and the controller, and the prototypes(i.e., names and signatures) of the services by which clients expect to request thedriver’s services.4 Often the various types of information must be obtained fromdifferent manuals written by different organizations each using slightly differentparadigms and terminology. Controller manufacturers are notorious for poordocumentation, but one can hope for:

• highly accurate and detailed data sheets for the major chips on the con-troller card,

• an accurate and up to date schematic for the card itself,

• some (often misleading) paragraphs about the controller’s principles ofoperation,5

• object code for a sample driver for some architecture. Mention volatility and IO-registers. Mention treatingvolatiles as streams.To write a driver for a given controller, it is often necessary to reverse engineer

that controller by studying its schematics, disassembling the sample driver, run-ning experiments, phoning the manufacturer, and/or pleading for help on theInternet.6

10.2.1 Driver Binding (i.e., Registration)Reference the server man-agement chapter.Drivers usually run in privileged mode as part of the kernel. Adding a new

driver class requires relinking the kernel, preferrably via some form of dynamiclinking/loading. Adding a device requires constructing the device’s instance ofthe corresponding driver class by invoking the class’s constructor and passing asparameters the controller’s port addresses and subdevice information (e.g., disk

3This ambiguous use of the term “device driver” is yet another manifestation of the con-fusing tendency to use the same term in referring to derivatives (subclasses) of a class and toinstances of (derivatives of) that class. For example, classes derived from Widget are called“widgets”, but so are instances of those derived classes.

4In C++, these interfacing conventions are best implemented via an abstract base class.5These paragraphs give important clues to the thinking of the team that developed the

controller.6In writing drivers, I have even had to solder a jumper wire onto the controller to enable

it to be reset without powering down.


and partition numbers for a SCSI controller). The values for these parametersare commonly found in configuration tables, but some systems have the abil-ity to probe I/O ports to determine the devices on the system and their portaddresses.7

Note that the vector table for interrupt handlers must be updated so thatthe appropriate driver service is requested whenever the device interrupts —each occurrence of an interrupt invokes a zero-ary stub function, which inspectsvarious registers to determine which instance of which driver should have itsinterrupt handler invoked.8

Finally, a reference to the device’s driver (instance) is placed in the device’sdescriptor in the server-management system (i.e., file system).Mention protection issues, as

in Windows NT. In Unix a device is identified by its major and minor numbers. The majornumber specifies the driver’s class, while the minor number specifies which in-stance of that class is the proxy for (i.e., drives) this device. These numbers arestored in the device’s file descriptor (i-node) and can be viewed as indices intoa possibly sparse two-dimensional array of references to device-driver objects.

10.2.2 I/O Coordination

One can partition each driver into a scheduler, which serializes access to thedevice, and a device-abstraction server through which the device is controlled.Most public services of a device must invoke the device’s scheduler services,acquire and release9 — an acquiring client may have to wait on a conditionin the scheduler. After acquiring the device, the client leaves the schedulerand holds exclusive access to that device while writing to its controller’s regis-ters and/or waiting on an I/O-completion semaphore in the device-abstractionserver.

When a thread reads from a given disk, the handler for the read systemcall must acquire that disk by issuing an acquire request to the disk’s sched-uler, then possibly issue a seek request to the disk’s controller and wait forthe completion interrupt, and then issue a read request and wait for anothercompletion interrupt. Then, the handler will release the disk and return somestatus information. When the thread performs its next disk read — perhapsonly a few milliseconds later — the same sequence must be repeated.

Notice that, for disk I/O, there is an acquisition per data transfer, whilefor printer output there is a single acquisition per file printed, even though theprinting of a file involves many write operations.

Polled direct I/O. In the most primitive method of input/output coordina-tion, a thread waits for the completion of I/O operations by polling the busy

7This capability is sometimes referred to as “plug-and-play” or possibly as “plug-and-pray.”8In C++, the stub cannot be a general member function of the driver’s class, but it can

be a static member function.9C++ programs should adhere to the acquisition-via-initialization protocol and acquire

devices by constructing a local sentry object, whose constructor acquires the device and whosedestructor releases it.


flag (bit) of the controller’s status register until another item (word of data ordevice command) is available or can be submitted. The next I/O operation isthen performed and waiting resumes.

void output( Item x ) {

EXCLUSION

.

.

.

while ( busy() ); // Poll busy flag on status port.

assert( ! busy() );

write x to device ; // possibly DMA

.

.

.

}

Interrupt-driven direct I/O. The most obvious improvement is to give upthe CPU while waiting, say by acquiring a zero-initialized semaphore (which wewill call “completion”) inside the polling loop. The ensuing wait slows downthe loop by releasing the CPU until there is a reason to recheck the controller’sbusy flag, specifically, until the device’s interrupt handler releases completion.

void output( Item x ) {

EXCLUSIONg

.

.

.

while ( busy() ) completion.acquire(); // Efficient polling.

assert( ! busy() );

write x to device ; // possibly DMA

.

.

.

}

Polled I/O wastes CPU cycles during busy waiting, but interrupt-driven I/Oincurs overhead in transferring the CPU from one thread to another. PolledI/O is used (and sometimes required) for low latency (i.e., quick response) orfor very fast data streams where the waiting time is not much more than theinterrupt-service overhead.10

Buffered I/O. We construct the monitor class OutBuffer by modifying themonitor class ThreadSafeQueue of Subsection 5.2.5, page 66. Producer threads

10If a device can queue up at least one command and I/O times are somewhat predictable,it’s not always necessary to have I/O-completion interrupts. A timer can be set to interruptjust before the queued commands have completed.


run until the queue is full, and then wait. On each interrupt from the device,the interrupt handler removes an item from the buffer queue and outputs it.

In this case, the consumer is not a thread but rather an interrupt-sustainedpseudo-thread that sleeps (i.e, waits) by requesting the device’s suspend servicewhenever the buffer becomes empty. Whenever it finds the device sleeping (i.e.,the operationInProgress flag is false), the handler for append prods thedevice back to life by invoking device.fireup() and device.output().11

template < class Item >

class OutBuffer : Monitor, Queue<Item> {

Condition nonfull;

bool operationInProgress;

Device& dev;

public:

OutBuffer( int size, Device& d, unsigned mask = interrupts.off )

: Monitor(mask),

Queue<Item>(size),

nonfull(this),

operationInProgress(false),

dev(d)

{}

void append( Item x ) {

EXCLUSION // or called from a function that has EXCLUSION.

while ( Queue<Item>::full() ) nonfull.wait();

if ( operationInProgress ) {


} else { // prod the device (back) to life.

operationInProgress = true;

dev.fireup();

dev.output(x); // possibly DMA

}

}

void remove() { // called by interrupt handler.

EXCLUSION

if ( Queue<Item>::empty() ) {

operationInProgress = false;

dev.suspend();

} else {

11Alternatively, the interrupt handler could signal a condition on which waits a processconsisting of an endless loop that removes a word from a ThreadSafeQueue, outputs it, andthen waits again.


Item x = Queue<Item>::remove();

nonfull.signal();

dev.output(x); // possibly DMA

}

}

};

Input buffering can be handled by a similar but complementary modification ofThreadSafeQueue.

Often, several threads are concurrently engages in reading from or writingto various devices. As discussed in Subsection 5.2.5, we can pool their bufferspace by using a shared BufferAllocator to allocate buffers to each thread asneeded.

Exercise. With the above implementation of OutBuffer, the producer mightwait and resume on each item. To diminish this overhead, it is best that produc-ers wait until the buffer is nearly empty before resuming. Modify OutBuffer

so that it implements such a policy. How would you modify the correspond-ing input-buffering system? Under what conditions are similar modifications toThreadSafeQueue useful?

10.2.3 Disk I/O

Information on a disk is divided into uniform-size chunks, called pages, blocks orsectors, typically .5 to 4 kilobytes. These chunks are the fundamental units ofdisk I/O transfer. The performance of a disk is characterized by the fact thatit takes a long time to get the first byte of a block (e.g., 3 to 15 milliseconds)but successive bytes come fast (every 10 to 100 nanoseconds), e.g., latency takesabout the same amount of time as transferring 250K of data.

Positional/rotational distance. The positional latency (a.k.a. seek time)between two blocks is the time that it takes the diskheads to move from thecylinder of the first block to that of the second block, including settling time.The rotational latency (sometimes called simply latency) between two blocks isthe time that it takes the disk to rotate from the end of the first block to the be-ginning of the second. The overall latency (a.k.a. access time) from one block toanother, i.e., the time it takes from the completion of an access to the first blockuntil the beginning of an access to the second block, is the rotational latency,plus a full rotation time if positional latency exceeds rotational latency.12

Notice that rotational latencies are not symmetric in the sense that thelatency from block A to block B is not necessarily the same as the rotationallatency from B to A. So, the same is true of overall latencies.

12In fact, we need to add rotation time * int( positional latency / rotation time ),which is zero unless positional latency exceeds a rotation time.


Disk Scheduling. In contrast to CPU scheduling, disk-scheduling policy hasa major effect on the throughput of even a single server (i.e., disk). At anypoint in time, a given disk’s request queue may have requests for sectors fromseveral different threads, but usually no more than one per thread. There areseveral possible disk-scheduling policies:

• Most-distant-request-first (dumb)

• Nearest-request-first (leads to indefinite postponement)

• FIFO (inefficient)

• LOOK (a.k.a. the elevator algorithm): Make sweeps back and forth acrossthe disk always selecting for service the positionally nearest request inthe current direction. Switch directions, whenever there are no furtherrequests in the current direction.

• C-LOOK (circular LOOK): Similar to LOOK but servicing requests inone direction only. When there are no further requests in that direction,C-LOOK services the positionally most distant request in the other direc-tion. The point is that LOOK gives preference to requests in the middlecylinders of the disk, while C-LOOK does not.

• First-of-optimal-schedule: see item 5 of Subsection 8.3 on page 118 fordetails.13 In general, the optimal schedule is the one that completes inthe shortest time, but we can impose certain constraints and/or secondarygoals, for example:

– To maximize CPU utilization by minimizing the average waitingtimes for read requests,

– To minimize user annoyance by minimizing the variance in waitingtimes for read requests,

– To minimize the latency involved in anticipatory reads by assigningto each dirty page, an optimal location as near after its predecessoror as near before its successor as possible,

– To avoid degradation in hit ratio due to an accumulation of dirtypages by imposing a stabilization deadline on each dirty page,

– To maximize stabilization throughput by writing dirty pages to thenearest free sector (unless optimal placement has very low cost) whentheir stabilization deadline nears,

– To maximize fault tolerance by writing metadata last.

13If the device controller can queue up commands, we might have to revoke and replace aqueued command when a new request arrives.


Implementation. The following serializer class implements the LOOK schedul-ing policy. Like our previous protocols, rather than a queue of pending I/Ocommands, this implementation keeps a queue of waiting threads that issuetheir own I/O commands. Notice, however, that this protocol is much morecomplex than the FIFO protocol used above for OutBuffer.

typedef int Cylinder;

class DiskScheduler : Monitor {

const Cylinder cylmax;

Cylinder headpos;

enum{ up, down } direction;

bool busy;

Condition upsweep;

Condition downsweep;

public:

void acquire( Cylinder dest ) {

EXCLUSION

while ( busy ) {

if ( dest < headpos ) {

downsweep.wait( cylmax - dest );

} else if ( dest > headpos ) {

upsweep.wait( dest );

} else if ( direction == up ) {

upsweep.wait( dest );

} else { // direction == down

downsweep.wait( cylmax - dest );

}

}

busy = true;

headpos = dest;

}

void release()

EXCLUSION

busy = false;

if ( direction == up ) {

if ( upsweep.awaited() ) {

upsweep.signal();

} else {

direction = down;

downsweep.signal();

}


} else { // direction == down

if ( downsweep.awaited() ) {

downsweep.signal();

} else {

direction = up;

upsweep.signal();

}

}

}

DiskScheduler()

: cylmax( number_of_cylinders ),

headpos( 0 ),

direction( up ),

busy( false ),

upsweep( this ),

downsweep( this )

{}

};

After returning from acquire, the invoking thread owns the disk, issues its ownI/O commands, waits for their completions, and eventually invokes release tolet another client use the disk.

Scheduling plus coordination. A CoordinatedDisk is a Disk together witha DiskScheduler. In addition, it has a semaphore, called completion, wherethreads wait until that semaphore is released by the handler for the Disk’sI/O-completion interrupt.Explain about disk partition-

ing, that partitions can spanmultiple devices. (Same forscreens.)

class CoordinatedDisk : DiskScheduler, Disk {

Semaphore completion;

public:

CoordinatedDisk( ... port addresses, capacity parameters, etc. ... )

: completion(0)

{

Disk::initialize( ... port addresses, capacity parameters, etc. ... );

}

int input( int cyl, int track, int sector, void* buf ) {

acquire(cyl);

// I now have exclusive access to the disk drive.

if ( headpos != cyl ) {


Disk::seek(cyl);

completion.acquire();

}

Disk::read( track, sector, buf );


int x = Disk::status(); // errors, etc.

release();

return x;

}

int output( int cyl, int track, int sector, void* buf ) {

acquire(cyl);

// I now have exclusive access to the disk drive.

if ( headpos != cyl ) {

Disk::seek(cyl);


}

Disk::write( track, sector, buf );


int x = Disk::status(); // errors, etc.

release();

return x;

}

void completionHandler() {

completion.release(); // no need for EXCLUSION.

}

}

Scheduling smarter disks. Modern disks, both IDE (a.k.a. ATA) and SCSI,have up to eight megabytes of on-board RAM cache and an on-board micro-processor. These disks treat their blocks as a one-dimensional array indexed ina way that minimizes latency between successive blocks that happen to be onadjacent tracks. A typical read or write operation specifies a range of blocks,and then the on-board processor handles the transfer of data to or from theonboard cache. The on-board processor begins transferring data as soon as theheads are within the requested block sequence, even at blocks in the middle ofthe sequence. It responds to read requests from the on-board cache as soon aspossible and reads ahead a few blocks beyond the reqested sequence, in casethere would be an immediate request for blocks from the continuation of thesequence just read. Also, the on-board processor acknowledges write as soonas the data is in the buffer, even before it has actually been transferred to thedisk.

The driver of a well-scheduled disk must:


• know the head location both rotationally and positionally at any instant,

• know when the current disk operation will complete and what the rota-tional and cylinder positions of the heads will be and what pages will bein the cache and which will be dirty,

• be able to issue commands to the disk in a way that doesn’t involve a lossof productivity between successive commands.14

Explain how.A driver can empirically determine its disk’s rotation rate and block placement.It can also determine an instantaneous rotation angle of the disk and thenextrapolate the rotation angle at any point in time from the rotation rate,but it will have to recalibrate periodically. The driver can usually interrogatethe drive regarding which cylinder the heads are on and/or set the heads to adetermined position. Then one can use a simple mechanical model to predicthead trajectories.Draw the diagram. Po-

sitional latency is approxi-mately a

√

|cyl0 − cyl1|+ b×min(c, |cyl0−cyl1|). The sec-ond term is for settling timefor the read heads.Give formulas from and ref-erence to Wilkes’ work.

10.3 Pushed Data Transfers

A standard I/O paradigm is that computers announce their readiness to transferdata and then wait for device controllers to perform the transfers. There are,however, devices and situations where failure to accept data on demand has avery high cost, e.g.:

• Streaming data from unbuffered video cameras or wind-tunnel experi-ments gets “spilled” if the consumer is unready. In general, inadequatelybuffered producers must transfer data on their own schedule rather thanthose of their consumers.

• The value of time-varying data often decays with the passage of time.

• Some data must be transferred during infrequent windows-of-opportunity,e.g., a satellite passing overhead briefly every 90 minutes. Similarly, if datafrom an unbuffered disk is missed, the delay of another disk revolution isincurred.

Consuming such pushed data is a bit like drinking from a fire hose. For commu-nication to be successful, at least one end must be accomodating in the sensethat it is ready to produce or consume data on demand. A demanding producercan only communicate with an accomodating consumer, and a demanding con-sumer requires an accomodating producer. In the standard paradigm, it is thecomputer (i.e., CPUs and memory) that is accomodating.

A buffer is a server that accomodates in both directions; it receives datawhenever data is pushed and provides data when and only when data is re-quested. For proper operation, there must be buffers between demanding pro-ducers and demanding consumers. But every buffer has finite capacity. How

14To mitigate the latency involved in interrupt response, system use I/O channels, diskssupport command queuing, or that do prefetching in the form of read-ahead.

10.4. REDUCED-COPY I/O 169

much buffering is required often depends on interrupt-response latencies. Anybuffer will eventually overflow if the production rate persistently exceeds theconsumption rate. Explain the downside in each

case of mismatch.Review ideas from Telemetryproject.

Even more complex are systems where the data rates vary unpredictably. Insuch cases, designers must not only guard against data spillage but also againstits spoilage, i.e., sitting too long in unprocessed buffers. As we’ve already noted,the value of time-varying data often decays with the passage of time.

10.4 Reduced-Copy I/O

One of the biggest sources of operating system overhead is the copying andrecopying of data during I/O activities. In many systems, a page from disk iscopied from a device controller to a buffer in the disk’s driver, then into thedisk cache, and then to the user’s buffer area specified in an invocation of theread system call. One obvious speedup is to read directly into the disk-cache.Another is to use page-remapping and copy-on-write to eliminate even morecopying. • Give real monitor for

the scheme proposedby Kai Li et al.

• Mention Unix’s “strat-egy” scheme.

• Show how to managea scatter/gather con-troller.

Chapter 11

VOLUMES and DYNAMICBINDING

Discuss CORBA, pipes,sockets and v-nodes.Move this material to chap-ter on binding. It is the basisfor binding.

11.1 Opened files as streams

Resolve “partition” vs. “de-vice.”

An ordinary file on disk is simply a persistent data container, typically an esten-sible vector of bytes or a list of records stored say as some kind of balanced tree.By contrast, opened-files are servers providing read, write, and seek services,e.g., under Unix:

Once a file is open, the following calls may be used:

n = read(filep, buffer, count)

n = write(filep, buffer, count)

Up to count bytes are transmitted between the file specified by filepand the byte array specified by buffer. The returned value n isthe number of bytes actually transmitted. In the write case, n isthe same as count except under exceptional conditions, such as I/Oerrors or end of physical medium on special files; in a read, however,n may without error be less than count. If the read pointer is so nearthe end of the file that reading count characters would cause readingbeyond the end, only sufficient bytes are transmitted to reach theend of the file; also, typewriter-like terminals never return more thanone line of input. [RI]

In C++, the stream class provides exactly such services. Since programs seeonly opened files, from a program’s perspective an opened-file is a stream ofdata objects, usually bytes.

The connection between opened-files and ordinary files is reflected in the factthat, under C++, streams and iterators into byte containers are equivalent in

171

172 CHAPTER 11. VOLUMES AND DYNAMIC BINDING

the sense that each can be implemented in terms of the other in obvious ways.1

Thus, turning an ordinary file into a stream requires the addition of an iterator(i.e., offset counter) that tells where the next read or write will occur. Eachtime an ordinary file is opened, a different iterator into its container is created,so we speak of an opened version of a file rather than the opened version of a file.When a process forks, the child gets a copy of the parent’s per-process open-filetable. The parent and child then share their open-files — if either of them writesto an open-file, that shared iterator (stream) gets incremented. Since a processinherits open-files from its parent, it possibly shares with its extended family(parents, siblings, etc.) access to such opened versions of ordinary files.Mention seek, append, dup,

and dup2.

General servers as files. There are several classes of persistent streams thataren’t simply iterators into persistent byte arrays, e.g.: pipes, terminals, sockets,and devices such as printers. Most devices are inherently sequential and theirI/O operations have a semantics that resembles read and write. In Unix,these streams are called “character devices.” Disks, by contrast, are randomlyaccessible storage devices, called “block devices” in Unix literature. They arebest treated not as streams but as arrays of block of bytes (chars), each blockbeing an array of a fixed length (typically a power of two in the range, 512 to4096). Often block devices are segmented into subdevices (actually, subarrays)called partitions.Revise.

In many operating systems, e.g., Unix, most user-accessible kernel-managedresources are treated as persistent streams.2

• In Unix, I/O devices are called “special files,” even if they do not dealwith streams of data. As discussed in Section 2.3 on page 25, the ioctl

operation can be used to accommodate this mismatch of abstract behavior— think of ioctl as an all-purpose service-request function.

• Although programs are contained in ordinary files, they provide theirexecute service and, thus, can naturally be veiwed as servers.

• Under some versions of Unix, certain ordinary files can be “memory mapped”into a client’s address space and treated directly as the byte vectors thatthey really are. (See Section 9.7 on page 152.)

• Linux has what is called the proc file system, which is used to inspect andcontrol running processes, including clone-created threads.

Mention object-request bro-kers, but note that they arenot allocated to partitions.

1See [STR], pp. 637-642 and pp. 558-559. Also, to improve efficency, when a C++ programopens a persistent stream, the program creates a buffered proxy for that stream — see [STR],Section 21.6, pp. 642.

2Even bitblt can be implemented as writing to an open file by issuing a generalized seekthat specifies a screen position and aspect ratio (or width or height) followed by a standardwrite specifying the memory buffer and length of the bit block to be transferred.

11.2. VOLUMES AND FILE SYSTEMS 173

11.2 Volumes and File Systems

The kernel establishes and controls each client thread’s access to servers otherthan those inside the client’s own address space. Such kernel-managed serversare persistent, in the sense that they exist until they are destroyed via a systemcall and can, therefore, outlive the process that creates them.3 The kerneldetermines to which handler of which persistent server a given service requestresolves. If the original service request is disallowed by the protection system,the kernel invokes a different service of a possibly different server.

Some some persistent, kernel-managed servers (resources), e.g., processes,users, and groups, are managed by submitting their handles (ID numbers) alongwith other parameters to special system calls. Often their descriptors are keptin special tables that are initialized from configuration files, in which case theirprotection can be handled via ad hoc mechanisms, e.g., file protection on theirconfiguration files.

Most persistent servers, however, are managed by special servers called vol-umes, which are instances of manager classes called file systems.4 Volume-managed servers are called files.5 Most file systems are implemented as ker-nel modules, but some are executables whose intances are daemons (user-modehelper processes) that communicate with the kernel via signals and system calls.

Each file belongs to one and only one volume and has a volume-relative han-dle. At any point in time, each on-line volume occupies a particular partition ona particular host machine and has a host-relative handle. Each host maintainsa mount table mapping the volumes it hosts to their respective partitions. Toidentify a file networkwide, we must know its volume-relative handle, its vol-ume’s host-relative handle (e.g., device handle plus partition number), and itshost’s identity — together, those components constitute the file’s networkwideidentity.

A volume manages the creation, destruction, storage and naming of its files.On behalf of its direct client, the kernel, a volume also manages all binding Also accounting and the en-

forcement of quotas.and access to the services of its files. Most volumes can be moved to anysufficiently large partition on any system having a compatible implementationof the volume’s file system.6 One can view volumes

as caches for huge two-dimensional byte arrays,whose rows are files. Also,one can think of a volume asa dynamic multiserver proxythat offers naming, protec-tion, and memory-allocationservices.

A volume has the following major subsystems:

• A name manager (e.g., a directory tree), which associates possibly multiplenames with the handles of each of the volume’s files.

3By way of contrast, note that C/C++ objects of “dynamic” storage class last until theyare deallocated via an invocation of delete or free or the address space of the object’s creatorgoes away.

4Volumes are sometimes also called “file systems.” This ambiguous use of the term “filesystem” is yet another example of the confusing practice of refering both to classes derivedfrom Widget and to their instances as “widgets.”

5The kernel running on a given system can have many file systems at once. Among thefile systems for Linux are: ext2, xfs, resierfs, ufs, ntfs, etc. In Unix jargon, the “fs” suffixindicates a file-system class.

6Note, however, that you can’t move a volume that contains devices to another machine.


• A descriptor manager (a.k.a. table of content), which associates eachvolume-managed server’s handle with that file’s descriptor.

• An access manager (a.k.a. protection system), which interacts with per-process capability7 caches, i.e., per-process open-file tables.

• A free-space manager (a.k.a. free-space list), which allocates the volume’sfree space to the volume’s files that need space.

Typically, the first or second block on a partition (in Unix, the super block)contains a descriptor of the volume residing on that partition, i.e., where tofind the volume’s name manager, free-space manager, and descriptor manager.System-wide name managers, descriptor managers, and access managers arebuilt upon their respective volume-local counterparts. Commonly, however,there is no system-wide free-space manager, since every file resides in its homevolume’s space.

11.3 Volume-based Name Management

There are many name managers in most systems, for example:

• the password file

• NIS

• DNS

• NFS

• non-distributed file system (directories)

• the mount table

• NFS (automounter)

• port mapper

• header files for ioctl()

A volume’s name manager is a table that maps character strings, callednames, to volume-relative handles of local files or alternative names of non-localfiles. Typically, files called directories organize names into a directed graph withname-labeled directed edges connecting each directory to the files it names.8

The named file can be any file in the directory’s volume, even another directory.Normally, the name graph is required to be a rooted, acyclic, directed graphwhose non-leaf nodes form a tree (i.e., directories can’t have multiple parents).

7A capability is the right to obtain a particular service from a particular file.8See the implementation paragraph below for details on how this “connection” can be

implemented.

11.3. VOLUME-BASED NAME MANAGEMENT 175

In Unix, there are two special names in each directory, . and .., which refer,respectively, to the directory itself and to its parent directory.

When a client refers symbolically to a file, it does so via a segmented path-name. For example, the segmented pathname goo/foo/bar denotes a file namedbar in the directory named foo in the directory named goo in the starting direc-tory. The pathname is looked up (i.e., resolved) iteratively, a segment at a time,left to right.9 segment lookup, except the final one, yields the volume-relativeidentity of a volume-local directory in which to look up the next segment. Oneach segment lookup, there is a check of permissions to see if the client haslookup access to that directory. Different name managers may use differentsymbols to separate segments, e.g, slash, back-slash, or dot. Also, some namemanagers processes the segments right to left, while others process the namesleft to right.

Since it is cumbersome to specify a full (i.e., root-relative) pathname on everyaccess, each client has a current-directory attribute. In Unix, searches start fromthe client’s current directory unless the first segment of the pathname is null (i.e.,has length zero), in which case they start from the client’s root directory. Forinstance, /foo1/foo2/foo3/bar is a full pathname. The initial current directory,following login, is the user’s home directory.

Some file systems also allow “symbolic links,” which are directory entriesthat resolve to full pathnames. When a lookup reaches such a symbolic link,it starts over root using the name obtained by appending the balance of theoriginal pathname to the symbolic link.

Mounting remote subdirectories. Similarly, in some file systems, a remotedirectory can be “mounted” on top of any local file by making the local file amount point (i.e., by setting the mount-point bit in the local file’s descriptor)and installing in a system-wide mount table an entry that maps the mountpoint’s identity to the mounted directory’s networkwide identity.

To facilitate network-wide lookups, each system maintains a cache that mapsfull pathnames of files to their network-wide identities. This name cache mustinclude the names of the root directories of the volumes for which the systemis host. When a system on a network broadcasts10 a lookup request with afull pathname, each system looks in its own name cache. The file’s host re-sponds with the file’s network-wide identity. Whenever a host rejects or failsto respond to a cached identity, that entry is removed from the cache, whichis then dynamically updated via a subsequent broadcast request — an instanceof “detected invalidation”. To preserve the correctness of cached information,it is important that a file’s identity not be reassigned until it has disappearedfrom all name caches throughout the network, e.g., each cache entry may, forinstance, be given a limited time-to-live.

A successful full-pathname lookup returns the named file’s network-wide

9Internet DNS name resolvers process names from right to left.10Note that even if the network’s technology does not directly support broadcasts, messages

can go out along a spanning tree in a way that simulates a broadcast.


identity. The pathname is initially put into canonical form by removing oc-currences of dot and collapsing occurrences of dot-dot. The lookup begins atthe directory corresponding to the pathname’s longest-prefix match in the re-questor’s name cache. The requestor removes that prefix from the pathnameand submits the remainder of the pathname plus the starting directory’s iden-tity to the directory’s volume via a remote procedure to the volume’s host. Thevolume’s name manager, processes that suffix, segment by segment, starting atthe specified directory, and returns the number of segments processed togetherwith the outcome:

• failure, i.e., a segment lookup reached a non-directory file or a directorythat did not contain a name corresponding to that segment,

• success, in which case the file’s identity is also returned,

• symbolic-link-encountered, in which case that full pathname is alsoreturned — in which case, the requestor combines the returned pathnamewith the remaining pathname suffix to form a name on which to start over.

• mount-point-encountered, in which case the volume looks up the mountpoint in the host’s mount table and returns the networkwide identity ofthe mounted directory — in which case, the requestor continues the searchat that directory.

Implementation. Typcally, directories are files containing tables that mapnon-segmented names to volume-relative handles. Directories can be read andwritten only by special kernel routines. In Unix, for instance, a process needswrite access to a directory to add, delete, or change a name in it.11 A processneeds execute access to a directory to make it the process’s current directoryand/or to lookup a name in that directory. It is possible to change the effectiveroot directory for a process (via chroot) so that it and its descendants cannotaccess files outside a certain subtree of directories.

Alternatively, a name manager can be implemented as a table that mapskeys consisting of a name (string) plus a directory handle to the volume’s files.

Typical operations of a volume’s name manager would be:

• lookup(string) returns file-handle (or nil, on failure) of the file currentlybound to the pathname in the specified string.

• name(file-handle, string) binds the name contained in string to thespecified file and increments the file’s name count.

• unname(string) decrements the name count of file whose name is con-tained in the string and removes that binding for that name.

11Note that having write access to a directory that contains a sensitive file, allows a processto change the effective content of that file by giving its name to a suitably modified versionof that file.

11.4. DESCRIPTOR MANAGEMENT 177

• rename(string1, string2) binds (or rebinds) the name contained instring1 to the file whose name is contained in string2, and incrementsits name count.

• chdir(directory) causes the specified directory to become the process’scurrent directory (like the command cd under Unix).

• enum(directory, index) returns string corresponding to that index (theindexth name) within the directory, in alphabetic order. enum can be usedto enumerate all items of a given directory, as in the Unix command ls. Use iterator terminology and

notation.Each operation returns an indication of success or failure, unless another returnvalue has been specified.

11.4 Descriptor Management

Each volume-managed file has a descriptor that contains the information neededto access and manage that file. A file’s descriptor is (part of) the file’s state.Each file is identified within its volume by a unique handle, typically an unsignedinteger, that the descriptor manager maps to the file’s descriptor. The fields ofa descriptor are called attributes. They typically include:

1. garbage-collection information, e.g., the file’s name count,

2. protection information — in Unix, the handles of the owning user andowning group, plus nine protection bits, which are discussed in Section 12.6on page 194,

3. times of creation, last access (i.e., use), and last modification,

4. the file’s class (in Unix, a code indicating whether the file is a directory,an ordinary file, or a special file, i.e., a device).

5. class-specific attributes, e.g., for a disk-based file one needs a reference toa volume-local list of its pages (often implemented as a B+ tree).

Each volume has its own descriptor manager that stores and manages the de-scriptors for the files on that volume.

In Unix, a file’s handle, called its i-number, is simply the index of its de-scriptor (i-node) in the volume’s table of content (i-list), which is implementedas an array of i-node frames at the beginning of the volume’s partition. Anarray is possible in this case, since i-nodes are of uniform size. In most schemesfor generalizing from files to arbitrary servers, we lose that uniformity and thetable of content requires more complex accessing mechanisms.


Caching of descriptors. A volume’s descriptor manager maps volume-relativefile handles to descriptors of volume-local files. Although it is possible memory-map each local volume’s tables of content, most systems individually cache filedescriptors in the kernel’s heap — such individually cached copy are usuallygarbage collected via access count, i.e., the number of times the file has beenopened and not yet closed.12 Either way, all references to the file are directedto a unique memory-resident copy of the descriptor, even if the file is openedsimultaneously by multiple processes.13

Each descriptor has a name-count attribute that corresponds to the numberof names (a.k.a. hard-links in Unix) that file has in its volume’s name manager.When one of its names is removed, its name count is decremented. A file isdestroyed as soon as its name count becomes zero and its descriptor is no longercached — the resources allocated to that file are then released, including its diskspace and descriptor.

Distributed descriptor caching runs into some obvious cache coherence prob-lems unless the descriptor is singly cached or all instances of it are read-only.Most distributed file systems, for instance, provide read and write services overthe network but do not use client-local descriptor caching.14 Instead, the file’sdescriptor is cached in the active-file table of the file’s host.Note that a stream based on

a remote file is local. Theremotely provided services ismemcopy on read/write. Operations. A descriptor manager typically provides certain operations:

• create(class) constructs (i.e., allocates, constructs and initializes) a newfile of the specified class and returns the new file’s handle.

• link(handle) simply increments the handle-specified file’s name count.

• unlink(handle) simply decrements the handle-specified file’s name count,and destroys the file when the count becomes zero.

• destroy(handle) destroys the specified file, whose descriptor is then re-leased by the descriptor manager.

• lookup(handle) returns a reference to the handle-specified file’s descrip-tor.

12Each time a file is opened, the access count associated with its active-file-table entry isincremented. Whenever the file is closed, its access count is decremented. When that countbecomes zero, the descriptor is released from the cache, but it is first written back to its tableof content if it is disk resident and dirty.

13When an ordinary file is opened, its descriptor is fetched. Also, a descriptor for thecorresponding stream is created and cached either in the active-file table or in a separateactive-stream table. Although such a stream is associated with a volume, it is not diskresident.

14Note that a remotely accessed file’s time-of-last-access attribute is modified on every accessbut can perhaps be updated only on fetch and/or write-back.

11.5. ACCESS MANAGEMENT 179

11.5 Access Management

When a client opens a file, a binding to that file is created and stored in theclient’s open-file table, which is a system-managed array of (references to) bind-ings that is inaccessible to the client except via system calls. The binding’sindex within that array is returned to the client to be used as a client-localhandle for that binding.15 The binding contains a reference to the file’s cacheddescriptor, and a map telling the access status that the protection has grantedthis binding to each of the files services, e.g.:

• unchecked

• allowed

• suspended16

• disallowed.

An allowed service is viewed as a “cached capability.”The following are typical services provided by an access manager:

• open(string) looks up the name contained in string via the name man-ager, installs a binding for the named file in the first available entry in thebinding table and returns a reference to that binding. It also incrementsthe file’s binding count.

• close(binding) removes the specified entry from the binding table anddecrements the corresponding file’s binding count.

• reopen(binding, string) closes the specified binding and reopens it tothe file named by the string.

• make(class) requests the descriptor manager to create a file of the spec-ified class and then opens it (without giving it a name).

• name(binding, string) requests the name manager to install an entryfor the file that binding refers to in the current directory under the namespecified by the string and increments the file’s name count.

• unname(string) requests that the name manager remove the correspond-ing entry from name table and decrements the file’s name count.

• dup(binding) duplicates the specified binding in first available binding-table entry and returns that binding’s index in the client’s binding table.It also increments the file’s binding count.

15In an unfortunate misuse of terminology, Unix literature refers to such indices into open-file tables are called “file descriptors.” A “descriptor” should describe something. Integersdon’t.

16The suspended state permits allowed capabilities to be programmatically suspended,say, for purposes of debugging.


• request(binding, service, parameter) requests, from the file specifiedin the binding, the specified service with the specified parameter (check-ing capabilities, of course, and possibly caching them). Essentially, thesame operations as ioctl of Unix. The correspondence between opera-tion names and service codes for a particular class can be stored in headerfiles or in a run-time table. Of course, the name manager could be ex-panded to lookup service codes.

• suspend(binding, service) suspends a particular cached capability.

• resume(binding, service) restores a suspended capability.

• petition(binding, service) returns an indication of whether this serviceis allowed and, if so, caches the corresponding capability. For real-timeapplications, worst-case time must be guaranteed. Avoiding possibly un-acceptable delays on first access requires the ability to pre-cache a capa-bilities.

Except where otherwise specified, each operation returns an indication of successor failure.

Remote procedure calls allow remote invocation of the services of the accessmanager and, therefore, remote access to volume-managed files as long as allparameters and passed and returned by value. But the most important I/Oservices, read and write take as parameters the addresses of buffers, which arenot directly helpful over the network. Instead we need to add remote read andwrite services, which really copy segments of data in a remote-DMA fashion.

Exercise. Must a client’s open-file table be an array? Why can’t open simplyreturn a reference to the binding it has created?

11.6 Free-Space Manager

The free-space managers of volumes implement the placement policies for disk-based archiving. Those policies, in turn, have a major impace on system per-formance.

We view a disk as consisting of partitions, each consisting of a contiguousgroup of cylinders. Each partition can be viewed as a one-dimensional array ofrandomly accessible page frames, also called blocks, each consisting of say 4Kbytes.17

It is not unusual for CPUs to have very low utilization while many threadswait for data to be fetched from disks. At such times, traffic between disk andmain memory is a performance bottleneck. The main problem in improving theutilization of potential disk bandwidth is to diminish positional and rotationallatency. Roughly speaking, we need to store pages that will be accessed togethernear each other.

17It is not necessary that VM-page sizes and disk-page sizes be the same, but one shouldbe a multiple of the other.

11.6. FREE-SPACE MANAGER 181

11.6.1 Fetch OptimizationRefer to device manage-ment section on Posi-tional/rotational distance.

It is difficult to guess which ordinary files will be accessed together,18 but thereis some correlation between one page of a file being accessed and other pagesof that file being accessed. Such correlation is biased in the forward direction;when a page of a file is accessed, the file’s next page is more likely to be accessedsoon than is the previous page. If a page is likely to be fetched soon afteranother, we should place it as soon as possible after that other page and prefetchit whenever the other page is fetched, thereby saving rotational latency andpossibly positional latency. Taking advantage of such correlation in the fetchingof data involves:

• The placement policy for disk pages: when a file needs another page, whichblock do we give it? To optimize for fetching, we place it as near after thefile’s previous page or as near before the file’s next page as possible.

• The fetch policy for disk caching: if a page of a file is accessed, also fetchsome of its successor pages within that file. 19

Refer to paper by Kai Li andothers.

The effects of page size. Let us consider, from the point of view of these twopolicies, the tactic of doubling a system’s page size. In effect, this tactic formsthe new larger pages by combining two of the smaller pages — the 2n-th page ispaired with the (2n+1)-st page. So, in terms of the original pages, the placementpolicy is effectively that the (2n+1)-th page is always placed immediately afterthe 2n-th page, and the fetch policy is effectively that both pages are fetchedwhenever either of them is fetched. The combined effect is that the secondpage requires no positional or rotational latency in its fetch, only transfer time.In a certain percentage of cases this fetch wastes bandwidth, but the accesscorrelation is often high enough to compensate. There is another penalty forprefetching: prefetched pages that are not accessed tend to fill the cache withuseless data, thereby diminishing the overall hit ratio. Obviously there is a pointbeyond which these penalties outweigh the benefits of increasing the page size. Does FFS equal UFS?

FFS. Early Unix used a random placement policy. When a file needed anotherpage it was given the first block from the randomly organized free-block list, apolicy that led to high latencies. The following is Thompson’s description ofthat allocation policy for disk storage [TH]:

The disk allocation algorithms are very straight forward. Since all al-location is fixed-size blocks and there is a strict accounting of space,there is no need to compact or garbage collect. However, as disk

18Ahmed Amer [?] has shown that it is a good bet that the file that was accessed next lasttime will be accessed next this time.

19In cases where the disk’s placement policy attempts to keep successive pages of a file neareach other on disk, it possibly pays to delay seeks for a short time after completion of a reador write in order to see if the completion-awakened thread wants to fetch or write the nextpage of that file.


space becomes dispersed, latency gradually increases. Some instal-lations choose to occasionally compact disk space to reduce latency.

By contrast, the BSD Fast File System (FFS a.k.a. UFS) [MK] subpartitionseach disk partition into bands of adjacent cylinders called cylinder groups orbuckets. It attempts to spread unrelated pages uniformly among the bucketsby use of a hashing function. It allocates the pages of a given file to the samebucket up to the point where continuing to do so would adversely impact theamount of remaining free space in the bucket. It then allocates the pages of thatfile from another bucket, etc. It tries to keep the i-node of a file in the samebucket as that file’s (first set of) pages. It also tries to keep the i-nodes of filesfrom a given directory in the directory’s bucket in order to make ls run fast.This cylinder-group scheme uses clustering of references, a bit like radix sorting.It minimizes positional latency only. Within each bucket, however, FFS placesthe successor of any page of a file as soon after that page as possible to facilitatesequential access to that file.

11.6.2 Write-Back Optimizations

When the cache is vulnerable and sufficiently large and the archive is less vul-nerable (e.g., non-volatile), nearly all inter-level traffic is associated with thestabilization of dirty items. These archived copies are unlikely ever to be read,especially when the cache is large.

• To improve CPU utilization, give fetching priority over archiving as longas the window of vulnerability is below some acceptable threshold. As thewindow of vulnerability becomes large, increase the priority of archiving.

• To improve throughput to disk-based archives, archive to the nearest freeblock (in the current sweep direction, inward or outward) from whereverthe heads happen to be. In cases where fetches are infrequent, such scat-tered placement will not impact fetch latency. It will, however, slow downprefetching (relative to say FFS) by increasing the distance between suc-cessive pages of any given file.

• To better tolerate crashes, archive an item only after all items it referenceshave been archived. In particular, archive metadata last, i.e., archivedescriptors after the data they describe.20This needs further explana-

tion.

11.6.3 Fault-Tolerance Strategies

There have been a number of file system where, if the directory got lost, sowould the data from all files. Other files systems require a lengthy check of filesystem consistency following system crashes.

20My thanks to Michael Griffith for pointing out the importance of this protocol to the faulttolerance of file systems.

11.7. FILE MANAGEMENT UNDER UNIX. 183

There have been several systems designed to alleviate this vulnerabiltiy. Oneearly, simple and interesting efforts was written by a local undergraduate forthe local newspaper’s IBM 1130. Following a particularly unfortunate crashthat wiped out the system’s directory and, with it, a number of stories filed byreporters, this student wrote a directory-less file system. To look up a particularpage of a particular file, one simply did a hash lookup on the name and pagenumber, which resolved to a particular block on disk. Collisions were resolvedlinearly. This system was very robust and no slower than Unix system of thattime.

More recently people have turned to log-structured and journaling file sys-tems. Discuss LFS, JFS, WAFL,

placement etc. Seehttp://www.linuxgazette.com/issue55/florido.html.See (google) also, “Seneca:remote mirroring donewrite” by John Wilkeset al. For WAFL seehttp://www.usenix.org/publications/library/proceedings/osdi99/full papers/hutchinson/hutchinson html/hutchinson.htmlandhttp://www.netapp.com/tech library/3002.html.

11.7 File Management Under Unix.

In Unix, special operations control access to files:

• open() establishes bindings, at the same time checking and caching re-quested capabilities. The call fails if any requested capability is disallowed.

• execute() involves a protection check but no binding or capability cachingon the assumption that, once an execute capability is exercised, it isunlikely to be exercised again in the near future.

• chmod() changes the protections associated with a specified file.

Devices are represented as special files, whose i-nodes contain the major andminor numbers of the corresponding device. With sequential devices, such astape drives, there is an intrinsic notion of current byte and next byte, etc. Bycontrast, the i-nodes for directories and ordinary (non-special) files describepersistent arrays of bytes. In particular, there is no concept of “current byte”or “next byte,” etc. To each entry in the active i-node table, one could addan offset field that tells where in that file the client currently is. But, what ifmore than one client currently has the file open and each client is at a differentpoint in the file? As usual, when faced with a problem, we add another level ofindirection; we create a server of type “file” that consists of a pair — a pointerto an active i-node plus an offset counter (iterator) that tells which byte of thatfile is the current one. These pointers need to be sharable, for if a child processinherits an open disk-resident file from its waiting parent (e.g., a shell runningthis child as a foreground command), the child’s output should be concatenatedonto whatever the parent has written so far, just as it would be if the file werea terminal. Accordingly, Unix has a system-wide open-file table that containsfiles, in the sense defined above, i.e., pairs consisting of a reference to an activei-node plus an offset counter. Of course, if the active i-node is that of a serialdevice, such as a printer, the pointer is not needed.

When a process invokes open:


1. The string passed as argument is looked up by the name manager, whichfinds the i-number of the file and its major and minor device numbers.

2. A copy of its i-node is placed in the active i-node table, if it is not alreadythere; otherwise, that entry’s reference count is incremented.

3. The requested services are checked against the protection bits in the i-nodeand, if they are rejected, this call to open fails.

4. A new entry is created in the system-wide open-file table for this filecontaining a pointer to that active i-node and the initial offset.

5. The first free entry in the per-process open-file (persistent-server) table(which is part of the process’s descriptor) of the calling process is filled inwith a pointer to that entry in the system-wide open-file table, and bitsare set telling which services have been requested (and approved). Thesebits are with the per-process entry in SystemV-derived systems and withthe system-wide entry in BSD-derived systems.Check this.

6. The index of this entry in the per-client open-file table is returned to theclient to use in subsequent requests for read, write, seek services. Suchrequests check the entry’s access bits to see whether the requested serviceis allowed.

When a client invokes close, the steps listed above are reversed. Eachtime a table entry is no longer needed, its reference count (for that table) getsdecremented — when the access count goes to zero, the entry is deleted, i.e.,placed on the free-entry list.Discuss VFS.

Mention that volumes areself-contained and can bemoved.

Chapter 12

PROTECTION

The protection system is a mechanism to allow the kernel to mediate attemptsby certain clients to obtain certain services from certain servers, including allvolume-managed servers. We need the protection system for reasons of:

• security, i.e., to guards against the corruption of and unauthorized accessto services and information,

• debugging,

• tracing execution,

• monitoring behavior and performance.

The topic of protection involves questions of:

• Which client is allowed to obtain which services (with which parametervalues) from which servers? For these purposes, we consider even a mem-ory location to be a server that offers read and write sevices.

• What happens when a client requests from a server a service that clientis not allowed?

• Who decides what is allowed?

• How are those decisions organized and stored?

• How can those decisions be modified?

Generally the clients are threads and the servers are files.In discussions about protection, the term “access mode” is often used rather

than “service”, e.g., read and write are modes via which files can be accessed.Sometimes, “servers” are called “objects,”and requesters of services are oftencalled “subjects” rather than “clients.”

Protection is a two-level system. At the foundation level is the protectionsystem provided by the underlying hardware, which restricts access to certain

185

186 CHAPTER 12. PROTECTION

memory locations, registers, and I/O ports by trapping all user-mode occur-rences of certain instructions and all out-of-range accesses by others. The secondlevel, which consists of the protection aspects of the volumes’ access managers, isbuilt upon this lower-level system. For instance, it uses the hardware’s memory-protection mechanism to prohibit threads from modifying their own open-filetables, thus preventing them from arbitrarily opening protected files.

12.1 Terminology

Given two sets S and T :

• An ordered pair whose first member is in S and whose second member isin T is called an “S/T pair”.

• A capability is a server/service pair.

• An access right is a client/service pair.

• A relation from S to T is a set of S/T pairs. A relation from S to T will alsobe called an “S-per-T relation”. An “S-per-T relation” is identified withthe boolean matrix (i.e., two-dimensional array) whose rows are indexedby members of S and whose columns are indexed by members of T , andwhose true entries are exactly those corresponding to the members of thatrelation.

• The T -per-S relation identified with the transpose of a given S-per-Trelation is called the converse of the given relation.

• The composition of a given S-per-T relation with a given T -per-U relationis a S-per-U relation. It is sometimes call the T -join of the given relationsor the boolean product of those relations.

The boolean product of two boolean matrices is obtained by replacing * with &

and + with | in the standard algorithm for matrix multiplication.

12.2 The Access-Control Database

Protection data can be conceptualized as a sparse three-dimensional booleanarray indexed by clients, servers, and services — it will be called the accessarray. The array entry corresponding to a particular client-server-service tripleis true if and only if that client is allowed to obtain that service from thatserver, e.g., a particular thread is allowed to read a particular file. Otherwise,that particular client is not allowed that service from that server.1 In Unix

1The above is an over-simplification — in some cases, a service request should be honoredonly when certain of its parameters are within a certain range. For example, a particularthread might be allowed to read or write only the first two-hundred entries of a given array.Often, we put the burden of parameter checking on the handler for the requested service (Inpractice, a large fraction of security gaps result from unanticipated parameter values, e.g.,input strings that overflow buffers.)

12.3. MANAGING ACCESS RIGHTS 187

file systems, for instance, the clients are threads, the servers are files, and theservices of interest are read, write, and execute.

The access array is sparse in the sense that most of its entries are false. Thus,we need only store the indices of the true entries. In general, absent entries aresaid to be null. In the case of boolean arrays, null entries are presumed to befalse.

It is often helpful to conceptually flatten the access array to a two-dimensionalarray of sets or to a two-dimensional boolean array, one of whose dimensionshas indices that are pairs:

1. access rights per server or

2. capabilities per client or

3. (allowed) services per client/server pair.

The first of these three views makes it clear that access arrays can be imple-mented by associating with each server a list of clients and their respectiveaccess rights to that server, i.e., the services that each client is allowed. Such alist is often called an access-control list (ACL). The second view makes it clearthat access arrays can also be implemented by associating with each client acapability list. These two methods of organizing the access array correspondsomewhat to the standard techniques for storing a sparse matrix: by row or bycolumn, with each row or column stored as a list.2

12.3 Managing Access Rights

Mechanisms for implementing protection must be concerned with

• compression of the data in the access matrix,

• efficiency of accessing that data,

• manageability, i.e., convenience in implementing policies.

Factoring the access array for manageability. Suppose that you own amanufacturing company. The lock shop has keyed the doors so that certain lockscan be unlocked by certain kinds keys, e.g., custodian keys, executive keys, tech-staff keys, production-staff keys, etc. The administrative staff has issued copiesof certain kinds of keys to each employee, e.g., the vice president of engineeringmight have both a tech-staff key and a executive key, while the vice presidentof manufacturing might have an executive key and a production-staff key.

If some guy quits, you take away his keys. Then (in principle) he can nolonger access the company’s facilities. Similarly, if you transfer a room fromengineering to production, you simply rekey the door to be openable via the

2One can also simply store the nontrivial entries of a sparse array in a set, but it is thendifficult to remove all entries for a given server or client when that server or client is deleted.


production-staff keys but not engineering-staff keys. If, however, every doorhad a combination lock instead of a key lock, to transfer the room, you mustchange its combination and tell each member of the production staff the newcombination. Similarly, when some guy quits, you would need to change thecombinations on all of the locks to which he had access. The point is that “keys”increase manageability.

Now suppose that you want to create a spreadsheet that tells who can getinto which rooms. From the administrative staff, you obtain a spreadsheettelling which employees have which kinds of keys. From the lock shop, youobtain another spreadsheet telling which kinds of keys open which doors. Totell whether a given employee can open a given door, you scan key-by-key acrosshis row in the keys-per-employee spreadsheet and down that door’s column inthe doors-per-key spreadsheet to see if there is a key that this employee holds andthat open the given door. When you finish with all employees and all doors, youhave produced a doors-per-employee spreadsheet by taking the boolean productof the keys-per-employee spreadsheet with the doors-per-key spreadsheet.

To enhance manageability of the access-control matrix, one can factor it intothe product of two matrices by introducing, between clients and capabilities, aset of intermediate entities, called keys, groups, clearances or domains. One ofthese two factors is a boolean matrix that tells which clients have which keys.The other is an access matrix telling which keys have which capabilities. Theproduct of these matrices is an access matrix in which a client has a capabilityif and only if one of the client’s keys has that capability. Note, however, thatthere is overhead involved; finding the right key may take some searching. Indatabase terms, the overall access matrix, which tells the capabilities per client,is the key-field join of the capabilities-per-key and the keys-per-client databases.

One might create a group (i.e., virtual key) for a particular university course.That key might have certain capabilities with respect to files in that course’sdirectory, e.g., read access to the posted assignments, write access to certainhomework collection files and execute access to the various pieces of softwarefor the course. All students enrolled in the course could be added to the group.It would then be easy to take those capabilities away from a student who dropsthe course — simply remove him or her from that group. It would also be easyto add new capabilities to the course group and to remove old ones withouthaving to modify the entries for say a hundred students.

We can factor the capabilities per client matrix several times, introducinganother set of intermediate indices each time. For example, there could bematrices giving:

capabilities per lock,

locks per key,

keys per domain,

domains per client.

12.3. MANAGING ACCESS RIGHTS 189

The overall access matrix (in capabilities-per-client form) would then be theproduct of these four matrices. A client would have a particular capability ifand only if one of the client’s domains has a key that fit one of the locks havingthat capability.

In our original example, suppose that instead of doors in an office building,we are talking about access to fields for cattle grazing. The gates to such fieldsare commonly secured with a length of chain and a series of interlinked locksso that unlocking any one of them will release the gate. Then to find out whohas access to which field, one needs to know who holds which key, which keysopen which locks and which locks open which gates, i.e., we need the booleanproduct of keys-per-employee with locks-per-key with gates-per-lock. (Note thatgates-per-lock is simply the transpose of locks-per-gate.)

Special cases. Implementation is simpler, access is more efficient, and lessspace is required, whenever one of the factors is:

• many-to-one — one entry per column, stored as a set or list attribute ofthe column index,

• one-to-many — one entry per row, stored as a set or list attribute of therow index,

• one-to-one — one entry per row and column, stored as an attribute of theclient and/or server.

• diagonal — same indices for rows and columns, and off-diagonal entriesare null,

• identity — on-diagonal elements are true, and the rest are false.

Many-to-one and one-to-many matrices can be represented by storing the valueof the corresponding “one” index as an attribute of the individual “many” in-dices, or they can be represented as maps. In the above example, locks-per-gateis many to one.

Relationship-based keys. To compress access data and make it more man-ageable, in some cases keys are not held by clients but by client/server pairs.Typically, each key corresponds to significant relationships, e.g., where theclient is the owner of the server. In such a decomposition, the access matrixis the (key/server)-join of a keys-per-pair database and a capabilities-per-keydatabase, i.e., a given client gets a server/service capability if and only if thecorresponding client/server pair has a key that has that capability.

Sometimes, the keys are ranked and only the client’s first key counts. InUnix, for example, there are three possible keys per client/server pair, cor-responding to three special relationships between client and server, {U,G,O},corresponding respectively to ownership, membership in the owning group, andall other clients (i.e., the complement of the union of U with G). For a givenuser/file pair, the corresponding category entry is U if the user owns the file,


else G if the user is a member of the owning group, else O. The following matricesare involved in determining capabilities per user:

users (domains) per thread many-to-one boolean,owned files per user one-to-many boolean,groups per user many-to-many boolean,owned files per group one-to-many boolean.

Notice only one of these matrices is many-to-many, namely groups per user.Users-per-thread, is represented as a thread attribute. Owned-files-per-userand owned-files-per-group are represented as a file attributes.

This approach facilitates data compression and management, but it has somelimitations. In Unix, for instance, it is difficult to give to a specific user writeaccess to a particular file. It is even more difficult to deny privileges to specificindividuals while allowing the same rights to all others. Often, system adminis-trators would like to prohibit certain users from accessing facilities such as thenetwork, but Unix provides no convenient mechanism for doing so.Check this.

Ownership. One useful policy is to let the owner(s) of a server determinewho has which access rights to it. The initial owner of a server is whicheverclient invoked the server’s constructor, which is sometimes considered to be anoperation on the server’s class. In some systems, the owner(s) of a server canextend ownership of that server to other clients.

If we treat the access-control list for a server as a server in its own right andgive only the owners of that server modification rights to the access-control list,then we have a simplistic policy: a client gets to make any possible modificationsto the entries of the access matrix that corresponds to servers it owns. Underthe capability/lock/key/domain/client model, as matters of policy:

• The owner of a server has the right to add capabilities involving that serverto a lock, or to revoke them.

• The owner of a lock can allow or disallow a key to access that lock.

• The owner of a key can give it to domains, or revoke it.

• Each client is in a given domain at any time.

Replication. We can view some keys as having access to the constructorservice of certain classes, perhaps even their own class, in which case the keycan be replicated. For instance, the TA for a course may not be the owner ofthe files for that course, but he may be given a replicatable key for the filesfor the course. He or she would then use that key to create and distributenonreplicatable copies to the students in the class. So, the key’s class is aprotected server supporting certain services such as replication (construction).

There is an implicit infinite hierarchy of subclasses of the class “key”:

• nonreplicatable keys

12.4. DYNAMIC PROTECTION 191

• keys that replicate to nonreplicatable keys

• keys that replicate to keys that replicate to nonreplicatable keys

.

.

.

• hereditarily replicatable keys.

In addition to its defining capability as discussed above, keys of any of theseclasses should be able to create instances of any previous class in the hierarchy.

12.4 Dynamic Protection

Domains. It would be difficult to update the access matrix every time a newclient thread is created or destroyed. Instead, by factoring the access matrix onintermediate items called protection domains, we decompose the access matrixinto a volatile part, domains per client, and a stable part, capabilities per do-main. The static part is implemented via the access managers of the variousvolumes on which the servers reside. The volatile (dynamic) part is updatedevery time a client is created, destroyed, or changes domains. Such events manyoccur many times per second. The stable (static) part is updated every timea server or a domain is created, deleted or has its access rights changed. Suchevents happen much less frequently.

Capabilities belong to the domain, and any client in a particular domainpotentially has the capabilities of that domain — as we will discuss later, theclient may have to explicitly request each capability it will attempt to exercise.In some systems there is exactly one domain per client, stored in the domainattribute of the client descriptor — i.e., the domains-per-client matrix is many-to-one and can be represented via a domain attribute in the client’s descriptor.Visibly, this restriction is not necessary; we could allow a client to be in multipledomains, just as a person might carry several rings of keys at a time.

The cached access matrix. Usually, a client thread must petition to getthe capabilities that it will attempt to exercise. These are cached in protectedkernel-accessible tables, and copies (replicas) of these cached capabilities areinherited by the thread’s descendants. The cached capabilities of a Unix threadis stored in the mode bit of its descriptor’s PSW field and in the open-file tableof its process.3

Petitioning is done via the open() system call or may occur automaticallyat the first attempt to exercise a particular capability after the server has beenopened. The server’s volume’s access manager grants the petition if and only

3Unlike Unix’s fork, Linux’s clone allows a parent to decide whether or not the child willshare the parent’s open-file table. Each thread has its own mode bit but a thread sharing anopen-file table can exercise capibilities cached by any of its fellow sharers.


if the thread’s current domain warrants the requested capability on the basisof the volume’s portion of the static (capabilities-per-domain) access matrix,discussed above.

In Unix, the entries in these lists4 are pointers to entries in the system-wideopen-file table, together with two bits telling whether the file is open for reading,for writing, or both.5

Cached capabilities are exercised via the read read and write system calls,which, on invocation, check the capability cache to see if the requested service isallowed. The execute capability is not cached; rather, it is checked each time itis exercised. Note that there are other services that are not explicitly mentionedand not cached, e.g., owners can successfully invoke the chmod() system call,which allows them to manipulate access rights to the owned file. In Unix, oneuser ID (called “root” or “super user”) is allowed all services from all servers.Discuss mask-server ap-

proach.

12.5 Changing Domains

In terms of the general model, switching from user mode to privileged mode givesthe current thread6 an additional or a larger domain, one that has additonalcapabilities, namely access to kernel resources, i.e.:

• kernel tables,

• privileged instructions, e.g., I/O instructions,

• memory-mapping tables,

• interrupt vectors and masks.

In Linux, a thread running in a root-owned user-mode process can turn off thetrapping of I/O instructions on selected ports via a call to ioperm(). Doing sofacilitates high-speed video output from processes such as X-servers and videogames. The ability to suppress such traps is architecture dependent and revisesthe notion of “privileged instruction.”7 On can think of this ability as thecaching of an implicit capability, since a root-owned thread can ultimately doanything.

A thread is generally allowed to switch or add protection domains undereither of two circumstances:

• when strictly decreasing (potential) capabilities, e.g., when returning froma system call or when a superuser thread switches to another user ID inUnix.

4I.e., per-process open-file tables5Unix literature calls references to these lists of cached capabilities as “file descriptors.”6Bear in mind, however, that depending on kernel design and possibly on the underlying

hardware architecture there might be a switch of threads whenever there is such a switch ofmode hardware-level protection mode. Specifically, a pseudo-processor kernel thread stopsrunning the current user thread but continues with its own activities within the kernel.

7Specifically, it embeds the boolean algebra of sets of ports into the hierarchy of hardwareprotection levels.

12.6. UNIX EXAMPLE 193

• when switching to code that is trusted by the new domain, e.g.:

– invoking a kernel routine adds a new domain (system mode) to theinvoking thread but simultaneously shifts control to that kernel rou-tine, which is trusted, of course, within the kernel’s domain.

– executing a set-UID program under Unix changes value of the “effec-tive UID” attribute in the requesting thread’s process’s descriptor tothe UID of the owner of the executable file. When a thread changesdomains in this way, it retains the capabilities it has cached so far.The capabilities of the new domain determine whether or not subse-quent petitions by threads of this process are granted. For instance,the passwd command is a set-UID command owned by root. By ex-ecuting it, you can modify the pass-word file on your behalf, eventhough you don’t have write access to the pass-word file.

Exercise. The cached access matrix is almost a subset of the the booleanproduct of the dynamic and static access matrices. Change of domains andshared open-file tables require the word “almost” in that statement. Name atleast one other situation that makes it necessary.

12.6 Unix Example

The Unix protection system is relationship-based. The clients are processes.The domains are users. And, the servers are files. The categories (i.e., keys) are{U,G,O}, corresponding respectively to ownership, membership in the owninggroup, and all other clients. The capabilities-per-relationship matrix is stored inper-file access lists consisting of nine bits that tell which relationship is allowedwhich service (read, write, and/or execute) from the file. Define an else operation on

sparse matrices.A thread’s descriptor includes the following attributes, which are inheritedfrom the thread’s parent:

• UID, the user ID number of the thread’s owner.

• GID, same as above but giving the thread’s group. (Really a list.)

• EUID, the effective user ID — changes when a set-UID program is exe-cuted and used in determining access privileges. Used to set ownership offiles created by the thread. Also used by passwd and su in authenticatingusers. Can be changed by the thread itself if and only if its UID is thatof the super-user, i.e., UID zero. (Also, there is a system call that swapsthe EUID with the UID.)

• EGID, the effective group ID — A list of groups used in determiningaccess privileges. Used to set group ownership of files created by thisthread. When a set-GID program is executed, its GID is added to thislist.


• UMASK, the protections mask for default protections of files created bythis thread: complement of UMASK ANDed with permissions requestedto initialize permissions on a new file. (Initialized to zero by login andchanged via umask(), usually by the shell’s configuration file.)

• CD, the current directory of this thread. (Changed via cd(), a systemcall that allows the change only if the thread has execute access to thetarget directory.)

• ROOT, the directory at which file lookups start. (Changed by chroot().Can be used for confine, i.e., “sandbox,” a give thread’s effects.)

Find the paper on breakingout of a chroot prison. The descriptor (i-node) of a Unix file includes the following protection-related

attributes:

• the owning UID, which gets set from UID of the creating thread and getschanged via chown(),

• the owning GID, which gets set from UID of the creating thread and getschanged via chgrp(),

• a set of nine bits telling which of the three services (read, write, exe-cute) are allowed to three classes of users: the owner, members of theowning group, everybody else (initialized to the complement of UMASKand’ed with the services specified by the creating thread and updateablevia chmod())

• a set-UID bit telling whether to change the EUID of threads executingthis file to that of the file’s owning user (initialized to off and updateablevia chmod()),

• a set-GID bit telling whether to add the file’s owning group to thread’sEGID (initialized to off and updateable via chmod()).

There is also a file /etc/group that lists the members of each group, and apass-word file /etc/passwd whose entries have six relevant fields:

• user name

• encrypted password

• user ID number

• default group ID number

• home directory

• default startup program (shell)

12.7. WINDOWS NT EXAMPLE 195

The first two of these fields are used by the root-owned set-UID programs, loginand su, in authenticating users. Specifically, login prompts for a user name,which it looks up in the password file and resets its own EUID accordingly.It then prompts for that user’s password and encrypts the response. If theencrypted response matches the encrypted form of the EUID’s password asstored in the password file, login then forks a child that execs the defaultstartup program, which is usually a shell. The thread’s process’s UMASK isinitialized to zero by login but usually gets re-configured by the shell’s startupfile via the umask command, which invokes the umask() system call.8

Exercise. If I run a set-UID program that you own and that creates a file,who owns that file?

12.7 Windows NT Example

Under Unix, setting up access rights for collaborative efforts requires involve-ment of a super-user. In Windows NT, there is a more flexible structure for theadministration of protections, one that involves access-control lists. Also, theExchange system, instead of mailing copies of documents, mails capabilities foraccessing them. That way, many users can share various sets of documents. Expand on this. Also, in-

clude material on polices,e.g., least privilege, andmaterial on the take/grantmodel. (See Singhal’s book.)

12.8 NSA’s SELinux

Apparently NSA’s SELinux compartmentalizes breaches of security thus allevi-ating the need to immediately update user and system application to prevent afull system compromise — patches and updates can be applied when convenient.According to NSA’s abstract:

The security architecture of the system is general enough to supportmany security policy abstractions. The access controls in the imple-mention currently support a combination of two, type enforcementand role-based access control. This combination was chosen becausetogther they provide powerful tools to construct useful security poli-cies. The specific policy that is enforced by the kernel is dictatedby security policy configuration files which include type enforcementand role-based access control components.

The type enforcement9 component defines an extensible set of do-mains and types. Each process has an associated domain, and each

8To avoid the need to store encrypted forms of passwords and/or transmit them over theserver’s communication links, the challenged device/user could instead simply prove that itknows say the private key associated with a given publicly known key by decoding a ran-dom one-time message. In such a case, the public key, which everyone already knows andwhich acts like a user name, would be stored on the server and transmitted over the server’scommunication links. But not so the private key, which acts like a password.

9Type Enforcement is a registered trademark of Secure Computing Corporation.


object has an associated type. The configuration files specify howdomains are allowed to access types and to interact with other do-mains. They specify what types (when applied to programs) canbe used to enter each domain and the allowable transitions betweendomains. They also specify automatic transitions between domainswhen programs of certain types are executed. Such transitions en-sure that system processes and certain programs are placed into theirown separate domains automatically when executed.

The role-based access control component defines an extensible set ofroles. Each process has an associated role. This ensures that systemprocesses and those used for system administration can be separatedfrom those of ordinary users. The configuration files specify the setof domains that may be entered by each role. Each user role has aninitial domain that is associated with the user’s login shell. As usersexecute programs, transitions to other domains may, according tothe policy configuration, automatically occur to support changes inprivilege.

Two papers provide background information for the project:

• “The Inevitability of Failure: The Flawed Assumption of Security in Mod-ern Computing Environments” explains the need for mandatory accesscontrols in operating systems.

• “The Flask Security Architecture: System Support for Diverse SecurityPolicies” describes the operating system security architecture through itsprototype implementation in the Fluke research operating system.

For more information see http://www.nsa.gov/selinux/ for project informationand http://www.nsa.gov/selinux/docs.html for more published papers.

12.9 Rejection of Access Requests

We must also consider the question of what do we do when there is a violationof protection. Among the common answers are:

• Let the owner of the server provide a service that gets invoked.

• Let the systems designer decide via a similar mechanism.

• Let the violating thread decide via a call-back (e.g., signal handler).

• Abort the request.

• Print a warning.

• Deny the access and continue by throwing an exception or returning aspecial value.

12.10. VIRTUAL-MACHINE SYSTEMS 197

All of these answers are good, and they are not entirely incompatible. Forexample, the first and the third are not incompatible; the owner may provide aroutine, and the behavior of that routine may be to signal the violating threador it may print a message etc. The systems designer may provide some defaultroutines etc. There is another interesting answer to the question, however.

Definition: Virtualization is the act of substituting for access to a protectedserver, similar access to another (virtual) server.

Spooling a printer is an example of virtualization. When one tries to writeto the printer, one usually writes instead to some special file. The spoolingdaemon collects these files and writes them to the real printer in some controlledfashion. In such a case we are substituting access to a sharable device, the disk,for similar access to a nonsharable device, the printer. The daemon eventuallyaccomplishes the original access on behalf of the thread.

Similarly, virtual-memory systems substitute read or write access to a an-other location than the one mentioned in an instruction’s operands.

12.10 Virtual-Machine Systems

An important example of virtualization is found in virtual-machine monitors(VMMs), introduced by IBM for its 360/67 and later adapted to its 370 series.(See [CR].) A VMM allows a user to get the illusion that his terminal is theoperator’s console of his own private computer. For example, when the user’svirtual computer sends a message to the operator console, it invokes some priv-ileged instructions, producing a privilege fault that invokes the VMM, whichintercepts privilege-faulting instructions and virtualizes them to the user’s ter-minal rather than sending a message to the real system operator.

Such a system is useful for several reasons:

• It gives users almost complete isolation from each other.

• One can develop new (versions of) operating systems and test them with-out the risk and inconvenience of crashing the underlying system.

• Each user on a shared system can run the operating system of his choicewithout the need for other users to run the same system.

• Performance studies can be facilitated.

We will cover VMMs in detail because of the insight they offer into the founda-tions of protection.

The state of a machine is the content of memory and all registers. Aninstruction is a transformation on machine states.

An emulator for a machine is a program that repeatedly transforms a repre-sentation of a state of the machine into the representation of its successor stateand performs the appropriate input and output. Most current workstationssupport emulators for PCs to allow users to run Windows software. Emulators


usually consist of a loop containing a large switch statement having a case foreach op code. Each time through the loop, the program counter is updated andanother instruction is interpreted. Typically, an emulator involves a factor ofthree to thirty in overhead.

An emulator can trace the execution of a program, instruction-by-instruction,and report the behavior of the program and the content of various memory lo-cations and registers. Such tracing emulators are called debuggers.

It is possible to write an emulator for a given machine in almost any language,and then compile and run that emulator on the machine itself. Such an emulatoris said to be self-resident. There are a couple of reasons for running a programunder a self-resident emulator. First of all, one may want to run the program ona debugger for debugging or for reasons of pedagogy. Secondly, one may want totest programs that require running in privileged mode, such as operating systemsin an environment where they get only a simulated version of privileged modeand cannot crash the system.

Many machines have a debug mode in which there is a trap to special soft-ware after each instructions. While this does not appreciably diminish emula-tion overhead, it facilitates a simpler self-resident emulator, since the underlyinghardware updates the program counter and directly executes each instruction(eliminating the large switch statement).

Some emulators are speeded up via dynamic translation wherein each seg-ment of code is translated the first time it is executed, so that subsequent exe-cution of those same instructions will incur almost no overhead. It is natural toask, why not use static translation? The problem has to do with self-modifyingcode, indirect jumps and, as we shall see, sensitive instructions.

Notice that emulators stay in complete control of the running of the under-lying program. In fact, those described so far intervene before and after eachinstruction, simulating the input, output, and memory-management behaviorof real hardware.

A virtual machine monitor (VMM) is an efficient, controlling, self-residentemulator — often the emulated machine has somewhat smaller memory thanthe host. It would be, therefore, an OS, except that it does not control orprovide programs with access to logical resources. A machine is virtualizable ifand only if it supports a virtual machine monitor.

The word “efficient” is meant to exclude simple software simulators thatinterpret every instruction. It requires that all but a statistically small fractionof the instructions run without software intervention.

The word “controlling” means that the interpreter stays in control, i.e., anyattempt by the simulated machine to increase its capabilities (memory map andprotection mode) must be intercepted by the VMM. The simulated machinemust think it is in control, but it must not be — an illusion of control must becreated for it.10 By contrast, a loader is a program that runs programs intended

10One might think of a political analogy: Imagine a society whose citizens think they havedemocracy but which is really a dictatorship; or one where the president thinks he is in control,but the bureaucrats have virtualized his control, giving him false reports and only pretendingto carry out his commands.


for the underlying architecture, but does so in a way that totally relinquishescontrol of the running of the program.

The key ideas behind the development of a VMM are the following:

• The VMM runs in user mode and gets control (perhaps from an underlyingOS) whenever a protection trap occurs (due to either a memory or modeprotection violation).

• Programs running on a virtual machine never make system calls to theVMM. Instead, they get services from the VMM by violating privilegeor memory protection, i.e., trying to access protected resources, therebyinvoking the VMM through the ensuing protection trap.

• Despite the fact that the VMM runs in user mode, the simulated machinecan be in system mode, or rather it can think that it is.

• For such complete deception to be possible, it is necessary that everyinstruction that behaves differently in user mode than in system modewill trap in user mode. For example, the instruction that reads the PSWmust be a privileged instruction and must trap in user mode. Otherwise, aprogram running on a virtual machine could try to put the virtual machineinto system mode and the read the PSW and find that it was still in usermode. Thus, the behavior of the virtual machine would differ from thatof real hardware.

• A VMM is different from a standard emulator in that all nonprivilegedinstructions are executed directly, and only the privileged instructions areinterpreted by software. The VMM does not get control before and aftereach instruction — just at the privileged ones, and that is enough.

• Each separate virtual machine is a different thread on the underlying un-derlying hardware or OS.

12.10.1 Virtualizability of Third-Generation Architectures

For the purposes of this discussion, based on that of Popek and Goldberg [PG],a third generation computer is a random-access computer with a simple protec-tion system consisting of a mode bit, a base (relocation) register, and a boundregister. A state for such a machine has the following format:11

struct {

word mem[maxmem];

struct {

enum{ privileged, user } mode;

void* pc;

void* base;

11Actually, we are ignoring I/O effects, but these can be represented as contents of memorylocations.


void* bound;

} PSW;

}

An instruction is a mapping from states to states.We assume the existence of two special addresses, old and new. An instruc-

tion is said to trap when applied to a state S if and only if, in the resulting stateS’, PSW has been stored into mem[old] and received a new value from mem[new],and the rest of memory has been left unchanged, i.e.,

S’.PSW = S.mem[new]

S’.mem[old] = S.PSW

S’.mem[i] = S.mem[i], for all other values of i

For any state S, the sequence of locations mem[base+0], mem[base+1], ...,

mem[base+bound] is called the accessible portion of memory. Two states aresaid to be similar if and only if they have the same values for mode, pc, bound,and accessible memory, i.e., they differ only in the content of base and/orinaccessible memory.

An instruction is said to be:

• location sensitive if it might allow a program to determine where accessi-ble memory is located relative to real memory or to modify inaccessiblememory, i.e., there are similar states where it doesn’t trap and it yieldsnonsimilar results.

• mode sensitive if it might allow a program to determine what mode it isrunning in, i.e., there are states that differ only in the content of modewhere and it yields results that differ in some other aspect.

• control sensitive if it might allow a program to increase its privileges,i.e., in some state where it doesn’t trap, it increases the privilege level(i.e., changes mode from user to privileged) or it makes some previouslyinaccessible location accessible (by modifying base and/or bound).

Each kind is said to be sensitive.A VMM must virtualize all sensitive instructions, which is easy to do if we

can get them to trap.

Theorem (Popek and Goldberg): An third-generation architecture is vir-tualizable if every sensitive instruction is privileged, i.e., traps on all stateswhose mode has the value user.

Proof: It is assumed that the machine starts out with PSW.base = 0 andPSW.bound = maxmem and PSW.mode = system. The VMM has an entry pointcalled dispatch, and it begins by setting mem[new] to

< system, dispatch, 0, maxmem >


to cause all traps to come to dispatch. To begin direct execution of the in-structions of the simulated machine, the VMM then sets the real PSW to

< user, TGM.PSW.pc, TGM.PSW.base, TGM.PSW.bound >

where TGM is a variable of type state that is used to keep track of the state ofthe virtual machine. Of course, before any of this, we must initialize TGM.mem

with the program that is to run on the virtual machine.

That program runs directly on the underlying hardware until it performs asensitive instruction, at which time a trap will occur dispatching the VMM inter-preter. The interpreter first updates the PSW.pc field of TGM from mem[old].pc.It then checks TGM to determine which virtual instruction caused the trap andsets TGM to the state that would have followed its current value had a real ma-chine executed this instruction. It then resumes direct execution as describedabove.

Since there are only five fundamental components in a machine state, eachof which can take only a finite number of values, one can construct a tablethat characterizes the behavior of any instruction. Since there are only a finitenumber of instructions, the entire instruction set can be stored as a large lookuptable.

The result of a nonsensitive instruction on the simulated machine is equiva-lent to what it would be on the real machine, by definition. The only differencesbetween the simulated machine and a real one of similar memory size are thevalues in PSW.mode and PSW.base. But all instructions that are sensitive tothese are privileged. Actually, they may also have different values in PSW.bound

but we have assumed that the simulated machine has smaller memory.

The result of sensitive instructions is appropriate if the correct table is used.So, by induction on the number of instructions executed, such a VMM worksproperly on all states.

QED

Discussion. One of the most common violations of the hypotheses of thetheorem is the existence of an instruction that reads the processor-status word,PSW. Such an instruction is obviously mode sensitive. On the other hand, itappears harmless enough that it is usually not privileged. Unless it is alwaysintercepted in user mode, however, the VMM cannot maintain the illusion ofprivileged-mode operation by the virtual machine.

A slight generalization that allowed the virtual machine to produce “essen-tially equivalent” states rather than identical states to those of the real hardwarewould allow this problem to be eliminated by a change in the architecture man-ual. The ability to read the PSW is not an important ability; one can keep trackof its content by writing the same value into a variable each time the PSW isloaded. So, the architecture manual could be changed to stipulate that all at-tempts to read the PSW return garbage. Appropriate modifications to the modelabove would lead a generalization to the theorem.


The protection state of a machine is the value of PSW.mode, PSW.base, andPSW.bound. Note that in a virtual machine there must be some collapsing ofreal protection states. It cannot ever really be in the most privileged state.It is confined to the rest, but it thinks it has a full range of states available.So, in reality some of these virtual states are collapsed into a single real state.Any attempt by the virtual machine to shift between collapsed states must beintercepted by the VMM, so that this field of TGM can be updated and the VMMcan properly simulate behavior for future states.

Note that there is no need to intercept transitions among noncollapsed states,since these states will be recorded in the PSW of the real machine and stored intomem[old] on a trap. For example, in the above two-mode model, we identify anytwo protection states that differ only with respect to the value of PSW.mode. Itis, therefore, possible for us to allow untrapped transitions among states that arenot collapsed. In particular, we can allow untrapped decreases in PSW.bound.12

It is not necessary to trap sensitive instructions in order to virtualize them.Some architectures, like the Intel’s 386 and its descendants, have nonprivilegedmode-sensitive instructions, and hence do not fulfill the hypotheses of the Popek-Goldberg Theorem.

Locus Computing developed a product call Merge that runs under SCO Unixon an x86. It allows Windows 95 to be run on its own virtual x86 system. Toaccomplish this feat, Locus uses a slightly modified version of the Windows 95binaries, wherein all nonprivileged sensitive instructions have been turned intoillegal instructions, whose trap handlers virtualize the originals.

VMware apparently uses an on-the-fly translation scheme that translatesnon-sensitive instructions into themselves but translates sensitive instructionsinto code that virtualizes them. The scheme is similar to on-the-fly translationschemes that allow x86 binaries to run on RISC architectures. The DEC segmentof Compaq has such a product called FX32! and there is a similar product forthe Apple’s systems.

12.10.2 Nested Virtual Machines

One can run a virtual machine monitor on a virtual machine. In fact, one cannest virtual machines upon virtual machine upon virtual machine. Obviously,there is overhead in doing so.

Problem: Determine the minimum number of traps that occur on the realhardware when a privilege trap occurs on an n-th level virtual machine.

Solution: Assuming that hardware is at level zero, the answer is (k + 1)n−1

where k is the number of privileged instructions executed by the privilege-faulthandler in the VMM.

12While we have concentrated on a simple base-and-bound memory management, everythingsaid in this regard translates in a straightforward way to paged virtual memory systems.


In practice, k can often be kept to one, as in the case of the theoreticalmachine developed by Popek and Goldberg. The privileged instructions hereare usually just those required to restore the system to user mode and to theappropriate memory protection state when resuming execution of the virtualmachine. Most architectures have a special return-from-trap instruction.

Lemma: Any attempt to execute a privileged instruction at level i causes(k + 1)i attempts at execution of privileged instructions at level-zero.

Proof: The lemma is visibly true when i = 0. By way of induction, we assumethat the lemma holds at level i−1, i.e., that each attempt to execute a privilegedinstruction at level i − 1 causes (k + 1)i−1 attempts at execution of privilegedinstructions on the level-zero machine.

Suppose one attempts to execute a privileged instruction at the i-th level.Regardless of the protection mode of the level-i virtual machine, this will causea trap on the machine at level i − 1, and handling this trap will cause theexecution of k more attempts to execute privileged instructions, usually thosethat restore the level-(i− 1) machine to a more restricted protection state as itresumes execution of the level-i machine. By the induction hypothesis, each ofthese k + 1 attempts at execution of privileged instructions on the level-(i− 1)machine will cause (k + 1)i−1 privilege traps on the level-zero machine, for atotal of (k + 1)i on the level-zero machine.QED.

Proof of solution: Any privilege fault on the level-n machine involves anattempt to execute a privileged instruction on that machine. By the above itwill cause the attempted execution of (k + 1)n−1 privileged instructions on thelevel-one machine since, as far as the level-one machine is concerned, the level-nmachine is at level n − 1. Each of these will involve exactly one privilege trapon the level-zero machine.QED.

Discussion. Perhaps, the best way to visualize this proof is to notice thatan attempt to execute a privileged instruction at the top level involves k + 1attempts to execute privileged instructions at the next level, the second fromtop — namely, the original attempt that caused the fault plus k more involvedin handling that fault.

Each of these, in turn, causes k+1 attempts at the next level, the third fromtop, for a total of (k + 1)2 attempts at that level. Each of these will generatek + 1 attempts, for a total of (k + 1)3, at the next level, the fourth from top.

At level one, the n-th level from the top, there will be (k + 1)n−1 attemptsat execution of privileged instructions. Each of these will cause exactly oneprivilege fault on the underlying hardware.

This situation can be diagramed as a k-way tree of height n. Each nodecorresponds to an attempt to execute a privileged instruction on a particular


virtual machine. The leaves correspond to attempts on the level-zero machine,the real hardware. The parents of the leaves correspond to attempts on thelevel-one virtual machine. The root corresponds to the occurrence that startedit all on the level-n virtual machine.

Note that any occurrence of a privileged instruction on a virtual machine isan attempt, not only on that machine, but on each lower-level machine. The leftchild of a node corresponds to the same instruction viewed as an attempt by themachine at the next-lower level. The parent always causes a fault on the lower-level machine, and the k siblings to the right of that left child correspond toprivileged instructions occurring in the handling of this fault on the lower-levelmachine.

To determine the attempts corresponding to a particular leaf, go up itsbranch until encountering a node that is not a left child. That node and itsleft descendants all the way down to the leaf (at the hardware level) are theattempts corresponding to a particular occurrence of a privileged instruction.

Leaves that are not left children correspond to level-zero occurrences. Theydo not cause faults, provided that the trap handler runs in privileged mode.

Each fault transfers control to the real hardware. Service of the faults com-plete in post order, i.e., bottom-up and left-to-right.

Visiting a left child corresponds to beginning the processing of the trapby the machine at that level, i.e., invoking its trap handler. Visiting a nodethat is not a left child corresponds to resuming execution of that handler afterprocessing a trap caused by its attempting execution of a privileged instruction.

In a VMM most of the virtualization is done by the software routines thattrap a programs attempt to access protected resources, but this involves a lot ofoverhead. Of course, memory is virtualized by the relocation hardware — eachaccess to a memory cell is virtualized (mapped) to another location in memory.This involves less overhead. One of the more difficult things to properly virtual-ize is a memory mapped display screen, since it requires hardware virtualizationto keep down overhead.

Occasionally it is possible to avoid the double handling of certain traps, likedivide by zero, by having the underlying hardware vector such traps directly tothe trap handler of the virtual machine.

12.10.3 A Virtualizable Architecture

It is possible to extend any CPU architecture to a virtualizable one with theaddition of external hardware that we will call a virtualization-managementunit, VMU. We assume that the original CPU has in and out instructions fordoing input and output respectively, but the VMU works for machines that usememory-mapped I/O as well.

The address bus goes through the VMU, and the VMU has four ports onthe I/O bus, one for each of its four registers:

• mode tells the current protection mode for the VMU (user or privileged)— a privilege violation occurs whenever there is an I/O operation in user


mode;

• bound is compared to the address received from the CPU — a boundsviolation occurs whenever that address exceeds the value of bound in user

mode;

• base contains a value added to each address in user mode;

• address holds the last I/O address received from the CPU in user mode.

The values of these registers can be set and read via I/O instructions. Allinterrupts and external faults put the VMU into privileged mode. Traps donot.

The VMU itself interrupts the CPU whenever a bounds or privilege violationoccurs. These are non-maskable, highest priority interrupts. Violating I/Oinstructions have no other effect. Bounds-violating instructions may run tocompletion (rather than faulting in the middle): violating writes leave memoryunchanged, while violating reads get garbage.

For ease of programming the VMU has three entries in the interrupt/faultvectoring system of the CPU: one for privilege violations resulting from in

operations, another for privilege violations resulting from out operations, anda third for bounds violations.

It is not necessary that mode, bound, base be readable, since software canrecord their values in local variables each time it sets them.

In the general case, the violation-service routines must find the violatinginstruction and analyze it for source or destination — how this is done dependson the CPU architecture. Here, we will assume that in and out are restrictedto work to and from a special location. The address register is provided so thatin such cases there is no need to know instruction formats and to analyze vio-lating instructions. This register is not really needed but can be a programmingconvenience.

There is a problem with fault returns in the proposed architecture. One mustreset the mode before returning, but as soon as one goes into user mode memorymapping will relativize the program counter to an unanticipated location. Thiscan be handled by letting the setting of the mode register have an effect thatis delayed a certain number of clock cycles, or by detecting an occurrence of areturn-from-interrupt instruction and using it to time the mode switch.

Note that the combination of the original CPU plus the VMU constitutes anextended CPU architecture satisfying the Popek-Goldberg Virtualizability The-orem criterion. The only mode-sensitive instruction is in, which is privileged.The only control-sensitive instruction is out, which is also privileged. If weassume that memory is of the standard, homogeneous, sort, then no instructionis location sensitive in the sense that its effect changes with the value of base(even if some of the CPU’s registers are memory mapped).


Exercise. Show that in the case of memory-mapped I/O this hardware allowsvirtualization of the CPU even though the extended CPU does not satisfy thePopek-Goldberg criterion.

Exercise. Show that if an operating system has the ability to forward privilegeand bounds violations to the violating thread as signals, and has the a systemcall that a thread can use to restrict its address space, then it is possible for auser-mode program running under that operating system to support a virtualmachine monitor.

Exercise. How would the use of the VMU differ from that given above if itused paging for memory management rather than the simple base and boundsregisters?Add stuff on security policies

from Singhal, e.g., least priv-ilege principle.

Chapter 13

SYSTEMADMINISTRATION

Mention ugu.com for unixgurus.System administration involves the creation, naming, protection, and configu-

ration of certain system-managed servers:

• daemons (system threads: mail, talk, finger, networking, automounting,chrontab, port mapper, DNS, NIS, NFS),

• users,

• commands,

• Device and special data files (configuration files, passwd, group, eventlogging, etc.)

Although one can put rather good hardware and software onto a desktop forabout $500 per year, the administration of that system and its servers coststwo to ten times more. Suppose that the cost of a system administrator, withbenefits and overhead, is on the order of $100,000 per year. Each administratorcan administer between 20 and 200 machines, which gives an annual cost permachine between $500 and $5000. An International Data Corp. [KO] survey of3,000 companies in 1996 indicates that it costs on average $5714 per year tokeep one PC running:

• $3,545 for operations staff,

• $1,143 for applications development staff,

• $545 for client hardware and software,

• $210 for system design,

• $139 for the training of end users,

• $66 for installation, and

207

208 CHAPTER 13. SYSTEM ADMINISTRATION

• $66 for training of IS staff.

There has been a tendency in the Unix community to produce very powerfultools with elaborate configuration options. Whenever anyone wants a feature,the standard response is “that’s just a matter of configuration.” The situationis reminiscent of the situation thirty years ago, when system designers woulddismiss software as trivia: “That’s just a matter of software.” We see that theper-desktop cost of configurationware now exceeds that of software and hardwarecombined. It is important that such administration costs be lowered.

Linux has established a special naming convention for the system managedfiles. There is an Internet standards for port assignments.Discuss init, chrontab, sock-

ets, event logging, backups,etc.

BACKGROUND

Where to start

Configuring a kernel

Policies and Politics

PROCESSES

Booting up and shutting down

Controlling processes

Daemons

Periodic processes

Logging (syslog and log files)

USERS

Rootly powers

Adding new users

FILES

The file system

NFS (Network File System)

Sharing system files (including NIS)

DNS (Domain Name System)

DEVICES

Devices and drivers

Adding a disk

Serial devices

Printing and imaging

NETWORKING

Network hardware

TCP/IP and routing

SLIP and PPP

UUCP

The Internet

Electronic mail

209

Netnews

OVERSIGHT

Security

Backups

Accounting

Disk-space management

Performance analysis

Network management

Hardware maintenance

Troubleshooting

210 CHAPTER 13. SYSTEM ADMINISTRATION

Appendix A

Interthread Communication

The threads of a given process share an address space and can communicate viashared variables that are protected by locks. Threads in different processes nor-mally do not have the luxury of shared memory. To communicate information(e.g., parameters) to actions performed on other stacks, we need a mechanismto pass messages from one thread to another, including threads within otherprocesses and possibly on other machines.1

The following is one possible low-level message protocol. Its messages con-sist of a fixed-length block. At any point in time, a thread has a queue ofincoming messages kept in an unspecified order. To keep buffering requirementsmanageable, we can record the number of unconsumed messages sent by eachthread in its descriptor and restrict each thread to have a bounded number ofunconsumed messages outstanding at any time, but this protocol requires someform of reply message.

send(Thread, Message) sends the message to the specified thread and re-turns special values for errors, e.g., nonexistent-thread and insufficient-buffering.

pending() tells whether the caller has any unreceived messages in its incoming-message queue.

receive(Buffer) dequeues the caller’s first incoming message and reads it intothe specified buffer. Invoking receive when the queue is empty yieldsundefined behavior.

install(Procedure) installs the specified procedure as the thread’s message-arrival handler, which is invoked whenever a message arrives and thethread’s message-arrival even enabled. The occurrence of a message-arrival event disables that event until it is re-enabled. There is a pointerto the installed handler in the thread’s descriptor. Initially, each thread

1We assume that threads are system known. If the system only knows about processes,then a message sent to a process goes to the its current thread or its initial thread.

211

212 APPENDIX A. INTERTHREAD COMMUNICATION

has a default handler that merely returns. System messages can specifytheir own handlers — this takes care of things like aborts, alarms, etc.

wait() causes its caller to wait (on a special condition in a special monitor in itsown thread description) until a message arrives, whereupon the messagehandler will be invoked and eventually return to the next instruction afterthe invocation of wait.

status(Status) returns its caller’s current message-arrival event status (en-abled or disabled) and sets a new status specified by the Status parame-ter.

The following is a possible message format:

• source: thread id number

• time: time of transmission

• system index: pointer to response routine

• kind: integer index of the kind of message

• body: byte string.

These primitives can be used to implement several different schemes for mes-sage passing. In this scheme, threads are the only entities that have an identity.Mailboxes, for instance, can be viewed as special threads that queue up mes-sages internally and divulge them to other threads on receipt of delivery-requestmessages. Similarly, thread ports can be emulated by special child threads thatforward messages to their parent thread. We can implement blocking messagepassing simply by waiting for a reply message.

If all of a thread’s in-coming messages are from one and only one thread,then those messages can be viewed as a bit stream, on which one can add anyhigher level protocol. To do so, however, requires that queues be maintained ina FIFO order. Of course, alarms and aborts override, since they specify theirown handlers — they should put their messages at the head of the queue.

Exercise. Implement the above scheme for the threads in a given process.

• Messages must be queued in FIFO order.

• Each message is a single word:

– thread handle (sender, filled in by send): 8 bits.

– kind: 8 bits.

– data: 16 bits.

• A call to send has the format send(thread, kind, data).Discuss protocols, ISOmodel, standard port map-ping, and sockets as anobject interface.

Appendix B

Terminology

Add a real glossary.

B.1 Glossary

• inode:

• flag: boolean variable.

• ATA

• IDE

• USB

• IEEE

• I/O

• CPU

• a.k.a.

• e.g.

• i.e.

• DMA

• ATM

• PCI

• VME

• PCMCIA

• SCSI

• RS232

• ASCII

213

214 APPENDIX B. TERMINOLOGY

B.2 Terminology

Operating systems are often classified by the kinds of concurrency thatthey support:

Multiprocessor systems have multiple CPUs, each having access to sharedmemory. These CPUs can work concurrently on the same process,one processor per thread. Note that multiprocessing refers to runningmultiple processors, rather than multiple processes.

Multiprogramming systems allow several programs to reside simulta-neously in main memory and execute concurrently, or pseudoconcur-rently. Such systems are sometimes called multiprocessing systems.

Multithreaded programs are those in which there are multiple indepen-dent streams of activity (i.e., threads).

Real-time systems respond to external events within a specified timeinterval.

Interactive systems allow programs to run without all input data re-siding in system storage; they can wait for responses from human-operated devices like keyboards and mice.

Multiplexing a resource is the policy of sharing it among a number ofthreads by giving it briefly to one, then to another, then another,etc. Each thread gets a time slice. (Communications engineers callthis “time-division multiplexing.”) For instance, we can implementpseudo-concurrency on a monoprocessor system by multiplexing theCPU among multiple threads.

Timesharing systems support multiple concurrent interactive users bymultiplexing the CPUs among the users’ processes. Unix is a timesharing system, as is the Terminal Server Edition of Windows.

B.3 Thesaurus

• vector = one-dimensional array

• Unix use of “file descriptor.”

• Unix use of “i-node.”

• processor = CPU

• segment = block = array

• dynamic binding = registration

• activity = lightweight process = thread = active object = job

• signal = notify (for conditions)

B.3. THESAURUS 215

• synchronous = blocking

• asynchronous = non-blocking

• include file = header file = .h file

• heavyweight process = process = job

• function = routine = procedure

• modify = update.

• CPU = processor.

• vector table = bounce table.

• lookup = resolve (for names)

• call = invoke = activate = instantiate.

• block = mask = disable = turn off

• Exception, fault, trap

• system call, service call, software interrupt, software trap, trap instruction,interrupt instruction.

• call-back, handler, remote procedure, method, signal

• install, register (for the above)

• Context, directory, and domain (in name resolution).

• The terms procedure and function.

• “Descriptor” means fundamental description record. Some use the term“control block.” Object-oriented programming tends to abolish the dis-tinction between an object and its descriptor, which is helpful in mostcases, but can provoke confusion in some cases.

• service, port, operation, index into vector table, mailbox, signal number,entry in a virtual function tables, entry in a tables for binding to a DLLs.

• Server, service facility, object, process

• process, heavy-weight process, client, job

• job, client, thread, light-weight process, coroutine

• timeslice = quantum

• manager = dispatcher = scheduler + protection-enforcer

216 APPENDIX B. TERMINOLOGY

Bibliography

[AHU] Aho, Hopcroft and Ullman, Data Structures and Algorithms,Addison-Wesley, 1983

[BA] Bach, The Design of the Unix Operating System, Prentice-Hall,1986.

[BE] Belady, L.A., and C.J.Kuehner, “Dynamic Space sharing in computersystems,” CACM, May 1969, Vol. 12, No. 5, pp. 282-288.

[BR] Brinch-Hansen, Operating Systems Principles, Prentice-Hall,Englewood Cliffs, New Jersey, (1973).

[CR] Creasy, R.J., “The Origin of the VM/370 Time-Sharing System,”IBM J. Res. Develop., Vol. 25, No. 5, September 1981, pp. 483-490.

[CO] Coffman and Denning, Operating Systems Theory, Prentice-Hall.

[CO2] Comer and Fossum, Operating System Design: The Xinu Approach,Vol. I, Prentice Hall.

[DE] Deitel, An Introduction to Operating Systems, Addison Wesley.

[EN] Engelschall, Ralf S., Portable Multithreading, the Signal Stack Trickfor User-Space Thread Creation, Proceedings of the USENIX AnnualTechnical Conference, June 18-23, 2000, San Diego, California, CA.(http://www.gnu.org/software/pth/rse-pmt.ps)

[FR] Fraser, Keir A., Practical Lock-Freedom, Ph.D. dissertation, King’sCollege, University of Cambridge, September 2003.

[HA] Habermann, Introduction to Operating System Design, Science Re-search Associates. Addison Wesley.

[HO] Hoare, “Monitors: an operating system structuring con-cept,” CACM, October 1974, Vol. 17, No. 10, pp. 549-557.(http://www.acm.org/classics/feb96/introduction.html)

[IN] 80386 Programmer’s Reference Manual, Intel, Santa Clara, CA, 1986.

217

218 BIBLIOGRAPHY

[JO] Johnson, Demers, Ullman, Garey and Graham, SIAM Jr. On Com-puting, 3(4), pp. 299-325 (1974).

[KN1] Knuth, Donald, Art of Programming Vol. I

[KN3] Knuth, Donald, Art of Programming Vol. III

[KO] Korzeniowski, “Lowering the Cost of PC Maintenance,” Computer,March 1997, Vol. 30, No. 3, pp. 15-16.

[LA2] Lampson and Redell, “Experience with processes and monitors inMesa,” CACM, February 1980, Vol. 23, No. 2, pp. 105-117.

[LA1] Lamport, “A new solution of Dijkstra’s concurrent programmingproblem,” CACM, August 1974, Vol. 17, no 8, pp. 453-455.

[LE] Leffler, McKusick, Karels, and Quarterman, The Design and Imple-mentation of the 4.3 BSD Unix Operating System, Addison-Wesley,1989.

[MA] Madnick and Donovan, Operating Systems, McGraw Hill.

[MK] McKusick, Joy, Leffler, Fabry, “A Fast File System for UNIX,” UNIXSystem Manager’s Manual, 4.3 Berkeley Software Distribution Vir-tual VAX-11 Version, April 1986

[OU] Ousterhout et al., “The Sprite Network Operating System,” Com-puter, February 1988, pp. 23-35.

[RI] Ritchie and Thompson, The Unix Operating System, Communica-tions of the ACM, July 1974, (17, no. 7, pp. 365-375). (Also, “TheUnix time-sharing system,” BSTJ 1978.)

[RI2] Ritchie, A Retrospective, BSTJ 1978.

[RU] Rubini, LINUX Device Drivers, O’Reilly, 1998.

[SG] Silberschatz and Galvin, Operating System Concepts, Fifth Edition,Addison Wesley, New York (1994).

[SI] Richard Sites, “Operating Systems and Computer Architecture,”Introduction to Computer Architecture, Second Edition, Harold S.Stone, editor, SRA, 1980.

[SS] Singhal and Shivarati, Advanced concepts in operating systems: dis-tributed, database, and multiprocessor operating systems, McGrawHill, 1994.

[ST] Stalling, Operating Systems, Fourth Edition, Prentice Hlll, 2001.

[STR] Stroustrup, The C++ Programming Lanugage, Third Edition,Addison-Wesley, 1997.

BIBLIOGRAPHY 219

[TA] Tanenbaum, Operating Systems Design and Implementation,Prentice-Hall.

[TH] Thompson, “Unix implementation,” BSTJ 1978

[TS] Tschritzis and Bernstein, Operating Systems, Academic Press.

[WI] Wirth, Programming in MODULA-2, third, corrected edition, NewYork: Springer-Verlag, 1985.

[BE1] The Bell Laboratories Technical Journal; October 1984, Vol. 63, No.8, Part 2.

[BE2] The Bell Laboratories Technical Journal; July/August 1987, Vol. 57,No. 6, Part 2.

[PG] Popek and Goldberg, “Formal Requirements for Virtualizable ThirdGeneration Architectures.” CACM, July 1974, Vol. 17, No. 7, pp.412-421.

220 BIBLIOGRAPHY

Appendix C

TO DO

• Straighten out “sector” vs. “block” terminology.

• Define the notion of “segment” of a one-dimensional array.

• Look at Nemeth et. al. on signals (especially the chart) to chapter onprocesses or on threads.

• Go down the shelf of OS books and add all to bibliography. Also Brett’sbooks.

• Note that lookup access and cd access are not standard Unix access modes.Well, okay, lookup is reading.

• Get copy of article on “shuttles” from Brett.

• Get copy of the article on Dynamic Binding in the March or April 1997issue of Computer.

• Where do DLLs fit in? (See Computer of March or April 1997.)

• Add a section on global name management and the Linux upper hierarchy.

• In a multithreaded process, how is it decided which thread handles a givensignal? See Butenhof’s book.

• In a multiprocessor arch, how is it decided which CPU gets an interrupt?

• Discuss the creation of mask objects with limited modes vs. caching ca-pabilities.

• Get a paper on log-structured file systems from Brett.

• For tables, there need to be miss policies: abort, garbage, indication offailure, etc.

• Note that prefix matching is an associative form of table lookup.

221

222 APPENDIX C. TO DO

• Add diagrams, examples and exercises.

• Get further into management of the domain (user) database:

– NIS

– Collaboration under Unix and NT

– Exchange

– Real DB vs the Unix text file approach.

• The destructor for a thread should remove it from its queue.

• Make locks, queues, and regsettings base classes for Monitors, Conditions,and Threads respectively.

• Give alternative structure where interrupts are signals: the resumed pro-cess is responsible for queue handling.

• Revise queue handling to make interrupts appear like signals. Have thewaitor do all of the queue handling.

• Note that one must collapse protection domains in order to virtualize.When the virtual machine moves among identified domains, there mustbe an interrupt so the monitor can keep track of these movements. Thecollapsing is required so that the monitor can have its own domain.

• In the access control portion of this presentation, we must discuss theconcept of parameter checking. Notice that the simple client-mode-objectmodel for access control ignores the notion of parameters. But the rangeof the parameters to an access call are very important. Bounds checkingin memory accessing is one example of parameter value (range) checking.Also, there is extensive discussion of parameter value checking in the 386segmentation system.

• Develop the parallel between the segmentation system of the 386 and afile system. What are the similarities and differences from the Unix filesystem.

• One approach to protection is to create a virtual object with fewer accessmodes. For instance, suppose I have read and write permissions for aparticular file, but open it for read-only access. In 4.3 BSD, it is onlythe read capability that gets cached in the per-process open file table. InSystem V, however, the file object created from the backing segment isgiven only read access. This is the virtual object with lesser capabilities.

• Note that the cylinder groups of the fast file system are probably of limitedeffectiveness. A typical process will have only one read request on thequeue at a time. If there is more than one process generating read requestsfor a given disk, they will alternate, destroying the locality of referencethat makes the cylinder groups effective. On the other hand, a process

223

can generate a lot of writing to a given file and these may all be flushedfrom the disk cache together, allowing a more efficient disk transfer.

• Cylinder groups are probably very effective when there is only one useron the system.

• Restructuring programs for nonblocking reads would help.

• Any restructuring to move a file at a time would help. For instance, theexec routine should suck up an entire executable file at a time. It cantake advantage of nonblocking reads.

• bounce should be extended to a generalized resume instruction that simplytakes a PID as argument and passes the caller’s cpu to him. This shouldbe hidden as soon as possible.

• Look up the write-up that Tannenbaum mentions about the system basedon one word messages.

• Do a comparison of the Unix file system with the 80386 segmentationsystem.

• In a VM system, there is no need for the VMM to intervene on every trap.E.g., the divide by zero trap can be vectored directly to the handler of theOS running on the VM. There is a problem of protection level, however— execution should not arrive there in system mode.

• Change “urgent” to “suspend” in the implementation of strict signalling.

• Discuss the difficulties getting parameters to a system call, especially inthe case where there is a new stack that handles the system call. Thisshould be implemented like a remote procedure call in a message passingsystem.

• Get exact quote for Brinch-Hansen on page 11.

• We must mention that a fork operation (or create) must increment theaccess count for each of the parent’s (i.e., child’s) open files in the system-wide open file table.

• Should the name manager handle the names of operations on objects orshould these be handled via “.h” files?

• From now on we will use the word handle for token or id number. We willuse the term X-handle to mean the handle with respect to the X handleserver.

• The mount operation in Unix establishes logical location of a file system,while the attach operation in Sprite has to do only with physical location.The volume is a level in the handle hierarchy. We have a three levelhierarchy: machine, volume, object. A volume is a set of files that can be

224 APPENDIX C. TO DO

physically moved to a new device (possibly on a different machine) withminimal reassignment of handles, i.e., a unit in the handle hierarchy. Thehandle is a machine convenient name. The real name hierarchy shouldreflect human convenience.

• A file is a volume root if and only if it has no hard links except . and some..’s. Its volume is its closure with respect to hard links.

• We call the two pieces of RPC software: the RPC client and the RPCserver.

• Every volume needs a free-space manager. The backing storage of a systemis broken up into devices. Each device is a segment of blocks. To identifya block one specifies its device and its offset. The driver is supposed to besmart enough to find it from there.

• The naming of operations on objects can either be done by the namemanager or by use of “.h” files.

• Mention exceptions to the need for exclusions on all monitor operationsas in Lampson and Redell.

Appendix D

Fall 2004

D.1 Getting information

• A web browsable version of the notes is available atfile:///home/csprofs/thp/pub/cs153stuff/notes/notes.htmlor pos-sibly atwww.cs.ucr.edu/~thp/notes/notes.html.

• A postscript version of the notes is available for browsing via the gv com-mand at/home/csprofs/thp/pub/cs153stuff/notes.ps. Use this to print pages andonly print through the preliminary readings, since I’m editing the stuffjust ahead.

• The netnews group for class is ucr.cs.cs153 is available for general discus-sion. It is gateway’d with the class’s mailing list, which is available fromthe Department’s homepage www.cs.ucr.edu.

I that members of the class send their questions related to course material viathe newsgroup/mailing-list:

• Questions posted to the newsgroup/mailinglist are seen by everyone andprovoke further questions in the minds of the rest of the class.

• My answers are seen by all members of the class.

• My answers are therefore a matter of record. If you say to me: “But yousaid ...”, I can’t deny it.

• I get to think about my answers before I make them a matter of record,and can cite or include background that I couldn’t in a casual conversation.

• Other folks, e.g., TAs and/or other members of the class, may have betteranswers than I do.

225

226 APPENDIX D. FALL 2004

• People can and will correct me when I’m wrong.

• Technical questions, say relating to a homework problem, can be answeredin a much more timely manner.

Personal questions relating say to the grading of a particular question or problemcan and should be directed to me personally at [email protected].

D.2 Objectives

D.2.1 Official ABET objectives for CS153

1. Study basic principles underlying the design of operating systems with afocus on principles and mechanisms used throughout the design

2. An understanding of CPU scheduling, storage management: memory man-agement, virtual memory and file systems

3. Study of concurrency control and synchronization, classic algorithms forsynchronization and concurrency management

4. Study deadlocks, devices, device management, and I/O systems

5. Study dynamic binding

6. An understanding of protection, access control, and security

7. Improve skills in concurrent programming and introduce kernel program-ming

D.2.2 Some additional objectives for this offering

To convey an understanding of:

• purposes of OSes

• how OSes achieve those purposes

• organizational significance of OS choice.

• societal and economic impact of OSes.

• basic marketing tactics and their countermeasures.

– pricing to value rather than cost

– who gets to set the standard (IBM vs. AT&T)

– lock-in (via featurism and proprietary protocols). The defense isstandards both formal and ad hoc. The counter-ploy is “embraceand extend”.

– FUD

D.2. OBJECTIVES 227

• Specific market penetration by Linux and Windows.

• what’s going on now, e.g., intellectual-property (IP) wars and DRM.

• skill in programming.

• skill in determining semantics of system calls library functions.

• better understanding of caching, protection, concurency, binding.

D.2.3 Outcomes

According to the ABET accreditation board, engineering programs must demon-strate that upon graduate their graduates have:

a. an ability to apply knowledge of mathematics, science, and engineering

b. an ability to design and conduct experiments, as well as to analyze andinterpret data

c. an ability to design a system, component, or process to meet desired needs

d. an ability to function on multi-disciplinary teams

e. an ability to identify, formulate, and solve engineering problems

f. an understanding of professional and ethical responsibility

g. an ability to communicate effectively

h. the broad education necessary to understand the impact of engineeringsolutions in a global and societal context

i. a recognition of the need for, and an ability to engage in life-long learning

j. a knowledge of contemporary issues

k. an ability to use the techniques, skills, and modern engineering tools nec-essary for engineering practice

The objectives of this course are intended to contribute to those at-graduateoutcomes of your education. Please be mindful of the connections and feel freeto ask about them.

D.2.4 Paradigms

There are two prevailing paradigms in discussions about the purposes of educa-tion:

• the student as customer,

• the student as work in progress.


In my humble opinion, both are correct and together they convey an accuratepicture of the educational endeavor, especially at the level of “higher education”.In fact, there is a paradigm that I like even more: students as junior colleagues.That paradigm comes into play when students participate in faculty research.

D.3 Assessment

Everyone is responsible for assigned readings before the corresponding lectureor lab.

• Quizzes: in labs and in lectures, unnanounced, no makups, everyone’s twolowest scores are dropped.

• Projects: assigned in labs, see schedule below.

• Final: Monday 12/6 from 11:30 to 2:30.

Respective weights will be somewhere between 20% and 40% and most likelyclose to equal.

D.4 Schedule

• Lectures: Mon, Wed, Fri at 10:10 to 11:00 in Watkins 1101.

• Labs:

– Section 22: Tuesday at 11:10-2:00 in Surge 283, Nelson Perez (nperez)

– Section 23: Tuesday at 2:10-5:00 in Surge 283, Abhishek Mitra (ami-tra)

• Holidays: Nov 11, Nov 25, and Nov 26.

• Final: Mon, 12/6, 11:30-2:30

D.4.1 Week 0

Lecture-1, Th 9/23

Covering through page 7. A few minutes of background on Information Tech-nology and what it’s about, i.e., the three underpinnings.

D.4.2 Week 1

Lab-1, week of 9/26

Observing Linux Behavior (via the /proc File System): Exercise 1, pages 1-65in Kernel Projects for Linux, a.k.a., “The Kernel Book”. Read also the manpages on strace and ptrace, and experiment with strace. (Note that strace is

D.4. SCHEDULE 229

a command program, while ptrace is a system call.) Do Exercise 1 from theKernel book. Should be done mostly in lab. Turn-in by 11:59 pm on Monday10/4, roughly one week, using departmental turnin facility, not BlackBoard.(TAs: Review the use of turnin, since there will be some transfer students andgraduate students who are not familiar with it.)

Lecture-2, Mon, 9/27

Preliminary reading: through page 14

Lecture-3, Wed, 9/29

Preliminary reading: through page 21. Also read interview with Jim Gray:http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=43

Lecture-4, Fri, 10/1


D.4.3 Week 2

Lab-2, week of 10/3

Kernel Modules: Exercise 4, pages 97-106 in the Kernel Book. Read also theman pages for diff and patch. Learn how to modify, compile and run the UMLversion of the Linux Kernel. Learn how to make a patch file. Do exercise 4and turn in only the corresponding patch file. Turn-in by 11:59 pm on Monday10/11, roughly one week.







D.4.4 Week 3

Lab-3, week of 10/10

Adding a System Call: Exercise 5, pages 107-118 in the Kernel book. Doexercise 5 and turn in only the corresponding patch file. Turn-in by 11:59 pmon Monday 10/18, roughly one week.








D.4.5 Week 4


Writing a Shell Progams: Exercise 2, 67-82 in the Kernel book. Read man pageon execve. Play around with ptrace. Also scan through the man pages on bash.There is a lot more than you need there. Also see Steve Graham’s writeup. Dueon Monday 10/25, roughly one week.







D.4.6 Week 5


Synchronization Mechanisms: Exercise 8, pages 145-154 in the Kernel Book.Due on Monday 11/1, roughly one week.





D.4. SCHEDULE 231



D.4.7 Week 6


Device Divers: Exercise 10, pages 167-178 in the Kernel Book. Due on Monday11/8, roughly one week.







D.4.8 Week 7

Preliminary reading rest of chapter 9 (to page 145).

Lab-7, week of 11/7

File System: Exercise 11, pages 179-204 in the Kernel Book. Due on Monday11/22, a bit less than two weeks.







D.4.9 Week 8


File I/O: Exercise 12, pages 205-218 in the Kernel Book. Due on Monday 11/26,two weeks.








D.4.10 Week 9

Lab-9,, week of 11/21

File System and File I/O continued.





Lecture-XX, Fri, 11/26

Thanksgiving break. Enjoy

D.4.11 Week 10


File I/O continued.







D.4. SCHEDULE 233

D.4.12 Finals Week

Final Mon, 12/6, 11:30-2:30

OPERATING SYSTEMS - CiteSeerX

Documents

Transcript of OPERATING SYSTEMS - CiteSeerX