Class-based Prioritized Resource Control in Linux

21
Class-based Prioritized Resource Control in Linux Shailabh Nagar, Hubertus Franke, Jonghyuk Choi, Chandra Seetharaman Scott Kaplan, * Nivedita Singhvi, Vivek Kashyap, Mike Kravetz IBM Corp. {nagar,frankeh,jongchoi,chandra.sekharan}@us.ibm.com {nivedita,vivk,kravetz}@us.ibm.com [email protected] Abstract In Linux, control of key resources such as memory, CPU time, disk I/O bandwidth and network bandwidth is strongly tied to ker- nel tasks and address spaces. The kernel of- fers very limited support for enforcing user- specified priorities during the allocation of these resources. In this paper, we argue that Linux kernel re- source management should be based on classes rather than tasks alone and be guided by class shares rather than system utilization alone. Such class-based kernel resource management (CKRM) allows system administrators to pro- vide differentiated service at a user or job level and prevent denial of service attacks. It also en- ables accurate metering of resource consump- tion in user and kernel mode. The paper pro- poses a framework for CKRM and discusses incremental modifications to kernel schedulers to implement the framework. 1 Introduction With Linux having made rapid advances in scalability, making it the operating system of choice for enterprise servers, it is useful and * on sabbatical from Amherst College, MA timely to examine its support for resource man- agement. Enterprise workloads typically run on two types of servers: clusters of 1-4 way SMPs and large (8-way and upward) SMPs and mainframes. In both cases, system adminis- trators must balance the needs of the work- load with the often conflicting goal of main- taining high system utilization. The balancing becomes particularly important for large SMPs as they often run multiple workloads to amor- tize the higher cost of the hardware. A key aspect of multiple workloads is that they vary in the business importance to the server owner. To maximize the server’s utility, the system administrator needs to ensure that workloads with higher business importance get a larger share of server resources. The ker- nel’s resource schedulers need to allow some form of differentiated service to meet this goal. It is also important that the resource usage by different workloads be accounted accurately so that the customers can be billed accord- ing to their true usage rather than an average cost. Kernel support for accurate accounting of resource usage is required, especially for re- source consumption in kernel mode. Differentiated service is also useful to the desk- top user. It would allow large file transfers to get reduced priority to the disk compared to disk accesses resulting from interactive com-

Transcript of Class-based Prioritized Resource Control in Linux

Class-based Prioritized Resource Control in Linux

Shailabh Nagar, Hubertus Franke, Jonghyuk Choi, Chandra SeetharamanScott Kaplan,∗ Nivedita Singhvi, Vivek Kashyap, Mike Kravetz

IBM Corp.{nagar,frankeh,jongchoi,chandra.sekharan}@us.ibm.com

{nivedita,vivk,kravetz}@us.ibm.com

[email protected]

Abstract

In Linux, control of key resources such asmemory, CPU time, disk I/O bandwidth andnetwork bandwidth is strongly tied to ker-nel tasks and address spaces. The kernel of-fers very limited support for enforcing user-specified priorities during the allocation ofthese resources.

In this paper, we argue that Linux kernel re-source management should be based on classesrather than tasks alone and be guided by classshares rather than system utilization alone.Such class-based kernel resource management(CKRM) allows system administrators to pro-vide differentiated service at a user or job leveland prevent denial of service attacks. It also en-ables accurate metering of resource consump-tion in user and kernel mode. The paper pro-poses a framework for CKRM and discussesincremental modifications to kernel schedulersto implement the framework.

1 Introduction

With Linux having made rapid advances inscalability, making it the operating system ofchoice for enterprise servers, it is useful and

∗on sabbatical from Amherst College, MA

timely to examine its support for resource man-agement. Enterprise workloads typically runon two types of servers: clusters of 1-4 waySMPs and large (8-way and upward) SMPs andmainframes. In both cases, system adminis-trators must balance the needs of the work-load with the often conflicting goal of main-taining high system utilization. The balancingbecomes particularly important for large SMPsas they often run multiple workloads to amor-tize the higher cost of the hardware.

A key aspect of multiple workloads is thatthey vary in thebusiness importanceto theserver owner. To maximize the server’s utility,the system administrator needs to ensure thatworkloads with higher business importance geta larger share of server resources. The ker-nel’s resource schedulers need to allow someform of differentiated service to meet this goal.It is also important that the resource usage bydifferent workloads be accounted accuratelyso that the customers can be billed accord-ing to their true usage rather than an averagecost. Kernel support for accurate accountingof resource usage is required, especially for re-source consumption in kernel mode.

Differentiated service is also useful to the desk-top user. It would allow large file transfers toget reduced priority to the disk compared todisk accesses resulting from interactive com-

Linux Symposium 151

mands. A kernel compile could be configuredto run in the background with respect to theCPU, memory and disk, allowing a more im-portant activity such as browsing to continueunimpeded.

The current Linux kernel (version 2.5.69 at thetime of writing) lacks support for the above-mentioned needs. There is limited and varyingsupport for any kind of performance isolationin each of the major resource schedulers (CPU,network, disk I/O and memory). CPU and in-bound network scheduling offer the greatestsupport by allowing specification of priorities.The deadline I/O scheduler [3] offers some iso-lation between disk reads and writes but notbetween users or applications and the VM sub-system has support for limiting address spacesize of a user. More importantly, the granular-ity of kernel supported service differentiation isa task (process), or infrequently the userid. Itdoes not allow the user to define the granular-ity at which resources get apportioned. Finally,there is no common framework for a systemadministrator to specify priorities for usage ofdifferent physical resources.

The work described in this paper addressesthese shortcomings. It proposes a class-basedframework for prioritized resource manage-ment of all the major physical resources man-aged by the kernel. A class is a user-defined,dynamic grouping of tasks that have associatedpriorities for each physical resource. The pro-posed framework permits a better separationof user-specified policies from the enforcementmechanisms implemented by the kernel. Mostimportantly, it attempts to achieve these goalsusing incremental modifications of the existingmechanisms.

The paper is organized as follows. Section 2 in-troduces the proposed framework and the cen-tral concepts of classification, monitoring andcontrol. Sections 3,4,5,6 explore the frame-

work for the CPU, disk I/O, network and mem-ory subsystems and propose the extensionsnecessary to implement it. Section 7 concludeswith directions for future work.

2 Framework

Before describing the proposed framework, wedefine a few terms.

Tasks are the Linux kernel’s common represen-tation for both processes and threads. Aclassis a group of tasks. The grouping of tasks intoclasses is decided by the user using rules andpolicies.

A classification rule, henceforth simply calleda rule, is a method by which a task can be clas-sified into a class. Rules are defined by the sys-tem administrator, generally as part of a policy(defined later) but also individually, typicallyas modifications or increments to an existingpolicy. Attributes of a task, such as real uid,real gid, effective uid, effective gid, path name,command name and task or application tag (de-fined later) can be used to define rules. A ruleconsists of two parts: a set of attribute-valuetuples (A,V) and a class C. If the rule’s tuplevalues match the corresponding attributes of atask , then it is considered to belong to the classC1.

A policy is a collection of class definitions andclassification rules. Only one policy is active ina system at any point of time. Policies are con-structed by the system administrator and sub-mitted to a CKRM kernel. The kernel option-ally verifies the policy for consistency and ac-tivates it. The order of rules in a policy is im-portant. Rules are applied in the order they aredefined (one exception is the application tagsas described in the notes below).

An Application/Task Tagis a user-defined at-tribute associated with a task. Such an attribute

Linux Symposium 152

is useful when tasks need to be classified andmanaged based on application-specific crite-ria. In such scenarios, an applications tasks canspecify its tag to the kernel using a system call,ioctl, /proc entry etc. and trigger its classifica-tion using a rule that uses the task tag attribute.Since the task tag is opaque to the kernel, itallows applications and system administratorsadditional flexibility in grouping tasks.

A Resource Manageris the entity which de-termines the proportions in which resourcesshould be allocated to classes. This could beeither a human system administrator or a re-source management application middleware.

With all the new terms of the framework de-fined, we can now describe how the frameworkoperates to provide class-based resource man-agement. Figure 1 illustrates the lifecycle oftasks in the proposed framework with an em-phasis on the three central aspects of classifi-cation, monitoring and control.

2.1 Initialization

Sometime after system boot up, the ResourceManager commits a policy to the kernel. Thepolicy defines the classes and it is used to clas-sify all tasks (pre-existing and new) createdand all incoming network packets. A CKRM-enabled Linux kernel also contains a defaultpolicy with a single default class to which alltasks belong. The default policy determinesclassification and control behaviour until theResource Manager’s policy gets loaded. Newpolicies can be loaded at any point and overridethe existing policy. Such policy loads triggerreclassification and reinitialization of monitor-ing data and are expected to be very infrequent.

2.2 Classification

Classification refers to the association of tasksto classes and the association of resource re-

quests to a class. The distinction is mostlyirrelevant as most resource requests are initi-ated by a task except for incoming networkpackets which need to be classified before it isknown which task will consume them. Classi-fication is a continuous operation. It happenson a large scale each time a new policy is com-mitted and all existing tasks get reclassified. Atlater points, classification occurs whenever (1)a new task is created e.g. fork(), exec(); (2)the attributes of a task change e.g setuid(), set-gid(), application tag change (initiated by theapplication) and (3) explicit reclassification ofa specific task by the Resource Manager. Sce-narios (1) and (2) are illustrated in Figure 1.

Classification of tasks potentially allows allwork initiated by the task to be associatedwith the class. Thus the CPU, memory andI/O requests generated by this task, or by thekernel on behalf of this task, can be moni-tored and regulated by the class-aware resourceschedulers. Kernel-initiated resource requestswhich are on behalf of multiple classes e.g.a shared memory page writeout need specialtreatment as do application initiated requestswhich are processed asynchronously. Classifi-cation of incoming network connections and/ordata (which are seen by the kernel before thetask to which they are destined) is discussedseparately in Section 5.

2.3 Monitoring

Resource usage is maintained at the class levelin addition to the regular system accounting bytask, user etc. The system administrator or anexternal control program with root privilegescan obtain that information from the kernel atany time. The information can be used to as-sess machine utilization for billing purposes oras an input to a future decision on changing theshare allotted to a class. The CKRM API pro-vides functions to query the current usage dataas shown in Figure 1.

Linux Symposium 153

Figure 1: CKRM lifecycle

2.4 Control

The system administrator, as part of the ini-tial policy or at a later point in time, assigns aper-resource share to each class in the system.Each class gets a separate share for CPU time,Memory pages, I/O bandwidth and incomingnetwork I/O bandwidth. The resource sched-ulers try their best to respect these shares whileallocating resources to a class. e.g. the CPUscheduler tries to ensure that tasks belongingto Class A with a 50% CPU share collectivelyget 50% of the CPU time. At the next levelof the control hierarchy, the system administra-tor or a control program can change the sharesassigned to a class based on their assessmentof application progress, system utilization etc.Collectively, the setting of shares and share-based resource allocation constitute the controlpart of the resource management lifecycle andare shown in Figure 1. This paper concentrateson the lower level share-based resource alloca-tion since that is done by the kernel.

The next four sections go into the details ofclassification, monitoring and control aspectsof managing each of the four major physical

resources.

3 CPU scheduling

The CPU scheduler is central to the operationof the computing platform. It decides whichtask to run when and how long. In general re-altime and timeshared jobs are distinguished,each with different objectives. Both are re-alized through different scheduling disciplinesimplemented by the scheduler. Before address-ing the share based scheduling schemes, we de-scribe the current Linux scheduler.

3.1 Linux 2.5 Scheduler

To achieve its objectives, Linux assigns a staticpriority to each task that can be modified bythe user through thenice() interface. Linuxhas a range of[0 . . . MAX_PRIO] priority classes,of which the firstMAX_RT_PRIO(=100) areset aside for realtime jobs and the remaining40 priority classes are set aside for timesharing(i.e. normal) jobs, representing the[−20 . . . 19]nice value of UNIX processes. The lower thepriority value, the higher the “logical” priority

Linux Symposium 154

of a task, i.e. its general importance. In thiscontext we always assume the logical prioritywhen we are talking about priority increasesand decreases. Realtime tasks always have ahigher priority then normal tasks.

The Linux scheduler in 2.5, a.k.a the O(1)scheduler, is a multi queue scheduler that as-signs to each cpu a run queue, wherein localscheduling takes place. A per-cpu runqueueconsists of two arrays of task lists, the activearray and the expired array. Each array indexrepresents a list of runnable task at their respec-tive priority level. After executing for a periodtask move from the active list to the expiredlist to guarantee that all tasks get a chance toexecute. When the active array is empty, ex-pired and active arrays are swapped. More de-tail is provided further on. The scheduler sim-ply picks the first task of the highest priorityqueue of the active queue for execution.

Occasionally a load balancing algorithm rebal-ances the runqueues to ensure that a similarlevel of progress is made on each cpu. Real-time issues and load balancing issues are be-yond this description here, hence we concen-trate on the single cpu issue for now. For moredetails we refer to [12]. It might also be ofinterest to abstract this against the Linux 2.4based scheduler, which is described in [10].

As stated earlier, the scheduler needs to decidewhich task to run next and for how long. Timequantums in the kernel are defined as multiplesof a systemtick. A tick is defined by the fixeddelta(1/HZ) of two consecutive timer inter-rupts. In Linux 2.5:HZ=1000, i.e. the inter-rupt routinescheduler_tick() is calledonce every msec at which time the currentlyexecuting task is charged a tick.

Besides thestatic priority (static_prio ),each task maintains aneffective priority(prio ). The distinction is made in or-der to account for certain temporary prior-

ity bonuses or penalties based on the recentsleep averagesleep_avg of a given task.The sleep average, a number in the range of[0 . . . MAX_SLEEP_AVG ∗ HZ], accounts for thenumber of ticks a task was recently desched-uled. The time (in ticks) since a task went tosleep (maintained insleep_timestamp ) isadded on task wakeup and for every time tickconsumed running, the sleep average is decre-mented.

The current design provisions arange of PRIO_BONUS_RATIO=25%[−12.5%..12.5%] of the priority range for thesleep average. For instance a “nice=0” task hasa static priority of 120. With a sleep averageof 0 this task is penalized by 5 resulting in aneffective priority of 125. On the other hand,if the sleep average isMAX_SLEEP_AVG=10 secs, a bonus of 5 is granted leading to aneffective priority of 115. The effective prioritydetermines the priority list in the active andexpired array of the run queue. A task isdeclaredinteractivewhen its effective priorityexceeds its static priority by a certain level(which can only be due to its accumulatingsleep average). High priority tasks reachinteractivity with a much smaller sleep averagethan lower priority tasks.

The timeslice, defined as the maximum time atask can run without yielding to another task,is simply a linear function of the static pri-ority of normal tasks projected into the rangeof [MIN_TIMESLICE . . . MAX_TIMESLICE]. Thedefaults are set to[10 . . . 200] msecs. Thehigher the priority of a task the longer itstimeslice . A task’s initial timeslice is de-ducted from parents’ remaining timeslice. Forevery timer tick the running task’s timeslice isdecremented. If decremented to “0”, the sched-uler replenishes the timeslice, recomputes theeffective priority and either reenqueues the taskinto the active array if it is interactive or intothe expired array if it is non-interactive. It then

Linux Symposium 155

picks the next best task from the active array.This guarantees that all others tasks will exe-cute first before any expired task will run again.If a task is descheduled, its timeslice will notbe replenished at wakeup time, however its ef-fective priority might have changed due to anysleep time.

If all runnable tasks have expired their times-lices and have been moved to the expiredlist, the expired array and the active arrayare swapped. This makes the scheduler O(1)as it does not have to traverse a potentiallylarge list of tasks as was needed in the 2.4scheduler. Due to interactivity the situationcan arise that the active queue continues tohave runnable tasks. As a result tasks in theexpired queue might get starved. To avoidthis, if the longest expired task is older thanSTARVATION_LIMIT=10secs, the arrays areswitched again.

3.2 CPU Share Extensions to the Linux Sched-uler

We now examine the problem of extending theO(1) scheduler to allocate CPU time to classesin proportion of their CPU shares. Propor-tional share scheduling of CPUs has been stud-ied in depth [7, 21, 13] but not in the contextof making minimal modifications to an exist-ing scheduler such as O(1).

Let Ci, i=[1 . . . N ] be the set ofN dis-tinct classes with corresponding cpu sharesScpu

i such that∑N

i=1 Scpui = 1. Let

SHARE_TIME_EPOCH(STE) be the time in-terval, measured in ticks, over which the modi-fied scheduler attempts to maintain the propor-tions of allocated CPU time. Further, we useΦ(a) andΦ(a, b) to denote a functions of pa-rameters a and b.

Weighted Fair Share (WFS): In the first ap-proach considered, henceforth called weightedfair share (WFS), a per class scheduling re-

source container is introduced that accountsfor timeslices consumed by the tasks currentlyassigned to the class. Initially, the timesliceTSi, i=[1 . . . N ] of each classCi is determinedby TSi = Scpu

i × STE. The timeslice allo-cated to a taskts(p) remains the same as inO(1). Everytime a task consumes one of itsown timeslice ticks, it also consumes one fromthe class’ timeslice. When a class’ ticks areexhausted, the task consuming the last tick isput into the expired array. When the sched-uler picks other tasks from the same class torun, they immediately move to the expired ar-ray as well. Eventually the expired and activearrays are switched at which time all resourcecontainers are refilled toTSi = Scpu

i × STE.Since the array switch occurs as soon as the ac-tive list becomes empty, this approach is workconserving (the CPU is never idle if there arerunnable tasks). A variant of this approach wasinitially implemented by [17] based on a patchfrom Rik van Riel for allocatingequalsharesto all usersin the system.

However, WFS has some problems. If the tasksof a class are CPU bound and

∑p∈Ci

ts(p) >TSi then a class could exhaust its timeslice be-fore all its tasks have had a chance to run atleastonce. Therefore the lower priority tasks of theclass could perpetually move from the activeto expired lists without ever being granted exe-cution time. Starvation occurs because neitherthe static priority (sp) nor the sleep average (sa)of the tasks is changed at any time. Hence eachtask’s timeslicets(p) = Φ(sp) and effectivepriority ep(p) = Φ(sp, sa) remain unchanged.Hence the relative priority of tasks of a classnever changes (in a CPU bound situation) nordoes the amount of CPU time consumed by thehigher priority tasks.

To ensure a fair share for individual taskswithin classes, we need to ensure that the rateof progress of a task depends on the share as-signed to its class. Three approaches to achieve

Linux Symposium 156

this are discussed next.

Priority modifications (WFS+P): Let aswitching intervalbe defined as the time in-terval between consecutive array switches ofthe modified scheduler,∆j be its duration andtj andtj+1 be the timestamps of the switches.In the priority modifications approach to al-leviating starvation in WFS, henceforth calledWFS+P, we track the number of array switchesse at which a task got starved due to its class’timeslice exhaustion and increase the task’s ef-fective priority based onse, i.e. ep(p) =Φ(sp, sa, se). This ensures that starving taskseventually get theirep high enough to get achance to run at which pointse is reset. Thedrawback of this approach is that the increasedscheduler overhead of tasks being selected forexecution and moving directly to the expiredlist due to class timeslice exhaustion, remainsunchanged.

Timeslice modifications (WFS+T1,WFS+T2): Recognizing that starvationcan occur in WFS for classCi only if∑

p∈Cits(p) > TSi, the timeslice modification

approaches attempt to change one or theother side of the inequality to convert it to anequality. In WFS+T1, the left hand side of theinequality is changed by reducing the times-lices of each task of a starved class as follows.Let exhi =

∑p∈Ci

ts(p) − TSi when classtimeslice exhaustion occurs. At array switchtime, eachts(p) is multiplied byλi = TSi

TSi+exhi

which results in the desired equality. WFS+T1is slow to respond to starvation because tasktimeslices are recomputed in O(1)beforetheymove into the expired array and not at arrayswitch time. Hence any task timeslice changestake effect only one switching interval lateri.e. two intervals beyond the one in whichstarvation occurred. One way to address thisproblem is to treat a task as having exhaustedits timeslice whents(p) gets decremented to(1−λi× ts(p)) instead of 0. A bigger problem

with WFS+T1 is that smaller timeslices fortasks could lead to increased context switcheswith potentially negative cache effects.

To avoid reducingts(p)’s, WFS+T2 increasesTSi of a starving class to make

∑p∈Ci

ts(p) =TSi i.e. the class does not exhaust its times-lice until each of its tasks have exhausted theirindividual timeslices. To preserve the relativeproportions between class timeslices, all otherclass timeslices also need to be changed ap-propriately. Doing so would disturb the sameequality for those classes and hence WFS+T2is not a workable approach.

Two-level scheduler: Another way to regu-late CPU shares in WFS is to take tasks outof the runqueue upon timeslice exhaustion andreturn them to the runqueue at a rate commen-surate with the share of the class. A prototypeimplementation of this approach was describedin [17] in the context of user-based fair sharing.This approach effectively implements a two-level scheduler and is illustrated in Figure 2. Amodified O(1) scheduler forms the lower leveland a coarse-grain scheduler operates at theupper level, replenishing class timeslice ticks.In the modified O(1), when a task expires, itis moved into a FIFO list associated with itsclass instead of moving to the expired array.At a coarse-granularity determined by the up-per level scheduler, the class receives new timeticks and reinserts tasks from the FIFO backinto O(1)’s runqueue. Class time tick replen-ishment can be done for all classes at every ar-ray switch point but that violates the O(1) be-haviour of the scheduler as a whole. To addressthis problem, [17] uses a kernel thread to re-plenish 8 ms worth of ticks to one class (user)every 8 ms and round robin through the classes(users). A variant of this idea is currently beingexplored.

Linux Symposium 157

Figure 2: Proposed two-level CPU scheduler

4 Disk scheduling

The I/O scheduler in Linux forms the interfacebetween the generic block layer and the lowlevel device drivers. The block layer providesfunctions which are used by filesystems and thevirtual memory manager to submit I/O requeststo block devices. These requests are trans-formed by the I/O scheduler and made avail-able to the low-level device drivers (henceforthonly called device drivers). Device driversconsume the transformed requests and forwardthem, using device specific protocols, to the de-vice controllers which perform the I/O. Sinceprioritized resource management seeks to reg-ulate the use of a disk by an application, theI/O scheduler is an important kernel compo-nent that is sought to be changed. It is also pos-sible to regulate disk usage in the kernel layersabove and below the I/O scheduler. Changingthe pattern of I/O load generated by filesytemsor the virtual memory manager (VMM) is animportant option. A less explored option isto change the way specific device drivers oreven device controllers consume the I/O re-quests submitted to them. The latter approach

is outside the scope of general kernel develop-ment and this paper.

Class-based resource management requirestwo fundamental changes to the traditional ap-proach to I/O scheduling. First, I/O requestsshould be managed based on the priority orweight of the request submitter with disk uti-lization being a secondary, albeit important ob-jective. Second, I/O requests should be associ-ated with the class of the request submitter andnot a process or task. Hence the weight associ-ated with an I/O request should be derived fromthe weight of the class generating the request.

The first change is already occurring in the de-velopment of the 2.5 Linux kernel with the de-velopment of different I/O schedulers such asdeadline, anticipatory, stochastic fair queueingand complete fair queueing. Consequently, theadditional requirements imposed by the sec-ond change (scheduling by class) are relativelyminor. This fits in well with our project goalof minimal changes to existing resource sched-ulers.

We now describe the existing Linux I/O sched-ulers followed by an overview of the changesbeing proposed.

4.1 Existing Linux I/O schedulers

The various Linux I/O schedulers can be ab-stracted into a generic model shown in Fig-ure 3. I/O requests are generated by theblock layer on behalf of processes access-ing filesystems, processes performing raw I/Oand from the virtual memory management(VMM) components of the kernel such askswapd, pdflush etc. These producers of I/Orequests call __make_request() which invokesvarious I/O scheduler functions such as eleva-tor_merge_fn. The enqueuing functions’ gen-erally try to merge the newly submitted blockI/O unit (bio in 2.5 kernels, buffer_head in

Linux Symposium 158

2.4 kernels) with previously submitted requestsand sort it into one or more internal queues. To-gether, the internal queues form a single log-ical queue that is associated with each blockdevice. At a later point, the low-level de-vice driver calls the generic kernel functionelv_next_request() to get the next request fromthe logical queue. elv_next_request interactswith the I/O scheduler’s dequeue function ele-vator_next_req_fn and the latter has an oppor-tunity to pick the appropriate request from oneof the internal queues. The device driver thenprocesses the request, converting it to scatter-gather lists and protocol-specific commandsthat are then sent to the device controller. Asfar as the I/O scheduler is concerned, the blocklayer is the producer of I/O requests and the de-vice drivers are the consumers. Strictly speak-ing, the block layer includes the I/O schedulerbut we distinguish the two for the purposes ofour discussion.

Figure 3: Abstraction of Linux I/O scheduler

Default 2.4 Linux I/O scheduler: The 2.4Linux kernel’s default I/O scheduler (eleva-tor_linus) primarily manages disk utilization.It has a single internal queue. For each newbio, the I/O scheduler checks to see if it can bemerged with an existing request. If not, a new

request is placed in the internal queue sortedby the starting device block number of the re-quest. This minimizes disk seek times if thedisk processes requests in FIFO order from thequeue. An aging mechanism limits the num-ber of times an existing request can get by-passed by a newer request, preventing starva-tion. The dequeue function is simply a removalof requests from the head of the internal queue.Elevator_linus also has the welcome propertyof improving request response timesaveragedover all processes.

Deadline I/O scheduler: The 2.5 kernel’sdefault I/O scheduler (deadline_iosched) in-troduces the notion of a per-request deadlinewhich is currently used to give a higher pref-erence to read requests. Internally, it main-tains five queues. During enqueing, each re-quest is assigned a deadline and inserted intoqueues sorted by starting sector (sort_list)andby deadline (fifo_list). Separate sort and fifolists are maintained for read and write requests.The fifth internal queue contains requests to behanded off to the driver. During a dequeueoperation, if the dispatch queue is empty, re-quests are moved from one of the four sortor fifo lists in batches. Thereafter, or if thedispatch queue was not empty, the head re-quest on the dispatch queue is passed on to thedriver. The logic for moving requests from thesort or fifo lists ensures that each read requestis processed by its deadline without starvingwrite requests. Disk seek times are amortizedby moving a large batch of requests from thesort_list (which are likely to have few seeks asthey are already sector sorted) and balancing itwith a controlled number of requests from thefifo list (each of which could cause a seek sincethey are ordered by deadline and not sector).Thus, deadline_iosched effectively emphasizesaverage read request response times over diskutilization and total average request responsetime.

Linux Symposium 159

Anticipatory I/O scheduler: The anticipatoryI/O scheduler [9, 4] attempts to reduceper-processread response times. It introduces acontrolled delay in dispatchingany new re-quests to the device driver, thereby allowinga process whose request just got serviced tosubmit a new request, potentially requiring asmaller seek. The tradeoff between reducedseeks and decreased disk utilization (due to theadditional delays in dispatch) are managed us-ing a cost-benefit calculation. anticipatory I/Oscheduling method is an additional optimiza-tion that can potentially be added to any of theI/O scheduler mentioned in this paper

Complete Fair Queueing I/O scheduler:Two new I/O schedulers recently proposedin the Linux kernel community, introducethe concept of fair allocation of I/O band-width amongst producers of I/O requests. TheStochastic Fair Queueing (SFQ) scheduler [5]is based on an algorithm originally proposedfor network scheduling [11]. It tries to appor-tion I/O bandwidth equally amongst allpro-cessesin a system using 64 internal queuesand one output (dispatch) queue. During anenqueue, the process ID of the currently run-ning process (very likely to be the I/O requestproducer) is used to select one of the inter-nal queues and the request inserted in FIFOorder within it. During dequeue, SFQ round-robins through the non-empty internal queues,picking requests from the head. To avoid toomany seeks, one full round of requests arecollected, sorted and merged into the dispatchqueue. The head request of the dispatch queueis then passed to the device driver. CompleteFair Queuing is an extension of the same ap-proach where no hash function is used. Henceeach process in the system has a correspond-ing internal queue and can get an fair shareof the I/O bandwidth (equal share if all pro-cesses generate I/O requests at the same rate).Both CFQ and SFQ manage per-process I/Obandwidth and can provide fairness at a pro-

cess granularity.

Cello disk scheduler: Cello is a two-levelI/O scheduler [16] that distinguishes betweenclasses of I/O requests and allows each classto be serviced by a different policy. A coarsegrain class-independent scheduler decides howmany requests to service from each class. Thesecond level class-dependent scheduler thendecides which of the requests from its classshould be serviced next. Each class has its owninternal queue which is manipulated by theclass-specific scheduler. There is one outputqueue common to all classes. Enqueuing intothe output queue is done by the class-specificschedulers in a way that ensures individual re-quest deadlines are met as far as possible whilereducing overall seek time. Dequeuing fromthe output queue occurs in FIFO order as inmost of the previous I/O schedulers. Cello hasbeen shown to provide good isolation betweenclasses as well as the ability to meet the needsof streaming media applications that have softrealtime requirements for I/O requests.

4.2 Costa: Proposed I/O scheduler

This paper proposes that a modified version ofthe class-independent scheduler of the CelloI/O scheduling framework can provide a low-overhead class-based I/O scheduler suitable forCKRM’s goals.

The key difference between the proposedscheduler called Costa and Cello is the elimina-tion of the class-specific I/O schedulers whichmay be add unnecessary overhead for CKRM’sgoal of I/O bandwidth regulation. Fig 4 illus-trates the Costa design. When the kernel isconfigured for CKRM support, a new internalqueue is created for each class that gets addedto the kernel. Since each process is alwaysassociated with a class, I/O requests that theygenerate can also be tagged with the class idand used to enqueue the request in the class-

Linux Symposium 160

Figure 4: Proposed Costa I/O scheduler

specific internal queue. The request->class as-sociation cannot be done through a lookup ofthe current process’ class alone. During de-queue, the Costa I/O scheduler picks up re-quests from several internal queues and sortsthem into the common dispatch queue. Thehead request of the dispatch queue is thenhanded to the device driver.

The mechanism also allows internal queues tobe associated withsystemclasses that groupI/O requests coming from important produc-ers such as the VMM. By separating these out,Costa can give them preferential treatment forurgent VM writeout or swaps.

In addition to a weight value (which deter-mines the fraction of I/O bandwidth that a classwill receive), the internal queues could alsohave an associatedpriority value which deter-mines their relative importance. At a given pri-ority level, all queues could receive I/O band-

width in proportion of their weights with theset of queues at a higher level always gettingserviced first. Some check for preventing star-vation of lower priority queues could be usedsimilar to the ones used in deadline_iosched.

5 QoS Networking in Linux

Many research efforts have been made in net-working QoS (Quality of Service) to providequality assurance of latency, bandwidth, jitter,and loss rate. With the proliferation of mul-timedia and quality-sensitive business traffic,it becomes essential to provide reserved qual-ity services (IntServ [23]) or differentiated ser-vices (DiffServ [2]) for important client traffic.

The Linux kernel has been offering a well es-tablished QoS network infrastructure for out-bound bandwidth management, policy-basedrouting, and DiffServ. Hence, Linux is beingwidely used for routers, gateways, and edgeservers, where network bandwidth is the pri-mary resource to differentiate among classes.

When it comes to Linux as an end server OS,on the other hand, networking QoS has notbeen given as much attention because QoS isprimarily governed by the system resourcessuch as CPU, memory, and I/O and less bythe network bandwidth. However, when weconsider the end-to-end service quality, wealso should require networking QoS in the endservers as exemplified by the fair share admis-sion control mechanism proposed in this sec-tion.

In the rest of the section, we first briefly intro-duce the existing network QoS infrastructureof Linux. Then, we describe the design of thefair share admission control in Linux and pre-liminary performance results.

Linux Symposium 161

5.1 Linux Traffic Control, Netfilter, DiffServ

The Linux traffic control [8] consists of queue-ing disciplines (qdisc) and filters. A qdisc con-sists of one or more queues and a packet sched-uler. It makes traffic conform to a certain pro-file by shaping or policing. A hierarchy ofqdiscs can be constructed jointly with a classhierarchy to make different traffic classes gov-erned by proper traffic profiles. Traffic canbe attributed to different classes by the filtersthat match the packet header fields. The filtermatching can be stopped to police traffic abovea certain rate limit. A wide range of qdiscsranging from a simple FIFO to classful CBQor HTB are provided for outbound bandwidthmanagement, while only one ingress qdisc isprovided for inbound traffic filtering and polic-ing [8]. The traffic control mechanism canbe used in various places where bandwidth isthe primary resource to control. For instancein service providers, it manages bandwidth al-location shared among different traffic flowsbelonging to different customers and servicesbased on service level agreements. It also canbe used in client sites to reduce the interferencebetween upstream and downstream traffic andto enhance the response time of the interactiveand urgent traffic.

Netfilter provides sophisticated filtering rulesand targets. Matched packets can be accepted,denied, marked, or mangled to carry out vari-ous edge services such as firewall, dispatcher,proxy, NAT etc. Routing decisions can bemade based on the netfilter markings so pack-ets may take different routes according to theirclasses. The qdiscs would enable various QoSfeatures in such edge services when used withNetfilter. Netfilter classification can be trans-ferred for use in later qdiscs by markings ormangled packet headers.

The Differentiated Service (DiffServ) [2] pro-vides a scalable QoS by applying per-hop be-

havior (PHB) collectively to aggregate trafficclasses that are identified by a 6-bit code pointin the IP header. Classification and condition-ing are typically done at the edge of a DiffServdomain. The domain is a contiguous set ofnodes compliant to a common PHB. The Diff-Serv PHB is supported in Linux [22]. Classes,drop precedence, code point marking, and con-ditioning can be implemented by qdiscs and fil-ters. At the end servers, the code point can bemarked by setting theIP_TOS socket option.

In the policy based networking [18], a pol-icy agent can configure the traffic classificationof edge and end servers according to a pre-defined filtering rules that match layer 3/4 orlayer 7 information. Netfilter, qdisc, and appli-cation layer protocol engines can classify traf-fic for differentiated packet processing at laterstages. Classifications at prior stages can beoverridden by the transaction information suchas URI, cookies, and user identities as they areknown. It has been shown that a coordinationof server and network QoS can reduce end-to-end response time of important client requestssignificantly by virtual isolation from the lowpriority traffic [15].

5.2 Prioritized Accept Queues with Propor-tional Share Scheduling

We present here a simple change to the existingLinux TCP accept mechanism to provide dif-ferentiated service across priority classes. Re-cent work in this area has introduced the con-cept of prioritized accept queues [19] and ac-cept queue schedulers using adaptive propor-tional shares to self-managed web servers [14].

Under certain load conditions [14], the TCP ac-cept queue of each socket becomes the bottle-neck in network input processing. Normally,listening sockets fork off a child process tohandle an incoming connection request. Someoptimized applications such as the Apache web

Linux Symposium 162

Figure 5: Proportional Accept Queue Results.

server maintain a pool of server processes toperform this task. When the number of incom-ing requests exceeds the number of static poolservers, additional processes are forked up toa configurable maximum. When the incom-ing connection request load is higher than thelevel that can be handled by the available serverprocesses, requests have to wait in the acceptqueue until one is available.

In the typical TCP connection, the client initi-ates a request to connect to a server. This con-nection request is queued in a single global ac-cept queue belonging to the socket associatedwith that server’s port. Processes that performan accept() call on that socket pick up thenext queued connection request and process it.Thus all incoming connections to a particularTCP socket are serialized and handled in FIFOorder.

We replace the existing single accept queue persocket with multiple accept queues, one foreach priority class. Incoming traffic is mappedinto one of the priority classes and queued onthe accept queue for that priority. There areeight priority classes in the current implemen-tation.

The accepting process schedules connectionacceptance according to a simple weighteddeficit round robin to pick up connection re-quests from each queue according to its as-signed share. The share, or weight can be as-signed by the sysctl interface.

In the basic priority accept queue design pro-posed earlier in [6], starvation of certain pri-ority classes was a possibility as the accept-ing process picked up connection requests ina descending priority order. With a propor-tional share scheme in this paper, it is easierto avoid starvation of particular classes to giveshare guarantees to low priority classes.

The efficacy of the proportional accept queuemechanism is demonstrated by an experiment.In the experiment, we used Netfilter with man-gle tables andMARKoptions to characterizetraffic into priority classes based on source IPaddress. Httperfs from two client machinessend requests to an Apache web server run-ning on a single server over independent giga-bit Ethernet connections. The only Apache pa-rameter changed from the default was the max-imum number of httpd threads. This was set to50 in the experiment.

Figure 5 shows throughput of Apache for twopriority classes, sharing inbound connectionbandwidth by 7:3. We can see that the through-put of the priority class 0 requests is slightlyhigher than that of the priority class 1 requestswhen the load is low. As load increases, theacceptance rates to the priority classes 0 and 1will be constrained in proportion to their rel-ative share, which in turn determines the pro-cessing rate of the Apache web server and con-nection request queueing delay. Under a severeload, the priority class 0 requests are processedat a considerably higher throughput.

Linux Symposium 163

6 Controlling Memory

While many other system resources can bemanaged according to priorities or proportions,virtual memory managers (VMM)currently donot allow such control. Theresident set size(RSS)of each process—the number of phys-ical page frames allocated to that process–will determine how often that process incursa page fault. If the RSS of each process isnot somehow proportially shared or prioritized,then paging behavior can overwhelm and un-dermine the efforts of other resource manage-ment policies.

Consider two processes,A andB, whereA isassigned a larger share thanB with the CPUscheduler. IfA is given too small an RSS andbegins to page fault frequently, then it will notoften be eligible for scheduling on the CPU.Consequently,B will be scheduled more oftenthan A, and the VMM will have become thede factoCPU scheduler, thus violating of therequested scheduling proportions.

Furthermore, it is possible for most existingVMM policies to exhibit a specific kind of de-generative behavior. Once processA from theexample above begins to page fault, its infre-quent CPU scheduling prevents it from ref-erencing its pages at the same rate as other,frequently scheduled processes. Therefore,its pages will become more likely to evicted,thereby reducing its RSS. The smaller RSSwill increasethe probability of further pagefaults. This degenerative feedback loop willcease only when some other process either ex-its or changes its reference behavior in a man-ner that reduces the competition for main mem-ory space.

Main memory use must be controlled just asany other resource is. The RSS of eachaddressspace—the logical space defined either by afile or by the anonymous pages within the vir-

tual address space of a process—must be cal-culated as a function of the proportions (or pri-orities) that are used for CPU scheduling. Be-cause this type of memory management has re-ceived little applied or academic attention, ourwork in this area is still nascent. We presenthere the structural changes to the Linux VMMnecessary for proportional/prioritized memorymanagement; we also present previous, appli-cable research, as well as future research di-rections that will address this complex prob-lem. While the proportional/prioritized man-agement of memory is currently less well de-veloped than the other resources presented inthis paper, it is necessary that it be comparablydeveloped.

6.1 Basic VMM changes

Consider a function that explicitly calculatesthe desired RSS for each address space—thetarget RSS—when the footprints of the activeaddress spaces exceeds the capacity of mainmemory. After this function sets these targets,a system couldimmediatelybring the actualRSS into alignment with these targets. How-ever, doing so may require a substantial num-ber of page swapping operations. Since diskoperations are so slow, it is inadvisable to useaggressive, pre-emptive page swapping. In-stead, a system should seek to move the the ac-tual RSS values toward their targets in alazyfashion, one page fault at a time. Until the ac-tual RSS of address space matches its target, itcan be labeled as being either inexcessor indeficitof its target.

As the VMM reclaims pages, it will do sofrom address spaces with excess RSS values.This approach to page reclamation suggestsa change in the structure of thepage lists—specifically, theactive and inactive lists thatimpose an order on pages1. The current Linux

1In the Linux community, these are known as the

Linux Symposium 164

VMM usesglobal page lists. If this approachto ordering pages in unchanged, then the VMMwould have to search the page lists for pagesthat belong to the desired address spaces thathave excess RSS values. Alternatively, onepair of page lists—activeand inactive—couldexist for each address space. The reclamationof pages from a specific address space wouldtherefore require no searching.

By ordering pages separately with each addressspace, we also enable the VMM to more eas-ily track reference behavior for each addressspace. While the information to be gatheredwould depend on the underlying policy that se-lects target RSS values, we believe that suchtracking may play an important role in the de-velopment of such policies.

Note that the target RSS values would need tobe recalculated periodically. While the periodshould be directly proportional to thememorypressure—some measure of the current work-load’s demand for main memory space—it is atopic of future research to determine what thatperiod should be. By coupling the period fortarget RSS calculations to the memory pres-sure, we can ensure that this strategy only in-curs significant computational overhead whenheavy paging is occuring.

6.2 Proportionally sharing space

Waldspurger [20] describes a method of pro-portionally sharing the space of a system run-ning VMWare’s ESX Server between a numberof virtual machines. Specifically, as with a pro-portional share CPU scheduler, memory sharescan be assigned to each virtual machine, andWaldspurger’s policy will calculate target RSSvalues for each virtual machine.

Proportionally sharing main memory space can

LRU lists. However, they only approximate an LRU or-dering, and so we refer to them only aspage lists.

result in superfluous allocations to some virtualmachines. If the target RSS for some virtualmachine is larger than necessary, and some ofthe main memory reserved for that virtual ma-chine is rarely used2, then its target RSS shouldbe reduced. Waldspurger addresses this prob-lem with ataxation policy. In short, this policypenalizes each virtual machine for unused por-tions of its main memory share by reducing itstarget RSS.

Application to the Linux VMM. This ap-proach to proportionally sharing main memorycould easily be generalized so that it can applyto address spaces within Linux instead of vir-tual machines on an ESX Server. Specifically,we must describe how shares of main memoryspace can be assigned to each address space.Given those shares, target RSS values can becalculated in the same manner as for virtualmachines in the original research.

The taxation scheme requires that the systembe able to measure the active use of pages ineach address space. Waldspurger used a sam-pling strategy where some number of randomlyselected pages for each virtual machine wereaccess protected, forcingminor page faultsto occur upon the first subsequent referenceto those pages, and therefore giving the ESXServer an opportunity to observe those pageuses. The same approach could be used withinthe Linux VMM, where a random samplingof pages in each address space would be ac-cess protected. Alternatively, a sampling of thepages’ reference bits could be used to monitoridle memory.

Waldspurger observes that, within a normalOS, the use of access protection or referencebits, taken from virtual memory mappings, willnot detect references that are a result of DMAtransfers. However, since such DMA transfers

2Waldspurger refers to such space as beingidle.

Linux Symposium 165

are scheduled within the kernel itself, those ref-erences could also be explicitly counted withhelp from the DMA scheduling routines.

6.3 Future directions

The description above does not address a par-ticular problem:shared memory. Space can beshared between threads or processes in a num-ber of ways, and such space presents impor-tant problems that we must solved to achieve acomplete solution.

The problem of shared spaces. The assign-ment of shares to an address space can be com-plicated when that address space is shared byprocesses in different process groups or ser-vice classes. One simple approach is for eachaddress space to belong to a specific serviceclass. In this situation, its share would be de-fined only by that service class, and not by theprocesses that share the space. Another ap-proach would be for each shared address spaceto adopt the highest share value of its sharedtasks or processes. In this way, an importantprocess will not be penalized because it is shar-ing its space with a less important process.

Note that this problem does not apply only tomemory mapped files and IPC shared mem-ory segments, but to any shared space. Forexample, the threads of a multi-threaded pro-cess share a virtual address space. Similarly,when a process callsfork() , it creates a newvirtual address space whose pages are sharedwith the original virtual address space usingthe copy-on-write (COW)mechanism. Theseshared spaces must be assigned proportionalshares even though the tasks and processes us-ing them may themselves have differing shares.

Proportionally sharing page faults. Thegoal of proportional share scheduling is to

fairly divide the utility of a system among com-peting clients (e.g., processes, users, serviceclasses). It is relatively simple to divide theutility of the CPU because that utility islin-ear and indepedent. One second of scheduledCPU time yields a nearly fixed number of exe-cuted instructions3. Therefore, each additionalsecond of scheduled CPU time nearly yieldsa constant increase in the number of executedinstructions. Furthermore, the utility of CPUdoes not depend on the computation being per-formed: Every process derives equal utilityfrom each second of scheduled CPU time.

Memory, however, is a more complex resourcebecause its utility is neither independent norlinear. Identical RSS values for two processesmay yield vastly different numbers of pagefaults for each process. The number of pagefaults is dependent on the reference patterns ofeach process.

To see the non-linearity in the utility of mem-ory, consider two processes,A andB. Assumethat for A, an RSS ofp pages will yieldmmisses, wherem > 0. If that RSS were in-creased top + q pages, the number of missesincurred byA may take any valuem′ where0 ≤ m′ ≤ m4. Changes in RSS do not im-ply a constant change in the number of missessuffered by an address space.

The proportial sharing of memoryspace, there-fore, does not necessarily achieve the statedgoal of fairly dividing the utility of a system.Consider thatA should receive 75% of thesystem, whileB should receive the remain-ing 25%. Dividing the main memory space bythese proportions could yeild heavy page fault-

3We assume no delays to access memory, as memoryis a separate resource from the CPU.

4We ignore the possibility ofBelady’s anomaly[1],in which an increase in RSS could imply an increase inpage faults. While this anomaly is likely possible for anyreal, in-kernel page replacement policy, it is uncommonand inconsequential for real workloads.

Linux Symposium 166

ing for A but not forB. Note also that noneof the 25% assigned toB may be idle, and soWaldspurger’s taxation scheme will not reduceits RSS. Nonetheless, it may be the case that areduction in RSS by 5% forB may increase itsmisses only modestly, and that an increase inRSS by 5% forA may reduce its misses drasti-cally.

Ultimately, a system should proportionallyshare the utility of main memory. We con-sider this topic a matter of significant futurework. It is not obvious how to measure onlinethe utility of main memory for each addressspace, nor how to calculate target RSS valuesbased on these measurements. Balancing thecontention between fairness and throughput forvirtual memory must be considered carefully,as it will be unacceptable to achieve fairnesssimply by forcing some address spaces to pagefault more frequently. We do, however, believethat this problem can be solved, and that theutility of memory can be proportionally sharedjust as with other resources.

7 Conclusion and Future Work

In this paper we make a case for providing ker-nel support for class-based resource manage-ment that goes beyond the traditional per pro-cess or per group resource management. Weintroduce a framework for classifying tasks andincoming network packets into classes, mon-itoring their usage of physical resources andcontrolling the allocation of these resources bythe kernel schedulers based on the shares as-signed to each class. For each of four ma-jor physical resources (CPU, disk, network andmemory), we discuss ways in which propor-tional sharing could be achieved using incre-mental modifications to the corresponding ex-isting schedulers.

Much of this work is in its infancy and the ideas

proposed here serve only as a starting point forfuture work and for discussion in the kernelcommunity. Prototypes of some of the sched-ulers discussed in this paper are under develop-ment and will be made available soon.

8 Acknowledgments

We would like to thank team members fromthe Linux Technology Center, particularlyTheodore T’so, for their valuable comments onthe paper and the work on individual resourceschedulers. Thanks are also in order for nu-merous suggestions from the members of thekernel open-source community.

References

[1] L. A. Belady. A study of replacementalgorithms for virtual storage.IBMSystems Journal, pages 5:78–101, 1966.

[2] S. Blake, D. Black, M. Carlson,E. Davies, Z. Wang, and W. Weiss. AnArchitecture for Differentiated Services.RFC 2475, Dec 1998.

[3] Jonathan Corbet. A new deadline I/Oscheduler.http://lwn.net/Articles/10874 .

[4] Jonathan Corbet. Anticipatory I/Oscheduling.http://lwn.net/Articles/21274 .

[5] Jonathan Corbet. The ContinuingDevelopment of I/O Scheduling.http://lwn.net/Articles/21274 .

[6] IBM DeveloperWorks. Inboundconnection control home page.http://www-124.ibm.com/pub/qos/paq_index.html .

Linux Symposium 167

[7] Pawan Goyal, Xingang Guo, andHarrick M. Vin. A Hierarchical CPUScheduler for Multimedia OperatingSystems. InUsenix Association SecondSymposium on Operating SystemsDesign and Implementation (OSDI),pages 107–121, 1996.

[8] Bert Hubert. Linux Advanced Routing &Traffic Control.http://www.lartc.org .

[9] Sitaram Iyer and Peter Druschel.Anticipatory scheduling: A diskscheduling framework to overcomedeceptive idleness in synchronous I/O.In 18th ACM Symposium on OperatingSystems Principles, October 2001.

[10] M. Kravetz, H. Franke, S. Nagar, andR. Ravindran. Enhancing LinuxScheduler Scalability. InProc. 2002Ottawa Linux Symposium, Ottawa, July2001.http://lse.sourceforge.net/scheduling/ols2001/elss.ps .

[11] Paul E. McKenney. Stochastic FairnessQueueing. InINFOCOM, pages733–740, 1990.

[12] Ingo Molnar. Goals, Design andImplementation of the new ultra-scalableO(1) scheduler. In 2.5 kernel source treedocumentation(Documentation/sched-design.txt).

[13] Jason Nieh, Chris Vaill, and Hua Zhong.Virtual-time round-robin: An o(1)proportional share scheduler. In2001USENIX Annual Technical Conference,June 2001.

[14] P. Pradhan, R. Tewari, S. Sahu,A. Chandra, and P. Shenoy. AnObservation-based Approach TowardsSelf-Managing Web Servers. InIWQoS2002, 2002.

[15] Quality of Service White Paper.Integrated QoS: IBM WebSphere andCisco Can Deliver End-to-End Value.http://www-3.ibm.com/software/webservers/edgeserver/doc/v20/QoSwhitepap%er.pdf .

[16] Prashant J. Shenoy and Harrick M. Vin.Cello: A disk scheduling framework forbext generation operating systems. InACM SIGMETRICS 1998, pages 44–55,Madison, WI, June 1998. ACM.

[17] Antonio Vargas. fairsched+O(1) processscheduler.http://www.ussg.iu.edu/hypermail/linux/kernel/0304.0/0060.html .

[18] Dinesh Verma, Mandis Beigi, andRaymond Jennings. Policy based SLAManagement in Enterprise Networks. InPolicy Workshop, 2001.

[19] T. Voigt, R. Tewari, D. Freimuth, andA. Mehra. Kernel Mechanisms forService Differentiation in OverloadedWeb Servers. In2001 USENIX AnnualTechnical Conference, Jun 2001.

[20] Carl A. Waldspurger. Memory resourcemanagement in {VM}ware {ESX}{S}erver. In Proceedings of the 5thSymposium on Operating SystemsDesign and Implementation, December2002.

[21] Carl A. Waldspurger and William E.Weihl. Stride scheduling: Deterministicproportional- share resourcemanagement. Technical ReportMIT/LCS/TM-528, 1995.

[22] Werner Almesberger Werner and JamalHadi Salim amd Alexye Kuznetsov.Differentiated Services on Linux. InGlobecom, volume 1, pages 831–836,1999.

Linux Symposium 168

[23] J. Wroclawski. The Use of RSVP withIETF Integrated Services. RFC 2210,Sep 1997.

Trademarks and Disclaimer

This work represents the view of the authors anddoes not necessarily represent the view of IBM.

IBM is a trademark or registered trademarks of In-ternational Business Machines Corporation in theUnited States, other countries, or both.

Linux is a trademark of Linus Torvalds.

UNIX is a registered trademark of The Open Groupin the United States and other countries.

Other trademarks are the property of their respec-tive owners.

Proceedings of theLinux Symposium

July 23th–26th, 2003Ottawa, Ontario

Canada

Conference Organizers

Andrew J. Hutton,Steamballoon, Inc.Stephanie Donovan,Linux SymposiumC. Craig Ross,Linux Symposium

Review Committee

Alan Cox,Red Hat, Inc.Andi Kleen,SuSE, GmbHMatthew Wilcox,Hewlett-PackardGerrit Huizenga,IBMAndrew J. Hutton,Steamballoon, Inc.C. Craig Ross,Linux SymposiumMartin K. Petersen,Wild Open Source, Inc.

Proceedings Formatting Team

John W. Lockhart,Red Hat, Inc.

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rightsto all as a condition of submission.