#inside: - USENIX

33
THE MAGAZINE OF USENIX & SAGE October 2000 • volume 25 • number 6 { # inside: CONFERENCE REPORT USENIX 2000 Annual Technical Conference The Advanced Computing Systems Association & The System Administrators Guild &

Transcript of #inside: - USENIX

THE MAGAZINE OF USENIX & SAGEOctober 2000 • volume 25 • number 6

{

#inside:CONFERENCE REPORT

USENIX 2000 Annual Technical

Conference

The Advanced Computing Systems Association &

The System Administrators Guild

&

5October 2000 ;login:

conference reportsrelated. Because the Freenix track wasrun with the same rigor as the generalrefereed papers track, they also presentedawards:

Best paper: “An Operating System in Javafor the Lego Mindstorms RCSMicrocontroller” by Pekka Nikander

Best student paper: “ProtocolIndependence Using the Sockets API” byCraig Metz

Andrew Hume, immediate past presidentof the USENIX Association (and nowvice president) announced the two annu-al USENIX awards:

Software Tools User Group (STUG)Award: Tetu Yionen for the initial designand creation of “ssh”, the Secure Shell.

The Flame: Outstanding AchievementAward given posthumously to W.Richard Stevens for his work as an inno-vator, teacher, and active member of theUSENIX community (accepted on hisbehalf by his sister, his widow, and hischildren)

KEYNOTE ADDRESS

Bill Joy presented the keynote address onhis vision on the future of technology.

Based on his 25 years of experience, Joyforecasted the next 25-30 years in com-puting. He started by looking back at thehistory: the eventual acceptance of soft-ware as research in computer science, theintegration of networks and the operat-ing system, and the growth of portabilityin computing. More and more we’ll seestandards defined in English, perhapspassing code or agents instead. He alsosees the continued need for maintainingcompatibility with open protocols andspecifications, noting that protocolsoften outlive the hardware systems forwhich they were designed.

Looking forward, Joy believes thatMoore’s Law will continue. He expects tosee up to a 1012 improvement over 30years based in part on molecular elec-

USENIX ANNUAL TECHNICALCONFERENCEJUNE 18–23, 2000SAN DIEGO, CAOpening SessionSummary by Josh Simon

ANNOUNCEMENTS

Christopher Small announced that therewere 1,730 attendees as of the start of thetechnical sessions. The program commit-tee received 91 refereed-paper submis-sions (up from 63 in 1999) and accepted27 (up from 23 in 1999). Next yearUSENIX will be in Boston.

The best paper awards were:

Best paper: “Scalable Content-awareRequest Distribution in Cluster-basedNetwork Servers” by Mohit Aron, DarrenSanders, Peter Druschel, and WillyZwaenepoel

Best student papers (tie): “Integrating aCommand Shell into a Web Browser” byRobert C. Miller and Brad A. Myers, and“Virtual Services: A New Abstraction forServer Consolidation” by John Reumann,Ashish Mehra, Kang G. Shin, and DilipKandlur

Kirk McKusick, chair of the Freenixcommittee, spoke about that track. Thecommittee received 56 refereed papersubmissions and accepted 29 of them.Most of the papers were open source–

This issue’s report is on the USENIXAnnual Technical Conference held in SanDiego, California, June 18-23, 2000.

Thanks to the summarizers:

Doug FalesRik FarrowKevin E. FuMatt GrapenthienBob GrayJosh KelleyJeff SchoutenJosh SimonCraig SoulesGustavo Vegas

And check out Peter Salus’ impressionsof the conference on page 34.

A collection of photographs taken at theconference can be found at<http://www.usenix.org/publications/library/proceedings/usenix2000/photos.html>

<http://www.usenix.org/publications/library/proceedings/usenix2000/photos.html>

Christopher Small

tronics and improved algorithms. Thequestion of synchronization between dif-ferent geographies becomes hard whenyou can store 64 TB in a device the sizeof your ballpoint pen. We need toimprove resilience and autonomy for theadministration of these devices to bepossible.

Further, he sees six webs of organizationof content: near, the Web of today, usedfrom a desktop; far, entertainment, usedfrom your couch; here, the devices onyou, like pagers and cell phones andPDAs; and weird, such as voice-basedcomputing for tasks like driving your carand asking for directions. These fourwould be the user-visible webs; theremaining would be e-business, for busi-ness-to-business computing, and perva-sive computing, such as Java and XML.

Finally, reliability and scalability willbecome even more important. Not onlywill we need hardware fault tolerance butalso software fault tolerance. In additionwe need to work toward a distributedconsensus model so there’s no one sys-tem in charge of a decision in case thatsystem is damaged. This leads into theconcepts of byzantine fault tolerance andthe genetic diversity of modular code.We also need to look into the fault toler-ance of the user; for example, have thecomputer assist the user who has forgot-ten her password.

Refereed Papers Track

SESSION: INSTRUMENTATION AND

VISUALIZATION

Summarized by Josh Simon

MAPPING AND VISUALIZING THE INTERNET

Hal Burch, Carnegie Mellon University;

Bill Cheswick and Steve Branigan, Bell

Labs Research, Lucent Technologies

We need tools to be able to map net-works of an arbitrarily large size, fortomography and topography. This workis intended to complement the work ofCAIDA. So Cheswick, et al. developedtools UNIX-style, using C and shell

6 Vol. 25, No. 6 ;login:

scripts to map the Internet as well as theLucent intranet. The tools scan up to 500networks at once and are throttled downto 100 packets per second. This generates100–200MB of text data (which com-presses to 5–10MB) per day and coverson the order of 120,000 nodes. See<http://www.cs.bell-labs.com/who/ches/map/>for details and maps.

MEASURING AND CHARACTERIZING SYSTEM

BEHAVIOR USING KERNEL-LEVEL EVENT

LOGGING

Karim Yaghmour and Michel R.

Dagenais, Ecole Polytechnique de

Montréal

Karim Yaghmour spoke on the problemof visualizing system behavior. ps andtop are good, but neither provides trulyrealtime data. He therefore developed akernel trace facility with a daemon thatlogs to a file. He instrumented the Linuxkernel to trace the events, and then per-forms offline analysis of the data. Thetools do not add much overhead forserver-side operations but do add a lot ofoverhead to intensive applications suchas the Common Desktop Environment(CDE). Data is collected up to 500kb persecond but it compresses well. Futurework includes quality-of-service kernels(throttling the rate of, for example, fileopens), security auditing, and even inte-grating the event facility further into thekernel. Sources are available at <http://www.opersys.com/LTT> and areunder the GNU Public License.

PANDORA: A FLEXIBLE NETWORK

MONITORING PLATFORM

Simon Patarin and Mesaac Makpangou,

INRIA SOR Group, Rocquencourt

Patarin and Makpangou’s goal was toproduce a flexible network-monitoringplatform with online processing, goodperformance, and no impact on the envi-ronment. The privacy of users was alsoimportant in the design. They decided touse components for flexibility and astack model. They developed a small

configuration language and a dispatcherthat coordinates the creation anddestruction of the components. The toolis 15,000 lines of C++, using libpcap. Theoverhead is about 0.26 microseconds perfilter per packet. For example, HTTPrequests get over 75Mb/s throughput ontraces, which translates into 44–88Mb/sin real-world situations, or 600–2600requests per second. Future workincludes improving the performance andflexibility. More details are available from<http://www-sor.inria.fr/projects/relais>and released pursuant to the GPL.

INVITED TALKS

COMPUTER SYSTEM SECURITY: IS THERE

REALLY A THREAT?

Avi Rubin, AT&T Research

Summarized By Rik Farrow

Avi Rubin began his talk by alluding tothe February 2000 distributed denial-of-service (DDoS) attacks, saying that thoseattacks just about made this talk redun-dant. But perhaps the theme of his talkwas that not enough has been learnedabout mitigating the threats to security,that lessons have not been learned fromthe past.

Rubin used the Internet Worm to illus-trate this point. In November of 1988,Robert T. Morris released the Worm, arelatively small piece of code that usedthe Internet to copy itself to over 6,000systems. At the time, those 6,000 systemscompromised a large percentage of hostson the Internet. Damage estimates fellbetween $10,000 and $97 million.Recovering from the Worm was difficult,as most Internet-connected sites reliedon email for communication, and emailcouldn’t get through while Worms wereexecuting.

The Worm used three mechanisms toobtain access to networked systems: abuffer-overflow in the finger server, abackdoor still in sendmail (debug), andthe “r” commands. The Worm would

C

ON

FERE

NC

ERE

PORT

Scrack passwords, using a list of 432“common” passwords so it could assumeother user identities for use with theremote shell.

Today, the finger server has been patchedand the debug backdoor commented out(except for short period when a Sunengineer accidentally reenabled it in1994) of sendmail. But “r” commandsare still in use today, and SSH when usedas an “r” command replacement (withtrust) is still vulnerable to Worm-styleattacks.

But the biggest problem then and nowwas homogeneity. The Worm targetedSun servers. (A few VMS systems werealso affected.) Microsoft systems presentthe current homogenous environment.

Rubin went on to talk about severalrecent virus attacks. Rubin had firsthandexperience with HAPPY99, as it infectedhis mom’s system. HAPPY99 is verypolite and very widespread. It keeps a listof infected systems, can be commandedto remove itself, and does nothing else. Itwill even replace the DLL (DynamicLinked Library) it modifies with a copyof the original.

The Melissa virus caused many millionsin damages. Named after a dancer inFlorida, Melissa first appeared on alt.sex.Because Melissa is written in VisualBasic, it is easy to modify even if youdon’t know VB, as you can just change itsmessage or various strings, and it willstill work (and have a different anti-virussignature as well).

Rubin listed the reasons why Melissa wasso successful (besides its promise of a listof hot porn sites):

■ Many people use the same mailer(Outlook)

■ MSWord on Windows on almost allsystems

■ Many people click okay to macrowarnings

■ No separation between MSWordand OS

7October 2000 ;login:

■ Virus protect only works againstknown viruses

“If every time you received a phone call,there was a chance that an appliance,such as your toaster, could explode, youwould likely not put up with it. For somereason, people are willing to tolerate thepotential loss of all their data when theyreceive an email.” said Rubin.

There have been other interesting virus-es, such as Russian New Year andBubbleboy. Russian New Year invokesExcel to execute any program on the sys-tem. Bubbleboy was the first email virusthat did not require an attachment, butdid display a dialog prompting the userto delete update.hta instead of doing ititself via code, which it could have done.Bubbleboy represents the killer transportmechanism, as it could install any soft-ware wanted. Actually, Microsoft makesthis easy, as there are system calls thatupload files (one call), then execute them(second call).

In summary, Rubin suggested that therewere certain deficiencies in our currentsecurity model. Besides the homogeneityof desktops, Rubin said, “This seems tobe the theme, security by pop-up win-dows. Do you want to continue? Thedefault is YES.” He also suggested the useof Digest Authentication in HTTP toprevent replay attacks, and said that SSLwas the right place to do encryption, asthe top of the IP stack, where you couldtell it is working.

Rubin stated, “Trinoo [a DDoS tool]requires password (people can be veryproprietary about their daemons).” Hewent on to recommend that peopleremember to make backups, so if some-thing does go awry, you at least canrecover.

There were a number of questions. ChrisHarrison, Whitehead Institute, asked:“Why haven’t we yet seen someone whohas caused some really serious damageon a national scale?” Rubin answered, “Idon’t have an answer to that. Could have

been much much worse. We need a soci-ologist to answer that.”

Some one asked about the patch toOutlook. Rubin said that he wouldinstall it, but what surprises him is howfast MS produced a patch (440k).

Some one asked two questions. Howmuch risk do you see from polymorphicviruses? How much risk from cross-plat-form viruses? Rubin answered, “I thinkpolymorphic viruses are harder to detect,making them more dangerous. As forcross platform, right now all viruses andworms are on MS platforms, but whencell phones start including commandinterpreters, this problem will spread thisas well.”

Dan Geer, now CTO of @stake (whichbought the l0pht) asked: What is youropinion about closing security holes byadvertising them? Rubin answered bysaying that there are “days I wake up feel-ing we should advertise vulnerabilities,and other days where I think this is a badidea.” The NIPC (National InfrastructureProtection Council, [I think]) is plan-ning on creating industry forums forexchanging security information andrumors. Rubin said, “We need a way todisseminate information to all of thegood guys and none of the bad guys.”(laugh)

There were several more questions,including comments about Linux orUNIX users having the inboxes crammedfull of impotent email viruses, and a sug-gestion to use litigation to force betteradherence to security practices.

USENIX ANNUAL TECHNICAL CONFERENCE ●

FREENIX TRACK

SESSION: STORAGE SYSTEMS

Summarized by Kevin E. Fu

SWARM: A LOG-STRUCTURED STORAGE

SYSTEM FOR LINUX

Ian Murdock and John H. Hartman,

University of Arizona

Ian Murdock described Swarm, a log-structured interface to a cluster of stor-age devices in Linux. Filesystem inter-faces are often tightly coupled to theinterface of the underlying storage sys-tem (e.g., requiring a block-orientedstorage device). Moving away from thistendency, Swarm provides a configurableand extensible infrastructure that may beused to build high-level storage abstrac-tions and functionality. The intent was tobuild more than a just a research proto-type.

Swarm is storage system, not a filesys-tem. However, Murdock presented twofilesystems that build upon Swarm. Stingis a log-structured filesystem similar toSprite’s LFS. Ext2fs/Swarm is an unmod-ified Ext2 filesystem on top of a speciallogical Swarm device.

Several goals guided the design ofSwarm:

■ Scalability. As the network grows,scale to demands.

■ High performance. Take advantageof the network-disk bottleneck.

■ High Availability. Handle failuresgracefully.

■ Cost-performance. Run on com-modity hardware.

■ Flexibility.

Swarm moves the high-level abstractionsfrom the server to the client. The serverbecomes a simple repository for datawhile clients implement the bulk offunctionality. By clustering devices,Swarm removes centralization. In thisway, Swarm avoids bottlenecks and sin-gle points of failure.

8 Vol. 25, No. 6 ;login:

Clients use a striped, append-only logabstraction to store data on servers. Thiscombines log-structured properties withRAID. Each client maintains its own logto eliminate synchronization. The logallows reconstruction of data from servercrashes and client failures.

Logs are divided into fixed-size piecescalled fragments, which are stripedacross storage servers. Parity is computedfor redundancy. The log is append-only,conceptually infinite, and consists of anordered stream of blocks and records.

Visit <http://www.cs.arizona.edu/swarm/> formore information.

Q: What is the performance to availabili-ty tradeoff? A: Write performance isgood because Swarm can batch smallwrites. Large write performance is alsogood. Read performance is not as goodas it should be, but this is an artifact ofthe implementation. The log structuregives both high availability and high per-formance.

Q: Is there a mechanism for synchro-nization of a shared resource? A: We areworking on distributed lock service.

Q: How does the cleaner know what iscleanable? A: The cleaner is not aware ofwhat is in the fragments, but it can knowwhat parts of the fragment are alive.Agents participate in the cleaningprocess.

Q: You might consider a HierarchicalStorage Management (HSM) designinstead of cleaning the tapes. A: Cleaningis a very interesting topic we are explor-ing. Your suggestion might be useful.

DMFS - A DATA MIGRATION FILE SYSTEM

FOR NETBSD

William Studenmund, Veridian MRJ

Technology Solutions

Bill Studenmund talked about the DataMigration File System (DMFS) designedat the NASA Ames Research Center.

Implemented for NetBSD, DMFS storesfiles on both disk and tape. This allowsan administrator to transparently archivedata to tape without affecting users.

NAStore 3 is a storage system at NASA’sNumerical Aerospace Simulation facility.It consists of three main systems: the vol-ume-management system for tape robot-ics, the virtual volume-management sys-tem to avoid small writes to tape byaggregating data, and the data-migrationfilesystem. Studenmund focused onDMFS for the remainder of the talk.

DMFS is a layered filesystem that placesits migration code between kernelaccesses and the underlying filesystem.The system works with the Berkeley FastFile System (FFS), FFS with soft updates,and any other filesystem with a fixednumber of inodes. The layering decou-ples development of DMFS from that ofany FFS-related development.

To have control over archiving andrestoring, there are a number of newFNCTL operations, such as set-archive-in-progress and set-restore-in-progress.Also, a modified ls command denoteswhether a file is unarchived, archivedand resident, or archived and nonresi-dent.

When a user asks for a file stored ontape, a DMFS daemon blocks the userprocess until the file is sufficientlyrestored from tape or the restorationfails. However, a user can still interruptthe process by hitting control-C. There isalso a utility to force restoration of filesfrom tape into the resident risk. Writesalso block until restoration finishes. Thearchive subsystem stores whole files ontape. Therefore, it is necessary to readfrom tape to piece together the unmodi-fied parts of a file. To initiate an archivalprocess, one can either run a user utilityor let the daemon schedule an archive.

The performance cost of using DMFS isminimal. There is an overhead of 1–2%to access resident files. However,

C

ON

FERE

NC

ERE

PORT

SStudenmund jokingly admitted that forobvious reasons vi takes about five min-utes to start when the data is retrievedfrom tape.

Q: Looking at HSM, it seems that somefiles are active, while others are accessedonly once a month. Is there a way tomark file as readable but not actuallyperform the restoration? Say, classify filesas active or read once a month? A: Thisis not supported by DMFS, but soundsreasonably easy to add to the system.

Q: You added system calls to deal withopening/closing files. Why not use a newfilesystem with vnodes instead? A: Weadded only three system calls. Datapassed to open is sent to the name-lookup system before the filesystem getsthe request. Using a special filesystemthat would take file handles as the“names” of files would not work.open(2), stat(2), and statfs(2) take nul-terminated strings to name files. Filehandles have an embedded length, andcan contain nuls. Thus these system callswill not read in the correct amount ofdata for such a special filesystem to beable to work.

Q: Can you mark files as preferredonline rather than archival? A:Everything is done in response to themigration daemon. You could modifythe daemon to control this. It is certainlyfeasible.

A 3-TIER RAID STORAGE SYSTEM WITH

RAID1, RAID5, AND COMPRESSED RAID5FOR LINUX

K. Gopinath, Nitin Muppalaneni, N.

Suresh Kumar, and Pankaj Risbood,

Indian Institute of Science, Bangalore

K. Gopinath discussed the design andimplementation of a host-based driverfor a 3-tier RAID storage system. Thetiers include a small RAID1, a largerRAID5, and a compressed RAID5. Thesystem migrates the most frequentlyaccessed data to the RAID1 tier. Theproject was motivated by the need for

9October 2000 ;login:

fast, reliable, efficient, quickly recover-able, and easily manageable storage.

Gopinath and his students previouslydeveloped the system for Solaris, but arenow developing in the context of Linux.The team ran into difficulty with theLinux device-driver framework. Linuxhas some support to logically integratemultiple disks as one logical disk, butthis breaks its own device-driver frame-work.

Q: A lot of RAID performance is guidedby stripe size. How do you supportchanging stripe size? A: We do not, butwe considered issues of the filesysteminforming the lower layer (volume man-ager) about stripe sizes and in the reversedirection about the actual capacitiesavailable because of compression.

SESSION: FILE SYSTEMS

Summarized by Kevin E. Fu

A COMPARISON OF FILE SYSTEM

WORKLOADS

Drew Roselli and Jacob R. Lorch,

University of California at Berkeley;

Thomas E. Anderson, University of

Washington

Drew Roselli presented an analysis offilesystem traces from a variety of envi-ronments under modern workloads. Thegoal was to understand how modernworkloads affect the performance offilesystems.

Traces were collected on four workloads:an HP-UX instructional workload with20 hosts for eight months (INS), an HP-UX research workload with 14 hosts forone year (RES), an HP-UX Web databaseand server with one host for six weeks(WEB), and a personal computing work-load with eight NT hosts for four to ninemonths (NT). To normalize the meas-urements, only 31 days of each tracewere used. None of the workloadsreached a steady state because more fileswere created than deleted. Drawing

chuckles from the crowd, Drew notedthat “disk sales confirm this result.”

The authors developed a new metric formeasuring lifetimes. This involves track-ing all files created during the trace andignoring files created toward the end ofthe trace window. This measurementreveals that for many workloads, theblock lifetime is less than the 30-secondwrite delay used by many systems.

On to the results. Each workload has itsown unique shape. Systems with dae-mons tend to have a knee because ofactivities such as overwriting Web-browser cache files. Other operating sys-tems tend to either delete immediately orwait until space fills (e.g., emptying therecycle bin). The deletion sweet spot cen-ters around one hour on all but the NTworkload. The NT workload reads andwrites more than twice the amount ofthe RES or INS workload. Common toall workloads are that overwrites causethe most significant fraction of deletedblocks, and overwrites show substantiallocality. The results also show that asmall cache is very effective for decreas-ing disk reads.

Because many processes memory-map asmall set of files, these files are usuallycached. The authors also reconfirmed abimodal distribution where many filesare repeatedly written without beingread while many are repeatedly readwithout being written. This buys pre-dictability in post-cache access patterns.

The UNIX traces are publicly availableon <http://tracehost.cs.berkeley.edu/traces.html>.

Q: How often should such trace studiesbe done? Can we subscribe to somethingthat collects and publishes such traces?A: If you’re thinking of collecting traces,don’t do it. Read others’ traces instead.Workloads change over time.

Q: Did you filter out access to sharedlibraries? Or is this included in yourtrace? A: There isn’t much left if you take

USENIX ANNUAL TECHNICAL CONFERENCE ●

out the shared libraries, so we kept themin the study.

Q: Are there traces of large sequentialflows? This is important to manufactur-ers. I’m willing to pay for the traces. A:Usually traces do not exist because peo-ple are concerned about privacy. [MargoSeltzer from Harvard stood up and saidshe can help in this area. Contact heroffline.]

Q: Are filesystems generally designedwith database workloads in mind? A:The Web server has a database in it. IBMcompared database workloads withSprite. There is not a big differencebetween their results and Sprite.

Q: You mentioned that your testedfilesystem is not in steady state. Howmuch is expanding? A: We reported thisin paper submission, but we thought itwas too wordy so we removed it. I can’tremember the growth off the top of myhead, but it was somewhere around30GB.

FIST: A LANGUAGE FOR STACKABLE FILE

SYSTEMS

Erez Zadok and Jason Nieh, Columbia

University

Erez Zadok discussed FiST, a languagefor stackable filesystems. Building afilesystem is hard! The kernel is a hostileenvironment; portability is a pain; devel-opment is time-consuming; and main-taining the code is costly. A stackablefilesystem makes code extensible andmodular. To create a new filesystem, onedoes not have to remember all the low-level details.

Early research suggested creating newAPIs such as the stackable vnodes.However, this suffers from several prob-lems including: modifying each operat-ing system, rewriting existing filesystems,taking a performance hit, using a non-portable API, and having no high-levelfunctionality. A stackable filesystem

10 Vol. 25, No. 6 ;login:

alone is not enough. It still leaves muchkernel work to do.

The FiST approach consists of a languageand a set of templates. The languagecontains simple, high-level primitiveswhile C templates provide kernel-levelabstractions. The fistgen program out-puts code given a base template (basefs)and a FiST input file. Erez dubbed thissystem “a YACC for filesystems.”

Without stacking, a user system call istranslated into a filesystem call. Withstacking, there is a translation betweenthe filesystem and basefs. The translationcan modify results and arguments. Thisallows for feature additions such as fileattributes and distributed filesystems.

The authors compared the developmentof FiST-based filesystems with filesys-tems written from scratch. Four filesys-tems were implemented under the vari-ous models:

■ Snoopfs: warns of possible unautho-rized access

■ Cryptfs: transparent encryptionfilesystem

■ Aclfs: adds ACLs to files■ Unionfs: joins content of two direc-

tories

The measured systems include filesys-tems written from scratch in C, writtenusing the basefs templates, written usingthe wrapfs templates, and written usingFiST.

The authors found that the size of newlywritten code is one to two orders ofmagnitude smaller when written withFiST. Development time is shortened bya factor of seven compared to writingfrom scratch. Meanwhile, the perform-ance overhead ranges between 0.8% and2.1% to enable stacking.

For more information, see <http://www.cs.columbia.edu/~ezk/research/fist/> or email <[email protected]>.

Q: Could you explain the difference incryptfs results between basefs and

wrapfs? Why the strange spike in thegraph? A: Wrapfs performs unnecessarymemory copies.

Q: How do you handle locking? A:Filesystem templates take care of nastydetails like locking and reference count-ing. We use whatever the VFS provides.Special locking primitives could beadded to the FiST language.

Q: What is your intention now that youhave built a filesystem framework? Doyou intend to write an original filesystemusing FiST? A: We have plans to extendthe language for low-level filesystems onvarious storage media.

Q: How do you handle traditionalfilesystem issues such as consistency andrecovery? A: Stacking is completely inde-pendent of what happens above andbelow the stacking. You don’t have toknow about details.

JOURNALING VERSUS SOFT UPDATES:ASYNCHRONOUS METADATA PROTECTION IN

FILE SYSTEMS

Margo I. Seltzer, Harvard University;

Gregory R. Ganger, Carnegie Mellon

University; M. Kirk McKusick, Author &

Consultant; Keith A. Smith, Harvard

University; Craig A. N. Soules, Carnegie

Mellon University; Christopher A. Stein,

Harvard University

Chris Stein compared two technologiesfor improving the performance of meta-data operations and recovery: journalingand soft updates. Journaling uses an aux-iliary log to record metadata operationswhile soft updates uses ordered writes toensure metadata consistency.

Stein discussed three properties for relia-bility:

■ Integrity. Metadata always recover-able.

■ Durability. Persistence of metadata.■ Atomicity. No partial metadata visi-

ble after recovery.

The metadata update problem concernsproper ordering of operations such thata filesystem can recover from a crash.

C

ON

FERE

NC

ERE

PORT

SThere are three general approaches: syn-chronous writes, soft updates, and jour-naling. The Berkeley Fast File System(FFS) uses synchronous writes to main-tain consistency. That is, an operationblocks until metadata are safely writtento disk. Synchronous writes significantlyimpair the ability of a filesystem toachieve high performance. On the otherhand, soft updates and journaling useasynchronous writes.

Soft updates uses delayed metadatawrites to improve performance.However, delayed writes alone couldchange the order of writes. Therefore,soft updates maintains dependencyinformation to order writes. That is,before a delayed write is written to disk,it must satisfy certain dependency con-straints. Soft updates is not durable oratomic; has weak guarantees on whenupdates flush to disk; and requires norecovery process after a crash.

Journaling writes all metadata updates toa log in such a way to guarantee recover-ability. It can perform asynchronouswith Write Ahead Logging (WAL) toensure that the metadata log is written todisk before any associated data. After acrash, the system can scan this log torecover metadata consistency. This pro-vides atomicity. With synchronous jour-naling, the filesystem will have durability.Asynchronous mode results in higherperformance at the cost of durability.

The authors performed a microbench-mark consisting of several stages remi-niscent of the LFS benchmark. After cre-ating a directory hierarchy, the bench-mark writes, reads, and deletes either128MB of data or 512 files, whichevergenerates the most files. There is also atest on zero-length files to stress metada-ta operations.

Macrobenchmarks included four work-loads:

■ An SSH build (similar to theModified Andrew Benchmark)

11October 2000 ;login:

■ A netnews server (large working set,no locality)

■ The SDET software developmentenvironment (timesharing)

■ The PostMark ISP benchmark(small file transactions to simulateemail and e-commerce)

These benchmarks were applied to twojournaling filesystems; a soft updatesfilesystem, and FFS. One of these jour-naling implementations has the charac-teristic that the log is a separate filesys-tem. This makes it easy to run either syn-chronously or asynchronously, switchingon and off the filesystem’s metadata-durability property.

The results show that both journalingand soft updates dramatically improvethe performance of metadata operations.However, synchronous journaling aloneis insufficient to solve the metadataupdate problem. Synchronous journalingcan penalize performance up to 90% rel-ative to the upper bound of an asynchro-nous filesystem with no protection. Thedelayed writes in soft updates improvesperformance in the microbenchmarks.However, soft updates did not achievegood performance in the netnews bench-mark because of the ordering con-straints. On the other hand, the delayedwrites let soft updates excel in perform-ance on the PostMark benchmark.

By understanding the solutions in termsof the transactional properties they offer,one can derive the relative costs of theproperties. The cost of durability is theperformance difference between thebest-performing solution that does offerthis property and the best-performingone that does not. Likewise with theother properties. What one discoversfrom comparing performance is that thecost of durability is generally large rela-tive to that of integrity. If one sacrificesdurability, performance can be very closeto that of systems with no metadata pro-tection, while maintaining integrity.

Q: What part of the performance gain isdue to the hefty hardware you’re using?Which has a better gain, an expensivedisk or an expensive CPU? A: Synchronous systems would performbetter with better SCSI drives and rota-tion. Asynchronous systems would bene-fit from more memory and faster CPU.

Q: Did you turn off disk caching? Forinstance, Cheetah hard drives docaching. A: I can’t recall off the top ofmy head.

Q: In your microbenchmark, what werethe file sizes and how long short livedwere the files? A: There were many filesizes. There are four phases in themicrobenchmark. All files were first cre-ated, the filesystem was unmounted thenremounted, and then the files were readand deleted.

INVITED TALK

WATCHING THE WAIST OF THE PROTOCOL

HOURGLASS

Steve Deering, Cisco Systems

Summarized by Rik Farrow

Steve Deering elected to use imagery andpuns to get across his point: that thewaist of IP should remain narrow. Thebasic image was an hourglass shape,where layers four through seven of IP(application and transport) represent thetop half; layer three (IP) is the narrowmiddle, or waist; and various differenttransport mechanisms fit into the fatbottom half (ATM, SONET, Ethernet,etc.). Deering pointed out the impor-tance of the IP layer – that it isolates thestuff above it from the stuff below it, andthat this is what has allowed the Internetto evolve.

IP allows us to create a “virtualized net-work” to isolate end-to-end protocolsfrom network details/changes. Having asingle IP protocol maximizes interoper-ability and minimizes the service inter-face. IP means that we can assume a least

USENIX ANNUAL TECHNICAL CONFERENCE ●

common denominator of network func-tionality to maximize the number of net-works.

Then Deering asked rhetorically: whyworry about the waist of IP? He gavemostly humorous answers:

■ It provides for navel gazing.■ It happens at middle age.■ The IP is the only layer I can get my

arms around.■ I am worried about how the archi-

tecture is now being lost: the waste.■ This theme offers all these puns.

Deering went on to describe some prob-lems with the waist of IP. For example,IP multicast (which he actually had a lotto do with) and quality of service (QoS)have added to the waist. And a mid-lifecrises, IPv6, has created a second waist(the hourglass image now has two nar-row connections). IPv6 doubles thenumbers of interfaces, requires a newEthertype, and changes in applicationsoftware.

The reason for IPv6 was that the IPv4address space was being rapidly depleted.Although IPv4 addresses began to berationed out in the early ’90s, NetworkAddress Translation (NAT, RFC 1631)and application gateways (AG) havehelped to preserve the IPv4 addressspace, but at some cost to the originalgoals of IP.

IP was also threatened by youths. TheATM community had this vision of get-ting rid of the IP layer, and having end-to-end ATM. ATM did not supplant IP,and instead we wound up with a compli-cated hybrid and two address plans. Ontop of this, we had more fattening temp-tations: layer-two tunneling protocols(PPP, LP2T, PPP over Ethernet). “This isprogress?” quipped Deering.

Deering talked about lost features of theInternet, properties that were true butare not anymore:

■ Transparency – devices that modifyaddresses (NAT).

12 Vol. 25, No. 6 ;login:

■ Robustness through fate-sharing –design a system so you do notdepend on more resources than youabsolutely must. TCP connectivenessdepends on state maintained ateither end. Add in NAT, and you losethis.

■ Dynamic routing. There are manyplaces routing is constrained. WithNAT, an organization cannot pro-vide multiple outgoing paths. Andthe failure of your ISP today meansthat you cannot simply have a sepa-rate path to your network (anotherISP).

■ Portable addresses. (Early connec-tors could move their addresses withthem, but now addresses are notportable.)

■ Unique addresses (NAT).■ Stable addresses (NATs and DHCP).■ Connectionless service. (If you are

behind the NAT, NATs do not sup-port transaction-oriented protocolslike UDP; protocols do not have anexplicit teardown, so NAT boxes justapply their own heuristics.)

■ Always-on service, most users of theInternet are behind dialup services,so they must connect or disconnectfrom the Internet. PPPoE makescable/DSL like dialup.

■ Peer-to-peer communications, notsymmetric devices (no serversbehind the Net). It becomes thepurview of the ISP to set up servers.

■ Application independence is anotherkey feature, as more and more appli-cation knowledge has to be builtinto the NAT. If NAT had beenwidespread before we got the Web,we might have not gotten the Web(no servers behind NAT).

In case all this didn’t make the networkmanager’s job hard enough, we renamedbridges to switches. Router peoplerenamed routers as switches (multilayerswitches and layerless switches, but thisjust obscures what is going on). Is thisentropy or evolution? Deering believes

that this looks like the normal entropy ofall large engineered systems over time.“Of course it grows warts and hats overtime,” said Deering.

What Deering really wants to do is turnhourglass into wineglass – IP over cop-per, fiber, radio with no intervening pro-tocols. If the lower layers are making IPjobs harder, then let’s get rid of lowerlayers. And we are seeing signs of IPevolving this way. We saw IP overSONET, but people are asking whatvalue does SONET add to IP? We needIPv6 to get back to the narrow shape.Deering concluded, “This is my dream.Only time will tell.”

Rob Pike of AT&T Research arose quick-ly to ask the first question. Pike men-tioned that there is not much flowthrough the stem of the wineglass. Healso stated that he thinks that Deeringwas not giving enough credit to thingslike what the phone system does for you,and that he doesn’t think that IPv6 canreplace this. Deering responded by say-ing that the shape was not supposed torepresent a bottleneck, but a reduction inbaggage. Pike responded by saying itdoesn’t appear that IPv6 addresses manyof these issues, such as trunking betweensites supported by frame relay. Deeringresponded that circuits could be betterbased on routing tables, and that MPLSis the current answer to this problem. Atthis point, Pike suggested that they con-tinue in private.

Jeff Mogul of Compaq Western ResearchLabs tried to get Deering to talk aboutfirewalls. (It was obvious he doesn’t likethem.) Eventually, Deering said that fire-walls should be in the end host, not in adedicated firewall (not that we can throwfirewalls away today).

Evi Nemeth asked when we will we seeIPv6 really deployed. Deering answered“next year,” and when Nemeth askedwhat will get it there, Deering said, “IPv6in billions of cell phones will require it.”

C

ON

FERE

NC

ERE

PORT

SSESSION: OLD DOGS, NEW TRICKS

Summarized by Bob Gray

LEXICAL FILE NAMES IN PLAN 9, OR, GETTING

DOT-DOT RIGHT

Rob Pike, Bell Laboratories

Rob Pike pointed out that for 20 years,the UNIX community has been livingwith the embarrassing problem thatchdir("..") doesn’t work properly in thepresence of symbolic links. From hispaper an example is:

% echo $HOME/home/rob% cd $HOME% pwd

/n/bopp/v7/rob% cd /home/rob% cd /home/ken% cd ../rob../rob: bad directory

This annoyance, which causes confusionand headaches, should never have lastedso long. The anomaly exists in most ver-sions of UNIX and Plan 9. With a fewhours of work, Pike was able to solve theproblem for Plan 9 by implementing an“accurate” path name for every active filein the system. It is guaranteed to be therooted, absolute name the program usedto acquire it. A pure lexical definition ofchdir("..") is required. That is, take thecurrent working directory and strip offthe last component, yielding the newworking directory. However, symboliclinks introduce an ambiguity in thepathnames. Rob showed that the “cor-rect” name can easily be determined bycontext.

Rob explained that in Plan 9, “bind” islike symbolic links and a channel is afilesystem handle. All of the necessaryinformation to disambiguate chdir("..") ispresent in the kernel. A few lines of ker-nel code implemented the proper solu-tion. Rob challenged the audience to fixthis silly bug in contemporary UNIXimplementations. He suggested that wemay want to look at the Plan 9 sourcecode, which is freely available now at<http://plan9.bell-labs.com>.

13October 2000 ;login:

GECKO: TRACKING A VERY LARGE BILLING

SYSTEM

Andrew Hume, AT&T Labs-Research;

Scott Daniels, Electronic Data Systems

Corp.; Angus MacLellan, AT&T Labs-

Research

Andrew Hume described Gecko – anadjunct system to track the efficiency,performance, and accuracy of the AT&Tphone billing system. The fundamentalquestion being asked was: “Is every tele-phone call being billed exactly once?”Their legacy system comprises dozens ofbig-iron, MVS computers, running hun-dreds of programs written in Cobol – itwas not feasible to interrupt the opera-tions of this 24x7 production system toadd tracking software. However, Humeand his team were allowed to tap into thedata flows between processes. The sumof these taps amounted to over250GB/day.

Given the hundreds of millions of trans-actions per day, Hume’s job was to mon-itor and verify that billing processing wasaccurate, timely, and complete. His solu-tion was to create “tags” for each tele-phone call record at each tap point.Nightly, Gecko would add these tags intoa giant (60G tag) database, and producereports highlighting discrepancies fromexpected flows.

Hume and his team processed the rawinformation and analyzed the flowsusing a 32-processor Sun E10000 andtwo smaller Sun E5000s. They used anumber of basic tools: C, ksh, Awk, andGre (a special implementation of grep).He compared implementations on theSun and an SGI Origin 2000, and com-mented that Gecko relied on a solid SMPimplementation and they were stymiedby inadequacies in PC environments. Hefound that the SGI gave a 2.5 price/per-formance advantage over the Sun. Butfor huge increases in scalability, theywould recommend a cluster implemen-tation.

EXTENDED DATA FORMATTING USING SFIO

Glenn S. Fowler, David G. Korn, Kiem-

Phong Vo, AT&T Labs-Research

Kiem-Phong Vo described the Sfio pack-age, which is a faster, more robust, andmore flexible replacement for the Stdiopackage. Stdio has several shortcomingsin data formatting. For example, to printan abstract scalar such as off_t on somesystems you would use a printf specifica-tion of %d – on other systems youwould need a printf specification of %ld.The Stdio routines, gets and scanf areunsafe when the length of an input lineor string exceeds the size of the givenbuffer. Stdio has no extension mecha-nism for printing user-defined types. Forexample, if you had a spatial coordinatetype, Coord_t, you would have to use adhoc formatting methods.

Sfio fixes the above-mentioned prob-lems. It also contains a generalizedmechanism for printing arbitrary basesin the range of 2–64. For example, %..2dwill print the decimal value of 123 as1111011.

Here is an example of how Sfio allowsflexibility in printing a integer whose sizemay vary from architecture to architec-ture:

sfprintf(sfstdin, "%I*d", sizeof(intval), intval);

Similarly, here is a scanning example forfloating point numbers:

sfscanf(sfstdin, "%I*f". sizeof(fltval), &fltval);

In both examples, the sizes of the scalarobjects determine their types.

Sfio has hooks for handling user-definedtypes such as complex numbers. Thisgeneral mechanism allows user to specifyformat strings, argument lists, and func-tions to parse each data type.

Sfio outperforms Stdio on all tested plat-forms mostly due to the new data con-version algorithms. The code is availablefrom <http://www.research.att.com/sw>.

USENIX ANNUAL TECHNICAL CONFERENCE ●

INVITED TALK

IMPLEMENTING 3D WORKSTATION GRAPHICS

ON PC UNIX HARDWARE

Daryll Strauss, Precision Insight

Summarized by Matt Grapenthien

3D hardware for PCs has improved tothe point where it begins to rival that oftraditional 3D graphics workstations.However, providing these capabilities oncommodity hardware poses a number ofdifficult and interesting problems.

After explaining some of the basic algo-rithms implemented by various 3D hard-ware, Strauss addressed several of thechallenges 3D presents, such as securityissues related to commodity hardware.Also, the scheduling granularity used bythe Linux kernel, though more efficientfor most processes, is too large forsmooth video.

Next, Strauss presented the shortcomingsof indirect rendering techniques, andpresented an alternative: the DirectRendering Infrastructure (DRI). TheDRI is designed to be secure, reliable,and high-performance, by utilizing theavailable hardware as much as possible.In addition, the DRI is highly modular,allowing different implementations touse only the subset of software theywish.

The DRI is included in XFree86 4.0, andis starting to be used by 3DLabs, HP, andIBM, among others, even on high-endhardware. Precision Insight’s work toprovide completely open-source solu-tions to the problems above has showngreat promise, even in this somewhatearly stage.

14 Vol. 25, No. 6 ;login:

FREENIX SESSION: FILE SYSTEMS

Summarized by Kevin E. Fu

PORTING THE SGI XFS FILE SYSTEM TO

LINUX

Jim Mostek, Bill Earl, Steven Levine,

Steve Lord, Russell Cattelan, Ken

McDonell, Ted Kline, Brian Gaffey, and

Rajagopal Ananthanarayanan, SGI

Russell Cattelan talked about the worknecessary to port SGI’s XFS filesystem toLinux. In particular, he discussed thefilesystem interface, the buffer caching,and legal issues. XFS is a highly scalable,journaling filesystem available for freeunder the GPL.

To maintain atomic updates to thefilesystem metadata, XFS keeps a log ofits uncommitted actions. Should themachine crash, XFS simply replays thelog to recover to a consistent filesystem.Recovery time depends on the size of theuncommitted log. This is in stark con-trast to the fsck tool, where recoverytime depends on the size of the entirefilesystem.

A single XFS filesystem can hold asmuch as 18 million terabytes of data andas much as 9 million terabytes of dataper file. XFS is an extent-based filesys-tem. That is, data are organized intoarbitrarily long extents of disk ratherthan fixed-sized blocks of disk. Thisallows XFS to achieve a throughput of7MB/second when reading from a singlefile on an SGI Origin 2000 system.

XFS uses the vnode/VFS interface.However, Linux does not use this inter-face. As a result, SGI created the linvfsinterface, which maps the Linux VFSinterface to the VFS layer in IRIX. At asmall overhead cost in translation, thiskeeps the core XFS code portable.

Second, SGI added pagebuf to Linux.This is a cache and I/O layer to providemost of the advantages from the cache inIRIX. Pagebuf allows pinning of metada-ta buffers, delayed writes via a daemon,and placement of metadata in a pagecache.

Russell also explained the legal process ofencumbrance relief. SGI determined, forall the XFS code, which code was originaland which code came from third parties.Most of the original XFS code camefrom SGI. However, XFS contains somesoftware from third parties. A homebrewtool compared every line of code in XFSagainst the source code from third par-ties. After removing the third-party code,SGI released XFS under the GPL.

For more information, see <http://oss.sgi.com/projects/xfs/>.

Q: Is preallocation fast? A: Yes. You’repreallocating an extent of space, notindividual blocks. You’re not zeroing outfiles, but you will receive zeros if youread from preallocated space.

Q: How does preallocation work on anAPI level? A: There are two ways to pre-allocate, with a command-line utility ornew IRCTL calls.

Q: Could you describe what you did atthe vnode/VFS layer? A: Our vnode isnow part of the inode. The Linux inodewould not work with our design. Wemostly map functions from linvfs func-tions to vnode functions.

LINLOGFS – A LOG-STRUCTURED FILE

SYSTEM FOR LINUX

Christian Czezatke, xS+S; M. Anton Ertl,

TU Wien

Christian Czezatke described theLinLogFS filesystem. Started in 1998,LinLogFS aims to achieve faster crashrecovery than that of the Ext2 filesystem.LinLogFS also enables the creation ofconsistent backups while the filesystemremains writable.

LinLogFS evolved from the Ext2 code-base to a log-structured filesystem withbetter data consistency andcloning/snapshots. It guarantees in-orderwrite consistency.

If you modify part of an Ext2 filesystemduring a backup, the change may or maynot be noticed. Worse, a simultaneous

C

ON

FERE

NC

ERE

PORT

Sbackup may only detect parts of thechange. LinLogFS does not suffer fromthis problem because backups can use asnapshot of a filesystem. In this manner,backups are completely atomic while stillallowing read-write access to the filesys-tem.

In the future, Czezatke plans to finish thecleaning program, use less naive datastructures and algorithms, and cooperatewith the object-based disk project formirroring.

See <http://www.complang.tuwien.ac.at/czezatke/lfs.html>

for the code.

Q: Have you reviewed the LFS code forthe *BSD operating systems? Can youcomment on differences or code overlap?A: We looked at BSD LFS and Sprite LFS.We had slightly different goals. We want-ed an LFS for Linux. And the Sprite andBSD LFS goal was higher speed thanFFS. Our goal was better functionality atthe same speed as Ext2.

UNIX FILE SYSTEM EXTENSIONS IN THE

GNOME ENVIRONMENT

Ettore Perazzoli, Helix Code, Inc.

Ettore Perazzoli spoke about filesystemextensions in the GNOME environment.Perazzoli is an open source softwaredeveloper and contributor to theGNOME project. GNOME extends thefunctionality of the UNIX filesystem byusing a user-level library called theGNOME Virtual File System.

There are many file formats to juggle:zip, tar, gzip, etc. Each format uses a dif-ferent tool to view the file contents.GNOME avoids this problem by using aglobal filesystem namespace. This allowsGNOME to identify and view contents ofcontainer files such as tar files. Names inthe GNOME VFS use an extension of thefamiliar Web Uniform ResourceIdentifier scheme.

Ettore discussed how MicrosoftWindows has the “My Computer” con-

15October 2000 ;login:

cept where one can find all systemresources such as files, a control panel,printers, and the recycling bin. GNOMEVFS addresses the “My Computer” prob-lem by providing filesystem abstractions,and the asynchronous file I/O problemby implementing an asynchronous API.Asynchronous behavior is importantbecause a GUI should be responsive allthe time. If the GUI were to block on afile operation, the user cannot stop theoperation. GNOME implements asyn-chronous behavior through an asynchro-nous virtual filesystem library.

For more information, visit <http://www.gnome.org/> and <http://developer.gnome.org/>.

Q: Why did you choose a pound sign forsubschema? A: We inherited this fromMidnight Commander. You can alwaysquote the character.

Q: How do you handle symlinks with “..”in the path? A: Symlinks only work with-in their context. For instance, a symlinkinside a tar file will only make sensewithin the context of the tar file.

Q: Have you considered POSIX AIO astwisted as it may be? A: Yes, but POSIXAIO only lets you make reads/writesfrom/to the filesystem asynchronous, soit does not help in complex cases such asthe .tar or .zip ones, when you have to doother stuff besides physically reading thefile.

SESSION: DISTRIBUTION AND

SCALABILITY: PROBLEMS AND

SOLUTIONS

Summarized by Josh Kelley

VIRTUAL SERVICES: A NEW ABSTRACTION FOR

SERVER CONSOLIDATION

John Reumann, University of Michigan;

Ashish Mehra, IBM T.J. Watson

Research Center; Kang G. Shin,

University of Michigan; Dilip Kandlur,

IBM T.J. Watson Research Center

Virtual services (VSes) provide a way topartition resources on a server clusterbetween competing applications. VSs canbe used to set resource limits and tocharge resources out to different virtualservices. For example, Web servers fortwo different companies can be run on asingle host by treating them as two sepa-rate VSes. Such partitioning has tradi-tionally been done by partitioning aphysical host into several virtual hosts.Although this approach provides goodinsulation between services, it does notallow sharing common subservicesbetween co-hosted sites. VSes addressthis problem by dynamically adjustingresource bindings. Each request’s VSresource context is tracked as the requestis handed off across application andmachine boundaries, thus allowing VSesto remain insulated from each otherwhile still accessing shared applicationsand machines.

Implementing VSes involves modifyingthe operating system’s scheduler to parti-tion the CPU. A VS also hooks into vari-ous system calls (such as process creationcalls and network communication calls)via gates in order to track the propaga-tion of work. Gates adjust resource bind-ings during system calls (binding aprocess to a different virtual service asneeded) and enforce resource limits.These gates are implemented as loadablekernel modules.

Evaluation shows that virtual serviceshave a minimal negative impact on per-formance and successfully insulate com-peting services from each other, evenwhen these services rely on a commonshared service. VS support is currentlyavailable for Linux 2.0.36; support forLinux 2.2.14 is in development. Sourcecode is available from <http://www.eecs.umich.edu/~reumann/vs.html

>.

USENIX ANNUAL TECHNICAL CONFERENCE ●

LOCATION-AWARE SCHEDULING WITH

MINIMAL INFRASTRUCTURE

John Heidemann, USC/ISI; Dhaval Shah,

Noika

With the increase in use of laptop com-puters, some way to specify context-dependent configurations and activitiescould provide a great deal of conven-

ience for users. For example, a usermight wish to specify “print to the print-er in room 232 when I’m at work” or“disable this background process whileI’m on battery.” lcron is a context-awarecron that provides a general solution toscheduling such context-dependentactivities.

Context information could come frommany sources: GPS receivers, wirelessbase station name, idle status, networkaddresses, or battery status are all possi-bilities. Lcron currently supports the lat-ter two.

The interface to lcron is similar to that ofthe standard cron and at utilities. A con-text-dependent at command, for exam-ple, could be invoked as at 7pm @work.lcron allows users to specify mappingsbetween terms meaningful to the user(‘@work’) and information readily avail-able to the computer (latitude, longitude,router IP address).

lcron has been in use for two years. It isavailable from <http://www.isi.edu/~johnh/SOFTWARE/XCRON>.

16 Vol. 25, No. 6 ;login:

DISTRIBUTED COMPUTING: MOVING FROM

CGI TO CORBA

James FitzGibbon and Tim Strike,

Targetnet.com Inc.

In 1997, TargetNet deployed a banner-addelivery system using CGI. As the net-work grew, basic CGI showed itself to beincapable of keeping up with the growth.They wanted a solution that was fasterthan basic CGI, was freely available,would support distributed computing,and would work with other products.CORBA was chosen as the solution thatbest met these criteria.

The system that was developed involvestwo main components. First is an HTTP-to-CORBA proxy. This proxy, called thedispatch server, takes HTTP requestsfrom the Web servers and presents themas CORBA requests to the applicationservers. This dispatch server allows athree-tier architecture (Web servers,application servers, databases) in whichonly the Web servers are exposed to theInternet.

The second main component is amethod of notifying each dispatch serverof the available application servers. Aservice heartbeat daemon serves this roleby maintaining a list of available applica-tion servers, providing a client with anavailable application server whenqueried, and coordinating with otherheartbeat daemons. This design avoidsany single point of failure and providesexcellent scalability.

CORBA offers an open source, distrib-uted object model, with wide languagesupport. The combination of a dispatchserver and a heartbeat daemon allows athree-tier architecture with excellent reli-ability and near-linear scalability.

INVITED TALK

THE MICROSOFT ANTITRUST CASE: A VIEW

FROM AN EXPERT WITNESS

Edward Felten, Princeton University

Summarized by Josh Simon

Disclaimer: Neither the author of thiswrite-up nor the speaker is a lawyer.

Dr. Felten was one of the expert witness-es for the U.S. Department of Justice inthe recent antitrust case againstMicrosoft. In his talk he discussed whyhe believed the government chose him,and he explained the role of an expertwitness in antitrust cases.

In October 1997 Felten received an emailmessage from an attorney in theDepartment of Justice asking to speakwith him. After signing a nondisclosureagreement (which is still binding, so headvised us there were some aspects hesimply could not answer questions on),and over the course of several months,he spoke with the DOJ until, in January1998, he signed a contract to be an advi-sor to the case.

What was the case about? Unlike mediaportrayals, the case was not aboutwhether Microsoft is good or evil, orwhether or not Bill Gates is good or evil,or whether Microsoft’s behavior wasgood or bad. The case was specificallyabout whether or not Microsoft violatedU.S. antitrust laws.

A brief discussion of economics may behelpful here. Competition constrainsbehavior. You cannot, as a company, hikeprices and provide bad products or serv-ices when there is competition, for theconsumer can go to your competitorsand you’ll go out of business. Weaklyconstrained companies, or those compa-nies with little or no competition, havewhat is called monopoly power.Monopoly power in and of itself is notillegal. What is illegal is using themonopoly power in one market (forexample, flour) to weaken competitionin another market (for example, sugar).

John Heidemann

C

ON

FERE

NC

ERE

PORT

SThe U.S. government claimed that (1)Microsoft has monopoly power in thepersonal-computer operating-systemmarket; and that (2) Microsoft used itsmonopoly power to (a) force PC manu-facturers to shun makers of other (non-Microsoft) applications and operatingsystems; (b) force AOL and other ISPs toshun Netscape’s browsers, Navigator andCommunicator; and (c) force customersto install and use Microsoft InternetExplorer. These issues are mostly non-technical and specifically economic. Dr.Felten focused on the technical aspects.

Under U.S. antitrust law, tying one prod-uct to another is illegal in some cases.For example, if you have a monopoly onflour, you cannot sell flour and sugartogether unless you also sell flour alone.You cannot force customers to buy yoursugar in order to get your flour.Similarly, the government argued,Microsoft cannot tie Windows 95 (later,Windows 98) together with InternetExplorer unless it offers both the OS andthe browser separately. Microsoftclaimed technical efficiencies in bundlingthe products together.

This boils down to two legal issues. First,what was the motive in combining theOS and the browser? The answer to thisis provided by documentation and wit-nesses, subpoenaed by the government,and not technical. Second, does the com-bination achieve technical efficienciesbeyond that of not combining the two?The answer to this is experimental, tech-nical, based in computer science, andwas the focus of the rest of the talk.

Specifically, how did Felten go abouttesting the efficiencies or lack thereof?He started by hiring two assistants whoreverse-engineered both Windows andInternet Explorer. (Note that this work,because it was done on behalf of the gov-ernment for the specific trial, was notillegal. Doing so yourself in your ownbasement would be illegal.) After ninemonths, they were able to assemble a

17October 2000 ;login:

program to remove Internet Explorerfrom Windows.

The next step in the process was to pre-pare for court. In general, witnesses haveto be very paranoid, nail down the tech-nical details, have sound and valid con-clusions, and learn how to be cross-examined. Lawyers, no matter your per-sonal opinion of them, are generally verywell-schooled in rhetoric, terminology,and framing of questions, and hidingassumptions in them. They’re also goodat controlling the topic, pacing the exam-ination, and producing sound bites. Inhis testimony, Felten demonstrated the“remove IE” program. Jim Alchain,Microsoft vice president, provided 19benefits of tying the products togetherand claimed the removal program hadbugs. In the government’s cross-exami-nation of Alchain, he admitted that all 19benefits were present if IE was installedon top of Windows 95, and that thevideo used to show that the demonstra-tion of the removal process had bugs haderrors and inconsistencies. Microsofttried a second demo to show problemswith the removal program under con-trolled circumstances and could not doso. Furthermore, in rebuttal toMicrosoft’s assertion that the productshad to be strongly tied together to gainbenefits, the government pointed outthat Microsoft Excel and Microsoft Wordwere not strongly tied and yet were ableto interoperate without being insepara-ble.

Judge Jackson reported in his findings offact in November 1999 that the combi-nation had no technical efficienciesabove installing them separately, InternetExplorer could be removed from theoperating system, and tying the browserand the operating system together was,in fact, illegal.

The next phase of the trial was the reme-dy phase in May 2000. The goals of theremedy phase are to undo the effects ofthe illegal acts, prevent recurrence of

those acts, and be minimally intrusive tothe company, if possible. There are gen-erally two ways to accomplish this: struc-tural changes (reorganization or separa-tion of companies) and conduct changes(imposing rules). The judge couldchoose either or both. The decision wasreached to restructure Microsoft suchthat the operating system (Windows)would be handled by one company andeverything else by another. Furthermore,in conduct changes, Microsoft could notplace limits on contracts; could not retal-iate against companies for competing inother markets (such as, for example,word processing); must allow PC manu-facturers to customize the operating sys-tems on the machines they sell; mustdocument their APIs and protocols; andcannot tie the OS and products togetherwithout providing a way to removethem.

Microsoft has appealed the case. At thetime of this writing it is not clearwhether the appeal will be heard in theU.S. Court of Appeals or by the U.S.Supreme Court. The remedies are stayed,or on hold, until the resolution of theappeals or until a settlement of somekind is reached between Microsoft andthe U.S. government. Once the case istruly over, Felten’s slides will be availableon the USENIX Web site.

FREENIX SESSION: SOCKETS

Summarized by Bob Gray

PROTOCOL INDEPENDENCE USING THE

SOCKETS API

Craig Metz, University of Virginia

Craig Metz convinced the audience thatin an all-IP world, our network pro-gramming has become inflexible anduni-protocol. Even though the BerkeleySockets were designed as a protocol-independent API, other parts of the API(like the name-service functions) grewup as protocol-dependent. So, some ofthe APIs are not protocol-independent,

USENIX ANNUAL TECHNICAL CONFERENCE ●

which in turn encourages code not to beeither. For now, IPv4 is ubiquitous, butas its 32-bit address space runs out, wewill find IPv6 more and more com-pelling. Therefore, it would be prudentto pay attention to the portability issuesand even check and retrofit some of ourexisting network programs.

Metz pointed out multiple problems:

■ Hard coded constants in programs,■ Storage limitations and assumptions,■ GUIs that assume four three-digit

fields as an address.

For example, many programs hard-codethe protocol family as AF_INET, whichprevents protocols other than IP frombeing used. We should not assume a net-work address will fit in a u_long. Andusing struct sockaddr_in sin limits pro-grams because it doesn’t allocate enoughstorage for protocols such as IPv6.

Metz recommends using the new POSIXp1003.1g interfaces. For example, getad-drinfo performs the functionality of geth-ostbyname(3) and getservbyname(3), ina more sophisticated manner.

The new interfaces will take time to beuniversally deployed; however, they arecurrently available in at least the follow-ing environments: AIX, BSD/OS,FreeBSD, Linux, OpenBSD, NetBSD,Solaris, and Tru64 UNIX. They areexpected to be available soon with IRIXand HP-UX.

SCALABLE NETWORK I/O IN LINUX

Niels Provos, University of Michigan;

Chuck Lever, Sun-Netscape Alliance

Graduate students Niels Provos andChuck Lever have observed andaddressed bottlenecks in Linux wheremany high-latency, low-bandwidth con-nections are simultaneously present on aWeb-server box. The problem is that thestock kernel data structures and algo-rithms don’t scale well in the presence ofthousands of rapidly formed HTTP con-nections. The file-descriptor selection

18 Vol. 25, No. 6 ;login:

code needed work. The problem is eventnotification. It takes a long time for thekernel to find which connections areready for I/O.

Niels discussed two solutions to removethe inefficiency: the POSIX RT signalsAPI and an optimized poll() along thelines of Banga’s declare_interest() inter-face. He convinced the audience thatboth mechanisms provide huge improve-ments for HTTP response time when251 or 501 idle or inactive connectionsare present in the background. However,his optimization using /dev/poll yieldedthe overall best response times.

To achieve high performance, Neils madethe following changes to poll():

■ Maintain state information in thekernel so that every poll system calldoesn’t have to retransmit it.

■ Allow device drivers to post comple-tion events to poll().

■ Eliminate the result copying whenpoll returns to the user.

The /dev/poll device allows a process toadd, modify, and remove “interests” froman interest set. This streamlines poll()calls because only relevant information ispassed at system call time. Further, whenpoll() returns, the application immediate-ly has access to the ready descriptors.

The software enhancements to poll() arefreely available.

ACCEPT() SCALABILITY IN LINUX

Stephen P. Molloy, University of

Michigan; Chuck Lever, Sun-Netscape

Alliance

Stephen Molloy described the thunder-ing hard problem associated with Linuximplementations of the accept() systemcall. When multiple threads call accept()on the same TCP socket to wait forincoming TCP connections, they areplaced into a wait queue. The problem iswhen a TCP connection is accepted, allof the threads are awakened – eventhough all but one will immediately need

to go back to sleep. As the number ofthreads goes from a few dozen to hun-dreds, the kernel begins severe thrashing.

One proposed solution is called TaskExclusive. The idea is to add a flag to thethread state variable, change the han-dling of wait queues, and connect into astandard, newly added wait-queue mech-anism.

Another solution is called Wake One,which adds new calls to complementwake_up() and wake_up_interruptible().The new functions just wake up onethread when a connection becomesready.

Molloy presented micro-benchmark datashowing huge improvements in “settletime” for both the Task Exclusive andWake One solutions over the stock ker-nel.

He also used a Macro-benchmark,-SpecWeb99, to demonstrate the effective-ness of both solutions – over 50% moreconnections.

The Task Exclusion solution has beenincorporated into the Linux kernel. Thecode is also available at the LinuxScalability Project’s home page:<http://www.citi.umich.edu/projects/linux-scalability/>.

SESSION: TOOLS

Summarized by Doug Fales

OUTWIT: UNIX TOOL-BASED PROGRAMMING

MEETS THE WINDOWS WORLD

Diomidis D. Spinellis, University of the

Aegean

Windows is fundamentally an environ-ment of mouse clicks and pixel output.Very little of this GUI-based OS is acces-sible to text-based programs. It is withthis shortcoming in mind that DiomidisSpinellis developed Outwit. Outwit pro-vides a number of text-based tools forWindows to work together with UNIX-based tools. Specifically, Spinellis’ toolsprovide mechanisms for interaction with

C

ON

FERE

NC

ERE

PORT

Sthe clipboard, the registry, the ODBCdatabase interface, document properties,and shell links (shortcuts). The presenta-tion included several convincing exam-ples of the time-saving advantages ofsuch an environment.

One example used the winreg command(the interface to the registry) to changethe location of the user’s home directoryfrom drive C: to drive D:, with a little ahelp from pipelines and the Win32 ver-sion of sed:

winreg HKEY_CURRENT_USER |sed -n 's/C:\\home/D:\\home/gp' |winreg

The winclip tool provides shell-basedclipboard access. For example, to copydata from standard input to theWindows clipboard:

ls -l | winclip -c

and to paste Windows clipboard data tostandard output:

winclip -p | wc -w

Spinellis would be interested in addingfunctionality to Outwit for new featuresof Windows 2000, and in providing sup-port for Unicode. The Outwit tools areavailable at <http://softlab.icsd.aegean.gr/~dspin/sw/outwit>.

PLUMBING AND OTHER UTILITIES

Rob Pike, Bell Labs

Plumbing is a Plan9 solution for inter-process communication and messagepassing between user applications. Thegeneral idea is to remove some of theburden on the user of constantly shuf-fling data from program to program. Acommon example of this occurs duringcompilation and debugging, when acompiler generates error messages detail-ing file location and line number.Plumbing allows the user to access theerror in an editor in one click.

Thanks to the pattern-matching lan-guage at the core of the plumber’sdesign, it is a far more powerful mecha-

19October 2000 ;login:

nism than filename-extension associa-tions such as those in Windows. Whilethe goal is for the plumber’s defaultactions to be the most desirable actions,it is highly customizable through a con-figuration file that defines rules in a pat-tern-action format.

The actual message passing is relativelytrivial because the plumber is a fileserv-er; messages are written to a file on theserver, which then takes appropriateaction based on the defined rules. Therules are actually quite flexible and pow-erful in that they provide a means tointerpret the context of messages.

The interface for plumbing is simple anddesigned to minimize keystrokes andbutton clicks. The applications them-selves do remarkably little work; almosteverything is handled within theplumber itself.

One very good application of this sort ofautomation is in dealing with file for-mats that might require transformationbefore viewing, such as an attachedMicrosoft Word document. With a cou-ple of pattern-matching rules to set upthe variables, this rule takes a Word doc-ument, converts it to text, and sends it tothe editor:

plumb start doc2txt $data | \plumb -i -d edit \-a action=showdata \-a filename=$0

This one-click approach to conversionand viewing is a slick example of theadvantages of this system. Details andmore examples may be found in Pike’spaper.

INTEGRATING A COMMAND SHELL INTO A

WEB BROWSER

Robert C. Miller and Brad A. Myers,

Carnegie Mellon University

Rob Miller demonstrated enhancementsto his existing Web browser (LAPIS, seehttp://www.usenix.org/events/usenix01/cfp/miller/miller_html/usenix99.html>). The extension was an integration of

browser and command shell. At firstglance, this might look like anotherattempt to unnecessarily GUI-ify a clas-sic typescript tool. However, this projectdemonstrated some very useful andinnovative ways of using a Web browser.

Miller’s browser provides a circular redi-rection of the standard input and out-put, so that commands are executed as ifin one long pipeline. The benefits ofsuch a mechanism in a Web browser arenot immediately apparent until the fullpotential of conventions like the “back”button are realized.

In addition, the browser convenientlyseparates the standard error and outputstreams when displaying command out-put. Some features of the browser are: anembedded pattern-matching language toextract data from a Web page, a com-mand window that can execute tradi-tional shell commands and Web-specificcommands, and the ability to automatebrowsing.

The demonstration showed how LAPIScould be used to visit and print (or save)each page in a document that is strungout over several links. Another interest-ing application was automating the useof forms. Miller used the pattern-match-ing language to extract the ISBN of abook from an Amazon.com Web site,and then fed this into a script that hadbeen constructed by LAPIS to consultthe form-based Web pages of CMU’slibrary lookup service.

Judging by the demonstration, LAPISseemed a well-designed, easy to useinterface. Furthermore, it does some-thing to alleviate the pain of endlessclicking and banner-ad watching that isassociated with browsing the Web thesedays. The browser and its Java source areavailable at<http://www.cs.cmu.edu/~rcm/lapis>.

USENIX ANNUAL TECHNICAL CONFERENCE ●

INVITED TALK

CHALLENGES IN INTEGRATING THE MAC OSAND BSD ENVIRONMENTS

Wilfredo Sanchez, Apple Computer

Summarized by Josh Simon

Fred Sanchez discussed some of the chal-lenges in integrating the MacOS andBSD environments to produce MacOS X.Historically, the Mac was designed toprovide an excellent user interface (“thebest possible user experience”) with tighthardware integration and a single user.In contrast UNIX was designed to solveengineering problems, using open source(for differing values of “open”), runningon shared multi-user computers andwith administrative overhead. There arepositives and negatives with bothapproaches. MacOS X is based on theMach 3.0 kernel and attempts to take thebest from both worlds. A picture mayhelp explain how all this hangs together:

Sanchez next talked about four problemareas in the integration: filesystems, files,multiple users, and backwards compati-bility. Case sensitivity was not much ofan issue; conflicts are rare and most sub-stitutions are trivial. MacOS uses colonas the path separator; UNIX uses theslash. Path names change depending onwhether you talk through the Carbonand Classic interfaces (colon, :) or theCocoa and BSD interfaces (slash, /).Filename translation is also required,since it is possible for a slash to be pres-ent in a MacOS file name. File IDs are apersistent file handle that follows a file in

20 Vol. 25, No. 6 ;login:

MacOS, providing for robust alias man-agement. However, this is not imple-mented in filesystems other than HFS+,so the Carbon interface provides fornonpersistent file IDs. Hard links are notsupported in HFS+, but it fakes it, pro-viding the equivalent behavior to theUFS hard link. Complex files – specifi-cally, the MacOS data and resource forks– are in the Mac filesystems (HFS+, UFS,and NFS v4) but not the UNIX filesys-tems (UFS and NFS v3). The possiblesolutions to this problem include usingAppleDouble MIME encoding, whichwould be good for commands like cpand tar but bad for commands usingmmap(), or using two distinct files, whichmakes renaming and creating files tough,overloads the name space, and confusescp and tar. The solution they chose wasto hide both the data and resource forksunderneath the filename (for example,filename/data and filename/resource,

looking like adirectoryentry but nota directory)and have theopen() systemcall return thedata forkonly. This letseditors andmost com-mands(except

archiving commands, like cp and tar, andmv across filesystem boundaries) havethe expected behavior. Another filesys-tem problem is permissions (which existin HFS+ and MacOS X but not inMacOS 9’s HFS). The solution here is tobase default permissions on the directorymodes.

The second problem area is files. Specialcharacters were allowed in MacOS file-names (including space, backslash, andthe forward slash). Filename translationworks around most of these problems,though users have to understand that“I/O stuff” in MacOS is the same as

“I:O_stuff” in UNIX on the samemachine. Also, to help reduce problemsin directory permissions and handlingthey chose to follow NeXT’s approachand treat a directory as a bundle, reduc-ing the need for complex files and sim-plifying software installations, allowingdrag-and-drop to install new software.

The third problem area involves multipleusers. MacOS historically thought ofitself as having only a single user andfocused on ease of use. This lets the Macuser perform operations like setting theclock, reading any file, installing soft-ware, moving stuff around, and so on.Currently MacOS X provides hooks forUID management (such as integratingwith a NetInfo or NIS or LDAP environ-ment) and tracks known (UNIX-like)and unknown disks, disabling com-mands like chown and chgrp onunknown disks.

The fourth and final problem areaSanchez discussed was compatibilitywith legacy software and hardware.Legacy software has to “just work,” andthe API and toolkit cannot change, soprevious binaries must continue to workunchanged. The Classic interface pro-vides this compatibility mode. Classic iseffectively a MacOS X application thatruns MacOS 9 in a sandbox. This causessome disk-access problems, dependingon the level (application, filesystem, diskdriver, or SCSI driver). The closed archi-tecture of the hardware is very abstract-ed, which helps move up the stack fromlow-level to the high-level applicationwithout breaking anything.

Questions focused on security, the desireto have a root account, and the terminalwindow or shell. The X11 windowingsystem can be run on MacOS X, thoughApple will not be providing it. Softwareports are available from <http://www.stepwise.com/>. Additionaldetails can be found at <http://www.mit.edu/people/wsanchez/papers/USENIX_2000/>

, <http://www.apple.com/macosx/>, and<http://www.apple.com/darwin/>.

Platinum Aqua Curses

Classic Carbon Cocoa BSD

(OpenStep)

Application Services

Quantum, OpenGL, and QuickTime

Core Services

Darwin (BSD layer)

C

ON

FERE

NC

ERE

PORT

SFREENIX SESSION: NETWORK

PUBLISHING

Summarized by Matt Grapenthien

The growth and evolution of the Internethas caused some of its limitations tobecome more acute. The presenters inthis session attempt to address some ofthese problems with more dynamic andgenerally more useful systems and proto-cols than those currently in wide use.

PERMANENT WEB PUBLISHING

David S. H. Rosenthal, Sun

Microsystems Laboratories; Vicky Reich,

Stanford University Libraries

David Rosenthal presented LOCKSS(Lots of Copies Keep Stuff Safe), a sys-tem to preserve access to scientific jour-nals in electronic form. Unlike normalsystems, LOCKSS has far more replicasthan necessary just to survive the antici-pated failures. Exploiting the surplus ofreplicas, LOCKSS allows much loosercoordination among them.

THE GLOBE DISTRIBUTION NETWORK

A. Bakker, E. Amade, and G. Ballintijn,

Vrije Universiteit Amsterdam; I. Kuz,

Delft University of Technology; P.

Verkaik, I. van der Wijk, M. van Steen,

and A. S. Tanenbaum, Vrije Universiteit

Amsterdam

The Globe Distribution System, present-ed by Arno Bakker, attempts to addressthe problem of distribution to a world-wide audience through selective, per-document replication, as opposed to the“all-or-none” replication policies cur-rently in use. Though still in an earlystage, the project’s goal of a “better”WWW/FTP seems promising.

OPEN INFORMATION POOLS

Johan Pouwelse, Delft University of

Technology

Johan Pouwelse presented a method toallow modification and extension ofexisting Web pages by allowing publicwrite access to collections of WWW-

21October 2000 ;login:

based databases. Open InformationPools further address the problem ofquickly evaluating huge amounts of con-tent, through an open rating and moder-ation system. Experiments on similar,already-existing systems seem to provethese concepts very valuable.

SESSION: KERNEL STRUCTURES

Summarized by Josh Kelley

OPERATING SYSTEM SUPPORT FOR MULTI-USER, REMOTE, GRAPHICAL INTERACTION

Alexander Ya-li Wong and Margo I.

Seltzer, Harvard University

An increasing number of services arebeing provided over the network insteadof locally. Examples include filesystems(NFS), storage (Fibre Channel), memory(Distributed Shared Memory), and inter-faces (thin clients). Of these, thin clientsare often neglected.

The key characteristics of thin-clientservice are that it is interactive, multi-user, graphical, and remote. One of themost important features of thin-clientservice is low latency. This presentationcompared two operating systems,Windows NT 4.0 Terminal ServerEdition, and Linux 2.0.36 running the XWindow System, and examined how wellthey provide these characteristics.

The first two questions regarding thin-client service are processor managementand memory management. The OS’s goalshould be to prevent the user from expe-riencing any perceptible latency. To min-imize latency, the OS should insure thatinteractive performance is background-load-independent and should swap outpages belonging to interactive processeslast. Although NT offers special supportfor scheduling interactive threads, exper-iments showed Linux to be much better,both for processor scheduling and forlow latencies while swapping pages infrom disk. Open questions here includethe best time slice to use for interactiveprocesses and how the Linux kernel canidentify interactive threads (since inter-

activity is determined in user space) forspecial treatment.

A third question regarding thin-clientservice is network load. In experiments,NT’s Remote Display Protocol presenteda much lower network load, with a largermessage size, than did the X or LBX pro-tocols used by Linux and X Windows.(This may be due partially to poorlycoded X applications.) RDP’s use of aclient-side bitmap cache allowed it tohave virtually no load in the specific areaof animation (as long as the cache wasnot overloaded). These experimentspoint out the importance of a client-sidecache.

TECHNIQUES FOR THE DESIGN OF JAVA

OPERATING SYSTEMS

Godmar Back, Patrick Tullmann, Leigh

Stoller, Wilson C. Hsieh, and Jay

Lepreau, University of Utah

A Java OS is an execution environmentfor Java bytecode that provides standardoperating-system functionality: separa-tion and protection, resource manage-ment, and interapplication communica-tion. It may run on a traditional OS or itmay be embedded in an application. Thepurpose of a Java OS is to support exe-cuting multiple Java applications.

There are several options for arrangingthe operating system, Java VirtualMachine, and Java applications. Oneapproach is physical separation: one OSper one JVM per one app. Such anapproach is expensive, prevents embed-ding of an OS within an application, andmakes communication difficult. A sec-ond approach is separate JVM processesrunning under one OS. This approachhas inefficient resource use, requires aunderlying OS, and makes for difficultcommunication and no embeddingwithin outside applications. A thirdapproach is an ad hoc layer that supportsrunning multiple applications withinone JVM. Examples of this approachinclude an applet context or a servlet

USENIX ANNUAL TECHNICAL CONFERENCE ●

engine. However, this approach offersinsufficient separation between theprocesses, with no resource control andunsafe termination of one runawayprocess. The fourth, and best, approachis to provide support for processes with-in the JVM, turning the JVM into a JavaOS. The Java OS can be designed toreplace the base OS as well.

There are several design decisions to bemade for a Java OS. One example is thearea of shared memory management.Issues here include the precision ofaccounting, the ability to reclaim sharedobjects, and the need for full reclamationof memory upon process termination.Java’s automatic garbage collection com-plicates these issues. Solutions to thequestion of shared memory managementinclude copying (simple but slow), indi-rect sharing via revocable proxies, directsharing via a dedicated shared heap, anda hybrid approach of both direct andindirect. The J-Kernel, from CornellUniversity, and Alta and K0 (which laterbecame KaffeOS), both from theUniversity of Utah, all offer differentsolutions to this question and illustratethe general tradeoffs between separation,resource management, and communica-tion.

SIGNALED RECEIVER PROCESSING

José Brustoloni, Eran Gabber, Abraham

Silberschatz, and Amit Singh, Lucent

Technologies-Bell Laboratories

This session presented signaled receiverprocessing, an alternative to the BSD’straditional IP packet receiver processing.Since implementations of and derivativesof BSD have appeared on a variety ofplatforms, BSD receiver processing is apart of many operating systems,although it has several disadvantages.Protocol processing of received packetsin BSD is interrupt-driven. This resultsin scheduling anomalies; CPU time spentprocessing packets is charged to the cur-rently running process or is not chargedat all. Therefore, no quality of service

22 Vol. 25, No. 6 ;login:

(QoS) guarantees are possible. A secondproblem is receive livelock, in which thesystem spends all of its time processingincoming packets, even when no bufferspace is available to store these packets.

One alternative to BSD receiver process-ing is lazy receiver processing (LRP). Toprevent receive livelock, LRP uses earlierdemultiplexing to detect full receivebuffers as soon as possible and droppackets accordingly. UDP packets areprocessed synchronously, at the receivecall. TCP packets are processed asyn-chronously, via an extra kernel threadper process or via a system-wide processthat uses resource containers. (Resourcecontainers are an abstraction used toseparate resource principal and protec-tion domain. Resources used by kernel-level code can be charged out to userprocesses.) This processing of UDP andTCP packets allows LRP to avoid theBSD receiver processing’s schedulinganomalies.

There are several disadvantages withLRP. First, it does not work on systemsthat do not implement kernel threads orresource containers. Second, under LRP,TCP is always asynchronous and sharesresources equally with the application.Third, LRP is designed for hosts, notgateways. Its early demultiplexing is toosimplistic for gateways, and time-sharingscheduling is inadequate for gateways.There are open questions with LRPregarding how well it would work withrealtime schedulers and proportional-share schedulers.

Signaled receiver processing (SRP) ispresented as an alternative method thatavoids these problems with LRP. When apacket arrives, the OS signals the receiv-ing application. By default, the packet isprocessed asynchronously, although theapplication may instead choose to catch,block, or ignore the packet, in order todefer processing until the next receivecall. SRP processes incoming packets inseveral stages; only the actual hardware

input is handled at the interrupt level.Processing is handled via a multi-stageearly demultiplexer, or MED, and istransferred from one stage to another viathe next stage submit (NSS) function.The NSS function signals the applicationby sending it a SIGUIQ (unprocessedinterrupt queue).

Performance tests show that throughput,CPU utilization, and round trip timesare practically the same for SRP underEclipse/BSD as they are for BSD receiverprocessing under FreeBSD. Tests alsoshow that SRP successfully preventsreceive livelock. It is easily portable,allows for flexible scheduling and usewith gateways, and still allows for QoSguarantees.

INVITED TALK

THE CONVERGENCE OF NETWORKING AND

STORAGE: WILL IT BE SAN OR NAS?

Rod Van Meter, Network Alchemy

Summarized by Josh Simon

The goal of this talk was to providemodels for thinking about SANs andNASs. Network-attached storage (NAS)is like NFS on the LAN; storage area net-works (SAN) are like a bunch of FibreChannel–attached disks.

There are several patterns of data shar-ing, such as one-to-many users, one-to-many locations, time slices, and fault tol-erance; activities, such as read only, read-write, multiple simultaneous reads, andmultiple simultaneous writes; and multi-ple ranges, of machines, CPUs, LAN ver-sus WAN, and known versus unknownclients.

When sharing data over the network,how should you think about it? Thereare 19 principles that Levy andSilverschatz came up with that describethe file. These include the namingscheme, component unit, user mobility,availability, scalability, networking, per-formance, and security. There is Garth

C

ON

FERE

NC

ERE

PORT

SGibson’s taxonomy of four cases: server-attached disks, like a Solaris machine;server integrated disks, like a NetworkAppliance machine; netSCSI, or SCSIdisks shared across many hosts with one“trusted” host to do the writes, and net-work-attached secure devices (NASD).Over time, devices are evolving tobecome more and more network-attached, smarter, and programmable.

Van Meter went into several areas inmore detail. Access models can be appli-cation-specific (like databases or HTTP),file-by-file (like most Unix file systems),logical blocks (like SCSI or IDE disks),or object-based (like NASD). Connec-tions can be over any sort of transport,including Ethernet, HiPPI, FibreChannel, ATM, SCSI, and more. Eachconnection model is at the physical andlink layers and assumes there is a trans-port layer (such as TCP/IP), thoughother transport protocols are possible(like ST or XTP or UMTP). The issues ofconcurrency (are locks mandatory oradvisory, is management centralized ordistributed?), security (authorization andauthentication, data integrity, privacy,and nonrepudiation), and network (“itdoesn’t matter” versus “it’s all that mat-ters”) all need to be considered.

Given all those issues, there are threemajor classes of solutions today. The firstis a distributed file system (DFS), alsoknown as NAS. This model is a lot ofcomputers and lots of data; examplesinclude NFS v2, AFS, Sprite, CIFS, andXFS. The bottleneck with these systemsis the file manager or object store; draw-backs include the nonprogrammabilityof these devices and the fact that they areOS-specific and have redundant func-tionality (performing the same steps dif-ferent times in different layers).

The second class of solution is storagearea networks (SAN). These tend to havefew computers and lots of data and tendto be performance-critical. These areusually contained in a single server or

23October 2000 ;login:

machine room; the machines tend tohave separate data and control networks.These devices’ drawbacks are that theyare neither programmable nor smart,they’re too new to work well, they pro-vide poor support for heterogeneity, andthe scalability is questionable. However,there is a very low error rate and theapplication layer can perform data recov-ery. Examples of SANs include VAX clus-ters, NT clusters, CXFS from SGI, GFS,and SANergy.

The third solution class is NASD, devel-oped at CMU. The devices themselvesare more intelligent and perform theirown file management. Clients have anNFS-like access model; disk drivesenforce (but do not define) security poli-cies. The problems with NASD is that it’stoo new to have reliable details, moreinvention is necessary, there are some OSdependencies, and some added function-ality may be duplicated in different lay-ers. Which solution is right for you? Thatdepends on your organization’s needsand priorities.

FREENIX SESSION: X11 AND USER

INTERFACES

Summarized by Gustavo Vegas

THE GNOME CANVAS: A GENERIC ENGINE

FOR STRUCTURED GRAPHICS

Federico Mena-Quintero, Helix Code,

Inc.; Raph Levien, Code Art Studio

The GNOME Canvas is a generic high-level engine for structured graphics. Acanvas is a window in which things canbe drawn. It contains a collection ofgraphical items such as lines, polygons,ellipses, smooth curves, and text.Graphics on this canvas are deemed tobe structured because you can placethese geometric shapes in the canvas andlater on access the objects to change theirattributes, such as position, color, andsize. The canvas is in charge of allredrawing operations.

The GNOME canvas has an open inter-face that permits applications that usethe canvas to create their own customitem types. Thus, the canvas can work asa generic display engine for all kinds ofapplications. The GNOME canvas itemsare GTK+ objects derived from anabstract class (GnomeCanvasItem), thatgives the methods for objects to beimplemented. Using the GTK+ objectsystem provides several advantages, suchas the possibility of associating arbitrarydata items to canvas items.

The GNOME canvas also uses the Libartlibrary for its external imaging model inantialias mode. Libart is a library thatprovides a superset of the PostScriptimaging model, and it provides supportfor antialiasing and alpha transparency.The end result is that graphics’ contoursare smoothed out to eliminate jaggededges.

Several applications that are currentlydistributed as part of the GNOME envi-ronment use the GNOME canvas to ren-der graphics and other types of data.Examples of such applications areGnumeric (the GNOME spreadsheet),GNOME-PIM (personal informationmanager), and Evolution (the next-gen-eration mail and groupware program forGNOME).

For more information about theGNOME project and the GNOME can-vas, see <http://developer.gnome.org/>.

EFFICIENTLY SCHEDULING X CLIENTS

Keith Packard, SuSE Inc.

Keith Packard presented a new schedul-ing algorithm for X11. The technicalmotivation behind this project is that theoriginal scheduling mechanism in X11 issimplistic and can potentially starveinteractive applications while a graphics-intensive program runs. This program isevident when one runs programs likeplaid, which generate many renderingrequests that can tie up the system forlong periods of time, making it unusable

USENIX ANNUAL TECHNICAL CONFERENCE ●

by other programs, as simple as anxterm.

The X server is typically a single-thread-ed network server that uses well-knownports to receive connections from clientsfor which it processes requests sent overthe port. To detect pending input fromclients, it uses the select(2) system call.When the set of clients with pendinginput has been determined, the X serverstarts executing the requests, startingwith the smallest file descriptor. Eachclient uses a buffer to read some of thedata from the network connection. Thisbuffer can be resized but it is typically4KB. The requests are executed untileither the buffer is exhausted of com-plete requests, or after ten requests. Afterthis has taken place, the server figuresout if there are any clients with pendingcomplete requests. If this is the case, theserver stops the select(2) call and goesback to process those pending requests.When all client input is exhausted, the Xserver calls select(2) again to await formore data. The problem with this algo-rithm is that it gives preference to moreactive clients. If a client generates com-plex requests, these requests may take upmore time to be satisfied and this clientwill end up hogging the server. As a con-sequence, clients that generate fewerrequests are starved in the presence ofmore active clients. Also, clients that donot generate a complete request duringtheir turn will be ignored. On the posi-tive side, when clients are busy, the serverspends most of its time executingrequests and wastes little time on systemcalls.

The design goal of the proposed solutionis to provide relatively fine-grained time-based scheduling with increased prioritygiven to applications that receive userinput. Each client would be given an ini-tial priority at connect time. Requests areexecuted from the client with the highestpriority, and various events may changethe priority of a given client. This system

24 Vol. 25, No. 6 ;login:

intends to penalize overactive clients andpraise clients with little activity.

In the performance measurements thatwere presented a measurable change wasapparent. However, the two schedulerswere within only 2% of each other. Thisshows that the changes in the schedulerhave little impact on the tool used forperformance measurement. The toolused was X11perf, a widely available tool.This tool runs with very little competi-tion from other clients, and thus most ofthe benefits of the new scheduler gounnoticed.

In conclusion, simple changes on thescheduler, based on real-life observationsof the X server behavior, can bring someadvantages over the original schedulerwithout impacting performance nega-tively. For more information and forwork that has been incorporated into the4.0 release of the X Window System fromthe XFree86 group, see<http://www.xfree86.org/>.

THE AT&T AST OPEN SOURCE SOFTWARE

COLLECTION

Glenn S. Fowler, David G. Korn,

Stephen S. North, and Kiem-Phong Vo,

AT&T Laboratories-Research

David Korn presented a suite of toolsthat have been released to the opensource community by AT&T. This toolsuite includes widely known componentsthat may or may not be directly used ingraphics applications, but the fact thatthese have been released has a profoundimpact in the open source community.

These tools have been released under anAT&T license agreement. This is not GPLor LGPL, so it is important to read thelicense if one is going to make use of anyof the software components for any pur-pose.

The components of this suite are deemedto be highly portable to practically anyenvironment, given the right base. Theymay not necessarily be complete applica-tions, but they can be reusable tools to

produce other powerful pieces of soft-ware. Their focus when creating thissoftware suite is reusability. They havealso focused in creating software librariesthat encompass core computing func-tions such as I/O and memory allocationand other new algorithms and datastructures such as data compression anddifferencing and graph drawing. Thus,they created libraries like,

Libast – Porting base library for theirsoftware tools.Sfio – This I/O library provides arobust interface and implements newbuffering and data formatting algo-rithms that are more efficient thanthose in the standard I/O library,Stdio.Vmalloc – This memory-allocationlibrary allows creation of differentmemory regions based on applica-tion-defined memory types (heap,shared, memory mapped, etc.) andsome library-provided memory-man-agement strategies.Cdt – This container data-type libraryprovides a comprehensive set of con-tainers under a unified interface:ordered/unordered sets/multisets,lists, stacks, and queues.Libexpr – This library provides run-time evaluation for simple C-styledexpressions.Libgraph – This graph library sup-ports attributed graphs, generalizednested subgraphs, and stream file I/Oin a flexible graph data language. It isbuilt on top of the Cdt library andemploys disciplines for I/O, memorymanagement, graph-object name-space management and object-updatecallbacks. This library is the base ofthe Graphviz package.

Other complete tools that have beenreleased as part of this collection arereimplementations of programs like theKornShell language and nmake, andother new applications like: tw, a morepowerful find and xargs; and warp, a toolthat helped with Y2K testing by running

C

ON

FERE

NC

ERE

PORT

Sa process through it and simulation atime and clock speed different from theactual system’s time and clock.

For more information, please see<http://www.research.att.com/sw/tools/>.For the AT&T source code license agree-ment, please see <http://www.research.att.com/sw/license/ast-open.html>.

SESSION: WORKS IN PROGRESS

Summarized by Kevin Fu

PERL ON THE ITANIUM

Murray Nesbitt

<[email protected]>, ActiveState

Tool Corp.

Nesbitt discussed his experiences inbuilding Perl under Windows 2000,Linux, and Monterey on the 64-bit IntelItanium. It was straightforward to buildPerl on IA-64 Linux. Building Perl onMonterey was almost as easy; it requireda standard Perl hints file to compilethough. Murray’s main point was thatbuilding Perl on Windows 2000 wasmore painful for a number of reasonssuch as lack of configure-script supportand problems with type sizes andabstractions. In the future Murray plansto work on optimization. In short, it’sfairly easy to port Perl, but it’s helpful toshare an office with a Perl guru.

TTF2PT1: A TTF TO ADOBE TYPE 1 FONT

CONVERTER

Sergey Babkin <[email protected]

forge.net>, Santa Cruz Operation

Babkin described his work on a TTF-to-Adobe Type 1 converter. The converteralso attempts to clean the outlines fromdetects and automatically generate Type1 hints. His program optimizes themethod to make the conversion lookgood. It comes under the BSD licenseand is reasonably well modularized. Henoted that this work has nothing to dowith SCO and is simply his personalhobby. For more information, see<http://ttf2pt1.sourceforge.net/>.

25October 2000 ;login:

LODD: A PIPELINING AND IN-PIPE DATA

MANIPULATION TOOL

Joseph Pingenot <[email protected]>,

Kansas State University

Pingenot, an undergraduate at KSU,talked about a tool to combine multiplechannels of input and output. The loddutility hopes to perform tasks such asacting as a logical dd and mixing threepipes together. lodd 1.x has stream-pipesupport and can mix multiple pipelinestogether, manipulate data at the blocklevel, perform bitwise logic, and dividepipelines. lodd is dd-compatible. Some ofthe interesting issues in creating loddinclude dealing with blocking I/O, back-streams, block sizes, shell limitations,and deciding what to do if a pipe closes.Visit <http://www.phys.ksu.edu/~trelane/lodd>for more information.

Q: Are functions extensible? A: Not inthe preliminary version. Hopefully in thefuture.

Q: How does one use lodd within a shell?A: I am now looking at ways to imple-ment a shell syntax. Suggestions are wel-come.

VSTACK: EASILY CATCH SOME BUFFER

OVERRUN ATTACKS

Craig Metz <[email protected]>,

University of Virginia

Buffer overruns typically work by over-writing a function’s return address with avalue of the adversary’s choice. In thisway, an adversary can change the flow ofcontrol to execute, for example, a rootshell. Metz described a simple approachthat prevents many commonly exploitedoverruns.

vstack verifies that function returnaddresses do not change. It does so bykeeping a separate virtual stack of returnaddresses and frame pointers. On returnfrom a function, a program verifies thatthe return address and frame pointer onthe execution stack matches that on thevirtual stack. If not, the program jumps

to a fault handler. This does not breakstandard calling convention and requireschanges only to the caller convention ina compiler.

A sample Perl implementation exists forthe x86. It edits assembly code to insertthe checks and management of the virtu-al stack. The code will appear soonunder a BSD-style license. The perform-ance loss is minimal. In the future, Metzplans to have more sophisticated choicesin what to do after detecting an overrun.vstack does not catch every overrunattack, but it can catch the vast majority.

Q: Instead of verifying the return addressmatches, why not simply use the returnaddress on the virtual stack? A: Theremight be other corrupted data. In specialcases it might be OK to use the virtualstack directly, but not in general.

Q: What prevents overwriting the virtualstack itself? A: It is elsewhere in memory.If you can overwrite an extent of 232

space, then you can overwrite everythinganyway.

Q: Does longjmp() or other functionsthat unwind the real stack confusevstack? A: Probably. (Offline Metzexplained that this is mostly solvable byusing wrappers around longjmp() andrelated functions. It won’t catch everycase, but it will catch most of them.)

Q: Can I overwrite data on the stack? A:Yes, but vstack will detect changes madeto return addresses.

KQ: KERNEL QUEUES IN FREEBSD

John-Mark Gurney <[email protected]>,

FreeBSD

Kernel queues are a stateful method ofevent notification. Instead of passingwhich events to monitor each time as isdone with select(2) and poll(2), the pro-gram tells the kernel which events neednotification. kq supports event monitors(filters) for file descriptors, processes,signals, asynchronous I/O, and VNODEs.State is allocated in kernel memory. kq

USENIX ANNUAL TECHNICAL CONFERENCE ●

pays attention to what file descriptors areready for reading and writing.

John-Mark described l0pht’s watch pro-gram modified to use kq. The watch pro-gram looks for temporary files created in/tmp. Before using kq, the watch pro-gram consumed a lot of CPU time bypolling directory entries in /tmp over andover. With kq, you get a notificationwhen the /tmp directory changes. In thismanner, the watch program does notneedlessly scan an unmodified directory.

Efforts are underway to use kq in theSquid HTTP proxy cache for asynchro-nous I/O and in ircd to reduce the serverload. Visit <http://people.freebsd.org/~jmg/kq.html>for more information.

Q: How does this compare to /dev/pollon Solaris 8? A: I haven’t looked at/dev/poll; it’s hard to say. However, mysystem is extremely lightweight.

Q: If a given source sends multiple sig-nals, will it cause a series of kernel mem-ory allocations before the read occurs? A:The memory is actually allocated whenyou register.

Q: Can you unify with other kernelname spaces? A: Currently you are limit-ed to a filesystem. However, pretty muchany kernel object can be associated.

AUTONOMOUS SERVICE COMPOSITION ON

THE WEB

Laurence Melloul

<[email protected]>, Stanford

University

The goal of this service it to allowdynamic specification of compositionrequests. The advantages include a cost-effective development cycle, better faulttolerance, and high availability. Melloulchose the Web as the medium because ithas autonomous services, uses a publicinfrastructure, is common, is simple, andspeaks a language independent protocol(HTTP). The two main issues are detec-tion of service interface changes and ver-

26 Vol. 25, No. 6 ;login:

ification of semantic compatibilitybetween service parameters. The key is tobuild on the ontology by web users.

THE HUMMINGBIRD FILE SYSTEM

Liddy Shriver <[email protected]

labs.com>, Bell Labs, Lucent

Technologies

The Hummingbird File System caters toworkloads of a caching Web proxy. OnUNIX machines, server software such asApache or Squid typically runs on aderivative of the 4.2BSD Fast FileSystem. FFS was not designed with theworkload of a proxy server in mind. Inparticular a Web-proxy workload hashigh temporal locality, relaxed persist-ence, and a read/write ratio differentfrom most workloads. FFS also includesfeatures which a proxy server does notrequire.

The Hummingbird File System takesadvantage of Web-proxy server proper-ties such as whole file access, small fileson average, and repeatable referencelocality sets. It also co-locates the storageof an HTML Web page with its embed-ded GIFs. Performance measurementsshow that under a sample Web-proxyworkload, Hummingbird supportsthroughput five to ten times greater thanthat of XFS and EFS (SGI) and six to tentimes greater than that of FFS mountedasynchronously (FreeBSD).

Future work includes persistence of data.At the moment, Hummingbird does notworry about persistence because the datacan be regenerated from origin servers.Visit <http://www.bell-labs.com/project/hummingbird/> formore information.

Q: Does Hummingbird support forHTTP reads for specific byte ranges? A:Not yet.

Q: When does Hummingbird write todisk? A: During idle time and whenmemory is filled and space is needed.

THE SELF-CERTIFYING FILE SYSTEM

Kevin Fu <[email protected]>, MIT Lab

for Computer Science

The Self-Certifying File System is asecure, global filesystem with completelydecentralized control. SFS lets you accessyour files from anywhere and share themwith anyone, anywhere. Anyone can setup an SFS server, and any user can accessany server from any client. SFS lets youshare files across administrative realmswithout involving administrators or cer-tificate authorities.

SFS is a secure network filesystem in thesense that it provides confidentiality andintegrity of the Remote Procedure Calls(RPCs) going over the wire. SFS runs atuser level and uses NFSv3 for portability.It performs between TCP and UDP NFSon FreeBSD and is in day-to-day use forFu’s research group.

Server public keys are made explicit inpathnames. The pathname to a remotefile includes a “HostID” that consists of acryptographic hash of a server’s publickey and hostname. At a high level, theHostID is essentially equivalent to a pub-lic key. Using this convention, we caneasily implement certificate authoritieswith symbolic links. Then a certificateauthority can chain trust with symboliclinks. A system of user agents andauthenticated lists of trusted HostIdtakes care of most of the HostIds.

There is also a read-only dialect suitablefor highly replicated, public, read-onlydata (e.g., software-distribution or cer-tificate authorities). In this scheme, anadministrator creates offline a signeddatabase of a filesystem to export.Untrusted servers can replicate this data-base. Clients can then select any of theuntrusted servers. Because the databaseis signed and the self-certifying pathdenotes the corresponding public key,the client can verify that the data isauthentic. Measurements show that aread-only server can handle many timesthe workload of a read-write SFS serverand that server-side authentication is

C

ON

FERE

NC

ERE

PORT

Smore than an order of magnitude fasterthan that of SSL.

SFS is free software. The SFS developershave used it on OpenBSD, FreeBSD,Solaris, OSF/1, and a patched Linux ker-nel. Download the software from<http://www.fs.net/>.

Q: Why not store server public keys in afile? Is the self-certification just a hack?A: We believe the symlink approach ismore elegant. And c’mon! It’s a coolhack.

Q: Have you thought about committingthis to the FreeBSD tree? You shouldmake sure that your software stays main-tained by contacting someone in chargeof distributions for each operating sys-tem. A: You are certainly welcome toinclude SFS in your operating system.We can try to locate the appropriate con-tacts for each operating system, but youare welcome to contact the SFS develop-ers too. Email <[email protected]>.

Q: Do you rely on DNS for security? A:No. We only use DNS as a hint to locatea server. If a fake SFS server responds toa request, the client will detect the fakebecause the fake server’s private key willnot correspond to the public keydescribed in the self-certifying path. Youcould receive notification of such failuresvia an agent. In our system, the worst anadversary can do is deny service.

NFS VERSION 4 OPEN SYSTEMS PROJECT

Andy Adamson <[email protected]>,

CITI, University of Michigan

Adamson discussed some of the interest-ing features of NFS version 4. The CITIgroup at the University of Michiganreceived funding from Sun Microsystemsto implement NFSv4 on Linux andOpenBSD.

NFS version 4 has compound RPCs thatperform multiple operations per RPC.There is no more mountd. There is nomore lockd. Locking is incorporated intothe protocol and includes DOS share

27October 2000 ;login:

locks and nonblocking byte-range lockswith lease-based recovery. Delegationaids in client file cache consistency. Theserver controls who gets delegation.Security is added to the RPC layer viaGSSAPI. NFSv4 requires Kerberos5 andLipkey PKI implementations. The securi-ty mechanism and QOP is negotiatedbetween the client and server.

The code has passed all nine basicConnectathon tests under Linux. TheCITI folks have just started work on theOpenBSD code. In the near future,Adamson plans to rebase to Linux 2.2.4and finish the OpenBSD port. Sourcecode will be available by September 1,2000. Visit <http://www.citi.umich.edu/projects/nfsv4/>for more information.

Q: Is there backwards compatibility withNFS3? A: There is none.

Q: Do you expect future growth in NFSspecifications? A: Ohhhh yeah.

Q: What are the terms of the license? A:Under Linux it is GPLed. UnderOpenBSD it has the OpenBSD license.

Q: Does it work under IPv6? A: We’dlove it to work with IPv6. Would you liketo financially support us?

ALFA-1: A SIMULATED COMPUTER WITH

EDUCATIONAL PURPOSE

Alejandro Troccoli <[email protected]>

and Sergio Zlotnik

<[email protected]>, University of

Buenos Aires

The Alfa-1 project consists of softwaretools to simulate a processor. This helpsin teaching computer architecture toundergraduate students. The GAD toolwas used as a basis for developing a sim-ulated computer, allowing students toexperiment with the approach defined bythe DEVS formalism. The model is basedmainly on the specification of the IntegerUnit of the Sparc processor.

Undergraduates implemented most ofthe system, which is now completely

specified and implemented. In thefuture, the Alfa-1 staff hopes to add aGUI, perform exhaustive testing, use dig-ital logic gates, add another level ofcache memory, and promote educationaluse of Alfa-1. Visit<http://www.dc.uba.ar/people/proyinv/usenix/>for more information.

Q: What are you able to simulate? A: Ifwe had enough computational power, wecould simulate everything.

Q: Are there triggers, tracepoints, GDBfor this? A: This is plain C code. You candebug the simulator using any standardtool.

TELLME STUDIO

Jeff Kellem <[email protected]>,

Tellme Networks

Tellme Studio offers a free service fordeveloping and testing voice XML appli-cations. All you need is knowledge ofVoiceXML and JavaScript (if you chooseto use JavaScript). You write yourVoiceXML code, put it up on a Web serv-er somewhere, log in to Tellme Studio(<http://studio.tellme.com/>), and givethe URL pointing to your code. You arethen given an 800 number to call toimmediately test out the application.Tellme Studio includes VoiceXML docu-mentation, code and grammar examples,and a community for sharing ideas.

PASSWORDS FOUND ON A WIRELESS

NETWORK

Dug Song <[email protected]>,

CITI, UMich

Receiving a standing ovation and givingby far the most entertaining talk at theconference, Dug Song from the CITIgroup at the University of Michigan gavea “brief report of what he found in theair.” He further described tools he creat-ed to demonstrate the insecurity of hisnetwork. In the process, the audienceconvinced him to give a live demonstra-tion on how easy it is to collect pass-words and shadow a user’s Web surfing.

USENIX ANNUAL TECHNICAL CONFERENCE ●

Song’s second slide included slightly san-itized sniffer logs. Among the finds werecleartext root logins via Telnet and color-ful passwords such as “hello dug song, doI smell.” While explaining an ebay.comURL, Dug reasoned that, “I don’t know ifMatt Blaze is here, but it sure looks likehe is.”

On to more serious stuff, Dug explainedthe rationale behind his seemingly mali-cious behavior. He views himself not as abad guy, but as someone promoting“security through public humiliation.”On that note, he introduced the motherof all password sniffers and a few pene-tration testing tools. His penetrationtools include:

arpredirect – This tool is extremely effec-tive for sniffing on switched Ethernet. Itdoes so by poisoning ARP. It politelyrestores the ARP mappings when fin-ished.macof – This is a C port of a tool toflood a network with random MACaddresses. It causes some switches to failopen in repeating mode. This effectivelyturns the switch into a hub for the pur-poses of sniffing. Dug commented,“Switch becomes hub, sniffing is good.”tcpkill – A evil tool to selectively kill con-nections. However, Dug uses it toremotely initialize connection state.tcpnice – This selectively slows downtraffic by using ICMP quenches andshrinking TCP window sizes. This is use-ful for sniffers that work better on slowernetwork traffic. It’s also useful againstthings like Napster.dsniff – This sniffer decodes 30 majorprotocols from Telnet to Meeting Maker.The HTTP module recognizes passwordURL schemes for many e-commerce sites(e.g., Web mail sites, eBay, etc). dsniffuses a magic(5)-style automatic protocoldetection. For example, running Telneton port 3000 will not fool dsniff becauseit can determine protocol by analyzingthe traffic content.filesnarf – Sucks down cleartext NFS2and NFS3 traffic. Very useful against files

28 Vol. 25, No. 6 ;login:

such as .XAuthorithy and .ssh/identity.This is Song’s “motivation” for develop-ing NFSv4.mailsnar – A fast and easy way to violatethe Electronic Communications PrivacyAct of 1986 (18 USC 2701-2711), becareful. This snarfs cleartext mail andoutputs the contents into a convenientformat suitable for offline browsing withyour favorite mail reader.urlsnarf – Same idea as mailsnarf but forURLs.webspy – Very sinister. It allows you towatch someone’s Web surfing in real-time. Dug demonstrated the tool byshadowing the Web surfing of an audi-ence member.

Song concluded that many people incor-rectly believe wireless and switched net-works are immune to sniffing. He thinksthat public humiliation can remind peo-ple of this misconception. Visit<http://www.monkey.org/~dugsong/dsniff>for more information. The slides are on<http://www.monkey.org/~dugsong/talks/usenix00.ps>.

Q: Should we block access to port 23 atUSENIX terminal rooms? Then cleartextTelnet sessions will not happen. A: That’sa technical solution to a social problem. Ilike my way better.

Q: The conference network ran out ofDHCP leases. Can your tools help me? A:During the conference I wrote a short“dhcpfree” program to forcibly free up IPaddresses. [Crowd laughs as he shows thecode.]

Q: Once upon a time in a terminalroom, I measured the ratio ofSSH/Telnet traffic. At one point, I saidloudly “look at all the interesting pass-words!” All of a sudden, the ratio wentup. A: Yup.

INVITED TALK

LESSONS LEARNED ABOUT OPEN SOURCE

Jim Gettys, Compaq

Summarized by Matt Grapenthien

In this interesting, informative, and thor-oughly entertaining talk, Jim Gettysaddressed the history of several projectsand presented the most (and least) suc-cessful methodologies over a time frameof about two decades.

Beginning with a history of X, Gettystraced its peaks and valleys from “prehis-tory” (1983) through our present“baroque” period. When the CDE (or“cruddy desktop environment”) tookover the market, X development becamealmost nonexistent. Then, starting in

about 1996, a combination of factorsrevitalized X. X is better today than anypoint in the past, and the future lookspromising.

Next Gettys talked about several individ-ual projects, tracing the Apache’s marketshare through the past decade. Throughthese case studies, he noted which prac-tices worked best (release continuously,make it easy for developers to con-tribute) and which didn’t work at all.

Gettys achieved a rare balance of keepingthe talk both very entertaining and veryuseful. His wit, sarcasm, and experience(20 years with OSS) made this a valuableand enjoyable session.

Jim Gettys

C

ON

FERE

NC

ERE

PORT

SSESSION: RUN-TIME TOOLS AND

TRICKS

Summarized by Josh Kelley

DITOOLS: APPLICATION-LEVEL SUPPORT FOR

DYNAMIC EXTENSION AND FLEXIBLE

COMPOSITION

Albert Serra, Nacho Navarro, and Toni

Cortes, Universitat Politècnica de

Catalunya

DITools is a way to modify an applica-tion or library without access to thesource code. It can be used, for example,to profile a binary without instrumenta-tion, to use another library with a pro-gram without rewriting the program, orto fix a bug in a library without recom-piling it.

A process image is traditionally built ofthree parts: a runtime loader, a mainprogram, and one or more libraries. TheOS brings all of the needed modules intothe address space, and then the runtimeloader resolves references between thesemodules. DITools augments the loader toinsert an extension backend between themain program and the libraries. Themain program then calls the extensionbackend instead of the libraries, and thebackend forwards calls to the libraries asappropriate.

DITools loads before entering the pro-gram. It offers dynamic loading supportand uses binding management to inter-pose modules. DITools can change func-tion bindings on a per-module basis,through rebinding, or globally, throughredefinition. DITools also offers twointerposition modes; it can change thelinkage table to point directly to a back-end’s wrapper, or it can change the link-age table to point to DITools’s dispatch-er, which calls callback functions in thebackend before and after calling the orig-inal function. To ensure correct opera-tion, DITools transparently checks forand handles events that may affect itsbehavior, such as dynamic loading,process forking, and multithreading.

29October 2000 ;login:

Performance tests show that DIToolsincurs a reasonable overhead on functioncalls. DITools complements relatedmethods such as static binary rewritingor dynamic instrumentation based oncode patching. DITools operates at ahigher abstraction level and is simplerthan these related methods, but it worksat the function level only. It presents anapplication-level tool to intercept cross-module references and easily makechanges. DITools is available from<http://www.ac.upc.es/recerca/CAP/DITools>.

PORTABLE MULTITHREADING-THE SIGNAL

STACK TRICK FOR USER-SPACE THREAD

CREATION

Ralf S. Engelschall, Technische

Universität München (TUM)

Multithreading offers many advantagesto a programmer, but finding a portablefallback approach to implement thread-ing on UNIX platforms can be difficult,if the standardized Pthreads API is notavailable. Setjmp(3) and longjmp(3) arethe traditional methods of transferringexecution control in user-space, but theydo not address the question of how tocreate a machine context on a particularruntime stack. The ucontext(3) APIallows for user-space context creationand switching, but it is still not availableacross all platforms.

A solution to this problem is to use theUNIX signal-handling facilities in con-junction with setjmp(3) and longjmp(3).The process can create a machine contextby setting up a signal stack using sigalt-stack(2), then sending itself a signal totransfer control onto that stack. Once inthat stack, the process saves the machinecontext there via setjmp(3), leaves thesignal handler scope, and later restoresthe saved machine context without sig-nal-handler scope. Then it finally entersthe thread-startup routine while runningon this particular stack. A much moredetailed description of the algorithm isavailable in the author’s paper or at<http://www.engelschall.com/pw/usenix/2000/>).

Performance tests show that thread cre-ation using this signal stack trick is about15 times slower than thread creationusing ucontext(3), because of the signal-ing required. However, user-space con-text switching is as fast as with ucon-text(3). The signal-stack trick offers anextremely portable method of imple-menting user-space threads without rely-ing on assembly code or platform-specif-ic facilities. This fallback approach isused in the GNU Portable Threads (Pth)library, available from <http://www.gnu.org/software/pth/>.

TRANSPARENT RUN-TIME DEFENSE AGAINST

STACK-SMASHING ATTACKS

Arash Baratloo and Navjot Singh, Bell

Labs Research, Lucent Technologies;

Timothy Tsai, Reliable Software

Technologies

Buffer overflows are one of the mostcommon sources of security vulnerabili-ties. Crackers can exploit buffer over-flows to achieve two dependent goals:injecting attack code and altering controlflow to execute this attack flow. The basicmethod of exploiting a buffer overflow isthe stack-smashing attack, where theattacker puts the attack code on thestack, then overwrites the current func-tion’s return address with the address ofthe attack code. This presentation offeredtwo complementary defenses againststack-smashing attacks.

The first defense is libsafe, which inter-cepts calls to unsafe functions andreplaces them with safe versions. Themajority of buffer overflows result fromthe misuse of unsafe functions such asstrcpy and fscanf. At runtime, libsafeestimates a safe upper bound on thestack buffer. It intercepts calls to theseunsafe functions and replaces them withcalls using this upper bound, thus con-taining overflows to a safe region andguaranteeing that the stack returnaddresses are protected. The function-interception technique is similar to thatused by zlib.

USENIX ANNUAL TECHNICAL CONFERENCE ●

The second defense, libverify, uses binaryrewrites to ensure that return addressesare valid before use. It wraps each func-tion to save the function’s return addresson the heap upon entry and check thereturn address at exit. If the returnaddress has changed, then the processdisplays an error to screen and syslogand dies. libverify uses runtime instru-mentation (copies the program at run-time and changes it) to insert its stack-checking code. This technique is similarto that used by StackGuard, but withoutrecompilation of the source code.

Both libraries can be loaded for analready-compiled binary using an entryin /etc/ld.so.preload or the LD_PRELOADenvironment variable. Testing showedthat these libraries successfully preventedknown exploits on several programs withreasonable execution time overhead. lib-safe is available for Linux from<http://www.bell-labs.com/org/11356/libsafe.html>.

INVITED TALK

AN INTRODUCTION TO QUANTUM

COMPUTATION AND COMMUNICATION

Rob Pike, Lucent Technologies – Bell

Labs

Summarized by Doug Fales

Rob Pike’s discussion of quantum com-puting was a very forward-looking,change-of-pace invited talk. He firstreviewed some quantum mechanics tobring the audience to common ground.The always-popular polarized-lightexperiment helped to demonstrate theprinciples. He also presented the famoustwo-slit experiment, in which a singlephoton passed through two very smallslits in a barrier still creates interferenceon the opposite side of the barrier. Pikeused this as a demonstration of theQuantum Measurement Postulatebecause when the particle is measured tosee which slit it passed through, the pat-tern disappears.

30 Vol. 25, No. 6 ;login:

After the simplified (but challenging)introduction, Pike progressed into themore specific field of quantum computa-tion. He addressed quantum-mechanicalphenomena like decoherence and entan-glement, as well as the implementationof quantum computational systems(qubits, quantum gates, etc.). The mostpromising aspects of this infant technol-ogy, those of massive parallelism and

zero-energy calculation, were clarified.As examples of the power of quantumcomputation, Pike went over the possi-bilities of Shor’s algorithm for factoringlarge primes in polynomial time, andGrover’s algorithm, which searches a listin square root time.

In addition to the computational side ofquantum mechanics, Pike also addressedpossibilities for communications, whichhe saw as not so distant in the future.This included a discussion of EPR pairs(a pair of entangled photons producedby electron-positron annihilation, namedafter Einstein, Podolsky and Rosen) andhow entanglement is actually an advan-tage for communications.

Perhaps the most interesting point in thewhole talk was the theme that informa-tion is known to be a physical quantity,restricted by a law of conservation, muchlike energy or mass. Furthermore, asMoore’s Law continues to shrink classicalcomputers, we will run into a physicalbarrier of size; as Pike put it, “we’re run-ning out of particles.”

Although quantum computation is stillrelatively far from real application, Pikenoted in closing that we cannot tell yetwhether progress in this field will be lin-ear or exponential. After all, he said, clas-sical computers were as much of an enig-ma 60 years ago as quantum computa-tion is today. The slides for this presenta-tion are at <http://www.usenix.org/events/usenix2000/invitedtalks/pike_html/index.html

>.

INVITED TALK

PROVIDING FUTURE WEB SERVICES

Andy Poggio, Sun Labs

Summarized by Josh Simon

Andy Poggio basically expanded on BillJoy’s keynote talk. The Internet has effec-tively begun to mimic Main Street and isbeginning to provide those services thatMain Street cannot, such as any time andanywhere. The six Webs are of relevance:

Near web – Monitor, keyboard, andmouse attached to a nearby system; per-sonalized news such as multimedia andnews-on-demand; and educationalaspects like multimedia, interactive sim-ulations, and so on. An example of edu-cational uses of the near web can befound at <http://www.planetit.com/techcenters/docs/>.

Far web – The television or appliancewith remote control, providing enter-tainment on demand; multiple datasources, providing a lower barrier toentry; possibly targeted advertising withproduct placement in on-demandmovies showing Coke ads on the sides ofa taxi cab to Coke drinkers but showingPepsi drinkers a Pepsi ad in the sameposition.

Voice web – For use when the hand andeye are busy, like driving a car.

e-commerce web – Computer-to-com-puter, such as auctions (both “forward”like eBay and “reverse” like eWanted) anddynamic pricing.

Rob Pike

C

ON

FERE

NC

ERE

PORT

SDevice web – Device-to-device for non-PC devices like cell phones and pagersand set-top boxes. These include agentsto collect data and remote distributedprocessing via IP and Java.

Here web – Personal digital assets, like“my CDs” or “my MP3s” or “my DVDs”;providing on-demand access and owner-ship, and allowing the end-user to createher own environments.

So how do we get there? Three aspectsneed to be worked on. First, the networkhas to be enhanced. IPv6 provides moreaddress space, better configuration man-agement, authentication, and authoriza-tion, but adoption has been slow. Poggiopredicts wired devices will win overwireless devices, both quality of serviceand overprovisioning will continue, opti-cal fiber will replace or supercede electri-cal (copper) wiring, and the last mile tothe home or the consumer will be fiberinstead of ADSL or cable modems orsatellites. Second, the computer-chiparchitecture will probably remain basedon silicon for the next ten or so years.Quantum effects (see the “QuantumComputing” talk for more information)show up around 0.02 microns, so weneed new approaches such as opticalcomputing, organic computing, quan-tum computing, or computational fogs(virtual realities). Third, Poggio believesthat the system architecture will connectthree components – CPU server, storagedevices, and the network – with someform of fast pipe, probably InfiniBand (ahigh-bandwidth, low-error, low-latencyfast interconnect).

31October 2000 ;login:

FREENIX SESSION: COOL STUFF

Summarized by Jeff Schouten

AN OPERATING SYSTEM IN JAVA FOR THE

LEGO MINDSTORMS RCXMICROCONTROLLER

Pekka Nikander, Helsinki University of

Technology

The RCX Microcontroller is a device soldas part of a Lego set. It’s designed tomove a bit of Lego here or there, andmaking a mostly functional robot for achild to play with. As a project, a Javaoperating system was developed for thismicrocontroller by Pekka Nikander andhis students at Helsinki University ofTechnology. The RCX consists of aHitachi H8 microcontroller, 32K of ram,an LCD panel, an IR transceiver, and sev-eral IO devices.

Using mostly Java, a small bit of C++,and a bit of H8 assembly, there is a most-ly functional OS for the RCX. When theRCX agrees to take the download (aboutone in three times) it runs fairly well, butgets stuck in a loop once in a while. Astudent is currently debugging thisbehavior. (That gave us a few laughs).

In all, it’s an amazingly small OS, for avery limited task, but it works – it can bedone.

LAP: A LITTLE LANGUAGE FOR OSEMULATION

Donn M. Seeley, Berkeley Software

Design, Inc.

LAP (Linux Application Platform), is aLinux-emulation package for BSD/OSthat allows Linux applications to rununder BSD/OS. By loading a sharedlibrary on a BSD system to “catch” Linuxsystem-level calls and reroute them tothe BSD kernel, a BSD user can run aLinux application.

Emulation is quite successful. A numberof interesting Linux applications run viaLAP and liblinux under BSD, includingAdobe Acrobat Reader v4, Netscape

Communicator, and WordPerfect 8. Notall functions of a Linux system are emu-lated, most notably the Linux clone() sys-tem call.

Overall, it seems to not eat a lot ofprocessor, and it provides BSD users a bitof flexibility they didn’t previously have.

TRAFFIC DATA REPOSITORY AT THE WIDEPROJECT

Kenjiro Cho, Sony CSL; Koushirou

Mitsuya, Keio University; Akira Kato,

University of Tokyo

The idea behind this project is to collectstatistical data on trans-Pacific backbonelinks on the Internet. WIDE is a Japaneseresearch consortium that designed thisdata repository, with the intent of build-ing free tools with which to build yourown repository.

One problem with this type of data isprivacy, another security. How do theyprotect private information from leakinginto the repository and thus generating apossible security breach? Removal of thepayload is the first step, leaving onlyheader information to analyze. AddressScrambling is the second step, rewritingor stripping out IP addresses out ofICMP and TCP packets.

INVITED TALK

THE GNOME PROJECT

Miguel de Icaza

Summarized by Josh Kelley

Although Linux has proved itself on theserver, its progress on the desktop laggeduntil quite recently. Three years ago,UNIX had little innovation; the last sig-nificant user interface change was XWindows. There was little code reuse andno consistent way to build desktop appli-cations. Improvements were incrementalenhancements to speed and feature listsrather than major architectural changes,and there was little direction betweengroups making these enhancements. The

USENIX ANNUAL TECHNICAL CONFERENCE ●

GNOME project aims to correct theseshortcomings.

The GNOME project is a unified, con-certed effort to build a free desktop envi-ronment for UNIX. One major part ofthis effort is GNOME’s component plat-form. GNOME is designed as a collec-tion of components that build on top ofone another; dependencies betweencomponents are encouraged, and eachapplication or component exports itsinternals to others via CORBA.

This approach to writing software as acollection of small components workswell with free software. Since free-soft-ware contributors tend to come and go, aset of components allows them to focuson, and contribute to, a small problemwith relatively little ramp-up time.

Traditional approaches to componentshave several disadvantages. UNIX com-mand-line tools may not be easy to useand can only communicate through uni-directional pipes that transfer primarilystreams of textual data. Object-orientedprogramming has its advantages, butsharing objects among different object-oriented languages is difficult.

Bonobo is the GNOME solution forcomponents. Bonobo is a componentarchitecture based on CORBA and par-tially inspired by Microsoft’s COM/ActiveX/OLE2. Bonobo provides thebuilding blocks and the core infrastruc-ture for writing and using components.Bonobo can be divided into two parts:the CORBA interfaces that are the con-tract between the providers and theusers, and the implementations of thoseinterfaces. Other implementations ofthese interfaces are possible. For exam-ple, KDE could choose to provide theseinterfaces to allow KDE and GNOMEcomponents to work together.

Since Bonobo is based on CORBA, it hasall of the standard CORBA features,including language independence andsupport for automation and scripting.Bonobo’s basic interface is

32 Vol. 25, No. 6 ;login:

Bonobo::Unknown, which provides twobasic features: life-cycle management ofan object through reference counting,and dynamic discovery of featuresthrough a standard query interface.Specific components may support anynumber of additional interfaces to allowfull access to their functionality.

One application of Bonobo would be toprovide standard interfaces to systemservices. Traditional UNIX, for example,stores configuration information in avariety of formats, mostly in files underthe /etc directory, and specific instruc-tions on how to make changes and howto put these changes into effect often dif-fer. The goal of Bonobo (and GNOME)is to have CORBA interfaces everywhere,for every service, for the desktop, and foreach application. The entire systemshould be scriptable. Applications shouldhave easy access to one another’s func-tionality rather than having to writedesired functionality themselves. Thisprovides better IPC (a more flexiblealternative to pipe and fork) and betterInternet protocols (applications cancommunicate via Bonobo rather thanusing ad hoc protocols).

Bonobo is used in several applicationsthat are either in development or arecurrently available. These includeGnumeric (a spreadsheet), Sodipodi (adraw application), Evolution (a mail andcalendar system), and Nautilus (theGNOME 2.0 file manager). Bonobo isalso integrated with the rest of GNOME;the GNOME canvas, for example, allowsembedding of objects of any type, andthe GNOME printing architecture offersfunctionality similar to the canvas toprovide an alternative to using PostScriptfor everything. GNOME 1.4, whichincludes Bonobo 1.0, should be releasedthis October.

FREENIX SESSION: SHORT TOPICS

Summarized by Craig Soules

JEMACS – THE JAVA/SCHEME-BASED EMACS

Per Bothner

The first talk of the “shorts” section ofFreenix was a discussion of JEmacs, theJava/Scheme-based Emacs. Described as“A next-generation Emacs based onJava,” JEmacs offers all of the standardfeatures of Emacs, as well as a number ofuseful new features.

The rationale behind JEmacs is that Javaimplementations have become increas-ingly faster, making it suitable to anapplication such as Emacs. Additionally,it offers a number of useful features,such as built-in Unicode support andmultithreading. Through the use ofKawa, JEmacs also has an easy-to-useScheme interpreter that offers features ofJava, such as Swing (its GUI interface), ina simple scripting language.

In order to make the transition toJEmacs as painless as possible, JEmacsalso offers full ELisp support. Althoughthere are some slight difficulties withintegrating ELisp with the multithreadednature of Java, the author seems to havethem well in hand. For more informa-tion on JEmacs see <http://www.jemacs.net/>. Kawa(JEmacs’s Scheme implementation) alsohas a home page,<http://www.gnu.org/software/kawa/>.

A NEW RENDERING MODEL FOR X

Keith Packard, SuSE, Inc.

Keith Packard spoke on developing anew rendering model for the X WindowSystem. The current X system was devel-oped based upon the PostScript specifi-cation of the time, and was targeted atbeing a simple solution to support sim-ple “business” applications. Today mostprograms don’t even use many of theavailable features of X, relying on ineffi-cient or unaccelerated external libraries

C

ON

FERE

NC

ERE

PORT

Sto handle screen drawing, which is donemostly client side. By looking at therequirements of most X programs today,Packard has developed a new model forX that will offer many of its missing fea-tures.

Although many things in the current Xsystem are lacking, there are a few thingsthat are worth keeping, such as testablepixelization and exposed pixel values. Itis important also, that all of the poten-tial, good and bad, of the old model isleft in place for use by legacy programs.The proposed solution offered here issimply to add a number of new features.These include: alpha composition,antialiasing support, a finer-grainedcoordinate system, more rendering prim-itives, and better text handling.

UBC: AN EFFICIENT UNIFIED I/O AND

MEMORY CACHING SUBSYSTEM FOR NETBSD

Chuck Silvers, The NetBSD Project

UBC is a unification of the I/O andmemory caching systems of NetBSD, andwas presented by Chuck Silvers. Unlikemany other current operating systems,NetBSD had a separated page and filecaching. This led to many complicationswithin the kernel involving managementbetween the two systems to avoid staledata and proper behavior. Additionally,the page cache has a number of usefulfeatures not available in the file-buffercache, such as being dynamically resized.

The proposed solution to this problem isto have the page cache manage all fileaccess. This offers a number of benefits:only one copy of the data will ever becached, which also prevents any copyingto avoid stale data; the page cache candynamically resize itself; and cached datano longer needs to be constantly mappedin memory to remain in the cache. Thisis all managed through several new calls,which are used by the buffer cache toretrieve and manage pages from the vir-tual-memory system.

33October 2000 ;login:

Although this new system has the poten-tial to have increased performance, aswell as reducing overall cache size, manyimprovements need to be made in orderto make this a reality. Tests with the ini-tial system indicate worse sequential-access performance than the currentcaching system. This is claimed to be duemostly to unaggressive read ahead andbad pager algorithms. In addition to theperformance tuning, several otherenhancements are in the future, such assoft-updates support, avoiding the needto map pages in order to do file I/O, andpage loan out. The current implementa-tion of this code will become available inthe release following the 1.5 release ofNetBSD.

MBUF ISSUES IN 4.4BSD IPV6 SUPPORT –EXPERIENCES FROM THE KAME IPV6/IPSEC

IMPLEMENTATION

Jun-ichiro itojun Hagino, Internet

Initiative Japan, Inc.

This presentation looked at the difficul-ties found in offering IPv6 support in aBSD environment. Although IPv6 wasdesigned with the idea that it can easilybe layered over current IPv4 implemen-tations, its implementation on 4.4BSDproved to be more complicated thanadvertised.

These complications arose from a num-ber of IPv4-specific assumptions madewithin the BSD kernel. The biggest ofthese problems arose from a change inheader size in IPv6, which led to severepacket loss. In order to fix this problem,the author created a new parser for theIP layer which not only offered correctIPv6 support, but also reduced theamount of copying and mbuf allocationsignificantly. This work has now beenintegrated with all of the major BSD ker-nels available. For more information onthis work, see <http://www.kame.net/>.

MALLOC() PERFORMANCE IN A

MULTITHREADED LINUX ENVIRONMENT

Chuck Lever and David Boreham, Sun-

Netscape Alliance

This work, presented by Chuck Lever, isa direct result of the Linux scalabilityproject at the University of Michigan.This project aims to enumerate the prob-lems facing network servers and ensurethat Linux offers support greater than orequal to current production-level operat-ing systems for network servers.

Because malloc() is a major concern fornetwork servers, it is important that it beable to offer acceptable performance insituations that require multithreadedasynchronous I/O, low latency and highdata throughput with a predictableresponse time, and a potentiallyunbounded input set of unpredictablerequests. It has been found that the man-ner in which malloc() performs can havea significant effect in overall system per-formance. One example showed a 6xperformance degradation when theiPlanet LDAP server ran on four-wayhardware with a native implementationof malloc().

To study the performance of malloc(),three benchmarks were created. The firstcompared the elapsed runtime of Nthreads accessing the same heap concur-rently. It was found that Linux outper-formed Solaris for N greater than two.The second benchmark, which testedunbounded memory consumption, allo-cated an object in one thread and freethe same object in the second thread. Itwas found that Linux has no pathologi-cal heap growth in this situation. Thefinal benchmark tested data placementwithin the heap by allocating a dataobject normally with malloc() and thenletting several threads write into it manytimes. Bad data placement causes poorperformance, especially on SMP hard-ware. Linux’s version of malloc() alignsdata to eight-byte boundaries, whichresulted in widely varying application

USENIX ANNUAL TECHNICAL CONFERENCE ●

performance on CITI’s four-way testmachine. In conclusion, the version ofmalloc() offered with glibc 2.1 showedoverall acceptable performance on two-and four-way hardware, but still needswork and special attention to reduce per-formance and scalability problemscaused by sloppy data placement.

CLOSING SESSION

NEW HORIZONS FOR MUSIC ON THE

INTERNET

Thomas Dolby Robertson

Summarized by Josh Simon

Thomas Dolby is a musician (you’llprobably remember him from “SheBlinded Me with Science!”) who’s beenworking for at least 20 years on integrat-ing computers into music. (One histori-cal tidbit: The drums in “Science!” wereactually generated by a discotheque’slight-control board.)

Dolby is one of the founders of Beatnik(<http://www.beatnik.com/>), a tool suiteor platform to transfer descriptions ofthe music, not the music itself, over theInternet. For example, the descriptionwould define which voice and attributesto use, and the local client side would beable to translate that into music oreffects. This effectively allows a Web pageto be scored for sound as well as forsight.

For example, several companies havetheme music for their logos that you mayhave heard on TV or radio ads. Thesecompanies can now, when you visit theirWeb sites, play the jingle theme withoutneeding to download hundreds of kilo-bytes, merely tens of bytes. Similarly, aWeb designer can now add sound effectsto her site, such that scrolling over a but-ton not only lights the button but plays asound effect. Another use for the tech-nology is to mix your own music withyour favorite artists, turning on and offtracks (such as drums, guitars, andvocals) as you see fit, allowing for per-

34 Vol. 25, No. 6 ;login:

sonalized albums at a fraction of the diskspace. (In the example provided duringthe talk, a 20K text file would replace a5MB MP3 file.) In addition to the “waycool” and “marketing” approaches,there’s an additional educational compo-nent to Beatnik. For example, you can setup musical regions on a page and allowthe user to experiment with mixing dif-ferent instruments to generate differenttypes of sounds.

The technical information: Beatnik com-bines the best of the MIDI format’s effi-ciency and the WAV format’s fidelity.Using “a proprietary key thingy” forencryption, Beatnik is interactive andcross-platform, providing an easy way toauthor music. And because the client isfree, anyone can play the results. Theaudio engine is a 64-voice general MIDIsynthesizer and mixer, with download-able samples, audio file streaming, and a64:2 channel digital mixer. It uses lessthan 0.5% of a CPU per voice, and thereare 75 callable Java methods at runtime.It supports all the common formats(midi, mp3, wav, aiff, au, and snd), aswell as a proprietary rich music format(rmf), which is both compressed andencrypted with the copyright. RMF filescan be created with the Beatnik Editor.(Version 2 is free while in beta but maybe for-pay software in production.) Theeditor allows for access to a sound bank,sequencer, envelope settings, filters, oscil-lations, reverbs, batch conversions (forexample, entire libraries), convertingloops and samples to MP3, and encryp-tion of your sound. And there is anarchive of licensable music so you canpay the royalties and get the licenseburned into your sample.

Web authoring is easy with the EZSonifier tool, which generates JavaScript;middling with tools like NetObjects’Fusion, Adobe GoLive, and MacromediaDreamweaver; and hard if you write ityourself, though there is a JavaScript-authoring API available for the musicobject.

Beatnik is partnered with SkywalkerSound, the sound-effects division ofLucasfilms Ltd.

BOF SESSION

WORKPLACE ISSUES FOR LESBIAN, GAY,BISEXUAL, TRANSGENDERED AND FRIENDS

Summarized by Chris Josephes and TomLimoncelli

Although the first submitted name forthis BoF suggested that it was for “sysad-mins,” it was well-attended by managers,engineers, programmers, and anyone elsewho felt like attending. The LGBT BoFhas had a long history at USENIX andLISA conferences. The crowd was aneven mix of newcomers and conferenceveterans. The purpose of the session wasto give people to opportunity to talkabout their work environments, and toprovide the opportunity to network withother attendees whom they may not haveotherwise met during the conference.

Everyone introduced themselves and thecompanies they work for. Then theytalked about how their employers havehandled issues facing LGBT employeesand related experiences they may havehad when dealing with their employers.As it is becoming harder to find qualifiedpeople, more companies are adaptingtheir benefits to the LGBT community inhopes of attracting new talent andretaining existing employees.

Of the 34 attendees, most reported thattheir employers established a nondis-crimination policy that included sexualorientation. On top of that, someemployers also offered domestic-partner-ship benefits for registered partners.Another issue employers are trying toaddress is maintaining a safe, friendly,nonthreatening environment foremployees by implementing peering pro-grams, diversity training, and employeegroups.

Andrew Hume, USENIX vice president,attended the BoF to welcome everyone

C

ON

FERE

NC

ERE

PORT

Sand assure us that USENIX is committedto creating a space that is inviting andsupportive to all attendees. His com-ments were well-received.

After the introductions were finished, theBoF became an open floor. We talkedmore in depth about the issues ofemployment and some of the recruitingplans some firms were offering. Severalattendees, all from one large company,reported that they now have a recruit-ment effort targeting the LGBT commu-nity since they feel their accepting workenvironment is one of their competitiveadvantages. Someone pointed out whichBoF attendees were presenting papers atthe conference, so that others couldattend and lend support. Ideas forimproving attendance at future BoFswere brought up, including a couple ofsuggestions for making sure themale/female ratio was more representa-tive of the conference attendance.

When it was time for the BoF to official-ly end, everyone agreed to informally gettogether at one of the hospitality suitesscheduled for that night.

Trip Report

USENIX ANNUAL TECHNICAL

CONFERENCEby Peter H. Salus

<[email protected]>

I had a great time at the 25th Anniv-ersary USENIX conference in San Diego.Oh, boy! San Diego! Right. I didn’t evenget off the hotel premises for nearly threedays.

After all, there were Dennis Ritchie, KenThompson, Bill Joy, Rob Pike, KirkMcKusick, Eric Allman, TomChristiansen, Elizabeth Zwicky, MargoSeltzer, Sam Leffler, Jim Gettys, ClemCole, Evi Nemeth, Mike Ubell, TeusHagen, Rich Miller, Oz Yigit, Miguel deIcaza, and over 2000 others inside thehotel.

35October 2000 ;login:

In fact, I only got to about twopapers/presentations a day. TakeThursday, the “middle” day of the con-ference. From 9:00 to 11:00 am, I listenedto Ed Felten of Princeton University talkabout the Microsoft trial. As he wasunder DoJ secrecy rules (he was adviserto and expert witness for the DoJ), therewas much he couldn’t say. What he couldsay was fascinating. Then I walkedaround the exhibits, and chatted. Then Iwent to a publications-committee meet-ing where I was a guest of the USENIXboard. Then I was on the Dr. Dobb’sWebcast for over an hour, to be succeed-ed by Linus Torvalds. So I chatted withLinus, his wife, and their two little girls.Then I sat down with some folks fromSleepycat to learn about embedded data-bases. And it’s now after 5:00 pm. At6:00, I went to the celebratory reception;at 8:00, I went to hear Linus at the LinuxBoF. At nine, I went to the “Old Farts’BoF.” At 10:30 pm I met my wife, whowanted to go out for dinner. I was tootired. We stayed in the Marriott.

On Wednesday morning I had the pleas-ure of witnessing the awarding of the“Flame” (for lifetime achievement) to thelate Rich Stevens, with Rich’s wife, chil-dren, and sister getting a five-minutestanding ovation from the attendees.Then Bill Joy shared his thoughts aboutthe future of computing.

Bill traced his beginnings in the field –just about 25 years ago at Berkeley – andtraced the origins of “open source” to thebases of university research and of UNIXon the PDP-11. While there was thequestion of whether doing software is“research,” the common answer was“yes,” and as researchers publish results,there could be no property rights adher-ing to that research. Lately, largelybecause of commercial and industrialcontributions, there are more and more“entanglements.”

Bill noted that he had written “a reallyboring Java book.” My guess is that he

was referring to The Java LanguageSpecification by Joy, Steele, Gosling, andBracha. As I really liked the first edition(1996) and found the second edition(2000) even better – when was the lasttime the second edition of a referencebook was smaller than the first? – I thinkI’d disagree.

One of Joy’s more interesting commentsturned on the fact that UNIX was inher-ently reliable because of its modulariza-tion. The consequence is that “Microsoftis clearly foolish” in attempting to makeall of its applications integral and theconstruct monolithic. “Microsoft isbeyond retrograde,” Bill said. “All inter-faces should be published.”

Turning to the future, Bill scorned thenotion of the death of Moore’s law. Infact, he thinks that we might see another“ten-to-the-sixth” improvement over thenext 30 years, just as machinery hasimproved a million times over the past30. Bill thinks that molecular computingwill enable us to come to grips with the“grand challenge problems” like those ofcell biology. Some of these steps willcome about through algorithmicimprovements, some through greateremphasis on remote storage and fasttransmission.

However, he pointed out, the more that’sdone remotely, the higher the toll for theround-trip, even with high-speed opticalconnections. “The Internet isn’t aboutpackets,” he said. “It’s about end-to-end.”

On the client-server side, Bill said that hesaw six Webs in the future:

the “near web,” which we currentlyuse;the “far web,” the entertainment forcouch potatoes;the “here web,” of the pocket deviceand the cell phone;the “weird web,” involving smartclothing and voice-activated cars;and two “invisible” webs: “e-business”and “truly pervasive.”

USENIX ANNUAL TECHNICAL CONFERENCE ●

Friday morning I was spellbound as RobPike, who is always exciting to hear,spoke about quantum computing. This isthe kind of paper that differentiates areal conference from a hawker’ssideshow: Heisenberg, Schroedinger,integral signs, real equations. Rob saidthat we should rid ourselves of thenotion that “the elements of informationare independent” and get used to con-cepts like “conservation of information.”“A quantum computer is probabilistic,”he remarked. “It’s not going to happensoon,” but it will happen.

Now we can turn it over to Gibson,Sterling, Stephenson, and Vinge . . .

A fine conference. I can’t wait till the30th Anniversary.

There was lots more. It was a fine hour.

Then I spent over an hour on the Dr.Dobb’s Webcast.

Wednesday night there was a four-hourBSD BoF (that’s right) organized by KirkMcKusick. An hour each of OpenBSD,FreeBSD, NetBSD, and BSDI. I satthrough over an hour of it. While therewere pockets of enthusiasts cheering andjeering, it was a generally highly intelli-gent, well-informed group of about 400.And I thought it exciting to see them alltogether. A doff of my cap to you, Kirk.

Thursday morning, as I mentioned, Iwent to hear Ed Felten. While most folksknow many of the details of the case, Ifound listening to the recollections of aparticipant truly fascinating. My person-al feeling is that the case is more abouteconomics and business than about tech-nology, but there are (clearly) two tech-nical queries of significance: (1) is therea technical advantage to tying Windowsto the browser? and (2) can they be dis-entangled without injury to either?

The clear answer to (1) is “no” (there is,of course, a business advantage forMicrosoft) and Felten himself demon-strated the answer to (2) in court. In fact,advocates of the small kernel (like LinusTorvalds) as well as “old-timers” (like BillJoy) all shun the interwoven monolithicmonstrosities produced by Microsoft.

Went off for another hour of Webcast.

The “Old Farts’ BoF” on Thursday nightwas very well attended. Ken Thompson,Lou Katz, Greg Rose, Dennis Moomaw,Clem Cole, and I were among the recol-lectors. Toward 10:00 pm, Professor ArunK. Sharma, head of computer scienceand engineering at the University of NewSouth Wales, announced a “drive” toraise $A2M (= $US1.2M) to establish aJohn Lions Chair of Operating Systemsat UNSW. As I considered John a teacherand a friend, I handed Arun my checkfor $1,000 on the spot. I found it verymoving.

36 Vol. 25, No. 6 ;login:

The 25th Anniversary Reception was a goodparty!