IMPLEMENTING THE COMANDOS ARCHITECTURE. Jose ...

23

Transcript of IMPLEMENTING THE COMANDOS ARCHITECTURE. Jose ...

1IMPLEMENTING THE COMANDOS ARCHITECTURE.Jose Alves Marques (2), Roland Balter (1), Vinny Cahill (4), Paulo Guedes (2), Neville Harris (4),Chris Horn (4), Sacha Krakowiak (3), Andre Kramer (4), John Slattery (4) and Gerard Vandome(1).(1) Bull Research Centre, c/o IMAG-Campus, BP 53X, 38041 Grenoble CEDEX, France.(2) Instituto de Engenharia de Systemas e Computadores, Rua Alves Redol 9-2, 1000 Lisboa,Portugal.(3) Laboratoire de Genie Informatique, IMAG-Campus, BP 53X, 38041 Grenoble CEDEX, France.(4) Distributed Systems Group, Department of Computer Science, Trinity College Dublin, Ireland.The fundamental goal of the three year ESPRIT project 834, COMANDOS, is to identify andconstruct an integrated platform for programming distributed applications which may manipulatepersistent - i.e. long-lived - data. The intention is to eventually provide such a platform runningon a range of machines from di�erent vendors. The COMANDOS project has already de�nedthe global architecture of such a platform [Horn87]. This architecture is speci�ed as a numberof co-operating functional components. A subset of the components constitute the COMANDOSkernel which provides the minimum functionality expected of a COMANDOS system. This pa-per describes the three di�erent implementations of the COMANDOS kernel being undertakenwithin the current project. The di�ering goals of each implementation as well as the particu-lar hardware environment targeted are summarised. The di�erent approaches being followed ineach implementation are overviewed, followed by a preliminary presentation of the results of eachimplementation to date.1 1INTRODUCTIONThe fundamental goal of the three year ESPRIT project 834, COMANDOS, is to identify andconstruct an integrated platform for programming distributed applications which may manipulatepersistent - i.e. long-lived - data. The intention is to eventually provide such a platform runningon a range of machines from di�erent vendors. Although the main intended application domain iso�ce systems, we expect that the COMANDOS platform may be valuable as a basis for integratedinformation systems in such application domains as CAD, software factories and manufacturingadministration.COMANDOS itself does not provide end-user applications, but rather a basis for the devel-opment of these. The essential features of the COMANDOS platform - the "virtual machine"interface - are:-

- support for distributed and concurrent processing in a loosely coupled LAN and WANenvironment;- an extensible and distributed data management system, which can be tailored to speci�capplication areas;- tools to monitor and administer the distributed environment.Further aspects of the project include tools to aid in o�ce systems design and maintenance,particularly in the light of operational experience [Horn88], and interworking with existing datamanagement systems constructed independently of the COMANDOS model [COMANDOS88c].The programming environment provided by COMANDOS is intended to be multi-lingual,and a range of programming languages are expected to be used. To interact fully with, andthus exploit, the COMANDOS environment, a range of primitives will be available via libraries.Existing programs are supported without requiring recoding, although relinking with standardenvironment libraries may be necessary: such programs may not be able to bene�t fully fromthe COMANDOS platform. Nevertheless, it is obviously crucial that existing applications, forexample UNIX applications, can be supported easily.Consideration of our environment and virtual machine primitives have convinced us that it isuseful to provide a language in which the concepts of the COMANDOS virtual machine are faith-fully re ected. The language embodies the main features of the COMANDOS virtual machine i.e.the type model and the computational model. Some of its features may be regarded as "syntacticsugar": reducing the burden on the application programmer who is making extensive use of theCOMANDOS program libraries, so ensuring that parameters are correctly managed across a se-ries of library calls; that a series of such calls is indeed meaningful (i.e semantically correct); andby providing syntactic constructs which automatically result in a number of calls for frequentlyused cases. Moreover other features of the virtual machine such as typing and inheritance canonly be expressed in linguistic terms. The resulting language, Oscar, designed within the project,and currently being implemented, is described in [COMANDOS88a, COMANDOS88b].The execution environment for COMANDOS applications is provided by a low-level kerneland a set of additional services running, like applications, above the kernel. Although the ker-nel supports distributed processing, the functionality of this kernel is fundamentally richer thanthat of, for example, UNIX [Bach86], Mach [Jones86], Vkernel [Cheriton84] or Amoeba [Mullen-der85], in that it also provides a basis for common data management services required by manyapplications. Examples of the additional support provided includes atomic transactions and re-covery; decomposition and reconstruction of complex data entities so as to accelerate associativeretrieval; location transparency as a default mechanism, but with the ability to determine theprecise (current) location of some entity, and to direct execution to particular sites; and a numberof monitoring and control points through which the system can be remotely administered.Compatibility with international and de facto standards is obviously critical. For interworking,the COMANDOS kernel assumes the availability of both CL and CO (Class 4) ISO Transport,over ISO IP. Above the CL-Transport a RPC service - the COMANDOS Inter-Kernel MessageService - is used by the kernel. Both CL and CO associations are available to applications.An emerging de facto system interface of particular relevance is UNIX X/OPEN. In oneof the COMANDOS prototypes, the COMANDOS kernel interface is being hosted on top of aUNIX implementation, and the normal UNIX interface is available to applications alongside theCOMANDOS kernel primitives. In the other COMANDOS prototypes, the COMANDOS kernel

is being implemented in privileged machine mode: in these cases, many of the UNIX services haveto be provided above the COMANDOS kernel.Our prototypes of the COMANDOS kernel are thus exploring a number of implementationstrategies in parallel. This paper gives a brief review and status report on these prototypes. InGrenoble, Bull and Laboratorie de Genie Informatique are jointly implementing the COMANDOSkernel interface as a series of C libraries running in user mode on top of a UNIX V2.2 kernel withcertain BSD extensions. Their implementation has been christened Guide. Using UNIX as ahost, Guide has been able to develop faster than the other prototypes, but with relatively poorperformance.At Dublin, Trinity College are implementing the COMANDOS kernel in privileged machinemode on both the NS32000 based Trinity Workstation and Digital VAX architectures. Theirimplementation is known as Oisin.While Oisin is aimed at relatively sophisticated multi-user hardware, it was felt it would alsobe interesting to develop a minimal implementation aimed at single user machines, with theintention that these could interact with larger machines to transparently gain access to the fullrange of COMANDOS services. At Lisbon, Inesc are developing a version of the kernel on i286based PC/AT compatibles: their implementation is known as IK.In summary, the COMANDOS consortium are prototyping the COMANDOS kernel interfacein a number of operating environments, both as a "guest" layer on top of UNIX, and in nativemode on both relatively sophisticated machines and on PCs. The intention is to conduct anappraisal and analysis of the relative success (including performance) of these prototypes beforethe conclusion of the current three year project contract.2 1THE COMANDOS VIRTUAL MACHINEAs noted above, the project aims to provide a complete interface allowing the development oflarge distributed applications. A complete description of the COMANDOS model is provided in[COMANDOS87]. In this section we brie y recall some of its main features.The COMANDOS Virtual Machine can be considered to be composed of two main entities:- A Type Model providing support for the description of the abstract structure and be-havioural properties of objects.- A Computational Model de�ning the interface with the operating system entities which pro-vide the functionality needed in a distributed environment: distributed processing includingobject invocation and support for fault tolerance.3 2Type ModelThe objective of providing e�cient support for both programming and data base conceptualmodelling had a strong in uence on the de�nition of the Type Model. Moreover, the distributed

nature of the system introduces the problem of heterogeneous implementations which is not usuallyconsidered in type models for homogeneous environments.In COMANDOS an object is generally composed of its internal state, or instance data, and aset of operations or methods. Each object is an instance of an implementation which provides thecode for the object's operations. Furthermore each object has an associated type which describesthe visible properties and behaviour of the object. An important initial decision was to exploit asmuch as possible static type checking for early detection of programming errors and to improvecode e�ciency.A type represents the external interface of an object in the system. Types are speci�cationsof interfaces: each type may then be associated with (possibly) several implementations. Thepossibility of having multiple implementations allows di�erent algorithms or di�erent code forheterogeneous machines to be associated with the same interface.Subtyping is supported by the model as a way of specialising interfaces, allowing incrementaldevelopment of software and the use of an instance of a type as if it were of another type. Theconformance rules and the manipulation of the object references are detailed in [COMANDOS87]To support data base modelling it is important to be able to manage groups of similarly typedobjects. The concept of class was introduced to simplify the management of large collectionsof objects. A class represents a set of objects having the same properties. On a class fourbasic operations may be performed: insertion of new objects, removal, test for membership andinspection of the extension.In addition to these generic operations, a type over which a class is de�ned may have someuser de�ned operations which may be used in conjunction with queries.4 2Computational ModelIn COMANDOS the di�erence between active and passive objects is that the former maychange their internal state independently of any invocation.Active objects exist in the form of Jobs and Activities. A Job may be considered as a multi-processor, multi-node virtual machine. A job may contain one or many activities executing inparallel. Activities are the fundamental active objects and represent distributed threads of control,similar to processes in a centralised system. Jobs and activities are distributed objects, and mayspan several nodes. In each node visited by any of the activities of the job the system maintainsa local context for the job containing the objects used at that node.Most COMANDOS objects are passive. Passive objects may belong to one of several cate-gories: Atomic, Synchronised, or Non-synchronised.Atomicity is an attribute of an object �xed at its creation time which cannot be changed after-wards. The classical properties of atomic transactions are guaranteed for these objects when usedwithin a transaction. Within transactions there may be objects for which no speci�c synchro-nisation is required and thus, to reduce the overhead introduced by transactional mechanisms,only atomic objects acquire the properties of atomic transactions. The model is intended tosupport both short and long duration transactions. To improve e�ciency sub-transactions andintermediate checkpoints were incorporated in the model.

Synchronised objects have internal mechanisms for synchronising accesses. A synchronisedobject can only be mapped at a single node to which all accesses are routed.When invoking an object not located in the current context the system decides if the jobshould di�use to the node where the object is located or instead map the object into the localcontext. This decision should, in principal, take into account all the attributes which may beassociated with objects (synchronisation properties, �xed location, association with a given job,size, etc.)Channels are a special type of object for communication. A channel is de�ned by a connectionbetween a source and a sink object allowing the transfer of a particular type of object. Therationale for this concept is to provide an optimised mechanism for transferring large objectsbetween activities and also to provide an uniform concept to handle input/output.5 2ArchitectureUniform access to objects is a key point in the COMANDOS model. Globally known objectshave a unique system wide identi�er designated a Low Level Identi�er (LLI). The LLIs form anaddress space in which invocation of objects is executed transparently. Transparency is supportedfor access to both for remote and stored objects. Similarly the single level storage model of theCOMANDOS architecture hides the distinction between long-lived - stored - objects and shortlived objects from application programmers.The Architecture is an abstract implementation structure, consisting of a set of distributedfunctional entities which provide support for the COMANDOS virtual machine interface. Formore detail on the COMANDOS architecture than is provided here refer to [Horn87].6 3Virtual Object Memory (VOM)Objects are mapped into the virtual address space of a job when they are referenced. Objectsare mapped out of virtual memory by the system, when the job terminates or when space isneeded for other objects.All objects are potentially persistent. Persistency is not a static attribute of an object, butinstead an object is considered persistent if it is reachable from an eternal persistent root.When an object is invoked the VOM analyses an internal table to determine if the objectis already in the required context. tries to locate it as it may be already active in some othercontext. In each storage node a table identi�es the objects normally stored there but which arecurrently mapped in the context of a job in some node. This table is used to detect whether thisobject is already in some context of the job, or in the case of a synchronised or atomic object,where a live image of the object is currently located.

7 3Storage System (SS)Secondary storage is managed by the SS and may be seen as a set of containers. A container isa logical entity with a unique identi�er, which may be implemented by several physical containers.Each container is organised as a set of segments. A segment is a contiguous storage unit.Provision has been made for optimising the access to related objects by grouping them in clusters.An object is normally entirely stored within a segment but exceptionally large objects may bepartitioned over a set of segments.Garbage collection in distributed object oriented system is a particularly di�cult problem.COMANDOS provides a compromise between e�cient resource management and algorithm com-plexity, by using ageing [COMANDOS87].8 3Activity Manager (AM)The Activity Manager is responsible for the low-level functions of the system. It controls therun-time environment giving support for jobs, activities, semaphores, timers and triggers.A distinction is made between the local hardware dependent kernel and the distributed entitywhich provides the support for the COMANDOS abstractions. The local kernel may be a ded-icated implementation for COMANDOS or any low level kernel providing process managementand synchronisation facilities.9 3Communication Subsystem (CS)All the architecture components are distributed and use the Communication System to com-municate with their remote peers. The CS o�ers two transport services, one dedicated to e�cientremote invocations and another optimised for the transfer of large amounts of data.10 3Remaining ComponentsThe Transaction Manager (TM) is responsible for implementing the support for atomic objects,in particular concurrency control and recovery procedures for aborts and node failures.The Type Manager (TpM) is the component responsible for maintaining information abouttypes and the relationships between them. The TpM will assist language compilation and the ob-ject management system. Mapping from certain languages to the TpM interface will be provided.Finally the Object Data Management System (ODMS) provides the functions related to man-agement of classes and queries on objects in the object base. The ODMS is responsible for

three main functions: support of the query language; management of classes; implementation ofdistributed location schemas for class members.11 1KERNEL IMPLEMENTATIONSIn the following sections we describe the three implementations of the COMANDOS kernelintroduced in section 1. The kernel includes only �ve of the components of the full COMANDOSarchitecture i.e the VOM, SS, AM, CS and TM. However the TM component of the kernel is notyet being implemented in any of the pilot implementations described here.The �rst GUIDE prototype has been designed to provide a minimum basis for supportingdistributed applications speci�ed in terms of the COMANDOS model as quickly as possible, andto identify problems raised by the implementation of an Object Oriented Architecture above Unix.Three Unix System V features (shared memory, messages queues and semaphores) and the UnixBSD communication facilities (sockets) are extensively used in GUIDE. Mechanisms providingoptimal support for object orientation will be investigated in the proposed next phase of theproject (COMANDOS-2) as extensions to the Unix kernel.The main goals of the IK implementation of the COMANDOS kernel are:- To provide an implementation of COMANDOS on commercial, widespread and inexpensivemachines, allowing the results of the project to be di�used.- To evaluate the suitability of a segmented memory management system, as provided by theiAPX286, to e�ciently support object oriented systems.In the IK prototype the main concern is to gain experience in the e�cient management ofvolatile and persistent objects in a distributed environment. Therefore, IK will provide e�cientinvocation of objects in virtual memory and transparent access to objects located at any nodeof the system. Support for local secondary storage is provided, however this implementation isprimarily intended to access information residing on more powerful machines.The chief goal of the Oisin kernel is to provide an e�cient implementation of the COMANDOSmodel on relatively sophisticated hardware. The chief features which distinguish Oisin from Guideand IK are as follows:- kernel mode exploitation of a demand paged virtual memory environment- use of clustering to reduce i/o operations and accelerate object invocations- a multi-level i/o subsystem, in which peripherals, i/o controllers and bus couples haveseparate drivers: each application level i/o request is mapped dynamically to a path ofdevices as necessary to reach the target peripheral.- absence of a UNIX style i/o bu�er pool and instead exploiting all available physical memorye�ectively as an i/o cache.As noted in section 1. it is our intention to both qualitatively and quantitatively compare thethree implementations: rather than mirroring strategies in the three implementations, it seemedmore prudent to explore in parallel techniques and exchange experiences.

12 2GUIDE Implementation13 3IntroductionThis section describes the principles of the Bull/LGI implementation of the COMANDOSKernel on top of Unix System V. This implementation is called GUIDE (standing for "GrenobleUniversities Integrated Distributed Environment"). The main decision in this implementation wasto map COMANDOS activities on Unix processes. The alternative choice would have been tomap COMANDOS jobs on Unix processes and to implement activities as "lightweight processes"within a Unix process. The availability of shared memory made the �rst solution more attractive,since it allows an easy sharing of objects and system tables between activities and between jobson each node.The implementation of the main components of the GUIDE kernel is brie y described in thefollowing sections.14 3Virtual Object MemoryIn the COMANDOS model [cf section 2.2 above], a job is de�ned as a multi-processor multi-node virtual machine. A job is de�ned by its virtual address space (possibly spanning severalphysical nodes) which is shared by the activities of the job. The management of jobs and activitiesis described in more detail in section 3.1.4.On a given node, an activity is represented by a Unix process, whose virtual address space isdivided into three areas: the private area, the shared area and the stack area.- The private area is divided into two zones: the �rst zone is the Unix .text zone whichcontains the GUIDE kernel (this zone is shared by all the activities running on the node);the second zone is the Unix .data zone which contains the kernel data, and the binary codeof currently used implementations.- The shared area is divided into three zones: the �rst zone contains the Context Object table(COT) which describes objects mapped into the corresponding job on this node (this zoneis shared by all the activities of this job). The second zone contains the Node Object Table(NOT) which describes objects currently loaded on the current node (this zone is shared byall the activities on the node). The COT may be viewed as a window on the NOT. Eachtable is implemented by means of a Unix system V shared memory segment. The thirdzone contains the data of the objects which are currently mapped on this node (this zoneis also shared by all the jobs at the node). Each object is loaded within a Unix systemV shared memory segment. An object shared by several jobs may be attached at di�erentvirtual addresses in each job, but at the same virtual address in all the activities of the samejob. Concurrent accesses to shared objects are controlled by means of the Unix System Vsemaphores.

- The stack area is private to each activity.An object loaded in VOM is kept resident as long as shared memory segments are available.The NOT contains a binding counter (number of contexts in which this object is currently mapped)for each object. When this counter becomes null, the object can be withdrawn from the NOT,and the associated shared memory segment reallocated (NOT entries are reallocated on a LIFObasis). An ordinary or synchronised object is stored back into the SS when it is no longer boundwithin any context. An atomic object which has been modi�ed within a transaction, is storedinto the SS when the associated transaction is committed.Object invocation is performed within the process implementing the calling activity. Whenan object, not yet already bound within the job, is invoked, a search for the object is carriedout within the VOM. If the object is found within the NOT on the local node (ie the object hasalready been mapped on the local node), it is then bound within the context of the calling job andthe invocation is performed. If the object is not yet mapped on any node, then it is mapped onthe local node and the invocation proceeds as above. If the object is already mapped on anothernode, and the current job is not already present on that node, the job and the activity are �rstdi�used to the target node. The object invocation is then carried out on the remote node.Run-time support for object invocation is provided by the GUIDE kernel via the ObjectCallkernel primitive:ObjectCall (v object: view, ref impl: reference; method index: integer; param block: ad-dress);where:v object is a view which points to the called object;ref impl is a reference to the implementation which contains the required operation (whichmay be di�erent from the implementation of the object, because of inheritance);method index is the index of the operation within that implementation;param block is the address of a block which contains the parameters.The format of the parameter block is as follows:the �rst entry contains the number of parameters;subsequent entries contain parameters - each entry contains one of the following:a) the address of a view that points to a parameter (this is the general case) orb) the value of the parameter (for integers and characters only) orc) the address of the parameter (for strings only).Passing addresses for views and strings is an optimisation for the case of a local invocation.

15 3Secondary StorageThe storage system is implemented using Unix �le systems (virtual disks). A �le systemis a disk partition which is accessed by the Unix kernel as an independent logical device. Theconversion between logical device addresses and physical addresses is performed by the disk driver.A COMANDOS physical container is mirrored by a Unix �le system. The Unix �le systemswhich correspond to physical containers are accessed in character mode. This means that thecorresponding I/O is synchronous, and that the Unix bu�er cache is not used. Instead a cacheis managed by the SS in order to optimise disk accesses: there is a single cache on each nodewhere at least one physical container is supported, and this cache is common to all the physicalcontainers located on this node.The SS is composed of two main parts : the client side, which performs SS primitives, andthe server side, which deals with container management. The client side is linked with VOMprocesses, while the server side is composed of one or several dedicated Unix processes. Theserver side is composed of three main modules : the cache module, the physical container moduleand the logical container module. The cache module is very similar to the Unix bu�er cache andthus makes disk I/Os more e�cient. The physical container module manages disk blocks withina physical container and the logical container module handles accesses to objects within logicaland physical containers. Communication between the client process and a SS server process isperformed using BSD sockets in datagram mode. Communications between server processes onthe same node uses the object cache.Physical containers are organised in 512 bytes blocks. An object descriptor is stored in oneblock. Since the size of an object descriptor is less than the block size, the remaining space in thedescriptor block may be used to store the corresponding object data. If the object size is greaterthan this remaining size, its contents are stored in independent data blocks. The addresses ofthese blocks are stored in the remaining part of the descriptor block (instead of the data for a smallobject). The �rst few addresses are addresses of direct data blocks and the last two addressesrefer to indirect data blocks (a single indirection for the �rst one and a double indirection forthe second one). This mechanism is very similar to the one used in the Unix �le system. Theresulting maximum object size is 8,287 Kbytes.Logical containers are organised as a hierarchy. A logical container is a son of the logicalcontainer into which objects that have not been used for a certain time are aged. This mechanismapplies also to the father logical container, which is itself the son of another logical container, andso on up to a root logical container. In the GUIDE implementation, there is a single root logicalcontainer into which all unused objects are aged ultimately.Replication on a per physical container basis is supported in the GUIDE implementation. Anobject stored within a logical container is replicated on each physical container of this logicalcontainer. An object is available as long as at least one of its host physical containers is stilloperational. The number of physical containers of a logical container can be increased or decreaseddynamically. When a new physical container is added to a logical container, it is �rst synchronisedwith the already active physical containers. There is always a master physical container, inwhich the objects are modi�ed, and secondary physical containers, into which the updates arepropagated. When the master physical container becomes unavailable, one of the secondaryphysical containers is elected as the primary container and the users of the system should beuna�ected.

A limited versioning mechanism is implemented within the GUIDE kernel. This mechanism isprovided by the ability to create a new object with another object as initial value. The new objectcan be kept as the old version of the object and the old one as the new version. Thus other objectsreferencing the object will always reference its latest version (by means of the old reference), andold versions can be made available to the user by a higher level versioning mechanism.16 3Job and Activity ManagementAs mentioned in the introduction, an activity is represented by a Unix process on each nodewhere it has invoked an object. The same process represents a given activity on a node, regardlessof the number of calls it has made on that node.A job is created on a node which is called its initial node and a Unix process is associated witheach activity on each node. The system guarantees the consistency of the virtual address spacesof the Unix processes which correspond to the activities belonging to a given job, as described insection 3.1.2. In addition to the processes associated with activites, there is a daemon process oneach node which implements job and activity management. This daemon is in charge of forkingthe processes which represent activities, and of managing �liation relationships between them. It"cleans up" the remaining processes after a job has terminated. In addition, the daemon processperforms various monitoring functions.17 3Communication SubsystemCommunications are performed using sockets in datagram mode (UDP/IP) for communica-tions between nodes, and System V2.2 message queues for communications within a given node.Remote communication is used for peer to peer exhanges between the components of the kernelor storage subsystem on di�erent nodes, and as a basis for the remote invocation mechanism. The�rst version of RPC uses the Sun XDR protocol.18 2OisinThe TCD implementation is being primarily targeted at the NS32332 based Trinity Worksta-tion, with a port of that implementation to Digital uVAX-IIs. Coding is primarily in Modula-2,although both C and assembler are occasionally used. The development environment is 4.2 bsd(NS Genix and Digital Ultrix).There follows below a brief overview of the NS32000 and VAX architectures for those readersunfamiliar with them.The NS32000 Architecture is a 32bit architecture speci�cally designed to support high levellanguage (HLL) compilers [National86]. The NS32000 also provides two protection modes: userlevel and supervisor level. Transitions to supervisor level are made for traps and interrupts.

Demand paged virtual memory, using two level page tables, is also supported. The currentstandard MMU chip is the NS32082 which supports a virtual address space of 16 Mbytes with512 byte page size. The position of the current level 1 page table is given by one of a pair ofpage table base registers, and can be located arbitrarily in physical memory. A level 1 page tableis 1Kbyte and contains the locations of 256 level 2 tables. Each level 2 table is 512bytes andcontains the location of 128 physical pages. Each Page Table Entry (PTE) includes a Valid bit;Referenced bit; Modi�ed bit and two protection level bits.The VAX-11 family is a 32 bit architecture with a 32-bit virtual address space. Demand pagedvirtual memory using single level page tables is provided. The page size is 512 bytes. There arefour distinct virtual address spaces: P0, P1, System and reserved. P0 and P1 de�ne the virtualaddress space for each process while the system space is a context switch independent space usedby the kernel. Each of these spaces is 1 Gigabyte. Each region has a base register containing thebase address of the page table for that region and a length register indicating the length of thepage table.19 3ClusteringConsideration of a number of sample programs for the COMANDOS environment, expressedon paper in Oscar, led us to believe that clustering of objects would be an important elementof an e�cient kernel implementation. A cluster is a set of contiguously "stored" objects. Whenmapped into VOM, the objects are contiguous: however a (large) cluster may be mapped toseveral contiguous ranges of disk blocks in a SS container, possibly in di�erent disk cylinders. Ifany object in a cluster is accessed by an application, then the entire cluster is mapped into virtualmemory. Depending on currently available physical memory, a range of pages including, but notlimited to, the faulted page, will be initially retrieved.Object identi�ers in Oisin appear as two categories. Inter-cluster references are similar toLLIs in Guide and IK: they consist of a logical container number and unique "generation" numberwithin that container. They also contain a hint �eld as to which cluster in that container thetarget object will most likely be found. Objects may also be migrated between logical containers,as explained in the COMANDOS Architecture report [COMANDOS87].Intra-cluster references are adopted when the target object is inaccessible from outside of itscluster. Dereferencing an intra-cluster identi�er is more e�cient than an inter-cluster one: carefuladministration of precisely which objects are placed in the same cluster will obviously directlya�ect overall performance.Several clusters may be mapped simultaneously into the same virtual address space. A clustermay be simultaneously mapped into di�erent address spaces at the same node (processor). Acluster may contain a single object, in which case it is similar to an object with an LLI in Guideor in IK.20 3Virtual Memory

The virtual memory subsystem is reasonably conventional. It does contain however a numberof Oisin speci�c mechanisms related to the e�cient handling of clusters. Modi�ed pages replacedin physical memory are �rst placed on a paging device, and if not faulted back after some time,are lazily purged by the kernel back to their home positions in the SS. Maintenance of free pagingspace (and indeed free disk blocks in a disk cylinder) is via a buddy algorithm implemented on abitmap. File type accesses are automatically supported by direct machine level reads and writes[Daley68].21 3DevicesIn UNIX, a single device driver may need to interact with several hardware units: this isparticularly true of disk drivers which may for example interact with a dma controller; one ormore bus adaptors (for example a SCSI interface); as well as the disk controller itself. In sophis-ticated i/o systems, there may in fact be several alternative routes from the CPU to a particularperipheral, and tra�c ought to be distributed to balance load and achieve fault tolerance. InOisin, we have been in uenced by the i/o support of Digital's VMS [Kenah84]: each hardwareunit has its own private device driver, and I/O requests follow a designated path from unit tounit as appropriate. In principle, di�erent I/O requests may follow di�erent paths to the sametarget device. Each I/O operation is de�ned by an I/O Request packet and associated CompletionPacket: such packets are exchanged between devices using software interrupts.22 3Physical Memory and Synchronisation.Rather than reserve a portion of physical memory as an I/O cache as in UNIX [Bach86] weallow potentially any page of physical memory to be used as a bu�er for I/O as required.Each job executing at a node has its own private context which is implemented as a singlevirtual address space and which is shared by all activities of the job executing at that node. Eachactivity is represented by one or more lightweight processes executing in this context. Processesare provided by the Oisin kernel which implements two level scheduling of jobs and activities.Oisin also provides semaphores to synchronise concurrent accesses by processes.23 3Communication SubsystemThe CS is being implemented as a path of device modules, in e�ect as an I/O path in themulti-level I/O subsystem. The Inter Kernel Message protocol is to be treated as a single pseudo-device, as is the combination of Transport and Internet layers. The target device in the path forthe datagram service is the ethernet device driver.

24 2IK25 3IntroductionThe INESC kernel, named IK, provides a single user COMANDOS environment to run oncommercial Personal Computers (PC/AT based on the INTEL iAPX286). Implementation iscurrently under way using Olivetti PE28, Olivetti M28 and Sperry IT both as development andtarget machines.26 3Brief Introduction to the iAPX286The iAPX286 microprocessor [Intel83] is a 16-bit CPU with an integrated MMU and has threemain features which in uenced our design:Segmented memory management, with a maximum segment size of 64 Kbytes. Address trans-lation is performed through one of two tables, a system wide Global Descriptor Table (GDT)or a per-task Local Descriptor Table (LDT). Instructions that modify the current segment arerestartable; an exception is raised if the segment is not present in memory, allowing to loadsegments in memory on demand.Hardware supported processes, described by particular segments. Process switching is auto-matically performed by special machine instructions.Four levels of protection are supported. Privileged levels can only be accessed via interruptsor gates.27 3Internal Structure of the KernelWe split the kernel into two parts: the Kernel User and the Kernel Supervisor.The Kernel User runs in user mode and handles invocation of objects already mapped invirtual memory. A call to the Kernel Supervisor is only issued if the object is not present invirtual memory.The Kernel Supervisor, which runs in the most privileged mode of the processor, is internallycomposed of a number of processes which communicate by exchanging messages. Global kerneldata is shared via the GDT. Currently, there is one process for the Virtual Object MemoryManager (VOM) and the Activity Manager (AM), one for the Storage Subsystem and one for theCommunications Subsystem. Device drivers are also implemented by independent processes.This organisation of the kernel in di�erent processes simpli�es development and debugging,because each component is isolated in a process, preventing it from corrupting data of other

components. Interfaces between di�erent components are better de�ned, allowing independenttesting. Performance is not compromised, because these processes are lightweight and bene�tfrom the hardware support for fast context switching.28 3Virtual Object Memory (VOM)Objects in IK are composed of three logical areas: header, data area and reference list.Headers are accessed very frequently and must not be modi�ed by user mode software, so theyare stored contiguously in a write-protected segment called the Header Table.The data area and the reference list are both stored in the same virtual memory segment. Thisconstitutes a major point of our implementation: an object corresponds to a segment. Objectsare mapped as a unit in the VOM, thus they are either completely mapped in the VOM or notmapped at all. When the object is mapped in the VOM, it is either present in primary memoryor saved in the swap area.This mapping has several advantages and also some drawbacks. Memory management issimpli�ed, because the logical structure of an object is mapped directly on the memory conceptsupported by the hardware. In terms of virtual memory, an object is described only by its virtualaddress.When objects are mapped and segments are resident in primary memory this implementationis very e�cient, because all code executes in user mode, without the intervention of the KernelSupervisor. In this case the code of an object runs in a way similar to a conventional residentprocess. The invocation time of a mapped object is close to that of a conventional procedure call.Small objects are e�ciently mapped in virtual memory because a reduced amount of I/O isnecessary to read them from disk. For large objects this solution has the drawback of requiring thereading of the whole segment, even when only a small part is actually used. Another inconvenienceis that object size is limited to the maximum size of a physical segment (64 Kbytes). Large objectsmay be decomposed into a hierarchy of smaller objects, at the expense of some performancedegradation.Objects may be mapped anywhere in the virtual address space of an activity, so the code ofan object may not assume the value of the code segment. This corresponds to the small modelon the iAPX286: an object has a single code segment and a single data segment, and must notassume nor modify the contents of these segments. This provides an easy way to share objectsbetween di�erent jobs: each job has a private copy of the object's header, all headers referencethe same virtual segment through di�erent virtual addresses. The Kernel Supervisor maintainsthe consistency of the di�erent copies of the header.29 3Storage Subsystem (SS)The Storage Subsystem of IK is simple because its target machines have small disks. Basically,it provides the ability to store and retrieve objects given their LLI. Objects are read and writtenas a single unit. Whenever possible, objects are stored on disk as a contiguous segment.

30 3Activity Manager (AM)Activities are supported by processes. Activities of the same job execute in the same addressspace by sharing a common LDT. As process switching is inexpensive, several activities of a jobexecuting at the same node are similar to concurrent lightweight threads of execution within thesame address space. Conventional semaphores are used for synchronisation.When an activity di�uses, a new process is always created at the remote node to handle theinvocation. If the activity di�uses again to the origin node, two processes will coexist, but onlyone is eligible for execution, the other is awaiting for the return of the remote invocation.At each node an activity stores the node from where it was called; if it di�uses to yet anothernode, it stores the node to where it di�used. This calling chain is used to locate the nodes visitedby the activity, when performing some operation on it (e.g. killing the activity).31 3Communication Subsystem (CS)Communication in COMANDOS takes place between the kernels on distinct nodes. The typeof messages exchanged between kernels will be queries, noti�cations and requests for services.Two special cases are requests for operations to be carried out at remote nodes and requests forbulk data to be transmitted across the network.The CS o�ers a Standard Communication Interface which provides both a connection oriented(ISO Transport Class 4 - ISO 8073) and a connectionless transport (Draft ISO DIS 8602).These services may be used directly by any component of the kernel. However, normal useof the CS is done by use of its Inter-Kernel Message service (IKM). The IKM lies above thetransport layer and provides an e�cient packet protocol for both announcements and RPC-likecommunication between kernels at di�erent nodes. Normally, it uses datagrams to transmit theinformation, but a connection may be used if a large amount of data has to be sent.In remote invocations or queries to a remote kernel the protocol is optimised to use a singlepacket in each direction, thus reducing protocol overheads.Support for heterogeneous machines is provided. Information is transmitted in its raw formand is translated only at the destination node. One advantage of this approach is that forcommunication between homogeneous machines no translation has to be done. Each componentuses a table-driven translator, avoiding the need to send extra information describing the contentsof the message.32 1STATUS OF IMPLEMENTATIONS

33 2GUIDE ImplementationAt the present date (may 1988), version 1 of the GUIDE implementation is running, on BullSPS7/300 and Matra-Datasysteme MS-3. Both machines are based on 68020 processors and thesupporting system is a version of Unix System-V, plus socket communication from BSD 4.2.Version 1 is essentially a single-node implementation, the aim of which is to test the integrationof the basic mechanisms of the kernel, and the integration between kernel and secondary storagemanagement. The following features are implemented.- Support for jobs and activities. Currently, a single activity per job is supported, and a nodemay support several independent jobs.- Support for the Virtual Object Memory. The basic mechanism for object binding and localoperation invocation is implemented, including dynamic linking to the required code of theoperation. Shared objects are not yet supported.- Single-container secondary storage. A container is implemented on a single node. Thesecondary storage is integrated with the kernel via the object fault mechanism.In addition, a �rst version of a C pre-processor for the GUIDE language (a subset of theOSCAR language) has been implemented. This allows us to exercise the basic system mechanismsby executing compiled programs, including object invocations. This version does not includeinheritance nor support for concurrent activities, which are both expected to be provided byoctober 1988, together with a �rst multi-node version of the kernel and secondary storage.This �rst version has been experimented with since mid-march 1988. Most of the work hasbeen devoted to performance tuning. Since the basic mechanisms of the kernel are implementedon top of Unix, the performance is certainly inferior to that of an implementation on a baremachine. However, our goal is to achieve a performance level which would allow us to run realisticapplications with acceptable response time. Preliminary results so far indicate that this goal isachievable.34 2Oisin ImplementationAt the present date (July 1988) a version of Oisin is running on an NS32332. This version isa single-node implementation. The following features are implemented.- Support for the Virtual Object Memory. The basic mechanism for object binding and localoperation invocation is implemented, including dynamic linking to the required code of theoperation. Shared objects are not yet supported.- Persistent memory is supported so that when application programs terminate, its resultslinger in secondary storage without requiring additional e�ort on behalf of the programmer.- Multiple jobs each containing multiple activities are supported. Synchronisation is providedvia semaphores.

- Single-container secondary storage. A container is implemented on a single node. Thesecondary storage is integrated with the kernel via the for object fault mechanism.- Object clustering, within both Virtual Object Memory and the Storage System.- The I/O subsystem, for disk accesses. Terminal and Ethernet drivers are being coded.- A Name Service, providing a UNIX like directory hierarchy, with each directory being anobject. A command interpreter, using the Name Service, is also implemented, allowingobjects to be interactively invoked and examined.In addition, support for Modula-2 application programs has been implemented, allowing usto write COMANDOS implementation objects in Modula-2 and exercise the Oisin kernel. A fullsyntax analyser for Oscar is also completed, and semantic analysis and code generation for thebasic constructs of the language are now being attempted.The following are some preliminary performance results to date (on a 10MHz NS32332):- Internal Procedure Call and Return using JSR machine instruction : 4 microsecs.- External Procedure Call and Return using CXP machine instruction : 8 microsecs. TheCXP instruction is normally used by a compiler to aid separate compilation.- Basic intra-cluster object invocation : 21 microsecs.- Basic inter-cluster object invocation : 310 microsecs.- Towers of Hanoi execution time (9 discs) in standard Modula-2 running under UNIX 4.2BSD : 4.2 secs.- ditto using intra-cluster calls running under Oisin using stub routines for Modula-2 forobject invocation: 5.8 secs.Our immediate plans are to complete and integrate the Communications Subsystem, allowingus to provide support for job di�usion. We also wish to further optimise the kernel, including thetimes for basic object invocation. Finally, we intend to attempt the port to a micro-VAX 2 in tehautumn.35 2IK ImplementationAt the present date (May 88) a version of IK is running on a PC/AT. This version is a singlenode implementation, which provides the following features:- Support for the Virtual Object Memory. The basic mechanism for object binding and localoperation invocation is implemented, including dynamic linking to the required code of theoperation. Shared objects are not yet supported.- Single activity and single job, although several processes exist inside the kernel.

- Single-container secondary storage. A container is implemented on a single node. Thesecondary storage is integrated with the kernel via the for object fault mechanism.Application programs are (carefully) coded in C with calls to a small library to interface thesystem.A major part of the work was devoted to the development of the run-time support for thekernel processes. However, this has proven to have been worth the e�ort, because some kernelprocesses, like the SS and the CS, may be developed and partially debugged in MS-DOS, usingall its debugging tools, and later integrated in the kernel with minor changes.Preliminary performance results indicate that the main design choices are appropriate, al-though much more experience is still needed to make conclusions. In a PC/AT running at 8MHz,the message passing primitive takes 800 microsecs to send a 64 byte message and perform theprocess switching. Object invocation in virtual memory takes currently 100 microsecs, but noattempt was made yet to optimise this code (e.g. writing an assembly language routine).36 1RELATED WORK and FUTURE PLANSIn summary, COMANDOS is conceptualising, designing and, most importantly, implement-ing a vendor-independent platform for distributed processing, including management of long-lived data, programming language support and on-line administration. To date, no such vendor-independent infrastructure exists. OSI may be used as a basis for interconnection of equipmentfrom various vendors, but not as a common platform for the numerous interacting applicationsrequired in the integrated electonic organisation. The availability of such a vendor-independentintegrated platform for programming distributed applications, coupled with data management,operating in an environment of heterogeneous machines would be a signi�cant advance from cur-rent exploitations of OSI. It would also advance the state of the art in distributed systems.UNIX is an accepted vendor-independent system interface. Recent extensions to UNIX, suchas the SUN Network File System [Sandberg86] or PCTE [Bourguignon85] extend aspects of theUNIX interface to distributed environments. However none of them provide at the same time allof the facilities which characterise the COMANDOS programming interface: distributed objectsconsidered in the SUN Network File System are basically limited to UNIX �les; the PCTE OMS(Object Management System) provides a simple object store and its type system is rather weak.We consider COMANDOS as a signi�cant evolutionary development of UNIX.A major achievement of the project so far has been the de�nition of an Architecture of aVirtual Machine for general distributed processing. This architecture emphasizes the integrationof technology from distributed operating systems, distributed programming languages and dis-tributed databases. This integration objective is achieved through the use of an object-orientedapproach. It should be noted that a similar approach has been adopted in the UK AdvancedNetwork Systems Architecture (ANSA) [ANSA87], and is currently appearing in internationalstandardisation activities such as the ECMA DASE proposal and the ISO ODP workitem. Morespeci�cally, these projects are considering a computational model which is close to that of CO-MANDOS. Although COMANDOS should not be seen primarily as an architecture de�nitionproject, it is expected to draw from the ongoing prototyping phase signi�cant experience on theinteraction between the computational model and other aspects of the virtual machine interface,

most notably those for data management. This implementation experience should be valuableinput to the standardisation process.A number of prototypes are expected at the end of the current three year Project (February 1989):- Three implementations of the COMANDOS kernel, as described in section 3. The compari-son between the UNIX-based kernel and the native kernels should provide useful informationabout the limits of an approach based on the extension of the standard UNIX interface.- Partial implementation of some basic system services for data management, type manage-ment and object naming, running on top of the kernel. These services extend the COMAN-DOS virtual machine to provide high-level facilities for building distributed applications.- Two implementations of the COMANDOS language (OSCAR): one provided as an extensionof an existing language (preprocessing C), and a full compiler of a subset of the language.The language will also be used for programming some of the system services mentionedabove.Starting from the results of this �rst three-year experience, a second phase of the COMANDOSproject is foreseen in the framework of ESPRIT II. The overall objective of this second phase is toconsolidate and integrate the various workitems available at the end of the current phase, in orderto provide a full implementation of the COMANDOS programming platform and virtual machine.Main aspects of this project may be summarized as follows:- Provide a full compiler of the COMANDOS Language.- Implement the COMANDOS system (with full capabilities) both on a bare hardware andon an industrial low-level distributed kernel (CHORUS).- Extend the X/OPEN UNIX kernel for supporting e�ciently the COMANDOS features, andconsequently to enhance UNIX towards the COMANDOS interface.- Support basic administrative facilities such as for example standard distributed directoryservices.- Finally provide a testbed application, both for checking the suitability of the COMANDOSmodel and language, and for evaluating internal mechanisms provided by the COMANDOSsystem.1ACKNOWLEDGEMENTSThe authors wish to acknowledge, in particular, the contribution of those involved in the var-ious kernel implementations: BULL: ?????; INESC: ????; LGI: Dominique Decouchant, AndrzejDuda, Hiep Nguyen Van, Michel Riveill and Xavier Rousset de Pina; TCD: Edward Finn andGradimir Starovic.

The authors also wish to acknowedge all those who have contributed to the Object-Orientedworking group of the COMANDOS project: BULL: ????; IEI: E. Bertino, R. Gagliardi, G.Mainetto, C. Meghini, F. Rabitti, and C. Thanos; INESC ???; LGI: M. Meysembourg, C. Roisin,and R. Scioville; Nixdorf: G. Mueller, and K. Profrock; Olivetti: A. Billocci, A. Converti, M.Farusi, L. Martino, and C. Tani; TCD: A. Donnellly; A. El-Habbash; F. Naji; A. O'Toole; B.Tangney; B. Walsh and I. White.1REFERENCES [ANSA87]ANSA: "The ANSA Reference Manual: release 00.03," 1987.[Bach 86]Maurice Bach, "The design of the UNIX operating system", Prentice-Hall1986.[Bourguignon85]Bourguignon, "Overview of PCTE: A Basis for a Portable Common ToolEnvironment" in Proceedings of ESPRIT Technical Week, September 1985.[Cheriton84]D. Cheriton, "The V-Kernel: A Software Base for Distributed Systems"IEEE Software, Vol. 1, No. 2,April 1984.[COMANDOS87]COMANDOS: Object Oriented Architecture D2-T2.1-870904[COMANDOS88a]COMANDOS: OSCAR Preliminary Programming Language Manual D1-T3.2.3.2-880331[COMANDOS88b]COMANDOS: Tutorial Introduction to OSCAR D1-T3.2.3.2-880331[COMANDOS88c]COMANDOS: COMANDOS Integration System February 1988[Daley68]R. Daley, J. Dennis "Virtual Memory, Processes, and Sharing in MUL-TICS" Comm. ACM, Vol 11, No 5, May 1968, pp306-312.

[Decouchant88]D.Decouchant, A.Duda, A.Freyssinet, M.Riveill, X.Rousset de Pina, R.Scioville,G.Vandome: "An implementation of an object-oriented distributed systemarchitecture on Unix", Proc. EUUG Conf., Lisbon, October 1988.[Horn87]C. Horn, S. Krakowiak: "Object Oriented Architecture for DistributedO�ce Systems" in ESPRIT '87: Achievements and Impact, North-Holland,1987.[Horn88]C. Horn, A. Ness, F. Reim, "Construction and Management of DistributedO�ce Systems" Proceedings of EURINFO '88, Athens, May 1988.[Intel83]INTEL: "iAPX286 Operating Systems Writer's Guide",1983[Jones86]Jones M.B., Rashid R.F. 1986, Mach and Matchmaker: kernel and lan-guage support for object-oriented distributed systems, Proc. First ACMConf. on Object-Oriented Programming Systems, Languages and Appli-cations (OOPSLA), Portland (sept. 1986), pp. 67-77.[Kenah84]L. Kenah and S. Bate "VAX/VMS Internals and Data Structures", DigitalPress, 1984.[Mullender85]S. Mullender, "Principles of Distributed Operating System Design" Ph.DThesis, Vrjie Universiteit, Amsterdam, October 1985.[Lampson81]Lampson, B.W. "Atomic Transactions" in "Distributed Systems Architec-ture and Implementation", Springer-Verlag 81, pp. 246-264[National86]National Semiconductor Corporation, NS32000 Series Databook, 1986[Sandberg86]R. Sandberg, "The Sun Network Filesystem: Design, Implementation andExperience" Spring EUUG Conference, 1986.[Walker83]Walker, B., G. Popek, R. English, C. Kline and G. Thiel, "The LOCUS

Distributed Operating System", ACM Proc of the 9th SIGOPS, 1983[Zimmermann84]Zimmermann H., Guillemont M., Morisset G., Banino J.S. 1984, Chorus: aCommunication and processing Architecture for Distributed Systems, RR328, INRIA, Rocquencourt (September 1984)