Overview of IBM system/390 parallel sysplex-a commercial parallel processing system

8
Overview of IBM System/390 Parallel Sysplex- A Commercial Parallel Processing System Jeffrey M. Nick 522 South Road IBM System/390 Division Poughkeepsie, NY 12601, USA jeff [email protected] Jen-Yao Chung, Nicholas S. Bowen P. O. Box 704 IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598, USA jychung,[email protected] Abstract Scalability has never been more a part of System/390 than with Parallel Sysplex. The Parallel Sysplex environ- ment permits a mainframe or Parallel Enterprise Server to grow from a single system to a configuration of 32 systems (initially), and appear as a single image to the end user and applications. The IBM S/390 Parallel Sysplex pro- vides capacity for today’s largest commercial workloads by enabling a workload to be spread transparently across a col- lection of S/390 systems with shared access to data. By way of its parallel architecture and MVS operating system sup- port, the S/390 Parallel Sysplex offers near-linear scalability and continuous availability for customers’ mission-critical applications. S/390 Parallel Sysplex optimizes responsive- ness and reliability by distributing workloads across all of the processors in the Sysplex. Should one or more proces- sors fail, the workload is redistributed across the remaining processors. Because all of the processors have access to all of the data, the Parallel Sysplex provides a computing environment with near-continuous availability. 1 Introduction Parallel and clustered systems are emerging as com- mon architectures for scalable commercial systems. Once popular in numerically intensive environments, the introduction of advanced coupling technology (e.g., S/390’s ESCON [4] and Coupling Facility[7], SP2 switch[11] and Tandem’s ServerNet [2]) is driving the acceptance of these systems in commercial markets. These systems share some common objectives; namely to harness large amounts of processing power while provid- ing availability improvements over single systems. Their architectures span a broad spectrum from traditional parallel processors that were initially focused on high performance for numerically intensive workloads [6] to clustered oper- ating systems that focus on availability [1]. This paper describes a new parallel architecture and a set of related products for IBM’s S/390 processors and MVS oper- ating system. The system is clearly unique. From an ar- chitectural perspective it contains many novel features that enable parallelism. However, at the same time that paral- lelism is enabled, a single system image is preserved. From the end-user’s view it appears as a scalable and available system by taking advantage of database managers who have themselves become “parallelized.” This paper describes the architecture and design objec- tives for the S/390 parallel systems (herein called “Parallel Sysplex”). This architecture contains new and innovative parallel data-sharing technology, allowing direct, concur- rent read/write access to shared data from all processing nodes in the parallel configuration, without sacrificing per- formance or data integrity. This in turn enables work re- quests associated with a single workload to be dynami- cally distributed for parallel execution on systems in the sysplex based on available processor capacity rather than data-to-system affinity. Through this state-of-the-art par- allel technology, the power of multiple MVS/390 systems can be harnessed to work in concert on common workloads, taking the commercial strengths of the MVS/390 platform to new heights in terms of competitive price/performance, scalable growth and continuous availability. The key design objectives in guiding this system were: Reduced total cost of computing. Compatibility with existing systems and programs. Dynamic workload balancing. Scalability and granular growth. Continuous availability of information assets. The purpose of the paper is to review the S/390 Parallel Sysplex architecture, the MVS operating system services built on the architecture, and provide an overview of the key application environments that exploit the S/390 Paral- lel Sysplex technology. This paper is organized as follows. Section 2 presents the objectives of building parallel sys- tems. Section 3 discusses the technology enhancements for data sharing. Section 4 discusses the scalability of the S/390 Parallel Sysplex. Section 5 presents the exploitation product details. Section 6 concludes the paper.

Transcript of Overview of IBM system/390 parallel sysplex-a commercial parallel processing system

Overview of IBM System/390 Parallel Sysplex-A Commercial Parallel Processing System

Jeffrey M. Nick522 South Road

IBM System/390 DivisionPoughkeepsie, NY 12601, USA

jeff [email protected]

Jen-Yao Chung, Nicholas S. BowenP. O. Box 704

IBM Thomas J. Watson Research CenterYorktown Heights, NY 10598, USAjychung,[email protected]

AbstractScalability has never been more a part of System/390

than with Parallel Sysplex. The Parallel Sysplex environ-ment permits a mainframe or Parallel Enterprise Server togrow from a single system to a configuration of 32 systems(initially), and appear as a single image to the end userand applications. The IBM S/390 Parallel Sysplex pro-vides capacity for today’s largest commercial workloads byenabling a workload to be spread transparently across a col-lection of S/390 systems with shared access to data. By wayof its parallel architecture and MVS operating system sup-port, the S/390 Parallel Sysplex offers near-linear scalabilityand continuous availability for customers’ mission-criticalapplications. S/390 Parallel Sysplex optimizes responsive-ness and reliability by distributing workloads across all ofthe processors in the Sysplex. Should one or more proces-sors fail, the workload is redistributed across the remainingprocessors. Because all of the processors have access toall of the data, the Parallel Sysplex provides a computingenvironment with near-continuous availability.

1 IntroductionParallel and clustered systems are emerging as com-

mon architectures for scalable commercial systems.Once popular in numerically intensive environments,the introduction of advanced coupling technology (e.g.,S/390’s ESCONTM [4] and Coupling Facility[7], SP2TM

switch[11] and Tandem’s ServerNetTM [2]) is driving theacceptance of these systems in commercial markets.

These systems share some common objectives; namelyto harness large amounts of processing power while provid-ing availability improvements over single systems. Theirarchitectures span a broad spectrum from traditional parallelprocessors that were initially focused on high performancefor numerically intensive workloads [6] to clustered oper-ating systems that focus on availability [1]. This paperdescribes a new parallel architecture and a set of relatedproducts for IBM’s S/390 processors and MVSTM oper-ating system. The system is clearly unique. From an ar-chitectural perspective it contains many novel features that

enable parallelism. However, at the same time that paral-lelism is enabled, a single system image is preserved. Fromthe end-user’s view it appears as a scalable and availablesystem by taking advantage of database managers who havethemselves become “parallelized.”

This paper describes the architecture and design objec-tives for the S/390 parallel systems (herein called “ParallelSysplex”). This architecture contains new and innovativeparallel data-sharing technology, allowing direct, concur-rent read/write access to shared data from all processingnodes in the parallel configuration, without sacrificing per-formance or data integrity. This in turn enables work re-quests associated with a single workload to be dynami-cally distributed for parallel execution on systems in thesysplex based on available processor capacity rather thandata-to-system affinity. Through this state-of-the-art par-allel technology, the power of multiple MVS/390 systemscan be harnessed to work in concert on common workloads,taking the commercial strengths of the MVS/390 platformto new heights in terms of competitive price/performance,scalable growth and continuous availability. The key designobjectives in guiding this system were:

� Reduced total cost of computing.

� Compatibility with existing systems and programs.

� Dynamic workload balancing.

� Scalability and granular growth.

� Continuous availability of information assets.

The purpose of the paper is to review the S/390 ParallelSysplex architecture, the MVS operating system servicesbuilt on the architecture, and provide an overview of thekey application environments that exploit the S/390 Paral-lel Sysplex technology. This paper is organized as follows.Section 2 presents the objectives of building parallel sys-tems. Section 3 discusses the technology enhancementsfor data sharing. Section 4 discusses the scalability of theS/390 Parallel Sysplex. Section 5 presents the exploitationproduct details. Section 6 concludes the paper.

IPPS '96
ISSN 1063-7133/96 $5.00 C 1996 IEEE.

2 Design ObjectivesThis section describes the basic objectives that drove the

design of the system.

2.1 Reduced Cost of ComputingThe primary business objective was to reduce the total

cost of computing. This meant using S/390 CMOS micro-processors to leverage industry-standard CMOS technologyto price/performance advantage, both in terms of reducedbase manufacturing cost and significant on-going customersavings in reduced power, cooling and floorspace require-ments.

Although the use of multiple interconnected micro-processors can provide the aggregation of large amountsof processing power, low cost can only be truly achieved ifthe processors are efficiently utilized. Therefore, the abilityto dynamically and automatically manage system resourcesis a key objective. A new component, the Workload Man-ager (WLM), was designed to meet this objective.

While the S/390 Parallel Sysplex is physically comprisedof multiple MVS systems, it has been designed to logicallypresent a single system image to end-users, applications,and the network, and provides a single point of control tothe systems operations staff. Systems management costs donot increase linearly as a function of the number of systemsin the sysplex. Rather, total cost of computing efficienciesof scale accrue through the centralized control over theintegrated multi-system configuration.

2.2 CompatibilityThe second key objective was to maintain compatibility

with the existing systems and programs. Given the hugecustomer investment in S/390 MVS commercial applica-tions, it was an imperative for success that the parallel sys-plex technology be introduced in a compatible manner withcustomers’ existing application base. It was a further designobjective to enable transparent leverage of the parallel sys-plex data-sharing technology to advantage for customersexisting application investments. With few exceptions,these objectives have been met. The parallel sysplex tech-nology extensions to the S/390 architecture (introducingnew cpu instructions, new channel subsystem technology,etc) are fully-compatible with the base S/390 architecture.The IBM subsystem transaction managers (CICSTM andIMSTM ) and key subsystem database managers (DB2TM ,IMS/DBTM) have exploited the data-sharing technologywhile preserving their existing interfaces. This has pro-tected customer investments in OLTP and decision supportapplications and also provided improvements in terms of re-duced cost of computing, scalable growth, and applicationavailability.

2.3 Dynamic Workload BalancingThe ability to dynamically adjust total system resources

to best satisfy workload objectives in real-time is a keyobjective in a commercial parallel processing environ-ment [14]. There are two fundamental approaches to work-load distribution and data access used in the commercial

parallel processing systems; data-partitioning (i.e., “sharednothing”) system designs and data-sharing system designs.In a data-partitioning system, the database and the work-load are divided among the set of parallel processing nodesso that each system has sole responsibility for workloadaccess and update to a defined portion of the database. Thedata-partitioning is required in order to enable each systemto locally cache data in processor memory with coherencyand to eliminate the need for cross-system serialization pro-tocols in providing data-access concurrency control. This isthe approach most commercial parallel processing systemshave taken.

However, limitations are imposed in a commercial pro-cessing environment by such a design point [13]. Signifi-cant capacity planning skills and cost are required to tunethe overall system to match each system node’s process-ing capacity to the projected workload demand for accessto data owned by that given system. While it is possi-ble to achieve an optimized match between system capacityand workload demand for a well-tuned benchmark environ-ment based on careful system monitoring and analysis, realcommercial workload applications are not so well-behaved.Significant fluctuations in the demand for system processorresources and access to data occur during real-time work-load execution, both within a single workload and acrossmultiple workloads in concurrent execution across the par-allel processing system nodes. These real-time spikes andtroughs in system capacity demand can result in signifi-cant over- or under-utilization of system resources acrossall of the parallel nodes. This problem is further aggra-vated by the fact that commercial processing applicationsare becoming more complex in their nature with respect tothe diversity of data that such applications access duringexecution of business transactions. In a data-partitioningsystem, it becomes increasingly difficult to insulate a par-ticular business transaction to execution on a single systemnode without incurring the overhead associated with mes-sage passing requests for data owned by other nodes in theparallel configuration.

The S/390 Parallel Sysplex environment employs the“data-sharing” strategy. The new high-performance data-sharing technology provides the means for MVS and itssubsystems to support dynamic workload balancing acrossthe collection of systems in the configuration. Function-ally, workload balancing can occur at two levels. Initially,during user logon, session binds can be dynamically dis-tributed to balance the load across the set of systems. Sub-sequently, work requests submitted by a given user canbe executed on any system in the configuration based onavailable processing capacity, instead of being bound to aspecific system due to data-to-processor affinity (which istypically the case with alternative data-partitioning parallelsystems). Normally, work will execute on the system onwhich the request is received, but in cases of over-utilizationon a given node, work can be directed to other less-utilizedsystem nodes.

Examples of two types of commercial workloads that

lend themselves well to dynamic workload balancing in-clude Online Transaction Processing (OLTP) and decisionsupport. OLTP workloads are comprised of many individ-ual work requests, i.e., transactions, each transaction be-ing relatively atomic in its execution with respect to othertransactions in the workload. Thus, it is possible to bal-ance the OLTP workload by distributing individual trans-actions for execution in parallel across the set of systemsin the parallel sysplex. Decision support workloads consistpredominantly of query requests, wherein a given querycan involve scanning multiple relational database tables.Here, parallelism can be attained by breaking up complexqueries into smaller sub-queries, and distributing the com-ponent queries across multiple processors (cpu) within asingle system or across multiple systems in a parallel sys-plex. Once all sub-queries have completed, the originalquery response can be constructed from the aggregate ofthe sub-query answers and returned to the requester. Forboth OLTP and decision support workloads, dynamic work-load balancing across systems can be made predominantlytransparent to the customer applications or users, whichremain unchanged.2.4 Scalability and Granular Growth

In the S/390 Parallel Sysplex environment, processingcapacity can be added in granular increments; from the ad-dition of a single processor within an existing system tothe introduction of one or more data-sharing systems. Newsystems can be introduced into the parallel sysplex in a non-disruptive manner. That is, the already-running systemscontinue to execute work concurrent with the activation ofthe new system. Once the new system is active, it canbecome a full participant in dynamic workload balancing.New work requests are naturally driven at an increased rateto that system until its utilization has reached steady-statewith respect to the demand for overall processor resourcesacross all system nodes in the parallel sysplex configura-tion. This capability eliminates the need (and considerablecosts) to re-partition the databases and re-tune each system’sworkload affinity to distribute work evenly after introduc-tion of the new system into the configuration, as is typicallyrequired with a data-partitioned parallel processing system.

Most significantly, the parallel sysplex data-sharing tech-nology enables systems to be added to the configurationwith near-linear scalability and nearly-unlimited capacity.The first S/390 Parallel Sysplex implementation supportsup to 32 systems, where each system can be a tightly-coupled multi-processor system with up to 10 cpus. In aparallel sysplex consisting of 32 S/390 CMOS systems, atotal processing capacity of several thousand S/390 MIPSis configurable.2.5 Continuous Availability

With the advent of the S/390 Parallel Sysplex data-sharing technology and exploitation by the MVS systemand subsystems, it is possible to construct a parallel pro-cessing environment with no single points of failure.

Since all systems in the parallel sysplex can have concur-rent access to all critical applications and data, the loss of a

system due to either hardware or software failure does notnecessitate loss of application availability. Peer instancesof a failing subsystem(s) executing on remaining healthysystems can take over recovery responsibility for resourcesheld by the failing instance, or the failing subsystem(s)can be automatically restarted on still-healthy systems bythe MVS Automatic Restart Manager (ARM) componentto perform recovery for work in progress at the time of thefailure. While the failing subsystem instance is unavailable,new work requests can be redirected to other data-sharinginstances of the subsystem to provide continuous applica-tion availability across the failure and subsequent recovery.

The ARM component is fully integrated with the exist-ing parallel structure and provides significantly more func-tions than a traditional “restart” service. First, it utilizesthe shared state support described in Section 3.2 so at anygiven point in time it is aware of the state of all processes onall processors (i.e., even of processes that “exist” on failedprocessors). Second, it is tied into the processor heart-beat functions so that it is immediately aware of processorfailures. Third, it is integrated with the WLM so that itcan provide a target restart system based on the current re-source utilization across the available processors. Finally,it contains many features to provide improved restarts suchas affinity of related processes, restart sequencing, and re-covery when subsequent failures occur. These services aredescribed more fully in [3].

The same availability characteristics associated with han-dling unscheduled outages are applicable to planned out-ages as well. A system can be removed from the parallelsysplex for planned hardware or software reconfiguration,maintenance or upgrade. New work can be dynamicallyre-distributed across the remaining set of active systems.Once the system is ready to be brought back online, itis re-introduced into the sysplex in a non-disruptive man-ner and participates in dynamic workload balancing andre-distribution as described earlier. In this manner, new re-leases of MVS and key IBM subsystems supporting the par-allel sysplex environment will also support release to releasemigration co-existence, allowing new software product re-lease levels to be rolled through the parallel sysplex onesystem at a time, providing continuous application avail-ability across the systematic migration install process.

Further, since all systems in the parallel sysplex can beconfigured to provide concurrent, direct access to commoncustomer applicationsand data, each individualsystem onlyrequires 1/N spare system capacity (where N represents thenumber of fully-configured systems in the sysplex) in orderfor all remaining systems to continue execution of criticalworkloads without any observable loss of service in theevent of any single system failure.

3 S/390 Parallel SystemsThis sections provides an overview of the technical capa-

bilities for the S/390 Parallel Sysplex. It covers the overallsystem architecture, the basic operating system support forparallel systems, and the advanced technology introducedto enable efficient coupling of systems.

3.1 System Model

Figure 1 shows the overall structure of the system. Itconsists of a set of processing nodes (each of which can be

Coupling Facility

ES/9000

Shared data

Sysplex Timer

ESCON

S/390 CMOS

Mainframe

121

2

3

4

56

7

8

9

10

11

Figure 1: System Model

tightly coupled multiprocessors) connected to shared disks.There can be up to 32 processing nodes where each node canbe a tightly coupled multiprocessor containing between 1and 10 processors. The systems do not have to be homoge-neous; that is, mixed configurations supporting both S/390CMOS processor systems and traditional ES/9000 bipolarsystems can be deployed. The basic processor design hasa long history of fault-tolerant features [10]. The disks arefully connected to all processors. The I/O architecture hasmany advanced reliability and performance features (e.g.,multiple paths with automatic reconfiguration for availabil-ity). The basic I/O architecture is described in [4] and oneaspect of the dynamic I/O configuration is described in [5].The sysplex timer serves as a synchronizing time referencesource for systems in the sysplex, so that local processortimestamps can be relied upon for consistency with respectto timestamps obtained on other systems. The CouplingFacility (CF) is a key Parallel Sysplex technology compo-nent providing multi-system data-sharing functions and isdescribed in Section 3.3.

3.2 Base MVS Multi-system Services

There are a set of operating system services that are pro-vided as building blocks for providing multi-system ser-vices. These are described in detail in[12] and here weonly briefly cover three of the most relevant aspects. First,a set of group membership services are provided. Theseallow processes to join/leave groups, signal other groupmembers and be notified of events related to the group.Second, the ability to provide efficient, shared access tooperating system resource state data is provided. This datais located on shared disks and many advanced functionsare provided including serialized access to the data (with

special time-out logic to handle faulty processors) and du-plexing of the disks containing the state data. In addition,there are availability enhancements for planned and un-planned changes to the state repositories (e.g., “hot switch-ing” of the duplexed disks). Third, processor heartbeatmonitoring is provided. In addition to standard monitor-ing of each processors health, functions are also providedto automatically terminate a failed processor and discon-nect the processor from its I/O devices. This enables othermulti-system components to be designed with a “fail-stop”strategy (to prevent problems from processors that appearfaulty because of the heartbeat function and then resumeprocessing).

3.3 Coupling FacilityGiven the advantages outlined above through data-

sharing in a parallel processing system environment, theobvious question arises as to why the predominant indus-try structure for parallel processing is a data-partitioningmodel. The basic answer lies in the fact that in the past,data-sharing multi-node parallel processing systems haveexhibited poor performance and rapidly-diminishing scala-bility characteristics as the number of nodes in the parallelconfiguration grows. The data-sharing performance over-head and limited scalability are driven by two fundamentalstructural attributes. First, significant processing overheadis incurred with respect to the need to provide multi-systemconcurrency controls for serialized access to shared data.High inter-system communication traffic arises in order togrant/release locks on shared resources. Second, significantprocessing overhead is incurred in order to provide multi-system buffer coherency controls for shared data cached ineach system’s processor memory. This function is essentialfor a data-sharing parallel structure, as it is critical for per-formance to enable local caching of shared data with fullread/write integrity. Here, the overhead is associated withprocessing to broadcast messages to other nodes to performbuffer invalidation when an update to shared data is madeon one system, or to determine whether data cached in localmemory is current at the time of use.

The S/390 Parallel Sysplex introduces new architecture,hardware and software technology to address the funda-mental performance obstacles which have heretofore pre-cluded implementation of a high-performance, scalabledata-sharing parallel-processing system as shown in Fig-ure 2.

At the heart of this inter-system “coupling” technologyis the Coupling Facility (CF), a new component providinghardware assists for a rich and diverse set of multi-systemfunctions, including:� High-performance, finely-grained locking and con-

tention detection� Global buffer coherency mechanisms for distributed

local caches� Shared intermediate memory for global data caching� Queueing mechanisms for workload distribution and

message-passingPhysically, the Coupling Facility consists of hardware

LOCKS DATA BUFFERS

DATABASE MANAGER

LOCKSDATA BUFFERS

DATABASE MANAGER

MVS Sysplex Services

Coupling Facility

REQUESTS REQUESTS

Multi-System

- Serialization

- ChangedData

MVS

S/390

MVS

S/390

Locks

Caches

Lists

SharedDASD

SharedDASD

Figure 2: Parallel Sysplex Data-Sharing Architecture

and specialized Coupling Facility microcode supportingthe S/390 Parallel Sysplex architecture extensions. Thehardware for the CF is also based on the S/390 proces-sor which provide an additional cost advantage. CouplingFacilities are physically attached to S/390 processors viahigh-speed coupling links. The coupling links support spe-cialized protocols for highly-optimized transport of com-mands and responses to/from the CF. The coupling links arefiber-optic channels providing either 50 MegaBytes/secondor 100 MB/second data transfer rates. Commands to theCF can be executed synchronously or asynchronously, withcpu-synchronous command completion times measured inmicro-seconds, thereby avoiding the asynchronous execu-tion overheads associated with task switching and processorcache disruptions. Multiple CF’s can be be connected foravailability, performance, and capacity reasons.

Logically, the CF storage resources can be dynamicallypartitioned and allocated into CF “structures”, subscribingto one of three defined behavior models: lock, cache, andlist models. Specific commands are supported by eachmodel and while allocated, CF structure resources can onlybe manipulated by commands for that structure type as spec-ified at initial structure allocation. Multiple CF structuresof the same or different types can exist concurrently in thesame Coupling Facility.

3.3.1 Lock structuresThe Coupling Facility lock model supports high-

performance, finely-grained lock resource management,maximizing concurrency and minimizing communicationoverhead associated with multi-system serialization proto-cols. The purpose of this model is to enable a specializedlock manager (e.g., a database lock manager) to be easily ex-tended into a multi-system environment. The CF lock struc-ture provides a hardware-assisted global lock contentiondetection mechanism for use by distributed lock managers,such as the IMS Resource Lock Manager (IRLM). The

lock structure supports a program-specifiable number oflock table entries used to record shared or exclusive inter-est in software locks which map via software-hashing to agiven CF lock table entry. Interest in each lock table en-try is tracked for all peers connected to the CF structureacross the systems in the sysplex. Through use of effi-cient hashing algorithms and granular serialization scope,false lock resource contention is kept to a minimum. Thisallows the majority of requests for locks to be granted cpu-synchronously to the requesting system, where synchronousexecution times are measured in micro-seconds. Only inexception cases involving lock contention is lock negotia-tion required. In such cases, the CF returns the identity ofthe system or systems currently holding locks in an incom-patible state with the current request, to enable selectivecross-system communication for lock negotiation. MVSprovides cross-system lock management services to coor-dinate lock contention negotiation, lock request suspensionand completion, and recording of persistent lock informa-tion in the Coupling Facility to enable fast lock recoveryin the event of an MVS system failure while holding lockresources.

3.3.2 Cache structuresThe CF cache structure serves as a multi-system shared

data cache coherency manager. The purpose of this modelis to enable an existing buffer manager (e.g., a databasebuffer manager) to be easily extended into a multi-systemenvironment. It enables each system to locally cache shareddata in processor memory with full data integrity and op-timal performance. Additionally, data can be optionallycached globally in the CF cache structure for high-speedlocal buffer refresh. As a global shared cache, the CF canbe viewed as a second-level cache in between local proces-sor memory and DASD in the storage hierarchy.

A CF cache structure contains a global buffer directorywhich tracks multi-system interest in shared data blockscached in one or more systems’ local buffer pools. A sep-arate directory entry is maintained in the CF structure foreach uniquely-named data block. When a database man-ager, such as IBM’s DB2, first connects to a CF cachestructure via MVS system services, MVS allocates a localbit vector in protected processor storage on behalf of thedatabase manager. The local bit vector is used to locallytrack the coherency of data cached in the local buffer pool.The database manager associates each buffer in the bufferpool with a unique bit position in the local bit vector. Whenthe database manager brings a copy of a shared data blockfrom DASD into a local buffer in processor memory, it firstregisters its interest in that data with the CF, passing theprogram-specified data block name and the local bit vectorindex associated with the local buffer where the data blockis being cached. The CF now tracks that system’s interestin the locally cached data.

Later, when another instance of the database manageron a different system updates its copy of the shared datablock, it issues a command to the CF directing the bufferinvalidation of any locally cached copies of the same data

block on other systems. The CF checks its global bufferdirectory and then sends a cross-invalidate signal via thecoupling links in parallel to only those systems having aregistered interest in that data block. Specialized couplinglink hardware provides processing for multi-system bufferinvalidation signals sent by the CF to attached systems.The hardware receives the buffer invalidation signal andupdates the CF-specified bit in the data manager’s local bitvector to indicate the local copy is no longer valid. Thisprocess does not involveany processor interrupt or softwareinvolvement on the target system. Work continues withoutany disruption. Once the CF has observed completion of allbuffer invalidation signals, it then responds to the systemwhich initiated the data update process. Again, this entireprocess can be performed cpu-instruction-synchronous tothe updating system, with completion times measured inmicro-seconds. The issuing database manager is then freeto release its serialization on the shared data block.

When another instance of the database manager attemptsto subsequently re-use its local copy of the now-down-leveldata block, it first checks the coherency of its local buffercopy. This check does not involve a CF access, but rather isachieved through execution of new S/390 cpu instructionswhich interrogate the state of the specified bit in the localbit vector to determine its buffer coherency. If the bufferis invalid, the database manager can re-register its interestin the data block with the CF, which might also return acurrent copy of the data if it had been cached there by thesystem which earlier performed the update.

Through exploitation of the cache coherency and globalbuffer cache management mechanisms described above, itcan be seen that the Coupling Facility and related S/390 par-allel sysplex processor technology provides the means forhigh-performance, scalable read/write data sharing acrossmultiple systems, avoiding the message passing overheadstypically associated with data-sharing parallel systems.

3.3.3 List structuresThe CF list structure supports general-purpose multi-

system queueing constructs which are broadly applicablefor a wide range of uses, including workload distribution,inter-system message passing, and maintaining shared con-trol block state information. A list structure includesa program-specified number of list headers. Individuallist entries are dynamically created when first written andqueued to a designated list header. List entries can op-tionally have a corresponding data block attached at thetime of creation or subsequent list entry update. Existingentries can be read, updated, deleted, or moved betweenlist headers atomically, without the need for explicit soft-ware multi-system serialization in order to insert or removeentries from a list. List structures can support queueingof entries in LIFO/FIFO order or in collating sequence bykey under program control. Optionally, the list structurecan contain a program-specified number of lock entries. Acommon exploitation of the serialized list structure is torequest conditional execution of mainline CF commands aslong as a specified lock is not held. Recovery operations

requiring a static view of a list or the entire structure can setthe lock causing mainline operations to be rejected. Sucha protocol avoids the necessity for mainline processes toexplicitly gain or release the lock for every request, butstill allows such requests to be suspended or rejected in thepresence of long-running recovery operations.

Programs can register interest in specific list headers usedas shared work queues or in-bound message queues. Whenan entry is added to the specified list causing it to go froman empty to non-empty state, the CF can send a list-non-empty transition signal to the registered program’s system,providing an indication observed via local system pollingthat there is work to be processed on the specified list. Aswith the cache buffer invalidation signal handling, there isno processor interruption or cache disruption caused as aresult of processing the list transition signal.

4 Parallel Sysplex Scalability

Figure 3 depicts effective total system capacity as a func-tion of the number of physically configured cpu’s in a pro-cessing system. The IDEAL line shows a 1:1 correspon-dence between physical capacity and effective capacity.That is, as each cpu is added to the total processing sys-tem, the full capacity of each additional processor would berealized in terms of available capacity. Real configurationsof course do not exhibit this ideal behavior.

Parallel Sysplex

Tightly CoupledMultiprocessors

Idea

l

Physical Capacity

Effective Capacity

Figure 3: Parallel Sysplex Scalability

The Tightly-Coupled Multi-Processing (TCMP) lineshows the behavior of a TCMP as additional cpu’s areadded to the same single physical system. TCMP sys-tems provide maximum effective throughput at relativelysmall numbers of engines, but as more cpus are added tothe TCMP system, incremental effective capacity begins todiminish rapidly, limiting ultimate scalability. This is attrib-utable to the overheads associated with inter-processor se-rialization, memory cross-invalidation and communicationrequired in the hardware to support conceptual sequencing

of instructions across cpus, cache coherency, and serializedupdates to storage performed atomically to cpu instructionexecution. These processes are performed in the hardwarewithout the benefit of knowledge of software serializationthat may already be held on storage being manipulated at amuch more coarse level. In addition TCMP overheads areincurred in the system software due to software serializationand communication to manage common system resources.

The S/390 Parallel Sysplex scalability characteristics areexcellent. Physical capacity introduced to the configurationvia the addition of more data-sharing systems in the sys-plex (where each system can be a TCMP or uni-processor)provides near-linear effective capacity growth as well. Re-cent performance studies conducted in a parallel sysplexenvironment consisting of multiple S/390 9672 CMOS sys-tems running a 100% data-sharing CICS/DBCTL workloaddemonstrated an incremental overhead cost of less than halfa percent for each system added to the configuration. In ad-dition, the initial data-sharing cost associated with the tran-sition from a single-system non-data-sharing configurationto a two-system data-sharing configuration was measuredat less than 18% [8, 9].

These results testify to the excellence of the S/390 MVSParallel Sysplex technology in providing near-linear scala-bility, minimizing overheads previously precluding imple-mentation of a true data-sharing parallel-processing system.

5 Exploitation and Product DetailsThrough exploitation and support of the Parallel Sys-

plex data-sharing technology, MVS and its major subsys-tems have combined to provide an industry-leading fully-integrated commercial parallel processing system.

5.1 Operating System SupportAt the base of the software structure, the MVS/ESA

Version 5 operating system provides extensive support forCoupling Facility resource management and mainline ac-cess services enabling subsystem CF exploitation, as shownin Figure 4. MVS has built these services as extensionsto its prior sysplex support, which provided multi-systemconfiguration management, system status monitoring andrecovery mechanisms, and inter-system communication fa-cilities. Several MVS base system components includingJES2, RACF, and XCF are exploiting the Coupling Facil-ity to facilitate or enhance their respective functions in aparallel sysplex configuration. In addition, the MVS Work-load Manager component provides policy-driven systemresource management for customer workloads, and is akey component in sysplex-wide workload balancing mech-anisms.

5.2 DatabaseIBM’s hierarchical and relational database managers,

IMS and DB2 respectively, provide multi-system data-sharing through exploitation of the CF cache and lock struc-tures. DFSMS support for multi-system data-sharing ofVSAM files is currently under development and will simi-larly exploit the Coupling Facility.

Appl Appl Appl Appl Appl

Transaction Managers

CICS IMS TM

IMS DB DB2 VSAM

Data Managers

MVS/ESA

User

VTAM

Base Services

Hardware Interfaces

Data Sharing

Applications Unchanged

Dynamic Workload Balancing

Single Image to Network

Figure 4: Parallel Sysplex Software Structure

With these database products enabled for sysplex-widedata-sharing, the IBM CICS and IMS Transaction Man-agement subsystems are providing multi-system dynamicworkload balancing for customers’ OLTP workloads. Inconjunction with the CICSPLEX/Systems Manager (CIC-SPLEX/SM) product, CICS has already delivered its dy-namic transaction routing capabilities. IMS is currently de-veloping its dynamic workload balancing functions throughexploitation of the Coupling Facility for workload distrib-ution.

5.3 NetworkVTAM provides single system image to the SNA network

for the Parallel Sysplex through its “Generic Resource” sup-port, enabling session binds for user logons to be dynami-cally distributed for workload balancing across the systemsin the sysplex. VTAM provides the Generic Resource fa-cilities through exploitation of the CF list structure. CICSand DB2 currently support VTAM’s generic resource facil-ities in the parallel sysplex environment. CICS users, forexample, can simply logon to “CICS” without having tospecify or be cognizant of which system their session willbe dynamically bound.

These and other MVS Parallel Sysplex components andsubsystems combine to bring the Parallel Sysplex busi-ness advantages of reduced cost, scalable growth, dynamicworkload balancing, and continuous availability to cus-tomer’s existing commercial workloads in a transparentmanner. End-users and business applications are insulatedfrom the technology infrastructure through subsystem ex-ploitation and parallelization of the underlying applicationexecution environments they provide.

6 ConclusionThe S/390 MVS Parallel Sysplex is a state-of-the-art

commercial parallel-processing system. Through the in-tegration of innovative hardware and software technology,

the S/390 MVS platform supports high-performance, di-rect, concurrent multi-system read/write data-sharing withhigh-performance enabling the aggregate capacity of mul-tiple MVS systems to work in parallel on shared work-loads. Exploitation in the base MVS system and key sub-system middleware provides single-system-image for themulti-system parallel configuration with transparent valuefor end-users and customers MVS business applications.Future enhancements are focused on leveraging the Paral-lel Sysplex data-sharing technology to support new appli-cation environments, including distributed applications in aheterogeneous networking environment, single system im-age for native TCP/IP networks, MVS servers to the World-WideWeb, and applications exploiting object-oriented tech-nology. The S/390 MVS Parallel Sysplex leverages paral-lel technology to business advantage, offering competitiveprice/performance and state-of-the-art availability, scalabil-ity and investment protection characteristics.

References[1] A. Azagury, D. Dolev, J. Marberg, and J. Satran.

Highly available cluster: A case study. In 24th Symp.on Fault-Tolerant Computing, pages 404–413, June1994.

[2] W.E. Baker, R.W. Horst, D.P. Sonnier, and W.J. Wat-son. A flexible servernet-based fault tolerant archi-tecture. In 25th Symp. on Fault-Tolerant Computing,pages 2–11, June 1995.

[3] N.S. Bowen, C.A. Polyzois, and R.D. Regan. Restartservices for highly available systems. In The 7th IEEESymposium on Parallel and Distributed Processing,October 1995.

[4] S.A. Calta, J.A. deVeer, E. Loizides, and R.N.Strangwayes. Enterprise systems connection (escon)architecture–system overview. IBM Journal of Re-search and Development, 36(4):535–552, 1992.

[5] R. Cwiakala, J.D. Haggar, and H.M. Yudenfriend.Mvs dynamic reconfiguration management. IBMJournal of Research and Development, 36(4):633–646, 1992.

[6] R. Duncan. A survey of parallel computer architec-tures. Computer, 23(2):5–16, 1990.

[7] IBM Corporation. MVS/ESA Programming: SysplexServices Guide, 1994.

[8] IBM Corporation. S/390 MVS Parallel Sysplex Per-formance, March 1995.

[9] C.L. Rao and C. Taaffe-Hedglin. Parallel sysplex per-formance. In Proceedings of CMG, pages 3–7, De-cember 1995.

[10] Lisa Spainhower, Jack Isenberg, Ram Chillarege, andJoseph Berding. Design for fault-tolerance in systemes/9000 model 900. In 22nd Symp. on Fault-TolerantComputing, pages 38–47, July 1992.

[11] C.B. Stunkel et al. The SP2 high-performance switch.IBM Systems Journal, 34(2):185–204, 1995.

[12] M.D. Swanson and C.P. Vignola. MVS/ESA coupledsystems considerations. IBM Journal of Research andDevelopment, 36(4):667–682, 1992.

[13] P.S. Yu and A. Dan. Performance analysis of affinityclustering on transaction processing coupling archi-tecture. IEEE Transactions on Knowledge and DataEngineering, 6(5):764–786, October 1994.

[14] P.S. Yu and A. Dan. Performance evaluation of trans-action processing coupling architectures for handlingsystem dynamics. IEEE Transactions on Parallel andDistributed Systems, 5(2):139–153, February 1994.