Optical-packet-switched interconnect for supercomputer applications [Invited]

Optical-packet-switched interconnect forsupercomputer applications [Invited]

Roe Hemenway and Richard R. Grzybowski

Corning Incorporated, Science and Technology, Corning, New York 14831

[email protected]; [email protected]

Cyriel Minkenberg and Ronald Luijten

IBM Research, Zurich Research Laboratory,Säumerstrasse 4, CH-8803 Rüschlikon, Switzerland

[email protected]; [email protected]

RECEIVED 11 AUGUST 2004;REVISED 8 OCTOBER2004;ACCEPTED12 OCTOBER2004;PUBLISHED 24 NOVEMBER 2004

We describe a low-latency, high-throughput scalable optical interconnect switchfor high-performance computer systems that features a broadcast-and-selectarchitecture based on wavelength- and space-division multiplexing. Its electroniccontrol architecture is optimized for low latency and high use. Our demonstra-tion system will support 64 nodes with a line rate of 40 Gbit/s per node andoperate on fixed-length packets with a duration of 51.2 ns using burst-modereceivers. We address the key system-level requirements and challenges for suchapplications. © 2004 Optical Society of America

OCIS codes:060.1810, 250.5980.

1. Introduction

High-performance computing systems (HPCS) interconnect hundreds of high-performancemicroprocessors by relying on high-sustained-bandwidth, low-latency links to providetimely access to distributed memory and the intermediate results of distant CPUs [1]. Forthe largest systems, this interconnect has become a key contributor to sustained perfor-mance, comparable in importance to microprocessor and software design. In a recent re-port, the Accelerated Strategic Computing Initiative (ASCI) [1] acknowledges an emergingsystem imbalance as microprocessor performance outstrips interconnect performance, andidentifies opportunities for revolutionary changes to HPCS design on the basis of maturingoptical technology. Under a contract related to this program, Corning and IBM are jointlydeveloping a 64-node HPCS interconnect demonstrator with the optical data paths oper-ated at 40 Gbit/s, scalable to 2048 nodes. The Optical Shared MemOry SupercomputerInterconnect System (OSMOSIS) project aims at solving the technical challenges and ataccelerating the cost reduction of all-optical packet switches for HPCS.

Optically switched interconnects potentially offer greater scalability, higher through-put, lower latency, and reduced power dissipation compared with electrically mediatedsystems of comparable capacity. The challenges they pose include optimized integrationwith the end nodes, a control system, and efficient scheduling algorithm that perform wellgiven the switch’s dynamic characteristics, an efficient structure of the data packet, a high-performance control channel (CC), and good channel coding for error management.

Our architecture implements a synchronous broadcast-and-select (B&S) data switch us-ing semiconductor optical amplifiers (SOAs) combining eight wavelengths on eight fibersto achieve 64-way distribution (see Fig.1). For improved performance, multiple burst-modeoptical receivers feed a single egress queue (EQ) on each port. The switching is achieved

© 2004 Optical Society of AmericaJON 5037 December 2004 / Vol. 3, No. 12 / JOURNAL OF OPTICAL NETWORKING 900

with a fast 8 : 1 fiber-selection stage followed by a fast 8 : 1 wavelength-selection stage ateach output port. Rather than implement tunable filters, this design features a demux-SOA-select-mux architecture. The receivers feature wide dynamic range, fast clock recovery, andpacket-oriented error correction. A programmable centralized arbitration unit reconfiguresthe optical switch via a separate optical CC synchronously with the arrival of fixed-lengthoptical packets. The arbiter enables high-efficiency packet-level switching without aggre-gation or prior bandwidth reservation and achieves a high maximum throughput. It featuresa novel pipelined implementation of a bipartite graph matching that avoids the latencypenalty associated with traditional pipelining methods.

control links central scheduler(bipartite graph matching algorithm)

Ingress Adapters All-optical Switch Egress Adapters

Broadcast Units x8 Select Units x128

x64 x64

8x1

Com-biner

1x8 8x1

Fast SOA 1x8Fiber-SelectorGates

Fast SOA 1x8Fiber-SelectorGates

Fast SOA 1x8Wavelength-Selector Gates

Fast SOA 1x8Wavelength-Selector Gates

OpticalAmplifier

WDMMux

8x1 1x128

StarCoupler

VOQs

control

Tx

VOQs

control

Tx

EQ

control

Rx

EQ

control

Rx

1

64

1

64

λ1λ2

λ8

Fig. 1. B&S switch with combined space- and wavelength-division switching.

This paper is organized as follows: In Section2 we describe the key requirementsand outline the challenges they pose. Section3 describes the data-path architecture of theswitch, including the optical B&S network, and presents salient features of the optical tech-nologies and components used in the OSMOSIS implementation. Section4 introduces thecontrol-path architecture, including descriptions of the arbitration algorithm and the controlprotocols, such as for request-grant arbitration, reliable delivery, flow control, and consis-tency. We present some initial test results in Section5 and conclude in Section6.

2. Requirements and Challenges

HPCS interconnects must be designed to handle the arrival, buffering, routing, forward-ing, reception, and error management of high-data-rate packet streams in scalable localnetworks. HPCS switches are required for delivering an extremely low bit-error rate, verylow latency, a high data rate, and extreme scalability. To achieve a balanced system designwherein no resource generally constrains system performance, HPCS systems require in-ternode bandwidths (between CPU clusters) comparable to intranode bandwidths. A ruleof thumb for HPCS is to achieve a ratio of 0.1 Gbit/s interconnect bandwidth per GFLOPS(109 floating point operations per second) of computer performance. Moreover, they mustbe efficient for bursty traffic, with demands varying from many small data transfers to fewerlarge dataset transfers. In comparison, conventional telecom-optimized packet switches op-erate on comparably smooth and predictable flows with much-relieved latency constraints.

The requirements for our system include 64 input and output ports operating at 40 Gbit/swith 75% available for user payload, a 10−21 bit-error rate using hardware forward errorcorrection (FEC), less than 1-µs latency measured application–application in the scaledsystem, an optical data path, and a system design capable of scaling to higher line rates andport counts. Scalability to 2048 nodes is achieved by deploying 96 64×64 switches in atwo-level (three-stage) fat tree topology (see Fig.2). The requirement for 75% user datanecessitates a low-overhead packet structure and low guard times for the optical switching


overhead. The overhead comprises the packet header, FEC, preamble, switching time, deadtime, and other inefficiencies.

Our optical switch offers numerous opportunities for a new HPCS switching paradigm.First, the switch core operates on entire packets and does not perform bit-level operationson the data-channel content; it needs to switch only once in an entire packet duration, not atdata-payload-bit speeds. This reduces both switching power and rate, and improves scalingof the switch core. Second, by provisioning a transparent data path, multiple transmissionscan use the data path simultaneously, enabling an increase of the internode bit rate, eitherby operating a single channel at a higher rate or by provisioning multiple closely-spacedWDM channels in one “band” without increasing the size of the switch core. Third, byseparating the control and data paths, each can be upgraded independently.

32 OSMOSIS modules

64x64

input/output 2047input/output 0

64 OSMOSIS modules

64x64

level 1

level 2

input/output 31

. . . .

. . . .

32 OSMOSIS modules

64x64

input/output 2047input/output 0

64 OSMOSIS modules

64x64

level 1

level 2

input/output 31

. . . .

. . . .

Fig. 2. Scaling to 2048 nodes using a 64-port OSMOSIS module via a fat-tree topologywith full bisectional bandwidth.

Meeting such requirements means overcoming significant technical challenges in theoptical data path, in the control path, and in the scheduling and control system. Optical-path challenges start with the need to mutually synchronize data-packet transmissions toproduce efficient just-in-time switching among them. This is achieved by a combinationof centralized clock distribution implemented over a synchronous CC and known opticalpath delays to the common resources of the switch core. The optical path lengths must bematched to a fraction of a packet time through all amplification, broadcast, and switch el-ements. This is achieved via low-skew optical hardware based largely on planar lightwavecircuits (PLCs), and discrete components with fixed-length fiber pigtails. The switchingof the SOAs coincides with the arrival of the data packets to be received, so that the syn-chronous switch control system operates at the packet rate to set switch positions. Path-and polarization-dependent loss (PDL), which can induce optical power excursions thatvary significantly from packet to packet, are managed by limiting the PDL of the com-ponents, equalizing transmit power, and fitting receivers with a wide dynamic range. Al-though the switch is synchronously operated, that is, all nodes operate at the same nominalbit and packet rate, the end nodes receive contiguously arriving packets exhibiting randomrelative bit phases. Thus the receivers must acquire the bit phase almost instantly on eachsuccessive packet. Because the packets are independent of one another, error-correctiontechniques that operate efficiently on individual packets must be implemented. Since FECis more efficient with long data blocks, this must be traded off against the need for shortpackets with low overhead in an HPCS interconnect. Finally, the actual switching time andguard times needed to account for all timing uncertainties must result in a switch systemthat still achieves a high utilization for user payload.

Among challenges in the control path is the need to conduct resource requests and man-age reliable delivery at a rate compatible with the packet transmission times. The scheduler


must not only be fast but also efficient. Ideally, it is able to process packet transmissionrequests for many packets without the need to buffer (and therefore delay) packets at theingress and optimally resolve transmission requests for the fully loaded 64-port switch op-erated at its maximum packet rate.

To reduce the arbitration rate, the B&S-based designs presented in Refs. [2–5] adopt ag-gregation approaches in which multiple packets are first assembled into containers (bursts,superframes) and then switched as an aggregate. They employ frame sizes of 1 kB at 10Gbit/s, resulting in a time slot duration of over 800 ns. While this aggregation approachminimizes the effect of long switching times and overhead, it is not suitable for HPCSbecause transmission must wait until containers are full to achieve high efficiency, intro-ducing unacceptably high latency with respect to a 1-µs requirement. Transmitting partiallyfilled containers reduces latency at the expense of user bandwidth and reduces throughput.In contrast, the time slot in our design is only 51.2 ns, 16 times shorter. To cope efficientlywith such relatively short packets, we require fast switching, low switching overhead, anda deeply pipelined control architecture. This is especially important at the data rates of 40Gbit/s and higher as employed in this study. In a totally different approach, such as in Ref.[6], contention is resolved viadeflection, which obviates the need for complex schedulingbut results in less predictable latency.

Finally, the control architecture must accommodate scaling by means of a fat tree topol-ogy to 2048 ports without significant performance loss.

3. Data-Path Architecture and Technology

The interconnect architecture implements a WDM B&S design that combines the transmis-sions from the 64 nodes via eight wavelengths onto eight separate fibers as shown in Fig.1.The major elements of the switch are shown in Fig.3. The data path comprises the elec-tronic host-channel adapters (HCAs) and the optical B&S network. A HCA originates andterminates packet transmissions. The set of eightWDM, amplify, and splitmodules eachfeature an integrated 8×1 wavelength combiner followed by a high-power optical amplifierand integrated broadband splitters to implement the broadcast function. To serve redundantreceivers, discussed below, our 64-port design implements a 1×128 split function. Each ofthe 128 optical switch modules (OSMs) selects one of the 64 possible packets in transmis-sion in each time slot. Multiple OSMs can simultaneously select the same packet to achievea multicast or broadcast functionality at no additional hardware cost.

Arbiter, Fast Switch Command & Master

Clock

WDM + Amplify and Split Module

Perfect Shuffle

Interconnect

Host Channel Adapter

Single Mode Fibers

FiberSelect

ColorSelect

Optical Switch

Modules

B&S Network

TX

TXMux

DFBDriver+EAM

RX

HW & FlowControl RX

clk

LA

CDR+TDDx

Data Channel

Control Channel

Com

pute

r Hos

t Mem

ory

Bus

ONIC

Control

Control

Switch Command Channel


Clock


Clock

WDM + Amplify and Split Module

Perfect Shuffle

Interconnect

Host Channel Adapter

Single Mode Fibers

FiberSelect

ColorSelect

Optical Switch

Modules

B&S Network

TX

TXMux

DFBDriver+EAM

RX

HW & FlowControl RX

clk

LA

CDR+TDDx

Data Channel

Control Channel

Com

pute

r Hos

t Mem

ory

Bus

ONIC

Control

Control

Switch Command Channel

Fig. 3. Major elements of the optical interconnect switch include the HCA with embeddedONIC, the amplifying multiplexer and broadcast splitters, the optical switch modules, andthe centralized arbiter.


HCA. The HCA originates and terminates data-packet transmissions across the switchcore. Each HCA comprises an ingress and an egress section. The ingress section receivestraffic from the attached computing node (or the previous stage in a multistage network)and temporarily stores every incoming packet in an electronic memory that is organized in avirtual output-queuing (VOQ) fashion until a grant to send has been received (see Section4for a description of the control path). The egress section receives traffic from the opticalswitching network, temporarily stores it if necessary, and forwards it when the downstreamreceiver (either a computing node or the next stage in a multistage network) is ready. Thehigh-speed digital functions of the HCA, implemented largely in high-performance field-programmable gate array (FPGA) technology, perform error correction, packet framing,queuing, reliable delivery, and handling of the CC protocols. The HCA interfaces to thehigh-speed host bus and its memory subsystem as well as distributes the system clock tocritical subsystems and manages optical CC communications. The HCA further comprisesan optical network interface card (ONIC) that contains the optical data and CC interfaces.

ONIC. The ONIC performs serialization, deserialization, and optical-to-electric andelectrical-to-optical conversions. A distributed-feedback (DFB) laser is coupled to a 40-Gbit/s electroabsorption modulator (EAM). We selected this combination because of itscompact size, low drive requirements, bit-rate scalability, and performance. On each HCA,multiple receivers are provisioned, each of which implements a high-speed PIN photodiodereceiver with wide-dynamic-range transimpedance and limiter amplifier functions. A clockrecovery and decision ASIC circuit, optimized for fast clock acquisition of optical packetsexhibiting up to 5-dB packet–packet power variation, completes the ONIC functions.

Multiple Burst-Mode Receivers. A distinguishing feature of the OSMOSIS architectureis the presence of multiple receivers per output port. This allows more than one packet tobe transferred to the same port simultaneously, which helps resolve contention faster andtherefore improves latency. The preferred configuration provides two receivers per outputbecause simulations have shown that the additional cost of having more than two receiversis not justified. More receivers would offer only minor additional performance improve-ments. The presence of two receivers can be exploited by changing the arbiter to matchup to two inputs to one output, instead of just one, which requires modifications to thematching algorithm.

B&S. The multiplex–amplify–split WDM broadcast stage combines the eight WDMchannels assigned to all eight independent broadcast networks. A high-power erbium-doped fiber amplifier (EDFA), capable of more than 20-dBm output power, less than 7-dBnoise figure, and fitted with an automatic gain control (AGC) function, boosts the signalpower to overcome subsequent broadcast-splitting losses. The broadcast 1× 128 splitteris achieved in two stages: 1×8 followed by 1×16 for equipment modularity using con-ventional planar lightwave circuit (PLC) technology. To control data-path skew, the entireoptical data path is controlled to a small fraction of the packet length.

OSM. The OSM implements a two-stage selection function: first a fiber-select SOA isactivated to select the fiber that contains the wavelength-multiplexed packet that is to bereceived. Regardless of the fiber selected, the WDM multiplex is combined in an 8×1 PLCto a common port. This stream is then demultiplexed and passed through a second SOAstage in which the WDM channel of interest is selected. Regardless of the color selected,the stream is then remultiplexed onto a common output port and delivered to the broadbandreceiver residing at the egress node ONIC. Systematic wavelength-dependent gain variation


of the SOAs is small (0–0.8 dB) and is fully compensated in the link design. Synchronouslywith packet transmission, the centralized arbiter resolves transmission requests, which aredelivered to it via an optical CC, and sets the state of the SOA selector gates just as thepackets arrive at each OSM. At the egress node HCA, receiver electronics extract the userdata, correct errors or request retransmission of uncorrectable packets, forward the packetsto the next level of the fat tree (if it is an intermediate node in the network) or reassemblepackets and deliver those packets to the terminal host node (if it is a terminal node in thenetwork). Packet-to-packet power variations of< 3 dB arise from residual polarization andpath-dependent loss in the various links.

Port Scalability. The OSMOSIS architecture enables scaling of the size of the basic 64-port module by increasing the dimension of wavelengths per fiber, the number of fibers, orboth. The number of required switch elements and size of control words per port increaseslinearly in proportion.

Multistage Scalability. One of the main objectives of the OSMOSIS project is providinga solution that scales to thousands of nodes. This cannot be achieved economically in asingle-stage configuration owing to the quadratic complexity involved in the control path(arbitration) and the physical bulk of the data path (number of SOAs, splitters, couplers,and so on). A more economical way to scale to much larger networks is with a multistagenetwork such as a fat tree. With the 64-port basic OSMOSIS switch, a 2048-node networkwith full bisectional bandwidth can be constructed with a two-level fat-tree network, asshown in Fig.2. Even larger networks can be achieved in a two-level network by config-uring it with less than full bisectional bandwidth or by deploying more levels. Althoughhaving an end-to-end all-optical data path throughout the multistage network is very attrac-tive, this is not practical in a packet-switching context, because the arbitration must thenarbitrate and configure all switches in the network simultaneously. Therefore, there will beelectronic buffers in between stages to decouple the arbitration. Obviously, the main dis-advantage of this approach is significant additional cost and latency, owing to the requiredO/E/O conversions and buffering.

Multistage scaling involves many additional challenges, including link-level flow con-trol, routing, and congestion control/management. Many performance characteristics, suchas the latency and throughput, are harder to control in a multistage network. These issuesare outside the scope of this paper.

4. Control-Path Architecture and Protocols

The OSMOSIS control structure involves the HCAs and the arbiter. Every HCA has a ded-icated, bidirectional (full-duplex) control channel (CC) to the arbiter to exchange requests,grants, credits, and acknowledgments. Additionally, there is a switch command channel(SCC, i.e., the configuration channel) between the arbiter and the crossbar to reconfigurethe SOAs according to the optimal crossbar setting that is computed by the arbiter. In thissection, we describe the control functions of the HCA and the arbiter, the CC protocols,and our approach to reliable delivery.

HCA. Every HCA comprises an ingress (TX) and an egress (RX) interface. The ingressinterface is connected to an input port of the switch fabric and contains electronic memoryto store packets as they wait to be transmitted across the switch core. The logical queuingstructure is arranged in avirtual output queuing(VOQ) fashion, so called because thereis one separate queue per output. This arrangement is known to eliminatehead-of-lineblocking, which limits throughput to 58.6% of full utilization.


The HCAs issue requests to the arbiter, which performs bookkeeping for the requestsof all HCAs, computes a match between input and output ports in every time slot based onthe currently pending requests, and issues corresponding grants to the HCAs. Upon receiptof a grant, the HCA dequeues a packet from the VOQ granted and transmits it to the switchcore, which will route it to the correct destination according to the configuration applied bythe arbiter via the SCC.

The egress interface is connected to an output port of the switch fabric. The OSMOSISarchitecture implements two OSMs per egress interface. This means that multiple packetscan be received by the same HCA in one time slot. The egress interface comprises anelectronic output buffer, in which it stores arriving packets before delivering them to thelocal node. This buffer temporarily stores packets that cannot immediately be accepted bythe local computing node or the next stage in a fat tree.

Arbiter. HPCS requirements impose stringent demands on the control of the system. Toensure that messages are delivered with minimum latency, we must be able to reconfig-ure the optical routing fabric in every time slot, contrary to many existing switches withoptical data paths. Similar optical B&S switches, presented in Refs. [3] and [5], employburst-switching aggregation techniques to reduce the arbitration rate. However, our strin-gent latency requirement implies that we cannot apply such techniques because assemblingindividual messages into large containers introduces too much latency.

Reconfiguring the switch entails computing a one-to-one matching between the in-puts and outputs in every time slot. To achieve good performance in terms of the delay-throughput characteristics, the matching should be as complete as possible, which requiresa centralized bipartite graph-matching algorithm. As the optimum solution to this problemis known to be infeasible in fast hardware, many heuristic approaches have been studiedin recent years, giving rise to a class of practical solutions based on iterative, round-robin–based algorithms, such as i-SLIP [7] and DRRM [8]. However, these algorithms must per-form a number of iterations (depending on the number of portsN) to converge and ensuregood performance characteristics. Therefore, arbitrating for a large number of portsN in ashort time remains a major challenge. It has been shown that the aforementioned iterativealgorithms on average converge in log2 (N) iterations. In the OSMOSIS demonstrator, thisimplies that the arbiter should complete about six iterations on a 64×64 request matrix inone time slot. The targeted packet size is 256 B at a line rate of 40 Gbit/s, resulting in atime slot of 51.2 ns. Considering the additional requirement of an FPGA implementation,it is very challenging to achieve up to six iterations within one time slot.

We have addressed this challenge with a novel pipelined implementation that we re-fer to as FLPPR: fast low-latency parallel pipelined arbitration [9]. It is related to existingtechniques for parallel matching, such as parallel maximal matching (PMM) [10, 11], inthat multiple matchings to be issued at subsequent time slots are computed in parallel.However, our approach avoids the latency penalty associated with existing approaches, inwhich, by design, every request enters at the beginning of the pipeline so that the corre-sponding grant must traverse all pipeline stages before being issued. The FLPPR scheme,on the other hand, allows requests to enter at any stage in the pipeline so that the minimumpipeline latency is equivalent to just a single pipeline stage. Moreover, this minimum la-tency is independent of the number of pipeline stages. Every stage of the pipeline performsone iteration of the DRRM [8] algorithm, which we prefer because it is highly amenable toa distributed implementation.

In addition to lower latency, FLPPR simultaneously achieves high maximum through-put and can provide up to 20% higher throughput with nonuniform traffic. Performanceresults with uniform as well as nonuniform traffic, as well as several FLPPR variants rang-ing from basic to optimized, have been reported in Ref. [9]. Figure4 compares the mean


latency (VOQ waiting time expressed in time slots) as a function of utilization of PMM [11]and FLPPR [9] for a 64×64 switch withK = 6 pipeline stages and uniform uncorrelatedarrivals. We observe that even the basic FLPPR implementation exhibits significantly lowerdelay than PMM throughout the load range and that the optimized FLPPR implementationachieves a further reduction in latency.

1

10

100

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

utilization [%]

late

ncy

[tim

e sl

ots]

K = 1K = 2K = 3K = 4K = 5K = 6

1

10

100

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

utilization [%]

late

ncy

[tim

e sl

ots]

PMM

FLPPR basic

FLPPR optimized

1

10

100

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

utilization [%]

late

ncy

[tim

e sl

ots]

K = 1K = 2K = 3K = 4K = 5K = 6

1

10

100

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

utilization [%]

late

ncy

[tim

e sl

ots]

PMM

FLPPR basic

FLPPR optimized

Fig. 4. Mean latency versus utilization withN = 64 and uniform uncorrelated arrivals. Left,basic FLPPR, different values ofK. Right, comparison of PMM with basic and optimizedFLPPR withK = 6.

Control-Channel Protocol. The physical implementation and packaging aspects of OS-MOSIS, or indeed large packet switches in general [12, 13], lead to increased physicaldistribution and thus a large physical distance between the HCAs and the switch core, toincreased latency owing to pipelining and to a shrinking packet cycle. All of this adds upto a significantround trip (RT) between the HCA and the arbiter; i.e., the bandwidth-delayproduct expressed as a multiple of the minimum packet duration is significantly larger thanone. This requires that data packets and control messages be pipelined on the data andcontrol paths, respectively. Moreover, this RT has a significant effect on two aspects of theswitch design. First, it implies that the arbiter has delayed knowledge of the current VOQstate. Therefore, the control protocol between HCAs and arbiter must be carefully designedto avoid performance degradation with long RTs [14]. Second, this RT is a direct contribu-tor to the arbitration latency, which is defined as the time that elapses between the issuanceof a request and the reception of the corresponding grant. The minimum arbitration latencycomprises the RT on the CC and the time needed to compute a matching.

Our CC protocol incorporates incremental VOQ state updates to cope with the physicaldistribution of the system, incremental credit flow control to prevent overflow of the egressbuffers, per-packet acknowledgments to achieve the required error rate, and a census pro-tocol to ensure consistency between the HCAs and the arbiter. The CC protocol comprisesa number of subprotocols that share the same physical channel and are encapsulated in thecontrol messages. These subprotocols are as follows:

Arbitration. The long RT has an effect on the design of the request-grant protocol be-tween the VOQs on the HCAs and the arbiter [14]. To cope with a long RT without perfor-mance loss, the arbiter maintains pending-request counters to reflect the state of all VOQs.The HCA issues a request for arbitration for every newly arriving packet. The arbiter in-crements the corresponding counter upon receipt of a request, computes a matching basedon the current values of all counters, and issues grants according to the current matching.When the arbiter has matched a specific VOQ, it immediately decrements the correspond-


ing counter to reflect that a request has been granted. In this protocol, requests and grantsareincremental, rather than absolute as traditionally used.

Consistency. An incremental request-grant protocol as described above is not inher-ently robust to errors. Therefore, we employ acensus protocol[15] to ensure that the actualVOQ state at the line cards and the state maintained by the arbiter are consistent within thebounds given by the RT. From time to time, the HCA triggers a census to check whether thearbiter’s image of the state of its VOQs is consistent with the actual state, taking into ac-count requests and grants that are in flight. To this end, the HCA injects a control messagecontaining a census count. According to specific rules, the arbiter returns a reply compris-ing an updated census count, which the HCA, upon receipt, checks for consistency. Thecheck result not only indicates whether the state is consistent, but it also indicates the mag-nitude of the inconsistency if it fails. In that case, the HCA initiates corrective action.

Local Flow Control. Because the HCA comprises egress as well as ingress buffersand the egress buffer’s arrival rate can exceed its departure rate, local flow control is nec-essary to ensure that no packets are lost owing to an egress buffer overflow. To this end,we employ an incremental credit flow control scheme between the egress buffers and thearbiter. The egress buffers release credits to the arbiter via the CC, and the arbiter performsthe credit bookkeeping. It will issue grants only for outputs that have credits available.Upon issuing a grant, the arbiter decrements the credit counter of the corresponding output.This incremental credit flow control scheme is also protected against errors by a censusmechanism.

Reliable Delivery. To ensure correctness, we employ a hop-by-hop reliable deliveryscheme. For every correctly received packet, the egress HCA returns an acknowledgmentvia the CC. The arbiter relays the acknowledgment to the corresponding ingress HCA,which then safely dequeues the acknowledged packet. This protocol is described in moredetail below.

Clock Distribution. A very important aspect of the OSMOSIS design is clock dis-tribution and synchronization. We take advantage of the presence of the centralized arbiterto distribute the clock and synchronize all HCAs using the physical layer of the CCs.

The physical implementation of a CC consists of two (one upstream, one downstream)8-bit/10-bit–coded optical InfiniBand links running at 2.5 Gbit/s,The physical layer of theSCC is identical to that of the CC, which implies that the length of a control message is128 bits. This is sufficient to ensure that every control message can carry one completesubmessage for each of the subprotocols. Each control message also comprises a cyclicredundancy check (CRC) to detect errors. With respect to CC reliability, note that censusmechanisms protect both arbitration and the flow-control subprotocols, and that the reliabledelivery and the census subprotocols themselves are inherently robust against errors.

Optical Switch Interface. A maximum of 128 SOAs need to be activated per modulein each arbitration cycle, that is, two SOAs per receiver. Concatenated control words forsetting those states are aggregated in groups of 16 ports and issued serially over the switchcontrol interface (SCI). At the switch control modules (SCMs) they are parsed and for-warded in parallel to each optical switch module (OSM) with the correct timing. Thus theSOA control-word load on the arbiter is minimized and requires only a few pins on itsprimary FPGA.


Reliable Delivery. To meet the objective of an error rate better than 10−21, we haveadopted a two-tier approach to reliable delivery. As the raw bit-error rate of the optical pathis designed to be near 10−10 we will employ a forward-error-correcting (FEC) code on theheader and data part of the packet to reduce the bit-error rate to an estimated 10−17. Toreduce the error rate observed by the application running on the nodes even further, weemploy a window-based hop-by-hop retransmission scheme that performs a given numberof retries for every corrupted packet. Together, these schemes can meet the extremely lowerror-rate requirement.

FEC. The link-level transmission coding scheme must correct all single-, detect alldouble- and most triple- and higher-order transmission bit errors. The coding scheme, fur-thermore, has to generate sufficient transitions for the receiving phase-locked loop (PLL)to enable an accurate tracking of phase changes by limiting the run lengths. As maintain-ing low latency is one of the most important requirements for OSMOSIS, the FEC codingscheme is a compromise between low latency, requiring a short code block, and low codingoverhead, requiring a long code block. To meet the system-level requirement that 75% oftransmitted data must be user information, we designed the total coding overhead to be lessthan 10%. Because we have not found an appropriate standard code, we have developed acustom code. It is in the class of generalized non-binary-cyclic Hamming codes, with a runlength of 256 bits and a coding overhead of 16 bits. The primitive polynomial is chosen tobe p(x) = x8 + x4 + x3 + x2 + 1. This code yields an overhead of approximately 6%; theremaining 4% overhead is reserved for limiting the run length.

Figure 5 shows the transfer efficiency (1 – coding overhead) versus the block codelength (user bits plus coding bits) with the number of correctable bitst as a parameter (t = 1corresponds to 1-bit correction capability). The yellow area marks the eligible region withless than 10% overhead. We also observe that, given the low coding overhead requirement,we can choose a code that corrects at most two bit errors, and that the minimum blocklength is 64 bits.

Fig. 5. Code transfer efficiency versus code length.


Hop-by-Hop Retransmission. Because the error rate after FEC still does not meet therequirement, we implement a hop-by-hop retransmission scheme between the input andthe output buffers. Retransmission schemes in general require that the sender stores sentpackets in a retransmission (RTX) buffer and that the receiver notifies the sender aboutsuccessful (positive acknowledgment – ACK) and/or unsuccessful (negative acknowledg-ment – NAK) transmissions. These acknowledgments can be transmitted in-band on thedata channel using either dedicated packets or by piggybacking on data traffic in the reversedirection, or out-of-band. Because in OSMOSIS we have a dedicated CC from every HCAto the arbiter, we accommodate ACKs in the control messages. A significant advantage ofsuch a fixed-bandwidth out-of-band channel is that the ACKs can be returned immediately,thus reducing the RT and therefore also the RTX buffer requirements. Additionally, thereis no need to perform ACK aggregation, so every packet is acknowledged individually.

We exclusively use ACKs. The only advantage of NAKs is that they can expedite theretransmission of corrupted packets, but with use of immediate per-packet ACKs withoutaggregation, it is straightforward to instantly detect and retransmit the missing packets.The arbiter must route the incoming ACKs to the appropriate HCAs. As there is a one-to-one matching between inputs and outputs in every time slot, it follows that the ACKsgenerated upon reception of these packets also form a one-to-one matching with the inputsand outputs reversed, so there is no contention in routing the ACKs in the arbiter.

We employ a go-back-N (GBN) retransmission policy that retransmits all unacknowl-edged packets sent since the last positive ACK. Although GBN may lead to unnecessaryduplications compared with a selective retry (SR) policy that only retransmits the corruptedpackets rather than the entire window, it has two key advantages over SR, namely, that thereis no need for a resequencing buffer at the receiver and that the RTX buffer is a simple FIFO.As the error rate after FEC is already very low, we expect that the extra overhead causedby unnecessary retransmissions will be negligible.

As the RT is fixed and known, retransmissions can be triggered autonomously by thesender by attaching a timestamp to every packet. If the head-of-line packet of the RTXbuffer is older than RT, the packet is considered lost (either the packet or the ACK has beencorrupted or lost), and the contents of the RTX buffer are retransmitted. To avoid endlessretransmissions when a link is broken, we impose a fixed upper limit on the number ofretransmissions (e.g., three) per packet by adding a time-to-live (TTL) field to every entryin the retransmission buffer that is decremented on every retransmission. When it reacheszero, the sender gives up and drops the packet.

Modeling and simulation indicate that over a wide range of offered load, maximumVOQ latency is comparable to the time of flight (RT) and thus adds only a small latencypenalty. With the low expected BER of the error-corrected links, the probability of retrans-missions is also extremely low, with negligible effect on the overall time to solution. Thusthe protocols just described are only rarely activated.

5. Initial Testing Results

Initial testing of the assembled optical layer confirms the expected performance of the datatransmission path. The transmitted 20 logQ factor is 18.5 dB, with an extinction ratio ofgreater than 9.5 dB under 231−1 pseudorandom binary sequence (PRBS). The fiber-selectSOAs deliver more than 15 dB of gain with less than 6.5-dB noise figure and typicallyless than 0.6-dB polarization-dependent gain. They exhibit output saturation power above+17.5 dBm and are operated at 1 dB or less into gain compression. They achieve less than0.4-dBQ penalty when operated at their optimum input power, yet still deliver a highQmargin over a wide input dynamic range (see Fig.6).

The received optical signal-to-noise ratio exceeds 30 dB, as measured in a 0.1-nm-resolution bandwidth. This is achieved by maintaining high signal power through the


Fig. 6. Signal-quality performance as a function of SOA total input power for eight40-Gbit/s channels under NRZ modulation. Below gain compression, signals are OSNR-limited; above gain compression they become limited by cross-gain modulation in the SOA.Back–backQ is 18.5 dB.

EDFA, thereby ensuring that the injected signal level into the critical fiber-select SOAis near its optimum value of+3 dBm, as shown in Fig.6. The OSNR reduction of thefiber-select SOAs under these conditions is small, producing aQ penalty of approximately0.4 dB. Further erosion of the OSNR at the second SOA is minimized by controlling the lossin nearby segments. Low noise figure and high saturation power in the SOAs are critical forthis performance. The receiver has a sensitivity of less than−10 dBm measured back–back(see Fig.7, left) and recovers the bit clock phase from the received packet stream in lessthan 3 ns. Although the design is not expected to push this limit, the receiver achieves error-free decisions when receiving packets with up to 5-dB packet-to-packet power variation (3dB shown in Fig.7, right). Other than accommodating packet-packet phase and amplitudevariations, no particular design requirements are impressed on the inherently broadbandWDM receiver.

Fig. 7. Left, burst-mode optical receiver performance at 40 Gbit/s. Right, optical packettrain shown with 3-dB power variation received without error.


6. Conclusions

The time-to-solution metric of high-performance computing systems increasingly dependson the latency and throughput of their interconnection networks. As such systems ap-proach and exceed the 1 PetaFLOP/s benchmark, they will require throughput and latencythat are potentially unaffordable or unachievable with further scaling of today’s electri-cally switched paradigm. Power dissipation, chip pinouts, rf interference, and sheer part-count management represent looming bottlenecks for the current technology. Our effortdoes not focus solely on the data path and its performance but instead addresses all ofthe key system-level challenges and requirements of future HPCS by introducing a scal-able optical packet switch provisioned to be fully functional at the system level. The chal-lenges of efficiency, throughput, latency, scalability, and reliable delivery are met with anovel, pipelined control architecture overlaid on a scalable optical crossbar that leveragesspace- and wavelength-division multiplexing in a fast-switched technology. Initial testingat 40 Gbit/s per port for a modular 64-port element of the fat tree demonstrates that theoptical switching tax can be acceptable and manageable, while still offering a technologywith room to scale in terms of port count and bit rate.

Significant engineering research remains, however, to achieve a commercially viableoptical-packet-based system solution. While the challenges of burst-mode data recovery,ultralow bit-error rate, switch dynamics and control-system integration have been met,commercial systems will require substantial further integration of the optoelectronic (OE)subsystems. A factor of 100 reduction in part count over discrete implementations to reducecost, power consumption, and size can be achieved by producing functional OE devices athigh levels of monolithic and hybrid integration. By selecting the optimum level of inte-gration, by simplifying and automating the processes of OE subassembly and testing, andby defining functional optical modules that maximize technology reuse, we expect that ouroptical packet-switching technology will achieve cost parity with unscaled electronic solu-tions while delivering substantial performance benefits and opportunities for future growth.

In future studies, we shall report on the detailed performance of the 64-port intercon-nect, i.e., both its optical transmission and dynamic performance as well as the efficacy ofthe CC, control system, queuing and scheduling algorithms, and reliable delivery proto-cols. We shall provide updates on the progress toward higher levels of OE integration, andreport on progress in evaluating and introducing more capable modulation formats, im-proved multiplexing and transport capabilities and demonstrating the various dimensionsof scaling.

We have chosen to implement fixed-length packets as a baseline. However, the architec-ture supports not only fixed- and variable-length packets but also semipermanent circuits.The only limitation is the complexity of the control system and mechanisms for reliable de-livery. Although fixed-length packets require segmentation and reassembly, they are gen-erally more efficient to schedule (especially for multicast traffic). They also make more-efficient use of the crossbar and are simpler to acknowledge and ensure reliable delivery.We also hope to explore these aspects in future studies.

Acknowledgments

This research is supported in part by the University of California under subcontract numberB527064. We thank our sponsors at the University of California, and acknowledge tech-nical contributions by Michael Sauer, John Downie, Deepukumar Nair, Martin Hu, RonKarfelt, Dave Peters, Jason Dickens, François Abel, Urs Bapst, Wolfgang Denzel, MitchGusat, Ilias Iliadis, Raj Krishnamurthy, Peter Müller, Venkatesh Ramaswamy, Enrico Schi-attarella, Alan Benner, Henry Brandt, Craig Stunkel, Marco Galli, Thomas Kneubühler,Elmar Ottiger, and their extended teams.


References and Links[1] The Future of Supercomputing: An Interim Report(NRC, National Academies Press, 2003).[2] F. Masetti, A. Jourdan, M. Vandenhoute, and Y. Xiong, “Optical packet routers for next gener-

ation Internet: A network perspective,” inProceedings of the European Conference on OpticalCommunication (ECOC), Workshop on the Optical Layer for Datanetworking (VDE, Munich,Germany, 2000).

[3] S. Araki, Y. Suemura, N. Henmi, Y. Maeno, A. Tajima, and S. Takahashi, “Highly scalableoptoelectronic packet-switching fabric based on wavelength-division and space-division opticalswitch architecture for use in the photonic core node,” J. Opt. Netw.2, 213–228 (2003),http://www.osa-jon.org/abstract.cfm?URI=JON-2-7-213 .

[4] F. Masetti, D. Chiaroni, R. Dragnea, R. Robotham, and D. Zriny, “High-speed high-capacitypacket-switching fabric: a key system for required flexibility and capacity,” J. Opt. Netw.2,255–265 (2003),http://www.osa-jon.org/abstract.cfm?URI=JON-2-7-255 .

[5] F. Masetti, D. Zriny, D. Verchère, J. Blanton, T. Kim, J. Talley, D. Chiaroni, A. Jourdan, J.-C.Jacquinot, C. Coeurjolly, P. Poignant, M. Renaud, G. Eilenberger, S. Bunse, W. Latenschlaeger,J. Wolde, and U. Bilgak, “Design and implementation of a multi-terabit optical burst/packetrouter prototype,” inOptical Fiber Communications Conference (OFC), Vol. 70 of OSA Trendsin Optics and Photonics Series (Optical Society of America, Washington, D.C., 2002), postdead-line paper.

[6] Q. Yang, K. Bergman, G. D. Hughes, and F. G. Johnson, “WDM packet routing for high-capacity data networks,” J. Lightwave Technol.19, 1420–1426 (2001).

[7] N. McKeown, “The iSLIP scheduling algorithm for input-queued switches,” IEEE/ACM Trans.Netw.7, 188–201 (1999).

[8] H. Chao and J. Park, “Centralized contention resolution schemes for a large-capacity opticalATM switch,” in Proceedings of IEEE ATM Workshop(IEEE, New York, 1998), pp. 11–16.

[9] C. Minkenberg, I. Iliadis, and F. Abel, “Low-latency pipelined crossbar arbitration,” to be pre-sented at IEEE GLOBECOM ’04, Dallas, Tex., 31 Nov.–3 Dec. 2004.

[10] E. Oki, R. Rojas-Cessa, and H. Chao, “A pipeline-based approach for maximal-sized matchingscheduling in input-buffered switches,” IEEE Commun. Lett.5, 263–265 (2001).

[11] E. Oki, R. Rojas-Cessa, and H. Chao, “PMM: A pipelined maximal-sized matching schedulingapproach for input-buffered switches,” inProceedings of IEEE GLOBECOM 2001(IEEE, NewYork, 2001), pp. 35–39.

[12] C. Minkenberg, R. Luijten, F. Abel, W. Denzel, and M. Gusat, “Current issues in packet switchdesign,” ACM Computer Commun. Rev.33, 119–124 (2003).

[13] F. Abel, C. Minkenberg, R. Luijten, M. Gusat, and I. Iliadis, “A four-terabit packet switch sup-porting long round-trip times,” IEEE Micro.23, 10–24 (2003).

[14] C. Minkenberg, “Performance of i-SLIP scheduling with large round-trip latency,” inProceed-ings of IEEE Workshop on High-Performance Switching and Routing (HPSR)(IEEE, New York,2003), pp. 49–54.

[15] C. Minkenberg, F. Abel, and M. Gusat, IBM Zurich Research Laboratory, are preparing a re-search report to be called “Reliable control protocol for crossbar arbitration.”


http://www.osa-jon.org/abstract.cfm?URI=JON-2-7-213



Optical-packet-switched interconnect for supercomputer applications [Invited]

Documents

Transcript of Optical-packet-switched interconnect for supercomputer applications [Invited]