THE IBM TURBOWAYS 155 PCI ATM ADAPTER: CLASSICAL IP AND LAN EMULATION PERFORMANCE FOR AIX

22
THE IBM TURBOWAYS 155 PCI ATM ADAPTER: CLASSICAL IP AND LAN EMULATION PERFORMANCE FOR AIX Document Number TR 29.2239 April 1997 Andrew Rindos David Cosby Steve Woolet Herman Dierks (*) Metin Aydemir Gary Delp (**) Networking Division Research Triangle Park, NC (*) RS/6000 Division Austin, TX (**) AS/400 Division Rochester MN

Transcript of THE IBM TURBOWAYS 155 PCI ATM ADAPTER: CLASSICAL IP AND LAN EMULATION PERFORMANCE FOR AIX

THE IBM TURBOWAYS 155 PCI ATM ADAPTER:CLASSICAL IP AND LAN EMULATION PERFORMANCE FOR AIX

Document Number TR 29.2239

April 1997

Andrew RindosDavid Cosby

Steve WooletHerman Dierks (*)

Metin AydemirGary Delp (**)

Networking DivisionResearch Triangle Park, NC

(*) RS/6000 DivisionAustin, TX

(**) AS/400 DivisionRochester MN

ii

Abstract

In this paper,we will first presentthroughputand processorutilization measure-mentsof the Turboways155 PCI ATM adapterfor the AIX 4.2.1operatingsystem,running betweenthe very latestRS/6000workstations(modelsF50), for both Clas-sical IP (CIP) and LAN Emulation(LANE; both Token Ring and Ethernet). Asthe datawill show, for the AIX operatingsystemrunning CIP, the adapteriscapableof deliveringdataat speedsof 152.5-154.5Mbps (at the mediaor adapterlevel, when ATM and SONET overheadsare added;at the Netperf applicationlevel,this representsa payloadthroughputof 133-134.7Mbps). ThesemeasuredTCP/IPthroughputsrepresentan amazing98-99%of mediaspeed.More importantly,becauseof the extremelyefficient devicedriver/operatingsystem(yielding very lowprocessorutilizations),when multiple adaptersare placedin a singlemachine,theaggregatethroughputincreaseslinearly with eachnew adapterinstalled(up to four).Four adaptersyieldedan aggregatethroughputof nearly four times the capacityofan OC3 link, i.e., 594 Mbps (with overheadadded,or 518 Mbps payloadthroughput).This represents98% of the aggregatemediacapacity.The systemstillhad substantialcapacityavailablewith four adapters(processorutilization was82%). A fifth adaptercould havethereforebeenaddedto producea still largeraggregatethroughputfrom a singlemachine.This highly efficient devicedriver there-fore allows a simultaneouslyrunning applicationto capturea larger portion of theprocessorcapacity.Theseexceptionalperformancenumbershighlight the advantagesof a packet-buffered(or "deep")adapter(i.e., one in which a frame is reconstructedfrom its constituentATM cells within the adapteritself, beforebeing transmittedupto the host) over a cell-buffered(or "shallow") adapter(i.e., one in which theadapterpassesthe cells throughup to the host,which then reconstructsthe frame).

When transmittingfull duplex (for CIP), the adapterwasable to sustain292 Mbps(at the mediaor adapterlevel, or 255 Mbps at the applicationlevel). This is verycloseto twice OC3 mediaspeed.

The peakthroughputobservedfor Token Ring LANE over ATM (4K-8K bytesMTU) for AIX wasalsoan impressive152 Mbps (at the adapteror medialevel,representingan application-levelor payloadthroughputof 132.6Mbps). Further-more, in spiteof the decreasein MTU sizeto 1516byteswith EthernetLANE overATM, the peakthroughputwas144.5Mbps (with overheadadded,or 126 Mbpspayload). Thesenumbersrefute recentarticlesin the tradepresssuggestingthat 155Mbps ATM technologycannotdeliver acceptableEthernetLANE over ATM per-formance. Furthermore,we will demonstratehow recentlyreportedEthernetLANE throughputmeasurementsfor a wide variety of 155 Mbps ATM PCIadaptersare most likely the result of lack of propertuning. With propertuning, wewereable to achievepeakthroughputsof up to 120 Mbps (adapterlevel, or 105Mbps payload)from someof the non-IBM adapterstested,far in excessof the60-70Mbps reported.

Finally, asnumerousexperimentshaveshown,the PCI bridging implementationfora given workstationcanhavea dramaticimpact on the final performanceobservedfor an ATM adapter.Likewise, the bus on which the adaptersare installed(i.e., theprimary or secondarybus),aswell aswhat other adaptersare activewithin theworkstation,will affect measuredthroughputs. Therefore,the customershouldbeawareof someof the effectsdescribedherein,and shouldtake advantageof someofthe known methodsfor optimizing PCI-bussystems.

Abstract iii

In summary,The Turboways155 PCI ATM adapterunderAIX yields:

ClassicalIP:

up to 154.5Mbps, half-duplex

292 Mbps, full duplex

over 594 Mbps from 4 adaptersin a singlemachine(at only 82% processorutilization)

Token Ring LAN Emulation: 152 Mbps

EthernetLAN Emulation: 144.5Mbps

(Note: All throughputslisted aboveare asmeasuredat the adapteror medialevel,i.e., throughputsincludedataplus ATM and SONET framing overhead;explainedbelow.)

KEYWORDSATM

Networking

Performance

ClassicalIP

LAN Emulation

Abstract iv

Contents

INTRODUCTION 1

AIX OPERATING SYSTEM PERFORMANCE: CLASSICAL IP 3

AIX OPERATING SYSTEM PERFORMANCE: LAN EMULATION 7

DEBUNKING RECENT CONCERNS REGARDING LAN EMULATIONPERFORMANCE 8

THE MTU AND WINDOW SIZE EFFECT 8PROPERTUNING TO AVOID THESE EFFECTS 9

PCI BUS PERFORMANCE CONSIDERATIONS 10

CONCLUSIONS 12

REFERENCES 13NOTES 13

Appendix A. AIX IP PERFORMANCE TUNING 14

Appendix B. NETWARE IPX PERFORMANCE TUNING 16

Contents v

Contents vi

INTRODUCTION

For most new personalcomputerpurchaserstoday,a Pentium1 processorwith aspeedin excessof 100 MHz hasbecomesomethingof a minimum requirement.Similar expectationsfrom consumershaveled to increasesin the delivery capacityofRS/60002 and other high end workstations. With the generalavailability of 200MHz PentiumPro 1 processors,the ability of most personalcomputers/workstationsto deliver enormousamountsof datato a network in a very short period of time hasthereforebecomea reality. Theseincreasedperformancecapabilitiesand expecta-tions haveboth arisenin responseto the developmentof new applicationsthatrequiresuchpower/performance,and havespurredthe developmentof newbandwidth-intensiveapplications.Someof this applicationdevelopment(especiallyin the areaof video) hasbeenthe result of the availability of ATM (AsynchronousTransferMode) and high speedLAN (Local Area Network) technology.

With increasedprocessingcapabilityhascomethe needfor high performancebuses.Therefore,most new workstationstoday comewith a PCI (PeripheralComponentInterconnect)bus,with a capacityof 132 MBps (or 1.056Gbps;for a 32 bit wideimplementation).In responseto this marketneed,IBM 2 hasreleasedits first 155Mbps PCI ATM adapter: the Turboways2 155 PCI ATM adapter. The first avail-abledevicedriver supportsthe AIX 2 (AdvancedInteractiveExecutive)operatingsystem.The adapterutilizes the CHARM 2 hardwarechip designedby the AS/400Division, RochesterMN (providing segmentation/reassembly,direct memoryaccessor DMA, etc.). It is a packet-buffered(or "deep")adapter(i.e., one in which a frameis reconstructedfrom its constituentATM cells within the adapteritself, beforebeing transmittedup to the host), rather than a cell-buffered(or "shallow") adapter(i.e., one in which the adapterpassesthe cells throughup to the host,which thenreconstructsthe frame).

In this paper,we will:

(1) presentadapterthroughputand processorutilization measurementsfor theAIX 4.2.1operatingsystem,running betweenthe very latestRS/6000workstations(modelsF50) for:

(a) ClassicalIP (CIP);

(b) LAN Emulation(LANE; both Token Ring and Ethernet3).

(2) discussrecentcriticisms that havebeenappearingin the tradepressregarding155 Mbps LANE over ATM, and demonstrate:

(a) LANE over ATM is capableright now of achievingOC3 (Optical Carrier3) mediaspeeds;

(b) currently reported"low" throughputsprimarily the result of impropertuning of the ATM adaptersunderstudy.

In demonstrating(2b), we will:

(c) replicatetheir observedresultsby usingdefaultprotocol window size,etc.parameters;

INTRODUCTION 1

(d) show how someof the adaptersmay be broughtcloserto OC3 mediaspeed,evenfor EthernetLANE over ATM, with propertuning.

(3) discusssomecharacteristicsspecific to the PCI bus that may havea dramaticeffect on the performanceobservedin a given customerenvironment.

INTRODUCTION 2

AIX OPERATING SYSTEM PERFORMANCE: CLASSICAL IP

Figure 1 plots measuredCIP throughputs(in Mbps) asa function of file size(inBytes)using the Netperf benchmarkapplication.Four curvesare shownrepres-enting the aggregatethroughputsgeneratedby one, two, threeand four Turboways155 PCI ATM adaptersinstalledin eachof a pair of communicatingworkstations.The operatingsystemwasAIX 4.2.1. The measureddatawas transmittedbetweentwo RS/6000model F50s. (The F50 is a multiprocessorworkstationcontainingupto four 166 MHz 604Eprocessors.)The throughputpresentedbelow is adapter-level throughput,i.e., the actualnumberof bits per unit time deliveredbetweentheATM adaptersacrossthe media,which includesATM and SONET (SynchronousOptical Network) overhead(see,e.g.,[ 1] , [ 2] ). The buffer size,etc. parametervaluesusedin thesestudiesare optimal, as reportedin [ 1] , [ 2] . (Also seeAPPENDIX A.) The protocol wasTCP/IP (TransmissionControlProtocol/InternetProtocol),usingan MTU (maximumtransmissionunit) sizeof9180bytes.

Figure 1. TCP/IP Throughputfor the Turboways155 ATM Adapter

As Figure 1 shows,a singleCHARM-basedadapteris capableof deliveringdatafrom the adapterto the mediaat 152.5Mbps. (154.5Mbps hasbeenmeasuredforother tunedsystems;not shownhere.) Thesethroughputsrepresent98-99%of thecapacityof an OC3 link. This yields 133-134.7Mbps, asmeasuredat the Netperfapplicationlevel, after ATM cell header(5 bytesper 48 byte payload)and SONETframing (1 byte per 26 bytestransmitted)havebeenremoved. (Adapter-level

AIX OPERATING SYSTEM PERFORMANCE:CLASSICAL IP 3

throughput= Netperf throughput*(53/48)*(27/26),whereOC3 physicalmediaspeed= 155.52Mbps).

More importantly,becauseof the extremelyefficient devicedriver/operatingsystem(yielding very low processorutilizationsper unit of throughput;as reflectedby thelargeNetperf scaledthroughputsin Table 1 below), when multiple adaptersareplacedin a singlemachine,the aggregatethroughputincreaseslinearly with eachnew adapterinstalled(up to four). For two adapters,the peakaggregateadapter-level throughputwas305 Mbps (representing266 Mbps at the applicationlevel), or98% of the aggregatemediaspeedof two OC3 links. Likewise, for threeadapters,the peakaggregateadapter-levelthroughputwas457 Mbps (representing398.4Mbps at the applicationlevel), or, onceagain,98% of the aggregatemediaspeedofthreeOC3 links. Finally, for four adapters,the peakaggregateadapter-levelthroughputwas594 Mbps (representing518 Mbps at the applicationlevel), or 96%of the aggregatemediaspeedof four OC3 links, or one OC12 link. The systemstillhad substantialcapacityavailablewith four adapters(processorutilization was82%). A fifth adaptercould havethereforebeenaddedto producea still largeraggregatethroughputfrom a singlemachine.

It shouldbe notedthat the RS/6000model F50 containsthreePCI bus segments:one primary and two secondary.Therefore,the first two ATM adapterswereinstalledon the primary PCI bus,while the third and fourth had to be installedonone of the secondaryPCI buses.The fact that the throughputappearedto scalelin-early when a third or fourth was installedsuggeststhat the F50 PCI bridging archi-tecturehasbeenproperly designedfor optimal bus performance(seePCI BUSPERFORMANCECONSIDERATIONSbelow).

Table 1 representsa sampleof the raw Netperf output that wasusedto generatethesingleadapterthroughputsin Figure 1. The scaledthroughputsare a standardoutput of Netperf, representingthroughputsnormalizedfor processorutilization(i.e., throughputs,in K Bytesper sec,divided by processorutilization). Superiorscaledthroughputvaluesindicatebetteror comparablemeasuredthroughputsdeliv-eredat lower processorutilizations.As canbeenseenfor both sendand receiveutili-zations,the Turboways155 ATM PCI adapteris very efficient underAIX.

Table 1. Netperf Performance:IBM Turboways,AIX 4.2.1,RS/6000F50

File Size(bytes)

RawThroughput(Mbps)

SendUtiliza-tion (%)

ReceiveUtiliza-tion (%)

SendScaledThroughput(KBps /%)

ReceiveScaledThroughput(KBps /%)

1 0.48 25.19 14.66 2.34 4.02

64 26.52 26.52 15.96 122.05 202.87

512 80.49 28.51 14.93 344.60 658.16

1024 130.44 29.96 20.45 531.40 778.76

4096 131.22 17.26 18.46 927.88 867.85

8192 132.94 14.88 19.18 1090.97 845.92

16,384 132.85 15.00 19.18 1080.93 845.37

AIX OPERATING SYSTEM PERFORMANCE:CLASSICAL IP 4

(Note: Absolutethroughputsin Mbps can be obtainedfrom the Tablesabovebymultiplying the Sendor ReceiveScaledThroughputsby 8*1.024 times, respectively,the Sendor ReceiveUtilizations, and then dividing by 1,000.Note that ScaledThroughputsare in units of K Bytesper secper unit of utilization, whereK Bytes= 1024Bytes.)

What IBM's superiorefficiency in processorutilization means(asmeasuredby thelargescaledthroughputvaluesabove)is that, for a workstationwith a lesspowerfulprocessor,the CHARM-basedPCI Turboways155 ATM adapter/devicedriver willexhibit significantly betterthroughputsacrossthe entire frame sizerange(whencomparedwith thoseobtainedfor a lessefficient ATM adapter). On workstationswith more powerful processors(e.g.,the RS/6000model F50), a similar increaseinthroughputsat small framesizeswill be obtained. At larger frame sizes,wheretheprocessoris not the bottleneckfor suchpowerful machines,the workstationiscapableof supportingmultiple Turboways155 ATM adapters,generatingtheextremelylargeaggregatethroughputsdescribedabove.

Of equalimportanceto the customeris the fact that thereis more processorcapacityavailablefor the executionof applicationsand other softwarethat must runsimultaneousto datatransmission.The end result is dramaticallyhigherdatathroughputsobtainedat minimal cost to the execution,and thereforeobservedper-formance,of the customer'ssimultaneouslyrunning applications.

Given that the adapterhardwarecansupportit, excellentprocessorutilization alsotranslatesinto excellentfull duplex (FDX) performance.Figure 2 representsmeas-ured TCP/IP throughputs(in Mbps) asa function of file size(in Bytes) for thesameset-upabove,with Netperf running its full duplex TCP streamtest (i.e., twosimultaneousstreamsin oppositedirectionsbetweenthe two workstations/adapters).As Figure 2 demonstrates,the Turboways155 PCI ATM adapteris capableof sus-taining a peakfull duplex throughputof 292 Mbps at the adapteror medialevel(representing255 Mbps at the applicationlevel). This is 94% of twice OC3 mediaspeed(i.e., full duplex OC3). Furthermore,processorutilizationswereno morethan 34%, indicating that up to threeTurboways155 PCI ATM adapterscouldoperatein full duplex modesimultaneouslyin a singleappropriateworkstation.

AIX OPERATING SYSTEM PERFORMANCE:CLASSICAL IP 5

Figure 2. TCP/IP FDX throughputfor a singleTurboways155 ATM adapter

Note that in Figure 2, FDX throughputappearsto decreaseslightly as the messagesize increases.This is not the result of throughputreductionstypically observedforHDX measurementswhen a new MTU or window occurs.Rather,it is the result ofcompetitionfor sharedresourceswithin the adapterby framesthat are beingreceivedand framesthat must be transmitted.As the framesize increases,the delayexperiencedby one waiting for the other increases,with a resultantslight observeddeclinein throughput.Modifications in the designare underwaywhich shouldmini-mize this effect.

AIX OPERATING SYSTEM PERFORMANCE:CLASSICAL IP 6

AIX OPERATING SYSTEM PERFORMANCE: LAN EMULATION

Figure 3 representsmeasuredToken Ring and EthernetLANE throughputs(inMbps) asa function of file size(in Bytes),onceagainusing the Netperf benchmarkapplication.The measureddatawasbetweentwo RS/6000model F50s,eachwith asingleTurboways155 PCI ATM adapter(asbefore). The operatingsystemwasonceagainAIX 4.2.1. As before,the throughputpresentedbelow representsadapter-levelthroughput,while buffer size,etc. parametervalueswereoptimal. Theprotocol was IP (over LANE).

Figure 3. TR/ENET LANE Throughputsfor the Turboways155 ATM Adapter

As can be seenfrom the two curvesin Figure 3, Token Ring LANE over ATM(4500byte MTU, 128K byte socketsize)achieveda peakadapter-levelthroughputof 152 Mbps (application-level:132.8Mbps), while EthernetLANE over ATM(1516byte MTU, 64K byte socketsize)achieveda peakthroughputof 144.5Mbps(applicationlevel: 126 Mbps). Both throughputsare closeto OC3 mediaspeed.Full duplex throughputswerealsoquite significant (e.g., full duplex Token RingLANE over ATM throughputwas256 Mbps at the adapterlevel, representing223.2Mbps payload).

AIX OPERATING SYSTEM PERFORMANCE:LAN EMULATION 7

DEBUNKING RECENT CONCERNS REGARDING LANEMULATION PERFORMANCE

A numberof recentarticleshavecriticized the performanceof 155 Mbps ATM, par-ticularly for LANE, basedon throughputmeasurementsof 155 Mbps ATMadaptersfrom a wide variety of vendors. This is in spiteof steady(and substantial)increasesin throughputperformanceover the pastyear,presumablydue to improve-mentsin both the adapterdevicedrivers and operatingsystems.Most importantamongtheseimprovementshasbeenthe releaseof Windows4 NT 4.0, which uti-lizes NDIS (Network Driver InterfaceSpecification)4.0. CurrentEthernetLANEthroughputsfor 155 Mbps ATM adapters(for a numberof adaptervendors;theTurboways155 PCI ATM adapterwasnot tested)running the latestversionsof theNovell 5 NetWare5 and/orWindows NT operatingsystems(client or server)werereportedto peakat around70 Mbps [ 3] . However,as the EthernetLANE (andToken Ring LANE) dataaboveattests,a 155 Mbps ATM PCI adapterrunningLANE over ATM can easilyachievenearmediaspeeds,given a well written devicedriver and matureoperatingsystem.What is not sufficiently emphasizedin thesehighly critical articles(see.e.g.,[ 3] , [ 4] ) is the impact of the small 1516byte MTUsizethat EthernetLANE imposes,along with the impact of the resultantdefaultprotocol window size,on measuredthroughput. With higher mediaspeeds,evensmall per window or per MTU overheadscanhavea devastatingimpact onend-to-endthroughput.Therefore,it is extremelyimportant to tune the systemtomaximizeperformance.As describedbelow, when the cited (non-IBM) 155 MbpsATM PCI adapterswereproperly tuned,obtainedthroughputsweremuch closertoOC3 mediaspeeds,evenfor EthernetLANE over ATM.

THE MTU AND WINDOW SIZE EFFECTEnd-to-endnetwork throughput,asmeasuredby suchbenchmarksasPerform3,Netperf, ttcp, etc., is typically definedby two things: (1) the amountof codeexe-cutedby the end workstation,in the main datatransmissionpath; and (2) thenetwork transmissiondelayassociatedwith the receiptacknowledgementaccruedwith eachprotocol window's amountof datasent.As a result, thereare fixed costs(in termsof processingtime and acknowledgementreceiptdelays)associatedwitheachunit of datatransmitted.The smallerthe dataunit sent,the more dataunitsrequiredto transmita larger file of data,and the more throughput-robbingfixedoverheadthat is accruedwith eachfile sent.As a result, regardlessof what highspeedimprovementsone makeswithin the backbone,the end-to-endthroughputwill alwaysbe hobbledby a small MTU and/orprotocol window size.(The smallbut fixed 53 byte ATM cell sizedoesnot affect the end-to-endthroughput,becauseit is transmittedat the very high capacityhardwarelevel, which doesnot and prob-ably neverwill representthe bottleneckin the network path.The fixed cell sizebrings a numberof performancebenefitsdiscussedin [ 5] .)

Most importantly, a small MTU can result in a smallerwindow sizeat the protocollevel. For example,for IPX (InternetworkPacketeXchange),the MTU sizeandwindow sizeare equalwith packetburst modeoff; with packetburst modeon, thewindow size is typically a multiple of the MTU size.(In many benchmarkteststhatwe haverun in the past,the defaultwasusually16.) It is initially basedon roundtrip delaysmeasuredat set-up,and is then dynamicallyadjusted(thoughdynamicchangesdo not typically occur during controlledperformancetesting).With eachwindow transmitted,thereis an extremelylargeoverheadaccruedthat includesthe

DEBUNKING RECENT CONCERNSREGARDING LAN EMULATION PERFORMANCE 8

network transmissiondelay in sendingback a receiptacknowledgementto thesender,along with any protocol stack,etc. processingdelaysin the endworkstation.The smallerthe window size,the more overheadaccruedfor a given file sizetrans-mitted. Preliminaryanalyticalstudiesindicatedthat evena small (absolute)overheadtime associatedwith eachwindow can havea devastatingeffect on performance.This detrimentalimpact on throughputbecomesincreasinglyworse(nonlinearly)asthe mediaspeedincreases(from 25 to 155 to 622 Mbps).

PROPER TUNING TO AVOID THESE EFFECTSTo demonstratethat the throughputsreportedin [ 3] wereunacceptablylow due tolack of propertuning, their throughputmeasurementswere replicatedusing thePerform3benchmark.The serveroperatingsystemwasNovell NetWare4.11,whilethe client operatingsystemwasWindows NT 4.0. The protocol was IPX. Morepowerful workstationswereused(200 MHz PentiumPros,insteadof 166 MHzPentiums),resultingin untunedthroughputs(i.e., using the default IPX windowsize) that wereapproximately10% higher than thosereportedin [ 3] , but still wellbelow the ratedmediaspeedof 155 Mbps. Severalof the samenon-IBM adapterscited in the original study wereexamined.Sincethey are competitors'adapters,theyare not namedhere.The focusof this study wason the technology,and specificallyon whetheror not nearmediaspeedscould be achievedfrom 155 Mbps ATM PCIadaptertechnology,for EthernetLANE running on today'smost popularoperatingsystems.

IPX will allow a maximumwindow sizeof 64K bytes.This window sizecanbeadjustedby first activatingburst mode,and then appropriatelysettingthe PBURSTREAD and WRITE WINDOW SIZES (seeAppendix B). Dependingupon theoperatingsystemversion,this window sizemay be either in bytesor MTUs. Whenthe IPX window sizewas increasedto 64K bytes,all of the adaptersexaminedyieldeddramaticallyimprovedthroughputs.The bestEthernetLANE throughputobservedfor this setof non-IBM adapters,following tuning, was120 Mbps (adapterlevel, or 105 Mbps payload),well in excessof throughputsreportedin [ 3] . (Similarimprovementswereobservedwhen the socketsizewas increasedfor IP measure-ments.)

It is thereforeimperativethat ATM adaptersbe properly tunedfor high speedper-formanceif they are to deliver continuousthroughputsapproachingtheir ratedmediaspeed.The appendicesat the end of this documentprovide detailedstepsforhigh performancetuning of IP and IPX for ATM adapters.

DEBUNKING RECENT CONCERNSREGARDING LAN EMULATION PERFORMANCE 9

PCI BUS PERFORMANCE CONSIDERATIONS

As statedpreviously,becauseof its high capacity,most new workstationstodaycomewith a PCI bus,and most networkingproductsvendorshavedevelopedhighspeedATM and LAN adaptersfor it. However,other commonlyusedadapterswithsmallerbandwidthrequirements,suchasmodems,video adaptersor soundcards,are typically availablefor an ISA or EISA bus,so that many new client work-stationsprovide the PCI bus in a split-busimplementation(e.g.,PCI/ISA,PCI/EISA, PCI/Micro Channel2). This is sometimesunfortunatesincesomecurrent implementationsof the PCI bus can exhibit poor throughputcharacteristicsundercertainconditions.

It hasoften beenobservedthat when PCI and ISA (or EISA or Micro Channel)adaptersoperatesimultaneouslyin a split-busmachine,the capacityof both busesissubstantiallyreduced,with the negativeeffectson the higherspeedbus typicallybeinggreater.This phenomenonwasmost recentlyreportedfor singleport satu-ration testsdescribedin [ 6] . In that study, it wasobservedthat file transferthroughputto a workstationrunning Windows NT (via 100 Mbps Ethernet)droppedby more than 40 to 50 Mbps when a 3 Mbps (mean)MPEG1 video wassimultaneouslytransmittedto the sameworkstation(with no perceivabledegrada-tion in the video quality). It wasdeterminednot to be the result of prioritization ofthe video over the file transferapplicationby the operatingsystem,nor due to theprotocol. The largedrop wasultimately determinedto be due to bridging betweenthe PCI and ISA buseson the split-busmachine. The MPEG1 playbackapplica-tion utilized a video adapterinstalledon the ISA sideof the split bus,while the filetransferapplicationutilized a 100 Mbps Ethernetadapteron the PCI sideof thesplit bus.

Theseeffectsare easilyquantified. Let Λ PCI and Λ I S A be the actualthroughputsobtainedfor the PCI and ISA busesrespectively.Let Λ *

PCI and Λ *I S A be the theore-

tical maximumthroughputsthat could be obtainedfor the PCI and ISA busesrespectively(asdefinedby the bus width and bus cycle time, assumingno overheadper bus access).Actual obtainedthroughputsfor eachbus on a split-busimplemen-tation, given that both busesare active,are then given by:

ΛPCI =DPCI

DPCI + OPCI + DISA + OISAΛ *

PCI (1)

Λ ISA =DISA

DPCI + OPCI + DISA + OISAΛ *

ISA (2)

whereDPCI and DI S A are the times allowed for actualdatatransferper bus accessforthe PCI and ISA busesrespectively,while OPCI and OI S A are the overheadtimesassociatedwith eachbus accessfor the respectivebuses.The PCI bus usesa 30 nsecbus cycle, during which 32 bits of datacanbe transferred.Therefore,Λ *

PCI = 1.067Gigabitsper sec,or slightly more than 132 MBps. The ISA bus,on the other hand,is an 8 MHz 16-bit wide bus,so that Λ *

I S A = 128 Mbps. However,overheadcyclesrequiredper bus accessdramaticallyreducethesemaximumtheoreticallimits, evenwithout a split-busimplementation.Given that the ISA bus requiresmore thaneight times the amountof time neededby the PCI bus to transferthe sameamountof data,equations(1) and (2) can be easilyusedto show that time spenton the ISA

PCI BUS PERFORMANCECONSIDERATIONS 10

bus hasa much greaterdetrimentalimpact on PCI bus performance,than that onISA bus performancedue to time spenton the PCI bus.

It shouldbe notedthat a packet-bufferedadapter,suchas the Turboways155 PCIATM adapter,can betterutilize the PCI bus in a split-busimplementation,than canan adapterthat can burst only one ATM cell at a time. By beingable to sustainlarger databurstswith eachPCI bus access(i.e., a larger DPCI in equation(1) ),effectivePCI throughput(Λ PCI) will increase.

In addition, the physicalcharacteristicsof the PCI bus limit the numberof attacheddevicesit can support.Therefore,evenin a pure PCI bus workstation,multiple PCIbus segmentsmust be bridgedtogetherto increasethe numberof availableadapterslots. The bridging implementationcan thereforehavea significant impact on anadapter'sperformance,especiallyin the caseof a hierarchicalbus structure,in whichsomebus segmentshavehigherpriority over other bus segments.The throughputofan adapteron a secondaryPCI bus can degradewhen anotheradapter,installedonthe primary PCI bus, is extremelyactive (dependentupon the primary-secondarybus bridgechip used).Recentsimulationstudiesbeingperformedby the PCCompanyServerPerformancegroup suggestthat secondarybus performance,aswell asoverall systemthroughput,may be improvedby adjustingthe timers associ-atedwith how long a devicemay hold the primary and secondarybuses.Improve-mentsare obtainedby increasingthe time a deviceon a secondaryPCI bus mayhold the secondarybus to two or more times the holding time of the primary bus.Confirmationstudiesof actualsystemsare currently underway.

In summarythen:

Split-busimplementations(e.g.,PCI-ISA) typically experiencereducedperform-ancewhen both busesare active.

Packet-bufferedadapters(like the IBM Turboways155 PCI ATM adapter)arepositionedto betterutilize split bus architectures,which would yield highereffectivethroughputs.

Similarly, multiple-PCIbus implementationscan result in poorersecondary(orlower) bus performance(relative to the primary bus),with the impact varyingwith PCI-to-PCIbridgechip design.

Propertuning of timers associatedwith primary, secondary,etc. bus holdingtimes may significantly improve performance.

PCI BUS PERFORMANCECONSIDERATIONS 11

CONCLUSIONS

As the datashows,the Turboways155 PCI ATM adapteris capableof supportingmediaspeedfor both CIP and LANE over ATM. In fact, becauseof the efficiencyof its devicedrivers, four Turboways155 ATM adapterswithin the samework-stationcangeneratean aggregatethroughputof four times the capacityof an OC3link (for CIP). It is alsocapableof true full duplex support,i.e., it can sustaintwosimultaneousOC3 streamsin oppositedirectionsat effectively full mediaspeed.

As statedearlier, theseexceptionalperformancenumbershighlight the advantagesofa packet-buffered(or "deep")adapterover a cell-buffered(or "shallow") adapter. Italsohighlights the advantagesIBM canbring to any networkingenvironment.Byproviding a completenetwork computingsolution, IBM can globally tune the per-formanceof its productset, from adapterand workstationhardware,to operatingsystem,devicedriver and protocol.The result is superiorperformanceto the enduser.

As the datafurther shows,the MTU and subsequentwindow sizehasa very pow-erful effect on the final end-to-endthroughputobservedfor a given system.Thisimpact hasbeenunfairly neglectedin recentcriticisms of 155 Mbps LANE overATM, and can easilybe correctedwith proper tuning. One alsoneedsto be awareof the impact split-busmachinescanhaveon PCI adapters,especiallywhen anotheradapteris installedon the non-PCIbus.

CONCLUSIONS 12

REFERENCES

(Note: References[ 1] , [ 2] , [ 5] and [ 6] below, aswell asother paperson IBM net-working productperformance,are availableon our IBM RTP world wide web siteat

http://www.networking.ibm.com/per/perprod.html.)

(1) “Factors influencingATM adapterthroughput,”A. Rindos,S. Woolet, D.Cosby,L. Hangoand M. Vouk, Multimedia Tools and Applications, Vol. 2, No. 3,pp. 253-271,May 1996. (Also availableas IBM Tech. Report 29.2099, and appearsasChapter11 in ATM Redbook.)

(2) “The IBM Turboways155 ATM adapter:Outstandingthroughputandprocessorutilization,” A. Rindos,D. Cosbyand S. Woolet, IBM Tech. Report29.2157, July 1996.

(3) “A nightmareon ATM LANE,” LAN Times,March 3, 1997.

(4) “155M-bps adaptersrevealATM's shortcomings,”S. Buehler,Netweek,April24, 1995.

(5) “ATM-to-the-Desktop:Performanceconsiderationsin positioningit amongemergingnetwork technologies,”A. Rindos,S. Woolet, C. O'Rourke,D. Cosby,Z.Ortiz, J. Sents,K. Young, K. Jotwani,J. Glekasand M. Vouk, Proc. IEEEWescon/96, AnaheimCA, Oct. 22-24and Proc. IEEE Northcon/96, SeattleWA,Nov. 4-6.

(6) “Performanceconsiderationsin positioningemergingnetworkingtechnologies:ATM, ATM-to-the-Desktopand switched/high-speedEthernet,”A. Rindos,S.Woolet, C. O'Rourke,D. Cosby,K. Jotwani,K. Young, Z. Ortiz, M. Aydemir, J.Sents,B. Trimmer, D. Laubscher,M. Shelley,R. Spychalaand M. Vouk, IBMTech. Report 29.2173, August1996.

(7) The Complete Guide to NetWare 4.1, J. E. Gaskin.AlamedaCA, NetworkPress,1995.

(8) NetWare Unleashed, R. Sant'Angelo,IndianapolisIN, SamsPublishing,1995.

NOTES1. Trademarksof Intel Corp.

2. Trademarksor registeredtrademarksof InternationalBusinessMachinesCorpo-ration.

3. Trademarkof Xerox Corp.

4. Trademarkof Microsoft Corp.

5. Trademarksof Novell Inc.

REFERENCES 13

Appendix A. AIX IP PERFORMANCE TUNING

The network parametersthat havethe greatesteffect on ATM performanceare theTCP/IP network option parameters.Their specificvaluesmay be observedbytyping "no -a" on the AIX commandline. They may be changedby typing "no -o<parametern a m e > = < n e wvalue>". The specificvaluesrecommendedbelowweredeterminedto be optimal for communicationbetweentwo high-endRS/6000machines(e.g.,betweentwo 59Hs).Other combinationsof valuesfor parametersfound within the rc.net file underAIX may be neededto maximizethe throughputof other specificsystemand adapterconfigurations,especiallyfor communicatingworkstationswith mismatchedprocessorspeeds.It is recommendedthat the userexperimentwith variouscombinationsof valuesfor thesenetwork parameterswhentuning a specificgiven configuration.

Suggestedoptimal valuesare:

rfc1323= 1 (rfc1323enhancementsactivated);

tcp send_space= tcp recv_space= 65,536for an MTU sizeof 9180bytes;

tcp send_space= tcp recv_space= 655,360for an MTU sizeof 59K bytes;

udp send_space= udp recv_space= 16,384(when examiningUDP/IP through-puts, increasesin thesevaluesdid not appearto yield significantly higher through-puts, thoughthe usermay wish to experiment);

sb_max= 600K (for an MTU sizeof 9180bytes);sb_max= 6M (for an MTUsizeof 59K bytesor higher).

The recommendedMTU sizeis the default9180bytes.However,one canartificiallypump up throughputnumbersby increasingthe MTU sizeto asmuch as64Kbytes.However,the cost is significantdelaysthroughthe IP stackwhen datatrafficis not heavy,aswell as retransmissionof very largeMTUs when evena singlecell inthat MTU is lost. The specificvalue for the MTU sizemay be observedby typing"netstat-i" on the AIX commandline. The MTU sizemay be increasedusing thefollowing procedure:

(1) Type the command"smitty inet".(2) Selectfrom the menu"Change/ShowCharacteristicsof a Network Inter-face".(3) Selectfrom the menu"atX ATM Network Interface".{X is the numericvaluecorrespondingto the appropriateATM interface.}(4) Set "Current State"to "detach".(5) Type the command"smitty atm".(6) Selectfrom the menu"Adapter".(7) Set "ATM PDU" to 60,416(i.e., 59K; or 65,536for 64K).(8) Type the command"smitty inet".(9) Selectfrom the menu"Change/ShowCharacteristicsof a Network Inter-face".(10) Selectfrom the menu"atX ATM Network Interface".(11) Set "Current State"to "up".(12) Type the command"smitty atm".(13) Selectfrom the menu"Services".

Appendix A. AIX IP PERFORMANCETUNING 14

(14) Set "Maximum IP PacketSize for this Device" to 60,416(or 65,536for64K).

One may needto optimize buffer allocation(Mbufs) for performancetesting. Afterrunning a test, the existenceof dataoverflows canbe detectedby typing thecommand"atmstat-d atmX". If overflows haveoccurred,then either the Small,Medium, Large,Huge or MTB Mbuf overflow parameterswill be nonzero.If thisoccursfor any of the Mbufs, repeatsteps(1) through(6) above.At step(7), for anyof the Mbufs listed abovethat overflowed,increasethe correspondingmaximumMbuf value(i.e., Max Small Mbuf, Max Medium Mbuf, Max Large Mbuf, MaxHuge Mbuf or Max MTB Mbuf). Then repeatsteps(8) through(11).

The ATM statisticscan be clearedby typing "atmstat-r atmX".

Appendix A. AIX IP PERFORMANCETUNING 15

Appendix B. NETWARE IPX PERFORMANCE TUNING

Books are availablethat describein greatdetail the monitoring and tuning ofNetWareIPX performance(see[ 7] [ 8] ). The minimum recommendationsforobtainingacceptablethroughputfor file transfersfollow. Obviously, the complexityof the IPX network and the typesof applicationsbeing run on the serverwill affectthe tuning for optimal overall systemperformance.

The singlemost importantNetWarefeaturefor obtainingmaximumfile transferthroughputis PacketBurst mode.This featureallows multiple packetsto be trans-mitted beforewaiting for an acknowledgement.The defaultconfigurationis PacketBurst Mode On. In a DOS environment,this defaultsto allowing 16 packetsto betransmittedper burst.

The numberof packetsper burst canbe modified in MS Windows,Windows 95and Windows NT client environments.For Windows, this can be doneusing thePBURSTREAD WINDOWS SIZE and PBURSTWRITE WINDOWS SIZE inthe NET.CFG file. Theseparametersset the PBURSTREAD and WRITEWINDOW SIZEs,respectively,and canbe varied from 2 to 64 packets(althoughthe maximumburst size is 65536bytes,no matterhow largethe packet). Tomodify thesesettingsin the Windows 95 or Windows NT environments,the NovellNetWareClient must be installed.To view or modify thesesettings,theADVANCED, PROPERTIESof the Novell NetWareClient NETWORK COM-PONENT must be selected(i.e. in Windows 95, right-click on the Network Neigh-borhoodicon, selectPROPERTIES,selectthe Novell NetWareClient networkcomponent,selectPROPERTIES,selectADVANCED, and then scroll to thedesiredparameters;a similar procedureis usedfor Window NT).

The granularityof the READ and WRITE window sizesdiffer betweenWindowsNT and Windows 95, with Windows NT allowing the sizesto be specifiedin termsof bytesper window, and Windows 95 in termsof packets.This differenceis notsignificant.To obtain maximumthroughputfor large file transfers,theseparametersshouldbe set to the maximumallowed.Thereare instanceswherethis may not bedesirable,i.e., the network condition is suchthat packetsmay be lost on a frequentbasis.In this case,the window settingsshouldbe at the maximumsizethat can betransmittedwith small probability of loss. It may alsobe desirableto reducethewindow sizesin clients that haveconfigurationslimiting the NIC capability (i.e., fastadaptersin ISA slotsof ISA/PCI split-busmachines).

Appendix B. NETWARE IPX PERFORMANCETUNING 16