GRIP: A Reconfigurable Architecture for Host-Based Gigabit-Rate Packet Processing

GRIP: A Reconfigurable Ar chitecture for Host-BasedGigabit-Rate PacketProcessing

�

PeterBellows Jaroslav Flidr Tom Lehman BrianSchott KeithD. Underwood

InformationSciencesInstituteUniversityof SouthernCalifornia

3811N. FairfaxDriveSuite200

Arlington, VA 22203�pbellows, jflidr, tlehman, bschott � @isi.edu, [email protected]

Abstract

One of the fundamental challenges for modernhigh-performance networkinterfacesis the processingcapabil-ities required to processpackets at high speeds. Simplytransmittingor receivingdataat gigabit speedsfully utilizesthe CPU on a standard workstation. Any processingthatmustbe done to thedata,whetherat the application layeror thenetworklayer, decreasestheachievablethroughput.Thispaperpresentsan architecture for offloading a signif-icant portion of thenetworkprocessingfromthehostCPUonto the networkinterface. A prototype, called the GRIP(Gigabit RateIPSec) card, hasbeenconstructedbasedonanFPGAcoupledwith acommodity Gigabit EthernetMAC.Experimental resultsbasedon the prototypeare presentedand analyzed. In addition, a secondgeneration designispresentedin thecontext of lessonslearnedfrom theproto-type.

1. Intr oduction

The bandwidth of high-performancenetwork interfaceshasoftenexceeded thecapabilitiesof workstationsto pro-cessnetwork data.Theadvent of gigabit ratenetworks andthe promise of ten-gigabit networks hasonceagainmadethis thecase.Protocol processingof gigabit ratedataleadsto CPUutilizationsin excessof 80%on modernmicropro-cessors;hence,thereis clearlynotenough processingpowerto handle the 10 GigabitEthernet standard now comingtomarket. To further complicatematters,theapplicationsus-ing that datarequire someCPU time. CPU time usedby�

Thiswork is supportedby theDARPA InformationTechnologyOffice(ITO) aspart of the Next Generation Internet programandis fundedviaGrantNo. F30602-00-1-0541.

applications is not availablefor protocol processing. Thisleadsto a reduction in the achievable network bandwidth.New technology is needed to enablecommodity machinesto leverage thesenew network capabilities. The proposedsolutionis the augmentationof the network interfacewithreconfigurablecomputing to offload processingrelatedtothenetwork datastream.

The network stackis typically formedof several layersof processing,as shown in Figure 1. Eachof theselay-erscanhave processingrequirementsthatvary by applica-tion. The goal of a network cardenhancedwith reconfig-urablecomputing is to absorbas many layersof the net-work stackaspossible(on a perapplication basis)to offerthe maximum possiblebandwidth to the application. Forexample, in a clustercomputer, the network andtransportlayerscanbeextremelylight-weight,allowing partsof theapplicationto be absorbed into the network interface[29].Alternatively, two serversat two distantmilitary sitescom-municating over theinternetrequiresecurecommunicationsat thenetwork level, suchasthoseprovidedby IPSec[17].Encryption with IPSecis an extremely compute-intensiveprocess[9] that may require most(or all) of the FPGA re-sources.

Providing reconfigurablecomputing on the network in-terfaceallows asinglenetwork adapter to servea varietyofpurposes.Moreimportantly, it providesthecapability to re-configurethenetworkadapterto meetthechanging needsofthesystem.This paperpresentsa prototypereconfigurablenetwork interfaceto meetthis need.A numberof potentialapplications aredescribed with the focus beingon thenet-work level securityprotocol, IPSec.Thecapabilitiesof theprototypeareassessedandpreliminaryresultsfor IPSecarepresentedalongwith an analysisof the results. Basedonexperienceswith theprototypeanda further analysisof thepossibleapplicationdomain, a second-generationarchitec-

7 �

Application6 Presentation�5 Session�4 Transport

LLCMACTransceiverFiber Optics Cables

3 Network2 Data Link1 Physical

OSI Layers�

Routers�BridgesWDM

IPsec

Figure 1. OSI 7-Layer Model

tureis presentedfor gigabit ratepacketprocessing.Theremainder of thispaperis organizedasfollows. Sec-

tion 2 presents applicationsunderconsideration.Section3thenpresentsthe prototypedevelopedfor experimentationwith theseapplications. Preliminaryresultsfrom the pro-totype arethenpresentedin Section4. Basedon the pre-liminary work, a second-generationdesignis presentedinSection5. Relatedwork is thenpresentedin Section6 fol-lowedby conclusionsin Section7.

2. Applications

A range of application domains can take advantageofa network interfaceenhancedwith reconfigurable comput-ing. Theserange from intrusiondetectionat the link layerandencryptionat thenetwork layer(IPSec)to protocol pro-cessingat the transport layerandparallelcomputing at theapplication layer. Thegoalof thetechnology is to absorbasmuchof the network processingstackaspossibleto max-imize application performance.Four exampleapplicationsarediscussedhere.

2.1. IPSec

IPSecis a protocol suite which allows hostsor entirenetworks to sendprotected (encrypted) and authenticateddataacrossuntrustednetworks. It consistsof threeproto-cols:EncapsulatingSecurityPayload(ESP),AuthenticationHeader(AH) anda key management protocol [17]. ESPprovidesprotection, integrity, authenticationandanti-replayservicesto the IP layer. AH providesthesameservicesasESPwith theexceptionof encryption.Thekey managementprotocolcoordinatesnegotiationsbetweenthepeersin orderto establishsecurityassociations,exchangethekeysanden-force thesecuritypolicies.

The cryptographic algorithms usedin the IPSecproto-colsaretoocomplex to bescaledto modern network speedsby sequential CPUs.Forexample,atestmachine(266MHzPentiumII) running animplementationof AES (AdvancedEncryptionStandard)[1] producedonly 26Mb/swhile con-suming100%of the CPU time [9]. Similar resultscanbefound in NIST AES candidates’efficiency testing[8]. Al-though theseresultsareimplementationdependent,eventhemostoptimizedimplementationscannot compensatefor theshortcomingsof serialprocessors.Adding thisbottlenecktotheexisting overheadof protocol processingcausesfurtherperformancedegradation.

Many modern cryptographic algorithms [13] have beendesignedto be optimized via parallelization. The obvioussolution,hardwareaccelerationof the cryptographicalgo-rithmsis a necessary, but not sufficient,stepto increasetheperformance. IPSecis not a fully independentpart of thenetwork layer. It providessecurityservicesto the IP layer,but in return requiresIP stackservicesfor networking; thus,a resourceinterdependenceis created. As aresult,acrypto-only accelerator, while decreasingtheCPUload,will createa new bottleneckat thePCI busby requiring threeor morePCItransactionsperpacket.Measurementshaveshown thatthe bestthroughput achievable in suchsystemswith dedi-catedcrypto-hardwareis 1̃2Mb/sfor 1500MTU or 75Mb/sfor 9k MTU packets[18]. By integratingcryptographic ac-celerationwith thenetwork interface,theGRIParchitectureeffectively removesboththeCPUandPCIbottlenecks.

2.2. Intrus ion Detection

Intrusion Detection(ID) is the processof detecting in-appropriate activity on a network or host system. Thereare generally two types of ID systems: host-basedandnetworked-based.Host-basedID systemsoperateon a hostto detectmalicious or inappropriateactivity. Alternatively,network-basedID systemsmonitorpacketson thenetworklooking for intrusionactivity. A network-basedID systemis usuallyplacedat a strategic point on thenetwork whereit canobserve traffic entering andleaving a site. The sys-temcanobservetraffic througha hubor aswitch(with portmirroring enabled), or through passive optical splittersforinterfaceswith fiberopticconnections.

TheGRIPsystemdescribedin this papercanbeappliedto both the host-basedandnetwork-basedID solution. Akey componentto any ID systemis therealtimeprocessingof network traffic. Softwarebasedtechniqueshave troublekeepingup with traffic ratesin excessof 100Mb/s. How-ever, ratesof 1 Gb/sarecommon today, 10 Gb/sareavail-able,and40Gb/swill beavailablesoon.For thesereasons,areconfigurablehardwarebasedsystemis anidealplatformfor intrusiondetection. Thehardwarebasedsolutionwill al-low capture,filtering,andprocessingof packetsatline rates.

Thereconfigurablefeatureis important in ID systems,be-causeattacker activity andmodesareconstantly changing,andthefiltering algorithms mustbeadjustedaccordingly.

2.3. ProtocolProcessing

Thekey challengeto hostswhich receive high-ratenet-work datais not moving the dataitself, but the high rateof header-processingandbuffer-chaining that resultsfromtherelatively smallsizeof a packet. While theIPSecappli-cationdiscussedleveragesjumbo framesto achieve gigabitrate throughput to the host, jumbo framesare not practi-cal for all network connections;hence,hostsneedamecha-nismfor improving network performancein thepresenceofstandardEthernet frames. To highlight the problem, con-sider a Gigabit Ethernet connection using standard1500byte frames. If one interrupt per packet is received, theresult is oneinterrupt every � �� . It takesapproximately�� to processa singleinterrupt. While standardGigabitEthernet cardsuseinterrupt mitigation schemes,theresult-ing increasein latency is unacceptabletomany applications.

Thearchitecture presentedhereis capable of offloadingasignificantportion of thatprotocolprocessingload.In thesimplestcase,thenetwork cardcouldgenerateall acknowl-edgmentpackets,significantlyreducingtheloadonthePCIbusandthehostprocessor. In a moreadvancedsystem,themajority of the packet handling could be moved onto thenetwork interface. In thiscase,astreamof receivedpacketswould be re-assembledinto a datastreamon the networkinterface. This approachwould facilitatetruezero-copy re-ceives in the TCP/IPsocket model. The number of inter-rupts in this casewould be drasticallyreducedasthe hostwould no longerneedto beinformedof thearrival of indi-vidual packets.

2.4. Cluster Computing

The field of cluster computing depends on high-bandwidth, low-latency communications betweennodesina clusterof commodity systems. The goal of theseclus-ters is high-performancecomputing, so any time spentin network processingis strictly overhead. In addition,parallel applications frequently rely on parallel comput-ing primitivesthatarenot availablefrom commodity high-performance network interfaces(e.g.standardGigabit Eth-ernetcards).Examplesof this includebarriersynchroniza-tion andparallelreduction.

A singlecardcombining a high-performancecomputingenginewith ahigh-performancenetwork interfacemakesanattractiveadditionto a clustercomputer. In a cluster, wherelightweight network protocolsareused,sucha cardcanab-sorb all of the network processingand provide a stream-based,ratherthana packet-based,interfaceto thenetwork.

User programmed part

Standard �

interface(PCI interface + control module)�

Xilinx �

Virtex�

devices� SRAM

�

72-bit ring bus�

64/66 PCI�

X0

X1 X2

‚ IF

X0

72�

72 72

72

X XX

60�

72-bit shared bus�

configuration�control�

Figure 2. General SLAAC-1V architecture

Thisdrasticallyreducestheloadontheprocessorfor ahigh-bandwidth connection. The samecard can also absorbasignificantportion of theapplication ontothenetwork cardalongwith thenetwork processing(asseenin [28, 29]). Asa final benefit, this enhanced network cardcan provide ahardware implementationof a varietyof parallelcommuni-cationprimitives.

3. GRIP Ar chitecture

In this section,the detailsof the GRIP architecture aredescribed. First, a brief descriptionof theSLAAC-1V cardthatformsthebaseof theGRIParchitecture is given. Then,extensions to the normal SLAAC-1V systemto provideGRIPfunctionalityarepresented,including theGigabitEth-ernetdaughter cardandthe customPCI interface/ packetenginedesign,aredescribed.

3.1. The SLAAC-1V Card

SLAAC-1V is a high-performancereconfigurablecom-putingplatform basedon theXilinx Virtex architecture. Itwasdesigned to addressa wide varietyof generalmemorybandwidth- and I/O-intensive applications, Its structure isshown in Figure2. It hasthreeuser-programmable Virtex1000FPGAs,calledX0, X1, andX2. ThethreeFPGAsareinterconnectedby a 72-bit systolicring path,plusanaddi-tionalshared72-bit bus.TheFPGAsconnect to atotalof 10fastZBT SRAMs,for a total of 11 MB of SRAM memoryandabout 3 GB/sof memory bandwidth. Thememoriesareindividually accessibleby thehost,allowing for overlappedI/O andcomputation. The PCI interfaceis subsumed intothe X0 FPGA, allowing for tight coupling to the PCI busandeasilycustomizablePCI behavior. About 80 percent ofthe X0 FPGA remainsavailablefor user-defined circuitry.The cardprovidestwo independentclocks to eachFPGA,with programmable frequencies.TheSLAAC-1V alsohasabitstreamcachecapableof holding 7 full-sizedVirtex-1000bitstreams(5 in flashmemory, 2 in SRAM), andis capable

X0

X1 X2

64/66 PCI�

IF

X0

72

72 72

X X

GRIP�

X

1 Gb/s

MAC! GRIP daughter card

FibreChannel MACGeneric packet-

processors

Packet I/O engine

(ZBT memory"architecture is#unchanged, not $ shown)%

Figure 3. Block diagram of the SLAAC-1V archi-tecture modified for GRIP

of partialconfigurationof theFPGAs.

3.2. GRIP extensions to SLAAC-1V

The GRIP architecture is motivatedby two main fac-tors. First, thereis a needto insert hardware-assistfunc-tions in the network protocol stack in order for systemsto perform useful computation at the high ratesprovidedby modern networks. The primary bottleneck in systemthroughput is localbusbandwidth. Therefore,thehardware-offloading schememust includean integratedphysical in-terface,andthe kernel driver mustdefinea clearsegrega-tion in theprotocol stackbetweenhostprocessingandco-processorprocessing;thehostdoesall protocol processingabove thatsegregationpoint, andtheco-processordoesallprocessingfrom thatpointdown to thephysicallayer. If theco-processorcannothandle all packet processingfrom this“cutoff point” down to thephysicallayer, eachpacket mustbe transferred acrossthe systembus at least threetimes,crippling performance.Thesecondconsiderationin defin-ing theGRIParchitecture wasthewide varietyof possiblehardware-assistfunctions for network processing.For thisreason, it was important to definea standardized,genericpacket-processingplatforminto which arbitrary packet op-erations could be inserted,with low overheadanddesigncomplexity.

In light of this, theGRIParchitecture addstwo key fea-turesto thebasicSLAAC-1V system.Themodifiedarchi-tectureis illustrated in Figure 3. First, a Gigabit Ether-net daughter cardhasbeenadded to the SLAAC-1V base

Figure 4. Photo of the assembled GRIP hard-ware

cardto provide an integrated physical layer. ThedaughtercardinterfacestheMAC to theFPGAson theSLAAC-1Vcard,andprovidessomepacketbufferingandfiltering. Fig-ure4 shows a photo of theGigabitEthernet daughter cardmounted on a SLAAC-1V basecard. Second, the normalSLAAC-1V PCI interfacein the X0 FPGA hasbeenaug-mentedwith customhigh-performanceDMA capabilities,aswell aspacket switchingandframing functions. X0 wasessentiallytransformedinto apacketmover, responsibleforroutingpacketsbackandforth betweenthePCI busandtheotherFPGAsin thesystem.It communicateswith theotherFPGAsusingthe72-bit systolicring andexternalI/O paths,with a simplepacket-centricFIFO interface. The X1 andX2 FPGAsremainavailablefro arbitrarypacket-processingdesignsthat implement a matching FIFO interface. TheSLAAC-1V ZBT SRAMsarenot usedby theGRIPinfras-tructure, leaving them free for the packet-processingper-formed in X1 andX2. To date,a number of different X1/ X2 packet-processingcoreshave beendevelopedby thirdpartiesfor theGRIPframework, including AES (Rijndael),3DES,packetfiltering (firewall) andintrusion detection.

3.3. Gigabit Ethernet (GRIP) daughter card

TheGRIPcard,diagrammedin Figure5, providesa Gi-gabitEthernetI/O extensionfor theSLAAC-1V. It connectsvia a72-pin external I/O connectorontheSLAAC-1V basecard.Connectivity to thenetwork is providedby acommer-cial GigabitEthernetMediaAccessController(MAC), theVitesse8840. Theinterfaceto theMAC chip requiresmorepins than are available on the SLAAC-1V I/O connector,so a Xilinx Virtex 300 is includedto provide a lesscom-plicated,narrower interfaceto the SLAAC-1V. SincetheFPGA is neededto provide an interfacefor theMAC, two512Kx16-bit ZBT-SRAMs are included for packet buffer-ing.

Processingon theGRIPcardcurrently consistsof anin-terfaceto the MAC andsomepacket filtering capabilities.Thepacket filtering capabilities areneededbecausetheuse

SLA

AC

−1V

I/O

Con

nect

or

Xilinx

Virtex 300 VSC8840XMAC−IIVitesse

Gig

abit

Eth

erne

tS

witc

h

256KB 83MHz ZBT−SRAM

18−bit path

Figure 5. Block diagram of the GRIP daughtercard

of jumboframesis desiredfor theongoing IPSecresearcheffort. Whenconfiguredfor jumboframes,theMAC usedisunable to performfiltering of badpacketsdueto inadequateinternal buffers. The FPGA on GRIP temporarily buffersreceived packetsin theZBT-SRAMs while waiting for thepacket statusword. Thepacket statusword is checkedforindicationsof anerror. Goodpacketsarepassedto theX0chipof theSLAAC-1V for processingwhilebadpacketsarediscarded.

3.4. X0 Packet Processing

TheX0 FPGAontheSLAAC-1V basecardis thebridgebetweenthe other threeFPGAsand the DMA engine, asshown in Figure6(a). A samplepacket flow is shown inFigure6(b); X1 couldholdanencryptor for outboundpack-etsandX2 a decryptor for inbound packets, for instance.PacketsaretransferredbetweenFPGAsvia acommoncom-municationsinterface. This interfacehastwo 34-bit, dual-clockFIFOs,allowing thecorecomputationclock to bein-dependentof theI/O clock. The34-bit FIFO pathshave 32databits and2 packet framing bits. The framing bits de-limit packet boundariesby indicating“start-of-frame” and“end-of-frame”. Thecommunicationsinterfacealsomulti-plexesa control channel into the chip-to-chip datastream.This control channel is usedfor infrequent slave accesses,like readingor writing controlregisters.

As shown in figure Figure6(a), a packet switch inter-connectsthe threecommunicationsmodulesandtheDMAcontroller. In thecurrentprototypesystem,thedevicedriversetstheswitchto a staticpacket-routingconfiguration.Thestatic switch supports a number of different routing con-figurations. In future implementations,a dynamic packet-switchingnetwork allowing routing decisionsto be madeinsideX0 will bedeveloped.

TheDMA engineis acrucial component for attainingthe

X0

X2GRIPX1

S&

R S&

R

S&

RS&

RS&

R

S&

R

XILINX

DMA Engine

'

Framing(Sw)

itch

OP OP

Legend:

OPAny packet-*processing*operation+

To network

64/66 PCI,

(a)

OP OPX2GRIPX1

S-

R S-

R

S-

RS-

RS-

R

S-

R

XILINX

DMA Engine

.

Framing/Sw0

itch

X0

Legend:Transmit path1Receive path1

To network2

64/66 PCI3

(b)

Figure 6. (a) Block diagram of the X0 design. (b)Sample packet flow through X0.

Controller

rotator

Buffer /4

sequencer5 Arbitrator6

32-bi7

t 8

packet9data:

2-bit ;

fra<

ming=

DMA>address,?control@

X0 Packet Switching networkA

From devicedriver

64/66 PCI

Dual-clock FIFOs

255-deepBtransfer C

tablesC

Byte-wiseDbarrelE

rotators

Transfer schedulinF gG

Buffer /4

sequencer /5controllerH

rotator

From devicedriver

Framing

DMAaddress,?control@

Figure 7. The GRIP X0 DMA Engine

desiredline ratesandpacket-processingfunctionality. Theoriginal SLAAC-1V designincorporatedanefficient 32/33PCI design,usingtheXilinx PCI coreanda customDMAengine which controlled the bus-mastercapabilities of theXilinx core. This original DMA engine wasoptimized infour ways; the resultingdesignis illustratedin figure Fig-ure 7. First, it wasmodifiedto work with 64/33or 64/66PCI, in order to reachthe bi-directional1 Gb/sprocessingtarget. Second,deepscatter-gather tableswereaddedus-ing Virtex BlockRamsin order to allow theDMA engine toprocessupto 255independent packetsin eachdirectionbe-tweenhostinterrupt responses.Third,a64-bit, 8-waybarrelrotator wasaddedto the DMA engine to allow the engineto perform transfers with arbitrary byte-alignment. Finally,logic wasaddedto theDMA engineto generatetheframingbits usedby thecommunicationsmodules to delimit packetboundaries.

3.5. GRIP Software

The softwaresideof theGRIP platform canbe dividedinto two parts: a generalized driver and a modified, ker-nel/application function for which hardware assistanceisused. The fundamentalrole of the driver is to presenttheGRIP card as a standarda Gigabit Ethernet network in-terfaceto the operating system. The offloadedfunction isapplication-specificand,in this project, it is theIPSecpro-tocol layerwith its specifiedcryptographic transformations.

3.5.1. GRIP Dri ver

The mostcrucial componentof the software is the driver.Unlikedrivers for standard reconfigurablecomputingcards,

which areusuallycharacter devices, theGRIPdriver regis-tersitself asa network driver. TheGRIPdriver is partiallyderived from thedriver for theSysKonnect[6] Gigabit Eth-ernetcard, whichsharesthesameMediaAccessControllerchipset,XMACII [7]. Currently, theGRIP driver registersitself asa regular Gigabit Ethernet driver. Its designusesthewell-known ring-bufferapproachwith oneaddition. Be-causetheSLAAC-1V boardsupportspseudo-scatter-gatherDMA transfers (seeSubsection3.1), thedriver is capableofscheduling up to 255buffers for transmitin theDMA con-trol FIFOwithoutwaitingfor thecompletionof eachpartic-ular transmission. Thanks to this feature,the performancecanbe greatlyincreasedbecausethe driver becomesinde-pendent of the jitter in the dataflow introducedeitherbythe hardwareor software. Furthermore,whenthe transmitcontrol FIFO getsfull, the driver introducesa new pointerwhich will recordthe last buffer sentto the hardware anddisassociatethis pointer from the ring-buffer tail. In thisfashion,thedriver canwithstandatemporary overflow with-outstoppingthekerneltransmitqueue.

3.5.2. IPSecProtocolStack

TheGRIPproject will usetheFreeS/WAN implementationof IPSec[2] to fully integrate IPSecfunctionality into theLinux TCP/IPstack. In the demonstrationtesting,the en-tire IPSeclayer is implemented in hardware, in very rudi-mentaryform. In the final GRIP system,only the crypto-graphic transformsandchecksumming will bedelegatedtothe hardware. Sincethe IPSecstackregistersits serviceswith the kernel, the GRIP driver will have to do the sameandpresentitself asboth a Gigabit Ethernet interfaceandan IPSecinterface. The IPSecstackmust also be modi-fied to accommodatethepresenceof thehardwareacceler-ator. The transmitandreceive functionswill be modifiedsuchthatpacketsaresentto theGRIPcardwithout callingthesoftwarecryptographic transforms. Themanagementofkeysandthesecuritypolicy databasewill notbeacceleratedinitially.

4. Experimental Results

Testingof theSLAAC-1V / GRIPcards(hereaftercalled“GRIP cards”)hasconsistedof threeaspects:basicfunc-tionality, functionality with packet-processing, and band-width. Basic functionality hasbeendemonstratedby con-nectinghostswith GRIPcardsto the internet. Thesehostsare able to perform normal network functions using theGRIPcardwith reasonable reliability. IPSecis beingusedas a proof of concept of packet-processing applications.Coresfor 128-bit AES(Rijndael)encryptionanddecryptiondevelopedby [10] areplacedwithin the GRIP frameworkto testIPSecfunctionality. Two GRIP cardsconnectedto-

Figure 8. SLAAC-1V DMA throughput

getherareableto communicatewith encrypteddata.Thefi-naltestis to measurethebandwidth achievablewith aGRIPcard.Theremainderof thissectiondetailsexperienceswithbandwidth testing.

4.1. TestingSLAAC-1V DMA performance

First, the DMA performance of the SLAAC-1V DMAengine wastestedindependently. The resultsareshown inFigure8. Testsconsistedof performing512DMA transfersof 2 MB each,in eachdirection. Testswererun usingthestandardSLAAC-1V device drivers on a Dell Dimension4100 PC(850MHz P-III) with aLinux 2.4kernel for 32/33PCItesting,andaDell Precision620workstation(733MHzXeon)with Windows NT 4.0 for 64/66PCI. Thegraph in-dicatesthatup to 96%PCI bus efficiency is achieved with32/33 PCI (132 MB/s theoreticalmaximum). At 64/66PCI,the bus efficiency decreasesto 52%(520MB/s theoreticalmaximum). Independent measurementsof maximum PCIbandwidth for updatedversions of theDell Precisionweresimilar to our findings (they measured227 MB/s and315MB/s) [3]. This suggeststhat the lower bus efficiency isprimarily dueto bandwidth limitationsof thetestPC.

4.2. Testingthe GRIP infrastruc tur e bandwidth

The bandwidth limitations of the prototype GRIP cardarebeingtestedusingDell Precision620workstationswiththe Linux 2.4 kernel. The GRIP driver presents the GRIPcardas a standardnetwork interface. The network utility“iperf” is usedto testtheGRIPcardover a range of packetsizes(“MTU”) . Themaximum throughput listedis thehigh-est datarate at which the GRIP card completed the iperftestwith lessthan1% packet loss. Figure9 indicatesthatachievablebandwidths rangefrom 610Mb/s with “jumbo-

Figure 9. GRIP network throughput

frame”packets(8900bytes)to300Mb/swith defaultpacketsizes(1500bytes).

Initial performanceresultsarepromising;however, test-ing is still on-going. Significantsystembottlenecks still ex-ist that arecurrently beingaddressed. Whenthesebottle-necksareaddressed,the GRIP architecture is expectedtoachieve the1 Gb/sperformancetarget. Thecurrent limita-tionsinclude:

1. Clock rate limitations: the maximum clock rate fortheGRIPdaughtercardcurrentlylimits themaximumthroughput. Thesourceof this limitation is still beinginvestigatedbut thepotential causeswill beaddressedin thesecond-generation carddescribedin Section5.

2. Asynchronous flow control: the GRIP systemre-quires a number of independentclocks. Communi-cationsbetweenclock domainsis error- prone andiscurrentlybelievedto limit theperformanceseen.

3. GRIPdriver: theGRIPdriver is in theearlystagesofdevelopment andhasnotbeenfully optimized.

Theseresultsaresignificant,but not becauseof theper-formanceachieved thusfar. Indeed, theresultsarepromis-ing,and1Gb/shost-to-hostnetwork traffic is expectedtobeachieved in the nearfuture. However, the more importantcontribution is an architecture that providesan open, pro-grammable, reconfigurableframework to offload arbitrarypacket operations. Most othernetwork acceleratorsareei-thernotopenlyprogrammable or arelimited to thephysicalor link layersof the protocol stack. The unique positionof theGRIPcardin thenetwork protocol stackallows it tooffloadhigherlevelsof network processing.

4.3. An exampleapplication: IPSEC

As a proof-of-concept, the X1 and X2 FPGAs havebeenloadedwith AES (Rijndael) encryption coresdevel-

opedfor the SLAAC-1V boardat George MasonUniver-sity. Theseencryptioncoreshavebeendemonstratedto runon a SLAAC-1V at systemspeedsof up to 80 MHz andbandwidth of 887Mb/s [10]. Although thesemeasurementsweretakeoutsidetheGRIPframework, thecoreshavebeenmodified andintegratedwith theTCP/IPstackaspartof theGRIP framework. The modifications consistedof addingheader-processingcapabilitiesandextra flow control to thecore. The modifications also includethe additionof “en-cryption rules”about whichpacketsto encrypt (e.g.controlpacketssuchasARPareun-encrypted). Thissystemwasfordemonstrationpurposesonly. TherealGRIPIPSecsystemwill implement the IPSecheaderprocessingwith a Linuxkernel patch.

Testingutilized two connectedGRIP cardsloaded withencryption cores.Performancemeasured with iperf is cur-rentlylimited to 50Mb/s. Two issuesarecurrentlybeingin-vestigatedasperformance-limiting factors: decryption fail-ureswith largerthan1500bytepacketsandflow-control is-suesthat might have beenintroducedwith the addition ofthe extra header-processinglogic. Header-processingwillsoonbe moved backinto the kernel. Full gigabit rateen-crypted bandwidth is expected when theseissuesare ad-dressed.

5. Next-GenerationAr chitecture

The original GRIP cardwasbuilt asa prototype deviceto allow experimentationwith anFPGA-basednetwork ac-celerator. Theoriginal designhasa number of limitations.Key amongthesearea lack of memory resources,a lackof memory bandwidth, a lack of logic resources,andcost.For thesecond-generation card,GRIP2,anew requirementsanalysishasbeenperformed.

5.1. RequirementsAnalysis

Work with current applications has revealed four re-quirementsfor the next-generation GRIP cardthat arenotaddressedby the current GRIP card: additional logic re-sources, additional RAM, additional RAM bandwidth, andlow cost. Additional logic resourcesare needed to sup-port the network securityapplications. The current GRIPcardis expectedto deliver gigabit rateencryption, but ad-ditional featuressuchaskey management aredesired. Inaddition, achieving full-duplex gigabit speedsmay requirefurther protocolprocessingbeoffloadedfrom thehost.An-otherapplicationthatneedsadditional logic resourcesis in-trusiondetectionwhich cannot beeasilyimplementedwiththecurrent design.

Additional RAM resourcesareneeded for network se-curity, parallel computing, andprotocol processingappli-cations. For IPSec,in particular, the hardwareaccelerator

64−

bit 6

6−M

Hz

Hos

t PC

I

Virtex2 1000

Xilinx

2MB 200MHz ZBT−SRAM

VitesseXMAC−IIVSC8840 G

igab

itE

ther

net

Sw

itch

133 MHz SDRAMSO−DIMM Slot

36−bit path

Figure 10. A block diagram of the second-generation GRIP card

needsto becapableof reassemblingIP packetsthatarefrag-mentedby thenetwork. Thisrequiresstoragefor thepacketsasthey arereassembled. Similarly, protocol processingre-quiresstoragefor receivedpackets,transmittedpacketsthathave not beenacknowledged,andstateinformationabouteachconnection. Someapplications discussedfor paral-lel computing[29] could also take advantage of additionalRAM resources.

RAM bandwidth is alwaysa concern whenprocessinggigabit ratedata.ThecurrentGRIPcardhasapproximately2 Gb/sof RAM bandwidth which is only enough to bufferpackets in a single direction. The current configurationrequires that this be usedas a receive FIFO. For a next-generation GRIPcard,thedesireis to provideenoughRAMbandwidth for buffering in bothsendandreceivedirections(4 Gb/s). This is required for protocol processingandforparallelcomputing applications. In addition, it is importantto have additional RAM bandwidth for the storageof thestateof connectionsandfor otherapplication processing.

Costis a factorfor any application, thoughit is particu-larly relevantto parallelcomputing. Parallelcomputing ap-plicationsof this technology requireacardin eachnodeof aclustercomputer. This is anissuebecausethecurrent GRIPcardrequiresanexpensiveSLAAC-1V reconfigurablecom-putingcardasacarrier.

5.2. GRIP2

Figure10illustratesadesignto addresstherequirementsoutlinedin Subsection5.1. Thisdesignwill beimplementedasaPMC form-factorcardthatcanbecoupledwith theup-comingOsiriscard(Figure 11). TheOsiriscardprovidesaXilinx Virtex-II 6000with anSO-DIMM interfaceandten

Figure 11. A block diagram of the Osiris card

banks of 250 MHz ZBT-SRAM. This providessignificantadditional logic resourcesandRAM bandwidth for usebyapplications. In addition, thePMC interfaceis easilycou-pledto aPMC-PCIconvertercardallowing theGRIP2cardto operate stand-alone, lowering the cost. On the GRIP2card,a Xilinx Virtex-II 1000 is usedas the FPGA deviceto provide adequate resourcesto implement the PCI inter-faceas well as someapplicationlogic while minimizingcosts.An SO-DIMM slotwasusedto providea largebuffercapable of at least4 gigabits per secondof sequentialac-cessbandwidth. In addition, two 36-bit wide, 2MB ZBT-SRAMs areprovided to hold protocol relatedstateandtoprovide the random accessbandwidth neededby the 2D-FFT application[28].

6. RelatedWork

SinceIPSecsoftware implementationsposea substan-tial burden for the systemresources, hardware accelera-torshave beenin developmentsincethedawn of theIPSecstandard. In general,theapproachescouldbedivided intothree categories: stand-alone, dedicatedhardware plat-forms (VPN gateways), crypto-accelerators and crypto-accelerators with a built-in networkinterface. Nearly allvendors of perimeter VPN devices employ the first ap-proach. It doesnot posea significantchallengefor sys-tem resource utilization becauseit usesproprietary tech-nologiesandessentiallylimitless resources. Thededicatedcryptocardsolutionreducesthe CPU overheadby offload-ing thecomputationallyexpensivecryptographic transforms

to hardware,but overwhelmsthePCIbusathighdatarates.TheGRIPprojecthasbeendevelopingtheintegratedpacketprocessing/crypto-acceleratorsolution. While we are notaware of similar researchefforts investigatingthe inegra-tion of the reconfigurablecrypto- andpacket processing, anumber of peoplehave beenpursuing FPGA crypto-coreresearchand development [21, 11, 14, 15, 16, 24, 19].Similarly, FPGAs were used in an ATM firewall appli-cation as presentedin [20]. A few commercial productsemploying proprietary, dedicated hardware such as cus-tom crypto/network processorsareavailableaswell fromHifn [5] andCorrent[4].

A number of otherefforts havedemonstratedtheuseful-nessof dedicatednetwork processing.Theseefforts haveshown thatanembeddedprocessoronthenetwork interfacecanenhanceparallel computingby improving bandwidth orlatency or by addingspecialfeatures. Examples of theseefforts include HARP[23], Typhoon[25], RWCP’s GigaEPM project[27], and the University of British Columbia’sGMS-NPproject[12]. Mostrecently, [26] demonstratedthatthe two embeddedprocessorson an Alteon Gigabit Ether-net cardcould provide protocol processingat neargigabitspeeds.Eachof theseefforts relieson embedded proces-sor(s)andonly attemptsto achievepeakdatarateswith un-encrypted datain a singledirectionon the network. Thegoal of GRIP, by contrast, is to achieve simultaneous bi-directional gigabitthroughput with encrypteddata.For this,additional processingpower is needed onthenetwork inter-face.

The advantagesof an FPGA on a network interfaceforan application werefirst presentedasSepia[22], a systemusingasmartnetwork adapterin a3Drenderingapplication.The usefulnessof this type systemwas further illustratedfor parallelapplications in [28, 29]. The cardarchitecturepresentedhereaddressestheshort-comingsof theprototypeusedin thatwork.

7. Conclusions

Network interfacespeedstodayaresuchthatgeneralpur-poseCPUhostsystemsstruggletoprocessnetwork dataandstill have resourcesleft for applicationprocessing.In addi-tion, theamount of processingrequired on network dataisincreasingdue to the needfor securityprotocols, crypto-graphic processing,intrusiondetection, andIP fragmenta-tion / reassembly. To furthercomplicatetheproblem,manyof thesenew network dataprocessingrequirementsdo notlend themselves to efficient processingwith a general pur-poseCPUarchitecture.However, this typeof processingisvery well suitedto hardwaresystemswherethepacket na-tureof network dataallows easyexploitationof theparallelprocessingthatcanbeimplemented in hardware systems.

TheGRIParchitecturepresentedprovidesanexampleof

how hardwareoffloadandassistmechanismscanallow hostresourcesto bededicatedto theirmainpurposewhichis ap-plicationprocessing.In addition, thecombinationof hard-warebasedprocessingwith real(or nearreal)timereconfig-urationprovidesmany opportunitiesfor furtherassistancetohostsystemsin responseto new application or network pro-cessingrequirements.Futureareasof researchwill includeextensionof theGRIParchitectureto offloadadditional net-work processingrequirementsincluding IP,TCP, andSSL.

References

[1] Advanced Encryption Standard development effort,http://www.nist.gov/aes.

[2] FreeS/WAN IPsec implementation,http://www.freeswan.org/.

[3] http://www.conservativecomputer.com/myrinet/perf.html.[4] http://www.corrent.com.[5] http://www.hifn.com.[6] Syskonnect gigabit ethernet card,

http://www.syskonnect.com/.[7] XMACII chipset,http://www.vitesse.com/.[8] Nist’sefficiency testingfor round1 aescandidates.Technical

report,NIST, 1999.[9] M. R. atal. Performanceof protocols.In SecurityProtocols

- 7th International Workshop, pages140–146, 2000.[10] P. Chodowiec, K. Gaj, P. Bellows, and B. Schott. Exper-

imental testingof the gigabit ipsec-compliantimplementa-tionsof rijndaelandtriple-desusingslaac-1vfpgaaccelera-tor board. In Proceedingsof the4th InternationInformationSecurityConference, Malaga,Spain,Oct.2001.

[11] P. Chodowiec,P. Khuon,andK. Gaj. Fastimplementationsof secret-key block ciphersusing mixed inner and outer-roundpipelining. In Proc. ACM/SIGDA International Sym-posiumon Field ProgrammableGateArrays, Feb. 2001.

[12] Y. Coady, J.S.Ong,andM. J.Feeley. Usingembeddednet-work processorsto implementglobalmemorymanagementin aworkstationcluster. In Proceedingsof TheEighthIEEEInternationalSymposiumon High PerformanceDistributedComputing, Redondo Beach,California,USA, Aug. 1999.

[13] J. DaemenandV. Rijmen. Rijndael: Algorithm specifica-tion. In http://csrc.nist.gov/encryption/aes/rijndael/.

[14] A. Dandalis,V. Prasanna,andJ.Roli. A comparative studyof performanceof aescandidatesusingfpgas. In TheThirdAdvancedEncryptionStandard (AES3)CandidateConfer-ence, Apr. 2000.

[15] A. Dandalis,V. K. Prasanna,andJ. D. P. Rolim. An adap-tive cryptographicenginefor IPSecarchitectures.In Pro-ceedingsof the IEEE Symposium on Field-ProgrammableCustomComputingMachines, pages132–141,NapaValley,CA, April 2000.

[16] A. Elbirt, W. Yip, B. Chetwynd,andC. Paar. An fpga im-plementationand performance evaluationof the aesblockciphercandidatealgorithmfinalists. In TheThird AdvancedEncryption Standard (AES3)CandidateConference, Apr.2000.

[17] S. Kent andR. Atkinson. RFC2401:Securityarchitecturefor theinternetprotocol, 1998.

[18] A. Keromytis.Someipsecperformance indications.In 52rdInternetEngineeringTaskforcemeeting,IPsecWG, 2001.

[19] M. Leong,O. Cheung, K. Tsoi, and P.H.W.Leong. A bit-serialimplementationof theinternationaldataencryptional-gorithm IDEA. In Proceedingsof the IEEE SymposiumonField-ProgrammableCustomComputingMachines, pages122–131,NapaValley, CA, April 2000.

[20] J.T. McHenry, P. W. Dowd, F. A. Pellegrino,T. M. Carrozzi,and W. B. Cocks. An FPGA-basedcoprocessor for ATMfirewalls. In Proceedingsof the IEEE Symposiumon FP-GAsfor CustomComputingMachines, pages30–39, NapaValley, CA, April 1997.

[21] M. McLoone and J. McCanny. High performancesingle-chip fpga rijndael algorithmimplementations.In Third In-ternational Cryptographic Hardware and EmbeddedSys-temsWorkshop- CHES2001, May 2001.

[22] L. Moll, A. Heirich, and M. Shand. Sepia: scalable3dcompositingusing PCI Pammette. In Proceedingsof theIEEE Symposiumon Field-Programmable CustomComput-ing Machines, pages146–155,NapaValley, CA, April 1999.

[23] T. Mummert,C. Kosak, P. Steenkiste,andA. Fisher. Finegrainparallelcommunicationon generalpurposeLANs. InIn Proceedingsof 1996International Conferenceon Super-computing(ICS96), pages341–349, Philadelphia,PA, USA,May 1996.

[24] C. Patterson. High performanceDES encryptionin VirtexFPGAsusing JBits. In Proceedings of the IEEE Sympo-siumonField-ProgrammableCustomComputingMachines,pages113–121,NapaValley, CA, April 2000.

[25] S.K. Reinhardt,J.R. Larus,andD. A. Wood. Tempestandtyphoon: User-level sharedmemory. In InternationalCon-ferenceonComputerArchitecture, pages260–267,Chicago,Illinois, USA, Apr. 1994.

[26] P. Shivam,P. Wyckoff, andD. Panda. Emp: Zero-copy os-bypassnic-drivengigabitethernetmessagepassing. In Pro-ceedingsof the 2001Conference on Supercomputing, Nov.2001.

[27] S.Sumimoto,H. Tezuka,A. Hori, H. Harada,T. Takahashi,and Y. Ishikawa. The designand evaluation of high per-formancecommunication usinga Gigabit Ethernet. In In-ternationalConferenceonSupercomputing, pages260–267,Rhodes,Greece,June1999.

[28] K. Underwood,R.Sass,andW. Ligon. Accelerationof a2D-FFT on an adaptablecomputingcluster. In ProceedingsoftheIEEESymposiumonField-ProgrammableCustomCom-putingMachines, April 2001.

[29] K. D. Underwood, R. R. Sass,andW. B. Ligon. A reconfig-urableextensionto thenetwork interfaceof beowulf clusters.In Proceedings2001IEEE Conferenceon ClusterComput-ing, pages212–221, Newport Beach,CA, October2001.

GRIP: A Reconfigurable Architecture for Host-Based Gigabit-Rate Packet Processing

Documents

Transcript of GRIP: A Reconfigurable Architecture for Host-Based Gigabit-Rate Packet Processing