Automatic HDL-based generation of homogeneous hard macros for FPGAs

Automatic HDL-based generation of homogeneous hard macros for FPGAs

Sebastian Korf∗, Dario Cozzi∗‡, Markus Koester∗, Jens Hagemeyer∗, Mario Porrmann∗, Ulrich Ruckert†,and Marco D. Santambrogio‡

∗System and Circuit Technology †Cognitronics and Sensor Systems ‡Dipartimento di Elettronica e InformazioneUniversity of Paderborn, Germany Bielefeld University, Germany Politecnico di Milano, Italy

{korf, lampa, koester, jenze, [email protected] {dario.cozzi, marco.santambrogio}@dresd.orgporrmann}@hni.uni-paderborn.de

Abstract—The regularity of resources found in FPGAs is aunique feature, which can be utilized in a number of applica-tions, e. g., in timing critical applications or applications witha demand for homogeneous routing. Current synthesis tools donot support an automatic generation of homogeneous FPGAdesigns, such that a time-consuming hand-crafted design isrequired. We present a tool flow, which automatically generateshomogeneous hard macros for Xilinx FPGAs starting from ahigh-level description, such as VHDL. Key functionalities ofthe tool flow are a homogeneous placer and a suitable routingalgorithm, which aim at maintaining the homogeneity of theresulting hard macro. The place and route tools use a resourcelibrary that is automatically generated for the target FPGAfamily by extracting relevant information from the vendortools. The tool chain is demonstrated for the design of hardmacros for a time-to-digital converter and a tiled partiallyreconfigurable region. The resulting designs are evaluated withrespect to resource requirements and timing constraints.

I. INTRODUCTION

Reconfigurable architectures have become one of thekey implementation platforms for digital circuits in manydifferent application areas. Although FPGAs progressivelychanged from homogeneous architectures containing iden-tical logic cells to heterogeneous architectures comprisingdifferent types of cells (CLB, BRAM, DSP, etc.), the struc-ture itself is still regular and piecewise homogeneous.

The properties of this regularity are utilized in timingcritical application, such as ring oscillators or time-to-digitalconverters [1], [2]. In the latter, the regularity of the routingstructure is used to measure the time difference between twoelectrical pulses with the precision of picoseconds. Anotherarea, which exploits the regularity of the FPGA resources, isthe field of partially reconfigurable (PR) systems. In [3] a PRsystem is described, where hardware modules can be placedat positions with the same resource arrangement they arebuilt from. Thus, the regularity of the PR region increasesthe flexibility for the placement of the modules.

The application scenarios described above utilize theregularity of the FPGA by implementing repeated designpatterns using the same types of resources and wires. The

necessary regularity of the designs is achieved by a manualplace and route, which is a tedious and error-prone process.Current commercially available FPGA place and route toolsare lacking an option for generating this type of regulardesigns. This paper presents a novel design flow calledDesign flow for Homogeneous Hard Macros (DHHarMa),which targets the automatic generation of homogeneous andregular designs starting from a high-level description, suchas VHDL or Verilog. This gives the designer the abilityto quickly create or modify designs, which require beinghomogeneously placed and routed.

The core components of the design flow are a packer,placer, and router, which as a result generate a homoge-neously structured design based on a user-defined FPGA par-titioning. The design flow currently supports Xilinx Virtex-4,-5, -6 and Spartan-6 FPGA families. In this paper the flowis demonstrated for a) a communication infrastructure forpartially reconfigurable regions and b) delay lines used intime-to-digital converters. The main contributions of thepaper are summarized as follows:

• Introduction of a homogeneous packing, and placementalgorithm to ensure identical placement of resourceswithin predefined regions.

• Presentation of a homogeneous router algorithm, whichensures identical routing within predefined regions.

The rest of the paper is structured as follows. Section IIpresents existing work, which is closely related to thepresented approach. In Section III, the DHHarMa designflow for homogeneous hard macros is described, focusing ona new FPGA resource database, the packing and placementalgorithm, and the homogeneous router. Section IV evaluatesthe homogeneous hard macros with respect to timing andresources requirements. Finally, a conclusion is given inSection V.

II. RELATED WORK

Regularity of resources is particularly required in thedomain of time-to-digital converters, where homogeneousdelay chain structures are used to achieve high resolutions.

https://www.researchgate.net/publication/221224625_A_17ps_Time-to-digital_Converter_Implemented_in_65nm_FPGA_Technology?el=1_x_8&enrichId=rgreq-a460f430-463d-435d-b7fe-2fbf6e56cb77&enrichSource=Y292ZXJQYWdlOzIyMTE3MzQwNDtBUzoxMDQ3Njg4ODIwODU5MDJAMTQwMTk5MDI1Njg0MQ==

https://www.researchgate.net/publication/224129605_Design_Optimizations_for_Tiled_Partially_Reconfigurable_Systems?el=1_x_8&enrichId=rgreq-a460f430-463d-435d-b7fe-2fbf6e56cb77&enrichSource=Y292ZXJQYWdlOzIyMTE3MzQwNDtBUzoxMDQ3Njg4ODIwODU5MDJAMTQwMTk5MDI1Njg0MQ==

The TDC implementations presented in [1] and [2] utilizethe homogeneity of the FPGA resources and allow forresolutions in the range of picoseconds. In both works thedelay chains are generated manually.

In the domain of run-time FPGA reconfiguration, thecommunication infrastructure, which is typically realizedusing fixed, pre-routed communication macros, also requires(semi)-automatic placement and routing of FPGA resources.Several works, e. g. [3]–[5], describe such design tools forhomogeneous communication macros. These tools allow tocompose a variety of communication infrastructures from aset of pre-synthesized primitives, but they are not suitable tocreate homogeneous hard macros from an HDL description.

Previous work in the area of pack, place and route wasdone by Betz et. al. (T-VPACK [6] and VPR [7]). Althoughwidely used in the community (e. g. [8]), these tools arebased on a generic FPGA-model, and cannot be consideredto produce bitstreams for commercially available FPGAs.

RapidSmith [9] is a tool for automatic generation of hardmacros starting from high-level description, such as VHDL.The tool aims at reducing the design time by composingsystems from pre-synthesized hard macros, instead of un-dergoing the complete synthesis process. The tool uses theXilinx place and route tools, which cannot be constrainedfor generating homogeneous macros.

III. DESIGN FLOW FOR HOMOGENEOUS HARD MACROS

Hard macros are pre-placed and pre-routed designs, whichcan be integrated into a system design without requiringan additional place and route process. In order to obtaina homogeneous hard macro, suitable packer, placer, androuter are required. We developed the DHHarMa designflow, which targets the generation of homogeneous hardmacros for Xilinx FPGAs. It uses a Xilinx-based frontendfor generating an unplaced and unrouted design descriptionfor the target FPGA device. The DHHarMa backend usesthe output from the frontend and generates a hard macrodescription by applying packing, placing, and routing in ahomogeneous manner. The complete design flow is depictedin Figure 1. The following subsections describe the designflow in detail.

High-Level Description

(HDL)

Hard Macro Descripition

(NMC)Xilinx-based

Frontend

Mapped Hard Macro (XDL) DHHarMa

Backend

FPGA partitioning

(CSV)

Figure 1. DHHarMa design flow for generating homogeneous hard macros

A. Xilinx Design Language (XDL)The XDL representation of the design is used as an

intermediate format to pack, place, and route the hard

Nets

Properties design "bm_top" xc6vcx75tff484-2 v3.2 ,

#All instances (SLICE, IOB, DSP, BRAM ...)inst "Rec0_a_[0]_(Slice_1/2)" "SLICEL",placed CLBLM_X12Y118 SLICE_X19Y118 , cfg " A5FFINIT::#OFF A5FFMUX::#OFF A5FFSR::#OFF A5LUT::#OFF A6LUT:Rec0_a/inst_g[0].base_element_1/LUT2_4.A6LUT:#LUT:O6=(A5*A6) ACY0::#OFF AFF::#OFF AFFINIT::#OFF AFFMUX::#OFF AFFSR::#OFF AOUTMUX::#OFF AUSED::0 B5FFINIT::#OFF B5FFMUX::#OFF B5FFSR::#OFF B5LUT::#OFF B6LUT:Rec0_a/inst_g[0].base_element_1/LUT2_1.B6LUT:#LUT:O6=(A5*A6) BCY0::#OFF BFF:Rec0_a/inst_g[0].base_element_1/FD_2.BFF:#FF BFFINIT::INIT1 BFFMUX::O6 BFFSR::SRLOW BOUTMUX::#OFF BUSED::0 C5FFINIT::#OFF C5FFMUX::#OFF C5FFSR::#OFF C5LUT::#OFF C6LUT:Rec0_a/inst_g[0].base_element_1/LUT2_2.C6LUT:#LUT:O6=(A5*A6) CCY0::#OFF CFF:Rec0_a/inst_g[0].base_element_1/FD_3.CFF:#FF CFFINIT::INIT1 CFFMUX::O6 CFFSR::SRLOW COUTMUX::#OFF CUSED::0 D5FFINIT::#OFF D5FFMUX::#OFF D5FFSR::#OFF D5LUT::#OFF D6LUT:Rec0_ainst_g[0].base_element_1/LUT2_3.D6LUT:#LUT:O6=(A5+A6) DCY0::#OFF DFF:Rec0_a/inst_g[0].base_element_1/FD_1.DFF:#FF DFFINIT::INIT1 DFFMUX::O6 DFFSR::SRLOW DOUTMUX::#OFF DUSED::#OFF CLKINV::CLK COUTUSED::#OFF PRECYINIT::#OFF SYNC_ATTR::ASYNC " ;inst "Rec1_b_[0]_(Slice_2/4)" "SLICEL",placed CLBLM_X12Y119 SLICE_X19Y119 ,...

Instances

#All nets connecting the instancesnet "static_right_out<7>" , outpin "Rec0_a_[0]_(Slice_1/2)" BQ , inpin "Base0_[0]_(Slice_1/10)" AX , inpin "Rec1_b_[0]_(Slice_2/4)" B6 , inpin "Rec0_a_[0]_(Slice_1/2)" D5 , pip CLBLM_X12Y118 CLBLM_L_BQ -> CLBLM_LOGIC_OUTS2 , pip INT_X12Y118 LOGIC_OUTS2 -> IMUX_B20 , pip CLBLM_X13Y117 CLBLM_IMUX_B20 -> CLBLM_L_B6 , ... ;net "static_right_out<8>" , ...

Ports#All Portsport "CLK_int" "Base0_[11]_(Slice_2/3)" "CLK";...

Packer

Placer

Router

Figure 2. Excerpt of an XDL-file of a design for a Xilinx Virtex-6 FPGA

macro homogeneously. XDL describes either designs orthe structure and resources of Xilinx FPGAs. The XilinxXDL tool allows to transform a design file (NCD) into anequivalent XDL file or to generate a report file that describesthe structure and resources of a Xilinx FPGA. Unfortunately,neither the language nor the XDL tool is documented byXilinx.

In the following we provide a brief description of thelanguage relevant for the DHHarMa backend. In general,an XDL design file is separated into four sections, whichare depicted in Figure 2. The properties section of XDL isused to define general properties for the design. The portsection contains all ports which are the connection points tothe hard macro. The instances section lists all used blocks(Slices, IOB, BRAM, ...) of the design. These blocks arecalled (primitive) instances, abbreviated inst. The examplein Figure 2 shows one instance, representing one slice of aVirtex-6 CLB (one CLB hosts two slices). One slice consistsof four parts, whereas a part consists of two Look-Up-Tables (LUTs), two registers, and a carry chain. Each firstcharacter of a switch within the configuration of a slicedefines the membership of the part. The nets section ofXDL contains the information about the connection of theinstances to implement the design. A net consists of a setof pins, where the first pin in the list defines an output pin(outpin) and the others define the input pins (inpin). Eachpin is linked to one previously defined instance. If the netis routed, the net section also contains the configuration of

https://www.researchgate.net/publication/3701496_Cluster-based_logic_blocks_for_FPGAs_Area-efficiency_vs_input_sharing_and_size?el=1_x_8&enrichId=rgreq-a460f430-463d-435d-b7fe-2fbf6e56cb77&enrichSource=Y292ZXJQYWdlOzIyMTE3MzQwNDtBUzoxMDQ3Njg4ODIwODU5MDJAMTQwMTk5MDI1Njg0MQ==

https://www.researchgate.net/publication/221224625_A_17ps_Time-to-digital_Converter_Implemented_in_65nm_FPGA_Technology?el=1_x_8&enrichId=rgreq-a460f430-463d-435d-b7fe-2fbf6e56cb77&enrichSource=Y292ZXJQYWdlOzIyMTE3MzQwNDtBUzoxMDQ3Njg4ODIwODU5MDJAMTQwMTk5MDI1Njg0MQ==

https://www.researchgate.net/publication/4375304_ReCoBus-Builder_-_A_novel_tool_and_technique_to_build_statically_and_dynamically_reconfigurable_systems_for_FPGAS?el=1_x_8&enrichId=rgreq-a460f430-463d-435d-b7fe-2fbf6e56cb77&enrichSource=Y292ZXJQYWdlOzIyMTE3MzQwNDtBUzoxMDQ3Njg4ODIwODU5MDJAMTQwMTk5MDI1Njg0MQ==

https://www.researchgate.net/publication/220759650_Using_Hard_Macros_to_Reduce_FPGA_Compilation_Time?el=1_x_8&enrichId=rgreq-a460f430-463d-435d-b7fe-2fbf6e56cb77&enrichSource=Y292ZXJQYWdlOzIyMTE3MzQwNDtBUzoxMDQ3Njg4ODIwODU5MDJAMTQwMTk5MDI1Njg0MQ==

https://www.researchgate.net/publication/224129605_Design_Optimizations_for_Tiled_Partially_Reconfigurable_Systems?el=1_x_8&enrichId=rgreq-a460f430-463d-435d-b7fe-2fbf6e56cb77&enrichSource=Y292ZXJQYWdlOzIyMTE3MzQwNDtBUzoxMDQ3Njg4ODIwODU5MDJAMTQwMTk5MDI1Njg0MQ==

https://www.researchgate.net/publication/2596996_VPR_A_new_packing_placement_and_routing_tool_for_FPGA_research?el=1_x_8&enrichId=rgreq-a460f430-463d-435d-b7fe-2fbf6e56cb77&enrichSource=Y292ZXJQYWdlOzIyMTE3MzQwNDtBUzoxMDQ3Njg4ODIwODU5MDJAMTQwMTk5MDI1Njg0MQ==

https://www.researchgate.net/publication/4290241_A_Method_for_Fast_Hardware_Specialization_at_Run-Time?el=1_x_8&enrichId=rgreq-a460f430-463d-435d-b7fe-2fbf6e56cb77&enrichSource=Y292ZXJQYWdlOzIyMTE3MzQwNDtBUzoxMDQ3Njg4ODIwODU5MDJAMTQwMTk5MDI1Njg0MQ==

DHHarMa BackendFPGA partitioning

parser

XDL Parser Homogeneous Packer

HomogeneousPlacer

HomogeneousRouter

FPGA resource database

XDL-to-NMC Converter

NMCXDL

CSV

Figure 3. Structure of the DHHarMa backend

the Programmable Interconnect Points (PIPs), which definethe connections between wires used to create the net.

B. Xilinx-based Frontend

The frontend of the DHHarMa flow uses the Xilinx FPGAtool chain to perform the technology mapping of the hardmacro design. Starting from a high-level description in anHDL, the frontend generates an XDL representation of themapped hard macro. The XST synthesis tool transforms theHDL source file into an architecture-specific netlist (nativegeneric circuit - NGC). This file contains the logical designdata and additional constraints (e. g., to specify timing orimplementation constraints). The NGC file is a direct inputfor NGDBuild, which reads the netlist and transforms it intoa logical design file (NGD). This file describes the design inelements such as logic gates (AND, OR, ...), flip-flops andRAM. The MAP tool applies the device-specific mappingof the logical design file and translates it into a device-specific description using, e. g., CLBs, IOBs, and DSPs.The XDL tool finally converts the output of the mappinginto an XDL description, which serves as an input for theDHHarMa backend.

C. DHHarMa Backend

The DHHarMa backend uses the technology mappeddesign description from the frontend to apply homogeneouspacking, placing and routing. An overview of the backendis presented in Figure 3.

Region coordinates

Base0

Rec

0_a

Rec

1_b

Rec

2_c

Rec

2_d

Rec

2_e

Rec

2_g

Rec

1_f

Rec

3_h

Switch Matrix

CLB

Figure 4. Partitioning of a Xilinx Virtex-6 FPGA with nine regions offour different types

In addition to the design description, an FPGA partition-ing is required that defines the location of the homogeneousregions of the system within the FPGA. The positions andsizes of several different region types can be specified in

a CSV-file. A region is defined by two 2D-coordinates.The coordinates are retrieved using the Xilinx FPGA-Editorand mark the top-left and bottom-right position of theregion. Due to this fact only rectangular-shaped regions aresupported. An example for a PR system on a Virtex-6 FPGAis shown in Figure 4. This partitioning uses four differenttypes of reconfigurable regions (Rec0 - Rec3).

The packer, placer, and router require resource and struc-ture information, such as the layouts of CLBs, switch-matrices, DSPs, etc., and their interconnections. We devel-oped a tool to automatically generate this information forthe target FPGA device by parsing a report generated fromthe Xilinx XDL tool. Details on the FPGA resource databasegeneration are provided in Section III-D. Subsequently, thehomogeneous packer rearranges the resources to guaranteea homogeneous packing of the slices within each regionof the same region type. The functionality of the packeris described in Section III-E.

A part of a Virtex-6 FPGA comprises a function generator(including two LUTs) and two registers. A simplified struc-ture of a part is depicted in Figure 5. After a disassembleprocess the parts are reassembled to build a homogeneousslice configuration, which is valid for all regions of the sametype. The placer finally places the reassembled packed slicesof each region in a way that the homogeneity of placementis maintained for all regions of the same type. Details onthe placer are provided in Section III-F. After this step therouter computes the paths for each net connecting the slicesof the macro. The routing is performed such that each regionof the same type uses the same routes. The functionality ofthe router is described in Section III-G. The last step is thecreation of a new XDL file of the homogeneously packed,placed, and routed macro. Finally, the XDL representationis transformed into a Xilinx conform hard macro file (NMC)by using the Xilinx XDL tool.

D. FPGA Resource Database

For a selected FPGA, the Xilinx XDL tool can be usedto generate a textual report representing the full descriptionof the resources and the connections between each other.In the report a resource block is denoted as a tile. Eachtile encapsulates the following components: Primitive Site& Pinwire, Wire & Connection, and PIP. Tiles are arrangedin a two-dimensional coordinate system. Listing 1 shows anXDL example of a tile for a Virtex-6 CLB.

A Primitive Site represents a subblock (e. g., Slice of aCLB) and contains all Pinwires, which represent its inpinsand outpins. What is called a Primitive Site in the report,is called an inst in the XDL-file. The Wire and Connectiondescribe a physical wire. A Connection describes a con-nection between two different tiles. One PIP specifies onewiring possibility of the beforehand defined wires.

The amount of space needed to store one report file canreach up to 24 GB (Virtex-6 LX760) depending on the size

1 ( t i l e 1 7 CLBLM X1Y359 CLBLM 2( p r i m i t i v e s i t e SLICE X0Y359 SLICEM i n t e r n a l 50

( pinwire A1 i n p u t CLBLM M A1). . .

)6 . . .

( wire CLBLM LOGIC OUTS0 1( conn INT X1Y359 LOGIC OUTS0 )

). . .

11 ( pip CLBLM X1Y359 CLBLM IMUX B0 −> CLBLM L A5). . .( t i le summary CLBLM X1Y359 CLBLM 95 309 149)

)

Listing 1. Excerpt of an Virtex-6 CLB description in a XDL report

of the FPGA. This implicates a lot of disadvantages for adirect use:

1) The reports for all FPGAs of a family require largestorage size (e. g. 260GB for Virtex-6 family).

2) The search time to find a specific component withinthe report is long.

3) The report description does not reflect the inherenthomogeneity of the FPGA fabric.

For these reasons, a place and route process using the reportitself is not feasible. An efficient data structure such as adatabase is required to provide fast access to the informationof the various FPGA components. Especially the router hasto check potentially millions of possible path variants whencreating a single path between two pins. We developed a toolto read the report file and to create a corresponding FPGAresource image. The amount of space needed for a full XilinxFPGA description is reduced to only a few MB on the harddrive. Table I shows information about the reports and thecreated images of the largest FPGA of each sub-family of allcurrently supported Xilinx FPGA families. The values of theloaded image describe the amount of memory needed (usinga 64-bit Linux system). The execution time was measuredfor an i7-920 processor with 6 GB of RAM. An example forthe efficiency of the data structure can be seen in the timeneeded to load the image.

Table ISPACE AND TIMING INFORMATION ABOUT IMAGES OF DIFFERENT

XILINX FPGA FAMILIES USED IN THE DATABASE

report compr. loadedFamily FPGA size image image

[GB] [MB] [MB] [mm:ss]FX140 8.3 5.5 220 00:15

Virtex-4 SX55 3.6 1.6 75 00:11LX200 10.4 2.1 120 00:20

Virtex-5

FX100T 5.4 2.8 130 00:08SX95T 5.6 1.9 95 00:08TX150T 6.8 2.1 110 00:10LX330 12.4 2.0 105 00:20CX240T 9.0 2.5 130 00:15

Virtex-6 SX475T 18.8 3.4 195 00:29LX760 24.2 3.5 210 00:39

Spartan-6 LX150T 4.1 5.6 215 00:07

E. Homogeneous Packer

Before the packing starts, the packer verifies the parti-tioning of the FPGA with respect to the homogeneity of theuser-defined regions. If all regions of the same type offeridentical resources, the packer proceeds with parsing theXDL description of the mapped hard macro. The macro isrepresented by ports, instances and nets (cf. Figure 2). Eachinstance is mapped to the dedicated regions and the nets areassigned to the instances (declared by the outpin of the net).

In the next step a classification of the nets is performed.If a net connects instances within the same region, the netis classified as an intra-net. If the net contains at least oneinpin that is located at a different region the net is classifiedas an inter-net. It should be noted that a net with more thanone inpin can belong to both net types. The classificationof the nets is of great significance for the routing algorithm.After this classification, the connectivity between instancesis analyzed. Instances connected by intra-nets are groupedto a cluster of instances. Clusters that procure the samefunctionality within regions of the same type are identified,ordered, and subsequently disassembled to a set of parts.The parts are categorized in three different types: fully-used(LUT and register of a part are used), LUT-used and register-used.

Reg 1(Latch or Flip Flop)

D Q

BMUX

B

BQ

=1

O5X

Part BFunction Gen.

LUT 6

LUT 5 O5

BXD

CECLK SR

Q

CY5QO5O6

XOR

CYO5O6BX

XOR

B2B1

B5

B3B4

B6

Bypass InBX

=1

O5X

Part CFunction Gen.

LUT 6

C2C1

C5

C3C4

C6

Bypass InCX

Rec2_d

Rec2_c

Reg 2(Flip Flop)

D Q


D Q

BMUX

B

BQ

=1

O5BX

D

CECLK SR

Q

CY5QO5O6

XOR

CYO5O6BX

XOR

Reg 2(Flip Flop)

D Q

LUT 5

Out 5 Out 6

LUT 5

Out 5 Out 6

Figure 5. Two parts implementing the same functionality within differentregions of same type

The cluster with the highest number of fully-used partsis declared as a reference for all other clusters within theregions of same type. All parts of these clusters are thancompared and potentially modified in order to obtain exactlythe same configuration of the parts. Examples for possiblemodifications are described in the following. As illustrated inFigure 5, a boolean function can be implemented using twodifferent types of LUTs within homogeneous regions. In re-gion Rec2 d a LUT6 implements a function in part C, whilein the homogeneous region Rec2 c a LUT5 implements thesame function in part B. This case requires a conversionfrom a LUT6 to a LUT5. The same problem can also occurwhen different registers in a Virtex-6 Slice are used, which

is also shown in Figure 5. In region Rec2 d the register 1implements the flip-flop in part C, whereas in region Rec2 cthe register 2 in part B is used. A conversion from register1 to register 2 solves this inhomogeneity. Moreover, it ispossible that the boolean function and the register are splitinto two parts, which requires both parts to be merged intoone.

During a conversion, the inpins and outpins of a net aremodified as well to guarantee the homogeneity of the macroin all regions of the same type. In the example shown inFigure 5, the LUT5 of part B of region Rec2 c would beconverted into a LUT6. The nets with the pins connected toB1, B2 and B4 have to be modified to the pins used in theLUT6 of part C in region Rec2 d (C3, C5 and C6). In asecond step the register 2 of part B of region Rec2 c wouldbe converted into a register 1. The net that is connected tothe outpin MUX has to be connected to outpin BQ now(which is the same pin used in part C in region Rec2 d).

The collected LUT-used and register-used parts of theclusters are combined to fully-used parts. Finally, all partsare assembled into new instances, which are now packedin a homogeneous way for all regions of the same type. Ina last step the ports for the system integration of the hardmacro are created.

F. Homogeneous Placer

Since regions of the same type offer the same resources,the placement is only done in one master region. Theplacement is replicated in all remaining slave regions of thesame type as the master region. The structure of availableslices of a Virtex-6 is irregular, such that a column containseither so called LL-CLBs or LM-CLBs. LL-CLBs containtwo L-slices, which offer two LUTs, a carry chain, andregisters. LM-CLBs contain an L-slice and an M-slice,which provides additional features, e. g., storing data usingdistributed RAM and shifting data with 32-bit registers.Therefore, the functionality realized with an L-slice canalso be realized with an M-slice, but not vice versa. Sincea Virtex-6 FPGA contains only a few M-slices, the placertries to avoid utilizing M-slices whenever possible, in orderto preserve the resources for the remaining designs of thesystem.

Each cluster of instances is represented by a rectangularshape, which is large enough to accommodate all slices ofthe cluster. Due to the different types of CLBs (LL-CLB andLM-CLB) the size of the shape is defined conservatively,assuming that only one slice of a CLB will be used. Allshapes of the clusters of a region are then arranged to fit intothe master region. After this step the actual placement stepis done using a constructive cluster-algorithm, which placesthe clusters of the master region slice-by-slice. After eachplacement within the master region, a relative coordinate iscalculated. The position of the equivalent slice within eachslave region is calculated using this relative coordinate to as-

sure, that all regions of same type share the same placement.In the XDL-file, the placement is specified by setting the XY-coordinates of the CLB and the corresponding slices in theinstance of the XDL description (highlighted in Figure 2).

G. Homogeneous Router

The main objective of the router is to find a homogeneousrouting for the mapped and placed hard macro. Similar tothe placement, the routing requires to be performed in sucha way that the homogeneously placed resources within eachregion of the same type use the same routes. Additionally,nets crossing or interconnecting one or more regions requireto be homogeneously routed as well. But regions of thesame type can have neighboring regions of different typesas shown in Figure 4. In this example the region Rec2 cand region Rec2 g are both from type 2, but Rec2 c hasneighboring regions of type 1 and 2, and Rec2 g hasneighboring regions of type 1 and 3. Therefore, the routingalgorithm especially has to consider the arrangement of theregions defined by the FPGA partitioning for nets crossingor interconnecting one or more regions.

In the following, a new algorithm is presented, whichconsiders the constraints described above. Homogeneity isachieved by using the same PIPs in the same relative positionwithin each region of the same type, as depicted in Fig-ure 7. The routing can be optimized by providing additionalconstraints, which are set by parameters. These include thepreferred search direction, or depth of the search. The flowof the router is depicted in Figure 6 and is described in thefollowing:

1) Identification of homogeneous inter-nets and intra-nets: In a first step, the nets of each region are classifiedand grouped. All nets are decomposed into point-to-pointconnections. Nets with n input pins are separated into nindependent point-to-point connections. While the routerconsiders the routing for these nets independently, resourcesharing will be applied later whenever possible. Afterwards,each net is classified either as an inter-net (connectionsbetween two regions) or an intra-net (connections withinthe same region). In each region of the same type, intra-nets

Identification of homogeneous Inter-nets

and Intra-nets

Exploration of path candidates for each set of homogeneous InterNets

DataBase

Selection of global InterNet solution

Exploration of path candidates for each set of homogeneous IntraNets

Selection of global IntraNet solution

Output XDL

Parameters

Input XDLDHHarMa Router

Figure 6. Design flow of the homogeneous router

Inhomogeneous Router DHHarMa Router

Figure 7. Comparison between inhomogeneous (left) and homogeneous(right) realization of a hard macro

A A AB B

1 122 3 4 43 2 15

Figure 8. Example of homogeneous inter-net with valid routing

using the same relative inpin and outpin are identified andgrouped to a set of homogeneous intra-nets. In an analogousmanner a set of homogeneous inter-nets is generated. Afterall nets are classified, the algorithm continues with therouting of the inter-nets, which have more homogeneousconstraints to consider than intra-nets.

2) Exploration of path candidates for each set of homo-geneous inter-nets: Since regions of the same type mayhave different adjacent regions, the router tries to route theinter-nets in such a way that the homogeneity within eachregion of the same type is preserved. For this a depth limitedsearch is used, where the limit is given by the number ofPIPs that can be used. Therefore, the router analyzes theshortest path first and gradually increases the path if therouting fails. First, every set of homogeneous inter-nets iscomputed independently from the other sets of homogeneousnets. The algorithm only proceeds if the selected path is validfor all regions of the same type. Thus, the path is created inthese regions in parallel. Thereby, when a possible path fora set of homogeneous inter-nets is found, the path is storedas a homogeneous path. In order to restrict the number ofpath candidates, the maximum number of PIPs the path maycontain can be specified as an upper bound. An example ofa homogeneous inter-net with a valid routing is presentedin Figure 8. For each set of homogeneous inter-nets, thealgorithm selects the outpin and concurrently increases thepath in each region of the same type, until the path crossesthe border of the region (cf. path 1 and 4 in Figure 8).After reaching the border, the algorithm replicates the lastconnection of the path in all regions of the same type thatcontain the selected outpin. Thus, the computation of thelast part of the connections is made in parallel for eachdestination region type (cf. path 2 and 3 in Figure 8).

If two connected regions are separated by a static region,which does not require homogeneity, the routing inside

this region is done without considering the homogeneityconstraints (cf. path 5 in Figure 8). As soon as all pathsare connected to the relative inpin, the algorithm will createthe connections starting from the next type of region and soon. When all inter-nets are routed, the algorithm stores allactivated PIPs and continues the search to find other possiblesolutions.

3) Selection of global inter-net solution: In this part thealgorithm selects a path for each set of homogeneous inter-nets aiming to retrieve a global solution for this type ofnets. The paths have to be selected in such a way thatthey do not overlap and use the same routing resources. Inorder to simplify the search, the algorithm first identifies socalled critical resources. A routing resource is considered ascritical, if the routing resource is used by all paths of a setof homogeneous inter-nets. If a physical wire is marked as acritical resource of any two sets of homogeneous inter-nets,any combination of paths will cause an intersection with thatphysical wire. Therefore, it is impossible to find a globalsolution, and at least one set of homogeneous inter-netsrequires to be recomputed with relaxed constraints to find apath avoiding the critical resource. The check for identicalcritical resources of different sets allows a fast detection ofpotential path intersections.

The final solution is the one that has the minimum numberof activated PIPs, for each set of homogeneous inter-nets.The algorithm starts to analyze the solutions with the leastPIPs activated, and selects the first solution found. If thealgorithm fails to find a solution, the exploration of pathcandidates for each set of homogeneous inter-nets is repeatedwith an increased maximum number of possible PIPs.

4) Exploration and selection of global intra-net solution:After the inter-nets are routed, the algorithm proceeds withthe routing of the intra-nets. Basically, the same routing algo-rithm is applied as for the inter-nets. First the path candidatesfor each set of homogeneous intra-net are determined using adepth limited search. In contrast to the inter-nets, the routingdoes not depend on the adjacent regions, but on the resourcesblocked by the global inter-net solution. The global intra-netsolution is computed in the same way as the global inter-net solution, considering critical resources and choosing onesolution for every set of homogeneous intra-nets. Finally, theglobal intra-net and inter-net solutions are merged and theresulting homogeneous hard macro description is generatedusing XDL (cf. Figure 2).

IV. EVALUATION

Currently, there are no comparable tools that are able toautomatically generate a homogeneous hard macro, such thatno comparison to existing approaches is possible. Tools thatcreate an inhomogeneous hard macro can utilize differentoptimized routing resources in order to maximize the clockfrequency. For homogeneous hard macros with many regionsthe homogeneity forces to use a small set of feasible routing

AMUX

A

AQ

CYO5O6AX

XOR

CY5QO5O6

XOR

=1

O5X

PartFunction Gen.

A2A1

A5

A3A4

A6

Bypass In AX

LUT 6

BMUX

B

BQ

=1

O5X

PartFunction Gen.

B2B1

B5

B3B4

B6

Bypass InBX

LUT 6

1 0

O5AX

O5BX

CY5QO5O6

XOR

CYO5O6BX

XOR

CIN



D Q

D Q

LUT 5

Out 5 Out 6

LUT 5

Out 5 Out 6

Reg 2(Flip Flop)

D Q

Reg 2(Flip Flop)

D Q

Figure 9. Schematic of the delay line example

resources. Long distance wires would cause an inhomogene-ity in the macro, such that they cannot be used.

The question that arises is how large is the impact of thehomogeneity with respect to the maximum clock frequencyof the hard macro? Therefore, we provide a comparisonof homogeneous hard macros with functionally equivalentinhomogeneous ones. However, the inhomogeneous hardmacro cannot be used for the target application.

For the analysis we considered various different designsfor a) a time-to-digital converter and b) a partially recon-figurable system on the smallest Virtex-6 FPGA, the LX75.This FPGA contains 11640 Slices with 93120 LUTs andFlip-Flops. These applications typically require homoge-neous hard macros. In Figure 9 a schematic of a delayline is shown, which is commonly implemented in FPGA-based time-to-digital converters. The delay line is used tomeasure the time between two pulses by utilizing the delayof the carry chains within the FPGA fabric. We generateda corresponding homogeneous hard macro with 40 smallregions of the same type – each containing a single CLBonly. Figure 10 shows a homogeneous hard macro generatedby DHHarMa and an inhomogeneous hard macro generatedusing the place and route tools from the Xilinx tool chain.When comparing the hard macros it can be realized that the

DHHarMa - homogeneous Placement

inhomogeneous Placement

Figure 10. Comparison of a homogeneous and an inhomogeneous hardmacro for a delay line circuit with 40 regions (rotated by 90◦)

homogeneous hard macro uses identical routing resourcesfor each element, while the inhomogeneous one does not.The homogeneous hard macro requires 160 Look-Up-Tables(LUTs), 526 nets and 863 PIPs, where the maximum clockfrequency is 750 MHz. The inhomogeneous hard macrorequires 240 LUTs, 529 nets and 1731 PIPs at a maximumfrequency of 652.74 MHz. The 33.3% lower number ofLUTs shows that the homogeneous packer of DHHarMaperforms even better than the one provided by Xilinx. The50.2% lower number of PIPs used in the homogeneousmacro indicates that less physical wires are used, reflectingthe result shown in Figure 10. The main reason for this isthat the DHHarMa placer places interconnected instancesas close as possible, which is not the main objective ofthe Xilinx placer. For the delay line example, the hardmacro generated by DHHarMa significantly outperforms theinhomogeneous one. Additionally, a second delay line ex-ample is generated using only half of the available registerswithin a part (only the XOR outputs of the carry chainsare used) as indicated in Figure 10. The resulting resourcerequirements and timing information are shown in Table II.In this example both macros require the same number ofLUTs, but the routing of the homogeneous hard macro againrequires less resources compared to the inhomogeneous one.

In the second application example a communication in-frastructure for a partially reconfigurable system is gen-erated. The communication infrastructure uses 9 regionswith 4 different types. The communication infrastructureis a complex example using 32 Bit data, 32 Bit addresses,4 Byte-enable signals, and 4 Bit auxiliary line. Additionally,dedicated signals are connected to each region for strobe,master request, master grant, region enable, and region reset.The communication infrastructure also supports bursts usingan embedded 8 Bit burst counter. The DHHarMa designflow creates a homogeneous hard macro using 1040 Slices,4813 nets, and 36382 PIPs as shown in Figure 11. A

Rec

1_b

Rec

2_c

Rec

2_d

Rec

2_e

Rec

2_g

Rec

1_f

Rec

3_h STATIC

Rec

0_a

Figure 11. Example of a homogeneous hard macro for a communicationinfrastructure with 9 regions and 4 different regions types

functionally equivalent inhomogeneous hard macro, whichcannot be used for the partially reconfigurable system, uses2796 Slices, 7919 nets, and 40857 PIPs. The packing andplacing used by the DHHarMa design flow requires 62.8%less Slices compared to the Xilinx tools.

Table IICOMPARISON OF HOMOGENEOUS AND INHOMOGENEOUS HARD MACROS FOR DIFFERENT DESIGN EXAMPLES

max. Clock

[MHz] #FFs #LUTs #Slices #Nets #PIPs

max. Clock

[MHz] #FFs #LUTs #Slices #Nets #PIPs

TDC Delay Line 160Bit 40 750 320 160 80 526 863 652.74 320 240 81 529 1731

TDC Delay Line 160Bit (XOR only) 40 750 160 160 40 366 223 750 160 160 40 243 359

PR Region Comm. (Full Master 32Bit L) 5 212.857 1651 1949 565 2464 16337 160.979 1651 2057 1636 4548 21629

PR Region Comm. (Full Master 32Bit LR) 9 81.044 2957 3668 1040 4813 36382 87.681 2957 3848 2796 7919 40857

PR Region Comm. (Full Master 32Bit M) 3 201.288 1343 1262 433 1807 15033 209.293 1343 1571 617 3183 15583

PR Region Comm. (simple 64Bit L) 5 213.766 1088 1216 352 1536 9000 228.624 1088 1280 343 2638 9566

PR Region Comm. (simple 64Bit LR) 9 104.866 1984 2304 640 3008 16399 106.519 1984 2452 934 4988 16913


PR Region Comm. Simple (simple 32Bit LR) 9 108.354 992 1152 320 1504 8128 113.43 992 1227 500 2501 8464


PR Region Comm. (simple 8Bit LR) 9 108.237 248 288 80 376 2050 112.905 248 306 149 625 2014

Homogeneous Hard Macro Inhomogeneous Hard Macro

Designs #Regions

In Table II additional communication macros are shownwith different features and resource requirements. The max-imum clock frequencies of the homogeneous hard macrosare comparable to those of the inhomogeneous hard macros.In the best case (PR Region Comm. Full Master 32 Bit L)the maximum frequency of the homogeneous hard macro is32.5% higher than the inhomogeneous macro, while in theworst case (PR Region Comm. simple 8 Bit L) the maximumfrequency of the homogeneous macro is 12.3% lower thanthe inhomogeneous macro. However, the timing stronglydepends on the design specification itself. Further timingimprovements in the design can potentially be achieved byintroducing additional pipeline stages.

V. CONCLUSION AND FUTURE WORK

In this paper the DHHarMa design flow is presented,which can be used to automatically generate homogeneoushard macros based on a high-level HDL specification. Thus,the HDL description can be updated or reused for any FPGAsupported by the vendor tools. The FPGA resource databaseused in the DHHarMa flow is automatically generated fromthe XDL report ensuring support for all devices of currentFPGA families (Virtex-4, -5, -6 and Spartan-6). Algorithmsfor packing, placing and routing are presented with a specialfocus on homogeneity. Although maintaining homogeneityrequires additional constraints in the pack, place, and routeprocesses, the resulting hard macros are comparable to anequivalent inhomogeneous hard macro in terms of maximumclock frequencies or resource requirements. The homoge-neous hard macros of some of the example designs evenrequire significantly less resources (up to 65.5%) comparedto the inhomogeneous macros. Future work will includefurther improvements in the packing and placing algorithmby considering additional optimization techniques such assimulated annealing.

ACKNOWLEDGMENT

This work was supported by the Collaborative ResearchCenter 614 - Self-Optimizing Concepts and Structures inMechanical Engineering - University of Paderborn, and

was published on its behalf and funded by the DeutscheForschungsgemeinschaft.

REFERENCES

[1] J. Wu and Z. Shi, “The 10-ps wave union TDC: ImprovingFPGA TDC resolution beyond its cell delay,” 2008 IEEENuclear Science Symposium Conf. Record, vol. 2, no. 630, pp.3440–3446, 2008.

[2] C. Favi and E. Charbon, “A 17ps time-to-digital converterimplemented in 65nm FPGA technology,” in Proceeding ofthe ACM/SIGDA Int. Symposium on Field Programmable GateArrays. New York, USA: ACM, 2009, pp. 113–120.

[3] M. Koester, W. Luk, J. Hagemeyer, M. Porrmann, and U. Ruck-ert, “Design optimizations for tiled partially reconfigurablesystems,” Very Large Scale Integration (VLSI) Systems, IEEETransactions on, 2010.

[4] C. Claus, B. Zhang, M. Hubner, C. Schmutzler, J. Becker,and W. Stechele, “An XDL-based busmacro generator forcustomizable communication interfaces for dynamically andpartially reconfigurable systems,” in RC education - Int. Work-shop on Reconfigurable Computing Education, May 2007.

[5] D. Koch, C. Beckhoff, and J. Teich, “ReCoBus-Builder - Anovel tool and technique to build statically and dynamicallyreconfigurable systems for FPGAS,” in Proc. of the 18th Int.Conf. on Field Programmable Logic and Applications (FPL),2008, pp. 119–124.

[6] V. Betz and J. Rose, “Cluster-based logic blocks for FPGAs:Area-efficiency vs. input sharing and size,” in IEEE CustomIntegrated Circuits Conf. (CICC), 1997.

[7] V. Betz and J. Rose, “VPR: A new packing, placement androuting tool for FPGA research,” in Proc. of the 7th Int. Conf.on Field Programmable Logic and Applications (FPL), 1997.

[8] K. Bruneel, P. Bertels, and D. Stroobandt, “A Method for fastHardware Specialization at Run-Time,” in Proc. of the 17th Int.Conf. on Field Programmable Logic and Applications (FPL),Aug. 2007, pp. 35–40.

[9] C. Lavin, M. Padilla, S. Ghosh, B. Nelson, B. Hutchings,and M. Wirthlin, “Using Hard Macros to Reduce FPGACompilation Time,” in Proc. of the 17th Int. Conf. on FieldProgrammable Logic and Applications (FPL), 2010, pp. 438–441.

Automatic HDL-based generation of homogeneous hard macros for FPGAs

Documents

Transcript of Automatic HDL-based generation of homogeneous hard macros for FPGAs