Self Configuring Binary Multipliers for LUT addressable FPGAs

12

Transcript of Self Configuring Binary Multipliers for LUT addressable FPGAs

Self Con�guring Binary Multipliers for LUT

addressable FPGAs

Mathew Wojko Hossam ElGindyDepartment of Electrical and Computer Engineering

The University of NewcastleCallaghan� NSW ����� Australia

�Ph� � � ��� �� �Fax� � � ������mwojko� hossam�ee�newcastle�edu�au

Abstract

In this paper we present a self�con�gurable multiplication technique allowing vari�able con�guration time for a class of LUT based Field Programmable Gate Arrays�FPGAs� which exist today� We show this technique to be implementable on FPGAarchitectures allowing internally addressable RAM primitives to be directly mapped tothe Logic Elements �LEs� of the logic resource� This provides run�time read�write ad�dressing capabilities to the FPGA logic elements� which in turn is viewed as an FPGApossessing a run�time recon�gurable logic resource� As an emerging �eld of computingresearch� recon�gurable computing provides an area�time tradeo� that is actively in�vestigated by many researchers� We base the variable parameter for our multiplier onthe recon�guration time required and present results showing the e�ective area�timeperformance for multipliers of varying input bit size� Results indicate the achiev�able increase in functional density for multiplication on FPGAs implemented utilisingrecon�guration�

� Introduction

FPGAs provide a con�gurable logic platform on which digital designs can be implemented�Conventionally� designs are implemented by a process of design entry interaction with aCAD tool which performs the required logic synthesis� placement and routing necessary togenerate the design con�guration the FPGA will take� The process of synthesis takes theinput logic function and functionally decomposes it to �t into the computing elements LEs�distributed throughout the FPGA in a two�dimensional nature� of the FPGA� The resultis a set of nodes which must be placed within the FPGA and an interconnection require�ment between these nodes that must be realised by the FPGA interconnection resource�The process of placement and routing performs this task and produces the con�gurationfor the FPGA that must be loaded to allow the input logic function to be performed�The con�guration information can be viewed to consist of two parts� a logic con�guration

April ��� ���� Page �

�which determines the con�guration loaded into the logic resource of the FPGA�� and theinterconnect resource con�guration�

The use of run�time con�guration with FPGAs allows logic designs to be swapped inand out of� or recon�gured within the device whilst operating� Here we see two classesof run time con�gurable FPGAs� There exist FPGAs which when con�gured� take on anentirely new con�guration and lose all previous con�guration information� FPGAs of thiscategorisation include most commonly known� traditional and commercially popular deviceseries such as the Altera FLEX ��� and Xilinx ���� � � We will term these FPGAs asentirely con�gurable� Partially con�gurable FPGAs ��� � allow parts of the device to becon�gured at run�time� while other parts of the device continue to operate and do not loosetheir con�guration information�

However� for the FPGA architectures that require entire con�guration of the device thereis a slight deviation from their de�nition when certain allowable functionality is provided�We see that for most of these devices� the logic resource is implemented as a set of binaryLook�Up Tables �LUTs� implementing the LEs� To implement any combinatorial logicfunction� all output values for every input combination of values for the function are storedin the LUT� The function inputs address the LUT to provide the function output� For theset of FPGAs that allow read�write RAM primitives to be directly mapped to the LUTs�i�e� the contents of the LUT can now be addressed and changed at run�time� we see anadditional classi�cation of con�guration becoming evident� We provide these architectureswith the additional categorisation of being run�time logic con�gurable� meaning that wecan modify the contents of the logic resource while the device is operating� Current FPGAarchitectures that support run�time logic recon�guration include the Xilinx ���� serieswhich provide RAM read�write capability at the CLB �Con�gurable Logic Block� LE level�and the Altera FLEX ����� series at the EAB �Embedded Array Block� level�

In this paper� we present a multiplication technique that uses run�time logic recon�gura�tion providing an area�time tradeo� against conventional parallel multiplication techniques�It is known that parallel multiplier implementations on FPGA architectures require con�siderable logic resource ��� �� It has been shown that for constant coe�cient multiplierimplementations �where the constant coe�cient is embedded within the logic� using dis�tributed arithmetic �where the LUT addressable bit size is matched with the coe�cientdistribution�� a considerable saving in the required on�chip area �determined by logic andinterconnect resource requirements� can be achieved ��� By applying run�time recon�gu�ration to this principle we are able to implement a full multiplier by recon�guring at everychange of one designated input� achieving a save in the required on�chip area� The multi�plier is designed to be con�guration time variable such that we present a tradeo� betweenthe area and the recon�guration time required by the multiplier� From implementation re�sults obtained� an area�time comparison is drawn between a conventional parallel multiplierimplementation and the recon�gurable multiplication technique implemented with varyingdegree of recon�guration times�

A brief overview of each section follows� Section � presets the multiplication technique�Section provides implementation results for the technique with varying input bit sizesand recon�guration times and compares the results with parallel�add multiplier implemen�tations� Section � provides an area�time analysis and discussion of the results between therecon�gurable and parallel multiplication techniques� Section � concludes the paper�

April ��� ���� Page �

� Self�Con�gurable Multiplication Technique

In this section we present a self�con�gurable multiplication technique which can be appliedto any FPGA allowing read�write RAM primitives to be implemented by the LEs of thearchitecture� The self�con�gurable technique uses a lookup based distributed arithmeticapproach to perform the multiplication of two numbers� One input is internally transformedinto hardware LUTs by the technique� recon�guring the hardware to contain multiples ofthe value� The other input addresses the LUTs to retrieve parts of the multiplication resultwhich are then summed together to provide the �nal result� We extend on the techniqueby providing variable recon�guration time implementations� Here the LUT values that arecalculated and stored serially� are computed and stored in parallel� The level of parallelismused to calculate the LUT values dictates the recon�guration time and the distribution ofLUT values� Since we assume a single LUT cannot be loaded in parallel� we partially loadmany LUTs during parallel con�guration� During operation� the second input value willselect and address the appropriate LUT to obtain the appropriate value� The LUT outputsare then added to provide the �nal result�

We begin this section by discussing the advantage of transforming input values intohardware for multiplication implemented by LUT based FPGAs� and show the e�ect thishas on the amount of logic resource required� We use this property as the basis for ourtechnique presented�

A common technique implemented on FPGAs generally providing best results to calcu�late the multiplication of two binary numbers in parallel is the parallel adding approach�A functional diagram of this technique is shown below in Figure �� Utilising the fastcarry functionality that is present in most LUT based FPGAs� high pipelined speeds areachievable by registering at every adder within every level of the addition tree� However�a considerable amount of logic is required since the cyclic convolution of inputs �rst has tobe calculated �via the �rst input level of the tree� before the partial products can be sub�sequently added� We see that for the ��bit parallel add multiplier below� eight �� � ANDblocks are required� In general� for an n bit multiplier� n� AND gates are required to com�pute the partial products via cyclic convolution� However� as discovered by Mintzer ��� ifone of the input values to the multiplier is �xed� then the convolution of input bit sequencescan be replaced by lookups into pre�computed memory tables of addressable depth k con�taining the sum of k partial products� For FPGAs containing ��input LEs implementedby LUTs� Mintzer was able to implement an ��bit constant coe�cient multiplier with two��� ��bit addressable LUTs and a �� bit adder� This can be seen below in Figure �� Thistechnique presented signi�cant reduction in the amount of logic required when implementedon FPGAs since the �rst two levels of the parallel adder tree were essentially eliminated�The processing involved in these two levels is pre�calculated for a particular constant inputvalue� and the results stored in the LUTs� Mintzers constant coe�cient multiplier has thedisadvantage in that only one input value may change� Since the LUT contents are �xed�the multiplier function is �xed to Y � cX where c is the value stored�

However� for FPGAs which allow RAM primitives to be mapped to the LUTs enablingthem to be addressed and read from and written to� the constant coe�cient Sum Of PartialProducts �SOPPs� lookup can be updated at run�time� In order to update the lookupvalues at run�time� they must be computed when the constant input value to the multiplierchanges� The technique presented is able to recalculate and load the LUT SOPPs within

April ��� ���� Page �

B[7:0] A7 B[7:0]B[7:0] B[7:0]B[7:0]B[7:0]B[7:0] A0A1A2A3A4

S[15:4]

A5B[7:0] A6

[0][0][7:1] [0][7:1][0][7:1][7:0][7:0] [7:0] [7:1] [7:0]

[6:0][7:0]

8-bit Add[6:0][7:0]

8-bit Add[6:0][7:0]

8-bit Add[6:0][7:0]

8-bit AddCarryCarryCarryCarry

[0][7:1][7:0]

[6:0][7][8:1][9] [0]

[1:0][9:2]

[0][11:2] [7:0]

[0]

[6:0][0]

[7:0]

[9:0]

[8:1][9] [7]

[7:1]

[1]

S0S1S[3:2]

12-bit Add

10-bit Add10-bit Add

1 And 8 1 And 8 1 And 8 1 And 8 1 And 8 1 And 8 1 And 8 1 And 8

Figure �� Functional Block Diagram for an ��bit Parallel Add Multiplier�

[7:0]

12-bit Add

S[15:4]

A[3:0]

12x4 LUT 12x4 LUT

S[3:0]

[11:0]

[11:0] [3:0][11:4]

Addr[3:0] Addr[3:0]

A[7:4]

Figure �� Functional Block Diagram for an ��bit Constant Coe�cient Multiplier�

the FPGA providing a self con�guring multiplier which con�gures its logic content basedon input� Below is an algorithmic description for an n�bit input multiplier using r�cyclesto recon�gure� The n�bit input values being A and B� the lookup values being distributedin the arrays Li�j�� and the �n�bit output result S�

�� Recon�guration cycle � invoked on change of input B ��ACC � �for i � � to �r � �� do

for j � � to �n�� �� pardo

for k � � to ���r� �� pardo

Lj�k�i � �k � ACCACC� � B

�� Multiplication cycle ��S � �for j � � to �n

�� �� pardo

t � �A� �j����S � S � Lj�b t

rc�t�r

April ��� ���� Page �

Below� in Figure � a sixteen cycle ��bit self con�gurable multiplier is presented� Presentare two �� � ��bit addressable LUTS and a ���bit adder which represent the basis of the��bit constant coe�cient multiplier� The ��bit counter and ���bit accumulator represent theaddress and data generators respectively used to recon�gure the LUTs with new coe�cientSOPPs� The two eight bit inputs A and B represent the multiplier inputs and the inputline CHANGE is used to signify that input B has changed and the multiplier LUTs mustbe recon�gured� The process of recon�guration can be initiated inside the FPGA by acomparator circuit� or external by a single input� Once the signal is raised� the countercycles through �� values and addresses the LUTs to store the values generated by theaccumulator which represent the SOPPs for the ��bit number B multiplied by any ��bitvalue� For an FPGA implementation of the ��� ��bit addressable LUT� �� ��bit LEs willbe used� Since the LEs are restricted such that only one input can be loaded at a time� itsu�ces that the sixteen recon�guration steps will fully load the ��bit LEs with the SOPPvalues� Once the sixteen values are loaded the carry detect output from the counter disablesthe recon�guring circuitry and allows the remaining circuitry to continue to operate untila new change in the B input is signi�ed�

12-bit Accumulate4-bit Counter

Clk

[7:0]

12-bit Add

S[15:4]

12x4 LUT 12x4 LUT

S[3:0]

[11:0]

[11:0] [3:0][11:4]

Data[11:0]Data[11:0]

[7:4] [3:0]

[3:0]

Addr[3:0] Addr[3:0]WE WE

B[7:0]

[7:0]Clear Clear

A[7:0]Change

Carry Set

Clk

Clk Clk

Clk

Clk

Figure � Functional Block Diagram for an ��bit ���cycle Recon�gurable Multiplier�

For the con�gurable multiplication technique� we vary the recon�guration time� i�e�number of cycles required to load the LUTs� by performing the calculation and storage ofSOPPs in parallel� For example� below in Figure �� an eight recon�guration cycle ��bit selfcon�gurable multiplier is presented� Here the sixteen SOPPs are found by calculating andstoring the �rst and last eight SOPPs in parallel� We distribute the storage of these SOPPsplacing them in two ��� �bit addressable LUTs� For an LUT based FPGA implementationof this technique we require two �����bit addressable LUTs and only use half of the storagelocations of each� We make this tradeo� since we use the restriction that each LUT canonly be loaded one value at a time� To calculate the sixteen SOPPs in eight steps� weaccumulate the �rst eight values of the input value B added to itself and store these valuesin one ��� �bit addressable LUT� These values represent the �rst eight SOPPs� The sumof � � B with the �rst eight SOPPs are calculated and stored in a separate �� � �bit

April ��� ���� Page

addressable LUT to represent the last eight SOPPs� Note that the calculation of eighttimes the input constant is performed by shifting the bit values three positions to the left�Once the con�guration cycles are complete the con�guration circuitry is disabled� and theinput bits from the A input are used to address the LUTs� and select which LUT to sumthe outputs from� Tri�state bu�ers are used to enable the output of the selected LUT�

We continue to reduce the recon�guration time to four cycles as shown in Figure �below� Here an example is presented for an ��bit multiplier� Notice that for all multipliers�the recon�guration time is not a function of the size of the input bit size� but rather derivedfrom the size of the LUT used� For the four cycle recon�gurable multiplier� considerablepreprocessing logic is required to calculate the LUT values� Here the set of four valuescalculated by the accumulator are calculated and added to zero� four� eight and twelvetimes the input value to provide the parallel values which are stored in parallel in eight�� � ��bit addressable LUTs� For an LUT based FPGA implementation� eight �� � ��bitaddressable LUTs will be used to implement the �� � ��bit addressable tables� Here onlyone quarter of the storage capacity of each LUT is used� We incur this expense of under�utilisation to allow the sixteen SOPPs to be calculated in parallel and in four cycles� Again�once the LUT values are con�gured� the A input bit values are used both to address theLUTs which contain the SOPPs and select which LUT to enable to output values from�Once the output values have been enabled they are summed to produce the multiplicationresult�

12-bit Add

[7:0]

12-bit Add

S[15:4] S[3:0]

[11:0]

Clk B[7:0]

[7:0]Clear Clear

A[7:0]Change

3-bit Counter

[10:3] [10:0]

11-bit Accumulate[2:0]

Data[11:0]WE

Data[11:0]WE

Data[11:0]WE

Data[11:0]WE

12x3 LUT

12x3 LUT

12x3 LUT

12x3 LUT

Addr[2:0]

Addr[2:0]

Addr[2:0]

Addr[2:0]

[2:0][3][6:4]

[10:0] [10:0]

[11:0] [11:0]

[11:0]

[11:0]

[3:0][11:4]

Carry Set

[7]

Clk

Clk

Clk

Clk

Clk

Clk

Clk

Figure �� Functional Block Diagram for an ��bit ��cycle Recon�gurable Multiplier�

It follows that to calculate the multiplication SOPPs for �� or ��bit input size longerLUTs are required� Additionally� more LUTs are required� When using ��bit input LUTS�

April ��� ���� Page

for a ���bit self con�gurable multiplier� four ��� ��bit addressable LUTs are required� andfor a ��bit self con�gurable implementation� eight ����bit addressable LUTs are needed�

Data[11:0]WE

12x2 LUTAddr[1:0]

Data[11:0]WE

12x2 LUTAddr[1:0]

Data[11:0]WE

12x2 LUTAddr[1:0]

Data[11:0]WE

12x2 LUTAddr[1:0]

Data[11:0]WE

12x2 LUTAddr[1:0]

[7:0]

12-bit Add

S[15:4] S[3:0]

[11:0]

Decode 2[1:0]

Decode 2[1:0]

12-bit Add12-bit Add

12-bit Add

12-bit Add

Clk

Data[11:0]WE Data[11:0]WE

Data[11:0]WE

[11:0] [11:0]

[11:0]

12x2 LUTAddr[1:0]

12x2 LUTAddr[1:0]

12x2 LUTAddr[1:0]

[3:0][11:0]

[11:0]

[11:0]

[11:0]

B[7:0]

[7:0]Clear Clear

A[7:0]Change

Carry Set

[7:6] [5:4] [3:2] [1:0]

2-bit Counter[1:0]

10-bit Accumulate

[9:0][9:0]

[9:2][10:3] [9:2]

[10:3]

[9:0]

[9:0]

[9:0][11:0]

Clk

Clk

Clk

Clk

Clk Clk

Clk

Clk

Clk

Clk

Clk

Figure �� Functional Block Diagram for an ��bit ��cycle Recon�gurable Multiplier�

� Implementation Results

To implement the self con�guring binary multiplication technique an FPGA architecturewas required which allowed internal read�write addressable RAM primitives to be imple�mented and provided internal output tri�stating� The Xilinx XC���� series of FPGAswas chosen since it provided this functionality� In all twelve di�erent multipliers wereimplemented and results were recorded on FPGA devices from the XC���� series� Thespeci�c devices used for each multiplication size are shown in the caption for each table�For multipliers of input size �� �� and � bits� each were implemented with four levelsof recon�guration time� Each recon�guration level required sixteen� eight� four and zerorecon�guration cycles respectively� For the implementation requiring zero recon�gurationcycles� the parallel adder multiplication technique as shown in Figure � was used� Theresults are provided below�

In Table � implementation results for ��bit multipliers are presented� The columnsshow the number of Con�gurable Logic Blocks �CLBs� and the maximum Mapped andRouted clocking speeds for each implementation requiring a set number of recon�gurationcycles� The CLB count provides a metric on the amount of logic resource required by the

April ��� ���� Page �

Recon�g� Steps No� CLBs Mapped Speed Routed Speed

�� � ����� MHz ������ MHz� �� ������ MHz ������ MHz� �� ������ MHz ������ MHz� �Parallel Add� � ������ MHz ��� MHz

Table �� ��bit Multiplier implementation results on a Xilinx ����E speed grade �� chip�

Recon�g� Steps No� CLBs Mapped Speed Routed Speed

�� �� ����� MHz ������ MHz� � ����� MHz ���� MHz� ��� ������ MHz ���� MHz� �Parallel Add� � ����� MHz ������ MHz

Table �� ���bit Multiplier implementation results on a Xilinx ��� E speed grade ���

technique� The mapped speed indicates the result of the delay for the technique mappedto all components within the FPGA excluding the routing resource� This value is recordedsince it represents the computation delay of the technique and is �xed for a given XC����device� However� the routed speed will vary across di�erent placement and routing instancesof the same technique on the same XC���� device� This is due to the selection of routinginterconnects which are hierarchical and logarithmic in length and posses di�erent routingdelays� The results given for the routed speed provide an instance of a speed optimisedimplementation of the multiplier� however may not necessarily represent the best achievableimplementation�

We see in Table � that apart from the zero cycle con�gurable multiplier� as the numberof required recon�guration cycles increase� so does the maximum clocking frequency of themultiplier� Since the adder used within each of these multipliers is of the same length�the critical paths would be the same and one would expect the clocking frequencies to bethe same� However there are two reasons why the clock frequency decreases� The �rst isdue to the additional delay provided between successive pipeline stages by the use of tri�stating bu�ers� Within the XC���� architecture� the tri�stating bu�ers have characteristicdelays which add to the critical path hence reducing the maximum clocking frequency� Inaddition to this� we see for the ���cycle recon�gurable multiplier in Figure �� one set of tri�stating bu�ers is used to select the address value for the LUTs which increases the criticalpath greater than that of the ���bit adder� However for the � and ��cycle multipliers�additional tri�stating bu�ers exist within the ���bit adder critical delay where they areused to arbitrate between the LUTs to select which output to sum� The other contributingdelay to reducing the maximum clocking frequency is the additional combinatorial logicrequired in the pre�processing stage of the multiplier� This logic can be seen for the � and��cycle recon�gurable multipliers where the sum of four� eight and twelve times the inputvalue B plus its accumulated value is recorded� This additional combinatorial delay can beseen to set the clocking frequencies for the � and ��cycle multipliers apart from the zeroand ���cycle multipliers�

Table � shows implementation results for ���bit multipliers� Here we see the CLB countto be inversely proportional to the required number of recon�guration cycles� This trend

April ��� ���� Page �

Recon�g� Steps No� CLBs Mapped Speed Routed Speed

�� �� ���� MHz � � MHz� � ��� � MHz ����� MHz� ��� ��� MHz ����� MHz� �Parallel Add� ���� ���� MHz ���� MHz

Table � ��bit Multiplier implementation results on Xilinx ����EX speed grade ���

will continue for all self con�gurable multipliers with larger input bit sizes implemented�In addition we see the mapped speed of the �� and zero cycle multipliers to be the same�This is due to the critical clocking path of the ���cycle multiplier now being present in the�nal addition stage instead of the address selection stage� This path matches that of the�nal stage of the zero�cycle con�gurable multiplier�

The results of ��bit multiplier implementations are presented in Table � Here wesee similar trends to the ���bit multiplier� The mapped clocking rate for the zero and���cycle multipliers are the same� and the � and ��cycle multipliers di�er in clocking ratefrom the zero and ���cycle multipliers by similar amounts due to the combinatorial logicpresent in the SOPP calculation stage of the pipeline� From the results of all multiplierimplementations we see an area�time tradeo� between the number of recon�guration stepsand the amount of corresponding required logic� We are able to achieve the best areaperforming multiplier by using sixteen recon�guration steps� We can then tradeo� the areaperformance to provide a multiplier which requires less recon�guration time � the extremecase being the full parallel multiplier as implemented by the parallel add technique�

� Analysis and Discussion of Results

In this section we provide an area�time analysis based on the scalability of the self con�gur�ing multiplication technique presented� We refer to the functional density metric adaptedto include the e�ect of recon�guration time �� to evaluate the results obtained� The func�tional density of a design implementation requiring a given circuit area A and possessinga critical delay T is evaluated as ��AT � For a recon�gurable design implementation thatassumes area A and has a critical delay T � which requires R cycles to recon�gure and isrecon�gured every n cycles we use the following formula to evaluate the functional density�

D �n

AT �n�R����

Or� given that the circuit has a clocking frequency f � we can evaluate the functionaldensity as below�

D �nf

A�n�R����

For each bit size multiplier� two functional density graphs were obtained displayingthe functional density comparison between all recon�gurable level multipliers for boththe mapped and routed clocking frequencies� The functional density is graphed againstthe number of cycles�iterations before recon�guration is performed� We provide both the

April ��� ���� Page �

mapped and routed functional density graphs to show if the additional delay incurred bythe interconnect network provides any substantial di�erence within the analysis�

Below in Figure � we see the mapped and routed functional density for ��bit inputmultipliers for all implementations at levels of ��� �� � and zero cycles of recon�guration�Viewing the graphs we are able to see the point at which the functional density for the���cycle self con�gurable multiplier equals that of the parallel add zero�cycle con�gurationmultiplier� Comparing the two graphs� we see that to achieve better functional density thanthe zero cycle multiplier� the number of iterations to be performed before recon�gurationis less for the routed functional density than the mapped� We see that for the ���cycleself con�gurable multiplier� where the number of iterations before recon�guration is greaterthan around �� iterations� then the functional density will exceed that of the parallel addmultiplier� The functional density for the � and ��cycle con�gurable multipliers remainsless than that the zero con�gurable for all values of the iteration count n�

cmul16

cmul8

cmul4

mul

0 50 100 150 200 250 300 350 400 450 5000

0.5

1

1.5

2

2.5

Iterations

1/A

T

Placed Functional Density Comparison for 8−bit Multipliers

cmul16

cmul8

cmul4

mul

0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Iterations

1/A

T

Routed Functional Density Comparison for 8−bit Multipliers

Figure �� Mapped and Routed Functional Density for ��bit Multipliers�

In Figure we see the function density for mapped and routed ���bit input multipliers�Here it can be immediately observed that the functional density of the ��cycle multiplierexceeds that of the zero cycle multiplier after a certain of number of iterations betweenrecon�guration� We view this number of iterations to be around ��� We also see the breakeven point between the ���cycle and zero cycle multipliers to be at about �� iterations�Observing both the mapped and routed functional density graphs we see little relationaldi�erence other than the values of functional density being o�set by the routing delay foreach multiplier implementation� Again we see the ��cycle con�gurable multiplier to providethe poorest functional density result�

We see below in Figure � a continuing trend in the relation of functional density withrecon�guration and multiplier input bit size� Below the functional density results for ��bit mapped and routed multiplier implementations are presented� We see the functionaldensity for the ��� � and � cycle self con�gurable multipliers to be o�set by a greater amountto the zero cycle than in previous ���� and ��bit multiplier implementations� The breakeven point for the �� and � cycle routed multipliers is reduced to around � and �� cyclesrespectively� and the ��cycle implementation functional density approaches that of the zero�cycle multiplier� Su�ce to say� we can present the observation that the o�set in functionaldensity for self con�gurable multipliers increases for increasing bit�size multipliers�

April ��� ���� Page ��

cmul16

cmul8

cmul4

mul

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Iterations

1/A

T

Placed Functional Density Comparison for 16−bit Multipliers

cmul16

cmul8

cmul4

mul

0 50 100 150 200 250 300 350 400 450 5000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Iterations

1/A

T

Routed Functional Density Comparison for 16−bit Multipliers

Figure � Mapped and Routed Functional Density for ���bit Multipliers�

cmul16

cmul8

cmul4

mul

0 50 100 150 200 250 300 350 400 450 5000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Iterations

1/A

T

Placed Functional Density Comparison for 32−bit Multipliers

cmul16

cmul8

cmul4

mul

0 50 100 150 200 250 300 350 400 450 5000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Iterations

1/A

T

Routed Functional Density Comparison for 32−bit Multipliers

Figure �� Mapped and Routed Functional Density for ��bit Multipliers�

Based on the functional density results� we can say that when one of the inputs toa multiplier is known to change at a de�ned rate slower than the clocking speed of themultiplier� self con�gurable multipliers present an alternative multiplication technique withincreased functional density than conventional parallel multipliers� The rate at which theinput changes� and the number of cycles used for recon�guration de�nes the di�erence infunctional density over conventional multiplication techniques� An application of the selfcon�gurable multiplier is within adaptive signal processing� In adaptive signal processing��lter coe�cients are often changed at a rate slower than the clocking speed of the �lter�For a �lter whose coe�cients change at a rate slower than the break�even point betweenthe functional density of self con�gurable multipliers and conventional multipliers� then anincrease in functional density will be obtained�

� Conclusion

In this paper a self�con�gurable multiplication technique allowing variable con�gurationtime was presented for LUT based FPGAs which allow internally addressable RAM prim�itives to be directly mapped to the Logic Elements �LEs� of the logic resource� The Xilinx

April ��� ���� Page ��

XC���� series of FPGAs was shown to provide an architecture of su�cient functionality toallow run�time logic resource partial con�guration to allow implementation of the multipli�cation technique presented� Implementation results for �� �� and ��bit multipliers at withvarying recon�guration times showed the e�ect on maximum achievable clocking frequencyand the amount of logic resource required� It was shown how the reduction in con�gura�tion time demanded additional logic to compute the partial product sums in parallel whichwere stored in the LUTs on the FPGA� An area�time tradeo� continuum between therecon�guration time and required logic was shown for the multiplication technique�

An analysis of functional density including time to recon�gure was presented and re�sults graphed for functional density against number of iterations between recon�guration�An increase in the functional density for multiplication on FPGAs implemented utilisingrecon�guration was shown for implementations with a varying number of cycles to recon��gure beyond a de�ned value of number of iterations between recon�guration� It was alsoobserved how the functional density o�set between multiplier implementations increasedwith increasing input bit size�

In closing� an application suggestion was made where self con�gurable multipliers wouldprovide an increase in functional density for adaptive �ltering without any large impact onthe performance of the processing engine�

References

�� Atmel Incorporated� ATMEL AT���� data sheet� �����

�� Xilinx Incorporated� XC���� FPGA data sheet� �����

� Xilinx Incorporated� Xilinx XC���� data sheet� �����

�� M� Jimenez� C� Wey� and M Shanblatt� Mapping multiplication algorithms into afamily of lut�based fpgas� In ���� International Symposium in Field ProgrammableGate Arrays� Montery CA� February ����� Poster Feature�

�� Les Mintzer� Fir �lters with �eld�programmable gate arrays� Journal of VLSI SignalProcessing� ������������ ��� �

�� ALTERA Corporation� FLEX ���� data sheet� �����

� ALTERA Corporation� FLEX ����� data sheet� June �����

�� M�J� Wirthlin and B�L� Hutchings� Improving computational e�ciency through run�time constant propagation� In ���� ACM�SIGDA Fifth International Symposium onField�Programmable Gate Arrays� ����

�� M� Wojko and H� ElGindy� Comparative analysis of multiplication techniques for FPGAarchitectures� In Tam Shardi� editor� Proceedings of PART��� pages ������� Newcas�tle� New South Wales� Australia� September ���� Springer�

April ��� ���� Page ��