A low-area, high-speed, processor array architecture for field ALU over GF (2m

A Low Area, High Speed, Processor Array Architecture for Field ALU over GF(2m)Mohamed Fayed, M. Watheq El-Kharashi, and Fayez Gebali,

Network LabElectrical & Computer Engineering

University of VictoriaVictoria, BC, Canada

Emails: {mfayed, watheq, fayez}@ece.uvic.ca

18 December 2007NETWORK LAB UNIVERSITY OF

VICTORIA 2

Outline

MotivationField MathematicsProcessor Array Design MethodologyProposed ArchitectureExperimental SetupResults and ComparisonConclusion


VICTORIA 3

Motivation

Cryptosystems (e.g., ECC) rely heavily on GF mathematicsCryptosystems require efficient implementation for:

Addition, Multiplication, Squaring, DivisionThe basic operations are addition and multiplication since:

Squaring could be done by utilizing a multiplierDivision is implemented by repeated multiplication and Squaring

Processor array is attractive because it has:Regular and simple structureScalable

HW implementation of GF mathematics provides:Physical security for cryptographic algorithms (e.g., RSA, ECC)Higher speed for real-time application

Field Mathematics

Addition

Multiplication

Squaring

Inversion


VICTORIA 4

∑ ∑∑∑−

=

−

=

−

=

−

=

+====1

0

1

0

1

0

1

0)(,)(,)(,)(

m

i

m

i

ii

mii

m

i

ii

m

i

ii xrxxRxbxBxaxAxcxC

∑ ∑∑−

=

−

=

−

=

⊕=+=+=1

0

1

0

1

0)()()()(

m

i

m

i

iii

ii

m

i

ii xbaxbxaxBxAxC

)()().()(1

0

1

0

xRModxabxBxAxCm

i

m

j

jiii∑∑

−

=

−

=

+==

)()(1

0

22 xRModxaxAm

i

ii∑

−

=

=

)()( 221 xRModAxAm −− =


VICTORIA 5

Processor Array Design Methodology

Algorithm Transformationto RIA

Obtain Dependency Graph(DG)

Node projection usingdifferent vectors

Obtain the data timingfunction


VICTORIA 6

Transformation to RIA for Multiplication

⎪⎩

⎪⎨

⎧

∧−−⊕−−⊕−∧−−⊕−∧

0>))(1)1,((1)1,())()((

0=1)1,())((0)(=),(

jjrmicjicimbja

jmicimbajic

i,ja(j),r(j)

c(i-1,j-1)b(m-i)

c(i-1,m-1)

c(i,j)


VICTORIA 7

Multiplier Dependency Graph (DG)j

a(3), r(3)

b(4) b(3) b(2) b(1)0 0

0

0

0

0 0

c(0,4) c(1,4) c(2,4)

a(2), r(2)

a(1), r(1)

a(0), r(0)

b(0)

c(3,4)

0

a(4), r(4)0

i


VICTORIA 8

Multiplier Node Projection

j

a(3), r(3)

b(4) b(3) b(2) b(1)0 0

0

0

0

0 0

c(0,4) c(1,4) c(2,4)

a(2), r(2)

a(1), r(1)

a(0), r(0)

b(0)

c(2,4)

0

a(4), r(4)0

i

Projection Direction d=[1,1]t

0 1 2 3 4 5 6 7 8


VICTORIA 9

Multiplier Node Projection (Cont.)

t=0

t=1

t=2

t=3

t=4

b(4)

a(3)r(3)

a(2)r(2)

a(1)r(1)

a(0)r(0)

a(4)r(4)

b(3)

a(3)r(3)

a(2)r(2)

a(1)r(1)

a(0)r(0)

a(4)r(4)

b(2)

a(3)r(3)

a(2)r(2)

a(1)r(1)

a(0)r(0)

a(4)r(4)

b(1)

a(3)r(3)

a(2)r(2)

a(1)r(1)

a(0)r(0)

a(4)r(4)

b(0)

a(3)r(3)

a(2)r(2)

a(1)r(1)

a(0)r(0)

a(4)r(4)

c(4,4) c(4,3) c(4,2) c(4,1) c(4,0)

c(3,4)

c(2,4)

c(1,4)

c(0,4)

0

00000

0

0

0

0

0

0

0

0

1

1

1

1

1

2

2

2

2

2

3

3

3

3

3

4

4

4

4

4

5

5

5

5

5

6

6

6

6

6

7

7

7

7

7

8

8

8

8

8


VICTORIA 10

Multiplier Node Projection (Cont.)

b(m-i)

a(m-1),r(m-1)a(0),r(0)

.

.

.a(m-2),r(m-2)

0 1 2 3 m-1

c(i,m-1)

a(m-2),r(m-2)a(m-1),r(m-1)

.

.

.a(m-3),r(m-3)

a(m-3),r(m-3)a(m-2),r(m-2)

.

.

.a(m-4),r(m-4)

a(m-4),r(m-4)a(m-3),r(m-3)

.

.

.a(m-5),r(m-5)

a(0),r(0)a(1),r(1)

.

.

.a(m-1),r(m-1)

0t=0t=1

.

.

.t=m-1


VICTORIA 11

Processing Element

i,ja(j),r(j)

c(i-1,j-1)b(m-i)

c(i-1,m-1)

c(i,j)

D QDFF

c(i,j)a(j)b(m-i)

c(i-1,j-1)r(i)

c(i-1,m-1)


VICTORIA 12

Multiplier Architecture

en

Control bus

Controller

MultiplierProcessor

Array

RA

RB

Data bus

W

W

m

W

Squaring

Tl is a polynomial of order <mTh (xd+1+..+x) is a polynomial of order >m

Squaring require Multiplication and addition


VICTORIA 13

)]( ))(([=)( 12 xRmodxxxTTxA dhl +++ + K

Squarer Architecture


VICTORIA 14

en

Multiplierprocessor

array

Data Bus

RX

Adder

W

m

Control Bus

Controller

RA

RB

m

W

Inversion

Which is implemented by repeated multiplication and squaring


VICTORIA 15

21221 )1(=2= −− −− mmAAA

According to Itoh [22]

Inversion Architecture


VICTORIA 16

en

Multiplierprocessor

array

Data Bus

RX

Adder

W

m

Control Bus

Controller

RA

RB

m

W

ALU Latency (clock cycles)


VICTORIA 17

Operationm=163 m=283 m=571

W W W

32 64 128 32 64 128 32 64 128

A+B 19 10 7 28 16 10 55 28 16

AB 181 172 169 310 298 292 625 598 586

A2 23 17 15 34 26 22 50 32 24

A-1 5,193 4,140 3,789 12,716 10,328 9,134 36,055 25,444 20,728

AB2 198 186 182 335 319 311 657 621 605

AB+C 194 179 174 329 309 299 662 617 597

AB2+C 211 193 187 354 330 318 694 640 616

AB-1 5,368 4,309 3,956 13,017 10,621 9,423 36,662 26,033 21,309


VICTORIA 18

Design AND gates

XORgates

1-bit Latches

2 ×1 Mux

3×1 Mux

Tri-state Buffer

Total Transistors

Wang et al [4] 3m2 3m2 8.5m2 0 0 0 128 m2

Kim et al. [6] 3m2-2m

3m2-2m 9m2-8m 0 0 0 108 m2-88m

Lee et al.[5] m2 m2+3m 3.5m2+3m 3m m 0 40 m2+71m

Our Design m 2m 3m 2m-2 0 2m 62 m

Complexity Comparison


VICTORIA 19

Experimental Setup

Modelling Language:VHDL Structural model

FPGA Technology:Xilinx Virtex II XC2V4000-6

Tools:Xilinx ISE tools 9.1ModelSim XE III 6.0d

Implementation Results


VICTORIA 20

m(bits)

W(bits)

Area(slice)

fmaxMHz

Operation latency (ns)

A+B AB A2 A-1

16332 72 686 87 19,676

64 1,068 264 38 652 64 15,687

128 27 640 57 14,357

28332 119 1,314 144 53,878

64 1,802 236 68 1,263 110 43,760

128 42 1,237 93 38,701

57132 243 2,760 221 159,219

64 3,606 226 124 2,641 141 112,361

128 71 2,588 106 91,535


VICTORIA 21

Comparison with Other Work

Design PlatformArea

(slice)Frequency

(MHz)m

(bits)Operation latency μ

AB A-1

806 98 128 3.974 -

Ors [28] Xilinx V812E-8 1,548 100 256 7.686 -

2,972 95 512 16.171 -

Crowe [29] Xilinx XC2V2000-6 5,267 44.9 256 5.75 -

Alberto [30] Xilinx XCV300 3,072 200 239 3.1 -

Miguel [27] Xilinx XC2V1000-4 - 166 191 1.48 135

Kitsos [31] Xilinx XC2V1000E-8 - 22.1 163 7.4 -

1,086 264 163 0.686 19.7

Our design Xilinx XC2V4000-6 1,802 236 283 1.3 53.9

3,606 226 571 2.8 159


VICTORIA 22

Conclusion

Propose An efficient ALU architecture over GF(2m) :Processor array-Based design space explorationHigh-speed (e.g., multiplication is done in <<2m clock cycles ) Low area design (e.g., multiplier m requires Pes)

The architecture is implemented for different m sizesm ∈ {163, 283, 571}

Evaluate the performance of the proposed architectureAchieved high frequency of 264 MHzAchieved low latency of 686 ns for multiplication


VICTORIA 23

A low-area, high-speed, processor array architecture for field ALU over GF (2m

Documents

Transcript of A low-area, high-speed, processor array architecture for field ALU over GF (2m