New iterative algorithms for modular multiplication

12
Signal Processing 84 (2004) 1919–1930 New iterative algorithms for modular multiplication Omar Nibouche a , Mokhtar Nibouche b , Ahmed Bouridane c, a Faculty of Engineering, Magee College, University of Ulster, Northland Road, Derry BT48 7JL, UK b Faculty of Computing, Engineering and Mathematical Sciences, The University of the West of England, Coldharbour Lane, Bristol BS16 1QY, UK c School of Computer Science, The Queen’s University of Belfast, Bernard Crossland Building, 18 Malone Rd, Belfast BT7 1NN, UK Received 18 July 2003; received in revised form 25 June 2004 Abstract The new modular multiplier structures proposed in this paper are based on a short precision magnitude comparison instead of the full magnitude comparison operation. Another feature of these structures is that the comparison operations are carried out first. Only once this has been achieved that the reduction operation takes place, while in previous work both the comparison and the reduction operations are interleaved. This has resulted in a reduction of the number of stages required for the implementation of the modular reduction operation. Serial implementations have shown that the new radix-2 algorithm has a better area usage than similar structures available in the literature while the proposed radix-4 algorithm exhibits better area usage than similar structures with relatively similar speed performances. The parallel implementation of these algorithms has also shown that the new radix-4 algorithm has the best area usage while its speed performances are similar to that of structures proposed in the literature. r 2004 Elsevier B.V. All rights reserved. Keywords: Modular multiplication; Cryptography; Serial-parallel systems; Computer arithmetic 1. Introduction In the recent past years, the use of software tools and hardware devices for security functions has increased dramatically [4,9,11,12,17]. Security is- sues play a crucial role in wide spreading the use of many computer and communication systems, such as the Internet, which more and more people are using to transmit sensitive information such as credit card numbers. A central tool for achieving system security is cryptography. Privacy and fraud concerns can be addressed through the use of various security primitives such as data encryp- tion, which can be used with the appropriate protocols to construct secure and trusted networks [5,8]. In 1976, Diffie and Hellman introduced the idea of public key cryptography [4], which is now widely used to provide confidentiality, ARTICLE IN PRESS www.elsevier.com/locate/sigpro 0165-1684/$ - see front matter r 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2004.07.001 Corresponding author. Tel.: +44-1232-335465. E-mail addresses: [email protected] (O. Nibouche), [email protected] (M. Nibouche), a.bouridane@ qub.ac.uk (A. Bouridane).

Transcript of New iterative algorithms for modular multiplication

ARTICLE IN PRESS

0165-1684/$ - se

doi:10.1016/j.sig

�CorrespondiE-mail addr

mokhtar.nibouc

qub.ac.uk (A. B

Signal Processing 84 (2004) 1919–1930

www.elsevier.com/locate/sigpro

New iterative algorithms for modular multiplication

Omar Nibouchea, Mokhtar Niboucheb, Ahmed Bouridanec,�

aFaculty of Engineering, Magee College, University of Ulster, Northland Road, Derry BT48 7JL, UKbFaculty of Computing, Engineering and Mathematical Sciences, The University of the West of England, Coldharbour Lane,

Bristol BS16 1QY, UKcSchool of Computer Science, The Queen’s University of Belfast, Bernard Crossland Building, 18 Malone Rd, Belfast BT7 1NN, UK

Received 18 July 2003; received in revised form 25 June 2004

Abstract

The new modular multiplier structures proposed in this paper are based on a short precision magnitude comparison

instead of the full magnitude comparison operation. Another feature of these structures is that the comparison

operations are carried out first. Only once this has been achieved that the reduction operation takes place, while in

previous work both the comparison and the reduction operations are interleaved. This has resulted in a reduction of the

number of stages required for the implementation of the modular reduction operation. Serial implementations have

shown that the new radix-2 algorithm has a better area usage than similar structures available in the literature while the

proposed radix-4 algorithm exhibits better area usage than similar structures with relatively similar speed performances.

The parallel implementation of these algorithms has also shown that the new radix-4 algorithm has the best area usage

while its speed performances are similar to that of structures proposed in the literature.

r 2004 Elsevier B.V. All rights reserved.

Keywords: Modular multiplication; Cryptography; Serial-parallel systems; Computer arithmetic

1. Introduction

In the recent past years, the use of software toolsand hardware devices for security functions hasincreased dramatically [4,9,11,12,17]. Security is-sues play a crucial role in wide spreading the use ofmany computer and communication systems, such

e front matter r 2004 Elsevier B.V. All rights reserve

pro.2004.07.001

ng author. Tel.: +44-1232-335465.

esses: [email protected] (O. Nibouche),

[email protected] (M. Nibouche), a.bouridane@

ouridane).

as the Internet, which more and more people areusing to transmit sensitive information such ascredit card numbers. A central tool for achievingsystem security is cryptography. Privacy and fraudconcerns can be addressed through the use ofvarious security primitives such as data encryp-tion, which can be used with the appropriateprotocols to construct secure and trusted networks[5,8].In 1976, Diffie and Hellman introduced the

idea of public key cryptography [4], which isnow widely used to provide confidentiality,

d.

ARTICLE IN PRESS

O. Nibouche et al. / Signal Processing 84 (2004) 1919–19301920

authentication, data integrity and non-repudia-tion. Since then, numerous public-key cryptosys-tems have been proposed. All these systems basedtheir security on some mathematical one-wayfunctions. RSA [11] is the most widely usedpublic-key cryptosystem. An RSA operation is amodular exponentiation operation, which requiresrepeated modular multiplications. For securityreasons RSA operand sizes need to be 1024-bitsor greater [12]. Therefore, the implementation ofsuch systems, which requires efficient architecturesto compute the modular product, has motivatedthe development of a number of modular multi-plication algorithms and architectures.The modular multiplication can be carried out

through division, whereby the product of the twooperands is formed and then divided by themodulus to compute the residue, which is theresult of the modular multiplication operation.Such a parallel implementation may turn intoenormous problem complexity, as the size of theoperands is very large. In addition to that, thecalculation of the quotient, which is not theultimate result of the operation, increases the areaused for the implementation. Therefore, themodular multiplication operation necessitates effi-cient iterative computation algorithms based onrepeated subtraction operations for the design ofhigh performance systems [7,8,12,15,16].Various iterative techniques exist for the divi-

sion-based modular multiplication operation. Onepopular algorithm is Blakeley’s algorithm [1,13].This algorithm interleaves the well-known shift

and add technique with magnitude comparison andsubtraction operations. It uses a Most SignificantBit First (MSBF) format and entails two magni-tude comparison/subtraction operations. Theseoperations are the setback of any implementationof the algorithm, as they require a propagationpath from the Least Significant Bit (LSB) to theMost Significant Bit (MSB), thus decreasing thefrequency of the system. Some interesting MSBFiterative methods are not based on divisionoperation [2,9,14,17]. In these algorithms, themodular reduction is computed via Look UpTables (LUTs) and by discarding a group of MostSignificant Bits (MSBs), which are used to select acorrection term. As the number of bits used to

select the reduction value increases, memory accesstime becomes the bottleneck of the speed of thewhole system. Therefore, reducing the LUT sizecan be of great practical concern. This can be doneusing a number of LUTs and multiplexers. Thedrawback of this approach is that more than onereduction value is to be stored. Consequently, thenumber of reduction values and the area usagerequired to reduce them is balanced by the size ofthe LUT and its access time. The implementationof these algorithms is based upon the use of CarrySave Adders (CSAs) and multiplexers. Thesealgorithms also need precalculation and storageof the correction values, which have to be carriedout for every different modulus (the calculationprogram can be written into an EPROM).In this paper the problem of designing scalable

modular multipliers without full magnitude com-parison, and which can be used with any modulusis addressed. The new modular multiplier struc-tures proposed in this paper are based on a shortprecision magnitude comparison instead of the fullmagnitude comparison as suggested by Blakeley[1,13]. The short magnitude comparison was firstused by [6,7] to derive modular multiplier archi-tectures. By using this approach, the partial resultsare kept in a wider range than Blakeley’salgorithm. However, the last reduction step whereno data are fed to the multiplier reduces the partialresults to the required range. The paper isorganized as follows: the mathematical back-ground of the modular multiplication operationand previous work are presented in Sections 2 and3, respectively. The new algorithms are presentedin Sections 4–7. An extension of this work forhigher radices is addressed in Section 8 and theconclusions are made in Section 9.

2. Background and previous work

The computation of the modular multiplicationoperation is the computation of the product P,given by

P ¼ hABiM ¼ ABmodM with 0pA;BoM; (1)

where P is the remainder of the division of theproduct AB by the modulus M. A;B; and M are

ARTICLE IN PRESS

O. Nibouche et al. / Signal Processing 84 (2004) 1919–1930 1921

positive integers sufficiently large to ensure thesecurity of the system. The modulus M is repre-sented in binary using n bits, where n is given by

n ¼ dlog2Me; (2)

where dxe is the ceiling of x, which is the integergreater than or equal to x but smaller than x þ 1:As it was shown in [5], in order to compose a

new method, it is simply required to develop avalid iteration rule that replicates a certain bound.Let consider the problem of computing hABiM ;

where the product AB of Eq. (1) is decomposed:

AB ¼Xn�1i¼0

ai2iB: (3)

Using this decomposition, hABiM can be rewrittenas

hABiM ¼Xn�1i¼0

ai2iB

* +M

¼ hhhhhan�12n�1BiM

þ an�22n�2BiM þ . . . . . . . . . iM

þ a121BiM þ a02

0BiM : ð4Þ

The inspection of this expression reveals that aniterative computation technique is possible.Namely, a partial result can be defined as Si ¼

hSi�12i�1 þ an�i�12

n�i�1BiM with S�1 ¼ 0 anditerate the computation n times (i.e. i ¼ 0 toi ¼ n � 1). It can be noted from (4) that thecomputation leads to a class of exact methods, i.e.,the partial results are in the range ½0; M½: Never-theless, in applications such as cryptography, exactevaluation of the modular reduction is oftenimpossible since calculation of the correspondingmodular correction requires full-word-length infor-mation. To avoid long propagation paths that canlower the clock frequency, an approximate evalua-tion of the partial results is favoured in which caseonly a small portion of a data word is analysed tocompute corrections. As it was suggested by [5], thefollowing general case can serve to provide somedirections in the development of approximateiterative modular reduction methods. In thesemethods the partial results are rather kept in alarger range than the range of Eq. (1).Let assume that a method is available to

compute Si ¼ hSi�12i�1 þ an�i�12

n�i�1BiM þ �M ;

where � 2 f�min; �min þ 1 . . . . . . . . . ; �max � 1; �maxgand ð�min � 1ÞMoSi�1oð�max þ 1ÞM: It is imme-diately obvious from this definition that if themethod is valid, the same bound is maintained onthe new result, i.e., ð�min � 1ÞMoSioð�max þ 1ÞM;which can be proven by induction [5].This establishes that in order to compose a new

method; it is simply required to develop a validiteration rule that replicates a certain bound.Furthermore, this underlines the fundamental factthat the error due to the approximate nature of theiteration level rule does not accumulate withiteration number but instead remains in a constantrange [5].To reduce the partial results back to the range

½0; M½; another subtraction step can be used. Thenumber of subtraction operations used in this stepdepends on the approximation used to derive thereduction rule of the algorithm.

3. The short magnitude comparison algorithm

An idea to reduce the propagation path and atthe same time to keep the scalability of themodular multiplier intact was presented in [6,7].In this work, the use of Carry Propagate Adders(CPAs) was avoided by employing a technique toestimate the sign, so that instead of subtracting M

or 2M from the whole partial results, the subtrac-tion operation is only performed on the n � t

MSBs of the operands. If t ¼ 0; the subtractionoperation is carried out on the whole operandswords. On the other hand, if t ¼ n � 1; thesubtraction operation is carried out only on theMSB of the two operands. Therefore, the para-meter t controls the estimation: the accuracy of theestimation and thus, the total amount of logicrequired for the implementation. For an n-bitinteger N, the estimator function T was defined by[1,13] as:

TðNÞ ¼ N � hNi2t ; (5)

where 0pton � 1:The operator T replaces the first t LSBs with

zero, which implies that

TðNÞpNoTðNÞ þ 2t: (6)

ARTICLE IN PRESS

O. Nibouche et al. / Signal Processing 84 (2004) 1919–19301922

The principle of the algorithm is to reduce a pair ofCarry–Sum bits, which we note ðCi;SiÞ; byestimating the sign of: ~Pi ¼ Si þ Ci � M ¼ ~Si þ~Ci: If the estimated sign of ~Pi is positive, then:

Si ¼ ~Si and Ci ¼ ~Ci: If the estimated sign isnegative, this means that the original value wasin the correct range, and no reduction is required.The algorithm as shown by [6,7] is described inAlgorithm 1 below:

Algorithm 1

S0 ¼ 0;C0 ¼ 0

For i ¼ 1 . . . n { S i þ Ci ¼ 2ðSi�1 þ Ci�1Þ þ an�iB

S

~ i þ ~Ci ¼ Si þ Ci � 2M

I

f TðSiÞ þ TðCiÞX0 then

Si ¼ ~Si and Ci ¼ ~Ci

S

~ i þ ~Ci ¼ Si þ Ci � M

I

f TðSiÞ þ TðCiÞX0 then

Si ¼ ~Si and Ci ¼ ~Ci

}

The algorithm entails as many magnitudecomparison/subtraction operations as in Blake-ley’s algorithm [1,13]. The difference is that in thelatter algorithm, the modular reduction is carriedout after a full-precision magnitude comparison.In the case of the short magnitude comparison,this is carried out upon a reduced length precision.For an n-bit modulus, the partial results of themultiplication equation fall in the range ½0; 5M½:They can be shown in binary representation usingn þ 3 bits. A sufficient condition for the correct-ness of the algorithm is to have: tpn � 1: The signestimation operation is carried out at least on thebits t to n þ 3 of the partial result, and thus itchecks at least 5 MSBs of the partial results. It canbe implemented using a CPA or Carry Look-ahead Adder (CLA) while the modular multiplieris implemented using three rows of CSAs.

4. The new binary iterative algorithm for modular

multiplication

Although the short magnitude comparisonalgorithm solves the problem of long path

propagation delay, better performances can beachieved. As it is shown in this section, this hasbeen achieved by estimating the sign prior to themodular reduction, instead of interleaving the signestimation with the modular reduction operation.Another innovation made on the algorithm by[6,7] is the use of signed arithmetic. In this way,only one modular reduction is carried out for thenew radix-2 algorithm as described in Algorithm 2.The new radix-4 algorithm also presented in thispaper and described in Algorithm 3, uses tworeduction operations and one operation of accu-mulation of the partial products generated usingBooth’s algorithm. Booth Recoding is a com-monly used technique to recode one of theoperands in binary multiplication [3]. The radix-2Booth recoding technique scans the bits of themultiplier one bit at a time, and adds or subtractsthe multiplicand to or from the partial product,depending on the value of the current bit and theprevious bit, while the radix-4 Booth recodingtechnique scans the bits of the multiplier two bitsat a time. The process of inspecting the multiplierbits required by Booth’s algorithm can be viewedas recoding the multiplier using three digits 0, 1and �1; in the case of radix-2 recoding, and 0, 1, 2,�1; and �2; in the case of radix-4 Booth recoding,which leads to the generation of multiples of themultiplicand that are divisible by two. Thereforethey can be generated by simple shift operations[3].The main feature of the new algorithm, when

compared to that proposed in [6,7], is that itrequires only one modular reduction operation,instead of two operations. This has been achievedby reducing the partial results only once the signestimation operations have been carried out,while in [6,7], the sign estimation operationsare interleaved with the modular reduction opera-tions. Taking into account that N lies in the range� � M ; M½; Eq. (6) can be rewritten as

TðNÞ � 2toNoTðNÞ þ 2t: (7)

Let Ci and Si be the ith carry and sum partialresult words, respectively. They are in the follow-ing range:

�xMoCi þ SioxM ; (8)

ARTICLE IN PRESS

O. Nibouche et al. / Signal Processing 84 (2004) 1919–1930 1923

where x is a positive integer that determines therange within which the partial result fall.From Eqs. (7) and (8), the estimated sign of Ci

and Si is positive

TðCiÞ þ TðSiÞX0 if � 2iþ1oCi þ SioxM:

(9.1)

And their estimated sign is negative 9

TðCiÞ þ TðSiÞo0 if � xMoCi þ Sio2iþ1:

(9.2)

Therefore, depending on the result of the signestimation given by (9), Tð�xMÞ is added toTð2CiÞ þ Tð2SiÞ if the estimated sign is positive, orTðxMÞ is added to Tð2CiÞ þ Tð2SiÞ if the esti-mated sign is negative. The results from thisestimation operation are fed to the second stageof the sign estimation for the partial results ~Ci þ~Si ¼ 2ðCi þ SiÞ � xM:Taking into account the estimated sign of ~Ci þ

~Si and Ci þ Si; there are four cases of modularreduction of the pair carry–sum, Ci þ Si; given by

Ci þ Si ¼ 2ðCi�1 þ Si�1Þ þ an�iB � ykM : (10)

The four different values of yk are defined asfollows: y0 is selected if the sign of the two stagesare both positive, y1 is selected if the result of thefirst stage is positive and the result of the secondstage is negative. If the result of the first stage isnegative and the result of the second stage ispositive, y2 is selected. The last case occurs whenthe results of both stages are negative, and in thiscase y3 is selected. These four cases are representedby the different ranges in which the partial resultscan fall into, and are summarized by the followingfour equations:

�2tþ1xMo2ðCi�1 þ Si�1Þ þ an�iBo2xM þ M ;

(11.1)

�2tþ1o2ðCi�1 þ Si�1Þ þ an�iBoxM þ M þ 2tþ1;

(11.2)

�2tþ1 � xMo2ðCi�1 þ Si�1Þ þ an�iBoM þ 2tþ1;

(11.3)

� 2xMo2ðCi�1 þ Si�1Þ þ an�iB

o� xM þ M þ 2tþ1: ð11:4Þ

The role of the correction term ykM is to keepthe partial results in the range � � xM ; xM½; sothat the iteration bound is kept intact and anotheriteration can be carried out. The first solution tothese four sets of Eqs. (11) can be found when x

equals three. The different values of yk define thereduction rule of the new iterative modular multi-plier, where the constraint made on the parametert is a necessary condition for its correctness.Therefore, we obtain the following rules for x ¼

3 and 4:

ðy0; y1; y2; y3Þ ¼ ð4; 2;�1;�4Þ; x ¼ 3; tpn � 2

and

y0 2 f5; 6; 7; g; y1 2 f2; 3g; y2 2 f�2;�1g;

y3 2 f�6;�5;�4g; x ¼ 4; tpn � 2:

The radix-2 iterative modular multiplier for x ¼

3 is shown in Algorithm 2 below. The two signestimation stages are Tð2Si�1Þ þ Tð2Ci�1Þ andTð2Si�1Þ þ Tð2Ci�1Þ þ Tð�3MÞ where the sign of�3M depends on the sign of the previousestimation stage.The aim is to generate multiples of the multi-

plicand and the modulus that are powers of 2 (i.e.�2l ; where l is a positive integer). This has beenachieved without using Booth’s recoding. Instead,unsigned representation of both the multiplier andthe multiplier is preferred, as the extra circuitrythat generates the digits ð1; 0;�1Þ in Booth’srecoding is avoided.

Algorithm 2

S0 ¼ 0;C0 ¼ 0For i ¼ 1 . . . n

{

If Tð2Si�1Þ þ Tð2Ci�1ÞX0 then

If Tð2Si�1Þ þ Tð2Ci�1Þ þ Tð�3MÞX0

then Si þ Ci ¼ 2ðSi�1 þ Ci�1Þ þ an�iB � y0M

else Si þ Ci ¼ 2ðSi�1 þ Ci�1Þ þ an�iB � y1M

If Tð2Si�1Þ þ Tð2Ci�1Þo0 then

If Tð2Si�1Þ þ Tð2Ci�1Þ þ Tð3MÞX0

then Si þ Ci ¼ 2ðSi�1 þ Ci�1Þ þ an�iB � y2M

else Si þ Ci ¼ 2ðSi�1 þ Ci�1Þ þ an�iB � y3M}

The Basic Cell (BC) of the multiplier and themultiplication Dependency Graph (DG) areshown in Figs. 1(a) and 2(a), respectively. The

ARTICLE IN PRESS

O. Nibouche et al. / Signal Processing 84 (2004) 1919–19301924

BC uses two CSAs (Full Adders), and an ANDgate to form the partial products. The bits ri;1;ri;2

from the sign estimation stages are used toselect one of the bits ðmj ;mjþ1;mjþ2Þ for themodular reduction (which is equivalent tomultiplying the modulus M by 1, 2, and 4,respectively, as shown in Algorithm 2). The bitri;1 is the sign bit of the first estimation stage and isused to find the two’s complement of the correc-tion value to be added to the partial results.Therefore, the multiples of the modulus M that areused for the reduction operation are 4M; 2M;�M ;and �4M : The two stages of the sign estimationare shown in Fig. 2(b). The sign bit of the firststage is used to select either Tð3MÞ or Tð�3MÞ;which is then accumulated in the second stage withthe result from the first stage to produce thesecond sign bit.The partial result Si�1 þ Ci�1 falls within the

range � � 3M ; 3M½; therefore, 2ðSi�1 þ Ci�1Þ canbe represented using n þ 4 bits. The rule ofAlgorithm 2 states that tpn � 2; thus the signestimation is carried out on 7 bits. However, toallow only the moduli M to be fed to the multiplierstructure, the terms Tð3MÞ and Tð�3MÞ areformed by the addition of Tð2MÞ to TðMÞ; andTð�2MÞ to Tð�MÞ; respectively. Nevertheless,

(a)

an-i bj

ci,j si,j

mjmj+1

mj+2

ri,1

si+1,jci+1,j+1

zi,j

zi+1,j

ri,2

CSA

CSA

(b)

ri,2ri,1 ri,1

zi,jzi+1,j

ci,j

ci+1,j+1

si,j

si+1,j

bj

bj

ri,2

an-ian-i

Fig. 1. The basic cell of the proposed radix-2 modular

multiplier.

Tð2MÞ þ TðMÞ is bounded by

Tð3MÞ � 3� 2toTð2MÞ þ TðMÞ

oTð3MÞ þ 3� 2t: ð12Þ

The effect on the reduction rule is that the newnecessary condition for the correctness of thealgorithm requires: tpn � 3; therefore the signestimation is calculated using 8 bits. An illustra-tion of how Algorithm 2 works is shown in Fig. 3.

5. The new radix-4 iterative algorithm for modular

multiplication

The algorithm shown in the previous section canbe extended to radix-4, whereby the radix-4Booth’s recoding is used to generate the partialproducts. Three sign estimation stages are thenrequired to calculate the right modular reductionvalue. This is due to the fact that, after eachiteration of the algorithm, the partial results areshifted two positions to the left instead of oneposition as in the case of a radix-2 algorithm.Therefore, eight different terms of modular reduc-tion, ykM ; are required. The selection of one ofthe values y0pkp7 (which we denote asy0; y1; y2; y3; y4; y5; y6; y7) depends on the sign ofeach of the three sign estimation stages. Table 1depicts the choice of these values and theequivalent sign of the sign estimation stages.Following the same calculation steps used for theradix-2 algorithm described previously, the rule ofreduction is given in what follows.Let z be:

z ¼ 4ðCi þ SiÞ þ an=2�iB with

an=2�i 2 f�2;�1; 0; 1; 2g; ð13Þ

where Ci and Si are the ith carry and sum words ofpartial results. The modular reduction is carriedout by subtracting yiM from z. The result must fallinto the range

�xMoz � yiMoxM : (14)

And as shown in the case of Radix 2, the differentcases are listed as follows:For every value of z given by the following

set of equations, a correction term is selected from

ARTICLE IN PRESS

j

i

T(2Ci-1)

T(2Si-1)

T(2M)T(M)

ri,2

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

ri,1

CSA CSACSACSACSACSACSA CSA

(a)

(b)

Fig. 2. The DG and the estimation stages of the proposed modular multiplier.

O. Nibouche et al. / Signal Processing 84 (2004) 1919–1930 1925

Table 1.

3xM � 2M � 2tþ1ozo4xM þ 2M ;

y0 is selected; ð15:1Þ

2xM � 2M � 2tþ1ozo3xM þ 2M þ 2tþ1;

y1 is selected; ð15:2Þ

1xM � 2M � 2tþ1ozo2xM þ 2M þ 2tþ1;

y2 is selected; ð15:3Þ

� 2M � 2tþ1ozoxM þ 2M þ 2tþ1;

y3 is selected; ð15:4Þ

� xM � 2M � 2tþ1ozo2M þ 2tþ1;

y4 is selected; ð15:5Þ

� 2xM � 2M � 2tþ1ozo� xM þ 2M þ 2tþ1;

y5 is selected; ð15:6Þ

� 3xM � 2M � 2tþ1ozo� 2xM þ 2M þ 2tþ1;

y6 is selected; ð15:7Þ

� 4xM � 2Mozo� 3xM þ 2M þ 2tþ1;

y7 is selected: ð15:8Þ

After appropriately subtracting the term yiM fromthese inequalities, the first set of solutions is foundwhen x equals 6. The reduction rules for x ¼

6; 7 and 8 are given by

ðy0; y1; y2; y3; y4; y5; y6; y7Þ

¼ ð21; 15; 9; 3;�3;�9;�15;�21Þ;

tpn � 2; x ¼ 6

ARTICLE IN PRESS

0000011011101

1111000100010

11110111111110

00000000000000

0000000000000

1110111111110

0000000000000

0000011101111

11101000100010

00001110111000

0000011011101

1100101000111

0010101110000

0000011101111

11110011011000

00001011001110

0000011011101

1110011110001

0001100111000

0000011101111

0000011011101

00000111110010

11111100100110

1110101110101

0010110011000

1110001000100

00100101010010

11010101010000

0000000000000

1110000000010

0001010100000

0000011101111

11110010011010

00001010001000

a7B=B

-2M

a6B=0

M

a5B=B

M

a4B=B

M

a3B=B

-4M

a2B=0

M

C0

S0

C1

S1

C2

S2

C3

S3

C4

S4

C5

S5

1110111

00010110

0000000

1110111

0000100

1101000

00010110

0001110

1110110

0000010

1001101

00010110

0101100

1111001

0001000

1110010

11101001

0011111

0010001

0001011

0100101

00010110

1010101

1111010

0001010

0000000 1101001

1101001

0000000

0000000

T(-3M)

T(3M)

T(3M)

T(3M)

T(-3M)

T(3M)

11110010011010

00001010001000

0000011011101

1111011001111

0000100110000

0000011101111

11111000100000

00000111011110

0000011011101

1110101000001

0010101111000

1110001000100

11110111111010000111011110

a1B=B

M

a0B=B

-4M

C5

S5

C6

S6

R

2M

0000011011011

1110010

00010110

0001010

1111100

0001110

1111000

00010110

0001110

0000110

0100010

T(3M)

T(3M)

B =221=011011101, A=187=010111011, M=239=011101111, R=219=011011011

Fig. 3. An example of using Algorithm 2.

O. Nibouche et al. / Signal Processing 84 (2004) 1919–19301926

and

y0 2 f23; 24; 25g; y1 2 f17; 18g;

y2 2 f10; 11g; y3 2 f3; 4g;

y4 2 f�4;�3g; y5 2 f�11;�10g;

y6 2 f�18;�17g; y7 2 f�25;�24;�23g;

x ¼ 7; tpn � 2;

and

ðy0; y1; y2; y3; y4; y5; y6; y7Þ

¼ ð28; 20; 12; 4;�4;�12;�20;�28Þ

tpn � 1; x ¼ 8:

The radix-4 algorithm for x ¼ 8 is summarizedbelow in Algorithm 3. First the partial results areshifted two positions to the left, then the signestimation operations take place. The reductionterm is selected in function of the sign of each ofthese operations.

Algorithm 3

S0 ¼ 0; C0 ¼ 0

For i ¼ 1 . . . n{ If Tð4Si�1Þ þ Tð4Ci�1ÞX0

If Tð4Si�1Þ þ Tð4Ci�1Þ þ Tð�16MÞX0

If Tð4Si�1Þ þ Tð4Ci�1Þ þ Tð�16MÞ þ Tð�8MÞX0

Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y0M

else Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y1M

else

If Tð4Si�1Þ þ Tð4Ci�1Þ þ Tð�16MÞ þ Tð8MÞX0

Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y2M

else Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y3M

else

If Tð4Si�1Þ þ Tð4Ci�1Þ þ Tð16MÞX0

If Tð4Si�1Þ þ Tð4Ci�1Þ þ Tð16MÞ þ Tð8MÞX0

Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y4M

else Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y5M

else

If Tð4Si�1Þ þ Tð4Ci�1Þ þ Tð16MÞ þ Tð�8MÞX0

Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y6M

else Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y7M }

The values yi obtained above are not powers oftwo, thus they cannot be obtained by simple shiftoperations. To remedy this problem, the values aredivided into two powers of two. This means thattwo rows of CSAs are required, as shown inFig. 4(a). The values 28, 20, 12 are rewritten as thesum of 32 and �4; 16 and 4, 8 and 4, respectively.The sign bit of the first sign estimation stage, ri;1; isused to determine the sign of the reduction value(see Fig. 4(c)). The sign bit of the second and thirdstage, ri;2 and ri;3; are used to determine the

ARTICLE IN PRESS

Table 1

The reduction values of the proposed radix-4 modular multiplier

y0 y1 y2 y3 y4 y5 y6 y7

1st stage positive positive positive positive negative negative negative negative

2nd stage positive positive negative negative positive positive negative negative

3rd stage positive negative positive negative positive negative positive negative

O. Nibouche et al. / Signal Processing 84 (2004) 1919–1930 1927

magnitude of the correction value. The sign bit ofthe third stage of the sign estimation, ri;3; is used toselect either 4M or �4M : The sign estimationoperation is performed on the three stages of eightbits CPA/CLA. The modular multiplier uses threerows of CSAs. In the first row, the partial productsare generated using radix-4 Booth’s recodingtechnique (see Fig. 3(a)), the remaining two rowsare used to accumulate the modular reduction.The multiplier DG is depicted in Fig. 4(d).

6. The modular reduction: the last step

The structures shown in the previous sections donot compute the exact value of the modular productsince the result of their calculation lies in the range� � xM; xM½; while the exact result must fall intothe range ½0;M½: As the sign estimation techniquewas adopted in these structures to avoid the longpropagation paths that decrease the clock frequencyof the system, a reduced word length of the datamust be used to reduce the results of the twostructures shown above. Therefore the sign estima-tion technique is also selected to carry out the taskof reducing the results. The sign estimation andsubtraction operations are used without feeding anydata to the structure, as shown in Algorithm 4 below.

Algorithm 4

�xMoSi þ CioxM

For i ¼ log2ðxÞ . . . 1

{ I f TðSiÞ þ TðCiÞX0 then

S

i�1 þ Ci�1 ¼ Si þ Ci � 2i�1M

e

lse Si�1 þ Ci�1 ¼ Si þ Ci þ 2i�1M}

This algorithm works in a binary search fashion.This has only been achieved using a redundantrepresentation for the partial results. The algo-

rithm requires dlog2ðxÞe reduction steps, where2n�1oxp2n; and the results are reduced to therange �M � 2tþ1oR ¼ S0 þ C0oM þ 2tþ1:This demonstrates the benefit from using a

redundant representation of the operands andpartial results.

7. Comparison of performances

A performance comparison of the new proposedarchitectures with similar structures available inthe literature [6,7] in terms of both the speed andthe area is presented in Table 2. To reduce the finalresult in the range �0;M½; repeated addition orsubtraction operations are used in the last stage ofthe algorithms shown above. However this laststage has not been taken into account in thecomparison of the implementation results, where-by the results are kept within the ranges ½0; 3M½

and � � xM ; xM½ for the structure in [1,13], andthe new structures proposed in this paper,respectively. It is worth mentioning that the termsTð3MÞ; Tð�3MÞ are replaced by the sumsTð4MÞ þ Tð�MÞ; Tð�4MÞ þ TðMÞ; respectively,so that there is no need to pre-calculate the termsTð�3MÞ: The comparison is based on the area(Area Unit A.U) and the delay (Time Unit T.U) ofan inverter gate. The area of a FA (or a CSA), anEXOR gate, and a multiplexor is 10, 4, and 4A:U;respectively. The propagation delay within these 3basic elements is 6, 3, and 3T:U; respectively [9].The comparison has been made between a serialimplementation of the algorithm described by [6,7]and the new radix-2 iterative multiplier andbetween the 2-bit serial implementation of [6,7]and the proposed radix-4 iterative multiplier. Itwas also supposed that only the modulus M is fedto the multiplier, therefore two rows of EXORgates have been added to the multiplier structure

ARTICLE IN PRESS

(d)

zi,j+1,1zi,j,+12

(b)

an-i+1

ri,2ri,1 ri,1

zi,j,1

ci,j

ci+1,j+1bj

bjbj+1

bj+1

si,j

si+1,j

ri,2ri,3 ri,3

zi,j,2

an-i+2

an-i+1

an-i

an-i+2

an-i

(a)

an-i+2

ci,jsi,j

mj+2

mj+4

ri,3

zi,j,1

zi+1,j+1,1

ri,3

bjbj+1

an-i+1

an-i

mj+3

si+1,jci+1,j+1

ri,1

ri,2

zi+1,j+1,2

zi,j,2

ri,1

(c)

T(4Ci-1)

T(4Si-1)

T(-16M)

T(16M)

T(8M)

T(-8M)ri,2

ri,1

ri,3CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

CSA

Fig. 4. The basic cell, the DG and estimation stages of the

proposed modular multiplier.

Table 2

Comparison of performances based on the area and the delay of

an inverter gate

[8] Bit

Serial

Radix-2

Bit Serial

[8] 2-Bit

Serial

Radix-4

Bit Serial

Area 56 nA:U 51nA:U 98nA:U 80nA:UClock 95T:U 125T:U 185T:U 200T:UTime 95nT:U 125nT:U 93nT:U 100nT:U

A.U: Area unit, T.U: Time unit.

O. Nibouche et al. / Signal Processing 84 (2004) 1919–19301928

shown in [6,7]. It was also assumed that thepipelining is at the cell level with the signestimation stages implemented using CPAs. This

means that there is no pipelining inside the cell,but the latches are placed at the outputs of theBCs. The time of the multiplication is the numberof cycles (i.e., the number of the iteration of thealgorithm) multiplied by the clock period, which isthe delay of the BC. As shown by Table 2, the bit-serial and 2-bit serial multipliers derived from [6,7]and the proposed radix-4 multiplier has almost thesame speed. In terms of area usage, the bit serialmultiplier of [6,7] uses a slightly larger area thanour radix-2 multiplier, while radix-4 clearly re-quires less area than the 2-bit serial multiplier of[6,7]. It has also a similar speed performance whencompared to those derived from [6,7], while theradix-2 multiplier has the worst speed perfor-mance. This in part is due to the longer CPAs theyare using and which should be implemented usingCLAs. Nevertheless, our radix-4 multiplier has abetter area performance than the 2-bit serial multi-plier of [6,7] while the speed performances arerelatively comparable, which makes it a goodcandidate for digit-serial parallel implementations.Results of parallel implementations of these struc-tures are shown in Table 3. The radix-4 algorithmachieves relatively the same processing time as thealgorithm by [6,7]. However it requires only 80% ofits area, making it the best choice. The radix-2algorithm is slower. It requires 130% of themultiplication time of [6,7]. The main benefit thatthe radix-2 multiplier exhibits is that it requires only90% of the area usage required by [6,7].

8. Towards higher radices modular multipliers

In this section, the extension of the structuresproposed in this paper to higher radices is

ARTICLE IN PRESS

Table 3

Comparison of performances for a parallel implementation

based on the area and the delay of an inverter gate

Algorithm in [8] Algorithm 5 Algorithm 4

Time � 90 n � 195n=2 � 120n

� 100% � 108% � 133%

Area � 42 n2 � 33 n2 � 37 n2

�100% � 79% �88%

Area � time � 3780n3 � 6435=2n3 � 4440n3

�100% �85% �117%

O. Nibouche et al. / Signal Processing 84 (2004) 1919–1930 1929

presented. Due to the large radix size, which is calledword size, it is essential that the multiplier’s wordsare in a redundant representation and that themultiplication partial results have to be accumulatedin a redundant representation too, as shown for thecase of the radix-4 modular multiplier. A redundantrepresentation is useful to produce multiples of themultiplicand that are some powers of 2 (i.e. �21;where l is a positive integer) and in such a way thatthe partial result at each iteration fall within asymmetric range. Therefore, the partial results arealso in a redundant representation and they areconverted into a positive value only once all thepartial products have been generated and theirmodular reduction has been carried out. Themultiplier words are represented using Booth’srecoding. Radix-4 and Radix-2 Booth’s recodingcan be used to avoid generating terms that are notpowers of two multiples of the multiplicand.Let d be the adopted high radix. Ci and Si are

the partial results at the ith cycle. Once shifted by d

position to the left:

2dxMo2dðCi þ SiÞo� 2dxM: (16)

The above equation imposes the use of d reductionstages to keep the partial results in the range givenby (7). Each reduction stage outputs a sign bit thatis used to select the reduction term. Let si be thesign bit from the ith reduction stage of the radix 2d

modular multiplier ð0pipdÞ: As shown in theprevious two structures, if the sign estimationoperation produces a positive sign bit, i.e. si equals0, then 2i�1M is subtracted from the partialresults. If the estimated sign is negative, i.e. si

equals 1, then 2i�1M is added to the partial results.

Then another sign estimation operation takesplace. This can be translated into the followingformula: TðSiÞ þ TðCiÞ þ Tð2iMðsi � siÞÞ; where si

is the complement of si:The range of the partial products is given as a

function of the sign-bit output from each of the signestimation stages. Let S be the word formed bythese sign bits. The MSB of this word is sd ; which isthe output of the first stage while the LSB is the bits0 of the last sign estimation stage. The reductionoperation only takes place when all the signestimation operations have been carried out. De-pending on the results of all these stages the partialresults are affected, and therefore a reductionvalue is selected to reduce them back to the range� � xM; xM½: The partial results fall into the range

ðsd � sdÞ �1þ 2d � sd2d þ

Xd�1i¼0

si2i

!xM

� 2tþ1o2dðCi þ SiÞ

oðsd � sdÞ 2d � sd2d þ

Xd�1i¼0

si2i

!xM þ 2tþ1:

ð17Þ

The effect of the sign estimation is apparent whentwo successive ranges are examined. In Fig. 5, twosuccessive ranges are overlapping. Had a fullprecision magnitude comparison been used, thelength of the ranges would had been M. Whereasthe short magnitude comparison is used, the lengthof the ranges is M þ 2tþ2: And two successiveranges share 2tþ2 of common range.Once the partial products and the reduction terms

yiM have been added to the partial results of theprevious cycle, the rule of modular reduction can beapplied in order to keep these results in the range� � xM; xM½: The parameter of the sign estimation,t, is made less or equal to n � 1; tpn � 1; so that wehave 2tpM: The second step is to determine thevalue of x that produces values yi that are equal topowers of 2. If no such value of x is found, thennumbers yi that equal sum of powers of 2 can beconsidered. Once the values of x and yi aredetermined, the multiples of M that the stages ofsign estimation use can be found. However, thesevalues may be not powers of 2, therefore theycannot be generated by shift operations. To

ARTICLE IN PRESS

M 2M 3M-2M -M 02t+1-2t+1

Fig. 5. The effect of the sign estimation technique on the partial

results.

O. Nibouche et al. / Signal Processing 84 (2004) 1919–19301930

circumvent this problem, the parameter t is appro-priately changed, so that any multiple of M can bewritten as a sum of some powers of 2.

9. Conclusion

In this paper, two new iterative algorithms formodular multiplication have been presented. Theimplementation of these algorithms yields toscalable architectures that can be used for anymodulus without altering the design. This makesthem useful for performing the modular multi-plication operation, which is the basis of crypto-systems and authentication schemes. Serialimplementations have shown that radix-2 algo-rithm has a better area usage than similarstructures available in the literature while thespeed performances are rather worse. This draw-back has been addressed in the radix-4 algorithmthat exhibits better area usage than similarstructures with relatively similar speed perfor-mances. The parallel implementation of thesealgorithms has also shown that radix-4 algorithmhas the best area usage while its speed perfor-mances are similar to that proposed in theliterature. This makes the radix-4 algorithm agood choice for digit serial parallel implementa-tions. The design of higher radix multipliers hasalso been investigated. In this case, only themultiplication operands are fed to these multiplierswhile their parameters are left to the designer tochose in such a way that only values that arepowers of 2 are used during the modular multi-plication, thus making this process simpler.

References

[1] G.R. Blakely, A computer algorithm for the product AB

modulo M, IEEE Transactions on Computers 32 (1983)

497–500.

[2] C.D. Chiou, T.C. Yang, Iterative modular multiplication

algorithm without magnitude comparison, Electronic

Letters 30 (30) (1994).

[3] Chung Nan Lyu, David W. Matula, Redundant Binary

Booth Recoding, in: Proceedings of the 12th Symposium

on Computer Arithmetic, July 19–21, Bath, England, 1995,

pp. 50–58.

[4] W. Diffie, M.E. Hellman, New directions in cryptography,

IEEE Transactions on Information Theory 22 (1976)

644–654.

[5] W.L. Freking, K.K. Parhi, A unified method for iterative

computation of modular multiplication and reduction

operations, in: Proceedings of the 1999 IEEE International

Conference on Computer Design, ICCD ’99, 1999, pp.

80–87.

[6] C.K. Koc, C.Y. Hung, Carry Save Adders for computing

the product AB modulo N, Electronic Letters 26 (1990)

899–900.

[7] C.K. Koc, C.Y. Hung, Bit-level systolic arrays for modular

multiplication, Journal of VLSI Signal Processing 3 (1991)

215–223.

[8] P. Kornerup, High-radix modular multiplication for

cryptosystems, in: Proceedings of the 11th Symposium

on Computer Arithmetic, IEEE Computer Society Press,

Windsor, Canada, 1993, pp. 277–285.

[9] M.C. Mekhallalati, Novel algorithms and architectures for

multiplication, Ph.D. Thesis, Department of Electrical and

Electronic Engineering, University of Nottingham, UK,

1997.

[10] H. Orup, Simplifying quotient determination in high-radix

modular multiplication, Proceedings of the 12th Sympo-

sium on Computer Arithmetic (ARITH ’95).

[11] R.L. Rivest, A. Shamir, L. Adleman, A method for

obtaining digital signatures and public-key cryptosystems,

Communications of the ACM 21 (2) (1978) 120–126.

[12] Shand, J. Vuillemin, Fast implementation of RSA

cryptography, Proceedings of the 11th IEEE Symposium

on Computer Arithmetic, 1993.

[13] K.R. Solan Jr., Comment on: a computer algorithm for the

product AB modulo M, IEEE Transaction on Computers

34 (1985) 290–292.

[14] Takagi, Generating a power of an operand by a table look-

up and a multiplication, Proceedings of the IEEE 13th

Symposium on Computer Arithmetic, July 1997, pp.

126–131.

[15] A. Tenca, M.D. Ercegovac, Design of high-radix

digit-slices for on-line computations, SPIE Conference

on High-Speed Computing, Digital Signal Processing,

and Filtering Using Reconfigurable Logic, November

1996.

[16] C.D. Walter, Space/time trade-offs for higher radix

modular multiplication using repeated addition, IEEE

Transactions on Computers 46 (2) (1997) 139–141.

[17] M.C.W. Wu, Y.F. Chou, General modular multiplication

by block multiplication and table lookup, in: Proceedings

of the IEEE International Symposium on Circuits and

Systems (ISCAS), 1994, pp. 295–298.