ARTICLE IN PRESS
0165-1684/$ - se
doi:10.1016/j.sig
�CorrespondiE-mail addr
mokhtar.nibouc
qub.ac.uk (A. B
Signal Processing 84 (2004) 1919–1930
www.elsevier.com/locate/sigpro
New iterative algorithms for modular multiplication
Omar Nibouchea, Mokhtar Niboucheb, Ahmed Bouridanec,�
aFaculty of Engineering, Magee College, University of Ulster, Northland Road, Derry BT48 7JL, UKbFaculty of Computing, Engineering and Mathematical Sciences, The University of the West of England, Coldharbour Lane,
Bristol BS16 1QY, UKcSchool of Computer Science, The Queen’s University of Belfast, Bernard Crossland Building, 18 Malone Rd, Belfast BT7 1NN, UK
Received 18 July 2003; received in revised form 25 June 2004
Abstract
The new modular multiplier structures proposed in this paper are based on a short precision magnitude comparison
instead of the full magnitude comparison operation. Another feature of these structures is that the comparison
operations are carried out first. Only once this has been achieved that the reduction operation takes place, while in
previous work both the comparison and the reduction operations are interleaved. This has resulted in a reduction of the
number of stages required for the implementation of the modular reduction operation. Serial implementations have
shown that the new radix-2 algorithm has a better area usage than similar structures available in the literature while the
proposed radix-4 algorithm exhibits better area usage than similar structures with relatively similar speed performances.
The parallel implementation of these algorithms has also shown that the new radix-4 algorithm has the best area usage
while its speed performances are similar to that of structures proposed in the literature.
r 2004 Elsevier B.V. All rights reserved.
Keywords: Modular multiplication; Cryptography; Serial-parallel systems; Computer arithmetic
1. Introduction
In the recent past years, the use of software toolsand hardware devices for security functions hasincreased dramatically [4,9,11,12,17]. Security is-sues play a crucial role in wide spreading the use ofmany computer and communication systems, such
e front matter r 2004 Elsevier B.V. All rights reserve
pro.2004.07.001
ng author. Tel.: +44-1232-335465.
esses: [email protected] (O. Nibouche),
[email protected] (M. Nibouche), a.bouridane@
ouridane).
as the Internet, which more and more people areusing to transmit sensitive information such ascredit card numbers. A central tool for achievingsystem security is cryptography. Privacy and fraudconcerns can be addressed through the use ofvarious security primitives such as data encryp-tion, which can be used with the appropriateprotocols to construct secure and trusted networks[5,8].In 1976, Diffie and Hellman introduced the
idea of public key cryptography [4], which isnow widely used to provide confidentiality,
d.
ARTICLE IN PRESS
O. Nibouche et al. / Signal Processing 84 (2004) 1919–19301920
authentication, data integrity and non-repudia-tion. Since then, numerous public-key cryptosys-tems have been proposed. All these systems basedtheir security on some mathematical one-wayfunctions. RSA [11] is the most widely usedpublic-key cryptosystem. An RSA operation is amodular exponentiation operation, which requiresrepeated modular multiplications. For securityreasons RSA operand sizes need to be 1024-bitsor greater [12]. Therefore, the implementation ofsuch systems, which requires efficient architecturesto compute the modular product, has motivatedthe development of a number of modular multi-plication algorithms and architectures.The modular multiplication can be carried out
through division, whereby the product of the twooperands is formed and then divided by themodulus to compute the residue, which is theresult of the modular multiplication operation.Such a parallel implementation may turn intoenormous problem complexity, as the size of theoperands is very large. In addition to that, thecalculation of the quotient, which is not theultimate result of the operation, increases the areaused for the implementation. Therefore, themodular multiplication operation necessitates effi-cient iterative computation algorithms based onrepeated subtraction operations for the design ofhigh performance systems [7,8,12,15,16].Various iterative techniques exist for the divi-
sion-based modular multiplication operation. Onepopular algorithm is Blakeley’s algorithm [1,13].This algorithm interleaves the well-known shift
and add technique with magnitude comparison andsubtraction operations. It uses a Most SignificantBit First (MSBF) format and entails two magni-tude comparison/subtraction operations. Theseoperations are the setback of any implementationof the algorithm, as they require a propagationpath from the Least Significant Bit (LSB) to theMost Significant Bit (MSB), thus decreasing thefrequency of the system. Some interesting MSBFiterative methods are not based on divisionoperation [2,9,14,17]. In these algorithms, themodular reduction is computed via Look UpTables (LUTs) and by discarding a group of MostSignificant Bits (MSBs), which are used to select acorrection term. As the number of bits used to
select the reduction value increases, memory accesstime becomes the bottleneck of the speed of thewhole system. Therefore, reducing the LUT sizecan be of great practical concern. This can be doneusing a number of LUTs and multiplexers. Thedrawback of this approach is that more than onereduction value is to be stored. Consequently, thenumber of reduction values and the area usagerequired to reduce them is balanced by the size ofthe LUT and its access time. The implementationof these algorithms is based upon the use of CarrySave Adders (CSAs) and multiplexers. Thesealgorithms also need precalculation and storageof the correction values, which have to be carriedout for every different modulus (the calculationprogram can be written into an EPROM).In this paper the problem of designing scalable
modular multipliers without full magnitude com-parison, and which can be used with any modulusis addressed. The new modular multiplier struc-tures proposed in this paper are based on a shortprecision magnitude comparison instead of the fullmagnitude comparison as suggested by Blakeley[1,13]. The short magnitude comparison was firstused by [6,7] to derive modular multiplier archi-tectures. By using this approach, the partial resultsare kept in a wider range than Blakeley’salgorithm. However, the last reduction step whereno data are fed to the multiplier reduces the partialresults to the required range. The paper isorganized as follows: the mathematical back-ground of the modular multiplication operationand previous work are presented in Sections 2 and3, respectively. The new algorithms are presentedin Sections 4–7. An extension of this work forhigher radices is addressed in Section 8 and theconclusions are made in Section 9.
2. Background and previous work
The computation of the modular multiplicationoperation is the computation of the product P,given by
P ¼ hABiM ¼ ABmodM with 0pA;BoM; (1)
where P is the remainder of the division of theproduct AB by the modulus M. A;B; and M are
ARTICLE IN PRESS
O. Nibouche et al. / Signal Processing 84 (2004) 1919–1930 1921
positive integers sufficiently large to ensure thesecurity of the system. The modulus M is repre-sented in binary using n bits, where n is given by
n ¼ dlog2Me; (2)
where dxe is the ceiling of x, which is the integergreater than or equal to x but smaller than x þ 1:As it was shown in [5], in order to compose a
new method, it is simply required to develop avalid iteration rule that replicates a certain bound.Let consider the problem of computing hABiM ;
where the product AB of Eq. (1) is decomposed:
AB ¼Xn�1i¼0
ai2iB: (3)
Using this decomposition, hABiM can be rewrittenas
hABiM ¼Xn�1i¼0
ai2iB
* +M
¼ hhhhhan�12n�1BiM
þ an�22n�2BiM þ . . . . . . . . . iM
þ a121BiM þ a02
0BiM : ð4Þ
The inspection of this expression reveals that aniterative computation technique is possible.Namely, a partial result can be defined as Si ¼
hSi�12i�1 þ an�i�12
n�i�1BiM with S�1 ¼ 0 anditerate the computation n times (i.e. i ¼ 0 toi ¼ n � 1). It can be noted from (4) that thecomputation leads to a class of exact methods, i.e.,the partial results are in the range ½0; M½: Never-theless, in applications such as cryptography, exactevaluation of the modular reduction is oftenimpossible since calculation of the correspondingmodular correction requires full-word-length infor-mation. To avoid long propagation paths that canlower the clock frequency, an approximate evalua-tion of the partial results is favoured in which caseonly a small portion of a data word is analysed tocompute corrections. As it was suggested by [5], thefollowing general case can serve to provide somedirections in the development of approximateiterative modular reduction methods. In thesemethods the partial results are rather kept in alarger range than the range of Eq. (1).Let assume that a method is available to
compute Si ¼ hSi�12i�1 þ an�i�12
n�i�1BiM þ �M ;
where � 2 f�min; �min þ 1 . . . . . . . . . ; �max � 1; �maxgand ð�min � 1ÞMoSi�1oð�max þ 1ÞM: It is imme-diately obvious from this definition that if themethod is valid, the same bound is maintained onthe new result, i.e., ð�min � 1ÞMoSioð�max þ 1ÞM;which can be proven by induction [5].This establishes that in order to compose a new
method; it is simply required to develop a validiteration rule that replicates a certain bound.Furthermore, this underlines the fundamental factthat the error due to the approximate nature of theiteration level rule does not accumulate withiteration number but instead remains in a constantrange [5].To reduce the partial results back to the range
½0; M½; another subtraction step can be used. Thenumber of subtraction operations used in this stepdepends on the approximation used to derive thereduction rule of the algorithm.
3. The short magnitude comparison algorithm
An idea to reduce the propagation path and atthe same time to keep the scalability of themodular multiplier intact was presented in [6,7].In this work, the use of Carry Propagate Adders(CPAs) was avoided by employing a technique toestimate the sign, so that instead of subtracting M
or 2M from the whole partial results, the subtrac-tion operation is only performed on the n � t
MSBs of the operands. If t ¼ 0; the subtractionoperation is carried out on the whole operandswords. On the other hand, if t ¼ n � 1; thesubtraction operation is carried out only on theMSB of the two operands. Therefore, the para-meter t controls the estimation: the accuracy of theestimation and thus, the total amount of logicrequired for the implementation. For an n-bitinteger N, the estimator function T was defined by[1,13] as:
TðNÞ ¼ N � hNi2t ; (5)
where 0pton � 1:The operator T replaces the first t LSBs with
zero, which implies that
TðNÞpNoTðNÞ þ 2t: (6)
ARTICLE IN PRESS
O. Nibouche et al. / Signal Processing 84 (2004) 1919–19301922
The principle of the algorithm is to reduce a pair ofCarry–Sum bits, which we note ðCi;SiÞ; byestimating the sign of: ~Pi ¼ Si þ Ci � M ¼ ~Si þ~Ci: If the estimated sign of ~Pi is positive, then:
Si ¼ ~Si and Ci ¼ ~Ci: If the estimated sign isnegative, this means that the original value wasin the correct range, and no reduction is required.The algorithm as shown by [6,7] is described inAlgorithm 1 below:
Algorithm 1
S0 ¼ 0;C0 ¼ 0
For i ¼ 1 . . . n { S i þ Ci ¼ 2ðSi�1 þ Ci�1Þ þ an�iBS
~ i þ ~Ci ¼ Si þ Ci � 2MI
f TðSiÞ þ TðCiÞX0 thenSi ¼ ~Si and Ci ¼ ~Ci
S
~ i þ ~Ci ¼ Si þ Ci � MI
f TðSiÞ þ TðCiÞX0 thenSi ¼ ~Si and Ci ¼ ~Ci
}The algorithm entails as many magnitudecomparison/subtraction operations as in Blake-ley’s algorithm [1,13]. The difference is that in thelatter algorithm, the modular reduction is carriedout after a full-precision magnitude comparison.In the case of the short magnitude comparison,this is carried out upon a reduced length precision.For an n-bit modulus, the partial results of themultiplication equation fall in the range ½0; 5M½:They can be shown in binary representation usingn þ 3 bits. A sufficient condition for the correct-ness of the algorithm is to have: tpn � 1: The signestimation operation is carried out at least on thebits t to n þ 3 of the partial result, and thus itchecks at least 5 MSBs of the partial results. It canbe implemented using a CPA or Carry Look-ahead Adder (CLA) while the modular multiplieris implemented using three rows of CSAs.
4. The new binary iterative algorithm for modular
multiplication
Although the short magnitude comparisonalgorithm solves the problem of long path
propagation delay, better performances can beachieved. As it is shown in this section, this hasbeen achieved by estimating the sign prior to themodular reduction, instead of interleaving the signestimation with the modular reduction operation.Another innovation made on the algorithm by[6,7] is the use of signed arithmetic. In this way,only one modular reduction is carried out for thenew radix-2 algorithm as described in Algorithm 2.The new radix-4 algorithm also presented in thispaper and described in Algorithm 3, uses tworeduction operations and one operation of accu-mulation of the partial products generated usingBooth’s algorithm. Booth Recoding is a com-monly used technique to recode one of theoperands in binary multiplication [3]. The radix-2Booth recoding technique scans the bits of themultiplier one bit at a time, and adds or subtractsthe multiplicand to or from the partial product,depending on the value of the current bit and theprevious bit, while the radix-4 Booth recodingtechnique scans the bits of the multiplier two bitsat a time. The process of inspecting the multiplierbits required by Booth’s algorithm can be viewedas recoding the multiplier using three digits 0, 1and �1; in the case of radix-2 recoding, and 0, 1, 2,�1; and �2; in the case of radix-4 Booth recoding,which leads to the generation of multiples of themultiplicand that are divisible by two. Thereforethey can be generated by simple shift operations[3].The main feature of the new algorithm, when
compared to that proposed in [6,7], is that itrequires only one modular reduction operation,instead of two operations. This has been achievedby reducing the partial results only once the signestimation operations have been carried out,while in [6,7], the sign estimation operationsare interleaved with the modular reduction opera-tions. Taking into account that N lies in the range� � M ; M½; Eq. (6) can be rewritten as
TðNÞ � 2toNoTðNÞ þ 2t: (7)
Let Ci and Si be the ith carry and sum partialresult words, respectively. They are in the follow-ing range:
�xMoCi þ SioxM ; (8)
ARTICLE IN PRESS
O. Nibouche et al. / Signal Processing 84 (2004) 1919–1930 1923
where x is a positive integer that determines therange within which the partial result fall.From Eqs. (7) and (8), the estimated sign of Ci
and Si is positive
TðCiÞ þ TðSiÞX0 if � 2iþ1oCi þ SioxM:
(9.1)
And their estimated sign is negative 9
TðCiÞ þ TðSiÞo0 if � xMoCi þ Sio2iþ1:
(9.2)
Therefore, depending on the result of the signestimation given by (9), Tð�xMÞ is added toTð2CiÞ þ Tð2SiÞ if the estimated sign is positive, orTðxMÞ is added to Tð2CiÞ þ Tð2SiÞ if the esti-mated sign is negative. The results from thisestimation operation are fed to the second stageof the sign estimation for the partial results ~Ci þ~Si ¼ 2ðCi þ SiÞ � xM:Taking into account the estimated sign of ~Ci þ
~Si and Ci þ Si; there are four cases of modularreduction of the pair carry–sum, Ci þ Si; given by
Ci þ Si ¼ 2ðCi�1 þ Si�1Þ þ an�iB � ykM : (10)
The four different values of yk are defined asfollows: y0 is selected if the sign of the two stagesare both positive, y1 is selected if the result of thefirst stage is positive and the result of the secondstage is negative. If the result of the first stage isnegative and the result of the second stage ispositive, y2 is selected. The last case occurs whenthe results of both stages are negative, and in thiscase y3 is selected. These four cases are representedby the different ranges in which the partial resultscan fall into, and are summarized by the followingfour equations:
�2tþ1xMo2ðCi�1 þ Si�1Þ þ an�iBo2xM þ M ;
(11.1)
�2tþ1o2ðCi�1 þ Si�1Þ þ an�iBoxM þ M þ 2tþ1;
(11.2)
�2tþ1 � xMo2ðCi�1 þ Si�1Þ þ an�iBoM þ 2tþ1;
(11.3)
� 2xMo2ðCi�1 þ Si�1Þ þ an�iB
o� xM þ M þ 2tþ1: ð11:4Þ
The role of the correction term ykM is to keepthe partial results in the range � � xM ; xM½; sothat the iteration bound is kept intact and anotheriteration can be carried out. The first solution tothese four sets of Eqs. (11) can be found when x
equals three. The different values of yk define thereduction rule of the new iterative modular multi-plier, where the constraint made on the parametert is a necessary condition for its correctness.Therefore, we obtain the following rules for x ¼
3 and 4:
ðy0; y1; y2; y3Þ ¼ ð4; 2;�1;�4Þ; x ¼ 3; tpn � 2
and
y0 2 f5; 6; 7; g; y1 2 f2; 3g; y2 2 f�2;�1g;
y3 2 f�6;�5;�4g; x ¼ 4; tpn � 2:
The radix-2 iterative modular multiplier for x ¼
3 is shown in Algorithm 2 below. The two signestimation stages are Tð2Si�1Þ þ Tð2Ci�1Þ andTð2Si�1Þ þ Tð2Ci�1Þ þ Tð�3MÞ where the sign of�3M depends on the sign of the previousestimation stage.The aim is to generate multiples of the multi-
plicand and the modulus that are powers of 2 (i.e.�2l ; where l is a positive integer). This has beenachieved without using Booth’s recoding. Instead,unsigned representation of both the multiplier andthe multiplier is preferred, as the extra circuitrythat generates the digits ð1; 0;�1Þ in Booth’srecoding is avoided.
Algorithm 2
S0 ¼ 0;C0 ¼ 0For i ¼ 1 . . . n
{
If Tð2Si�1Þ þ Tð2Ci�1ÞX0 thenIf Tð2Si�1Þ þ Tð2Ci�1Þ þ Tð�3MÞX0
then Si þ Ci ¼ 2ðSi�1 þ Ci�1Þ þ an�iB � y0Melse Si þ Ci ¼ 2ðSi�1 þ Ci�1Þ þ an�iB � y1M
If Tð2Si�1Þ þ Tð2Ci�1Þo0 then
If Tð2Si�1Þ þ Tð2Ci�1Þ þ Tð3MÞX0
then Si þ Ci ¼ 2ðSi�1 þ Ci�1Þ þ an�iB � y2Melse Si þ Ci ¼ 2ðSi�1 þ Ci�1Þ þ an�iB � y3M}
The Basic Cell (BC) of the multiplier and themultiplication Dependency Graph (DG) areshown in Figs. 1(a) and 2(a), respectively. The
ARTICLE IN PRESS
O. Nibouche et al. / Signal Processing 84 (2004) 1919–19301924
BC uses two CSAs (Full Adders), and an ANDgate to form the partial products. The bits ri;1;ri;2
from the sign estimation stages are used toselect one of the bits ðmj ;mjþ1;mjþ2Þ for themodular reduction (which is equivalent tomultiplying the modulus M by 1, 2, and 4,respectively, as shown in Algorithm 2). The bitri;1 is the sign bit of the first estimation stage and isused to find the two’s complement of the correc-tion value to be added to the partial results.Therefore, the multiples of the modulus M that areused for the reduction operation are 4M; 2M;�M ;and �4M : The two stages of the sign estimationare shown in Fig. 2(b). The sign bit of the firststage is used to select either Tð3MÞ or Tð�3MÞ;which is then accumulated in the second stage withthe result from the first stage to produce thesecond sign bit.The partial result Si�1 þ Ci�1 falls within the
range � � 3M ; 3M½; therefore, 2ðSi�1 þ Ci�1Þ canbe represented using n þ 4 bits. The rule ofAlgorithm 2 states that tpn � 2; thus the signestimation is carried out on 7 bits. However, toallow only the moduli M to be fed to the multiplierstructure, the terms Tð3MÞ and Tð�3MÞ areformed by the addition of Tð2MÞ to TðMÞ; andTð�2MÞ to Tð�MÞ; respectively. Nevertheless,
(a)
an-i bj
ci,j si,j
mjmj+1
mj+2
ri,1
si+1,jci+1,j+1
zi,j
zi+1,j
ri,2
CSA
CSA
(b)
ri,2ri,1 ri,1
zi,jzi+1,j
ci,j
ci+1,j+1
si,j
si+1,j
bj
bj
ri,2
an-ian-i
Fig. 1. The basic cell of the proposed radix-2 modular
multiplier.
Tð2MÞ þ TðMÞ is bounded by
Tð3MÞ � 3� 2toTð2MÞ þ TðMÞ
oTð3MÞ þ 3� 2t: ð12Þ
The effect on the reduction rule is that the newnecessary condition for the correctness of thealgorithm requires: tpn � 3; therefore the signestimation is calculated using 8 bits. An illustra-tion of how Algorithm 2 works is shown in Fig. 3.
5. The new radix-4 iterative algorithm for modular
multiplication
The algorithm shown in the previous section canbe extended to radix-4, whereby the radix-4Booth’s recoding is used to generate the partialproducts. Three sign estimation stages are thenrequired to calculate the right modular reductionvalue. This is due to the fact that, after eachiteration of the algorithm, the partial results areshifted two positions to the left instead of oneposition as in the case of a radix-2 algorithm.Therefore, eight different terms of modular reduc-tion, ykM ; are required. The selection of one ofthe values y0pkp7 (which we denote asy0; y1; y2; y3; y4; y5; y6; y7) depends on the sign ofeach of the three sign estimation stages. Table 1depicts the choice of these values and theequivalent sign of the sign estimation stages.Following the same calculation steps used for theradix-2 algorithm described previously, the rule ofreduction is given in what follows.Let z be:
z ¼ 4ðCi þ SiÞ þ an=2�iB with
an=2�i 2 f�2;�1; 0; 1; 2g; ð13Þ
where Ci and Si are the ith carry and sum words ofpartial results. The modular reduction is carriedout by subtracting yiM from z. The result must fallinto the range
�xMoz � yiMoxM : (14)
And as shown in the case of Radix 2, the differentcases are listed as follows:For every value of z given by the following
set of equations, a correction term is selected from
ARTICLE IN PRESS
j
i
T(2Ci-1)
T(2Si-1)
T(2M)T(M)
ri,2
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
ri,1
CSA CSACSACSACSACSACSA CSA
(a)
(b)
Fig. 2. The DG and the estimation stages of the proposed modular multiplier.
O. Nibouche et al. / Signal Processing 84 (2004) 1919–1930 1925
Table 1.
3xM � 2M � 2tþ1ozo4xM þ 2M ;
y0 is selected; ð15:1Þ
2xM � 2M � 2tþ1ozo3xM þ 2M þ 2tþ1;
y1 is selected; ð15:2Þ
1xM � 2M � 2tþ1ozo2xM þ 2M þ 2tþ1;
y2 is selected; ð15:3Þ
� 2M � 2tþ1ozoxM þ 2M þ 2tþ1;
y3 is selected; ð15:4Þ
� xM � 2M � 2tþ1ozo2M þ 2tþ1;
y4 is selected; ð15:5Þ
� 2xM � 2M � 2tþ1ozo� xM þ 2M þ 2tþ1;
y5 is selected; ð15:6Þ
� 3xM � 2M � 2tþ1ozo� 2xM þ 2M þ 2tþ1;
y6 is selected; ð15:7Þ
� 4xM � 2Mozo� 3xM þ 2M þ 2tþ1;
y7 is selected: ð15:8Þ
After appropriately subtracting the term yiM fromthese inequalities, the first set of solutions is foundwhen x equals 6. The reduction rules for x ¼
6; 7 and 8 are given by
ðy0; y1; y2; y3; y4; y5; y6; y7Þ
¼ ð21; 15; 9; 3;�3;�9;�15;�21Þ;
tpn � 2; x ¼ 6
ARTICLE IN PRESS
0000011011101
1111000100010
11110111111110
00000000000000
0000000000000
1110111111110
0000000000000
0000011101111
11101000100010
00001110111000
0000011011101
1100101000111
0010101110000
0000011101111
11110011011000
00001011001110
0000011011101
1110011110001
0001100111000
0000011101111
0000011011101
00000111110010
11111100100110
1110101110101
0010110011000
1110001000100
00100101010010
11010101010000
0000000000000
1110000000010
0001010100000
0000011101111
11110010011010
00001010001000
a7B=B
-2M
a6B=0
M
a5B=B
M
a4B=B
M
a3B=B
-4M
a2B=0
M
C0
S0
C1
S1
C2
S2
C3
S3
C4
S4
C5
S5
1110111
00010110
0000000
1110111
0000100
1101000
00010110
0001110
1110110
0000010
1001101
00010110
0101100
1111001
0001000
1110010
11101001
0011111
0010001
0001011
0100101
00010110
1010101
1111010
0001010
0000000 1101001
1101001
0000000
0000000
T(-3M)
T(3M)
T(3M)
T(3M)
T(-3M)
T(3M)
11110010011010
00001010001000
0000011011101
1111011001111
0000100110000
0000011101111
11111000100000
00000111011110
0000011011101
1110101000001
0010101111000
1110001000100
11110111111010000111011110
a1B=B
M
a0B=B
-4M
C5
S5
C6
S6
R
2M
0000011011011
1110010
00010110
0001010
1111100
0001110
1111000
00010110
0001110
0000110
0100010
T(3M)
T(3M)
B =221=011011101, A=187=010111011, M=239=011101111, R=219=011011011
Fig. 3. An example of using Algorithm 2.
O. Nibouche et al. / Signal Processing 84 (2004) 1919–19301926
and
y0 2 f23; 24; 25g; y1 2 f17; 18g;
y2 2 f10; 11g; y3 2 f3; 4g;
y4 2 f�4;�3g; y5 2 f�11;�10g;
y6 2 f�18;�17g; y7 2 f�25;�24;�23g;
x ¼ 7; tpn � 2;
and
ðy0; y1; y2; y3; y4; y5; y6; y7Þ
¼ ð28; 20; 12; 4;�4;�12;�20;�28Þ
tpn � 1; x ¼ 8:
The radix-4 algorithm for x ¼ 8 is summarizedbelow in Algorithm 3. First the partial results areshifted two positions to the left, then the signestimation operations take place. The reductionterm is selected in function of the sign of each ofthese operations.
Algorithm 3
S0 ¼ 0; C0 ¼ 0
For i ¼ 1 . . . n{ If Tð4Si�1Þ þ Tð4Ci�1ÞX0
If Tð4Si�1Þ þ Tð4Ci�1Þ þ Tð�16MÞX0
If Tð4Si�1Þ þ Tð4Ci�1Þ þ Tð�16MÞ þ Tð�8MÞX0
Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y0M
else Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y1M
else
If Tð4Si�1Þ þ Tð4Ci�1Þ þ Tð�16MÞ þ Tð8MÞX0
Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y2M
else Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y3M
else
If Tð4Si�1Þ þ Tð4Ci�1Þ þ Tð16MÞX0
If Tð4Si�1Þ þ Tð4Ci�1Þ þ Tð16MÞ þ Tð8MÞX0
Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y4M
else Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y5M
else
If Tð4Si�1Þ þ Tð4Ci�1Þ þ Tð16MÞ þ Tð�8MÞX0
Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y6M
else Si þ Ci ¼ 4ðSi�1 þ Ci�1Þ þ an�iB � y7M }
The values yi obtained above are not powers oftwo, thus they cannot be obtained by simple shiftoperations. To remedy this problem, the values aredivided into two powers of two. This means thattwo rows of CSAs are required, as shown inFig. 4(a). The values 28, 20, 12 are rewritten as thesum of 32 and �4; 16 and 4, 8 and 4, respectively.The sign bit of the first sign estimation stage, ri;1; isused to determine the sign of the reduction value(see Fig. 4(c)). The sign bit of the second and thirdstage, ri;2 and ri;3; are used to determine the
ARTICLE IN PRESS
Table 1
The reduction values of the proposed radix-4 modular multiplier
y0 y1 y2 y3 y4 y5 y6 y7
1st stage positive positive positive positive negative negative negative negative
2nd stage positive positive negative negative positive positive negative negative
3rd stage positive negative positive negative positive negative positive negative
O. Nibouche et al. / Signal Processing 84 (2004) 1919–1930 1927
magnitude of the correction value. The sign bit ofthe third stage of the sign estimation, ri;3; is used toselect either 4M or �4M : The sign estimationoperation is performed on the three stages of eightbits CPA/CLA. The modular multiplier uses threerows of CSAs. In the first row, the partial productsare generated using radix-4 Booth’s recodingtechnique (see Fig. 3(a)), the remaining two rowsare used to accumulate the modular reduction.The multiplier DG is depicted in Fig. 4(d).
6. The modular reduction: the last step
The structures shown in the previous sections donot compute the exact value of the modular productsince the result of their calculation lies in the range� � xM; xM½; while the exact result must fall intothe range ½0;M½: As the sign estimation techniquewas adopted in these structures to avoid the longpropagation paths that decrease the clock frequencyof the system, a reduced word length of the datamust be used to reduce the results of the twostructures shown above. Therefore the sign estima-tion technique is also selected to carry out the taskof reducing the results. The sign estimation andsubtraction operations are used without feeding anydata to the structure, as shown in Algorithm 4 below.
Algorithm 4
�xMoSi þ CioxM
For i ¼ log2ðxÞ . . . 1
{ I f TðSiÞ þ TðCiÞX0 thenS
i�1 þ Ci�1 ¼ Si þ Ci � 2i�1Me
lse Si�1 þ Ci�1 ¼ Si þ Ci þ 2i�1M}This algorithm works in a binary search fashion.This has only been achieved using a redundantrepresentation for the partial results. The algo-
rithm requires dlog2ðxÞe reduction steps, where2n�1oxp2n; and the results are reduced to therange �M � 2tþ1oR ¼ S0 þ C0oM þ 2tþ1:This demonstrates the benefit from using a
redundant representation of the operands andpartial results.
7. Comparison of performances
A performance comparison of the new proposedarchitectures with similar structures available inthe literature [6,7] in terms of both the speed andthe area is presented in Table 2. To reduce the finalresult in the range �0;M½; repeated addition orsubtraction operations are used in the last stage ofthe algorithms shown above. However this laststage has not been taken into account in thecomparison of the implementation results, where-by the results are kept within the ranges ½0; 3M½
and � � xM ; xM½ for the structure in [1,13], andthe new structures proposed in this paper,respectively. It is worth mentioning that the termsTð3MÞ; Tð�3MÞ are replaced by the sumsTð4MÞ þ Tð�MÞ; Tð�4MÞ þ TðMÞ; respectively,so that there is no need to pre-calculate the termsTð�3MÞ: The comparison is based on the area(Area Unit A.U) and the delay (Time Unit T.U) ofan inverter gate. The area of a FA (or a CSA), anEXOR gate, and a multiplexor is 10, 4, and 4A:U;respectively. The propagation delay within these 3basic elements is 6, 3, and 3T:U; respectively [9].The comparison has been made between a serialimplementation of the algorithm described by [6,7]and the new radix-2 iterative multiplier andbetween the 2-bit serial implementation of [6,7]and the proposed radix-4 iterative multiplier. Itwas also supposed that only the modulus M is fedto the multiplier, therefore two rows of EXORgates have been added to the multiplier structure
ARTICLE IN PRESS
(d)
zi,j+1,1zi,j,+12
(b)
an-i+1
ri,2ri,1 ri,1
zi,j,1
ci,j
ci+1,j+1bj
bjbj+1
bj+1
si,j
si+1,j
ri,2ri,3 ri,3
zi,j,2
an-i+2
an-i+1
an-i
an-i+2
an-i
(a)
an-i+2
ci,jsi,j
mj+2
mj+4
ri,3
zi,j,1
zi+1,j+1,1
ri,3
bjbj+1
an-i+1
an-i
mj+3
si+1,jci+1,j+1
ri,1
ri,2
zi+1,j+1,2
zi,j,2
ri,1
(c)
T(4Ci-1)
T(4Si-1)
T(-16M)
T(16M)
T(8M)
T(-8M)ri,2
ri,1
ri,3CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
CSA
Fig. 4. The basic cell, the DG and estimation stages of the
proposed modular multiplier.
Table 2
Comparison of performances based on the area and the delay of
an inverter gate
[8] Bit
Serial
Radix-2
Bit Serial
[8] 2-Bit
Serial
Radix-4
Bit Serial
Area 56 nA:U 51nA:U 98nA:U 80nA:UClock 95T:U 125T:U 185T:U 200T:UTime 95nT:U 125nT:U 93nT:U 100nT:U
A.U: Area unit, T.U: Time unit.
O. Nibouche et al. / Signal Processing 84 (2004) 1919–19301928
shown in [6,7]. It was also assumed that thepipelining is at the cell level with the signestimation stages implemented using CPAs. This
means that there is no pipelining inside the cell,but the latches are placed at the outputs of theBCs. The time of the multiplication is the numberof cycles (i.e., the number of the iteration of thealgorithm) multiplied by the clock period, which isthe delay of the BC. As shown by Table 2, the bit-serial and 2-bit serial multipliers derived from [6,7]and the proposed radix-4 multiplier has almost thesame speed. In terms of area usage, the bit serialmultiplier of [6,7] uses a slightly larger area thanour radix-2 multiplier, while radix-4 clearly re-quires less area than the 2-bit serial multiplier of[6,7]. It has also a similar speed performance whencompared to those derived from [6,7], while theradix-2 multiplier has the worst speed perfor-mance. This in part is due to the longer CPAs theyare using and which should be implemented usingCLAs. Nevertheless, our radix-4 multiplier has abetter area performance than the 2-bit serial multi-plier of [6,7] while the speed performances arerelatively comparable, which makes it a goodcandidate for digit-serial parallel implementations.Results of parallel implementations of these struc-tures are shown in Table 3. The radix-4 algorithmachieves relatively the same processing time as thealgorithm by [6,7]. However it requires only 80% ofits area, making it the best choice. The radix-2algorithm is slower. It requires 130% of themultiplication time of [6,7]. The main benefit thatthe radix-2 multiplier exhibits is that it requires only90% of the area usage required by [6,7].
8. Towards higher radices modular multipliers
In this section, the extension of the structuresproposed in this paper to higher radices is
ARTICLE IN PRESS
Table 3
Comparison of performances for a parallel implementation
based on the area and the delay of an inverter gate
Algorithm in [8] Algorithm 5 Algorithm 4
Time � 90 n � 195n=2 � 120n
� 100% � 108% � 133%
Area � 42 n2 � 33 n2 � 37 n2
�100% � 79% �88%
Area � time � 3780n3 � 6435=2n3 � 4440n3
�100% �85% �117%
O. Nibouche et al. / Signal Processing 84 (2004) 1919–1930 1929
presented. Due to the large radix size, which is calledword size, it is essential that the multiplier’s wordsare in a redundant representation and that themultiplication partial results have to be accumulatedin a redundant representation too, as shown for thecase of the radix-4 modular multiplier. A redundantrepresentation is useful to produce multiples of themultiplicand that are some powers of 2 (i.e. �21;where l is a positive integer) and in such a way thatthe partial result at each iteration fall within asymmetric range. Therefore, the partial results arealso in a redundant representation and they areconverted into a positive value only once all thepartial products have been generated and theirmodular reduction has been carried out. Themultiplier words are represented using Booth’srecoding. Radix-4 and Radix-2 Booth’s recodingcan be used to avoid generating terms that are notpowers of two multiples of the multiplicand.Let d be the adopted high radix. Ci and Si are
the partial results at the ith cycle. Once shifted by d
position to the left:
2dxMo2dðCi þ SiÞo� 2dxM: (16)
The above equation imposes the use of d reductionstages to keep the partial results in the range givenby (7). Each reduction stage outputs a sign bit thatis used to select the reduction term. Let si be thesign bit from the ith reduction stage of the radix 2d
modular multiplier ð0pipdÞ: As shown in theprevious two structures, if the sign estimationoperation produces a positive sign bit, i.e. si equals0, then 2i�1M is subtracted from the partialresults. If the estimated sign is negative, i.e. si
equals 1, then 2i�1M is added to the partial results.
Then another sign estimation operation takesplace. This can be translated into the followingformula: TðSiÞ þ TðCiÞ þ Tð2iMðsi � siÞÞ; where si
is the complement of si:The range of the partial products is given as a
function of the sign-bit output from each of the signestimation stages. Let S be the word formed bythese sign bits. The MSB of this word is sd ; which isthe output of the first stage while the LSB is the bits0 of the last sign estimation stage. The reductionoperation only takes place when all the signestimation operations have been carried out. De-pending on the results of all these stages the partialresults are affected, and therefore a reductionvalue is selected to reduce them back to the range� � xM; xM½: The partial results fall into the range
ðsd � sdÞ �1þ 2d � sd2d þ
Xd�1i¼0
si2i
!xM
� 2tþ1o2dðCi þ SiÞ
oðsd � sdÞ 2d � sd2d þ
Xd�1i¼0
si2i
!xM þ 2tþ1:
ð17Þ
The effect of the sign estimation is apparent whentwo successive ranges are examined. In Fig. 5, twosuccessive ranges are overlapping. Had a fullprecision magnitude comparison been used, thelength of the ranges would had been M. Whereasthe short magnitude comparison is used, the lengthof the ranges is M þ 2tþ2: And two successiveranges share 2tþ2 of common range.Once the partial products and the reduction terms
yiM have been added to the partial results of theprevious cycle, the rule of modular reduction can beapplied in order to keep these results in the range� � xM; xM½: The parameter of the sign estimation,t, is made less or equal to n � 1; tpn � 1; so that wehave 2tpM: The second step is to determine thevalue of x that produces values yi that are equal topowers of 2. If no such value of x is found, thennumbers yi that equal sum of powers of 2 can beconsidered. Once the values of x and yi aredetermined, the multiples of M that the stages ofsign estimation use can be found. However, thesevalues may be not powers of 2, therefore theycannot be generated by shift operations. To
ARTICLE IN PRESS
M 2M 3M-2M -M 02t+1-2t+1
Fig. 5. The effect of the sign estimation technique on the partial
results.
O. Nibouche et al. / Signal Processing 84 (2004) 1919–19301930
circumvent this problem, the parameter t is appro-priately changed, so that any multiple of M can bewritten as a sum of some powers of 2.
9. Conclusion
In this paper, two new iterative algorithms formodular multiplication have been presented. Theimplementation of these algorithms yields toscalable architectures that can be used for anymodulus without altering the design. This makesthem useful for performing the modular multi-plication operation, which is the basis of crypto-systems and authentication schemes. Serialimplementations have shown that radix-2 algo-rithm has a better area usage than similarstructures available in the literature while thespeed performances are rather worse. This draw-back has been addressed in the radix-4 algorithmthat exhibits better area usage than similarstructures with relatively similar speed perfor-mances. The parallel implementation of thesealgorithms has also shown that radix-4 algorithmhas the best area usage while its speed perfor-mances are similar to that proposed in theliterature. This makes the radix-4 algorithm agood choice for digit serial parallel implementa-tions. The design of higher radix multipliers hasalso been investigated. In this case, only themultiplication operands are fed to these multiplierswhile their parameters are left to the designer tochose in such a way that only values that arepowers of 2 are used during the modular multi-plication, thus making this process simpler.
References
[1] G.R. Blakely, A computer algorithm for the product AB
modulo M, IEEE Transactions on Computers 32 (1983)
497–500.
[2] C.D. Chiou, T.C. Yang, Iterative modular multiplication
algorithm without magnitude comparison, Electronic
Letters 30 (30) (1994).
[3] Chung Nan Lyu, David W. Matula, Redundant Binary
Booth Recoding, in: Proceedings of the 12th Symposium
on Computer Arithmetic, July 19–21, Bath, England, 1995,
pp. 50–58.
[4] W. Diffie, M.E. Hellman, New directions in cryptography,
IEEE Transactions on Information Theory 22 (1976)
644–654.
[5] W.L. Freking, K.K. Parhi, A unified method for iterative
computation of modular multiplication and reduction
operations, in: Proceedings of the 1999 IEEE International
Conference on Computer Design, ICCD ’99, 1999, pp.
80–87.
[6] C.K. Koc, C.Y. Hung, Carry Save Adders for computing
the product AB modulo N, Electronic Letters 26 (1990)
899–900.
[7] C.K. Koc, C.Y. Hung, Bit-level systolic arrays for modular
multiplication, Journal of VLSI Signal Processing 3 (1991)
215–223.
[8] P. Kornerup, High-radix modular multiplication for
cryptosystems, in: Proceedings of the 11th Symposium
on Computer Arithmetic, IEEE Computer Society Press,
Windsor, Canada, 1993, pp. 277–285.
[9] M.C. Mekhallalati, Novel algorithms and architectures for
multiplication, Ph.D. Thesis, Department of Electrical and
Electronic Engineering, University of Nottingham, UK,
1997.
[10] H. Orup, Simplifying quotient determination in high-radix
modular multiplication, Proceedings of the 12th Sympo-
sium on Computer Arithmetic (ARITH ’95).
[11] R.L. Rivest, A. Shamir, L. Adleman, A method for
obtaining digital signatures and public-key cryptosystems,
Communications of the ACM 21 (2) (1978) 120–126.
[12] Shand, J. Vuillemin, Fast implementation of RSA
cryptography, Proceedings of the 11th IEEE Symposium
on Computer Arithmetic, 1993.
[13] K.R. Solan Jr., Comment on: a computer algorithm for the
product AB modulo M, IEEE Transaction on Computers
34 (1985) 290–292.
[14] Takagi, Generating a power of an operand by a table look-
up and a multiplication, Proceedings of the IEEE 13th
Symposium on Computer Arithmetic, July 1997, pp.
126–131.
[15] A. Tenca, M.D. Ercegovac, Design of high-radix
digit-slices for on-line computations, SPIE Conference
on High-Speed Computing, Digital Signal Processing,
and Filtering Using Reconfigurable Logic, November
1996.
[16] C.D. Walter, Space/time trade-offs for higher radix
modular multiplication using repeated addition, IEEE
Transactions on Computers 46 (2) (1997) 139–141.
[17] M.C.W. Wu, Y.F. Chou, General modular multiplication
by block multiplication and table lookup, in: Proceedings
of the IEEE International Symposium on Circuits and
Systems (ISCAS), 1994, pp. 295–298.
Top Related