Analysis of the influence of register file size on energyconsumption, code size, and execution time

9

Transcript of Analysis of the influence of register file size on energyconsumption, code size, and execution time

1Revision of Manus ript 1934Analysis of the In uen e of Register File Sizeon Energy Consumption, Code Sizeand Exe ution TimeL. Wehmeyer, M.K. Jain, S. Steinke, P. Marwedel, M. BalakrishnanAbstra t|Interest in low power embedded systems has in- reased onsiderably in the past few years. To produ e lowpower ode and to allow an estimation of power onsumptionof software running on embedded systems, a power modelwas developed based on physi al measurement using an eval-uation board and integrated into a ompiler and pro�ler.The ompiler uses the power information to hoose instru -tion sequen es onsuming less power, whereas the pro�lergives information about the total power onsumed duringexe ution of the generated program.The used ompiler is parameterized su h that e.g. theregister �le size may be hanged. The resulting ode isevaluated with respe t to ode size, performan e and power onsumption for di�erent register �le sizes. The extra tedinformation is espe ially useful during appli ation analysisand ar hite ture spa e exploration in ASIP design. Ouranalysis gives the designer the ability to estimate the desir-able register �le size for an ASIP design.The size of the register �le should be onsidered as adesign parameter sin e it has a strong impa t on the energy onsumption of embedded systems.Keywords|low power, ompiler, power model, appli ationanalysis, register �le sizeI. Introdu tionRe ently, energy onsumption in embedded systems hasbe ome more and more important, mainly due to the fa tthat many systems are now being designed as mobile de-vi es, i.e. they have to operate on battery power instead ofusing abundant power from wall so kets. One importantaspe t of su h devi es is their running time. How mu henergy saving is possible by modifying software alone wasre ently demonstrated by Siemens [1℄: the standby time ofone of their mobile phones was in reased from 100 to 160hours without hanges in the hardware. Using our energyaware ompiler framework en we are able to generate lowpower ode, perform spe i� optimizations aimed at energysaving and evaluate the resulting program's energy dissi-pation. The basis for the in orporated power model wasgiven by physi al measurement using an evaluation board.The fa t that the en ompiler is parameterizable en-ables further studies: By hanging the on�guration of theproposed target pro essor within the ompiler on�gura-tion �le, it is possible to evaluate the e�e t of ar hite tural hanges on ode quality. This is parti ularly interestingThis work has been supported by a DST-DAAD proje t and byAgilent Te hnologies, USA.Lars Wehmeyer, Stefan Steinke and Peter Marwedel are with theDepartment of Computer S ien e XII of the University of Dort-mund, Germany. E-mail: fwehmeyer,steinke,marwedelg�ls12. s.uni-dortmund.deManoj Kumar Jain and M. Balakrishnan are with the Departmentof Computer S ien e and Engineering, Indian Institute of Te hnology,Delhi, India. E-mail: fmanoj,mbalag� se.iitd.ernet.in

for ar hite tural design exploration targeting ASIPs whi hhave a limited number of variable parameters. ASIC de-velopment, despite resulting in highly eÆ ient designs, isoften too expensive whereas general purpose pro essors donot provide the required performan e. ASIPs are oftentargeted today, sin e they o�er high exibility and perfor-man e at a relatively low ost. Usually, ASIPs are designedfor a small set of appli ations whi h enables the designerto exploit knowledge about appli ation hara teristi s soas to meet the performan e, ost and power requirements.Using the en ompiler in the design pro ess, we show howto explore the design spa e provided by the register �le sizein ASIP design.The rest of this work is organized as follows: Chap-ter II gives an overview of previous work on erning lowpower ompiler optimizations, energy models and ASIP de-sign methodologies. Chapter III introdu es the en om-piler framework along with the integrated instru tion levelpower model. Chapter IV gives a short overview of theARM7TDMI pro essor that was used for our experiments.The following hapter shows the used appli ation ben h-marks, whereas hapter VI explains the experimental setupand the obtained results. Finally, the last hapter gives asummary and provides some areas for future work.II. Related WorkCompiler optimizations like power aware instru tion se-le tion, i.e. hoosing the instru tion sequen e that will ause the least energy dissipation when the program is ex-e uted, will not require hanges in the hardware to redu eenergy onsumption. Often, optimizing for performan ewill result in a program that also yields good energy re-sults. There are ases, however, where these two goals willresult in di�erent instru tion sequen es. The ompiler op-timization te hnique `Register Pipelining' is an example forthis e�e t. Usually aimed at performan e as well as energyoptimization, this optimization tries to redu e the numberof power hungry o�- hip memory a esses within a loop byholding array values that are needed in later loop iterationswithin the pro essor registers. This leads to additional in-stru tions being inserted into the ode, thus prolongingexe ution time. Sin e memory a esses onsume up to anorder of magnitude more power than instru tions operat-ing merely on the register �le, the resulting program may onsume less overall energy despite its longer running time.This parti ular example an be found in [2℄.Another ompiler based optimization for power is in-stru tion s heduling. Subsequent instru tions are ordered

2so as to in ur as little hange in ir uit state as possible.Some te hniques, aimed espe ially at VLIW ar hite tures,are des ribed in [3℄.Beside these ompiler oriented optimizations, there is afurther group of optimizations that ombines ar hite tural hanges with modi� ations to the ompiler. If e.g. a spe ialbus en oding used in an ar hite ture is not transparent tothe programming interfa e, the ompiler may need to takethe oding into a ount to generate orre t ode. Using as rat h pad memory within the system's memory hierar hymay require ompiler support to eÆ iently take advantageof this feature, sin e in ontrast to a hes, it is not trans-parent to software. Therefore, either the programmer orthe ompiler has to de ide whi h data or instru tions areto be pla ed within the s rat h-pad memory [4℄, [5℄.One ar hite tural parameter that also requires ompilersupport is the register �le size. There have been severalapproa hes to allow ompilers to adapt to di�erent targetar hite tures by supplying them with ar hite tural infor-mation about the target pro essor. In the Trimaran om-piler [6℄, the ma hine des ription language MDES is usedto model the underlying hardware. A di�erent approa hwas taken by [7℄, where a ompiler an be retargeted todi�erent pro essors of the DSP domain des ribed in MI-MOLA [8℄, a VHDL-like hardware des ription language.In this approa h it is possible to spe ify the ar hite turein the form of an exe utable spe i� ation and at the sametime supply the ne essary information for the ompiler.One re ent publi ation by Brooks et al. [9℄ presents aframework enabling power analysis and investigation of thee�e t of hardware modi� ations as well as ompiler opti-mizations. Their power analysis is based on parameter-izable power models of ommon stru tures found in modernmi ropro essors.For VLIW ar hite tures, Zalamea et al. [10℄ provide re-sults on erning y le time, area and power onsumptionfor register �les of di�erent sizes.A number of approa hes to ASIP design where the designspa e onsists of a number of ar hite tural parameters likenumber and kind of fun tional units, issue width and thesize of a hes have been reported [11℄. Most of these sear hthe design spa e for an area-time tradeo�. If power is alsoevaluated then it is on the basis of ir uit power modelswhi h are primarily non-appli ation spe i� [12℄, [13℄. Toour knowledge, this is the �rst work whi h evaluates energy onsumption by hanging an ASIP ar hite tural feature i.e.register �le size, in an appli ation spe i� evaluation.III. The en ompiler frameworkIn this work, the en ompiler framework was hosen for ode generation and evaluation. en is an energy aware ompiler developed for the RISC lass of ar hite tures, a-pable of generating ode with redu ed energy onsump-tion. en uses a ost fun tion that not only onsiders the`traditional' optimization options like ode size and exe u-tion time, but also the energy onsumption. To be ableto make de isions during ode generation, the ompiler re-quires a database ontaining information about the power

onsumption of single instru tions. This information analso be used to determine the energy onsumption of an in-stru tion sequen e by adding the individual ontributionsof ea h instru tion and a ounting for inter-instru tion ef-fe ts ( .f. se tion III-A).The input for the en ompiler onsists of an appli- ation program written in the programming language C.Some standard optimizations are performed on this �le us-ing the LANCE frontend [14℄. The ba kend of the ompilerthen generates assembly instru tions using the tree pat-tern mat her olive [15℄, whi h employs the energy modelas a ost fun tion during ode sele tion. Register allo a-tion is performed using the graph oloring based algorithmdes ribed in [16℄. Some ba kend optimizations like instru -tion s heduling, register pipelining, memory layout, jumpoptimization or peep hole optimizations are also available.ISS

profilinginformation trace file

encc

profiler

energy

LinkerAssembler & executableassemblyC-Code

databaseFig. 1. Work ow within the en frameworkThe use of en is simpli�ed by a graphi al user inter-fa e, whi h gives the user full ontrol over the ompilationpro ess. The available optimization options are made up of'exe ution time', 'energy' and ' ode size'. Supplemental op-timizations like instru tion s heduling, register pipeliningor support for di�erent kinds of memory are also provided.Using the GUI, the user an e.g. generate and exe ute as-sembly ode and run the pro�ler to generate a report onexe ution time, memory a esses and energy onsumption.The pro�ler that was used to generate the results pre-sented in this work will be presented in greater detail inse tion III-C.Several ar hite tural parameters of the targeted pro es-sor an easily be hanged in the ompiler. After re ompi-lation, the ompiler is able to generate ode for a modi�edtarget ar hite ture. Examples for these parameters in ludethe register �le size, number of register used for subrou-tine parameter passing as well as the available addressingmodes. This feature of en was used to retarget the om-piler from the ARM7 to the LEON pro essor, whi h is aSPARC V8 based ar hite ture. The VHDL model of theLEON is available under the GNU publi li ense [17℄.For this work, the ompiler on�guration was initially setup to des ribe the ARM7TDMI pro essor [18℄. The num-ber of registers was then su essively hanged to generate a ompiler for pro essors di�ering in the number of registers.The performan e of the generated programs was evaluatedin a simulation run using the energy database: The out-put generated by the instru tion set simulator is passed tothe pro�ler whi h al ulates the number of a esses to dif-ferent kinds of memory, the number of bytes o upied ininstru tion and data memory as well as overall energy on-sumption. In this way, all the exe uted instru tions along

3with their individual ontribution to energy dissipation are onsidered.A. Instru tion Level Energy ModelIn the literature, power and energy are often used syn-onymously. \Low Power" and \Energy Aware" have de-veloped into standard phrases. Generally, the importantfa tor to onsider in embedded system design is energy,sin e battery lifetime is ru ial.The power model used in the en ompiler is based onthe model developed by Tiwari et al. [19℄, whi h distin-guishes between basi osts and inter-instru tion e�e ts.The basi osts onsist of the measured urrent during ex-e ution of a single instru tion. The additional power on-sumption due to the hange of internal ir uit states ausedby di�ering subsequent instru tions has to be a ountedfor. This ontribution is summed up in the so- alled interinstru tion e�e t.Our enhan ement of Tiwari's power model onsists ofthe onsideration of power onsumption within the memorysystem [20℄. This ontribution to overall power should notbe ignored sin e the energy onsumed by o�- hip memory an be signi� antly higher than the amount dissipated bythe pro essor itself.For our target pro essor, the ARM7TDMI, neither aVHDL model that ould be simulated to determine powernor an instru tion level power model is available. There-fore, the instru tion level power values had to determinedby experimental physi al measurement using an evaluationboard.B. Physi al Power MeasurementTo perform the measurements, we used the evaluationboard AT91EB01 by Atmel [21℄. The pro essor used on thisboard is an AT91M40400, onsisting of an ARM7TDMI ore with 4 KB on- hip s rat h-pad memory. Additionally,the board o�ers 512 KB o�- hip memory. To perform thepower measurements, we used the onne tions on the boardthat allow dire t monitoring of the pro essor urrent. Tomeasure memory power onsumption, we had to ut allsupply power onne tions of one of the memory hips andinsert an amperemeter into the ele tri ir uit. The evalua-tion board with all the onne tions for power measurementin pla e is shown in �gure 2.The digital multimeter used for the measurements is an\ESCORT 95". It is hara terized by a basi pre ision of0.06% and a measuring frequen y of 20 measurements perse ond. Using this multimeter, one instru tion at a timewas measured and its power onsumption determined. Inorder to do this, the instru tion was repeated many timeswithin a loop. We found 100 instru tion instan es to suf-� e so as to get a stable reading on the digital multimeterand to be sure we are a tually measuring the instru tionwithin the loop, not the loop's bran h instru tions. Theaverage pro essor and memory urrents that an be readfrom the multimeter are measured during exe ution of thisloop, and the result is transferred to the power database

Fig. 2. The evaluation board with measuring onne tions. Top left:pro essor urrent. Top middle: memory urrent. Bottom: addressbus monitoringused by the ompiler and by the pro�ler. The results ofthese measurements an be found in [22℄.

40 45 50 55 60

ADD & LOGICAL

SUB & COMPARE

SHIFT

MOV

BRANCH

MULTIPLY

LOAD

STORE

PUSH

POP

current [mA]

38 Fig. 3. Measured power onsumption of individual instru tionsFigure 3 gives an idea of the possible improvements that an be a hieved by using the power model in the ode sele -tion phase of the ompiler. Note that the power requiredto a ess external memory is not in luded in this �gure!Even without the memory urrent, it be omes lear thatthe instru tions that by far onsume the most power arememory a ess instru tions. Consequently, there is a highpotential for optimizations espe ially in those instru tionsinvolving memory a esses.Another aspe t that has to be onsidered in the powermodel is the inter instru tion e�e t that des ribes thepower onsumed by hanges in the system state when dif-ferent instru tions are exe uted in sequen e. The in u-en e of this e�e t was also measured using the des ribedmethodology using tuples of instru tions instead of singleinstru tions. The exa t results of these measurements anbe found in [22℄. The overall average inter instru tion ef-fe t ontributes about 2mA to power onsumption, whi his below 5% for nearly all instru tions. Therefore, the inter

4instru tion e�e t was taken into a ount in all experimentsby adding these average 2mA to the base osts of the in-stru tions. To show that pre ision does not su�er from thissimpli� ation, we ondu ted some validation experimentsto determine the pre ision of our power model. Resultsshow that the average deviation of our power model pre-di tions from the measured results was only 1.7% [22℄. Wethus laim that handling the inter instru tion e�e t in theway des ribed above is appropriate for our purpose. Ourpower model is suÆ iently pre ise to allow it to be usedwithin a power aware ompiler.C. Integration of Power Model into the en ompilerThe values measured as des ribed in the last se tion weretransferred into a power database that is used by the om-piler during ode sele tion and within the pro�ler. The ode sele tor uses the database to hoose optimal instru -tion sequen es with respe t to power, whereas the pro�lerevaluates the overall power onsumption of a program thatis being exe uted.The en ompiler uses the LANCE-frontend [14℄ totransform the initial C- ode to an intermediate representa-tion in 3-address-format. From this form, the ompiler gen-erates data ow trees that an be passed to the ode sele -tor, whi h performs the a tual hoi e of the orrespondingassembler instru tions. As a ode sele tor, we hose olive[15℄, the su essor of the well-known iburg. olive requires agrammar to spe ify whi h assembly instru tions over thenodes in the data ow tree. A ording to a supplied ostfun tion, it will de ide on an optimal over for ea h treeand generate the orresponding assembly ode stru ture.Sin e olive allows for arbitrary ost fun tions, the orre-sponding values are retrieved from the power database andfed to olive's ost fun tion. The ode sele tor will thenguarantee that an optimal over is generated for ea h data ow tree.The power information from the database is not onlyused during ode sele tion, but also for pro�ling duringsimulation runs. Sin e we are investigating overall energy onsumption of ben hmark programs in our work, the pro-�ler takes the output of the instru tion set simulator, whi h ontains information on exe uted instru tions and memorya esses, and annotates this information with the orre-sponding power values from the database. The individualvalues are summed up and a statisti al output for the om-plete program is generated.IV. The ARM7TDMI Pro essorThe ARM7TDMI 32-bit RISC pro essor [18℄ was hosenfor our work sin e it is widely used in embedded systemstoday, and is parti ularly re ommended for ultra low en-ergy appli ations, making it a suitable target pro essor forthe en ompiler. It o�ers a 32 bit RISC instru tion set,16 general purpose registers with 32 bits ea h as well as anALU, a multiplier and a dedi ated barrel shifter, but nodata or instru tion a he ( .f. �gure 4).To redu e energy onsumption, the ARM7TDMI fea-tures an additional 16 bit instru tion set alled THUMB

Fig. 4. Blo k Diagram of the ARM7TTDMI pro essorwith less fun tionality ompared to the 32 bit instru tionset. One major restri tion is that only eight of the 16general purpose registers are available in THUMB mode.Some of the 32-bit instru tions using e.g. onditional exe- ution or the multiply-a umulate operation are also notsupported. The energy savings ompared to the 32-bitinstru tion set is due to the fa t that ode density is onaverage 30% higher in THUMB mode [23℄ as only 16 bitsof instru tion data have to be fet hed for ea h instru tioninstead of 32. Sin e instru tion fet hes onsume a largeamount of energy, this te hnique generally results in lowerenergy dissipation during program exe ution. Sin e exe u-tion times are not signi� antly longer when the THUMBinstru tion set is used, it is the �rst hoi e for energy awareembedded systems.V. Ben hmark SuiteTo investigate the e�e t of hanging the register �lesize, a number of ben hmark appli ations have to be se-le ted. The areas that are overed by ARM pro essors in- lude automotive equipment, onsumer entertainment, dig-ital imaging, industrial appli ations, networking, se urity,storage and wireless appli ations [24℄.The hosen ben hmarks were taken from the domains ofdigital signal pro essing and multimedia, along with stan-dard sorting algorithms.The biquad N se tions program, part of the DSP-kernelben hmark suite [25℄, performs the �ltering of input val-ues through N biquad IIR se tions. latti e init al ulatesthe output of a latti e �lter, whereas matrix mult imple-ments the multipli ation of two 2D matri es. me ivlin isa demonstration for a multimedia appli ation, mainly on-sisting of integer arithmeti operations. The standard sort-ing algorithms all sort a given array of integers using dif-ferent methods.Note that the ARM7TDMI does not feature a oating

5TABLE IBen hmarks used for the experiments� biquad N se tions DSP� latti e init DSP� matrix-mult mult. of two m x n matri es� me ivlin media appli ation� bubble sort standard sorting algorithm� heap sort standard sorting algorithm� insertion sort standard sorting algorithm� sele tion sort standard sorting algorithmpoint unit. Sin e the use of the data types float or doublewould result in ineÆ ient library alls, only data of typeinteger is generally used in real-world appli ations. In or-der to provide realisti ben hmark appli ations, our hosenprograms only onsider data of type integer as well.VI. Experimental Methodology & ResultsIn this se tion we �rst present the methodology used togather the results presented in this work. Along with theplots for the obtained values, some appli ation spe i� ex-planations for the results are given. This analysis allows anexperien ed programmer to de ide on a minimum numberof registers by analyzing the C sour e ode of an appli a-tion.Using the ompiler en , the number of registers avail-able for the appli ation is varied by hanging the orre-sponding value in the ompiler on�guration �le. Thismodi�ed ompiler is then used to ompile the ben hmarks.The resulting assembly ode is analyzed and exe uted us-ing the instru tion set simulator for the ARM7TDMI. The hosen register �le size is limited to the range from threeto eight. Three is the minimum number of registers ne -essary to exe ute our ben hmarks in THUMB instru tionset mode, and eight is the number of registers in the 'real'ARM pro essor. The instru tion set simulator does notprovide a means of re on�guring register �le size, hen esimulation is limited to appli ations using a maximum ofeight registers. This limitation is not a great drawba k,however, sin e most of the used ben hmarks do not bene�tsigni� antly from more than eight registers. For a stati analysis of the generated assembly ode not requiring sim-ulation, up to 20 registers were onsidered.It should be mentioned that we have not taken into a - ount some of the potential positive e�e ts that arise fromusing less registers. A smaller register �le size will auseless hip area to be used and will also bring down power onsumption in the ir uitry itself. Another e�e t that wasnot onsidered is the fa t that using less registers generallyresults in shorter instru tion words, sin e the operands inregisters an be addressed with fewer bits.`Live values' are values that will be read at a later pointin time during program exe ution ( .f. [26℄). In order toexe ute a program that requires more live values than avail-able registers, additional memory is needed to hold the livevalues. Su h values are usually stored in external memory

and retrieved just before they are needed. (Another solu-tion for this problem is data-regeneration, in whi h valuesare not always stored to memory but may also be re al u-lated [27℄). If the register �le size is very small, the ompilerthus needs to insert ode to store values to memory on ethey have been omputed and to retrieve them from mem-ory when they are needed. This extra ode is alled spill ode, as the values are 'spilled' to memory. It is lear thatthis will in rease the size of the generated program. Duringexe ution of the program, the spill ode will require timeto exe ute, thus prolonging exe ution time. Finally, sin e(o�- hip) memory a esses are usually very power-hungry,one an expe t the power onsumption for a program to goup signi� antly if a lot of spill ode is generated. Conse-quently, a strong de rease in appli ation performan e is tobe expe ted when the register �le size is too small for anappli ation.As mentioned in hapter III-B, the evaluation board usedfor the power measurements o�ers a 4 KB on- hip s rat h-pad memory and a 512 KB o�- hip memory ( .f. �gure5). S rat h-pad memory is like an on- hip a he withoutany tag memory or a he organization hardware, i.e. themanagement of the s rat h-pad memory is left to the pro-grammer. The en ompiler supports the use of di�erentmemory types in su h a way that the developer may spe -ify where instru tions and data are to be stored. Sin eon- hip memory is limited to 4 KB, program ode anddata do not �t into the s rat h-pad memory, so we de- ided to keep the data o�- hip while transferring the in-stru tions to the s rat h-pad memory. This makes sensefor our observations, sin e we are interested in the e�e tof a esses to data in memory and less in the e�e t of in-stru tion fet hing. By partitioning the memory obje ts inthe des ribed way, instru tion fet hes a ount only for asmall additional power onsumption and exe ution time,whereas data transfers whi h are also required for storingor retrieving spilled variables show a high ontribution tooverall power onsumption and exe ution time. The ho-sen memory layout also oin ides with the situation oftenfound in embedded systems, where the program is usuallystored in a ROM and only the dynami data is stored inexternal RAM.�������������������������������������������������������

�������������������������������������������������������

coreARM7TDMI

Pro-gram

proc.

�����������������������������������������������������������������������������������������������������������������������������������������������

�����������������������������������������������������������������������������������������������������������������������������������������������

4KBon

chipRAM

512KB off chip RAM

DataFig. 5. Memory on�guration of the Atmel AT91M40400 evaluationboardA. Results for the Ratio of Spill Instru tions to Total Stati Code SizeThe addition of spill ode for de reasing number of reg-isters leads to a larger ode size. The spilling instru tionsare made up of memory load and store operations as well as

6the ne essary address al ulations. Sin e the total numberof instru tions related to spilling is not very meaningful,the ratio of spill instru tions to total ode size is given.The obtained results are shown in �gure 6.As expe ted, the ratio of spill ode de reases with in- reasing number of registers. In our observations, we foundthat the 'saturation' of the ben hmark appli ations, i.e. thepoint where no spill ode is present in the assembly ode,varies between 7 and 19 registers. This value provides anupper bound for the register �le size: if an ASIP designerwants to make sure that his appli ation performan e is notimpaired by spill ode, he will have to determine this 'sat-uration' value for a parti ular appli ation. The plot alsoprovides information about the maximum register pressure,i.e. the maximum number of values that are live at thesame time.0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

3 4 5 6 7 8 9 10

Number of Registers

Rat

io o

f S

pill

Inst

ruct

ion

s to

To

tal

Co

de

(sta

tic)

biquad lattice_initmatrix-multme_ivlinbubble_sortheap_sortinsertion_sortselection_sortFig. 6. Ratio of spill instru tions to total number of instru tionsHaving looked at this �rst stati analysis, we found thatthe results are useful when minimal ode size (a stati prop-erty of the ode) is the target. However, the results are oflittle use for dynami runtime analysis of appli ations: Spill ode introdu ed within the innermost loop of an appli a-tion is exe uted many times, but ounted only on e in thisobservation. To be able to estimate the in uen e of spill ode at runtime, we have to make further experiments andobserve the behavior of the appli ations when they are ex-e uted. A dynami analysis is also ne essary to be able toextra t any kind of information on the power onsumptionof the appli ation.B. Results for the Number of Exe uted Instru tionsFor the results presented in this se tion, the pro�ler wasused to ount the number of instru tions that were exe- uted by the instru tion set simulator. The results areshown in �gure 7. The instru tion ount values for thedi�erent ben hmark appli ation were s aled in this �gureto produ e the results in a single plot. This is a eptablesin e the trends of the urves are preserved.As expe ted, the resulting plots now show more mean-ingful trends sin e spill ode inserted within an inner loopis ounted ea h time the loop is exe uted.In some urves, there is one remarkable sharp bend, e.g.in the program insertion sort. When the number of regis-

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

3 4 5 6 7 8

Number of Registers

Nu

mb

er o

f In

stru

ctio

ns

(

mill

ion

s)

biquad (x 500)lattice_init (x 2)matrix-mult (x 250)me_ivlin (x 1)bubble_sort (x 4)heap_sort (x 14)insertion_sort (x 6)selection_sort (x 3)Fig. 7. Number of exe uted instru tions over number of registersters is in reased from four to �ve registers, the instru tion ount de reases substantially, whereas an in rease from �veto six registers hardly hanges the number of exe uted in-stru tions. To understand the ause for this behavior, wetake a look at the innermost loop of the appli ation whi his exe uted more often than any other part of the ode andtherefore has a strong in uen e on the dynami behavior ofthe program. The innermost loop is shown in the original Csour e ode and in the assembly ode generated assumingeight available registers (�gures 8 and 9, respe tively).

Fig. 8. Inner loop of insertion sort: C ode

Fig. 9. Inner loop of insertion sort: assembly odeIt is possible to determine whi h C variable is stored inwhi h physi al register of the ARM7TDMI pro essor by omparing the two ode fragments. For example, indx2is stored in register r3 sin e it is the result of the �rst

7subtra tion in C as well as in the assembly ode. indx iskept in register r6, but sin e this variable is not read againwithin the loop (it is only present in the initialization of thefor-loop), it is not live within the loop and an therefore benegle ted. Continuing the analysis of the assembly ode,we ome to the mapping shown in table II. The expression(indx2-1)*4 is used for addressing the one-dimensionalarray This[℄whose elements of type int require four bytesea h.This table provides an explanation of the sharp bendin the urve for the insertion_sort ben hmark: sin e�ve registers (r0, r2, r3, r4, r7) are required to hold thevalues that are live within the innermost loop, redu ingthe number of registers to four will result in spilling withinthe innermost loop, whi h will inevitably lead to a severeperforman e degradation of the appli ation. The knees inthe other appli ations' urves an be explained in a similarway, i.e. by analyzing the required number of live valueswithin the innermost loop.TABLE IIMapping of variables to registersVariable Registerindx2 r3indx2-1, (indx2-1)*4 r2This[℄ r0temp_val r7 ur_val r4To summarize, an ASIP designer an de ide on the min-imum requirements for register �le size by using a re on-�gurable ompiler and looking at the results of dynami runtime analysis. On the other hand, given he has an un-derstanding of the ode generation pro ess for a parti ularar hite ture, he an estimate the number of required reg-isters by determining the number of live values within theinnermost loop.C. Results for the Number of Cy lesAs the next step, the number of y les required for exe u-tion of the ben hmarks was extra ted. The results obtainedfor number of y les are shown in �gure 10. Again, thevalues are s aled appropriately.The general behavior of these urves is similar to the re-sults shown in the previous �gure on erning instru tion ount, sin e y les and instru tions are strongly interde-pendent. Ea h instru tion requires a ertain amount of y les to exe ute (the so- alled CPI value - y les per in-stru tion). For the ARM7TDMI, the CPI has a value be-tween one and �ve, not taking into a ount the e�e t ofpipelining. Store and Load operations require more y lesto omplete than arithmeti operations: the latter requireonly one y le, whereas a Store and Load to/from o�- hipmemory takes two and three y les, respe tively. Hen e,one additional spill instru tion within the appli ation re-quires several more y les every time the spill instru tion isexe uted. That is why the sharp bends in the urves remain

0

0.5

1

1.5

2

2.5

3

3 4 5 6 7 8

Number of Registers

Nu

mb

er o

f C

ycle

s (m

illio

ns)

biquad (x 650)lattice_init (x 1)matrix-mult (x 100)me_ivlin (x 1)bubble_sort (x 3)heap_sort (x 12)insertion_sort (x 5)selection_sort (x 6)Fig. 10. Number of y les over number of registersin the same spots while the slope of the urves hanges.Sin e we are observing the e�e ts of spill ode in the pro-gram, the same appli ation hara teristi s (number of livevalues in the innermost loop) that were dis ussed in theprevious se tion an be used to explain these observations.D. Results for Power ConsumptionIn this se tion we exploit the fa t that the en ompilerframework is energy aware: the pro�ler is apable of ana-lyzing the power ontributions of every exe uted instru -tion, also taking into a ount the inter instru tion e�e t.These power ontributions are added up to give a valuefor the omplete program's power onsumption during ex-e ution. The results obtained for the average power on-sumption in relation to the number of available registersare shown in �gure 11.

200

220

240

260

280

300

320

340

360

380

400

3 4 5 6 7 8

Number of Registers

Ave

rag

e P

ow

er C

on

sum

pti

on

(m

W)

biquadlattice_initmatrix-multme_ivlinbubble_sortheap_sortinsertion_sortselection_sortFig. 11. Average power onsumption over number of registersData load and store instru tions introdu ed due tospilling in rease the power onsumption signi� antly sin ea esses to o�- hip memory are very power onsuming inRISC ar hite tures. Due to the used memory layout, theo�- hip data a ess instru tions related to spilling onsumea large per entage of the overall power of the appli ation.E. Results for Energy ConsumptionComing to one of the most important aspe ts of embed-ded system design today, we now take a look at the energy

8 onsumption of our ben hmarks in relation to the register�le size. The general behavior of the urves in �gure 12is obvious: sin e y le time as well as power onsumptionwent down with in reasing register �le size, energy shouldshow a very strong de rease for growing number of regis-ters.The strong orrelation between register �le size and en-ergy an also whi h show that, using the on�gurationwe hose for our experiments (instru tions in s rat h-padmemory, data o�- hip), additional memory a esses re-quired for spilling onsume an order of magnitude moreenergy than e.g. an ADD instru tion. This information isgiven in table III. TABLE IIIEnergy onsumption of ADD, LDR and STR instru tionsInstru tion Av. Energy Consumption� 32.768 MHzADD 4.91 nWsLOAD 49.4 nWsSTORE 48.7 nWs

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

3 4 5 6 7 8

Number of Registers

En

erg

y C

on

sum

pti

on

(W

S)

biquad (x 650)lattice_init (x 1)matrix-mult (x 100)me_ivlin (x 1)bubble_sort (x 3)heap_sort (x 12)insertion_sort (x 5)selection_sort (x 6)Fig. 12. Energy onsumption over number of registersSin e power as well as y le time ontribute to the en-ergy de rease observed here, it is interesting to investigatewhi h of these two fa tors has the higher ontribution tooverall energy savings. Figure 13 provides the result ofour investigation: the main ontribution to energy savings omes from saving y les, only a small amount of energyis saved due to the use of instru tions that onsume lesspower. This shows that the laten y of the used memorywill also have a strong e�e t on the a hievable energy sav-ings. For low-energy ompiler systems we an dedu e thatoptimizing sour e ode for time is generally a good ap-proa h towards energy eÆ ient software. The ontributionof power to energy savings should not be negle ted, how-ever, sin e often (su h as in mobile devi es), improvementsin power onsumption by as little as 5% are regarded asinteresting in the industry.The fa t that the number of registers is an importantfa tor during the design pro ess of an ASIP is underlinedby the results given in table IV whi h shows the relative en-

������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

60 %

40 %

20 %

0 %3 4 5 6

80 %

87

to Decrease in Power

Energy Reduction dueto Decrease in Numberof Cycles

Reference Energy

Energy Reduction due

��������

��������

��������

��������

120 %

100 %

Number of Registers

Nor

mal

ized

val

ues

����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

Fig. 13. Contribution of power and time to energy savingsTABLE IVEnergy savings by hanging register file sizeben hmark 3 ! 4 4 ! 5 5 ! 6 6 ! 7 7 ! 8biquad 62.89% 13.35% 11.26% 10.42% 6.71%latti e init 17.45% 20.97% 18.92% 19.78% 16.84%matrix-mult 33.44% 18.57% 14.99% 24.26% 23.28%me ivlin 59.28% 43.16% 36.29% 18.24% 20.47%bubble sort 39.07% 55.63% 16.78% 0.74% 9.63%heap sort 22.35% 24.87% 18.27% 33.24% 10.16%insertion sort 45.47% 57.11% 5.60% 2.63% 0.92%sele tion sort 29.68% 22.59% 30.14% 2.94% 0.00%ergy savings possible by in reasing the number of availableregisters one by one. The average energy savings possiblein the range from three to eight registers amount to ap-proximately 22%. Sin e all urves are starting to level outeven in the range we have onsidered here, it is lear thatthe positive ontribution of additional registers be omessmaller and smaller the more registers are already presentin the design. One should bear in mind, however, that ea hadditional register will also onsume additional energy dueto added apa itan es and required hip area. This energyhas not been onsidered in our approa h, but will be thesubje t of future work.VII. Con lusion and Future WorkIn this work, we des ribe the methodology used to de-sign and implement an instru tion level power model forthe en ompiler and show its integration into the om-piler framework. Physi al measurements using an evalua-tion board were performed to gather data on erning in-stru tion level power onsumption. In our experiments,the number of registers for the ARM7TDMI pro essor was hanged within the en ompiler. This design parameteris of importan e in ASIP design sin e it strongly in uen esenergy onsumption in those embedded systems that aresensitive to memory a esses. Several ben hmark appli a-tions were ompiled using the ustomized ompiler. Someappli ation hara teristi s responsible for the observed re-sults on erning number of spill instru tions, number ofexe uted instru tions, y le time required for exe ution aswell as power onsumption and energy dissipation wereidenti�ed. The results obtained are useful during the de-sign of ASIPs, espe ially at an early stage of design spa eexploration where the designer an de ide how many reg-isters an appli ation will need to work within the requiredtime and/or energy onstraints for the �nal pro essor. Fu-

9ture work on erning the register �le size will onsist oftaking additional observations for di�erent ar hite tures,in luding the LEON ar hite ture whi h in ontrast to theARM7TDMI features both instru tion and data a he.Considering the ar hite tural onsequen es of modifyingthe register �le size as well as hanging instru tion odingwill also be part of our future work. To further broadenthe onsidered design spa e, we plan to develop a retar-getable instru tion set simulator whi h will enable us tomake measurements for an arbitrary number of registersinstead of being limited to the maximum number the pro- essor provides. In the ourse of the experiments, we hopeto identify other parameters that may have a strong e�e ton energy onsumption of embedded systems. Also, moreexperiments will have to be performed using a ompilerthat supports spe i� features of the targeted ASIP. Weplan to work on designing and implementing algorithmsfor the automati extra tion of appli ation hara teristi sto help in an early estimation for number of required reg-isters in the ASIP design ow.Referen es[1℄ K. Groebmair. Bestselling C25 mobile phone nowperforms even better: The \C25 power" featureslonger standby time and improved sound quality.http://www.siemens.de/i /produ ts/ d/english/index/news/pressreleases/index.html or http://194.121.239.192/ gi-bin/db4web .exe/db/pressedb/i p user/en/pm einzeln.d4w?pm id=759, August 1999.[2℄ S. Steinke, R. S hwarz, L. Wehmeyer, and P. Marwedel. Lowpower ode generation for a RISC pro essor by Register Pipelin-ing. Te hni al Report 754, University of Dortmund, Dept. ofComputer S ien e XII, 2001.[3℄ C. Lee, J. Lee, T. Hwang, and S. Tsai. Compiler Optimizationon Instru tion S heduling for Low Power. In 13th InternationalSymposium on System Synthesis. ACM, Septermber 2000.[4℄ P.R. Panda, N.D. Dutt, and A.Ni olau. Memory Issues in Em-bedded Systems-On-Chip. Kluwer A ademi Publishers, 1999.[5℄ S. Steinke, C. Zobiegala, L. Wehmeyer, and P. Marwedel. Mov-ing Program Obje ts to S rat h-Pad Memory for Energy Redu -tion. Te hni al Report 756, University of Dortmund, Dept. ofComputer S ien e XII, 2001.[6℄ Trimaran. Trimaran Homepage. "http://www.trimaran.org/index.html", 1999.[7℄ R. Leupers. Retargetable Code Generation for Digital SignalPro essors. Kluwer A ademi Publishers, 1997.[8℄ R. J�ohnk and P. Marwedel. MIMOLA Referen e Manual Version3.45. Te hni al Report 470, University of Dortmund, Dept. ofComputer S ien e XII, Mar h 1993.[9℄ D. Brooks, V. Tiwari, and M. Martonosi. Watt h: A Frameworkfor Ar hite tural-Level Power Analysis and Optimizations. InPro eedings of the ACM ISCA 2000, June 2000.[10℄ J. Zalamea, J. Llosa, E. Ayguade, and M. Valero. Software andHardware Te hniques to Optimize Register File Utilization inVLIW Ar hite tures. In Pro . of the International Workshopon Advan ed Compiler Te hnology for High Performan e andEmbedded Systems (IWACT), July 2001.[11℄ M.K. Jain, M. Balakrishnan, and A. Kumar. ASIP DesignMethodologies : Survey and Issues. In Pro eedings of the Four-teenth International Conferen e on VLSI Design, pages 76{81,January 2001.[12℄ J. Kin, C. Lee, W.H. Mangione-Smith, and M. Potkonjak. PowerEÆ ient Mediapro essors: Design Spa e Exploration. In Pro- eedings of the Design Automation Conferen e (DAC), June1999.[13℄ D. Singh, J. Rabaey, M. Pedram, F. Catthoor, S. Rajgopal,N. Sehgal, and T. Mozden. Power ons ious CAD Tools andMethodologies: A Perspe tive. In Pro eedings of the IEEE, April1995.[14℄ R. Leupers. LANCE V2.0. "http://ls12-www. s.uni-dortmund.de/�leupers/lan eV2/lan eV2.html", 2000.

[15℄ C.W. Fraser, D.R. Hanson, and T.A. Proebsting. Engineering aSimple, EÆ ient Code Generator Generator. Te hni al report,AT & T Bell Laboratories, 1992.[16℄ A.W. Appel. Modern Compiler Implementation in C. Cam-bridge University Press, Cambridge, New York u.a., 1998.[17℄ J. Gaisler. Gaisler Resear h. "http://www.gaisler. om", 2001.[18℄ ARM Ltd. ARM Ltd. Homepage. "http://www.arm. om", 2000.[19℄ V. Tiwari, S. Malik, and A. Wolfe. Power Analysis Of Em-bedded Software: A First Step Towards Software Power Min-imization. In Pro eedings of the IEEE/ACM InternationalConferen e on Computer-Aided Design ICCAD, pages 384{390,November 1994.[20℄ S. Steinke, M. Knauer, L. Wehmeyer, and P. Marwedel. AnA urate and Fine Grain Instru tion-Level Energy Model sup-porting Software Optimizations. In International Workshopon Power And Timing Modeling, Optimization and Simulation(PATMOS), September 2001.[21℄ Atmel. Atmel Corporation Homepage. "http://www.atmel. om", 2000.[22℄ M. Theokharidis. Measuring Energy onsumption ofARM7TDMI Pro essor Instru tions (in german), available from"http://ls12-www. s.uni-dortmund.de/publi ations/theses".Master's thesis, University of Dortmund, Department ofComputer S ien e XII, 2000.[23℄ Advan ed RISC Ma hines Ltd (ARM). An Introdu tion toThumb. available from: "http://www.arm. om", Mar h 1995.[24℄ Advan ed RISC Ma hines Ltd(ARM). ARM Powered[tm℄ produ ts."http://www.arm. om/arm/arm powered?OpenDo ument",2001.[25℄ V. Zivojnovi , J. Martinez, C. S hlaeger, and H. Meyr. DSP-stone: A DSP-Oriented Ben hmarking Methodology. In Pro- eedings of the International Conferen e on Signal Pro essingAppli ations and Te hnology, ICSPAT'94, Dallas, Texas, O to-ber 1994.[26℄ St. S. Mu hni k. Advan ed Compiler Design and Implementa-tion. Morgan Kaufmann Publishers, San Fran is o, California,1. edition, 1997.[27℄ C.H. Gebotys. Network Flow Approa h to Data Regenerationfor Low Energy Embedded System Synthesis. In IntegratedComputer-Aided Engineering, J. Wiley Publishers, 1998.