Modulo scheduling for a fully-distributed clustered VLIW architecture

10
To appear in 33th. Int. Symp. on Microarchitecture (MICRO-33) December 10-13, 2000 - Monterey, California Abstract Clustering is an approach that many microprocessors are adopting in recent times in order to mitigate the increasing penalties of wire delays. In this work we propose a novel clustered VLIW architecture which has all its resources partitioned among clusters, including the cache memory. A modulo scheduling scheme for this architecture is also pro- posed. This algorithm takes into account both register and memory inter-cluster communications so that the final schedule results in a cluster assignment that favors cluster locality in cache references and register accesses. It has been evaluated for both 2- and 4-cluster configurations and for differing number and latencies of inter-cluster buses. The proposed algorithm produces schedules with very low communication requirements and outperforms previous cluster-oriented schedulers. 1. Introduction Technology projections point to wire delays as being one of the main hurdles for improving instruction throughput of future microprocessors [23]. As wire delays grow relative to gate delays and feature sizes shrink, the percentage of on-chip transistors that can be reached in a single cycle will decrease, and microprocessors will become communica- tion bound rather than capacity bound [1][14]. Techniques to solve this problem at all levels, from applications to technology, will be crucial for performance. Clustering is an effective microarchitectural approach to mitigate the negative effect of wire delays. The main idea is to have a hierarchical organization of the interconnection wires such that units that communicate frequently are inter- connected through short and fast wires. On the other hand, units that rarely communicate can use longer and slower wires. In other words, the microarchitecture exploits what we may call communication locality . Several commercial microprocessors have adopted this approach, such as the Alpha 21264 [10], which is a superscalar processor, but this trend is even more common for VLIW processors used in the embedded/DSP domain. Examples of the latter are Texas Instrument’s TMS320C6000 [24], Equator’s MAP1000 [15] and Analog’s TigerSharc [8]. Clustering can be applied to different parts of the microarchitecture. Cluster microarchitectures proposed so far, both in the commercial and research arena, distribute the functional units and register files, but the data cache is considered a centralized resource. This centralized organi- zation challenges the scalability of these architectures. Besides, some studies point out that the access time (in number of cycles) to the memory structures is likely to increase with future technologies, even when their capacity is kept constant [1]. This suggests that short latency mem- ory structures should be even smaller than they are today. Because of these two reasons, we believe that a distributed cache memory architecture is key for increasing the perfor- mance of future microarchitectures. In this work we propose a clustered VLIW microarchi- tecture with a distributed cache memory. This architecture has all the resources distributed: instruction fetch, execute and memory units. It resembles very much a multiproces- sor, with the exception that all the clusters progress in a lockstep mode, and inter-cluster register communications are controlled by the compiler by means of certain fields in the ISA. Because of this resemblance we refer to this archi- tecture as a multiVLIWprocessor. The effectiveness of this microarchitecture strongly depends on the ability of the compiler to generate code that balances the workload of the different clusters and result in few inter-cluster communications. In this work we propose a modulo scheduling technique for multiVLIWprocessors. The proposed scheduler includes some heuristics for mini- mizing inter-cluster register communication, based on the information provided by the data dependence graph. Besides, it implements a powerful memory locality analy- sis based on Cache Miss Equations [9], which guides the scheduling of memory instructions with the objective of minimizing inter-cluster memory communications. Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture Jesús Sánchez and Antonio González Dept. of Computer Architecture Universitat Politècnica de Catalunya Barcelona - SPAIN E-mail: {fran,antonio}@ac.upc.es

Transcript of Modulo scheduling for a fully-distributed clustered VLIW architecture

To appear in33th. Int. Symp. on Microarchitecture (MICRO-33)December 10-13, 2000 - Monterey, California

inre

s

esoteisni-s.(inoitym-ay.tedr-

-retes-a

nsinhi-

lyatine

ni-heh.-

f

Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture

Jesús Sánchez and Antonio González

Dept. of Computer ArchitectureUniversitat Politècnica de Catalunya

Barcelona - SPAIN

E-mail: {fran,antonio}@ac.upc.es

Abstract

Clustering is an approach that many microprocessors areadopting in recent times in order to mitigate the increasingpenalties of wire delays. In this work we propose a novelclustered VLIW architecture which has all its resourcespartitioned among clusters, including the cache memory. Amodulo scheduling scheme for this architecture is also pro-posed. This algorithm takes into account both register andmemory inter-cluster communications so that the finalschedule results in a cluster assignment that favors clusterlocality in cache references and register accesses. It hasbeen evaluated for both 2- and 4-cluster configurations andfor differing number and latencies of inter-cluster buses.The proposed algorithm produces schedules with very lowcommunication requirements and outperforms previouscluster-oriented schedulers.

1. Introduction

Technology projections point to wire delays as being one ofthe main hurdles for improving instruction throughput offuture microprocessors [23]. As wire delays grow relativeto gate delays and feature sizes shrink, the percentage ofon-chip transistors that can be reached in a single cycle willdecrease, and microprocessors will becomecommunica-tion bound rather thancapacity bound [1][14].

Techniques to solve this problem at all levels, fromapplications to technology, will be crucial for performance.Clustering is an effective microarchitectural approach tomitigate the negative effect of wire delays. The main ideais to have a hierarchical organization of the interconnectionwires such that units that communicate frequently are inter-connected through short and fast wires. On the other hand,units that rarely communicate can use longer and slowerwires. In other words, the microarchitecture exploits whatwe may callcommunication locality. Several commercialmicroprocessors have adopted this approach, such as theAlpha 21264 [10], which is a superscalar processor, but this

trend is even more common for VLIW processors usedthe embedded/DSP domain. Examples of the latter aTexas Instrument’s TMS320C6000 [24], Equator’MAP1000 [15] and Analog’s TigerSharc [8].

Clustering can be applied to different parts of thmicroarchitecture. Cluster microarchitectures proposedfar, both in the commercial and research arena, distributhe functional units and register files, but the data cacheconsidered a centralized resource. This centralized orgazation challenges the scalability of these architectureBesides, some studies point out that the access timenumber of cycles) to the memory structures is likely tincrease with future technologies, even when their capacis kept constant [1]. This suggests that short latency meory structures should be even smaller than they are todBecause of these two reasons, we believe that a distribucache memory architecture is key for increasing the perfomance of future microarchitectures.

In this work we propose a clustered VLIW microarchitecture with a distributed cache memory. This architectuhas all the resources distributed: instruction fetch, execuand memory units. It resembles very much a multiprocesor, with the exception that all the clusters progress inlockstep mode, and inter-cluster register communicatioare controlled by the compiler by means of certain fieldsthe ISA. Because of this resemblance we refer to this arctecture as amultiVLIWprocessor.

The effectiveness of this microarchitecture strongdepends on the ability of the compiler to generate code thbalances the workload of the different clusters and resultfew inter-cluster communications. In this work we proposa modulo scheduling technique formultiVLIWprocessors.The proposed scheduler includes some heuristics for mimizing inter-cluster register communication, based on tinformation provided by the data dependence grapBesides, it implements a powerful memory locality analysis based onCache Miss Equations[9], which guides thescheduling of memory instructions with the objective ominimizing inter-cluster memory communications.

dedses

hehmnitadn

is-c-

i-ter

ntsl

he

t

e

INT INT FP FP MEM MEM

REGISTER FILE

DATA CACHE

CLUSTER1

CLUSTER2

CLUSTERN

MAIN MEMORY

CLUSTER1

CLUSTER2

CLUSTERN

Register buses

Memory buses

INSTR. CACHE

Figure 1. Microarchitectures of a MultiVLIWProcessor

Some previous work related to scheduling of instruc-tions for clustered VLIW architectures can be found in theliterature for non-cyclic [6][4][11][18] and cyclic code[17][7][22], but to the best of our knowledge this is the firststudy that deals with a clustered VLIW architecture thathas a distributed data cache.

The rest of this paper is organized as follows. Section2 describes the architecture of themultiVLIWprocessorandsome basic background on modulo scheduling. An exam-ple that motivates the proposed algorithm is shown in Sec-tion 3. In Section 4, the proposed algorithm is describedand Section 5 shows performance results obtained for dif-ferent configurations. Finally, the main conclusions of thework are drawn in Section 6.

2. MultiVLIWProcessors

In this section we first describe the microarchitecture ofmultiVLIWprocessorsand then we review some basic con-cepts of modulo scheduling for the proposed architecture.

2.1. Microarchitecture

Our base architecture (see Figure 1) is composed of severalclusters, each one executing a fixed part of each VLIWinstruction. All clusters work in lockstep mode, i.e., anystall in one cluster also stalls the other clusters. Every cycle,all clusters fetch their corresponding parts of a new VLIWinstruction from their local instruction caches. Each clusterconsists of several functional units, a register file and alocal data cache memory in addition to the local instructioncache. Functional units can be of three different types: inte-ger arithmetic, floating-point arithmetic or memory access.For the sake of simplicity, we consider that all clusters arehomogeneous (i.e., with the same number and type of func-tional units), but the proposed techniques can be general-ized for heterogeneous clusters.

Register values generated by one cluster and neeby another one are communicated through a set of buthat are shared by all clusters (calledregister buses). Avalue that is put in a register bus can come from either tlocal register file or the output of a functional unit througa short-circuit. On the other hand, a value that is read frothe bus can be stored in a register file, feed a functional uor both. Thus, instruction register operands can be refrom either the local register file or any bus, and instructioresults can be written into the register file and to any regter bus. All register communication operations are expliitly encoded in the appropriate fields of the VLIWinstruction, which are set at compile time. Thus, no addtional hardware is needed to manage and arbitrate regisbuses. The detailed VLIW instruction format is shown iFigure 2. Each instruction for a particular cluster consisof the following fields. An operation for each functionaunit in that particular cluster (FUj) and the source (IN BUS)and target (OUT BUS) of the bus (there are as manyIN/OUTfields as number of buses). TheIN BUS field indicates, ifnecessary, the register in the local register file in which tvalue that is inIRV has to be stored. TheIRV (IncomingRegister Value) is a special register in each cluster thalatches the value that comes from the bus. TheOUT BUSfield indicates from which local register a value has to b

...UF1 UF2 UFn INBUS

OUTBUS...FU1 FU2 FUn IN

BUSOUTBUS

OP SRC1 SRC2 TARGET

FU Input Mux•Register•Bus•Constant•Unused

Bus Output•Register•Unused

Bus Input•Register•Null

FU Output•Register

...CLUSTER1 CLUSTER2 CLUSTERnVLIW Instruction

Figure 2. VLIW instruction format

rit-the

g

led

fro-r is

lledhe

gdn

-e,cetnthea-

thec-

thee,

sry.

cy:

all

ait-

issued to the bus, if any. If the register is being written inthat cycle, the data will be bypassed from the output of thecorresponding functional unit. Since a bus is a resourceshared by all the clusters, when one particular cluster placesa data on the bus (OUT BUS), this bus will be busy duringthe entire bus latency and no other instruction can use it (abus is considered by the scheduling algorithm as anotherresource in the reservation table).

Regarding memory accesses, a load/store issued by acluster first tries its local L1 data cache. If the data is found,the access is satisfied with minimum latency. Otherwise,the hardware tries the cache of the other clusters or, finally,the access is solved by the main memory. Both local mem-ories and main memory are interconnected through one orseveral buses (that are calledmemory buses). As the cacheis physically partitioned among the clusters, coherenceamong the local caches and the main memory has to bekept. For this reason, a snoopy MSI protocol [5] has beenimplemented. This protocol is completely transparent to theISA, and further, both the coherence and the bus arbitrationare managed by the hardware. When a memory accessmisses in its local cache, the miss request is queued in alocal MSHR (Miss information/Status Handling Register)structure, since the L1 data cache is non-blocking [12].Then, the access has to compete for a free memory bus inorder to access a remote cache or the main memory.

All the dependences with memory operations aredynamically checked, since the scheduler may have consid-ered an optimistic latency for these instructions (i.e., hit inthe local cache). If any dependence is not met, the depen-dent instruction stalls in all clusters until the hazard isresolved.

2.2. Background on Modulo Scheduling

Software pipelining is a very effective technique to stati-cally schedule loops. The most popular scheme to performsoftware pipelining is called modulo scheduling [20][13].The two main parameters that statically characterize a mod-ulo scheduled loop are theinitiation interval (II) and thestage count(SC). The former reflects the number of cyclesthat a kernel iteration takes (assuming no stalls), whereasthe latter shows how many iterations are overlapped, anddetermines the length of the prolog and epilog.

For a clustered VLIW architecture, both II and SC canbe affected by inter-cluster register communications. If thecommunication buses become saturated, a higher II isrequired. Moreover, communication operations mayincrease the length of the schedule, and therefore the SCmay be increased. Thus, the IPC of a clustered VLIW archi-tecture will be lower than that of an equivalent unifiedVLIW architecture with the same resources in general. On

the other hand, a clustered architecture may reduce the cical delays such as the register file access time andbypass latency [19], and allow for faster clock rates.

For this paper, which focuses on modulo schedulinfor multiVLIWprocessors, the number of cycles needed toexecute a particular modulo scheduled loop can be modethrough the following expression [21]:

NCYCLETotal = NCYCLECompute + NCYCLEStall

WhereNCYCLECompute represents a fixed number ocycles that depends on the particular static scheduling pduced by the compiler. During these cycles the processodoing useful (or at least scheduled) work.NCYCLEStall rep-resents the number of cycles where the processor is staand depends on several factors as we detail below. Tvalue ofNCYCLEComputecan be computed before executinthe loop if the number of times the loop is execute(NTIMES) and the number of iterations of each executio(NITER) are known, as shown by the next expression:

NCYCLECompute = NTIMES * ((NITER + SC -1) * II)

The value ofNCYCLEStall cannot be computed stati-cally. It represents the number of stall cycles due to incomplete information managed by the compiler. For instancsome memory instruction latencies may be unknown sinthe compiler does not know whether they will hit in the firslevel cache. If the value loaded by a memory instructiofeeds another operation (i.e., the latter depends onformer) but the latter was scheduled using an underestimtion of the memory latency, it will stall until the memoryaccess is finished. In the assumed microarchitecture,final latency of a memory instruction depends on three fators:

• Latency of memory accesses, which depends onmemory level that satisfies the access: local cachremote cache or main memory.

• Number of entries in the MSHR of the lockup-freecaches. If there is no available entry for a new misrequest, the instruction stalls until there is a free ent

• Cycles waiting for a free bus and bus latency.

Thus, considering all of these factors, the total latenof a memory access can be represented by this formula

LATMemAccess = LATCache +

MISSLC * (NCWaitingEntry + NCWaitingBus + LATMemoryBus +

max ( LATCache , MISSRC * LATMainMemory ) )

Where bothMISSLC andMISSRC represent binary val-ues that are 1 if the access misses in local cache andremote caches respectively, or 0 otherwise.NCWaitingEntryrepresents the number of cycles that a miss access is w

ese

nd,edfttheva-a-in

isin

or-a

heg

-ys

heetm-ofs-to

LD1 LD2

*

LD3 LD4

*

+

LD1 LD2

*

LD3 LD4

*

+

CLUSTER 1 CLUSTER 2

ARITH MEM BUS ARITH MEM

0 *[1] LD1[0] *[1] LD3[0]

1 LD2[0] C LD4[0]

2 C +[2] ST[3]

(a)

LD1 LD2

*

LD3 LD4

*

+

CLUSTER 1 CLUSTER 2

ARITH MEM BUS ARITH MEM

0 LD1[0] C +[2]

1 C *[1] LD2[0]

2 LD3[0] C ST[2]

3 C *[0] LD4[1]

(b)

ST

ST

ST

DO I = 1, N, 2A(I) = B(I)*C(I) +

B(I+1)*C(I+1)ENDDO

II = 3 , SC = 4

II = 4 , SC = 4

Figure 3. Motivating example

ing for an available entry in the MSHR.NCWaitingBusis thenumber of cycles that the access is waiting for a free bus.Note that a bus can be also busy for coherence operationsand this is taken into account by our simulator. Finally,although we have consideredLATMainMemory as a fixedparameter, in the above expression note that for some ref-erences this number could be smaller if an earlier miss hasalready started loading the relevant cache line. This fact hasalso been accounted for our simulator.

3. Motivating Example for the ProposedScheduler

The objective of this study is twofold: first, demonstratethat when the data cache is partitioned among the differentclusters, the selection of the cluster where each memoryinstruction is scheduled is very important and can dramati-cally affect the final performance of a program (the sameholds for register values, but this has already been shownby previous papers). Second, we propose a modulo sched-uler that takes into account both register and memory inter-cluster communications.

In this section, we illustrate through an example howthe cluster selection can affect the total number of cycles inwhich a code section is executed. Consider that we want toperform modulo scheduling of a loop whose code anddependence graph are shown in Figure 3. Assume the pro-cessor consists of 2 clusters, each one with its local registerfile and data cache (direct-mapped), and 2 functional units:one for arithmetic operations (with 2-cycle latency) andone for memory operations. There is one inter-register buswith a 2-cycle latency. The latencies for memory accessesare: 2 cycles for a local cache, 2 cycles for a bus transactionand 10 cycles for an access to main memory.

For this loop, theminimum initiation interval(mII) foran equivalent unified architecture with the same resourcis 3 cycles. The partition and scheduling that minimizes thnumber of register communications between clusters athus, that achieves the same II as the equivalent unifiarchitecture is shown in Figure 3(a). In this figure, the lepart represents the partition of the operations betweenclusters whereas the right part shows the modulo resertion table obtained after modulo scheduling. Each opertion is scheduled in a particular slot and the numberbrackets represents the stage at which this operationscheduled. The usage of the register bus is also shownthis table. Whenever a bus transaction takes place, the cresponding bus time slot is reserved and it is indicated byC in the reservation table.

Then, theNCYCLEComputeof the resulting loop can becomputed as:

NCYCLECompute(a) = NTIMES * ((N + 4 -1) * 3) = NTIMES * (N + 3) * 3

However, suppose that both arraysB andC are locatedin memory at a distance that is a multiple of the local cacmemory size. This means that we will have ping-poninterferences betweenLD1 andLD2, and betweenLD3 andLD4. Thus, the spatial locality exhibited by the four instructions cannot be exploited and the four accesses alwamiss. The result is that the instruction(s) that consume tmemory values suffer many stalls. In the example, thVLIW instruction that contains the multiplications cannocontinue its execution until the misses are satisfied. Assuing that we have sufficient memory buses, the numbercycles that the instruction stalls is the latency of a bus tranaction plus an access to main memory, since the latency

ouregreatweeper

leer-a

n-dya

ntheanestceer-d-r-ere

atu-ted

es

isikr-ame-

man

ingr-r,ols.

el-s.

the local cache was taken into account by the scheduler.Then, the number of stall cycles is:

NCYCLEStall(a) = NTIMES * N * (2+10) = NTIMES * N * 12

An alternative scheduling is shown in Figure 3(b).Based on the locality properties previously observed, inthis second alternative cluster assignment is selected inorder to take advantage of the locality exhibited by memoryinstructions. For this reason,LD1 andLD3 are scheduled inthe same cluster in order to profit from its group reuse, andthe same applies forLD2 andLD4 which are scheduled inthe other cluster. In this way, ping-pong interferences areremoved and we can take advantage of the spatial reuse.However, as we can see in the example, for this case twocommunications between register values are needed periteration, and then the II has to be increased from 3 to 4.Thus,NCYCLECompute is computed as:

NCYCLECompute(b) = NTIMES * ((N + 3 - 1) * 4) = NTIMES * (N + 2) * 4

However, the miss rate ofLD3 andLD4 is 25% (assum-ing eight data elements per cache block), andLD1 andLD2always hit (excepting the first iteration). Thus, the numberof stall cycles is:

NCYCLEStall(b) = NTIMES * N * (2*(2+10)* 0.25) = NTIMES * N * 6

Then, putting all together, we have that the total num-ber of cycles in both strategies as:

NCYCLETotal(a) = NTIMES * (15 * N + 9)

NCYCLETotal(b) = NTIMES * (10 * N + 8)

Therefore, we can conclude that the second strategy,which takes into account both register and memory com-munications, achieves a schedule that is 1.5 times fasterthan the original one, which is optimized only for registercommunications.

4. Register and Memory Communication-Aware Modulo Scheduler

In this section we present a modulo scheduler that tries tominimize both register and memory inter-cluster communi-cations and at the same time balance the workload. We firstreview a previously proposed scheduler, which is veryeffective at minimizing register communications, andwhich we will use as a baseline for comparisons. Then, wepresent the data locality analysis framework that is used bythe scheduler. Finally, the modulo scheduler is described.

4.1. Baseline Algorithm

We use as the baseline algorithm the one proposed inprevious work [22], which was shown to be very effectivat minimizing register communications and maximizinthe workload balance. In that work, the target architectuwas similar to the one proposed in Section 2.1, but in thcase all clusters accessed a shared L1 cache. Below,briefly review the algorithm proposed there. For mordetails, the interested reader is referred to the original pa[22].

The algorithm employs a unified assign-and-scheduapproach, that is, cluster selection and scheduling of opations is done in a single step. The heuristic for selectingcluster is the number of edges that exit from the depedence subgraph corresponding to all the nodes alreascheduled in a particular cluster. This value representsmeasure of the number of register communications. Aattempt is made to schedule an operation (i.e., a node independence graph) in all the clusters in which there isavailable slot. The one chosen is the one in which the bprofit from output edges is achieved (that is, the differenbetween output edges before and after including this opation in the partial schedule). All the operations are scheuled using the same algorithm and following a particulaorder that is crucial for performance. If an instruction cannot be scheduled (because no issue slot is available, or thare not enough registers, or the register buses are srated), the II is increased and the whole process is re-star(except the ordering).

4.2. Overview of the Cache Miss Equations

Cache Miss Equations(CME) is an analytical frameworkto model the cache behavior that is very accurate for codthat make use of scalar variables and affine1 array refer-ences, which is very common in numeric applications. Thframework was proposed by Gosh, Martonosi and Mal[9]. CME describes the precise relationship among the iteation space, array sizes, base addresses and cache parters for a loop nest.

A direct solution of the CME is an NP problem, whichmakes it infeasible for many practical cases. The problecan basically be stated as counting integer points insideexponential number of polyhedra. However, Bermudoet al.[3] proposed some techniques to speed-up the countprocess by exploiting some intrinsic properties of the paticular type of polyhedra generated by the CME. FurtheVera et al. [25] proposed a sampling scheme in order testimate the solution by means of confidence interva

1. An array reference is affine if the expressions that indicate the referencedement in each dimension are linear functions of the loop induction variable

s,ter

aed.ut

ndse

ndbynddisale ofra-t-eetal

eerseadg

ledesiser,a

theonbehistiveforanhey

Memorynode?

Best profitin cache misses

Howmany?

Sort nodes START Next nodeSelect possible

clustersHow

many?Memorynode?

Best gainfrom output edges

Best gainfrom cache misses

Howmany?

Howmany?

Schedule itMinimize

register reqs.

+ II0

>0

YES

NO

1

>1

1

>1

Figure 4. RCMA modulo scheduling step by step

These two techniques together drastically reduce the com-puting time to just about a few seconds per loop for mostprograms, and then the time required to compute and solvethe equations is comparable to the time required by othertypical optimizations of the compiler. In this paper, we usethis implementation of the CME to estimate the amount ofreuse that is exploited by any subset of memory instruc-tions. CME will allow the scheduler to estimate the amountof memory communications among clusters or betweenclusters and main memory. The scheduler uses this infor-mation to guide its scheduling decisions. For instance,given a memory instruction, it is beneficial to schedule it ina cluster where there already are other instructions fromwhich it reuses data (group reuse). On the other hand, it isdetrimental to schedule the instruction in a cluster wherethere already are other instructions that cause many cacheconflicts with the current one. CME allow the schedule toquantify the amount of reuse and conflicts among anygroup of instructions of the same loop nest. CME are usedto produce the following statistics:

• The number of misses incurred by a set of memoryreferences for a particular cache configuration (capac-ity, block size and associativity)

• The miss ratio of a particular memory instruction inthis set.

4.3. Scheduler for a Distributed Cache

The proposed algorithm is calledRMCA(which standsfor Register and Memory Communication-Aware) moduloscheduling. It is an evolution of the algorithm reviewed inSection 4.1 and its main steps are depicted in Figure 4 (newfeatures are shown in gray boxes). All nodes in the datadependence graph are first sorted according to the criteriaused by the original paper [22]. This ordering minimizesthe number of nodes that have both predecessors and suc-cessors in the set of nodes that precede it in the order. Then,cluster selection and scheduling is performed in a singlestep following that order. However, there is now a distinc-

tion between two types of nodes: (a) memory operationand (b) non-memory operations. For operations of the latgroup, the algorithm does not change. However, whenmemory operation is scheduled, a different strategy is usInstead of choosing the cluster where the gain from outpregister edges is maximized, the cluster selection depeon the profit from cache misses. In other words, each tima memory operation is scheduled, all clusters are tried, afor each one, the number of cache misses contributedmemory operations scheduled in that cluster, before aafter introducing the current operation, is computethrough the CME. Then, the cluster(s) where this gainmaximized is chosen. If more than one cluster is optimwith respect to cache misses, the scheduler selects onthem using the same strategy as for non-memory opetions. Although the solver of the CME have to be repeaedly invoked, the method is very fast due to thoptimizations mentioned in Section 4.2., and the timrequired by the scheduler is a small percentage of the tocompilation time.

This algorithm tries to minimize the number of cachmisses, and thus it attempts to minimize the inter-clustmemory communications. However, the latency of thecommunications can be hidden by scheduling some loinstructions using the cache-miss latency (bindinprefetching, as proposed in [21]). When a load is scheduusing the cache-miss latency, the operation that consumthe data read by the load will not be stalled because itscheduled assuming the worst-case latency. Howevscheduling instructions using a larger latency can havenegative effect on both register pressure and length ofschedule. On one hand, the lifetime of the load destinatiregister is increased. On the other hand, the II canincreased if this instruction belongs to a recurrence and tincreased latency makes the recurrence the most restricconstraint on the II. Besides, the length of the schedulea single iteration may increase, which may causeincrease in the SC, which in turn affects the durations of tprologue and epilogue. Therefore, as shown in [21], it ma

-eerchee

,isrs.heis.as

us-t

.

-5

d-er-of

toerhat

ofr-

m.s

ms

be much more effective to schedule with a miss latencyonly those loads that are likely to miss. This can be done aslong as the latency does not increase the II with respect tothe schedule produced when loads are scheduled with a hitlatency. Thus, the proposed scheme includes another step:once the target cluster of an instruction is determined, it isscheduled using the cache-miss latency if the miss ratio ofthis instruction in this particular cluster (considering thepartial schedule produced so far) is greater than a certainthreshold, and provided that this latency does not increasethe II if the operation is in a recurrence. The assumed misslatency is the time to access main memory, that is,LATCache+ LATMemoryBus+ LATMainMemory (note that we do not con-sider the memory bus contention since it is not known atthis moment, although it could be estimated).

Note that with this scheme some memory instructionsare scheduled with the miss latency even if their miss ratiois lower than 100%. This may happen for instance forinstructions with spatial locality. In this case, loop unroll-ing could be used to generate multiple instances of thesame instruction such that one of them always miss and theother always hit [16]. However, we have not considered thisoptimization in this paper.

5. Performance Results

This section analyzes the performance of the proposedscheduler. The main performance metric that we use is thenumber of cycles executing instructions of modulo sched-uled loops. Note that this metric does not include the effectof clustering on the cycle time, thus, differences observedfor different schedulers and the same architecture directlytranslate into differences in execution time. However, thenumber of cycles for different architectures should bedivided by cycle time to measure differences in executiontime. Since we are concerned with differences among alter-native schedulers, we prefer not to include the effect ofcycle time in our metric, to isolate the effect of the sched-ulers. A study of the impact of clustering on cycle time canbe found elsewhere [19] as well as on energy consumption[26], which is another important factor that can be reducedthrough clustering.

5.1. Configurations and Benchmarks

The scheduling algorithm has been evaluated for three dif-ferent configurations of themultiVLIWprocessorarchitec-ture. These configurations are shown in Table 1.The firstconfiguration is calledUnifiedand it is composed of a sin-gle cluster with four functional units of each type (integer,floating point and memory) and a unique register file of 64general-purpose registers.

This configuration represents our baseline. Both the2-clus-ter and4-clusterconfigurations have the register file partitioned (into two and four partitions respectively). Thformer has 2 functional units of each type and 32 registper cluster and the latter includes 1 functional unit of eatype and a register file of 16 registers per cluster. The thrconfigurations are 12-way issue.

For all configurations, the total L1 cache size is 8KBdivided into equal-sizes among the different clusters. Thcache capacity is realistic for embedded/DSP processoFor instance, the TI TMS320C6711 has an L1 data cacof 4Kbytes [24]. In our architecture, each local cachedirect-mapped, non-blocking with 10 entries in the MSHRAn access to a local cache is satisfied in 2 cycles, wherean access to main memory takes 10 cycles. For the cltered configurations we will present results for differennumber and latency of both register and memory buses

The modulo scheduling algorithm has been implemented in the ICTINEO compiler [2] and some SPECfp9benchmarks have been evaluated:tomcatv, swim, su2cor,hydro2d, mgrid, applu, turb3dandapsi. Note that moduloscheduling is an effective technique for both numeric anmultimedia applications, but it is not so effective for applications such as SPECint95 due to the small number of itations for each loop execution and the abundanceconditionals.

The performance figures shown in this section referthe modulo scheduling of innermost loops with a numbof iterations greater than four. Our measurement shows tcode inside such innermost loops represents about 90%all the executed instructions, so that the statistics for innemost loops are quite representative of the whole prograOnly instructions that belong to modulo scheduled loopare taken into account by the simulator. Thus, the prograwere run until the first 100 million memory instructions inthese loops using the ref input data set.

Resources Unified 2-cluster 4-cluster Latencies INT FP

INT / cluster 4 2 1 MEM 2 2

FP / cluster 4 2 1 ARITH 1 3

MEM / cluster 4 2 1 MUL/ABS 2 6

REGS / cluster 64 32 16 DIV/SQR/TRG

6 18

Table 1. MultiVLIWProcessor configurations andoperation latencies

bep-

theerinin

he-isar

a-heoneldtheanee

of

or

Unified Baseline RMCA Baseline RMCA Baseline RMCA Baseline RMCA Baseline RMCA Baseline RMCA Baseline RMCA Baseline RMCA Baseline RMCA0.0

0.5

1.0

1.5

2.0N

orm

aliz

ed N

um

ber

of

Cyc

les

1.000.750.250.00

LMB = 1 2 4 1 2 4 1 2 4LRB = 1 2 4

Bus Configuration

Unified Baseline RMCA Baseline RMCA Baseline RMCA Baseline RMCA Baseline RMCA Baseline RMCA Baseline RMCA Baseline RMCA Baseline RMCA0.0

0.5

1.0

1.5

2.0

No

rmal

ized

Nu

mb

er o

f C

ycle

s

1.000.750.250.00

LMB = 1 2 4 1 2 4 1 2 4LRB = 1 2 4

Bus Configuration

(a) 2-cluster

(b) 4-cluster

Figure 5. Results obtained for an unbounded number of buses (averaged for all benchmarks)

Threshold

Threshold

Latency of Memory BusesLatency of Register Buses

5.2. An Unbounded Number of Buses

Before considering realistic configurations, we have evalu-ated an architecture with an unbounded number of buses totest the performance of the proposed algorithm underextreme situations where bus bandwidth in not a problem.The remaining parameters of the architecture are thoselisted in Section 5.1 and the latency of the buses is param-etrized. Figure 5 shows the normalized number of cyclesaveraged for all benchmarks, for 2 and 4 clusters and thedifferent latencies considered. The first set of four bars rep-resents the results for the unified configuration. The restrepresent the results for the clustered configuration for dif-ferent latencies of register buses (LRB -Latency of RegisterBuses) and memory buses (LMB -Latency of MemoryBuses). For the different sets, we have evaluated two differ-ent schedulers:

• The baseline scheduler outlined in Section 4.1, whichis very effective at minimizing register communica-tions.

• The proposed algorithm, that takes into account bothregister and memory communications, which islabeled asRMCA.

Each set of four bars represents the results obtained fordifferent values of the cache miss threshold (from 1.00 to

0.00) that determines whether a load is attempted toscheduled with a miss latency. Note that threshold 1.00 reresents the traditional scheme, that is, using alwayscache-hit latency for memory operations. On the othhand, threshold 0.00 is most similar to the one proposed[21], where all operations that do not cause an incrementthe II (due to recurrences) are scheduled using the cacmiss latency. The only difference is the locality analysemployed, which is more powerful in this paper. Each bis split into two parts: the compute time (orNCYCLECom-

pute) is the black/grey part, whereas the stall time (orNCY-CLEStall) is the white one.

From these graphs we can see that for all configurtions (number of clusters, latencies and thresholds) tscheme that takes into account memory communicati(RMCA) outperforms the one that ignores this featur(Baseline). As expected, for smaller values of the threshothe compute time increases (since it may increase bothII due to register requirements, and the SC due toincrease in the length of the schedule) but the stall timdecreases. Note that with a threshold of 0.00 the stall timis almost zero for all configurations and the numbercycles for themultiVLIWprocessorare comparable to thoseof the unified configuration. We can also observe that fsmall thresholds (0.25 or 0.00) bothBaselineandRMCA

the

r-nlye

byveenti-edus.g

theses

re

Unified Baseline RMCA Baseline RMCA Baseline RMCA Baseline RMCA0.0

0.5

1.0

1.5

2.0

No

rmal

ized

Nu

mb

er o

f C

ycle

s

1.000.750.250.00

LMB = 1 4 1 4NMB = 1 2

Bus Configuration

Unified Baseline RMCA Baseline RMCA Baseline RMCA Baseline RMCA0.0

0.5

1.0

1.5

2.0

No

rmal

ized

Nu

mb

er o

f C

ycle

s

1.000.750.250.00

LMB = 1 4 1 4NMB = 1 2

Bus Configuration

(a) 2-cluster

(b) 4-cluster

Figure 5. Results obtained when the number of buses is limited (averaged for all benchmarks)

Threshold

Threshold

Latency of Memory Buses

Number of Memory Buses

strategies achieve similar performance, since the latency ofcache misses is hidden by scheduling loads with the cache-miss latency. Nevertheless, note that for an unboundednumber of buses the time waiting for a free bus (NCWaiting-

Bus) is zero, and hence, if the latency is hidden, the numberof misses has no effect. However, as we will see in next sec-tion, when the number of memory buses is limited, the dif-ference between both schemes will be notable, since theschedules produced by theRMCA scheme require muchless communications.

5.3. Evaluation of Realistic Configurations

We have shown the potential benefits that can be achievedwhen memory communication are taken into account bythe scheduler. In this section we study the results when arealistic inter-cluster communication network is consid-ered.

We have evaluated configurations with a fixed numberand latency of register buses (2 buses with 1-cycle latency)and for a different number and latency of memory buses. InFigure 6 we can see the results for both 2 and 4 clusters.Each set of four bars has the same meaning as in the previ-ous section. The first set represents the results for the uni-

fied configuration. The rest are the averaged results fordifferent strategies (BaselineandRMCA) for 1 and 2 buses(NMB - Number of Memory Buses) and 1 and 4 cycles oflatency (LMB -Latency of Memory Buses). We can observein these graphs that, as in the unbounded study, theRMCAstrategy outperforms theBaselinefor all configurations.However now, for small values of the threshold, the diffeence between both strategies is more remarkable, maifor 4 clusters. For the most effective threshold (0.00), thRMCA scheme outperforms the baseline schedulerabout 5% for 2 clusters and 20% for 4 clusters. We haobserved that the reason for this difference is the time spwaiting for an available bus in order to initiate a communcation. When the number of memory buses is unboundthis value is zero, because there is always an available bHowever, when the number of buses is limited, reducinthe number of misses is also important since lessernumber local cache misses, lesser the number of accescompeting for a free bus time slot.

6. Conclusions

In this work we have proposed a novel microarchitectucalledmultiVLIWprocessor, which has a fully-distributed

ra-

e

-

-

er-

-

dd

y-

esh

lo

p

-”,

d

t

rd

clustered VLIW organization. The main novelty of thisarchitecture with respect to previous proposals for clus-tered VLIW processors is the distributed data cache, whichintroduces new challenges to the instruction scheduler.

In this paper we have also presented a modulo sched-uler designed for this particular architecture. This sched-uler, by means of a powerful locality analysis based on theCache Miss Equationsand an analysis of the register datadependence graph, generates codes with very low inter-cluster communication requirements. We have also shownthat the proposed scheduler outperforms previous schemesthat just focused on register communications.

Acknowledgements

This work has been supported by the Spanish Ministry ofEducation under contract CICYT-TIC 511/98 and theESPRIT Project MHAOTEU (EP24942).

References

[1] V. Agarwal, M.S. Hrishikesh, S.W. Keckler and D. Burger,“Clock Rate versus IPC: The End of the Road For Conven-tional Microarchitectures”, inProcs. of the 27th. Int. Symp.on Computer Architecture, pp. 248-259, June 2000

[2] E. Ayguadé, C. Barrado, A. González, J. Labarta, D. López,S. Moreno, D. Padua, F. Reig, Q. Riera and M. Valero, “Icti-neo: a Tool for Research on ILP”, inSupercomputing’96(SC’96), Research Exhibit “Polaris at Work”, 1996

[3] N. Bermudo, X. Vera, A. González and J. Llosa, “An Effi-cient Solver for Cache Miss Equations”, inProcs. of Int.Symp. on Performance Analysis and System Software, April2000

[4] A. Capitanio, D. Dytt and A. Nicolau, “Partitioned RegisterFiles for VLIWs: A Preliminary Analysis of Tradeoffs”, inProcs. of 25th. Int. Symp. on Microarchitecture, pp. 192-300, 1992

[5] D. Culler and J.P. Singh, “Parallel Computer Architecture.A Hardware/Software Approach”,Morgan Kaufmann Pub-lishers, Inc., 1999

[6] J. R. Ellis, “Bulldog: A Compiler for VLIW Architectures”,MIT Press, pp. 180-184, 1986

[7] M.M. Fernandes, J. Llosa and N. Topham, “DistributedModulo Scheduling”, inProcs. of Int. Symp. on High-Per-formance Computer Architecture, pp. 130-134, Jan. 1999

[8] J. Fridman and Zvi Greefield, “The TigerSharc DSP Archi-tecture”,IEEE Micro, pp. 66-76, Jan-Feb. 2000

[9] S. Ghosh, M. Martonosi and S. Malik, “Cache Miss Equa-tions: an Analytical Representation of Cache Misses”, inProcs. of Int. Conf. on Supercomputing (ICS’97), pp. 317-324, July 1997

[10] L. Gwennap, “Digital 21264 Sets New Standard”,Micro-processor Report, 10(14), Oct. 1996

[11] S. Jang, S. Carr, P. Sweany and D. Kuras, “A Code Genetion Framework for VLIW Architectures with PartitionedRegister Banks”, inProcs. of 3rd. Int. Conf. on MassivelyParallel Computing Systems, April 1998

[12] D. Kroft,” Lockup-Free Instruction Fetch/Prefetch CachOrganization”, inProcs. 8th Int. Symp. on Computer Archi-tecture, pp. 81-87, 1981

[13] M. Lam, “Software pipelining: An Effective schedulingtechnique for VLIW Machines”, inProcs. on Conf. on Pro-gramming Languages and Implementation Design, pp. 258-267, June 1993

[14] D. Matzke, “Will Physical Scalability Sabotage Performance Gains”,IEEE Computer, Vol. 30, No. 9, pp. 37-39,Sept. 1997

[15] “MAP1000 unfolds at Equator”,Microprocessor Report,12(16), Dec. 1998

[16] T.C. Mowry, M.S. Lam and A. Gupta, “Design and Evaluation of a Compiler Algorithm for Prefetching”, inProcs. ofthe 5th. Ann. Symp. on Programming Languages and Opating Systems (ASPLOS-V), pp.62-73, Oct. 1992

[17] E. Nystrom and A. E. Eichenberger, “Effective Cluster Assingment for Modulo Scheduling”, inProcs. of 31th. Int.Symp. on Microarchitecture, pp. 103-114, 1998

[18] E. Özer, S. Banerjia and T.M. Conte, “Unified Assign anSchedule: A New Approach to Scheduling for ClustereRegister File Microarchitectures”, inProcs. of 31st Int.Symp. on Microarchitecture, pp. 308-315, Nov. 1998

[19] S. Palacharla, N.P. Jouppi, and J.E. Smith, “ComplexitEffective Superscalar Processors”, inProcs. of the 24th. Int.Symp. on Computer Architecture, pp. 1-13, June 1997

[20] B.R. Rau and C.D. Glaeser, “Some Scheduling Techniquand an Easily Schedulable Horizontal Architecture for HigPerformance Scientific Computing”, inProcs. on the 14thAnn. Workshop on Microprogramming, pp. 183-198, Oct.1981

[21] J. Sánchez and A. González, “Cache Sensitive ModuScheduling”, inProcs. of 30th. Int. Symp. on Microarchitec-ture, pp. 338-348, Dec. 1997

[22] J. Sánchez and A. González, “The Effectiveness of LooUnrolling for Modulo Scheduling in Clustered VLIWArchitectures”, inProcs. of the 29th. Int. Conf. on ParallelProcessing, pp. 555-562, Aug. 2000

[23] Semiconductor Industry Association, “The National Technology Roadmap for Semiconductors: Technology Needs1997

[24] Texas Instruments Inc., “TMS320C62x/67x CPU anInstruction Set Reference Guide”, 1998

[25] X. Vera, J. Llosa, A. González and C. Ciuraneta, “A FasImplementation of Cache Miss Equations”, inProcs. of the8th. Int. Workshop on Compilers for Parallel Computers,pp. 319-326, Jan. 2000

[26] V.V. Zyuban, “Low-Power High-Performance SuperscalaArchitectures”,PhD Thesis, Dept. of Computer Science anEngineering, University of Notre Dame, Jan. 2000