Energy-Efficient Multiprocessor Systems-on-Chip for Embedded Computing: Exploring Programming Models...

Energy-Efficient MultiprocessorSystems-on-Chip for Embedded Computing:

Exploring Programming Models andTheir Architectural Support

Francesco Poletti, Antonio Poggiali, Davide Bertozzi, Luca Benini,

Pol Marchal, Mirko Loghi, and Massimo Poncino

Abstract—In today’s multiprocessor SoCs (MPSoCs), parallel programming models are needed to fully exploit hardware capabilities

and to achieve the 100 Gops/W energy efficiency target required for Ambient Intelligence Applications. However, mapping abstract

programming models onto tightly power-constrained hardware architectures imposes overheads which might seriously compromise

performance and energy efficiency. The objective of this work is to perform a comparative analysis of message passing versus shared

memory as programming models for single-chip multiprocessor platforms. Our analysis is carried out from a hardware-software

viewpoint: We carefully tune hardware architectures and software libraries for each programming model. We analyze representative

application kernels from the multimedia domain, and identify application-level parameters that heavily influence performance and

energy efficiency. Then, we formulate guidelines for the selection of the most appropriate programming model and its architectural

support.

Index Terms—MPSoCs, embedded multimedia, programming models, task-level parallelism, energy efficiency, low power.

Ç

1 INTRODUCTION

THE traditional dichotomy between shared memory andmessage passing as programming models for multi-

processor systems has consolidated into a well-acceptedpartitioning. For small-to-medium scale multiprocessorsystems, there is an undisputed consensus on cache-coherentarchitectures based on shared memory. In contrast, large-scale high-performance multiprocessor systems have con-verged toward nonuniform memory access (NUMA) archi-tectures based on message passing (MP) [3], [4].

The appearance of Multi-Processor Systems-on-Chip

(MPSoCs) in the multiprocessing scenario, however, has

somehow brought this picture into discussion. In fact,

several peculiarities differentiate these architectures from

classical multiprocessing platforms. First, their “on-chip”

nature reduces the cost of interprocessor communication.

The cost of sending a message on an on-chip bus is, in fact,at least one order of magnitude lower (power andperformance-wise) than that of an off-chip bus, thuspushing toward message passing-based programmingmodels. On the other hand, the cost of on-chip memoryaccesses is also smaller with respect to off-chip memories;this makes cache-coherent architectures based on sharedmemory competitive.

Second, MPSoCs are resource-constrained systems. Thisimplies that, while performance is still critical, other costmetrics, such as power consumption, must be considered.Unfortunately, it is not usually possible to optimize powerand performance concurrently and one quantity musttypically be traded off against the other one.

Third, unlike traditional message passing systems, someMPSoC architectures are highly heterogeneous. For in-stance, some platforms are a mix of standard processorcores and application-specific processors such as DSPs ormicrocontrollers [19], [8]. Conversely, other platforms arehighly modular and reminiscent of traditional multipro-cessor architectures [22], [24]. While, in the former case,message-passing is the only viable alternative (some of theprocessing engines may even be cacheless), in the lattercase, a cache-coherence model seems to be the mostintuitive choice.

All of these issues indicate that the choice between the twoprogramming models is not so well-defined for MPSoCs. Theobjective of this work is precisely that of exploring whatfactors may affect this choice, yet from a novel and moreexhaustive perspective. Although our analysis considers thetwo traditional dimensions of the problem, namely, thearchitecture and the software, they are both considered from thesoftware perspective. In particular, we assume that the

606 IEEE TRANSACTIONS ON COMPUTERS, VOL. 56, NO. 5, MAY 2007

. F. Poletti and L. Benini are with DEIS, University of Bologna, VialeRisorgimento 2/2, 40100 Bologna (BO), Italy.E-mail: {fpoletti, lbenini}@deis.unibo.it.

. A. Poggiali is with STMicroelectronics, Centro Direzionale Colleoni, viaCardano 2-palazzo Dialettica, 20041 Agrate Brianza (MI), Italy.E-mail: [email protected].

. D. Bertozzi is with the Engineering Department, University of Ferrara,Via Saragat, 1, 44100 Ferrara (FE), Italy. E-mail: [email protected].

. P. Marchal is with ESAT KULeuven-IMEC vzw, Kapeldreef 75, 3001Heverlee, Belgium. E-mail: [email protected].

. M. Loghi and M. Poncino are with the Dipartimento di Automatica eInformatica, Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129Torino, Italy. E-mail: {mirko.loghi, massimo.poncino}@polito.it.

Manuscript received 25 Aug. 2005; revised 26 May 2006; accepted 10 Sept.2006; published online 6 Mar. 2007.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TC-0283-0805.Digital Object Identifier no. 10.1109/TC.2007.1040.

0018-9340/07/$25.00 � 2007 IEEE Published by the IEEE Computer Society

variable “architecture” is determined by the programmingmodel. The actual dimension then becomes the program-ming model (shared-memory versus message-passing), un-der the assumption that an underlying architecture correspondsto each mode that is optimized for it.

This assumption, which is at the core of this work, stemsfrom considering the inefficiency incurred when mappinghigh-level programming models (such as message passing)onto generic architectures in terms of software andcommunication overhead. This conflicts with the trend ofdesigning optimized, custom-tailored architectures show-ing very high power and communication efficiency in arestricted target application domain (application-specificMPSoCs).

On the software side, conversely, we consider moretraditional parameters, the most important being the work-load allocation strategy. However, we also consider moreapplication-specific parameters that affect the communica-tion (e.g., the size of the messages or the communication/computation ratio).

Unlike previous works, we do not simply do a rewritingof benchmarks under different programming models for agiven architecture. In our case, using a different modelimplies using a different architecture and the software ismodified accordingly so as to exploit the optimizedcommunication features provided by the hardware. It isworth emphasizing that we do not want to demonstrate thesuperiority of one paradigm over the other. Rather, weshow that, for a given target application, there may not be aprogramming model which is consistently better than theother. Our focus is on media and signal processingapplications commonly found in MPSoC platforms.

Our exploration leverages an accurate multiprocessorsimulation environment that provides cycle-accurate simula-tion and estimation of power consumption, based on 0:13�mtechnology-homogeneous industrial power models, see [40].

In summary, the main contributions of our work are:

1. the creation of a flexible and accurate MPSoCperformance and power analysis environment,

2. the development of highly optimized hardwareassists and software libraries for supporting messagepassing and shared memory programming abstrac-tions on an MPSoC platform,

3. comparative energy and performance analysis ofmessage passing and shared memory hardware andsoftware tuned MPSoC architectures for coarse-grainparallel workloads typical of the multimedia appli-cation domain, and

4. derivation of general guidelines for matching a task-level parallel application with a target hardware-software platform.

2 RELATED WORK

Parallel programming and parallel architectures havebeen extensively studied in the past 40 years in thedomain of high-performance general-purpose computing[3]. Our review of related works focuses primarily onmultiprocessor SoC architectures for embedded applica-tions [23], [19], [20], [21], [17].

From the software viewpoint, there is little consensus onthe programmer view offered in support of these highly

parallel MPSoC platforms. In many cases, very littlesupport is offered and the programmer is in charge ofexplicitly managing data transfers and synchronization.Clearly, this approach is extremely labor-intensive, error-prone, and leads to poorly portable software. For thisreason, MPSoC platform vendors are devoting an increas-ing amount of effort to offering more abstract programmerviews through middleware libraries and their APIs. Messagepassing and shared memory are the two most commonapproaches.

Message passing was first studied in the high-perfor-mance multiprocessor community, where many techniqueshave been developed for reducing message delivery latency[10], [11], [9]. Message passing has also entered the world ofembedded MPSoC platforms. In this context, it is usuallyimplemented on top of a shared memory architecture (e.g.,TI OMAP [21], Philips Eclipse [8], Toshiba Kawasaki [7],and Philips Nexperia [19]). Hence, shared memory is likelyto become a performance/energy bottleneck, even whenDMAs are used to increase the transfer efficiency.

Therefore, several authors have recently proposedsupport for message-passing on a distributed memoryarchitecture. Two interesting case studies are presented in[6], [5]. The above approaches have limited support forsynchronization and limited flexibility in matching theapplication to the communication architecture, e.g., in [5],remote memories are always accessed with a DMA-likeengine even though this is not the most efficient strategy forsmall message sizes.

Even though message passing has received someattention, shared memory is the most common programmerabstraction in today’s MPSoCs. However, the presence of amemory hierarchy with locally cached data is a majorsource of complexity in shared-memory approaches.Widely speaking, approaches for solving the cache coher-ence problem fall into two major classes: hardware-basedapproaches and software-based ones. The former imposescache coherence by adding suitable hardware whichguarantees coherence of cached data [46], [47], [3], whereasthe latter imposes coherence by limiting the caching ofshared data [48]. This can be done by the programmer, thecompiler, or the operating system.

In embedded MPSoC platforms, shared memory coher-ence is often supported only through software librarieswhich rely on the definition of noncacheable memoryregions for shared data or on cache flushing at selectedpoints of the execution flow. However, there are a fewexceptions that rely on hardware cache coherence, espe-cially for platforms which have a high degree of homo-geneity in computational node architecture [24].

The literature on comparing message passing and sharedmemory as programming models in large-scale general-purpose multiprocessors is quite rich ([25], [26], [27], [28],[29], [30], [31], [32]). Early works ([25], [26], [27]) compare ashared memory program against a similar program writtenwith a message passing library that was implemented inshared memory on the same machine. The first two worksprovide strong evidence of the superiority of messagepassing, a conclusion which the third work partially putsinto discussion.

These works do not actually explore programming stylessince they do not use the architectural variable. Theperformance of a message passing library simulated on a

POLETTI ET AL.: ENERGY-EFFICIENT MULTIPROCESSOR SYSTEMS-ON-CHIP FOR EMBEDDED COMPUTING: EXPLORING PROGRAMMING... 607

https://www.researchgate.net/publication/4086358_An_efficient_scalable_and_flexible_data_transfer_architecture_for_multiprocessor_SoC_with_massive_distributed_memory?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/4086358_An_efficient_scalable_and_flexible_data_transfer_architecture_for_multiprocessor_SoC_with_massive_distributed_memory?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/2953928_Survey_of_cache_coherence_schemes_for_multiprocessors?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/3214775_Hardware_Approaches_to_Cache_Coherence_in_Shared-Memory_Multiprocessors_Part_1?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/221085725_A_Comparison_of_Programming_Models_for_Shared_Memory_Multiprocessors?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/221085725_A_Comparison_of_Programming_Models_for_Shared_Memory_Multiprocessors?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/2955005_An_efficient_protected_message_interface?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/2506059_Where_is_Time_Spent_in_Message-Passing_and_Shared-Memory_Programs?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/3299512_Hardware_Support_for_Interprocess_Communication?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/3512405_On_the_influence_of_programming_models_on_shared_memory_computer_performance?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/3512405_On_the_influence_of_programming_models_on_shared_memory_computer_performance?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/225220903_A_Comparative_Characterization_of_Communication_Patterns_in_Applications_Using_MPI_and_Shared_Memory_on_an?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/4007008_Communication_centric_architectures_for_turbo-decoding_on_embedded_multiprocessors?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/3515057_Shared_memory_vs_message_passing_in_shared-memory_multiprocessors?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/3515057_Shared_memory_vs_message_passing_in_shared-memory_multiprocessors?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/2623556_A_comparison_of_MPI_SHMEM_and_cache-coherent_shared_address_space_programming_models_on_the_SGI_Origin_2000?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/220692283_Parallel_Computer_Architecture_A_HardwareSoftware_Approach?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/220692283_Parallel_Computer_Architecture_A_HardwareSoftware_Approach?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/3300592_MPI-LAPI_an_efficient_implementation_of_MPI_for_IBM_RS6000_SPsystems?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/234819400_A_comparison_of_message_passing_and_shared_memory_architectures_for_data_parallel_programs?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/3247269_Classifying_software-based_cache_coherence_solutions?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/3898201_Message_passing_vs_shared_address_space_on_a_cluster_of_SMPs?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

shared memory computer is likely to be quite different fromthe more complex library on message passing hardware.Also, the programs were executed on a real machine, whichlimited the comparison to elapsed time.

Simulation was used in [28] to compare message trafficin the two programming models by writing applications ina parallel language that supports high-level communicationprimitives of the two types. Translation onto the targetarchitecture is done through a compiler, which, however,affects the interpretation of the comparison. Chandra et al.[29] did a more predictable analysis by careful writing ofthe application onto the same hardware platform. Theirconclusions partially upset the superiority of messagepassing in favor of a shared memory paradigm. Morerecent works ([30], [31], [32]) focused again on specificplatforms such as high-end SMPs.

From our perspective, these works have several limita-tions, which we address in our analysis. First and foremost, allmethods but [29] refer to a specific architecture, which is thusnot considered as a dimension of the exploration. Second,none of them explicitly refers to MPSoCs as an architecturaltarget, therefore power or energy are never considered asvaluable design metrics. Third, nonrealistic software archi-tectures are sometimes considered (e.g., [51], [52]).

3 HARDWARE ARCHITECTURES

The architecture of the hardware platform is designed toprovide efficient support for the different styles of parallelprogramming. Therefore, our MPSoC simulation platformwas extended in order to model and simulate the followingarchitectures.

3.1 Shared Memory Architecture

This architecture consists of a variable number of processorcores (ARM7 simulation models will be deployed for our

analysis framework) and of a shared memory device towhich the shared addressing space is mapped.

As an extension, each processor also has a privatememory connected to the bus where it can store its ownlocal variables and data structures (see Fig. 1). In order toguarantee data coherence from concurrent multiprocessoraccesses, shared memory can be configured to be noncache-able, but, in this case, it can only be inefficiently accessed bymeans of single bus transfers.

This inefficiency might be overcome by creating copies ofshared memory locations in private memory (i.e., usingshared memory only as a communication channel). Datawould then become cacheable and could be accessed viaburst transfers at the cost of moving a larger volume of datathrough the bus.

Alternatively, the shared memory can be declaredcacheable, but, in this case, cache coherence has to beensured. We have enhanced the platform by adding ahardware coherence support based on a write-through policy,which can be configured either as Write-Through Invalidate,WTI, or Write-Through Update, WTU.

The hardware snoop devices for both the invalidate andthe update case are depicted in Fig. 2. The snoop devicessample the bus signals to detect the transaction which isbeing performed on the bus, the involved data, and theoriginating core. The input pinout of the snoop devicedepends, of course, on the particular bus implemented inthe system and Fig. 2 reports the specific example of theinterface with the STBus interconnect from STMicroelec-tronics, although signal lines with identical content can befound in most communication architecture specifications.

When a write operation is flagged, the correspondingaction is performed, i.e., invalidation for the WTI policy,rewriting of the data for the WTU one. Write operations areperformed in two steps. The first one is performed by the core,which drives the proper signals on the bus, while the secondone is performed by the target memory, which sends itsacknowledge back to the master core to notify it of operation


Fig. 1. Shared memory architecture.

Fig. 2. Interface and operations of the Snoop Device for the

(a) invalidate and (b) update policies.

https://www.researchgate.net/publication/246112543_Evaluation_of_Snoop-Energy_Reduction_Techniques_for_Chip-Multiprocessors?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=



https://www.researchgate.net/publication/3965476_TLB_and_snoop_energy-reduction_using_virtual_caches_in_low-power_chip-multiprocessors?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/225220903_A_Comparative_Characterization_of_Communication_Patterns_in_Applications_Using_MPI_and_Shared_Memory_on_an?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/2623556_A_comparison_of_MPI_SHMEM_and_cache-coherent_shared_address_space_programming_models_on_the_SGI_Origin_2000?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/234819400_A_comparison_of_message_passing_and_shared_memory_architectures_for_data_parallel_programs?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

https://www.researchgate.net/publication/3898201_Message_passing_vs_shared_address_space_on_a_cluster_of_SMPs?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

completion (there can be an explicit and independentresponse phase in the communication protocol or a readysignal assertion in a unified bus communication phase). Thewrite ends only when the second step is completed and whenthe snoop device is allowed to consistently interact with thelocal cache. Of course, the snoop device must ignore writeoperations performed by its associated processor core. In oursimulation model, synchronization between the core and thesnoop device in a computation tile is handled by means of alocal hardware semaphore for mutually exclusive access tothe cache memory.

Hardware semaphores and slaves for interrupt genera-tion are also connected to the bus (Fig. 1). The interruptdevice allows processors to send interrupt signals to eachother. This hardware primitive is needed for interprocessorcommunication and is mapped in the global adressingspace. For an interrupt to be generated, a write should beissued to a proper address of the device. The semaphoredevice is also needed for the synchronization among theprocessors; it implements test-and-set operations, the basicrequirement to have semaphores.

Further details of the shared memory architecture can befound in Table 1.

The template followed by this shared memory architec-ture reflects the design approach of many semiconductorcompanies to the implementation of shared memory multi-processor architectures. As an example, the MPCoreprocessor implements the ARM11 microarchitecture andcan be configured to contain one to four processor coreswhile supporting fully coherent data caches [18].

3.2 Message-Oriented Distributed MemoryArchitecture

Message passing helps in mastering the design complexityof highly parallel systems provided the transfer cost on theunderlying architecture can be limited. We thereforeconsider a distributed memory architecture with light-weight hardware extensions for message passing, asdepicted in Fig. 3.

In the proposed architecture, a scratchpad memory, asemaphore, and a DMA unit are attached to each processorcore. The different processor tiles are connected using theshared bus (STBus). In order to send a message, a producerwrites in the message queue stored in its local scratchpadmemory, without generating any traffic on the interconnect.Once the data is in the message queue, the correspondingconsumer (running on another processor) can fetch themessage to its own scratchpad, directly or via a DMAcontroller. For this purpose, the scratchpad memories areconnected as slaves to the communication fabrics and theirspace is made visible to any other processor on theplatform. The DMA engine attached to each core enablesefficient data transfers between scratchpad and nonlocal

memories (cf. [43]): It supports multiple outstanding datachannels and has a dedicated connection for fast access tothe local scratch pad memory.

As far as synchronization is concerned, when a producerintends to generate a message, it locally checks an integersemaphore which contains the number of free messages inthe queue. If enough space is available, it decrements thesemaphore and stores the message in its scratchpad.Completion of the write transaction and availability of themessage are signaled to the consumer by incrementing asemaphore located in its scratchpad memory. This singlewrite operation goes through the bus. Semaphores aretherefore distributed among the processing elements,resulting in two advantages: The read/write traffic to thesemaphores is distributed and the producer (consumer) canlocally poll whether space (a message) is available, therebyreducing bus traffic.

The details of the message passing architecture can befound in Table 1.

The architecture of the recently announced Cell Proces-sor [17], developed by Sony, IBM, and Toshiba, sharesmany similarities with the template we are considering inthis paper. The Cell processor exhibits eight vectorcomputers equipped with local storage and connectedthrough a data-ring-based system interconnect. The in-dividual processing elements can use this bus to commu-nicate with each other and this includes the transfer ofdata in between the units acting as peers of the network.

4 SOFTWARE SUPPORT

A software library is an essential part of any of today’smultiprocessor systems. In order to support softwaredevelopers in programming the two optimized hardwareplatforms, we have implemented two architecture-specificcommunication and synchronization libraries exposinghigh-level APIs. The ultimate objective is to abstract low-level architectural details to the programmers, such asmemory maps, management of hardware semaphores, andintermediate data transfers, while keeping the overhead


TABLE 1Technical Details of the Architectural Components

Fig. 3. Message-oriented distributed memory architecture.

https://www.researchgate.net/publication/221059376_An_integrated_hardwaresoftware_approach_for_run-time_scratch-management?el=1_x_8&enrichId=rgreq-1fc725d9-e48b-4606-a26c-fea50f737d16&enrichSource=Y292ZXJQYWdlOzMwNDk0OTE7QVM6MTM3NzEwMjU3ODQwMTI4QDE0MDk4NDQwOTI0ODI=

introduced by the programming library as low as possiblefrom a performance and power viewpoint.

Concerning the shared memory architecture, we optedfor porting a standard communication library onto theMPSoC platform: It is the SystemV IPC Library, which is thenative communication library for heavyweight processesunder the Unix operating system. This allows softwaredesigners to develop their applications on host PCs and toeasily port their code onto the MPSoC virtual platform forvalidation and fine-grained software tuning on the targetarchitecture.

With regard to the message-oriented architecture, it is,rather, tuned for MPSoC implementations and its effective-ness was proven in [44]. As a consequence, we needed acommunication library able to fully exploit the features ofthis architecture. Moreover, we expect that the porting ofthe standard message passing libraries traditionally used inthe parallel computing domain might cause an overlysignificant overhead in resource-constrained MPSoCs. Forthis reason, we had to develop our own optimized messagepassing library, custom-tailored for the scratchpad-baseddistributed memory architecture we are considering.

4.1 A Lightweight Porting of System V IPC Libraryfor Shared Memory Programming

4.1.1 Brief Introduction to IPC Standard

System V IPC is a communication library for heavyweightprocesses based on permanent kernel resident objects. Eachobject is identified by a unique kernel ID. These objects canbe created, accessed, and manipulated only by the kernelitself, granting mutual exclusion between processes. Threedifferent types of objects, named facilities, are defined:messages queues, semaphores, and shared memory. Pro-cesses can communicate through System V IPC objectsusing ad hoc defined APIs that are specific for each facility.

Message Queues are objects similar to pipes and FIFOs.A message queue allows different processes to exchangedata with each other in the form of messages in compliancewith the FIFO semantic. Messages can have different sizesand different priorities. The send API (msgsnd) puts amessage in the queue, suspending the calling process ifthere is not enough free space. On the other hand, thereceive API (msgrcv) extracts from the queue the firstmessage that satisfies the calling process requests in termsof size and priority. If there is not a valid message or if thereare no messages at all, the calling process is suspended untila valid message is written to the queue. A special controlAPI (msgctl) allows processes to manage and delete thequeue object.

Semaphore objects consist of a set of classic Dijkstra’ssemaphores. A process calling the “operation” API (semop)can wait and signal on any semaphore of the set. Moreover,System V IPC allows processes to request more than oneoperation on the semaphore set at the same time. That APIensures that the operations will be executed atomically. Aspecial control API (semctl) allows us to initialize and deletethe semaphore object.

Shared memory objects are buffers of memory which aprocess can link to its own memory space through theattach API (shmat). All processes which have attached ashared memory buffer see the same buffer and can sharedata directly reading and writing on it. As the memoryspaces of the processes are different, the shared buffer

could be attached by the attach API at different addressesfor each process. Therefore, processes are not allowed toexchange pointers which refer to the shared buffer. In orderto successfully share a pointer, its absolute address must bechanged into an offset relative to the starting location of theshared buffer. A special control API (shmctl) allowsprocesses to mark a buffer for destruction. A buffer markedfor destruction is removed from the kernel when there areno more processes that are linked to it. A process canunlink a shared buffer from its memory space using thedetach API (shmdt).

4.1.2 Implementation and Optimizations

Some implementation details concerning the MPSoC com-munication library compliant with the System V IPCstandard follow. All of the objects, which require beingaccessed in a mutually exclusive way, are stored in theshared memory. Therefore, a dynamic allocator wasintroduced in order to efficiently implement data allocationin shared memory. All original IPC kernel structures wereoptimized by removing much process/permission relatedinformation in order to reduce shared memory occupancyand, therefore, API overhead. In our library implementationtargeting MPSoCs, mutual exclusion on the critical sectionsof an object was ensured by means of hardware mutexesthat are accessible on the shared memory space. Each IPCobject is protected by a different hardware mutex, allowingparallel execution on different objects.

MPSoC platforms are typically resource-constrained.Therefore, we decided not to implement some of thefeatures of System V IPC. At the moment, the priority inthe message queues facility and the atomic multioperationson the semaphore sets have not been implemented. Thesefeatures are not critical in System V IPC, so their lack willonly marginally affect code portability.

The MPSoC IPC library was tested and optimized toimprove the performance of APIs. The length of the criticalsections was reduced as much as possible in order tooptimize code efficiency. Similarly, the number of sharedmemory accesses was significantly reduced. Moreover, incase of repeated read accesses to the same memory location,we hold the read value. Write operations were optimized,avoiding having to perform useless write accesses to sharedmemory (e.g., writing the same value).

Since the benchmarks we will use in the experimentalresults make extensive use of the semaphore facility, weassessed the cost inucurred by our library in managing thisfacility. We created an ad hoc benchmark where two tasksare running onto two different processors: The first oneperiodically releases a certain semaphore, while the secondone is waiting on that semaphore. We measured the time toperform signal and wait over 40 iterations. It turned out thatthe overhead for using System V IPC with respect to themanual management of the hardware semaphores isnegligible (only 2 percent).

Dynamic memory allocation will never be exploited byour benchmarks since they allocate shared memory duringinitialization and free it before exiting; therefore, weexcluded those two phases from system performancemeasurements. Moreover, we do not use message queues,which involve mapping a message passing paradigm on topof shared memory, i.e., on top of an architecture which isnot optimized for messaging, and this goes in the oppositedirection with respect to our initial assumptions.


4.2 Message Passing Library

We also built a set of high-level APIs to support a messagepassing programming style on the message-oriented dis-tributed memory architecture described above. Our librarysimplifies the programming stage and is flexible enough toexplore the design space. The most important functions arelisted in Table 2.

To instantiate a queue, both the producer and consumermust run an initialization routine. To initialize the producerside, the corresponding task must call sq_init_producer. Ittakes as arguments the identifier of the consumer, themessage size, the number of messages in the queue, and abinary value. The last argument specifies whether theproducer should poll the producer’s semaphore or suspenditself until an interrupt is generated by the semaphore. Theconsumer is initialized with sq_init_consumer. It requires theidentifier of the consumer itself, the location of the readbuffer, and the poll/suspend flag. In detail, the secondparameter indicates the address where the function sq_readwill store the message transferred from the producer’smessage queue. This address can be mapped either to theprivate memory or to the local scratchpad memory.

The producer sends a message with the sq_write(_dma)function. This function copies the data from �source to a freemessage block inside the queue buffer. This transfer caneither be carried out by the core or via a DMA transfer(x_dma). Instead of copying the data from �source into amessage block, the producer can decide to directly generatedata in a free message block. The sq_getToken_write returns afree block in the queue’s buffer on which the producer canoperate. When data is ready, the producer should sendnotification of its availability to the consumer withsq_putToken_write. The consumer transfers a message fromthe producer’s queue to a private message buffer with voidsq_read(_dma). Again, the transfer can be performed eitherby a local DMA or by the core itself.

Our approach thus supports: 1) either processor orDMA-initiated data transfers to remote memories, 2) eitherpolling-based or interrupt-based synchronization, and3) flexible allocation of the consumer’s message buffer,i.e., on a scratchpad or on a private memory at a higherlevel of the hierarchy.

4.2.1 Low Overhead Implementation and Tuneability

The library implementation is very lightweight since it isbased on C macros that do not introduce significantoverhead with respect to the manual management of

hardware resources. A producer-consumer exchange ofdata programmed via the library showed just a 1 percentoverhead with respect to a manual control of the transfer bythe programmer without high-level abstractions.

More interestingly, the library flexibility can be used forfinetuning the porting of an application on the targetarchitecture. In fact, the library can exploit several featuresof the underlying hardware, such as processor- versus DMA-driven data transfers or interrupt versus active polling. Asimple case study shows the potential benefits of thisapproach. Let us consider a functional pipeline of eightmatrix multiplication tasks. Each stage of this pipeline takes amatrix as input, multiplies it with a local matrix, and passesthe result to the next stage. We iterate the pipeline 20 times.We run the benchmark, respectively, on an architecture witheight and four processors. In the first case, only one task isexecuted on each processor, while, in the second, we addedconcurrency by mapping two tasks to each core. First, wecompare three different configurations of the message-oriented architecture (Table 3). We execute the pipeline fortwo matrix sizes: 8� 8 and 32� 32 elements. In the latter case,longer messages are transmitted.

Analyzing the results in Fig. 4, referring to the casewhere one task runs on each processor, we can observe thata DMA is not always beneficial in terms of throughput. Forsmall messages, the overhead for setting up the DMAtransfer is not justified. In case of larger messages, theDMA-based solution outperforms processor-driven trans-fers. Conversely, employing a DMA always leads to anenergy reduction, even if the duration of the benchmark islonger, due to a more power-efficient data transfer. Notethat the energy of all system components (DMA included) isaccounted for in the energy plot. Results have been derivedthrough functional simulation and technology homoge-neous power models (0.13um technology).

Furthermore, the way in which a consumer is notified ofthe arrival of a message plays an important role, performanceand energy-wise. The consumer has to wait until theproducer releases the consumer’s local semaphore. With asingle task per processor (Fig. 4), the overhead related to the


TABLE 2APIs of Our Message Passing Library

TABLE 3Different Message Passing Implementations

Fig. 4. Comparison of message passing implementations in a pipelined

benchmark with eight cores from Table 3.

interrupt routine can slow down the system, depending onthe communication versus computation ratio, and polling is,in general, more efficient. On the contrary, with two tasks perprocessor (Fig. 5, referring to matrices of 8� 8 elements), theinterrupt-based approach performs better. In this case, it ismore convenient to suspend the task because the con-current task scheduled on the same processor is in the“ready” state. Instead, with active polling, the processor isstalled and the other task cannot be scheduled.

From this example, we thus conclude that, in order tooptimize the energy and the throughput, the implementa-tion of message passing should be matched with applica-tion’s workload characteristics. This is only feasible bydeploying a flexible message passing library.

5 FIRST-LEVEL CLASSIFICATION IN THE SOFTWARE

DOMAIN

Given the two complete and optimized hardware-softwarearchitectures for the shared memory and the messagepassing platforms, we now put them to work and try tocapture which application characteristics and mappingdecisions determine their relative performance and energydissipation. The ultimate objective is to identify designguidelines.

Our next step in this direction is to provide a first-levelclassification in the software domain. We try to capturesome relevant application features that can make thedifference in discriminating between programming para-digms. We recall that we are targeting parallel applicationsand, in particular, the multimedia and signal processingapplication domain. Relevant application features are asfollows:

. Workload allocation policy. It determines the way aparallel workload is assigned to the parallel compu-tation units for the processing stage. For the class ofapplications we are targeting, there are two mainpolicies:

1. Master-Slave paradigm. The volume of dataprocessed by each computation resource isreduced by splitting it among multiple slavetasks operating in a coordinated fashion. Amaster task is usually in charge of preproces-sing data, activating slave operation, and of

synchronizing the whole system. Workloadsplitting can be irregular or regular [33].Horizontal, vertical, and cross-slicing are well-known examples of regular data partitioning foruse in video decoding. From an energy view-point, the benefits from shortening the executiontime might be counterbalanced by the highernumber of operating processors, thus giving riseto a nontrivial trade-off between applicationspeedup and overall energy dissipation [36].

2. Pipelining. Pipelining is a traditional solutionfor throughput constrained systems [34]. Eachpipelined application consists of a sequence ofcomputation stages, wherein a number ofidentical tasks are performed, executing ondisjoint sets of input data. Computation at eachstage may be performed by specialized applica-tion-specific components or by homogeneouscores. Many embedded signal processing appli-cations follow this parallelization pattern [35].

. The degree of data sharing among concurrent tasks.Slave tasks may have to process data sets that arecommon to other concurrent tasks, as in the case ofthe reference frame for motion compensation inparallel video decoding. To the limit, all processingdata could be needed by all slaves. In this case, ashared memory programming paradigm relies onthe availability of shared processing data in sharedmemory at the cost of increased memory contention.On the contrary, employing message passing on adistributed architecture for this case would give riseto a multicast communication pattern having themaster processor as the source of processing dataand the slave processors as the receivers. Finding themost efficient solution from a performance andenergy viewpoint is again a nontrivial issue. Cachecoherence support is also critical. For instance, ourshared memory architecture can largely reduce theoverhead for keeping shared data coherent. If a taskchanges shared data, it has to update/notify all othertasks with which it shares the data. On a sharedmemory architecture, slaves can snoop the usefulupdates directly from the shared bus, thus avoidingthe transmission of updates to all tasks, whichwould congest the network and slow down programexecution.

. The Granularity of processing data. Signal proces-sing pipelines might operate on data units as smallas single pixels (e.g., pixel-level video graphicspipelines) and as large as entire frames. Anincreased data granularity has a different impacton the volumes of traffic to be moved across the busbased on the chosen application coding style. Asomewhat higher communication cost should betraded off with the advantages given by otherarchitectural mechanisms (e.g., data cacheability).Our exploration framework aims at spanning thistrade-off and at identifying the low-level effects thatcome into play to determine it.

. Data Locality. Optimizing for data locality has beenthe main focus of many studies in the last threedecades or so [14]. While locality optimizationefforts span a very large spectrum, ranging from


Fig. 5. Task scheduling impact on synchronization in a pipelined

benchmark with four cores from Table 3.

cache locality to memory locality to communicationlocality, one can identify a common goal behindthem: maximizing the reuse of data in nearbylocations, i.e., minimizing the number of accessesto data in far locations. There have been numerousabstractions and paradigms developed in the past tocapture the data reuse information and exploit it forenhancing data locality. In this work, we refer todata locality when a piece of data is still in a cacheupon reuse. Many embedded image and videoprocessing applications operate on large multi-dimensional arrays of signals using multilevelnested loops. An important feature of these codesis the regularity in data accesses, which can beexploited using an optimizing compiler to improvecache memory performance [12]. In contrast, manyscientific applications require sparse data structuresand demonstrate irregular data access patterns, thusresulting in poor data locality [13].

. Computation-to-communication ratio. This ratioprovides an indication about the communicationoverhead with respect to the overall computationtime. In general, when this ratio is such as to be heavieron the communication side, then bandwidth issuesbecome critical to determining system performance. Agood computation-to-communication ratio, togetherwith the minimization of load imbalance, is therequirement of scalable parallel algorithms in theparallel computing domain. Hiding communicationduring computation is the most straighforward wayto reduce the weight of communication, but othertechniques can be used such as message compressionor smart mapping strategies.

We now experimentally examine how the above applica-tion features influence the choice between message passingand shared memory coding styles. Our approach is to makehighly accurate comparisons of a few representative designpoints in the software domain, rather than making abstractcomparisons covering a wide space at the cost of limitedaccuracy. The accuracy of our analysis will be ensured byour timing-accurate modeling and simulation environment.Varying hardware and software parameters in the con-sidered design points will allow us to take stable conclu-sions and to point out power-performance trade-offs.

Our exploration space is depicted in Fig. 6. We split thesoftware space based on the workload allocation policy andthe degree of sharing of processing data. We aim atperforming an accurate comparison of programming para-digms within the identified space partitions. Our investiga-tions within each subspace will take into account otherapplication parameters such as data granularity, computa-tion/communication ratio, and data locality. To analyze eachsoftware subspace, we have designed a set of representativeand parameterizable parallel benchmarks. These latterconsist of several kernels which can be typically found insideembedded system applications: matrix manipulations (suchas addition and multiplication), encryption engines, andsignal processing pipelines. Handling parameterizable ap-plication kernels instead of entire applications provides uswith the flexibility to vary computation as well as commu-nication parameters of the parallel software, thus extendingthe scope of our analysis and making our conclusions morestable. Such flexibility for space exploration is frequently not

allowed by complete real-life applications. Each kernel hasbeen mapped using both the shared memory and themessage passing coding style. Interestingly, the code hasbeen deeply optimized for each programming paradigmfor a fair and realistic comparison.

. Benchmark I—Parallel Matrix Multiplication. Amatrix multiplication algorithm was partitioned,sticking to the master-slave paradigm. It was chosento allow the analysis of applications whereinprocessing data is shared among the slave proces-sors. In fact, each slave processor uses half of theentire source matrices and produces a slice of theresult matrix (Fig. 7). All slices are composedtogether by the master processor, which is then incharge of reactivating the slave processors for a newiteration. This program is developed so as tomaximize the sharing of the read-only variables(the source matrices) and to minimize the sharing ofthe variables that need to be updated. The size of thematrices can be arbitrarily set. A master-drivenbarrier synchronization mechanism is required toallow a new parallel computation to start only oncethe previous one (i.e., processing at all the slaveprocessors) has completed. Overall, we simulatedfive processors: one producer and four slaves.

. Benchmark II—DES encryption. The DES (DataEncryption Standard) algorithm was chosen as anexample of an application that easily matches themaster-slave workload allocation policy. DES en-crypts and decrypts data using a 64-bit key. It splitsinput data into 64-bit chunks and outputs a streamof 64-bit ciphered blocks. Since each input element isindependently encrypted from all others, the algorithmcan be easily parallelized. An initiator task dis-patches 64-bit blocks together with a 64-bit key ton calculator tasks for encryption (Fig. 8 top). Acollector tasks does exist which rebuilds an outputstream by concatenating the ciphered blocks of textfrom the calculator tasks. Please note that computa-tion at each slave task is completely independentsince the sets of input data are completely disjoint.We modified the benchmark so to increase the size


Fig. 6. Exploration space. Within each space partition, other software

parameters have been explored such as data locality, computation/

communication ratio, and data granularity.

of exchange data units to multiples of 64 bits, thusexploring different data granularities. Here, slavetasks just need to be independently synchronizedwith the producer, which alternatively providesinput data to all of the slaves, and with the collectortask. In this benchmark, no shared data exists.Overall, we simulated isx processors: the producer,the consumer, and four slaves.

. Benchmark III—Signal Processing Pipeline. Thisapplication consists of several signal processingtasks executing in a pipelined fashion. Each proces-sor computes a two-dimensional filtering task(which, in practice, reduces to matrix multiplica-tions) and feeds its output to the next processor inthe pipeline. All pipeline stages perform computa-tions on disjoint sets of input data, as depicted inFig. 8 bottom. Synchronization mechanisms (inter-rupts and/or semaphores) were used for correctdata propagation across the pipeline stages. Wesimulated an 8-stages signal processing chain. Forthe pipeline-based workload allocation policy, wedid not explore the case of processing data sharedamong the pipeline stages because we consider it tobe of minor interest for the multimedia domain.

We have optimized the code of these benchmarks forboth the shared memory and the message passing para-digm, as described hereafter. When using the messagepassing library, we always selected the active pollingconfiguration since we always run single tasks perprocessor. In this context, interrupts do not result in abetter resource utilization, but only in scheduling overhead.Moreover, in our comparison with shared memory, weused the best message passing performance result, whichwas sometimes given by using DMA and other times byusing processor-driven transfers.

Moreover, since the system interconnect is a shared bus,we expect the update-based cache coherence protocol tohave an advantage over the invalidate-based one. In fact,

when the producer writes data to shared memory and thosedata are in the caches of other cores, this data is directlyupdated without further bus transactions. This inherentbroadcasting mechanism brings even more advantageswhen many data blocks are shared among slave processors.For these reasons, we use the update protocol, in contrast tomany previous papers targeting parallel computers [42].

Finally, in order to eliminate the impact of I/O frombenchmark execution (this aspect is outside the scope of ouranalysis), we assume that input data is stored on an on-chipmemory, from where it is moved or accessed according tothe programming style.

6 EXPERIMENTAL RESULTS

In this section, we examine how the application character-istics and mapping decisions influence the performance andenergy ratio between shared memory and message passing.First, we explain the simulation framework in which theseexperiments are conducted.

6.1 Simulation Framework

Our experimental framework was based on the MPARMsimulation environment [39], which performs functional,cycle-true simulation of ARM-based multiprocessor sys-tems. This level of accuracy is particularly important forMPSoC platforms, where small architectural features mightdetermine macroscopic performance differences. Of course,simulation accuracy has to be traded off with simulationperformance (up to 200,000 cycles/sec. with the MPARMplatform). MPARM makes available a complete analysistoolkit, allowing monitoring of the performance and energydissipation (based on industry-provided power models) ofplatform components for the execution of software routinesas well as of an entire benchmark. Simulation is cycleaccurate and bus-signal accurate. Our virtual platformleverages technology-homogeneous (0.13 um) power mod-els of all system components (processor cores, system


Fig. 7. Workload allocation policies for parallel matrix multiplication.Fig. 8. Workload allocation policy for the DES encryption algorithm (top)

and the signal processing pipeline (bottom).

interconnect, memory devices) provided by STMicroelec-tronics [40], [41]. Processor core models take into accountthe cache power dissipation, which accounts for a largefraction of overall power.

6.2 Master-Slave, Shared Data

We ran the parallel matrix multiply ðMMÞ benchmark withvarying matrix sizes and D-cache size and for the twodifferent hardware-software architectures. We measuredthe execution time for processing 20 matrices. Then, wemodified the benchmark so as to perform the sum ofmatrices instead of multiplications (synthetic benchmark,synth�MM), thus exploring the computation versuscommunication ratio.

Results are reported in Fig. 9; the y-axis represents theratio between the execution times of the benchmark in themessage passing (MP) and in the shared memory (SHM)version. Fig. 9a refers to the MM benchmark, while Fig. 9brefers to synth-MM. In the diagrams, values greater than 1thus denote a better performance (shorter execution time)of shared memory over message passing. The scratchpadwas sized big enough to contain the largest processing datasince this involved realistic cuts (8kB) while playing only amarginal role in energy dissipation. The benchmark hasgood data locality; therefore, we expect shared memory tobe effective in this case. Furthermore, with messagepassing, shared data blocks have to be sent to the slaveprocessors as explicitly replicated messages, thus originat-ing a communication overhead. Our simulation runs onlypartially confirm these intuitions, as depicted in Fig. 9a.We observe that, as we increase data size, a correspondingincrease in data cache misses affects shared memoryperformance, thus making message passing competitive.This loss of performance can be restored by increasing thecache size. In the plot, we show that the performance ratiogoes back above 1 with cache sizes of 4kB. The same ratiocan actually be obtained with 8 kB caches, even if a fully

associative cache is instantiated. This saturation point isclearly related to the matrix size.

However, with large matrices, the advantage of sharedmemory over message passing decreases with respect tosmaller matrices: Since the computational load of the MMbenchmark increases more than its communication load(the computation has OðN3Þ complexity while the commu-nication load is only OðN2Þ, where N indicates the matrixsize), message passing leverages its advantage of perform-ing the computation on a more efficient memory (thescratchpad), thus making up for the communication over-head. In general, with larger matrices, the performances ofmessage passing and shared memory tend to converge,provided the cache and the scratchpad sizes can bearbitrarily increased to deal with larger data sets.

In the rightmost point of Fig. 9a, the designer has todecide whether it is more convenient to increase the cachesize and to have shared memory outperforming messagepassing or to adopt the message passing paradigm. Sincethe energy plots for the two programming paradigmsexhibit the same trend as Fig. 9 (and, therefore, we have notreported them), we can make two conclusions. First,increasing the cache size to 4 kB with matrix size 32 makesshared memory not only more performance-efficient, butalso more energy-efficient. The reason can be deduced fromTable 4: In this case, the data cache energy is almostnegligible with respect to the instruction cache andprocessor contributions. Therefore, a larger data cachereduces cache misses and, hence, application executiontimes in this context.

With the synth-MM benchmark (Fig. 9b), the ratiobetween the computational load and the communicationone does not vary with the size of the data; therefore, thecommunication overhead of the message passing solutionincreases with respect to the shared memory version, wherethere is no need to move data. The same trend is followedby the energy curves and is therefore not reported for lackof space.

For the shared memory version of the MM and synth-MM, we reported only the results of the cache-coherentplatform due to the poor performance shown by thenoncoherent platform.

6.3 Master-Slave, Nonshared Data

In this experiment, we ran the DES benchmark in themessage passing and shared memory versions for varyinggranularity of processing data. In this case, computationcomplexity is similar to the synth-MM benchmarks and thismight lead to the conclusion that shared memory is theright choice here. However, this benchmark also empha-sizes other features that put previous conclusions indiscussion.

First, this is a synchronization-intensive benchmark andprevious work in the parallel computing domain agrees onthe fact that performing synchronization by means of sharedmemory variables is inherently inefficient [28]. However, this


Fig. 9. Execution time ratio. D-cache size is a parameter. (a) MM

benchmark. (b) synth-MM benchmark.

TABLE 4Energy Breakdown for the Shared Memory Platform with

Matrix Size 32, Data Cache Size 4 KB (4-Way Set Associative),and Instruction Cache Size 4 KB (Direct Mapped)

disadvantage of shared memory over message passing(which can exploit the synchronization implicit in the arrivalof a message) can be counterbalanced by using interrupt-based synchronization. The issue is to find out whether, in anMPSoC domain, using interrupts in a shared memory systemis more costly than the mechanism used to wait for messagesin a message passing implementation.

Second, a static profiling of the DES benchmark pointsout poor data locality. Similarly, many scientific applica-tions do not exhibit much temporal locality as all or most ofthe application data set is rewritten on each iteration of thealgorithm. Finally, DES input data sets for each processorare disjoint, thus minimizing the advantage of usingupdate-based cache coherence protocols. It is difficult topredict how the above features combine to determine finalperformance and energy metrics in the MPSoC domain,thus motivating our simulation-based analysis. Results forthe DES benchmark are reported in Fig. 10.

At first, let us observe the relevant impact of synchro-nization on performance. On one hand, it causes through-put to increase as the size of exchanged data units increases.In fact, processors still elaborate the same overall amount ofdata, but they exchange data units with larger granularity,thus incurring fewer synchronization events. Please notethat the increase in communication translates into a linearincrease of computation, thus resulting in the linear increaseof throughput.

On the other hand, for small data units, shared memoryscales worse than message passing due to the high over-head associated with interrupt handling. In fact, the idletask is scheduled to avoid polling remote semaphores andthe DES task is rescheduled when an interrupt is received.On the contrary, message pasing can poll a distributed localsemaphore without accessing the bus. This inefficiencyincurred by shared memory significantly impacts itsperformance with respect to that of message passing, whichis clearly the best solution for small data units.

In addition, Fig. 10 also shows that the message passingapproach clearly outperforms shared memory over all therange of explored data granularity. Unlike the synth-MMbenchmarks, where a larger data size results in anincreasing efficiency of shared memory over messagepassing, here the advantage of message passing over sharedmemory does not reduce, but stays constant over the rangeof explored data unit size.

In fact, as the data footprint increases, the lowersynchronization overhead of shared memory is progres-sively counterbalanced by the increasing cache miss ratio ofthe consumer processor and the two low-level effectscompensate for each other, as shown by the parallel curvesin Fig. 10.

In this case, the degrading data cache performance is notrelated to cache conflicts, but rather to the limited cachesize. In fact, as Fig. 10 indicates, a fully associative cacheprovides negligigle performance benefits. On the contrary,shared memory performance can be significantly improvedby increasing the data cache size from (default) 4 kB to 8 kB.The underlying reason is that, while the cache miss ratio ofall slave processors stays constant as data size increases,this does not hold for the consumer. This latter reads slaveoutput data from shared memory. While, for small dataunits, the corresponding memory locations can be con-tained in the consumer cache without conflicts, a larger datafootprint causes an increasing number of conflicts in the4 kB data cache (from 4 to 11 percent) that penalizes sharedmemory.

Interestingly, further increasing the data cache sizefrom 8 kB to 16 kB leads to a performance saturationeffect, which indicates that, in this scenario, a messagepassing solution is inherently more effective. Moreover,reverting to such large caches also starts impactingsystem energy, as illustrated in Fig. 11. The trend ofenergy curves is strongly correlated to the performanceplot in that a higher throughput determines a shorterexecution time to process the same amount of data.

6.4 Pipelining

We finally ran the pipelined matrix processing benchmarks(multiplication and addition) and report simulation resultsin Fig. 12.

Consider Fig. 12a, i.e., matrix multiply. This benchmarkhas features common to both the MM and DES benchmarks.Like MM, here we have high data locality and highcomputation complexity. Like DES, we have a high impactof synchronization mechanisms. Results show that, forsmall matrices, the more efficient synchronization carriedout by message passing is compensated for by the highertime spent for interprocessor communication: With sharedmemory, cache updates occur in parallel with task execu-tion, while, with message passing, the small data size is not


Fig. 10. Throughput for the DES benchmark as a function of data

granularity.Fig. 11. Energy for the DES benchmark as a function of data granularity.

favorable to using a DMA due to the programmingoverhead. The pros and cons of each paradigm compensatefor each other and we do not observe any performancedifference.

Although counterintuitive, if matrices become large, thehigher computation efficiency of message passing (sharedmemory incurs a significant cache miss ratio) does notdetermine an overall better performance of messagepassing. In fact, since the pipeline stages are almostperfectly balanced, all data transfers between pairs ofcommunicating processors occur in parallel at the sametime, thus creating localized peaks of bus congestion thatincrease transfer times. This explains the similar perfor-mance of message passing and shared memory also forlarge data.

In Fig. 12b, the shared memory solution outperforms themessage passing one as matrix size increases, reflectingwhat we have already seen in the synth-MM benchmark.However, if matrices are small, the high synchronizationefficiency of message passing generates performance ben-efits, as seen for DES. Moreover, in the rightmost part of theplot, we can see that cache-coherent shared memory andnon-cache-coherent shared memory tend to have the sameperformance. In fact, cache-coherent shared memory suffersfrom a high percentage of cache misses and this counter-balances the more efficient accesses to shared memory.

In Fig. 13a, we see that the shared memory variantconsumes more energy since we have an increase in datacache misses. On the contrary, in Fig. 13b, communicationplays a more significant role; therefore, message passingprogressively becomes less energy-efficient.

6.4.1 Impact of Mapping Decisions

For balanced pipelines, message passing suffers from thehigh peak bandwidth utilization problem that limits itsperformance. Let us now show that this limitation can be

relieved by taking the proper course of actions and that theperformance that can be achieved in this way cannot beachieved by shared memory by varying cache settings. Weconsider a pipeline of matrix multiplications, where adifferent number of operations is performed at each stage,thus making the pipeline unbalanced (see Table 5). Therightmost bars in Fig. 14 indicate that message passingoutperforms shared memory in this context, even thoughthe difference is not significant. However, if a lowerthroughput is needed, by rearranging task allocation toprocessors and allowing more tasks to run on the sameprocessor, we can get a more noticeable differentiationbetween message passing and shared memory, providedcommunication is taken into account in the mappingframework. We focused on a 500 MBit/sec. target through-put and considered two mappings that meet the perfor-mance constraint while generating different amounts of bustraffic. The mappings are reported in Table 6 and the firstone was communication-optimized by using the frameworkin [53]. By looking at the results in Fig. 14, the messagepassing implementation of mapping 1 outperforms that ofmapping 2. The performance difference can be explained bythe peaks in bandwidth utilization, which increase the timespent in transferring data. Finally, the plot shows thatshared memory performance is always lower than that ofmessage passing, whatever the cache configuration (sizeand associativity), thus proving a higher efficiency ofmessage passing for this context.


Fig. 12. Throughput for pipelined matrix processing. (a) Matrix multi-plication. (b) Matrix addition.

Fig. 13. Energy for pipelined matrix processing. (a) Matrix multiplication.

(b) Matrix addition.

TABLE 5The Computation Cost of Each Task of the Pipeline

7 CONTRASTING PROGRAMMING PARADIGMS FOR

MPSOCS AND PARALLEL COMPUTERS

Our exploration has pointed out some of the main

differences between programming paradigms for MPSoCs

with respect to those for the parallel computing domain. We

summarize them as follows:

. In shared memory platforms, the use of shared busesmakes update-based cache coherence protocolseffective for producer-consumer communicationwithout generating traffic overhead, as is the casefor many network-centric parallel computer archi-tectures. Furthermore, caches tend to smooth thedistribution of data traffic, hence reducing theprobability of traffic peaks on the interconnect.

. MPSoCs have access to a fast communicationarchitecture integrated on the die together with theprocessors. As a result, memory can be accessedfaster and, thus, the cache-lines can be refilled moreeasily than on a traditional multiprocessor architec-ture. In practice, this also means that, on an MPSoC,the same performance can be obtained with asmaller cache, even if this causes cache misses toincrease. The latter insight is often used by designersto reduce chip area and, thus, manufacturing cost.However, if the bandwidth of the communicationarchitecture becomes congested, the communicationdelay increases again and the extra cache missesthen result in a high-performance loss and in asystem energy overhead associated with longerexecution times. Hence, even though, with a smallercache, we can obtain the same performance, thesmaller cache makes the performance more sensitiveto bus congestion, potentially limiting the efficiencyof shared memory.

. In the MPSoC context, the software infrastructure isfar more lightweight than in traditional parallelsystems. Therefore, many performance overheadsources that have been traditionally considerednegligible or marginal now come into play and, insome cases, might make the difference. Two relevantexamples that have emerged throughout this workare the overhead for DMA programming (whichmust be compared with the size of the data to bemoves) and for interrupt handling (to be comparedwith the bus congestion induced by semaphore

polling). Surprisingly, solutions that are apparentlyinefficient might turn out to provide the bestperformance, such as processor-driven data transfersand polling-based synchronization. A similar issueconcerns the porting of standard messaging librarieson MPSoC platforms. The porting process of theselibraries (such as the SystemV IPC library consideredin this work or the MPI primitives) has to becombined with an optimization and customizationeffort to the platform instance in order to reduce itsperformance overhead. As an example, the severalthousand cycles latency incurred by MPI primitives[45] in traditional parallel systems would seriouslyimpair MPSoC performance. This further stressesthe importance of hardware extensions for thedifferent programming paradigms, as we have donein this work.

. In message passing architectures, local memories inprocessor nodes cannot be as large as in traditionaldistributed memory multiprocessor systems. On theother hand, software-controlled scratchpad memoriesexhibit a negligible access cost, performance andenergy-wise. We think that this feature, combinedwith technology constraints in memory fabrications,will further differentiate MPSoC platforms fromdistributed parallel computers. We expect this toimpact the architecture of the memory hierarchy,which will have to store large data sets off-chip while,at the same time, avoiding the bottleneck of centra-lized off-chip memory controllers. Considering theseissues is outside the scope of this work, which hastherefore assumed that processing data can be entirelycontained in scratchpad memories while keepingreasonable memory sizes.

8 DESIGN GUIDELINES

A designer can choose the architectural template and theprogramming paradigm that best suit its needs based on afew relevant features of the parallel application underdevelopment. Our analysis has shown the importance ofworkload allocation policy, computation/communication ratio,degree of sharing of input data among working processors, anddata locality in differentiating between the performance andenergy of the message passing versus the shared memoryprogramming paradigm. Since our approach is centeredaround the accuracy of the exploration framework, werestricted our analysis to three relevant scenarios for futureMPSoC platforms, which were extensively and accuratelyinvestigated by means of synthetic and parameterizablebenchmarks. This leads us to the following guidelines forsystem designers:

. For the case where many working processors sharethe same input processing data, shared memorytypically outperforms message passing. Sharedmemory leverages the implicit broadcasting supportoffered by the write-through update cache coherence


Fig. 14. Bit rate achieved with the different mappings.

TABLE 6Mapping of Tasks on the Processors

protocol. In contrast, message passing suffers fromthe overhead for explicitly replicated input mes-sages and for postprocessing updates of shareddata stored in local memories. Obviously, anapplication with a low computation/communica-tion ratio emphasizes shared memory efficiency.The only nontrivial case where message passingturns out to be competitive is that of computation-intensive applications with large data sets. In fact,message passing makes a profit by a more efficientcomputation in scratchpad memory, while theshared memory implementation starts sufferingfrom cache misses. We have shown that sharedmemory performance can be restored by means ofproper data cache sizing since this has only amarginal impact on system energy. However, theperformances of both programming paradigmstend to converge in these operating conditions.

. For synchronization-intensive applications, messagepassing provides potential for the implementation ofmore efficient synchronization mechanisms and,hence, for shorter application execution times. Inparticular, this point makes the difference in thepresence of processing data with a small footprint.Synchronization events can be very costly forMPSoC systems in terms of bus congestion forremote semaphore polling or performance overheadfor interrupt handling and task switching. Thefrequency and duration of these events and, hence,their impact on application execution metrics,depends on the amount of computation performedon each input data, on input data granularity, and onrelative waiting times between synchronized tasks.We have observed that this issue certainly deter-mines better system performance and energy ofmessage passing when small input data is to beprocessed in synchronization-intensive applications.

. Many applications (e.g., scientific computation,cryptography) make use of iterative algorithmsshowing poor temporal locality, where all or mostsets of input data are rewritten at each iteration ofthe algorithm. In this scenario, message passingturns out to be a more effective solution than sharedmemory, even though different cache settings mightreduce the gap. The message passing solution is alsothe most energy-efficient.

. With regard to signal processing pipelines, whatreally makes the difference between the two pro-gramming paradigms is the computation/commu-nication ratio and data granularity. For small datasets, message passing again makes a profit by themost efficient synchronization mechanism, which iskey for pipeline implementations. On the otherhand, as the data footprint increases, messagepassing proves slightly more effective only forcomputation-intensive pipeline stages. However, inthis regime, message passing performance is ex-tremely sensitive to peak bus bandwidth utilizationand, for balanced pipelines or significant peakbandwidth requirements (associated with input datareading or output data generation), shared memory

becomes competitive. Instead, shared memory no-ticeably outperforms message passing with a lowcomputation/communication ratio and large datasets since the communication overhead of messagepassing cannot be amortized by enough computa-tion in scratchpad memory.

9 CONCLUSIONS

This paper explores programming paradigms for parallel

multimedia applications on MPSoCs. Our analysis points

out that the trade-offs spanned by MPSoC platforms can be

very different from those of traditional parallel systems and

provide some design guidelines to discriminate between

message passing and shared memory programming para-

digms in relevant subspaces of the software space.

ACKNOWLEDGMENTS

This work was supported in part by SRC under contract

no. 1188 and in part by STMicroelectronics.

REFERENCES

[1] G. Declerck, “A Look into the Future of Nanoelectronics,” Proc.IEEE Symp. VLSI Technology, pp. 6-10, 2005.

[2] Ambient Intelligence, W. Weber, J. Rabaey, and E. Aarts, eds.Springer, 2005.

[3] D. Culler, J. Singh, and A. Gupta, Parallel Computer Architecture: AHardware/Software Approach. Morgan Kaufmann, 1999.

[4] L. Hennessy and D. Patterson, Computer Architecture—A Quanti-tative Approach, third ed. Morgan Kaufmann, 2003.

[5] S. Hand, A. Baghdadi, M. Bonacio, S. Chae, and A. Jerraya, “AnEfficient Scalable and Flexible Data Transfer Architectures forMultiprocessor SoC with Massive Distributed Memory,” Proc. 41stDesign Automation Conf., pp. 250-255, 2004.

[6] F. Gilbert, M. Thul, and N. When, “Communication CentricArchitectures for Turbo-Decoding on Embedded Multiproces-sors,” Proc. Design, and Test in Europe Conf., pp. 351-356, 2003.

[7] H. Arakida et al., “A 160mW, 80nA Standby, MPEG-4 Audio-visual LSI 16Mb Embedded DRAM and a 5 GOPS Adaptive PostFilter,” Proc. IEEE Int’l Solid-State Circuits Conf., pp. 62-63, 2003.

[8] M. Rutten, J. van Eijndhoven, E. Pol, E. Jaspers, P. van der Wolf, O.Gangwal, and A. Timmer, “Eclipse: Heterogeneous Multiproces-sor Architecture for Flexible Media Processing,” Proc. Int’l Paralleland Distributed Processing Conf., pp. 39-50, 2002.

[9] U. Ramachandran, M. Solomon, and M. Vernon, “HardwareSupport for Interprocess Communication,” IEEE Trans. Parallel andDistributed Systems, vol. 1, pp. 318-329, July 1990.

[10] M. Banekazemi, R. Govindaraju, R. Blackmore, and D. Panda,“MP-LAPI: An Efficient Implementation of MPI for IBM RS/6000SP Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 12,no. 10, pp. 1081-1093, Oct. 2001.

[11] W. Lee, W. Dally, S. Keckler, N. Carter, and A. Chang, “AnEfficient Protected Message Interface,” Computer, pp. 68-75, Mar.1998.

[12] N.E. Crosbie, M. Kandemir, I. Kolcu, J. Ramanujam, and A.Choudhary, “Strategies for Improving Data Locality in EmbeddedApplications,” Proc. 15th Int’l Conf. VLSI Design, 2002.

[13] M.M. Strout, L. Carter, and J. Ferrante, “Rescheduling for Localityin Sparse Matrix Computations,” Lecture Notes in Computer Science,p. 137, 2001.

[14] M. Kandemir, “Two-Dimensional Data Locality: Definition,Abstraction, and Application,” Proc. Int’l Conf. Computer AidedDesign, pp. 275-278, 2005.

[15] G. Byrd and M. Flynn, “Producer-Consumer Communication inDistributed Shared Memory Multiprocessors,” Proc. IEEE, pp. 456-466, Mar. 1999.

[16] K. Tachikawa, “Requirements and Strategies for SemiconductorTechnologies for Mobile Communication Terminals,” Proc. Elec-tron Devices Meeting, pp. 1.2.1-1.2.6, 2003.


[17] D. Pham et al., “The Design and Implementation of a First-Generation CELL Processor,” Proc. Int’l Solid State Circuits Conf.(ISSCC), Feb. 2005.

[18] ARM Semiconductor, “ARM11 MPCore Multiprocessor,” http://arm.convergencepromotions.com/catalog/753.htm, 2007.

[19] Philips Semiconductor, “Philips Nexperia Platform,”www.semiconductors.philips.com/products/nexperia/home,2007.

[20] STMicroelectronics Semiconductor, “Nomadik Platform,”www.st.com/stonline/prodpres/dedicate/proc/proc.htm, 2007.

[21] Texas Instrument Semiconductor, “OMAP5910 Platform,” http://focus.ti.com/docs/prod/folders/print/omap5910.html, 2007.

[22] MPCore Multiprocessors Family, www.arm.com/products/CPUs/families/MPCoreMultiprocessors.html, 2007.

[23] Intel Semiconductor, “IXP2850 Network Processor,” http://www.intel.com, 2007.

[24] B. Ackland et al., “A Single Chip, 1.6 Billion, 16-b MAC/sMultiprocessor DSP,” IEEE J. Solid State Circuits, vol. 35, no. 3,Mar. 2000.

[25] C. Lin and L. Snyder, “A Comparison of Programming Models forShared Memory Multiprocessors,” Proc. Int’l Conf. Parallel Proces-sing, pp. 163-170, 1990.

[26] T.A. Ngo and L. Snyder, “On the Influence of ProgrammingModels on Shared Memory Computer Performance,” Proc. Int’lConf. Scalable and High Performance Computing, pp. 284-291, 1992.

[27] T.J. LeBlanc and E.P. Markatos, “Shared Memory vs. MessagePassing in Shared-Memory Multiprocessors,” Proc. Symp. Paralleland Distributed Processing, pp. 254-263, Dec. 1992.

[28] A.C. Klaiber and H.M. Levy, “A Comparison of Message Passingand Shared Memory Architectures for Data Parallel Programs,”Proc. Int’l Symp. Computer Architecture, pp. 94-105, 1994.

[29] S. Chandra, J.R. Larus, and A. Rogers, “Where Is Time Spent inMessage-Passing and Shared-Memory Programs,” Proc. Int’l Conf.Architectural Support for Programming Languages and OperatingSystems, pp. 61-73, 1994.

[30] S. Karlsson and M. Brorsson, “A Comparative Characterization ofCommunication Patterns in Applications Using MPI and SharedMemory on an IBM SPI,” Proc. Int’l Workshop Comm., Architecture,and Applications for Network-Based Parallel Computing, pp. 189-201,1998.

[31] H. Shan and J.P. Singh, “A Comparison of MPI, SHMEM andCache-Coherent Shared Address Space Programming Models onthe SGI Origin2000,” Proc. Int’l Conf. Supercomputing, pp. 329-338,1999.

[32] H. Shan, J.P. Singh, L. Oliker, and R. Biswas, “Message Passing vs.Shared Address Space on a Cluster of SMPs,” Proc. Int’l Paralleland Distributed Processing Symp., Apr. 2001.

[33] D. Altilar and Y. Paker, “Minimum Overhead Data PartitioningAlgorithms for Parallel Video Processing,” Proc. 12th Int’l Conf.Domain Decomposition Methods, 2001.

[34] S. Bakshi and D.D. Gajski, “Hardware/Software Partitioning andPipelining,” Proc. ACM/IEEE Design Automation Conf., pp. 713-716,1997.

[35] W. Liu and V.K. Prasanna, “Utilizing the Power of High-Performance Computing,” IEEE Signal Processing Magazine,pp. 85-100, Sept. 1998.

[36] J.P. Kitajima, D. Barbosa, and W. Meira Jr., “Parallelizing MPEGVideo Encoding Using Multiprocessors,” Proc. Brazilian Symp.Computer Graphics and Image Processing (SIBGRAPI), pp. 215-222,Sept. 1999.

[37] M. Stemm and R.H. Katz, “Measuring and Reducing EnergyConsumption of Network Interfaces in Hand-Held Devices,”IEICE Trans. Comm., vol. E80-B, no. 8, pp. 1125-1131, 1997.

[38] V. Raghunathan, C. Schurgers, S. Park, and M. Srivastava,“Energy Aware Wireless Microsensor Networks,” IEEE SignalProcessing Magazine, pp. 40-50, Mar. 2002.

[39] M. Loghi, F. Angiolini, D. Bertozzi, L. Benini, and R. Zafalon,“Analyzing On-Chip Communication in a MPSoC Environment,”Proc. Design and Test in Europe Conf. (DATE), pp. 752-757, Feb.2004.

[40] M. Loghi, M. Poncino, and L. Benini, “Cycle-Accurate PowerAnalysis for Multiprocessor Systems-on-a-Chip,” Proc. Great LakesSymp. VLSI, pp. 401-406, Apr. 2004.

[41] A. Bona, V. Zaccaria, and R. Zafalon, “System Level PowerModeling and Simulation of High-End Industrial Network-on-Chip,” Proc. Design and Test in Europe Conf. (DATE), pp. 318-323,Feb. 2004.

[42] G.T. Byrd and M.J. Flynn, “Producer-Consumer Communicationin Distributed Shared Memory Multiprocessors,” Proc. IEEE,vol. 87, pp. 456-466, Mar. 1999.

[43] F. Poletti, P. Marchal, D. Atienza, L. Benini, F. Catthoor, and J.M.Mendias, “An Integrated Hardware/Software Approach for Run-Time Scratchpad Management,” Proc. Design Automation Conf.,vol. 2, pp. 238-243, July 2004.

[44] F. Poletti, A. Poggiali, and P. Marchal, “Flexible Hardware/Software Support for Message Passing on a Distributed SharedMemory Architecture,” Proc. Design and Test in Europe, vol. 2,pp. 736-741, Mar. 2004.

[45] MPI-2 Standard, http://www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-2.0/mpi2-report.htm, 2007.

[46] P. Stenstrom, “A Survey of Cache Coherence Schemes forMultiprocessors,” Computer, vol. 23, no. 6, pp. 12-24, June 1990.

[47] M. Tomasevic and V.M. Milutinovic, “Hardware Approaches toCache Coherence in Shared-Memory Multiprocessors,” IEEEMicro, vol. 14, nos. 5-6, pp. 52-59, Oct./Dec. 1994.

[48] I. Tartalja and V.M. Milutinovic, “Classifying Software-BasedCache Coherence Solutions,” IEEE Software, vol. 14, no. 3, pp. 90-101, Mar. 1997.

[49] A. Moshovos, B. Falsafi, and A. Choudhary, “JETTY: FilteringSnoops for Reduced Energy Consumption in SMP Servers,” Proc.High Performance Computer Architecture Conf., pp. 85-97, Jan. 2001.

[50] C. Saldanha and M. Lipasti, “Power Efficient Cache Coherence,”High Performance Memory Systems, pp. 63-78, Springer-Verlag,2003.

[51] M. Ekman, F. Dahlgren, and P. Stenstrom, “Evaluation of Snoop-Energy Reduction Techniques for Chip-Multiprocessors,” Proc.Int’l Symp. Computer Architecture, May 2002.

[52] M. Ekman, F. Dahgren, and P. Stenstrom, “TLB and SnoopEnergy-Reduction Using Virtual Caches in Low-Power Chip-Multiprocessors,” Proc. Int’l Symp. Low Power Electronics andDesign, pp. 243-246, Aug. 2002.

[53] M. Ruggiero, A. Guerri, D. Bertozzi, F. Poletti, and M. Milano,“Communication-Aware Allocation and Scheduling Frameworkfor Stream-Oriented Multi-Processor Systems-on-Chip,” Proc.Design and Test in Europe, vol. 1, pp. 3-9, Mar. 2006.

[54] P. Banerjee, J. Chandy, M. Gupta, J. Holm, A. Lain, D. Palermo, S.Ramaswamy, and E. Su, “Overview of the PARADIGM Compilerfor Distributed Memory Message-Passing Multicomputers,” Com-puter, vol. 28, no. 3, pp. 37-37, Mar. 1995.

[55] M. Gupta, E. Schonberg, and H. Srinavasan, “A Unified Frame-work for Optimizing Communication in Data-Parallel Programs,”IEEE Trans. Parallel and Distributed Systems, pp. 689-704, vol. 7,no. 7, July 1996.

Francesco Poletti received the Laurea degreein computer science from the University ofBologna, Italy, in 2003. In 2003, he joined theresearch group of Professor Luca Benini in theDepartment of Electronics, Computer Scienceand Systems (DEIS) at the University ofBologna. His research is mostly in the field ofembedded MPSoC systems. He is currentlyinvolved in cycle-accurate simulation infrastruc-tures for multiprocessor embedded systems and

in the design of parallel applications optimized for the architecture,ranging from biomedical to multimedia systems. The aim of his researchis the exploration of the design space to identify optimal architecturaltrade-offs among speed, area, and power. Additionally, he is working onthe topic of memory hierarchies, with emphasis on the usage ofScratchPad Memories (SPMs) and dedicated hardware support.


Antonio Poggiali graduated in computerscience engineering in July 2005 from theElectronic and Computer Science EngineeringDepartment at the “Alma Mater Studiorum”University, Bologna, Italy, with a thesis onoptimization of multimedia software for MPSoCsystems. Currently, he works for STMicroelec-tronics in the Advanced System TechnologyDivision-Advanced Microprocessor DesignGroup as an architectural designer for low powermicroprocessors.

Davide Bertozzi received the PhD degree in2003 from the University of Bologna, Italy, withan oral dissertation on “Energy-Efficient Con-nectivity of Network Devices: From WirelessLocal Area Networks to Micro-Networks ofInterconnects.” He is an assistant professor inthe Engineering Department at the University ofFerrara, Italy. He has been a visiting researcherat international universities (Stanford University)and in the semiconductor industy (STMicroelec-

tronics—Italy, Samsung Electronics—Korea, Philips—Holland, NECAmerica—USA). He is a member of the technical program committeesof several conferences and a reviewer for many technical journals. Hisresearch interests concern system level design issues in the domain ofsingle-chip multiprocessors, with emphasis on both the hardware(communication and I/O) and software architecture (programmingparadigms, application portability).

Luca Benini received the PhD degree inelectrical engineering from Stanford Universityin 1997. He is a full professor in the Departmentof Electrical Engineering and Computer Science(DEIS) at the University of Bologna. He alsoholds a visiting faculty position at the EcolePolytechnique Federale de Lausanne. His re-search interests are in the design of system-on-chip platforms for embedded applications. He isalso active in the area of energy-efficient smart

sensors and sensor networks. He has published more than 250 papersin peer-reviewed international journals and conferences, four books, andseveral book chapters. He has been program chair and vice chair of theDesign Automation and Test in Europe Conference. He has been amember of the technical program committees and organizing commit-tees of several technical conferences, including the Design AutomationConference, the International Symposium on Low Power Design, andthe Symposium on Hardware-Software Codesign. He is an associateeditor of the IEEE Transactions on Computer Aided Design of Circuitsand Systems and the ACM Journal on Merging Technologies inComputing Systems.

Pol Marchal received the engineering degreeand PhD degree in electrical engineering fromthe Katholieke Universiteit Leuven, Belgium, in1999 and 2005, respectively. He currently holdsa position as a senior researcher at IMEC,Leuven. Dr. Marchal’s research interests are inall aspects of design of digital systems, withspecial emphasis on technology-aware designtechniques for low-power systems.

Mirko Loghi received the DrEng degree(summa cum laude) in electrical engineeringfrom the University La Sapienza of Rome in2001 and the PhD degree in computer sciencefrom the University of Verona in 2005. He iscurrently a postdoctoral fellow at the Politecnicodi Torino. His research interests include low-power design, embedded systems, and multi-processor systems.

Massimo Poncino received the DrEng degreein electrical engineering and the PhD degree incomputer engineering, both from the Politecnicodi Torino. He is an associate professor ofcomputer science at the Politecnico di Torino.His research interests include several aspects ofdesign automation of digital systems, withparticular emphasis on the modeling and opti-mization of low-power systems. He has coau-thored more than 180 journal and conference

papers, as well as a book on low-power memory design. He is anassociate editor of the IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems and a member of the technical programcommittees of several technical conferences, including the InternationalSymposium on Low Power Design.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Energy-Efficient Multiprocessor Systems-on-Chip for Embedded Computing: Exploring Programming Models...

Documents

Transcript of Energy-Efficient Multiprocessor Systems-on-Chip for Embedded Computing: Exploring Programming Models...