Runtime Resource Management Techniques for Many-core ...

Runtime Resource Management Techniques for Many-coreArchitectures: The 2PARMA Approach

Alexandros Bartzas1, Patrick Bellasi2, Iraklis Anagnostopoulos1, Cristina Silvano2, William Fornaciari2Dimitrios Soudris1, Diego Melpignano3, Chantal Ykman-Couvreur4

1Institute of Communications and Computer Systems, Athens, Greece2Politecnico di Milano, Italy3STMicroelectronics, France

4IMEC, Interuniversity Micro-electronics Center, Leuven, Belgium

Abstract— Real-time applications, hard or soft, are rais-ing the challenge of unpredictability. This is an extremelydifficult problem in the context of modern, dynamic, multi-processor platforms which, while providing potentially highperformance, make the task of timing prediction extremelydifficult. Also, with the growing software content in em-bedded systems and the diffusion of highly programmableand re-configurable platforms, software is given an unprece-dented degree of control on resource utilization. Existingapproaches that are looking into Runtime Resource Man-agement (RTRM) still require big design-time efforts, whereprofiling information is gathered and analyzed in orderto construct a runtime scheduler that can be lightweight.There is a trade-off to be made between design-time andruntime efforts. In this paper we present a framework forRTRM on many-core architectures. This RTRM will offeran optimal resource partitioning, an adaptive dynamic datamanagement and an adaptive runtime scheduling of thedifferent application tasks and of the accesses to the data.Furthermore, the 2PARMA RTRM takes into account: i)the requirements/specifications of many-core architectures,applications and design techniques; ii) OS support forresource management and iii) a design space explorationphase.

Keywords: Runtime resource management, task, memory andpower management

1. IntroductionThe current trend in computing architectures is to replace

complex superscalar architectures with many processingunits connected by an on-chip network. This trend is mostlydictated by inherent silicon technology frontiers, which aregetting as closer as the process densities levels increase.The number of cores to be integrated in a single chipis expected to continue to rapidly increase in the coming

This work is partially supported by the E.C. funded FP7-2487162PARMA Project, www.2parma.eu

years, moving from multi-core to many-core architectures.This trend will require a global rethinking of software andhardware approaches.

Multi-core architectures are nowadays prevalent in generalpurpose computing and in high performance computing. Inaddition to dual- and quad-core general purpose processors,more scalable multi-core architectures are widely adoptedfor high-end graphics and media processing. Such platformsare becoming widespread as silicon technology develops inthe sub-50nm nodes. The transition to multi-core is almosta forced choice to escape the silicon efficiency crisis causedby the looming power wall, the application complexityincrease and the design complexity gap under tighteningtime-to-market constraints. While multi-core architecturesare common in general-purpose and domain-specific com-puting, there is no one-size-fits-all solution. General-purposemulti-cores are still designed to deliver outstanding single-thread performance under very general conditions in termsof workload mix, memory footprint, runtime environmentand legacy code compatibility. These requirements lead toarchitectures featuring a few complex, high-clock speed“mega-cores” with complex instruction sets, deep pipelines,non-blocking multi-level caches with hardware-supportedcoherency and advanced virtualization support. Today, wesee a trend towards many-core fabrics, with a throughput-oriented memory hierarchy featuring software-controlled lo-cal memories, FIFOs and specialized DMA engines. As aresult, an SoC platform today is a highly heterogeneoussystem. It integrates a general-purpose multi-core CPU,and a number of domain-specific many-core subsystems.Examples of such emerging multi-core platforms are theIntel’s SCC [1] and ST’s Platform 2012 [2].

System-level design and optimization of computing sys-tems is a highly challenging task. Especially since suchsystems are becoming more and more complex, from bothhardware as well as software perspectives [3]. Over the lastfew years, the main focus in the design of computing systemshas been to provide good performance and at the sametime achieve low-power consumption. To achieve optimal

results, a good coordination between hardware and softwaredesign is required. Therefore, memory-intensive applicationsrunning on embedded platforms (e.g., multimedia) must beclosely linked to the underlying Operating System (OS) andefficiently utilize the available hardware resources. Puttingall this together, it is clear that developing a complete,working system is an integration nightmare [3].

In this paper we present a framework for Runtime Re-source Management (RTRM) on many-core architectures.This RTRM will offer i) an optimal resource partitioningamong the different resource requirements of the applica-tions running on the hardware platform; ii) adaptive dynamicdata management (dynamic allocation and de-allocation ofheap data); iii) adaptive runtime scheduling of the dif-ferent application tasks and the of accesses to the data.Furthermore, the adequate power management techniquesas well as the integration to the Linux Operating System(OS) are currently developed. The 2PARMA RTRM takesinto account: i) the requirements/specifications of many-core architectures, applications and design techniques; ii)OS support for resource management and iii) a design spaceexploration phase.

The rest of the paper is organized as follows. Backgroundinformation regarding RTRM is provided in Section 2,whereas an overview of the proposed RTRM framework inSection 3. The runtime resource management component ispresented in Section 4, the adaptive task management com-ponent is presented in Section 5 and the adaptive dynamicmemory management component is presented in Section 6.Finally, conclusions are drawn and future work outlined inSection 7.

2. BackgroundSchaumont et al. [4] demonstrate the use of hierarchical

configuration on an image processing example application.The authors’ methodology starts by profiling a set of appli-cations from a target application domain to determine theright point in the configuration design space. In this way, aset of commonly used and computationally intensive kernelscan be identified. Parameterizable implementations of thesekernels form the building blocks of the reconfigurable plat-form. These blocks are then pre-instantiated into the FPGAfabric and can be programmed at runtime with a minimalamount of configuration input. The authors suggest to usecompiler techniques to map an application onto a given setof parameterizable IP blocks.

Keller et al. [5] describe the use of so-called softwaredecelerators. By using freely available soft IP cores, thedesigner can take advantage of an easier software applicationdesign process. In addition, certain algorithms, like e.g., asequential state machine use less hardware resources whenimplemented on a soft IP core, while still meeting thenecessary performance requirements. The authors describea configuration hierarchy case study.

Faruque et al. [6] present a runtime application map-ping in a distributed manner using agents targeting foradaptive NoC-based heterogeneous multi-processor systems.Authors claim that a centralized RTRM may bear a se-ries of problems such as single point of failure and largevolume of monitoring-traffic. However, Nollet et al. [7]present a centralized runtime resource management schemethat is able to efficiently manage a NoC containing finegrain reconfigurable hardware tiles and two task migrationalgorithms. The resource management heuristic consists ofa basic algorithm completed with reconfigurable add-ons.The basic heuristic contains ideas from multiple resourcemanagement approaches.

Shirazi et al. [8] present a method for managing reconfig-urable designs, which supports runtime configuration trans-formation. This method involves structuring the reconfigu-ration manager into three components: a monitor, a loader,and a configuration store. To improve reconfiguration speed,authors have developed a scheme to implement the loaderin hardware. The main motivation for implementing part ofthe runtime manager in hardware or a separate acceleratoris to avoid the overhead caused by the runtime managementfunctionality [9] [10], and to increase determinism. Aspresented in [11] when moving RTRM functionality to aplatform hardware service, the runtime overhead is keptto a minimum as decision making is done in parallel byspecialized hardware. In addition, hardware can operate at afiner granularity without incurring an performance penalty.

3. Runtime resource management: Anoverview

The development of new computing systems requires tun-ing of the software applications to specific hardware blocksand platforms as well as to the relevant input data instances.The behaviour of these applications heavily relies on thenature of the input data samples, thus making them stronglydata-dependent. For this reason, it is necessary to extensivelyprofile them with representative samples of the actual inputdata. An important aspect of this profiling is done notonly at the dynamic data type level, which actually steersthe designersâAZ choice of implementation of these datatypes, but also at the functional level. We characterize thesoftware metadata that these optimizations require, and wepresent a methodology, as well as appropriate techniques, toobtain this information from the original application. Equallyimportant is for the designer to have a good knowledge of theplatform characteristics. With both this information at hand(software metadata and platform characteristics) the designercan characterize the runtime behaviour of the applicationand determine its working modes and the reconfigurationoverheads.

This paper is focused on the design of a RTRM frame-work, targeting the optimization of both computing fabric

resources usage and applications’ Quality-of-Service (QoS)requirements. Specifically regarding Power Management, weinvestigate on the OS, services supporting runtime man-agement. The main available OS frameworks related topower, hot-spots and process variation management havebeen analyzed and their characteristics have been comparedto define the base for the future design of a new frameworkfor supporting the QoS-based runtime management of ageneric many-core computing platform. To support the newpower manager, both application behaviour and computingplatform characteristics should be identified, described andproperly reported to the runtime manager framework. Thisinvolved a deeper investigation on two main aspects: theplatform description and the interfaces with applications.

On one side, the platform description requires the identi-fication and definition of a proper set of hardware metadatato represent a generic computation fabric. We identified asuitable formalism to support both portability and efficiencyof the runtime controlling solution. On the other side, theneed to interface and interact with applications is motivatedby collecting application requirements, in terms of resourcesand expected behavior, and notifing control decisions accord-ing to a running optimization strategy.

All together, these aspects define the basic building blocksfor a portable runtime resource manager, being both inde-pendent from any specific computation fabric and to controldifferent applications. Based on these results we started thedesign of a runtime resource management framework atthe OS level to support many-core computing platforms. Inthe meanwhile, we also investigated a runtime optimizationpolicy, which could be used to efficiently allocate resourcesto applications according to multiple objectives.

3.1 The role of the Runtime Resource Manager(RTRM)

The RTRM framework is composed of a set of mod-ules providing services to different “classes of users” (de-picted in Figure 1). Two of these users are representedby applications and target-platforms. Applications usuallyhave different Working Modes (WM), each one definingexpected Quality-of-Service (QoS) and corresponding needsin terms of computational resources (e.g. processing ele-ments, memory, bandwidth, etc). Target-platforms define aset of available resources, each one with specific: features,operating modes, monitoring and control points. Moreover,the platform architecture could define a specific functionalrelationship between available resources (e.g. clusters ofprocessing elements and a certain memory hierarchy).

The RTRM is a component placed in between applicationsand the target-platform, which is in charge of managingapplications access to platform available resources. Thismanagement is a quite complex activity generally aimed atmeeting contrasting goals: maximizing applications’ perfor-mance while reducing energy consumption. How this double

Cri$cal applica$ons

Best-‐effort applica$ons

WMa

RTRM

WMb

Dynamic mem. mgmt.

Task scheduling

Power mgmt.

PlaBorm characteris$cs

Many-‐core plaBorm

Software meta-data/ app. working modes

Hardware meta-data

Application-specific/ fine-grain

System-wide/ coarse-grain

Fig. 1: Overview of the 2PARMA RunTime Resource Man-agement Framework.

goal could be obtained is behind the scope of this paper.Here it is important to understand that the RTRM tool shouldbe supported on its role by the applications and the target-platform. Indeed, it is required for both these “users” toprovide the framework with some information that could beeffectively exploited to accomplish its task.

The overall structure of the required information, comingfrom applications and target-platforms (meta-data) shouldsatisfy three main design goals:

1) Completeness: all the information required to properlysupport the RTRM management activities should beconsidered and represented.

2) Portability: the meta-data should describe both appli-cations and target platform properties independentlyfrom each other. Indeed, for the success of the finalsolution it is considered interesting to support a “writeonce and run everywhere” approach. Thus, it should bepossible to define application meta-data independentlyof the target platform they will run on.

3) Simplicity: the information required by the RTRMrepresents an “overhead” for developers of both appli-cations and platforms, thus it is important to identifya solution that is as much as possible effortless to beused. Even better if the solution could be defined asan extension of the classical design and developmentflow of each system abstraction layer.

Overall, these requirements must be considered to prop-erly define the collection of meta-data and the interfaces toacquire them from the applications and the target platform.

4. The RTRM ComponentThe RTRM component is composed of a set of modules

running at different levels of abstraction and providingservices to different “client” modules. The client modulesrepresent the users of the services provided by this tool.Three main classes of users could be easily identified:

1) User applications which can be either critical orbest-effort. The former are applications, generallyknown at design time and provided by the device pro-ducer/integrator, which implement fundamental tasksfor the target device. The latter instead are applicationsunknown at design time but that each user could addand use once the system is already in production.Critical applications are usually fine tuned off-line tobe highly efficient, e.g., by Design Space Exploration(DSE) techniques, and their resource requirement areconsidered mandatory for a proper functioning of thedevice. To the contrary, best-effort applications areexpected to negotiate resource usage with the RTRMbefore actually accessing them.

2) Resource tuning tools for the optimization of specificresources. Within this class, three main users couldbe classified: code-optimizers (which are not furtherdiscussed in this paper), the runtime task manager(Section 5) and the runtime dynamic data manager(Section 6). The RTRM will provide support to thesetools to better achieve their optimization goals whilestill granting to meet the system-wide optimizationpolicy.

3) Platform-specific controller representing the lower-level interface to the many-core computing fabricservices. This is usually the computing fabric device-driver, being it properly extended to provide the re-quired runtime management specific services.

To provide a flexible and efficient implementation, ahierarchical design for the RTRM should be considered,where the monitoring, management, control and optimizationstrategies are operated at different granularity and abstractionlevels. At least, three different granularities could be consid-ered: user-space, kernel-space and fabric/device-space. Thefirst two of them correspond to the host-side while the latteris the computation fabric side. An overall view of the toolarchitecture and main components is reported in Fig. 1.

The hierarchical architecture of the optimization policyallows to reduce the control complexity by distributing it atdifferent levels, each one considering different details. Eachgranularity level collects requirements from higher level,runs a specific optimization policy, and finally identifies aset of constraints delivered to lower levels. This approachensures control over sensible overheads at each level ofthe hierarchy. Moreover, it allows to keep time-critical andfine-tuning decisions running on lower levels, close to eachcontrolled resource, while a global optimization strategy runs

on higher levels. This approach will grant a prompt andlow-latency handling of critical events while it still ensure asystem-wide optimization.

An overall view of the requirements for the developmentof an effective RTRM tool is represented in Fig. 2 where it isproposed a target-based view. Indeed, we could identify de-sign requirements, user interaction requirements, functionalrequirements and finally system-integration requirements.Some of these requirements are imposed by the RTRM toolto other system components while others are related to thetool itself.

4.1 Imposed requirementsThe set of imposed requirements must be satisfied by the

users of the tool, in order to be properly integrated with theRTRM. Each of these requirements is addressed either bythe resources (controllers), i.e., platform devices and relativedrivers, or by the applications1.

a) Definition of resources working-modes [resources]:The RTRM tool needs a complete view on the workingmodes (WMs) of controlled resources. Each working modeis defined by a set of properties such as: the amount of avail-able resources, the power consumption and the constraintsto switch from one mode to another. These information onresources working modes will be collected by the RTRMtool by using a lower abstraction level module, presumablyas an extension of the computing fabric driver. However, anycomponent in a use-case, which represents a resource (i.e.platform subsystems), must completely define its workingmodes and notify them to the RTRM.

b) Definition of resources control points [resources]:In order for the RTRM tool to perform optimizations onresources usage, this tool requires the ability to tune someparameters of the corresponding platform subsystems. Thus,these subsystems are required to expose their control pointsto the RTRM and to define how the modification of thesecontrol points impact on the subsystem behaviors in termsof both power consumptions and performances. Usually, theoptimization actions performed by the RTRM, using thesecontrol points, correspond on switching a subsystem fromone of its working-mode to another.

c) Resources observability [resources]: To properly runits optimization policy, the RTRM will relay on an updatedand complete view of the resources state, both in terms ofpower consumption and performance. Thus, every subsystemrepresenting a resource to be controlled by this tool isexpected to expose some observability points. These pointswill be represented by some metrics that can be mapped on

1For improved readability, the target of each requirement is indicatedright after the requirement name

Fig. 2: The main requirements for the RTRM tool

resource power consumptions and performances, and thusthey will generally correspond to the properties defining aworking mode.

d) Priority based access to resources [applications]:Access to resources is priority based: critical applicationscould preempt resources used by best-effort applications. Ingeneral, every application will be associated to a resourceaccess and usage priority by means of a system configu-ration. In addition to a coarse grained classification basedon critical and best-effort applications, it will be possibleto define also a fine grained priority level. Each applicationwill have a default priority, however a proper mechanisms(similar to the ones already available on operating systemsto defined the priority of a process) will allow privilegedusers to change this level. Critical applications are requiredto use this mechanism to properly configure their priority.

e) User-space representing application [applications]:The RTRM tool allows kernel-space resident clients, how-ever it will expect that every client has a correspondingcontrolling user-space application. To the purposes of theoptimization policy, the user-space application will defineresource access rights and priority.

f) Query resource availability [applications]: The systemresources are shared among different applications which runconcurrently and compete for their usage. The resource avail-ability could change at runtime according to the workingconditions, e.g., workload, hot-spot and failures. Thus, anapplication is required to query the RTRM tool to knowabout resources availability. According to the resource avail-ability, an application will have the chance to know exactlywhat QoS level can be obtained from the system. This

requirement is mandatory only for best-effort applications.On the contrary, critical applications could always supposethat their required resources are available and it is in chargeof the RTRM to ensure that. Proper interfaces will bedesigned to allow clients to request information on resourcesby specifying:

• Resource class, e.g., processing elements, clusters,memory, communication channels

• Usage constraints, e.g., usage time, access policy, func-tional requirements (e.g., max latency between to PEs,minimum granted bandwidth, etc.)

g) Get and release resources [applications]: To providea system-wide optimization, the RTRM must always havean updated and complete view on the resources state,and specifically on resources usage and availability. Thisinformation is required to properly support the resourceaccounting feature provided by the RTRM, and thus eachuser is required to notify when a resource is used andreleased. The “get” method allows to obtain a referenceto a “virtual resource”, which will be properly mappedto a “physical resource” by the RTRM according to itsoptimization strategy and the runtime working conditions.The “release” method notifies the RTRM that the resourceis not more needed by the asserting client and thus it isavailable for other usage or optimizations.

h) Handle RTRM notifications [applications]: Resourceavailability could change at runtime due to changing workingconditions. Applications that have access to some resourcescould be requested to adapt to this changing conditions. Inthis case, a notification will be sent by the RTRM to theinvolved applications to give them the chance to reconfigurethemselves.

4.2 Covered requirementsA different set of requirements could be identified which

must be satisfied by the RTRM in order to provide aneffective implementation of the runtime manager.

a) Monitor resource performance: To properly performresource and performance optimization, the RTRM will relayon an updated view of the usage and behaviors of eachsubsystem, and their resources, with different levels of detail.This information could be exposed on request to use-caseclients as well, according to their rights. For example, theRTRM will probably collect statistics on processing elementsand memory usage, information on communication channelscongestion (bandwidth) or behavior (latency).

b) Dynamic resource partitioning: Each application mayhave a different impact on the overall user experience, whichdefines its “priority level”. Thus, the RTRM will providesome support to handle both critical workloads, that havehard requirements in terms of resource usage and executionbehaviors, and best-effort workloads, for which a penaltyeither does not impact on perceived behaviors or producea tolerated QoS degradation. Both classes of applicationswill access the available resources through a single runtimeresource manager. Thus, the role of this tool is to accountfor available resources, and grant access to these resourcesto demanding applications according to their priority level.In general, critical applications are off-line profiled andoptimized, for example using the DSE techniques, whilethe best-effort will not be optimized. In the former case,the applications will have specific resource requirementsthat should be granted by the RTRM in order to meet thedesired and designed behaviors. In the latter case, however,a best-effort policy can be used, which could not guaranteethe runtime behaviors. In any case, we want to be fairon granting access to resources usage once when they areavailable. To efficiently manage this scenario, the resourcescould be dynamically partitioned, taking into considerationcurrent QoS requirements and resources availability. Thisdynamic partitioning will allow to grant resources to criticalworkloads while dynamically yield these resources to best-effort workloads when (and only while) they are not requiredby critical ones, thus optimizing resource usage and fairness.

c) Resource abstraction: An effective RTRM targetingmobile multimedia platforms, cannot disregard the context ofsystems which admit the presence of multiple applications,running concurrently on the same many-core computingfabric, each one having its own application-specific require-ments. In such a context, the efficiency of managing theavailable resources is a challenging goal. The mapping ofapplications onto the available resources may change duringthe device life-time, and the current effective mapping should

be based on specific quantitative metrics, e.g., throughput,memory bandwidth and execution latency. In parallel to theapplication-specific requirements, we also experience othernon-functional aspects such as power consumption, energyefficiency and thermal profiles. The last one is especiallyrelevant in scaling technologies like the one targeted by theupcoming many-core computing platforms. The presence ofthese two types of requirements makes the mapping decisionmore complex. To address this problem, the RTRM tool willhandle a decoupled perspective of the resources between theusers and the underlying hardware. The user applicationsshould see virtual resources, e.g., the number of processingelements available, but they will not be aware of which ofthe physical resources are effectively available. At runtimethe RTRM will perform the virtual-to-physical mappingaccording to the current objective function (low power, highperformance, etc.) and runtime phenomena (process varia-tion, temporal and spatial temperature gradients, hardwarefailures and, above all, workload variation).

d) Multi-objective optimization policy: The optimizationpolicy should be system-wide and able to consider multiplemetrics, e.g., power consumption, performance indexes andthermal gradients.

e) Dynamic optimization policy: The optimization policiesof every abstraction levels will be runtime tunable to someextend, for instance to associate different levels of impor-tance to the different optimization goals. Moreover, in thecase of high-level and system-wide optimization policies,there will be the possibility to completely change the policyat runtime. This will enable to design multiple light-weightpolicies which fit well specific optimization scenarios.

f) Low runtime overheads: The overall RTRM overheadshould not impact noticeably on the performance of thecontrolled system, both at the host-side and especially at thecomputation fabric side. This requirement will be satisfiedby adopting an hierarchical design for the framework withcontrol policies distributed at different levels performing arefined optimization.

5. The Task Scheduling ComponentTo address the challenges introduced by future embedded

computing, the task scheduling component needs to fulfillthe following features:

• First, a variety of applications should be supported:mobile communications, networking, automotive andavionic applications, multimedia in the automobile andInternet interfaced with many embedded control sys-tems. These applications may run concurrently, start andstop at any time. Each application may have multipleconfigurations, with different constraints imposed by

the external world or the user (deadlines and qualityrequirements, such as audio and video quality, outputaccuracy), different usages of various types of platformresources (processing elements, memories and commu-nication bandwidth) and different costs (performance,power consumption).

• Second, a holistic view of platform resources shouldbe supported. This is needed for global resource al-location decisions optimizing a utility function (alsocalled Quality of user Experience - QoE), given theavailable platform resources. This QoE will allow trade-off, negotiated with the user, between diverse QoSrequirements and costs.

• Third, the platform resource usage and the applica-tion mapping on the platform should be transparentlyoptimized. This is needed to facilitate the applicationdevelopment and manage the QoS requirements withoutrewriting the application.

• Next, the task scheduling component should dynam-ically adapt to changing context. This is needed toachieve a high efficiency under changing environment.QoS requirements and platform resources must bescaled dynamically (e.g., by adjusting the clock fre-quencies and voltages, or by switching off some func-tions) in order to control the energy/power consumptionand the heat dissipation of the platform.

• Different heuristics should be allowed, since a singleheuristic cannot be expected to fit all application do-mains and optimization goals.

• Finally, since the task scheduling component is intendedfor embedded platforms, a lightweight implementationonly is acceptable. To address this challenge, this com-ponent should interface with design-time exploration toalleviate its runtime decision making.

The task scheduling component manages and optimizesthe application mapping taking into account the possibleapplication configurations, the available platform resources,the QoS requirements, the application constraints, and theoptimization goal. In the following we overview the issuesthe task scheduling component focuses on in the 2PARMAapproach.

a) Interface with design-time exploration: Whereas thefunctional specification of an application is fixed, there maybe several specific algorithms or implementations for a givenapplication. Also an application implementation can takeseveral forms (fixed logic, configurable logic, software) andoffer different characteristics. These application configura-tions with associated meta-data (e.g., QoS, platform resourceusage, costs) are provided at design time to enable fastexploration during runtime decisions. The interface will beextended from [12].

b) Reconfiguration: Ideally, for any application, all func-tionalities should be accessible at any time. However, basedon the user requirements, the available platform resources,the limited energy/power budget of the platform, and thetarget platform autonomy, it may not be possible to integrateall these functionalities on the platform at the same time.Hence the application developer has to organize the applica-tion into application modes, each one specifying a differentsubset of functionalities. Moving from one application modeto another may be needed at run time due to user interactionsor changes in the platform resource availability when newapplications are activated. Adapting the mapping of an activeapplication is called dynamic reconfiguration or task migra-tion. The key challenge is of course to maintain the real-timebehavior and the data integrity of the overall set of activeapplications. On the one hand, dynamic reconfiguration isa powerful mechanism to improve the platform utilization,avoiding some idle computing resources, while others areoverloaded. On the other hand, important issues are not onlytask state representation and message consistency duringreconfiguration, but also reconfiguration time that limits theperformance of the overall system. A high-level modelingand simulation framework of this reconfiguration issue willbe developed.

c) Switching: Environment changes may give rise to res-election of application modes and configurations. This re-sulting switching must be seamless. To that end, switchingpoints are introduced in the applications where it is checkedwhether a switching is requested. A smooth transition fromthe current configuration to the new one is provided as fol-lows. On the one hand, the RTRM can signal the concernedapplication at any time when a switching is requested. Onthe other hand, whenever the application reaches a switchingpoint inside its code, the application checks whether aswitching is requested. If yes, the application enters aninterrupted state and transfers all relevant state informationto the RTRM. The RTRM communicates the newly selectedapplication configuration and the related received state in-formation to the concerned cores of the platform.

d) Optimal selection of application configurations: TheQoS requirements and the optimization goal are definedthrough the QoE manager. This goal is translated into anabstract and mathematical function, called utility function(e.g., performance, power consumption, battery life, QoS,weighted combination of them). A fast and lightweightpriority-based heuristic extended from [13] selects near-optimal configurations for the active applications. It worksas follows:

• It selects exactly one configuration for each active ap-plication from the multi-dimension set of configurationsderived by the design-time exploration. The selection is

done according to the available platform resources, inorder to optimize the utility function, while satisfyingthe application constraints.

• The heuristic cannot always guarantee to find a feasiblesolution, i.e. to select a set of configurations, oneper active application, within the available platformresources. In that case, the constraints of the applicationwith the lowest priority is relaxed and the heuristic isexecuted once again.

e) Monitoring of application parameters: For many ap-plications, the processing requirements to obtain results atreal time can be impractical. This is due to the increas-ing complexity of the applications developed for advancedplatforms. Nevertheless, a lower result quality might beacceptable if the results are obtained within the time re-quired. The proposed technique will work as follows. Inorder to trade the result quality for the time required toobtain them, a set of well-chosen application parameters aretuned/monitored at run time once the time requirement is setby the user. The runtime monitoring technique is part of theapplication. It is intended to monitor the application behaviorand act on the application parameters in order to meetthe performance requirement while maximizing the resultquality. This technique is called iteratively in the applicationto monitor the application execution time, check whether thedeadline is met, and take decisions concerning the tuningof the application parameters. It modifies the applicationparameters either to improve the result quality if the deadlineis met, or to lower the result quality if the deadline isnot met. To enable an efficient runtime decision-making,the technique makes use of Pareto-optimal combinations ofparameters also derived by the design-time exploration.

6. The Dynamic Memory ManagementComponent

In [14], the design decisions that form the DMM de-sign space have been presented and implemented inside aC++ library that enables the modular construction of everyvalid DMM configuration. Through automated exploration,combining together differing design decisions generates cus-tomized DMM configurations. Each design decision can beviewed as a building block, the DM manager is constructedby binding together DMM building blocks (Fig. 3). How-ever, these solutions are fully static in the sense that theDMM mechanisms i.e. the fitting mechanisms, the coalesc-ing/splitting thresholds etc. selected to customize the DMMcan not be altered during runtime to adapt to workloadsdifferent than the simulated ones.

6.1 DMM Design SpaceFirst we provide the basic DMM terminology that will be

used throughout this Section.

Fig. 3: Static versus adaptive DMM.

• Heap: Heap refers to the memory pool responsible forallocation or de-allocation of dynamic data (arbitrarily-sized data blocks allocated in arbitrary order that willlive an arbitrary amount of time). In C/C++ program-ming language dynamic memory management is per-formed through the malloc/new functions for allo-cation and free/delete functions for de-allocation,respectively.

• Heap Fragmentation: Fragmentation is defined asthe maximum amount of memory allocated from theoperating system divided by the maximum amount ofmemory required by the application. In multi-threadedmemory allocators there are three types of fragmenta-tion, namely internal fragmentation, external fragmen-tation and heap blow-up. Internal fragmentation [15]is generated when the DM manager returns a memoryblock that is larger than the initial size request. Externalfragmentation [15] refers to the situation that a memoryrequest cannot be served even if there are availablememory blocks in the heap that can serve the requestif they merged. Finally, heap blow-up [16] is a specialkind of fragmentation found in multi-threaded memoryallocators.

• Heap Memory Footprint: The heap memory footprintrefers to the maximum heap memory (allocated andfreed) that the DM manager reserves. Actually, it refersto the maximum memory that is occupied, taking intoconsideration the memory consumed for the block’spayload, the block’s header and the unused space re-sulted from padding and alignment.

Fig. 4 shows the design/decision space concerning dy-namic memory management. More specifically, we recog-nize the following taxonomy:

• Inter-Heap Categories:– Architectural Scheme Category: It determines the

way the dynamic memory allocator organizes andarchitects its heaps in order to exploit the availablethread-level parallelism into memory management.

– Data Coherency Decisions Category: It deals withthe existence or not and the structure of the syn-chronization mechanisms in order to ensure the

data coherency in each heap.– Inter-Heap Allocation Decisions Category: It man-

ages the way in which threads allocate memoryin the inter-heap level. Allocation in this level isstrongly connected with decisions which considerboth the thread grouping in order to share a heapand the thread to heap mapping. Allocation deci-sions of finer granularity i.e. fit policies etc. areincluded into the intra-heap design space.

– Inter-Heap De-allocation Decisions Category: Itincludes the decisions concerning ownership [16]aware de-allocation of each memory block andplacement decisions for the de-allocated blocks.

– Inter-Heap Emptiness Decisions Category: It man-ages the potential memory blowup of the multi-threaded application and consider decisions in or-der to reduce or bound the worst memory blowup.

• Intra-Heap Categories:

– Intra-Heap Block Structure Category: It handlesthe data structures, which organize the memoryblocks inside each heap of the DMM.

– Intra-Heap Pool Organization Category: It definesper heap pools’ organization i.e. single pool, onepool per size, traversing order etc.

– Intra-Heap Block Allocation/De-allocation Cate-gories: They deal with the operations that satisfythe DM allocation and de-allocation requests.

– Intra-Heap Splitting/Coalescing Categories: Theyformalize the decisions to handle the current co-alescing and splitting blocks techniques [15], i.e.the threshold logic for coalescing and splitting theblocks.

The selection of certain decisions heavily affects thecoherency of other decisions within a DM manager. Thus,DMM decisions exhibit various inter-dependencies depictedas arrows in Fig. 4. We recognize two types of inter-dependencies. Excluding inter-dependencies (solid arrows)are generated when a DMM decision disables either seman-tically or structurally the incorporation of other categories orDMM decisions. Linked inter-dependencies (dashed arrows)are generated in cases which a DMM decision affect otherdecisions, but not disable their use. For example, if thecoalescing frequency is set to zero, automatically the wholecategory K is excluded (excluding inter-dependency). How-ever, coalescing frequency in category K, affects splittingfrequency in category J, but it does not eliminate its usage,since a DM manager which only splits memory blocks butnever coalesce is semantically and structurally viable DMMsolution.

Each possible DMM configuration can be generated byproperly combining the available design decisions, withrespect to the parameter inter-dependencies. In [14], theparameter space has been used within a design-time explo-

Fig. 4: The multi-threaded dynamic memory management(MTh-DMM) design space.

ration procedure, for generating application-specific MTh-DMM configurations. While in this paper we target thedesign of runtime adaptive DM managers, we briefly discussthe overall design space, since it forms the basis of ouranalysis regarding the selection of design decisions that canbe configured during runtime. The DMM parameter spacecan be conceptually partitioned into two smaller sub-spaces,namely the inter-heap and the intra-heap sub-spaces. Inter-heap sub-space captures decisions that are shared betweenthe threads such as overall heap organization, thread toheap mapping policies etc. Intra-heap subspace includesdecisions which manage each instantiated heap’s internalstructure i.e. number of free-lists, allocation mechanisms etc.Each parameter sub-space is further partitioned into severaldecision categories (white boxes in Fig. 4), each includingthe DMM design decisions (colored boxes in Fig. 4).

6.2 Runtime Adaptive DMMTo enable runtime adaptivity of the DMM, we need to

move towards the DMM solution proposed in the right sideof Fig. 3. Comparing the static and the runtime-tunableDMM schemas, we recognize three major extensions thatneed to be considered in order to move towards moreadaptive configurations:

• Identification of the subset of the available designdecisions that will enhanced towards runtime reconfig-uration (the identification of the runtime tuning knobsof the DM manager).

• Extension of the DM manager to provide runtime-monitoring data.

• Extraction of the decision making process that theruntime controller will implement in order to generatethe new DMM configuration.

In order to extract the runtime-tunable DMM parameters,we have to evaluate the switching overheads imposed during

parameter reconfiguration. DMM parameter reconfigurationoccurs among the parameters of the same design decision(colored boxes in Fig. 4). For example, the parametersfound in the design decision regarding the allocation fittingalgorithms (category H) include the first fit algorithm, thebest-fit algorithm with searching thresholds and the nextfit algorithm. Since a DM manager has to incorporate a fitalgorithm in order to be able to reuse the already allocatedmemory, we have to evaluate the trade-offs associated withswitching during runtime from one fit algorithm to another.The same evaluation has to be performed for each DMMdesign decision.

The designer that wants to extract the runtime-tunableparameters, has to traverse through the DMM design de-cisions and analyzed the expected overheads in case thatruntime switching is considered. The analysis indicated twoclasses of DMM decisions, namely the (i) the numericalDMM tunable parameters and (ii) the algorithmic DMMtunable parameters. Fig. 5 depicts the runtime-tunable DMMparameters along with the valid transitions that can beperformed between them [17]. The rectangular boxes refer tothe numerical DMM tunable parameters while the ellipsoidboxes refer to the algorithmic DMM tunable parameters. TheDMM parameters that are numerical values are characterizedas runtime-tunable since they can be tuned through a simplewrite at a memory address, which does not impose anysevere switching overhead. The numerical DMM parametersthat recognized as tunable are the following ones: (i) thethreshold values regarding the maximum allowable inter-heap emptiness (category E), (ii) the decisions regardingthe number of the free blocks moved to a global sharedmemory space whenever inter-heap emptiness crosses theallowable threshold (category E), (iii) the numerical valuesthat set the maximum percentage in Best Fit algorithm ofthe available free space that has to be searched (categoryH), (iv) the parameters that control the triggering of splittingmechanism–minimum block size above which splitting hasto be performed (category J), (v) the parameters that controlthe triggering of coalescing mechanism–maximum blocksize, resulted by the merging of two adjacent free blocks,above which no coalescing is triggered (category K).

The class of algorithmic DMM tunable parameters in-cludes parameters that are algorithms but their tuning doesnot affect the architecture of the instantiated DM man-ager. In this case, multiple DMM management policies andmechanisms are switched during runtime and applied to thesame data-structures in a mutual exclusive manner. Morespecifically, the analysis indicated the following algorithmicDMM tunable parameters (i) the thread-to-heap mapping al-gorithms in category C, (ii) the allocation search algorithmsi.e. FIFO, LIFO etc., (category H) and (iii) the allocation fitalgorithms, i.e. First Fit, Best Fit, Next Fit etc., (categoryH). In order to enable such runtime algorithmic switching,a large portion of the C DMM library has been rewritten to

THR_1

THR_2

THR_N

Heap1

Heap2

HeapK

FIFO

LIFO

FirstFit

BestFit

NextFit

10%

90%

100%

10%

90%

100%

10%

90%

100%

64

512

1024

Thread-to-Heap Mapping

Allocation Search Algorithms

Allocation Fit Algorithms

BestFit Searching Percentage

Inter-Heap Emptiness Threshold

Inter-Heap Percentage of Moved Blocks

Splitting ThresholdMinSize

Coalescing Thresh.MaxSize

64

512

1024

Fig. 5: DMM runtime-tunable parameters.

completely decouple the DMM’s data structures from theirmanipulation services. In addition, we have to mention thatin this class of DMM tunable parameters no recompilation isneeded, since the differing DMM mechanisms are compiledonce to structure a specific DMM instance and they canbe switched with each other during runtime by writing aspecific global variable that guides the internal paths of themalloc/free functions.

To keep track on the state of the DM manager duringruntime, proper software monitoring mechanisms have beenintegrated. Through software monitors, runtime statistics arecollected to provide to the controller useful information ateach moment for different variables. Through these runtimestatistics, the controller makes the decision regarding thenext DMM configuration (value assignment to the runtime-tunable DMM parameters). Through the DMM runtimestatistics, the controller is keeping track of the actual frag-mentation and when a certain threshold is exceeded, itreconfigures the DM manager towards a fragmentation awareconfiguration.

The monitor’s structure can be configured at design-time.In particular, the following runtime statistics have beenconsidered: (i) Total memory footprint, (ii) Requestedmemory per time-slot, (iii) Actual memory per time-slot

0

0,5

1

1,5

2

2,5 StaticDMM1StaticDMM2AdaptiveDMM

Worst Fragm. StaticDMM2

Worst Fragm. StaticDMM1

Worst Fragm. AdaptiveDMM

Comparative Study Between Static and Adaptive DMM Around their Worst Fragmentation Time Windows

Frag

men

tatio

n

DMM Events !Fig. 6: Comparative fragmentation study around worst-caseexecution windows.

and (iv) Per heap memory.The source code of the DMM library has been annotated

in specific points with operations that updates the DMMStatsby writing the new values of the dmmStats fields.

6.3 Adaptive DMM EvaluationWe evaluate the effectiveness of our approach based on

the Larson benchmark [18], which simulates the workloadfor a multi-threaded server. The adaptive DMM technique isused by this application through the invocation of standardC APIs (malloc() and free()). Three different dynamicmemory managers were used [17]: (1) A static performance-optimized DMM (StaticDMM1) employing a first-fit pol-icy without splitting/coalescing mechanisms. (2) A staticfootprint-fragmentation optimized DMM (StaticDMM2) em-ploying a best-fit policy with 100% search percentage andsplitting/coalescing mechanisms with minimum split size.(3) A runtime tunable footprint-fragmentation optimizedDMM (AdaptiveDMM).

Fig. 6 compares the three examined DMMs during a timewindow around their worst fragmentation cases. The pro-posed runtime-tunable DMM solution represents an efficientintermediate solution between the two static ones, sincethe proposed AdaptiveDMM is much more efficient thanthe StaticDMM1 and very close to StaticDMM2. Further-more, Adaptive DMM is more efficient with respect to theStaticDMM1, with 25.1% and 69.9% gains on the averagefootprint and fragmentation, respectively. More details canbe found in [17].

7. ConclusionIn this paper we presented the design of a new runtime

resource management framework at the OS level to supportmany-core computing platforms. The RTRM frameworkis composed of a set of modules providing services todifferent “classes of users”. We characterized the softwaremetadata that QoS optimizations require, and we presented

a methodology, as well as appropriate techniques, to ob-tain this information from the original application. In the2PARMA approach, the RTRM is a component sitting inbetween applications and the target-platform, which is incharge of managing applications access to platform availableresources. Early experimental results showed that the pre-sented adaptive DMM used by the 2PARMA RTRM, is moreefficient in comparison with the StaticDMM1, with 25.1%and 69.9% gains on the average footprint and fragmentation,respectively.

References[1] T. G. Mattson, M. Riepen, T. Lehnig, P. Brett, W. Haas, P. Kennedy,

J. Howard, S. Vangal, N. Borkar, G. Ruhl, and S. Dighe, “The 48-core scc processor: the programmer’s view,” in Proc. of SC. IEEEComputer Society, 2010, pp. 1–11.

[2] STMicroelectronics and CEA, “Platform 2012: A Many-core programmable accelerator for UltraEfficient EmbeddedComputing in Nanometer Technology,” 2010. [Online].Available: http://www.cmc.ca/en/NewsAndEvents/~/media/English/Files/Events/20101105_Whitepaper_Final.pdf

[3] A. Sangiovanni-Vincentelli, “Quo vadis, sld? reasoning about thetrends and challenges of system level design,” Proc. of IEEE, vol. 95,no. 3, pp. 467 –506, 2007.

[4] P. Schaumont, I. Verbauwhede, K. Keutzer, and M. Sarrafzadeh, “Aquick safari through the reconfiguration jungle,” in Proc. of DAC.ACM, 2001, pp. 172–177.

[5] E. Keller, G. Brebner, and P. James-Roxby, “Software decelerators,”in Field-Programmable Logic and Applications, ser. Lecture Notes inComputer Science. Springer Berlin / Heidelberg, 2003, vol. 2778,pp. 385–395.

[6] M. A. Al Faruque, R. Krist, and J. Henkel, “Adam: run-time agent-based distributed application mapping for on-chip communication,” inProc. of DAC. ACM, 2008, pp. 760–765.

[7] V. Nollet, T. Marescaux, P. Avasare, D. Verkest, and J.-Y. Mignolet,“Centralized run-time resource management in a network-on-chipcontaining reconfigurable hardware tiles,” in Proc. of DATE. IEEEComputer Society, 2005, pp. 234–239.

[8] N. Shirazi, W. Luk, and P. Y. K. Cheung, “Run-time managementof dynamically reconfigurable designs,” in Proc. of FPL. Springer-Verlag, 1998, pp. 59–68.

[9] P. Kohout, B. Ganesh, and B. Jacob, “Hardware support for real-timeoperating systems,” in Proc. of CODES+ISSS. ACM, 2003, pp. 45–51.

[10] P. G. Paulin, C. Pilkington, M. Langevin, E. Bensoudane, andG. Nicolescu, “Parallel programming models for a multi-processorsoc platform applied to high-speed traffic management,” in Proc. ofCODES+ISSS. ACM, 2004, pp. 48–53.

[11] V. Nollet, D. Verkest, and H. Corporaal, “A safari through the mpsocrun-time management jungle,” Journal of Signal Processing Systems,vol. 60, pp. 251–268, 2010.

[12] C. Ykman-Couvreur, P. Avasare, G. Mariani, C. Silvano, and V. Za-ccaria, “Linking run-time resource management of embedded multi-core platforms with automated design-time exploration,” IET Comput.Digit. Tech., vol. 5, no. 2, pp. 123–135, 2011.

[13] C. Ykman-Couvreur, V. Nollet, F. Catthoor, and H. Corporaal, “Fastmulti-dimension multi-choice knapsack heuristic for MP-SoC run-time management,” ACM TECS, 2011.

[14] S. Xydis, A. Bartzas, I. Anagnostopoulos, D. Soudris, andK. Pekmestzi, “Custom multi-threaded dynamic memory managementfor multiprocessor system-on-chip platforms,” in Proc. of IC-SAMOS,2010, pp. 102 –109.

[15] P. R. Wilson, M. S. Johnstone, M. Neely, and D. Boles, “Dynamicstorage allocation: A survey and critical review,” in Proc. of IWMM.Springer-Verlag, 1995, pp. 1–116.

[16] E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson,“Hoard: a scalable memory allocator for multithreaded applications,”SIGPLAN Not., vol. 35, pp. 117–128, November 2000.

[17] S. Xydis, I. Stamelakos, A. Bartzas, and D. Soudris, “Runtimetuning of dynamic memory management for mitigating footprint-fragmentation variations,” in Proc. of PARMA Workshop. VDEVerlang, 2011, pp. 27–36.

[18] P.-A. Larson and M. Krishnan, “Memory allocation for long-runningserver applications,” in Proc. of ISMM. ACM, 1998, pp. 176–185.

Runtime Resource Management Techniques for Many-core ...

Documents

Transcript of Runtime Resource Management Techniques for Many-core ...