Exploiting Location-Aware Task Execution on Future Large-scale Many-core Architectures

Exploiting Location-Aware Task Execution on Future Large-scale Many-coreArchitectures

Panayiotis Petrides and Pedro TrancosoDepartment of Computer Science

University of Cyprus, Cyprus

Frederico Pratas, Leonel SousaSiPS, INESC-ID/IST

Universidade Tecnica de Lisboa

Abstract

Multi-core designs have overcome the major limitationsof previous monolithic processor designs. As the numberof on-chip cores increases, so does the amount of instruc-tions that can be executed concurrently, thus resulting in anincrease of the overall memory bandwidth pressure. Oneapproach to overcome this obstacle is to integrate multiplememory controllers on the same chip, increasing the avail-able off-chip bandwidth. While this significantly improvesthe memory bandwidth, agnostic operating system policiesand architectures, designed for best average performance,result in potential memory access bottlenecks. To tacklethis problem, in this work we propose a bandwidth-awarescheduling policy that is able to assign applications to coresin order to achieve a better bandwidth balance among clus-ters of cores within the chip. The evaluation of the proposedscheduler was done using a workload composed of a mix ofapplications with different requirements on clustered multi-core processors with 32 to 128 cores. Experimental resultsindicate that a bandwidth-aware scheduling policy is ableto achieve a performance benefit of up to 2x for most appli-cations compared with the agnostic assignment policy.

1. Introduction

The de-facto standard in processor design is the multi-core architecture. This architecture offers the benefit ofan increased degree of parallelism to provide better per-formance, without the drawbacks of the previous mono-lithic designs, such as high power consumption and com-plex design. As technology improves, the integration levelincreases leading to an increase in the number of cores perchip. While this results in a further increase of the degree ofparallelism, it may not necessarily lead to improved perfor-mance, even when considering highly parallel applications.The increasing number of processing units per chip resultsin a higher demand for “feeding” those units with both in-structions and data. At the same time, neither the number

Con

trol

ler 2

Con

trol

ler 4

Con

trol

ler 1

Con

trol

ler 3

Cluster 3 Cluster 4

Cluster 2Cluster 1

P4P1 P3P2

P8P5 P7P6

P12P9 P11P10

P4P1 P3P2

P8P5 P7P6

P12P9 P11P10

P4P1 P3P2

P8P5 P7P6

P12P9 P11P10

P4P1 P3P2

P8P5 P7P6

P12P9 P11P10

Figure 1: Clustered multi-core architecture.

of pins on the chip, nor the links to memory improve at thesame rate as the number of cores. Thus we are faced with anew limitation also known as the Bandwidth Wall [13].

Architects have tackled the Bandwidth Wall in two dif-ferent ways. A first approach is by increasing the size of theLast-Level Cache, usually shared by all cores in the chipat the lowest level of the on-chip memory hierarchy. Intu-itively, a larger cache will better hold the required data, thusreducing the number of out-of-chip misses. Nevertheless,some applications do not have such a “good” behavior. Inaddition, large caches are usually slow, consume larger area(which could be used for more cores), and power. Also, asthe number of cores increases, the contention to that cachewill increase, thus resulting in a higher access latency. Asecond approach is by integrating the memory controllerinto the chip and also by using multiple controllers. Thisintegration reduces significantly the memory access latency,and the use of multiple controllers can also provide a betterservice for the memory requests. The cost of this solutionis reflected in the die area required. In addition, it is stillunclear how multiple memory controllers are managed ona large-scale multi-core chip. Intuitively, the memory re-quests are better satisfied with multiple controllers, as eachof these handles the requests of a fixed number of cores. For

example as in the recent Intel Single-chip Cloud Computing(SCC) [7], which implements 48 Intel Pentium like coreswith a two level cache hierarchy, and divided in four groupsof 12, known as clusters, and served by four on-chip mem-ory controllers as depicted in Figure 1. The memory con-trollers use the technology DDR3-800. The work presentedherein uses a variant of this architecture with more recentCore2Duo cores as the baseline. Even though, in average,this type of architecture provides large memory bandwidth,it is possible to imagine a case where the OS assigns a setof applications with high bandwidth requirements to a par-ticular cluster, placing more pressure on its controller.

Clearly, the bandwidth management is becoming a seri-ous issue as the number of cores increases. In order to ad-dress this issue, in this work we propose a Bandwidth-AwareScheduler that has, as its main goal, to balance the band-width requests among the different controllers of a clusteredmulti-core architecture. The scheduler monitors the band-width utilization of all clusters and tries to balance it: whenit detects a cluster that requires more bandwidth than it isavailable, it migrates some of the more demanding applica-tions to another cluster with lower bandwidth utilization. Inorder to design and evaluate the Bandwidth-Aware Sched-uler we analyze the performance requirements for a diverseset of applications and propose a categorization of the appli-cations in terms of memory bandwidth and execution time.

Overall, the main contributions of this work are the fol-lowing:

• characterization of the bandwidth requirements for dif-ferent applications;• creation of bandwidth-aware application workloads;• proposal of a bandwidth-aware scheduler;• evaluation of the bandwidth-aware scheduler

The results obtained from the experiments performedherein have been collected via binary instrumentation ofreal applications on a native system. They showed that aperformance benefit of up to 2x is achieved for most ap-plications, in comparison with the agnostic assignment pol-icy. These results clearly show that bandwidth-awareness isnecessary to achieve good performance in future large-scalesystems, and that the proposed bandwidth-aware scheduleris a technique with a considerable potential to address thebandwidth limitations of large multi-core systems.

The remainder of this paper is organized as follows.Section 2 presents the detailed architecture of the pro-posed scheduler. Sections 3 presents the experimental setupwhereas Section 4 explains the methodology of categoriza-tion of the applications according to the memory bandwidthand execution time. Section 5 presents the memory band-width evolution according to the size of a cluster. In Sec-tion 6 are presented the results obtained with the proposedscheduler. Finally, the previous work is presented in Sec-

tion 7 and Section 8 concludes this paper.

2. Proposed Scheduler

This Section presents the proposed bandwidth-awarescheduler. Its objective is to assign applications to coreson a multi-core processor providing a balance between theavailable bandwidth and the applications’ requirements.

At a high-level, this objective is achieved by monitoringthe bandwidth utilization on the different clusters of cores.If a bandwidth violation is observed, i.e., the bandwidthutilization passes a certain threshold - the bandwidth sup-ported by the dedicated controller -, the scheduler will se-lect the applications responsible for the bandwidth violationand will try to place then on another cluster with smallerbandwidth utilization.

The details of the scheduler’s procedure are presentedin the flow diagram depicted in Fig 2. In this system, theexecution of the applications is interrupted every N clockcycles, simultaneously on all cores. This N thus representswhat is known as the time quantum of the system or execu-tion phase. At that point the scheduler is executed.

The first step of the scheduler is to determine the dis-tribution of the applications on each cluster, i.e., for thequantum that has completed, how many applications haveLow, Medium, and High bandwidth demands on each clus-ter (represented as Li:Mi:Hi). For this classification thescheduler estimates the bandwidth demand by using thenumber of Last-Level Cache (LLC) misses per applicationper quantum. This value is obtained from the performancecounters [11] of each core. Together with the BandwidthThreshold Values which have been predefined using severaldifferent applications on the current machine (see Section 4for a detailed presentation), we classify each application’sbandwidth demand. In the second and third step the sched-uler adds the bandwidth estimations for the applications andnormalizes the value considering the maximum bandwidthsupported by the cluster, to determine the bandwidth uti-lization of each cluster (UBWi for cluster i). At this point,the scheduler is able to determine if the bandwidth limit hasbeen exceeded in any of the clusters. Given that this limitis physical, in practice what this means is that the applica-tions in such a cluster have been delayed due to bandwidthcontention. If no cluster exceeded its bandwidth limit, thescheduler completes its task and the applications are all re-sumed on the same cores as before to improve locality.

If one or more clusters have exceeded the bandwidthlimit, the scheduler will try to move applications around toobtain a better bandwidth balance among the clusters. Thisis achieved with the steps within the ”Adaptation Proce-dure” in Figure 2. The heuristic used tries to balance theclusters by exchanging applications from the cluster withthe highest bandwidth utilization with applications from the

ADAPTION PROCEDUREStart

Determine Bandwidth Utilization per Cluster

Bandwidth Violation Detected in Cluster j

Stop

Perform application exchanges from cluster

i to j

Get new bandwidth distributions of clusters i

and j

Update the minimum Bandwidth Balance

T

Get the cluster with the Minimum Bandwidth

Utilization

F

Cluster j and i Bandwidth

Utilization Satisfied

T

F

Figure 2: Flow chart of Bandwidth-Aware Scheduler

cluster with the lowest bandwidth utilization. Consequently,the first step is to determine which clusters have the max-imum and minimum bandwidth utilizations, cluster a withUBWa and b with UBWb, respectively. In the next steps thescheduler evaluates the application exchange before it ac-tually performs these exchanges. In an exchange scenariothe scheduler will pick a group of applications as the prob-lem is usually caused by a set of applications, thus avoid-ing multiple consecutive exchanges. Thus, the exchangewill remove from the considered cluster applications of acertain bandwidth category and exchange them with appli-cations of lower bandwidth demands. As such, for clustera, the application distribution will change from the origi-nal La:Ma:Ha to the new distribution La’:Ma’:Ha’. As thescheduler is evaluating the potential of exchanging applica-tions from cluster a to b and vice-versa, there is a fixed set ofexchanges that are possible. The scheduler uses a Compati-bility Distribution Lookup Table, which has been previouslyconstructed for the current system setup, to find out what arethe new valid application distributions. This table is lookedup using the current a and b distributions. In an iterativeway, the scheduler will evaluate all new valid distributionsfor a and b (Lai:Mai:Hai and Lbi:Mbi:Hbi) by determin-

ing the bandwidth utilization for each (UBWai and UBWbi)and finally the Bandwidth Balance, which is a metric thataccounts for the bandwidth of the combined clusters, i.e.,the sum of the bandwidth that exceeds the limit for eachcluster. After all possible valid distributions are evaluated,the scheduler picks the distribution that predicts the mini-mum Bandwidth Balance. If the new state of the selecteddistribution achieves a balance that is better than the one ofthe current distribution then the scheduler migrates the nec-essary applications to achieve the new distribution in bothclusters and then the execution is resumed. Otherwise, thescheduler will try the same procedure to evaluate the ex-changes as described above but between the same a clusterand a new b cluster which has the second to the minimumBandwidth Utilization.

While the procedure above seems time consuming, thedescription corresponds to the complete scheduler proce-dure, which in practice can be simplified. For example,instead of testing all possible valid new distributions, thescheduler marks some distributions as having higher poten-tial and only test those. Similarly, if the exchange betweenthe cluster with maximum and minimum bandwidth utiliza-tion does not provide a better balance, the scheduler givesup and avoids checking other possible exchanges.

Finally, it is relevant to notice that in some cases, morethan one cluster may have a utilization that exceeds thelimit. In such a case, the scheduler, as described above,follows a conservative approach will first handle the clusterwith the highest utilization. The other clusters will eventu-ally be handled on the following quantum one-at-a-time.

3. Experimental Setup

A series of steps have been followed in order to eval-uate the bandwidth-aware scheduler proposed. Firstly, aset of different real sequential applications was selected.These applications were simulated using a binary instru-mentation tool and the obtained results were used to ana-lyze the memory bandwidth requirements. The bandwidthresults allowed to analyze the effects of the proposed sched-uler for different configurations of the target architecture.

The applications used herein represent a wide variety ofdifferent workloads of real applications. A brief descrip-tion of the main characteristics of the selected applicationsis presented in Table 1. The TPC-H benchmark queriesare workloads based on decision support systems appliedto large data sets; sixteen queries with different complex-ities were used and the respective results presented in thisdocument use the tpch Qx nomenclature, where x denotesthe query identifier.

The DEX application is a high-performance graphdatabase querying system oriented to handle large graphdatabases. DEX includes 8 different queries which can be

Table 1: Applications Description.

Application/Benchmark Brief DescriptionTPC-H Decision support benchmark [16]DEX A graph-based database query application [10]MrBayes Bioinformatics application that performs Bayesian inference of phylogeny [14]Biobench Benchmark suite containing different bioinformatic algorithms [1]NAMD Computer chemistry application for molecular dynamics simulation [12]PARSEC Benchmark of representative applications from different areas [3]

executed independently or jointly for different workloads.In this case we have selected the most intensive queries4 and 8 and we also used the jointly configuration for theworkload scale factors 3, 4 and 5. The results obtained arenamed according to dex y Qx, where x represents the querynumber and y the scale factor, dex y All is used to distin-guish the jointly configurations.

As representatives of scientific computing applicationswe selected MrBayes and Biobench as bioinformatics ap-plications and NAMD as a computational chemistry appli-cation. In particular, MrBayes performs Bayesian infer-ence of phylogeny. Seventeen DNA data sets with var-ious sizes were used as input workloads and the respec-tive results are identified as mrbayes x, where x is relatedto the data set size. Regarding the Biobench benchmarksuite, it consists of bioinformatic applications which arebased on data mining and are applied to large amountsof data, namely we have considered as representatives ofthis benchmark phylip protdist, phylip protpars, fasta dna,fasta protein and hmmer with the respective default work-loads. NAMD is a molecular dynamics simulation appli-cation for high-performance simulation of large biomolec-ular systems; versions with different precision arithmetics,namd single and namd double, were considered.

Finally, to cover other application areas we selectedPARSEC as it includes various benchmarks from areas suchas computer vision, video encoding, financial analytics andimage processing. The considered representatives for thisbenchmark were blackscholes, streamcluster and freqmine.

We firstly monitored the execution of the applications inorder to define the off-chip memory references. For theseexperiments we have used Pin Binary Instrumentation tool[9] for simulating the cache memory of a processor by mon-itoring the memory addresses referenced by each applica-tion. More specifically we have used Pin 2.7.31933 forLinux and x86 architectures. In order to simulate the mem-ory cache of a processor we have developed a tool for pinwhich simulates a two-level cache with the following char-acteristics: a split first level cache for instructions and dataeach each with a total size of 32KB size, 32B cache lineand associativity of 4, and a unified level two cache with a

total size of 2MB, 32B cache line size and associativity of 8with a write-back, write-allocate policy. In order to capturethe execution phases of the applications we have included inour tool a counter to monitor the executed instructions, andrecorded results for each phase, namely number of memoryhits and misses, number of memory accesses at each cachelevel, and total number of executed instructions.

After collecting these results for the different applica-tions, the off-chip bandwidth demands for each phase andthe total execution time of the applications are computedby considering the characteristics of the Intel Core 2 Duoarchitecture (in particular the Intel Q9559 at 2.83GHz). Inmore detail we have considered the following: each instruc-tion latency 1 cycle, the latency to access the L1 cache, 3cycles with three simultaneous accesses, the latency of L2cache was considered to be 14 cycles with ten simultaneousaccesses, and 183 cycles latency for the off-chip memorywith ten simultaneous accesses. Both these calculations andresults obtained for the different scheduling policies wereobtained by using as inputs the parameters from Pin BinaryInstrumentation tool and the scheduler was simulated ac-cording to the flow chart presented in Figure 2.

4. Bandwidth Characterization

In order to perform our analysis we started by classifyingthe different applications mentioned in the previous sectionaccording to two dimensions: execution time and memorybandwidth requirements. Three classes were considered foreach dimension as depicted in Figure 3 and presented in Ta-ble 2. It is relevant to notice that the characterization ofthe applications in terms of bandwidth requirements is notindependent of the applications’ execution time. For exam-ple, the impact of an application with high bandwidth re-quirements but short execution time is much smaller thanthat of an application with high bandwidth requirementsand long execution time. Consequently, the applicationsare classified according to their execution time into: short,for less than 10 seconds; medium, between 10 and 100seconds; and high, above 100 seconds. Moreover, duringthe execution of an application one can observe phases of

0.1

1

10

100

1000

10000

0 0.5 1 1.5 2 2.5

Execu,

on Tim

e [s]

Bandwidth [GB/s]

Short-‐Low

Medium-‐Low

Long-‐Low

Short-‐Medium

Medium-‐Medium

Long-‐Medium

Short-‐High

Medium-‐High

Long-‐High

Figure 3: Classification of the applications according to the exe-cution time and memory bandwidth.

high memory bandwidth demand and phases of low and/ormedium memory bandwidth demand. In order to computethe average bandwidth requirements of each application theweighted average is applied in order to reflect the contri-bution of these different bandwidth phases of the applica-tions. Thus, we classify the applications according to theirmemory bandwidth requirements by calculating the overallbandwidth average for all applications and setting the fol-lowing thresholds: (i) high bandwidth demanding applica-tions are those applications that have an average bandwidth50% above the overall average (namely above 0.82 GB/s),(ii) low bandwidth demanding applications are those withan average bandwidth 50% below than the overall average(namely below 0.27 GB/s) and (iii) medium bandwidth de-manding applications are those with an average bandwidthbetween the previously two thresholds.

As it is possible to observe from Figure 3, several appli-cations fall into the same category. In order to reduce theexploration space in our analysis, a representative applica-tion was choose per each category, and are marked in boldin Table 2. The representatives (R) were selected by cal-culating the distance of the applications to the “center” (C)of each class, and selecting the minimum as given by theexpression

R = min(di), i = 1, · · · , N (1)

where N represents the number of applications in the classand di the distance between application i and C. C is thepoint at the density center of each class, i.e., according tothe distribution of the corresponding applications. Con-sidering each pair (xi, yi) = (bandwidthi, timei) as thecartesian coordinates for every application i, then we haveC = (x′, y′) where x′, y′ are given by the expressions

x′ =

∑N

ixi

N , y′ =

∑N

iyi

N , i = 1, · · · , N (2)

and the distances di are calculated according to

di =√|xi − x′|2 + |yi − y′|2 (3)

5. Multi-core Execution within BandwidthLimits

One of the objectives of this work is to investigate howthe bandwidth limitations affect the execution of the appli-cations when executed simultaneously. In a first step, theoverall bandwidth requirements of the applications wereevaluated by executing a scenario where all the cores arerunning the same application. In order to combine appli-cations in the same cluster of cores, or even in differentclusters, some assumptions had to be made. First of all weconsidered combinations of applications of the same exe-cution time category but of different bandwidth categories.This assumption was made in order to avoid asymmetry ofshort and long applications. Since applications may havedifferent number of executed instructions, applications witha smaller number of instructions continue their executionuntil the completion of the longest application.

In Figure 4 we show the bandwidth requirements for dif-ferent application distributions within a cluster of cores, i.e.,the number of cores on the x-axis is served be a single mem-ory controller. Thus, the X:Y:Z labels in Figure 4 denotethe ratio of Low, Medium and High applications executingwithin the cluster. The results show the linear increase ofthe off-chip bandwidth requirements in terms of the num-ber of cores per cluster, which is of major impact to thescalability of multi-core processors. Moreover, for differentapplications mixes, the bandwidth demands behave differ-ently, and the controller (see dashed line) limits the scala-bility of the architecture, more specifically, we can see thatfor a 33:33:33 mix it does not saturate before 8 cores, whilefor the 00:00:100 it does. In order to show the impact ofthe overall bandwidth requirements on the execution timeof the applications, we have calculated, for the configura-

0

5

10

15

20

25

30

35

40

45

0 8 16 24 32

Band

width [G

B/s]

#cores Short -‐ 00:00:100 Long -‐ 00:00:100 Short -‐ 33:33:33 Long -‐ 33:33:33 DDR3-‐800

Figure 4: Bandwidth requirements of applications combinationsaccording to the number of cores, for short and long executions.

Table 2: Applications Categorization Results according to Execution Time - Bandwidth Requirements (representatives are marked in bold)

Short - Low Short - Medium Short - Hightpch Q3, tpch Q6, tpch Q7, tpch Q8 tpch Q10, tpch Q14 tpch Q15 tpch Q11tpch Q12, tpch Q13, tpch Q16dex 3 Q4, dex 3 Q8, dex 3 Alldex 4 Q4, dex 4 Q8dex 5 Q8, dex 5 Allnamd single

Medium - Low Medium - Medium Medium - Highdex 4 All, dex 5 Q4 tpch Q1, tpch Q2 tpch Q9 , streamcluster,namd double mrbayes 10x5000, 10x20000,freqmine 20x5000, 50x1000, 50x1000,

50x5000, 100x1000Long - Low Long - Medium Long - High

phylip protdist, fasta dna, fasta protein hmmer, phylip protpars mrbayes 10x50 000, 20x20000,20x50000, 50x20000, 100x5000,100x20000, 100x50000

0

20

40

60

80

100

120

140

100:00:00 50:50:00 50:25:25 50:00:50 33:33:33 25:50:25 00:100:00 25:25:50 00:50:50 00:00:100

#Extra pha

ses

Applica9on Mixes (L:M:H)

8 Cores 16 Cores 32 Cores

(a) short applications

0

20

40

60

80

100

120

100:00:00 50:50:00 50:25:25 50:00:50 33:33:33 25:50:25 00:100:00 25:25:50 00:50:50 00:00:100

#Extra pha

ses

Applica9on Mixes (L:M:H)

8 Cores 16 Cores 32 Cores

(b) long applications

Figure 5: Extra phases for DDR3-800 (6.4 GB/s) configuration.

tion of the 6.4 GB/s off-chip bandwidth, the number of extraphases that the applications have to sustain due to the lim-ited available bandwidth per cluster. This calculation wasmade considering the results of the different applicationsdistributions presented previously and by checking for eachphase if the overall bandwidth is satisfied by the availablebandwidth or not. If it is not satisfied, we calculate howmany extra phases are required to satisfy the bandwidth re-quirements. To determine this value we assume that theexecution time of the applications is interrupted until thebandwidth overflow is satisfied and also that in each phase

we are able to satisfy up to the available bandwidth. Weconsider the whole execution of the applications in order todetermine the total phases of penalty. From the results de-picted in Figure 5 it is obvious that, as the number of coresincreases in the cluster, there is a linear increase of the extraphases. It is important to mention that Short and Mediumexecution time applications are relatively more influencedby the available bandwidth in terms of extra phases, whencompared to the Long execution time applications.

From this analysis we have shown that the scalability ofmulti-core processors is highly dependent on the off-chipmemory bandwidth.

6. Results with the Proposed Scheduler

As stated in the begining of the previous section the ex-ecution starts with a random distribution of the applicationsamong the available cores. Then, the proposed scheduleris responsible for adapting the workload among the differ-ent clusters according to the bandwidth demands of the ap-plications. In this section, to evaluate the proposed sched-uler (Bandwidth-Aware) we performed several experiments.For each of the four on-chip clusters different distributionsof the applications have been experimented. Moreover,different cluster configurations were considered, by takinginto account combinations of 8, 16 and 32 cores per clus-ter. Regarding the applications, because the number of dif-ferent distributions through the available clusters increasesrapidly with the number of cores, we have restricted theexploration space to some representative scenarios. Morespecifically, we restricted the application distributions intwo levels: global chip application distribution, and dis-tribution within a cluster of cores. In both cases we in-

0 5

10 15 20 25 30 35 40 45

4 x 8 4 x 16 4 x 32

Exec

utio

n Ti

me

[Ext

ra P

hase

s]

Cluster Configuration [#clusters x #cores-per-cluster]

Random Bandwidth-Aware Oracle

(a) short 50:50:00

0 10 20 30 40 50 60 70 80 90

4 x 8 4 x 16 4 x 32

Exec

utio

n Ti

me

[Ext

ra P

hase

s]



(b) short 50:00:50

0 10 20 30 40 50 60 70 80 90

4 x 8 4 x 16 4 x 32

Exec

utio

n Ti

me

[Ext

ra P

hase

s]



(c) short 00:50:50

0 10 20 30 40 50 60 70 80 90

4 x 8 4 x 16 4 x 32

Exec

utio

n Ti

me

[Ext

ra P

hase

s]



(d) short 33:33:33

Figure 6: Extra phases for different scheduling policies random, oracle and our proposed scheduler for short applications combinations.

0 100 200 300 400 500 600

4 x 8 4 x 16 4 x 32

Exec

utio

n Ti

me

[Ext

ra P

hase

s]



(a) long 50:50:00

0

500

1000

1500

2000

2500

4 x 8 4 x 16 4 x 32

Exec

utio

n Ti

me

[Ext

ra P

hase

s]



(b) long 50:00:50

0

500

1000

1500

2000

2500

4 x 8 4 x 16 4 x 32

Exec

utio

n Ti

me

[Ext

ra P

hase

s]



(c) long 00:50:50

0

500

1000

1500

2000

2500

4 x 8 4 x 16 4 x 32

Exec

utio

n Ti

me

[Ext

ra P

hase

s]



(d) long 33:33:33

Figure 7: Extra phases for different scheduling policies random, oracle and our proposed scheduler for long applications combinations.

Table 3: Example of a Compatibility Distribution Lookup Table(L:M:H)

Initial Next PossibleDistribution Distribution00:00:100 00:50:50 , 50:00:5000:100:00 00:50:50 , 50:50:00100:00:00 50:00:50 , 50:50:0000:50:50 00:00:100 , 00:100:0050:00:50 00:00:100 , 100:00:0050:50:00 00:100:00 , 100:00:00

vestigate the different distributions for the following sce-narios: i) executing only High, or Medium, or Low band-width demanding applications (i.e., 00:00:100, 00:100:00,and 100:00:00); ii) executing evenly High-Medium, High-Low, or Medium-Low bandwidth demanding applications(i.e., 00:50:50, 50:00:50, and 50:50:00); and iii) execut-ing evenly High, Medium, and Low bandwidth demand-ing applications (i.e., 33:33:33). Distributions of only Low,Medium and High bandwidth applications were not shownfor the global chip analysis since the overall bandwidth dis-tribution of the system is the same for every cluster, i.e., itis always balanced.

To implement the proposed scheduler we had also todefine the bandwidth thresholds internally used during themonitoring period for tagging each phase of the applica-tions, according to the categories presented in Section 4.Moreover, we have also defined the Compatibility Distri-bution Lookup Table for the different cluster configurations.Table 3 shows a lookup table example used for the chip con-figurations. This table can include other ratios accordingto the granularity of the scheduler adopted to move appli-cations between clusters. There is a tradeoff between thedegree of adaptability achieved and the complexity of theadaptation procedure, as it has to look into more possibili-ties. In this case we have determined the nearest distribu-tion of applications that can be migrated between clusters.For example, a cluster executing only Low bandwidth ap-plications (100:00:00) can switch to a distribution of evenlyHigh and Low (50:00:50) or Medium and Low (50:50:00)bandwidth applications.

Finally, for the different defined distributions and sce-narios we have obtained the penalty in extra phases that anapplication pays according to the scheduling policy. Forcomparison purposes, in addition to the proposed schedulerwe consider also two other schedulers: i) a non-bandwidth-aware random static scheduler (Random), representing acommon scenario and used as baseline as it is close to theexisting agnostic policies; ii) a scheduler that takes into ac-count a priori the overall application bandwidth character-istics to define the best static placement of the different ap-

plications through the chip (Oracle), representing the per-fect static scenario with minimum penalty. For the randomapproach we have determined the extra phases penalty forthe different possible distributions and we use the results tocalculate the average, the best, and the worst case scenario.The Oracle case translates into only one solution as it willalways use the best case scenario, which corresponds to thebest distribution of the Random case.

From the results depicted in Figure 6 and 7, we can ob-serve that the results for the proposed scheduler are, for themajority of the experiments, very close to the results ob-tained with the Oracle scheduler. We can also observe a sig-nificant penalty reduction, thus highlighting the importanceof the proposed scheduler for both short and long executedapplications. Another important observation is the high im-pact of the Bandwidth-Aware scheduler versus the Randomscheduling for the different configurations. More specifi-cally, we can observe high variations when using the Ran-dom scheduling technique that translates to a higher penaltyof extra phases than the penalty observed for architectureswith uniform clusters. Notice that the Bandwidth-Awarescheduler always achieves a number of extra phases thatmatches or is near the lowest value of the Random sched-uler. The same can be observed for other charts with theaverage variance for the Random scheduler being 1.9x. Ourresults also show the adaptiveness of the Bandwidth-Awarescheduler for different architectures, as even varying thecore configuration the scheduler continues to perform well.

The described observations are also corroborated by theestimated speedup results depicted in Figures 8(a) and 8(b),for the short and long applications respectively. These fig-ures show the speedup results when compared to the aver-age results of the Random approach. We can see that thedifference between the Oracle and the Bandwidth-Awareschedulers is not very significant, being in average less than5%, and in general the Bandwidth-Aware scheduler tendsto be better than the Oracle for longer applications. Thishappens because the Bandwidth-Aware scheduler is able torefit the applications during the whole execution adaptingthe workload to the system according to their needs, whilethe other two schedulers are static and do not account foron-the-fly bandwidth variations. Moreover, the obtainedspeedups also show that in general a significant perfor-mance improvement can be achieved regarding the commonscenario when the proposed scheduler is used. Namely, weobtained average speedups of 1.35x for the low class and1.34x for the high class. Overall, speedups of up to about2x are achieved and only in two cases our proposed solu-tion performs similarly to the average Random. Thus, thisresults show that a bandwidth-aware scheduler is a valuableapproach to improve both the global performance and thesystem predictability.

0.0

0.5

1.0

1.5

2.0

4 x 8 4 x 16 4 x 32 4 x 8 4 x 16 4 x 32 4 x 8 4 x 16 4 x 32 4 x 8 4 x 16 4 x 32

0:50:50 33:33:33 50:00:50 50:50:00

Spee

dup

rela

tive

to th

e R

ando

m

aver

age

Oracle Bandwidth-Aware

(a) short class

0.0

0.5

1.0

1.5

2.0

4 x 8 4 x 16 4 x 32 4 x 8 4 x 16 4 x 32 4 x 8 4 x 16 4 x 32 4 x 8 4 x 16 4 x 32

0:50:50 33:33:33 50:00:50 50:50:00

Spee

dup

rela

tive

to th

e R

ando

m

aver

age

Oracle Bandwidth-Aware

(b) long class

Figure 8: Speedup results obtained with different combinations of applications, from left to right 00:50:50, 33:33:33, 50:00:50, 50:50:00.

7. Related Work

The latest generations of multi-core processors used asingle off-chip memory controller to connect processorsusually through a common bus, e.g., the Front Side Busfor Intel architectures. This type of architecture creates abottleneck in the scalability of memory bandwidth. Newerprocessor designs, like the Intel Nehalem microarchitectureor the AMD Opteron processors (Shanghai) [2], have sug-gested to integrate on-chip the memory controllers to im-prove the memory latency and the bandwidth scalability.An interesting study comparing the two approaches of off-chip and integrated memory controllers is presented in [6].In this study the authors compare the Intel Core 2 (Penrynmodel) with the Intel Nehalem processors showing that asingle-channel Nehalem processor (which supports triple-channel memory controller) is faster than the dual-channelCore 2 processor. This shows the importance of the memorybandwidth and latency in the execution of applications.

Rogers et al. [13] studied bandwidth as a limitation fac-tor of multi-core scalability showing that different architec-ture design techniques must be considered in order not tohit the bandwidth wall. In this work the importance of thebandwidth in multi-core systems and how this is a limitationto their scalability was studied and different architecture de-signs were evaluated in order to be overcome. Some of thearchitecture design techniques that were investigated werecache compression, DRAM cache, 3D-Stacked Cache, linkcompression and sectored cached.

Long et al. [8] have tried to characterize different appli-cations from the SPEC benchmark suite according to thebandwidth demands by taking the phases of their execution.More specifically, they have studied the memory traces ofthe applications during their execution and tried to catego-rized them as Low, Medium and High bandwidth demand-ing applications using the arithmetic-to-memory ratio foreach phase. In their study, they vary the number of cores ina multi-core system and investigate the relative performanceof the applications when multiple instances of the same ap-plications and couples of applications are executed concur-rently. Their results show an important impact in the appli-

cations performance due to bandwidth sharing among appli-cations as the number of cores increases, and they suggest away to overcome this limitation by using phase prefetching.

Fedorova and Doucette [5] suggested that the Operat-ing System for heterogeneous multi-core should handlescheduling of threads considering three objectives: optimalperformance, fair CPU sharing, and balanced core assign-ment. Scheduling is done by monitoring physical variablesand controlling according to all the three objectives. Threadmigration is performed only if the performance benefit ofmigrating a thread to another core is greater than not do-ing anything, provided that no large negative impact on thesystem-wide measures, like core assignment balance and re-sponse time fairness, will arise as a result of the migration.

Shelepov and Fedorova [15] address the scheduling inheterogeneous multi-core systems in the case of short-lived threads, which do not allow monitoring to reach anear optimal scheduling solution. The general idea ofthe above problem is to schedule threads according totheir architectural signatures, which are composed of cer-tain microarchitecture-independent characteristics, gener-ated offline. They can be embedded in applications’ bi-nary. Finally, the scheduler knows the systems’ character-istics and selects the best match according to the demandsand available resources.

Fedorova et al. [4] studied the impact in applications per-formance degradation in multi-core systems when the ap-plications share the same resources, such as the last-levelcache, memory controllers among others. In this work ananalytical model and a more flexible scheduling algorithm,which could be implemented, was presented. They pointout how applications could benefit when contention on thesharing resources was handled.

In our work we address the bandwidth utilization ofmulti-core systems by proposing a bandwidth-aware sched-uler considering that no hardware changes are made to thesystem. More specifically, we suggest that a bandwidth-aware scheduler could satisfy the bandwidth requirementsof the applications, by monitoring them during their execu-tion and trying to utilize the available bandwidth, by redis-tributing applications among clusters of cores. Moreover,

we have categorize the applications according to their band-width requirements in order to setup the thresholds that thescheduler will consider to take decisions on the new distri-bution of the applications.

8. Conclusions

In this paper we have analyzed and evaluated a widerange of applications in terms of their memory bandwidthdemands and how they impact multi-core processors per-formance. This paper has shown the importance of consid-ering these demands and has proposed a bandwidth-awarescheduling policy. This scheduling policy can be used infuture multi-core architectures for avoiding the BandwidthWall, in order to improve the efficiency of next generationsof processors. More specifically, we have shown that evenfor short executing applications the impact of a bandwidthaware scheduling policy benefits their execution by a fac-tor of up to 2x in comparison with an agnostic schedulingpolicy. The effectiveness of the proposed dynamic policyis considerable since it led to a performance that is in aver-age only less than 5% worse than the results obtained withthe Oracle policy, which assumed the knowledge a priori ofthe bandwidth requirements. Overall, bandwidth has provento be a determinant factor in the applications performanceand bandwidth-aware scheduling shows a good potentialin avoiding this new wall. The proposed bandwidth-awarescheduling policy is a step forward to design Operating Sys-tems for the next generations of multi-core processors.

References

[1] K. Albayraktaroglu, A. Jaleel, null Xue Wu, M. Franklin,B. Jacob, null Chau-Wen Tseng, and D. Yeung. Biobench: Abenchmark suite of bioinformatics applications. IEEE Inter-national Symmposium on Performance Analysis of Systemsand Software, 0:2–9, 2005.

[2] AMD. Product Brief: Quad-Core AMD Opteron Processor.http://www.amd.com/, 2008.

[3] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsecbenchmark suite: characterization and architectural impli-cations. In PACT ’08: Proceedings of the 17th internationalconference on Parallel architectures and compilation tech-niques, pages 72–81, New York, NY, USA, 2008. ACM.

[4] A. Fedorova, S. Blagodurov, and S. Zhuravlev. Manag-ing contention for shared resources on multicore processors.Commun. ACM, 53(2):49–57, 2010.

[5] A. Fedorova, D. Vengerov, and D. Doucette. Operating sys-tem scheduling on heterogeneous core systems. In Proceed-ings of the First Workshop on Operating System Support forHeterogeneous Multicore Architectures, 2007.

[6] I. Gavrichenkov. First Lookat the Nehalem Microarchitecture.http://www.xbitlabs.com/articles/cpu/display/nehalem-microarchitecture.html, 2008.

[7] Intel. Single-chip Cloud Computer.http://techresearch.intel.com/UserFiles/en-us/File/terascale/SCC-Overview.pdf, 2009.

[8] G. Long, D. Fan, and J. Zhang. Characterizing and under-standing the bandwidth behavior of workloads on multi-coreprocessors. In Euro-Par ’09: Proceedings of the 15th Inter-national Euro-Par Conference on Parallel Processing, pages110–121, Berlin, Heidelberg, 2009. Springer-Verlag.

[9] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser,G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin:building customized program analysis tools with dynamicinstrumentation. In PLDI ’05: Proceedings of the 2005ACM SIGPLAN conference on Programming language de-sign and implementation, pages 190–200, New York, NY,USA, 2005. ACM.

[10] N. Martınez-Bazan, V. Muntes-Mulero, S. Gomez-Villamor,J. Nin, M.-A. Sanchez-Martınez, and J.-L. Larriba-Pey. Dex:high-performance exploration on large graphs for informa-tion retrieval. In CIKM ’07: Proceedings of the sixteenthACM conference on Conference on information and knowl-edge management, pages 573–582, New York, NY, USA,2007. ACM.

[11] M. Pettersson. Linux x86 performance-monitoring countersdriver, 2001.

[12] J. C. Phillips, G. Zheng, S. Kumar, and L. V. Kale. Namd:Biomolecular simulation on thousands of processors. SCConference, 0:36, 2002.

[13] B. M. Rogers, A. Krishna, G. B. Bell, K. Vu, X. Jiang, andY. Solihin. Scaling the bandwidth wall: challenges in andavenues for cmp scaling. In ISCA ’09: Proceedings of the36th annual international symposium on Computer archi-tecture, pages 371–382, New York, NY, USA, 2009. ACM.

[14] F. Ronquist and J. Huelsenbeck. MrBayes 3: Bayesian Phy-logenetic Inference under Mixed Models. Bioinformatics,19:1572–1574, 2003.

[15] D. Shelepov and A. Fedorova. Scheduling on heterogeneousmulticore processors using architectural signatures. In Pro-ceedings of the Workshop on the Interaction between Oper-ating Systems and Computer Architecture, 2008.

[16] Transaction Processing Council. TPC Benchmark H (Deci-sion Support) Standard Specification Revision 2.6.1. June2006.

Exploiting Location-Aware Task Execution on Future Large-scale Many-core Architectures

Documents

Transcript of Exploiting Location-Aware Task Execution on Future Large-scale Many-core Architectures