Bridging functional heterogeneity in multicore architectures

13
Bridging Functional Heterogeneity in Multicore Architectures Dheeraj Reddy, David Koufaty, Paul Brett, Scott Hahn {dheeraj.reddy,david.a.koufaty,paul.brett,scott.hahn}@intel.com Intel Corporation ABSTRACT Heterogeneous processors that mix big high performance cores with small low power cores promise excellent single– threaded performance coupled with high multi–threaded throughput and higher performance–per–watt. A significant portion of the commercial multicore heterogeneous proces- sors are likely to have a common instruction set architec- ture(ISA). However, due to limited design resources and goals, each core is likely to contain ISA extensions not yet implemented in the other core. Therefore, such heteroge- neous processors will have inherent functional asymmetry at the ISA level and face significant software challenges. This paper analyzes the software challenges to the operat- ing system and the application layer software on a hetero- geneous system with functional asymmetry, where the ISA of the small and big cores overlaps. We look at the widely deployed Intel R Architecture and propose solutions to the software challenges that arise when a heterogeneous proces- sor is designed around it. We broadly categorize functional asymmetries into those that can be exposed to application software and those that should be handled by system soft- ware. While one can argue that new software written should be heterogeneity–aware, it is important that we find ways in which legacy software can extract the best performance from heterogeneous multicore systems. Categories and Subject Descriptors D.4.1 [Operating Systems]: Process Management; C.1.3 [Processor Architectures]: Other Architecture Styles - Heterogeneous (hybrid) systems General Terms Algorithms, Performance Keywords Functional Heterogeneity, Multicore, Shared Asymmetric ISA, Intel R Architecture, Operating systems 1. INTRODUCTION Advances in semiconductor technology have enabled pro- cessor manufacturers to integrate more and more cores on a chip. Most commercial multi-core processors consist of iden- tical cores, where each core implements sophisticated micro- architecture techniques, such as super-scalar and out–of– order execution, to achieve high single–thread performance. This approach can incur high energy costs as the number of cores continues to grow. Alternatively, a processor can con- tain many simple, low–power cores, possibly with in–order execution. This approach, however, sacrifices single–thread performance and benefits only applications with thread– level parallelism. A heterogeneous system integrates a mix of big and small cores, and thus can potentially achieve the benefits of both [1, 7, 9, 15, 16, 27, 29]. Despite their significant benefits in power and performance, heterogeneous architectures pose significant challenges to operating system design [6], which has traditionally assumed homogeneous hardware. Heterogeneous architectures can be broadly classified into two classes that are not mutually exclusive: functional asym- metry and performance asymmetry. Functional asymmetry refers to architectures where cores have different or over- lapping instructions set architecture (ISA). For example, some cores may be general–purpose while others perform fixed functions such as encryption and decryption. Perfor- mance asymmetry refers to architectures where cores dif- fer in performance (and power) due to differences in micro- architecture or frequency. Within the functional asymmetry design space, two trends are likely to exist: the ISA might be non–overlapping (such as the case of the Cell processor [24] or hybrid CPU/GPU designs [22]) or the ISA might be over- lapping. We believe a significant portion of future commercial mul- ticore heterogeneous processors are likely to have a com- mon ISA. However, due to limited design resources or micro- architectural goals, the big core is likely to contain ISA ex- tensions not yet implemented in the small core. Therefore, such heterogeneous processors will have inherent functional asymmetry at the ISA level and face significant software challenges. This paper analyzes the software challenges to the operating system and the application layer software on a heterogeneous system with functional asymmetry, where the small core ISA overlaps with the big core ISA as in Figure 1- (a) or one is a proper subset of the other as in Figure 1-(b). We find that software has several options to leverage hetero- geneity. However, each option is influenced greatly by the design goals of a particular software stack. Our paper is novel in several ways. First, previous work on heterogeneous architectures has focused on the performance advantages of designing performance asymmetric systems assuming a homogeneous instruction set. This is very un- likely since small processors stand to gain power/performance advantages by sacrificing rarely used hardware or adding 21

Transcript of Bridging functional heterogeneity in multicore architectures

Bridging Functional Heterogeneity in MulticoreArchitectures

Dheeraj Reddy, David Koufaty, Paul Brett, Scott Hahn{dheeraj.reddy,david.a.koufaty,paul.brett,scott.hahn}@intel.com

Intel Corporation

ABSTRACTHeterogeneous processors that mix big high performancecores with small low power cores promise excellent single–threaded performance coupled with high multi–threadedthroughput and higher performance–per–watt. A significantportion of the commercial multicore heterogeneous proces-sors are likely to have a common instruction set architec-ture(ISA). However, due to limited design resources andgoals, each core is likely to contain ISA extensions not yetimplemented in the other core. Therefore, such heteroge-neous processors will have inherent functional asymmetryat the ISA level and face significant software challenges.This paper analyzes the software challenges to the operat-ing system and the application layer software on a hetero-geneous system with functional asymmetry, where the ISAof the small and big cores overlaps. We look at the widelydeployed IntelR© Architecture and propose solutions to thesoftware challenges that arise when a heterogeneous proces-sor is designed around it. We broadly categorize functionalasymmetries into those that can be exposed to applicationsoftware and those that should be handled by system soft-ware. While one can argue that new software written shouldbe heterogeneity–aware, it is important that we find ways inwhich legacy software can extract the best performance fromheterogeneous multicore systems.

Categories and Subject DescriptorsD.4.1 [Operating Systems]: Process Management; C.1.3[Processor Architectures]: Other Architecture Styles -Heterogeneous (hybrid) systems

General TermsAlgorithms, Performance

KeywordsFunctional Heterogeneity, Multicore, Shared AsymmetricISA, IntelR©Architecture, Operating systems

1. INTRODUCTIONAdvances in semiconductor technology have enabled pro-cessor manufacturers to integrate more and more cores on achip. Most commercial multi-core processors consist of iden-tical cores, where each core implements sophisticated micro-architecture techniques, such as super-scalar and out–of–order execution, to achieve high single–thread performance.This approach can incur high energy costs as the number ofcores continues to grow. Alternatively, a processor can con-tain many simple, low–power cores, possibly with in–order

execution. This approach, however, sacrifices single–threadperformance and benefits only applications with thread–level parallelism.

A heterogeneous system integrates a mix of big and smallcores, and thus can potentially achieve the benefits of both [1,7, 9, 15, 16, 27, 29]. Despite their significant benefits inpower and performance, heterogeneous architectures posesignificant challenges to operating system design [6], whichhas traditionally assumed homogeneous hardware.

Heterogeneous architectures can be broadly classified intotwo classes that are not mutually exclusive: functional asym-metry and performance asymmetry. Functional asymmetryrefers to architectures where cores have different or over-lapping instructions set architecture (ISA). For example,some cores may be general–purpose while others performfixed functions such as encryption and decryption. Perfor-mance asymmetry refers to architectures where cores dif-fer in performance (and power) due to differences in micro-architecture or frequency. Within the functional asymmetrydesign space, two trends are likely to exist: the ISA might benon–overlapping (such as the case of the Cell processor [24]or hybrid CPU/GPU designs [22]) or the ISA might be over-lapping.

We believe a significant portion of future commercial mul-ticore heterogeneous processors are likely to have a com-mon ISA. However, due to limited design resources or micro-architectural goals, the big core is likely to contain ISA ex-tensions not yet implemented in the small core. Therefore,such heterogeneous processors will have inherent functionalasymmetry at the ISA level and face significant softwarechallenges. This paper analyzes the software challenges tothe operating system and the application layer software on aheterogeneous system with functional asymmetry, where thesmall core ISA overlaps with the big core ISA as in Figure 1-(a) or one is a proper subset of the other as in Figure 1-(b).We find that software has several options to leverage hetero-geneity. However, each option is influenced greatly by thedesign goals of a particular software stack.

Our paper is novel in several ways. First, previous work onheterogeneous architectures has focused on the performanceadvantages of designing performance asymmetric systemsassuming a homogeneous instruction set. This is very un-likely since small processors stand to gain power/performanceadvantages by sacrificing rarely used hardware or adding

21

(b)(a)

Big core ISA

Small core ISA

Small core ISA

Big Core ISA

Figure 1: Overlap of instruction set architecture(ISA). In (a) the two cores share a common ISA,but each has special purpose features. In (b) onecore ISA is a proper subset of the other core.

hardware accelerators. Additionally, processor manufactur-ers are constrained by their design resources and will try toreuse existing core designs with their exiting ISA as muchas possible. Second, previous work on functional asymmetryhas been limited to kernel support for user-level ISA. To ourknowledge, ours is the first comprehensive analysis of thediverse heterogeneity challenges faced by the operating sys-tem, runtime and user space and a systematic methodologyto bridge the gap between the different ISA.

The rest of this paper is organized as follows. Section 2describe the different design alternatives to support hetero-geneity on each feature set. Section 3 enumerates the listof most common functional gaps that need to be addressed,either with software or a combination of software/hardware.It also establishes a baseline of must have features acrossarchitectures. Section 4 describes two hardware prototypesused to implement some of the ideas presented. Section 5discusses how to address some of the implicit assumptionsabout homogeneity made by operating systems that break inthe presence of heterogeneous systems. Section 6 discussesrelated work and Section 7 presents our conclusions.

2. DESIGN SPACEHeterogeneous multicore architectures present new challengesin designing both the hardware – operating system and op-erating system– application interfaces: (1) how should theoperating system exploit the heterogeneity — are the het-erogeneous capabilities of sufficient value that the operatingsystem should always use them (even if this requires mi-grating execution to other core types), should the featurebe used opportunistically if available, or should the featurebe ignored completely? and (2) how should the operatingsystem expose the heterogeneity to applications — shouldit present the illusion of a homogeneous system (either withor without the heterogeneous feature) or should it exposethe heterogeneity, requiring applications to cope with theresulting asymmetry?

These questions must be answered independently for eachheterogeneous feature, weighing the potential functional andperformance benefits with the added complexity in the oper-ating system (which must manage the demand on resourcesand provide sharing mechanisms, security etc) and applica-tions. Choices are often constrained by hardware implemen-tation issues which limit flexibility and existing operating

Model User Space View OS Support

Restricted Homogeneous De-featuringHybrid Heterogeneous Hetero-API extensionsUnified Homogeneous Feature unification

Figure 2: Overview of the solution space. The oper-ating system must make choices of what features itexploits and what features it exposes to user level.

system design choices which favor different software callingconventions (system API, device driver, IOCTL etc).

In addressing these key questions, a number of common ar-chitectural strategies have emerged each of which can beapplied independently to different features. Figure 2 liststhese solutions and their impact on the operating systemand user space. In the next sections, we examine each ofthese approaches in detail.

2.1 The Restricted ModelThe Restricted Model leverages our assumption that futureheterogeneous multi-core processors share a significant por-tion of their ISA, sufficient to provide a complete executionenvironment to system software. The restricted model tar-gets software at a hypothetical core which implements onlythis shared ISA.

An operating system which chose to use the restricted modelinternally for the handling of a heterogeneous feature wouldbe implemented to utilize only the common ISA uniformlyprovided on all the cores. Independent of this choice, an op-erating system could choose to expose the restricted modelto higher level software by hiding (preventing enumeration),disabling (preventing usage) or forbidding (by convention)the asymmetric ISA features from higher level software.

Implementating the restricted model presents a number ofchallenges:

• There must exist some useful, common subset of theISA on all cores in a system. Hence the restrictedmodel cannot be implemented where each core typepresents similar but incompatible implementations ofa feature.

• The common ISA features, which is specific to the com-bination of both cores, must have been identified atoperating system design time, significantly increasingthe software enabling costs of any new heterogeneousplatform.

• The common subset of the ISA represents a new ar-chitecture, similar to but independent from either ofthe core architectures, onto which system software willneed to be ported and validated.

• System software will not be able to utilize the benefitof the heterogeneous features of the platform, therebyreducing the benefit gained by adding heterogeneity tothe platform.

22

• Additional hardware support may be required in theplatform to prevent the enumeration or usage of het-erogeneous features which require OS support for con-figuration, configuration or state management fromthird party applications.

2.2 The Hybrid ModelIn the cases where no common ISA exists, or the hetero-geneity exposed by the platform provides sufficient benefitto software, the operating system may choose to explicitlyimplement and/or expose the asymmetry available in theplatform. The Hybrid Model requires potentially significantsoftware modifications to gain the benefits of the heteroge-neous platform.

In the case where the operating system explicitly addressesheterogeneity, the operating system can be divided into corespecific and core agnostic portions. Core specific code withinthe operating system, for example interrupt handlers andscheduler, will never migrate and hence can utilize dedicateddata structure and code paths to manage and benefit fromthe heterogeneous features. Core agnostic code with the OSis typically preemptable and can be migrated by the CPUscheduler, and can be enabled using the same techniques asheterogeneous aware applications.

Heterogeneous software must first interrogate the system toestablish the heterogeneity available in the platform, for ex-ample by querying a new system database or testing eachof the available cores (Figure 3). The existing core affin-ity capabilities presented by operating systems can then beused to enable the application to adapt to the heteroge-neous environment. However, heterogeneous applicationsutilizing core affinity must balance the cost of using coreaffinity with the benefit gained from the heterogeneous fea-ture. Selectively executing code optimized for the currentprocessors (Figure 4) incurs a potentially significant systemcall overhead to set and clear the affinity mask (measure-ments on an IntelR© XeonR© Processor E5450 processor run-ning Linux 2.6.27 show this overhead can be as 5000 cycles).This cost can be amortized over many instructions by forc-ing the thread to remain on the big cores (Figure 5), at therisk of overloading these cores with threads which cannot bemigrated.

Implementation of such modifications may require hetero-geneous aware compilers, multi-path libraries, or ”fat bina-ries” containing multiple optimized versions of applicationsrequiring the developer to select independent algorithms op-timized for the platform asymmetry.

This diversity in implementations creates a resource man-agement issue for an operating system– applications whichare executing algorithms optimized for one type of core maynot be able to execute efficiently (or at all) on the othercores. Heterogeneous applications can express these depen-dencies using CPU affinity, but at the cost of limiting OSload balancing – impacting the ability of the scheduler tooptimize system throughput.

2.3 The Unified ModelThe Unified Model attempts to hide the resource manage-ment issues of exposing heterogeneity by presenting the super–

// test_capability tests for SSE 4.1int test_capability(void) {

unsigned int eax, ebx, ecx, edx;eax = 1;asm("cpuid" : "=a" (eax), "=b" (ebx),

"=c" (ecx), "=d" (edx): "0" (eax), "2" (ecx));

return (ecx & 1<<19);}

cpu_set_t discovery() {

cpu_set_t current, capable;int CPUS = sysconf(_SC_NPROCESSORS_CONF);unsigned int eax,ebx,ecx,edx;int i;

CPU_ZERO(&capable);for(i=0;i<CPUS;i++) {

// set_affinity to cpu iCPU_ZERO(&current);CPU_SET(i, &current);if(!sched_setaffinity( 0,

sizeof(cpu_set_t), &current)) {// if the cpu has capabilityif(test_capability()) {

CPU_SET( i, &capable );}

}}

}

Figure 3: Discovering SSE4.1 in a Linux application

cpu_set_t previous, current;

sched_getaffinity( 0, sizeof(cpu_set_t), &previous);// pin the task to the current coreCPU_ZERO(&current);CPU_SET(sched_getcpu(), &current);if(!sched_setaffinity( 0,

sizeof(cpu_set_t), &current) &&test_capability()) {// use SSE 4.1

} else {// use alternate algorithm

}sched_setaffinity( 0, sizeof(cpu_set_t), &previous);

Figure 4: Linux application code to utilize SSE4.1only if it is present on the current CPU

// early on in the code, we build maps of all and// SSE4.1 capable processorscpu_set_t default;cpu_set_t capable = discovery();sched_getaffinity(0, sizeof(cpu_set_t), &default);

// at each use of SSE4.1 instructions, we set the// CPU mask, possibly forcing process migrationsched_setaffinity(0, sizeof(cpu_set_t), &capable);

// use SSE 4.1...

// finally, restore default affinity to maximize// throughputsched_setaffinity( 0, sizeof(cpu_set_t), &default);

Figure 5: Linux application code to utilize SSE4.1instructions, forcing migration if required

23

set of capabilities of the cores to software.

Capabilities present in any of the cores in the architectureare exposed to software as if they are present on all the coresin the system, with the operating system providing trans-parent services to ensure the correct functioning of softwarewhich uses features not present on the current core.

When a process attempts to utilize a capability not presenton the current core, the core transfers control into the oper-ating system fault handler. For example, software which ex-ecutes an instruction on a core which does not support thatinstruction generates an illegal instruction fault. Operatingsystems designed for homogeneous systems will normally in-terpret this fault condition as a fatal error and terminate theprocess (for example Linux will deliver a SIGILL to the userprocess). However, on a heterogeneous systems the faultmay have occurred because the process executed an instruc-tion which is only present on some subset of the availablecores. With the unified model, the operating system canuse one of a number of strategies to continue execution ofthe faulting process, namely, process migration, instructionemulation and instruction proxying.

2.3.1 Process MigrationThe operating system migrates the process to another coretype and re–executes the faulting instruction, on the as-sumption that this core will support the missing capabil-ity. Processes which fault at the same instruction address ofeach core type are handled in the normal way for a homo-geneous system. This approach was introduced in [18] andit is referred as fault-and-migrate.

Process migration provides a simple mechanism to presentthe unified model to software, however if many applicationsattempt to use heterogeneous capabilities present on onlya few cores, process migration may the system to becomeimbalanced. Using load balancing to migrate heterogeneousprocesses back onto the other cores could result in excessiveprocess migrations, as processes bounce between cores. Fi-nally, if the ISA of one core type is not a strict subset of theother core type, it is possible for pathological applicationsto bounce between cores as alternate instructions may forcemigrations.

2.3.2 Instruction EmulationAs an alternative to migrating the faulting process, an op-erating system could choose to emulate the faulting instruc-tion. Early IntelR© Architecture system included instructionsfor floating point operations which were only present if an8087 math coprocessor was installed in the system. operat-ing system such as Linux provided an instruction emulatorfor the 8087 for systems which did not include the hard-ware. However, software emulation is significantly slowerthan hardware not just because the dedicated hardware ac-celerators are not present, but also because a fault and de-code operation must be taken for each emulated instruction.Hypervisors face a similar issue with emulation of privilegedinstruction sequences, and have developed techniques to re-duce (but not eliminate) the performance overhead. Xen [3]uses the QEMU [5] emulator to decode streams of many in-structions, reducing the number trips into the fault handler.

VMware [25] uses binary translation with aggressive cachingand pinhole optimization to achieve a similar result.

2.3.3 Instruction ProxyingFinally, the operating system could encapsulate the fault-ing instruction in a lightweight execution context and passit to the other core type to be executed. Executing onlythe faulting instruction on the other core type can allowthe process to gain the performance benefit of process mi-gration with a reduced risk of over–committing the morecapable core types. However, encapsulating the faulting in-structions context is complex and expensive. Each operationwill incur a significant performance penalty because of theinter–core communication required, not just on the receivingcore which must perform the work but also on the sendingcore since it will be difficult to schedule other work in theshort period in which it would be waiting for completionof the instruction. For any instruction sequence involvingmemory operations, the core will have to change memorycontexts to execute the instruction, impacting cached data.

2.3.4 Common ProblemsAlthough the unified model provides a powerful tool forlegacy application support, each technique presents signif-icant problems since they require that each core type mustgenerate an exception for every feature that it does imple-ment. Features which presents a common interface but pro-vides alternate definitions would therefore be problematicto implement. Additionally, the techniques required to im-plement the unified model introduce significant performanceoverheads and both Process Migration or Instruction Prox-ying add intercore dependencies which could result in faultsor deadlocks in performance critical sections such as faulthandlers and device drivers. Since operating system changesto address platform heterogeneity would be required in orderto expose or exploit the unified model, we believe the benefitto the operating system of the unified model is limited andit is much more likely that a combination of the restrictedmodel and the hybrid model would be used for operatingsystem code.

3. FUNCTIONAL ASYMMETRYBridging the gap between instruction sets on heterogeneoussystems requires both hardware and software support, in-cluding policies for the heterogeneous model in effect foreach asymmetric core feature. In this section, we will de-scribe how various functional asymmetry issues affect theoperating system and give examples on IntelR© Architectureplatforms. While this is not an exhaustive list, it showcasessome of the most common challenges to bring up the oper-ating system components that depend on the hardware fea-tures, including feature enumeration, memory models, andinstruction set.

3.1 Feature enumerationEnumerating the features supported by each core and adopt-ing a model to use and expose those features to applica-tions is perhaps the most critical support needed for a trueheterogeneous–aware operating system. Two key issues needto be addressed. As described in Section 2, the operatingsystem must decide what heterogeneous features it will sup-port and which ones will be exposed to user space.

24

3.1.1 Core featuresTraditionally, operating systems have been designed for ho-mogeneous hardware and would check feature availabilityby checking it in one processor and assuming it exists ev-erywhere else. Examples of this are instruction subsets, ar-chitectural registers and operating modes such as virtual-ization. Feature enumeration must be fixed in one of threeways. First, for features that will be supported only if theyexist in every core (the restricted model), the feature checkneed to use a global set of common features instead of acore–specific feature support. Second, for features that willbe supported asymmetrically (the hybrid model), most datastructures would need to be made per–core to guarantee onlysupport in the current core is checked. Moreover, initializa-tion of such features made in one processor (for example,the boot processor) will require awareness the feature avail-ability in other cores. Finally, features that would be sup-ported transparently regardless of hardware support (theunified model) require an approach similar to the hybridmodel, with the addition of software and hardware supportdescribed next.

3.1.2 Exposing features to user spaceAn operating system must decide what kind of asymmetry,if any, it wants to expose to user space. This is importantbecause libraries and code that are not asymmetry–awarecould break. For example, if a library checks for the exis-tence of a certain high performance feature that only existsin a subset of the cores it could lead to lower performancewhen the feature is checked in a core without it. Even worse,if the feature is detected, but the thread is later migratedto another core without it, it could lead to a fault. In sec-tion 3.4 we discuss a potential solution for this issue.

It is not practical to predict what policies general purposeoperating system will support. For this reason, we proposethat hardware provide a mechanism to allow software to se-lect which features will be exposed to user level on eachcore. This could be implemented in hardware by intercept-ing those instructions used for feature enumeration executedin user space, similar to the techniques used in virtualiza-tion [12]. In the IntelR© Architecture, the CPUID instructionis used for feature enumeration, so one alternative is is toprovide an exception when user code executes the CPUIDinstruction and let the operating system generate the appro-priate result for each feature in the fault handler. Figure 6shows an example implementation. Using this mechanismto implement a restricted or unified model, user code thatuses CPUID to detect features does not have to be modi-fied, while the kernel code fault handler provides the set offeatures that it wants to expose to user level. Alternatively,hardware can provide overrides that allow the operating sys-tem to selectively disable features that it does not want toexpose to user level, easing the implementation of The Re-stricted Model.

The method above is generic enough to allow flexibility tosupport many models. But any other methods that allowsthe operating system to mask out the existence of a hardwarefeature would suffice.

User code Kernel code

// get core features

EAX = 1CPUID −−−−−−−−−−−−−−−−−−−−−→

// detect feature

if (ECX & mask) then// feature present

...else// feature not present

...endif

// fault handler

if (EAX == 1) then// exposed desired// features

ECX = ...else

CPUIDendif

// fix return address...IRET

Figure 6: Exposing a desired target instruction setto user space.

3.1.3 Non–architectural featuresSome cores provide features that are model–specific andtherefore non–architectural. Such features need to be ad-dressed in software. For example, the address, bits and en-coding of non–architectural model–specific registers in theIntelR© Architecture can vary across cores. Its use in thekernel should be conditional of the core type where the ker-nel is executing.

3.2 Memory and interrupt domainsOne of the fundamental principles in which most traditionaloperating systems rely on is the ability of all processorsto share a single physical memory address space and in-terrupt domain. Although some non–traditional operatingsystems [4] and clustering operating systems [2, 20] existsthat do not require a single domain across the system, theystill assume a single domain within each node.

A single memory domain requires the system to provide co-herent access to all memory in the system to each core. Asingle interrupt domain requires the system the ability to dosymmetric inter–processor communication.

Our work assumes that the hardware provides both. There-fore, all memory in the system must be cache coherent andshared. Similarly, internal or external interrupt controllersmust support addressing any core in the system. One ofthe consequences of this is that processors that support theadvanced x2APIC architecture of IntelR© Architecture pro-cessors cannot be mixed with processors that only supportthe incompatible xAPIC architecture, unless the x2APIC isonly run in compatibility mode (i.e. xAPIC mode).

One consequence of the single memory domain is that ifthe cores do not support the same physical address width,the operating system is limited to use the lowest physicaladdress width of all the cores in the system for general pur-pose memory management. The remaining physical memorybeyond this limit would be unused or dedicated to specialpurposes.

3.3 Virtual memory and pagingThere are many hardware features that control the opera-tion of the virtual memory and paging subsystems in theoperating system. For the discussion convenience we sub-

25

divide them into three groups: page tables, paging cachestructures, and cache control.

3.3.1 Page tablesPage tables control the translation between virtual addressand physical address, with different paging modes associatedwith differences in page size and attributes. The operatingsystem expects all processors to support the same pagingmodes on all cores or, alternatively, it will use only the com-mon paging modes across all cores. While there could be aperformance or functional advantage in using a paging modeonly available in certain cores (for example a large page sizeor a no–execute bit), these page tables might be shared bythreads executing on different cores types and the cost ofremapping them out weights its benefits.

Given that the paging modes are the same and that all mem-ory should be shared, it follows that the virtual addressspace width should be the same across all cores.

3.3.2 Paging structure cachesThe design of paging structure caches such as translationlookaside buffers (TLB) are not visible architecturally, there-fore any differences on their sizes or associativity wouldonly have a performance impact and are irrelevant to thefunctionality of the operating system. However, they mighthave an impact on the design of the operating system sched-uler [14].

Address space identifiers (ASID) are commonly used in manyarchitectures to allow selective flushing of TLB and im-prove performance when switching VM contexts. Supportfor ASID in all cores is optional, as the operating system caneither forgo ASID support completely or maintain ASID insoftware and only use them in cores that support them.

3.3.3 Cache controlCache control refers to all features that control the cacheabil-ity of data, including memory types and cache line size.Significant effort is made by the operating system and ap-plication software to avoid cache line alignment issues andminimize the amount of false sharing. Additionally, softwarethat depends on a certain cache line size to flush data out ofthe cache (e.g. CLFLUSH) might not behave properly whenthe cache line size varies across cores. For this reasons it isrecommended that the line size be the same across all cores.

The memory type can be controlled by different structuresin hardware. In the IntelR© Architecture, memory type is de-rived from a combination of memory type in the page tablesand the memory type range registers (MTRR). Software ex-pects that the memory type resolution and visible behaviorof the resulting memory type be exactly the same across allcores. Without it, software would invariably use additionalmemory fences or stronger memory types to enforce the de-sired behavior in hardware, leading to a loss of performance.

3.4 Instruction asymmetryThere are two types of instruction asymmetry relevant forheterogeneous systems: the overlapping instructions that ex-ists on all cores and the non–overlapping instructions thatonly exists on a subset of cores.

3.4.1 Overlapping instructionsIn order to simplify the programming mode, instructionsthat exists in two or more cores must have identical en-coding and result in identical state change. Otherwise, thewhole software stack including compilers, operating systems,runtime and applications would be hampered by the non-deterministic behavior of potentially each instruction. Whileruntime checks for core-specific behavior can be done, theyare not a practical solution since it would have to be doneatomically before code path that uses instructions that arenot identical.

While guaranteeing identical encoding and results is straightforward for integer arithmetic instructions, it might posemore challenges to floating point arithmetic. We expectcores to support the same precision and rounding modes.

3.4.2 Non–overlapping instructionsThe instruction set exposed by each core is likely to differand result in each core type having specialized or advancedinstructions not implemented on other cores. While it is de-sirable that user level code executes only instructions avail-able in the core it is executing on, it might be difficult toachieve this, particularly for legacy software. Indeed, exe-cuting an instruction that does not exist in one core can betransparently handled by the kernel by performing a faultand migrate as described in Section 2.3.1. While this is notgenerally desirable for high performance computing, it con-stitutes as last line of defense for correct execution.

There are two types of instructions that might exists onlyin some cores. The first type is those that operate on archi-tectural state already present in every core. For example,SSE4 instructions [11] operate on the same XMM registersthat previous SSE instructions do. In this case it is enoughto migrate the thread using the existing context switchingmethod. The second type consists of those instructions thatoperate on new state only available on certain cores. Thisnew challenge requires changing the context switch code intwo ways:

• The save area on the task structure must be able toaccommodate all the state across cores.

• Context switching should only save and restore theportions of the state that is relevant to the cores it isoperating on.

One example of how this can be done using high performancecontext switching is depicted in figure 7. On the IntelR© Ar-chitecture, the XSAVE instruction [11] allows saving of baseand extended architectural state. There are several possi-bilities on how to implement this depending on the charac-teristics of the extended states implemented on each core.One approach would be for each core to be aware of the ar-chitectural state implemented by the other cores and designthe save area format accordingly, as allowed by the defini-tion of the save area. The operating system would set thesave masks appropriately to only save the state applicableto each core. Another solution shown in figure 7 combinesone core with no extended state with one core with extendedstate. Using the compatibility of the legacy save area, the

26

Ext. Area 2

Ext. Area 3

Memory image

. . .

Ext. Area header

XSAVE FXRSTORFPU/SSE Area

Big core

Thread A

Migration

Thread A

Small core

Figure 7: Mixing state save instruction with XSAVEand FXSAVE.

core without extended state can use the older FXSAVE in-struction, while the core with extended state can use thenewer XSAVE instruction.

Of course, as previously stated this is not a high perfor-mance solution. It is desirable that code that is not asym-metry aware be either restricted to the right core type orlimited to use the common instruction set. This will avoidthe performance impact from frequent migrations and theunnecessary over subscription of certain core types.

In the case of kernel code, it should refrain from using in-structions that are not available in the current core (usingconditional code) or simply use the common set of instruc-tions across all cores. Support like fault and migrate is notfeasible in kernel mode for two reasons. First, certain codepaths that are non–preemptible cannot be transparently mi-grated. For example, if the code assumes that the local core’srunqueue is locked, this assumption would be broken if thethread migrated. Second, even if the code is preemptible,the faulting instruction might have been intended to changethe privileged state of the local core. Migrating it wouldincorrectly made that state change on a different core. Fi-nally, it is hard to envision a situation where the perfor-mance benefits of using fault and migrate for kernel codewould outweigh the complexity of addressing the previousissues.

3.5 System topologyToday’s systems include a variety of topologies that are han-dled by the operating system. Examples of this are cachehierarchies, memory subsystems and non–uniform memoryaccess (NUMA), core to socket mapping, and simultaneousmultithreading.

The operating system must address the potential of asym-metric configurations. For example, the number of coresper socket could be different on each socket, or the size andnumber of sharers of a cache. This does not represent aparticular challenge since it is already likely the operatingsystem already using hierarchical data structures for thispurpose.

One notable aspect of how the operating system deals withsystem topology is scheduling. When cache and memoryhierarchies differ, care must be take to create scheduling do-

mains that reflect the differences in cache topologies amongthe cores and sockets. Similarly, scheduling optimizationsfor simultaneous multithreading [23] should be only appliedto those cores that support it. Finally, scheduling optimiza-tions can exploit the diversity in the workload to improvesystem performance [26, 14].

3.6 Timing infrastructureThere are many platform counters that can be used for tim-ing purposes. As long as they are globally shared in theplatform they do not represent a challenge in heterogeneoussystems. Modern processors such as those from the IntelR©

Architecture family provide a high resolution counter calledthe time stamp counter (TSC) that is increment every clockcycle. Since the clock cycle is the minimum unit of time, theTSC provides the highest resolution counter available in theplatform. Additionally, the TSC can be read fast, usuallyorders of magnitude faster than other platform counters.

In order to support the continued use of TSC for timingpurposes, hardware must provide guarantees that the TSCis incremented at a constant rate across all cores on thesystem. For heterogeneous systems this implies that it isdesirable to have the TSC increment at the same constantfrequency across all cores, regardless of the actual frequencyof the core. Moreover, the TSC should not be affected bysleep or frequency events. Otherwise, the operating systemwill not use TSC and obtain timing information from otherslower sources.

3.7 RAS and debugReliability, availability and serviceability (RAS) features areoften critical in certain market segments. As such, the re-quirements on RAS features will likely be dictated by thedesign goals of the CPU. While software does not requirethe RAS features to be identical across cores, its usefulnessis limited by the lowest common feature set. Supportingasymmetry on other features is possible, such as limitingthe propagation of certain errors, and could be enabled onlyon cores that support it.

Debugging of heterogeneous systems extends well beyondthe operating system kernel. While the hardware infras-tructure for debug is likely to offer some advanced featuresin certain cores, it is expected that most basic functionalitywill be common across all cores to enable debuggers. Debug-gers and other tools will need to be updated to reflect theasymmetry in the cores, particularly the non–common stateand decoding/disassembly of non–overlapping instructions.

3.8 Performance monitoringMost modern CPU include some fundamental infrastructureto collect performance counter that enable developers andother users to understand the application behavior. Manytools such as IntelR© VTune rely on the existence of thesecounters to debug applications.

There are two types of performance counters in the IntelR©

Architecture. First, the architectural performance eventsare those that behave consistently across microarchitectures.IntelR© Architecture implements several versions of architec-tural events, and it would be desirable to all cores to sup-port the same version. Second, non–architectural events are

27

those that are specific to the core. Due to the varying mi-croarchitectures, non–architectural events would likely differsignificantly between the cores. All layers of the softwarestack, including the kernel, would need to be aware of thisasymmetry. Tools such as VTune would need to be updatedand provide the user with proper interfaces to collect theproper events on each core.

3.9 VirtualizationIf the virtualization features are not identical across cores, aVMM cannot effectively use those features. For example, itis not worth the effort for a VMM to support two guest tophysical memory translation algorithms and data structuressimultaneously, say a page table shadowing and a guest tophysical page table, to be used conditionally on the presenceof support for extended page tables on the current core.

Therefore, the VMM is likely to use only the common set ofvirtualization features. Yet, this by itself is complicated assome virtualization architectures such as Intel’s VMX hidethe memory layout of the virtual machine control structure(VMCS) from software. This format will change as newvirtualization features are added and new per–VM state issaved in the VMCS, leading to the introduction of the VMCSrevision id. Hence, the format and revision id of the VMCSmight be different across cores, and one core would not loada VMCS with a different revision id.

Two solutions are possible. The desired one is for hardwareto provide the same VMCS format and revision id, even ifsome fields are deprecated in some cores due to unimple-mented features. The other approach, much less desirable,requires the VMM to do the translation during a VM migra-tion across cores of different types. For example, the VMCScan be saved to a format–independent memory region usingVMCS reads and saved in a new format using VMCS writeson the new core. This process adds to the cost of a VMmigration, but could be mitigated by reducing the numberof such migrations, for example by creating processor poolsby core type.

3.10 SummaryTable 1 summarizes the discussion of the previous sections.Adoption of heterogeneous systems will require hardwareand software co–development. Whilst the operating sys-tem can support some degree of functional asymmetry, itrequires hardware to provide a basic subset of a commonISA, including single memory and interrupt domains, iden-tical memory types and identical behavior for common in-structions. Software, on the other side, can bridge the gapin functionality for certain classes of functional asymmetry,including core feature set, topology and non–overlapping in-structions. However, in order to fully exploit performanceof asymmetric systems both the operating system and theuser–level layer need to be aware of it, including libraries,debuggers and performance monitoring tools.

In some cases, software is limited in the amount of asymme-try that it can tolerate or exploit. For instance, asymmet-ric paging modes or virtualization features, would requirethe additional complexity of multiple kernel algorithms andstate migration between them, so it is unlikely that the op-erating system itself would use these features. On the other

hard, applications can always deal with asymmetric features,either by explicitly programming to the hybrid model or byhiding the asymmetric features in the operating system us-ing the restricted model or the unified model, the later par-ticularly useful for legacy applications.

4. HARDWARE PROTOTYPESTo identify and further investigate the operating system is-sues that might occur in a heterogeneous systems, we exper-imented with two hardware platforms. The first prototypeconsisted of processors from different microarchitecture fam-ilies in a single system. The second prototype uses identi-cal processors but with certain features disabled in selectedcores.

4.1 Multi-family platformThis platform consists of a dual socket system featuring pro-cessors from different processor families, a dual-core IntelR©

AtomTM

Processor N330 and a quad-core IntelR© XeonR©

Processor E5450. This platform thus gives us a pairing of aset of high performance big cores (E5450) with a set of powerefficient small cores (N330). The small cores are an order ofmagnitude more power efficient than the big cores but havesignificantly less performance to that of the big cores. Weverified the raw integer performance difference between thecores using a set of microbenchmarks. While these big coresdo not support any form of SMT, the small cores are two-waySMT cores. This asymmetric core prototype has significantarchitectural, topological as well as performance asymme-try. These systems have a modified firmware that supportsthe different core types. These platforms also have beenmodified such that an N330 core is always the bootstrapprocessor (BSP). We disabled the use of TSC as a clock-source since the two processors were running at differentclock frequency. Thus, the HPET was the only clocksourcein our operating system. This makes sure that the core witha smaller feature-set is always the smaller core. Thoughthere is no such requirement for a heterogeneous system, itmakes our platform bring–up process relatively easy. Ourmodified Linux kernels and the software stack above couldrun various performance-oriented algorithms previously pro-posed [17, 14] for a performance asymmetric heterogeneoussystems.

4.2 Defeatured platformThe second platform consists of a system featuring the 20112nd Generation IntelR© CoreR© Processor codenamed “SandyBridge”. These cores come with the Intel Advanced VectorExtensions (AVX), a new 256-bit SIMD instruction set thataccelerates floating point intensive applications. The AVXinstructions depend on the presence of the XSAVE/XRSTRinstructions to save and restore the extended AVX state. Us-ing proprietary tools, we disabled the XSAVE/XRSTR and,therefore, AVX instructions in a subset of the cores that wedesignate as small cores. This provides for an architecturalasymmetry in the the heterogeneous platform but almost noperformance asymmetry. Applications which take advantageof the newer AVX instructions will be able to run only onthe big cores. They will experience illegal–instruction ex-ceptions if they run on the small cores. To overcome thisheterogeneity, we resorted to The unified model as discussedin Section 2.3. We created the floating point state for all the

28

Topic Subtopic Hardware Software

Domains Memory SameInterrupt Same

Features Core features Expose to SW ManageUser space features Add HW support Policy

Memory Page tables The Restricted ModelPhysical address The Restricted ModelASID The Restricted ModelMemory types SameCache line size SamePaging caches Transparent

Instructions Overlapping SameNon–overlapping Policy & manage

Topology Cache, NUMA, SMT ManageTiming TSC frequency Same desirable Use if sameRAS The Restricted Model, others optionalDebug Basic features same Impact to toolsPerformance monitoring Architectural Same

Non–architectural Manage in toolsVirtualization Features The Restricted Model

Revision id Same desirable Manage

Table 1: Summary of hardware and software requirements, details are given in the respective sections. Blankentries mean that there are no requirements, while The Restricted Model refers to support only for thesubset of the feature[s] common to all cores in the system.

processes corresponding to the big core feature set. Thus,each process will have floating point state which assumesthat it will run on the big core. Obviously, when the pro-cess runs on the small core, the floating point instructionsoperate on a subset of this state. AVX instructions on theother hand, operate on the complete state. To verify thatour architectural hypothesis is indeed reflected in softwarebehavior, we hand–crafted a few sample applications whichuse AVX instructions that exhibited the fault–and–migratebehavior and could exploit the heterogeneity using the tech-niques mentioned in Sections 2.2 and 2.3. We also verifiedthat migrating legacy applications (SPEC 2006 benchmarks)across the cores of this heterogeneous systems does not re-sult in any functional deviation or incorrect results.

5. OVERCOMING HOMOGENEITYSystem software running on current multicore architecturesmake several implicit assumptions about the homogeneousnature of the platform. In many cases, these assumptionsprevent the basic bootstrapping of the operating system it-self on heterogeneous systems. However, a larger problemoccurs when core algorithms such as those that govern re-source allocation assume homogeneity. This results in silentsuboptimal behavior of the system and in many cases, out-right incorrect behavior. In this section we describe sev-eral such assumptions that we encountered while runningmodern operating system stacks on the two custom hetero-geneous systems that were described in Section 4. Eventhough we focus the discussion mostly on the LinuxR© oper-ating system, we believe that most of these issues are alsopresent in other operating systems. For example, our anal-ysis of the FreeBSDR© software stack reveals that the issueswe discuss below also apply to it.

5.1 CPU feature flagsEarly in the boot process, the Linux kernel discovers the fea-tures that an CPU supports. However, there is only one suchpersistent data–structure that stores the flags of the boot-strap processor (BSP). These capability bits are then logi-cally ANDed with the features that are discovered when theapplication processors (AP) are enumerated. Subsequently,whenever the kernel software needs to check for a partic-ular feature flag, it refers to these common capability bits.While this ensures that software that assumes homogeneitywill work in the kernel, it was designed to overcome minordifferences in processor steppings and only supports a TheRestricted Model kernel, thus sacrificing the advantages thatthe big cores may potentially provide.

On user space, most application software written to takeadvantage of advanced processor features depends on nativeinstructions (such as CPUID) to discover processor features.While this is not a problem on homogeneous systems, whenthe operating system chooses to present a The Hybrid Modelto user space, it can cause software to underutilize the sys-tem or generate exceptions. While solutions like fault andmigrate (Section 2.3.1) handle these issues as a last resortmethod, feature detection on heterogeneous systems shouldbe rewritten to detect per-cpu features using either nativemethods or new software interfaces that the application layercan count on.

5.2 Runtime code patchingOne of the ways in which the Linux kernel takes advantageof the variations and improvements in hardware is by meansof the “alternatives” mechanism. This allows the kernel tooptimize itself at boot–time when it exactly knows whatplatform it is running on. During build–time, the kernelstores two (or more) implementations of a given function-

29

Logical processor

Scheduling domains

Core

Package

NUMA Node

Figure 8: Scheduler domains in Linux. Each boxrepresents one scheduler domain.

ality. Only the baseline implementation is included in theinitial loadable kernel image, targeting the ISA supported byall the processors in a family. For example, the floating pointstate saving and restoring code will target the ISA supportedby the Pentium processor in IntelR© Architecture. The al-ternate implementations are stored in a special ELF section.After discovering the platform at boot time, the kernel walksthrough this special ELF section and the alternative imple-mentations are patched into the kernel image. After thepatching process is finished, the running kernel image be-haves as it it has been configured/compiled for the exactplatform it is running on. Examples of such patching in theLinux kernel include memory barriers, saving and restoringthe floating point state, SMP primitives, and prefetch hints.

In the context of heterogeneous systems, this runtime patch-ing mechanism may not work. The above described methodof patching works because the alternatives mechanism as-sumes that the platform is homogeneous. This is an exampleof the optimizations that need to be redesigned in the con-text of heterogeneity. One crude method that we have im-plemented is by replacing the alternatives mechanism withreal alternative code paths in the kernel image that checkfor the presence of the specific features. However, other op-timizations in this area are possible.

The run–time code patching feature is specific to Linux andthe BSD family of operating systems do not seem to have asimilar feature.

5.3 System topologyLoad balancing in multiprocessor systems depend on theaccurate discovery of the system topology. In Linux, thecomplete system topology is divided into an elegant tree ofscheduling domains which may share various parts of thearchitecture to a smaller or larger degree. An example ofa scheduling domain is a group of cores that share the lastlevel cache. An illustration of these domains can be seen inFigure 8. The load balancing code starts at the lowest levelin which each domain is a logical processor and then movesup the system topology identifying imbalances in each do-main. If it finds an imbalance, it attempts to rectify it. Thisalgorithm assumes that the topology is balanced or symmet-ric. However, in case of asymmetric core architectures, thereis no guarantee that the system topology will be balanced.

One illustration of a scheduling domain asymmetric topol-ogy is shown in Figure 9. In such a topology, as we reach thesystem topology nodes which are not balanced themselves,

Scheduling domains

Package

NUMA Node

Core

Logical processor

Figure 9: Scheduler domains in asymmetric topolo-gies. Load balancing needs to account for differencesin topology and compute power.

the balancing acts at the Numa node domain will result inan imbalanced system, because the child domains, which arethe Packages have different numbers of logical processors inthem. One way to mitigate this is to quantitatively takeinto account the compute power of a scheduler domain itself.This will allow us to balance across asymmetric topologiesbecause their asymmetries will be masked by the computepower factor.

5.4 System timekeepingLinux uses the timestamp counter (TSC) for fine–grainedtime keeping. Coarse–grained time keeping can be doneusing a variety of timer options available on the platform.However, the TSC runs at the frequency of the processorclock. Linux calibrates the TSC using a known externaltimer once on the BSP and assumes that it is the same forall the APs. In a heterogeneous systems it is easy to seethat we may have cores that run at different frequencies.This introduces a small error every time we do the standardtime measurement which then builds cumulatively. Overtime (which can be fairly small in wall clock terms) it canbecome a source of significant error in idle events and finegrained timing related tasks such as those in the file sys-tem which depend on gtod. As discussed in Section 3.6, thelack of a fixed rate TSC in the E5450/N330 prototype leadsto unreliable timing, so our system uses a core-independenttimer source (the HPET ) for timing purposes.

5.5 Trampoline codeThe bootstrap process in a typical multiprocessor requiresthat the BSP send an inter–processor interrupt to each APalong with a vector. The vector address is the address of thetrampoline code. The Linux kernel assumes that this tram-poline code is the same for all the processors, which couldbreak in the presence of heterogeneity. In the current proces-sor architectures, the trampoline code more or less handlesall the processors in a single processor family. However, itis conceivable that if there are cores with different microar-chitecture and feature sets that differ slightly in their initialbootstrap process, the bootstrap code need to be custom forthe different processors. This aspect of heterogeneity did nothave an impact in any of our prototypes.

5.6 Power efficient idlingSince the processor cores can be in an idle state for signifi-cant amount of time, idle state management is a fairly im-portant job of the Linux kernel. The cpuidle driver handles

30

the idle–state management in the Linux kernel. The devicedriver developers can provide hints to the idle managementsubsystem about the tolerable latencies in a given devicestate. The cpuidle driver ensures that there is only one idleloop algorithm for all the cores in the system. However, eachof the cores can be in a different state at any given point oftime. The idle states are initialized only once for the BSPand it is assumed that the same hold true for all the cores.On one of our hardware prototypes the Linux kernel usedonly the C1 state idle handling using the MWAIT instruc-tion. Ideally, one would expect that the operating systemwould use per–cpu idle loops that are optimized for a givencore architecture for maximal power savings and appropri-ate entry and exit latencies. For a heterogeneous systems,Linux should have a cpuidle driver per–cpu which could bedifferent and uses the C–states that are specific to the core.

5.7 Dynamic voltage and frequency scalingModern processors support a wide range of operating sys-tem visible dynamic voltage and frequency scaling (DVFS)states. DVFS allows for a nice method to save batterypower as lower clock speeds (at low voltages) mean lowerpower consumption. One can think of various governorsthat can act on different cores at different times in hetero-geneous systems. As in the case of the cpuidle behavior, theLinux kernel has a single driver that provides the mecha-nism for driving the various cores in the platform to differentP–states (The ACPI term for various performance/DVFSstates). During initialization, this driver populates the var-ious P–states that the core that it ran on is capable of tran-sitioning to. This information is copied into all the per–cpuinformation nodes. On a heterogeneous system this is nota correct assumption because all the cores may have dif-ferent sets of possible P-states. Thus, we need a per coretype mechanism (or more generally a per–core) mechanismto drive the individual cores to different P–states as is ap-propriately chosen by a myriad of governing policies.

5.8 Machine check architectureIntelR© Architecture processors implement a mechanism thatcan detect and report hardware errors called the MachineCheck Architecture (MCA). To this extent they consist ofseveral error–reporting register banks. Each bank is asso-ciated with one or more hardware units. Our E5450/N330hardware prototype consisted of cores that supported dif-ferent numbers of MCA banks. The Linux kernel assumesthat the banks discovered in the BSP are applicable to allthe APs. In fact, it forces the number of banks to be thesame. The actual meaning of the error codes reported islikely to be different for different cores in a heterogeneoussystem. Thus, the identification of the MCA and the subse-quent hooking up of the error decoding notifier chain shouldbe made core type specific in heterogeneous systems.

5.9 Non–architectural featuresThe Linux kernel sometimes uses the non–architectural fea-tures such as model–specific registers (MSR). While use ofsuch features in a homogeneous system will not create anyproblems, in case of heterogeneous systems, such features al-ways need to be conditional on the individual core type. Oneexample of such a use is the last branch record (LBR) fea-ture in the perf subsystem. When these registers need to be

reset, the addresses of these registers are model–specific, anddiffer between the E5450 and N330 processors. The Linuxkernel, assumes that the registers are same, thus resultingin machines crashing in strange ways.

Operating systems have generalized certain properties of thecores (such as the C–states as well as P–states) as plat-form devices. Since, in the case of homogeneous systems themechanism and the possible states of the cores are identical,this approximation holds. However, in case of heterogeneoussystems, the possible states as well as the mechanism tochange state differs, these devices need to be per–core withdistinct properties in terms of states as well as mechanisms.In many cases, the assumption on homogeneity is built-inin the many data structures the kernel handles. While wecover some specific examples in the previous sections, moregenerically the kernel needs to assure that any data struc-ture that is CPU specific is made per-CPU. For example,in the case of our E5450/N330 system, the N330 processorfeatures simultaneous multithreading while the E5450 doesnot. In Linux, a single variable accounts for the number ofhardware threads that a single core supports, leading to in-correct construction of the scheduler domains hierarchy inthis system.

6. RELATED WORKPrior work has demonstrated the benefits of single ISA het-erogeneous architectures in achieving improved performanceper watt. Kumar et al. [15, 16] demonstrated the perfor-mance benefits of single ISA heterogeneous architectures.Annavaram et al. [1] varied the amount of energy expendedto process instructions according to the amount of avail-able parallelism. Ghiasi et al. [7] demonstrated improvedefficiency by using application characteristics to control dy-namic voltage and frequency scaling. Hill and Marty [10]show that asymmetric multi-core designs provide greater po-tential speedup than symmetric designs.

Knauerhase et al. [13] demonstrated the benefits of dynamicruntime observation in the operating system to optimize per-formance on heterogeneous architectures. Koufaty et al. [14]added bias scheduling for heterogeneous systems to matchthreads to the core type that can maximize system through-put, based on dynamically matching workload characteris-tics to core microarchitecture. Shelepov and Fedorova [28]proposed Heterogeneity-Aware Signature-Supported schedul-ing using per-thread architectural signatures to avoid theneed for dynamic profiling.

The majority of prior research assumes either single or dis-joint ISAs. Our work lies in between and considers overlap-ping ISAs, which we believe is more practical. For disjointISAs, most manage the “different” cores as coprocessors orperipherals and incur high overhead when moving contextsacross address spaces. CUDAR© exposes graphics processorsas a coprocessor through libraries and OS drivers [22]. CellR©

offloads predefined code blocks to Synergistic Processor El-ements [8] and EXOCHI offloads to a graphics processor vialibraries and compiler extensions [29]. These designs placegreat burden on programmers, whereas we allow the operat-ing system to transparently manage all cores as traditionalCPUs.

31

Li et al. [19] presented a comprehensive study of OS supportfor heterogeneous architectures in which cores have asym-metric performance and overlapping, but non-identical in-struction sets, enabling transparent execution and fair shar-ing of different types of cores. We complement this workby presenting the detailed analysis of a real heterogeneousplatform.

MISP [9, 29] employs proxy execution similar to fault-and-migrate. However, it requires hardware support for user-level faults and inter-processor communication. Hypervisorssuch as Xen [3] and VMware [25] use fault and emulate tech-niques to implement instructions not supported by hardwarevitalization features.

Existing operating systems which expose heterogeneity gen-erally avoid resource management issues by having secondaryschedulers for the heterogeneous resources, accessed via a de-vice driver model. Probably the most prevalent example ofthis is the usage of device drivers for graphics cards whichhide domain specific compute engines.

In the research community, Helios [21] introduces satellitekernels to export a single, uniform set of OS abstractionsacross heterogeneous systems, retargeting applications toavailable ISAs by compiling applications to an intermediatelanguage. Barrelfish [4] takes a distributed system approachto hide CPU asymmetry behind ’cpu drivers’, exposing CPUcapabilities directly to applications which must adapt basedupon the dynamic view of the system held in a system knowl-edge base.

7. CONCLUSIONSIn this paper, we analyze various methods that can be usedto bridge the functional issues that arise in a heterogeneoussystems which has processor cores from different microar-chitecture families. We explore the design choices that areavailable for software designers and discuss the pros and consof each. We find that the software stack including operat-ing systems and applications will need to adapt to exploitthe heterogeneous nature of the underlying platform. Wehave prototyped some of our software modifications in twohardware prototypes of heterogeneous systems. We concludethat such modifications will provide for the smooth adop-tion of heterogeneous systems into the computing realm ina backward compatible manner. Even as we enumerate thesoftware choices to overcome heterogeneity, there is no oncechoice that suits all software stacks. The design goals ofindividual software stacks will certainly influence the wayin which they approach heterogeneity. Although our studyand experience have a number of limitations, our findingscan influence the direction in which operating system evolveas well as hardware design choices. By way of hardware andsoftware prototypes, we identify several assumptions abouthomogeneity that the current operating systems make. Fur-ther, our experience with the hardware prototyping empha-sizes that a symbiotic evolution of hardware and softwareis necessary for this category of heterogeneous systems tofulfill its potential.

8. REFERENCES[1] M. Annavaram, E. Grochowski, and J. Shen.

Mitigating Amdahl’s law through EPI throttling. In

Proceedings of the 32nd Annual InternationalSymposium on Computer Architecture, pages 298–309,June 2005.

[2] A. Barak and O. La’adan. The mosix multicomputeroperating system for high performance computing. InFuture Generation Computer Systems, 1998.

[3] P. Barham, B. Dragovic, K. Fraser, S. Hand,T. Harris, A. Ho, R. Neugebauer, I. Pratt, andA. Warfield. Xen and the art of virtualization. InProceedings of the 19th ACM Symposium on OperatingSystem Principles, New York, NY, USA, Oct. 2003.ACM.

[4] A. Baumann, P. Barham, P.-E. Dagand, T. Harris,R. Isaacs, S. Peter, T. Roscoe, A. Schupbach, andA. Singhania. The multikernel: A new os architecturefor scalable multicore systems. In Proceedings of the22nd ACM Symposium on Operating SystemPrinciples, pages 29–44, New York, NY, USA, Oct.2009. ACM.

[5] F. Bellard. Qemu, a fast and portable dynamictranslator. In Proceedings of the 2005 USENIXAnnual Technical Conference, Apr. 2005.

[6] F. A. Bower, D. J. Sorin, and L. P. Cox. The impactof dynamically heterogeneous multicore processors onthread scheduling. IEEE Micro, 28(3):17–25, May/Jun2008.

[7] S. Ghiasi, T. Keller, and F. Rawson. Scheduling forheterogeneous processors in server systems. InProceedings of the 2nd Conference on ComputingFrontiers, pages 199–210, May 2005.

[8] M. Gschwind. The Cell* broadband engine: Exploitingmultiple levels of parallelism in a chip multiprocessor.International Journal of Parallel Programming, 35(3),June 2007.

[9] R. A. Hankins, G. N. Chinya, J. D. Collins, P. H.Wang, R. Rakvic, H. Wang, and J. P. Shen. Multipleinstruction stream processor. In Proceedings of the33rd Annual International Symposium on ComputerArchitecture, pages 114–127, June 2006.

[10] M. Hill and M. Marty. Amdahl’s law in the multicoreera. IEEE Computer, 41(7):33–38, July 2008.

[11] Intel Corporation. IntelR© 64 and IA-32 ArchitecturesSoftware Developer’s Manual, Volume 2: InstructionSet Reference. Intel Corporation, June 2009.

[12] Intel Corporation. IntelR© Virtualization TechnologyFlexMigration Application Note 323850.http://www.intel.com/Assets/PDF/manual/323850.pdf,May 2010.

[13] R. Knauerhase, P. Brett, B. Hohlt, T. Li, andS. Hahn. Using OS observations to improveperformance in multi-core systems. IEEE Micro,28(3):54–66, May 2008.

[14] D. Koufaty, D. Reddy, and S. Hahn. Bias scheduling inheterogeneous multi-core architectures. In Proceedingsof the Fifth European conference on ComputerSystems, New York, NY, USA, Apr. 2010. ACM.

[15] R. Kumar, K. I. Farkas, N. P. Jouppi,P. Ranganathan, and D. M. Tullsen. Single-ISAheterogeneous multi-core architectures: The potentialfor processor power reduction. In Proceedings of the36th Annual IEEE/ACM International Symposium onMicroarchitecture, pages 81–92, Dec. 2003.

32

[16] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P.Jouppi, and K. I. Farkas. Single-ISA heterogeneousmulti-core architectures for multithreaded workloadperformance. In Proceedings of the 31st AnnualInternational Symposium on Computer Architecture,pages 64–75, June 2004.

[17] T. Li, D. Baumberger, D. Koufaty, and S. Hahn.Efficient operating system scheduling forperformance-asymmetric multi-core architectures. InProceedings of the 2007 ACM/IEEE Conference onSupercomputing, Nov. 2007.

[18] T. Li, P. Brett, B. Hohlt, R. Knauerhase, S. D.McElderry, and S. Hahn. Operating system supportfor shared-isa asymmetric multi-core architectures. InProceedings of the Fifth Annual Workshop on theInteraction between Operating Systems and ComputerArchitecture, June 2009.

[19] T. Li, P. Brett, R. Knauerhase, D. Koufaty, D. Reddy,and S. Hahn. Operating system support foroverlapping-isa heterogeneous multi-core architectures.In Proceedings of the Sixteenth InternationalSymposium on High-Performance ComputerArchitecture, Jan. 2010.

[20] C. Morin, R. Lottiaux, G. VallAl’e, P. Gallard,G. Utard, R. Badrinath, and L. Rilling. Kerrighed: asingle system image cluster operat-ing system for highperformance computing. In Proceedings of the 9thInternational Euro-Par Conference, Aug. 2003.

[21] E. B. Nightingale, O. Hodson, R. McIlroy,C. Hawblitzel, and G. Hunt. Helios: heterogeneousmultiprocessing with satellite kernels. In Proceedingsof the 22nd ACM Symposium on Operating SystemPrinciples, pages 221–234, New York, NY, USA, Oct.2009. ACM.

[22] NVIDIA. NVIDIA CUDA Programming Guide,Version 1.1. NVIDIA Corporation, Nov. 2007.

[23] S. Parekh, S. Eggers, H. Levy, and J. Lo.Thread-sensitive schedling for smt processors. 2000.

[24] D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P.Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty,Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak,M. Suzuoki, M. Wang, J. Warnock, S. Weitzel,D. Wendel, T. Yamazaki, and K. Yazawa. The designand implementation of a first generation CELLprocessor. In IEEE International Solid-State CircuitsConference Digest of Technical Papers, pages 184–185,Feb. 2005.

[25] M. Rosenblum and T. Garfinkel. Virtual machinemonitors: Current technology and future trends. InComputer, volume 38, pages 39–47, May 2005.

[26] J. C. Saez, M. Prieto, A. Fedorova, and S. Blagodurov.A comprehensive scheduler for asymmetric multicoresystems. In Proceedings of the Fifth Europeanconference on Computer Systems, pages 139–152, NewYork, NY, USA, Apr. 2010. ACM.

[27] S. Saisanthosh Balakrishnan, R. Rajwar, M. Upton,and K. Lai. The impact of performance asymmetry inemerging multicore architectures. In Proceedings of the32nd Annual International Symposium on ComputerArchitecture, pages 506–517, June 2005.

[28] D. Shelepov and A. Fedorova. Scheduling onheterogeneous multicore processors using architectural

signatures. In Proceedings of the Fourth AnnualWorkshop on the Interaction between OperatingSystems and Computer Architecture, June 2008.

[29] P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang,X. Tian, M. Girkar, N. Y. Yang, G.-Y. Lueh, andH. Wang. EXOCHI: Architecture and programmingenvironment for a heterogeneous multi-coremultithreaded system. In Proceedings of the ACMSIGPLAN 2007 Conference on ProgrammingLanguage Design and Implementation, pages 156–166,June 2007.

33