NANOS: Effective Integration of Fine-grain Parallelism Exploitation and Multiprogramming

NANOS: Effective Integration of Fine-grainParallelism Exploitation and Multiprogramming

E. Ayguadé1, C. R. Calidonna2, J. Corbalan1, M. Giordano2, M. Gonzalez1,H. C. Hoppe3, J. Labarta1, M. Mango Furnari2, X. Martorell1, N. Navarro1,

D. S. Nikolopoulos4, J. Oliver1, T. S. Papatheodorou4 and E. D. Polychronopoulos4

AbstractThe objective of the NANOS project is to investigate possible ways to accomplish both highsystem throughput and application performance for parallel applications in multiprogrammedenvironments on shared–memory multiprocessors. The target of the project has been thedevelopment of a complete environment in which interactions between mechanisms and policiesat different levels (application, compiler, threads library and kernel) are carefully coordinated, inorder to achieve the aforementioned goals. The environment integrates techniques proposed indifferent research frameworks, enabling the exploitation of their combined potential and thedevelopment of new algorithms and ideas. The NANOS environment includes 1) an applicationdevelopment environment consisting of an application structure visualization and transformationtool; 2) an extended OpenMP parallelizing compiler; 3) a runtime user–level threads library; 4) anapplication performance visualization and analysis tool; 5) a processor manager; and 6) a systemactivity visualization tool. NANOS focuses on numerical applications written in Fortran77 andtargets a current state of the art computer, the SGI Origin2000.

1. IntroductionParallel processing, which was originally conceived for high–end systems, is currently migratingdown to low–cost workstations. This trend will continue towards PC–systems in the future.Programmers for these parallel architectures find significant difficulties in getting the highestperformance out of their programs and have to face issues like low–level architectural details,coherence protocols and data distribution. System software (compiler and runtime support) forthese architectures lacks the required functionality to exploit all the parallelism available in theapplications. Finally, programmers frequently assume a dedicated system while developing theirprograms; however, production runs are usually performed in multi–user, multiprogrammedenvironments. Events like blocking system calls, paging, starting and termination of processesintroduce a high variability in the system workload and directly affect system throughput andperformance of individual applications.

The parallelization of large applications requires the support of interactive compilationenvironments, equipped with program structure visualization and interaction mechanisms, toexploit both user and compiler knowledge to guide the parallelization process. The novelty of ourapproach resides in offering a hierarchical graphical representation of the program that allows theuser to navigate through different application granularity levels, interact with the informationprovided by the compiler and perform the appropriate parallelization actions. The analysis ofperformance requires an environment for instrumentation of parallel programs, and the analysisand visualization of traces gathered from their actual execution. This closes the cyclic

1 European Center for Parallelism of Barcelona, Universitat Politècnica de Catalunya. Barcelona, Spain.

Corresponding author: E. Ayguadé ([email protected]). Project web page: http://www.ac.upc.es/nanos.2 Istituto di Cibernetica, Consiglio Nazionale delle Richerce. Arco Felice, Napoli , Italy.3 Pallas GmbH. Brühl, Germany.4 High Performance Information Systems Laboratory, University of Patras. Patras, Greece.

parallelization process in which, at each step, the parallelization is refined by combining userknowledge, compiler techniques and dynamic data gathered from instrumentation.

Currently, the majority of commercial and research parallelizing compilers target theexploitation of a single level of parallelism around loops (for example, the current version of theSGI MIPSpro compiler, the SUIF compiler infrastructure [9] or the MOERAE infrastructure on topof the Polaris compiler [10]). Exploiting a single level of parallelism means that there is a singlethread (master) that produces work for other processors (slaves). Once parallelism is activated,the execution environment ignores new opportunities for parallel work creation.

Multi–level parallelism enables the generation of work from several simultaneouslyexecuting threads. Once parallelism is activated, new opportunities for parallel work creationresult in the generation of work for all or a restricted set of processors. Multi–level parallelizationmay play an important role in providing scalable programming and execution models; nestedparallelism may also provide further opportunities for work distribution, both around loops andsections. The OpenMP application program interface [19], jointly defined by a group of majorcomputer hardware and software vendors, includes in its current definition the exploitation ofmulti–level parallelism through the nesting of work–sharing constructs. Our project contributes tothis issue by providing an implementation of OpenMP with true multi–level parallelism supportand some additional extensions oriented towards the definition of processor groups.

Previous work on supporting multi–level parallelism focused on providing some kind ofcoordination support to allow the interaction of a set of program modules (task parallelism) in theframework of HPF [11] parallel programs (data parallelism) for distributed memory architectures[5,6,8,26]. Recently, KAI (Kuck and Associates, Inc.) has made proposals to OpenMP to supportmulti–level parallelism through the WorkQueue mechanism [12], in which work can be createddynamically, even recursively, and put into queues. Within the WorkQueue model, nestedqueuing permits a hierarchy of queues to arise, mirroring recursion and linked data structures.

The research conducted during the recent past in operating system scheduling formultiprogrammed shared–memory multiprocessors, revealed that different forms of gangscheduling [4,20] and dynamic space–sharing with process control [17,30] were able to deliversatisfactory throughput and acceptable performance for parallel applications in multiprogrammedenvironments. Although both proposals were extensively explored in the literature, their adoptionin contemporary operating systems is so far quite limited, either due to their implementationcomplexity or due to their poor integration in traditional operating system schedulers and modernNon–Uniform Memory Access (NUMA) architectures. The kernel–level scheduling policiesproposed in NANOS attempt to bridge the gap between research proposals and contemporaryoperating systems and adapt kernel scheduling to the architectural characteristics ofmultiprogrammed NUMA multiprocessors. Towards this direction, we designed kernel–levelalgorithms along four major guidelines. First, the processor allocation from the operating systemis dynamically guided from the actual resource requirements of each application in terms ofparallelism. Second, applications communicate with the kernel directly through shared memoryinstead of using system calls, in order to be able to adapt immediately to any kernel–sidescheduling event that affects their execution, such as changes of the workload, page faults andblocking system calls. Third, the scheduling algorithms exploit to the maximum extent the cachefootprints of kernel threads by maintaining a history of affinity bindings of threads to processors,in order to overcome the performance implications of frequent thread migrations. Fourth, theimplementation of the algorithms enables their integration in any standard priority–aging time–sharing scheduler.

Figure 1 shows the structure of the whole system developed in the NANOS project. Thedifferent components attempt to assist the parallelization, performance prediction, analysis andtuning of parallel applications. The whole environment provides an advanced compiler (theNanosCompiler) and runtime support (through the NthLib user–level threads library) for multilevelhierarchical parallelism exploitation. Finally, the application–kernel cooperation mechanism andthe combined space/time–sharing scheduling policies aim at offering an efficient environment forthe execution of multiprogrammed workloads. A major objective achieved within NANOS is the

https://www.researchgate.net/publication/3560284_A_compilation_system_that_integrates_High_Performance_Fortran_and_Fortran_M?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

https://www.researchgate.net/publication/3333151_Task_Parallelism_in_a_High_Performance_Fortran_Framework?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

https://www.researchgate.net/publication/2407315_Maximizing_Multiprocessor_Performance_with_the_SUIF_Compiler?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

https://www.researchgate.net/publication/36005567_Efficient_scheduling_on_multiprogrammed_shared-memory_multiprocessors?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

https://www.researchgate.net/publication/201976697_The_High_Performance_Fortran_Handbook?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

https://www.researchgate.net/publication/2857015_Simultaneous_Exploitation_of_Task_and_Data_Parallelism_in_Regular_Scientific_Applications?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

https://www.researchgate.net/publication/220439371_A_Dynamic_Processor_Allocation_Policy_for_Multiprogrammed_Shared-memory_Multiprocessors?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

https://www.researchgate.net/publication/4124047_Double_Standards_Bringing_Task_Parallelism_to_HPF_Via_the_Message_Passing_Interface?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

https://www.researchgate.net/publication/244303905_Scheduling_Techniques_for_Concurrent_Systems?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

development of a technology demonstrator for the complete hierarchy. It allows effectiveexperimentation to coordinate the mechanisms and policies carried at all levels involved.Although the availability of this system in itself gives us an enormous research potential, we alsoenvisage that different component technologies can be extracted from our environment andexploited in other frameworks. These components are easily portable to a variety of parallelplatforms.

In-kernelImplementation

User-levelResource Manager

Sequential code

Hand-writtennanoThreads code

Parallel code

Stock Operating System(SGI, DEC, ...)

Modified OperatingSystem (Chorus, Mach,...)

OpenMP Directives

User-level Execution Model / Interface

NthLib User-level Threads Library

Kernel Interface

Application Codes

ParallelizingNanosCompiler

Code Generation

Figure 1: Structure of the whole NANOS environment: main components and interfaces.

2. OpenMP NanosCompilerThe NanosCompiler is a source–to–source parallelizing compiler for OpenMP applications writtenin Fortran 77. It has been implemented on top of the Parafrase–2 compiler [22] around itshierarchical internal program representation. The representation captures the parallelismexpressed by the user (through OpenMP directives) and the parallelism automatically discoveredby the compiler through a detailed analysis of data and control dependences. The compiler isfinally responsible for encapsulating work into threads, establishing their execution precedencesand selecting the mechanisms to execute them in parallel. The code generated by the compilercontains calls to the user–level threads library (NthLib) API.

The NanosCompiler is able to exploit multiple levels of parallelism and generate work frommultiple simultaneously executing threads. Once parallelism is activated on a coarse level, newopportunities for parallelism on a finer level result in the generation of work for all or for restrictedgroups of processors. One of the objectives of the project is to prove that multi–levelparallelization can be instrumental in providing scalability in a programming and execution model;nested parallelism provides further opportunities for work distribution, both for parallel loops andcode sections [2].

The NanosCompiler offers a GUI [3] that assists the user in the parallelization of sequentialapplications providing her/him with information to understand the application structure andparallelization opportunities. The application is represented as a hierarchical task graph (HTG[7]), where tasks are application units (HTG nodes) and precedences (HTG arcs) representserialization of their execution. The GUI includes (as shown in Figure 2) a visualization modulethat assists the user in navigating through the hierarchical representation of the applicationstructure and provides information about data and control dependences and precedences. The

https://www.researchgate.net/publication/2335808_A_Graphic_Parallelizing_Environment_for_User-Compiler_Interaction?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

https://www.researchgate.net/publication/3299587_Automatic_extraction_of_functional_parallelism_from_ordinaryprograms?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

https://www.researchgate.net/publication/220171570_Parafrase-2_an_Environment_for_Parallelizing_Partitioning_Synchronizing_and_Scheduling_Programs_on_Multiprocessors?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

GUI provides an OpenMP wizard that allows the user to introduce directives by directly interactingwith this hierarchical representation (refining task granularities and precedences). Finally, the GUIoffers an interface that guides the user through the process of generating valid and efficientparallel code with calls to the NthLib.

Figure 2: GUI for the NanosCompiler. This environment offers graphical application development for theOpenMP programming model and drives parallel code generation for the NthLib user–level threads library.

3. User–level threads library (NthLib)NthLib is a user–level threads library primarily designed to provide runtime support at thebackend of the NanosCompiler. The main focus is to provide effective support for multiple levelsof parallelism. The library not only supports the structured parallelism offered by OpenMP butalso supports the execution of parallel tasks expressed by means of a general unstructuredhierarchical task graph. The NthLib API is simple enough to be used for multithreadedprogramming on a variety of multiprocessor platforms.

NthLib has been designed with the objective of providing efficient mechanisms to managethe parallelism, at different levels of granularity, present in parallel applications. On one hand, thelibrary offers a simple interface for the management of fine–grain loop–level parallelism (throughwork descriptors), designed to minimize fork/join overheads [15]. On the other hand, the libraryprovides mechanisms for the management of coarser levels of parallelism (through work queues),where the specification of precedences among threads is necessary, thus supporting theunstructured parallelism found in arbitrary task graphs [16]. These general mechanisms allow themanagement of private thread stacks that contain variables used by the thread itself and thathave to be maintained while other spawned threads are active.

The structural elements available in the library (like the work descriptors and the global andper–processor local queues) enable the achievement of good load balancing and exploitation ofdata locality. Processors search for work in their own local queue and the global queue as soonas they become idle. A processor executing a thread can spawn new parallelism by simplycreating new thread descriptors; pointers to these descriptors can be inserted into the ready–to–run queues globally or individually. For example, a single descriptor can be inserted into all

https://www.researchgate.net/publication/221235992_Thread_forkjoin_techniques_for_multi-level_parallelism_exploitation_in_NUMA_multiprocessors?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

https://www.researchgate.net/publication/2461768_A_Library_Implementation_of_the_Nano-Threads_Programming_Model?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

queues, or individual descriptors can be put into a particular queue or a group of queues. Usingthese mechanisms, the compiler can control the generation and distribution of work toprocessors, targeting to data locality and balanced workload allocation among the executingprocessors. Alternative organizations for the data structures holding ready–to–run descriptors areunder evaluation; for instance, the use of hierarchical queues to implement user–level schedulingstrategies for applications with inherent multi–level parallelism [18].

Finally, the library is flexible enough to allow experimentation with user–suppliedscheduling policies as well as dynamic mechanisms directly supported by the library, such aswork stealing or runtime exploitation of data locality. For example, dynamic user–level schedulingtechniques analogous to Dynamic Bisectioning Scheduling (DBS) [25] enable a processor topartially steal work from the local queue of another processor, thus quickly reacting to dynamicvariations of load or adapting to unpredictable loads.

4. Instrumentation environment for performance analysis and visualizationThe NanosCompiler allows the instrumentation of both sequential and parallel applicationprograms. The former is useful for the prediction of parallel performance, while the latter capturesthe dynamic application behavior for performance analysis. The instrumentation for performanceprediction is based on the generation of calls to an instrumentation library targeted to Dimemas[21], a parametric simulator of parallel architectures. The instrumentation for performanceanalysis and visualization is based on the generation of calls to an instrumentation library thatgathers information from the hardware counters of the machine, records the execution status ofeach thread and inserts events related to the OpenMP directives. Either Dimemas or theinstrumentation library generates a trace suitable for being visualized with Paraver [13].

Figure 3 shows a snapshot of the visualization with Paraver. Different colors in the traceare used to indicate the current status of each thread: idle (light blue), running (dark blue),blocked (red) or creating work (yellow). Flags are used to signal events during the execution; theyhave associated types and values related to the original program and OpenMP parallelization,and to display performance statistics (cache misses, invalidations, etc.) directly gathered from theMIPS hardware counters. For instance, the trace visualized in Figure 3 shows the generation ofwork when two levels of parallelism are defined (four sections executing in CPU1, CPU5, CPU9and CPU13, each one forking loop–level parallelism for four additional threads). After a barrier,the main thread (CPU1) generates work for itself and the rest of the processors to exploit a singlelevel of loop parallelism.

5. Kernel–interface and scheduling policiesBesides execution time of individual applications, system utilization and throughput are importantperformance metrics that must be considered in current multiprocessor systems that have to runworkloads consisting of different kinds of applications (such as number crunching, large–scalesimulations, data base engines and web server applications). The project aims at proposingefficient kernel scheduling policies and a new kernel–application interface to support efficientparallel execution in environments with diverse workloads. As an initial approach, we haveimplemented the interface in the context of a user–level CPU manager.

The kernel–interface provides mechanisms to dynamically establish a dialog between thekernel and individual applications. Reacting to this dialog, the kernel (re)allocates physicalprocessors to applications, according to the active scheduling policies, the application demandsand the availability of system resources. Application adaptability and recovery of preempted workare accomplished through a lightweight communication path between applications and the kernel,implemented through shared memory.

https://www.researchgate.net/publication/243779321_Scheduling_User-level_Threads_on_Scalable_Shared-memory_Multiprocessors?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

https://www.researchgate.net/publication/296097818_Efficient_runtime_thread_management_for_the_nano-threads_programming_model?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

Figure 3: Instrumentation of real parallel execution and visualization with Paraver. It allows the user tonavigate along the whole execution, visualize events related to the OpenMP parallelization, and statisticallyanalyze and measure execution times and hardware counters.

The NANOS environment provides a platform for research on scheduling policies thatcombine different degrees of space and time–sharing, to reduce system overhead, improvelocality, increase throughput and assure fairness. For instance, the Dynamic Space Sharing (DSS[23]) policy applies dynamic space–sharing among executing parallel applications based on theirprocessor requirements and the system load. DSS has been effectively combined with sometime–sharing policies to provide long–term fairness and integration in general–purpose kernelschedulers.

Another outcome of the NANOS project is a tool to visualize kernel scheduling in IRIX. Thisallows monitoring the status and migration of each application running on the system through itslifetime. It is implemented as a high priority user–level process that investigates and records thestatus of each application. Paraver is used to visualize and analyze the effects of the schedulingpolicies in terms of number of process migrations, average time for the execution of bursts, etc.For example, Figure 4 shows the execution trace for a workload composed of six applicationsrunning on a system with 64 processors. Each color in the trace represents an application. Thetrace on the left shows the execution when the original scheduler in IRIX 6.4 is used. The traceon the right shows the execution of the same workload using one of the policies developed in theproject. The (yellow) vertical lines represent migrations of kernel threads from one processor toanother. For example, in the time frame shown in Figure 4, the scheduling policy that has beenused reduces the number of process migrations from 1300 to 500.

https://www.researchgate.net/publication/2352842_Kernel-Level_Scheduling_for_the_Nano--Threads_Programming_Model?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

Figure 4: Visualization of workload scheduling using IRIX 6.4 and one of the kernel scheduling policiesproposed in the project that effectively combines space and time–sharing.

6. Performance evaluationA set of benchmarks, representing a wide range of important HPC applications, has been used toassess functionality, robustness and performance of the complete software suite. For theevaluation of kernel interaction and scheduling policies, several workloads composed of theprevious benchmarks are used to simulate a multiprogrammed environment running parallelapplications. The experiments reported in this paper have been done on a Silicon GraphicsOrigin2000 system [14,28] with 64 R10k processors, running at 250 MHz with 4 Mb of secondarycache each. For all compilations we have used the native SGI compiler [27] with flags set to -64-Ofast=ip27 -LNO:prefetch_ahead=1:auto_dist=on.

6.1 Single–application performance

The additional functionalities offered by the whole system introduce negligible performancepenalty in the execution of programs with a single level of parallelism around loops. For example,Figure 5 shows the performance for two applications from SPEC95 [29] (Swim and Tomcatv)compiled with the native SGI OpenMP compiler (label MP) and the NanosCompiler (labelNANOS). Both applications have loop–level parallelism. Notice that the performance of the twoversions is similar up to a certain number of processors (16 in Swim and 24 in Tomcatv) and afterthat the NANOS version performs better; in any case, the performance drops when more than 32processors are used.

https://www.researchgate.net/publication/3700198_The_SGI_origin_a_ccNUMA_highly_scalable_server?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

Figure 5: Performance for two SPEC95 programs: Swim and Tomcatv.

The objective of the experimental work conducted in the NANOS project is to show that, insome cases, exploiting additional levels of coarser parallelism may result in higher performancereturns as the number of processors is increased. Therefore, multi–level parallelism has thepotential to play an important role in providing better scalability because it offers furtheropportunities for work distribution, both for loops and sections. Locality issues have to be takeninto consideration by identifying groups of processors that work on specific parts of theapplication. The programmer has the task of minimizing the negative performance effects causedby data movement that may occur among groups of processors; this overhead may easilycounteract any gain produced by better work distribution.

For example, Figure 6 shows the performance obtained from the execution of the Hydro2DSPEC95 application. Two different parallelization strategies have been evaluated for thisbenchmark: single and multi–level [2]. Notice that the performance plot for the single–levelstrategy (parallelization of outermost loops only) saturates on more than 16 processors becauseof insufficient work availability. A second level of parallelism (around sections, each one workingon a different array) increases the performance and achieves scalability up to a larger number ofprocessors (48). Additional levels of parallelism can be exploited in Hydro2D, but they cause lossof data locality and lead to excessive overheads.

Figure 6: Parallel performance on a SGI Origin2000 system for two parallelization strategies in the Hydro2Dbenchmark.

6.2 Performance for multiprogrammed workloads

The workloads used to evaluate the impact of the tight cooperation between kernel and parallelapplications and the behavior of different scheduling policies, are composed of combinations ofbenchmarks used for the evaluation of individual applications performance. Workloads are eitherhomogeneous (composed of several instances of the same benchmark with the same processor

0

5

10

15

20

25

30

Speed-up

1 2 4 8 16 24 32 48 64

Swim (MP)

Swim (NANOS)

Tomcatv (MP)

Tomcatv (NANOS)

0123456789

10

Spe

ed-u

p

2 4 8 16 24 32 48 64

Hydro2d (loop level)

Hydro2D (multi-level)

requirements) or heterogeneous (composed of different benchmarks requesting differentnumbers of processors), both in open and closed systems. The workloads model heavily loadedmultiprogrammed systems and stress critical performance parameters of the processors and thememory hierarchy, thus reducing flexibility and adaptability from the kernel scheduler to avoidperformance degradation.

For example, Figure 7 shows the average slowdown for some of the SPEC95 applicationswhen executed in a homogeneous workload executed in a closed system (each row represents aworkload composed of 8 instances of the same benchmark, each asking 16 processors). Theslowdown is measured with respect to the parallel execution in a dedicated system. For eachprogram, several scheduling policies are used (original irix–mp, equipartition and several versionsof the proposed DSS policy enhanced with time–sharing disciplines [24]). Figure 8 shows theaverage slowdown and number of complete instances for each application in a single workloadcomposed of the parallel applications listed in the table. This evaluation has been performedusing two scheduling policies (original irix–mp and equipartition).

Workload Slowdown (w.r.t. dedicated execution)

Application # process ors Irix–mp Equipartition Sw–dss Ssw–dss Vtq–dss8 x Hydro2D 16 2.78 1.43 1.38 1.45 1.39

8 x Swim 16 4.67 3.41 3.31 3.40 3.12

8 x Tomcatv 16 2.11 1.85 1.69 1.72 1.66

8 x Turb3D 16 3.03 2.48 2.47 2.44 2.34

Figure 7: Average slowdown for some benchmarks in homogeneous workloads in closed system mode.

Workload Slowdown # complete instances

Application # of process ors Irix–mp Equipartition Irix–mp EquipartitionBT 16 1.92 2.00 6 6

Ltomcatv 1, 16 1.54 1.14 2 3

Swim 16 6.61 2.17 4 6

Turb3D 24 2.39 1.33 1 3

Hydro2D 16 3.69 1.56 1 3

Su2cor 8 1.17 1.03 4 4

Figure 8: Average slowdown and number of complete instances for the benchmarks composing aheterogeneous workload in closed system mode. The measures are done in a time frame that starts at thefirst completed instance and finishes at 7th completed instance (both for the first arriving application).

The proposed kernel scheduling methodologies demonstrate solid performanceimprovements over the native IRIX multiprogrammed execution environment. Theseimprovements are mainly attributed to three factors: 1) The instantaneous adaptability of parallelprograms which exploit the NANOS kernel interface to regulate their degree of parallelism to theavailable resources. 2) The potential of the hybrid space and time–sharing policies that letapplications to exploit parallelism even when executed in a heavily multiprogrammedenvironment. And 3) the improved memory performance of parallel applications due to theminimization of thread migrations and the maximal exploitation of thread–to–processor affinity.

7. Concluding remarksThe NANOS environment integrates compiler, runtime system and operating system techniquesto achieve high system throughput and good application performance for parallel applications inmultiprogrammed environments on shared–memory multiprocessor systems.

Our approach is based on two layers of interaction: user–application and application–kernel. The user–application interaction takes place during the program parallelization phase bymeans of the NanosCompiler GUI that lets the programmer analyze the parallelism discovered bythe compiler and specify manual parallelization by means of OpenMP directive annotation. A

https://www.researchgate.net/publication/2358306_An_Efficient_Kernel-level_Scheduling_Methodology_for_Multiprogrammed_Shared_Memory_Multiprocessors?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=

compiler–based instrumentation environment enables performance prediction and visualizationprocesses to tune the parallelization of the application from dynamically gathered information.

The application–kernel interaction is realized through NthLib. NthLib has been primarilydesigned to provide runtime support at the backend of the NanosCompiler and enable theeffective exploitation of multiple levels of parallelism with minimal overhead. At the same time,NthLib integrates a communication mechanism – based on shared memory – to interface with thekernel scheduler. This interface provides mechanisms that let the applications regulate theirdegree of parallelism by taking into account the available system resources and recover theirpreempted threads to ensure their progress in the presence of a heavy multiprogramming load.

The performance evaluation on a set of benchmarks demonstrates that the additionalfunctionalities offered by the whole system introduce negligible performance penalty in theexecution of programs with a single level of parallelism around loops. It also shows that multi–level parallelism may play an important role in providing better scalability for parallel applicationsbecause it offers further opportunities for work distribution. The proposed kernel schedulingmethodologies demonstrate solid performance improvements over the native IRIXmultiprogrammed execution environment.

In summary, the major objective achieved within the NANOS project has been thedevelopment of a technology demonstrator for the complete hierarchy. It allows effectiveexperimentation to coordinate the mechanisms and policies carried at all levels involved. Forinstance, using the NANOS environment we can investigate the following issues: Is it possible toeffectively exploit multiple levels of parallelism in OpenMP applications? How does localityinterfere with overhead and load balancing issues? Which system level (compiler/library/kernel)should be responsible for which resource management decision? Or how parallel applicationswritten in traditional parallel programming models adapt to multiprogrammed environments?

Although the availability of the system in itself gives us an enormous research potential, wealso envisage that different component technologies can be extracted from our environment andexploited in other frameworks. Examples of tools that can be used as stand–alone products are: aportable OpenMP compiler and runtime system; a visualisation tool for OpenMP programstructures; a dynamically loaded instrumentation library and visualisation tool for programscompiled for the SGI system; and a tool for the visualization of IRIX scheduling (process activityand migration). These components are easily portable to other parallel platforms.

AcknowledgementsThis work has been supported by: the European Commission through the ESPRIT NANOSproject 21907; the Ministry of Education of Spain under contracts TIC97-1445CE and TIC98-511;the Consiglio Nazionalle delle Ricerche under grant CNR/ASI 1995-RS-04 and by SiliconGraphics Inc. (Italy) under contract BS/CIB/97091.

References1. E. Ayguadé, X. Martorell, J. Labarta, M. Gonzalez and N. Navarro. Exploiting Parallelism Through

Directives in the Nano-Threads Programming Model. 10th Int. Workshop on Languages and Compilersfor Parallel Computing LCPC97. Minneapolis (USA). August 1997.

2. E. Ayguadé, X. Martorell, J. Labarta, M. Gonzalez and N. Navarro. Exploiting Multiple Levels ofParallelism in Shared-memory Multiprocessors: A case Study. Tech. Report UPC-DAC-1998-48, 1998.

3. C. Calidonna, M. Giordano and M. Mango Furnari. A Graphic Parallelizing Environment for User-compiler Interaction. 13th Int. Conference on Supercomputing ICS99. Rhodes (Greece), June 1999.

4. D. Feitelson. A Survey of Scheduling in Multiprogrammed Parallel Systems, Research Report RC19790 (87657), IBM T.J. Watson Research Center, Revised Version, 1997.

5. I. Foster, B. Avalani, A. Choudhary and M. Xu. A Compilation System that Integrates High PerformanceFortran and Fortran M. Scalable High Performance Computing Conference, Knoxville (TN), May 1994.







https://www.researchgate.net/publication/221497125_Exploiting_Parallelism_Through_Directives_on_the_Nano-Threads_Programming_Model?el=1_x_8&enrichId=rgreq-08aba4e5632d6ded8dba328d4244118a-XXX&enrichSource=Y292ZXJQYWdlOzI0MDU5Mjc7QVM6MTAwMDcxNTA5MDY5ODI3QDE0MDA4NzAzMTU3ODA=





6. I. Foster, D. R. Kohr, R. Krishnaiyer and A. Choudhary. Double Standards: Bringing Task Parallelism toHPF Via the Message Passing Interface. Supercomputing'96, November 1996.

7. M. Girkar and C. Polychronopoulos. Automatic Extraction of Functional Parallelism from OrdinaryPrograms. IEEE Transactions on Parallel and Distributed Systems, vol. 3(2), March 1992.

8. T. Gross, D. O'Halloran and J. Subhlok. Task Parallelism in a High Performance Fortran Framework.IEEE Parallel and Distributed Technology, vol. 2, no.3, Fall 1994.

9. M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R. Murphy, S. W. Liao, E. Bugnion and M. S. Lam.Maximizing Multiprocessor Performance with the SUIF Compiler. IEEE Computer, December 1996.

10. S. W. Kim, M. Voss, and R. Eigenmann. MOERAE: Portable Interface between a Parallelising Compilerand Shared-Memory Multiprocessor Architectures. http://yake.ecn.purdue.edu/seon/moerae.html, 1998.

11. C. H. Koelbel, D. B. Loveman, R. S. Schreiber, G. L. Steele and M. E. Zosel. The High PerformanceFortran Handbook. Scientific Programming, 1994.

12. Kuck and Associates, Inc. WorkQueue Parallelism Model, http://www.kai.com, Fall 1998.

13. J. Labarta, S. Girona, V. Pillet, T. Cortes and L. Gregoris. DiP: A Parallel Program DevelopmentEnvironment. 2nd Int. Euro-Par Conference. Lyon (France). August 1996.http://www.cepba.upc.es/paraver.

14. J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. 24th Int. Symposium onComputer Architecture, Philadelfia (USA), 1997.

15. X. Martorell, E. Ayguadé, N. Navarro, J. Corbalan, M. Gonzalez and J. Labarta. Thread Fork/JoinTechniques for Multi-level Parallelism Exploitation in NUMA Multiprocessors. 13th Int. Conference onSupercomputing ICS99. Rhodes (Greece), June 1999.

16. X. Martorell, J. Labarta, N. Navarro and E. Ayguadé. A Library Implementation if the Nano-ThreadsProgramming Model. 2nd Int. Euro-Par Conference. Lyon (France). August 1996.

17. C. McCann, R. Vaswani and J. Zahorjan. A Dynamic Processor Allocation Policy for MultiprogrammedShared Memory Multiprocessors. ACM Trans. on Computer Systems, vol. 11(2), 1993.

18. D. Nikolopoulos, E. Polychronopoulos and T. Papatheodorou. Efficient Run-time Thread Managementfor the Nano-Threads Programming Model. Workshop on Runtime Systems for Parallel Programming(IPPS/SPDP98). Orlando (USA). April 1998.

19. OpenMP Organization. Fortran Language Specification, v. 1.0. http://www.openmp.org. October 1997.

20. J. Ousterhout. Scheduling Techniques for Concurrent Systems. Proc. of the Distributed ComputingSystems Conference, pp. 22--30, 1982.

21. Pallas GmbH. Information about products available at http://www.pallas.com.

22. C. Polychronopoulos, M. Girkar, M.R. Haghighat, C.L. Lee, B. Leung, and D. Schouten. Parafrase-2: anEnvironment for Parallelizing, Partitioning and Scheduling Programs on Multiprocessors. InternationalJournal of High Speed Computing, vol. 1(1), 1989.

23. E. Polychronopoulos, X. Martorell, D. Nikolopoulos, J. Labarta, T. Papatheodorou and N. Navarro.Kernel-level Scheduling for the Nano-threads Programming Model”. 12th Int. Conference onSupercomputing ICS98. Melbourne (Australia). July 1998.

24. E. Polychronopoulos, D. Nikolopoulos, T. Papatheodorou, N. Navarro and X. Martorell. An EfficientKernel Scheduling Methodology for Multiprogrammed Shared Memory Multiprocessors”. TechnicalReport TR-010399, University of Patras, 1999.

25. E. Polychronopoulos and T. Papatheodorou. Scheduling User-level Threads on Scalable Shared-memory Multiprocessors. 5th Int. Euro-Par Conference. Toulouse (France), September 1999.

26. S. Ramaswamy. Simultaneous Exploitation of Task and Data Parallelism in Regular ScientificComputations. Ph.D. Thesis. University of Illinois at Urbana-Champaign, 1996.

27. Silicon Graphics Computer Systems SGI. MIPSpro Fortran 77 Programmer's Guide, 1996.

28. Silicon Graphics Computer Systems SGI. Origin200 and Origin2000 Technical Report}, 1996.

29. SPEC Organization. The Standard Performance Evaluation Corporation, http://www.spec.org.

30. A. Tucker. Efficient Scheduling on Multiprogrammed Shared Memory Multiprocessors, PhDDissertation, Technical Report CSLTR94601, Stanford University, 1994.











































NANOS: Effective Integration of Fine-grain Parallelism Exploitation and Multiprogramming

Documents

Transcript of NANOS: Effective Integration of Fine-grain Parallelism Exploitation and Multiprogramming