Integrated parallel performance views

Cluster Comput (2008) 11: 57–73DOI 10.1007/s10586-007-0051-6

Integrated parallel performance views

Aroon Nataraj · Allen D. Malony · Sameer Shende ·Alan Morris

Received: 16 March 2007 / Accepted: 29 October 2007 / Published online: 27 November 2007© Springer Science+Business Media, LLC 2007

Abstract The influences of the operating system and sys-tem-specific effects on application performance are increas-ingly important considerations in high performance comput-ing. OS kernel measurement is key to understanding the per-formance influences and the interrelationship of system anduser-level performance factors. The KTAU (Kernel TAU)methodology and Linux-based framework provides paral-lel kernel performance measurement from both a kernel-wide and process-centric perspective. The first character-izes overall aggregate kernel performance for the entire sys-tem. The second characterizes kernel performance when itruns in the context of a particular process. KTAU extendsthe TAU performance system with kernel-level monitoring,while leveraging TAU’s measurement and analysis capabili-ties. We explain the rational and motivations behind our ap-proach, describe the KTAU design and implementation, andshow working examples on multiple platforms demonstrat-ing the versatility of KTAU in integrated system/applicationmonitoring.

Keywords Parallel performance · Kernel · Linux ·Instrumentation · Measurement

A. Nataraj (�) · A.D. Malony · S. Shende · A. MorrisDepartment of Computer and Information Science, University ofOregon, Eugene, OR, USAe-mail: [email protected]

A.D. Malonye-mail: [email protected]

S. Shendee-mail: [email protected]

A. Morrise-mail: [email protected]

1 Introduction

The performance of parallel applications on high-perfor-mance computing (HPC) systems is a consequence of theuser-level execution of the application code and system-level (kernel) operations that occur while the application isrunning. As HPC systems evolve towards ever larger andmore integrated parallel environments, the ability to ob-serve all performance factors, their relative contributionsand interrelationship, will become important to comprehen-sive performance understanding. Unfortunately, most paral-lel performance tools operate only at the user-level, leavingthe kernel-level performance artifacts obscure. OS factorscausing application performance bottlenecks, such as thosedemonstrated in [1, 2], are difficult to assess by user-levelmeasurements alone. An integrated methodology and frame-work to observe OS actions relative to application activitiesand performance has yet to be fully developed. Minimally,such an approach will require OS kernel performance mon-itoring.

The OS influences application performance both directlyand indirectly. Actions the OS takes independent of theapplication’s execution have indirect effects on its perfor-mance. In contrast, we say that the OS directly influencesapplication performance when its actions are a result of orin support of application execution. This motivates two dif-ferent monitoring perspectives of OS performance in orderto understand the effects. One perspective is to view the en-tire kernel operation as a whole, aggregating performancedata from all active processes in the system and includingthe activities of the OS when servicing system-calls madeby applications as well as activities not directly related to ap-plications (such as servicing hardware interrupts or keepingtime). We will refer to this as the kernel-wide perspective,

https://www.researchgate.net/publication/4222084_The_Case_of_the_Missing_Supercomputer_Performance_Achieving_Optimal_Performance_on_the_8192_Processors_of_ASCI_Q?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==

58 Cluster Comput (2008) 11: 57–73

which is helpful in understanding OS behavior and in iden-tifying and removing kernel hot spots. However, this viewdoes not provide comprehensive insight into what parts of acertain application spend time inside the OS and why.

Another way to view OS performance is within the con-text of a specific application’s execution. Application perfor-mance is affected by the interaction of user-space behaviorwith the OS, as well as what is going on in the rest of thesystem. By looking at how the OS behaves in the contextof individual processes (versus aggregate performance) wecan provide a detailed view of the interactions between pro-grams, daemons, and system services. This process-centricperspective is helpful in tuning the OS for a specific work-load, tuning the application to better conform to the OS con-figuration, and in exposing the source of performance prob-lems (in the OS or the application).

Both the kernel-wide and process-centric perspectivesare important in OS performance measurement and analysison HPC systems. The challenge is how to support both per-spectives while providing a monitoring infrastructure thatgives detailed visibility of kernel actions and is easy to useby different tools. For example, the interactions between ap-plications and the OS mainly occur through five differentmechanisms: system-calls, exceptions, interrupts (hard andsoft), scheduling, and signals. It is important to understandall forms of interactions as the application performance isinfluenced by each. Of these, system calls are the easiest toobserve as they are synchronous with respect to the applica-tion. Also, importantly, system calls are serviced inside thekernel relative to the context of the calling process, allowinggreater flexibility in how these actions are monitored. Onthe other hand, interrupts, scheduling, and exceptions occurasynchronously and usually in the interrupt context, wherethe kernel-level facilities and state constrain monitoring op-tions and make access to user-space performance data diffi-cult. One problem is observing all the different program-OSinteractions and correlating performance across the user/OSboundary. Another is observing these interactions across aparallel environment and collectively gathering and analyz-ing the performance data. Support for this needs to be ex-plicit, especially for post-processing, analysis, and visual-ization of parallel data.

Our approach is the development of a new Linux Ker-nel performance measurement facility called KTAU, whichextends the TAU [3] performance system with kernel-levelmonitoring. KTAU allows both a kernel-wide and process-centric perspective of OS performance to be obtained toshow indirect and direct influences on application perfor-mance. KTAU provides a lightweight profiling facility anda tracing capability that can generate a detailed event log.We describe KTAU’s design and architecture in Sect. 4. Wedemonstrate KTAU’s features and versatility through work-ing examples on different platforms in Sect. 5. We begin

with a discussion of related work in Sect. 2 as backgroundto the problem. The paper concludes with final remarks andfuture directions of KTAU in Sect. 6.

2 Background related work

To better understand the design and functionality of KTAU,we first review related research and technology in OS mon-itoring. Tools operating at the kernel level only and thosethat attempt to combine user/kernel performance data aremost closely related to KTAU’s objectives. These tools em-ploy different techniques and mechanisms for instrumenta-tion and measurement. These techniques are the primary dis-tinguishing factor between the existing approaches. We be-gin with kernel-only tools and then present tools explicitlysupporting merged user/kernel performance data. For lack ofspace, we show only a few candidate tools in each category.Table 1 compares related work to KTAU on several dimen-sions including instrumentation, measurement techniques,availability of combined user/kernel performance analysis,explicit support for parallel performance data collection andvisualization, SMP and supported OSes.

2.1 Dynamic instrumentation

Dynamic (runtime) instrumentation [4] modifies the exe-cutable code directly, thereby saving the cost of recompi-lation and relinking. In KernInst [5], a kernel-module usesdynamic instrumentation to insert measurement code intokernel routines. Different performance measurements ob-serving various kernel actions are possible. While KernInstincurs overhead only for dynamically instrumented events,trampolining adds delay and there is a cost in altering in-strumentation during execution. KernInst, by itself, does notsupport merging user and kernel performance data.

The dynamic approach used in Sun’s DTrace [6] sepa-rates instrumentation infrastructure, instrumentation place-ment, and performance data collection (by providers) andperformance data usage (by consumers) allowing multi-ple virtualized consumers. DTrace allows users to dynam-ically enable and manage thousands of probes. Application-level instrumentation traps into the kernel where the kernel-logging facility logs both application and kernel perfor-mance data. However, because trapping into the kernel canbe costly, using DTrace itself is not a scalable option forcombined user/kernel performance measurement of parallelHPC codes.

2.2 Static source instrumentation

2.2.1 Kernel tracing tools

In contrast to KernInst and DTrace, Linux Trace Toolkit(LTT) [7] and K42 Tracing [8] are based on kernel source

https://www.researchgate.net/publication/220881400_Dynamic_Instrumentation_of_Production_Systems?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==

https://www.researchgate.net/publication/220851865_Fine-Grained_Dynamic_Instrumentation_of_Commodity_Operating_System_Kernels?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==

https://www.researchgate.net/publication/220782182_Efficient_Unified_and_Scalable_Performance_Monitoring_for_Multiprocessor_Operating_Systems?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==

https://www.researchgate.net/publication/3560185_Dynamic_program_instrumentation_for_scalable_performance_tools?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==

https://www.researchgate.net/publication/220881150_Measuring_and_Characterizing_System_Behavior_Using_Kernel-Level_Event_Logging?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==

Cluster Comput (2008) 11: 57–73 59

Table 1 OS-only and integrated user/OS performance tools

Related work classification

Tool Instrumentation Measurement Combined user/kernel Parallel SMP OS

KernInst [5] dynamic flexible not explicit not explicit yes Solaris

DTrace [6] dynamic flexible trap into OS not explicit yes Solaris

LTT [7] source trace not explicit not explicit yes Linux

K42 [8] source trace partial not explicit yes K42

KLogger [14] source trace not explicit not explicit yes Linux

OProfile [11] N/A flat profile partial not explicit yes Linux

KernProf [10] gcc (callgraph) flat/callgraph profile not explicit not explicit yes Linux

[15] source trace syscall only not explicit no Linux

CrossWalk [13] dynamic flexible syscall only not explicit yes Solaris

DeBox [12] source profile/trace syscall only not explicit yes Linux

KTAU+TAU source profile/trace full explicit support yes Linux

instrumentation. Both are tools for tracing kernel behavior.In the case of LTT, the Linux kernel source code is modi-fied to include LTT macros in specified functions and LTTdata structures for holding the trace information. During ex-ecution it requires a user-space daemon to periodically (syn-chronously) empty the trace-buffer to secondary storage toprevent loss of data. While LTT can provide rich informationabout kernel function, its use of heavy-weight timing (get-timeofday) has high overhead and low precision, and its useof a single trace buffer requires global locks. Some of thelimitations are being addressed [9], but there is no supportfor merging data from the user and kernel space.

2.2.2 Kernel profiling tools

Source-instrumented kernels also support tools to measureperformance in the form of profile statistics. SGI’s Kern-Prof [10] (under callgraph modes) is one example. It con-sists of a kernel-patch and a user-space utility called kern-prof that is used to control the kernel-implementation andanalyze the data. By compiling the Linux Kernel with the-pg option to gcc, every function is instrumented with codeto track call counts and parent/child relationships. The gprofprogram uses this information to generate a callgraph of thekernel. As every function is being profiled the overhead atruntime can be significant. KernProf does not provide ex-tensive support for online merging of user and kernel data.

2.3 Statistical sampling tools

In contrast to instrumentation that modifies the kernel code(source or executable), tools based on statistical samplingperiodically interrupt normal flow of control to determinethe location of execution. Program Counter (PC) values are

sampled and stored to be interpreted with symbol informa-tion to build histograms of how performance is distributed.This approach can observe both user-mode and kernel-modeoperation across the system including libraries, applications,kernel-image and kernel-modules on a per processor ba-sis. For example, Oprofile [11] is meant to be a continuousprofiler for Linux, meaning it is always turned on. Opro-file’s state-maintenance is similar to a tracing approach. Itkeeps track of sampled PC values, time-stamps, process-switches and user-kernel mode switches in lock-free per-processor log buffers. Periodically, these buffers are coa-lesced into a single event-buffer that is written to a user dae-mon. The post-processing produces flat and callgraph pro-files. User/kernel information can be merged, but not on-line. Its shortcomings include an inability to provide onlineinformation (as it performs a type of partial tracing) and therequirement of a daemon. Other issues stem from the inac-curacy of sampling based profiles.

2.4 Merged user-kernel performance analysis tools

The tools above do not provide explicit support to trackand merge performance measurements between the appli-cation and the kernel during execution. To better understandprogram-OS interaction and localize bottlenecks in the pro-gram/OS execution stack, it is important to associate kernelactions with specific application processes. It may also al-low identifying intrusive effects such as excessive schedul-ing or interrupts that can steal cycles from applications.Here we describe tools supporting varying levels of onlinemerged user-kernel performance measurement and analysis.These tools incorporate different instrumentation and mea-surement mechanisms discussed above.

DeBox [12] makes system-call performance a first-classresult by providing kernel-mode details (time in kernel, re-

https://www.researchgate.net/publication/220881363_Making_the_Box_Transparent_System_Call_Performance_as_a_First-Class_Result?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==

60 Cluster Comput (2008) 11: 57–73

source contention, and time spent blocked) as in-band dataon a per-syscall basis. The in-band and online nature is em-phasized as the kernel performance data can be used byapplications for debugging or tuning. An application cancontrol how the performance data is collected, filtered, andused on a per-call basis. However, DeBox only focuses onthe system-call interfaces and does not provide any meansof collecting interrupt, exception, or scheduler data outsideof system calls. A more pronounced implementation-levelshortcoming is its inability to maintain performance dataacross multiple system-calls (the performance state is re-initialized at the start of each syscall). Applications usinglibraries (or components) may not have easy access to thecall-sites of system-calls (in fact the source for libraries maynot be available). If within a single library routine multiplesystem calls are made (which is not uncommon), then onlythe last syscall’s performance data will be available.

CrossWalk [13] is a tool that walks a merged call-graphacross the user-kernel boundary in search of the real sourceof performance bottlenecks. Using a specified performancethreshold, it tries to find routines that take longer than thethreshold starting its search from the main() function. Ifthe bottleneck is inside a system-call, CrossWalk walks thekernel-side call-graph into the system-call (using KernInst)and further tries to isolate the performance bottleneck in ker-nel instrumented functions. Locating call-sites is done withstatic binary analysis (for call instructions) or by dynamicrun-time instrumentation (for indirect calls such as functionpointers and virtual functions). Crosswalk’s requirement ofknowing call-site locations allows it to bridge only the user-kernel gap across system-calls. Profiling other aspects ofprogram-OS interaction are not supported. Also, it is clearlynot possible to know before-hand where the call-sites for ex-ceptions (e.g., page-faults), scheduling, and interrupts willoccur. Therefore, CrossWalk’s call-graph walking methodcannot be easily applied to the complete set of application-OS interactions. In addition, support for multi-threaded ap-plications is unavailable.

2.5 Discussion

Table 1 summarizes the tools discussed above and oth-ers we studied. While many of the tools mentioned canperform kernel-only performance analysis, they are unableto produce valuable merged information for all aspects ofprogram-OS interaction. Merged information is absent, inparticular, for indirect influences such as interrupts andscheduling that can have a significant performance impacton parallel workloads. In addition, online OS performanceinformation and the ability to function without a daemonis not widely available. A daemon-based model causes ex-tra perturbation and makes online merged user/kernel per-formance monitoring more expensive. KTAU resorts to a

daemon only when proprietary/closed-source applicationsdisallow source instrumentation. Most of the tools do notprovide explicit support for collecting, analyzing, and vi-sualizing parallel performance data. KTAU aims to provideexplicit support for online merged user/kernel performanceanalysis for all program-OS interactions in parallel HPC ex-ecution environments.

3 Objectives

The background discussion above provides a rich researchcontext for the KTAU project. In relation to this work, thedistinguishing high-level objectives for KTAU are:

• Support low-overhead OS performance measurement atmultiple levels of function and detail.

• Provide a kernel-wide perspective of OS performance.• Provide a process-centric perspective of application per-

formance within the kernel.• Merge user-level and kernel-level performance informa-

tion across all program-OS interactions.• Provide online information and the ability to function

without a daemon where possible.• Support profiling and tracing for kernel-wide and process-

centric views in parallel systems.• Leverage existing parallel performance analysis tools.• Deliver a robust OS measurement system that can be used

in scalable production environments.

While none of the tools studied meet these objectivesfully, their design and implementation highlight importantconcerns and provide useful contrasts to help understandKTAU’s approach.

The measurements made in the kernel will be driven bythe observation needs of the performance evaluation prob-lems anticipated. Our goal is to provide an integrated ker-nel measurement system that allows observation of multi-ple functional components of the OS and choice of level ofperformance measurement granularity of these components.Like LTT, KTAU targets key parts of the Linux Kernel, in-cluding interrupt handlers, the scheduling subsystem, sys-tem calls, and the network subsystem. Structured OS obser-vation is also seen in K42, Dtrace, and KernProf, whereasother tools provide limited kernel views (e.g., only systemcalls). KTAU provides two types of measurement of kernelperformance: profiling and tracing. Most of the other toolsprovide only one type. The choice allows measurement de-tail to be matched to observation requirements.

Any measurement of the OS raises concerns of overheadand influence on application (and kernel) execution. The fre-quency and detail of measurement are direct factors, but howthe OS is instrumented is an important determinant as well.The above contrasts of dynamic versus source (compiled)

https://www.researchgate.net/publication/221552441_CrossWalk_A_Tool_for_Performance_Profiling_Across_the_User-Kernel_Boundary?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==

Cluster Comput (2008) 11: 57–73 61

versus interrupt-driven instrumentation to identify variantstrategies (plus technologies) which result in different over-head effects. While each approach has its devotees, for anyone, it is more important to assess overhead impact as bestas possible. Instrumentation in KTAU is compiled into thekernel. As a result, there are issues about how to control(turn on/off) measurements and questions of execution per-turbation. We present experiment data later to address theseconcerns.

KTAU’s primary objectives are to provide access to OSperformance information from kernel-wide and process-centric perspectives. Tools such as LTT, KernProf, andOprofile provide a kernel-wide view. DeBox, DTrace, andCrossWalk provide capabilities for generating a process-centric view, with possible restrictions on the scope ofkernel observation. In contrast, KTAU desires to supportboth perspectives. KTAU enables process-centric observa-tion through runtime association of kernel performance withprocess context to a degree that is not apparent in other tools.

The performance data produced by KTAU is intention-ally compatible with that produced by the TAU performancesystem [3]. By integrating with TAU and its post-processingtools, KTAU inherits powerful profile and trace analysis andvisualization tools, such as Paraprof [16] for profiling andVampir [17] and Jumpshot [18] for trace visualization.

The KTAU project is part of the ZeptoOS [19] researchproject at Argonne National Laboratory funded by the DOEOffice of Science “Extreme Performance Scalable OperatingSystems” program. Thus, KTAU is intended to be used inperformance evaluation of large-scale parallel systems andin runtime performance monitoring/analysis for kernel mod-ule adaption in ZeptoOS.

4 Architecture

The KTAU system was designed for performance measure-ment of OS kernels, namely the Linux Kernel. As depictedin Fig. 1, the KTAU architecture is organized into five dis-tinct components:

• Kernel instrumentation• Kernel measurement system• KTAU proc filesystem• libKtau user-space library• clients of KTAU including the integrated TAU framework

and daemons.

While KTAU is specialized for Linux, the architecture couldgenerally apply to other Unix-style OSes. In the followingsections we discuss the KTAU architecture components andtheir implementation in more detail.

Fig. 1 KTAU architecture

4.1 Kernel instrumentation

Our instrumentation approach in KTAU inserts measure-ment code in the Linux Kernel source. The kernel instru-mentation uses C macros and functions that allow KTAU tointercept the kernel execution path and record measured per-formance data. The instrumentation points are used to pro-vide profiling data, tracing data, or both through the standardLinux Kernel configuration mechanism (e.g., make menu-config). Instrumentation points are grouped based on vari-ous aspects of the kernel’s operation, such as in which sub-system they occur (e.g. scheduling, networking) or in whatcontexts they arise (e.g., system calls, interrupt, bottom-halfhandling). To control the scope and degree of instrumen-tation, compile-time configuration options determine whichgroups of instrumentation points are enabled. Boot-time ker-nel options can also be used to enable or disable instrumen-tation groups.

Three types of instrumentation macros are provided inKTAU: entry/exit event, atomic event, and event mapping.The entry/exit event macro performs measurements be-tween an entry and exit point. Currently, KTAU supportshigh-resolution timing based on low-level hardware timers(known as the Time Stamp Counter on the Intel architectureand the Time Base on the PowerPC). The KTAU entry/exitevent instrumentation keeps track of the event activation(i.e., instrumentation) stack depth and uses it to calculateinclusive and exclusive performance data.

Not all instrumentation points of interest conform to en-try/exit semantics or consist of monotonically increasingvalues. The atomic event macro allows KTAU to instrumentevents that occur stand-alone and to measure values specificto kernel operation, such as the sizes of network packets sentand received in the network subsystem. The entry/exit andatomic event macros are patterned on the instrumentationAPI used in TAU.

https://www.researchgate.net/publication/220767658_ParaProf_A_Portable_Extensible_and_Scalable_Tool_for_Parallel_Performance_Profile_Analysis?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==

https://www.researchgate.net/publication/2776852_Toward_Scalable_Performance_Visualization_with_Jumpshot?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==

https://www.researchgate.net/publication/44231602_VAMPIR_Visualization_and_analysis_of_mpi_resources?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==

62 Cluster Comput (2008) 11: 57–73

The performance data generated by the two instrumenta-tion event macros must be stored with respect to the eventsdefined. If we were only interested in supporting a kernelperformance view, we could statically allocate structuresfor this purpose. However, one of KTAU’s main strengthsis in tracking process context within the kernel to produceprocess-centric performance data. The event mapping macrosolves the problem of how to associate kernel events to somecontext, such as the current process context. The macro isused to create a unique identity for each instrumentationpoint, serving to map the measured performance data to dy-namically allocated event performance structures. A globalmapping index is incremented for the first invocation ofevery instrumented event. A static instrumentation ID vari-able created within the event code accepts the current globalindex value and binds that to the event. The instrumentationID can then be used as an index into a table where eventperformance data is stored. The same mechanisms are usedwhen mapping to specific process contexts.

Using KTAU macros, we have instrumented the LinuxKernel with default points in the scheduling and networksubsystems, and in interrupt handlers and system calls. Theinstrumentation points are similar to those defined in LTT. Incertain cases, we reduce the number of locations by wrap-ping common code with multiplexing instrumentation. Forexample, instead of instrumenting every system call, thecommon system call entry code (in entry.S) is instrumented.The event mapping macro enables the different system calltypes to be distinguished. With the KTAU instrumentationmacros, others are free to further instrument the Linux Ker-nel.

4.2 KTAU measurement system

The role of the KTAU measurement system is to collectkernel performance data and manage the life-cycle of per-process profile/trace data structures. The measurement sys-tem runs within the Linux Kernel and is engaged whenevera process is created or dies, kernel events occurs, or perfor-mance data is accessed. Upon process creation, KTAU add ameasurement structure to the process’s task structure in theLinux process control block. The size of this data structure isappropriately configured for profiling or tracing. When trac-ing is used, a fixed size circular trace buffer (of configurablelength) is created for each process. Using this scheme, tracedata may be lost if the buffer is not read fast enough byuser-space applications or daemons. The KTAU measure-ment structure also contains state and synchronization vari-ables.

4.3 KTAU proc filesystem

The proc filesystem scheme was chosen as a standard mecha-nism for interfacing user-space clients with the KTAU mea-surement system. As shown in Fig. 1, KTAU exposes two

entries under /proc/ktau called profile and trace. User-spaceclients gain access to the entries via libKtau user library(see Sect. 4.4). We designed the interface to be session-less(where a session is described as persisting across multipleinvocations from user-space). In this scheme a profile readoperation, for instance, requires first a call to determine pro-file size and another call to retrieve the actual data into anallocated buffer. The interface does not maintain any savedstate between calls (even though the size of profile datamay change between calls). This design choice was madeto avoid possible resource leaks due to misbehaving clientsand the complexity required to handle such cases.

To accommodate frequent use clients, such as daemonsthat repeatedly retrieve data at small intervals, a memory-mapped buffer strategy is being used. The memory pagesspanning the user buffer are first locked into memory andmapped into the kernel-address using the get_user_pageskernel routine in Linux. This removes the need to repeatedlycopy to/from userspace.

4.4 KTAU user API and libKtau library

The KTAU User API provides access to a small set of easy-to-use functions that hide the details of the KTAU procfilesystem protocol. The libKtau library exports the API toshield applications using KTAU from changes to kernel-mode components. libKtau provides functions for kernelcontrol (for merging, overhead calculation), kernel data re-trieval, data conversion (ASCII to/from binary), and format-ted stream output. Internally, libKtau performs IOCTLs onthe proc/ktau files to access the kernel performance data.

Kernel-mode performance data is requested from user-space in one of three ways, ‘self’, ‘others’ and ‘all’, refer-ring to the target process(es). The different modes exist toprovide optimal data transfer and reduced locking.

4.5 KTAU clients

Different types of clients can use KTAU. These includedaemons that perform system-wide profiling/tracing, self-profiling clients (interested only in their own performanceinformation), Unix-like command-line clients for timingprograms, and internal KTAU timing/overhead query utili-ties. We briefly describe the KTAU clients currently imple-mented and used in our experiments.

4.5.1 KTAUD—the KTAU daemon

KTAUD periodically extracts profile and trace data from thekernel. It can be configured to gather information for allprocesses or a subset of processes. Thus, KTAUD uses the‘other’ and ‘all’ modes of libKtau. KTAUD is required pri-marily to monitor closed-source applications that cannot beinstrumented.

Cluster Comput (2008) 11: 57–73 63

4.5.2 TAU

The TAU performance system is used for application-levelperformance measurement and analysis. We have updatedTAU to function as a client of KTAU, gathering kernel pro-file and trace data through the libKtau API. It is in this man-ner that we are able to merge application and kernel perfor-mance data.

4.5.3 runKtau

We created the runKtau client in a manner similar to theUnix time command. time spawns a child process, executesthe required job within that process, and then gathers rudi-mentary performance data after the child process completes.runktau does the same, except it extracts the process’s de-tailed KTAU profile.

5 KTAU experiments

We conducted a series of experiments to demonstrate thefeatures of KTAU on different Linux installations. Themethodology followed for the first set of experiments soughtto exercise the kernel in a well understood, controlled fash-ion and then use KTAU performance data to validate that theviews produced are correct, revealing, and useful. The sec-ond set of experiments investigated KTAU’s use on a largecluster environment under different instrumentation config-urations to explore perturbation effects and highlight thebenefit of kernel performance insight for interpreting appli-cation behavior. Unfortunately, not all of the performancestudies we have done can be reported here, given the spaceprovided. Thus, we present only a subset of those using theLU application from the NAS Parallel Benchmarks (NPB)(version 2.3) [20] and the ASCI Sweep3D [21]. In addi-tion to other NPB applications, we have also experimentedwith the LMBENCH micro-benchmark for Linux [22] (notshown here). Finally, we have conducted experiments withKTAU as integrated in ZeptoOS distribution and run on theIBM BG/L machine (see [23]).

5.1 Controlled experiments

Simple benchmarks (LU, LMBENCH) were run on twoplatforms to show how the performance perspectives sup-ported by KTAU appear in practice. Our testbeds (patchedwith KTAU) included:

• neutron: 4-CPU Intel P3 Xeon 550 MHz, 1 GB RAM,Linux 2.6.14.3 kernel

• neuronic: 16-node 2-CPU Intel P4 Xeon 2.8 GHz, 2 GBRAM/node, Redhat Linux 2.4 kernel.

We use the LU benchmark to demonstrate several thingsabout using KTAU and its benefits to integrated perfor-mance analysis. First, the experiments show KTAU’s abilityto gather kernel performance data in parallel environments.This data is used to detect and identify causes of perfor-mance inefficiencies in node operation. Second, analysis ofkernel-level performance is able to reuse TAU tools for pro-file and trace presentation. Third, we see how KTAU pro-vides merged user-kernel views, allowing application per-formance to be further explained by aspects of kernel be-havior. Again, TAU tools can be applied to the merged per-formance information.

To test whether KTAU can detect performance artifacts,we introduced a performance anomaly in our experiments inthe form of an artificially induced system workload. Period-ically, while the LU application is running, an “overhead”process wakes up (after sleeping 10 seconds) and performsa CPU-intensive busy loop for a duration of 3 secs. Thisoccurs only on one compute node, with the result of disrupt-ing the LU computation. We expect to see the effects of thisanomalous behavior as a performance artifact in the kerneland user profile and trace data.

Figure 2A1 presents a ParaProf display of kernel-levelactivity on 8 physical nodes of a 16-processor LU run (num-bers represent host ids), aggregated across all processes run-ning on each node. In the performance bargraph of node 8,we clearly see greater time in scheduling events introduceby the extra “overhead” process running on the node. Tofurther understand what processes contributed to Host 8’skernel-wide performance view, the node profile in Fig. 2Bshows kernel activity for each specific process (numbers rep-resent process ids (pids)). It is clear that apart from the twoLU processes (PID:28744 and PID:28746), there are manyprocesses on the system. Also PID:28649 (the extra over-head process) is seen to be the most active and is the causeof the difference in the kernel-wide view. The views can helppin-point causes and locations of significant perturbation inthe system.

Figure 2D demonstrates the benefit of KTAU’s integrateduser/kernel profile to see performance influences not ob-served by user-level measurement alone. This view com-pares TAU’s application-only profile with KTAU’s, from thesame LU run. A single node’s profile is chosen. Each pair ofbars represents a single routine’s exclusive time in seconds,the upper bar represents the integrated view, and the lowerbar represents the standard TAU view. There are two keydifferences between the two. Kernel-level routines (such asschedule, system calls and interrupts) are additional in theKTAU view. Also the exclusive times of user-functions havebeen reduced to not include time spent in the kernel and rep-resent the “true” exclusive time in the combined user/kernel

1The authors recommend that Fig. 2 be viewed online.

https://www.researchgate.net/publication/2617510_A_General_Predictive_Performance_Model_for_Wavefront_Algorithms_on_Clusters_of_SMPs?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==

https://www.researchgate.net/publication/236135736_The_Nas_Parallel_Benchmarks?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==

https://www.researchgate.net/publication/220880891_lmbench_Portable_Tools_for_Performance_Analysis?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==

64 Cluster Comput (2008) 11: 57–73

Fig. 2 Controlled experiments with LU benchmark

call-stack. This comparison can identify compute-intensiveroutines with unusual kernel-times and also where in the ker-nel the time is spent.

Processes can be rescheduled voluntarily because theyare waiting for some event (such as I/O completion ormessage arrival from another process). They are involun-tarily rescheduled due to time-slice expiry. It is useful todifferentiate the two cases in performance measurement.To do so we add, within schedule(), a new instrumenta-tion point called schedule_vol() which is called when theneed_resched process flag is unset (signaling time slice hasnot expired). We ran the NPB LU application on our 4-processor SMP host with this instrumentation. Due to weakCPU affinity, the four LU processes mostly stay on theirrespective processors. In addition we also run a daemon,pinned to CPU-0, that periodically performs a busy loop

stealing cycles away from the LU application. Figure 2Cshows the result of voluntary versus involuntary schedul-ing on the four processes. LU-0 (top bar) is seen to be sig-nificantly affected by involuntary scheduling; its voluntaryscheduling time is small. The reverse is true of the threeother LU processes as they were waiting (voluntarily) forLU0 to catch-up. This example demonstrates how the sourceof dilation of and imbalance in system-loads can be betterisolated.

Tracing can capture the spatial and temporal aspects ofperformance using a log of events. TAU application tracesusing the hardware timers can be correlated with KTAUkernel-level traces. Figure 2E shows kernel-level activitywithin a user-space MPI_Send() call by the LU benchmark.Two snapshots from Vampir of TAU application and KTAUkernel traces have been merged. MPI_Send() is implemented

Cluster Comput (2008) 11: 57–73 65

by sys_writev(), sock_sendmsg() and tcp_sendmsg(). Thebottom-half handling (do_softirq()) and tcp receive routinesare not directly related to the send operation and instead oc-cur when a bottom-half lock is released and pending softirqsare automatically checked.

5.2 Chiba experiments

The second set of experiments we report target a larger-scalecluster environment. Argonne National Lab generously allo-cated a 128-node slice of the Chiba-City cluster for use withKTAU. Each node is a dual-processor Pentium III runningat 450 MHz with 512 MB RAM. All nodes are connectedby Ethernet. The OS is a KTAU-patched Linux 2.6.14.2 ker-nel. With this experimental cluster, we again used the LUbenchmark, this time to assess the performance overhead ofKTAU for a scalable application in different instrumenta-tion modes. Serendipitously, we also needed to use KTAUto track down and resolve a performance issue related to thesystem configuration.

Initially, just to check out our experimental Chiba clus-ter, we configured KTAU with all instrumentation points en-abled and ran the LU (Class C) application on 128 processes,first with 128 nodes (128x1) and then with 2 threads pernode over 64 nodes (64x2). Our measurement involvedmerged user and OS level profiling using KTAU and TAU.The mean total execution time of LU over five runs for the128x1 was 295.6 seconds and for the 64x2 configuration was504.9 seconds. This was very surprising. We thought using128 separate nodes may have advantages such as less con-tention for the single Ethernet interface, more bandwidth perthread, and less interference from daemons. But could thesedifferences be sufficient to account for the 73.2% slowdownseen?

We used KTAU to help explain this performance bug andhopefully make optimizations to bring the 64x2 executionperformance more in line with the 128x1 runs. On examin-ing just the user-space profile produced by TAU we foundseveral interesting features. The largest contributors to theprofile were MPI_Recv and rhs routines. There was consid-erable time spent in MPI_Recv across the ranks, except fortwo of the ranks which showed far lower MPI_Recv times.Figure 3 shows a histogram of the MPI_Recv routine’s ex-clusive times. The left-most two outliers (ranks 61 and 125)also showed relatively larger times for the rhs routine. Theuser-level profile alone was unable to shed more light on thereasons behind these observations.

On examining the merged user/OS profile from KTAUadditional observations were made. We first examined theMPI_Recv to see what types of OS interactions it had andhow long they took. Figure 4 shows MPI_Recv’s kernel callgroups (i.e., those kernel routines active during MPI_Recvexecution). The top bar is the mean across all the ranks and

Fig. 3 MPI_Recv: excl. time (seconds)

Fig. 4 MPI_Recv OS interactions

the next two bars are MPI ranks 125 and 61 respectively. Onaverage most of MPI_Recv was spent inside scheduling, butcomparatively lesser for ranks 125 and 61.

Scheduling is of two types: voluntary, where processesyield the processor waiting for completion of certain events,and involuntary where the process is pre-empted due toprocessor contention. Figures 5, 6 were generated from theKTAU profile and show the cumulative distribution func-tion (CDF) of involuntary and voluntary scheduling activ-ity experienced by the LU application threads. The X-axis islogscale to accentuate the curve’s shape. In Fig. 5, the bot-tom of the curve labeled 64x2 Anomaly shows a small pro-portion of threads experiencing relatively very low voluntaryscheduling activity. This trend is reversed for 64x2 Anomalyin Fig. 6. Again two ranks (61, 125) differ significantly fromothers with large involuntary and small voluntary schedul-

66 Cluster Comput (2008) 11: 57–73

Fig. 5 Voluntary scheduling (NPB LU)

Fig. 6 Involuntary scheduling (NPB LU)

ing. These are the same two ranks that also showed higherrsh compute times in the user-level profile.

Surprisingly, we found that both of these ranks run onthe same node (ccn10). Our first intuition was that OS-levelinterference and/or daemon activity on ccn10 was causingthe problem. Low interrupt and bottom-half times suggestedOS-interference was not the major cause. The involuntaryscheduling instead suggested daemon activity causing pre-emption of the LU tasks on ccn10. Again, the views fromKTAU allowed easy testing of our hypothesis. Figure 7shows the activity of all processes (including daemons, ker-nel threads, and applications) on ccn10. Each bar representsa single process that was active while the LU applicationwas running on ccn10. The bottom two bars represent theLU tasks themselves. It is clear that no significant daemonactivity occurred during the LU run as all other processeshave minuscule execution times compared to the LU execu-tion duration. This invalidated the daemon hypothesis.

The involuntary scheduling on ccn10 of the LU taskscould only be explained if the two LU tasks themselvespre-empted each other But why would this occur? On ex-

Fig. 7 Node ccn10 OS activity (NPB LU)

amination of the node itself (kernel logs and /proc/cpuinfo)we found that the OS had erroneously detected only a sin-gle processor (the reason is still being investigated). Thismatched exactly with KTAU measurements as the LU taskswere contending for the single available processor on ccn10.

As a result, the other MPI ranks experienced high volun-tary scheduling as they waited for the slow node (ccn10)in MPI_Recv. Remote influences will be seen as volun-tary scheduling, where processes are made to wait for otherprocesses. And local slow-down due to pre-emption is seenas involuntary scheduling. By removing the faulty nodefrom the MPI ring and re-running our LU experiment, thebenchmark reported an average total execution time of 396.7seconds. This was an improvement of 21.4 percent, which isstill 36.1 percent slower than the 128-node case.

We then examined the profiles of the runs without thefaulty node. The curve 64x2 in Figs. 5, 6 shows lowerscheduling activity than 64x2 Anomaly as expected. Butthere still remains pre-emptive scheduling (Fig. 6) of 2.5 to7 seconds duration across the ranks. With only a few hun-dred milliseconds worth of daemon activity accounted for,the LU tasks were still pre-empting each other. We decidedto pin the tasks (using cpu affinity) one per processor so asto avoid unnecessary pre-emption between the LU threads.

The pinned 64x2 run provided only a meager improve-ment of 3.2% (or 13.1 seconds) over the non-pinned runs.The curve labeled 64x2 Pinned in Fig. 6 shows much re-duced pre-emptive scheduling (0.2 to 1.1 seconds). Sur-prisingly, KTAU reported a substantial increase in volun-tary scheduling times. Figure 5 shows the increase in vol-untary scheduling of 64x2 Pinned over 64x2 and also anearly bi-modal distribution. To understand why the 64x2Pinned run was experiencing greater imbalance and idle-waits, we examined another “usual suspect’ of system inter-ference, namely interrupts. Figure 8 shows 64x2 Pinned hasa prominent bi-modal distribution with approximately halfof the threads experiencing most of the interrupts. This wasclearly the major cause of the imbalance leading to the high

Cluster Comput (2008) 11: 57–73 67

Table 2 Exec. time (secs) and% slowdown from 128x1configuration

Config. NPB LU and ASCI Sweep3d experiments

NPB LU ASCI Sweep3D

Exec. time %Diff. from 128x1 Exec. time %Diff. from 128x1

128x1 295.6 0 369.9 0

64x2 Anomaly 512.2 73.2% 639.3 72.8%

64x2 402.53 36.1% 428.96 15.9%

64x2 Pinned 389.4 31.7% 427.9 15.6%

64x2 Pin,I-Bal 335.96 13.6% 404.6 9.4%

Fig. 8 Interrupt activity (NPB LU)

idle-wait times. It suggested that interrupt-balancing (or irq-balancing) was not enabled. Therefore all interrupts werebeing serviced by CPU0 which meant LU threads pinned toCPU0 were being delayed and threads pinned to CPU1 werewaiting idly.

By re-running the experiment with both pinning and irq-balancing enabled a total execution time of 335.96 seconds.This resulted in a 16.5% improvement over the configurationwithout pinning and irq-balancing. Using KTAU’s perfor-mance reporting capabilities the performance gap betweenthe 64x2 and 128x1 configuration, runs had been reducedfrom 73.2% to 13.6%. This would have been impossible todo with user-level profile data alone. We also experimentedwith the Sweep3D ASCI benchmark [21] in this study. Thegap there was reduced from 72.8% to 9.4%. These resultsare summarized in Table 2.

However, there was still a difference in performancebetween the 64x2 pinned, interrupt-balanced run and the128x1 run (13.6% in NPB LU and 9.4% in Sweep3D). Anadvantage of correlating user-level and OS measurement isthat the merged profiles can provide insight regarding theexistence and causes of imbalance in the system. Examin-ing kernel interactions of purely compute-bound segmentsof user-code is one way of investigating imbalance. In asynchronized parallel system with a perfectly load-balancedand synchronous workload, compute and communication

Fig. 9 OS TCP in compute (Sweep3D)

phases happen in all processing elements (PE) at generallythe same time. But assume some local slowdown causes aPE X’s compute-phase to be longer than another PE Y’s.The compute-bound segment of user-code at PE X will beinterrupted by kernel-level network activity initiated by PEY at the start of its I/O phase. This will further dilate X’scompute phase causing more imbalance effects.

Consider Sweep3D. Figure 9 shows the number of callsof kernel-level TCP routines that occurred within a com-pute bound phase (free of any communication) of Sweep3Dinside the sweep() routine. Note also that our MPI ver-sion of Sweep3D did not use any asynchronous (non-blocking) communication primitives that explicitly requestcompute/communication overlap. Larger call numbers indi-cate a greater mixing of computation and communicationwork, suggesting greater imbalance due to increased back-ground TCP activity. The 64x2 Pinned, I-Bal curve showsa significantly larger amount of TCP calls than the 128x1curve, implying that imbalance is reduced in the 128x1 con-figuration. Note, the total number of kernel-level TCP callsthroughout the Sweep3D run did not differ significantly be-tween the 64x2 and 128x1 cases. This suggests that the“missing” TCP routine calls in the 128x1 run (Fig. 9) oc-cur elsewhere in the I/O phase (where they should occur).

But what causes the imbalance to decrease in the 128x1case? Is the extra processor absorbing some of the TCP


68 Cluster Comput (2008) 11: 57–73

Fig. 10 Time/OS TCP call (Sweep3D)

activity? The curve labeled 128x1 Pin,IRQ CPU1 denotesa modified 128x1 configuration where both the Sweep3Dprocess and all interrupts were pinned to the second cpu(CPU1). This curve closely follows ‘128x1’. Also, by de-fault, irq-balancing was not enabled for the 128x1. Thus,the extra free processor was not causing the performanceimprovement.

Figure 10 shows the CDF of exclusive time of a sin-gle kernel-level TCP operation across all kernel TCP opera-tions. There is a marked difference of approximately 11.5%over the entire range between the 64x2 and 128x1 configu-rations. Hence, TCP activity was more expensive when twoprocessors are used for computation. Interrupt-balancingblindly distributes interrupts (and the bottom-halves that fol-low) to both processors. Data destined for a thread runningon CPU0 may be received by the kernel on CPU1 caus-ing cache related slowdowns. Therefore, the dilation in TCPprocessing times seen in 64x2 run is very likely cache re-lated ([24] also found evidence of TCP/IP cache problemson SMP).

5.3 Compute/communication overlap experiments

Overlap of computation and communication (or I/O) is ben-eficial in application codes because communication can beperformed in otherwise idle periods. In MPI, this is donethrough the use of asynchronous (non-blocking) operationssuch as MPI_Isend and MPI_Irecv. When observing perfor-mance in the presence of compute/communication overlap,it is difficult to discern how performance problems shouldbe attributed, to the application algorithm, to system-levelinterference, to the communications sub-system, or to otherreasons.

We investigate these issues using a simple MPI send-receive benchmark shown in Fig. 11. This benchmark allowsus to easily demonstrate effects that are difficult to perform

Fig. 11 Send-Recv benchmark

Fig. 12 Send-Recv performance

in a controlled manner with a more complex code. It con-sists of two MPI ranks (0,1) on two separate nodes that con-tinuously perform sends and receives with computation inbetween. We will use it to examine scenarios where over-lap of computation and kernel-assisted communication oc-cur implicitly (without being explicitly requested, say, byusing MPI_Isend or MPI_Irecv) and show how KTAU canprovide a better understanding of otherwise hard to compre-hend performance data.

We begin by executing, for several hundred iterations,the benchmark instrumented with TAU over two nodes. Thenodes are connected by Ethernet and run a KTAU kernel. Weexpect that the three computation routines (compute_top(),compute() and compute_bottom()) take similar amounts oftime to execute as they are identical. We also expect thatthe respective routines on both nodes (ranks 0,1) take aboutthe same time. Figure 12 shows a Paraprof screen-shot ofuser-level performance data from TAU. The top bars rep-

https://www.researchgate.net/publication/224642541_A_Measurement_Study_of_the_Linux_TCPIP_Stack_Performance_and_Scalability_on_SMP_systems?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==

Cluster Comput (2008) 11: 57–73 69

Fig. 13 MPI_Recv (OS)

Fig. 14 Compute_Top (OS)

resent Rank0 (labeled n,c,t 0,0,0) and the bottom bars rep-resent Rank1 (labeled n,c,t 1,0,0). Two features stand out.compute_top() takes significantly (> 20%) longer on Rank0.And MPI_Recv() takes significantly longer on Rank1. Theuser-level profile does not shed light on the reasons behindthese differences.

To investigate further, we look at the kernel-level ac-tivity occurring during the execution of these two rou-tines. Figures 13 and 14 show the kernel call group datafor MPI_Recv() and compute_top() respectively. MPI_Recv,on Rank1, is dilated due to larger wait time (scheduling).The compute_top() routine, on Rank0, owes its longer timeto kernel-level TCP processing and bottom-half handling(which aids in TCP processing as well).

Based on this kernel-level information, we can update thebenchmark block digram (Fig. 15) to better explain the user-level performance data. On Rank0, MPI_Send returns im-mediately without transmitting data due to OS-level buffer-ing. The actual transmission occurs overlapped over com-pute_top() dilating that routine’s execution time. On Rank1,the MPI_Send at the end of the loop also returns immedi-ately due to OS buffering. But, in that case, the transmissionis overlapped over MPI_Recv() at the beginning of the nextloop iteration, explaining the longer time experienced by

Fig. 15 Send-Recv (with OS data)

Fig. 16 Send-Recv-Imbalanced benchmark

MPI_Recv() on Rank1. This demonstrates how OS bufferingcan introduce overlap implicitly making performance datahard to interpret. Integrating OS-level performance informa-tion into user-level data can aid in that interpretation.

We next turn to the topic of imbalance and show thatit can implicitly cause overlap of computation and com-munication to occur. The send-receive benchmark is mod-ified by adding extra busy-looping in Rank0, as shown inFig. 16. This causes imbalance in loads between the two

70 Cluster Comput (2008) 11: 57–73

Fig. 17 Send-Recv-Imbalanced performance

Fig. 18 MPI_Recv (OS)

ranks. While our implementation adds this imbalance ex-plicitly in the code (as a separate routine), it may also occurbecause of OS interference. Keeping it explicit allows us tocontrol where it occurs in the experiment.

Figure 17 shows the user-level profile of the imbalancedbenchmark. The difference seen in compute_top() has beenpreviously explained to be due to OS buffering. The dif-ference in the MPI_Recv() routine between the two ranksis far greater than before. In addition, there is a significant(>30%) difference in compute_bottom() between the ranks.We turn to KTAU’s kernel call group data for the latter tworoutines (Figs. 18, 19). MPI_Recv()’s dilation is still due towait time (scheduling). And compute_bottom() on Rank0 isinterrupted by TCP and bottom-half processing.

A reconstruction of the imbalanced send-recv benchmarkblock diagram shows that Rank1’s MPI_Recv takes muchlonger because it waits for Rank0’s MPI_Send (which inturn has been delayed by the added imbalance). The reasonfor OS TCP processing within Rank0’s compute_bottom()is more interesting. When Rank1 reaches its MPI_Send()and starts to send data, Rank0, having been delayed, is still

Fig. 19 Compute_Bottom (OS)

Fig. 20 Send-Recv-Imbalanced benchmark (with OS data)

within compute_bottom(). Although a MPI_Recv() has notbeen posted, Rank0’s OS accepts and buffers the incomingdata. These OS-level TCP receives are overlapped over com-pute_bottom(), causing it to be dilated.

Using KTAU we have demonstrated that kernel-assistedcommunication can cause overlap implicitly because oftwo reasons: (i) OS-level buffering on MPI_Sends andMPI_Recvs, and (ii) imbalance across the ranks (caused by,for example, workload imbalance or by OS-interference).The resulting user-level performance data can be difficult tounderstand as it can be unclear why certain routines are di-lated. Using performance data integrated across the OS anduser-level helps uncover the occurrence and cause of over-lap, which in turn improves performance understanding.

Cluster Comput (2008) 11: 57–73 71

Table 3 Perturbation: total exec. time (secs)

NPB LU Class C (16 nodes) ASCI Sweep3D (128 nodes)

Metric Base Ktau off ProfAll ProfSched ProfAll+Tau Base ProfAll+Tau

Min 468.36 463.6 477.13 461.66 475.8 – –

% Min Slow. 0 0 1.87 0 1.58 – –

Avg 470.812 470.86 481.748 471.164 484.12 368.25 369.9

% Avg Slow. 0 0.01 2.32 0.07 2.82 0 0.49

5.4 Perturbation experiments

The above exercises clearly demonstrate KTAU’s utilityin performance measurement and analysis of parallel ap-plications, especially when system-level effects cause per-formance problems. However, a source-based approach tokernel instrumentation, coupled with direct measurement,raises concerns of perturbation. The following experimentsinvestigated the effects of KTAU measurement overhead onperturbation with respect to the level of instrumentation.

With a correctly configured Chiba cluster, we ran the LUbenchmark again on 16 nodes and measured the overall ap-plication slowdown under five different configurations:

• Base: A vanilla Linux kernel and uninstrumented LU.• Ktau Off : The kernel is patched with KTAU and all in-

strumentation points are compiled-in. Boot-time controlturns off all OS instrumentation by setting flags that arechecked at runtime.

• ProfAll: The same as KtauOff, but with all OS instrumen-tation points turned on.

• ProfSched: Only the instrumentation points of the sched-uler subsystem are turned on.

• ProfAll+Tau: ProfAll, but with user-level Tau instrumen-tation points enabled in all LU routines.

Table 3 shows the percentage slowdown as calculated fromminimum and average execution times (over five experi-ments) with respect to the baseline experiment. In somecases, the instrumented times ran faster (which was surpris-ing), and we report this as a 0% slowdown.

These results are remarkable for two reasons. First, withKtau Off instrumentation (complete kernel instrumentationinserted, but disabled) we see no statistically significantslowdown for LU. While it is certainly the case that differentworkloads will result in different kernel behavior, these re-sults suggest that KTAU instrumentation can be compiledinto the kernel and disabled with no effects. Second, theslowdown for fully enabled KTAU instrumentation is smalland increases only slightly with TAU instrumentation on av-erage. If only performance data for kernel scheduling was ofinterest, KTAU’s overhead is very small. It is interesting tonote that the average execution time for the Sweep3D Baseexperiment was 368.25 seconds and 369.9 for the fully in-

Table 4 Direct overheads (cycles)

Operation Mean Std. Dev. Min

Start 244.4 236.3 160

Stop 295.3 268.8 214

strumented ProfAll+Tau experiment. This produces an aver-age slowdown of 0.49%. For completeness, Table 4 reportsthe direct KTAU overhead introduced by a single instrumen-tation operation (start or stop).

6 Conclusions and future work

The desire for a kernel monitoring infrastructure that canprovide both a kernel-wide and process-centric performanceperspective led us to the design and development of KTAU.KTAU is unique in its ability to measure the completeprogram-OS interaction, its support for joint daemon andprogram access to kernel performance data, and its inte-gration with a robust application performance measurementand analysis system, TAU. In this paper, we demonstratedvarious aspects of KTAU’s functionality and how KTAUis used for different kernel performance monitoring objec-tives. The present release of KTAU supports the Linux 2.4and 2.6 kernels for 32-bit x86 and PowerPC platforms. Weare currently porting KTAU to 64-bit IA-64 and PPC ar-chitectures and will be installing KTAU on a 16-processorSGI Prism Itanium-2 Linux system and a 16-processor IBMp690 Linux system in our lab.

There are several important KTAU objectives to pursuein the future. Because KTAU relies on direct instrumenta-tion, we will always be concerned with the degree of mea-surement overhead and KTAU efficiency. Additional per-turbation experiments will be conducted to assess KTAU’simpact and to improve KTAU’s operation. However, basedon our current perturbation results, we believe a viable op-tion for kernel monitoring is to instrument the kernel sourcedirectly, leave the instrumentation compiled in, and to im-plement dynamic measurement control to enable/disablekernel-level events at runtime. This is in contrast to otherkernel monitoring tools that use dynamic instrumentation,such as KernInst. Presently, the instrumentation in KTAU

72 Cluster Comput (2008) 11: 57–73

is static. Our next step will be to develop mechanisms todynamically disable/enable instrumentation points withoutrequiring rebooting or recompilation. Finally, we will con-tinue to improve the performance data sources that KTAUcan access and improve integration with TAU’s user-spacecapabilities to provide better correlation between user andkernel performance information. This work includes perfor-mance counter access to KTAU, better support for mergeduser-kernel call-graph profiles, phase-based profiling, andmerged user-kernel traces. We will also continue to portKTAU to other HPC architectures.

The scope of KTAU application is quite broad. We haveported KTAU to the IBM BG/L machine [23] as part of ourwork on the ZeptoOS project [19, 23] (KTAU is a part of theZeptoOS distribution). We will be evaluating I/O node per-formance of the BG/L system, and once a ZeptoOS port iscomplete, KTAU will be also be used to provide kernel per-formance measurement and analysis for dynamically adap-tive kernel configuration. I/O performance characterizationand kernel configuration evaluation are equally of interest onany cluster platform running Linux. Other groups have alsoexpressed interest in KTAU. Wolski is evaluating KTAU as away of instrumenting the Linux kernel for purposes of buildcustomized OS images for HPC applications [25]. In addi-tion, they will be looking to use KTAU as part of a suite oftools employed monitoring purposes on HPC and distributedsystems.

Acknowledgements The KTAU research was supported by theDepartment of Energy’s Office of Science (contract no. DE-FG02-05ER25663) and the National Science Foundation (grant no. NSF CCF0444475). The authors thank Suravee Suthikulpanit for his contribu-tion to KTAU’s implementation and Rick Bradshaw for his assistancewith the Chiba-City cluster.

References

1. Petrini, F., Kerbyson, D.J., Pakin, S.: The case of the missing su-percomputer performance: Achieving optimal performance on the8,192 processors of asci q. In: SC ’03: Proceedings of the 2003ACM/IEEE conference on Supercomputing, p. 55. IEEE Com-puter Society, Washington (2003)

2. Jones, T., et al.: Improving the scalability of parallel jobs byadding parallel awareness to the operating system. In: SC ’03: Pro-ceedings of the 2003 ACM/IEEE conference on Supercomputing.IEEE Computer Society, Washington (2003)

3. TAU: Tuning and Analysis Utilities, http://www.cs.uoregon.edu/research/paracomp/tau/

4. Hollingsworth, J.K., Miller, B.P., Cargille, J.: Dynamic programinstrumentation for scalable performance tools. Tech. Rep. CS-TR-1994-1207 (1994) [Online]. Available: citeseer.ist.psu.edu/75570.html

5. Tamches, A., Miller, B.P.: Fine-grained dynamic instrumentationof commodity operating system kernels. Oper. Syst. Des. Imple-ment, 117–130 (1999)

6. Cantrill, B.M., Shapiro, M.W., Leventhal, A.H.: Dynamic instru-mentation of production systems. In: USENIX ’04: Proceedings ofthe 2004 USENIX Annual Technical Conference, p. 13. USENIX,Boston (2004)

7. Yaghmour, K., Dagenais, M.R.: Measuring and characterizing sys-tem behavior using kernel-level event logging. In: USENIX ’00:Proceedings of the 2000 USENIX Annual Technical Conference,p. 15. USENIX, Boston (2000)

8. Wisniewski, R.W., Rosenburg, B.: Efficient, unified, and scal-able performance monitoring for multiprocessor operating sys-tems. [Online]. Available: citeseer.csail.mit.edu/675589.html

9. Richard, M.D., et al.: Efficient and accurate tracing of eventsin linux clusters. [Online]. Available: citeseer.ist.psu.edu/627702.html

10. Sgi kernprof, http://oss.sgi.com/projects/kernprof/11. Oprofile, http://sourceforge.net/projects/oprofile/12. Ruan, Y., Pai, V.: Making the “box” transparent: System call per-

formance as a first-class result. In: USENIX ’04: Proceedings ofthe 2004 USENIX Annual Technical Conference, p. 15. USENIX,Boston (2004)

13. Mirgorodskiy, A., Miller, B.P.: Crosswalk: A tool for performanceprofiling across the user-kernel boundary. [Online]. Available:citeseer.csail.mit.edu/692418.html

14. Etsion, Y., Tsafrir, D., Kirkpatrick, S., Feitelson, D.G.: Finegrained kernel logging with klogger: Experience and insights,Technical Report 2005-35. School of Computer Science and En-gineering, The Hebrew University of Jerusalem (2005)

15. Sharma, S., Bridges, P.G., Maccabe, A.B.: A framework for ana-lyzing linux system overheads on hpc applications. In: LACSI ’05:Proceedings of the 2005 Los Alamos Computer Science InstituteSymposium, Santa Fe, NM, USA, p. 17 (2005)

16. Bell, R., Malony, A.D., Shende, S.: A portable, extensible, andscalable tool for parallel performance profile analysis. In: LectureNotes in Computer Science, vol. 2790, pp. 17–26. Springer, Berlin(2003)

17. Nagel, W.E., Arnold, A., Weber, M., Hoppe, H.C., Solchen-bach, K.: VAMPIR: Visualization and analysis of MPI resources.Supercomputer 12(1), 69–80 (1996). [Online]. Available: cite-seer.ist.psu.edu/nagel96vampir.html

18. Zaki, O., Lusk, E., Gropp, W., Swider, D.: Toward scalableperformance visualization with Jumpshot. Int. J. High Perform.Comput. Appl. 13(3), 277–288 (1999). [Online]. Available: cite-seer.ist.psu.edu/zaki99toward.html

19. ZeptoOS: The small linux for big computers, http://www.mcs.anl.gov/zeptoos/

20. Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter,R.L., Dagum, D., Fatoohi, R.A., Frederickson, P.O., Lasin-ski, T.A., Schreiber, R.S., Simon, H.D., Venkatakrishnan, V.,Weeratunga, S.K.: The nas parallel benchmarks. Int. J. Super-comput. Appl. 5(3), 63–73 (1991). [Online]. Available: cite-seer.ist.psu.edu/bailey95nas.html

21. Hoisie, A., Lubeck, O.M., Wasserman, H.J., Petrini, F., Alme,H.: A general predictive performance model for wavefront algo-rithms on clusters of SMPs. In: International Conference on Par-allel Processing, p. 219 (2000)

22. McVoy, L.W., Staelin, C.: lmbench: Portable tools for per-formance analysis. In: USENIX Annual Technical Conference,pp. 279–294 (1996). [Online]. Available: citeseer.ist.psu.edu/mcvoy96lmbench.html

23. Nataraj, A., Malony, A., Morris, A., Shende, S.: Early experienceswith ktau on the ibm bg/l. In: EuroPar06 European Conference onParallel Processing (2006)

24. Bhattacharya, S., Apte, V.: A measurement study of the linuxtcp/ip stack performance and scalability on smp systems. In: 1stInternational Conference on COMmunication Systems softWAreand middlewaRE (COMSWARE) (2006)

25. Personal communication—Application Specific Linux, http://www.cs.ucsb.edu/~lyouseff/ASL.htm
















https://www.researchgate.net/publication/229087849_A_Framework_for_Analyzing_Linux_System_Overheads_on_HPC_Applications?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==












https://www.researchgate.net/publication/221351750_Fine_grained_kernel_logging_with_KLogger_experience_and_insights?el=1_x_8&enrichId=rgreq-7e86f1bc-75e7-40c5-9656-267e22fdc0f8&enrichSource=Y292ZXJQYWdlOzIyNzI5NTk1OTtBUzoxNTU4OTEyNjU0NDU4ODhAMTQxNDE3ODc4Mjg3NQ==









































Cluster Comput (2008) 11: 57–73 73

Aroon Nataraj has been pursuing a Ph.D. de-gree in the areas of parallel performance in-strumentation, measurement and analysis, at theUniversity of Oregon since 2003. He was em-ployed as a Member of Technical Staff at SunMicrosystems and as a Network Software En-gineer at Intel Corporation in 2001 and 2002respectively. He received a Master of Scienceand Technology (M.Sc. Tech.) degree in Infor-mation Systems from the Birla Institute of Tech-nology and Science, Pilani, India in 2001.

Allen D. Malony is a Professor in the Depart-ment of Computer and Information Science atthe Unversity of Oregon. He received the B.S.and M.S. degrees in Computer Science fromthe University of California, Los Angeles in1980 and 1982, respectively. He received thePh.D. degree from the University of Illinois atUrbana-Champaign in October 1990. He joinedthe faculty at Oregon in 1991, spending hisfirst year as a Fulbright Research Scholar and

visiting Professor at Utrecht University in The Netherlands. He wasawarded the NSF National Young Investigator award in 1994. In 1999he was a Fulbright Research Scholar to Austria located at the Univer-sity of Vienna. He was awarded the prestigious Alexander von Hum-boldt Research Award for Senior U.S. Scientists by the Alexander vonHumboldt Foundation in 2002. His research interests are in parallelcomputing, performance analysis, supercomputing, and scientific soft-

ware environments. He is the Director of the Performance ResearchLaboratory, the NeuroInformatics Center, and the Computational Sci-ence Institute at the University of Oregon.

Sameer Shende is a research faculty in theNeuroInformatics Center at the University ofOregon and the President of ParaTools, Inc. Hereceived the Bachelor of Technology (B. Tech.)degree in Electrical Engineering from the In-dian Institute of Technology, Bombay, India in1991. He received the M.S. and Ph.D. degreesfrom the University of Oregon in 1996 and 2001respectively. His research interests are in thearea of performance analysis of parallel pro-

grams, instrumentation, measurement and techniques for mapping per-formance data. He helped develop the TAU performance system.

Alan Morris is a research assistant at the Uni-versity of Oregon where he works on the TAUPerformance System. He received his B.S. inComputer Science from the University of Utahin 2002 and his M.S. in Computer Science fromthe University of Arizona in 2004.

Integrated parallel performance views

Documents

Transcript of Integrated parallel performance views