I/O, performance analysis, and performance data immersion

I/O, Performance Analysis, and Performance Data Immersion�Daniel A. Reed Christopher L. ElfordTara Madhyastha Will H. ScullinRuth A. Aydt Evgenia SmirniDepartment of Computer ScienceUniversity of IllinoisUrbana, Illinois 61801AbstractA large and important class of national challengeapplications are irregular, with complex, data de-pendent execution behavior, and dynamic, with timevarying resource demands. We believe the solu-tion to the performance optimization conundrum isintegration of dynamic performance instrumentationand on-the- y performance data reduction with con-�gurable, malleable resource management algorithms,and a real-time adaptive control mechanism that au-tomatically chooses and con�gures resource manage-ment algorithms based on application request patternsand observed system performance. Within the con-text of parallel input/output optimization, we describethe components of such a closed-loop control systembased on the Pablo performance analysis environment,a portable parallel �le system (PPFS), and virtual en-vironments for study of dynamic performance data andinteractive control of �le system policies.1 IntroductionIt is increasingly clear that a large and impor-tant class of national challenge applications are ir-regular, with complex, data dependent execution be-havior, and dynamic, with time varying resource de-mands. For example, recent studies of application in-put/output behavior [2] showed that tuning �le systempolicies to exploit knowledge of application access pat-terns can increase performance by more than an orderof magnitude [5].Unfortunately, the space of possible performanceoptimizations is large and non-convex, and the bestmatch of application and resource management tech-nique is seldom obvious a priori. Moreover, becausethe interactions between application and system soft-ware change across applications and during a singleapplication's execution, runtime libraries and resource�This work was supported in part by the Advanced ResearchProjects Agency under ARPA contracts DAVT63-91-C-0029and DABT63-93-C-0040, by the National Science Foundationunder grants NSF IRI 92-12976, NSF ASC 92-12369, and NSFCDA 94-01124, and by the National Aeronautics and Space Ad-ministration under NASA Contracts NGT-51023, NAG-1-613,and USRA 5555-22.

management policies should automatically and unob-trusively adapt to rapidly changing application behav-ior.Current performance instrumentation and analysistools provide the data necessary to understand thecauses for poor performance a posteriori, but alonethey are insu�cient to adapt to temporally varyingapplication resource demands and system responses.We believe the solution to the performance optimiza-tion conundrum is integration of dynamic performanceinstrumentation and on-the- y performance data re-duction with con�gurable, malleable resource man-agement algorithms, and a real-time adaptive controlmechanism that automatically chooses and con�guresresource management algorithms based on applicationrequest patterns and observed system performance.Based on this thesis, we describe the ongoing in-tegration of the Pablo performance analysis environ-ment [15, 14, 16] with input/output analysis software[2], exible parallel �le system policies [5], immersivevirtual environments for interactive �le system pol-icy selection and optimization [18, 20], and an infras-tructure for automatic classi�cation of input/outputbehavior [8]. We also brie y describe Pablo's role inthe Scalable I/O Initiative [13], a wide-ranging projectto characterize application input/output performance,develop new language interfaces and �le systems, andprototype input/output software.The remainder of this paper is organized as fol-lows. In x2, we describe performance instrumentationand analysis techniques, and summarize the lessonslearned during the development of the Pablo per-formance analysis environment. In x3, we describeour experiences using software performance instru-mentation techniques to capture and analyze the in-put/output dynamics of large, input/output intensiveparallel applications and the relation of this e�ort tothe new Scalable I/O Initiative. This is followed in x4by a brief description of a exible infrastructure forstudying parallel �le system policies, and in x5 by adescription of virtual environments for study of dy-namic performance data and interactive control of �lesystem policies. In x6 we sketch our vision of a closed-loop adaptive control system for resource managementpolicy selection and tuning, and in x7 we describe re-

lated work. Finally, x8 summarizes the current stateof our work and outlines plans for continued research.2 Performance Analysis TechniquesIn [15], we argued that an ideal performance anal-ysis environment should support interactive inser-tion of instrumentation points, as well as subsequentdata analysis, reduction, and display. For monitoringand adaptive control of resource management policies,performance environments must also support real-time data analysis and extraction. These constraintsdictate performance environments that are portableacross a range of parallel architectures, allowing re-source management policy tuning on multiple parallelsystems, scalable with the size of the system beingstudied, allowing optimization on large systems, andextensible, allowing addition of environment function-ality as needed.These are not mere suppositions, but are based onour experiences building three generations of perfor-mance analysis tools for multiple generations of par-allel systems [9, 17, 15]. In an earlier collaborationwith Intel, we developed a distributed memory perfor-mance analysis environment that included operatingsystem and application software instrumentation [17],hardware support for performance data capture [10],and a data analysis and visualization system [11, 12].Although this environment provided detailed data onsystem and application software behavior, it could notbe ported to other parallel systems without majormodi�cations, it could not be extended without inti-mate knowledge of the environment's internal softwarestructure, and could it not accommodate hundreds ofprocessors.To redress these problems, for the past �ve years,we have worked to develop a software performanceanalysis infrastructure, called Pablo [15, 14, 16], forportable, extensible performance data capture andanalysis. Below, we describe the Pablo environment,with emphasis on those components that support real-time performance monitoring and adaptive control.2.1 Pablo Software InfrastructureThe complete Pablo1 environment consists of� an extensible performance data metaformat(SDDF) and associated library that separates thestructure of performance data records from theirsemantics, allowing easy addition of new perfor-mance data types,� an instrumenting parser capable of generating in-strumented SPMD source code,� extensible instrumentation libraries that can cap-ture event traces, counts, or interval times andreduce the captured performance data on the y,and� graphical performance data display and soni�ca-tion tools, based on coarse-grain graphical pro-gramming, that support rapid prototyping of per-formance analyses.1Pablo is a registered trademark of the Board of Trustees ofthe University of Illinois.

As Figure 1 suggests, the Pablo instrumentationsoftware captures dynamic performance data via in-strumented source code that is linked with a datacapture library. During program execution, the in-strumentation code generates performance data thatcan either be directly recorded by the data capturelibrary or processed by one or more data analysis ex-tensions prior to recording. After program executioncompletes, the data can be analyzed by a toolkit ofdata transformation modules capable of processing theself-describing data format and a graphical program-ming model that allows users to interactively connectand con�gure a data analysis graph.This Pablo software infrastructure has been usedas a basis for a portion of Intel's performance analysistools on the Intel Paragon XP/S [19] and has beenintegrated with data parallel compilers to study theperformance of Fortran D and HPF codes [1]. Foradditional details on the Pablo software design philos-ophy and our experiences, see [15, 14, 16].2.2 Pablo Software ExtensibilityGiven the rapid changes in hardware platforms andprogramming models and the performance optimiza-tion problems inherent in an evolving market, soft-ware performance tools must be capable of evolvingto accommodate unanticipated needs. Since its de-velopment, we have retargeted the Pablo software tostudy application input/output patterns [2], WorldWide Web access patterns [6], data parallel languages[1], and parallel �le system policies [5].One of the keys to the Pablo environment's adapt-ability has been the inclusion of a self-de�ning dataformat (SDDF). SDDF data streams consists of agroup of record descriptors and record instances.Much as structure declarations in the C program-ming language specify templates for storage alloca-tion, SDDF descriptors de�ne the structure for recordinstances. The data following the descriptors con-sists of a stream of descriptor tag and data recordpairs. Each tag identi�es the descriptor that de�nesthe juxtaposed data. By separating the structure ofdata from its semantics, the Pablo SDDF library per-mits construction of tools that can extract and processSDDF records and record �elds with minimal knowl-edge of the data's deeper semantics.The second key to Pablo's adaptability has beenthe design of a set of software extension interfacesfor the Pablo data capture library. Via these inter-faces, one can register a set of performance data anal-ysis modules that create and process performance datarecords prior to extraction. The function of such ex-tensions can range from simple matching of procedureentry/exit records for computing procedure invocationlifetimes to sophisticated calculation of input/outputactivity histograms based on input/output activitytraces or correlation of dynamic program behaviorwith data parallel source code [1].Finally, the ability to extract performance data viaseveral mechanisms (e.g., �les or network sockets) hasenabled a variety of data analysis modes. Real-timetransmission of captured (and optionally processed)performance data from each node of a parallel sys-

Standard Metaformat Performance Data

Data Capture Library

Linker

InstrumentedExecutable Program

ParallelMachine

Fortran or CSource Code Fortran or C

Parser

InstrumentedFortran or CSource Code

Instrumentation Points

Call TreeDescription

Fortran or C Compiler

GraphicalInstrumentation Interface

GraphicalInstrumentation Interface

InstrumentedObject Code

Legend

Vendor Supplied

Project Developed

(Stay Tuned) Dynamic Clustering

Extension InterfaceFigure 1: Pablo Instrumentation Infrastructuretem makes possible real-time display and, with therequisite interfaces, interactive control of applicationor system behavior. As described in x5, we have usedthis capability of the Pablo library to develop a virtualenvironment for real-time data display and control.3 Input/Output CharacterizationThe acceptance of massively parallel systems as thevehicle of choice for high-performance computing hasproduced a wide variety of machines and an even widervariety of potential input/output con�gurations, mostwith inadequate input/output capacity and perfor-mance. Understanding application input/output pat-terns is the �rst step to optimizing application perfor-mance with extant �le systems and to developing moree�ective parallel input/output �le system policies. Byextending the Pablo performance instrumentation andanalysis software, we have created an infrastructurefor quantifying current parallel input/output patternsin parallel applications and measuring �le system re-sponses to those patterns.3.1 Pablo Input/Output ExtensionsAny instrumentation system must strike a bal-ance between instrumentation detail and perturba-tion of application behavior. When characterizinginput/output behavior, use of the input/output sys-tem to extract performance data can lead to partic-ularly pernicious perturbations. To capture and ana-lyze application input/output data while minimizinginput/output perturbations, we exploited the Pablodata capture library's extension interfaces to developa suite of input/output analysis routines that supportboth real-time calculation of statistical summaries andcapture of detailed event traces. The former trades

computation perturbation for input/output perturba-tion.Detailed input/output event traces include thetime, duration, size, and other parameters of each in-put/output operation. Statistical summaries can takeany one of three forms: �le lifetime, time window, or�le region. File lifetime summaries include the numberand total duration of �le reads, writes, seeks, opens,and closes, as well as the number of bytes accessed foreach �le, and the total time each �le was open. Timewindow summaries contain similar data, but allow oneto specify a window of time for summarization. Fi-nally, �le region summaries are the spatial analog oftime window summaries; they de�ne a summary overthe accesses to a �le region.3.2 A Characterization ExampleDuring the past two years, we and others have usedthe Pablo input/output analysis software to studythe behavior of a wide variety of parallel applica-tions on the Intel Paragon XP/S and IBM SP/2 [2].These applications represent a snapshot of currentinput/output practice on scalable parallel machinesand re ect the developers' input/output design choicesbased on perceived and actual limitations of availableinput/output systems. These initial codes are but asmall part of the nascent Scalable Input/Output ini-tiative's (SIO) code suite, described in x3.4.As one example, we have tracked the performanceof multiple versions of an electron scattering code [21]across three releases of the Intel OSF/1 parallel �lesystem (PFS) on the Caltech Paragon XP/S. Electronscattering results are of interest in many contexts, in-cluding semiconductor fabrication.This particular electron scattering code computes

Tot

al I

/O T

ime

(sec

onds

)

Processor Identifier

0

100

200

300

400

500

600

700

800

0 20 40 60 80 100 1200

100

200

300

400

500

600

700

800

0 20 40 60 80 100 120Processor Identifier

Tot

al I

/O T

ime

(sec

onds

)

0

5

10

15

20

25

30

35

40

0 1000 2000 3000 4000 5000 6000 7000

Rea

d D

urat

ion

(sec

onds

)

Execution Time

0

0.1

0.2

0.3

0.4

0.5

0.6

0 1000 2000 3000 4000 5000 6000

Rea

d D

urat

ion

(sec

onds

)

Execution Time

0

0.10.2

0.30.4

0.5

0.60.7

0.80.9

1

0 1000 2000 3000 4000 5000 6000 7000

Wri

te D

urat

ion

(sec

onds

)

Execution Time

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 1000 2000 3000 4000 5000 6000

Execution Time

Wri

te D

urat

ion

(sec

onds

)

OSF 1.2 (non-optimized) OSF 1.3 (optimized)Figure 2: Intel Paragon XP/S Input/Output Performance (Electron Scattering Code)

the scattering probabilities for low-energy electron-molecule collisions by solving linear systems whoseterms are obtained via numerical quadrature. Becausethe quadrature data is too voluminous to �t in thememory of the processors, it is written to disk and re-trieved when needed. For a modest, though represen-tative data set, the electron scattering code executesfor roughly one and three quarter hours on 128 pro-cessors of the Intel Paragon XP/S. Production datasets have similar behavior, but with ten to �fty hourexecutions on 256 or 512 processors and much largerinput/output data volumes.Figure 2 shows the behavior of two snapshots ofthis code. In the �rst version, measured on Intel OSF1.2, all processors concurrently open and read a set ofinitialization �les with a variety of request sizes. Fol-lowing this, all the processors compute and a singleprocessor writes the quadrature data. A small intervalof data reuse occurs near the end of the computation.In the second version, measured sixteen months lateron Intel OSF 1.3, a single processor reads the initial-ization data in large chunks and broadcasts the datato the other nodes. The processors write quadraturedata, but in larger units matched to the stripe size ofthe Intel parallel �le system.As Figure 2 shows, the e�ects of these applicationoptimizations and improvements in the Intel parallel�le system are striking. By avoiding concurrent �leopens and reads, �le system contention is reduced,and total input/output time decreases nearly an or-der of magnitude. Likewise, because the Intel �lesystem does no caching and is optimized for largerequests, increasing the read and write request sizesbetter exploits the �le system capabilities and reducestotal read/write time even though the individual writetimes increase.3.3 Input/Output ImplicationsAs analysis of the electron scattering code's behav-ior suggests, there are many opportunities for improv-ing the performance of current parallel �le systems(e.g., by better support for caching and prefetching)and to reduce the burden faced by application devel-opers when tuning their codes to exploit �le system id-iosyncrasies. In particular, our characterization study[2] has shown that parallel applications exhibit a widevariety of input/output request patterns, with bothvery small and very large request sizes, sequential andnon-sequential access, and a variety of temporal vari-ations.Small input/output requests are best managed byaggregation, prefetching, caching, and write-behind,though large requests are better served by streamingdata directly to or from storage devices and applica-tion bu�ers. In consequence, we believe that designof parallel �le systems that rely on a single, system-imposed �le system policy is unlikely to be success-ful. In short, exploitation of input/output access pat-tern knowledge in caching and prefetching systems iscrucial to obtaining a substantial fraction of peak in-put/output performance. Inherent in such an adaptiveapproach is the need to identify access patterns andchoose policies based on access pattern characteristics.

We will return to these topics in x4 and x6.3.4 Scalable I/O InitiativeThe Scalable I/O (SIO) Initiative [13] is a newlylaunched multi-agency (ARPA, DOE, NSF, andNASA) research program that includes applicationand system input/output characterization, network-ing, �le systems and �le system application program-ming interfaces (APIs), compiler and language sup-port, and basic operating system services. The pri-mary hardware testbeds for this research are the IntelParagon XP/S at Caltech, the IBM SP-2 at ArgonneNational Laboratory, and the Convex Exemplar andSilicon Graphics Power Challenge Array at NCSA.Organized as �ve major working groups (perfor-mance characterization, operating systems, languagesand runtime systems, applications, and software inte-gration) that include application developers, academicresearchers and vendors, the goal of the initiative is abroad-based attack on the input/output problem thatwill produce not only new input/output optimizationideas but also the embodiment of these ideas in sys-tem software prototypes. In addition to the �ve work-ing groups, two multidisciplinary groups are designingAPIs for parallel �le systems and developing a suite ofparallel input/output benchmarks.The SIO initiative builds on our earlier in-put/output characterization e�ort | the Pablo per-formance environment's input/output analysis soft-ware is the basis for the SIO initiative's study of ap-plication input/output behavior. In addition, we areworking with Intel and other vendors to augment diskand network device drivers with instrumentation torecord physical input/output behavior and to inte-grate this support with an extended version of thePablo performance data capture software. By corre-lating application input/output request patterns withsystem input/output responses, one can quantify thee�cacy of particular �le system policies and, as de-scribed in x6, exploit this data to adaptively tune �lesystems.4 Flexible Parallel File SystemsThe data from our application input/output studiesshow that there is wide variation in temporal and spa-tial access patterns, and it seems unlikely that a single�le system policy can minimize physical input/outputoperations and overlap all unavoidable physical in-put/output with application computation. To explorethe thesis that performance is best maximized by tun-ing �le system policies to application behavior, we con-tinue to develop a portable parallel �le system (PPFS)[5] to study the interaction of application access pat-terns, �le caching and prefetching algorithms, and ap-plication �le data distributions.4.1 PPFS ArchitecturePPFS consists of a user-level input/output librarythat is portable across both parallel systems (the In-tel Paragon XP/S, IBM SP/2, and Convex Exemplar)and workstation clusters. Although a user-level imple-mentation sacri�ces control of low-level details (e.g.,�le block placement), it allows us to quickly extendthe software with minimal e�ort and to conduct a

Tim

e(Se

cond

s)

090

018

00

Request Size (K-bytes)

12

48

1632

64128

Processor Count1 2 4 8 16 32 64 128 256

Tim

e(Se

cond

s)

090

018

00


12

48

1632

64128

Processor Count1 2 4 8 16 32 64 128 256(a) Reads (PFS OSF 1.2) (b) Writes (PFS OSF 1.2)

Tim

e(Se

cond

s)

090

018

00


12

48

1632

64128

Processor Count1 2 4 8 16 32 64 128 256

Tim

e(Se

cond

s)

090

018

00


12

48

1632

64128

Processor Count1 2 4 8 16 32 64 128 256(a) Reads (PPFS) (b) Writes (PPFS)Figure 3: Intel Paragon XP/S PPFS Benchmark Performance (Sixteen I/O Nodes)wide range of experiments more quickly than wouldbe possible via system software modi�cations.As Figure 6 shows, the key software componentsof PPFS are clients, servers, metadata server(s), andcaching agents. A client consists of an instance of theuser application code and the local caching, prefetch-ing, and bookkeeping software that permits an ap-plication to use PPFS. Clients communicate withthe metadata server to determine which data serversand caching agents will satisfy input/output requests.Caching agents are shared caches via which multipleclients share �les. Finally, input/output servers arethe ultimate resolvers of requests. Via this architec-ture, PPFS provides application control of a varietyof �le caching, prefetching, data layout, and coherencepolicies.4.2 PPFS InstrumentationTo analyze the internal behavior of PPFS, we haveused the Pablo instrumentation environment to cap-ture event traces of internal PPFS state transitions,procedure calls, and input/output events. In addition,

each PPFS client and server maintains data on �le sys-tem performance (e.g., queue lengths and delays, ser-vice times, and request throughputs). Together, thisprovides the data needed to understand the e�cacyof particular policy choices and con�gurations and, aswe shall see in x6, to interactively adjust policy pa-rameters.4.3 PPFS ExperiencesBy carefully choosing PPFS parameters and datamanagement policies to match application accesspatterns, it is possible to increase application in-put/output performance by an order of magnitudeover that achievable with vendor parallel �le systems[5]. Intuitively, request aggregation, write behind,prefetching, and caching policies match the applica-tion request stream to the capabilities of the underly-ing �le system.As an example of the e�cacy of matching �le sys-tem policies to application access patterns, Figure 3shows the time to read and write a 64 MB �le using avariety of request sizes and numbers of processors on

an Intel Paragon XP/S with sixteen I/O nodes; see [5]for details. For the read benchmark, each processingnode reads the entire �le. For the write benchmark,each processing node writes an equal portion of the�le, with record level interleaving.Because PFS does not cache or prefetch data, largenumbers of small reads are ine�cient | each read seesthe full input/output latency and the input/outputnodes each see large numbers of small requests. Smallwrites su�er from many of the same problems as smallreads | �le system call overhead dominates total ex-ecution time.By matching policies to access patterns, PPFS com-bines multiple pending read requests for the samerecord. Subsequent requests for the same record canbe serviced from the server caches. For writes, PPFSaggregates multiple, small requests, creating large,contiguous data blocks that can be written e�ciently.In general, our experiments with large researchcodes have shown that investment in a malleable in-frastructure is repaid with increased input/output per-formance. Tuning the �le system policies to applica-tion needs, rather than forcing the application to useinappropriate and ine�cient �le access modes, is thekey to performance. However, such tuning requires ei-ther the application developer to provide �le systemhints based on knowledge of access patterns or auto-mated adaptation based on observed application be-havior. Because our experience has shown that devel-opers often do not understand nor wish to understanddetailed input/output behavior, we believe automatedadaptation is most appropriate.5 Performance Data ImmersionGiven the sensitivity of performance to the interac-tion of application request patterns with parallel sys-tem resource management policies, it is rarely possibleto predict which of the potentially hundreds of perfor-mance metrics will best illuminate this interaction. Inprinciple, it is possible to capture all metrics of poten-tial interest and analyze the data o�-line; in practice,the large volume of performance data makes this ap-proach impractical. Instead, interactive selection andreal-time display of performance metrics are necessary.Although animated, two-dimensional graphics dis-plays can provide some insights, understanding theevolving relations among abstract, multivariate dataremains a di�cult task. Understanding the shape ofcomplex, three-dimensional objects is greatly simpli-�ed by reliance on binocular stereo images and ourability to change the viewing perspective | head andbody movements allow one to compare a series of im-ages.By exploiting the Pablo environment's support forreal-time data extraction, we have developed an im-mersive virtual environment, called Avatar [20, 18],for real-time, three-dimensional display of dynamicperformance data. Avatar operates with (a) a head-mounted display and tracker, (b) the CAVE [3] virtualreality theater, or (c) a workstation display with stereoglasses. At present, Avatar supports three domain-independent display metaphors that accept real-timedata from the Pablo instrumentation library (a three-

Figure 5: System Metrics (Scattercube View)dimensional generalization of scatterplot matrices [18],a \time tunnel," and a geographic display [6]) withdata soni�cation and sound spatialization.5.1 Time TunnelsThe Avatar time tunnel display metaphor capturesthe time evolutionary behavior of a parallel code viaa display consisting of a cylinder whose major axis istime. Along the cylinder periphery, each line is com-posed of segments, where the color and length of eachsegment indicate the type and duration of each behav-ior type in a parallel program. Cross-processor inter-actions (e.g., via message passing) are represented bychords that cut through the interior of the cylinder.Figure 4 shows end and side snapshots of the timetunnel for the input/output behavior of the electronscattering code described in x3; in the �gure, timeincreases as one moves forward from the viewer. Fileopen operations are shown near the periphery, withsubsequent �le read operations shown near the center.The high cost of �le opens is manifest as line segmentsof long duration. Likewise, contention for �le accessappears as segments that begin at the same time oneach processor but complete at di�ering times.5.2 ScattercubesAlthough the time tunnel metaphor allows one tostudy processor interactions and application behav-ioral durations, it cannot show correlations amongsystem performance metrics (e.g., paging or networkpacket rates). Scattercubes, a generalization of simplescatterplot matrices, allow one to study such metriccorrelations.In a two-dimensional scatterplot matrix, all pos-sible pairs of dimensions for a set of N -dimensionaldata are plotted against each other in separate scat-terplots arranged in an N by N matrix. This shows allpossible two-dimensional data projections and can beused to determine data dispersion and bivariate corre-lations. Our three-dimensional generalization of scat-terplot matrices, which we call a scattercube, contains

Figure 4: Electron Scattering I/O Dynamics (Time Tunnel Views)N3, three-dimensional scatterplots, allowing users towalk around and inside the data.In each scatterplot cube, the coordinate axes corre-spond to three of the N performance metrics, and thetime-varying position of each processor in the scat-terplot cube is determined by the current values ofthe associated performance metrics. Geometrically,the behaviors of the p processors de�ne a set of pcurves in an N -dimensional performance metric space,with each scatterplot cube showing a di�erent three-dimensional projection of this trajectory. To help an-alyze data point trajectories, it is possible to displaythese trajectories using history ribbons (i.e., markersof data paths) | Interactively enabling history rib-bons for a subset of the data points allows one to seeif the selected points cluster in one or more scatter-cubes.Figures 5 and 7 show the interiors of scattercubesfor two di�erent sets of performance metrics. In both�gures, the origin of the metric space is located atthe rear of the cube, with the location of each proces-sor denoted by an octahedron. History lines show themovement of the processors in the metric space, withevident clustering.5.3 Immersion ExperiencesWe have successfully used Avatar to study the cor-relations of large numbers of dynamic performancemetrics from parallel applications, with emphasis oninput/output dynamics, and to display real-time dataon the behavior of World Wide Web (WWW) servers[20, 6]. In both cases, the ability to walk and ythrough the data, to examine it from multiple per-spectives, and to interactively change real-time displayattributes has proven invaluable | we have gained

insights into behavioral and performance metric inter-actions that were not possible with two-dimensionalgraphics displays. Finally, real-time modi�cation ofapplication behavior and resource management poli-cies, described in x6, has proven especially useful.6 Adaptive Control: The FutureAs our analysis of input/output dynamics showed,optimizing application performance requires a judi-cious match of application request patterns and re-source management policies. Based on our instrumen-tation of large research codes and subsequent discus-sions with the code developers, we have found thatthey are often either unaware of their application dy-namics or are surprised by the interaction of requestpatterns with system resource management policies.2Moreover, because the interactions between applica-tion and system software change across applicationsand during a single application's execution, we believeruntime libraries and resource management policiesare needed that can automatically adapt to rapidlychanging application behavior.We believe the solution to the performance opti-mization conundrum is integration of dynamic per-formance instrumentation and on-the- y performancedata reduction with con�gurable, malleable resourcemanagement algorithms, and a real-time adaptive con-trol mechanism that automatically chooses and con�g-ures resource management algorithms based on appli-cation request patterns and observed system perfor-mance. Though completely automated policy con�g-2This is not surprising. Their interests are in the scienceembodied in the application. Performance is an issue only whenits lack precludes new scienti�c research.

Client

Pre Cache

Server

Pre Cache

Client

Pre Cache

Control Translator

TCP

UDP

Native messages

ControlsCache

WriteBack

Avatar

PPFS

SDDFRecords

ControlPacketsFigure 6: PPFS Interactive Controlsuration is the ultimate goal, intermediate goals mustnecessarily include interactive optimization.6.1 Interactive OptimizationTo develop better resource management policies,the policy developer must �rst understand the dynam-ics of policy and application interaction. This intu-ition is best gained by varying policies and policy pa-rameters and studying the e�ects of these changes. Byinteractively modifying system parameters, one cantest large numbers of policy alternatives during a sin-gle experiment and study just those performance met-rics needed to understand the causes of variations inpolicy performance. In addition, interactive controland real-time performance data display make it possi-ble for application developers to choose resource man-agement policies and steer their computation's behav-ior based on their own domain-speci�c knowledge.As a test of this approach to interactive perfor-mance optimization, we developed a prototype basedon an extended version of the PPFS parallel �le sys-tem, the Pablo performance instrumentation library,and the Avatar virtual environment; see [18] for de-tails. As described in x4, we augmented PPFS withPablo performance instrumentation to capture dataon �le system policies (e.g., client and server cachehit rates and request service durations). During ap-plication use of PPFS services, this performance data

Figure 7: Interactive I/O Policy Selectionis extracted in real-time and displayed in the Avatarvirtual environment.As the schematic of Figure 6 shows, Avatar andPPFS are coupled by an adaptive control mechanismthat allows users to interactively change �le systempolicies based on observed performance and to studythe e�ects of these changes. Each PPFS client andserver can accept requests to change either �le systempolicies (e.g., cache management) or policy parameters(e.g., cache size).Figure 7 shows the Avatar interface for �le systempolicy control. By interactively adjusting cache sizesand prefetch amounts, the user can match �le systembehavior to application resource demands. Becausethe e�ects of these changes are immediately visible inthe virtual environment, the user can quickly explorea wide variety of �le system alternatives.6.2 Closed Loop OptimizationAutomatic, dynamic adaptation of resource man-agement policies to changing application resource de-mands requires many of the same software componentsas are needed for interactive optimization, though sucha closed loop system replaces the human with a set ofalgorithmic decision procedures. In particular, closedloop optimization requires� distributed software sensors for real-time perfor-mance capture and synthesis of quantitative per-formance metrics,� distributed actuators that can enable and con�g-ure resource management policies,� automatic behavioral classi�cation techniques toidentify and group application resource requestpatterns,� a set of local (per parallel task) and global (perparallel program) decision procedures for select-ing resource management policies and enablingactuators based on observed application resource

requests and the system responses captured byperformance sensors, and� performance analysis tools to assess the perfor-mance of current decision procedures, to conductparametric performance studies using capturedapplication behavioral data, and to identify im-proved decision procedures.Realizing such an adaptive control infrastructure re-quires solving a plethora of open research problemsin control theory and automatic classi�cation. For in-stance, automatic, qualitative classi�cation of applica-tion request patterns is needed to choose policy fami-lies. With qualitative knowledge that �le read requestsare roughly sequential, strided, and of modest size, de-cision procedures should choose a cache-based prefetchpolicy. Given a policy selection, performance sensors,placed in the �le system read routines, would providethe data needed to choose cache sizes and prefetchdistances.As one step toward automatic policy selection, wehave begun exploration of general techniques for qual-itative classi�cation of �le access patterns [8]. We ex-tended the PPFS software infrastructure to include atrained feed forward arti�cial neural network that ac-cepts �le request attributes (e.g., read or write, �leo�set, and request size). The result of the classi�ca-tion is a qualitative classi�cation along three axes: se-quentiality, read/write mix, and size. The qualitativeclassi�cation is then used to select PPFS �le systempolicies and policy parameters.The preliminary results from our automatic clas-si�cation and policy selection study [8] are extremelyencouraging. For a suite of input/output intensive ap-plications with complex, time-varying access patterns,we have achieved substantial speedups over that pos-sible with standard Unix �le systems. Because selec-tion of �le system policies is automatic, based on ap-plication access characteristics and known �le systemperformance, the user need not understand either thedetailed behavior of his or her application or the re-sponse of the system to the o�ered load. This lessensthe cognitive burden of performance optimization andincreases the portability of the application code.7 Related WorkThere is a rich history of control theory and manyadaptive real-time systems have been built. The con-trol problems posed by massively parallel systems areboth similar and di�erent from those posed by adap-tive control of industrial processes. However, the prob-lem of hysteresis is common to both | the feedbackloop for performance monitoring and control parame-ter adjustment must be stable.Several systems have been built that support appli-cation behavior steering (i.e., guiding a computationtoward interesting phenomena), though there havebeen fewer e�orts to interactively steer or adaptivelycontrol application performance. Leblanc [7] has de-veloped an adaptive real-time system for robotic con-trol that consists of a multiprocessor executing a groupof adaptive cognitive tasks. More recently, Schwan [4]has developed an adaptive control mechanism for a

molecular dynamics code and an atmospheric model-ing code on the KSR-1 that allows users to steer ap-plication load distribution and to automatically adjustthread locking policies based on expected synchroniza-tion delay. Performance data from sensors can activateactuators inserted in the code by the developer. Theseactuators can change program behavior based on cur-rent conditions and measured performance.8 ConclusionsExperience has shown that a large and importantclass of national challenge applications are irregular,with complex, data dependent execution behavior,and dynamic, with time varying resource demands.Because parallel system performance is extremely sen-sitive to the match of application resource demandsand resource management policies, resource manage-ment policies must adapt to application-speci�c re-source demands. For parallel �le systems, we haveshown that such adaptation can dramatically increasethe achieved application input/output performance.With exibility comes the need for mechanisms tospecify resource policy choices and con�guration pa-rameters. Based on our instrumentation of large re-search codes and subsequent discussions with the codedevelopers, we have found that they are often eitherunaware of their application dynamics or are surprisedby the interaction of request patterns with system re-source management policies. Thus, one needs eitheran interactive scheme for exploration of resource man-agement policy alternatives and their performance oran automatic scheme that can identify application re-source demands and select appropriate resource man-agement policies.We believe the ultimate solution to the performanceoptimization conundrum is integration of dynamicperformance instrumentation and on-the- y perfor-mance data reduction with con�gurable, malleable re-source management algorithms, and a real-time adap-tive control mechanism that automatically choosesand con�gures resource management algorithms basedon application request patterns and observed systemperformance. Such a system would be complementedby an interactive virtual environment for real-timemonitoring and performance tuning.AcknowledgmentsPhyl Crandall contributed to the early evaluationof input/output behavior, and Andrew Chien has beenan insightful collaborator. Ilkyr Hamzaoglu, Jay Hu-ber, and David Blumenthal helped develop and extendthe portable parallel �le system, showing that majorinput/output performance improvements were possi-ble with �le system tuning. Finally, Steve Lamm, LuisTavera, and Keith Shields contributed greatly to thedevelopment of Avatar, our virtual environment forperformance data immersion and control.References[1] Adve, V. S., Mellor-Crummey, J., Ander-son, M., Kennedy, K., Wang, J., and Reed,D. A. An Integrated Compilation and Perfor-mance Analysis Environment for Data Parallel

Programs. In Proceedings of Supercomputing '95(Dec. 1995).[2] Crandall, P. E., Aydt, R. A., Chien, A. A.,and Reed, D. A. Characterization of a Suite ofInput/Output Intensive Applications. In Proceed-ings of Supercomputing '95 (Dec. 1995).[3] Cruz-Neira, C., D.J.Sandin, and DeFanti,T. Surround-Screen Projection-Based VirtualReality: The Design and Implementation of theCAVE. In SIGGRAPH '93 Proceedings (Aug.1993), Association for Computing Machinery.[4] Eisenhauer, G., Gu, W., Schwan, K., andMallavarupu, N. Falcon | Toward Interac-tive Parallel Programs: the Online Steering ofa Molecular Dynamic Program. In Proceedingsof the Third International Symposium on High-Performance Distributed Computing (Aug. 1994).[5] Huber, J. V., Elford, C. L., Reed, D. A.,Chien, A. A., and Blumenthal, D. S. PPFS:A High-Performance Portable Parallel File Sys-tem. In Proceedings of the 9th ACM Inter-national Conference on Supercomputing (July1995), pp. 385{394.[6] Lamm, S. E., Scullin, W. H., and Reed,D. A. Real-time Geographic Visualization ofWorld Wide Web Tra�c. In submitted for publi-cation (1996).[7] LeBlanc, T. J., and Markatos, E. P. Op-erating System Support for Adaptive Real-timeSystems. In Proceedings of the Seventh IEEEWorkshop on Real-Time Operating Systems andSoftware (May 1990), pp. 1{10.[8] Madhyastha, T., and Reed, D. A. Intelli-gent, Adaptive File System Policy Selection. Insubmitted for publication (1995).[9] Malony, A. D., and Reed, D. A. Visu-alizing Parallel Computer System Performance.In Instrumentation for Future Parallel Comput-ing System s, M. Simmons, R. Koskela, andI. Bucher, Eds. Addison-Wesley Publishing Com-pany, 1989, pp. 59{90.[10] Malony, A. D., and Reed, D. A. AHardware-Based Performance Monitor for the In-tel iPSC/2 Hypercube. In 1990 ACM Inter-national Conference on Supercomputing (June1990), Association for Computing Machinery,pp. 213{216.[11] Malony, A. D., Reed, D. A., Arendt,J. W., Aydt, R. A., Grabas, D., and Totty,B. K. An Integrated Performance Data Collec-tion Analysis, and Visualization System. In Pro-ceedings of the Fourth Conference on HypercubeConcurrent Computers and Applications (Mon-terey, CA, Mar. 1989), Association for Comput-ing Machinery, pp. 229{236.

[12] Malony, A. D., Reed, D. A., and Rudolph,D. C. Integrating Performance Data Collec-tion, Analysis, and Visualization. In ParallelComputer Systems: Performance Instrumenta-tion and Visualization, M. Simmons, R. Koskela,and I. Bucher, Eds. Addison-Wesley PublishingCompany, 1990, pp. 73{97.[13] Poole, J. T. Scalable I/O Initiative. Cal-ifornia Institute of Technology, Available athttp://www.ccsf.caltech.edu/SIO/, 1995.[14] Reed, D. A. Performance Instrumentation Tech-niques for Parallel Systems. In Models and Tech-niques for Performance Evaluation of Computerand Communications Systems, L. Donatiello andR. Nelson, Eds. Springer-Verlag Lecture Notes inComputer Science, 1993, pp. 463{490.[15] Reed, D. A. Experimental Performance Anal-ysis of Parallel Systems: Techniques and OpenProblems. In Proceedings of the 7th Inter-national Conference on Modelling Techniquesand Tools for Computer Performance Evaluation(May 1994), pp. 25{51.[16] Reed, D. A., Aydt, R. A., Noe, R. J., Roth,P. C., Shields, K. A., Schwartz, B. W., andTavera, L. F. Scalable Performance Analysis:The Pablo Performance Analysis Environment.In Proceedings of the Scalable Parallel LibrariesConference, A. Skjellum, Ed. IEEE Computer So-ciety, 1993, pp. 104{113.[17] Reed, D. A., and Rudolph, D. C. Experi-ences with Hypercube Operating System Instru-mentation. International Journal of High-SpeedComputing (Dec. 1989), 517{542.[18] Reed, D. A., Shields, K. A., Tavera, L. F.,Scullin, W. H., and Elford, C. L. VirtualReality and Parallel Systems Performance Anal-ysis. IEEE Computer (Nov. 1995), 57{67.[19] Ries, B., Anderson, R., Auld, W.,Breazeal, D., Callaghan, K., Richards,E., and Smith, W. The Paragon PerformanceMonitoring Environment. In Proceedings of Su-percomputing '93 (Nov. 1993), pp. 850{859.[20] Scullin, W. H., Kwan, T. T., and Reed,D. A. Real-time Visualization of NCSA's WorldWide Web Data. In Symposium on VisualizingTime-Varying Data (Sept. 1995).[21] Winstead, C., and McKoy, V. Studies ofElectron-Molecule Collisions on Massively Paral-lel Computers. In Modern Electronic StructureTheory, D. R. Yarkony, Ed., vol. 2. World Scien-ti�c, 1994.

I/O, performance analysis, and performance data immersion

Documents

Transcript of I/O, performance analysis, and performance data immersion