A design pattern language for engineering (parallel) software

A Design Pattern Language for Engineering (Parallel) Software: Merging the PLPP and OPL Projects

Kurt Keutzer Dept. of Electrical Engineering and

Computer Sciences, UC Berkeley Berkeley, CA

[email protected]

Berna L. Massingill

Dept. of Computer Science, Trinity

University San Antonio, TX

[email protected]

Timothy G. Mattson

Intel Corp. DuPont, WA

[email protected]

Beverly A. Sanders

Dept. of Computer and Information Sciences, University of Florida

Gainesville, FL

[email protected]

ABSTRACTParallel programming is stuck. To make progress, we need to step back and understand the software people wish to engineer. We do this with a design pattern language. This paper provides background for a lively discussion of this pattern language. We present the context for the problem, the layers in the design pattern language, and descriptions of the patterns themselves.

Categories and Subject DescriptorsD.3.3 [Concurrent Programming]: Parallel programming.

General TermsAlgorithms, Design.

KeywordsParallel programming, design patterns, pattern language.

1. THE SEMINAL INTELLECTUAL CHALLENGE OF COMPUTER SCIENCE

The key to parallel programming is software architecture. And the key to software architecture is design patterns organized into a pattern language. In particular, echoing [3]:

We believe software architectures can be built up from a manageable number of design patterns. These patterns define the building blocks of all software engineering and are fundamental to the practice of architecting parallel software. Hence, an effort to propose, argue about, and finally agree on what constitutes this set of patterns is the seminal intellectual challenge of our field.

If we only “argue” among a select few, the final result will fall short of our lofty goal. To “define the building blocks of all (parallel) software engineering” we need to engage a wide community of computer scientists in a debate to arrive at a meaningful consensus.

This paper presents a snapshot of a “grand consensus pattern language”. This was produced by a painstaking union of two design pattern language projects, OPL [3] and PLPP [4]. The paper is stripped of motivation, background, related work, or other expository elements expected from a formal paper. These would just get in the way of the task before us now … to stimulate a lively debate and make progress on the “seminal intellectual challenge of our field”.

This debate comes down to three interlocking questions.

1. Have we included the requisite set of patterns in OPL?2. Are the layers in our pattern language correct? Have we

defined them appropriately, and do the names of each layer convey that meaning?

3. Is the description of each pattern correct?

This paper is broken down into three parts. The first defines the context for this discussion: What is the problem we are solving, and what are the key abstractions we’ll use in the discussion? We then launch directly into the design pattern language itself with a definition of the basic structure of the language. Finally, we close with a description of the patterns. We urge you to read the Context and Structure sections carefully. Then pick a problem you know well. Decompose it into a composition of patterns using the patterns from our pattern language. Then check the specific patterns used in the decomposition to see if you agree with them.

… and with that out of the way … let the great debate begin.

2. CONTEXTWords are the medium of exchange in any discussion. If we can’t agree on what our words mean, then this debate will go nowhere. Hence, whether you would completely agree with us or not, for the sake of this discussion, let’s agree to use the following definitions.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. A preliminary version of this paper was presented in a writers' workshop at the 2nd Annual Conference on Parallel Programming Patterns (ParaPLoP). ParaPLoP '10, March 30 - 31st, 2010, Carefree, AZ.Copyright 2010 is held by the author(s). ACM 978-1-4503-0127-5.Copyright 2010 is held by the author(s). ACM 978-1-4503-0127-5.

• Program: A sequence of instructions represented in human-readable text (source code) designed to execute on a computer. One or more programs working together constitute a body of software.

• Application: Software that solves a problem when it executes on a computer.

• Software architecture: The high-level organization of a program described in terms of a hierarchical composition of design patterns.

• Implementation: The translation of a high-level software architecture into a program.

• Computation: The particular sequence of instructions that execute when a program runs.

• Task: A logically related sequence of instructions during a computation. A computation consists of one or more tasks.

• Unit of execution (UE): An entity managed by the computer system that executes instructions. Examples include a Linux process, a kernel thread, a user thread, a CUDA thread, an OpenCL work-item. etc.

• Concurrency: In general, the state of a computation in which more than one task is active and able to make progress at one time. We use the term specifically to describe a quality of a software architecture that allows for tasks to have the potential to be executed simultaneously

• Parallelism: A concurrent computation for which multiple active tasks actually make progress at the same time. In particular, this term refers to the quality of an implementation to map onto units of execution that execute simultaneously.

• Parallel algorithm: An algorithm that exploits parallelism to solve a problem of a given size in less

time. Examples include a parallel image segmentation problem.

• Coordination: The collection of operations that together manage the creation, destruction, communication, and synchronization of multiple UEs.

Using these definitions, we can precisely state the problem we are trying to solve:

We want to describe the architecture of application software that will execute on systems that support parallel computation. This software must be (1) developed in a productive fashion, (2) correct and (3) efficient. This fundamentally reduces to exposing concurrency that can be exploited within a parallel algorithm as collections of tasks whose execution is coordinated so the problem is solved correctly and efficiently.

We use a hierarchical composition of design patterns as a mechanism to write down the solution to this design problem.

3. OVERALL STRUCTURE OF OUR PATTERN LANGUAGEWork on this pattern language began in the late 1990s. The result was a design pattern language for parallelism that we now call PLPP [4]. More recently, a research community centered at UC Berkeley expanded the scope of the initiative considerably to consider the larger problem of the architecture of well engineered software [3]. The design pattern language from this effort is called OPL, the patterns for which are available online [5]. These two projects have now been merged into a single pattern language. This pattern language, which we will call OPL, is presented in Figure 1.

Figure 1. The structure of OPL and the design patterns categories.

The layers in the pattern language correspond to the major phases of mapping a problem onto a parallel computation. We have five major categories of patterns. The top two categories, structural patterns and computational patterns, sit at the same level of the

hierarchy, and cooperate to create one layer, known as the high-level architecture, of the software architecture.

1. Structural patterns: Describe the overall organization of the application and the way the

computational elements that make up the application interact. These patterns are closely related to the architectural styles discussed in [6]. Informally, these patterns correspond to the “boxes and arrows” an architect draws to describe the overall organization of an application. 2. Computational patterns: These patterns describe the classes of computations that make up the application. They are essentially the thirteen motifs made famous in [1] but described more precisely as patterns rather than simply computational families. These patterns can be viewed as defining the “computations occurring in the boxes” defined by the structural patterns. Note that some of these patterns define complicated design problems in their own right and serve as entry points into smaller design pattern languages focused on a specific class of computations. This is yet another example of the hierarchical nature of the software design problem.

In OPL, the top two categories, the structural and computational patterns, are tightly coupled. A software architect thinks about his or her problem, chooses a structural pattern, and then considers the computational patterns required to solve the problem. The selection of computational patterns may suggest a different overall structure for the architecture and force a reconsideration of the appropriate structural patterns. Alternatively, an architect may immediately identify the key computational pattern, and then identify the structural patterns that are necessary to support this computation. This process, moving between structural and computational patterns, continues until the designer settles on a high-level design for the problem. This is the architecture of the solution, and it is expressed as a composition of structural and computational patterns.

Our structural and computational patterns may be used to define the software architecture of both serial and parallel programs. Ideally, the designer working at this high level will not need to focus on parallel computing issues even for a parallel program. However, for the remaining layers of Our Pattern Language parallel programming is a primary concern. We divide the remaining patterns in Our Pattern Language, the parallel design patterns, into the following three layers.

1. Parallel algorithm strategies: These patterns define high-level strategies to exploit concurrency within a computation for execution on a parallel computer. They address the different ways concurrency is naturally expressed within a problem, providing well known techniques to exploit that concurrency in parallel execution.

2. Implementation strategies: These are the structures that are realized in source code to support (1) how the program itself is organized and (2) common data structures specific to parallel programming.

3. Parallel execution patterns: These are the approaches often embodied in a runtime system that supports the execution of a parallel program.

Patterns in these three lower layers are naturally coupled. For example, a problem using the divide and conquer algorithm strategy is likely to utilize a fork-join implementation strategy, which is commonly supported at the execution level with a thread pool. These connections between patterns are a key point in the text of the patterns.

Sitting beneath the patterns are the parallel foundation constructs. These are the common “hooks” used within the text of the patterns to connect algorithms to constructs in a parallel programming environment. This layer will be of interest to parallel language designers interested in the fundamental constructs required from programming languages that support software architected with OPL. We will not discuss these further in this paper, however, so we can remain focused on identifying key patterns and understanding how they fit together.

In summary, we can think of the layers in OPL as different stages in a process to map a problem onto a computation that executes on a parallel system:

• Structural and computational patterns: What software architecture best describes the high-level structure and computations of the application?

• Parallel algorithm strategies: How does the software architecture map onto parallel algorithms?

• Implementation strategies: How do algorithms map onto source code?

• Parallel execution patterns: How is the source code realized as an executing program?

4. DESIGN PATTERN DESCRIPTIONSIn this section we describe the patterns within each layer of the pattern language. The full text of the patterns is available either online [5] or, for the lower layers of the pattern language, in [4].Note that this content is unchanged from [3] for the structural and computational patterns. Readers familiar with those patterns may want to skip directly to the lower three layers of the pattern language.

4.1 Structural Patterns• Pipe and Filter: These problems are characterized by

data flowing through modular phases of computation. The solution constructs the program as filters (computational elements) connected by pipes (data communication channels). Alternatively, they can be viewed as a graph with computations as vertices and communication along edges. Data flows through the succession of stateless filters, taking input only from its input pipe(s), transforming that data, and passing the output to the next filter via its output pipe

• Agent and Repository: These problems are naturally organized as a collection of data elements that are modified at irregular times by a flexible set of distinct operations. The solution is to structure the computation in terms of a single centrally-managed data repository, a collection of autonomous agents that operate upon the data, and a manager that schedules the agents’ access to the repository and enforces consistency.

• Process Control: Many problems are naturally modeled as a process that either must be continuously controlled or must be monitored until completion. The solution is to define the program analogously to a physical process-control pipeline: Sensors sense the current state of the process to be controlled; controllers determine which actuators are to be affected; and actuators actuate the process. This process control may be continuous and unending (e.g., heater and thermostat), or it may have some specific termination point (e.g., production on assembly line).

• Event-Based/Implicit Invocation: Some problems are modeled as a collection of processes or tasks that respond to events in a medium by issuing their own events into that medium. The structure of these processes is highly flexible and dynamic, as processes may know nothing about the origin of the events, their orientation in the medium, or the identity of processes that receive events they issue. The solution is to represent the program as a collection of agents that execute asynchronously: listening for events in the medium, responding to events, and issuing events for other agents into the same medium. The architecture enforces a high-level abstraction so invocation of an event for an agent is implicit; i.e., not hardwired to a specific controlling agent.

• Model-View-Controller: Some problems are naturally described in terms of an internal data model, a variety of ways of viewing the data in the model, and a series of user controls that either change the state of the data in the model or select different views of the model. While conceptually simple, such systems become complicated if users can directly change the formatting of the data in the model or view-renderers come to rely on particular formatting of data in the model. The solution is to segregate the software into three modular components: a central data model that contains the persistent state of the program; a controller that manages updates of the state; and one or more agents that export views of the model. In this solution the user cannot modify either the data model or the view except through public interfaces of the model and view respectively. Similarly the view renderer can only access data through a public interface and cannot rely on internals of the data model.

• Iterative Refinement: Some problems may be viewed as the application of a set of operations over and over to a system until a predefined goal is realized or a constraint is met. The number of applications of the operation in question may not be predefined, and the number of iterations may not be able to be statically determined. The solution to these problems is to wrap a flexible iterative framework around the operation that operates as follows: An iteration of the computation is performed; the results are checked against a termination condition; and depending on the results of the check, the computation completes or proceeds to the next iteration.

• Map Reduce: For an important class of problems, the same function may be applied to many independent data sets, with the final result being some sort of summary or aggregation of the results of those applications. While there are a variety of ways to structure such computations, the problem is to find the one that best exploits the computational efficiency latent in this structure. The solution is to define a program structured as two distinct phases. In phase one a single function is mapped onto independent sets of data. In phase two the results of mapping that function onto the sets of data are reduced. The reduction may be a summary computation or merely a data reduction.

• Layered Systems: Sophisticated software systems naturally evolve over time by building more complex operations on top of simple ones. The problem is that if each successive layer comes to rely on the

implementation details of each lower layer then such systems soon become ossified, as they are unable to easily evolve. The solution is to structure the program as multiple layers in a way that enforces a separation of concerns. This separation should ensure that: (1) only adjacent layers interact and (2) interacting layers are only concerned with the interfaces presented by other layers. Such a system is able to evolve much more freely.

• Puppeteer: Some problems require a collection of agents to interact in potentially complex and dynamic ways. While the agents are likely to exchange some data, and some reformatting is required, the interactions primarily involve the coordination of the agents and not the creation of persistent shared data. The solution is to introduce a manager to coordinate the interaction of the agents, i.e., a puppeteer, to centralize the control over a set of agents and to manage the interfaces between the agents.

• Arbitrary Static Task Graph: Sometimes it’s simply not clear how to use any of the other structural patterns in OPL, but still the software system must be architected. In this case, the last resort is to decompose the system into independent tasks whose pattern of interaction is an arbitrary graph. Since this must be expressed as a fixed software structure, the structure of the graph is static and does not change once the computation is established.

4.2 Computational Patterns• Backtrack, Branch and Bound: Many problems are

naturally expressed as either the search over a space of variables to find an assignment of values to the variables that resolves a Yes/No question (a decision procedure) or a search for an assignment of values to the variables that gives a maximal or minimal value to a cost function over the variables, respecting some set of constraints. The challenge is to organize the search such that solutions to the problem, if they exist, are found, and the search is performed as computationally efficiently as possible. The solution strategy for these problems is to impose an organization on the space to be searched that allows for subspaces that do not contain solutions to be pruned as early as possible.

• Circuits: Some problems are best described as Boolean operations on individual Boolean values or vectors (bit-vectors) of Boolean values. The most direct solution is to represent the computation as a combinational circuit and, if persistent state is required in the computation, to describe the computation as a sequential circuit: that is, a mixture of combinational circuits and memory elements (such as flip-flops).

• Dynamic Programming: Some search problems have the additional characteristic that the solution to a problem of size N can always be assembled out of solutions to problems of size ≤ N-1. The solution in this case is to exploit this property to efficiently explore the search space by finding solutions incrementally and not looking for solutions to larger problems until the solutions to relevant subproblems are found.

• Dense Linear Algebra: A large class of problems can be expressed in terms of linear operations applied to matrices and vectors for which most elements are nonzero. Computations are organized as a sequence of

arithmetic expressions acting on dense arrays of data. The operations and data-access patterns are well defined mathematically, so data can be pre-fetched so that processors execute close to their theoretically allowed peak performance. Applications of this pattern often make heavy use of standard library routines called the Basic Linear Algebra Subroutines (BLAS). Since the BLAS are available for most processors as highly tuned libraries, and most Linear Algebra problems spend the bulk of their time carrying out operations defined in the BLAS, an application can achieve highly tuned performance “for free” just by calling the BLAS.

• Sparse Linear Algebra: This includes a large class of problems expressed in terms of linear operations over sparse matrices (i.e., matrices for which it is advantages to explicitly take into account the fact that many elements are zero). Solutions are diverse and include a wide range of direct and iterative methods.

• Finite State Machine: Some problems have the character that a machine needs to be constructed to control or arbitrate a piece of real or virtual machinery. Other problems have the character that an input string needs to be scanned for syntactic correctness. Both problems can be solved by creating a finite state machine that monitors the sequence of input for correctness and may optionally produce intermediate output.

• Graph Algorithms: A broad range of problems are naturally represented as actions on graphs of vertices and edges. Solutions to this class of problems involve building the representation of the problem as a graph and applying an appropriate graph traversal or partitioning algorithm that results in the desired computation.

• Graphical Models: Many problems are naturally represented as graphs of random variables, where the edges represent correlations between variables. Typical problems include inferring probability distributions over a set of hidden states, given observations on a set of observed states observed states, or estimating the most likely state of a set of hidden states, given observations. To address this broad class of problems there is an equally broad set of solutions known as graphical models.

• Monte Carlo: Monte Carlo approaches use random sampling to understand properties of large sets of points. Sampling the set of points produces a useful approximation to the correct result.

• N-Body: These are problems in which the properties of each member of a system depend on the state of every other member of the system. For modest-sized systems, computing each interaction explicitly for every point is feasible (a naïve O(N2) solution). In most cases, however, the arrangement of the members of the system in space is used to define an approximation scheme that produces an approximate solution for a complexity less than the naïve solution.

• Spectral Methods: These problems involve systems that are defined in terms of more than one representation. For example, a periodic sequence in time can be represented as a set of discrete points in time or as a linear combination of frequency components. This pattern addresses problems where changing the representation of a system can convert a difficult

problem into a straightforward algebraic problem. The solutions depend on an efficient mechanism to carry out the transformation, such as a fast Fourier transform.

• Structured Mesh: These problems represent a system in terms of a discrete sampling of points in a system that is naturally defined by a mesh. For a structured mesh, the points are tied to the geometry of the domain by a regular process. Solutions to these problems are computed for each point based on computations over neighborhoods of points (explicit methods) or as solutions to linear systems of equations (implicit methods)

• Unstructured Mesh: Some problems that are based on meshes utilize meshes that are not tightly coupled to the geometry of the underlying problems. In other words, these meshes are irregular relative to the problem geometry. The solutions are similar to those for the structured mesh (i.e., explicit or implicit) but in the sparse case, the computations require gather and scatter operations over sparse data.

4.3 Parallel Algorithm Strategy Patterns • Task Parallelism: Suppose that the chosen structural

and computational patterns decompose the computation into collection of tasks. This pattern addresses the problem of how to schedule the tasks for execution in a way that keeps the work balanced between the processing elements of the parallel computer and manages any dependencies between tasks so that the correct answer is produced regardless of the details of how the tasks execute. The well-known embarrassingly parallel pattern (with no dependencies between the tasks) is a special case.

• Pipeline: Suppose that the chosen structural and computational patterns decompose the computation into a stream of data elements and a sequence of operations to perform on these elements. At first glance, there appears to be little opportunity for concurrency, since each operation on a particular data element depends on the previous operations on that element. If the computations on the different data elements are independent, however, parallelism can be introduced by setting up a series of fixed coarse-grained tasks (stages) with data flowing between them in an assembly-line-like manner. Initially, the computation processes only a single element, but as the first data element flows to the second stage, the first stage begins processing the second data element in parallel with the second stage processing the first, and so on, and as the pipeline fills, the number of tasks executing in parallel grows up to the number of stages in the pipeline (the so-called depth of the pipeline). This pattern is frequently used with the Pipe-and-Filter and Process Control patterns.

• Discrete Event: Suppose that a computation has been structured as a loosely connected collection of tasks that interact at unpredictable points in time. The solution is to set up an event handler infrastructure and then launch a collection of tasks whose interaction is handled through the event handler. The handler is an intermediary between tasks, and in many cases the tasks do not need to know the source or destination for the events. Discrete Event is often used with problems, such as GUIs and discrete event simulations, that are

handled with the Event-Based/Implicit Invocation, Model-View-Controller, or Process Control patterns.

• Speculation: Suppose that the computation has been decomposed into a number of tasks that are not completely independent, but where conflicts are expected to occur only infrequently when the computation is actually executed. An effective solution may be to just run the tasks independently, that is, speculate that no conflicts will occur, and then clean up after the fact and retry in the rare situations where a conflict does occur. Two essential element of this solution are (1) to have an easily identifiable safety check to determine whether the computation ran without conflicts and can thus be committed and (2) the ability to rollback and re-compute the cases where conflicts occur.

• Data Parallelism: Some computations are defined in terms of a collection of data elements to which the same task is applied. In such cases, the concurrency is expressed in terms of the data, defining the collection of data elements and then applying the task to each element. At the simplest level, this pattern results in programs constructed from sequences of calls to the vector intrinsic functions included with most modern microprocessors. The data-parallel approach, however, can be applied much more broadly by making the task applied to each element a complex sequence of instructions or by including collective communication operations such as reductions or prefix scans.

• Divide and Conquer: Suppose that the computation can be decomposed into tasks that are generated by repeatedly splitting a larger problem into smaller subproblems until the subproblems are small enough to be solved directly, and then combining the results of these subproblems to get the solution to the original problem. This pattern addresses the issues that arise when computing the subproblem solutions in parallel. Divide and Conquer is often used together with the Dynamic Programming, Dense Linear Algebra, and Spectral Methods computational strategy patterns and is implemented with the Fork/Join and/or Task Queue parallel execution patterns.

• Geometric Decomposition: Suppose that the computation can be decomposed by dividing the key data structures within a problem into regular chunks and assigning a task to update each chunk. This pattern addresses the issues that arise when the chunks are updated in parallel. In many cases, the computation involves iteratively updating the chunks of the data structure, such that in each iteration the new value for a particular data element depends on values from neighboring chunks. In this case, the computation for each iteration breaks down into three components: (1) exchanging boundary data, (2) updating the interiors of each chunk, and (3) updating boundary regions. The optimal size of the chunks is dictated by the properties of the memory hierarchy. This pattern is often used with the Structured Mesh and Dense Linear Algebra computational strategy patterns.

4.4 Implementation Strategy Patterns4.4.1 Program Structure PatternsThis set of patterns deals with the structure of the source code.

• Single Program Multiple Data (SPMD): The program is organized by giving all UEs the same source code (i.e., single program). Each UE has a unique identifier or rank that can be used to index into multiple data sets (MD) or branch into different subsets of instructions. For example, the following loop uses the thread rank ID to divide up the work of the loop among threads:

for (i=ID*itersPerThread; i<(ID+1)*itersPerThread); i++){a[i] = f(b[i]);}

This pattern can be used with most of the concurrent algorithm strategy patterns and is the pattern of choice when using MPI.

• Data Parallel/Index Space: The main idea of this pattern is to define an abstract index space and map the data structures in the computation onto that index space. Computational kernels (e.g., “work-items” in OpenCL or “threads” in CUDA) operate on these data structures in parallel for each point in the abstract index space. The pattern is appropriate for problems that are mostly data parallel and is frequently used together with the Data Parallelism pattern.

• Fork/Join: The computation is organized as a set of functions or tasks that execute within a shared address space. New UEs can be created (fork) at any time to execute tasks in parallel. A thread can wait for another thread to terminate (join) when its results are needed, or to impose structure on the computation. Fork/Join is a very flexible pattern that is the underlying strategy of OpenMP. It is frequently used with the Task Parallelism a Divide and Conquer patterns.

• Actors: Some computations can be organized as a set of objects where an object encapsulates part of the state of the computation (instance variables) and the operations on those variables (methods). In some cases, an effective way to introduce parallelism is to associate UEs with objects, thus creating active entities called actors. Method calls then correspond to message passing between actors. This pattern is frequently used with the Discrete Event pattern.

• Loop-Level Parallelism: The computation is organized as a modest number of compute-intensive loops, with the loop bodies in any given loop transformed so the loop iterations can safely execute in any order. After transforming the loops as needed to support safe concurrent execution, the serial compute-intensive loops are replaced with parallel loop constructs (such as the for worksharing construct in OpenMP). A common goal of these solutions is to create a single source that can be executed correctly both serially and in parallel, depending on the target system and compiler.

• Task Queue: For task-parallel computations with independent tasks, the challenge is how to schedule the execution of the tasks to balance the computational load among the processing elements of a parallel computer. One solution is to place the tasks into a task queue. Each UE repeatedly pulls a task out of the queue, carries out the computation, and then goes back to the queue for the next task. Note that master-worker algorithms are basically a technique for implementing a task-queue.

4.4.2 Data Structure Patterns• Shared Queue: Some computations involve items that

must be handled in (approximately) the order in which they are generated. The solution is to define a shared queue where the safe management of the queue is built into the operations upon the queue.

• Shared Map: Many computations are organized around a data structure that can be considered as a mapping between a key and an associated value. As we parallelize algorithms, we need to extend this “shared map” into a parallel context and create a dictionary data structure that is safe and efficient to update in parallel. This requires creating an abstract data type with well defined “put” and “get” operations with synchronization protocols to assure their safe (and scalable) operation.

• Partitioned Graph: A graph is typically a single monolithic structure with edges indicating relations among vertices. The problem addressed by this pattern is how to organize concurrent computation on this single structure in such a way that computations on many parts of the graph can be done concurrently. The desired solution is a strategy for partitioning the graph such that synchronization is minimized and the workload is balanced.

• Distributed Array: The array is a critical data structure in many computations, often because they are too large to fit in a single processor's memory or to enable updates to the array in parallel. In such cases, the arrays must be decomposed in distributed memory. Complex bookkeeping is then required to map indices between global indices in the original problem domain and the local indices visible to a particular thread or process. The solution is to define a distributed array data type and fold the complicated index algebra into access methods on the distributed array. The programmer still needs to handle potentially complex index algebra, but it is localized to one place and can possibly be reused across programs that use similar array data types.

• Shared Data: Programmers should attempt to minimize the amount of data that is updated concurrently by multiple threads. When this cannot be avoided, the solution is to give a well-defined API for accessing the data and build in the synchronization protocols necessary to allow safe concurrent access. A variety of implementation techniques that trade off implementation complexity for scalability are available.

4.5 Parallel Execution Patterns• Multiple Instruction Multiple Data (MIMD): The

computation proceeds as a set of independent UEs, each with its own program counter, its own data and potentially its own set of instruction to execute. They interact at discrete points though some sort of coordination event (e.g., message passing). This is the approach used on HPC clusters.

• Single Instruction Multiple Data (SIMD): The computation proceeds using multiple UEs, but there is one stream of instructions and a single program counter. All the UEs execute the same sequence of instructions together “in lock step”, though in many systems a UE can be masked out of the computation for any given step. This is the approach used in vector units and on traditional GPU processors.

• Thread Pool: The OS maintains a pool of threads. When threads are needed, they are pulled from the pool, complete their work, and then return to the pool. This separates the high cost of thread creation and destruction from the execution of a program. This is common used when implementing an OpenMP runtime.

• Task Graph: A set of tasks with specific dependencies constraining their execution can be considered as a graph with tasks as the nodes and dependencies represented by the edges. The resulting graph is typically a directed acyclic graph. This can be used to define how a collection of tasks orchestrate their execution. This is used in runtime systems that support a data-flow style of programming. This has also more recently become an important technique in parallel linear algebra problems.

• Transactions: Concurrent tasks are executed “as if” they are independent. All accesses are monitored so that if a conflict does occur, the state can be rolled back and the conflict resolved. When conflicts do not occur, the computation can proceed with large amounts of concurrency. Transactions have been used in data bases for many years. More recently, this approach has been used in general-purpose parallel programming with the abstraction of transactional memory to support the rollback feature. Transactions are an important mechanism for implementing the Speculation pattern.

5. SUMMARY, CONCLUSIONS, AND FUTURE WORK

This completes our description of the patterns in the latest version of OPL. Now we can begin the hard work of debating the names and contents of these patterns as well as the structure of OPL itself. This debate is critical, since to achieve out long term goal this pattern language needs to represent a broad consensus view.

The next step will be to write down the text for each of the patterns. We have been working on this effort within the OPL project for the last few years and have made great progress [5].What is needed now is to review what we have written at workshops such as ParaPLoP and refine the patterns until we get them right.

At that point, are we done? Will the pattern language alone solve the parallel software problem? Clearly the answer is “no”. A pattern language is not a programming language. It captures the essential elements of software design, but it does not directly generate executable code. For that, we need frameworks that help turn a software design based on design patterns into code. That work is occurring in parallel to our work on the patterns, but it will not take final form until the basic OPL design pattern language is complete. We hope that after you help us refine OPL, you’ll join us in our frameworks research as well.

6. REFERENCES

[1] K. Asanovic, et al. The Landscape of Parallel Computing Research: A View From Berkeley. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2006-183. 2006.

[2] W-M. Hwu, K. Keutzer, T. Mattson. The concurrency challenge. IEEE Design and Test, 25, 4, 2008. pp. 312 – 320.

[3] K. Keutzer and T. G. Mattson. A design pattern language for engineering (parallel) software. Intel Technology Journal, 13, 4, 2010.

[4] T. G. Mattson, B. A. Sanders, B. L. Massingill. Patterns for Parallel Programming. Addison Wesley, 2004.

[5] OPL. http://parlab.eecs.berkeley.edu/wiki/patterns/patterns.

[6] M. Shaw and D. Garlan. Software Architecture: Perspectives on an Emerging Discipline. Prentice Hall, 1995.

http://parlab.eecs.berkeley.edu/wiki/patterns/patterns

A design pattern language for engineering (parallel) software

Documents

Transcript of A design pattern language for engineering (parallel) software