Parallel processing in power systems computation

10
629 Transactions on Power Systems. Vol. 7, No. 2. May 1992 Parallel Processing in Power Systems Computation An IEEE Committee Report by a Task Force of the Computer and Analytical Methods Subcommittee of the Power Systems Engineering Committee Co-chairmen: Daniel J. Tylavsky, Anjan Bose Members: Fernando Alvarado, Ramon Betancourt, Kevin Clements, Gerald T. Heydt, Garng Huang, Maria Ilic, Massimo La Scala, M.A. Pai, Chris Pottle, Sarosh Talukdar, James Van Ness, Felix Wu ABSTRACT The availability of parallel processing hardware and software presents an opportunity and a challenge to apply this new computation technology to solve power system problems. The allure of parallel processing is that this technology has the potential to be cost effectively used on computationally intense problems. The objective of this paper is to define the state of the art and identify what we see to be the most fertile grounds for future research in parallel processing as applied to power system computation. As always, such projections are risky in a fast changing field, but we hope that this paper will be useful to the researchers and practitioners in this growing area. INTRODUCTION - WHAT IS PARALLEL PROCESSING? Unlike most topics which have a rich research history, parallel processing is still relatively young by power systems standards. This is evident by the difficulty in obtaining an agreement on what is meant in the power community by parallel processing. Borrowing from [l] we take a working definition with minor variations as, Parallel processing is a form of information processing in which two or more processors together with some form of inter- processor comnuuu'cations system, co-operate on the solution of a problem. Within processes which have this attribute, many issues arise which are not found in other disciplines because of the uniqueness of the power system application. There exist many issues which arise and are of broad concern to the parallel processing community as a whole as expounded by 121. These issues may be broken into three main groups as follows. 91 SM 503-3 PWRS by the IEEE Power System Engineering Committee of the IEEE Power Engineering Society for presentation at the IEEE/PES 1991 Summer Meeting, San Diego, California, July 28 - August 1, 1991. Manuscript submitted January 25, 1991; made available for printing May 17, 1991. A paper recommended and approved Processor architecture encompasses one class of issues which arises and includes all aspects of performance, design, conuol, and use. This includes the single instruction multiple data (SIMD) and multiple instruction multiple data (MIMD) class of machines which may have shared or local memory. The SIMD class of machines as used here includes vector processors such as the Cray, IBM 3090/VF, as well as the MPP, Connection Machine, etc. The MIMD class of machines with local memory will be used here to include distributed processing systems as well as such machines as the IPSC. and NCube. The class of MIMD shared memory machines includes the BBN Butterfly, Balance, Encore, Alliant FX-8, etc. The MIMD class is also understood to include the more specialized architectures such as those designed for neural networks research, transputer designs, etc. and the class of parallel vector pcessors which use more than one vector pipeline simultaneously. Distributed processing systems are an important subclass of MIMD machines. This subclass machines are unique in that the architecture of each machine is generally different, each may be of a different type using code based on different languages, execution is carried out asynchronously, and communication paths may use different media. An introduction to one scheme for classifying these architectures may be found in [3]. Software development issues include transparency, portability, task scheduling, vectorization, and performance evaluation. Transparency as used here means the ease with which software written for a set number of processors can be reformulated for another number of processors. Portability as used here means the ease with which software can be compiled by different compilers. Compilers are an issue which is of particular importance for production line use. Algorithm development is the third major issue which includes the design and analysis of new numerical and symbolic methods to match existing or new architectures. Testing these algorithms to ensure accuracy, and evaluation of their performance is also an issue. New problems with solutions that become computationally feasible because of the existence of high speed parallel processors represent fertile ground for the development of new algorithms and numeric/symbolic methods which are tailored to this new technology. Examples of such symbolic/numeric methods are new high level languages. Both coarse and fine grain algorithms are of interest to match available system architectures. 7- I ~ OSS5-8050/02$03.0~1992 IEEE

Transcript of Parallel processing in power systems computation

629 Transactions on Power Systems. Vol. 7, No. 2. May 1992

Parallel Processing in Power Systems Computation

An IEEE Committee Report by a Task Force of the Computer and Analytical Methods Subcommittee of the Power Systems Engineering Committee

Co-chairmen: Daniel J. Tylavsky, Anjan Bose Members: Fernando Alvarado, Ramon Betancourt, Kevin Clements, Gerald T. Heydt, Garng Huang, Maria Ilic,

Massimo La Scala, M.A. Pai, Chris Pottle, Sarosh Talukdar, James Van Ness, Felix Wu

ABSTRACT

The availability of parallel processing hardware and software presents an opportunity and a challenge to apply this new computation technology to solve power system problems. The allure of parallel processing is that this technology has the potential to be cost effectively used on computationally intense problems. The objective of this paper is to define the state of the art and identify what we see to be the most fertile grounds for future research in parallel processing as applied to power system computation. As always, such projections are risky in a fast changing field, but we hope that this paper will be useful to the researchers and practitioners in this growing area.

INTRODUCTION - WHAT IS PARALLEL PROCESSING?

Unlike most topics which have a rich research history, parallel processing is still relatively young by power systems standards. This is evident by the difficulty in obtaining an agreement on what is meant in the power community by parallel processing. Borrowing from [l] we take a working definition with minor variations as,

Parallel processing is a form of information processing in which two or more processors together with some form of inter- processor comnuuu'cations system, co-operate on the solution of a problem.

Within processes which have this attribute, many issues arise which are not found in other disciplines because of the uniqueness of the power system application. There exist many issues which arise and are of broad concern to the parallel processing community as a whole as expounded by 121. These issues may be broken into three main groups as follows.

91 SM 503-3 PWRS by the IEEE Power System Engineering Committee of the IEEE Power Engineering Society for presentation at the IEEE/PES 1991 Summer Meeting, San Diego, California, July 28 - August 1, 1991. Manuscript submitted January 25, 1991; made available for printing May 17, 1991.

A paper recommended and approved

Processor architecture encompasses one class of issues which arises and includes all aspects of performance, design, conuol, and use. This includes the single instruction multiple data (SIMD) and multiple instruction multiple data (MIMD) class of machines which may have shared or local memory. The SIMD class of machines as used here includes vector processors such as the Cray, IBM 3090/VF, as well as the MPP, Connection Machine, etc. The MIMD class of machines with local memory will be used here to include distributed processing systems as well as such machines as the IPSC. and NCube. The class of MIMD shared memory machines includes the BBN Butterfly, Balance, Encore, Alliant FX-8, etc. The MIMD class is also understood to include the more specialized architectures such as those designed for neural networks research, transputer designs, etc. and the class of parallel vector pcessors which use more than one vector pipeline simultaneously. Distributed processing systems are an important subclass of MIMD machines. This subclass machines are unique in that the architecture of each machine is generally different, each may be of a different type using code based on different languages, execution is carried out asynchronously, and communication paths may use different media. An introduction to one scheme for classifying these architectures may be found in [3].

Software development issues include transparency, portability, task scheduling, vectorization, and performance evaluation. Transparency as used here means the ease with which software written for a set number of processors can be reformulated for another number of processors. Portability as used here means the ease with which software can be compiled by different compilers. Compilers are an issue which is of particular importance for production line use.

Algorithm development is the third major issue which includes the design and analysis of new numerical and symbolic methods to match existing or new architectures. Testing these algorithms to ensure accuracy, and evaluation of their performance is also an issue. New problems with solutions that become computationally feasible because of the existence of high speed parallel processors represent fertile ground for the development of new algorithms and numeric/symbolic methods which are tailored to this new technology. Examples of such symbolic/numeric methods are new high level languages. Both coarse and fine grain algorithms are of interest to match available system architectures.

7- I ~

OSS5-8050/02$03.0~1992 IEEE

I . .

630

POWER SYSTEM PROBLEMS

The application of parallel processing to power systems analysis is motivated by the desire for faster computation and not because of the structure of the problems. Except for those analytical procedures that require repeat solutions, like contingency analysis, there are no obvious parallelisms inherent in the mathematical structure of power system problems. Thus, for a particular problem a parallel (or near-parallel) formulation has to be found that is amenable to formulation as a parallel algarithm. This solution has then to be implemented on a particular parallel machine keeping in mind that computational efficiency is dependent on the suitability of the parallel architecture to the parallel algorithm.

The interconnected generation and transmission system is inherently large and any problem formulation tends to have thousands of equations. The most common analysis, the power flow, requires the solution of a large set of nonlinear algebraic equations approximately two for each node. The usual algorithm of iterative matrix solutions exploits the extreme sparsity of the underlying network connectivity to gain speed and conserve storage. parallel algorithms for handling dense matrices are not competitive with sequential sparse matrix methods, and since the pattern of sparsity is irregular, parallel sparse matrix methads have been difficult to find. The power flow describes the steady state condition of the power network and thus, the formulation (or some variation) is a subset of several other important problems like the optimal power flow or transient stability. An effective parallelization of the power flow problem would also help speed up these other solutions.

The transient stability program is used extensively for off-line studies but has been too slow for on-line use. A significant speed up by parallel processing, in addition to the usual efficiencies, will allow on-line transient stability analysis a prospect that has spurred research in this area. The transient stability problem requires the solution of differential equations that represent the dynamics of the rotating machines together with the algebraic equations that represent the connecting network. This set of differential algebraic equations (DAE) have various nonlinearities and some sort of numerical method is usually used to obtain a step-by-step time solution. Each machine may be represented by two to twenty differential equations, and so a 2ooo bus power network with 300 machines may require 3000 differential equations and 4000 algebraic equations. In terms of structure the differential equations can be looked upon as block diagonal (one block for each machine) and the sparse algebraic equations as also providing the interconnection between the machine blocks. This block diagonal structure has made the transient stability problem more amenable to parallel processing than the power flow problem. Research results to date seem to bear this out,

Other power system analysis problems are slowly being subjected to parallel processing by various researchers. Short circuit calculations require the same kind of matrix handling as the power flow and the calculation of electromagnetic transients

is mathematically similar to the transient stability solution although the models can be more complicated. Steady-state stability (or small disturbance stability) analysis requires the calculation of eigenvalues for very large matrices. The optimal power flow (OPF') optimizes some cost function using the various limitations of the power system as inequality constraints and the power flow equations as equality constraints. Usually the OPF refers to opimization for one operating condition while unit commitment and hydro-thermal coordination requires optimization over time. This optimization problem, especially if there are many overlapping water and fuel constraints, can be extremely large even without the power flow constraints. Reliability calculations, especially when considering generation and transmission together, can be quite extensive and may require Monte Carlo techniques. Production costing is another large example.

It is the size of these above problems and the consequent solution times that encourages the search for parallel processing approaches. Even before parallel computers became a potential solution. the concept of decomposing a large problem to address the time and storage problems in sequential computers has been applied to many of these power system problems. In fact, there is a rich literature of decompositioidaggregation methods, some more successful than others, that have been specifically developed for these problems. The use of parallel computers can take advantage of these decompositioidaggregation techniques but usually a certain amount of adaptation is necessary. Much of the research in applying parallel processing to power systems has its roots in this literature. This report, however, is confined to examining the efforts that apply parallel computers to specific power system problems rather than the much larger area of methods and algorithms that are potentially applicable to patallel computers.

STATE OF THE ART

raic Eauatlons Many power system problems have large portions which can be easily parallelized. However, solution of most power system problems requires the solution of the linear algebraic problem in the form,

where A is characterized as large with random sparsity, is typically incidence symmetric and is often numerically symmetric. Also, x and b may or may not be sparse. There are a large number of direct and indirect algorithms for solving this problem. The most effective method on serial processor! for power system application to date is the use of triangular factorization along with forward/backward substitution definec by ,

L D U = A L y = b

D U x = y

I 1

63 1

The two distinct phases to this problem are the factorization phase, @a), and the substitution phase, (2b.c). The algorithms in which such solutions are required will dictate whether both phases can be processed simultaneously. For example, in a full Newton power flow the Jacobian and mismatches are recreated on each iteration, so (2) must be solved repeatedly. The fast &coupled power flow requires that (2a) be solved once and (2b.c) be repeatedly solved OA each itetation. Thus there exists a need for parallelizing (2), (a), and (2b.c). Much work has been done on algorithms for parallel aiangular factorization 14-10] and/or forward and backward substitution [4-9,11,12]. Many of these algorithms have attempted to take the serial factorization/substitution problem and exploit available parallelism through reordering/partitioning of the A matrix. This has been effective in reducing the number of precedence relationships which is govemed by the maximum factor path length. m e length of the longest factor path in the elimination tree seems to represent a fundamental limit in minimizing the number of precedence relationships during factorization [13].) Algorithm development for use on array processors has also been investigated. Fundamentally new algorithms attempting to minimize the precedence relationships with forward and backward substitution problems include the multiple factorization scheme [14] and the use of sparse inverse factors [12. 15, 161. Indirect methods have also been revisited in an attempt to minimize the number of precedence relationships in solving (2) [17]. These algorithms have done a great deal to assuage the mncems that the precedence relationships inherent in the substitution phase represented an insurmountable obstacle.

While algorithm development has yielded good theoretical results, little software has been developed for parallel machines to date. [9. 181 report parallel factorization and substitution results on the iPSC which are unimpressive. Full factorization can be accomplished with maximum speed gains on the order of 2 and with parallel gains of about 10 when factorization is halted before the densest portion of the matrix is encountered (partial factorization). [ 191 has produced experimental results for solving (2) with a vector processors. These results show promise of being able to take advantage of a portion of the capability of the widely available vector machines. These experimental results are helping to distinguish some of the real issues in parallel processing from non-issues.

An altemate approach to using Newton’s methods to obtain the solution to a set of nonlinear equations, is to iterate directly on the equations,

ICw = 0 (3)

This approach has a rich algorithmic history in power flow literature with the Gauss-Seidel method being the most often cited algorithm. The Gauss-Seidel approach in its basic form (i.e. excluding block Gauss-Seidel) is not suitable for parallel processing because of the inherent sequentiality. However, variations do exist which generate exploitable parallelism. Gauss-Jacobi algorithms do not suffer from this sequentiality but may require a large number of iterations to converge. Such

algorithms perform better on the algebraic equations characteristic of the transient stability problem (the nonlinear network equations and discretized differential equations) than on the power flow model because of different methods used in modeling generators. Transient stability algorithms are discussedbelow.

Team-based approaches that combine s e v d methods, such as Newton’s method. Gauss-Seidel and residue minimization have been reported in [20]. Each method is assigned a computer and searches for a solution to the entire set of equations. Pmgress is reported to a central controller. This controller intempts methods that are doing poorly and restarts them from the latest results of the methods that are doing well. On a number of difficult test problems, such teams have performed admirably, quickly finding solutions even when the individual methods, working alone, fail miserably.

The transient stability problem is defined by a set of nonlinear differential-algebraic equations (DAEs):

i = f ( Y , x) (4)

0 = 8 (Y, x) (3 where (4) describes the machine dynamics and (5) the network static behavior. Sequential solution algorithms, developed over several decades, use two basic approaches. The first is the partitioned approach where (4) is solved by an integration method (e.g. fourth order Runge-Kutta) and at every time step (5) is solved separately. In the simultaneous approach, (4) is discretized (e.g. by the trapezoidal method) and then solved together with ( 5 ) at each time step using some Newton like method. The latter approach tends to be fastex because the sparse Jocabian can be held mnstant and this variation is known as the very dishonest Newton (VDHN) methad. Also, the transient stability problem can be stiff which means that an explicit method like the Runge-Kutta may require very small time steps, and hence longer computation times, to avoid numerical instability whereas an implicit method like the trapezoidal is inherently stable but provides variable accuracy.

Both of these approaches, of course, can be used in parallel computers but the decomposition of the problem and the subsequent relaxation presents many new variations. The decomposition of the system variables into groups is known as parallelization in space (i.e. variable space). In addition, since several time steps can be solved simultaneously it is possible to parallelize in time. The most obvious parallelization in space is the decomposition of equation (4) into sets of equations for each separate machine, the interconnection being provided by (5 ) which is kept together [21]. The first suggestion for parallelization in time was made in [22], forming the Newton equations at each time step and solving simultaneously.

When (1) is decomposed, relaxation can be done on the differential equations directly or on the discretized set of

1 ‘ I

632

equations. The former, known as waveform relaxation was suggested in [23] and the latter in [U]. Actually, in 1241 it is shown that the discrete version of (4) together with (5) can be decomposed to each system variable and solved simultaneously for all time steps by picard’s (relaxah) method This provides the maximum possible parallelization in space and time but, even though convergence is usually achieved, it requhx many more iterations. Various schemes to obtain the most efficient solution have been prolnwcd [25,261.

Implementation of drese algorithms on actual pafallel computers has been slow because of the s c a m availability of such hardware. An approximate implementation, using trapezoidal integration for (4) and reducing (5) to the machine terminals, on an iPSC hypercube with 32 p ” o r s [18] showed speedup gains of up to 6. More recent implementations on the hypercube, using full representations of machines and network, have produced speedups of over 10 with a parallel (in space and time) version of the very dishonest Newton method [27]. Similar spesdups have also been obtained with a relaxed Newton methodon the Balan;e and Auiantshated-memaly machines.

In general, this limited amount of experience has shown that as the number of processors increases, the effiiency decreases quite quickly. Computation time also decreases and speedup gains of about a magnitude can be obtained by using a few tens of processors. Any further gain is swamped out by the extra overhead, communication time in the message passing machines and memory contention in the shared memory machines. Thus, even though high levels of parallelization is shown to be feasible [24] for the transient stability problem, their implementation on finer grained machines will require either different algorithmic approaches or different machine architectures to get significantly higher speedups.

Several other power system problems have been the target of parallel computing application although to a lesser extent. The parallelization of the optimal power flow solution poses problems similar to that of network decomposition. A scheme called textured algorithm was proposed [28] for optimal reactive control. Finding the eigenvaludeigenvectors of the large matrix representing steady-state stability behavior has been implemented on a hypercube [29]. This algorithm is inherently parallel except for the setting up of the initial matrix and large gains are obtained from parallelization. Reliability calculations for the combined generation and transmission system have been parallelized by scheduling the contingency calculations simultaneously on different nodes. Speedup of about one magnitude was obtained in [301 when SYREL, a production program for reliability calculations, was implemented on a 16 node hypercube. Experiments on a DOS-based shared memory multiprocessor [3 11 has shown that multi-area reliability calculation, hydro production costing by Monte Carlo simulation, and corrective rescheduling for different contingencies can be parallelized with high efficiency.

F o r m

Definitions. Successful implementations are those which match the communication requirements and granularity of the algorithm to the communication architecture, communication speed, and granularity of the processor. The granularity of a processor architecture is charactenzed by the ratio of a measure of the execution speed @ertraps in MIPS) of each p” to a measure of the intercommunication speed (perhaps baud rate) between p”. Those m h i t e c m with relatively high d o s are referred to as cause grain archirectures while low ratios characterize fine grain machines. Coarse grain machines generally tend to have comparatively few processors relative to frne grain machines. Similarly coats~e grain parallel algorithms are those with parallel plocesses that allow a large number of computations for every word of intercommunication required while finc grain algorithms require that few computations can be completed before words from the communication bus are required. Fine grain algorithms tend to be divisible into a large number of parts relative to coarse grain schemes. A mismatch between algorithm and architecture can lead to a significant depdatbn of performance as demonshated in [9] while a good match can lead to moderate to significant gains [18,19.30].

Architecture. Communication architecture determines the processors between which direct communication is possible and the number of simultaneous messages which may be passed. A mismatch between algorithm and machine leads to bus contention problems which can severely affect performance. Few bus contention problems have been reported in the hypercube structure, even on problems requiring much simultaneous communication, e.g., parallelizing (2). If the effective intercommunication speed is relatively low, such as with the iPSC, idle processor time due to synchronization difficulties may occur in spite of the availability of ample communication paths. Shared memory machines, e.g., Balance and Alliant, which have a broadcast communication scheme suffer bus contention problems quickly if the data communication requirement is heavy.

Software. Software development on the existing machines is currently very time intensive. Machines with local memory like the NCube currently have no parallel compilers. This requires that all code portions be broken down into parallel processes manually and distributed to each processor. Synchronization of all processes and all data communication is under programmer control by compiler recognized language extensions. The advantage of such a scheme is that the code can be thoroughly optimized by taking advantage of parallelism which might be invisible to an unsophisticated compiler. The disadvantage is that converting and debugging an existing code to run on such a machine can become quite a tedious and lengthy chore. Further, the code developed cannot be transported to another local memory machine with ease because language extensions have not been standardized. Transparency in these machines is very low since manual task aggregatioddemmposition is needed to reapportion the algorithm into different sized segments and

633

redesign the flow of data to minimize bus contention and achieve load balancing.

Shared memory machines such as the Balance and Alliant have resident parallel compilers for a limited set of languages, typically FORTRAN and/or C. The advantage of these compilers is that they can take dusty deck software and convert them to parallelized execution modules with no effort on the part of the user. Of course, since these codes are often not written with parallelism in mind, little of the code can be parallelized leaving parallel gains to be only slightly greater than 1.0 regardless of the number of processors used. Compiler- native language extensions can be used to control which portions of the code are parallelized or vectorized. Modified compilers containing an expanded set of language extensions, such as those available from Argonne National Laboratory, can be used to gain control over aggregating wark to form tasks and task assignment. These language extensions are easy to use but provide little or no control over some factors which govern the efficiency of the parallelized version. For example, paging may not be under programmer control leading to an inordinate number of page faults. This can drastically affect the performance of all processors when a broadcast scheme is used for communication. When problems occur which are not controllable with available language extensions, more than a casual familiarity with the machine may be required. Under these conditions, the user may be forced to do much of the task defmition, assignment, and synchronization manually. Further, language extensions have not been standardized so portability is low in many instances. These compilers have a very high degree of transparency since the number of processors to be used is controlled by one variable which can easily be changed at compile time.

There are many other issues which arise depending on the depth to which the user is immersed. Software performance is often not handled consistently for simple parallel processing systems and becomes more intricate with the complexity and diversity Seen in distributed processing systems. Database support is often critical for production grade environments.

Algorithm Performance. Once the code is optimized, regardless of the machine used, the problems that are most pernicious are slowbottlenecked communication paths and the serial slowdown (S2) factor prevalent in parallel algorithms. Communication problems ultimately may be hardware bound but can sometimes be ameliorated with the help of language extensions. Design of an algorithm which suffers from a large or even moderate S2 factor can lead to an implementation which may not be capable of unity gain. For example, if an algorithm suffers from a ~2 factor of 20 (i.e.. runs on a serial processor twenty times as slow as the fastest industry accepted algorithm). then ideally 20 processors are needed to achieve unity gain. However, since efficiencies decrease as more processors are added, it would not be unreasonable to see efficiencies halved as processors are doubled leading to less than unity gain regardless of the number of processors applied. In practice efficiencies often decrease more quickly leading to

saturation of the gain curve and, eventually, negative incremental gain. Typically, a S2 factor greater than 4 or 5 leads to a parallel implementation which does not tramced the unity gain barria with current available axhitectures. A good ideal-parallel-gain estimator for candidate parallel algorithms which uses timing data taken fnwn a s e r i a l p " r is Amdahls law,

where T, Ts, and Tp are the serial execution times of the entire code, of the non-parallelizable code portion, and of the parallelizable code portion respectively and N is the number of processors. This also is written as,

where fs = T P and fp = Tp/r are the non-parallelizable and parallelizable-code-execution time fractions and where the limit as N gets large gives the upper bound on G as shown. In some problems the equality may be reached [30]. Overhead due to communication requirements, bus Contention, additional paging requirements, unbalanced task assignment, synchronization testing, locking/unlocking of privileged data, etc., causes significant degradation of the gain as N grows. The type and amount of overhead experienced will be different for different machines. Hence altemative schemes for the aggregation and distribution of tasks may be needed to optimize the performance of the algorithm on different machines.

Vector Processors. Vector processors are available that use more than one vector pipeline simultaneously. From this point of view such processors are shared memory MIMD machines and the above comments apply. Focusing on a single processor reveals a fine grained SIMD machine whose architecture supports parallelism at the micro-code level known as vector processing. Vector processors form a unique and important type of parallel processor because they have the highest FLOP (floating point operation) rate and cycle times (less than 10 nanoseconds) available today. Further, this class of parallel processors has found wide acceptance and is currently in production line use by other disciplines. This suggests that such machines will remain available in the future with, most probably, significantly lower cost. The compilers available with the single processor machines are sophisticated (relative to non-vector machine compilers) with good diagnostics for enhancing vectorization. Language extensions are not standardized, so portability is low. Transparency is relatively high at the vector or multiprocessor levels.

These machines have not been exploited much by the power community since vectorization is inefficient on sparse matrix/vector problems. Recent progress has been reported by [ 191 on parallelizing on the power flow problem. Also [32] has shown that frequency domain simulation of the transient

634

stability problem allows for significant vectorization and p d n g gains. application implementations.

special purpose architectures have yet to enjoy power system

Distributed Processing. Distributed processing (DP) involves the use of networks of computexs. which may not be close geographicaly, far very cogfse grained palallel processing. Distributed procesSing is the most general form of parallel processing since it will, in g e n d , involve many diffetent types of processors executing different programs which are written in different languages, and asynchronous communication channels whose speeds may vary ovex a wide range and whose architecture will be unique for each system. Since several different languages and programming styles may be involved, assembling such systems can be difficult However some software aides that d u c e the difficulty are beginniig to appear 1331. Problems still exist in establishing the “comx~’’ of distributed algorithms for automatiodcoml, and far evaluating rheir @or”.

Distributed pmcessing systems have been used in power system applications (e.g. distribution automation) for some time because of the large geographical region coveted by the system. Their use is on the increase and will be so for the foreseeable future. Fueling the growth of DP systems over centralized computing systems are the increased computing power per dollar, improved reliability, gradual performance degradation as components fail and relative high computing speeds for parallelizable tasks. However, DP systems are more difficult to analyze and control than centralized systems and geographical separation of the p ” r s makes global memory unwieldy and asynchronous execution imperative. Further, it is mote difficult for a processor to acquire and maintain accurate and current information about the rest of the system when communication paths are limited. Many such systems, however, often require communication between one processor and only a few others. Ameliorating strategies for a communication channel loss contingency are easy to devise for many applications.

Distribution automation (DA) is currently using DP to implement distribution operation and control tasks along with new customer-service functions, under local-processor control. This takes advantage of the fact that many DA functions have significant local components which cannot be performed in a costdficient manner in real time using centralized computing.

special Architectures. Other specialized architectures such as array processors 134,351 and transputers [36] have been and continue to be studied for power system applications. For example, an extensible network of “800 transputers, one per bus, using the EMTP method of characteristics and the trapezoidal rule, can simulate a balanced bulk power transmission network with an integration step size of under 80~s . corresponding to real-time operation for a system with minimum line length of 24km. This application is particularly interesting because it uses the transmission line transit time to decouple the node problems. Data flow machines, such as the MIT Tagged Token Architecture, are being used experimentally for solving (2) [37, 381. Systolic arrays, along with many

RESEARCH ISSUES

In the last decade a significant amount of research has been conducted in parallel processing of power system problems. Most of this work has been in the development of algorithms while actual testing on multiprocessor architectures is more at the beginning stages. Most of the results, although encouraging, do not yet indicate cleat cut paths for the development of production grade engineering tools. Probably the largest uncertainty today is the evolution of the hardware. Although several different parallel machines are commercially available, they are largely being used for research plrposes and their acceptability to industry and long term viability are very

Since the main thrust of the research so far has been the search for parallel algorithms for specific power system problems, a body of knowledge about the suitability of certain algorithmic approaches has already been created. However, the search for better algorithms is a never ending one and their application also to other power system problems has to be continued. The main goal in developing an algorithm is to maximize its parallelism and minimizing the data dependencies between the parallel parts. For some problems, like contingency analysis or eigenvalue calculations, the patallelism is obvious and there are no data dependencies. For others, like power flow and transient stability, the parallelism and data dependencies are the result of careful algorithm design. There is an obvious trade-off here because more parallelism, that is more processors, may requh more data sharing. Also, an increase in parallelism may affect convergence of an iterative method adversely forcing another trade-off consideration. Thus the design of a good parallel algorithm, just like for a sequential one, is an art because even though algorithms can be chosen for certain characteristics their actual performance can only be determined by careful testing.

unpredictable.

The testing of these algorithms have largely been done on sequential machines by trying to simulate a parallel environment. Since this requires making many assumptions about the parallel architectures. the projected results for algorithm performance tend to be optimistic. The testing on actual parallel machines have been limited mainly because of their lack of availability. However, the best implementation of a particular algorithm on a particular parallel architecture is an open question but is critical to the success of this approach. Certain algorithms may be more suitable for certain architectures (e.g. if a lot of data is to be shared by all processors, a data sharing machine may work better than a message passing machine). But more than that, the actual implementation of organizing the schedule of calculation and U0 for each processor affects the performance of an algorithm significantly.

In this respect, the search for parallel solutions is more complicated than that of the sequential ones. The three way matching of the problem, algorithm and architectm is a difficult

635

lask and since no particular commercially available architecture has stood out as being more viable, concentrating on one type may not be a good idea This, of course, raises the interesting possibility of defdng a structure that is particularly suitable for power system applications (or even a subset) in the hope that such a machine would then become commercially available to capture the power industry market. However, any one industry has traditionally not been large enough to exert such influence and is probably unlikely.

Another interesting prospect is the influence of other technologies on parallel architectures. Optical computing may bring completely new possibilities in design. The application of neural networks will also influence hardware design and the application of neural networks to power system problems is another area with great potential.

As researchers have grappled with parallel solutions to improve computational performance, the measure for such performance has been difficult to devise. Since computer hardware varies in speed, the standard way to measure the performance of algorithms on sequential machines is to compare their CPU times on the same machine. For parallel algorithms the architecture is part of the solution and getting the hardware out of the measure is difficult. It has become common to measure the effect of parallelization by running the algorithm on one processing node and comparing this to the run times when using more number of nodes. Although this is a good measure of efficiency and speed-up it does not reflect the comparison of performance with the best known sequential algorithm. One way to measure this would be to use as the basis of comparison the best known sequential algorithm on one processing node. Since extremely few parallel algorithms are faster than the best sequential algorithm when run on a sequential machine, the basis for comparison would be much more realistic from an engineering viewpoint.

As pointed out in the section on prior work, research in parallel solutions have concentrated on only a few of the large computation problems in power systems. The area of transient stability has attracted particular attention because there is a perceived need far on-line analysis to determine the security of the power system. The power flow problem has also been a target because of its pervasive use in the industry and also because it fits in very well with the matrix solution research being carried out outside the power ara. Other areas have been touched but not explored to a great extent.

It is probably fair to say that the industrial need is more in the area of on-line applications in the control center where faster computation gives the power system operator more automatic analysis to help in decision making. Obviously, any breakthroughs will be beneficial for off-line use because those analyses that are now too slow for interactive use can be made so, and more of those that are already quite fast can be run in the same time period. That is, faster programs increase engineering efficiency.

The most time critical function in the control center is static contingency analysis. Dynamic Contingency analysis is still not feasible. If the total contingency analysis problem becomes tractable, the next obvious issue is the determination of corrective and plotective control. At the moment the optimal power flow is starting to be used to calculate such control action but the possible scenarios are so numerous (possible contingencies combined with possible flow or voltage violations combined with possible control actions) that only certain aspects

The optimal operation of the power system just taking into account the security concerns above is a big problem. But economic concerns are also very important. The contilrual economic dispatching of units and the scheduling of units over time taking into account hydro and other fuel prices and availability, are also very large problems. If some of the uncertainties in load forecasting. generator forced outages, and other parameters are taken into account the problem becomes extremely complicated. The coming deregulated environment is certain to increase the uncertainty level in the system and incorporating them into the optimal operation question is going to be a challenge. These economic calculations (except for generator dispatching) are not time critical but are in the operations environment and are large.

System planning, reliability, production costing are all areas requiring large calculations. Parallel processing could help in q ” g up the present day programs. These analysis are only used off-line for non-time-critical calculations and the urgency for major speedups is not high.

Canbe&kW!dtOdaY.

Another computational area with which power engineers have slowly become more involved is the software environment itself. From interactive off-line to control center on-line programs the use of computer graphics has become extensive. Large data bases in all areas of the electric power industry has generated their own problems of maintenance and coordination. The use of symbolic manipulation of data and analysis has changed the way workers are utilizing the computer resources. Speed is a concern in all these areas and parallel solutions may be possible in many cases.

Serial implementations of simulated neural network computing have been performed for such problems as control center alarm processing and load forecasting. Simulated neural network computing is a highly parallel fine grained processes which may be capable of using the fine grained SIMD machines such as the MPP. It appears to be practical on problems of small dimension, input variables well below 100, and for systems whose internal complexity makes conventional model development difficult or impractical. Simulated neural network computing is still in its infancy but may have an important niche in computing applications of the future.

636

CONCLUSION

The rapidly changing computational technology is providing the power system engineer new ways to increase cost and speed efficiency. The use of vector and array processors to speed up cerlain calculations and the design of distributed processors for system monitoring and control are already commonplace in the power industry. However, the appearance of parallel architectures and the projected advances in hadware and software emphasize that a more systematic approach for their application to power system computation is needed. It is no longer meaningful to develop the best algorithm without asking about the architecture for which it is best suited. This paper tries to summarize the state of the art in an attempt to capture the developing approach of matching algorithms to architectures. The growing literature is an indication of future promise.

There remain a plethora of problems and architectures which have not been examined and algorithms yet to be developed which can have a significant impact on differential equation, algebraic equation, and optimization problems. The challenge of matching new/existing algorithms to new/existing architectures remains a central research issue. Research problems including neural network use and genetic algorithms all involve the central challenge of matching highly parallelized algorithms to new and existing parallel/distributed architectures. The biggest challenge is to learn to look at problems with new eyes so that highly parallel algorithms, even ones with "embarassing" parallelism, can be found which do not suffer from serial slowdown. Power system engineers have tradionally made important contributions in the field of computation. We will never have a better time than now to continue the uadition.

ACKNOWLEDGEMENT

A two day workshop sponsored by the National Science Foundation under grant #87-15715 enabled this task force to produce this report. The attendees were: F. Alvarado, University of Wisconsin; R. Betancourt, San Diego State University; A. Bose, Arizona State University; K. Clements, Worcester Polytechnic Institute; G.T. Heydt, National Science Foundation; G. Huang, Texas A&M; M. Ilic, M.I.T.; M. La Scala, University of Bari, Italy; M.A. Pai, University of Illinois; C. Pottle, Come11 University; S. Talukdar, Carnegie Mellon University; D.J. Tylavsky, Arizona State University; J. Van Ness, Northwestern University; F. Wu, University of California. In addition, the contributions of J.S. Chai, X.Y. Chen, N. Zhu. Arizona State University, and L. Murphy, University of CalifOmia, Berkeley, are acknowledged.

REFERENCES

111 K. Hwang, F. Briggs, Computer Architecture and Parallel Processing, McGraw-Hill, New York, 1984.

121 J.M. Ortega and R.G. Voight, "Solution of Partial Differential Equations on Vector and Parallel Processing", SIAM, Philadelphia, 1985.

R.A. Duncan, "A Survey of Parallel Computer Architectures," IEEE Computer, pp. 5-16. Feb. 1990. D.E. Barry, C. Pottle and K. Wirgau, "A Technology Assessment Study of Near Term Computer Capabilities and Their Impact on Power Flow and Stability Simulation Program", EPRI-Tps-77-749 Final Report 1978. F.M. Brasch. J.E. Van Ness and S.C. Kang, "Evaluation of Multiprocessor Algorithms for Transient Stability

F.M. Brasch, LE. Van Ness and S.C. Kang, "Design of Multiprocessor Structures for Simulation of Power System Dynamics", EPRI-EL-1756.1981. A.M. Erisman, "Decomposition and Sparsity with Application to Distribution Computing". Exploring Applications of Parallel Rocessing to Power System

J. Fong and C. Pottle. "Parallel Processing of Power System Analysis Problems via Simple Parallel Microcomputer Structures." Exploring Applications of Parallel Processing to Power System Analysis Problems,

K. Lau, D. J. Tylavsky and A. Bose, "Coarse Grain Scheduling in Parallel Triangular Factorization and Solution of Power System Matrices", IEEWPES Summer Meeting, Minneapolis, July 1990. D.J. Tylavsky, "Quadrant Interlocking Factorization: A Form of Block L-U Factorization", Proc. IEEE, 1986. A. Abur "A Parallel Scheme for the Foiward/Backward Substitutions in Solving Sparse Linear Equations", IEEE Trans. on Power Systems, PWRS-3, 1988. FL. Alvarado, D.C. Yu, R. Betancourt,"F'artitioned Sparse A - l Methods", IEEE Trans. on Power Sysyems, May 1990. W.F. Tinney. V Brandwajn, S.M. Chan, "Sparse Vector Methods", IEEE Trans. on Power Apparatus and Systems, February, 1985. J.E. Van Ness and G. Molina. "Multiple Factoring in the Parallel Solution of Algebraic Equations", EPRI EL-3893, 1983. R.Betancourt, FL.Alvarado, "Parallel Inversion of Sparse Matrices", IEEE Trans. on Power System, Feb. 1986. M.K. Enns, W.F. Tinney, FL. Alvarado, "Sparse Matrix Inverse Factors", IEEE Trans. on Power Sysyems, May 1990. D.J. Tylavsky and B. Gopalakrishnan, "Precedence Relationship Performance of an Indirect Matrix Solver", IEEEPES Summer Meeting, Long Beach, CA, 1989. S.Y. Lee, H.D. Chiang, K.G. Lee and B.Y. Ku. "Parallel Power System Transient Stability Analysis on Hypercube Multiprocessors", IEEE Power Industry Computer Applications Conference, Seattle, May 1989. A. Gomez, and R. Betancourt, "Implementation of the Fast Decoupled Load Flow on a Vector Computer", IEEE PES Winter Meeting, Atalanta, 1990. SN. Talukdar, S.S. Pyo and Ravi Mehrotra, ''Distributed Processors for Numerically Intense Problems", Final Report, EPRI Project RP 1764-3,1983.

Problems", EPRI-EL-947, 1978.

Analysis Problems, EPRI-EL-566-SR. 1977.

EPRI-EL-566-SR, 1977.

[21] W.L. Hatcher, F.M. Brasch and J.E. Van Ness, "A Feasibility Study for the Solution of Transient Stability Problems by Multiprocessor Structures", IEEE Trans. on PAS, Nov/Dec 1977

[22] F.L. Alvarado, "Parallel Solution of Transient Problems by Trapezoidal Integration", IEEE Trans. on PAS, May/June 1979.

[23] M. Ilic-Spong, M.L. Crow and M.A. Pai, "Transient Stability Simulation by Waveform Relaxation Methods", IEEE Trans. on Power Systems, Nov. 1987.

[%I M. LascaIa, A. Bose, D.J. Tylavsky and J.S. Chai, "A Highly parallel Method for Transient Stability Analysis", IEEE Power Industry Computer Applications Conference, Seattle. May 1989. [a M. LaScala, M. Brucoli, F. Torelli and M. Trovato, "A Gauss-Jacobi-block-Newton Method for Parallel Transient Stability Analysis," IEEE PES Winter Meeting, Atlanta, February 1990.

[%I M. LaScala, R. Sbrizzai and F. Torelli, "A Pipelined-in- Time Algorithm for Transient Stability Analysis," IEEE PES Summer Meeting, Minneapolis, July 1990.

[27l J.S. Chai, N. Zhu, A. Bose and D.J. Tylavsky. "Parallel Newton Type Methods for Power System Stability Analysis Using Local and Shared Memory Multiprocessors," IEEE PES Winter Power Meeting, New York, February 1991.

1281 1. Zaborszky, G. Huang and W. Lu, "A Textured Model for Computationally Efficient Reactive Power Control and Management," IEEE Trans. on Power Apparatus and Systems, July 1985.

1291 J.E. Van Ness and D.J. Boratynska-Stadnicka, "A Partitioning Algorithm for Finding Eigenvalues and Eigenvectors", Power Systems Computation Conference, Graz. Austria, August 1990.

631

[30] D.J. Boratynska-Stadnicka, M.G. Lauby and J.E. Van Ness, "Converting an Existing Computer Code to a Hypercube Computer," IEEE Power Industry Computer Applications Conference, Seattle, May 1989.

[31] MJ. Teixeira, HJ.C.P. Pinto, M.W.F. Pereira and M.F. McCoy, "Developing Concurrent Processing Applications to Power System Planning and Operations", IEEE Power Indusrry Computer Applications Conference, Seattle, May 1989.

[32] D.J. Tylavsky, P.E. Crouch, D.J. Atresh, "Frequency Domain Relaxation of Power System Dynamic", IEEE Trans. on Power Sysyems, May 1990.

[33] S.N. Talukdar and E. Cardozo, "An Environment for Rule- based Blackboards and Distributed Problem Solving", Readings in Distributed Artificial Intelligence, Morgan Kaufmann Publishers, San Mateo, CA, 1988.

[MI R. Podmore, M. Liveright and S. Virmani, "Application of an Array Irocessor for Power System Network Computations," IEEE Power Industry Computer Applications Conference, Cleveland, May 1979.

[35] C. Pottle, "Array and Parallel Processor in On-line Computations", EPRI EL-2363, 1982.

[%I C. Pottle. "Prospects for the Real-Time Simulation of Bulk Power Systems", Prac. 1988 Grainger Lecture Series, pp. 82-89, Urbana, IL May 1988.

[37l Yu and Wang, "A New Parallel LU Decomposition Method", IEEE Trans. on Power Systems, Feb. 1990.

[38] Yu and Wang, "A New Appraach for the Forward and Backward Substitution of Parallel Solution of Sparse Linear Equations Based on Data Flow Architecture", IEEE Trans. on Power Systems, May 1990.

638

Discussion

ANDRE MORELATO, University of Campinas, Brazil :

The authors are to be congratulated on an informa- tive and helpful study of the state of the art on parallel processing applications to power systems. I would like to clarify one statement made in the paper.

Even though the precedence relationships inherent in solving eq. (2) are not an insurmountable obstacle they remain a fundamental issue on the effective application of parallel processing to power systems problems. I t is well known that parallelism is most effective if we per- form in parallel the part of the problem which requires a large number of cycles. In power systems that part corre- sponds to the solution of the sparse linear system Ax = b. However, it has been recognized that the precedence rela- tionships between elementary operations can impose sig- nificant limitation for achieving very high speedups in parallel sparse matrix algorithms, even if the implemen- tation is able to match the algorithm to the parallel ma- chine [l]. On the other hand, vector processors seem to be inefficient on sparse matrix problems.

The huge efficiency of the sequential sparse matrix methods is based on the extreme exploitation of spar- sity. On the contrary, would the sparsity be a burden for handling power network problems in parallel ? I would appreciate the authors’ comment on this point.

Reference

[l] A. Padilha and A. Morelato, ”A W-Matrix Methodol-

ogy for Solving Sparse Network Equations on Multipro- cessor Computers’) ) IEEE/PES Summer Meeting, San Diego, CA, July/Aug. 1991 (Paper 482-0-PWRS).

Manuscr ip t r e c e i v e d August 2 2 , 1991.

DANIEL J. TYLAVSKY, ANJAN BOSE (on behalf of Task Force): We would like to thank Dr. Morelato for his clarification and agree that the repeat solution of the linear equations become the bottleneck in any parallelization of power system calculations. We briefly address here his question of whether the sparsity of the linear equations is a burden to parallelization.

Sparsity is an aid, rather than a burden, in coarse grain parallel processing for several reasons. First, the sparse structure determines the number of precedence relationships in the factorization phase of the calculations. Second, the number of partitions needed in the use of the W-matrix method is a function of the sparsity structure and amount of acceptable fill-in. Thus in general, the more dense the problem, the smaller the amount of exploitable parallelism. The difficulty lies in being able to take advantage of this parallelism to improve upon the very efficient sequential processing that has already been developed.

Sparsity, however, does inhibit some fine grain parallel processing techniques such as vectorization, as noted by Dr. Morelato. It is thought in some circles that vectorization will be replaced by pipelines capable of more general fine grain parallel processing. If these systems come to fruition, then sparsity may be an aid, rather than a hindrance, in pipeline processing.

Manuscr ip t r ece ived November 19 , 1991.