Large scale atomistic polymer simulations using Monte Carlo methods for parallel vector processors

22
Computer Physics Communications 144 (2002) 1–22 www.elsevier.com/locate/cpc Large scale atomistic polymer simulations using Monte Carlo methods for parallel vector processors Alfred Uhlherr a,, Stephen J. Leak b , Nadia E. Adam b,1 , Per E. Nyberg b,2 , Manolis Doxastakis c,d , Vlasis G. Mavrantzas c,d , Doros N. Theodorou c,d a CSIRO Molecular Science, Bag 10, Clayton South, Victoria 3169, Australia b NEC Australia Pty Ltd, 635 Ferntree Gully Rd, Glen Waverley 3150, Australia c Department of Chemical Engineering, University of Patras, GR 26500 Patras, Greece d Institute of Chemical Engineering and High Temperature Chemical Processes, GR 26500 Patras, Greece Received 28 May 2001; accepted 28 November 2001 Abstract In this paper we discuss the implementation of advanced variable connectivity Monte Carlo (MC) simulation methods for studying large (>10 5 atom) polymer systems at the atomic level. Such codes are intrinsically difficult to optimize since they involve a mixture of many different elementary MC steps, such as reptation, flip, end rotation, concerted rotation and volume fluctuation moves. In particular, connectivity altering MC moves, such as the recently developed directed end bridging (DEB) algorithm, are required in order to vigorously sample the configuration space. Techniques for effective vector implementation of such moves are described. We also show how a simple domain decomposition method can provide a general and efficient means of parallelizing these complex MC protocols. Benchmarks are reported for a 192,000 atom simulation of polydisperse linear polyethylene with an average chain length C 6000 , for simulations using 1 to 8 processors and a variety of MC protocols. 2002 Elsevier Science B.V. All rights reserved. PACS: 02.70.Lq; 61.20.Ja; 61.25.Hq; 36.20.Ey Keywords: Molecular simulation; Polyethylene melt; Parallel computing 1. Introduction In principle, molecular simulation methods [1–4] appear ideal for enabling rational design of new polymer materials, thus further increasing the importance of these materials to all aspects of modern society. The problem is that real polymers exhibit complex behavior over a large range of length and time scales. To fully address this complexity, simulation of real polymer materials from first principles requires a hierarchical approach [5,6]. Even * Corresponding author. E-mail address: [email protected] (A. Uhlherr). 1 Present address: Wilson Synchrotron Laboratory, Cornell University, Ithaca, NY 14853, USA. 2 Present address: HNSX Supercomputers Inc., 52 Hymus Blvd, Pointe-Claire, Quebec, H9R 1C9, Canada. 0010-4655/02/$ – see front matter 2002 Elsevier Science B.V. All rights reserved. PII:S0010-4655(01)00464-7

Transcript of Large scale atomistic polymer simulations using Monte Carlo methods for parallel vector processors

Computer Physics Communications 144 (2002) 1–22

www.elsevier.com/locate/cpc

Large scale atomistic polymer simulations using Monte Carlomethods for parallel vector processors

Alfred Uhlherra,∗, Stephen J. Leakb, Nadia E. Adamb,1, Per E. Nybergb,2,Manolis Doxastakisc,d, Vlasis G. Mavrantzasc,d, Doros N. Theodorouc,d

a CSIRO Molecular Science, Bag 10, Clayton South, Victoria 3169, Australiab NEC Australia Pty Ltd, 635 Ferntree Gully Rd, Glen Waverley 3150, Australia

c Department of Chemical Engineering, University of Patras, GR 26500 Patras, Greeced Institute of Chemical Engineering and High Temperature Chemical Processes, GR 26500 Patras, Greece

Received 28 May 2001; accepted 28 November 2001

Abstract

In this paper we discuss the implementation of advanced variable connectivity Monte Carlo (MC) simulation methods forstudying large (>105 atom) polymer systems at the atomic level. Such codes are intrinsically difficult to optimize since theyinvolve a mixture of many different elementary MC steps, such as reptation, flip, end rotation, concerted rotation and volumefluctuation moves. In particular, connectivity altering MC moves, such as the recently developed directed end bridging (DEB)algorithm, are required in order to vigorously sample the configuration space. Techniques for effective vector implementationof such moves are described. We also show how a simple domain decomposition method can provide a general and efficientmeans of parallelizing these complex MC protocols. Benchmarks are reported for a 192,000 atom simulation of polydisperselinear polyethylene with an average chain length C6000, for simulations using 1 to 8 processors and a variety of MC protocols. 2002 Elsevier Science B.V. All rights reserved.

PACS:02.70.Lq; 61.20.Ja; 61.25.Hq; 36.20.Ey

Keywords:Molecular simulation; Polyethylene melt; Parallel computing

1. Introduction

In principle, molecular simulation methods [1–4] appear ideal for enabling rational design of new polymermaterials, thus further increasing the importance of these materials to all aspects of modern society. The problemis that real polymers exhibit complex behavior over a large range of length and time scales. To fully address thiscomplexity, simulation of real polymer materials from first principles requires a hierarchical approach [5,6]. Even

* Corresponding author.E-mail address:[email protected] (A. Uhlherr).

1 Present address: Wilson Synchrotron Laboratory, Cornell University, Ithaca, NY 14853, USA.2 Present address: HNSX Supercomputers Inc., 52 Hymus Blvd, Pointe-Claire, Quebec, H9R 1C9, Canada.

0010-4655/02/$ – see front matter 2002 Elsevier Science B.V. All rights reserved.PII: S0010-4655(01)00464-7

2 A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22

at the lower levels of this hierarchy, simulation of a single polymer phase at the atomistic level is problematical, dueto the high chain lengths used in most technologically useful materials and the strong influence of the chain lengthon most material properties. Long chains require not only a large number of atoms, but also very long relaxationtimes, which scale as the third power of the chain length or greater. A typical commercial grade polymer melt (suchas the C6000 melt considered here) equilibrates in a time period of 10−3 to 101 seconds, which is far beyond thepresent capabilities of conventional simulation techniques such as molecular dynamics (MD) [6].

In previous work [7–9] we have shown that advanced variable connectivity MC methods such as end bridging(EB) and directed end bridging (DEB) can be used to overcome the time scale problem. Such methods canequilibrate polymer melts at all length scales, at a rate which varies only weakly (and can actually increase) asa function of the chain length. Here we address the length scale problem, and in particular the general notion thatscaling up to large system sizes is more difficult for atomistic MC than for MD. The ability to perform large scalesimulations (> 105 atoms) is important if we are to take full advantage of the ability to simulate long chains.

The other key feature of end bridging methods is that they generate polydisperse systems of predefinedmolecular weight distribution (MWD). This MWD can be chosen to mimic that which is observed experimentally.Real polymers such as polyolefins often have high polydispersities, and many of their important processingcharacteristics are dominated by the high molecular weight “tail”, i.e. a small fraction of very long chains. Inprinciple, an atomistic simulation could use the same type of MWD, provided a large enough number of atoms isused.

Previous atomistic MC polymer simulations have been performed for moderate sized systems (∼ 104 atoms)over long times on single workstation or server processors. The ability to routinely perform large simulationsis reliant on having codes that run efficiently on high performance computing facilities. Here we focus onimplementation for vector-parallel architectures. In the final section we consider some implications for distributedparallel architectures.

2. Molecular model

The program discussed here is designed to simulate melts of linear polyethylene using a united atom “polybead”model. Extensions to more complex polymers such as polyisoprene and polypropylene have been described [10,11]. The potential function is the same as that defined previously [8]. CH2 and CH3 units are treated as equivalent,spherically symmetric united atoms, separated by fixed bond lengths of 1.54 Å. The interaction between pairs ofunited atoms is governed by a Lennard–Jones nonbond potential,

uLJ(rij ) = 4ε

((σ

rij

)12

−(

σ

rij

)6), (1)

whererij is the distance between atomsi and j . The respective values of the well depthε and the interactiondiameterσ/kB are 49.3 K and 3.94 Å. Explicit pairwise interactions are truncated at 2.3σ and augmented bystandard long range corrections [1]. The angleθi between the bonds on atomi is governed by a harmonic potential,

ubend(θi) = 12kθ (θi − θ0)2 (2)

with a bending stiffnesskθ /kB = 57,950 K·rad−2 and equilibrium angleθ0 = 112◦. The torsional energy for thedihedral angle about bondi is of the form

utor(φi) =5∑

m=0

Am cosm(φi). (3)

For m = 0–5, the coefficientsAm/kB are set to 1116, 1462,−1578,−368, 3156 and−3788 K, respectively.

A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22 3

All simulation results are for the semi-grandnNµ∗P T ensemble [7] at temperatureT = 450 K and pressureP = 1 atm. The total number of chainsn is 32, while the total number of united atom sitesN is 192,000. Thespectrum of chain-length dependent chemical potentials is defined to give a rectangular distribution of chain lengthsranging from 2400 to 9600 mers, yielding a polydispersity index of 1.12. The average chain length is thus 6000,corresponding to a number average molecular weight of approximately 84,000 g mol−1. This molecular weightis significantly larger than that considered in our previous work [9,12], and far exceeds other quantitative bulkpolymer simulations performed to date. Such a molecular weight is comparable to commercial grades of highdensity polyethylene (HDPE) that are commonly used in injection moulding applications. However, the simulatedpolydispersity is still much lower than typical experimental ranges.

3. Parallel Monte Carlo protocol

Parallel MD simulations have become commonplace, due largely to the successful application of the familiarconcept of domain decomposition [13–17], whereby the dynamics of atoms in different spatial regions (domains) ofthe simulation cell can be updated synchronously by different processors, utilizing communicated coordinate datafrom adjacent domains. By contrast, MC simulations of condensed matter are difficult to parallelize effectively [18],except for studies that utilize many replicate runs or ensembles/protocols that are intrinsically parallel in nature [2,18–20]. This is because such simulations are based on the Metropolis scheme [21] or other importance samplingmethod, requiring a strict sequence of small random displacements, each involving only one or a few particles.Synchronizing such a long sequence of small individual moves for interacting particles across multiple processorsquickly leads to serious communication overheads, as does spreading neighbor interaction calculations for eachatom across multiple processors [22,23]. In some model systems—notably Ising spin lattices—simultaneousmoves can be spread across multiple processors by taking advantage of simplified interaction symmetries or othercharacteristics inherent in the model [24–26]. Transferring such domain decomposition processes to atomisticsimulations with finite interaction ranges is possible [27], but again leads to inefficient message passing andsynchronization overheads.

Atomistic polymer MC simulations in particular require an elaborate mixture of different moves to ensure fullsampling of configuration space [28]. Hence previous attempts to parallelize such simulations have been highlyspecific to individual move algorithms [29,30]. For example, in the configuration bias (CB) algorithm [2,31],the generation of a set of trial atom positions can be performed as a set of microtasks [30]. The scope of suchparallelization schemes is rather limited, because:

• each move algorithm must be parallelized in a different manner,• the optimal spread across processors will vary for different algorithms, molecular systems and computer

architectures,• shared memory and significant communication overheads are required.

In this work we use a domain decomposition scheme based on the “checkerboard” method [24] for Isingspin systems, which we have called the “Collingwood” method. The basis of the method is outlined in Fig. 1.Specifically;

• independent sequences of individual Monte Carlo moves are performed in parallel within independent activeregions (stripes), separated by inactive regions that are wider than the largest interaction distance,

• attempted moves across the boundary of an active region are rejected,• the active regions are periodically redefined with randomized positions and orientations.

4 A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22

(a) (b)

Fig. 1. Schematic representation of the domain decomposition scheme. (a) Independent sequences of individual Monte Carlo moves (e.g.,displacement of the shaded atom) are performed in parallel within independent “active” regions. (b) The active regions are periodicallyredefined. To maintain the detailed balance condition, atoms are forbidden from leaving the current active region.

The direct rejection of moves outside active regions is the simplest means to satisfy the vital detailed balance or“microscopic reversibility” criterion [2]. Inactive particles outside the active region are held fixed, and are thus notselected for trial MC moves. This ensures that the net flux between configurations involving particle transfers acrossthese artificial boundaries is strictly zero. A formal proof of microscopic reversibility is given in Appendix A. Itcan readily be seen that the overall simulation follows a strict Markov sequence, as required [2], but the sequencenumber of each individual move becomes arbitrary. The remaining issue is the “cycle” time between domaindefinitions, which can also be chosen arbitrarily with no loss of rigor. However, too long a cycle would obviouslyimpact on the ergodicity of the simulation. In this work we simply use a convenient number of MC steps, notingthat any loss of ergodicity would be reflected directly in the benchmark results. One obvious consequence is thatany ability to interpret the MC displacements in terms of physically realistic particle “dynamics” is reduced. Othermore general features and implications of the scheme, as applied to simple atomic fluids, will be reported in afuture publication.

This simple parallel scheme is suitable for either shared or distributed memory facilities, and is applicablein principle to any MC move that requires only localized displacements of atoms. Nevertheless, applying themethod to chain molecular systems requires some care. The fact that moves in the different domains are performedasynchronously is important. Different MC move algorithms required for high polymer simulations can requirevery different amounts of CPU time, which can also vary substantially depending on whether or not the move isaccepted. MC domain decomposition strategies requiring synchronized simultaneous moves on different processors[27,32] are clearly unsuitable in these circumstances. However, our asynchronous approach has the additionalcaveat that different types of moves can also lead to differences in the boundary conditions that are applied torestrict the motion of atoms to within each active domain.

The types of MC moves considered in this work are detailed in Fig. 2. “Flip” moves [8] are used to performsmall displacements for a single backbone atom. “End rotation” moves [33] perform large displacements for anend atom. In each of these two cases, the boundary condition is identical to that for simple atomic systems. If thenew trial position for the atom is outside the active domain, the move is simply rejected.

A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22 5

Fig. 2. Simplified two-dimensional representation of the different Monte Carlo moves employed in the parallel polymer simulations. (a) Endrotation, (b) flip, (c) concerted rotation, (d) reptation, (e) directed end bridging, here withl = 3, (f) volume fluctuation. Unfilled circles denotethe atoms which are displaced to new positions (shaded circles), while remaining atoms (filled circles) are held fixed.

Concerted rotation (ConRot) moves [8,33] involve displacement of a sequence of 5 backbone atoms. The firstand last atoms are given a small displacement via a torsion “driver”, while the remaining 3 atom positions arechosen from a set of feasible solutions (“bridges”) to the resultant geometric constraint equations. All feasiblebridges need to be identified for both the forward and reverse moves, as the ratio of these numbers appears in theMonte Carlo acceptance criterion [8]. The appropriate boundary condition for such a move is to scan through listsof the feasible bridges for both forward and reverse moves, deleting any which lead to one or more of the trimeratoms crossing the active boundaries. In addition, any forward move where either driver atom crosses an activeboundary is immediately rejected. End bridging (EB) moves involve the displacement of an atom trimer only, andhence require the same boundary condition as the trimer bridge component of the ConRot move.

Directed internal bridging (DIB) and directed end bridging (DEB) moves [9,34] displace a sequence ofl atoms,where l is typically in the range 3–8. Each of the firstl − 3 atoms is regrown by the configuration bias (CB)technique [31], i.e. by energetically-weighted random selection from a set of trial configurations. The last 3 atomsare regrown by trimer bridging as described above. The old configuration is then reconstructed using the samesequence of CB regrowth and trimer bridging. The appropriate boundary condition for both CB and bridging stepsis to delete trial configurations where an atom has crossed the boundary.

EB and DEB moves have an additional complication, in that their acceptance is dependent on the lengths ofthe two chains involved [7,9]. For high molecular weight polymers, each of these chains is likely to straddleseveral active domains. Hence, if such moves were to be performed in different domains, they would no longerbe independent, and a strict sequence would need to be maintained across the processes. Here we use the simplesolution of restricting variable connectivity moves to the first domain (master process) only, which allows us tomaintain the asynchronous form of the domain decomposition.

Reptation moves [35] consist of the deletion of an atom at one end of a given chain, and the simultaneousgrowth of an atom at the other end of the chain. Such moves are valuable in maximizing the efficiency of thevariable connectivity moves [9]. However, reptation moves are also difficult to parallelize by asynchronous domaindecomposition, because the two ends will generally be in different domains. For simulations with reptations but

6 A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22

without variable connectivity moves like EB or DEB, it would be possible to perform independent reptations in eachdomain, by restricting candidate reptation moves to chains which have both ends in the same domain (regardlessof the position of the rest of the chain). However, this would prevent us from using EB or DEB moves, since thesecreate a new chain end in their own domain, and would thus need to be synchronized with reptations in otherdomains. Hence to implement reptation moves in this work we must further restrict them to chains which haveboth ends in the first domain. Since both EB/DEB and reptation moves are performed only within this domain bya single process, they remain strictly sequential.

The final type of move used here is the volume fluctuation move [36], which maintains constant pressure. Sucha move involves simultaneous displacement of all atoms, and cannot be parallelized by our domain decompositionmethod. In principle it would be possible to perform volume fluctuations on specified domains, using a procedureanalogous to the SLAB method of Escobedo and de Pablo [37]. However this method was devised for flexiblechains, and would be very difficult to implement for atomistic chains with fixed or stiffly constrained bond angles.

Here we utilize the fact that volume fluctuations are performed much less frequently than other MC moves, andsimply separate them from the domain decomposition process. In this way one or more volume fluctuations may beperformed at the end of each decomposition cycle. Each volume fluctuation move requires a full recalculation of thepotential energy of the system. This requires a double summation over all interacting atom pairs; the correspondingloops can be readily parallelized by standard multitasking.

4. Implementation

The basic structure of the code is depicted in Fig. 3. The code uses the MPI protocol [38] to performsimultaneous asynchronous MC moves in the different active domains as different processes. All DEB and reptationmoves are performed in the first domain by the master process. Since master and slave processes have differentprotocols or “mixes” of MC moves, the average CPU time per MC move may be different. Hence, the ratio of thenumber of MC steps performed by master vs. slave processes can be adjusted heuristically for load balancing.

Traditionally, a single MC “cycle” for a system ofN atoms corresponds toN simple displacement moves [2].Here we define a cycle as the total number of MC moves performed for a defined allocation of active domains. Fora system of 192,000 atoms, a cycle length of 200,000 steps is convenient, corresponding to several minutes of fullyindependent multiprocessing between message passing operations.

At the beginning of each cycle, we define a new set of domains, and their associated active regions or “stripes”.This procedure is illustrated in Fig. 4. The positions and orientation of the domains are selected at random. Forconvenience the domain positions may be matched to the nonbond neighbor cells. Orientations are limited to thethree planes perpendicular to thex-, y- andz-axes. The domain widths and stripe widths/separations are calculatedfrom the system size, the nonbond interaction range and the number of processes. Since the number of cells is aninteger value, in some cases the number of processors chosen dictates that the domains have differing widths. Herewe divide cells up evenly between the different domains, with any excess being allocated to the first domain.

At the beginning of each cycle, the necessary atom coordinates are copied from the master process to the slaveprocesses. At the end of the cycle, the updated coordinates are transferred back to the master process. The changein energy of the entire system is calculated by summing the energy changes over all the domains. Single—or, ifnecessary, multiple—volume fluctuation moves are then performed, using the updated coordinates for the entiresystem. Note that performing a specified type of MC move at regular intervals satisfies the balance condition, ratherthan the detailed balance condition [39]. The master process then performs all input/output operations, notably aregular dump of the system configuration, plus a full neighbor list re-evaluation and a series of data integrity checks.

In the form used here, more than 80% (21,000+ lines) of the polymer simulation code is executed in parallelby domain decomposition. The only inter-processor communication occurs at the beginning and end of each cycle.Hence the implementation is well suited to the MPI paradigm of independent processors with distributed memory.

A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22 7

Fig. 3. Simplified structure chart for the parallel code for atomistic polymer simulation. The dashed line is used to distinguish routines executedby the master process only, from routines executed by all processes in parallel.

Considerable effort was devoted to optimizing the vector performance and scaleability of the code. This islargely governed by a series of list structures, as described previously [8]. The following lists are used:

• the list of nonbond neighbor cells, described above,• a conventional Verlet neighbor list [1] for each atom,• a list of small cells for fast screening of atom overlaps in ConRot and DEB moves,• an end segment neighbor list, containing target atoms for potential DEB moves.

In addition to these, it is optimal to define an augmented neighbor list for any moves that use configuration bias(such as DEB). Each atom regrown by CB requires the creation ofnq trial configurations of that atom, with a fullevaluation of the nonbond energy of the atom in that configuration. The typical valuenq = 20 is used throughoutthe present work. In principle, these nonbond terms can be calculated using the neighbor cells. In practice, it ismore efficient to use these neighbor cells to create an augmented neighbor list, containing all atoms that are withininteraction range for allnq trial configurations of the newly grown atom. This may be readily accomplished using

8 A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22

Fig. 4. Schematic representation of the domain decomposition procedure, as implemented for 3 processors. The solid thick line denotes therandomly selected temporary “origin” plane, while dashed thick lines denote the other domain boundaries. The grid of small dotted squaresrepresents the neighbor cells. Shading is used to indicate the active regions. Note that (a) the first (master) domain is larger than the remainingdomains, (b) the domains are matched to the cell boundaries but the active regions are not, (c) the separation of active regions is set to thenonbond cutoff length of 2.3σ , (d) the domain definitions use periodic boundaries.

Fig. 5. Schematic representation of the procedure for vectorizing the configuration bias algorithm (e.g., DEB moves). Fornq trial configurationsq of the regrown atom, a spherical augmented neighbor list (large circle) is defined which encompasses all interacting neighbors of eachconfigurationq (smaller shaded circle).

a spherical cutoff of the nonbond interaction cutoff plus one bond length, centred on thepreviousatom. Thisprocedure is illustrated in Fig. 5.

The nonbond neighbor cells are the key to simulation performance for molecular systems such as those studiedhere. The Verlet lists, end lists and augmented CB lists for each atom are evaluated from a 3× 3 × 3 array ofneighbor cells, while the domain decomposition also uses the cell definitions. The Verlet list uses a shell of 1 Å toreduce the number of list re-evaluations following local moves. As described above, the CB list uses a shell of onebond length, in this case 1.54 Å. The end lists are somewhat smaller, as they are governed by the maximum spatialseparation of a pair of atoms connected by 4 bonds. Here we set this maximum radius to 5.1 Å, with a 0.3 Å shell.

A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22 9

Hence the dimensions of the neighbor cells are governed by the use of augmented CB lists. This cell dimension isthus set to a minimum value equal to the nonbond cutoff plus one bond length (= 2.3σ +1.54 Å or 10.602 Å), thenadjusted upwards so that the box dimension corresponds to an integer number of cells. It is worth noting that it ispossible for the number of nonbond cells in the system to change between domain decomposition cycles, due tothe volume fluctuations. This would in turn lead to differences in the parallel domain sizes between cycles, whichmay influence the performance (but not the rigor) of the code. This was not observed for the system sizes studiedin this work.

The dimensions of the full cell used here correspond to 16× 16× 16 nonbond cells. Hence the ratio of widthsfor master and slave domains are 8 : 8, 6 : 5 : 5, 4 : 4 : 4 : 4, 4 : 3 : 3 : 3 : 3 and 2 : 2 : 2 : 2 : 2 : 2 : 2 : 2 respectively for 2,3, 4, 5 and 8 processors.

The direct and indirect use of the nonbond neighbor cells for all nonbond interactions confers orderN scalingon the code’s performance, i.e. the CPU time is an approximately linear function of the number of MC moves,which should be chosen to be proportional to the total number of atomsN . The extensive use of neighbor listsallows vectorization of all nonbond evaluations. This is particularly important for DEB moves, since each move oflengthl requires a total of 2l − 6 augmented lists, withnq passes through each list, corresponding to thenq trialconfigurationsq .

The other major time consuming portion of the code is in the solution of the trimer geometric equations [8,33]used in ConRot and DEB moves. After an efficient systematic scan is used to narrow the search for potentialsolutions, the subsequent convergence utilizes a weighted bisection method with a comparatively short (andirregular) number of iterations, which is difficult to vectorize effectively. Hence this small but frequently usedportion of code runs relatively slowly. By judicious inlining and minor code rearrangements, the time used in thisset of routines was reduced to approximately 25% of the total CPU time.

5. Run profile

Fig. 6 shows a typical run profile for the parallel code, using a simple mix of ConRot, end rotation and flip moves.Note that in this instance we wish to ensure an equal MC mix across all 8 processors. Hence the run does not includereptation or DEB moves. The run uses the large system size (192,000 atoms) but for a relatively short duration of5 × 106 MC steps. All configurational data is saved to disk every 106 MC steps; for postprocessing of productionruns this interval is a convenient compromise of code performance, storage and statistical considerations.

The timing results may be summarized as follows:

• the total CPU time used by the run is 20,289 seconds,• the CPU time for the MPI master process is 4054 seconds, compared to an average of 2320 seconds for each

slave process,• the variation between slave processes is less than 0.5%,• the master CPU time includes 1220 seconds of CPU time microtasked across 8 processors, measuring 153 sec-

onds of parallel time, with less than 1% variation in execution time for each microtask,• the total serial plus parallel time is 2990 seconds,• the elapsed (wall-clock) time is 7084 seconds, reflecting the external load on the machine in a non-dedicated

environment.

The parallel microtasking time corresponds to re-evaluations of the total potential energy, largely for volumefluctuation moves. The consistency of CPU times amongst slaves and also amongst microtasks indicate wellbalanced parallel code with a very even load distribution.

Since the MC mix for each process is the same in this run, we can deduce that the CPU time spent by the masterprocess on regular parallel MC moves is about 2320 seconds. Discounting the microtasked time, the total serial

10 A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22

Fig. 6. Approximate parallel profile for Run 1/8. Large bars represent processes while small bars represent microtasks. Dark shading representssystem time as opposed to user time.

time is thus 4054− 2320− 1220= 514 seconds. Of this time, some 251 seconds is used for a one-off evaluationof the total non-bond energy of the system,without the neighbor lists. While this routine may still be used tooccasionally verify the integrity of the neighbor lists, for long runs this time contribution becomes insignificant.The CPU time required for routines related to input/output operations is approximately 98 seconds. This leaves aremainder of 165 seconds for miscellaneous initialization and bookkeeping operations.

The fractional serial timeα is 514/20289= 0.025, corresponding to 97.5% parallelism. The processor speedupp can then be calculated by Amdahl’s Law [40] as(0.025+ 0.975/8)−1 or 6.8. Alternatively we can calculate theprocessor speedup in the long time limit,p∞, by subtracting the one-off 251 second non-bond calculation from thetotal serial time. This gives a fractional serial timeα = 0.013, or 98.7% parallelism, and a speedupp∞ = 7.3.

For the present run, the average floating point performance for the master process is 363 MFlops, with 95%vector operations and a mean vector length of 150. For the slave processors, which provide a better measure ofoverall performance for long production runs, the equivalent figures are 207 MFlops, 77% vector operations and avector length of 84. Note that when adding reptation and DEB moves, these figures increase to 511 (282) MFlops,95% (83%) vector operations and 160 (101) vector length for master (slave) processes. From these observationswe can estimate that the per-processor performance averages 40–70 MFlops for the scalar portions of the code, and600–800 MFlops for the vector portions, depending on the MC mix and the processor count. The latter is ratherlower than the peak processor performance of 8 GFlops, since the average vector length is governed largely by theaverage number of nonbond interactions per atom, which is well below the vector register length of 512 words.Hence these figures reflect a code that benefits substantially, but less than ideally, from vector hardware.

It is worth noting that the present implementation is designed to be optimal for a shared memory architecture.Specifically, the frequent full re-evaluations of the nonbond neighbor cells is reliant on the ease with which such anoperation may be microtasked. For running on a fully distributed network, it would most likely be more efficientto recover the cell lists from the different processes by a masked gather operation, rather than re-evaluate the listsfrom scratch for each parallel cycle.

A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22 11

6. Performance measures

In order to quantitatively compare the performance of the code for varying numbers of processors, it is necessaryto consider not only the extra CPU overheads associated with parallelization, but also any loss of efficiency in theMonte Carlo configurational sampling imposed by the domain decomposition process. Firstly, attempted movesthat are near the domain boundaries are less likely to be accepted. Secondly, those moves that are accepted havesmaller average displacements. Since the parallelization process influences different MC moves in different ways,it is also necessary to consider a range of MC mixes.

The MC acceptance ratio gives a very incomplete picture of the efficiency of sampling configuration space forthese complex molecular systems and MC mixes [9]. Instead we define first and second degree autocorrelationfunctions for the second neighbor vector (N2V-ACF) about each atomi,

f1(t) = ⟨bi (t + t0) · bi (t0)

⟩, (4)

f2(t) = 3

2

⟨(bi (t + t0) · bi (t0)

)2⟩ − 1

2, (5)

wherebi denotes the unit vector from atomi − 1 to its second neighbor atomi + 1 and angled brackets denotean ensemble average over all atomsi and multiple time originst0. The second degree function (Eq. (5)) is themost suitable for simulations with variable connectivity moves [9], as it is not affected by artefacts associated withrenumbering of the atoms. The “time” variablet for these functions corresponds to the number of MC steps, andis here scaled to correspond to the total CPU time (on all processors).

Another common measure of local relaxation rate is the atomic mean squared displacement function,

fa(t) = ⟨(xi (t + t0) − xi (t0)

)2⟩, (6)

wherexi denotes the coordinates of atomi. The slope of this function allows one to define an effective atomicdiffusion coefficient,

Da =⟨

1

6tfa(t)

⟩. (7)

For these simulations, this quantity is dominated by the efficiency of reptation moves [9].Similarly the long-time relaxation of global chain characteristics can be quantified via the mean squared

displacement function of the chain centre of massxCOM, which can be evaluated by ensemble averaging overall chains using the expression

fc(t) = ⟨(xCOM(t + t0) − xCOM(t0)

)2⟩. (8)

The chain centre of mass diffusion coefficientDc is then defined as

Dc =⟨

1

6tfc(t)

⟩. (9)

One may quantitatively compare the local relaxation rate between serial and parallel runs by determining thetime multipler required to superimpose the second neighbor autocorrelation functions. The product ofr with theprocessor speedupp gives thenetparallel speedup, i.e. the scale-up obtained in MC configurational sampling byemploying more processors. Similarly, the productr · p∞ gives the ideal net speedup for infinitely long runs. Theglobal (long range configurational) relaxation rate is easier to compare, since it is measured directly by a singlequantity, the diffusion coefficientDc . Hence the corresponding speedup measures areDc · p andDc · p∞.

12 A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22

7. Benchmarks

Table 1 lists the MC mix, number of processors and acceptance ratios for each of the benchmark runs consideredin this work. The runs are labeled via a shorthand describing the MC mix and the number of processors, hence Run1/8 denotes the first MC mix executed on 8 processors. Each run also contains 0.0005% volume fluctuation moves,which are not listed. All runs are of the same, relatively short length of 5× 106 MC steps. The DEB moves are oflengthl randomly selected in the range 3–6 [9]. Note that the maximum fraction of reptation moves is nominallyset to 5%, but that the actual proportion of reptation moves varies from run to run. The requirement that reptationmoves only be performed by the master process, on chains with both ends in the corresponding active domain,means that frequently the number of candidate reptation moves drops to zero. In such an event, attempted reptationmoves are supplanted by extra attempted end rotation moves.

The reduction in acceptance ratio with increasing number of processors can be seen in Table 1. A small reductionis apparent for ConRot moves, and a rather more obvious reduction is seen for DEB moves. For the remaining MCmove types, any reduction is too small to be readily measurable. Such trends are not surprising; ConRot movesand (particularly) DEB moves involve the largest displacements and the largest numbers of atoms. Note that in aDEB move, there is no direct rejection due to attempted displacement of an atom outside the active domain. Ratherthe domain boundary reduces the number of feasible CB configurations and trimer solutions, thus reducing thelikelihood that a feasible bridge can be identified. It is worth emphasizing that the data in Table 1 shows how thefraction of accepted moves is reduced slightly by parallelization, butnot how the average atom displacement ineach accepted move is reduced.

Table 2 lists the CPU times and performance measures for the runs detailed in Table 1. Again the elapsed(wall-clock) times are given only for qualitative purposes. Using our domain decomposition Monte Carlo method,optimum load balancing may be readily achieved manually by simply adjusting the ratio of MC steps performed

Table 1Run details for the Monte Carlo simulations

Run Flip End rotation Reptation ConRot DEBmix/proc mix% acc% mix% acc% mix% acc% mix% acc% mix% acc%

1/1 15.0 76.7 1.0 15.9 0.0 – 84.0 8.0 0.0 –1/8 15.0 76.5 1.0 15.2 0.0 – 84.0 7.5 0.0 –2/1 15.0 76.6 1.0 17.0 5.0 8.1 79.0 8.0 0.0 –2/2 15.0 76.5 1.0 16.8 5.0 7.9 79.0 7.9 0.0 –2/4 15.0 76.5 1.8 16.1 4.2 7.3 79.0 7.8 0.0 –2/8 15.0 76.4 5.6 16.1 0.4 6.4 79.0 7.5 0.0 –3/1 15.0 76.6 1.0 17.0 5.0 7.7 69.0 8.0 10.0 0.223/3 15.0 76.6 1.2 16.6 4.8 8.0 69.0 7.8 10.0 0.183/5 15.0 76.7 2.9 16.6 3.1 8.7 69.0 7.7 10.0 0.154/1 15.0 76.6 1.0 16.8 5.0 7.6 64.0 8.1 15.0 0.204/2 15.0 76.6 1.0 16.5 5.0 8.2 64.0 7.9 15.0 0.174/3 15.0 76.5 1.1 16.9 4.9 7.2 64.0 7.8 15.0 0.184/4 15.0 76.6 2.8 16.2 3.2 8.2 64.0 7.8 15.0 0.155/1 15.0 76.6 1.0 17.0 5.0 8.1 59.0 8.0 20.0 0.185/2 15.0 76.5 1.0 18.0 5.0 8.6 59.0 7.9 20.0 0.175/3 15.0 76.5 1.1 16.4 4.8 7.3 59.0 7.8 20.0 0.146/1 15.0 76.7 1.0 17.3 5.0 7.9 49.0 8.0 30.0 0.216/2 15.0 76.6 1.0 17.0 5.0 8.0 49.0 7.9 30.0 0.177/1 15.0 76.7 1.0 17.5 5.0 8.3 39.0 8.0 40.0 0.197/2 15.0 76.6 1.0 16.8 5.0 8.0 39.0 8.0 40.0 0.158/1 15.0 76.6 1.0 17.0 5.0 8.0 29.0 8.0 50.0 0.209/1 15.0 76.7 1.0 17.6 5.0 8.6 19.0 8.1 60.0 0.19

10/3 8.0 76.5 4.0 17.3 4.0 8.3 60.0 7.7 24.0 0.17

A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22 13

Table 2Benchmark results for the Monte Carlo simulations (symbols as defined in the text)

mix/proc Elapsed CPU p p∞ Da/Å2s−1 Dc/Å2s−1 r

1/1 32,745 20,291 1.00 1.00 5.4× 10−6 2.4× 10−9 1.001/8 7084 20,289 6.77 7.36 4.9× 10−6 2.5× 10−9 0.812/1 31,837 19,877 1.00 1.00 2.1× 10−5 1.7× 10−7 1.002/2 25,401 19,655 1.90 1.93 1.8× 10−5 8.6× 10−8 0.992/4 47,160 19,580 3.61 3.75 1.3× 10−5 6.5× 10−8 0.952/8 6756 19,705 6.58 7.15 5.9× 10−6 4.2× 10−9 0.853/1 34,530 20,520 1.00 1.00 2.0× 10−4 0.17 1.003/3 15,736 19,644 2.81 2.89 4.9× 10−5 0.11 0.963/5 11,519 19,348 4.40 4.63 2.9× 10−5 0.10 0.924/1 49,442 20,578 1.00 1.00 2.0× 10−4 0.22 1.004/2 30,803 19,658 1.96 1.98 1.2× 10−4 0.16 1.024/3 14,650 19,653 2.84 2.92 6.6× 10−5 0.15 1.014/4 8881 19,319 3.62 3.76 3.5× 10−5 0.08 0.965/1 49,660 20,644 1.00 1.00 3.0× 10−4 0.16 1.005/2 34,581 19,491 1.97 1.99 1.3× 10−4 0.19 1.025/3 32,323 19,051 2.79 2.87 4.9× 10−5 0.16 0.996/1 50,601 21,478 1.00 1.00 3.1× 10−4 0.20 1.006/2 18,406 19,364 2.00 2.03 1.0× 10−4 0.23 1.017/1 31,568 21,354 1.00 1.00 3.6× 10−4 0.22 1.007/2 18,763 19,027 2.03 2.06 1.4× 10−4 0.22 1.028/1 32,738 21,953 1.00 1.00 4.1× 10−4 0.15 1.009/1 32,752 22,017 1.00 1.00 4.5× 10−4 0.21 1.00

10/3 6976 16,436 2.93 3.03 1.3× 10−4 0.20 –

by master and slave processes during each cycle. As it happens, when this ratio is simply set to 1, then the MCmixes in Table 1 are all close to the optimum balance.

The measured processor speedupp shows excellent scaling for all runs, with>90% utilization of 4 or fewerprocessors. For>4 processors, the difference betweenp and the long time extrapolationp∞ becomes morepronounced, reflecting the Amdahl’s Law effect of the initial serial nonbond calculation. Note that for MC mixeswith a large proportion of DEB moves, the parallelization apparently becomes slightly superlinear. This is due tothe reduced CPU time required for failed DEB moves, and hence is misleading unless the effects on local andglobal MC relaxation rate are also taken into account, as discussed later.

Table 2 lists the atomic diffusion coefficient in Å2 per CPU second. As mentioned previously, this is not somuch a measure of relaxation rate, as a specific measure of the efficiency of reptation moves. Clearly this ismuch reduced with increasing number of processors, due to the reduced domain size and the requirement that allreptations be restricted to chains with both ends in a single specified domain. The reptation efficiency also increaseswith increasing DEB fraction in the MC mix, due to the synergy of these 2 moves in reducing “shuttling” [9].

The second neighbor vector autocorrelation functions provide a more useful measure of local relaxation rate.Fig. 7 compares the first and second degree N2V-ACF’s for fixed connectivity runs with MC mix 2, using a variablenumber of processors. The ACF decay rate decreases slightly with increasing number of processors. The value ofr, the relative relaxation rate for multiple processors, is given by the time multiple required to superimpose theACF with that for a single processor. These values are listed in Table 2. For example, the 8 processor run provides85% of the local configurational sampling efficiency per unit CPU time as the equivalent single processor run.This CPU time is distributed across an effective total ofp = 6.58 processors. Hence the net speedup in throughputobtained by using 8 processors is 6.58× 0.85= 5.6, when measured in terms of a local equilibration measure suchas the second neighbor ACF. For long time (i.e. production) runs, wherep∞ = 7.15, the net speedup is 6.1. Thecorresponding net speedup for 4 processors is 3.75× 0.95= 3.6.

14 A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22

Fig. 7. Second neighbor vector autocorrelation functions of (a) first degree and (b) second degree, for Monte Carlo simulations of mix 1 with 1,2, 4 and 8 processors.

Fig. 8. Second neighbor vector autocorrelation functions (second degree) for single processor Monte Carlo simulations, with different movemixes (see Table 1).

Fig. 8 shows how the second degree N2V-ACF for single processor runs varies with the MC mix. The dominantvariable in the mix is the replacement of ConRot moves with DEB moves. While it appears that the local relaxationrate is reduced with increasing proportion of DEB moves, it should be pointed out that these runs examine theshort-time relaxation only. Over longer times, runs with high DEB content are less non-exponential or “stretched”[9]. Without DEB moves, there is limited scope for local relaxation in long chain polymer simulations. The main

A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22 15

Fig. 9. Net parallel speedup as measured by local equilibration measurer · p, as a function of number of processors, for different MC mixes.

point to emphasize is simply that it is more appropriate to quantify the effect of parallelization on local relaxationrate for a single MC mix, than across multiple mixes; hence the definition of the relaxation ratior.

The net speedupr · p is shown as a function of processors for the different MC mixes in Fig. 9. Clearly the netparallel scaling is excellent for a small number of processors, but deteriorates as we approach the maximum numberof processors for the prescribed system size and domain geometry. Our hypothetical maximum is 16 processors, butthis would correspond to infinitely thin active “stripes” and hence zero throughput. The obvious alternative wouldbe to use a different domain geometry, such as spherical active domains dispersed in an inactive matrix (e.g., viaan FCC or other efficient packing). Obviously, an increased number of atoms would also lead to better utilizationof a large number of processors. Overall, these results confirm that domain composition is a very viable strategyfor MC simulation of molecular systems that are governed by local relaxation processes.

The necessity for global relaxation of the configurations in long chain polymer systems poses a rather morestringent demand. Table 2 lists the measured effective chain centre of mass diffusion coefficients,Dc, for thebenchmark runs. It is immediately apparent that the runs without DEB moves are simply unable to attain globalrelaxation. The diffusion rate does tend to increase slightly with increasing proportion of DEB moves, as assessedmore carefully in previous work [9]. Nevertheless, the difference is not great, and 10% DEB moves are clearlysufficient to provide effective global relaxation. The major benefit of increasing the DEB content is then to improvethe statistics in measuring properties that depend on chain length [8]. The disadvantage in the present case is thatit reduces the maximum number of processors.

The other noticeable trend is that the diffusion rate per unit CPU drops with increasing number of processors.This is of course due to the reduced number of candidate DEB moves when the size of the master domain isreduced.Dc is anomalously low for Run 4/4, as compared with, say, Run 3/5. This most likely reflects the factthat, as described in the Implementation section, the width of the master domain is actually the same for 4 or 5processors.

Fig. 10 shows the net speedup in terms of global configurational relaxation, given as the product ofDc by theeffective processor countp. It can be seen that, while multiple processors are clearly superior to a single processor,the dependence on the actual number of processors is very weak. Clearly any benefits obtained by increasing thenumber of processors are offset by the reduced domain size and the resultant reduced effectiveness of the DEBmoves.

16 A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22

Fig. 10. Net parallel speedup as measured by global equilibration measureDc · p, as a function of number of processors, for different MCmixes.

Fig. 11. Comparison of second neighbor vector autocorrelation functions (second degree) for MC mix 10 with enlarged master domain,compared to other mixes, using 3 processors.

On the basis of these results, MC mix 10 (Run 10/3) was devised to further optimize the DEB efficiency. Thedifference between this 3-processor run and earlier runs is that the size of the master domain was increased relativeto the other two domains, in the ratio 8 : 4 : 4. The corresponding data forDc andp, shown in Table 2, confirms thatthis run has the highest net efficiency of global configurational sampling. As shown in Fig. 11, the net efficiencyof local configurational sampling also compares favorably with other 3-processor runs, notably mix 5 which has asimilar ratio of ConRot to DEB moves.

A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22 17

8. Long time equilibration

We now consider the performance of the code for production purposes in simulating a 32-chain polyethylenesystem of 192,000 atoms at thermodynamic equilibrium. The initial configuration of 32 C6000chains, built using adistance geometry packing method [41], is equilibrated over 3× 108 steps using 3 or 4 processors, and is analyzedhere during 1× 109 production steps using 3 processors. The MC move mix is the same as for Run 10/3. Thestructural and thermophysical characteristics of the melt are directly comparable to those reported previously forlinear PE simulations of lower molecular weights [8], as well as experimentally, and will be detailed in a futurepublication.

Fig. 12 shows the atomic mean squared displacement of the 192,000 atom polyethylene melt. The averagedisplacement of each atom is quite low, due to the relative inefficiency of reptation moves for multi-processorruns. This displacement corresponds to less than half the length of the periodic simulation cell. For a conventionalsimulation with a fixed composition ensemble, this would imply incomplete sampling of configuration space.However, in the variable connectivity semigrand ensemble, all atoms have equal probabilities of any sequenceposition in a chain of any given length. Hence the only equilibration requirement on the average atomicdisplacement is that it must be larger than the interatomic spacing. Satisfying this condition requires less than107 MC steps.

Fig. 12 also shows the mean squared displacement of the chain centres of mass. While we have found in thepast that this is the most suitable means for quantitatively characterizing the large scale degrees of freedom [8,9], in the present work an unusual difficulty arises. For such large molecular systems, it is necessary to analyzeconfigurations separated by intervals of 106 MC steps or greater, to keep the resultant file sizes to manageablelevels. However, the rate of displacement of chain centres of mass for these long chain systems is so great thatthe average displacement during such an interval is comparable to the cell size. Hence the reported equilibrationrate is actually a lower bound on the true rate. This observation again represents a stark contrast to a conventional,fixed-connectivity simulation.

Fig. 13 depicts the autocorrelation functions for the 2nd neighbor and end-to-end vectors. The latter functiondecays to zero very rapidly (about 2× 107 MC steps), indicating that a full scale production run samples a verylarge number of independent global chain configurations. The former decays more slowly, confirming that it is thelocal, rather than global, chain relaxation that governs the overall equilibration.

Fig. 12. Atomic and centre-of-mass mean squared displacement functions for extended MC simulation of 32-chain C6000 melt.

18 A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22

Fig. 13. Second neighbor and end-to-end vector autocorrelation functions for extended MC simulation of 32-chain C6000 melt.

Fig. 14. Square of the chain radius of gyration for extended MC simulation of 32-chain C6000 melt. Circles and line represent instantaneousvalues and cumulative average, respectively.

Fig. 14 displays the evolution of the chain radius of gyration during the course of the simulation. Theinstantaneous values (averaged over the 32 chains in the cell) exhibit large and rapid fluctuations, while thecumulative average converges to a constant value, again reflecting vigorous sampling of the global chaincharacteristics.

Finally, Fig. 15 depicts the squares of the chain end-to-end distance and radius of gyration as functions ofmolecular weight. In each case a linear dependence is observed, which may be extrapolated to the origin, while theslopes differ by a factor of 6. This reflects the expected behavior for a bulk homopolymer melt. Hence, even thoughthe dimensions of the longer chains often exceed the size of the periodic simulation cell, the resultant finite sizeeffects [42] on these static properties are sufficiently small that they are difficult to detect for the present molecularsystem.

A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22 19

Fig. 15. Mean squared end-to-end distance and radius of gyration as a function of chain lengthl for MC simulation of 32-chain (polydisperse)C6000 melt.

The results in Figs. 12 to 15 thus confirm that it is now feasible to perform production simulations of genuinehigh polymer melts, which can achieve full equilibration on all length scales.

9. Discussion

From the results shown here, it is clear that conventional, fixed connectivity Monte Carlo simulations, involvingarbitrarily long chains, detailed interatomic potentials and complex move protocols, can be performed veryefficiently on parallel processors. The procedure we have developed is very general and flexible, requires minimalcommunication between tasks, and is suited equally well to shared memory or distributed memory architectures.In principle there is no limit to the number of processors that can be used. In practice, very large systems ora large number of processors will require careful consideration of the optimum domain geometry and transfer ofcoordinate data. The use of vector processor systems provides additional performance benefits. However the overallvector performance is limited by the complex algorithms used to solve the geometric rebridging equations, whichwould require major redesign for fully optimal vector operation. Alternative formulations for fixed and variablebond angles [43–45] may provide opportunities for enhancing vector performance.

For variable connectivity simulations using the simple master/slave protocol described here, a comparativelysmall number of processors (2–5) is clearly desirable, simply because of the limits imposed by restricting the vitalconnectivity-altering moves to a single master task. This restriction can be partially alleviated by exaggerating thesize difference between master and slave domains.

The choice of processor configurations for production work naturally depends on the size and number ofsimulations required and on the nature of the available computing resources. A large Monte Carlo simulationcan be broken up into shorter independent blocks and run in embarrassingly parallel fashion [18] on one or a fewprocessors, although for high polymers the long decorrelation times limit the effectiveness of this approach. Aninteresting possibility for performing a single large run would be to utilize a fast machine for the master process, andseveral smaller machines (e.g., personal computers) for slave processes. Since the master/slave ratio of iterationsis fully adjustable, such a ratio could be chosen for optimal load balancing on this lopsided cluster. Alternatively,the domain decomposition approach used here can be run efficiently on unscheduled heterogeneous networks (e.g.,nondedicated PC’s) by setting each domain decomposition cycle to a fixed wall-clock time, rather than a fixed

20 A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22

number of iterations. Another potential avenue for parallel optimization is to overlap serial operations (notablyinput/output) on the master process with further MC moves on the slave processes, with appropriate adjustment ofthe master/slave iteration ratio.

An alternative approach would be to perform connectivity-altering moves in all domains simultaneously. Thiswould require implementation of a global sequence counter, to maintain the integrity of the Markov sequence.For example, one might make use of the high rejection rate for DEB moves prior to the actual Metropolis trial,such that the outcome of each DEB trial is held pending completion of all previously commenced moves (in alldomains) involving the two chains. The two new chain lengths can then be calculated unambiguously, and enter intothe Metropolis acceptance criterion. Note that this sacrifices one of the speed advantages obtainable for truncatedmolecular weight distributions, whereby typically half of all DEB trial moves may be rejected instantly due toincorrect new chain lengths, prior to any expensive energy or geometry calculations. The other disadvantage ofallowing DEB moves in multiple domains is that the chain lengths must be communicated between processorsat irregular intervals, for example via interrogation of shared memory or a separate “sentry” process. Reptationscould then also be performed in all domains, for those chains with both ends in the same active domain, as long asthese moves update the chain length information for the DEB moves. These refinements await future development,which will be made easier by the enhanced communication features of new parallel programming standards suchas MPI-2.0 [38].

10. Conclusions

We have shown how a conceptually simple, general method of parallelizing Monte Carlo simulations can besuccessfully applied to large systems of densely packed long chain molecules, with complex interatomic potentialfunctions and MC move protocols. The method can be used to perform quantitative simulations of high polymermelts, on a scale which considerably exceeds previous efforts.

Fixed connectivity simulations have good scaling characteristics for a moderate number of processors, achievingnet speedup factors of 3.6 and 6.1 for 4 and 8 processors, respectively. Provided sufficient memory is available,the number of processors may be matched to the number of atoms, and increased without limit. The performancebenefit obtained through vector processing is also significant, but less than that observed for many applications(such as molecular dynamics).

Variable connectivity simulations, which use advanced algorithms such as DEB for vigorous sampling ofconfiguration space, can also be parallelized efficiently. However, for our current implementation which restrictssuch moves to a single domain, the optimum number of processors remains small (2–5) regardless of system size.

Acknowledgements

This work was supported by the High Performance Computing and Communications Centre (HPCCC) andNEC Australia under the joint research and development program (HRDP). We wish to thank Robert Bell, MarekMichalewicz, Len Makin and Jeroen van den Muyzenberg (HPCCC) and Michael Rezny (NEC) for their technicalassistance.

Appendix A

We seek to establish that the domain decomposition scheme depicted in Fig. 1 satisfies the detailed balancecondition. Initially we consider the conventional situation where the active domain is infinite, i.e. it comprises theentire simulation box and all associated periodic images. Consider two configurations,m andn, where the new

A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22 21

configurationn is obtainable from the previous configurationm via one of the moves employed in the simulation.Detailed balance requires [1,2] that

Pseq(m → n) = Pseq(n → m), (A.1)

wherePseq(m → n) andPseq(n → m) denote thea priori probabilities for the forward and reverse transitions,respectively. Note that

Pseq(i → j) = P (i)Tseq(i → j), (A.2)

whereP (i) is the equilibrium probability of the configurationi in the specified statistical mechanical ensemble,while Tseq(i → j) is the transition probability for the move fromi to j , which is defined by the rules for attemptingand accepting the moves in a sequential simulation of the system.

When domain decomposition is applied, the newa priori transition probabilities,Ppar(m → n) andPpar(n → m), are given by summing over all definitions of the active domains,

Ppar(m → n) =∑D

PDHD(m → n)Pseq(m → n), (A.3)

Ppar(n → m) =∑D

PDHD(n → m)Pseq(n → m), (A.4)

wherePD denotes the probability of selecting a particular set of domainsD, andHD(i → j) equals 1 for domaindefinitionD if all atoms ofi which are displaced inj remain within the same active domain, or 0 otherwise. In thecase of a connectivity-altering move,HD(i → j) equals 1 if all atoms ofi which are displaced inj remain withina specific(“master”) active domain, or 0 otherwise.

By rejecting all moves in which any displaced or newly generated atoms protrude out of the active domain, andby disabling all connectivity-altering moves outside the nominated master domain, we ensure that

HD(i → j) = HD(j → i) (A.5)

for any arbitrary definitionD of the active domains. By substituting (A.1) and (A.5) into (A.3) and (A.4), oneobtains

Ppar(m → n) = Ppar(n → m) (A.6)

thus satisfying detailed balance.

References

[1] M.P. Allen, D.J. Tildesley, Computer Simulation of Liquids, Clarendon Press, Oxford, 1987.[2] D. Frenkel, B. Smit, Understanding Molecular Simulation, Academic Press, San Diego, 1996.[3] A.R. Leach, Molecular Modelling: Principles and Applications, Addison-Wesley, Longman, Harlow, 1996.[4] R.J. Sadus, Molecular Simulation of Fluids, Elsevier, Amsterdam, 1999.[5] A. Uhlherr, D.N. Theodorou, Curr. Opin. Solid St. Mater. Sci. 3 (1998) 544.[6] J. Baschnagel, K. Binder, P. Doruker, A.A. Gusev, O. Hahn, K. Kremer, W.L. Mattice, F. Müller-Plathe, M. Murat, W. Paul, S. Santos,

U.W. Suter, V. Tries, Adv. Polym. Sci. 152 (2000) 41.[7] P.V.K. Pant, D.N. Theodorou, Macromolecules 28 (1995) 7224.[8] V.G. Mavrantzas, T.D. Boone, E. Zervopoulou, D.N. Theodorou, Macromolecules 32 (1999) 5072.[9] A. Uhlherr, V.G. Mavrantzas, M. Doxastakis, D.N. Theodorou, Macromolecules (submitted).

[10] M. Doxastakis, V.G. Mavrantzas, D.N. Theodorou, J. Chem. Phys. (submitted).[11] C.T. Samara, Ph.D. Thesis, University of Patras, December 2000.[12] V.G. Mavrantzas, D.N. Theodorou, Macromol. Theory Simul. 9 (2000) 500.[13] D.C. Rapaport, E. Clementi, Phys. Rev. Lett. 57 (1986) 695.[14] M.P. Allen, Theor. Chim. Acta 84 (1993) 399.

22 A. Uhlherr et al. / Computer Physics Communications 144 (2002) 1–22

[15] D. Brown, J.H.R. Clarke, M. Okuda, T. Yamazaki, Comput. Phys. Comm. 74 (1993) 67.[16] A.B. Sinha, K. Schulten, H. Keller, Comput. Phys. Comm. 78 (1994) 265.[17] S. Plimpton, B. Hendrickson, J. Comput. Chem. 17 (1996) 326.[18] G.S. Heffelfinger, Comput. Phys. Comm. 128 (2000) 219.[19] U.H.E. Hansmann, Chem. Phys. Lett. 281 (1997) 140.[20] M.W. Deem, AIChE J. 44 (1998) 2569.[21] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, J. Chem. Phys. 21 (1953) 1087.[22] D.M. Jones, J.M. Goodfellow, J. Comput. Chem. 14 (1993) 127.[23] A.P. Carvalho, J.A.N.F. Gomes, M.N.D.S. Cordeiro, J. Chem. Inf. Comput. Sci. 40 (2000) 588.[24] G.S. Pawley, K.C. Bowler, R.D. Kenway, D.J. Wallace, Comput. Phys. Comm. 37 (1985) 251.[25] G.T. Barkema, T. MacFarland, Phys. Rev. E 50 (1994) 1623.[26] D. Handscomb, Math. Comput. Simul. 47 (1998) 319.[27] G.S. Heffelfinger, M.E. Lewitt, J. Comput. Chem. 17 (1996) 250.[28] E. Leontidis, B.M. Forrest, A.H. Widman, U.W. Suter, J. Chem. Soc. Faraday Trans. 91 (1995) 2355.[29] K. Esselink, L.D.J.C. Loyens, B. Smit, Phys. Rev. E 51 (1995) 1560.[30] A.H. Widmann, U.W. Suter, Comput. Phys. Comm. 92 (1995) 229.[31] D. Frenkel, G.C.A.M. Mooij, B. Smit, J. Phys.: Condens. Matter 4 (1992) 3053.[32] M. Müller, K. Binder, W. Oed, J. Chem. Soc. Faraday Trans. 91 (1995) 2369.[33] L.R. Dodd, T.D. Boone, D.N. Theodorou, Mol. Phys. 78 (1993) 961.[34] A. Uhlherr, Macromolecules 33 (2000) 1351.[35] M. Vacatello, G. Avitabile, P. Corradini, A. Tuzi, J. Chem. Phys. 73 (1980) 548.[36] I.R. McDonald, Chem. Phys. Lett. 3 (1969) 241.[37] F.A. Escobedo, J.J. de Pablo, Macromol. Theory Simul. 4 (1995) 691.[38] http://www.mpi-forum.org.[39] V.I. Manousiouthakis, M.W. Deem, J. Chem. Phys. 110 (1999) 2753.[40] E.V. Krishnamurthy, Parallel Processing: Principles and Practice, Addison-Wesley, Reading, MA, 1987, p. 251.[41] R.V.N. Melnik, A. Uhlherr, J.H. Hodgkin, F. de Hoog, in: M. Deville, R. Owens (Eds.), Proc. 16 IMACS World Congress on Scientific

Computation, Applied Mathematics and Simulation, Lausanne, 2000, p. 799.[42] K. Kremer, G.S. Grest, J. Chem. Phys. 92 (1990) 5057.[43] M.G. Wu, M.W. Deem, Mol. Phys. 97 (1999) 559.[44] C. Wick, J.I. Siepmann, Macromolecules 33 (2000) 7207.[45] Z. Chen, F.E. Escobedo, J. Chem. Phys. 113 (2000) 11 382.