Enhancement of DFT-calculations at petascale: Nuclear Magnetic Resonance, Hybrid Density Functional...

Computer Physics Communications 184 (2013) 1827–1833

Contents lists available at SciVerse ScienceDirect

Computer Physics Communications

journal homepage: www.elsevier.com/locate/cpc

Enhancement of DFT-calculations at petascale: Nuclear MagneticResonance, Hybrid Density Functional Theory andCar–Parrinello calculationsNicola Varini a,b,c,∗, Davide Ceresoli d, Layla Martin-Samos h,e, Ivan Girotto a,f,Carlo Cavazzoni ga ICHEC, 7th Floor Tower Building Trinity Technology & Enterprise Campus, Grand Canal Quay, Dublin 2, Irelandb iVEC, 26 Dick Perry Ave, Kensington, WA 6151, Australiac Research and Development, Curtin University, GPO Box U 1987, Perth, WA 6845, Australiad CNR Istituto di Scienze e Tecnologie Molecolari (CNR-ISTM), c/o Department of Chemistry, University of Milan, via Golgi 19, 20133 Milan, Italye CNR Istituto Officina Molecolare (CNR-IOM), c/o SISSA, via Bonomea 235, 34136 Trieste, Italyf International Centre for Theoretical Physics (ICTP), Strada Costiera 11, 34014 Trieste, Italyg CINECA, via Magnanelli 6/3, 40033 Casalecchio di Reno, Italyh University of Nova Gorica, Materials Research Laboratory, vipavska cesta 11C, 5270 Ajdovscina, Slovenia

a r t i c l e i n f o

Article history:Received 12 November 2012Received in revised form1 March 2013Accepted 5 March 2013Available online 13 March 2013

Keywords:Parallelizationk-pointsPlane waves DFTCar–ParrinelloNMRExact exchange

a b s t r a c t

One of the most promising techniques used for studying the electronic properties of materials is basedon Density Functional Theory (DFT) approach and its extensions. DFT has been widely applied intraditional solid state physics problems where periodicity and symmetry play a crucial role in reducingthe computational workload. With growing compute power capability and the development of improvedDFTmethods, the range of potential applications is now including other scientific areas such as Chemistryand Biology. However, cross disciplinary combinations of traditional Solid-State Physics, Chemistryand Biology drastically improve the system complexity while reducing the degree of periodicity andsymmetry. Large simulation cells containing of hundreds or even thousands of atoms are needed tomodelthese kind of physical systems. The treatment of those systems still remains a computational challengeeven with modern supercomputers. In this paper we describe our work to improve the scalability ofQuantumESPRESSO (Giannozzi et al., 2009 [3]) for treating very large cells and huge numbers of electrons.To this end we have introduced an extra level of parallelism, over electronic bands, in three kernelsfor solving computationally expensive problems: the Sternheimer equation solver (Nuclear MagneticResonance, package QE-GIPAW), the Fock operator builder (electronic ground-state, package PWscf) andmost of the Car–Parrinello routines (Car–Parrinello dynamics, package CP). Final benchmarks show oursuccess in computing the Nuclear Magnetic Response (NMR) chemical shift of a large biological assembly,the electronic structure of defected amorphous silica with hybrid exchange–correlation functionals andthe equilibriumatomic structure of height Porphyrins anchored to aCarbonNanotube, onmany thousandsof CPU cores.

© 2013 Elsevier B.V. All rights reserved.

1. Introduction

Understanding the details of atomic/molecular structures aswell as dynamics of solid materials and biological systems, is oneof the major challenges confronting physical and chemical sciencein the early 21st Century. Such knowledge has a direct impact on

∗ Corresponding author at: iVEC, 26 Dick Perry Ave, Kensington, WA 6151,Australia. Tel.: +61 864368621.

E-mail addresses: [email protected], [email protected] (N. Varini),[email protected] (D. Ceresoli).

0010-4655/$ – see front matter© 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.cpc.2013.03.003

important issues in our society like the design, synthesis and pro-cessing of new either eco-friendly or high-efficiencymaterials; de-velopment of newenergy sources; control ofmaterials degradationand recycling; design of drugs for specific pathology.

The complexity of these areas of research is growing rapidlyalongwith the need for computational tools to dealwith such com-plexity. Computer simulation is the only way to study these largephysical systems. However, extremely expensive large-scale com-putational resources do not come for free. In addition to the scien-tific value of the computer simulation, scientific codes must showgood scalability and efficiency for access to world-class supercom-puting facilities.

http://dx.doi.org/10.1016/j.cpc.2013.03.003

http://www.elsevier.com/locate/cpc

http://www.elsevier.com/locate/cpc

http://crossmark.dyndns.org/dialog/?doi=10.1016/j.cpc.2013.03.003&domain=pdf

mailto:[email protected]



http://dx.doi.org/10.1016/j.cpc.2013.03.003

1828 N. Varini et al. / Computer Physics Communications 184 (2013) 1827–1833

In this respect, parameter-free first principles (aka ab initio)atomistic calculations in the framework of Density FunctionalTheory (DFT) [1,2] are quite popular models, as they combine rea-sonable chemical accuracy with affordable complexity and goodscalability. The current available codes implementing thosemodels[3–8] have enabled computer simulations at the frontier of science.

Computer technology is rapidly changing. The trajectory of ma-jor computer manufactures of the last few years illustrates thatvendors are rapidly increasing the number of cores to build super-computing technology, reducing the clock frequency of each sin-gle compute element. The most powerful supercomputers in theworld are today equippedwithmillion of cores capable of schedul-ing billions of threads concurrently. Without substantial updating,many legacy codes will not be able to exploit this massive par-allelism. Nevertheless, such horsepower gives the opportunity toaddress problems of unprecedented size and complexity. This pa-per presents a parallelization strategy, based on the distribution ofelectronic band loops, for efficiently scaling calculations of hybrid-functionals and Nuclear Magnetic Resonance (NMR). In additionwe describe how the same parallelization strategy impacts in thescalability of the Car–Parrinello computational kernels.

The paper is organized as follows. In Section 2we briefly reviewrelated work reported by other groups. In Section 3 we describethe equations implemented in the QE-GIPAW code to calculateNMR shielding tensors and in Section 4 the formulas needed toevaluate the Fock-exchange operator and energy. In Section 5 weoutline the parallelization strategy based on bands and q-pointdistribution. In Section 6 we report and discuss benchmark resultsof our parallelization. Finally, in the last section we present ourconclusions and perspectives on this parallelization strategy.

2. Related work

The idea of distributing computation over electronic bands waspioneered in the ’90s by the CASTEP [9] code and was foundto be effective on early vector machines (i.e. Cray Y-MP). In thesubsequent years, the improvement of collective communicationalong with the availability of efficient FFT libraries, enabled agood scaling of plane wave codes up to few hundreds of CPUswith a simple two-level parallelization (k-points and planewaves).Only since very recently, top-level HPC machines have had a hugenumber of CPU cores (104–105) with a limited amount of mem-ory. As the number of atoms in the system increase, the numberof k-points can be reduced, deteriorating the scaling and paral-lel efficiency of the codes. At this stage, in order to exploit thelarge number of CPU cores, an extra level of parallelism, elec-tronic bands, must be introduced. The iterative diagonalization hasbeen the first kernel to benefit from this strategy allowing a scal-ability up to few thousands of cores for electronic ground statecalculations within traditional exchange–correlation functionals.ABINIT implemented a block preconditioned conjugate gradientalgorithm [10], while VASP avoids explicit orthogonalization ofbands by the RM-DIIS algorithm.Moreover, the QBOX [8] code wasdesigned purposely for Blue Genemachines by a clever data distri-bution and fully distributed linear algebra (ScaLAPACK [11]). Simi-lar schemes have been implemented in CPMD, BigDFT, CASTEP [9]and are being implementedwithin different packages of the Quan-tum ESPRESSO distribution. Regarding the Exact Exchange (EXX)kernel, the NWCHEM code introduced an extra level of parallelismover occupied states and reported excellent scaling up to 2048CPUs [12]. In this work we report scaling results up to many thou-sands of CPU cores, thanks to the band parallelization on replicateddata of two of the most computationally intensive kernels (linearresponse and the Fock operator). In addition to this, we show thatevery step of the Car–Parrinello algorithm can be fully distributed(both data and computation), enabling petascale computation. Thisparallelization strategy allows to fully exploit the new EU Tier-0petascale facilities such as CURIE [13], JUGENE [14] and FERMI [15].

3. The GIPAW equations for the induced current and NMRshieldings

The GIPAW method (Gauge Including Projector AugmentedWave) makes it possible to calculate the current induced by an in-finitesimal external field, hence the NMR shielding tensors, in a pe-riodic systembymeans of linear response. TheGIPAWmethod is anextension of the PAW method [16] for reconstructing all-electronwavefunction and expectation values from a pseudopotential cal-culation. This is essential to compute accurately the response of thevalence electrons in regions near the nuclei, which determines theNMR shielding.

The GIPAW method was formulated by Mauri [17] andPickard [18–20] and we refer the reader to the original papers forthe detailed derivation of the method. In the following, we showonly the resulting set of equations for the induced current and out-line the flow of the code.

We will start by noting that the GIPAW transformation of aquantum mechanical operator O (Eq. (17) of Ref. [18]), gives riseto two kinds of terms. The first is the operator O itself, (whichwe call bare term), acting on the valence wavefunctions all overthe space. The second is a non-local operator acting only insidespherical augmentation regions centered around each atom. Thenon-local projectors enjoy the property of vanishing beyond acutoff radius, equal to the cutoff radius of the pseudopotential. Thisterm is called the reconstruction term and is not computationallyexpensive, except for systems with more than 2000 atoms. Forlarger systems, work is in progress to block-distribute the non-local projectors and use the ScaLAPACK [11] library to performlinear algebra operations.

Therefore, in the following we will focus on the evaluation ofthe bare contribution to the induced current and to the magneticsusceptibility of the system. The induced current is defined as

j(1)bare(r′, q) =

1c

k

wk

q=±qx,±qy,±qz

12q

×

n∈occ

Re2i⟨unk|J

parak,k+q(r

′) Gk+q(ϵnk) B× q · vk+q,k|unk⟩

(1)

where the first sum is over k-points with weight wk. The secondsum is over a star of q-points, where q ≪ 1 is the inversewavelength of the externalmagnetic field and q is a unit vector. Thethird sum is over the occupied bands, unk is the valence eigenstateat k-point k and band index n. The bracket is evaluated from rightto left, by applying first the non-local velocity operator

vk,k′ = −i∇ + k′ +1i

r, V nl

k,k′

(2)

to |unk⟩, then by applying the Green’s function, formally defined as

Gk+q(ϵ) =Hk+q,q − ϵ

−1 (3)

where Hk′,k = e−ik′·rHeik·r is the periodic Kohn–Sham Hamilto-

nian. Then, we apply the paramagnetic current operator to the leftto ⟨unk|

Jparak,k′ = −12

(−i∇ + k)|r′⟩⟨r′| + |r′⟩⟨r′|(−i∇ + k′)

. (4)

Finally the bracket product is evaluated and summed to obtain thecurrent field. The induced magnetic field is obtained easily by theBiot–Savart law, after Fourier-transforming to reciprocal space:

B(1)bare =

4πc

iG× j(1)bare(G)

G2. (5)

This procedure is repeated with B along each of the three Carte-sian directions, and the bare NMR shielding tensor is obtained by

N. Varini et al. / Computer Physics Communications 184 (2013) 1827–1833 1829

evaluating the inducedmagnetic field response at each nuclear co-ordinate. The G = 0 term in Eq. (5) depends on the macroscopicshape of the sample and the magnetic susceptibility tensor χbare.The expression of χbare is very similar to Eq. (1):

←→Q (q) = −

1c2

k

wk1Ω

×

q=qx,qy,qz

n∈occ

Re1i⟨unk|q× (−i∇ + k)

×Gk+q(ϵnk) B× q · vk+q,k|unk⟩

(6)

where Ω is the cell volume and

←→χ bare =

←→F (q)− 2

←→F (0)+

←→F (−q)

q2

Fij(q) = (2− δij)Qij(q) i, j ∈ x, y, z. (7)

Since the only difference between Eqs. (1) and (6) is the operatorto the left of the Green’s function, the two formulas are evaluatedback-to-back in the same loop.

Finally, the NMR shielding tensor←→σ (R) is obtained by addingthe bare and reconstruction contributions of the inducedmagneticfield, evaluated at each nuclear position R

←→σ (R) = −∂B(1)(r)

∂B

r=R

. (8)

3.1. The Sternheimer equation

The most time-consuming operation in GIPAW is applying theGreen’s function (Eq. (3)) to a generic ket |wnk⟩

|gnk+q⟩ = Gk+q(ϵnk)|wnk⟩. (9)

To avoid inversion of large matrices or, equivalently, summationover all empty states, we solve instead the Sternheimer equationHk+q,k − ϵnk + αPk+q,k

|gnk+q⟩ = −Qk+q,k|wnk⟩ (10)

where Pk+q,k =

n∈occ |unk+q⟩⟨unk| is the projector over theoccupiedmanifold andQk+q,k = 1−Pk+q,k is the projector over theempty states. α is chosen as twice the bandwidth of the occupiedstates in order to make the lhs positive definite.

Eq. (10) constitutes a set of N independent linear equations,where N is the number of occupied states and is solved for |gnk+q⟩by the conjugate gradient (CG) method. In order to take ad-vantage of level-3 BLAS operations, the CG update is performedinitially on all electronic bands. Then, starting from the second iter-ation, the electronic bands are divided into two groups, each occu-pying a contiguous memory area: converged and not converged.A band is converged when the residual falls below a threshold(10−7 Ryd). Therefore, we update only the bands which have notyet converged. Before returning from the subroutine, the bands aresorted according to the original band index n.

3.2. Outline of the GIPAW code

Before running an NMR calculation with the QE-GIPAW code[21], it is necessary to run a self-consistent calculation (SCF) withthe PW code on a possibly relaxed structure, in order to obtain theground state wavefunction |unk⟩. The QE-GIPAW code reads fromdisk thewavefunctions and the electronic density generated by PWand computes the induced current and themagnetic susceptibility.

After an initialization phase, the QE-GIPAW code runs as fol-lows:

1. Loop over k-points.1. Read |unk⟩ from disk, all bands.2. Set q = 0 and for each n and compute vk,k|unk⟩,

Gk,k(ϵnk)vk,k|unk⟩, and (−i∇ + k)|unk⟩.3. Compute Q (0) according to Eq. (6).4. Loop over the star of q-points: q = ±qx,±qy,±qz.

1. Diagonalize the KS Hamiltonian at k+ q.2. For each n compute vk+q,k|unk⟩,

Gk+q,k(ϵnk)vk+q,k|unk⟩, and (−i∇ + k+ q)|unk⟩.3. Compute Q (q) and j(1)bare(r

′, q).2. Parallel execution only: reduce (parallel sum) Q and j(1)bare over

k-points and plane waves.3. Solve the Biot–Savart equation (Eq. (5)), evaluate the induced

magnetic field at each nuclear coordinates, output the NMRshielding tensors and terminate.

All reconstruction terms are evaluated after step 1.3 and step 1.4.3,with little computational cost. When employing ultrasoft [22] ofPAW pseudopotential, there is an additional evaluation of theGreen’s function for each k and q. The details can be found inRef. [20].

Thus the flowchart of the code is based on three nested loops:over k-points, q-star and bands. Each term can be calculated inde-pendently and we anticipate that in order to run efficiently at thepetaflop scale, we have distributed the calculation of every indi-vidual term on all the CPUs. Parallel communication is performedonly at the end of the three loops and consumes little time. The onlybottleneck is the diagonalization step at k + q. We use the wave-functions calculated at k as starting vector in order to reduce thenumber of Davidson or CG [3] iterations. Unfortunately the David-son and CG algorithms cannot be parallelized easily over bands,because of the Gram–Schmidt orthogonalization. We are currentlyseeking a diagonalization method which does not require an or-thogonalization step, such as the RM-DIIS method [7].

Moreover, the explicit diagonalization can be avoided for verylarge simulation cells by choosing the magnitude of q equal to thefirst reciprocal lattice vector, i.e. q = 2π/a, where a is the latticespacing of a cubic supercell. In fact, by the Bloch theorem, thewavefunctions at k+ q are given simply by:

|un,k+q2π/a⟩ = eiq2π/a·r|unk⟩. (11)

This method (which we call commensurate) provides an enormousspeed-up and scaling for very large systems, but it may worsen theaccuracy of the calculated NMR chemical shift. We are currentlytesting this method and results will be reported in a future paper.

4. The Exact Exchange (EXX) within the plane wave method

After the work of A. Becke in 1993, ‘‘A new mixing of Hartree–Fock and local density-functional theories’’ [23], it is nowadaysbecoming very common to include a fraction of Exact eXchange(formally Fock exchange) in Density Functional calculations.Exchange–correlation functionals with this fraction are calledHybrid-Functionals (HFs). The use of HFs enables at least a partialcorrection to the self-interaction error and the inclusion of anon-local contribution to the Hamiltonian. In other words, theinclusion of an EXX fraction, with respect to traditional DFTfunctionals, Local Density Approximation (LDA) or GeneralizedGradient Approximation (GGA) [24], is mainly used to improve theagreement to experiments of the band gaps and the energetics ofsmall molecules and solids.

The EXX energy contribution is defined as

Ex = −12

k

q

occi,j

drdr′

×u∗ik(r) ujk−q(r) u∗jk−q(r

′) uik(r′)|r− r′|

(12)


which in reciprocal space reads

Ex = −12

k

occi,j

G

q

M∗ik,jk−q(G)Mik,jk−q(G′)|G+ q|2

(13)

where Mik,jk−q(G) = FTu∗jk−q(r)uik(r)

is the Fourier Transform

of the products of two Bloch wavefunctions. These quantities areusually called ‘‘overlap charge densities’’. In plane-wave-basedcodes, the overlap charge densities are calculated via Fast FourierTransform (FFT). From the reciprocal-space representation thewavefunctions are Fourier transformed to real-space, the overlapproduct is calculated and then the result is transformed back toreciprocal space where the sums over occupied states, q, k, and Gare performed [25,26].

With the PWscf code of the Quantum ESPRESSO distribution,the convergence of the Kohn–Sham (KS) Hamiltonian containingEXX fraction is achieved through two nested do loops. The innerloop converges the KS equations at fixed EXX potential whilethe outer updates the EXX potential (Vx) and checks the overallaccuracy. Before adding the first guess for Vx, PWscf makes afirst full-self-consistent calculation with a traditional non-hybridexchange–correlation functional. The EXX potential projected onan electronic state |uik⟩ is calculated as

Vx|uik⟩ = −

occj

G

q

M∗ik,jk−q(G)Mik,jk−q(G′)|G+ q|2

. (14)

It is easy to see that the distribution of the computationalworkloadat the level of the sum over j is very convenient as it just implies areduction at the end of the loops.

Note however that in the current Quantum ESPRESSO imple-mentation, hybrid functionals cannot be used to compute linearresponse quantities (i.e. NMR) because one should implement andsolve simultaneously the coupled Hartree–Fock equations [27]. Onthe contrary, in pure DFT (no EXX) the zeroth order (Kohn–Sham)and first order (Sternheimer) equations are fully decoupled.

5. Parallelization strategy

Historically, the Quantum ESPRESSO distribution has been de-signed and used for modeling condensed-matter problems wherethe physical system can be mapped into a periodically repeatedprimitive cell. In these cases, the number of electrons (and con-sequently the number of electronic bands) is a relatively smallnumber while the number of k-points (Brillouin zone sampling)is usually high. As a result, the k-points distribution across pro-cesses is the natural method of parallelization. Especially if consid-ering that the DFT Hamiltonian is diagonal in k and the only sourceof communication arises from a sum in the calculation of the to-tal energy. However, the boom of new organic and hybrid basedelectronics and optoelectronics materials (originated by the needof replacing silicon-based technologies) along with the fast evo-lution of supercomputers present brand-new scenarios extendingthe boundaries among condensed-matter, chemistry and biology.Most of the new topics involve systems with a huge number ofatoms (subsequently, a huge number of electrons) and either low-symmetry (i.e., surfaces or wires) or no-symmetry at all (i.e., bio-logical molecules).

The development activity focused on three main DFT applica-tions: Nuclear Magnetic Resonance (NMR), Exact-Exchange (EXX)and Car–Parrinello calculations. Those computational models areimplemented into the Quantum-ESPRESSO distribution amongthree scientific codes named QE-GIPAW [21], PWscf (EXact eX-change part only) and CP, respectively. For all these packages, per-formance analysis indicated that highly computationally intensive

sections were nested into loops over electronic bands. New levelsof parallelism were introduced to better distribute such computa-tion as these algorithms do not present data dependency along theband dimension. Applied, this simple concept becomes a break-through to simulate new scientific problems efficiently on large-scale supercomputers and to discover emerging physical effectsthat arise at a scale that was computationally unreachable before.

One of the most computationally intensive algorithms of theQE-GIPAW code is to evaluate the linear response of the wave-functions to an external magnetic field. This is done by solving theSternheimer equation which is composed of n = 1 . . .N indepen-dent linear systems of the formH (0)− E(0)

n

|Ψ (1)

n ⟩ = H (1)|Ψ (0)

n ⟩ (15)

whereH (0)−E(0)

n = G(En)−1 is the inverse of the Green’s function,H (1) is the perturbing magnetic field, and |Ψ (0)

n ⟩ are the unper-turbed wavefunctions, obtained by a previous SCF calculation withPWscf. Here |Ψ (1)

n ⟩ are the unknowns and the index n runs overall occupied electronic states. Previous to the present work, theSternheimer equations were solved band-by-band by the Conju-gate Gradient method, with a clever re-grouping of not convergedbands, in order to exploit level-3 BLAS operations. In the new im-plementation, we further distribute the occupied bands over CPUs,and the CG algorithm has been modified to work with groups ofbands. The new schema is reported as pseudo-language shape inFig. 2. As the numbers of electronic bands become relevant thesame model might also improves scalability of plane-wave DFTcodes for efficiently running at thousands of cores in parallel.

The code analysis brought out that the GIPAW calculation isbased on a simultaneous run over a 7-fold loop at one of the out-ermost routines. The image communicator (see Fig. 1), already im-plemented in the modular structure of the package, is designed forsuch purpose but it was not yet used for the QE-GIPAW code. Anew communicator (later called the electronic bands communica-tor) has been introduced in order to take advantage of this level ofparallelism (see Fig. 1). As presented in thenext section evenhigherscalability is reachable whether this is exploited. Despite the sim-ple model, the introduction of new levels of parallelism requireda considerable re-factoring due to the complexity of the softwarestructure.

A similar parallelization scheme has been implemented for theExact-Exchange (EXX) kernel of the PWscf code. We worked at thebands level as described for the QE-GIPAW code because the targetwas to enhance the parallelization for simulating big systems. Aspresented in Fig. 3 once again it was possible to parallelize theinner loop over the electronic bands by introducing the electronicbands MPI communicator (i.e., by splitting the ibnd index acrossprocessors to perform on a different local section of the arrayrho). In this case the parallelization scales much more effectivelybecause it is implemented on a high level loop that performsthe whole computational workload of the EXX applications. Inother words, the sum over occupied states that appear in theGreen function calculation and in the EXX self-energy has beendistributed across processors.

The CP code has benefited from a very similar parallelizationstrategy. At large scale, performance analysis underlined that themain computational bottlenecks are: wave function orthogonal-ization (the use of ScaLAPACK could likely be for the better) andthe evaluation of both forces on electronic degrees of freedom andcharge density, especially for systems with a very large number ofelectrons.

Both electronic forces and charge density are computed itera-tively over the electronic bands, performing distributed 3D-FFT ateach iteration. It is well known that distributed 3D-FFT’s do notscale because those are parallelized along the z axis and the num-ber of points in this dimension does not grow at the same rate as


Fig. 1. The picture stands for the levels of parallelism implemented within the Quantum ESPRESSO software packages. On the right hand side the hierarchy is logicallyrepresented from the higher to the deeper parallelism, expressed using multi-threading on top of MPI distribution. The left hand side shows how this hierarchy is mappedon MPI groups of processes (or communicators). From the top to the bottom four of these levels are divided in smaller groups from previous level (each black arrow standsfor a splitting stage), starting from MPI COMM WORLD which represents all the MPI processes. For each sub-group, processes are identified from 0 to n − 1 where n is thenumber of processes for a givenMPI communicator. At the last stage each process is no longer considered as a member of a set of processes (MPI communicator) but a singleentity from which threads are created.

Fig. 2. Code sections from QE-GIPAW code: (a) original implementation schema; (b) new implementation schema. The original number of iteration nbnd are divided evenlyamong the band groups used. So each group computes ibnd_end–ibnd_start iteration. At the end of the loop the results of each group are summed up.

the number of bands. As a direct consequence, the overall scalabil-ity is limited by parallel FFT up to a few hundreds of tasks. A hybridMPI + OpenMP parallelization scheme allows to extend the scal-ability up to few thousands of cores (see also CPMD [28,29]). Theband parallelization was introduced to scale the workload of thetwo loopswhich compute forces and charge density, likewisewhatwe described for QE-GIPAW routines and of the EXX kernel, seeFigs. 2 and 3. However, contrary to QE-GIPAW and the EXX kernelof PWscf, here, data and loops were distributed among the differ-ent band groups. Both data and parallelism over the g-vectors havebeen replicated, so that all the g-vectors loops are left untouched.Although the g-vectors are replicated across band group, the newimplementation requires less memory. The wavefunctions arraysare distributed along two dimensions such g-vectors and band in-dex. While two-dimensional arrays having number of atoms andband index as dimensions are distributed on the band index. In-deed, when the band parallelism is turned off thememory capacityis a limit for a large physical system.

Beside the parallelization of electronic forces and charge den-sity, the new schema has been partially used in the orthogonal-ization subroutine too. In particular band parallelization has been

used to distribute the setup of the matrices for the orthogonaliza-tion (these matrices are in fact the product of one wave functionagainst all the other wavefunctions), whereas the diagonalizationand the matrix multiplications performed with ScaLAPACK havebeen left replicated across the band groups.

In the next section we will show the scalability improvementsobtained with these new parallelization schemes.

6. Benchmark results

Computer technologies are rapidly evolving and the number ofcompute cores needed to build themost powerful supercomputinginfrastructure worldwide is drastically increasing (over a millionfor the time of writing). As a consequence, access to supercomput-ing petascale facilities commonly calls for specific requirements,that is scalability. Even for algorithms that are not embarrassinglyparallel, speedup can be achieved for higher orders of magnitudeof core count. However, in such cases, efficiency will be poor andso less useful as a measure. In such cases, the need for performingscientific challenges which would be impossible otherwise, it is areasonable compromise to balance the cost of efficiency. Within


Fig. 3. Code sections from EXX code: (a) original implementation schema; (b) new implementation schema. The original number of iteration nbnd are divided evenly amongthe band groups used. So each group computes ibnd_end–ibnd_start iteration. At the end of the loop the results of each group are summed up.

Table 1GIPAW benchmark data for 592 cholesterol atoms on CURIE [13]. Wall-clock timesare given in minutes.

# Cores # Threads # Bands Time Efficiency (%)

64 1 1 621.15 10064 1 2 694.66 89

128 1 1 533.33 58128 1 2 416.69 73256 1 4 300.79 57

Table 2GIPAWbenchmark data for 592 cholesterol atoms on JUGENE [14].Wall-clock timesare given in minutes. Benchmarks were executed with a number of bands equal 2.

# Cores # Threads Time Efficiency (%) Time Eq. (10)

128 1 416.69 100 242.11896 1 81.59 73 58.71

1792 2 51.82 58 37.603584 2 36.00 41 24.81

this scenario, we enabled three scientific codes at petascale facili-ties so that the user community can finally exploit such computecapabilities. Validation and performance analysis of our develop-ment work was performed using real data. In particular, a choles-terol systemwith 592 atoms and 600 electronic bandswas used forQE-GIPAW tests. The EXX tests have been performed on a109 atoms a-SiO2 supercell with an oxygen interstitial, whilefor CP, we have obtained a series of benchmarks results on aCNT10POR8 system (one hydrogen-saturated carbon nanotube,with four porphyrin rings chemically linked to the CNT surface. Theoverall system comprises 1532 atoms, 5232 electrons, i.e., 2616occupied bands). Calculations have been run on EU Tier-0 sys-tems: CURIE [13], a 3-fraction Linux cluster power by BULL andequipped with Intel x86-64 based technology, JUGENE [14] an IBMBlue Gene/P architecture and FERMI IBM Blue Gene/Q.

Table 1 shows the numbersmeasured performingGIPAWcalcu-lation on the CURIE Tier-0 system. The first row presents the refer-ence value obtained running over only two nodes. Indeed, 64 coresis the lowest value that would allow to perform such a calculationwithin the 24-h of wall clock limit and it is necessary to reach con-vergence to obtain timing information. The second row reports theelapsed time we got with the new version using two band groups,32 cores for each group. Here the old approach still performs bettersince the diagonalization routine is not parallelized over the bands.However, by doubling the number of cores the parallelization over

the electronic bands reduced the wall-time. The fourth row of Ta-ble 1 presents in fact better lower value of execution time and con-sequently a better efficiency (5th column). De facto, 128 cores wasthe limit of the older version of the code as already at this stage ef-ficiency is low. We obtain the same value of efficiency running thenew version of the code at 256 cores using 4 bands group. The newlevel of parallelism introduced allows at large number of cores toincrease the scalability by a factor of two. While the Green’s func-tion scales the code sections almost perfectly the iterative diag-onalization is now the actual bottleneck. At this level 1/3 of theoverall time is spent solving the diagonalization problem that doesnot take advantage of the parallelization over electronic bands. Theparallelization over electronic bands cannot then be further ap-plied to improve the scalability. So, QE-GIPAWruns efficientlywith2 band groups for this example. This is the limit where users aresupposed to utilize the other level of parallelism introduced.

As described in the previous section the code implementsseven independent calculations at the outermost routine. For ourexperiments we took the limit at 128 cores. As shown in Table 2,by using this new parallelism and running over 896 cores the codescales with 73% of efficiency respect to 128 cores (93% if comparedwith the same number of cores while using the old version of thecode). Although this is the best result reached in term of efficiency,further scaling with the introduction of hybrid MPI + OpenMPapproach shows that is possible to reduce up to 36 min whilerunning at 3584 cores. Thanks to the new development we presentin this paper users can now easily scale at least 14 times thenumber of cores in regards to the old version of the same code.

The result analysis continues now with the EXX calculation. Asdescribed in theprevious section the routineVexx is carrying out al-most all of the computational workload. In order to test the impactof the new parallelization we simulate a system of 108 atoms with800 bands on the Tier-0 JUGENEmachine. Despite the low numberof atoms, full convergence requires a huge amount of core hoursas shown in Table 3, almost 300000 h were necessary to completethis computation. For completeness it must be underlined that thecurrent implementation allows only norm-conserving pseudopo-tentials [30] and by consequence the energy cutoff used for theFock operator (EXX operator) is the density cutoff, i.e. four timesthe wave function cutoff, although a smaller cutoff can be used.

The routine Vexx, that calculates the Exact-Exchange potential,scales perfectly by doubling the number of cores and the numberof band groups. Wemeasure an efficiency of around∼97%movingfrom 32768 to 65536 cores if we compare the time of executionof the single Vexx routine. However, as presented in Table 3


Table 3EXX benchmark data for 109 SiO2 atoms on JUGENE [14].

# Cores # Bands Elapsed time (s) Efficiency (%)

32768 64 523.45 10065536 128 311.59 84

Fig. 4. Scalability of the CP kernel of Quantum ESPRESSO on BGQ, using thebenchmark CNT10POR8. A 1532 atoms system of an hydrogen-saturated chiralcarbon nanotube, with four porphyrin rings chemically linked to the surface of thenanotube.

also the global time scales almost perfectly at a large number ofcores. Indeed, since Vexx is extremely computationally intensive inregards to the overall time of execution, the code scales efficientlyup to 65536 cores. The level of parallelism introduced by the newelectronic bands communicator definitely impacts more the EXXcode than the QE-GIPAW code as in this case almost all of thecomputational workload can be parallelized over the electronicsbands.

The benchmark analysis for the CP code, see Fig. 4, shows that,the time spent in the ortho subroutines decreases when increasingthe number of cores, but not linearly, whereas the computation ofdforce and rhoofr aswell as other subroutine containing loops overelectronic bands scale almost linearly with the number of cores. It isworth noting that, when increasing the number of band groups tothe limit in which electronic force and charge density computationbecome negligible, the linear algebra computation contained in theortho subroutine will become again the main bottleneck of thecode. At that point some new strategy has to be implemented.

7. Conclusions

The work presented in this paper shows the advantage of theband parallelization approach to simulate challenging systemswhen the number of bands become huge. This was implementedfor two DFT applications. (In fact, in order to accurately reproduce

experimental data it is often necessary to keep the system size bigenough). The results presented here suggest that it is mandatory touse this strategy on petascale hardware and beyond. Recently thememory distribution in some critical parts of Quantum-Espressohas been introduced. So, those two strategies coupled together arethe workhorse that allows efficient simulation of DFT calculationon world-class supercomputers.

Acknowledgments

Thisworkwas financially supported both by the PRACE FirstIm-plementation Project funded in part by the EUs 7th Frame-work Programme (FP7/2007–2013) under grant agreement no.RI-261557 and by Science Foundation Ireland (grant 08/HEC/I1450). Benchmarks were carried out on the Jugene [14] machineat Jülich, on Curie [13] at the CEA and on Fermi [15]. DC is indebtedto Ari P. Seitsonen, Uwe Gerstmann and Francesco Mauri for coau-thoring the first version of the QE-GIPAW code. We acknowledgeEmine Küçükbenli for sharing the cholesterol model, Arrigo Calzo-lari for the CNT10POR8 model, and Paolo Giannozzi for useful dis-cussions. We acknowledge the Quantum ESPRESSO Foundation forthe support.

References

[1] P. Hohenberg, W. Kohn, Phys. Rev. 136 (1964) B864.[2] W. Kohn, L.J. Sham, Phys. Rev. 140 (1965) A1133.[3] P. Giannozzi, S. Baroni, N. Bonini, M. Calandra, R. Car, C. Cavazzoni, D. Ceresoli,

G.L. Chiarotti, M. Cococcioni, I. Dabo, A. Dal Corso, S. de Gironcoli, S. Fabris,G. Fratesi, R. Gebauer, U. Gertsmann, C. Gougoussis, A. Kokalj, M. Lazzeri,L. Martin-Samos, N. Marzari, F. Mauri, R. Mazzarello, S. Paolini, A. Pasquarello,L. Paulatto, C. Sbraccia, S. Scandolo, G. Sclauzero, A.P. Seitsonen, A. Smogunov,P. Umari, R.M. Wentzcovitch, J. Phys.: Condens. Matter 21 (2009) 395502.http://www.quantum-espresso.org.

[4] X. Gonze, et al., Comput. Mater. Sci. 25 (2002) 478. Also see http://www.abinit.org.

[5] J.M. Soler, E. Artacho, J.D. Gale, A. García, J. Junquera, P. Ordejón, D. Sánchez-Portal, J. Phys.: Condens. Matter 14 (2002) 2745.

[6] R. Car, M. Parrinello, Phys. Rev. Lett. 55 (1985) 2471.[7] G. Kresse, J. Hafner, Phys. Rev. B 47 (1993) RC558;

G. Kresse, D. Joubert, Phys. Rev. B 59 (1999) 1758.[8] F. Gygi, IBM J. Res. Dev. 52 (2008) 137.[9] P.J. Hasnip, M.I.J. Probert, K. Refson, M. Plummer, M. Ashworth, Cray Users

Group 2009 Proceedings, 2009.[10] F. Bottin, S. Leroux, A. Knyazev, G. Zérah, Comput. Mater. Sci. 42 (2008) 329.[11] http://www.netlib.org/scalapack/.[12] E.J. Bylaskai, K. Tsemekhman, S.B. Baden, J.H. Weare, H. Jonsson, J. Comput.

Chem. 32 (2011) 54.[13] http://www-hpc.cea.fr/en/complexe/tgcc-curie.htm.[14] http://www2.fz-juelich.de/jsc/jugene.[15] Fermi BG/Q. http://www.cineca.it/en/hardware/ibm-bgq.[16] P.E. Blöchl, Phys. Rev. B 50 (1994) 17953.[17] F. Mauri, B.G. Pfrommer, S.G. Louie, Phys. Rev. Lett. 77 (1996) 5300.[18] C.J. Pickard, F. Mauri, Phys. Rev. B 63 (2001) 245101.[19] C.J. Pickard, F. Mauri, Phys. Rev. Lett. 91 (2003) 196401.[20] J.R. Yates, C.J. Pickard, F. Mauri, Phys. Rev. B 76 (2007) 024401.[21] D. Ceresoli, A.P. Seitsonen, U. Gertsmann, F. Mauri, E. Kücückbenli, S. de

Gironcoli, N. Varini, http://qe-forge.org/projects/qe-gipaw.[22] D. Vanderbilt, Phys. Rev. B 41 (1990) 7892.[23] A.D. Becke, J. Chem. Phys. 98 (1993) 1372.[24] J.P. Perdew, K. Burke, M. Ernzerhof, Phys. Rev. Lett. 78 (1997) 1396.[25] S. Chawla, G.A. Voth, J. Chem. Phys. 108 (1998) 4697.[26] M.C. Gibson, S. Brand, S.J. Clark, Phys. Rev. B 73 (2010) 125120.[27] J. Gerrat, I.M. Mills, J. Chem. Phys. 49 (1968) 1719.[28] J. Hütter, A. Curioni, Chem. Phys. Chem. 6 (2005) 1788.[29] J. Hütter, A. Curioni, Parallel Comput. 31 (2005) 1.[30] N. Troullier, J.L. Martins, Phys. Rev. B 43 (1991) 1993.

http://www.quantum-espresso.org

http://www.abinit.org




http://www.netlib.org/scalapack/

http://www-hpc.cea.fr/en/complexe/tgcc-curie.htm

http://www2.fz-juelich.de/jsc/jugene

http://www.cineca.it/en/hardware/ibm-bgq

http://qe-forge.org/projects/qe-gipaw

Enhancement of DFT-calculations at petascale: Nuclear Magnetic Resonance, Hybrid Density Functional...

Documents

Transcript of Enhancement of DFT-calculations at petascale: Nuclear Magnetic Resonance, Hybrid Density Functional...