Daubechies wavelets for linear scaling density functional theory

18
arXiv:1401.7441v1 [cond-mat.mtrl-sci] 29 Jan 2014 Daubechies Wavelets for Linear Scaling Density Functional Theory 1 Stephan Mohr, 1, 2 Laura E. Ratcliff, 2 Paul Boulanger, 2, 3 Luigi 2 Genovese, 2 Damien Caliste, 2 Thierry Deutsch, 2 and Stefan Goedecker 1 3 1 Institut f¨ ur Physik, Universit¨ at Basel, Klingelbergstr. 82, 4056 Basel, Switzerland 4 2 Laboratoire de simulation atomistique (L Sim), SP2M, 5 UMR-E CEA / UJF-Grenoble 1, INAC, Grenoble, F-38054, France 6 3 Institut N´ eel, CNRS and Universit´ e Joseph Fourier, B.P. 166, 38042 Grenoble Cedex 09, France 7 (Dated: January 30, 2014) 8 We demonstrate that Daubechies wavelets can be used to construct a minimal set of optimized localized contracted basis functions in which the Kohn-Sham orbitals can be represented with an arbitrarily high, controllable precision. Ground state energies and the forces acting on the ions can be calculated in this basis with the same accuracy as if they were calculated directly in a Daubechies wavelets basis, provided that the amplitude of these contracted basis functions is sufficiently small on the surface of the localization region, which is guaranteed by the optimization procedure described in this work. This approach reduces the computational costs of DFT calculations, and can be combined with sparse matrix algebra to obtain linear scaling with respect to the number of electrons in the system. Calculations on systems of 10,000 atoms or more thus become feasible in a systematic basis set with moderate computational resources. Further computational savings can be achieved by exploiting the similarity of the contracted basis functions for closely related environments, e.g. in geometry optimizations or combined calculations of neutral and charged systems. I. INTRODUCTION 9 The Kohn-Sham (KS) formalism of density functional 10 theory (DFT) 1,2 is one of the most popular electronic 11 structure methods due to its good balance between ac- 12 curacy and speed. Thanks to the development of new 13 approximations to the exchange correlation functional, 14 this approach now allows many quantities (bond lengths, 15 vibration frequencies, elastic constants, etc.) to be calcu- 16 lated with errors of less than a few percent, which is suf- 17 ficient for many applications in solid state physics, chem- 18 istry, materials science, biology, geology and many other 19 fields. Although the KS approach has some shortcom- 20 ings – e.g. its inability to accurately describe the HOMO- 21 LUMO separation or many-body (e.g. excitonic) effects, 22 thus reducing its predictive power in the field of optics – 23 it has become the standard for the quantum simulation 24 of matter and also provides a well defined starting point 25 for more accurate methods, such as the GW approxima- 26 tion 3 . 27 Despite the efforts put forth to increase the efficiency 28 of DFT calculations and the increasing computing power 29 of modern supercomputers, the applicability for stan- 30 dard calculations is limited to systems containing about 31 a thousand atoms, which is small compared to the size 32 of systems of interest in nanoscience. The reason for 33 this is that standard electronic structure programs using 34 systematic basis sets such as plane waves 4–6 , finite el- 35 ements 7 or wavelets 8 need a number of operations that 36 scales as the number of orbitals, N orb , squared times the 37 number of basis functions, N basis , used to represent them. 38 Since both the number of orbitals and the number of basis 39 functions scale as the number of atoms, the overall cost 40 scales as O(N 2 orb N basis )= O(N 3 atom ). Electronic struc- 41 ture programs that use Gaussians 9 or atomic orbitals 10 42 require in a standard implementation a matrix diagonal- 43 ization which scales as O(N 3 basis ). 44 To circumvent this problem, one can exploit Kohn’s 45 nearsightedness principle 11–13 , which states that, for sys- 46 tems with a finite gap or for metals at finite temperature, 47 all physical quantities are determined by the local envi- 48 ronment. This is a consequence of the exponentially fast 49 decay of the density matrix 14–20 . Therefore, it is theo- 50 retically possible to express the KS wavefunctions of a 51 given system in terms of a minimal, localized basis set. 52 In order to get highly accurate results while still keeping 53 the size of the basis relatively small, such a basis has to 54 depend on the local chemical environment. If this basis 55 set were known or could be approximated beforehand, it 56 would lead to a computationally cheap tight-binding like 57 approach 21,22 . Of course, in practice it is not possible 58 to determine this optimal localized basis set beforehand; 59 instead it has to be built up iteratively during the cal- 60 culation. This would result in O(N 3 orb ) scaling, which is 61 still equivalent to O(N 3 atom ), but with a much smaller 62 prefactor than systematic approaches (e.g. plane waves) 63 where the number of basis functions is far greater than 64 the number of orbitals (N basis N orb ). 65 However, the use of a strictly localized basis offers yet 66 another possibility. As has been demonstrated during 67 the past twenty years 23,24 , it is possible to truncate the 68 density matrix and thus transform it into a sparse form 69 by neglecting elements either when they are below a cer- 70 tain threshold, or when they correspond to localized or- 71 bitals which are too distant from each other. This reduces 72 the complexity of the algorithm to O(N orb )=O(N atom ) 73 and leads to so-called linear scaling (LS) DFT methods. 74 Methods of this type have been implemented in numer- 75 ous codes such as onetep 25 , Conquest 26 , CP2K 27 and 76 siesta 28 . Note, however, that the extent of the trunca- 77 tion impacts the accuracy due to the imposition of an 78 additional constraint on the system, and is therefore left 79

Transcript of Daubechies wavelets for linear scaling density functional theory

arX

iv:1

401.

7441

v1 [

cond

-mat

.mtr

l-sc

i] 2

9 Ja

n 20

14

Daubechies Wavelets for Linear Scaling Density Functional Theory1

Stephan Mohr,1, 2 Laura E. Ratcliff,2 Paul Boulanger,2, 3 Luigi2

Genovese,2 Damien Caliste,2 Thierry Deutsch,2 and Stefan Goedecker13

1Institut fur Physik, Universitat Basel, Klingelbergstr. 82, 4056 Basel, Switzerland4

2Laboratoire de simulation atomistique (L Sim), SP2M,5

UMR-E CEA / UJF-Grenoble 1, INAC, Grenoble, F-38054, France6

3Institut Neel, CNRS and Universite Joseph Fourier, B.P. 166, 38042 Grenoble Cedex 09, France7

(Dated: January 30, 2014)8

We demonstrate that Daubechies wavelets can be used to construct a minimal set of optimizedlocalized contracted basis functions in which the Kohn-Sham orbitals can be represented with anarbitrarily high, controllable precision. Ground state energies and the forces acting on the ions canbe calculated in this basis with the same accuracy as if they were calculated directly in a Daubechieswavelets basis, provided that the amplitude of these contracted basis functions is sufficiently small onthe surface of the localization region, which is guaranteed by the optimization procedure described inthis work. This approach reduces the computational costs of DFT calculations, and can be combinedwith sparse matrix algebra to obtain linear scaling with respect to the number of electrons in thesystem. Calculations on systems of 10,000 atoms or more thus become feasible in a systematicbasis set with moderate computational resources. Further computational savings can be achievedby exploiting the similarity of the contracted basis functions for closely related environments, e.g.in geometry optimizations or combined calculations of neutral and charged systems.

I. INTRODUCTION9

The Kohn-Sham (KS) formalism of density functional10

theory (DFT)1,2 is one of the most popular electronic11

structure methods due to its good balance between ac-12

curacy and speed. Thanks to the development of new13

approximations to the exchange correlation functional,14

this approach now allows many quantities (bond lengths,15

vibration frequencies, elastic constants, etc.) to be calcu-16

lated with errors of less than a few percent, which is suf-17

ficient for many applications in solid state physics, chem-18

istry, materials science, biology, geology and many other19

fields. Although the KS approach has some shortcom-20

ings – e.g. its inability to accurately describe the HOMO-21

LUMO separation or many-body (e.g. excitonic) effects,22

thus reducing its predictive power in the field of optics –23

it has become the standard for the quantum simulation24

of matter and also provides a well defined starting point25

for more accurate methods, such as the GW approxima-26

tion3.27

Despite the efforts put forth to increase the efficiency28

of DFT calculations and the increasing computing power29

of modern supercomputers, the applicability for stan-30

dard calculations is limited to systems containing about31

a thousand atoms, which is small compared to the size32

of systems of interest in nanoscience. The reason for33

this is that standard electronic structure programs using34

systematic basis sets such as plane waves4–6 , finite el-35

ements7 or wavelets8 need a number of operations that36

scales as the number of orbitals, Norb, squared times the37

number of basis functions, Nbasis, used to represent them.38

Since both the number of orbitals and the number of basis39

functions scale as the number of atoms, the overall cost40

scales as O(N2orbNbasis) = O(N3

atom). Electronic struc-41

ture programs that use Gaussians9 or atomic orbitals1042

require in a standard implementation a matrix diagonal-43

ization which scales as O(N3basis).44

To circumvent this problem, one can exploit Kohn’s45

nearsightedness principle11–13, which states that, for sys-46

tems with a finite gap or for metals at finite temperature,47

all physical quantities are determined by the local envi-48

ronment. This is a consequence of the exponentially fast49

decay of the density matrix14–20. Therefore, it is theo-50

retically possible to express the KS wavefunctions of a51

given system in terms of a minimal, localized basis set.52

In order to get highly accurate results while still keeping53

the size of the basis relatively small, such a basis has to54

depend on the local chemical environment. If this basis55

set were known or could be approximated beforehand, it56

would lead to a computationally cheap tight-binding like57

approach21,22. Of course, in practice it is not possible58

to determine this optimal localized basis set beforehand;59

instead it has to be built up iteratively during the cal-60

culation. This would result in O(N3orb) scaling, which is61

still equivalent to O(N3atom), but with a much smaller62

prefactor than systematic approaches (e.g. plane waves)63

where the number of basis functions is far greater than64

the number of orbitals (Nbasis ≫ Norb).65

However, the use of a strictly localized basis offers yet66

another possibility. As has been demonstrated during67

the past twenty years23,24, it is possible to truncate the68

density matrix and thus transform it into a sparse form69

by neglecting elements either when they are below a cer-70

tain threshold, or when they correspond to localized or-71

bitals which are too distant from each other. This reduces72

the complexity of the algorithm to O(Norb)=O(Natom)73

and leads to so-called linear scaling (LS) DFT methods.74

Methods of this type have been implemented in numer-75

ous codes such as onetep25, Conquest26, CP2K27 and76

siesta28. Note, however, that the extent of the trunca-77

tion impacts the accuracy due to the imposition of an78

additional constraint on the system, and is therefore left79

2

as a freely selectable parameter for the user. This addi-80

tional constraint also comes at the cost of extra computa-81

tional steps, so that the prefactor is greater than for stan-82

dard DFT codes, even for a single iteration in the self-83

consistency cycle. Furthermore, there can be problems84

with ill-conditioning when using strictly localized basis85

sets, which further increases the prefactor. The combina-86

tion of these two problems means that for small systems87

the total calculation time is actually greater when one88

imposes locality, but thanks to the better scaling, there89

is a crossover point where the new algorithms become90

more efficient.91

Our minimal set of localized contracted basis functions,92

called support functions in the following, is obtained by93

an environment dependent optimization where the sup-94

port functions are represented in terms of a fixed un-95

derlying wavelet basis set. In the language of quantum96

chemistry, these support functions could be denoted as97

environment dependent contracted wavelets. Because of98

the environment dependency, the size of this basis set is99

however for a given accuracy much smaller than the size100

of typical contracted Gaussian basis sets and we refer to101

this basis set therefore also as a minimal basis set.102

The choice of the underlying basis set is one of the most103

important aspects impacting the accuracy and efficiency104

of a linear scaling DFT code. Ideally, it should feature105

compact support while still being orthogonal, thus allow-106

ing for a systematic convergence – properties which are107

all offered by Daubechies wavelets basis sets29. Further-108

more, wavelets have built in multiresolution properties,109

enabling an adaptive mesh with finer sampling close to110

the atoms where the most significant part of the orbitals111

is located; this can be particularly beneficial for inho-112

mogeneous systems. Wavelets also have the distinct ad-113

vantage that calculations can be performed with all the114

standard boundary conditions – free, wire, surface or pe-115

riodic. This also means we can perform calculations on116

charged and polarized systems using free boundary con-117

ditions without the need for a compensating background118

charge. It is therefore evident that the combination of the119

above features makes wavelets ideal for a LSDFT code.120

This paper is organized as follows. We first give an121

overview of the method, focussing in particular on the122

imposition of the localization constraint in Daubechies123

wavelets. We then discuss the details, highlighting the124

novel features, following which we consider the calcula-125

tion of atomic forces. For this latter point, we demon-126

strate the remarkable result that, thanks to the compact127

support of Daubechies wavelets, the contribution of the128

Pulay-like forces, arising from the introduction of the lo-129

calization regions, can be safely neglected in a typical130

calculation. We then present results for a number of sys-131

tems, illustrating the accuracy of the method for ground132

state energies and atomic forces. We also demonstrate133

the improved scaling compared with standard BigDFT,134

showing that we are able to achieve linear scaling. Fi-135

nally, we highlight two cases where the minimal basis136

functions can be reused, resulting in further significant137

computational savings.138

II. MINIMAL CONTRACTED BASIS139

A. Kohn-Sham formalism in a minimal basis set140

The standard approach for performing Kohn-Sham141

DFT calculations is to calculate the Kohn-Sham orbitals142

|Ψi〉 which satisfy the equation143

HKS |Ψi〉 = εi |Ψi〉 , (1)

with144

HKS = −1

2∇2 + VKS [ρ] + VPSP , (2)

where VKS [ρ] contains the Hartree potential – solution145

to the Poisson equation – and the exchange-correlation146

potential, while VPSP contains the potential arising147

from the pseudopotential and the external potential148

created by the ions. In the case of BigDFT, these149

are norm-conserving GTH-HGH30 pseudopotentials and150

their Krack variants31, possibly with a nonlinear core cor-151

rection32. In our approach the KS orbitals are in turn152

expressed as a linear combination of support functions153

|φα〉:154

|Ψi(r)〉 =∑

α

cαi |φα(r)〉 . (3)

The density – which can be obtained from the one-155

electron orbitals via ρ(r) =∑

i fi|Ψi(r)|2, where fi is the156

occupation number of orbital i – is given by157

ρ(r) =∑

α,β

φ∗α(r)Kαβφβ(r), (4)

where Kαβ =∑

i ficαi c

βi is the density kernel. The latter158

is related to the density matrix formulation of Hernandez159

and Gillan33, since – as follows from Eq. (3) –160

F (r, r′) =∑

i

fi |Ψi(r)〉 〈Ψi(r′)|

=∑

α,β

|φα(r)〉Kαβ 〈φβ(r

′)| .(5)

Thus the density kernel is the representation of the den-161

sity matrix in the support function basis. We choose to162

have real support functions and thus from now on we will163

neglect the complex notation for this quantity.164

The density matrix decays exponentially with respect165

to the distance |r − r′| for systems with a finite gap or166

for metals at finite temperature14–20. In these cases it167

can therefore be represented by strictly localized basis168

functions. A natural and exact choice for these would be169

the maximally localized Wannier functions which have170

the same exponential decay34. Of course, these Wannier171

functions are not known beforehand. Therefore, in our172

case, the contracted basis functions are constructed in173

3

situ during the self-consistency cycle and are expected174

to reach a quality similar to that of the exact Wannier175

functions.176

In the formalism we have presented so far, the KS or-177

bitals have to be optimized by minimizing the total en-178

ergy with respect to the support functions and density179

kernel. For a self-consistent calculation this is equivalent180

to minimizing the band structure energy, i.e.181

EBS =∑

α,β

KαβHαβ , (6)

subject to the orthonormality condition of the KS or-182

bitals,183

〈Ψi |Ψj〉 =∑

α,β

cα∗i Sαβcβj = δij , (7)

where Hαβ = 〈φα | HKS |φβ〉 and Sαβ = 〈φα |φβ〉 are the184

Hamiltonian and overlap matrices of the support func-185

tions, respectively. This is equivalent to imposing the186

idempotency condition on the density kernel Kαβ,187

γ,δ

KαγSγδKδβ = Kαβ , (8)

which can be achieved using the McWeeny purification188

scheme35 or directly imposing the orthogonality con-189

straint on the coefficients cαi190

The algorithm therefore consists of two key compo-191

nents: support function and density kernel optimization.192

The workflow is illustrated in Fig. 1; it consists of a flex-193

ible double loop structure, with the outer loop control-194

ling the overall convergence, and two inner loops which195

optimize the support functions and density kernel, re-196

spectively. The first of these inner loops is done non-197

self-consistently (i.e. with a fixed potential), whereas the198

second one is done self-consistently.199

B. Daubechies wavelets in BigDFT200

BigDFT8 uses the orthogonal least asymmetric201

Daubechies29 family of order 2m = 16, illustrated in202

Fig. 2. These functions have a compact support and203

are smooth, which means that they are also localized204

in Fourier space. This wavelet family is able to exactly205

represent polynomials up to 8th order. Such a basis is206

therefore an optimal choice given that we desire at the207

same time locality and interpolating power. An exhaus-208

tive presentation of the use of wavelets in numerical sim-209

ulations can be found in Ref. 36. A wavelet basis set is210211

generated by the integer translates of the scaling func-212

tions and wavelets, with arguments measured in units of213

the grid spacing h. In three dimensions, a wavelet ba-214

sis set can easily be obtained as the tensor product of215

one-dimensional basis functions, combining wavelets and216

scaling functions along each coordinate of the Cartesian217

grid (see e.g. Ref. 8).218

FIG. 1. Structure of the minimal basis approach: for the ba-sis optimization loop the hybrid scheme can be used insteadof trace or energy minimization, and for the kernel optimiza-tion loop either direct minimization or the Fermi operatorexpansion method can be used in place of diagonalization.

-1.5

-1

-0.5

0

0.5

1

1.5

-6 -4 -2 0 2 4 6 8

x

φ(x)ψ(x)

FIG. 2. Least asymmetric Daubechies wavelet family of order2m = 16; both the scaling function φ(x) and wavelet ψ(x)differ from zero only within the interval [1−m,m].

In a simulation domain, we have three categories of219

grid points: those which are closest to the atoms (“fine220

region”) carry one (three-dimensional) scaling function221

and seven (three-dimensional) wavelets; those which are222

further away from the atoms (“coarse region”) carry only223

one scaling function, corresponding to a resolution which224

is half that of the fine region; and those which are even225

further away (“empty region”) carry neither scaling func-226

tions nor wavelets. The fine region is typically the region227

where chemical bonding takes place, whereas the coarse228

region covers the region where the tails of the wavefunc-229

tions decay smoothly to zero. We therefore have two230

resolution levels whilst maintaining a regularly spaced231

grid.232

A support function φα(r) can be expanded in this233

4

wavelet basis as follows:234

φ(r) =∑

i1,i2,i3

si1,i2,i3ϕi1,i2,i3(r)

+∑

j1,j2,j3

7∑

l=1

d(ℓ)j1,j2,j3

ψ(ℓ)j1,j2,j3

(r), (9)

where ϕj1,j2,j3(r) = ϕ(x − j1)ϕ(y − j2)ϕ(z − j3) is the235

tensor product of three one-dimensional scaling functions236

centered at the grid point (i1, i2, i3), and ψ(ℓ)j1,j2,j3

(r) are237

the seven tensor products containing at least one one-238

dimensional wavelet centered on the grid point (j1, j2, j3).239

The sums over i1, i2, i3 (j1, j2, j3) run over all grid points240

where scaling functions (wavelets) are centered, i.e. all241

the points of the coarse (fine) grid.242

To determine these regions of different resolution, we243

construct two spheres around each atom a; a small one244

with radius Rfa = λf · rfa and a large one with radius245

Rca = λc · rca (Rc

a > Rfa). The values of rfa and rca are246

characteristic for each atom type and are related to the247

covalent and van der Waals radii, whereas λf and λc can248

be specified by the user in order to control the accuracy249

of the calculation. The fine (coarse) region is then given250

by the union of all the small (large) spheres, as shown251

in Fig. 3. Hence in BigDFT the basis set is controlled252

by these three user specified parameters. By reducing h253

and/or increasing λc and λf the computational degrees254

of freedom are incremented, leading to a systematic con-255

vergence of the energy.256

III. LOCALIZATION REGIONS257

Thanks to the nearsightedness principle it is possible to258

define a basis of strictly localized support functions such259

that the KS orbitals given in terms of this contracted ba-260

sis are exactly equivalent to the expression based solely261

on the underlying Daubechies basis. However, as pre-262

sented so far, the support functions φα(r) of Eq. (9) are263

expanded over the entire simulation domain. We want264

them to be strictly localized while still containing various265

resolution levels, as illustrated by Fig. 3, and so we set to266

zero all scaling function and wavelet coefficients which lie267

outside a sphere of radius Rcut around the point Rα on268

which the support function is centered. In general, these269

centers Rα could be anywhere, but we choose them to270

be centered on an atom a and we thus assume from now271

on that Rα = Ra. Consequently we define a localization272

projector L(α), which is written in the Daubechies basis273

space as274

L(α)i1,i2,i3;j1,j2,j3

= δi1j1δi2j2δi3j3θ(Rcut−|R(i1,i2,i3)−Rα|) ,

(10)where θ is the Heaviside function. We use this projector275

to constrain the function |φα〉 to be localized throughout276

the calculation, i.e.277

|φα〉 = L(α)|φα〉 . (11)

FIG. 3. A two level adaptive grid for an alkane: the highresolution grid points are shown with bold points while thelow resolution grid points are shown with smaller points. Alsovisible are three localization regions (red, blue and green) withradii of 3.7 A centered on different atoms in which certainsupport functions will reside. The coarse grid points shownin yellow do not belong to any of the three localization regions.

Clearly, if |φα〉 is localized around Rα and Rcut is large278

enough, L(α) leaves |φα〉 unchanged and no approxima-279

tion is introduced to the KS equation.280

It is important to note that the localization constraint281

of Eq. (11) determines the expression of dφα

dRβoutside the282

localization region of φα. Indeed, as the Daubechies basis283

set is independent of Rα, differentiating Eq. (11) with284

respect to Rβ leads to285

(1− L(α)

)|dφαdRβ

〉 = δαβ∂L(α)

∂Rα|φα〉 . (12)

This result will be used in Appendix C 2 to demonstrate286

that the Pulay-like forces are negligible for a typical cal-287

culation with our approach.288

A. Imposing the localization constraint289

In what follows, we demonstrate that choosing the sup-290

port functions to be orthogonal allows for a more straigh-291

forward application of the localization constraint. Due292

to the orthonormality of the KS orbitals we cannot di-293

rectly minimize the band structure energy (Eq. (6)) with294

respect to the support functions, rather we have to min-295

imize the following functional:296

Ω =∑

α,β

KαβHαβ −∑

i,j

α,β

Λij

(cα∗i c

βj Sαβ − δij

), (13)

with the Lagrange multiplier coefficients Λij determined297

by the relation298

i,j

cα∗i cβj Λij =

ρ,σ

KαρHρσ(S−1)σβ , (14)

5

where (S−1)αβ is the inverse overlap matrix. The gradi-299

ent |gα〉 =∣∣∣ δΩδ〈φα|

⟩is therefore300

|gα〉 =∑

β

KαβHKS |φβ〉 −∑

β,ρ,σ

KαρHσρ(S−1)σβ |φβ〉 .(15)

However, we wish to impose the localization condition301

|φα〉 = L(α)|φα〉 on the support functions and therefore302

the functional to be minimized becomes303

Ω′ = Ω−∑

α

⟨φα

∣∣∣ 1− L(α)∣∣∣ ℓα⟩, (16)

where the components of the vector |ℓα〉 are the Lagrange304

multipliers of this locality constraint. The gradient for305

Ω′,∣∣∣ δΩ′

δ〈φα|

⟩, can therefore be written as306

|g′α〉 = |gα〉 − (1 − L(α)) |ℓα〉 . (17)

Using the stationarity condition 0 = |g′α〉 and combining307

with the fact that 1 − L(α) is a projection operator, i.e.308

(1− L(α))2 = 1− L(α), we have309

(1− L(α)) |ℓα〉 = (1− L(α)) |gα〉 . (18)

Therefore, using Eq. (17),310

|g′α〉 = L(α) |gα〉 . (19)

i.e. the gradient is explicitly localized. This yields the311

following result for the gradient312

|g′α〉 =∑

β,ρ

Kαρ(S1/2) βρ

[L(α)HKS

∣∣∣φβ⟩

−∑

σ

⟨φσ

∣∣∣HKS

∣∣∣ φρ⟩L(α)

∣∣∣φσ⟩]

. (20)

Here the localized gradient is expressed in terms313

of the orthogonalized support functions∣∣∣φα⟩

=314

∑β(S

−1/2) βα |φβ〉. Requiring the support functions to315

be orthogonal i.e. Sαβ = δαβ therefore further simplifies316

the evaluation of the gradient as it no longer becomes317

necessary to calculate S−1 or S1/2. Moreover, it avoids318

the need for distinguishing between covariant and con-319

travariant indices37.320

B. Localization of the Hamiltonian application321

As shown in Ref. 8, the Hamiltonian operator in a322

Daubechies wavelets basis set is defined by a set of convo-323

lution operations, combined with the application of non-324

local pseudopotential projectors. The nature of these325

operations is such that HKS |φβ〉 will have a greater ex-326

tent than |φβ〉. We therefore define a second localization327

operator, L′(β), with a corresponding cutoff radius R′cut,328

such that R′cut is equal to Rcut plus half of the convolu-329

tion filter length times the grid spacing, which in our case330

corresponds to an additional eight grid points. When ap-331

plying the Hamiltonian, we impose332

HKS |φβ〉 = L′(β)HKSL

(β)|φβ〉 . (21)

For both the convolution operations and the nonlo-333

cal pseudopotential applications, this procedure guaran-334

tees that the Hamiltonian application is exact within335

the localization region of L(β). However, the values of336

HKSL(β)|φβ〉 are approximated outside this region due337

to the semilocal nature of the convolutions and the pseu-338

dopotential projectors. This impacts the evaluation of339

the Hamiltonian matrixHαβ for all elements whose local-340

ization regions do not coincide, and thus also affects the341

gradient |gα〉. Nonetheless, we have verified that further342

enlargement of R′cut will have negligible impact on the343

accuracy, while adding additional overheads, see Sec.VI.344

Apart from these technical details, most of the basic345

operations are identical to their implementation in the346

standard BigDFT code8 and are therefore not repeated347

here. The only difference is that these operations are now348

done only in the localization regions (corresponding to349

either L(β) or L′(β)) and not in the entire computational350

volume.351

IV. SELF-CONSISTENT CYCLE352

A. Support function optimization353

As an initial guess for the support functions we use354

atomic orbitals, which are generated by solving the355

atomic Schrodinger equation and therefore possess long356

tails which need to be truncated at the borders of the357

localization regions. If the values at the borders are not358

negligible, the resulting kink will cause the kinetic energy359

to become very large due to the definition of the Lapla-360

cian operator in a wavelet basis set, and so to assure sta-361

bility during the optimization procedure, the localization362

regions would need to be further enlarged. To overcome363

this problem, even for small localization regions, it is ad-364

vantageous to decrease the extent of the atomic orbitals365

before the initial truncation by adding a confining quar-366

tic potential centered on each atom, aα(r−Rα)4, to the367

atomic Schrodinger equation.368

For the first few iterations of the outer loop (Fig. 1)369

we maintain the quartic confining potential of the atomic370

input guess. This implies that the total Hamiltonian be-371

comes dependent on the support function, Hα = HKS +372

aα(r − Rα)4, and we can no longer minimize the band373

structure energy (Eq. (6)) to obtain the support func-374

tions. Instead, we choose to minimize the functional375

minφα

α

〈φα | Hα |φα〉 , (22)

while applying both orthogonality and localization con-376

straints on the support functions, as discussed in Sec-377

tion III A.378

6

Apart from the improved localization, the use of the379

confining potential has yet another advantage. The band380

structure energy (Eq. (6)) is invariant under unitary381

transformations among the support functions if there are382

no localization constraints. This corresponds to some383

zero eigenvalues in the Hessian characterizing the op-384

timization of the support functions. The introduction385

of a localization constraint violates this invariance and386

leads to small but non-zero eigenvalues. The condition387

number, defined as the ratio of the largest and small-388

est (nonzero) eigenvalue of the Hessian, can thus become389

very large, potentially turning the optimization into a390

strongly ill-conditioned problem. On the other hand, if391

the unitary invariance is heavily violated in Eq. (22) by392

the introduction of a strong localization potential, the393

small eigenvalues grow and the condition number im-394

proves as a consequence.395

After a few iterations of the outer loop, the support396

functions are sufficiently localized to continue the opti-397

mization without a confining potential, i.e. by minimizing398

the band structure energy. This procedure will lead to399

highly accurate support functions while still preserving400

locality. As an alternative it is also possible to define401

a so-called “hybrid mode” which combines the two cate-402

gories of support function optimization and thus provides403

a smoother transition between the two. In this case the404

target function is given by405

Ωhy =∑

α

Kαα 〈φα|Hα|φα〉+∑

β 6=α

Kαβ 〈φα|HKS |φβ〉 .

(23)In the beginning a strong confinement is used, making406

this expression similar to the functional of Eq. (22); how-407

ever the confining potential is reduced throughout the408

calculation so that towards the end the strength of the409

confinement is negligible and Eq. (23) reverts to the full410

energy expression. A prescription for reducing the con-411

finement is presented in Appendix A.412

1. Orthogonalization413

The Lagrange multiplier formalism conserves the or-414

thogonormality of the contracted basis only to first or-415

der. An additional explicit orthogonalization has to be416

performed after each update of the contracted basis to re-417

store exact orthogonality. This is done using the Lowdin418

procedure. The calculation of S−1/2, which is required419

in this context, can pose a bottleneck. However, as our420

basis functions are close to orthonormality, the exact cal-421

culation can safely be replaced by a first order Taylor422

approximation.423

It is important to note that the support functions will424

only be nearly orthogonal rather than exactly orthog-425

onal, as exact orthogonality is in general not possible426

for functions exhibiting compact support in a discretized427

space. This near-orthogonality is in contrast with most428

other minimal basis implementations which use fully non-429

orthogonal support functions25,26. The asymptotic decay430

behavior of the orthogonal and non-orthogonal support431

functions is identical. However the prefactor differs and432

leads to a better localization of the non-orthogonal func-433

tions38. However, in practice, we have found that the434

introduction of the orthogonality constraint does not sig-435

nificantly increase the size required for the localization436

regions, provided that a sufficiently strong confining po-437

tential is applied to localize the support functions at the438

start of the calculation.439

In order to counteract the small deviations from idem-440

potency caused by the changing overlap matrix, we purify441

the density kernel during the support function optimiza-442

tion either by directly orthonormalizing the expansion443

coefficients of the KS orbitals or by using the McWeeny444

purification transformation.445

2. Gradient and preconditioning446

The optimization is done via a direct minimization447

scheme or with direct inversion of the iterative subspace448

(DIIS)39, both combined with an efficient precondition-449

ing scheme. The derivation of the gradient of the target450

function Ω with respect to the support functions involves451

some subtleties for the trace and hybrid modes since in452

these cases the Hamiltonian depends explicitly on the453

support function, leading to an asymmetry of the La-454

grange multiplier matrix Λαβ =⟨φα

∣∣∣ δΩδ〈φβ |

⟩. In order455

to correctly derive the gradient expression we follow the456

same guidelines as Ref. 40; assuming nearly orthogonal457

orbitals the final result is given by458

|gα〉 =δΩ

δ 〈φα|−

1

2

β

(Λαβ + Λβα) |φβ〉 . (24)

This is a generalization of the ordinary expression and459

thus also valid if the Hamiltonian does not explicitly de-460

pend on the support function, i.e. for the energy mode.461

As discussed in Section III A the gradient is suitably lo-462

calized once derived, i.e. |gα〉 ← L(α)|gα〉.463

To precondition the gradient |gα〉 we use the standard464

kinetic preconditioning scheme8. To ensure that the pre-465

conditioning does not negatively impact the localization466

of the gradient, we have found that it is important to467

add an extra term to account for the confining potential468

if present. In this case, the expression becomes469

(−1

2∇2 + aα(r−Rα)

4 − εα) |gprecα 〉 = − |gα〉 , (25)

where εα is an approximate eigenvalue. This also has470

the effect of improving the convergence. The inclusion471

of the confining potential adds only a small overhead as472

it can be evaluated via convolutions in the same manner473

as the kinetic energy. Furthermore, the preconditioning474

equations do not need to be solved with high accuracy,475

only approximately.476

7

B. Density kernel optimization477

For the optimization of the density kernel we have im-478

plemented three schemes: diagonalization, direct min-479

imization and the Fermi operator expansion method480

(FOE)41,42. Once the kernel has been updated, we recal-481

culate the charge density via Eq (4); the new density is482

then used to update the potential, with an optional step483

wherein the density is mixed with the previous one in or-484

der to improve convergence. This procedure is repeated485

until the kernel is converged; in practice, we consider this486

convergence to have been reached once the mean differ-487

ence of the density of two consecutive iterations is below488

a given threshold, i.e. ∆ρ < c.489

The direct diagonalization method consists of finding490

the solution of the generalized eigenproblem for a given491

Hamiltonian and overlap matrix. Its implementation492

therefore relies straightforwardly on linear algebra solvers493

and will not be detailed here.494

In the direct minimization approach the band-495

structure energy is minimized subject to the orthogo-496

nality of the support functions. To this end, we ex-497

press the gradient of the Kohn-Sham orbitals, |gi〉 =498

HKS |Ψi〉−∑

j Λij |Ψj〉, in terms of the support functions,499

i.e. |gi〉 =∑

α dαi |φα〉. The d

αi are obtained by solving500

α

Sβαdαi =

α

Hβαcαi −

j

α

Sβαcαj Λji,

Λji =∑

γ,δ

cγ∗j cδiHγδ.

(26)

The coefficients are optimized using this gradient via501

steepest descents or DIIS. Once the gradient has con-502

verged to the required threshold, the density kernel is503

calculated from the coefficients and occupancies.504

In the Fermi operator expansion method, the density505

matrix may be defined as a function of the Hamiltonian506

as F = f(H), where f is the Fermi function. In terms of507

the support functions, this corresponds to an expression508

for the density kernel in terms of the Hamiltonian matrix,509

i.e. K = f(H). The central idea of the FOE24,41,42 is to510

find an expression for f(H) which can be efficiently eval-511

uated numerically. One particularly simple possibility is512

a polynomial expansion; for numerical stability we use513

Chebyshev polynomials43. As will be shown in detail in514

Appendix B, the density kernel can be constructed using515

only matrix vector multiplications thanks to the recur-516

sion formulae for the Chebyshev matrix polynomials.517

1. Suitability of the methods518

All three methods for calculating the density kernel519

(direct diagonalization, direct minimization and FOE)520

yield the same final result, thus the main differences lie521

in their performance, where one of the most important522

points is the performance of the linear algebra. Due to523

the localized nature of the support functions, the overlap,524

Hamiltonian and density kernel matrices are in general525

sparse, with the level and pattern of sparsity depending526

on the localization radii of the support functions and the527

dimensionality of the system in question. We can take528

advantage of this sparsity by storing and using these ma-529

trices in compressed form and indeed this is necessary to530

achieve a fully linear scaling algorithm.531

For diagonalization, exploiting the sparsity is very hard532

due to the lack of efficient parallel solvers for sparse ma-533

trices. The method therefore performs badly for large534

systems due to its cubic scaling, but thanks to the small535

prefactor it can still be useful for smaller systems con-536

sisting of a few hundred atoms.537

For direct minimization the situation is better, since538

both the solution of the linear system of Eq. (26) and the539

orthonormalization can be approximated using Taylor540

expansions for S−1 and S−1/2, respectively. It can also541

be easily parallelized, so that the cubic scaling terms only542

become problematic for systems containing more than a543

few thousand atoms, as demonstrated in section VII (see544

Fig. 7). Indeed, for moderate system sizes, the extra545

overhead associated with the manipulation of sparse ma-546

trices makes dense matrix algebra cheaper, and so many547

physically interesting systems are already considerably548

accelerated.549

Even though the minimization method does not scale550

linearly in its current implementation, there are some sit-551

uations where its use is advantageous. For example, the552

unoccupied states are generally not well represented by553

the contracted basis set following the optimization pro-554

cedure, and for cases where we have only the density555

kernel and not the coefficients, it becomes necessary to556

use another approach to calculate them, such as optimiz-557

ing a second set of minimal basis functions44, which can558

be expensive. However, as with the diagonalization ap-559

proach, the ability to work directly with the coefficients560

makes it possible to optimize the support functions and561

coefficients to accurately represent a few states above the562

Fermi level at the same time as the occupied states, with-563

out significantly impacting the cost.564

The FOE approach leads to linear scaling if the spar-565

sity of the Hamiltonian is exploited and if the build up566

of the density matrix is done within localization regions.567

These localization regions do not need to be identical568

to the localization regions employed for the calculation569

of the contracted basis set – in general they will actu-570

ally be larger. The final result turns out to be relatively571

insensitive to the choice of size for the density matrix572

localization regions, above a sensible minimum value.573

By exploiting this sparsity, we can access system sizes574

of around ten thousand atoms with a moderate use of575

parallel resources.576

C. Parallelization577

Like standard BigDFT we have a multi-level578

MPI/OpenMP parallelization scheme45. The details of579

8

0

5

10

15

20

25

0 500 1000 1500 2000 2500 3000 3500 4000

Speedup

wrt

160

core

s

Number of cores (MPI tasks x 4 OMP threads)

SpeedupIdeal speedup

Amdahl’s law, P=0.97

0

20

40

60

80

100

40 60 120 240 480 640 9600246810121416

Tim

e/effic

iency

(%)

Speedup

Number of MPI tasks

CommsLinAlgConvPotentialDensityOtherSpeedupEfficiency

FIG. 4. Parallel scaling for a water droplet containing 960atoms for between 160 and 3840 cores, with the effective andideal speedup with respect to 160 cores, including a fit of Am-dahl’s law [below], and a breakdown of the calculation timefor selected number of processors [above]. The categories pre-sented are for communications, linear algebra, convolutions,calculation of the potential and density, and miscellaneous.

the MPI parallelization are presented in Appendix D. In580

Fig. 4 we show the effective speedup as a function of the581

number of cores for a large water droplet, keeping the582

number of OpenMP threads at four. We measured the583

parallelization up to 3840 cores, by taking as a reference584

a 160 core run. As can be seen, the effective speedup585

reaches about 92% of the ideal value at 480 cores, de-586

creasing to 62% for 3840 cores. Fitting the data to Am-587

dahl’s law46 shows that at least 97% of the code has been588

parallelized; the true value is higher as the times shown589

include the communications and are relative to 160 cores590

rather than a serial run. Fig. 4 also shows a breakdown of591

the total calculation time into different categories, where592

we see that the communications start to become limiting593

for the highest number of cores, demonstrating that this594

is at the upper bound which is appropriate for this sys-595

tem size. For a larger system, this limit will of course be596

higher.597

V. CALCULATION OF IONIC FORCES598

In a self-consistent KS calculation (i.e. when the charge599

density is derived from the numerical set of wavefunc-600

tions), the forces acting on atom a are given by the nega-601

tive gradient of the band structure energy with respect to602

the atomic positions Ra. The Hellmann-Feynman force,603

given by the expression604

F(HF )a = −

i

fi

⟨Ψi

∣∣∣∣∂H

∂Ra

∣∣∣∣Ψi

⟩, (27)

involves only the functional derivative of the Hamiltonian605

operator. This term is evaluated numerically in the com-606

putational setup used to express the ground state energy.607

As explained in more detail in Appendix C 1, with the cu-608

bic version of BigDFT, only the Hellmann-Feynman term609

contributes to the forces, as the remaining part tends to610

zero in the limit of small grid spacings.611

However, when the KS orbitals are expressed in terms612

of the support functions, there is an additional contribu-613

tion which is not captured by the computational setup.614

As demonstrated in Appendix C 2, it is given by615

F(P )a = −2

α,β

Re

(Kαβ

⟨χβ

∣∣∣∣∂L(α)

∂Ra

∣∣∣∣φα⟩)

, (28)

where616

|χα〉 = HKS |φα〉 −∑

j

ρ,σ

cρjεjc

σ∗j Sσα |φρ〉 (29)

is the residual vector of the support function |φα〉, which617

is related to the support function gradient |gα〉 (see618

Eq. (15)). This term can be considered as the equivalent619

of a Pulay contribution to the ionic forces, arising from620

the explicit dependence of the localization operators on621

the atomic positions.622

The vector ∂L(α)

∂Ra|φα〉 only depends on the value of the623

support function on the borders of the localization re-624

gions (Eq. (C12)). Therefore, if the scalar product be-625

tween the residues |χβ〉 and the values of the support626

functions |φα〉 at the boundaries of their localization re-627

gions is smaller than the norm of the residue itself (quan-628

tifying the accuracy of the results), the Pulay term can629

be safely neglected.630

As mentioned in Section IVA, the Laplacian operator631

in the wavelet basis causes the kinetic energy to be high632

if the values at the edges are non-negligible. Such a sit-633

uation is therefore penalized by the energy minimization634

and so the values at the borders are guaranteed to remain635

low. Indeed, we have seen excellent agreement between636

the Hellmann-Feynman term only and the forces calcu-637

lated using standard BigDFT, as will be demonstrated638

in sections VI and VIII A.639

The Hellmann-Feynman force is thus the only relevant640

term even in the contracted basis approach and is given641

by642

F(HF )a = −

α,β

Kαβ

⟨φα

∣∣∣∣∂H

∂Ra

∣∣∣∣φβ⟩. (30)

It is identical to the implementation in standard643

BigDFT8 and so the different terms are not repeated644

here. The only difference is that instead of applying645

the operator to the wavefunctions, we now apply it to646

all overlapping support functions. This can be done effi-647

ciently since each support function overlaps with only a648

few neighbors.649

VI. ACCURACY650

We have applied our minimal basis approach to a num-651

ber of systems, depicted in Fig. 5, in order to demonstrate652

9

both its accuracy and its applicability. All calculations653

have been done using the local density approximation654

(LDA) exchange-correlation functional47 and HGH pseu-655

dopotentials30. In addition, we have used free bound-656

ary conditions, avoiding the need for the supercell ap-657

proximation. The values of the wavelet basis parame-658

ters for the different systems, as well as the localization659

radii, were selected in order to achieve accuracies better660

than 1meV/atom. This corresponded to values between661

0.13 A and 0.20 A for h, 5.0 and 7.0 for λc and 7.0 and662

8.0 for λf . Unless otherwise stated, we have used the663

direct minimization scheme for the density kernel opti-664

mization. For hydrogen atoms one basis function was665

used per atom, whereas for all other elements four basis666

functions were used per atom, except where otherwise667

stated.668

A. Benchmark systems669

We demonstrate excellent agreement with the tradi-670

tional cubic scaling method for both energy and forces671

– of the order of 1meV/atom for the energy and a672

few meV/A for the forces, as shown in Tab. I. We673

also demonstrate systematic convergence of the total en-674

ergy and forces with respect to localization radius for a675

molecule of C60, where the largest localization regions676

are close to the total system size, as depicted in Fig. 6.677

For all these systems the level of accuracy achieved for678

the forces is of the same order as that of the cubic code679

using the Hellmann-Feynman term only.680

B. Silicon defect energy681

In order to demonstrate the accuracy of our method682

for a practical application we calculated the energy of a683

vacancy defect in a hydrogen terminated silicon cluster684

containing 291 atoms, shown in Fig. 5(f). As shown in685

Tab. II, the difference in defect energy between the cu-686

bic reference calculation and the linear version is 129meV687

using 4 support functions per Si atom and 1 support func-688

tion per hydrogen atom. Even more accurate results can689

be achieved by increasing the number of support func-690

tions per Si atom to 9, which reduces the error to 12meV.691

Increasing the localization radii does not further improve692

the accuracy as the result is already within the noise693

level. To achieve these results, support functions were694

optimized using the hybrid mode and the density kernel695

was optimized using the FOE approach with a cutoff of696

7.94 A for the kernel construction.697

C. Consistency of energies and forces698

Following the discussion in Section V, we have calcu-699

lated the average value of the support functions on the700

borders of their localization regions for various systems701

and found this to be at least three orders of magnitude702

smaller than the norm of the support function residue703

(defined in Eq. (C1)). This is in line with our expec-704

tations, as discussed in Section V, and implies that the705

Pulay terms should be negligible compared to the error706

introduced to the Hellmann-Feynman term due to the707

localization constraint. Indeed, this agrees with the cal-708

culated forces for the systems presented thus far. To709

further verify that the Pulay term can be neglected and710

to quantify the different sources of errors, we have also711

checked that the calculated forces are consistent with the712

energy, i.e. that they correspond to its negative deriva-713

tive. To this end, initial and final configurations R(a)714

and R(b) of a given system were chosen, where R rep-715

resents the atomic positions. Small steps ∆R were then716

taken between R(a) to R(b). If the forces F are correctly717

evaluated we should have718

∆E =

∫ b

a

F(r) · dr ≈

b∑

µ=a

F(Rµ) ·∆Rµ, (31)

where µ labels the intermediate steps between configu-719

rations R(a) and R(b). This approximation can be com-720

pared with the exact value obtained by directly calculat-721

ing the energy differences, i.e. E(R(b))−E(R(a)). These722

two values should agree with each other up to the noise723

level of the calculation.724

To analyze the different terms contributing to the noise725

in the forces, we use a combination of the hybrid and726

FOE methods with five progressive setups which give an727

estimate of the magnitude of the various error sources:728

1. Using the cubic scaling scheme where all orbitals729

can extend over the full simulation cell.730

2. Using the minimal basis approach but without lo-731

calization constraints or confining potential.732

3. Applying a confining potential but no localization733

constraints.734

4. Using a finite localization radius of 7.94 A for the735

density kernel, but not for the support functions.736

5. Applying in addition strong localization radii of737

4.76 A to the support functions.738

This test was done for a 92 atom alkane – despite the rel-739

atively small system size the introduction of finite cutoff740

radii for the support functions and the density kernel741

construction has a strong effect since for chain-like struc-742

tures the volume of the localization region is only a small743

fraction of the total computational volume. The results744

are shown in Tab. III, for a step size of ∆Rµ = 0.003 A.745

As expected, without the application of the localization746

constraint, the errors for the minimal basis calculations747

are of the same order of magnitude as the reference cu-748

bic calculation. This is also the case when a finite cutoff749

radius for the construction of the density kernel is in-750

troduced. Once a finite localization is imposed on the751

10

(a)Cinchonidine (C19H22N2O),Rcut =5.29A

(b)Boron cluster48, Rcut =5.82A (c)Alkane,Rcut =5.29A

(d)Water droplet, Rcut =5.29A

(e)C60 (f)Silicon cluster, Rcut =4.76A (g)SiC49, Rcut =5.82A (h)Ladderpolythiophene,Rcut =5.29A

FIG. 5. The different systems studied, where gray denotes carbon, white hydrogen, red oxygen, gold nitrogen, bronze boron,green silicon and blue sulfur. The values used for the support function localization radii are also given.

Num. atomsEnergy (eV) Forces (eV/A)

Min. Basis Cubic (Min. - Cub.)/atom Min. Basis Cubic Av. (Min. - Cub.)

Cinchonidine 44 −4.273 · 103 −4.273 · 103 2.0 · 10−3 2.734 · 10−2 2.772 · 10−2 4.3 · 10−3

Boron cluster 80 −6.141 · 103 −6.141 · 103 2.4 · 10−3 1.305 · 10−2 1.316 · 10−2 4.2 · 10−3

Alkane 257 −1.592 · 104 −1.592 · 104 3.7 · 10−4 3.760 · 10−2 3.767 · 10−2 1.0 · 10−3

Water 450 −7.011 · 104 −7.012 · 104 1.7 · 10−3 3.730 · 10−2 3.730 · 10−2 2.5 · 10−3

TABLE I. Energy and force differences between the minimal basis approach and the standard cubic version of the code. Forthe forces, the root mean squared force is given for each approach, as well as the average difference between each component.

support functions the discrepancy between the energy752

difference and the force integral increases by an order of753

magnitude, however it remains small, agreeing with our754

previous observations about the Pulay forces for large755

enough localization radii.756

VII. SCALING AND CROSSOVER POINT757

We have applied the minimal basis method to alka-758

nes, applying both the direct minimization and FOE ap-759

proaches for the kernel optimization. The time taken per760

iteration is compared with the traditional cubic-scaling761

version in Fig. 7. The number of iterations required to762

reach convergence is similar for the cubic and minimal763

basis approaches and is approximately constant across764

system sizes, so that the total time taken shows similar765

behaviour. The results clearly demonstrate the improved766

scaling of the method, with a crossover point for the to-767

tal time at around 150 atoms. This will of course be768

system dependent – the chain like nature of the alkanes769

makes them a particularly favorable system for the min-770

imal basis approach. We also plot cubic polynomials for771

the timing data; whilst the cubic scaling approach only772

has a very small cubic term, both this and the quadratic773

term are noticeably reduced for both the FOE and di-774

rect minimization approaches. Indeed, the FOE method775

is predominantly linear scaling, compared to direct min-776

11

1e−05

1e−04

1e−03

1e−02

1e−01

1e+00

2 3 4 5 6 71e−04

1e−03

1e−02

1e−01

1e+00(E

−Ecu

bic)

/atom

(eV)

Av.

(F−Fcu

bic)

(eV/Å

)

Localization radius (Å)

EnergyForces

FIG. 6. Convergence of the total energy and forces with re-spect to the localization radius as compared to the cubic val-ues for a molecule of C60. For large localization regions theaccuracy of the forces is sufficient for high accuracy geometryoptimizations.

pristine vacancy ∆ ∆-∆cubic

eV eV eV meV

cubic −20674.223 −20563.056 111.167 –

4/1 −20667.556 −20556.518 111.038 129

9/1 −20672.856 −20561.701 111.155 12

TABLE II. Total energies and energy differences for a siliconcluster (Fig. 5(f)) with and without a vacancy defect. Thefirst column shows the number of support functions per Si andH atom respectively and the last column shows the differencein defect energy between the cubic and linear versions.

Setup

Sup.func.

Con

f.pot.

Kαβcutoff

φαcutoff

∆E∫F(r) · dr diff.

1 5.666961 5.667190 −2.3 · 10−4

2 5.666966 5.666999 −3.3 · 10−5

3 5.666958 5.667024 −6.5 · 10−5

4 5.667239 5.667024 2.1 · 10−4

5 5.669992 5.673043 −3.1 · 10−3

TABLE III. Force calculations for the five setups described inthe text, with all quantities given in eV.

imization which has larger quadratic and cubic terms,777

mainly due to the linear algebra, as expected.778

The minimal basis approach also gives considerable779

savings in memory; for the above example the memory780

requirements for the cubic version still prohibit calcula-781

tions on systems bigger than around 2000 atoms for the782

chosen number of processors, whereas for the minimal ba-783

sis method the memory requirements allow calculations784

of up to 4000 atoms using direct minimization, and nearly785

8000 atoms with FOE.786

To take full advantage of the improvements made to787

BigDFT, it is not enough for the time taken per iter-788

ation to scale favorably with respect to system size, it789

is also necessary for the number of iterations needed790

to reach convergence not to increase with system size.791

0

10

20

30

40

50

60

Time/iteration(s)

ax3+ bx

2+ cx:

2.8e-09; 8.4e-06; 4.1e-031.3e-10; 6.1e-07; 3.9e-032.1e-11; 1.2e-11; 3.8e-03

0

2

4

6

8

10

12

0 2000 4000 6000 8000

Memory(GB)

Number of atoms

CubicDminFOE

FIG. 7. Comparison between the time taken per iterationand memory usage for the cubic scaling and minimal basisapproaches using both direct minimization (Dmin) and FOEfor increasing length alkanes, where the time is for the wave-function optimization, neglecting the input guess. The coeffi-cients are shown for the corresponding cubic polynomials. Afixed number of 301 MPI tasks and 8 OpenMP threads wasused.

We have demonstrated such behavior for increasing sized792

randomly generated non-equilibrium water droplets, as793

shown in Fig. 8. The number of iterations required to794

reach a good level of agreement with the cubic scaling795

version of the code remains approximately constant, with796

the fluctuations due to the random noise in the bond797

lengths of the water molecules. Furthermore, the energy798

converges rapidly to a value very close to that obtained799

with the cubic code, as illustrated by the upper panel.800

We have also observed similar convergence behavior for801

other systems, including alkanes, as mentioned above.802

VIII. FLEXIBILITY OF THE MINIMAL BASIS803

FORMALISM804

A. Geometry optimization805

As a further test of the quality of the forces and as806

a demonstration of the flexibility of the minimal basis807

formalism, we have performed a geometry optimization808

for a segment of a SiC nanotube containing 288 atoms,809

depicted in Fig. 5(g). Here we can take advantage of810

the minimal basis formalism by reusing the optimized811

support functions from the previous geometry step as812

an improved input guess, moving them with the atoms813

using an interpolation scheme to account for atomic dis-814

placements which are not multiples of the grid spacing h.815

This has the effect of reducing the number of iterations816

required to converge the support functions for each new817

geometry. In fact, for cases where the atoms have only818

moved a small amount, they will hardly need optimizing819

12

1e-04

1e-03

1e-02

1e-01

1e+00

0 50 100 150 200 250 300 350 400 450

(E-

Ecubic

)/

ato

m(e

V)

Number of atoms

1e-04

1e-03

1e-02

1e-01

1e+00

0 5 10 15 20 25(E-

Ecubic

)/

ato

m(e

V)

Iteration number

60 atoms

FIG. 8. Convergence behavior for water droplets, showing theconvergence with respect to outer loop iteration number forthe system containing 60 atoms [above] and the energy dif-ference with the converged cubic value after certain numbersof iterations [below]. The number of iterations is indicatedby the color used; circles denote iterations with a confiningpotential and squares without.

at all and so substantial savings can be made. A similar820

procedure also exists for the cubic version, but the mini-821

mal basis approach can profit much more because of the822

direct relation between the support function centers and823

the atomic positions.824

We compared the convergence behavior and time taken825

for the minimal basis approach both with and without826

reusing the support functions at each geometry step with827

that for the standard cubic approach, for which the re-828

sults are shown in Fig. 9. It is clear that the Hellmann-829

Feynman forces are sufficiently accurate to optimize the830

structure to the required level – in this case forces of831

below 10−2 eV/A are readily achieved. For this system832

size we are below the crossover point, such that when833

the support functions are reset at each geometry step834

the time taken per step is greater than that for the cu-835

bic approach. However, the reuse of support functions836

results in a significant reduction in the number of steps837

required to fully converge the support functions, and so838

the total time is less than that required for the cubic839

approach. This means that the crossover point will be840

reduced for geometry optimizations or molecular dynam-841

ics calculations, opening up further possibilities for the842

highly accurate study of dynamics of large systems.843

B. Charged systems844

As previously mentioned, the ability to use free bound-845

ary conditions is essential for charged systems. This has846

enabled us to perform calculations of isolated segments847

of ladder polythiophene (LPT) (Fig. 5(h)), initially in a848

neutral state and then adding a charge of plus or minus849

1e−03

1e−02

1e−01

1e+00

RMSF(eV/Å)

0

5

10

Tim

e/s

tep(m

in.)

0

20

40

60

80

100

0 5 10 15

Cum

.tim

e(m

in.)

CubicReset

Reformat0.00

0.05

0.10

0 5 10 15

∆R(Å)

FIG. 9. Geometry optimization for a segment of a SiC nan-otube for the cubic and minimal basis approaches where thesupport functions are regenerated from atomic orbitals at eachgeometry step (“reset”) and where they are reformatted forreuse at the next iteration (“reformat”). The time taken foreach step, cumulative time, force convergence and averagedistance from the final structure optimized using standardBigDFT are plotted for each step of the geometry optimiza-tion.

Q ∆EQ |∆(EQ − E0)|

0 opt. 176 –

−2opt. 147 28

unopt. 292 117

+2opt. 128 47

unopt. 200 24

TABLE IV. Energy differences between the standard cubicversion and the minimal basis approach for absolute ener-gies (∆EQ) and energy differences with the neutral system(|∆(EQ − E0)|) for 63 atom segments of LPT with a netcharge Q. Results are shown both for fully optimized sup-port functions (“opt”) and for support functions reused fromthe neutral calculation (“unopt”). All results are in meV.

two electrons. The support functions from the neutral850

case are also well suited to the charged system so that851

only kernel optimizations are required, which can reduce852

the computational cost by an order of magnitude.853

In Tab. IV we compare the agreement between the min-854

imal basis approach and the standard cubic approach for855

a system containing 63 atoms. We demonstrate an agree-856

ment of the order of 100meV for the energy differences857

for both the fully optimized set of support functions and858

the reuse of the support functions from the neutral sys-859

tem. For the negatively charged calculations we have also860

confirmed that this level of accuracy is maintained up to861

300 atoms, beyond which size the cost of calculations862

with the cubic version of BigDFT increases significantly.863

In order to converge the results obtained with the min-864

imal basis to a good level of accuracy, we used 9 support865

13

functions for carbon and sulfur and 1 per hydrogen. For866

charged systems we have found that the direct minimiza-867

tion method is more stable, as it allows us to update the868

coefficients in smaller steps before updating the kernel869

and therefore density, rather than fully converging them870

before each update.871

We expect such support function reuse to be generally872

applicable for systems where the addition of a charge873

only results in a perturbation of the electronic structure.874

However it may be necessary to optimize a few unoccu-875

pied states (using direct minimization or diagonalization)876

in order to ensure that the contracted basis is sufficiently877

accurate for negatively charged systems.878

IX. CONCLUSION879

We have presented a self-consistent minimal basis ap-880

proach within BigDFT which leads to a reduced scaling881

behavior with system size and allows the treatment of882

larger systems than can be treated with the cubic version;883

for very large systems linear scaling is clearly visible. The884

use of a small set of nearly orthogonal contracted basis885

functions which are optimized in situ in the underlying886

wavelet basis set gives rise to sparse matrices of rela-887

tively small size. For the optimization of these so-called888

support functions we use a confining potential which on889

the one hand helps to keep the support functions strictly890

localized, and on the other hand helps to alleviate the no-891

torious ill-conditioning which is typical of linear scaling892

approaches.893

The standard cubic scaling version of BigDFT has been894

previously demonstrated to give highly accurate results895

and so we use this as a standard of comparison for our896

method. We have demonstrated for a number of different897

systems excellent agreement with the cubic version for898

both energy and forces. In particular, we have demon-899

strated that it is not necessary to include Pulay-like cor-900

rection terms to the atomic forces, thanks to the nature of901

the Laplacian operator in the wavelet basis which ensures902

the support functions remain negligible on the borders903

of the localization regions. In addition, we have shown904

consistent convergence behavior across a range of system905

sizes. From the viewpoint of scaling with the number of906

atoms we have demonstrated linear scaling for the FOE907

method where the linear algebra has been written to ex-908

ploit the sparsity of the matrices.909

Finally, we have highlighted some of the advantages of910

using localized support functions expressed in a wavelet911

basis set. These include the ability to further accelerate912

geometry optimizations by reusing the support functions913

from the previous geometry, and the possibility of achiev-914

ing a good level of accuracy for a charged calculation by915

reusing the support functions from a neutral calculation.916

By directly working in the basis of the support func-917

tions, we can therefore reduce the number of degrees of918

freedom needed to express the KS operators for a tar-919

geted accuracy. Aside from reducing the computational920

overhead, this flexible approach paves the way for fu-921

ture developments, where the contracted basis functions922

can be reused in other situations, including for example923

constrained DFT calculations of large systems. Work is924

ongoing in this direction.925

The authors would like to acknowledge funding926

from the European project MMM@HPC (RI-261594),927

the CEA-NANOSCIENCE BigPOL project, the ANR928

projects SAMSON (ANR-AA08-COSI-015) and NEW-929

CASTLE, and the Swiss CSCS grants s142 and h01.930

CPU time and assistance were provided by CSCS, IDRIS,931

Oak Ridge National Laboratory and Argonne National932

Laboratory.933

Appendix A: Prescription for reducing the934

confinement935

To derive a prescription for reducing the confinement,936

it is assumed that the change in the target function be-937

tween successive iterations of the minimization procedure938

can be approximated to first order by939

∆Ω′(n) =

α

〈gα(n)|∆φα(n)〉 , (A1)

where |∆φα(n)〉 is the change in support function between940

iterations n and n+1, i.e. |∆φα(n)〉 = |φα(n+1)〉−|φ

α(n)〉, and941

|gα(n)〉 is the gradient of the target function with respect to942

the support function at iteration n. Due to the influence943

of the confinement and the localization regions, the gra-944

dient of the support functions and thus ∆Ω′(n) will not go945

down to zero. However, the actual change in the target946

function, ∆Ω(n), will at some point go to zero, meaning947

that further optimization becomes impossible for the lo-948

calization region and confining potential currently used.949

In this case, the only way to further minimize the target950

function is to decrease the confining potential. Therefore951

at each step of the minimization the ratio between the952

actual and estimated decreases in the target function is953

determined:954

κ =∆Ω(n)

∆Ω′(n)

. (A2)

This value is then used to update the confinement pref-955

actor, aα, at the start of the following support function956

optimization loop, via957

anewα = κaoldα . (A3)

If κ is of the order of one, this implies there is still some958

scope for optimizing the support functions using the cur-959

rent confining potential and it should not be updated. If,960

on the other hand, κ is much smaller, it will hardly be961

possible to further improve the support functions and so962

the magnitude of the confining potential should be de-963

creased. In this way one gets a smooth transformation964

from the hybrid expression to the energy expression.965

14

Appendix B: Fermi operator expansion966

In the FOE method the density kernel is given as a967

sum of Chebyshev polynomials. Since these polynomials968

are only defined in the interval [−1, 1], it is necessary to969

shift and scale the Hamiltonian such that its eigenvalue970

spectrum lies within this interval. If εmin and εmax are971

the smallest and largest eigenvalues that would result972

from diagonalizing the Hamiltonian matrix according to973

Hci = εiSci, then the scaled Hamiltonian, H, has to be974

built using975

H = σ(H− τS), (B1)

with976

σ =2

εmax − εmin, τ =

εmin + εmax

2. (B2)

Now the density kernel can be calculated according to977

K ≈ p(H′) =c0

2I+

npl∑

i=1

ciTi(H′) (B3)

with978

H′ = S−1/2HS−1/2,

H = σ(H− τS),

σ =2

εmax − εmin, τ =

εmin + εmax

2.

(B4)

where I is the identity matrix, Ti the Chebyshev polyno-979

mial of order i, S the support function overlap matrix and980

S−1/2 is calculated using a first order Taylor expansion.981

To determine the expansion coefficients ci, one has to982

recall that the density matrix of Eq. (5) is a projection983

operator onto the occupied subspace of the KS orbitals:984

〈ψi|F |ψj〉 = f(εj)δij . (B5)

Since F and H have the same eigenfunctions, one can985

express the polynomial p(H) in the same way, leading to986

〈ψi|p(H)|ψj〉 = p(εj)δij (B6)

with987

p(ε) =c0

2+

npl∑

i=1

ciTi(ε). (B7)

By comparing Eqs. (B5) and (B6) it becomes clear that988

the polynomial expansion p(ε) has to approximate the989

Fermi function f(ε) in the interval [−1, 1]. Thus the coef-990

ficients ci are simply given by the expansion of the Fermi991

function in terms of the Chebyshev polynomials. The992

time for this step is negligible compared to the other op-993

erations related to the FOE. However, in practice it turns994

out that it is advantageous to replace the Fermi function995

by996

f(ε) =1

2

[1− erf

(ε− µ

∆ε

)], (B8)

since it approaches the limits 0 and 1 faster as one goes997

away from the chemical potential. ∆ε is typically a frac-998

tion of the band gap.999

The last step is to evaluate the Chebyshev polynomials1000

and to build the density kernel. If the lth column of the1001

Chebyshev matrix T is denoted by tl, then these vectors1002

fulfill the recursion relation1003

t0l = el,

t1l = H′el,

tj+1l = 2H′t

jl − t

j−1l ,

(B9)

where el is the lth column of the identity matrix. The lth1004

column of the density kernel, denoted by kl, is then given1005

by the linear combination of all the columns tl according1006

to Eq. (B3), i.e.1007

kl =c0

2t0l +

npl∑

i=1

citil . (B10)

This demonstrates that the density kernel can be con-1008

structed using only matrix vector multiplications.1009

Since the correct value of the Fermi energy is initially1010

unknown, this procedure has to be repeated until the1011

correct value has been found, so that Tr(KS) is equal1012

to the number of electrons in the system. The band-1013

structure energy can then be calculated by reversing the1014

scaling and shifting operations:1015

EBS =Tr(KH)

σ+ τTr(KS). (B11)

Appendix C: Pulay forces1016

1. The traditional cubic approach1017

Numerically, the set of |Ψi〉 is expressed in a finite ba-1018

sis set. This means that the action of HKS can in prin-1019

ciple lie outside the span of the |Ψi〉. Let us suppose1020

that the KS Hamiltonian and orbitals are expressed in1021

a basis set which is complete enough to describe them1022

within a targeted accuracy. For the Daubechies basis1023

in the traditional BigDFT approach, this happens when1024

the grid spacing h is such as to describe the PSP and1025

orbital oscillations, and the radii λc,f such as to contain1026

the decreasing tails of the wavefunctions. This situation1027

indeed corresponds to the traditional setup of a BigDFT1028

run. We can therefore define a residual function1029

|χi〉 = HKS |Ψi〉 − ǫi |Ψi〉 , (C1)

which is of course zero when the numerical KS orbital1030

is the exact KS orbital. By definition 〈Ψj |χi〉 = 0 ∀i, j.1031

The norm of this vector, once projected in the basis set1032

used to express |Ψi〉, is often used as a convergence crite-1033

rion for the ground state energy.1034

Even though the basis set is finite, the orthog-1035

onality of the KS orbitals holds exactly, implying1036

15

Re(⟨Ψi

∣∣∣ dΨi

dRa

⟩)= 0. It is thus easy to show that the1037

numerical atomic forces are defined as follows:1038

−dEBS

dRa=−

i

〈Ψi|∂HKS

∂Ra|Ψi〉

− 2∑

i

Re

(⟨χi

∣∣∣∣dΨi

dRa

⟩), (C2)

where the first term of the right hand side of the above1039

equation is the Hellmann-Feynman contribution to the1040

forces. The norm of |χi〉 (Eq. (C1)) can be reduced within1041

the same basis set to meet this targeted accuracy. There-1042

fore the projection of∣∣∣ dΨi

dRa

⟩onto the basis set used for1043

the calculation can be safely neglected as it is associ-1044

ated with the same numerical precision. Consequently,1045

the atomic forces can be evaluated by the Hellmann-1046

Feynman term only as the remaining part is proportional1047

to the desired accuracy.1048

2. The minimal basis approach1049

As mentioned in the main text, when the KS orbitals1050

are expressed in terms of the support functions, an addi-1051

tional Pulay-like term should in principle be taken into1052

account. To demonstrate this, we define – in analogy1053

to Eq. (C1) – the support function residue |χα〉, which1054

becomes, using the identity Hρσ =∑

j cρjεjc

σ∗j ,1055

|χα〉 = HKS |χα〉 −

(∑

ρ,σ

|φρ〉 Hρσ 〈φσ|

)|φα〉

= HKS |φα〉 −∑

j

ρ,σ

cρjεjc

σ∗j Sσα |φρ〉 .

(C3)

Next, inserting the definition of χi (Eq. (C1)) into the1056

non-Hellmann-Feynman contribution of Eq. (C2) and us-1057

ing the relation Re(⟨Ψi

∣∣∣ dΨi

dRa

⟩)= 0 one obtains1058

Fa − F(HF )a = −2

i

Re

(⟨Ψi

∣∣∣∣HKS

∣∣∣∣dΨi

dRa

⟩). (C4)

Expanding the KS orbitals in terms of the support func-1059

tions, using the relation Hαβ =∑

j

∑ρ,σ εjc

ρj c

σ∗j SαρSσβ1060

and the orthonormality of the KS orbitals, we can write1061

Fa − F(HF )a = −2

i

α,β

Re

(cα∗i c

βi

⟨φα

∣∣∣∣HKS

∣∣∣∣dφβdRa

⟩)

− 2∑

i

α,β

Re

(cα∗i

dcβidRa

〈φα | HKS |φβ〉

)

= −2∑

i

α,β

Re

(cα∗i c

βi

⟨φα

∣∣∣∣HKS

∣∣∣∣dφβdRa

⟩)

− 2∑

i

β,σ

Re

(dcβidRa

εicσ∗i Sσβ

).

(C5)From the orthonormality of the KS orbitals one can de-1062

rive the relation1063

2∑

α,β

Re

(dcαidRa

cβ∗i Sαβ

)= −

α,β

cα∗i cβi

dSαβ

dRa. (C6)

Inserting this into Eq. (C5) yields1064

Fa − F(HF )a = −2

i

α,β

Re

(cα∗i c

βi

⟨φα

∣∣∣∣HKS

∣∣∣∣dφβdRa

⟩)

+∑

i

β,σ

Re

(cβi c

σ∗i

dSσβ

dRaεi

).

(C7)Again using the KS orthonormality condition, we can1065

write1066

Fa − F(HF )a = −2

i

α,β

Re

(cα∗i c

βi

⟨φα

∣∣∣∣HKS

∣∣∣∣dφβdRa

⟩)+∑

i,j

α,β,ρ,σ

Re

(cα∗i c

βi c

σ∗j εic

ρjSαρ

dSσβ

dRa

)

= −2∑

α,β

Re

(Kβα

⟨φα

∣∣∣∣HKS

∣∣∣∣dφβdRa

⟩)+ 2

j

α,β,ρ,σ

Re

(Kβαcσ∗j εjc

ρjSαρ

⟨φσ

∣∣∣∣dφβRa

⟩),

(C8)

which becomes in terms of the support function residue1067

of Eq.(C3)1068

Fa − F(HF )a = −2

α,β

Re

(Kβα

⟨χα

∣∣∣∣dφβdRa

⟩). (C9)

This result contains Eq. (C2) when no localization pro-1069

jectors are applied to the support function. Therefore the1070

only term of the forces which cannot be captured within1071

the localization regions is the part which is projected1072

outside. The extra Pulay term due to the localization1073

constraint is therefore1074

F(P )a = −2

α,β

Re

(Kβα

⟨χα

∣∣∣∣ (1− L(β))

∣∣∣∣dφβdRa

⟩).

(C10)

16

Using Eq. (12), we can show1075

F(P )a = −2

α,β

Re

(Kβα

⟨χα

∣∣∣∣∂L(β)

∂Ra

∣∣∣∣φβ⟩)

. (C11)

When the localization regions are atom-centered, the1076

derivative of the projector L(a) (as defined in Eq. (11))1077

can be evaluated analytically in the underlying basis set1078

and is given by1079

∂L(α)

∂Rβ i1,i2,i3;j1,j2,j3

= δαβδi1j1δi2j2δi3j3R(i1,i2,i3) −Rα

Rcut

× δ(Rcut − |R(i1,i2,i3) −Rα|) .(C12)

This demonstrates that the Pulay term is only associated1080

with the value of the support functions at the border of1081

the localization regions.1082

Appendix D: Parallelization1083

It is a natural choice to divide the support functions1084

between MPI tasks so that each one handles only a sub-1085

set of support functions. For some operations these can1086

be treated independently but for others, such as the cal-1087

culation of scalar products between overlapping support1088

functions needed to build the overlap and Hamiltonian1089

matrices, communication of support functions between1090

MPI tasks is required. One could directly exchange in a1091

point-to-point fashion the parts of the support functions1092

which overlap with each other, so that the scalar products1093

can be calculated locally on each task. Although concep-1094

tually straightforward, this has severe drawbacks. Firstly1095

the amount of data being communicated is tremendous1096

since the support functions generally have quite a notable1097

overlap. This also results in a very poor ratio between1098

computation and communication – in the extreme case1099

where each task handles only one support function, each1100

communicated element is only used for one operation.1101

Secondly there can be enormous load imbalancing for free1102

boundary conditions as support functions in the center1103

of the system usally have more neighboring support func-1104

tions than those near the edges. Finally, the data is split1105

into a large number of small messages, which could result1106

in a large overhead due to the latency of the network.1107

We therefore use a different approach, which requires1108

a so-called “transposed” rather than “direct” arrange-1109

ment of data. In this layout the simulation cell is parti-1110

tioned among MPI tasks and the support functions are1111

distributed to the various tasks such that each one can1112

calculate a partial overlap matrix for a given region of1113

the cell. Each task therefore has to receive those parts of1114

all support functions which extend into its region. The1115

partial matrices are then summed to build the full over-1116

lap matrix using MPI Allreduce. This partitioning of the1117

cell is done such that the load balancing among the MPI1118

tasks is optimal, which in general does not correspond1119

to a naive uniform distribution of the simulation cell. To1120

FIG. 10. An example depicting four support functions (Ro-man numerals) in a system consisting of four grid points (Ara-bic numerals) [above]. The support functions are constructedsuch that each extends over two grid points. The various datalayouts are also illustrated [below]: the direct layout whereeach MPI task has all the data for certain support functions[left]; the naive transposed layout where each MPI task hasall the data for some grid points [centre]; and the optimizedtransposed layout which is similar to the previous case, butwith an optimal load balancing [right].

FIG. 11. Illustration of the transposition process for the sys-tem shown in Fig. 10. In step a, the data is rearranged locallyon each MPI task, after which it is communicated using a sin-gle collective call (MPI Alltoallv), as shown in step b. Finallyin step c it is again rearranged locally to reach the final layout.

determine the optimal layout a weight is assigned to each1121

grid point, given by m2, where m is the number of sup-1122

port functions touching it (if symmetry can be exploited1123

the weight should rather be m(m+1)2 ); the total weight1124

(i.e. the sum of all partial weights) is then divided among1125

all MPI tasks as evenly as possible. In Fig. 10 this proce-1126

dure is illustrated with a toy example, where in the upper1127

part the support functions and their overlaps are shown1128

and in the lower part the resulting direct and transposed1129

(both naive and optimal) data layouts are given.1130

In addition to the better load balancing this approach1131

has the advantage that considerably less data has to be1132

communicated – since the transposed layout is just a re-1133

distribution of the standard layout, the total amount of1134

17

data that is communicated is equal to the total size of1135

all support functions, whereas in the point-to-point ap-1136

proach, the same data is often sent to multiple processes.1137

Furthermore, the communication can be done more ef-1138

ficiently: after some local rearrangement of the data for1139

each MPI task, it can be communicated with a single MPI1140

call (MPI Alltoallv) – in practice there are two calls since1141

the coarse and fine parts are handled separately. After1142

the data has been received some local rearrangement is1143

again required to reach the correct layout. These three1144

steps – local rearrangement, communication and further1145

local rearrangement – are illustrated in Fig. 11. Due1146

to the latency of the network, two MPI calls will likely1147

be more efficient than the very large number of small1148

messages that have to be sent for the point-to-point ap-1149

proach.1150

For the calculation of the charge density, which is for-1151

mally identical to the calculation of scalar products, a1152

similar approach is used. Since these two operations are1153

the most important ones from the viewpoint of commu-1154

nication and parallelization, this results in an excellent1155

scaling with respect to the number of cores.1156

1 P. Hohenberg and W. Kohn, Phys. Rev. 136, B864 (1964)1157

2 W. Kohn and L. J. Sham, Phys. Rev. 140, A1133 (1965)1158

3 L. Hedin and S. Lundqvist (Academic Press, 1970) pp. 1 –1159

1811160

4 X. Gonze, B. Amadon, P.-M. Anglade, J.-M. Beuken,1161

F. Bottin, P. Boulanger, F. Bruneval, D. Caliste, R. Cara-1162

cas, M. Cote, T. Deutsch, L. Genovese, P. Ghosez, M. Gi-1163

antomassi, S. Goedecker, D. Hamann, P. Hermet, F. Jol-1164

let, G. Jomard, S. Leroux, M. Mancini, S. Mazevet,1165

M. Oliveira, G. Onida, Y. Pouillon, T. Rangel, G.-M. Rig-1166

nanese, D. Sangalli, R. Shaltaf, M. Torrent, M. Verstraete,1167

G. Zerah, and J. Zwanziger, Comput. Phys. Commun.1168

180, 2582 (2009)1169

5 P. Giannozzi, S. Baroni, N. Bonini, M. Calandra, R. Car,1170

C. Cavazzoni, D. Ceresoli, G. L. Chiarotti, M. Cococcioni,1171

I. Dabo, A. D. Corso, S. de Gironcoli, S. Fabris, G. Fratesi,1172

R. Gebauer, U. Gerstmann, C. Gougoussis, A. Kokalj,1173

M. Lazzeri, L. Martin-Samos, N. Marzari, F. Mauri,1174

R. Mazzarello, S. Paolini, A. Pasquarello, L. Paulatto,1175

C. Sbraccia, S. Scandolo, G. Sclauzero, A. P. Seitsonen,1176

A. Smogunov, P. Umari, and R. M. Wentzcovitch, J. Phys.:1177

Condens. Matter 21, 395502 (2009)1178

6 M. D. Segall, P. J. D. Lindan, M. J. Probert, C. J. Pickard,1179

P. J. Hasnip, S. J. Clark, and M. C. Payne, J. Phys.: Con-1180

dens. Matter 14, 2717 (2002)1181

7 J. E. Pask and P. A. Sterne, Modell. Simul. Mater. Sci.1182

Eng. 13, R71 (2005)1183

8 L. Genovese, A. Neelov, S. Goedecker, T. Deutsch,1184

S. A. Ghasemi, A. Willand, D. Caliste, O. Zilberberg,1185

M. Rayson, A. Bergman, and R. Schneider, J. Chem. Phys.1186

129, 014109 (2008)1187

9 M. Valiev, E. Bylaska, N. Govind, K. Kowalski,1188

T. Straatsma, H. V. Dam, D. Wang, J. Nieplocha, E. Apra,1189

T. Windus, and W. de Jong, Comput. Phys. Commun.1190

181, 1477 (2010)1191

10 V. Blum, R. Gehrke, F. Hanke, P. Havu, V. Havu, X. Ren,1192

K. Reuter, and M. Scheffler, Comput. Phys. Commun.1193

180, 2175 (2009)1194

11 W. Kohn, Phys. Rev. 133, A171 (1964)1195

12 W. Kohn, Phys. Rev. Lett. 76, 3168 (1996)1196

13 S. Goedecker, Phys. Rev. B 58, 3501 (1998)1197

14 J. D. Cloizeaux, Phys. Rev. 135, A685 (1964)1198

15 J. D. Cloizeaux, Phys. Rev. 135, A698 (1964)1199

16 W. Kohn, Phys. Rev. 115, 809 (1959)1200

17 R. Baer and M. Head-Gordon, Phys. Rev. Lett. 79, 39621201

(1997)1202

18 S. Ismail-Beigi and T. A. Arias, Phys. Rev. Lett. 82, 21271203

(1999)1204

19 S. Goedecker, Phys. Rev. B 58, 3501 (1998)1205

20 L. He and D. Vanderbilt, Phys. Rev. Lett. 86, 5341 (2001)1206

21 M. Elstner, D. Porezag, G. Jungnickel, J. Elsner,1207

M. Haugk, T. Frauenheim, S. Suhai, and G. Seifert,1208

Phys. Rev. B 58, 7260 (1998)1209

22 B. Aradi, B. Hourahine, and T. Frauenheim,1210

J. Phys. Chem. A 111, 5678 (2007)1211

23 D. R. Bowler and T. Miyazaki, Rep. on Prog. in Phys. 75,1212

036503 (2012)1213

24 S. Goedecker, Rev. Mod. Phys. 71, 1085 (1999)1214

25 C.-K. Skylaris, P. D. Haynes, A. A. Mostofi, and M. C.1215

Payne, J. Chem. Phys. 122, 084119 (2005)1216

26 D. R. Bowler and T. Miyazaki, J. Phys.: Condens. Matter1217

22, 074207 (2010)1218

27 J. VandeVondele, M. Krack, F. Mohamed, M. Parrinello,1219

T. Chassaing, and J. Hutter, Comput. Phys. Commun.1220

167, 103 (2005)1221

28 J. M. Soler, E. Artacho, J. D. Gale, A. Garcıa, J. Junquera,1222

P. Ordejon, and D. Sanchez-Portal, J. Phys.: Condens.1223

Matter 14, 2745 (2002)1224

29 I. Daubechies, Ten Lectures on Wavelets (SIAM, 1992)1225

30 C. Hartwigsen, S. Goedecker, and J. Hutter, Phys. Rev. B1226

58, 3641 (1998)1227

31 M. Krack, Theor. Chem. Acc. 114, 145 (2005)1228

32 A. Willand, Y. O. Kvashnin, L. Genovese, A. Vazquez-1229

Mayagoitia, A. K. Deb, A. Sadeghi, T. Deutsch, and1230

S. Goedecker, J. Chem. Phys. 138, 104109 (2013)1231

33 E. Hernandez and M. J. Gillan, Phys. Rev. B 51, 101571232

(1995)1233

34 N. Marzari and D. Vanderbilt, Phys. Rev. B 56, 128471234

(1997)1235

35 R. McWeeny, Rev. Mod. Phys. 32, 335 (1960)1236

36 S. Goedecker, Wavelets and Their Application: For the So-1237

lution of Partial Differential Equations in Physics (Presses1238

polytechniques et universitaires romandes, 1998)1239

37 A. Ruiz-Serrano and C.-K. Skylaris, J. Chem. Phys. 139,1240

164110 (2013)1241

38 P. W. Anderson, Phys. Rev. Lett. 21, 13 (1968)1242

39 P. Pulay, Chem. Phys. Lett. 73, 393 (1980)1243

40 S. Goedecker and C. J. Umrigar, Phys. Rev. A 55, 17651244

(1997)1245

41 S. Goedecker and L. Colombo, Phys. Rev. Lett. 73, 1221246

(1994)1247

42 S. Goedecker and M. Teter, Phys. Rev. B 51, 9455 (1995)1248

43 W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P.1249

Flannery, Numerical Recipes 3rd Edition: The Art of Sci-1250

18

entific Computing, 3rd ed. (Cambridge University Press,1251

New York, NY, USA, 2007)1252

44 L. E. Ratcliff, N. D. M. Hine, and P. D. Haynes,1253

Phys. Rev. B 84, 165131 (2011)1254

45 L. Genovese, B. Videau, M. Ospici, T. Deutsch,1255

S. Goedecker, and J.-F. Mehaut, C. R. Mec. 339, 1491256

(2011)1257

46 G. M. Amdahl, IEEE Solid-State Circuits Newsletter 12,1258

19 (2007)1259

47 D. M. Ceperley and B. J. Alder, Phys. Rev. Lett. 45, 5661260

(1980)1261

48 S. De, A. Willand, M. Amsler, P. Pochet, L. Genovese, and1262

S. Goedecker, Phys. Rev. Lett. 106, 225502 (2011)1263

49 P. Pochet, L. Genovese, D. Caliste, I. Rousseau,1264

S. Goedecker, and T. Deutsch, Phys. Rev. B 82, 0354311265

(2010)1266