gpu accelerated quantum chemistry

192
GPU ACCELERATED QUANTUM CHEMISTRY A DISSERTATION SUBMITTED TO THE DEPARTMENT OF CHEMISTRY AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY NATHAN LUEHR February 2015

Transcript of gpu accelerated quantum chemistry

GPU ACCELERATED QUANTUM CHEMISTRY

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF CHEMISTRY

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

NATHAN LUEHR

February 2015

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/hb803mt5913

© 2015 by Nathan Luehr. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Todd Martinez, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Hans Andersen

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Vijay Pande

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

iv

v

ABSTRACT

This dissertation develops techniques to accelerate quantum chemistry calculations

using commodity graphical processing units (GPUs). As both the principle bottleneck

in finite basis calculations and a highly parallel task, the evaluation of Gaussian

integrals is a prime target for GPU acceleration. Methods to tailor quantum chemistry

algorithms from the bottom up to take maximum advantage of massively parallel

processors are described. Special attention is taken to make maximum use of

performance features typical of modern GPUs, such as high single precision

performance. After developing an efficient integral direct self-consistent field (SCF)

procedure for GPUs that is an order of magnitude faster than typical CPU codes, the

same machinery is extended to the configuration interaction singles (CIS) and time-

dependent density functional theory (TDDFT) methods. Finally, this machinery is

applied to molecular dynamics (MD) calculations. To extend the time scale accessible

to MD calculations of large systems, an ab initio multiple time steps (MTS) approach

is developed. For small systems, up to a few dozen atoms, an interactive interface

enabling a virtual molecular modeling kit complete with realistic ab initio forces is

developed.

vi

ACKNOWLEDGEMENTS

I thank my advisor Dr. Todd Martínez for his guidance, advice, and patience

over the past six years of work. Working in his lab has provided tremendous

professional and personal growth and has been much more fun than graduate school

has any right to be. I am also indebted to my predecessor, Ivan Ufimtsev, who began

the GPU effort and gently brought me up to speed. I have also had the advantage of

working with Dr. Tom Markland, and Dr. Christine Isborn on various projects and

have benefited tremendously in learning from them. I thank my father, Dr. Craig

Luehr, for his life-long encouragement to pursue interesting questions, and his

patience in reviewing even the least interesting sections of this dissertation. Finally, I

thank my wife, Tracy, for her unfailing encouragement and support in pursuing this

work.

vii

TABLE OF CONTENTS

CHAPTER TITLE PAGE

Title Page i

Abstract v

Acknowledgements vi

Table of Contents vii

List of Tables viii

List of Illustrations ix

Introduction 1

1 Background and Review 7

2 Integral-Direct Fock Construction on GPUs 33

3 Dynamic Precision ERI Evaluation 53

4 DFT Exchange-Correlation Evaluation on GPUs 71

5 Excited State Electronic Structure on GPUs: CIS and TDDFT 91

6 Multiple Time Step Integrators for Ab Initio MD 115

7 Interactive Ab Initio Molecular Dynamics 141

Bibliography 165

viii

LIST OF TABLES

TABLE TITLE PAGE

1.1 Runtime Comparison for GPU ERI Algorithms 27

3.1 RHF Energies Computed in Single and Double Precision 60

3.2 Double and Dynamic Precision Final RHF Energies 64

3.3 Runtime Comparison of Dynamic and Full Double Precision 66

4.1 Comparison of Becke Grid Calculations by Precision 78

4.2 Performance of CPU and GPU Becke Weight Calculations 79

4.3 Timing Breakdown: CPU and GPU Grid Generation for BPTI 80

5.1 Accuracy and Performance of GPU CIS Algorithm 103

5.2 TD-BLYP AX Build Timings for Various Quadrature Grids 106

5.3 Properties for First Bright State of Several Dendrimers 109

6.1 Performance of MTS and Velocity Verlet Integrators 134

7.1 Small Molecule TeraChem Optimization Improvements 157

7.2 Wall Time per MD Time Step for Various Small Molecules 160

ix

LIST OF ILLUSTRATIONS

FIGURE TITLE PAGE

1.1 1 Block – 1 Contracted ERI Mapping 23

1.2 1 Thread – 1 Contracted ERI Mapping 25

1.3 1 Thread – 1 Primitive ERI Mapping 26

1.4 ERI Grid Sorted by Angular Momentum 30

2.1 GPU J-Engine Algorithm 38

2.2 Organization of Coulomb ERIs by Schwarz Bound 42

2.3 GPU K-Engine Algorithm 48

3.1 Arrangement of Double and Single Precision Coulomb ERIs 59

3.2 Geometries Used for Benchmarking Mixed Precision 60

3.3 Mixed Precision Error versus Precision Threshold 61

3.4 Additional Test Systems for Dynamic Precision 65

3.5 Fock Construction Speedups by Precision 66

4.1 Pseudo-code of Serial Becke Weight Calculation 75

4.2 Benchmark Molecules for Becke Weight Kernels 79

4.3 Linear Scaling of CPU and GPU Becke Weight Kernels 80

4.4 One-Dimensional and Three-Dimensional SCF Test Systems 84

4.5 First SCF Iteration Timing Breakdown 85

4.6 Parallel Efficiency for SCF Calculations on Multiple GPUs 87

4.7 TeraChem vs. GAMESS: SCF Performance for Water Clusters 89

5.1 Geometries of Four Generations of Oligothiophene Dendrimers 99

x

5.2 Additional Systems to Benchmark Excited State Calculations 99

5.3 CIS Convergence Using Single and Double Precision 101

5.4 Timing Breakdown for Construction of AX Vectors on GPU 104

5.5 TDDFT First Excitation Energy versus System Size 108

6.1 H2O/OH- Dissociation Curves for CASE and RHF 124

6.2 Total Energy for 21ps MTS-LJFRAG Simulation 125

6.3 Energy Drift for Ab Initio MTS Integrators 127

6.4 Power Spectra: Velocity Verlet and MTS Integrators 129

6.5 Power Spectra: MTS Integrators with Various Model Forces 130

6.6 Power Spectra: CASE Verlet and CASE MTS Integrators 131

6.7 Energy Conservation of 21ps MTS-CASE Trajectory 133

7.1 Schematic of Interactive MD Communication 145

7.2 Histogram of Step Times for Interactive and Batch MD 148

7.3 Schematic of Visualized and Simulated Systems 152

7.4 Multi-GPU Parallelization Strategies 155

7.5 Total Energy Curve for AI-IMD Simulation of HCl 159

7.6 Geometries for AI-IMD Benchmark Calculations 160

7.7 Snapshots of an Interactive Simulation 161

7.8 Interactive Proton Transfer in Imidazole 162

1

INTRODUCTION

For the field of quantum chemistry, the circumvention of computational

bottlenecks is a key concern. After all, the non-relativistic Schrödinger equation in

principle describes the chemistry of a great many important organic and biological

systems.1 In practice, however, the generality and accuracy of this equation can only

be accessed at a tremendous computational cost that scales factorially with the size of

the system. Many sophisticated ab initio approximations2 have been developed to

reduce the required effort to a polynomial function of the system’s size while retaining

the general applicability of the Schrödinger equation along with an absolute accuracy

in the computed energy of at least 0.5 kcal/mol, which is ~kBT at 300K and thus the

threshold at which energies become chemically relevant.

The history of quantum chemistry has developed through a series of

algorithmic developments. However, the impact of computer hardware has been

equally important. Because each algorithm was developed to run on a particular

machine, the performance characteristics of each computer shaped the algorithms that

were developed. For example, the expression of correlated wavefunction methods

exclusively in terms of dense linear algebra operations is arguably a direct result of the

efficiency of BLAS on traditional processors. As clock speeds and serial CPU

performance ramped up in the 90’s and early 2000’s, processor architectures were

heavily consolidated until only a few remained, most notably Intel’s x86. Given the

favorable cost and performance of CPUs, it is not surprising that quantum chemistry

methods have extensively targeted the CPU.

2

However, in recent years the CPU’s serial performance has essentially

stagnated. As a result, alternative architectures are gaining traction for many

workloads. Multi-core CPUs have become commonplace. Massively parallel

streaming architectures designed for use in graphics processing units (GPUs) provide a

more extreme contrast with traditional serial processors. Today’s widening landscape

of processor designs raises the question of what shape quantum chemistry methods

will take as they move beyond the CPU.

In the following chapters we seek to answer this question by tailoring various

quantum chemistry algorithms to GPUs. This is a particularly attractive architecture

both because key computational bottlenecks in quantum chemistry map extremely

well onto massively parallel GPU processors and because low-cost high-performance

hardware is readily available and continuously improved. The introduction of the

Compute Unified Device Architecture3 (CUDA) as an extension to the C language

also greatly simplifies GPU programming, making it easily accessible for scientific

programming. The methods described in the following chapters form the foundation of

the TeraChem quantum chemistry program, which was designed from the ground up

to make optimal use of GPU hardware.4,5

The importance of this work is at least threefold. First, the methods presented

here are important in themselves because they dramatically accelerate certain quantum

chemistry calculations and make previously difficult calculations routine.6-9 Second, as

finite size constraints become dominant in hardware design, machines will become

increasingly parallel. As a result, many features and limitations that exist on modern

GPUs will become ubiquitous on high-performance architectures of the future, making

3

our work highly transferable. Third, the discussion of how to map quantum chemistry

calculations onto computer hardware goes beyond mere code optimization, since there

is a kind of natural selection at play in method development. If certain operations can

be accelerated, then methods that exploit these operations may gain advantages and be

chosen for future development. The present work follows this pattern. For example,

after introducing and optimizing the Coulomb and exchange operators for use in self-

consistent field (SCF) calculations10,11 in chapters 2-4, these same operations are

exploited to accelerate excited state methods6 in chapter 5.

At the same time that traditional processors are reaching their performance

limits, it is becoming much cheaper to fabricate custom architectures. The present

work focuses on optimizing quantum chemistry methods for a particular alternative

architecture. Perhaps in the future the inverse process will be feasible, and processors

will be tailored as much to quantum chemistry as vice versa. The series of ANTON

machines that are designed for classical MD are perhaps early examples of this

trend.12-18 The less successful GRAPE-DR architecture is similar,19 but perhaps shows

how much must be gleaned from the study of existing architectures before efficient

custom hardware can be designed for quantum chemistry.

The following chapters are organized as follows. Chapter one gives a brief

introduction to quantum chemistry and introduces the electron repulsion integrals

(ERIs) which represent an important computational bottleneck. We also review the

McMurchie-Davidson20 approach that can be used to evaluate these integrals as well

as early work to evaluate ERIs on GPU processors.21,22 Chapter two covers the

efficient implementation of Coulomb and exchange operators in TeraChem.11,23

4

Chapter three introduces dynamic precision, which is an important technique to tailor

integral evaluation to GPUs that provide much more single than double precision

performance.10 Chapter four discusses the implementation of density functional theory

(DFT) exchange-correlation potentials on GPUs.24,25 In chapter five we extend our

Coulomb and exchange operators to excited state configuration interaction singles

(CIS) and linear response time-dependent DFT (TDDFT).6 Finally, we consider

methods that leverage GPU quantum chemistry to extend the reach of ab initio

molecular dynamics (AIMD). In chapter six we discuss the use of multiple time step

(MTS) integrators to accelerate AIMD in large systems.8 And in chapter seven we turn

to accelerating calculations on small systems and introduce an interactive quantum

chemistry interface built on real-time AIMD.

REFERENCES

(1) Dirac, P. A. M. P R Soc Lond a-Conta 1929, 123, 714.

(2) Helgaker, T.; Jørgensen, P.; Olsen, J. Molecular electronic-structure theory; Wiley: New York, 2000.

(3) Schwegler, E.; Challacombe, M.; HeadGordon, M. J. Chem. Phys. 1997, 106, 9708.

(4) PetaChem, L.; Vol. 2010.

(5) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2009, 5, 2619.

(6) Isborn, C. M.; Luehr, N.; Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2011, 7, 1814.

(7) Kulik, H. J.; Luehr, N.; Ufimtsev, I. S.; Martinez, T. J. J. Phys. Chem. B 2012, 116, 12501.

(8) Luehr, N.; Markland, T. E.; Martinez, T. J. J. Chem. Phys. 2014, 140, 084116.

5

(9) Ufimtsev, I. S.; Luehr, N.; Martinez, T. J. J. Phys. Chem. Lett. 2011, 2, 1789.

(10) Luehr, N.; Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2011, 7, 949.

(11) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2009, 5, 1004.

(12) Grossman, J. P.; Kuskin, J. S.; Bank, J. A.; Theobald, M.; Dror, R. O.; Ierardi, D. J.; Larson, R. H.; Ben Schafer, U.; Towles, B.; Young, C.; Shaw, D. E. Acm Sigplan Notices 2013, 48, 549.

(13) Grossman, J. P.; Towles, B.; Bank, J. A.; Shaw, D. E. Des Aut Con 2013.

(14) Kuskin, J. S.; Young, C.; Grossman, J. P.; Batson, B.; Deneroff, M. M.; Dror, R. O.; Shaw, D. E. Int S High Perf Comp 2008, 315.

(15) Larson, R. H.; Salmon, J. K.; Dror, R. O.; Deneroff, M. M.; Young, C.; Grossman, J. P.; Shan, Y. B.; Klepeis, J. L.; Shaw, D. E. Int S High Perf Comp 2008, 303.

(16) Shaw, D. E.; Deneroff, M. M.; Dror, R. O.; Kuskin, J. S.; Larson, R. H.; Salmon, J. K.; Young, C.; Batson, B.; Bowers, K. J.; Chao, J. C.; Eastwood, M. P.; Gagliardo, J.; Grossman, J. P.; Ho, C. R.; Ierardi, D. J.; Kolossvary, I.; Klepeis, J. L.; Layman, T.; McLeavey, C.; Moraes, M. A.; Mueller, R.; Priest, E. C.; Shan, Y. B.; Spengler, J.; Theobald, M.; Towles, B.; Wang, S. C. Conf Proc Int Symp C 2007, 1.

(17) Shaw, D. E.; Dror, R. O.; Salmon, J. K.; Grossman, J. P.; Mackenzie, K. M.; Bank, J. A.; Young, C.; Deneroff, M. M.; Batson, B.; Bowers, K. J.; Chow, E.; Eastwood, M. P.; Ierardi, D. J.; Klepeis, J. L.; Kuskin, J. S.; Larson, R. H.; Lindorff-Larsen, K.; Maragakis, P.; Moraes, M. A.; Piana, S.; Shan, Y. B.; Towles, B. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis 2009.

(18) Towles, B.; Grossman, J. P.; Greskamp, B.; Shaw, D. E. Conf Proc Int Symp C 2014, 1.

(19) Makino, J.; Hiraki, K.; Inaba, M. 2007 Acm/Ieee Sc07 Conference 2010, 548.

(20) McMurchie, L. E.; Davidson, E. R. J. Comp. Phys. 1978, 26, 218.

(21) Yasuda, K. J. Comp. Chem. 2008, 29, 334.

(22) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2008, 4, 222.

(23) Titov, A. V.; Ufimtsev, I. S.; Luehr, N.; Martinez, T. J. J. Chem. Theo. Comp. 2013, 9, 213.

6

(24) Hwu, W.-m. GPU computing gems; Elsevier: Amsterdam ; Burlington, MA, 2011.

(25) Yasuda, K. J. Chem. Theo. Comp. 2008, 4, 1230.

7

CHAPTER ONE

BACKGROUND AND REVIEW

In this chapter we provide a brief review of the Hartree-Fock (HF) and Density

Functional Theory (DFT) quantum chemistry methods within atom-centered Gaussian

basis sets. We focus particularly on the evaluation of electron repulsion integrals

(ERIs) because these play an important role in the following chapters. The

performance of Self-consistent field (SCF) methods such as HF and DFT depends on

two principal bottlenecks. The first is the evaluation of ERIs. Formally, for a basis set

containing N functions, a total of O(N4) ERIs must be evaluated. In the limit of large

systems, efficient screening of negligibly small ERIs can reduce this number to O(N2)

or, for certain insulating systems, even O(N).1-5 The second bottleneck is the update of

the orbitals/density between SCF iterations. This is traditionally performed by

diagonalizing the N-by-N Fock matrix to obtain its eigenvectors and eigenvalues.

Eigensolvers applied to dense matrices run with a complexity of O(N3). However,

using sparse matrix algebra it is possible, again in asymptotically large systems, to

achieve O(N) scaling for this step as well.6 Thus, formal asymptotic analysis is of

limited use since the dominant bottleneck results from prefactors rather than scaling

exponents. Empirically, for systems up to at least 10,000 basis functions, integral

evaluation dominates the SCF runtime, and thus the following chapters focus

primarily on the GPU acceleration of ERI evaluation.

Numerous ERI evaluation schemes have been developed for use in traditional

CPU codes. For very high angular momentum, Rys quadrature methods7 may provide

8

an advantage on GPUs due to their smaller memory footprint.8-10 For low angular

momentum basis functions, however, the Rys and simpler McMurchie-Davidson11

approaches provide comparable performance, and the simplicity of the latter is

preferred here.

QUANTUM CHEMISTRY REVIEW

Full derivations of the HF and DFT methods as well as in-depth descriptions of

various ERI evaluation algorithms can be found elsewhere.12-14 Here we provide a

brief background in order to put the subsequent chapters in context. Unless otherwise

noted, we assume a spin-restricted wavefunction anstatz and atomic units.

Self-Consistent Field Equations in Gaussian Basis Sets

A primitive Gaussian function is defined as follows.

! i

!r( ) = Ni

!rx " xi( )ni !ry " yi( )li !rz " zi( )mi e"# i!r"!Ri

2

(1.1)

Here !r is the three-dimensional electronic coordinate,

!Ri = xi , yi , zi( ) is the

primitive’s Cartesian center (usually coinciding with the location of an atom), !i is an

exponent determining the spatial extent of the function, and Ni is a normalization

constant chosen so that the following holds.

! i!r( )! i

!r( )d!r = 1"#

#

$ (1.2)

The nonnegative integers, ni, li, and mi, fix the function’s angular momentum in the

Cartesian x-, y-, and z-directions. Their sum, "i = ni + li + mi, gives the primitive’s

total angular momentum. Functions with !i = 0,1,2 are termed, s-, p-, and d-functions

9

respectively. The set of !i +1( ) !i + 2( ) / 2 primitive functions that differ only by the

distribution of !i into n, l, and m is referred to as a primitive shell.

! I = ! i

!r( ) |!Ri =

!RI ," i =" I ,#i = #I{ } (1.3)

We use the lower case indices i, j, k, and l to refer to primitive functions and the

capital letters I, J, K, and L for primitive shells.

In order to more closely approximate solutions to the atomic Schrödinger

equation, several primitive functions (all sharing a common center, Rµ , and angular

momenta, nµ , lµ , and mµ ) are combined together into contracted basis shells using

fixed contraction weights, ci.

!µ (r) = ci" i (r)i#µ$ (1.4)

Here a segmented basis is assumed in which each primitive contributes to a single

basis function. Greek indices are used for contracted AO quantities. The notation

i !µ specifies that the primitive ! i (r) belongs to the AO contraction !µ (r) . These

contracted functions are termed atomic orbitals (AOs) in analogy to the Hydrogen

atom’s one-electron orbitals which they resemble and form the basis in which the

Schrödinger equation will be solved.

The AOs are further combined by linear contraction into molecular orbitals

(MOs), each of which represents a one-particle spatial probability distribution for an

electron in the multi-atom system.

! i!r( ) = Cµi"µ

!r( )µ

N

# (1.5)

10

The MO coefficients, Cµi , are free parameters, and their determination is the primary

objective of the SCF procedure. In order to describe an n-electron system, the one-

electron MOs are combined with spin functions in a Slater determinant.

! !x1,!x2,...,

!xn( ) = 1n!

" 1(!x1) " 2 (

!x1) " " n (!x1)

" 1(!x2 ) " 2 (

!x2 ) " " n (!x2 )

# # $ #" 1(!xn ) " 2 (

!xn ) " " n (!xn )

(1.6)

For the spin-restricted case in which two electrons occupy each spatial orbital, the spin

orbitals (depending on both spatial and spin electronic degrees of freedom) can be

defined as follows (where !k is the spin degree of freedom for the kth electron):

! 2n"1(!xk ) = #n (

!rk )$ (% k )! 2n (

!xk ) = #n (!rk )&(% k )

(1.7)

The energy of the wavefunction, ! , representing n electrons in a system

containing A fixed atomic nuclei (each with charge Za and located at position !Ra ) is

derived from the expectation value of the electronic Hamiltonian, H :

H =Za!ri !!Raa

A

" !#i

2

2+ 1

21!ri !!rjj$i

"%

&''

(

)**i

n

" (1.8)

�ERHF =

!�H�!

! != 2 " i h" i +

i

n/2

# 2 " i" i�" j" j( )i, j

n/2

# $ " i" j�" i" j( )i, j

n/2

# (1.9)

Here the MOs are assumed, without loss of generality, to be orthonormal.

!i ! j = " ij (1.10)

11

The one-electron core Hamiltonian operator, h , accounts for electron-nuclear

attraction and electron kinetic energy,

h(!r ) =

Za!r !!Raa

A

" ! #2

2 (1.11)

and the two-electron repulsion integrals (ERIs) account for pairwise repulsive

interactions between electrons.

�! i! j�!k! l( ) = d 3!r1 d 3!r2""

! i* !r1( )! j

!r1( )!k* !r2( )! l

!r2( )!r1 #!r2

(1.12)

For Kohn-Sham DFT, a similar energy expression is obtained by using the

determinant to describe non-interacting pseudo-particles whose total density matches

the ground state electron density.15

!(!r ) = 2 " i(

!r )2

i

n/2

# (1.13)

In this case, components of the Hartree-Fock energy provide good approximations for

the DFT kinetic energy and classical electron repulsion. An additional density-

dependent exchange-correlation functional, EXC !"# $% , corrects for the relatively small

energetic effects of electron exchange and correlation as well as errors from

approximating the kinetic energy as that of the Kohn-Sham determinant.

�EDFT = 2 ! i h! i +

i

n/2

" 2 ! i! i�! j! j( )i, j

n/2

" + EXC[#] (1.14)

Given the exact exchange-correlation functional, EXC !"# $% , equation 12 would provide

the exact ground state energy. Unfortunately, the exact functional is not known in any

computationally feasible form. In practice a variety of approximate functionals are

12

often employed. For simplicity, we focus on the remarkably successful class of

generalized gradient approximation (GGA) functionals. These take the form of an

integral over a local xc-kernel that depends only on the total density and its gradient.

EXC[!]= fxc(!(r), "!(r)

2 )d!r# (1.15)

To calculate the HF or DFT ground state electronic configuration, we vary the MO

coefficients, Cµi , to minimize ERHF or EDFT under the constraint of equation 8 that the

MOs remain orthonormal. Functional variation ultimately results in the following

conditions on the MO coefficients.

F(P)C = !SC (1.16)

Here P is the density matrix represented in the AO basis;

Pµ! = Cµi

i

n

" C!i* (1.17)

! is a diagonal matrix of MO energies (formally, this matrix is the set of Lagrange

multipliers enforcing the constraint that all the molecular orbital remain orthogonal,

i.e. equation 8); S is the AO overlap matrix;

Sµ! = "µ "! (1.18)

and F(P) is the non-linear Fock operator, defined slightly differently for HF and DFT

as follows.

�Fµ!

HF (P) = hµ! + P"#"#

N

$ 2 µ!�#"( )% µ"�!#( )&' () (1.19)

�Fµ!

DFT (P) = hµ! + 2 µ!�"#( )P#"#"

N

$ +Vµ!XC (1.20)

Here h is the core Hamiltonian from equation 9 in the AO basis,

13

hµ! = "µ

Za!r1 #!Raa

A

$ #%1

2"! (1.21)

and the two electron ERIs are defined in the AO basis as follows.

µ!�"#( ) = d 3!r1 d 3!r2$$%µ!r1( )%! !r1( )%"

!r2( )%# !r2( )!r1 &!r2

= ci c jck cll'#(

k'"(

j'!(

i'µ( d 3!r1 d 3!r2$$

) i!r1( )) j

!r1( )) k!r2( )) l

!r2( )!r1 &!r2

= ci c jck cll'#(

k'"(

j'!(

i'µ( ij�kl*+ ,-

(1.22)

Note that round braces refer to ERIs involving contracted basis functions while square

braces refer to primitive ERIs. Finally, for DFT, VXC is determined by functional

differentiation of the exchange-correlation energy expression.

Vµ!

XC = "µ

# EXC

#$"! (1.23)

Because the HF and DFT Fock operators are non-linear, equation (1.16) cannot

be solved in closed form. Instead an iterative approach is used. Starting from some

guess for the density matrix, P, the Fock matrix is constructed and then diagonalized

to obtain a matrix of approximate MO orbitals, C. The MO coefficients are then used

to construct an improved guess for the density matrix using equation (1.17), and the

process is repeated until F and P converge to stable values.

Evaluating Electron Repulsion Integrals

Having described the basic SCF working equations we turn now to the

evaluation of primitive ERIs.

14

ij�kl!" #$ = d 3!r1 d 3!r2%%& i!r1( )& j

!r1( )& k!r2( )& l

!r2( )!r1 '!r2

(1.24)

Efficient evaluation of the coulomb integrals within Gaussian basis sets begins by

invoking the Gaussian product theorem (GPT) of equation (1.25). This allows a pair of

Gaussian functions at different centers to be rewritten as a combined Gaussian

function centered at a point, !P , between the original centers.16

e!" i!r!!Ri( )2e!" j

!r!!Rj( )2 = Kije

!#ij!r!!Pij( )2

#ij =" i +" j

Kij = e!" i" j

" i+" j

!Ri!!Rj( )2

!Pij =

" i

!Ri +" j

!Rj

" i +" j

(1.25)

Applying equation (1.25) separately to the bra, ! i!r1( )! j

!r1( ) , and ket, ! k!r2( )! l

!r2( )

primitive pairs of equation (1.24), results in two charge distributions, ! ij and !kl and

reduces the four center ERI to a simpler two-center problem.

ij�kl!" #$ = % ij�%kl!" #$ (1.26)

The pair distributions, ! ij , can be factored into x-, y-, and z-components, which will

greatly simplify the problem.

! ij = Ni N j Kij

x! ijy! ij

z! ij (1.27)

The x-component of the bra distribution is shown below, the other terms being

analogous.

x! ij (x1) = x1 " xi( )ni x1 " x j( )nj e

"#ij x1"Xij( )2 (1.28)

15

Following McMurchie and Davidson we expand the pair distributions of

equation (1.28) exactly in a basis of Hermite Gaussians, {!t}.11

x! ij (x1) = x Et

nin j"tx (x1)

t

ni+nj

# (1.29)

!t

x (x1) = ""Xij

#

$%

&

'(

t

e)*ij x1)Xij( )2

(1.30)

Again, analogous expressions expand the y- and z-components. The expansion

coefficients, x Et

ij

are calculated from simple recurrence relations given below.11

x Etmn = 0, where t < 0 or t > m+ n

x Etm+1,n = 1

2 px Et!1

mn + X Pij Ri

x Etmn + (t +1) x Et+1

mn

x Etm,n+1 = 1

2 px Et!1

mn + X Pij Rj

x Etmn + (t +1) x Et+1

mn

(1.31)

Here XPQ is shorthand for Px – Qx. The pair distribution from equation (1.27) is then

written as follows.

! ij (r1) = Ni N j Kij Etuv

ij

v

mi+mj

" #tuv (r1)u

li+l j

"t

ni+nj

" (1.32)

Etuvij = x Et

nin j y Eulil j z Ev

mimj

(1.33)

!tuv (r1) = !tx (x1)!u

y ( y1)!vz (z1) (1.34)

And the overall integral from equation (1.24) is expanded as follows.

[ij | kl]= Ni N j Nk Nl Kij Kkl Etuv

ij E !t !u vkl

!v

mk+ml

" Vtuv!t !u !v

!u

lk+ll

"!t

nk+nl

"v

mi+mj

"u

li+l j

"t

ni+nj

" (1.35)

16

Here Vtuv!t !u !v represent Hermite Coulomb integrals, which are defined through equations

(1.30) and (1.34) as partial derivatives of an s-function Coulomb integral.

Vtuv!t !u !v = ["tuv |" !t !u !v ]

= (#1) !t + !u + !v $$Xij

%

&'

(

)*

t+ !t$$Yij

%

&'

(

)*

u+ !u$

$Zij

%

&'

(

)*

v+ !ve#+ij

!r1#!Pij( )2e#+kl

!r2#!Pkl( )2

!r1 #!r2

,, d!r1d!r2

(1.36)

To evaluate Vtuv!t !u !v , the simple Coulomb integral on the right in equation (1.36)

is first expressed in terms of the Boys function, Fn.

e!"ij!r1!!Pij( )2e!"kl

!r2!!Pkl( )2

!r1 !!r2

## d!r1d!r2 =

2$ 5/2

"ij"kl "ij +"kl

F0

"ij"kl

"ij +"kl

Pij ! Pkl( )2%

&'

(

)*

(1.37)

Fn (x) = t 2ne! xt

2

dt0

1

" (1.38)

In practice the Boys function is computed using an interpolation table and downward

recursion.14 Next we define the auxiliary functions, Rtuvn , as follows.

Rtuv

n = !!Xij

"

#$

%

&'

t!!Yij

"

#$

%

&'

u!

!Zij

"

#$

%

&'

v

(2)ij)kl

)ij +)kl

"

#$

%

&'

n

Fn

)ij)kl

)ij +)kl

!Pij (!Pkl( )2"

#$

%

&' (1.39)

Noting that

Rtuv

0 = !!Xij

"

#$

%

&'

t!!Yij

"

#$

%

&'

u!

!Zij

"

#$

%

&'

v

F0

(ij(kl

(ij +(kl

!Pij )!Pkl( )2"

#$

%

&' (1.40)

Equation (1.36) now becomes the following.

Vtuv!t !u !v = ("1) !t + !u + !v 2# 5/2

$ij$kl $ij +$kl

Rt+ !t ,u+ !u ,v+ !v0 $ij$kl

$ij +$kl

,!Pij "!Pkl( )2%

&'

(

)* (1.41)

17

The utility of the auxiliaries, Rtuvn , is that they can be efficiently computed from the

Boys function starting at R000n using the following recurrence relations.

Rt+1,u ,vn = tRt!1,u ,v

n+1 + Xij ! Xkl( )Rt ,u ,vn+1

Rt ,u+1,vn = uRt ,u!1,v

n+1 + Yij !Ykl( )Rt ,u ,vn+1

Rt ,u ,v+1n = vRt ,u ,v!1

n+1 + Zij ! Zkl( )Rt ,u ,vn+1

(1.42)

This brings us to the final expression for evaluating the ERI of equation (1.24).

[ij | kl]= Nij Nkl Etuvij E !t !u !v

kl

!v

mk+ml

" (#1) !t + !u + !v

$ij +$kl

Rt+ !t ,u+ !u ,v+ !v0

!u

lk+ll

"!t

nk+nl

"v

mi+mj

"u

li+l j

"t

ni+nj

" (1.43)

Nij =

Ni N j Kij 2! 5/4

"ij (1.44)

Here Nij is a convenience factor that combines all scalar factors for the bra (or ket) pair

distribution. In practice, the AO contraction coefficients from equation (1.22), cicj in

the case of the bra, are also included in this factor.

For s-functions, each quartet of primitive shells, [ij|kl], generates a single

integral. For shells with higher angular momentum a shell quartet generates multiple

integrals, since each primitive shell contains multiple functions. For example, each

shell quartet with the momentum pattern [sp|sd] will generate 18 functions (since there

are three functions in the p-shell, six in the d-shell and one in each s-shell. These 18

integrals, however, involve the same set of auxiliary integrals, Rtuv0 , and Hermite

contraction coefficients, xEtmn . As a result, it is advantageous to generate the

intermediates once, and then evaluate equation (1.43) repeatedly, once for each

integral in the shell quartet.

18

Screening Negligible Integrals

The Fock contributions in equations (1.19) and (1.20) nominally involve

contributions from N4 ERIs. However, several strategies are routinely used to avoid

calculating most of these integrals. First, the ERIs possess eight-fold symmetry so that

[ij|kl] = [kl|ij] = [ij|lk]. The point group symmetry of the molecule is sometimes also

used to eliminate even more redundant integrals. However, large systems rarely

possess such symmetry, so this approach is not applicable to the present work.

Many of the remaining integrals are so small that they can be neglected

without affecting the computed molecular properties. Because each AO basis function

is localized in space, a pair distribution, !ij , will approach zero exponentially as the

distance between primitive functions increases. Thus, an AO ERI, �µ!�"#( ) , will be

non-negligible only if µ is centered near ! and ! is near ! . For large systems, this

reduces the number of integrals to a more manageable N2. In order to efficiently

identify significant ERIs, a Cauchy-Schwarz inequality can be applied to either

contracted or primitive integrals.17

ij�kl!" #$ % ij�ij!" #$1/2

kl�kl!" #$1/2

(1.45)

For primitive integrals, this Schwarz bound is easily computed, because in

[ij|ij] integrals the bra and ket pair distributions share a common center greatly

simplifying the integral expressions. Thus, by checking the integral bound for each

shell quartet, it is possible to avoid computing many small integrals all together.

Another advantage of the Schwarz bound is that it can be decomposed into bra and ket

19

parts, and thus quantities of [ij|ij]1/2 can be computed once and stored with each pair

distribution rather than being recomputed for every shell quartet.

INTRODUCTION TO CUDA AND GPU PROGRAMMING

Each GPU is a massively parallel device, containing thousands of execution

cores. However, the performance of these processors results not only from the raw

width of execution units, but also from a hierarchy of parallelism that forms the

foundation of the hardware architecture and is ingeniously exposed to the programmer

through the CUDA programming model.18 Developers must understand and respect

these hierarchical boundaries if their programs are to run efficiently on GPUs.

At the lowest level, the CUDA programmer writes a small procedure – called a

kernel in “CUDA-speak” – that is to be executed by tens of thousands of individual

threads in parallel. Although each CUDA thread is logically autonomous, the

hardware does not execute each thread independently. Instead, instructions are

scheduled for groups of 32 threads, called warps, in single-instruction-multiple-thread

(SIMT) fashion. Every thread in a warp executes the same instruction stream, with

threads masked to null operations (no-ops) for instructions in which they do not

participate.

Warps are grouped into larger blocks of up to 1024 threads. Blocks are

assigned to local groups of execution units called streaming multiprocessors (SMs).

The SM provides hardware-based intra-block synchronization methods and a small

on-chip shared memory often used for intra-block communication. CUDA blocks can

be indexed in 1, 2, or 3 dimensions at the convenience of the programmer.

20

At the highest level, blocks are organized into a CUDA grid. As with blocks,

the grid can be up to 3 dimensional. In general, the grid contains many more blocks

and threads than the GPU has physical execution units. When a grid is launched, a

hardware scheduler streams CUDA blocks onto the processors. By breaking a task into

fine-grained units of work, the GPU can be kept constantly busy, maximizing

throughput performance.

In CUDA the memory is also structured hierarchically. The host memory

usually provides the largest space, but can only be accessed through the PCIe data bus

which suffers from latencies on the order of several thousand instruction cycles. The

GPU’s main (global) memory provides several gigabytes of high-bandwidth memory

capable of more than 250 GB/s of sustained throughput. In order to enable this

bandwidth, global memory accesses incur long latencies, on the order of 500 clock

cycles. Global memory operations are handled in parallel mirroring the SIMT warp

design. The large width of the memory controller allows simultaneous memory access

by all threads of a warp as long as those threads target contiguous memory locations.

Low-latency, on-chip memory is also available. Most usefully, each block can use up

to 64 KB of shared memory for intra-block communication, and each thread may use

up to 255 local registers in which to store intermediate results.

Consideration of the hardware design suggests the following basic strategies

for maximizing the performance of GPU kernels.

1) Launch many threads, ideally one to two orders of magnitude more threads than

the GPU has execution cores. For example, a Tesla K20 with 2496 cores may not

reach peak performance until at least O(105) threads are launched. Having

21

thousands of processors will only be an advantage if they are all saturated with

work. All threads are hardware scheduled, making them very lightweight to create,

unlike host threads. Also, the streaming model ensures that the GPU will not

execute more threads than it can efficiently schedule. Thus oversubscription will

not throttle performance. Context switches are also instantaneous, and this is

beneficial because they allow a processor to stay busy when it might otherwise be

stalled, for example, waiting for a memory transaction to complete.

2) Keep each thread as simple as possible. Threads with smaller shared-memory and

register footprints can be packed more densely onto each SM. This allows the

schedulers to hide execution and memory latencies by increasing the chance that a

ready-to-execute warp will be available on any given clock cycle.

3) Decouple your algorithm to be as data parallel as possible. Synchronization

between threads always reduces the effective concurrency available to the GPU

schedulers and should be minimized. For example, it is often better to re-compute

intermediate quantities rather than build shared caches, sometimes even when the

intermediates require hundreds of cycles to compute.

4) Maintain regular memory access patterns. On the CPU this is done temporally

within a single thread, on the GPU it is more important to do it locally among

threads in a warp.

5) Maintain uniform control flow within a warp. Because of the SIMT execution

paradigm, all threads in the warp effectively execute every instruction needed by

any thread in the warp. Pre-organizing work by expected code-path can eliminate

divergent control flow within each warp and improve performance.

22

These strategies have well known analogues for CPU programming; however,

the performance penalty resulting from their violation is usually much more severe in

the case of the GPU. The tiny size of GPU caches relative to the large number of in-

flight threads defeats any possibility of cushioning the performance impact of non-

ideal programming patterns. In such cases, the task of optimization goes far beyond

simple FLOP minimization, and the programmer must consider tradeoffs from each of

the above considerations on his design.

GPU ERI EVALUATION

Parallelization Strategies

The primary challenge in implementing ERI routines on GPUs is deciding how

to map the integrals onto the GPUs execution units. Because each ERI can be

computed independently, there are many possible ways to decompose the work into

CUDA grids and blocks. For simplicity we will ignore the screening of negligible

integrals until the next chapter, and consider a simplified calculation in which each

AO is formed by a contraction of s-functions only. In this case, equation (1.43)

simplifies tremendously to the following expression.

[ij | kl]=Nij Nkl

!ij +!kl

F0

!ij!kl

!ij +!kl

Pij " Pkl( )2#

$%

&

'( (1.46)

Since we are now interested in integrals over contracted AO functions, we note that

the pair prefactors, Nij, now include the AO contraction coefficients.

A convenient way to organize ERI evaluation is to expand unique pairs of

atomic orbitals, !µ!" µ #"{ } , into a vector of dimension N(N-1)/2. The outer product

23

of this vector with itself then produces a matrix whose elements are quartets,

!µ!" ,!#!$ µ %" ,# %${ } , each representing a (bra|ket) contracted AO integral. This is

illustrated by the blue square in figure 1. Due to (bra|ket) = (ket|bra) symmetry among

the ERIs, only the upper triangle of the integral matrix need be computed. Clearly ERI

evaluation is embarrassingly parallel at the level of AOs. However, each AO integral

in the grid can include contributions from many primitive integrals. In order to

parallelize the calculation over the more finely grained primitives, a degree of

coordination must be introduced among threads. Here we review three broadly

representative decomposition schemes10,19 that will guide our work in later chapters.

Figure 1: Schematic of 1 Block – 1 Contracted Integral (1B1CI) mapping. Cyan squares on left represent contracted ERIs each mapped to the labeled CUDA block of 64 threads. Orange squares show mapping of primitive ERIs to CUDA threads (green and blue boxes, colored according to CUDA warp) for two representative integrals, the first a “contraction” over a single primitive ERI and the second involving 34=81 primitive contributions.

The first strategy assigns a CUDA block to evaluate each contracted AO ERI

and maps a 2-dimensional CUDA grid onto the 2D ERI grid. The threads within each

block work together to compute a contracted integral in parallel. This approach is

termed the one block – one contracted integral (1B1CI) scheme. It is illustrated in

figure 1. Each cyan square represents a contracted integral. The CUDA block

responsible for each contracted ERI is labeled within the square. Lower triangular

block(2, 0)

|11)

(11|

|12) |13) |22) |23) |33)

(12|

(13|

(22|

(23|

(33|

block(0, 0)

block(3, 3)

block(0, 5)

idle blocks

(0)Block (2, 0), Integral (11|13)

[22|33](1)

[idle](31)[idle]

(63)[idle]

(32)[idle]

(0)Block (3, 3), Integral (22|22)

[11|11](16)

[12|32](31)

[21|22](63)

[32|11](32)

[21|23][32|12] [33|33]

(17)[12|33][idle] [idle] [idle] [idle]

24

blocks, labeled idle in Figure 1, would compute redundant integrals due to

�bra�ket( ) = ket�bra( ) symmetry. These blocks exit immediately and, because of the

GPUs efficient thread scheduling, contribute negligibly to the overall execution time.

Each CUDA block is made up of 64 worker threads arranged in a single dimension.

Blocks are represented by orange rectangles in Figure 1. The primitive integrals are

mapped cyclically onto the threads, and each thread collects a partial sum in an on-

chip register. The first thread computes and accumulates integrals 1, 65, etc. while the

second thread handles integrals 2, 66, etc. After all primitive integrals have been

evaluated, a block level reduction produces the final contracted integral.

Two cases deserving particular consideration are illustrated in Figure 1. The

upper thread block shows what happens for very short contractions, in the extreme

case, a single primitive. Since there is only one primitive to compute, all threads other

than the first will sit idle. A similar situation arises in the second example. Here an

ERI is calculated over four AOs each with contraction length 3 for a total of 81

primitive integrals. In this case, none of the 64 threads are completely idle. However,

some load imbalance is still present, since the first 17 threads compute a second

integral while the remainder of the warp, threads 18-31, execute unproductive no-op

instructions. It should be noted that threads 32-63 do not perform wasted instructions

because the entire warp skips the second integral evaluation. Thus, “idle” CUDA

threads do not always map to idle execution units. Finally, as contractions lengthen,

load imbalance between threads in a block will become negligible in terms of the

runtime, making the 1B1CI strategy increasingly efficient.

25

Figure 2: Schematic of 1 Thread – 1 Contracted Integral (1T1CI) mapping. Cyan squares represent contracted ERIs and CUDA threads. Thread indices are shown in parentheses. Each CUDA block (red outlines) computes 16 ERIs with each thread accumulating the primitives of an independent contraction, in a local register.

A second parallelization strategy assigns entire contracted integrals to

individual CUDA threads. Since all primitives within a contraction are computed by a

single GPU thread, the sum of the final ERI can be accumulated in a local register,

avoiding the final reduction step. This coarser decomposition, which is termed the one

thread – one contracted integral (1T1CI) strategy, is illustrated in figure 2. The

contracted integrals are again represented by cyan squares, but each CUDA block,

represented by red outlines, now handles multiple contracted integrals rather than just

one. The 2-D blocks shown in Figure 2 are given dimensions 4x4 for illustrative

purposes. In practice, blocks sized at least 16x16 threads should be used. Because

threads within the same warp execute in SIMT fashion, warp divergence will result

whenever neighboring ERIs involve contractions of different lengths. To eliminate

these imbalances, the ERI grid must be pre-sorted by contraction length so that blocks

handle ERIs of uniform contraction length.

(0, 0) (0, 0)(3, 0)

(3, 3)

(1, 1)

(0, 0)

(1, 1)

(0, 3)

(0, 0) (3, 0)

(0, 3)

(12|13

)

|11)

(11|

|12) |13) |22) |23) |33)

(12|

(13|

(22|

(23|

(33| idle threads

block (0, 0)

26

Figure 3: Schematic of 1 Thread – 1 Primitive Integral (1T1PI) mapping. Cyan squares represent two-dimensional tiles of 16 x 16 primitive ERIs, each of which is assigned to a 16 x 16 CUDA block as labeled. Red lines indicate divisions between contracted ERIs. The orange box shows assignment of primitive ERIs to threads (grey squares) within a block that contains contributions to multiple contractions.

The third strategy maps each thread to a single primitive integral (1T1PI) and

ignores boundaries between primitives belonging to different AO contractions. A

second reduction step is then employed to sum the final contracted integrals from their

constituent primitive contributions. This is illustrated in Figure 3. The 1T1PI approach

provides the finest grained parallelism of the three mappings considered. It is similar

to the 1B1CI in that contracted ERIs are again broken up between multiple threads.

Here however, the primitives are distributed to CUDA blocks without considering the

contraction of which they are members. In Figure 3, cyan squares represent 2D CUDA

blocks of dimension 16x16, and red lines represent divisions between contracted

integrals. Because the block size is not an even multiple of contraction length, the

primitives computed within the same block will, in general, contribute to multiple

contracted ERIs. This approach results in perfect load balancing (for primitive

evaluation), since each thread does exactly the same amount of work. It is also notable

in that 1T1PI imposes few constraints on the ordering of primitive pairs, since they no

longer need to be grouped or interleaved by parent AO indices. However, the

|11)(1

1||12)

(12|

block(0, 0)

block(3, 3)

block(0, 5)

block(5, 5)

block(4, 1)

idle threads

Thread(0, 0)

[35|55]

Thread(7, 0)

[35|66]

Thread(0, 15)[62|55]

Thread(7, 15)[62|66]

Thread(8, 0)

[35|11]

Thread(15, 0)[35|22]

Thread(8, 15)[62|11]

Thread(15, 15)[62|22]

Block (4, 1) contributes to(11|12) and (11|13)

27

advantages of the 1T1PI scheme are greatly reduced if we also consider the

subsequent reduction step needed to produce the final contract ERIs. These reductions

involve inter-block communication and for highly contracted basis sets, can prove

more expensive than the ERI evaluation itself.

Table 1 summarizes benchmarks for each of the strategies described above.10

The example system consisted of 64 hydrogen atoms arranged in a 4x4x4 cubic lattice

with 0.74Å separating nearest neighbors. Two basis sets are considered. The 6-311G

basis represents a low contraction limit in which most (two-thirds) of the AOs include

a single primitive component. Here the 1T1PI mapping provides the best performance.

At such a minimal contraction level, very few ERIs must be accumulated between

block boundaries, minimizing required inter-block communication. The 1T1CI

method takes a close second since, for small contractions, it represents a parallel

decomposition that is only slightly coarser than the ideal 1T1PI scheme. The 1B1CI

scheme, on the other hand, takes a distant third. This is due to its poor load balancing

since, for the 6-311G basis, over 85% of the contracted ERIs involve nine or fewer

primitive ERI contributions. Thus, the vast majority of the 64 threads in each 1B1CI

CUDA block do no work.

GPU 1B1CI

GPU 1T1CI

GPU 1T1PI

CPU PQ Pre-calc

GPU-CPU Transfer GAMESS

6-311G 7.086 0.675 0.428 0.009 0.883 170.8 STO-6G 1.608 1.099 2.863 0.012 0.012 90.6

Table 1: Runtime comparison for evaluating ERIs of 64 H Atom lattice using 1B1CI, 1T1CI, and 1T1PI methods. Times are given in seconds. All GPU calculations were run on an Nvidia 8800 GTX. CPU pre-calculation records time required to build pair quantities prior to launching GPU kernels. GPU-CPU transfer provides the time required to copy the completed ERIs from device to host memory. Timings for the CPU-based GAMESS program running on an Opteron 175 CPU are included for comparison.

28

The STO-6G basis provides a sharp contrast. Here each contracted ERI

includes 64 or 1296 primitive ERI contributions. As a result for the 1T1PI scheme, the

reduction step becomes much more involved, in fact requiring more time than

primitive ERI evaluation itself. This illustrates that, for massively parallel

architectures, organizing communication is often just as important as minimizing

arithmetic instructions when optimizing performance. The 1T1CI scheme performs

similarly to the 6-311G case. The fact that all ERIs now involve uniform contraction

lengths provides a slight boost, since it requires only 60% more time to compute twice

as many primitive ERIs compared to the 6-311G basis. The 1B1CI method, improves

dramatically, as every thread of every block is now responsible for at least 20

primitive ERIs.

Finally, it should be noted that simply transferring ERIs between host and

device can take longer than the ERI evaluation itself, especially in the case of low

basis set contraction. This means that, for efficient GPU implementations, ERIs can be

re-evaluated from scratch faster than they can be fetched from host memory, and much

faster than they can be fetched from disk. This will prove an important consideration

in the next chapter.

Extension to Higher Angular Momentum

Some additional considerations are important for extension to basis functions

of higher angular momentum. For non-zero angular momentum functions, shells

contain multiple functions. As noted above, all integrals within an ERI shell depend on

the same auxiliary integrals, Rtuv0 , and Hermite contraction coefficients, xEt

mn . Thus, it

is advantageous to have each thread compute an entire shell of primitive integrals. For

29

example, a thread computing a primitive ERI of class [sp|sp] is responsible for a total

of nine integrals.

The performance of GPU kernels is quite sensitive to the register footprint of

each thread. As threads use more memory, the total number of concurrent threads

resident on each SM decreases. Fewer active threads, in turn, reduce the GPU’s ability

to hide execution latencies and lowers throughput performance. Because all threads in

a grid reserve the same register footprint, a single grid handling both low and, more

complex, high angular momentum integrals will apply the worst-case memory

requirements to all threads. To avoid this, separate kernels must be written for each

class of integral.

Specialized kernels also provide opportunities to further optimize each routine

and reduce memory usage, for example, by unrolling loops or eliminating

conditionals. This is particularly important for ERIs involving d- and higher angular

momentum functions, where loop overheads become non-trivial. For high angular

momentum integrals it is also possible to use symbolic algebra libraries to generate

unrolled kernels that are optimized for the GPU.20

Given a basis set of mixed angular momentum shells, we could naively extend

any of the decomposition strategies presented above as follows. First, build the pair

quantities as prescribed, without consideration for angular momentum class. Then

launch a series of ERI kernels, one for each momentum class, assigning a compute

unit (either block or thread depending on strategy being extended) to every ERI in the

grid. Work units assigned to ERIs that do not apply to the appropriate momentum

class could exit immediately. This strategy is illustrated for a hypothetical system

30

containing four s-shells and one p-shell in the left side of Figure 4. Each square

represents a shell quartet of ERIs, that is all ERIs resulting from combination of the

various angular momentum functions within each of the included AO shells. The

elements are colored by total angular momentum class, and a specialized kernel

evaluates elements of each color. Unfortunately, the number of integral classes

increases rapidly with the maximum angular momentum in the system. The inclusion

of d-shells would already result in the vast majority of the threads in each kernel

exiting without doing any work.

Figure 4: ERI grids colored by angular momentum class for a system containing four s-shells and one p-shell. Each square represents all ERIs for a shell quartet (a) Grid when bra and ket pairs are ordered by simple loops over shells. (b) ERI grid for same system with bra and ket pairs sorted by angular momentum, ss, then sp, then pp. Each integral class now handles a contiguous chunk of the total ERI grid.

A better approach is illustrated on the right side of Figure 4. Here we have

sorted the bra and ket pairs by the angular momenta of their constituents, ss then sp

and last pp. As a result, the ERIs of each class are localized in contiguous sub-grids,

and kernels can be dimensioned to exactly cover only the relevant integrals.

REFERENCES

(1) Challacombe, M.; Schwegler, E. J. Chem. Phys. 1997, 106, 5526.

(2) Burant, J. C.; Scuseria, G. E.; Frisch, M. J. J. Chem. Phys. 1996, 105, 8969.

11 12 13 14 15 22 23 24 25 33 34 35 44 45 55

1112

1314

1522

2324

2533

3435

4445

55

11 12 13 14 22 23 24 33 34 44 15 25 35 45 55

1112

1314

2223

2433

3444

1525

3545

55

|λσ)

(μν|

|λσ)

(μν|

(b)(a)

31

(3) Schwegler, E.; Challacombe, M. J. Chem. Phys. 1996, 105, 2726.

(4) Schwegler, E.; Challacombe, M.; HeadGordon, M. J. Chem. Phys. 1997, 106, 9708.

(5) Ochsenfeld, C.; White, C. A.; Head-Gordon, M. J. Chem. Phys. 1998, 109, 1663.

(6) Rudberg, E.; Rubensson, E. H. J Phys-Condens Mat 2011, 23.

(7) Rys, J.; Dupuis, M.; King, H. F. J. Comp. Chem. 1983, 4, 154.

(8) Yasuda, K. J. Comp. Chem. 2008, 29, 334.

(9) Asadchev, A.; Allada, V.; Felder, J.; Bode, B. M.; Gordon, M. S.; Windus, T. L. J. Chem. Theo. Comp. 2010, 6, 696.

(10) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2008, 4, 222.

(11) Mcmurchie, L. E.; Davidson, E. R. J. Comp. Phys. 1978, 26, 218.

(12) Parr, R. G.; Yang, W. Density-functional theory of atoms and molecules; Oxford University Press: Oxford, 1989.

(13) Szabo, A.; Ostlund, N. S. Modern Quantum Chemistry; McGraw Hill: New York, 1982.

(14) Helgaker, T.; Jørgensen, P.; Olsen, J. Molecular electronic-structure theory; Wiley: New York, 2000.

(15) Kohn, W.; Sham, L. J. Phys Rev 1965, 140, 1133.

(16) Boys, S. F. Proc. Roy. Soc. Lon. A 1950, 200, 542.

(17) Whitten, J. L. J. Chem. Phys. 1973, 58, 4496.

(18) NVIDIA In Design Guide; NVIDIA Corporation: docs.nvidia.com, 2013.

(19) Ufimtsev, I. S.; Martinez, T. J. Comp. Sci. Eng. 2008, 10, 26.

(20) Titov, A. V.; Ufimtsev, I. S.; Luehr, N.; Martinez, T. J. J. Chem. Theo. Comp. 2013, 9, 213.

32

33

CHAPTER TWO

INTEGRAL-DIRECT FOCK CONSTRUCTION

ON GRAPHICAL PROCESSING UNITS

Because ERIs remain constant from one SCF iteration to the next, it was once

common practice to pre-compute all numerically significant ERIs prior to the SCF. At

each iteration the two-electron Fock contributions of equation (2.1) would then be

generated from contracted ERIs, �µ!�"#( ) , stored, for example, on disk.

�Gµ! (P) = P"#

"#

N

$ 2 µ!�"#( )% µ"�!#( )&' () (2.1)

This procedure certainly minimized the floating-point operations involved in the

calculation. However, for systems containing thousands of basis functions, ERI

storage quickly becomes impractical. The integral-direct approach, pioneered by

Almlof,1 avoids the storage of ERIs by re-computing them on the fly with each

formation of the Fock matrix.

ERI evaluation represents a tremendous bottleneck in the integral-direct

approach. Thus, early implementations were careful to generate only symmetry-unique

ERIs, �µ!�"#( ) where µ !" , ! "# , and µ! " #$ . Here µ! and !" are compound

indices corresponding to the element numbers in an upper triangular matrix. Each ERI

was then combined with various density elements and scattered into multiple locations

in the Fock matrix. This reduces the number of ERIs that must be evaluated by a factor

of eight compared to a naïve implementation (based on the eightfold symmetry among

the ERIs).

34

Beyond alleviating storage capacity bottlenecks, the direct approach offers

performance advantages over conventional algorithms based on integral storage. As

observed above, ERIs can sometimes be re-calculated faster than they can be recalled

from storage (even when this storage is just across a fast PCIe bus). As advances in

instruction throughput continue to out pace those for communication bandwidths, this

balance will shift even further in favor of integral-direct algorithms. Another

advantage results from knowledge of the density matrix during Fock construction.1 By

augmenting the usual Schwarz bound with density matrix elements as follows,

�µ!�"#( )P"# $ µ!�µ!( )1/2

"#�"#( )1/2P"# (2.2)

the direct approach is able to eliminate many more integrals than is possible for pre-

computed ERIs since even quite large integrals are often multiplied by vanishing

density matrix elements.

Almlof and Ahmadi also suggested dividing the calculation of G into separate

Coulomb, J, and exchange, K, contributions, double calculating any ERIs that are

common to both.2

�Jµ! = µ!�"#( )P"#

"#

N

$ (2.3)

�Kµ! = µ"�!#( )P"#

"#

N

$ (2.4)

This division offers two primary advantages. First, for the Coulomb operator in

equation (2.3), the density elements, P!" , can be pre-contracted with the ket, !" ) .

This provides an important optimization as described later in this chapter. Second for

the exchange operator in equation (2.4) only a few non-negligible contributions need

35

to be computed. The density matrix in insulating systems with finite band gap decays

exponentially with distance.3,4 Because Gaussian basis functions are localized, the AO

density matrix remains sparse. As noted in the previous chapter, the bra, µ! , and ket,

!" , pairs are also sparse due to the locality of the Gaussian basis set. Thus, equation

(2.4) couples the bra and ket through a sparse density matrix, and few ERI

contributions survive screening in large systems. Separately calculating the exchange

term in equation (2.4) then adds few ERIs compared to the number of ERIs required

by equation (2.3) alone.

The considerations above apply perhaps even more forcefully on GPUs. As

already observed in the context of forming contracted ERIs, the GPU’s wide execution

units benefit from longer contractions of primitive ERIs. In the previous chapter, this

explained the improved performance of the 1B1CI approach for the hydrogen lattice

test case in moving from the 6-311G basis to the more highly contracted STO-6G.

Longer contractions parallelize more evenly across many cores. Expanded in terms of

primitive ERIs, the sums in Eqs. (2.3) and (2.4) include many more contributions than

any contracted ERI considered in chapter 1. Thus to simplify the parallel structure of

the Coulomb and exchange algorithms and improve GPU performance, the

construction of contracted ERIs, �µ!�"#( ) , is abandoned in the present chapter in

favor of direct construction of Coulomb and exchange matrix elements from primitive

Gaussian functions.

36

Even within the Coulomb and exchange operators, ERI symmetry must not be

taken for granted. For example, although each ERI, �µ!�"#( ) , makes multiple

contributions to the exchange matrix,

µ!�"#( )P!# $ Kµ" "#�µ!( )P#! $ K"µ

!µ�"#( )Pµ# $ K!" "#�!µ( )P#µ $ K"!

µ!�#"( )P!" $ Kµ# #"�µ!( )P"! $ K#µ

!µ�#"( )Pµ" $ K!# #"�!µ( )P"µ $ K#!

(2.4)

gathering disparate density matrix elements introduces irregular memory access

patterns and scattering outputs to the Fock matrix creates dependencies between

threads computing different ERIs. GPU performance is extremely sensitive to these

considerations, so that even the eight-fold reduction in work available from exploiting

the full symmetry among ERIs could be swamped by an even larger performance

slowdown resulting from fragmented memory accesses. It is helpful to start, as below,

from a naïve, but completely parallel algorithm, and then exploit ERI symmetry only

where it provides a practical benefit.

The remainder of the chapter describes the algorithm used to implement

integral-direct Coulomb and exchange operators in TeraChem. A final performance

evaluation will be delayed until the implementation of DFT exchange-correlation

terms have also been described in chapter 4.

GPU J-ENGINE

The strategies for handling ERIs developed in the previous chapter provide a

good starting point for the evaluation of the Coulomb operator in equation (2.3).5 As

37

in the previous chapter, we first consider ERIs involving only s-functions in which

each quartet of AO functions produces a single integral. This provides a clear context

in which to describe the overall structure of our approach. Later, details for evaluating

higher angular momentum functions will be provided. The first step is again to

enumerate AO pairs, !µ!" , for the bra and ket. For the moment we consider the full

lists of N2 function pairs and consider the integral matrix, I, constructed as the N2-by-

N2 product of the bra-pair column vector, µ! |( , with the ket-pair row vector, | !" ) .

Iµ! ,"# = µ! | "#( ) (2.5)

Inserting equation (2.5) into (2.3) casts the Coulomb operator as a matrix vector

product between the integral matrix and a vector of length N2 built by re-dimensioning

the usual N-by-N one-particle density matrix.

With this picture in mind, several plausible mappings to CUDA threads and

blocks suggest themselves. A simple strategy is to assign each thread to a single bra-

pair, µ! |( , and have it sweep over all kets, | !" ) , and density elements, P!" ,

accumulating the products, �µ!�"#( )P#" , to compute an independent Coulomb

element, Jµ! . This strategy is similar to the 1T1CI scheme of chapter 1 and maximizes

the independence of each thread but at the cost of rather coarse parallelism. In order to

saturate a massively parallel GPU (or ideally several GPUs) it is preferable to employ

multiple threads to compute each Jµ! . Thus a preferred approach uses 2D thread

blocks so that threads in each row stride across the integral matrix, each accumulating

a partial sum. This is shown in figure 1 for an illustrative block size of 2x2. In practice

38

a block size of 8x8 was shown to be near optimal across a range of empirical test

calculations. Once all integrals have been evaluated, a final reduction within each row

of the CUDA block produces final Jµ! elements. The reduction step adds negligibly to

the runtime regardless of primitive contraction length, because it is performed only

once per Coulomb element, rather than once per contracted ERI as in the 1B1CI case

described in the previous chapter.

Figure 1: Schematic representation of a J-Engine kernel for one angular momentum class, e.g., (ss|ss). Cyan squares represent significant ERI contributions. Sorted bra and ket vectors are represented by triangles to left and above grid. The path of a 2x2 block as it sweeps across the grid is shown in orange. The final reduction across rows of the block is illustrated within the inset to the right.

The above discussion ignored ERI symmetry for clarity. However, having

determined an efficient mapping of the Coulomb problem onto the GPU execution

units, it is important to consider what symmetries can be exploited without upsetting

the structure of the algorithm. We first note that Jµ! is symmetric. Thus it is sufficient

39

to compute its upper triangle and only the N(N+1)/2 bra pairs where µ !" need to be

considered. Similarly for ket pairs, the terms �µ!�"#( )P"# and �

µ!�"#( )P"# can be

computed together as �µ!�"#( ) P"# + P#"

1+$"#

, where the Kronecker delta is used to handle

the special case along the diagonal of the density matrix. Thus, using a slightly

transformed density, ERI symmetry again allows a reduction to ket pairs where ! "# .

If the AO shells are ordered by angular momentum, as suggested in the

previous chapter, then these symmetry reductions will also conveniently reduce the

number of specialized momentum-specific kernels that are needed. For example,

assuming a basis including s-, p-, and d-functions, the reduced pairs will include a

total of six momentum classes, ss, sp, sd, pp, pd, and dd, and require 36 specialized

Coulomb kernels, less than half of the 34 = 81 total momentum classes that might be

expected.

It is not possible to exploit the final class of ERI symmetry,

�µ!�"#( ) = "#�µ!( ) , without creating dependencies between rows of the ERI grid

and, as a result, performance sapping inter-block communication. Thus, ignoring the

screening of negligibly small integrals discussed below, the GPU J-Engine nominally

computes N4/4 integrals.

For clarity, the discussion of ERI symmetry has been carried out in terms of

contracted ERIs. However, building such contracted intermediates would require

many sums of irregular length that are difficult to parallelize efficiently across many

cores. Thus it is advantageous to construct the Coulomb operator directly from

40

primitive ERIs. Continuing with s-functions for the moment, the primitive Coulomb

matrix elements are calculated as follows.

�Jij = ci c j ij�kl!" #$ck cl P%k%l

kl& (2.6)

Here we define the AO index vector, !i , to select the contracted AO index to which

the ith primitive function belongs. (Since at present our discussion is limited to s-

functions, we ignore the structure of functions organized within shells). The

coefficients, c j , represent weights of primitive functions within AO contractions. The

final Coulomb elements are then computed in a second summation step as follows.

Jµ! = Jiji"µj"!

# (2.6)

The evaluation of equation (2.6) can be carried out as described above, except

that the bra and ket AO pairs, !µ!" , are now replaced by expanded sets of primitive

pairs, ! i! j | i "µ, j "# ,µ $#{ } . The bra prefactor from equation (1.44) is now

augmented with the contraction coefficients.

Nijbra = cicjNij (2.7)

In constructing the ket pairs, the prefactor is pre-multiplied by both the contraction

coefficients and the appropriate density matrix element.

Nklket = ckclNkl

P!k!l + P!l!k1+"!k!l

(2.8)

Along with these prefactors, the quantities !ij and !Pij from equation (1.25) and each

pair’s Schwarz contribution, Bijbra = ij | ij[ ]1/2 and Bkl

ket = kl | kl[ ]1/2 P!k!l for each bra and

41

ket pair, is transferred to the GPU. As illustrated in figure 1, a CUDA kernel then

processes the primitive ERI grid formed as the outer product of bra and ket pair arrays.

The resulting primitive Coulomb elements are returned to the host where the sum of

equation (2.6) is carried out to provide the final matrix elements.

As noted above, screening small ERIs can reduce the computational

complexity of Coulomb construction from O(N4) to O(N2). Screening is introduced in

three passes. During the initial enumeration of primitive pairs, a conservative bound is

used to remove all primitive pairs for which Bijbra < ! pair "10#15 Hartrees.a Next, when

building the bra and ket arrays, pairs are further filtered that satisfy equations (2.9) and

(2.10), respectively.

Bijbra ! " screen

maxklBklket (2.9)

Bklket ! " screen

maxijBijbra (2.10)

Here 10-12 is a typical value of ! screen .b Finally, prior to evaluating each ERI, the GPU

kernel evaluates the four-center density-weighted Schwarz bound, and the

computationally intensive ERI evaluation is skipped where the following holds.

�BijbraBkl

ket = ij�ij( )1/2 kl�kl( )1/2 P!" # $ screen (2.11)

a ! pair corresponds to the variable THREPRE in TeraChem. b ! screen corresponds to the variable THRECL in TeraChem.

42

Figure 2: Organization of ERIs for Coulomb formation. Rows and columns correspond to primitive bra and ket pairs respectively. Each ERI is colored according to the magnitude of its Schwarz bound. Data is derived from calculation on ethane molecule. Left grid is obtained by arbitrary ordering of pairs within each angular momentum class and suffers from load imbalance because large and small integrals are computed in neighboring cells. Right grid sorts bra and ket primitives by Schwarz contribution within each momentum class, providing an efficient structure for parallel evaluation.

The left half of figure 2 shows a typical distribution of Schwarz bounds for

ERIs in the primitive Coulomb grid. Merely skipping small integrals still requires

evaluating the bound of every cell in this grid. Also, because large and small integrals

are interspersed throughout the grid, CUDA warps will suffer from divergence

between threads that are evaluating an ERI and threads that are not. A more efficient

approach to screening can be achieved by sorting the final lists of bra and ket pair

quantities in descending order by Bijbra and Bkl

ket , respectively. The resulting grid,

shown on the right in figure 2, eliminates warp divergence because the significant

integrals are condensed in a contiguous region for each momentum class of ERI.

Furthermore, the bounds are now guaranteed to decrease across each row, so each

block may safely exit when the first negligible ERI is located. This eliminates the

overhead of even checking the Schwarz bound for negligible quartets.

43

For clarity, the discussion above has considered only s-functions. For integrals

involving higher angular momentum functions, it is advantageous to compute all ERIs

belonging to a shell quartet simultaneously. In order to maintain fine-grained

parallelism, it would be desirable to distribute a shell quartet among threads in a block.

However, ERI evaluation involves extensive recurrence relations such as equation

(1.42) which cannot be efficiently parallelized between many SIMT processing cores.

Thus, it is preferable to assign an independent thread to compute all ERIs within a

shell quartet. The final algorithm follows the basic pattern described for s-functions,

except that instead of function pairs, the bra and ket vectors are now built from

primitive shell pairs, ! I! J .

The quantities NIJbra , !IJ , and

!PIJ are uniform for all functions within the shell

pair and are thus organized in pair data arrays as above. The Schwarz contributions,

Bijbra/ket , are not strictly uniform for d-type functions and above. However, to maintain

rotational invariance, the Schwarz bound is computed treating both primitives as

spherical s-functions, a quantity which is uniform across the shell pair. Because each

shell pair now spans several functions, the ket bound, Bklket , must use the maximum

density element over the shell block.

BKLket = KL |KL[ ]1/2 PKLmax (2.12)

PKLmax = max

!k"!K

!l"!L

P#k#l (2.12)

Additional pair data is required to compute ERIs of non-zero angular

momentum. Since the density elements differ for each function within the shell block

44

they can no longer be included in the ket prefactor, Nijket , as in equation (2.8). The

Hermite expansion coefficients, Etuvij , from equation (1.33) are also needed for every

function pair ij in the shell pair IJ.

An important simplification both in terms of the required pair data and

computational cost of the final integral is available by inserting equation (1.43) into

equation (2.3).2

Jµ! = Nijci c ji"µj"!

# Etuvij Nklck cl

$ij +$klkl# (%1) &t + &u + &v P'k'l

E &t &u &vkl Rt+ &t ,u+ &u ,v+ &v

0

&t &u &v#

tuv#

(2.12)

Because the quantities Nkl, ck cl, and !kl are uniform for all function pairs within the

shell pair, the above sum can be segmented by primitive shell pair, KL, as follows.

Jµ! = Nijci c ji"µj"!

# Etuvij N KLcKcL

$ij +$KLKL# D %t %u %v

KL Rt+ %t ,u+ %u ,v+ %v0

%t %u %v#

tuv#

(2.12)

D !t !u !vKL = ("1) !t + !u + !v P#k#l

E !t !u !vkl

k$Kl$L

% (2.12)

Rather than separately packing McMurchie-Davidson coefficients and density

elements for every primitive function pair, combined coefficients are generated

through equation (2.12) for each pair of ket shells.

Equation (2.12) is computed in two steps in TeraChem. First the pair data

described above is copied to the GPU where the following Hermite Coulomb

contributions are computed.

Jtuvij =

Nijci c j N KLcKcL

!ij +!KLKL" D #t #u #v

KL Rt+ #t ,u+ #u ,v+ #v0

#t #u #v"

(2.12)

45

In exact analogy to the algorithm described for s-functions above, each row of CUDA

threads is assigned a bra shell pair, !I! J

, and strides across the list of kets,

accumulating all associated Jtuvij

values in separate registers. After all significant

Coulomb contribions have been computed, reductions within the rows of each block

produce the sums of equation (2.12). The Hermite Coulomb contributions are then

copied back to host memory where, in analogy to equation (2.6) the bra expansion

coefficients are used to produce the final Coulomb elements as follows.

Jµ! = Etuvij Jtuv

ij

tuv"

i#µj#!

" (2.12)

GPU K-ENGINE

The GPU-based construction of the exchange operator in equation (2.4)

follows a similar approach to that employed in the GPU J-Engine algorithm.5 Here we

highlight adjustments that are required to accommodate the increased complexity of

exchange which results from the splitting of the output index, µ! , between the bra

and ket. As a result rows of the bra-by-ket ERI grid do not contribute to a single

matrix element but scatter across an entire row of the K matrix. Additionally,

symmetry among the ERIs is more difficult to exploit for exchange since symmetric

pairs, e.g., �µ!�"#( )$ µ!�#"( ) , now contribute to multiple matrix elements. The

split of the density index, !" , between bra and ket, also precludes the pre-contraction

of density elements into the pair quantities as was done for the ket pairs in the J-

Engine above.

46

Such complications could be naïvely removed by changing the definitions of

bra and ket to so called physicist notation, where pairs include a primitive from each

of two electrons.

�µ!�"#( ) = µ" !# (2.13)

With such µ! and !" pairs, a GPU algorithm analogous to the J-Engine could easily

be developed. Unfortunately, the new definitions of bra and ket also affect the

pairwise Schwarz bound, which now becomes the following.

�µ! µ!

1/2"# "#

1/2= µµ�!!( )1/2

""�##( )1/2 (2.14)

As the distance, R , between !µ and !" increases, the quantity �µµ�!!( ) decays

slowly, as 1/ R . This should be compared to a decay of e!R2

for �µ!�µ!( ) . This

weaker bound means that essentially all N4 ERIs would need to be examined leading

to a severe performance reduction compared to the N2 Coulomb algorithm. Thus, the

scaling advantages of the µ! /!" pairing are very much worth maintaining, even at

the cost of reduced hardware efficiency.

The K-Engine begins with the usual step of enumerating AO shell pairs, !µ!" .

Because the !µ!" and !"!µ pairs contribute to different matrix elements, we neglect

symmetry and construct AO pairs for both µ ! " and µ > ! pairs. As with Coulomb

evaluation, the pairs are separated by angular momentum class, and different kernels

are tuned for each type of integral. Inclusion of µ > ! pairs requires additional pair

and kernel classes compared to the J-Engine, since, for example, kernels handling ps

pairs are distinct from those handling sp pairs.

47

Simply sorting the bra and ket primitive pairs by Schwarz bound would leave

contributions to individual K elements scattered throughout the ERI grid. In order to

avoid inter-block communication, it is necessary to localize these exchange

contributions. This is accomplished by sorting the primitive bra and ket pairs by µ -

and ! -index, respectively. Then the ERIs contributing to each element, Kµ! , form a

contiguous tile in the ERI grid. This is illustrated in figure 3. Within each segment of

µ -pairs, primitive PQs are additionally sorted by Schwarz contributions, �µ!�µ!( )1/2

,

so that significant integrals are concentrated in the top left of each µ! -tile.

Since the density index is split between bra and ket, density information cannot

be included in the primitive expansion coefficients or Schwarz contributions. Instead

additional vectors are packed for each angular momentum class, Pss , Psp , etc. The

packed density is ordered by shell pair and organized into CUDA vector types to allow

for efficient fetching. For example, each entry in the Psp block contains three

elements, spx , spy , and spz , packed into a float4 (assuming single precision) vector

type. The maximum overall density element in a shell pair, P!"

max is also pre-

computed for each shell pair, in order to minimize memory access when computing

the exchange Schwarz bound.

�!µ!"�!# !$%& '(P"$ ) !µ!"�!µ!"

%& '(1/2

!# !$�!# !$%& '(1/2

P"$

max (2.15)

48

Figure 3: Schematic of a K-Engine kernel. Bra and ket pair arrays are represented by triangles to left and above grid. The pairs are grouped by µ and ! index and then sorted by bound. The paths of four blocks are shown in orange, with the zigzag pattern illustrated by arrows in the top right. The final reduction of an exchange element within a 2x2 block is shown to the right.

To map the exchange calcuation to the GPU, a 2-D CUDA grid is employed in

which each block computes a single Kµ! element, and is thus assigned to a tile of the

primitive ERI grid. Each CUDA block passes through its tile of � µ!�!!"# $% primitive

integrals in a zigzag pattern, computing one primitive shell per thread. Ordering pairs

by bound allows a block to skip to the next row as soon as the boundary of

insignificant integrals is located. As with the J-Engine, the uniformity of ERI bounds

within each block is maximized by dimensioning square 2-D CUDA blocks. Figure 3

shows a 2x2 block for illustrative purposes, a block dimension of at least 8x8 is used

in practice. When all significant ERIs have been evaluated, a block-level reduction is

49

used to compute the final exchange matrix element. As with the J-Engine above, this

final reduction represents the only inter-thread communication required by the K-

Engine algorithm and detracts negligibly from the overall performance. The K-Engine

approach is similar to the 1B1CI ERI scheme described in chapter 1. However,

because each block is responsible for thousands of primitive ERIs, exchange

evaluation does not suffer from the load imbalance observed for the 1B1CI algorithm.

The structure of the ERI grid allows neighboring threads to fetch contiguous

elements of bra and ket data. Access to the density matrix is more erratic, because it is

not possible to order pairs within each µ!-tile simultaneously by Schwarz bound and "

or # index. As a result, neighboring threads must issue independent density fetches

that are in principle scattered across the density matrix. In practice, once screening is

applied, most significant contributions to an element, Kµ!, arise from localized "#-

blocks, where " is near µ and # is near !. Thus, CUDA textures can be used to

ameliorate the performance penalty imposed by the GPU on non-sequential density

access.

Another difficulty related to the irregular access of density elements is that the

density weighted Schwarz bounds for primitive ERIs calculated through equation

(2.15) are not strictly decreasing across each row or down each column of a µ!-tile.

As a result, the boundary between significant and negligible ERIs is not as sharply

defined as in the Coulomb case. Yet, using the density to preempt integral evaluation

in exchange is critical. This can be appreciated by comparison with the Coulomb

operator in which the density couples only to the ket pair. Since in general both the

50

density element, P!" , and Schwarz bound, �!"�!"( )1/2

, decrease as the distance r!"

increases, density weighting the Coulomb Schwarz bound has the effect of making

already small �!"�!"( )1/2

terms even smaller. In exchange, on the other hand, the

density couples the bra and ket so that small density matrix elements can reduce

otherwise large bounds and greatly reduce the number of ERIs that need to be

evaluated. In fact, for large insulator systems, we expect the total number of non-

negligible ERIs to reduce from N2 Coulomb integrals to just N exchange ERIs.

In order to incorporate the density into the ERI termination condition, the usual

exit threshold, !screen ,a is augmented by an additional guard multiplier, G=~10-3.b Each

warp of 32 threads then terminates ERI evaluation across each row of the ERI-tile

when it reaches a contiguous set of primitive ERIs where the following holds for every

thread.

�!µ!"�!µ!"#$ %&

1/2!' !(�!' !(#$ %&

1/2P"(

max< G) (2.16)

In principle, this non-rigorous exit condition could neglect significant integrals in

worst-case scenarios. However, in practice, empirical tests demonstrate that the exit

procedure produces the same result obtained when density information is not

exploited.

For SCF calculations, the density and exchange matrices are symmetric. In this

case, we need only compute the upper triangle of matrix elements. This amounts to

exploiting �µ!�"#( )$ "#�µ!( ) ERI symmetry and means that we calculate N2/2

a ! screen corresponds to the variable THREEX in TeraChem. b G corresponds to the variable KGUARD in TeraChem.

51

ERIs (in the absence of screening). This is four times more ERIs than are generated by

traditional CPU codes that take full advantage of the eight-fold ERI symmetry.

However, comparisons by instruction count are too simplistic when analyzing the

performance of massively parallel architectures. In the present case, freeing the code

of inter-block dependencies boosts GPU performance more than enough to

compensate for a four-fold work increase.

Because the AO density matrix elements, Pµ! , in insulating systems rapidly

decays to zero with increasing distance, rµ! , it is possible to pre-screen entire

exchange elements, Kµ! , for which the sum in equation (2.4) will be negligibly small.

A rigorous bound on each exchange matrix element can be evaluated using the

Schwarz inequality as follows.6

�Kµ! = µ"�!#( )P"#

"#

N

$ % µ"�µ"( )1/2P"# !#�!#( )1/2

"#

N

$ (2.17)

Or casting the AO Schwarz bounds in matrix form, �Qµ! = µ!�µ!( )1/2

:

Kµ! " (QPQ)µ! (2.18)

Whenever !Rµ is far from

!R! , P!" will be zero for all !" near !µ and !" near !" . In

practice, it is possible to impose a simple distance cutoff so that Kµ! is approximated

as follows.

Kµ! =µ"�!#( )P"#

"#

N

$ if rµ! < RMASK

0 otherwise

%

&'

('

(2.19)

52

The mask condition is trivially pre-computed for each element, Kµ! , and packed in a

bit mask, 1 indicating that the exchange element should be evaluated and 0 indicating

that it should be skipped. Each block of CUDA threads checks its mask at the

beginning of the exchange kernel, and blocks that are assigned a zero bit exit

immediately. As will be shown for practical calculations in chapter 4, this simple

distance mask accelerates the K-Engine algorithm to scale essentially linearly with the

size of the system without introducing book keeping overhead7-10 that could easily

reduce the K-Engine’s overall efficiency.

REFERENCES

(1) Almlof, J.; Faegri, K.; Korsell, K. J. Comp. Chem. 1982, 3, 385.

(2) Ahmadi, G. R.; Almlof, J. Chem. Phys. Lett. 1995, 246, 364.

(3) Kohn, W. Phys Rev 1959, 115, 809.

(4) des Cloizeaux, J. Phys Rev 1964, 135, A685.

(5) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2009, 5, 1004.

(6) Kussmann, J.; Ochsenfeld, C. J. Chem. Phys. 2013, 138.

(7) Burant, J. C.; Scuseria, G. E.; Frisch, M. J. J. Chem. Phys. 1996, 105, 8969.

(8) Ochsenfeld, C.; White, C. A.; Head-Gordon, M. J. Chem. Phys. 1998, 109, 1663.

(9) Schwegler, E.; Challacombe, M. J. Chem. Phys. 1999, 111, 6223.

(10) Neese, F.; Wennmohs, F.; Hansen, A.; Becker, U. Chem. Phys. 2009, 356, 98.

! "#!

CHAPTER THREE

DYNAMIC PRECISION ERI EVALUATIONa

$%&'!()*+*,-../!01'*+,10!-'!2(,'341)!5*01(!+-41!6-)07-)1!6-51!'6(7,!

8)141,0(3'! 9(81,8*-.! :()! 4-,/! 8-';'! *,! 2(4938-8*(,-.! 2614*'8)/<! *,2.30*,+!

1.128)(,*2! '8)3283)1! 861()/<=>=?! !"# $%$&$'# 4(.123.-)! 0/,-4*2'<==! -,0! 149*)*2-.!

:()21>:*1.0! @-'10! 4(.123.-)! 0/,-4*2'AB<=C>=D! E61! 141)+1,21! (:! 861! F&GH!

0151.(941,8! :)-417();! :)(4! IJKGKH! 6-'! '*49.*:*10! 861! )193)9('*,+! (:! 86*'!

6-)07-)1! :()! '2*1,8*:*2! 2(4938*,+<="! 2(49-)10! 8(! 1-)./! 1::()8'! (,! '*4*.-)!

-)26*81283)1'!86-8!6-0!)1.*10!(,!1L9.*2*8./!)12-'8*,+!'2*1,8*:*2!-.+()*864'!*,!81)4'!

(:! .(7! .151.! +)-96*2'! *,'8)328*(,'AB<=M! I151)861.1''<! @12-3'1! $%&'! 6-51! @11,!

2-)1:3../! 01'*+,10! :()! 4-L*434! 91):()4-,21! *,! '912*:*2! +)-96*2'! 9)(21''*,+!

8-';'<! *8! *'! '8*..! *49()8-,8! 761,! -0-98*,+! '2*1,8*:*2! 2(01'! 8(! )3,! (,! $%&'! 8(!

)1'9128! '912*-.*N10! 6-)07-)1! 2(,'8)-*,8'! '326! -'!414()/! -221''! 9-881),'! -,0!

,(,>3,*:()4! 1::*2*1,2/! (:! :.(-8*,+! 9(*,8! -)*86418*2! *,! 0*::1)1,8! 9)12*'*(,A! K8! *'!

3,.*;1./!86-8!861'1!.*4*8-8*(,'!7*..!151)!@1!:3../!1.*4*,-810!'*,21!861/!9)(5*01!861!

:(3,0-8*(,! (:! 861! $%&'! 2(4938-8*(,-.! 9)(71''A! $%&! 6-)07-)1! 1L21.'! *,!

4-''*51./!9-)-..1.! :.(-8*,+>9(*,8! 2(4938-8*(,! *,!9-)8!@12-3'1! *8! *'! +((0!-8! .*88.1!

1.'1A!

E61! F(3.(4@! -,0! OL26-,+1! -.+()*864'! 0151.(910! *,! 26-981)! C! 6-51<! (:!

2(3)'1<!-.)1-0/!@11,!2-)1:3../!01'*+,10!8(!4-L*4*N1!:*,1>+)-*,10!9-)-..1.*'4!-,0!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!H0-9810!7*86!91)4*''*(,!:)(4!IA!P316)<!KAQA!&:*48'15<!-,0!EARA!S-)8*,1N<!()#*+,-)#.+,')#*'-/0&A!!"##<!T<!UDU>U"DA!F(9/)*+68!VC??UW!H41)*2-,!F614*2-.!Q(2*18/A!

!"D!

)1+3.-)! 414()/! -221''! 9-881),'A! E61! 9)1'1,8! 26-981)! 1L81,0'! 86-8! 7();! 8(!

2(,'*01)!861!(98*4-.!3'1!(:!:.(-8*,+>9(*,8!9)12*'*(,!*,!861!F(3.(4@!-,0!OL26-,+1!

(91)-8()'A!

OXK!5-.31'!'9-,!-!7*01!0/,-4*2!)-,+1!:)(4!Y=?#!0(7,!8(!Y=?>=?!Z-)8)11'!

@1.(7!76*26!OXK'!2-,!@1!,1+.12810!76*.1!4-*,8-*,*,+!2614*2-.!-223)-2/A!Q*,+.1!

9)12*'*(,!2-,,(8!-223)-81./!-22343.-81!,34@1)'!-2)(''!86*'!7*01!0/,-4*2!)-,+1!

(:!86*)811,!012*4-.!()01)'A!E63'<!1-)./!'2*1,8*:*2!9)(+)-4'!)1.*10!(,!8)*2;'!'326!-'!

2(491,'-810! '344-8*(,! *,! ()01)! 8(! 4-*,8-*,! 9)12*'*(,! 03)*,+! .-)+1!

2-.23.-8*(,'A=T![*86!861!-051,8!(:!6-)07-)1>@-'10!0([email protected]!9)12*'*(,!*,'8)328*(,'!

:()! 2(44(0*8/! F%&'! *,! 861! =UB?'<! *8! @12-41! '8-,0-)0! 9)-28*21! 8(! 3'1! 0([email protected]!

9)12*'*(,! 3,*:()4./! :()! -..! :.(-8*,+>9(*,8! (91)-8*(,'! *,! \3-,834! 2614*'8)/!

2-.23.-8*(,'A! K,! :-28<! :()! 5-)*(3'! -)26*81283)-.! -,0! -.+()*864*2! )1-'(,'<! '*,+.1!

9)12*'*(,! -)*86418*2! (::1)10! (,./! -! '.*+68! 91):()4-,21! -05-,8-+1! (51)! 0([email protected]!

9)12*'*(,A!!

P*;1!1-)./!F%&'<! 861! :*)'8!F&GH>1,[email protected]!$%&'!6-0!,(!'399()8! :()!0([email protected]!

9)12*'*(,! -)*86418*2<! 014-,0*,+! 2-)1! *,! 861*)! 3'1! :()! \3-,834! 2614*'8)/!

-99.*2-8*(,'A! F(,51,*1,8./! 861! .-81'8! $%&'! :3../! '399()8! 0([email protected]! 9)12*'*(,!

-)*86418*2!7*86!91):()4-,21!*,!861!)-,+1!(:!'151)-.!63,0)10!$]P^%Q<!'391)*()!8(!

86-8!(::1)10!@/!F%&'A!I151)861.1''<!151,!861!.-81'8!$%&'!2(,8*,31!8(!9)(5*01!39!

8(! #CL! 4()1! 91):()4-,21! 8(! '*,+.1! 9)12*'*(,! (91)-8*(,'A! E6*'! 0*'9-)*8/! '814'!

:)(4! 861!6-)07-)1_'!910*+)11! *,!+)-96*2'<!761)1! 861)1! *'! .*88.1!,110! :()!0([email protected]!

9)12*'*(,!-223)-2/!-,0!861!,121''-)/!*,2)1-'1!*,!2*)23*8)/!*'!0*::*23.8!8(!`3'8*:/A!^,!

-,/! 9)(21''()<! '*,+.1! 9)12*'*(,! 1L6*@*8'! :3)861)! 91):()4-,21! -05-,8-+1'! -'! -!

! ""!

)1'3.8! (:! *8'! '4-..1)! 414()/! :((89)*,8<! 76*26! )10321'! 0-8-! @-,07*086!

)1\3*)141,8'C!-,0!*,2)1-'1'!861!,34@1)!(:!5-.31'!86-8!2-,!@1!2-2610!*,!(,>26*9!

)1+*'81)'A! E63'<! :()! 4-L*434! 91):()4-,21! (,! $%&'<! *8! )14-*,'! *49()8-,8! 8(!

:-5()!'*,+.1!9)12*'*(,!-)*86418*2!-'!4326!-'!9(''*@.1A!

! E([email protected],21!$%&!91):()4-,21!7*86!2614*2-.!-223)-2/<!\3-,834!2614*'8)/!

*49.141,8-8*(,'! 6-51! -0(9810! 4*L10! 9)12*'*(,! -99)(-261'! *,! 76*26! 0([email protected]!

9)12*'*(,! (91)-8*(,'! -)1! -0010! '9-)*,+./! 8(! -,! (861)7*'1! '*,+.1! 9)12*'*(,!

2-.23.-8*(,A! S-8)*L! 43.8*9.*2-8*(,! *,! 861! 2(,81L8! (:! )1'(.38*(,>(:>861>*01,8*8/!

Sa..1)>%.1''18! 91)83)@-8*(,! 861()/! 6-'! @11,! '6(7,! 8(! 9)(5*01! -223)-81!4*L10!

9)12*'*(,!)1'3.8'<!151,!761,!861!4-`()*8/!(:!(91)-8*(,'!-)1!2-))*10!(38!*,!'*,+.1!

9)12*'*(,AD<M! Q*,+.1! 9)12*'*(,! OXK! 15-.3-8*(,! 6-'! @11,! '3221'':3../! -3+41,810!

7*86!0([email protected]!9)12*'*(,!-22343.-8*(,!*,8(!861!4-8)*L!1.141,8'!(:!861!F(3.(4@!-,0!

1L26-,+1! (91)-8()'A#<"! bG([email protected]! 9)12*'*(,! -22343.-8*(,c! '*49./! 41-,'! 86-8! 861!

OXK'!-)1!15-.3-810!*,!'*,+.1!9)12*'*(,!@38!-!0([email protected]!9)12*'*(,!5-)*[email protected]! *'!3'10!8(!

-22343.-81!861!9)(0328'!(:!01,'*8/!4-8)*L!1.141,8'!-,0!OXK'!76*26!4-;1!39!861!

:*,-.!(91)-8()! V1A+A!F(3.(4@!()!1L26-,+1WA! ! ]()!1L-49.1<! 861!F(3.(4@!(91)-8()!

2-,!@1!2(,'8)32810!-'d!!

Jµ!64+ = P"#

32 µ! | "#( )32 (3.1)

761)1!861!'391)'2)*98'!*,0*2-81!861!,34@1)!(:!@*8'!(:!9)12*'*(,!3'10!8(!2(49381!

[email protected]!81)4!-,0!861!OXK'!-)1!+*51,!-'d!

µ! | "#( ) = $µ (r1)$! (r1)$" (r2 )$# (r2 )r1 % r2

& dr1dr2 (3.2)

!"M!

F(4938*,+!-!:17!(:!861!.-)+1'8!OXK'!*,!:3..!0([email protected]!9)12*'*(,!6-'!-.'(!@11,!'6(7,"!

8(!*49)(51!-223)-2/!2(49-)10!8(!2-.23.-8*(,'!3'*,+!(,./!'*,+.1!9)12*'*(,!:()!-..!

OXK'A! K,2)141,8-.! 2(,'8)328*(,! (:! 861! ](2;! 4-8)*L=B! 2-,! -.'(! *49)(51! 861!

-223)-2/!(:! '*,+.1!9)12*'*(,!OXK!15-.3-8*(,A#!]*,-../<! *8!6-'!@11,!'3++1'810<! 86-8!

:3..! '*,+.1!9)12*'*(,!2-,!@1!'-:1./!149.(/10! *,! 861!1-).*1'8!QF]! *81)-8*(,'A"!Q326!

'8)-81+*1'!6-51!9)(51,!1::128*51! *,! *49)(5*,+!4*L10!9)12*'*(,! )1'3.8'! :()!4-,/!

2-.23.-8*(,'! )19()810! 8(!0-81A!Z(7151)<!,(! '/'814-8*2! '830/!(:!4*L10!9)12*'*(,!

OXK!15-.3-8*(,!6-'!@11,!3,01)8-;1,A!K,!:-28<!861!9)12*'*(,!238(::'!43'8!@1!26('1,!

2-)1:3../!:()!1-26!4(.123.-)!'/'814!'830*10!8(!+3-)-,811!86-8!861!-@'(.381!1))()!

*'!7*86*,!-!8(.1)[email protected]!)-,+1A!

K,! 861! 9)1'1,8! 26-981)<! 71! *,8)(0321! -! '/'814-8*2! 4186(0! (:! @(86!

2(,8)(..*,+!1))()!-,0!4*,*4*N*,+!0([email protected]!9)12*'*(,!(91)-8*(,'!*,!4*L10!9)12*'*(,!

](2;!4-8)*L!2(,'8)328*(,A![1!@1+*,!@/!01'2)*@*,+!-!4*L10!9)12*'*(,!'26141<! *,!

76*26! -..! *,81+)-.'! -)1! 2(493810! (,! 861!$%&<!7*86! .-)+1!OXK'! 2-.23.-810! *,! :3..!

0([email protected]! 9)12*'*(,! -,0! '4-..! *,81+)-.'! *,! '*,+.1! 9)12*'*(,! 7*86! 0([email protected]! 9)12*'*(,!

-22343.-8*(,A! [1! '6(7! 86-8! 861! )1.-8*51! 1))()! *,! '326! 2-.23.-8*(,'! *'! 71..!

@16-510! :()! -,! -))-/! (:! '/'814'! -,0! 2-,! @1! 2(,8)(..10! 8(! 9)(5*01! -,! 1::128*51!

9)12*'*(,! -,0!91):()4-,21! *,81)410*-81!@18711,! 86-8! (:! '*,+.1! -,0! :3..! 0([email protected]!

9)12*'*(,A! K,!()01)! 8(! :3)861)!012)1-'1! 861!,34@1)!(:!0([email protected]!9)12*'*(,! *,81+)-.!

15-.3-8*(,'<!71!'3++1'8!-!,17!4186(0!86-8!71!81)4!12%!-$3#/4,3$5$'%! *,!76*26!

861! 1::128*51! 9)12*'*(,! *'! -0`3'810! 0/,-4*2-../! @18711,! QF]! *81)-8*(,'A! K,! 86*'!

7-/!861!4*,*434!,34@1)!(:!0([email protected]!9)12*'*(,!*,81+)-.'!2-,!@1!3'10!8(!(@8-*,!-,/!

! "T!

01'*)10!.151.!(:!-223)-2/A!]*,-../<!71!9)1'1,8!91):()4-,21!-,0!-223)-2/!)1'3.8'!8(!

@1,264-);!861!0/,-4*2!9)12*'*(,!-99)(-26!-'!*49.141,810!*,!E1)-F614A=U!!

MIXED PRECISION IMPLEMENTATION

E61! 4-+,*8301! (:! -,! OXK! *'! 2(44(,./! @(3,010! 3'*,+! 861! Q267-)N!

*,1\3-.*8/AC?! K,! 0*)128! QF]! 2(01'<=B! 861! @(3,0! 2-,! @1! :3)861)! )103210! 3'*,+!

1.141,8'!(:!861!01,'*8/!4-8)*L!-'d!

µ! | "#( )P"# $ µ! | µ!( )1/2 "# | "#( )1/2 P"# (3.3)

e12-3'1! (,./! -@'(.381! -223)-2/! *'! )1\3*)10! *,! 2614*2-.! -99.*2-8*(,'<! f-'30-!

*,8)(03210!-! 238(::! (,! 861!01,'*8/>71*+6810!Q267-)N!@(3,0! 8(!+)(39!OXK'! *,8(!

87(!@-8261'A"!E6('1!76('1!@(3,0!:1..!@1.(7!861!238(::!71)1!2-.23.-810!*,!'*,+.1!

9)12*'*(,! (,! 861! $%&A! K,81+)-.'! 76('1! @(3,0! 7-'! .-)+1)! 86-,! 861! 238(::! 71)1!

15-.3-810!*,!0([email protected]!9)12*'*(,!(,!861!F%&A!

Jµ!64+ =

P"#32 µ! | "#( )32 if P"# µ! | µ!( ) "# | "#( ) $ % prec

P"#64 µ! | "#( )64 if P"# µ! | µ!( ) "# | "#( ) > % prec

&'(

)(

(3.4)

H'! 6-'! @11,! 9)15*(3'./! ,(810<=! 7*86! 861! -051,8! (:! )(@3'8! 0([email protected]! 9)12*'*(,!

'399()8!(,!861!$%&!*8! *'!,(!.(,+1)!,121''-)/!8(!15-.3-81!'(41!OXK'!(,!861!F%&A!

[1!6-51!*49.141,810!-!4*L10!9)12*'*(,!](2;!4-8)*L!15-.3-8*(,!'26141!'*4*.-)!

8(! 86-8! '3++1'810!@/!f-'30-A! K,'81-0! (:! 3'*,+! 861!F%&<! 6(7151)<!71!0151.(910!

0([email protected]!9)12*'*(,!-,-.(+31'!(:!(3)!9)15*(3'./!)19()810!'*,+.1!9)12*'*(,!F(3.(4@!

-,0! 1L26-,+1! )(38*,1'A#! K49.141,8-8*(,! 018-*.'! -)1! 01'2)*@10! *,! 861! 9)15*(3'!

26-981)A!

!"B!

K,! ()01)! 8(!4-;1! 861!4('8! (:! 861!$%&_'!414()/! @-,07*086<! 861! 0([email protected]!

-,0!'*,+.1!9)12*'*(,!*,81+)-.'!-)1!6-,0.10!*,!-!87(>9-''!-.+()*864A!H'!9)15*(3'./!

01'2)*@10<! (3)! OXK! -.+()*864! (91)-81'! 0*)128./! (,! 861! @)-! -,0! ;18! 9)*4*8*51!

$-3''*-,!9-*)'<!76*26!-)1!'()810!@/!012)1-'*,+!Q267-)N!@(3,0A!K,!861!:*)'8!9-''<!

0-8-!:()!861!.-)+1'8!9)*4*8*51!9-*)'!*'!9-2;10!*,8(!0([email protected]!9)12*'*(,!-))-/'!-,0!-,/!

OXK! 76('1! @(3,0! *'! +)1-81)! 86-,! 861! 9)12*'*(,! 86)1'6(.0! *'! 2-.23.-810! 3'*,+!

0([email protected]! 9)12*'*(,! $%&! ;1),1.'A! K,! 861! '12(,0! 9-''<! '4-..1)! 9)*4*8*51! 9-*)'! -)1!

-0010!8(!861!@)-!-,0!;18!0-8-<!76*26!*'!)1-''[email protected]!*,8(!'*,+.1!9)12*'*(,!-))-/'!

-,0!9)(21''10!@/!'*,+.1!9)12*'*(,!;1),1.'A!

e12-3'1! 861! :(3)>*,01L<! 01,'*8/>71*+6810! Q267-)N! @(3,0! *'! 2(493810!

(,./!7*86*,!861!$%&!;1),1.'<!'(41!039.*2-8*(,!(223)'!@18711,!861!'18'!(:!'*,+.1!

-,0!0([email protected]!9)12*'*(,!9)*4*8*51!9-*)'<!-,0!1-26!;1),1.!43'8! :*.81)!(38! *,0*5*03-.!

OXK'!76('1! @(3,0! :-..'! (38'*01! (:! 861! )1.15-,8! )-,+1A! Q*,21! -'! @1:()1! 861! 9-*)!

\3-,8*8*1'! -)1! ()01)10! @/! Q267-)N! @(3,0<! 861! '*,+.1! -,0! 0([email protected]! 9)12*'*(,!

*,81+)-.'!)1'*01!*,!2(,8*+3(3'!@.(2;'!-'!'6(7,!*,!:*+3)1!=A!E6*'!4*,*4*N1'!7-)9!

0*51)+1,21!(,!861!$%&<!'*,21!,1*+6@()*,+!86)1-0'!7*..!*,!+1,1)-.!';*9!()!2(49381!

*,81+)-.'! 8(+1861)A![61,!:*.81)*,+<! *8! *'!1''1,8*-.! 86-8!@(86!861!'*,+.1!-,0!0([email protected]!

9)12*'*(,! ;1),1.'! 6-,0.1! 861! @(3,0'! *01,8*2-../A! ^861)7*'1<! 861! 0*::1)1,8!

)(3,0*,+!@16-5*()!1L6*@*810!@/!'*,+.1!-,0!0([email protected]!9)12*'*(,!-)*86418*2!7*..!2-3'1!

*,81+)-.'! 2.('1! 8(! 861! @(3,0! 8(! @1! ';*9910! ()! 0([email protected]! 2(3,810A! K,! (3)!

*49.141,8-8*(,<!861!0([email protected]!9)12*'*(,!;1),1.'!2-'8!861!@(3,0!\3-,8*8*1'!8(!'*,+.1!

9)12*'*(,! @1:()1! 0181)4*,*,+! 761861)! 861! -''(2*-810! *,81+)-.'! -,0! 861*)!

2(,8)*@38*(,'!7*..!@1!15-.3-810A!

! "U!

Figure 1: Organization of double and single precision workloads within Coulomb ERI grids. Rows and columns correspond to primitive bra and ket pairs. On left each ERI is colored according to the magnitude of its Schwarz bound. On right ERIs are colored by required precision. Yellow ERIs require double precision while those in green may be evaluated in single precision. Blue ERIs are neglected entirely.

MIXED PRECISION ACCURACY

E61!4(.123.1'!'6(7,!*,!:*+3)1!C!71)1!26('1,!-'!)19)1'1,8-8*51!81'8!2-'1'!

8(! '830/! 861!-223)-2*1'!(:! '151)-.!4*L10!9)12*'*(,! 86)1'6(.0'A!$1(418)*1'!71)1!

9)19-)10! -8! @(86! -,!(98*4*N10!XZ]gM>#=$!4*,*434!-,0!-!0*'8()810! +1(418)/!

(@8-*,10! @/! 91):()4*,+! IJE! 0/,-4*2'! -8! YC???hA! E61! 0([email protected]! 9)12*'*(,! $%&!

*49.141,8-8*(,! 7-'! :*)'8! @1,264-);10! -+-*,'8! $HSOQQC=! 3'*,+! 861! 01:-3.8!

$HSOQQ!2(,51)+1,21!-,0! 87(>1.128)(,! 86)1'6(.0'A!E61! )1'3.8*,+! :*,-.! 1,1)+*1'<!

'6(7,! *,! [email protected]! =<! *,0*2-81! +((0! -+)1141,8A! Q7*826*,+! 8(! '*,+.1! 9)12*'*(,! OXK'!

01+)-01'! 861! )1'3.8! @/! #>"! ()01)'! (:! 4-+,*8301A! Z(7151)<! *8! '6(3.0! @1!

1496-'*N10!86-8!:()!4(.123.1'!2(,8-*,*,+!-'!4-,/!-'!(,1!63,0)10!-8(4'<!'*,+.1!

9)12*'*(,!9)(5*01'!-01\3-81!)1'3.8'!i!861!-@'(.381!1))()!(:!861!1,1)+/!2(493810!

7*86!'*,+.1!9)12*'*(,!OXK'! *'!71..!@1.(7!=!;2-.g4(.!151,! :()!I13)(;*,*,!H!7*86!

="T!-8(4'A!!

!M?!

Figure 2: Molecular geometries used to benchmark the correlation between precision cutoff and the effective precision of the final energy. Optimized geometries (shown here) were used in addition to distorted nonequilibrium geometries prepared by carrying out RHF/STO-3G NVT dynamics at 2000 K (1000 K for Ascorbic Acid).

! H'2()@*2!H2*0! P-28('1! F/-,(@-281)*-.!E(L*,!$HSOQQ! >MB?A"BCBUDT! >=CBUAMMMMC"?! >CDU=AC?"BBU#!E1)-F614!G%! >MB?A"BCBUDT! >=CBUAMMMMC"?! >CDU=AC?"BBU?!E1)-F614!Q%! >MB?A"BCB?T=! >=CBUAMMMDCMM! >CDU=AC?"#MM?!! ! ! !! I13)(;*,*,!H! "LM!I-,(83@1! !$HSOQQ! >D?BUAMBB#TT?! >=#TU?A=D="=T=! !E1)-F614!G%! >D?BUAMBB#TMC! >=#TU?A=D="=TM! !E1)-F614!Q%! >D?BUAMBTUBCD! >=#TU?A=#BUUBT! !Table 1: RHF/6-31G final energies in Hartrees compared between GAMESS (set at default convergence and two-electron thresholds), our GPU accelerated TeraChem code performing all calculations in double precision (TeraChem DP), and TeraChem using single precision for ERIs with double precision accumulation into the Fock matrix elements (TeraChem SP). Distorted nonequilibrium geometries from RHF/STO-3G NVT dynamics at 2000K (1000K for Ascorbic Acid) were used.

I1L8!71!15-.3-810!861!4*L10!9)12*'*(,!-99)(-26!@/!5-)/*,+!861!9)12*'*(,!

86)1'6(.0! @18711,! =?>U! Z-)8)11'! V,1-)./! -..! *,81+)-.'! -)1! 15-.3-810! *,! 0([email protected]!

9)12*'*(,W! -,0! =A?! Z-)8)11'! V1''1,8*-../! -..! *,81+)-.'! 3'1! '*,+.1! 9)12*'*(,WA!

I1+.*+*@./!'4-..!OXK'!71)1!'2)11,10!-22()0*,+!8(!861!01,'*8/!71*+6810!Q267-)N!

3991)!@(3,0!(:!1\3-8*(,!V#A#W!3'*,+!-!2(,'1)5-8*51!86)1'6(.0!8(!1,'3)1!86-8!-,/!

! M=!

0*::1)1,21'! *,! 861! :*,-.! 1,1)+/! -@(51! Y=?>T! Z-)8)11'! 7(3.0! @1! 0(4*,-810! @/!

4*L10! 9)12*'*(,! 1))()'A! E61! -51)-+1! )1.-8*51! 1,1)+/! 0*::1)1,21! @18711,! :3..!

0([email protected]! -,0!4*L10!9)12*'*(,! QF]! 1,1)+*1'! :()! 861! 81,! 81'8! +1(418)*1'!01'2)*@10!

-@(51! *'! 9.(8810! -'! -! :3,28*(,! (:! 9)12*'*(,! 86)1'6(.0! *,! :*+3)1! #A! H.86(3+6! 861!

-@'(.381! 1))()! *,2)1-'1'! -.(,+! 7*86! 861! 8(8-.! 1,1)+/! (:! 861! '/'814'<! 861!

)1.-8*(,'6*9!@18711,!861!9)12*'*(,!238(::!-,0!)1.-8*51!1))()!*'!)(3+6./!.*,1-)!-,0!

*'! *,0191,01,8! (:! '/'814! '*N1! -,0! @-'*'! '18A! E63'<! 1-26! 86)1'6(.0! 2-,! @1!

-''(2*-810!7*86!-,!1::128*51!)1.-8*51!1))()!*,!861!1,1)+/A!

!

Figure 3. Average relative precision error in final RHF energies versus the precision threshold. Both minimized and distorted non-equilibrium geometries for the molecules in figure 2 are included in averages. Error bars represent two standard deviations above the mean. The black line represents the empirical error bound given of equation (3.5).

&'*,+!:*+3)1!#!*8!*'!9(''*@.1!8(!149*)*2-../!9)1>'1.128!-!9)12*'*(,!86)1'6(.0!

2())1'9(,0*,+!8(!-,/!-223)-2/!)1\3*)141,8!3'*,+!(,./!861!1'8*4-810!8(8-.!1,1)+/!

V8(! 2-.23.-81! )1.-8*51! 1))()WA! H! )1-'(,-@./! 2(,'1)5-8*51! 149*)*2-.! @(3,0! (,! 861!

9)12*'*(,! 1))()! *'! +*51,! *,! 861! :(..(7*,+! :()43.-! -,0! 9.(8810! :()! )1:1)1,21! *,!

:*+3)1!#A!

!MC!

Err Thre( ) = 2.0 !10"6Thre0.7 (3.5)

e/!*,51)8*,+!1\3-8*(,!V#A"W<!71!2(3.0!'1.128!-,!-99)(9)*-81!9)12*'*(,!86)1'6(.0!-8!

861! '8-)8! (:! 861! QF]! 9)(2103)1! -,0! 3'1! 861! 4*,*434! -..([email protected]! 1::128*51!

9)12*'*(,A!Z(7151)<! 86*'!7(3.0! )1\3*)1! 1-)./! *81)-8*(,'!76('1!01,'*8/!4-8)*21'!

-)1!6*+6./!-99)(L*4-81!8(!3'1!861!:3..! .151.!(:!9)12*'*(,!,11010!-8!2(,51)+1,21A!

E6*'! *'! 1'912*-../!7-'81:3.! *,! 51)/! .-)+1! '/'814'! '*,21! 861!-223)-2/! )1\3*)10!-8!

2(,51)+1,21! ,1-)'! 86-8! (:! :3..! 0([email protected]! 9)12*'*(,A! E(! :3)861)! )10321! 861! 3'1! (:!

0([email protected]!9)12*'*(,!*,!1-)./!*81)-8*(,'!76*.1!'8*..!-26*15*,+!861!)1\3*)10!-223)-2/!-8!

2(,51)+1,21<!71!*,8)(0321!-!0/,-4*2!9)12*'*(,!-99)(-26<!01'2)*@10!@1.(7A!

DYNAMIC PRECISION IMPLEMENTATION

E61!1''1,21!(:!861!0/,-4*2!9)12*'*(,!-99)(-26!*'!8(!3'1!1\3-8*(,!V#A"W!8(!

'1.128! -! 0*::1)1,8! 86)1'6(.0! :()! 1-26! *81)-8*(,! (:! 861! QF]! 9)(2103)1A! O-)./!

*81)-8*(,'!6-51!@11,!'6(7,!8(!8(.1)-81!)1.-8*51./!.-)+1!1))()'!*,!861!](2;!4-8)*L!

7*86(38! 6-491)*,+! 2(,51)+1,21A#<CC![1! 8-;1! 861!4-L*434!1.141,8! (:! 861!GKKQ!

1))()! 5128()C#! :)(4! 861! 9)15*(3'! *81)-8*(,! -'! -!418)*2! (:! 86*'! 8(.1)-,21<! -,0! -8!

1-26! *81)-8*(,! 71! '1.128! -! 86)1'6(.0! 9)(5*0*,+! 9)12*'*(,! '-:1./! @1.(7! 861! GKKQ!

1))()A! E6*'! 1,'3)1'! 86-8! 861! 9)12*'*(,! 1))()! *'! -! '4-..! 2(,8)*@38()! 8(! 861! 8(8-.!

1))()A!e/!)1032*,+!861!9)12*'*(,!86)1'6(.0!+)-03-../!-'!2(,51)+1,21!9)(+)1''1'<!

*8! *'! 9(''*@.1! 8(! -99)(-26! :3..! 0([email protected]! 9)12*'*(,! )1'3.8'! 76*.1! 4*,*4*N*,+! 861!

,34@1)!(:!-283-.!0([email protected]!9)12*'*(,!(91)-8*(,'A!

E(!*49)(51!91):()4-,21<!(3)!QF]!2(01!3'1'!-,!*81)-8*51!390-81!-99)(-26!

V-.'(!;,(7,!-'!*,2)141,8-.!](2;!4-8)*L!:()4-8*(,W!8(!@3*.0!39!861!](2;!4-8)*L!

! M#!

(51)! 861! 2(3)'1! (:! 861! QF]! 9)(2103)1A=B! E61! 390-81! -99)(-26! 012(49('1'! 861!

](2;!4-8)*L!-'!!

Fi+1 Pi+1( ) = Fi Pi( ) + F Pi+1 ! Pi( ) , (3.6)

'(!86-8!(,./!861!.-'8!81)4!,110'!8(!@1!2-.23.-810!*,!1-26!QF]!*81)-8*(,A!Z1)1!Pi !-,0!

Fi Pi( ) !-)1! 861! 01,'*8/! -,0! ](2;! 4-8)*21'! -8! 861! $>86! QF]! *81)-8*(,A! e12-3'1!

26-,+1'!*,!861!01,'*8/!4-8)*L!@12(41!51)/!'4-..!,1-)!2(,51)+1,21<!861!*81)-8*51!

](2;! -99)(-26! -..(7'! 4-,/! -00*8*(,-.! *,81+)-.'! 8(! @1! '2)11,10A! E/9*2-../! 86*'!

9)(5*01'! -,! (51)-..! '911039! @18711,! CL! -,0! #L! (51)! 861! 2(,51,8*(,-.! QF]!

-99)(-26A! Z(7151)<! 861! ,-j51! *49.141,8-8*(,! (:! 0/,-4*2! 9)12*'*(,! 01'2)*@10!

-@(51! 2-3'1'! 861! *81)-8*51! ](2;! 4186(0! 8(! 2(,51)+1! *,2())128./<! @12-3'1! 1-26!

390-81!(:!861!](2;!4-8)*L!2-,!,(8!2())128!:()!861!9)12*'*(,!1))()!(:!861!9)15*(3'!

'819A!

X-861)! 86-,! -@-,0(,*,+! 861! *81)-8*51! ](2;! -.+()*864! -.8(+1861)<! 71!

*,8)(0321!861!:(..(7*,+!-0`3'841,8A![61,!861!)1.-8*51!GKKQ!1))()!0)(9'!@1.(7!861!

1))()! @(3,0! (:! 861! 23))1,8! 9)12*'*(,! 86)1'6(.0<! 861! 86)1'6(.0! *'! )103210! 8(!

9)(5*01!1,(3+6!-223)-2/! :()! '151)-.! ()01)'!(:!4-+,*8301! )10328*(,! *,! 861!GKKQ!

1))()A!O-26!8*41!861!9)12*'*(,!*'!*49)(510<!861!](2;!4-8)*L!*'!)12-.23.-810!:)(4!

'2)-826A!e18711,! 86)1'6(.0! )10328*(,'<! 861! :-'81)! *81)-8*51!](2;!390-81! '26141!

2-,!@1!'-:1./!149.(/10A!

RESULTS

E(! @1,264-);! (3)! 0/,-4*2! 9)12*'*(,! -99)(-26<! 71! 91):()410! XZ]!

1,1)+/! 2-.23.-8*(,'! (,! 861! 81'8! +1(418)*1'! 9)1'1,810! -@(51<! -'! 71..! -'! '(41!

!MD!

.-)+1)!'/'814'!'6(7,!*,!:*+3)[email protected]!C!014(,'8)-81'!861!-223)-2/!9)(5*010!@/!

(3)! 0/,-4*2! 9)12*'*(,! -99)(-26A! K,! 1-26! 2-.23.-8*(,<! 861! 0/,-4*2! 9)12*'*(,!

4186(0!*'!'3221'':3.!*,!)19)(032*,+!861!:3..!0([email protected]!9)12*'*(,!)1'3.8'!8(!7*86*,!861!

2(,51)+1,21! 2)*81)*-A! ]3)861)4()1<! 861! ,34@1)! (:! QF]! *81)-8*(,'! )1\3*)10! 8(!

)1-26! 2(,51)+1,21! V-.'(! '6(7,! *,! [email protected]! CW! *'! 1''1,8*-../! *01,8*2-.! @18711,!

0/,-4*2! -,0! 0([email protected]! 9)12*'*(,A! ]*,-../<! 861! QF]! 1,1)+/! 0*::1)1,21! @18711,!

0/,-4*2! -,0! 0([email protected]! 9)12*'*(,! )14-*,'! :-*)./! 2(,'8-,8! (51)! 861! )-,+1! (:! 81'8!

'/'814'<!*,0*2-8*,+!86-8!(3)!149*)*2-.!1))()!@(3,0!*'!)1-'(,-@./!2-.*@)-810A!

Double Precision Dynamic Precision Precision Error

Conv Thres Final Energy Iter Final Energy Iter

Ascorbic Acid (Minimum)

-680.6986413210 16 -680.6986413153 16 5.70E-09 10-7

-680.6986413213 12 -680.6986413898 12 6.85E-08 10-5

Ascorbic Acid (1000K)

-680.5828947151 17 -680.5828947066 17 8.50E-09 10-7

-680.5828947060 12 -680.5828947665 12 6.05E-08 10-5

Lactose (Minimum)

-1290.0883460632 14 -1290.0883460414 14 2.18E-08 10-7

-1290.0883460086 10 -1290.0883459660 10 4.26E-08 10-5

Lactose (2000K)

-1289.6666249592 15 -1289.6666249614 15 2.20E-09 10-7

-1289.6666249365 11 -1289.6666248603 11 7.62E-08 10-5

Cyano Toxin (Minimum)

-2492.3971992758 19 -2492.3971992873 19 1.15E-08 10-7

-2492.3971992730 13 -2492.3971985116 13 7.61E-07 10-5

Cyano Toxin (2000K)

-2491.2058890235 21 -2491.2058890017 21 2.18E-08 10-7

-2491.2058889916 13 -2491.2058886707 13 3.21E-07 10-5

Neurokinin A (Minimum)

-4091.3672645555 19 -4091.3672645944 20 3.89E-08 10-7

-4091.3672645489 14 -4091.3672644494 14 9.95E-08 10-5

Neurokinin A (2000K)

-4089.6883762179 21 -4089.6883761946 21 2.33E-08 10-7

-4089.6883760772 15 -4089.6883758130 15 2.64E-07 10-5

Nanotube (Minimum)

-13793.7293925221 24 -13793.7293925323 23 1.02E-08 10-7

-13793.7293924922 15 -13793.7293928287 15 3.37E-07 10-5

Nanotube (2000K)

-13790.1415175662 29 -13790.1415175584 27 7.80E-09 10-7

-13790.1415175332 18 -13790.1415191026 18 1.57E-06 10-5

Crambin

-17996.6562925538 18 -17996.6562926036 18 4.98E-08 10-7

-17996.6562925535 12 -17996.6562927894 12 2.36E-07 10-5

Ubiquitin

-29616.4426376594 24 -29616.4426376596 24 2.00E-10 10-7

-29616.4426376302 18 -29616.4426376655 18 3.53E-08 10-5

T-Cadherin EC1

-36975.6726049407 21 -36975.6726049394 21 1.30E-09 10-7

-36975.6726049265 16 -36975.6726048777 16 4.88E-08 10-5

Ribonuclease A

-50813.1471248227 19 -50813.1471248179 19 4.80E-09 10-7

-50813.1471247051 12 -50813.1471250247 12 3.20E-07 10-5

Table 2. Comparison of double and dynamic precision final RHF/6-31G energies (listed in Hartree). Precision error is taken as the absolute difference between double and dynamic precision energies. The number of SCF iterations required to reach convergence is listed (Iter) as well as the threshold used to converge the maximum element of the DIIS error matrix.

! M"!

!

Figure 4. Additional molecules used to test the dynamic precision algorithm.

[email protected]! #! '344-)*N1'! 861! 91):()4-,21! (:! (3)! -.+()*864! (,! 87(! $%&!

9.-8:()4'A! ^,! 861! (.01)! E1'.-! F=?M?! $%&<! 0/,-4*2! 9)12*'*(,! -221.1)-81'! 861!

2(,'8)328*(,!(:!861!](2;!4-8)*L!@/!39!8(!DL!(51)!:3..!0([email protected]!9)12*'*(,A!E61!E1'.-!

FC?"?!*,2.301'!-!+)1-81)!9)(9()8*(,!(:!0([email protected]!9)12*'*(,!3,*8'<!-,0!-'!-!)1'3.8!861!

91):()4-,21! 4-)+*,! @18711,! 0([email protected]! -,0! '*,+.1! 9)12*'*(,! -)*86418*2! *'!

,-))(710A! Z(7151)<! 151,! 61)1! 0/,-4*2! 9)12*'*(,! ](2;! 4-8)*L! 2(,'8)328*(,! *'!

@18711,!CL!-,0!#L!:-'81)!86-,!86-8!91):()410!*,!:3..!0([email protected]!9)12*'*(,A!]*+3)1!"!

2(49-)1'!](2;!4-8)*L!2(,'8)328*(,!3'*,+!0/,-4*2<!4*L10<!-,0!'*,+.1!9)12*'*(,!8(!

:3..! 0([email protected]! 9)12*'*(,! 2-.23.-8*(,'! (,! E1'.-! F=?M?! $%&'A! Z1)1!-$6,1# /4,3$5$'%!

)1:1)'!8(!'8-8*2-../!:*L*,+!861!9)12*'*(,!86)1'6(.0!:()!861!1,8*)1!QF]!9)(2103)1!-8!

861!5-.31!9)1'2)*@10!@/!*,51)8*,+!1\3-8*(,!V#A"WA!G/,-4*2!9)12*'*(,!2(,'*'81,8./!

!MM!

(3891):()4'! 861! '*49.1)! 4*L10! 9)12*'*(,! '26141! 01'9*81! )1\3*)*,+! 91)*(0*2!

)1@3*.0'! (:! 861! ](2;! 4-8)*LA! S()1! *49()8-,8./<! 0/,-4*2! 9)12*'*(,! 2(,'*'81,8./!

9)(5*01'! @18711,! T?! -,0! B?k! (:! 861! 91):()4-,21! (:! '*,+.1! 9)12*'*(,! 76*.1!

9)(5*0*,+! )1'3.8'! 2(49-)[email protected]! 8(! :3..!0([email protected]!9)12*'*(,<! -,0! 86*'!9-881),! )14-*,'!

*,8-28!151,!:()!861!.-)+1'8!'/'814'A!

Nvidia Tesla C1060 Nvidia Tesla C2050 Dynamic

Runtime Fock Speedup

Total Speedup

Dynamic Runtime

Fock Speedup

Total Speedup

Ascorbic Acid 2.23 sec 4.0 3.4 2.93 sec 2.0 1.4 Lactose 9.70 sec 4.1 3.8 8.41 sec 2.2 1.9 Cyano Toxin 87.66 sec 4.2 4.0 68.44 sec 2.4 2.3 Neurokinin A 197.91 sec 3.8 3.7 149.76 sec 2.3 2.3 Nanotube 1716.88 sec 3.0 2.7 1155.58 sec 3.0 2.6 Crambin 1104.22 sec 2.9 2.5 762.09 sec 2.1 1.8 Ubiquitin 11833.58 sec 2.8 2.5 7517.68 sec 2.3 2.0 T-Cadherin EC1 17408.21 sec 2.7 2.4 10781.42 sec 2.3 1.9 Ribonuclease A 21869.37 sec 2.7 2.3 --- --- --- Table 3. Runtime comparison between dynamic and full double precision for RHF/6-31G single point energy calculations converged to a maximum DIIS error of 10-5 a.u. Calculations were run on a dual Intel Xeon X5570 platform with 72 gigabytes of ram. Smaller systems (from ascorbic acid to neurokinin) utilized a single GPU, while 8 GPUs operated in parallel for the larger systems. The speedups for Fock matrix construction (coulomb and exchange operator evaluation, including all data packing and host-GPU transfers) are listed along with the speedup of the entire energy evaluation. The Tesla C2050 could not treat Ribonuclease A at the RHF/6-31G level due to memory constraints.

Figure 5. Speedups for the construction of the Fock matrix accumulated over all SCF iterations using single, dynamic, and mixed precision relative to full double precision performance. Calculations were run on Nvidia Tesla C1060 GPUs and were converged to a DIIS error of 10-7 a.u. Mixed precision calculations used a static precision threshold chosen for each system to give an absolute accuracy of 10-7 Hartree. A single GPU was employed for the smaller systems (from 20 to 200 atoms), while the larger systems utilized 8 GPUs in parallel. Single precision failed to converge for ubiquitin.

! MT!

CONCLUSION

[1!6-51!014(,'8)-810!86-8!@/!0/,-4*2-../!-0`3'8*,+!861!)-8*(!(:!*,81+)-.'!

2-.23.-810!*,!'*,+.1!-,0!0([email protected]!9)12*'*(,!(,!861!$%&!*8!*'!9(''*@.1!8(!4*,*4*N1!861!

,34@1)! (:! 0([email protected]! 9)12*'*(,! -)*86418*2! (91)-8*(,'! *,! 2(,'8)328*,+! 861! ](2;!

4-8)*L!76*.1! '8*..! '/'814-8*2-../! 2(,8)(..*,+! 861! 1))()A! OL9.(*8*,+! 86*'! :.1L*@*.*8/!

71!6-51!23'8(4*N10!(3)!](2;!4-8)*L!)(38*,1'!:()!4-L*434!91):()4-,21!(,!861!

$%&A!^3)!0/,-4*2!9)12*'*(,!*49.141,8-8*(,!*'[email protected]!8(!-26*151!*,!1L21''!(:!T?k!

(:!'*,+.1!9)12*'*(,_'!91):()4-,21!76*.1!4-*,8-*,*,+!-223)-2/!2(49-)[email protected]!8(!:3..!

0([email protected]!9)12*'*(,A!

]()! 1L8)141./! .-)+1! '/'814'<! 861! )1\3*)10! )1.-8*51! -223)-2/! 4-/! 71..!

1L81,0! @1/(,0! 861! 2-9-2*8/! (:! 0([email protected]! 9)12*'*(,ACD! K,! 86*'! .*4*8! 861! -99)(-26!

(38.*,10! -@(51! 7*..! -+-*,! 9)(51! 3'1:3.! *,! '/'814-8*2-../! *49)(5*,+! 0([email protected]!

9)12*'*(,! 7*86! -! 4*,*434! (:! 6*+61)! 9)12*'*(,! -)*86418*2! (91)-8*(,'A! H! 4()1!

2(49)161,'*51! 43.8*>9)12*'*(,! '8)-81+/! 2-,! @1! 1-'*./! 1,5*'*(,10<! :()! 1L-49.1!

3'*,+! '*,+.1<! 0([email protected]! -,0! \3-0)39.1! 9)12*'*(,! 15-.3-8*(,! (:! 0*::1)1,8! OXK'<!

-22()0*,+! 8(! 861*)! 4-+,*8301A! ]3)861)4()1<! 861! '-41! 0/,-4*2-.! 9)12*'*(,!

-99)(-26!2-,!@1!-99.*10!8(!861!2-.23.-8*(,!(:!861!1L26-,+1>2())1.-8*(,!(91)-8()!

*,!01,'*8/!:3,28*(,-.!861()/!-,0!'*4*.-)!91):()4-,21!+-*,'!7*..!@1!(@8-*,10A!

REFERENCES

(1) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2008, 4, 222.

(2) Ufimtsev, I. S.; Martinez, T. J. Comp. Sci. Eng. 2008, 10, 26.

(3) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2009, 5, 1004.

!MB!

(4) Vogt, L.; Olivares-Amaya, R.; Kermes, S.; Shao, Y.; Amador-Bedolla, C.; Aspuru-Guzik, A. J. Phys. Chem. A 2008, 112, 2049.

(5) Yasuda, K. J. Comp. Chem. 2008, 29, 334.

(6) Olivares-Amaya, R.; Watson, M. A.; Edgar, R. G.; Vogt, L.; Shao, Y. H.; Aspuru-Guzik, A. J. Chem. Theo. Comp. 2010, 6, 135.

(7) Asadchev, A.; Allada, V.; Felder, J.; Bode, B. M.; Gordon, M. S.; Windus, T. L. J. Chem. Theo. Comp. 2010, 6, 696.

(8) Anderson, J. A.; Lorenz, C. D.; Travesset, A. J. Comp. Phys. 2008, 227, 5342.

(9) Genovese, L.; Ospici, M.; Deutsch, T.; Mehaut, J.-F.; Neelov, A.; Goedecker, S. J. Chem. Phys. 2009, 131, 034103.

(10) Yasuda, K. J. Chem. Theo. Comp. 2008, 4, 1230.

(11) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2009, 5, 2619.

(12) Friedrichs, M. S.; Eastman, P.; Vaidyanathan, V.; Houston, M.; Legrand, S.; Beberg, A. L.; Ensign, D. L.; Bruns, C. M.; Pande, V. S. J. Comp. Chem. 2009, 30, 864.

(13) Harvey, M. J.; Giupponi, G.; DeFabiritiis, G. J. Chem. Theo. Comp. 2009, 5, 1632.

(14) Stone, J. E.; Phillips, J. C.; Freddolino, P. L.; Hardy, D. J.; Trabuco, L. G.; Schulten, K. J. Comp. Chem. 2007, 28, 2618.

(15) Kirk, D. B.; Hwu, W. W. Programming Massively Parallel Processors: A Hands-On Approach; Morgan Kauffman Burlington, MA, 2010.

(16) Levine, B.; Martinez, T. J. Abst. Pap. Amer. Chem. Soc. 2003, 226, U426.

(17) Kahan, W. Comm. ACM 1965, 8, 40.

(18) Almlof, J.; Faegri, K.; Korsell, K. J. Comp. Chem. 1982, 3, 385.

(19) Luehr, N.; Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2011, 7, 949.

(20) Whitten, J. L. J. Chem. Phys. 1973, 58, 4496.

(21) Schmidt, M. W.; Baldridge, K. K.; Boatz, J. A.; Elbert, S. T.; Gordon, M. S.; Jensen, J. H.; Koseki, S.; Matsunaga, N.; Nguyen, K. A.; Su, S. J.; Windus, T. L.; Dupuis, M.; Montgomery, J. A. J. Comp. Chem. 1993, 14, 1347.

(22) Rudberg, E.; Rubensson, E. H.; Salek, P. J. Chem. Theo. Comp. 2009, 5, 80.

! MU!

(23) Pulay, P. J. Comp. Chem. 1982, 3, 556.

(24) Takashima, H.; Kitamura, K.; Tanabe, K.; Nagashima, U. J. Comp. Chem. 1999, 20, 443.

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!T?!

!

71

CHAPTER FOUR

DFT EXCHANGE-CORRELATION EVALUATION ON GPUS

The exchange-correlation potential is the final ingredient for a complete GPU

accelerated SCF program. Although not as intensive as the Coulomb and exchange

procedures, the evaluation of the exchange-correlation potential represents a second

major bottleneck for DFT calculations. We limit our consideration here to the common

class of generalized gradient approximation (GGA) functionals. In the general spin-

polarized case, where !" # !$ , equations (1.15) and (1.23) for the exchange-

correlation energy and potential become the following.1,2

EXC = fxc(!" (

!r ),!# (!r ),$ "" (

!r ),$ "# (!r ),$ ## (

!r ))d 3!r% (4.1)

Vµ!

XC" = # fxc#$"

%µ%! + 2 # fxc#& ""

'$" +# fxc#& "(

'$(

)

*+,

-./'(%µ%! )

0

122

3

455

6 d 3!r (4.2)

Where an analogous expression to equation (4.2) is used for the beta-spin Vµ!XC" , and

! represents the gradient invariants as follows.

! " #" (!r ) = $%" (

!r ) &$% #" (!r ) (4.3)

Unlike the Gaussian ERIs used to construct the Coulomb and exchange

operators, typical xc-kernel functions, fxc , are not amenable to analytical integration.

Instead, real-space numerical quadrature is employed to evaluate equations (4.1) and

(4.2). The quadrature grid itself proves to be non-trivial, and we consider it first before

discussing the evaluation of the exchange-correlation potential more properly.

72

QUADRATURE GRID GENERATIONa

Because molecular potentials and charge densities exhibit discontinuous cusps

at each nucleus, accurate numerical integration requires very dense quadrature grids at

least in these local regions. Thus efficient quadrature schemes introduce spherical

grids centered at each atom with integration points becoming more diffuse at larger

radial distances. For example, combining Euler-Maclaurin3 radial and Lebedev4

angular quadratures for each atom-centered grid allows integration around an atom

centered at !a as follows,

f (!r )d 3!r! " EiLj

j# f (!a + !rij )

i# = $p f (

!a + !rp )p# (4.4)

where Ei and Lj represent radial and angular weights respectively, p is a combined

ij index and ! p = Ei Lj . Each atomic grid independently integrates over all !3 but is

most accurate in the region of its own nucleus. In forming molecular grids, quadrature

points originating from different atoms must be weighted to normalize the total sum

over all atoms and to ensure that each atomic quadrature dominates in the region

around its nucleus. An elegant scheme introduced by Becke5 accomplishes this by

defining a spatial function, wa (!r ) , for each atomic quadrature, a , that gives the non-

negative weight assigned to that quadrature in the region of !r . Double counting is

avoided by constraining the weight function so that the following holds for all !r .

wa (!r )a! = 1 (4.5)

a Adapted from GPU Computing Gems, Emerald Edition, N. Luehr, I. Ufimtsev, T. Martinez, “Dynamical Quadrature Grids, Applications in Density Functional Calculations,” 35-42, Copyright (2011), with permission from Elsevier.

73

Constructing wa (!r ) so that it is near unity in the region of the ath nucleus also causes

each atomic quadrature to dominate in its most accurate region. The final quadrature is

then evaluated as follows.

f (!r )d 3!r! " wa (

!Ra +!rp )#p f (

!Ra +!rp )

p$

a$ = wap#p f (

!rap )p$

a$ (4.6)

Because the quadrature grid is now slaved to the molecular geometry, it must

provide well-defined gradients with respect to nuclear motion.6 This is easily

accomplished by ensuring that wa (!r ) is differentiable. Thus, Becke proposed the

following weighting scheme.5

wnp =Pn(!rnp )

Pm(!rnp )m! (4.7)

Pa (!r ) = s(µab)

b!a" (4.8)

µab(!r ) =

!Ra !!r !!Rb !!r

!Ra !

!Rb

=!ra !!rb!

Rab

(4.9)

s(µ) = 1

21! g(µ)( )

(4.10)

g(µ) = p( p( p(µ))) (4.11)

p(µ) = 3

2µ ! 1

2µ3 (4.12)

As is easily verified from the equations above, the effort to compute the Becke

weight at each grid point is determined by the denominator of equation (4.7) and

scales as O(N2) with the number of atoms in the system. Since the total number of grid

points also increases linearly with number of atoms, the total effort to compute the

74

quadrature grid is O(N3). Although the prefactor is modest if compared to Coulomb

and Exchange evaluation, generating the quadrature grid still becomes the dominant

bottleneck for calculations on large systems that are now routinely handled by the

methods described in chapters 2 and 3.

In order to reduce the formal scaling of grid generation, Stratmann and

coworkers replaced Becke’s switching function, s(µ) defined in equation (4.10), with

the following piecewise definition, where the parameter a is given the empirically

determined value of 0.64.7

s(µ) =

1 !1" µ " !a121! z(µ)( ) !a " µ " +a

0 +a " µ " +1

#

$%%

&%%

(4.13)

z(µ) = 116

35(µ / a)! 35(µ / a)3 + 21(µ / a)5 ! 5(µ / a)7"# $% (4.14)

Equation (4.13) drives the Becke weight to zero significantly faster for distant

quadrature points than equation (4.10). This allows a distance cutoff to be introduced

so that the sum and product of equations (4.7) and (4.8) need only consider a small set

of atoms in the neighborhood of !rnp . For large systems, the number of significant

neighbors becomes constant leading to linear scaling.

In our implementation, the GPU is used to evaluate equation (4.7) for each grid

point since this step dominates the runtime of the grid setup procedure. Before the

kernel is called, the spherical grids are expanded around each atom by table lookup,

the points are sorted into spatial bins, and lists of neighbor atoms are generated for

each bin. These steps are nearly identical for CPU and GPU implementations. The

75

table lookup and point sorting are already very fast on the CPU, and no GPU

implementation is planned. Building atom lists for each bin is more time consuming

and may benefit from further GPU acceleration in the future, but is not a focus of the

present chapter.

The serial implementation of Becke weight generation falls naturally into the

triple loop pseudo-coded in figure 1. Since the denominator includes the numerator as

one of its terms, we focus on calculating the former, setting aside the numerator in a

local variable when we come upon it.

Figure 1: Pseudo-code listing of Becke Kernel as a serial triple loop.

In order to accelerate this algorithm on the GPU, we first needed to locate a

fine-grained yet independent task to serve as the basis for our device threads. These

arise naturally by dividing loop iterations in the serial implementation into CUDA

blocks and threads. Our attempts at optimization fell into two categories. The finest

level of parallelism was to have each block cooperate to produce the weight for a

single point (block-per-point). In this scheme CUDA threads were arranged into 2D

01 for each point in point_array 02 P_a = P_sum = 0.0f 03 list = point.atom_list 04 for each atomA in list 05 Ra = dist(atomA, point) 06 P_tmp = 1.0f 07 for each atomB in list 08 Rb = dist(atomB, point) 09 rec_Rab = rdist(atomA, atomB) 10 mu = (Ra-Rb) * rec_Rab 11 P_tmp *= S(mu) 12 if(P_tmp < P_THRE) 13 break 14 P_sum += P_tmp 15 if(atomA == point.center_atom) 16 P_a = P_tmp 17 point.weight = point.spherical_weight * P_a/P_sum

76

blocks. The y-dimension specified the atomA index, and the x-dimension specified the

atomB index (following notation from figure 1). The second optimization scheme split

the outermost loop and assigned an independent thread to each weight (thread-per-

point).

In addition to greater parallelism, the block-per-point arrangement offered an

additional caching advantage. Rather than re-calculating each atom-to-point distance

(needed at lines 5 and 8), these values were cached in shared memory avoiding

roughly nine floating-point operations in the inner loop.

Unfortunately, this boost was more than offset by the cost of significant block

level cooperation. First of all, lines 11 and 14 required block-level reductions in the x

and y dimensions respectively. Secondly, because the number of members in each

atom list varies from as few as 2 to more than 80 atoms, any chosen block dimension

was necessarily suboptimal for many bins. Finally, the early exit condition at line 12

proved problematic. Since s(µ) falls between 0.0 and 1.0, each iteration of the atomB

loop can only reduce the value of P_tmp. Once it has been effectively reduced to zero,

we can safely exit the loop. As a result, calculating multiple “iterations” of the loop in

parallel required wasted terms to be calculated before the early exit was reached by all

threads of a warp.

On the other hand, the coarser parallelism of the thread-per-point did not

reduce its occupancy on the GPU. In the optimized kernel each thread required a

modest 29 registers and no shared memory. Thus, reasonable occupancies of 16 warps

per SM were obtained (using a Tesla C1060 GPU). Since a usual calculation requires

77

millions of quadrature points (and thus threads), there was also plenty of work

available to saturate the GPU many execution cores.

The thread-per-point implementation was, however, highly sensitive to

branching within a warp. Because each warp is executed in SIMD fashion,

neighboring threads can drag each other through non-essential instructions. This can

happen in two ways, 1) a warp contains points from bins with varying neighbor list

lengths or 2) threads in a warp exit at different iterations of the inner loop. Since many

bins have only a few points and thus warps often span multiple bins, case one can

become an important factor. Fortunately, it is easily mitigated by presorting the bins

such that neighboring warps will have similar loop limits.

The second case is more problematic and limited initial versions of the kernel

to only 132 GFLOPS. This was only a third of the Tesla C1060’s single issue

potential. Removing the early exit and forcing the threads to complete all inner loop

iterations significantly improved the GPU’s floating-point throughput to 260

GFLOPS, 84% of the theoretical single-issue peak. However, the computation time

also increased, as the total amount of work more than tripled for our test geometry. To

minimize branching, the points within each bin were further sorted such that nearest

neighbors were executed in nearby threads. In this way, each thread in a warp behaved

as similarly as possible under the branch conditions in the code. This adjustment

ultimately provided a modest performance increase to 187 GFLOPS of sustained

performance, about 60% of the single issue peak. However, thread divergence remains

a significant barrier to greater efficiency.

78

After the bins and points have been sorted on the host, the data is moved to the

GPU. We copy each point’s Cartesian coordinates, spherical weights (!p in equation

(4.4)), central atom index, and bin index to GPU global memory arrays. The atomic

coordinates for each bin list are copied to the GPU in bin-major order. This order

allows threads across a bin barrier to better coalesce their reads, which is necessary

since threads working on different bins often share a warp. Once the data is arranged

on the GPU, the kernel is launched, calculates the Becke weight and combines it with

the spherical weight in place. Finally, the weights are copied back to the host and

stored for later use in the DFT calculation.

Precision Number of Electrons Difference from Full Double Precision

Kernel Execution Time

Single 1365.9964166131 3.5x10-6 306.85 (ms) Mixed 1365.9964131574 3.9x10-8 307.68 (ms) Double 1365.9964131181 N/A 3076.76 (ms) Table 1: Comparison of single, mixed, and full double precision calculation of total charge for Olestra, a molecule with 1366 electrons.

Because the GPU is up to an order of magnitude faster using single rather than

double precision arithmetic, our kernel was designed to minimize double precision

operations. Double precision is most important when accumulating small numbers into

larger ones or when taking differences of nearly equal numbers. With this in mind we

were able to improve our single precision results using only a handful of double

precision operations. Specifically, the accumulation at line 14 and the final arithmetic

of line 17 in figure 1 were carried out in double precision. This had essentially no

impact on performance, but improved correlation with the full double precision results

by more than an order of magnitude.

79

The mixed precision GPU code can be compared to a CPU implementation of

the same algorithm. To level the playing field, the CPU code uses the same mixed

precision scheme developed for the GPU. However, the CPU implementation was not

parallelized to multiple cores. The reference timings were taken using an Intel Xeon

X5570 at 2.93 GHz and an nVidia Tesla C1060 GPU. Quadrature grids were

generated for a representative set of test geometries shown in figure 2. These ranged in

size from about 100 to nearly 900 atoms.

Figure 2: Test Molecules

Molecule Atoms (Avg. List)

Points GPU Kernel Time (ms)

CPU Kernel Time (ms)

Speedup

Taxol 110 (19) 607828 89.1 12145.3 136X Valinomycin 168 (25) 925137 212.0 26132.2 123X Olestra 453 (18) 2553521 262.9 30373.6 116X BPTI 875 (39) 4785813 2330.3 364950.8 157X Table 2: Performance comparison between CPU and GPU Becke weight calculation. Sorting and GPU data transfers are not included. Along with the total number of atoms in each molecule, the mean length of the neighbor list for that molecule is reported in parentheses. Times are reported in milliseconds.

Table 2 compares the CPU and GPU treatments of the Becke weight

evaluation (code corresponding to figure 1). The observed scaling is not strictly linear

with system size. This is because the test molecules exhibit varying atomic densities.

When the atoms are more densely packed, more neighbors appear in the inner loops.

80

We can use the average size of the neighbor lists for each system to crudely account

for the variation in atomic density. Doing so produces the expected linear result.

Figure 3: Linear scaling of CPU and GPU kernels. Fits constrained to pass through (0,0). The effective atoms takes into account varying density of atoms in each test system. Slope of linear fit gives performance in terms of ms per effective atom.

Although a speedup of up to 150X over the CPU is impressive, it ignores the

larger picture of grid generation as a whole. In any real world application the entire

procedure from spherical grid lookup to bin sorting and memory transfers must be

taken into account. A timings breakdown is shown in table 3 for the BPTI test

geometry.

Category GPU Timing (ms) CPU Timing (ms) Table 3: Timing breakdown between CPU and GPU calculations for BPTI. Sorting represents the time to sort the bins and points to minimize branching in the GPU kernel. GPU Overhead includes packing and transferring point and atom list data for the GPU.

Atom List 6179 6040 Sorting 286 N/A GPU Overhead 316 N/A Becke Kernel 2330 364951 Total Time 9720 371590

The massive kernel improvement translates into more modest 38X global

speedup. GPU acceleration has essentially eliminated the principle bottleneck in the

CPU code. The Becke weight kernel accounts for up to 98% of the total CPU runtime

but on the GPU is overshadowed by the previously insignificant atom list step. This is

a practical example of Amdahl’s law that the total speedup is limited by the proportion

of code that is left un-parallelized. The GPU implementation requires slightly more

81

time to generate the atom lists due to the additional sorting by length required to

minimize warp divergence on the GPU.

The final weights for many individual quadrature points are negligibly small.

These are removed from the final grid, and the Cartesian coordinates and weights of

all remaining points are stored for use in later integration.

wap!p "# i!rap "

!ri (4.15)

EXCHANGE-CORRELATION POTENTIAL EVALUATION

With the numerical quadrature grid in hand, we now consider the evaluation of

the exchange-correlation potential, Vxc in equation (4.2). Following Yasuda’s GPU

implementation,2 the calculation of Vxc is broken down into three parts. First, the

electronic density and its gradient are evaluated at each grid point. Second, the DFT

functional is evaluated at each grid point. And third, the functional values are summed

to produce the final matrix elements.

As with the ERI kernels considered previously, the calculation is segmented by

the angular momenta of the basis functions. For GGA functionals, however, the

exchange-correlation potential involves only one-electron terms, so that kernels deal

with pairs rather than quartets of basis shells. Thus, for a basis limited to s- , p-, and d-

functions, only six kernel classes are needed, ss, sp, sd, pp, pd, and dd. Threads

assigned to higher angular momenta shells again compute terms for all primitive

functions within the shell pair.

82

The alpha-spin electronic density and density gradient at each grid point, !ri ,

are calculated from the one particle density matrix P! as follows (with analogous

expressions for the beta spin case).

!" (!ri ) = Pµ#

" $µ (!ri )$# (!ri )µ#% (4.16)

!"# (!ri ) = 2 Pµ$

# %µ (!ri )!%$ (!ri )µ$& (4.17)

To compute these quantities, each CUDA thread is assigned a unique grid point and

loops over significant primitive shell pairs. As a result of the exponential decay of the

basis functions, !µ , the sum for each grid point need only include a few significant

AOs. Significant AO primitive pairs are spatially binned by the center of charge, !Pij in

equation (1.25), and sorted by increasing exponent. Each thread loops over all bins,

but exits each bin as soon as the first negligible point-pair interaction is reached. The

quadrature points are similarly binned to minimize warp divergence by ensuring that

neighboring threads share similar sets of significant basis functions.

Next, the density at each quadrature point is copied to host memory and used

to evaluate the xc-kernel function and its various derivatives on the grid. Specifically,

the following values are computed for each grid point.

ai =! i fxc "# (!ri ),"$ (

!ri ),% ## (!ri ),% #$ (

!ri ),% $$ (!ri )( )

bi =! i

& fxc!ri( )

&"#

ci =! i 2& fxc

!ri( )&% ##

'"# +& fxc

!ri( )&% #$

'"$

(

)*+

,-

(4.18)

83

This step has a small computational cost and can be performed on the host without

degrading performance. This is desirable as a programming convenience since

implementing various density functionals can be performed without editing CUDA

kernels, and because the host provides robust and efficient support for various

transcendental functions needed by some functionals. During this step, the functional

values, ai , are also summed to evaluate the total exchange-correlation energy per

equation (4.1).

The final step is to construct AO matrix elements for the exchange-correlation

potential. This task is again performed on the GPU. For all non-negligible primitive

pairs, !K! L , the pair quantities KKL ,

!PKL , and !KL from equation (1.25) are packed

into arrays and transferred to device memory. The grid coordinates, !ri , are spatially

binned and transferred to the GPU with corresponding values bi and ci . A CUDA

kernel then performs the summation,

!Vµ! = "µ ("ri )

bi2"! ("ri )+ ci#"! (

"ri )$%&

'()i

*

= ckcll+!*

k+µ* , k (

"ri )bi2, l ("ri )+ ci#, l (

"ri )$%&

'()i

* (4.19)

and the final matrix elements are computed as follows.

Vµ!XC = !Vµ! + !V!µ (4.20)

This calculation can be envisioned as a 2D matrix in which each row

corresponds to a primitive shell pair, !K! L , and each column is associated with a

single grid point. The CUDA grid is configured as a 1-D grid of 2-D blocks, similar to

the GPU J-Engine described in chapter 2. A stack of blocks spanning the primitive

84

pairs sweeps across the matrix. The points are spatially binned so that entire bins can

be skipped whenever a negligible point-primitive pair combination is encountered. As

in the Coulomb algorithm, each row of threads across a CUDA block cooperates to

produce a primitive matrix element, !Vkl . These are copied to the host where the final

summation into AO matrix elements takes place.

SCF PERFORMANCE EVAULATION

Having completed our discussion of GPU acceleration of SCF calculations we

illustrate the application of our work to practical calculations. All GPU calculations

described below were carried out using the TeraChem quantum chemistry package. To

elucidate the computational scaling of our GPU implementation with increasing

system size, we consider two types of systems. First, linear alkene chains ranging in

length between 25 and 707 carbon atoms provide a benchmark at the low-dimensional

limit. Second, cubic water clusters containing between 10 and 988 molecules provide

a dense, three-dimensional test system that is more representative of interesting

condensed phase systems. Examples of each system type are shown in figure 4.

Figure 4: One-dimensional alkene and three dimensional water-cube test systems. Alkene lengths vary from 24 to 706 carbon atoms and water cubes range from 10 to nearly 850 water molecules. A uniform density is used for all water boxes.

85

For each system, a restricted B3LYP calculation was performed on a single

Tesla M2090 GPU using the 6-31G basis set. Figure 5 shows the timing breakdown

during the first SCF iteration for the J-Engine, K-Engine, linear algebra (LA), and

DFT exchange-correlation potential. For the water clusters, the standard and distance-

masked K-Engines were tested in separate runs and are both provided for comparison.

A conservative screening distance of 8Å was chosen for the distance-masked K-

Engine. Exponents resulting from a power fit of each series are provided in the legend.

Figure 5: First SCF iteration timings in seconds for a) linear alkenes and b) cubic water clusters. Total times are further broken down into J-Engine, K-Engine, distance-masked K-Engine, linear algebra (LA) and DFT exchange-correlation contributions. For water clusters, total SCF times are shown for both the naïve and distance-masked (mask) K-Engine. All calculations were performed using a single Tesla M2090 GPU and the 6-31G basis set. Power fits show scaling with increasing system size, and the exponent for each fit is provided in the legend.

The linear algebra time is dominated by the diagonalization of the Fock matrix,

which scales visibly worse than the Fock formation routines. For large systems,

diagonalization will clearly become the principle bottleneck. This is particularly true

for low-dimensional systems, where the Fock formation is particularly efficient due to

the sparsity of the density matrix. However, for the more general, three-dimensional

clusters, linear algebra remains a small contribution to the total time past 10,000 basis

86

functions. If the more efficient masked K-Engine can be employed, the crossover

point, at which linear algebra becomes dominant, shifts to about 8,000 basis functions.

Although the time per basis function is lower for the alkene test series, the

overall scaling is similar for both test series. This behavior is good evidence that we

have penetrated the asymptotic regime of large systems, where the dimensionality of

the physical system should impact the prefactor rather than the exponent. That we

easily reach this crossover point on GPUs is itself noteworthy. Even more noteworthy

is the fact that the K-Engine exhibits sub-quadratic scaling without imposing any of

the assumptions or book-keeping mechanisms such as neighbor lists common among

linear scaling exchange algorithms.8-11 The further imposition of a simple distance

mask on exchange elements provides an effective linear scaling approach for

insulating systems. The scaling of the SCF with the masked K-Engine further

exemplifies the impact of the linear algebra computation on the overall efficiency. The

N3 scaling of LA becomes significantly more dominant in large systems. The validity

of the distance-based exchange screening approximation was confirmed by comparing

the final converged absolute SCF energies of each water cluster calculated with and

without screening. The absolute total electronic energy is accurate to better than 0.02

kcal/mol in all systems considered, which is well within chemical accuracy.

Our GPU J-Engine scales quadratically with system size, which comports well

with our scaling analysis in chapter 2. Further acceleration would require modification

of the long-ranged Coulomb potential, for example by factorization into a multipole

expansion.12 Extrapolating beyond figure 5, this may become necessary for water

clusters beyond 12,000 basis functions, when the J-Engine finally becomes the

87

dominant component in Fock formation. As for DFT, the GPU exchange-correlation

methods exhibit linear scaling, matching the scaling efficiency that has been achieved

in CPU codes, but, as demonstrated below, with much greater performance.

Parallelizing Fock construction over multiple GPUs is accomplished by

splitting the CUDA grids into equal chunks in the y-dimension. For small systems, this

strategy is inefficient because it doubles the latency involved in transferring data

between the host and device and in launching GPU kernels. For large systems the

kernel execution times grow and such latencies become negligible compared to the

benefits of ideal load balancing provided by this splitting strategy, as shown in figure

6.

Figure 6: Multi GPU Parallel efficiency for J-Engine, K-Engine, and exchange-correlation Fock formation based on 1st iteration time for water clusters, run on 2 M2090 GPUs.

The scaling efficiency of a code provides a useful check on the quality of the

algorithm and its implementation. However, the absolute performance is of much

greater importance for any practical applications. To assess the GPUs overall

usefulness for quantum chemistry, we will again use our water box test cases treated at

the B3LYP and 6-31G level of theory. Rather than a single GPU, we will now employ

four cards, across two GPU architectures: the Tesla M2090 and newer GTX Titan.

88

Notably, a 1533 atom single-point energy calculation on a cubic water cluster required

only 2.47 hours for the entire SCF procedure, and ab initio dynamics, requiring

thousands of single point evaluations are feasible for systems containing thousands of

basis functions.

As a point of comparison, the same calculations were carried out using the

CPU implementation available in the GAMESS program.13 Parameters such as

integral neglect thresholds and SCF convergence criteria were matched as closely as

possible to the previous GPU calculations. The CPU calculation was also parallelized

over all 8 CPU cores available in our dual Xeon X5680 3.33GHz server. Figure 7

shows the speedup of the GPU implementation relative to GAMESS for the first SCF

iteration including both Fock construction and diagonalization, the latter of which is

performed in both codes on the CPU. Performance is similar between the two GPU

models on small structures for both alkenes and water clusters. However, GTX Titan

exhibits a larger speedup relative to GAMESS above 100 orbitals and reaches 400x at

the largest systems examined. During Fock construction, the CPU cores are left idle in

our implementation. It is also possible to reserve some work for the CPU,14 but given

the performance advantage of our GPU implementation, this would offer a rather

small performance improvement.

89

Figure 7: Total SCF time of TeraChem on 8 CPUs and 4 GPUs, relative to GAMESS on 8 CPUs for water clusters.

REFERENCES

(1) Pople, J. A.; Gill, P. M. W.; Johnson, B. G. Chem. Phys. Lett. 1992, 199, 557.

(2) Yasuda, K. J. Chem. Theo. Comp. 2008, 4, 1230.

(3) Murray, C. W.; Handy, N. C.; Laming, G. J. Mol. Phys. 1993, 78, 997.

(4) Lebedev, V. I.; Laikov, D. N. Dokl Akad Nauk+ 1999, 366, 741.

(5) Becke, A. D. J. Chem. Phys. 1988, 88, 2547.

(6) Johnson, B. G.; Gill, P. M. W.; Pople, J. A. J. Chem. Phys. 1993, 98, 5612.

(7) Stratmann, R. E.; Scuseria, G. E.; Frisch, M. J. Chem. Phys. Lett. 1996, 257, 213.

(8) Burant, J. C.; Scuseria, G. E.; Frisch, M. J. J. Chem. Phys. 1996, 105, 8969.

(9) Ochsenfeld, C.; White, C. A.; Head-Gordon, M. J. Chem. Phys. 1998, 109, 1663.

(10) Schwegler, E.; Challacombe, M. J. Chem. Phys. 1999, 111, 6223.

(11) Neese, F.; Wennmohs, F.; Hansen, A.; Becker, U. Chem. Phys. 2009, 356, 98.

(12) White, C. A.; Head-Gordon, M. J. Chem. Phys. 1994, 101, 6593.

(13) Schmidt, M. W.; Baldridge, K. K.; Boatz, J. A.; Elbert, S. T.; Gordon, M. S.; Jensen, J. H.; Koseki, S.; Matsunaga, N.; Nguyen, K. A.; Su, S. J.; Windus, T. L.; Dupuis, M.; Montgomery, J. A. J. Comp. Chem. 1993, 14, 1347.

90

(14) Asadchev, A.; Gordon, M. S. J. Chem. Theo. Comp. 2012, 8, 4166.

91

CHAPTER FIVE

EXCITED STATE ELECTRONIC STRUCTURE

ON GPUS: CIS AND TDDFTa

In the previous chapters we described GPU algorithms to accelerate the

construction of the Fock matrix for use in SCF calculations. We now demonstrate how

those efficient Fock methods can be applied more widely to dramatically increase the

performance of post-SCF calculations such as single excitation configuration

interaction (CIS),1 time-dependent Hartree-Fock (TDHF) and linear response time-

dependent density functional theory (TDDFT).2-6 These single-reference methods are

widely used for ab initio calculations of electronic excited states of large molecules

(more than 50 atoms, thousands of basis functions) because they are computationally

efficient and straightforward to apply.7-9 Although highly correlated and/or multi-

reference methods such as multireference configuration interaction (MRCI10),

multireference perturbation theory (MRMP11 and CASPT212), and equation-of-motion

coupled cluster methods (SAC-CI13 and EOM-CC14,15) allow for more reliably

accurate treatment of excited states, including those with double excitation character,

these are far too computationally demanding for large molecules.

CIS/TDHF is essentially the excited state corollary of the ground state Hartree-

Fock method, and thus similarly suffers from a lack of electron correlation. Because

a Adapted from C.M. Isborn, N. Luehr, I.S. Ufimtsev, and T.J. Martinez, J. Chem. Theo. Comput. 2011, 7, 1814-1823. This is an unofficial adaptation of an article that appeared in an ACS publication. ACS has not endorsed the content of this adaptation or the context of its use.

92

of this, CIS/TDHF excitation energies are consistently overestimated, often by ~1eV.8

The TDDFT method includes dynamic correlation through the exchange correlation

functional, but standard non-hybrid TDDFT exchange-correlation functionals tend to

underestimate excitation energies, particularly for Rydberg and charge transfer states.5

The problem in charge-transfer excitation energies is due to the lack of the correct 1/r

Coulombic attraction between the separated charges of the excited electron and hole.16

Charge transfer excitation energies are generally improved with hybrid functionals and

also with range separated functionals that separate the exchange portion of the DFT

functional into long and short range contributions.17-21 Neither CIS nor TDDFT (with

present-day functionals) properly include the effects of dispersion, but promising

results have been obtained with an empirical correction to standard DFT

functionals,22,23 and there are continued efforts to include dispersion directly in the

exchange-correlation functional.24,25 Both the CIS and TDDFTa single reference

methods lack double excitations and are unable to model conical intersections or

excitations in molecules that have multi-reference character.26,27 In spite of these

limitations, the CIS and TDDFT methods can be generally expected to reproduce

trends for one-electron valence excitations, which are a majority of the transitions of

photochemical interest. TDDFT using hybrid density functionals, which include some

percentage of Hartree-Fock (HF) exact exchange, have been particularly successful in

modeling the optical absorption of large molecules. Furthermore, the development of

new DFT functionals and methods is an avid area of research, with many new

a By TDDFT we here refer to the adiabatic linear response formalism with presently available functionals.

93

functionals introduced each year. Thus it is likely that the quality of the results

available from TDDFT will continue to improve. A summary of the accuracy currently

available for vertical excitation energies is available in a recent study by Jacquemin et

al. which compares TDDFT results using 29 functionals for ~500 molecules.28

Although CIS and TDDFT are the most tractable methods for treating excited

states of large molecules, their computational cost still prevents application to many

systems of photochemical interest. Thus, there is considerable interest in extending the

capabilities of CIS/TDDFT to even larger molecules, beyond hundreds of atoms.

Quantum mechanics/molecular mechanics (QM/MM) schemes provide a way

to model the environment of a photophysically interesting molecule by treating the

molecule with QM and the surrounding environment with MM force fields.29-33

However, it is difficult to know when the MM approximations break down and when a

fully QM approach is necessary. With fast, large scale CIS/TDDFT calculations, all

residues of a photoactive protein could be treated quantum mechanically to explore the

origin of spectral tuning, for example. Explicit effects of solvent-chromophore

interactions, including hydrogen bonding, charge transfer, and polarization, could be

fully included at the ab initio level in order to model solvatochromic shifts.

GPUs provide a promising route to large-scale CIS and TDDFT calculations.

GPUs have been applied to achieve speed-ups of 1-2 orders of magnitude in ground

state electronic structure,34-36 ab initio molecular dynamics37 and empirical force field-

based molecular dynamics calculations.38-41 In this chapter we extend the

implementation of GPU quantum chemistry beyond the SCF methods36,42 described in

previous chapters to the calculation of excited electronic states. The performance

94

provided by GPU hardware for evaluating ERIs allows full QM treatment of the

excited states of very large systems – both large chromophores and chromophores in

which the environment plays a critical role and should be treated with QM. In this

chapter we present the results of implementing CIS and TDDFT within the Tamm-

Dancoff approximation using GPUs to drastically accelerate the bottlenecks of two-

electron integral evaluation, density functional quadrature, and matrix multiplication.

This results in CIS calculations over 200x faster than those achieved running on a

comparable CPU platform. Benchmark CIS/TDDFT timings are presented for a

variety of systems.

CIS/TDDFT IMPLEMENTATION USING GPUS

The linear response formalism of TDHF and TDDFT has been thoroughly

presented in review articles.4,7,8,43 Only the equations relevant for this work are

presented here. The TDHF/TDDFT working equation is

A BB A

!"#

$%&

XY

!"#

$%&=' 1 0

0 (1!"#

$%&

XY

!"#

$%&

, (5.1)

where for TDHF (neglecting spin indices for simplicity)

Aai,bj = ! ij!ab "a # " i( ) + ia | jb( ) # ij | ab( ) , (5.2)

Bai,bj = ia | bj( )! ib | aj( ) , (5.3)

and for TDDFT

Aai,bj = ! ij!ab "a # " i( ) + ia | jb( ) + ij | fxc | ab( ), (5.4)

Bai,bj = ia |bj( ) + ib | fxc | aj( ) . (5.5)

The ERIs are defined as usual.

95

ia | jb( ) =! i(r1)!a (r1)! j (r2)!b (r2)

r1 " r2## dr1dr2 (5.6)

Within the adiabatic approximation of DFT, the exact time-dependent exchange-

correlation potential is approximated using the time-independent ground state

exchange-correlation functional Exc[!] as follows.

ia | fxc | jb( ) = ! i(r1)!a (r1)" 2Exc

"#(r1)"#(r2)$$ ! j (r2)!b (r2)dr1dr2 (5.7)

The i,j and a,b indices represent occupied and virtual MOs, respectively, in the

Hartree-Fock/Kohn-Sham (KS) ground state determinant.

Setting the B matrix to zero within TDHF results in the CIS equation, while in

TDDFT this same neglect yields the equation known as the Tamm-Dancoff

approximation (TDA):

AX =!X . (5.8)

In part because DFT virtual orbitals provide a better starting approximation to the

excited state than HF orbitals, the neglect of the B matrix within TDA is known to

accurately reproduce full TDDFT results for non-hybrid DFT functionals.7,8,44

Furthermore, previous work has shown that a large contribution from the B matrix in

TDDFT (and to a lesser extent also in TDHF) is often indicative of a poor description

of the ground state, either due to singlet-triplet instabilities or multi-reference

character.45 Thus, if there is substantial deviation between the full TDDFT and TDA-

TDDFT excitation energies, the TDA results will often be more accurate.

A standard iterative Davidson algorithm46 has been implemented to solve the

CIS/TDA-TDDFT equations. As each AX matrix-vector product is formed, the

96

required two-electron integrals are calculated over primitive basis functions within the

atomic orbital (AO) basis directly on the GPU. Within CIS, the AX matrix-vector

product is calculated as follows.

ACISX( )bj = ! ij!ab "a # " i( ) + ia | jb( ) # (ij | ab)[ ]Xiaia$ (5.9)

ia | jb( ) ! (ij | ab)[ ]Xiaia" = Cµj

µ#" C#bFµ# (5.10)

Fµ! = T"#"#$ µ! |"#( ) % µ" |!#( ){ } (5.11)

T!" = Xiaia# C!iC"a (5.12)

Here C!i are the SCF MO coefficients of the HF/KS determinant, and T!" is a non-

symmetric transition density matrix. For very small systems there is no performance

advantage with GPU computation of the matrix multiplication steps in equations

(5.10) and (5.12), in which case we compute these quantities on the CPU. For larger

systems, the matrix-vector multiplications are performed on the GPU using dgemm

calls to the CUBLAS library, a CUDA BLAS implementation.47

Equation (5.11) represents the dominant computational bottleneck of this

integral direct CIS algorithm. Fortunately, this equation is equivalent to the Coulomb

and exchange matrix builds described in chapters 2 – 4 except that the non-symmetric

transition density, T, replaces the symmetric one-particle density matrix, P. Thus, the

portion of the F matrix from the product of T!" with the first integral in equation (5.11)

is computed with the GPU J-engine algorithm. The portion of the F matrix from the

product of T!" with the second integral in equation (5.11) is computed with the K-

engine algorithm. The coulomb matrix remains symmetric even with a non-symmetric

97

transition density matrix. Thus, it is possible to continue to eliminate redundant µ# "

#µ bra and !" " "! ket pairs as in the SCF routine as long as the sum of transpose

density elements is used as follows.

Jµ! = J!µ = µ! | "#( ) T"# + 1$%"#( )T#"( )

"&#'

(5.13)

The excited state exchange matrix is not symmetric. We must thus calculate both the

upper and lower triangle contributions with separate calls to the ground state GPU K-

engine routines. Since the SCF K-engine already ignores µ# " #µ bra and !" " "!

ket symmetries, the result is that, ignoring screening, all O(N4) ERIs are computed for

the excited state K contribution, while the J term requires the same O(N4/4) ERIs

computed in the ground state.

Compared to the J-engine, the K-engine suffers from a second disadvantage for

both ground and excited state calculations. Unlike the J-matrix GPU implementation,

the K-matrix algorithm cannot map the density matrix elements onto the ket integral

data, since the density index now spans both bra and ket indices. This leads to two

negative consequences. First, each thread must load an independent density matrix

element which results in slower non-coalesced GPU memory accesses. Second, the

sparsity of the density cannot be included in the pre-sorting of ket pairs. Thus, the

integral bounds are not guaranteed to strictly decrease and a mix of significant and

insignificant ERIs must be computed. As a result of these drawbacks, the exchange

algorithm is less efficient on the GPU and actually takes longer to calculate than its

Coulomb counterpart even though density screening should in principle leave many

fewer significant ERI contributions in the case of exchange. In practice a K/J timing

98

ratio for ground state SCF calculations is observed to be between 3 and 5 for systems

studied below.

Evaluation of the derivative of the exchange-correlation potential needed for

TDDFT excited states7 is performed using the same numerical quadrature as the

ground state exchange-correlation potential. We again use a Becke type quadrature

scheme48 with Lebedev angular49 and Euler-Maclaurin radial50 quadrature grids. The

same basic approach described in the previous chapter is again employed to compute

the exchange-correlation matrix elements of equation (5.7).35,51 The expensive steps

are evaluating the electron density/gradient at the grid quadrature points to

numerically evaluate the necessary functionals and summing the values on the grid to

assemble the matrix elements of the exchange-correlation potential. For the excited

state calculations, we generate the second functional derivative of the exchange

correlation functional only once, saving its value at each quadrature point in memory.

Then, for each Davidson iteration, the appropriate integrals are evaluated, paired with

the saved functional derivative values and summed into matrix elements using GPU

kernels analogous to the ground state case detailed in the previous chapter.

RESULTS AND DISCUSSION

We evaluate the performance of our GPU-based CIS/TDDFT algorithm on a

variety of test systems: B3PPQ – 6,6’-bis(2-(1-triphenyl)-4-phenylquinoline - an

oligoquinoline recently synthesized and characterized by the Jenekhe group for use in

OLED devices52 and characterized theoretically by Tao and Tretiak;53 four generations

of oligothiophene dendrimers that are being studied for their interesting photophysical

properties54-56; the entire photoactive yellow protein (PYP)57 solvated by TIP3P58

99

water molecules; and deprotonated trans-thiophenyl-p-coumarate, an analogue of the

PYP chromophore59 that takes into account the covalent cysteine linkage, solvated

with an increasing number of QM waters.

Figure 1. Structures, number of atoms, and basis functions (fns) using the 6-31G basis set for four generations of oligothiophene dendrimers, S1-S4. Carbon atoms are orange, sulfur atoms are yellow.

Figure 2. Structures, number of atoms, and basis functions (fns) for the 6-31G basis for benchmark systems photoactive yellow protein (PYP), the solvated PYP chromophore, and oligoquinoline B3PPQ. For PYP, carbon, nitrogen, oxygen and sulfur atoms are green, blue, red, and yellow, respectively. For the other molecules, atom coloration is as given in figure 1, with additional red and blue coloration for oxygen and nitrogen atoms, respectively.

Benchmark structures are shown in figures 1 and 2 along with the number of

atoms and basis functions for a 6-31G basis set. For the solvated PYP chromophore,

only three structures are shown in figure 2, but benchmark calculations are presented

100

for 15 systems with increasing solvation, starting from the chromophore in vacuum

and adding water molecules up to a 16 Ångstrom solvation shell, which corresponds to

900 water molecules.

For our benchmark TDDFT calculations, we use the generalized gradient

approximation with Becke’s exchange functional60 combined with the Lee, Yang, and

Parr correlation functional61 (BLYP). During the SCF procedure for the ground state

wavefunction, we use two different DFT grids. A sparse grid of ~1000 grid

points/atom is used to converge the wave function until the DIIS error reaches a value

of 0.01, followed by a more dense grid of ~3000 grid points/atom until the ground

state wave function is fully converged. This denser grid is also used for the excited

state TDDFT timings reported herein, unless otherwise noted. The Coulomb and

exchange integral screening thresholds are set to 1x10-11 atomic units. Coulomb and

exchange integrals with products of the density element and Schwarz bound below the

integral screening threshold are not computed, and exchange evaluation is terminated

when the products of density element and Schwarz bound fall below this value times a

guard factor of 0.001. The timings reported were obtained using a dual quad-core Intel

Xeon X5570 platform with 72 GB RAM and eight Tesla C1060 GPUs.

All CPU operations are performed in full double precision arithmetic,

including one-electron integral evaluation, integral accumulation, and diagonalization

of the subspace matrix of A. Calculations carried out on the GPU (Coulomb and

exchange operator construction and DFT quadrature) use mixed precision unless

otherwise noted. The mixed precision integral evaluation is a hybrid of 32-bit and 64-

bit arithmetic. In this case, integrals with Schwarz bounds larger than 0.001 a.u. are

101

computed in full double precision, and all others are computed in single precision with

double precision accumulation into the final matrix elements. To study the effects of

using single precision on excited state calculations, we have run the same CIS

calculations using both single and double precision integral evaluation for many of our

benchmark systems.

Figure 3. Plot of single and double precision convergence behavior for the 1st CIS/6-31G excited state of five benchmark systems. A typical convergence threshold of 10-5 in the residual norm is indicated with a straight black line. Convergence behavior is generally identical for single and double precision integration until very small residual values well below the convergence threshold. Some calculations do require double precision for convergence. One such example is shown here for a snapshot of the PYP chromophore (PYPc) with 94 waters.

In general we find that mixed (and often even single) precision arithmetic on

the GPU is more than adequate for CIS/TDDFT, with convergence being achieved in

as many or fewer iterations than is required for the identical convergence criterion for

GAMESS. In most cases we find that the convergence behavior is nearly identical for

single and double precision until the residual vector is quite small. Figure 3 shows the

typical single and double precision convergence behavior as represented by the CIS

residual vector norm convergence for B3PPQ, the 1st and 3rd generations of

oligothiophene dendrimers S1 and S3, and a snapshot of the PYP chromophore

surrounded by 14 waters. A common convergence criterion on the residual norm is

102

shown with a straight black line at 10-5 a.u. Note that for the examples in figure 3, we

are not using mixed precision – all two electron integrals on the GPU are done in

single precision. This is therefore an extreme example (other calculations detailed in

this paper used mixed precision where large integrals and quadrature contributions are

calculated in double precision) and serves to show that CIS and TDDFT are generally

quite robust, irrespective of the precision used in the calculation. Nevertheless, a few

problematic cases have been found in which single precision integral evaluation is not

adequate and where double precision is needed to achieve convergence.a During the

course of hundreds of CIS calculations performed on snapshots of the dynamics of the

PYP chromophore solvated by various numbers of water molecules, a small number

(<1%) of cases yield ill-conditioned Davidson convergence when single precision is

used for the GPU-computed ERIs and quadrature contributions. For illustration, the

single and double precision convergence behavior for one of these rare cases, here the

PYP chromophore with 94 waters, is shown in figure 3. In practice, this is not a

problem since one can always switch to double precision and this can be done

automatically when convergence problems are detected.

Timings and excitation energies for some of the test systems are given in table

1 and compared to the GAMESS quantum chemistry package version 12 Jan 2009

(R3). The GAMESS timings are obtained using the same Intel Xeon X5570 eight-core

machine as for the GPU calculations (where GAMESS is running in parallel over all

eight cores). The numerical accuracy of the excitation energies for mixed precision

a Of course, if the convergence threshold was made sufficiently small, all calculations would require double (or better) precision throughout.

103

GPU integral evaluation is excellent for all systems studied; the largest discrepancy

between GAMESS and our GPU implementation is less than 0.0001eV. Speedups are

given for both the total SCF time and CIS computation time, with a large increase in

performance times obtained using the GPU for both ground and excited state methods.

The speedups increase as system size increases, with SCF speedups outperforming

CIS speedups. For the largest system compared with GAMESS, which is the 29 atom

chromophore of PYP surrounded by 487 QM water molecules, the speedup is well

over 400# for SCF and 200# for CIS.

molecule (atoms; basis functions) CIS timings (s) Speedup DES0/S1 (au) GPU GAMESS SCF CIS GPU GAMESS

B3PPQ oligoquinoline (112; 700) 38.6 371.5 15 10 0.1729276 0.1729293 S2 oligothiophene dendrimer (128; 958) 117.5 755.9 15 6 0.1511511 0.1511509 PYP chromophore + 101 waters (332; 1501) 164.8 3032.7 52 18 0.1338027 0.1338021 PYP chromophore + 146 waters (467; 2086) 286.7 8654.9 90 30 0.1337468 0.1337463 PYP chromophore + 192 waters (605; 2684) 431.5 20546.8 133 48 0.1338623 0.1338617 PYP chromophore + 261 waters (812; 3581) 715.4 57800.5 212 81 0.1339657 0.1339651 PYP chromophore + 397 waters (1220; 5349) 1459.3 243975.7 353 167 0.1341203 0.1341196 PYP chromophore + 487 waters (1490; 6519) 2408.1 562606.6 421 234 0.1343182 0.1343174 Table 1. Accuracy and performance of the CIS algorithm on a dual Intel Xeon X5570 (eight CPU cores) with 72 GB RAM. GPU Calculations use eight Tesla C1060 GPU cards.

The dominant computational parts in building the CIS/TDDFT AX vector can

be divided into Coulomb matrix, exchange matrix and DFT contributions. Figure 4

plots the CPU+GPU time consumed by each of these three contributions (both CPU

and GPU times are included here, although the CPU time is a very small fraction of

the total). J and K timings are taken from an average of the ten initial guess AX builds

for a CIS calculation, and the DFT timings are from an average of the initial guess AX

builds for a TD-BLYP calculation. The initial guess transition densities are very sparse

and thus this test highlights the differing efficiency of screening and thresholding in

104

these three contributions. The J-timings for CIS and BLYP are similar, and only those

for CIS are reported. Power law fits are shown as solid lines and demonstrate near-

linear scaling behavior of all three contributions to the AX build. The coulomb matrix

and DFT quadrature steps are closest to linear scaling, with observed scaling of N1.2

and N1.1, respectively, where N is the number of basis functions. The exchange

contribution scales as N1.6. These empirical scaling data demonstrate that with proper

sorting and integral screening, the AX build in CIS and TDDFT scales much better

than quadratic, with no loss of accuracy in excitation energies.

Figure 4. Contributions to the time for building an initial AX vector in CIS and TD-BLYP. Ten initial X vectors are created based on MO energy gaps, and the timing reported is the average time for building AX for those ten vectors. The timings are obtained on a dual Intel Xeon X5570 platform with eight Tesla C1060 GPUs. Data (symbols) are fit to power law (solid line, fitting parameters in inset). Fewer points are included for the TD-BLYP timings because the SCF procedure does not convergea for the solvated PYP chromophore with a large number of waters or for the full PYP protein.

a This is due to the well-known problem with non-hybrid DFT functionals having erroneously low-lying charge transfer states that can prevent SCF convergence.

y=7.3!10-5 x1.2 R=0.97 y=5.0!10-5 x1.6 R=0.99 y=4.9!10-4 x1.04 R=0.97

105

Of the three integral contributions (Coulomb, exchange, and DFT quadrature),

the computation of the exhcange matrix is clearly the bottleneck. This is due to the

three issues with exchange computation previously discussed: 1) the J-engine

algorithm takes full advantage of density sparsity because of efficient density

screening that is not possible for our K-engine implementation, 2) exchange kernels

access the density in memory non-contiguously, and 3) exchange lacks the µ! " !µ

and #" " "# symmetry. It is useful to compare the time required to calculate the

exhcange contribution to the first ground-state SCF iteration (which is the most

expensive iteration due to the use of Fock matrix updating62) and to the AX vector

build for CIS (or TDDFT). We find that the exchange contribution is on average 1.5x

slower in CIS/TDDFT compared to ground state SCF. One might have expected the

excited state computation to be 2x slower because of the two K-engine calls, but the

algorithm is able to exploit the greater sparsity of the transition density matrix

(compared to the ground state density matrix).

Due to efficient screening of ERI contributions to the Coulomb matrix, the J-

engine similarly exploits the increased sparseness of the transition density, and

therefore is faster than the ground state 1st iteration J-engine calculation. In the current

implementation the Coulomb evaluation profits more from transition density sparsity

than that for exchange since it scales better with system size (N1.2 vs N1.6).

As can be seen in figure 4, the DFT integration usually takes more time than

the Coulomb contribution, in spite of the fact that DFT integration scales more nearly

linearly with system size. This is because of the larger prefactor for DFT integration,

which is related to the density of the quadrature grids used. It has previously been

106

noted63 that very sparse grids can be more than adequate for TDDFT. We further

support this claim with the data presented in table 2, where we compare the lowest

excitation energies and average TD-BLYP integration times for the initial guess

vectors for six different grids on two of the test systems. For both molecules, the

excitation energies from the sparsest grid agree well with those of the more dense

grids, but with a substantial reduction in integration time, suggesting that a change to

an ultra sparse grid for the TDDFT portion of the calculation could result in

considerable time savings with little to no loss of accuracy. The TD-BLYP values

computed with NWChem64 using the default ‘medium’ grid are also given to show the

accuracy of our implementation. The small (<0.001eV) differences in excitation

energies between our GPU-based TD-BLYP and NWChem are likely due to slightly

differing ground state densities, which differ in energy by 0.000008 a.u. for the

chromophore and 0.0019 a.u. for the S2 dendrimer.

PYP chromophore (29 atoms) Table 2. TD-BLYP timings (average time for the DFT quadrature in one AX build for the initial 10 AX vectors) and first excitation energies using increasingly dense quadrature grids. For comparison, NWChem excitation energies are also given using the default ‘medium’ grid. Number of points/atom refers to the pruned grid for TeraChem and the unpruned grid for NWChem. NWChem was run on a different architecture, so timings are not directly comparable.

grid points points/atom time (s) $E (au) 0 29497 1017 0.06 0.08516162 1 81461 2809 0.11 0.08516510 2 182872 6305 0.22 0.08516472 3 330208 11386 0.38 0.08516266 4 841347 29011 0.91 0.08516267 5 2126775 73337 2.30 0.08516268

NWChem/medium

21655 n/a 0.08516691 S2 dendrimer (128 atoms) Grid points points/atom time (s) $E (au)

0 141684 1106 0.28 0.08394399 1 382576 2988 0.51 0.08394430 2 848918 6632 0.98 0.08394427 3 1506502 11769 1.64 0.08394433 4 3770640 29458 3.72 0.08394431 5 9472331 74002 9.24 0.08394431

NWChem/medium

25061 n/a 0.08395211

107

GPU-accelerated CIS and TDDFT methods can calculate excited states of

much larger molecules than can currently be studied with previously existing ab initio

methods. For the well-behaved valence transitions in the PYP systems, CIS

convergence requires very few Davidson iterations. The total wall time (SCF+CIS)

required to calculate the 1st CIS/6-31G excited state of the entire PYP protein (10,869

basis functions) is less than 7 hours, with ~5.5 hours devoted to the SCF procedure,

and ~1.5 hours to the CIS procedure. Most (1.2 hours) of the CIS time is spent

computing exchange contributions. We can thus treat the protein with full QM and

study how mutation within PYP will affect the absorbance. For any meaningful

comparison with the experimental absorption energy of PYP at 2.78 eV,59 many

configurations need to be taken into account. For this single configuration, the CIS

excitation energy of 3.69 eV is much higher than the experimental value, as expected

with CIS. The TD-B3LYP bright state (S5) is closer to the experimental value, but still

too high at 3.33 eV.

Solvatochromic studies in explicit water are problematic for standard DFT

methods, including hybrid functionals, due to the well-known difficulty in treating

charge transfer excitations.16,65 In calculating the timings for the first excited state of

the PYP chromophore with increasing numbers of waters, we found that the energy of

the CIS first excited state quickly leveled off and stabilized, while that for TD-BLYP

and TD-B3LYP generally decreased to nonsensical values, at which point the ground

state SCF convergence was also problematic. This behavior of the first excitation

energies for the PYP chromophore with increasing numbers of waters is shown in

figure 5 for CIS, TD-BLYP, and TD-B3LYP. While the 20% HF exchange in the

108

hybrid TD-B3LYP method does improve the excitation energies over TD-BLYP, the

energies are clearly incorrect for both methods, and a higher level of theory or a range-

corrected functional19,21 is certainly necessary for studying excitations involving

explicit QM waters.

Figure 5. The first excitation energy (eV) of the PYP chromophore with increasing numbers of surrounding water molecules. Both TD-BLYP and TD-B3LYP exhibit spurious low-lying charge transfer states. The geometry is taken from a single AMBER dynamics snapshot.

The recent theoretical work by Badaeva et al. examining the one and two

photon absorbance of oligothiophene dendrimers was limited to results for the first

three generations S1-S3, even though experimental results were available for S4.54-56

In table 3, we compare our GPU accelerated results on the first bright excited state

(oscillator strength > 1.0) using TD-B3LYP within the TDA to the full TD-B3LYP

and experimental results. Results within the TDA are comparable to those from full

TD-B3LYP, for both energies and transition dipole moments. Our results for S4 show

the continuing trend of decreasing excitation energy and increasing transition dipole

moment with increasing dendrimer generation.

109

Exp GPU TD-B3LYP CPU TD-B3LYP $E $E µge $E µge

S1 3.25 3.13 9.2 3.00 8.0 S2 3.25 2.98 10.6 2.92 10.1 S3 3.20 2.93 12.0 2.89 11.9 S4 3.19 2.83 13.3

Table 3. Experimental and calculated vertical transition energies (eV) and transition dipole moments (D) for the lowest energy bright state. Experimental results were taken from Refs. 54 and 56. GPU accelerated TD-B3LYP was computed within the Tamm-Dancoff approximation. TD-B3LYP results taken from Ref. 56.

CONCLUSION

We have implemented ab initio CIS and TDDFT calculations using GPUs,

allowing full QM calculation of the excited states of large systems. The numerical

accuracy of the excitation energies is shown to be excellent using mixed precision

integral evaluation. A small percentage of cases require full double precision

integration. For these occasional issues, we can easily switch to full double precision

to achieve the desired convergence. The ability to use lower precision in much of the

CIS and TDDFT calculation is reminiscent of the ability to use coarse grids when

calculating correlation energies, as shown previously for pseudospectral methods.63,66-

69 Recently, it has also been shown70 that single precision can be adequate for

computing correlation energies with Cholesky decomposition methods which are

closely related to pseudospectral methods.71 Both quadrature and precision errors

generally behave as relative errors, while chemical accuracy is an absolute standard

(often taken to be ~1 kcal/mol). Thus, coarser grids and/or lower precision can be

safely used when the quantity being evaluated is itself small (and therefore less

relative accuracy is required), as is the case for correlation energies and/or excitation

energies.

110

For some of the smaller benchmark systems, we present speedups as compared

to the GAMESS quantum chemistry package running over 8 processor cores. The

speedups obtained for CIS and TDDFT calculations range from 6x to 234x, with

increasing speedups with increasing system size.

The increased size of the molecules that can be treated using our GPU-based

algorithms exposes some failings of DFT and TDDFT. Specifically, the charge-

transfer problem16 of TDDFT and the delocalization problem72 of DFT both seem to

become more severe as the molecules become larger, especially for the case of

hydrated chromophores with large numbers of surrounding quantum mechanical water

molecules.

REFERENCES

(1) Foresman, J. B.; Head-Gordon, M.; Pople, J. A.; Frisch, M. J. The Journal of Physical Chemistry 1992, 96, 135.

(2) Runge, E.; Gross, E. K. U. Physical Review Letters 1984, 52, 997.

(3) Gross, E. K. U.; Kohn, W. Physical Review Letters 1985, 55, 2850.

(4) Casida, M. E. In Recent Advances in Density Functional Methods; Chong, D. P., Ed.; World Scientific: Singapore, 1995.

(5) Casida, M. E.; Jamorski, C.; Casida, K. C.; Salahub, D. R. The Journal of Chemical Physics 1998, 108, 4439.

(6) Appel, H.; Gross, E. K. U.; Burke, K. Physical Review Letters 2003, 90, 043005.

(7) Hirata, S.; Head-Gordon, M.; Bartlett, R. J. The Journal of Chemical Physics 1999, 111, 10774.

(8) Dreuw, A.; Head-Gordon, M. Chem. Rev. 2005, 105, 4009.

(9) Burke, K.; Werschnik, J.; Gross, E. K. U. The Journal of Chemical Physics 2005, 123, 062206.

111

(10) Dallos, M.; Lischka, H.; Shepard, R.; Yarkony, D. R.; Szalay, P. G. J. Chem. Phys. 2004, 120, 7330.

(11) Kobayashi, Y.; Nakano, H.; Hirao, K. Chem. Phys. Lett. 2001, 336, 529.

(12) Roos, B. O. Acc. Chem. Res. 1999, 32, 137.

(13) Tokita, Y.; Nakatsuji, H. J. Phys. Chem. B 1997, 101, 3281.

(14) Krylov, A. I. Ann. Rev. Phys. Chem. 2008, 59, 433.

(15) Stanton, J. F.; Bartlett, R. J. J. Chem. Phys. 1993, 98, 7029.

(16) Dreuw, A.; Weisman, J. L.; Head-Gordon, M. The Journal of Chemical Physics 2003, 119, 2943.

(17) Iikura, H.; Tsuneda, T.; Yanai, T.; Hirao, K. The Journal of Chemical Physics 2001, 115, 3540.

(18) Heyd, J.; Scuseria, G. E.; Ernzerhof, M. The Journal of Chemical Physics 2003, 118, 8207.

(19) Tawada, Y.; Tsuneda, T.; Yanagisawa, S.; Yanai, T.; Hirao, K. The Journal of Chemical Physics 2004, 120, 8425.

(20) Rohrdanz, M. A.; Herbert, J. M. The Journal of Chemical Physics 2008, 129, 034107.

(21) Rohrdanz, M. A.; Martins, K. M.; Herbert, J. M. The Journal of Chemical Physics 2009, 130, 054112.

(22) Grimme, S. J. Comp. Chem. 2004, 25, 1463.

(23) Grimme, S. J. Comp. Chem. 2006, 27, 1787.

(24) Vydrov, O. A.; Van Voorhis, T. J. Chem. Phys. 2010, 132, 164113.

(25) Dion, M.; Rydberg, H.; Schroder, E.; Langreth, D. C.; Lundqvist, B. I. Phys. Rev. Lett. 2004, 92, 246401.

(26) Maitra, N. T.; Zhang, F.; Cave, R. J.; Burke, K. The Journal of Chemical Physics 2004, 120, 5932.

(27) Levine, B. G.; Ko, C.; Quenneville, J.; Martinez, T. J. Mol. Phys. 2006, 104, 1039.

(28) Jacquemin, D.; Wathelet, V.; Perpeate, E. A.; Adamo, C. J. Chem. Theo. Comp. 2009, 5, 2420.

112

(29) Warshel, A.; Levitt, M. J. Mol. Biol. 1976, 103, 227.

(30) Virshup, A. M.; Punwong, C.; Pogorelov, T. V.; Lindquist, B. A.; Ko, C.; Martinez, T. J. J. Phys. Chem. B 2009, 113, 3280.

(31) Ruckenbauer, M.; Barbatti, M.; Muller, T.; Lischka, H. J. Phys. Chem. A 2010, 114, 6757.

(32) Polli, D.; Altoe, P.; Weingart, O.; Spillane, K. M.; Manzoni, C.; Brida, D.; Tomasello, G.; Orlandi, G.; Kukura, P.; Mathies, R. A.; Garavelli, M.; Cerullo, G. Nature 2010, 467, 440.

(33) Schafer, L.; Groenhof, G.; Boggio-Pasqua, M.; Robb, M. A.; Grubmuller, H. PLoS Comp. Bio. 2008, 4, e1000034.

(34) Vogt, L.; Olivares-Amaya, R.; Kermes, S.; Shao, Y.; Amador-Bedolla, C.; Aspuru-Guzik, A. The Journal of Physical Chemistry A 2008, 112, 2049.

(35) Yasuda, K. J. Chem. Theo. Comp. 2008, 4, 1230.

(36) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2009, 5, 1004.

(37) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2009, 5, 2619.

(38) Stone, J. E.; Phillips, J. C.; Freddolino, P. L.; Hardy, D. J.; Trabuco, L. G.; Schulten, K. J. Comp. Chem. 2007, 28, 2618.

(39) Anderson, J. A.; Lorenz, C. D.; Travesset, A. J. Comp. Phys. 2008, 227, 5342.

(40) Liu, W.; Schmidt, B.; Voss, G.; Muller-Wittig, W. Comp. Phys. Comm. 2008, 179, 634.

(41) Friedrichs, M. S.; Eastman, P.; Vaidyanathan, V.; Houston, M.; Legrand, S.; Beberg, A. L.; Ensign, D. L.; Bruns, C. M.; Pande, V. S. J. Comp. Chem. 2009, 30, 864.

(42) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2008, 4, 222.

(43) Grabo, T.; Petersilka, M.; Gross, E. K. U. Journal of Molecular Structure: THEOCHEM 2000, 501-502, 353.

(44) Hirata, S.; Head-Gordon, M. Chem. Phys. Lett. 1999, 314, 291.

(45) Cordova, F.; Doriol, L. J.; Ipatov, A.; Casida, M. E.; Filippi, C.; Vela, A. J. Chem. Phys. 2007, 127, 164111.

(46) Davidson, E. R. J. Comp. Phys. 1975, 17, 87.

113

(47) White, C. A.; Head-Gordon, M. J. Chem. Phys. 1994, 101, 6593.

(48) Becke, A. D. The Journal of Chemical Physics 1988, 88, 2547.

(49) Lebedev, V. I.; Laikov, D. N. Dokl. Akad. Nauk 1999, 366, 741.

(50) Murray, C. W.; Handy, N. C.; Laming, G. J. Mol. Phys. 1993, 78, 997.

(51) Brown, P.; Woods, C.; McIntosh-Smith, S.; Manby, F. R. J. Chem. Theo. Comp. 2008, 4, 1620.

(52) Hancock, J. M.; Gifford, A. P.; Tonzola, C. J.; Jenekhe, S. A. The Journal of Physical Chemistry C 2007, 111, 6875.

(53) Tao, J.; Tretiak, S. J. Chem. Theo. Comp. 2009, 5, 866.

(54) Ramakrishna, G.; Bhaskar, A.; Bauerle, P.; Goodson, T. The Journal of Physical Chemistry A 2007, 112, 2018.

(55) Harpham, M. R.; Suzer, O.; Ma, C.-Q.; Bauerle, P.; Goodson, T. J. Am. Chem. Soc. 2009, 131, 973.

(56) Badaeva, E.; Harpham, M. R.; Guda, R.; Suzer, O.; Ma, C.-Q.; Bauerle, P.; Goodson, T.; Tretiak, S. J. Phys. Chem. B 2010, 114, 15808.

(57) Yamaguchi, S.; Kamikubo, H.; Kurihara, K.; Kuroki, R.; Niimura, N.; Shimizu, N.; Yamazaki, Y.; Kataoka, M. Proceedings of the National Academy of Sciences 2009, 106, 440.

(58) Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L. The Journal of Chemical Physics 1983, 79, 926.

(59) Nielsen, I. B.; Boye-Peronne, S.; El Ghazaly, M. O. A.; Kristensen, M. B.; Brondsted Nielsen, S.; Andersen, L. H. Biophysical Journal 2005, 89, 2597.

(60) Becke, A. D. Physical Review A 1988, 38, 3098.

(61) Lee, C.; Yang, W.; Parr, R. G. Phys. Rev. B 1988, 37, 785.

(62) Almlof, J.; Faegri, K.; Korsell, K. J. Comp. Chem. 1982, 3, 385.

(63) Ko, C.; Malick, D. K.; Braden, D. A.; Friesner, R. A.; Martinez, T. J. J. Chem. Phys. 2008, 128, 104103.

(64) Valiev, M.; Bylaska, E. J.; Govind, N.; Kowalski, K.; Straatsma, T. P.; VanDam, H. J. J.; Wang, D.; Nieplocha, J.; Apra, E.; Windus, T. L.; deJong, W. A. Comp. Phys. Comm. 2010, 181, 1477.

114

(65) Grimme, S.; Parac, M. ChemPhysChem 2003, 4, 292.

(66) Martinez, T. J.; Carter, E. A. J. Chem. Phys. 1993, 98, 7081.

(67) Martinez, T. J.; Carter, E. A. J. Chem. Phys. 1994, 100, 3631.

(68) Martinez, T. J.; Carter, E. A. J. Chem. Phys. 1995, 102, 7564.

(69) Martinez, T. J.; Mehta, A.; Carter, E. A. J. Chem. Phys. 1992, 97, 1876.

(70) Vysotskiy, V. P.; Cederbaum, L. S. J. Chem. Theo. Comp. 2010, Articles ASAP; DOI: 10.1021/ct100533u.

(71) Martinez, T. J.; Carter, E. A. In Modern Electronic Structure Theory, Part II; Yarkony, D. R., Ed.; World Scientific: Singapore, 1995, p 1132.

(72) Cohen, A. J.; Mori-Sanchez, P.; Yang, W. Science 2008, 321, 792.

115

CHAPTER SIX

MULTIPLE TIME STEP INTEGRATORS

FOR AB INITIO MOLECULAR DYNAMICSa

The GPU-accelerated electronic structure methods developed in previous

chapters provide an excellent foundation for ab initio molecular dynamics (AIMD)

simulations. For ab initio dynamics, the calculation of the electronic structure at each

time step is the dominant bottleneck, and GPU acceleration does little to alter this

balance. However, with GPUs, AIMD can be routinely applied to very large systems,

containing up to one thousand atoms. As is well understood from classical dynamics

simulations, such large systems include a mix of fast and slow degrees of freedom.

The fastest modes determine the maximum acceptable time step, while the slower

modes typically represent the motions of scientific interest. In the present chapter we

discuss the application of multiple time step (MTS) integrators to extend the

simulation time scale accessible to large AIMD simulations.

MTS integration techniques1-7 are standard tools used to increase the

computational efficiency of molecular dynamics calculations based on empirical force

fields. These MTS schemes exploit the fact that the forces in chemical systems can

typically be split into fast-varying and slow-varying parts. This splitting is then

leveraged to integrate the slow-varying parts with a longer time step and the fast-

varying parts with a shorter time step. Since the “slow forces”, such as the long-range

a Adapted with permission from N. Luehr, T.E. Markland, and T.J. Martinez, J. Chem. Phys. 2014, 140, 084116. Copyright 2014, AIP Publishing LLC.

116

electrostatic interactions, are typically more computationally expensive to evaluate

than the “fast forces”, such as covalent bond stretches, the ability to evaluate them less

often affords significant speed-ups. For empirical potentials this separation is often

straightforward as the Hamiltonian is typically written as a sum of terms such as van

der Waals, bond stretching, torsional, and electrostatic interactions which can be easily

assigned to the “slow” or “fast” part.

In contrast, ab initio molecular dynamics (AIMD) schemes compute the

potential energy surface on which the nuclei evolve by solving the electronic

Schrodinger equation at each time step. This introduces significant flexibility,

allowing for bond rearrangement (difficult with empirical force fields that generally

assume a prescribed bonding topology),8 electron transfer,9 and transitions between

electronic states.10,11 The most straightforward AIMD approach is the Born-

Oppenheimer scheme (BOMD).12-14 In BOMD, the electronic degrees of freedom are

assumed to relax adiabatically at each nuclear geometry, and an electronic structure

problem is solved fully self-consistently at each time step. An advantage of this

approach is that dynamics always occur on a Born-Oppenheimer potential energy

surface. In contrast, other AIMD methods employ an extended Lagrangian scheme

(Car-Parrinello or CPMD),15,16 where new fictitious degrees of freedom corresponding

to the coefficients in the electronic wavefunction are integrated simultaneously with

the nuclear motion. The CPMD method avoids iteration to self-consistency in the

solution of the electronic wavefunction at each time step, at the expense of introducing

electronic time-scales that are faster than those of atomic motion and thus

necessitating smaller time steps for CPMD than would be possible using BOMD.

117

MTS schemes have been developed to mitigate the computational cost of integrating

fast electronic degrees of freedom in CPMD.17-19 These exploit the time-scale

separation between the fictitious electronic and nuclear degrees of freedom and allow

the outer (nuclear) time step to approach the BOMD limit.

Applying MTS schemes to decompose nuclear motions in AIMD methods

presents a much greater challenge since ab initio forces do not naturally separate into

fast-varying and slow-varying components. Thus, for a long time it has appeared that

MTS schemes could not be applied straightforwardly to BOMD (or, equivalently, to

the nuclear degrees of freedom in CPMD). Recent work has shown that this

conclusion is too harsh, demonstrating that one can treat different components of the

electronic structure problem (specifically, the Hartree-Fock and Moller-Plesset

contributions) with different time steps in an MTS scheme.20 This approach leverages

the well-known fact that the dynamic electron correlation correction to the Hartree-

Fock potential energy surface varies slowly with molecular geometry.

In this chapter, we demonstrate two ways of splitting the electronic

Hamiltonian that enable AIMD calculations to exploit MTS integrators. The first of

these relies on a fragment decomposition of the Hamiltonian. Such fragment

decompositions have been previously proposed to accelerate electronic structure

computations for large molecular systems.21-30 In those cases, the energy expression in

terms of fragments is viewed as an approximation to the true potential energy surface,

and neglected interactions (e.g., relating to charge transfer between fragments) are

rarely quantified or corrected. In contrast, our scheme uses the fragment

decomposition only as an intermediary representation during inner time steps with

118

corrections included at outer steps. As such, the dynamics occurs on the potential

surface without any fragment approximations.

The second MTS scheme we introduce exploits a splitting of the Coulomb

operator in the electronic Hamiltonian. This is closely related to the Coulomb-

Attenuated Schrödinger Equation (CASE) approximation that has been proposed to

accelerate electronic structure calculations.31,32 Again, the advantage of our approach

is that unlike the CASE approximation, which entirely neglects long-range

electrostatic effects, our scheme yields results which are equivalent to those obtained

in a calculation employing the usual Coulomb operator while simultaneously allowing

for the computational speedups afforded by the CASE approach.

THEORY

Trotter factorization of the Liouville operator provides a systematic approach

to derive symplectic time-reversible molecular dynamics integrators for systems

containing many time-scales.3 We begin by briefly reviewing this formalism in order

to highlight its desirable properties for the AIMD force splitting schemes presented

below. The classical Liouville operator for n degrees of freedom with coordinates, xk,

and conjugate momenta, pk is

iL = !xk

!!xk

+ fk

!!pk

"

#$

%

&'

k

n

( (6.1)

where fk is the force acting on the kth degree of freedom. The classical propagator,

eiLT , exactly evolves the system by a time period T from an initial phase space point

119

!(t) = x(t),p(t){ } at time t to its destination at time t + T via the operation

! t +T( ) = eiLT! t( ) .

For a multidimensional system with a general choice of interactions, this

operation cannot be performed analytically for the full propagator. The Trotter

factorization method solves this problem by splitting the Liouville operator,3

iL = iLx + iLp = !xk

!!xkk

n

" + fk

!!pkk

n

" (6.2)

and applying the symmetric Suzuki-Trotter formula.33,34

ea+b ! ea/2 M eb/ M ea/2 M( )M

+O 1M 2

"#$

%&' (6.3)

Thus one obtains

!(t +"T ) = eiLp"T /2eiLx"T eiLp"T /2!(t) (6.4)

where !T = T / M is the time step. This expression becomes exact as M ! " i.e. !T

! 0. In practice one uses a finite time step, which is sufficiently small to well

represent the fastest varying force in the system. Regardless of the time step, the

family of Trotter factorized integrators retains important symplectic and time-

reversible properties.3,35

The above Trotter factorized integrator is identical to the traditional Velocity

Verlet scheme.36 The operator eiLp!T

can be shown to perform the operations of a

momentum shift through a time interval #T:3

x(t),p(t){ }! x(t), p(t) + "T f (x(t)){ } (6.5)

120

while eiLx!T performs a coordinate shift:

x(t),p(t){ }! x(t)+"Tp(t) / m, p(t){ } (6.6)

where m is the vector of masses for the degrees of freedom.

The strength of the Trotter factorization, however, is that it allows much more

general decompositions of the Liouville operator. We now consider the case where the

total force on each degree of freedom can be separated into fast and slow components.

fi = fiF + fi

S (6.7)

Splitting the momentum shift operator, the Liouville operator can now be decomposed

into three terms.

iL = iLx + iLp

F + iLpS = !xk

!!xkk

n

" + fkF !!pkk

n

" + fkS !!pkk

n

" (6.8)

Proceeding as above with Suzuki-Trotter expansion, one obtains the following MTS

integrator.3

!(t + "T ) = eiLp

S "T /2 eiLpF#T /2eiLx#T eiLp

F#T /2( )N

eiLpS "T /2!(t)

(6.9)

Here, !T = N"T is the outer time step, and the bracketed term evolves the positions

and momenta of the system by the smaller inner time step, !T , under the action of

only the fast varying forces. The advantage of the MTS scheme is that the outer time

step can be chosen with respect to the fastest slow force and evaluated 1/Nth as often

compared to traditional Verlet integrators. Since force evaluation is the dominant

computational step in AIMD calculations, the MTS approach can in principle provide

121

an N-fold speedup over traditional integrators, as long as the appropriate

decomposition into slow and fast forces can be found.

In BOMD both long- and short-range atomic forces depend on nonlinear SCF

equations. Thus, a strict algebraic force decomposition of the type commonly used in

classical MTS integrators is not possible. Fortunately, the RESPA strategy is flexible

enough to allow a broad range of numerical decompositions. We introduce the

following, almost trivial, scheme.

iL = !xk

!!xkk

n

" + Fkmod !

!pkk

n

" + FkAI # Fk

mod( ) !!pkk

n

" = iLx + iLpF + iLp

S

(6.10)

Here FkAI

is the full ab initio force acting on the kth degree of freedom while Fkmod is

an approximate model force intended to capture the short-range behavior of FkAI .

Assuming the model force is smooth and numerically well behaved, the resulting ab

initio MTS integrators evolve dynamics on the same potential energy surface as

traditional BOMD approaches. Of course, a poor choice of the model force could

leave short-range, high-frequency components in the difference force, FkAI ! Fk

mod ,

limiting the outer time step and defeating any speedup. However, the same logic also

works in reverse. We are free to relax an exact model for the short-range force until

short-range discrepancies appear in FkAI ! Fk

mod with the same timescale as the fastest

long-range forces. In other words, the outer time step retains some ability to correct

errors in our short-range model force. In practice we will show that sufficiently

accurate model forces are easily computed, so it is the latter line of reasoning that is

most relevant.

122

Our first approach to model the short-range ab initio force is to split an

extended system into small independent fragments. A very general approach would

require a method to fragment macro-molecules across covalent bonds as well as

automatically distribute electrons among fragments. For simplicity, the present work

focuses on systems where these refinements are not needed, in particular, large water

clusters. In the method we christen MTS-FRAG, the model force is constructed as a

sum of independent ab initio gradient calculations on each water molecule in vacuo.

Because all molecular calculations require equal computational effort, the work to

compute the model force scales linearly with the size of the system (i.e., number of

fragments). On the other hand, the global ab initio gradient needed in the outer time

step requires computational effort that is at least quadratic in the system size [for

conventional implementations of Hartree-Fock (HF) or density functional theory

(DFT)]. Thus, for large systems, the short-range force calculation becomes essentially

free and we expect the overall speedup to increase linearly with the size of the outer

MTS time step. The same analysis applies to other ab initio methods such as

perturbation theory and coupled cluster. In these cases, the scaling of effort with

respect to molecular size is often considerably steeper, which will make our MTS-

FRAG scheme even more efficient compared to an implementation where the ab initio

force is evaluated using HF or DFT.

As presented above, the fragment approach completely ignores all

intermolecular interactions during the inner time steps. It is illustrative to also consider

a simple refinement, in which we add Lennard-Jones and Coulomb terms between all

W water molecules to the model force.

123

Fk

LJ = !k

ArOmOn

12 " CrOmOn

6 +qiq j

rijj#n$

i#m$

%

&''

(

)**n=m+1

W

$m=1

W

$ (6.11)

We denote an MTS method using this Lennard-Jones augmented fragment model

force as MTS-LJFRAG. In the present work, we used empirical parameters (A, C, and

charges qk) directly from the TIP3P water model without modification.37 Evaluation of

this term adds negligible effort to the more expensive ab initio fragment calculations

but allows atoms to avoid strongly repulsive regions near neighboring molecules

during inner integration steps. Polarization and charge transfer effects are still

neglected, but these are expected to vary on longer time scales. Further refinement of

these fragment models is no doubt possible. However, we introduce these coarse

approximations here in order to explore the ability of the outer time step to correct for

small but significant errors in the short-range model force.

The primary limitation of the fragment approaches described above is that the

atomic decomposition into fragments cannot be adjusted during the course of a

simulation without destroying the reversibility and energy conservation of the

integrator. This rules out arbitrary bond rearrangements, one of the key features we

would like to preserve from traditional AIMD. As an alternative we consider the

MTS-CASE scheme, where the model force is obtained from an electronic structure

calculation with a truncated (i.e. short-range) Coulomb operator. This is accomplished

with the following substitution in the electronic Hamiltonian (and also in the Coulomb

interaction between the nuclear point charges).

1r! erfc("r)

r (6.12)

124

Here ! is a constant parameter with units of inverse distance that determines the

range at which the Coulomb interaction is effectively quenched. Because the CASE

model force does not require an explicit molecular decomposition, it should, in

principle, have no difficulty with bond rearrangements. However, the accuracy with

which the CASE approximation can describe distorted transition geometries is a

serious concern. Figure 1 shows potential energy scans at the restricted Hartree-Fock

(RHF) and CASE levels of theory for the pictured dissociation of a hydroxide ion and

water molecule. While CASE accurately reproduces the RHF minimum-energy

oxygen-oxygen distance, the binding energy is catastrophically underestimated. Such

artifacts imply that physically relevant trajectories cannot be obtained from the CASE

potential surface. However, in MTS-CASE the outer time step corrects for the

difference between the CASE potential and the full RHF surface. Hence the MTS-

CASE trajectory evolves on the full potential surface and so should accurately

describe transition geometries.

Figure 1: Dissociation curves for an H2O/OH- cluster at the CASE and RHF levels of theory using the 6-31G basis set. Potential curves are generated from optimized geometries at constrained oxygen-oxygen distances. The energy for each curve is taken relative to its minimum value. The equilibrium bond distances are very similar. However, CASE severely under-estimates the binding energy, and includes an unphysical kink near 5 Angstroms.

125

RESULTS AND DISCUSSION

Figure 2: Energy conservation for 21ps simulation of an (H2O)57 water cluster using the MTS-LJFRAG integrator with outer and inner time steps of 2.5 and 0.5fs respectively. The simulation was run at the RHF/3-21G level of theory in the NVE ensemble after 5ps NVT equilibration at 350K. The cluster was confined by a spherical boundary chosen to lead to a density of 1g/mL. The blue curve shows total energy in kcal/mol shifted by +2.72x106 kcal/mol. The cyan line is a least-squares fit of total energy to show drift. Slope is 1.39x10-2 kcal/mol/ps for all degrees of freedom, i.e. 2.74x10-5 kcal/mol/ps/dof. The red curve shows total kinetic energy for scale.

In order to test these approaches, we implemented the ab initio MTS methods

described above in a development version of TeraChem.38 We first simulated the

dynamics of an (H2O)57 cluster at the RHF/3-21G level of theory. In all calculations,

the cluster was confined to a density of 1g/mL by applying a spherical quadratic

repulsive potential to oxygen atoms beyond the sphere’s radius. The barrier potential

was included in the inner time step of the MTS trajectories due to its negligible

126

computational cost compared to the ab initio force evaluation. The system was first

equilibrated for 5 ps using a standard Velocity Verlet36 integrator with the Bussi-

Parinello thermostat.39 During equilibration we used an MD time step of 1.0 fs, the

target temperature was 350K, and the thermostat time constant was 100 fs.

After equilibration, we collected a series of 21ps micro-canonical (NVE)

trajectories using a range of time steps for each of the MTS-FRAG, MTS-LJFRAG,

and MTS-CASE integrators outlined above. All simulations were started from

identical initial conditions. Baseline dynamics using the standard Velocity Verlet

integrator were collected with time steps between 0.5 and 1.5fs, since higher values

lead to unstable trajectories. Ab initio MTS trajectories used an inner time step of

0.5fs, with outer time steps ranging from 1.5 to 3.0fs. For the MTS-CASE integrator,

two screening ranges were tested: $=0.33 Bohr-1, which corresponds to an effective

screening distance of only 1.6 Å, and $=0.18 Bohr-1, which corresponds to 2.9Å.

The drift in total energy over the course of a simulation provides a test for the

quality of an integrator. An energy curve for an MTS-LJFRAG simulation using an

outer time step of 2.5fs is shown in figure 2. The drift was extracted by performing a

linear fit of the total energy over the 21ps trajectory and is almost unnoticeable on the

scale of the natural kinetic energy fluctuations of the system. The extracted slope is

2.74x10-5 kcal/mol/ps/dof, i.e. during the entire 21ps trajectory 5.75x10-4 kcal/mol of

energy is added to each degree of freedom due to inaccuracies in the integration of the

equations of motion. To put this in perspective, this is equivalent to a 0.289 K rise in

temperature of the system over that time. Current AIMD trajectories are generally

limited to around 100ps and hence even on this time-scale the increase in temperature

127

due to the integrator would be just over 1K. This is likely to cause a negligible

difference in any desired properties and could be removed by an extremely gentle

thermostat.

Figure 3 compares the energy conservation of each MTS trajectory. All MTS

methods perform well up to an outer time step of 2.5fs, which is more than double the

maximum stable time step possible with a standard Velocity Verlet integrator. MTS-

CASE with ! = 0.33 is approaching the edge of acceptability. This is not surprising

considering the relatively short-range interactions that the difference potential

describes when the model force is screened so severely. Increasing the screening

distance (decreasing ! ) allows MTS-CASE to be tuned to an acceptable level of

energy conservation. The success of the MTS integrators suggests that the inner model

forces accurately include the high frequency components of the interactions, e.g.

stretches and bends, leaving a smooth slowly varying force at the outer time step.

When the outer time step is increased to 3fs, the performance of all the MTS methods

degrades significantly.

Figure 3: Energy drift in units of Kelvin per degree of freedom per ps for a range of integration time steps. Drifts calculated as slope of least squares fit to total energy from a 21ps NVE trajectory simulation of (H2O)57 at the RHF/3-21G level of theory. All trajectories were started from the same initial conditions, generated by 5ps NVT simulation at 350K. MTS trajectories used an inner time step of 0.5fs. 0.0

2.0

4.0

6.0

0.5 1.0 1.5 2.0 2.5 3.0

Abso

lute

Drif

t (K/

dof/p

s)

Time Step (fs)

Velocity VerletFragmentLJ-FragmentMTS-CASE (! = 0.18)MTS-CASE (! = 0.33)

128

The failure of all methods with a 3fs outer time step is not surprising since

MTS integrators such as RESPA are known to suffer from non-linear resonance

instabilities.40,41 These instabilities, which arise due to interactions between the fast

forces and long time steps, mean that the fastest forces in the system limit how

infrequently the slowest interactions must be calculated at the outer time step. In the

case of our system the fastest modes, due to OH stretches of the dangling hydrogen

atoms at the surface of the cluster, oscillate at 4000 cm-1 (figure 4), corresponding to a

time period of %=8.5 fs. For a harmonic oscillator one can show that the maximum

stable outer time step is given by &Tmax = %/' which yields a resonance limit on the

time step of 2.7fs.42,43 This matches our observations precisely, and suggests that the

present limitations of our method do not stem from inadequacy of the model

potentials. It has been shown that resonance instabilities can be effectively mitigated

using specially designed Langevin thermostats.41,44 The application of these

techniques here is beyond the scope of the present study, but should increase the outer

time step by a further factor of 2-4 fold and thus further improve the computational

speedups reported here.

129

Figure 4: Power spectrum comparison between Velocity Verlet with 0.5fs time step (red), Velocity Verlet with 1.0fs time step (blue) and MTS-LJFRAG integrator with 2.5 and 0.5fs outer and inner time steps respectively (green). Spectra are based on 21ps NVE trajectories at the RHF/3-21G level of theory. System consists of 57 water molecules confined by a spherical boundary to a density of 1g/mL.

While energy conservation provides a test of the stability of MTS methods one

would also like to ensure that more subtle properties of the system remain unchanged.

For example, figure 4 shows the power spectrum of the system using 0.5fs and 1.0fs

time steps with a traditional Velocity Verlet integrator. Although in figure 3 we found

these both to provide acceptable energy conservation, the 1.0fs Verlet integrator shows

a clear frequency shift at high frequencies compared to the 0.5fs Verlet and MTS

integrators. Hence, although using standard Velocity Verlet may be stable, one should

take care regarding the properties obtained when large time-steps are employed.

0

0.1

0.2

0.3

0.4

0 1000 2000 3000 4000

Rel

ativ

e Po

wer

Wavenumber (cm-1)

Verlet 0.5fsVerlet 1.0fs

LJ-FRAG

130

Figure 5: Power spectrum comparison between standard Velocity Verlet (red), and MTS trajectories. Upper box compares simple fragment (blue), and Lennard-Jones augmented fragment (green) methods. Lower box compares MTS-CASE integration using two values of omega, 0.33 (blue) and 0.18 (green), which represent 1.6 and 2.9Å cutoffs of the Coulomb potential respectively. Spectra are based on 21ps NVE trajectories at the RHF/3-21G level of theory. System consists of 57 water molecules confined by a spherical boundary to a density of 1g/mL. MTS integrators use 2.5fs and 0.5fs for outer and inner time steps respectively. Verlet integrator uses 0.5fs time step.

Figure 5 shows the power spectra from the MTS trajectories using a 2.5fs

outer time step and 0.5fs inner time step compared with that obtained using standard

Velocity Verlet with a 0.5fs time step. In all cases the agreement is very good, with the

peak positions being very well reproduced. Remarkably all MTS methods with a 2.5fs

outer time step capture the power spectrum better than a Verlet integrator using a 1.0fs

time step. Peak intensities show greater variation, though this is likely due to the large

statistical error bars on power spectra obtained from a single 21ps trajectory.

It is clear from figure 5 that the empirically fit terms in the LJFRAG model

improve on the simpler FRAG approach. This suggests that the unmodified TIP3P

0

0.2

0.4R

elat

ive

Pow

erVerlet 0.5fs

FRAGLJ-FRAG

0

0.2

0.4

0 1000 2000 3000 4000Wavenumber (cm-1)

Verlet 0.5fsCASE (0.33)CASE (0.18)

131

force field is capturing some of the relevant high frequency intermolecular forces,

such as those from hydrogen bonds that are not included by the simpler monomer

fragment approach. Among the CASE spectra, the $=0.18 model shows only marginal

improvement over the coarser $=0.33 cutoff. Both outperform the simple fragment

approach, and are roughly equivalent to the Lennard-Jones fragment model.

Figure 6: Power spectra resulting from Velocity Verlet integrated dynamics on the CASE (blue) and RHF (red) potential energy surfaces compared with MTS-CASE (green) integration. The CASE approximation uses a cutoff of 0.33 Bohr-1 (1.6 Å) for both the Verlet and MTS integrators. Verlet-RHF and MTS-CASE spectra are based on 21ps NVE trajectories using the 3-21G basis set. Verlet-CASE spectrum is based on a shorter 14ps trajectory at the same level of theory. The system consists of 57 water molecules confined by a spherical boundary to a density of 1g/mL. Outer and inner timesteps for the MTS integrator are 2.5fs and 0.5fs respectively. Verlet integrators use 0.5fs time steps.

It is also instructive to compare MTS-CASE dynamics with that obtained by

traditional (Velocity Verlet) integration on the CASE potential surface. Figure 6

compares the power spectrum obtained from a 14ps simulation of dynamics on the

CASE (!=0.33) potential energy surface using a 0.5fs Velocity Verlet integrator to

that obtained from 0.5fs (RHF) Verlet and 2.5fs/0.5fs MTS-CASE (!=0.33). The

Verlet-CASE power spectrum is in pronounced disagreement with that obtained from

dynamics on the RHF surface. This is not surprising given the major quantitative

0

0.1

0.2

0.3

0.4

0.5

0 1000 2000 3000 4000

Rel

ativ

e Po

wer

Wavenumber (cm-1)

Verlet-RHFVerlet-CASE (!=0.33)MTS-CASE (!=0.33)

132

differences between the RHF and CASE potential surfaces shown in figure 1. It is

notable that the differences are most pronounced at the highest frequencies since one

might expect the O-H stretch to be relatively well preserved within CASE, given that

the Coulomb operator is unmodified at short ranges. However, this peak shows much

less structure in the CASE spectra and is significantly red-shifted from 4000 to

3600cm-1. Given these significant differences, it is remarkable that the MTS-CASE

trajectory gives a power spectrum close to that obtained on the RHF surface despite

using the CASE potential as the model for high-frequency force updates.

Thus far we have only considered water clusters, which do not undergo

covalent bond rearrangements on the simulated time-scale. However, one of the main

advantages of AIMD simulation is the ability for the system to undergo spontaneous

covalent bond breaking and formation during the simulation. Hence to evaluate the

ability of our MTS-CASE approach to describe bond rearrangements, we simulated

the proton transfer dynamics of a hydroxide ion solved in a cluster of 64 water

molecules. The system was equilibrated using the same procedure and boundary

conditions as the (H2O)57 cluster described above. A 21ps microcanonical simulation

was run at the RHF/3-21G level of theory using the MTS-CASE ($=0.33) integrator.

Figure 7 shows a 2ps window of the total trajectory. The energy drift of 1.54x10-4

kcal/mol/ps/dof (0.077K/ps/dof) was calculated by a linear fit to the entire 21ps

trajectory. For each frame the hydroxide oxygen atom was determined by assigning

each hydrogen atom to the nearest oxygen atom and then reporting the index of the

oxygen atom with a single assigned hydrogen (red line in figure 7).

133

Figure 7: Energy conservation of MTS-CASE integrator. Plot shows 2ps window from a longer 21ps trajectory. The NVE simulation was run at the RHF/3-21G level of theory after 5ps of NVT calibration to 350K. The inner and outer time steps were 0.5 and 2.5fs. The Coulomb attenuation parameter was 0.33 Bohr-1. The system was made up of a hydroxide ion solvated by 64 water molecules and was confined by a spherical barrier to a density of 1g/mL. A representative snapshot shows a proton transition between two oxygen atoms highlighted in blue. The blue curve shows the total energy in kcal/mol shifted by -3.10x106 kcal/mol. The green line gives a least squares fit of the total energy curve for the entire 21ps trajectory. Its slope records a drift of 8.7x10-2 kcal/mol/ps for the entire system or 1.5x10-4 kcal/mol/ps/dof. The red line indicates the index of the hydroxide oxygen atom for each time step. This was determined by assigning each hydrogen atom to its nearest oxygen neighbor, and then reporting the oxygen with a single assigned hydrogen. The magenta curve shows the total kinetic energy for scale.

As shown in the upper panel of figure 7, a proton oscillates primarily between

two oxygen atoms (shown in blue in the lower panel). During the entire 21ps

trajectory we observed over 500 proton transfer events. However, the total energy drift

is comparable to that reported above for the nonreactive (H2O)57 cluster. This is

remarkable on two counts. First, it demonstrates that the MTS scheme is applicable

even when bonds are being formed and broken – a difficult task for empirical force

fields and also for the fragment-based MTS-FRAG and MTS-LJFRAG schemes

134

described above. Second, the CASE approach we are using here is the most

aggressively screened of the two we have considered. With a screening factor of

$=0.33 bohr-1, the Coulomb interaction between any two charged particles is already

being attenuated at a distance of 1.6Å. Hence, in the inner time step, the proton barely

interacts with any electronic orbitals other than those on the two nearest heavy atoms.

All other interactions are corrected in the outer time step, which demonstrates

remarkable robustness by its ability to maintain the energy conservation observed

above.

Integrator Time Step (fs) Time/Step (sec) Steps/Day fs/day Speedup Velocity Verlet 0.5 430 201 100 1.0 Velocity Verlet 1.0 422 204 204 2.0 MTS-LJFrag 1.5 460 188 282 2.8 MTS-LJFrag 2.0 471 183 367 3.7 MTS-LJFrag 2.5 485 178 446 4.4 Table 1: Performance of MTS Lennard-Jones fragment compared to standard Velocity Verlet integrator. The system consists of 120 water molecules confined by a spherical boundary to a density of 1g/mL. The simulation was run at the RHF/6-31G level of theory using a single NVIDIA Tesla C2070 GPU. Step sizes for MTS integrator refer to the outer time step; 0.5fs inner time steps were used throughout. Timings are averaged over 100 MD steps and are in units of wall-time seconds per outer step. Speedups are computed by comparison to the 0.5fs Velocity Verlet integrator.

Finally, we consider the computational efficiency of our ab initio MTS

approach based on our initial implementation in TeraChem. Table 1 summarizes the

performance of the Lennard-Jones fragment MTS scheme relative to a standard

Velocity Verlet integrator with a time step of 0.5fs. Even though the Velocity Verlet

integrator is stable with a 1fs time step (figure 3), such a large time step noticeably

alters the dynamics, as demonstrated by the shift of the high frequency peak in the

power spectrum (figure 4). Hence, Velocity Verlet with a 0.5fs time step is the

appropriate comparison. These calculations were carried out with a larger (H2O)120

cluster treated at the RHF/6-31G level of theory and confined to a density of 1g/mL by

135

a spherical boundary. All calculations used a single Tesla C2070 GPU. Because the

outer time step still requires a full SCF gradient evaluation, the best that can be

achieved is a 5x speedup, if we compare an MTS integrator with a 2.5fs outer time

step to a standard Velocity Verlet method with a time step of 0.5fs. We are able to

achieve up to 4.4x speedup, which is over 88% efficient.

As with the fragment models, the CASE approximation leads to linear scaling

computational effort in evaluating the inner time steps.32 This is now due to improved

screening of electron repulsion integrals afforded by a short-range coulomb operator.

The present experimental code does not yet exploit the improved CASE screening.

However, previous work has shown that CASE at the screening levels employed here

offers significant computational speed-ups.32

CONCLUSION

In this paper we have demonstrated new approaches that allow MTS

integrators to be applied generally to AIMD calculations. We exploited the ability of

the outer MTS time steps to correct low frequency modeling errors within the inner

time steps. Thus we were able to employ drastically simplified short-range

approximations to the ab initio forces in the inner time steps. Despite these

computational savings, the resulting methods remain robust, exhibiting symplecticity

and time-reversibility and providing excellent energy conservation and even improved

dynamical properties compared to Verlet integrators with moderately large time steps.

As with single-step integrators our ab initio MTS methods can be systematically

improved by reducing the time steps or, for MTS-CASE, also by increasing the

screening distance.

136

The model forces used here are inspired by linear scaling approximations.

However, while linear scaling methods attempt to globally capture all significant

interactions, MTS model forces need only represent the highest frequencies within the

system, and even here low frequency differences between the model and ab initio

systems can be tolerated. This means that much looser thresholds can be employed in

the case of MTS model forces. Thus, even for systems too small to reach the crossover

point for traditional linear scaling methods, our MTS scheme can provide a significant

speedup. For larger systems where linear scaling approaches are applicable, the MTS

approaches introduced here can still be applied, since its looser thresholds should

allow the model forces to be computed more cheaply than the more accurate linear-

scaling gradient.

Other model forces are obviously possible. For example, our use of the TIP3P

force field suggests a completely empirical model force. However, the benefits of such

models are limited by the cost of the global gradient calculation in the outer time step.

In the present case, reducing the computational effort of the inner time step would at

best yield a 13% speedup over the fragment approaches described above. Thus, future

work should focus on extending the outer time step and on reducing the cost of the

global gradient evaluation. For example, employing existing Langevin thermostats44 to

remove spurious resonance effects should allow the outer time step to be extended to

10 or even 20 fs and provide greater performance gains.

REFERENCES

(1) Tuckerman, M. E.; Berne, B. J. The Journal of Chemical Physics 1991, 95, 8362.

137

(2) Tuckerman, M. E.; Berne, B. J.; Martyna, G. J. The Journal of Chemical Physics 1991, 94, 6811.

(3) Tuckerman, M. E.; Berne, B. J.; Martyna, G. J. The Journal of Chemical Physics 1992, 97, 1990.

(4) Tuckerman, M. E.; Berne, B. J.; Rossi, A. The Journal of Chemical Physics 1991, 94, 1465.

(5) Tuckerman, M. E.; Martyna, G. J.; Berne, B. J. The Journal of Chemical Physics 1990, 93, 1287.

(6) Grubmuller, H.; Heller, H.; Windemuth, A.; Schulten, K. Mol. Sim. 1991, 6, 121.

(7) Streett, W. B.; Tildesley, D. J.; Saville, G. Mol. Phys. 1978, 35, 639.

(8) Carloni, P.; Rothlisberger, U.; Parrinello, M. Acc. Chem. Res. 2002, 35, 455.

(9) VandeVondele, J.; Sulpizi, M.; Sprik, M. Ang. Chem. Int. Ed. 2006, 45, 1936.

(10) Virshup, A. M.; Punwong, C.; Pogorelov, T. V.; Lindquist, B. A.; Ko, C.; Martinez, T. J. J. Phys. Chem. B 2009, 113, 3280.

(11) Ben-Nun, M.; Martinez, T. J. Adv. Chem. Phys. 2002, 121, 439.

(12) Barnett, R. N.; Landman, U.; Nitzan, A.; Rajagopal, G. J. Chem. Phys. 1991, 94, 608.

(13) Leforestier, C. J. Chem. Phys. 1978, 68, 4406.

(14) Payne, M. C.; Teter, M. P.; Allan, D. C.; Arias, T. A.; Joannopoulos, J. D. Rev. Mod. Phys. 1992, 64, 1045.

(15) Car, R.; Parrinello, M. Phys. Rev. Lett. 1985, 55, 2471.

(16) Tuckerman, M. E.; Ungar, P. J.; von Rosenvinge, T.; Klein, M. L. J. Phys. Chem. 1996, 100, 12878.

(17) Hartke, B.; Gibson, D. A.; Carter, E. A. Int. J. Quant. Chem. 1993, 45, 59.

(18) Gibson, D. A.; Carter, E. A. J. Phys. Chem. 1993, 97, 13429.

(19) Tuckerman, M. E.; Parrinello, M. J. Chem. Phys. 1994, 101, 1316.

(20) Steele, R. P. J. Chem. Phys. 2013, 139, 011102.

(21) Yang, W. T. Physical Review Letters 1991, 66, 1438.

138

(22) Yang, W. T. Physical Review A 1991, 44, 7823.

(23) Gordon, M. S.; Freitag, M.; Bandyopadhyay, P.; Jensen, J. H.; Kairys, V.; Stevens, W. J. J. Phys. Chem. A 2001, 105, 293.

(24) Gordon, M. S.; Smith, Q. A.; Xu, P.; Slipchenko, L. V. Ann. Rev. Phys. Chem. 2013, 64, 553.

(25) He, X.; Zhang, J. Z. H. J. Chem. Phys. 2006, 124, 184703.

(26) Steinmann, C.; Fedorov, D. G.; Jensen, J. H. PLOS One 2013, 8, e60602.

(27) Xie, W.; Orozco, M.; Truhlar, D. G.; Gao, J. J. Chem. Theo. Comp. 2009, 5, 459.

(28) Fedorov, D. G.; Nagata, T.; Kitaura, K. Phys. Chem. Chem. Phys. 2012, 14, 7562.

(29) Pruitt, S. R.; Addicoat, M. A.; Collins, M. A.; Gordon, M. S. Phys. Chem. Chem. Phys. 2012, 14, 7752.

(30) Collins, M. A. Phys. Chem. Chem. Phys. 2012, 14, 7744.

(31) Adamson, R. D.; Dombroski, J. P.; Gill, P. M. W. Chem. Phys. Lett. 1996, 254, 329.

(32) Adamson, R. D.; Dombroski, J. P.; Gill, P. M. W. J. Comp. Chem. 1999, 20, 921.

(33) Suzuki, M. Commun Math Phys 1976, 51, 183.

(34) Trotter, H. F. Proc. Amer. Math. Soc. 1959, 10, 545.

(35) Sanz-Serna, J. M.; Calvo, M. P. Numerical Hamiltonian Problems; Chapman and Hall: London, 1994.

(36) Swope, W. C.; Andersen, H. C.; Berens, P. H.; Wilson, K. R. J. Chem. Phys. 1982, 76, 637.

(37) Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L. The Journal of Chemical Physics 1983, 79, 926.

(38) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2009, 5, 2619.

(39) Bussi, G.; Donadio, D.; Parrinello, M. J. Chem. Phys. 2007, 126, 014101.

(40) Biesiadecki, J. J.; Skeel, R. D. J. Comp. Phys. 1993, 109, 318.

139

(41) Ma, Q.; Izaguirre, J. A.; Skeel, R. D. Siam J Sci Comput 2003, 24, 1951.

(42) Barth, E.; Schlick, T. J. Chem. Phys. 1998, 109, 1633.

(43) Han, G. W.; Deng, Y. F.; Glimm, J.; Martyna, G. Comp. Phys. Comm. 2007, 176, 271.

(44) Morrone, J. A.; Markland, T. E.; Ceriotti, M.; Berne, B. J. J. Chem. Phys. 2011, 134, 014103.

140

141

CHAPTER SEVEN

INTERACTIVE AB INITIO MOLECULAR DYNAMICS

The previous chapter was focused on extending AIMD to large systems, where

long time scales are important. In this chapter we explore the impact of accelerated

AIMD applied to small systems containing up to a few dozen atoms. In this regime,

the steady advance of computers could soon transform the basic models used to

understand and describe chemistry. In terms of quantitative models in general,

researchers no longer seek human comprehensible, closed-form equations such as the

ideal gas law. Instead algorithmic models that are evaluated through intensive

computer simulations have become the norm. As a few examples, ab initio electronic

structure, molecular dynamics (MD), and minimum energy reaction path optimizers

are now standard tools for describing chemical systems.

Despite their often-impressive accuracy, quantitative models sometimes

provide surprisingly little scientific insight. For example, individual MD trajectories

are chaotic and can be just as inscrutable to human understanding as the physical

experiment they seek to elucidate. Qualitative cartoon-like models, it seems, are

essential to inspire human imagination and satisfy our curiosity. As an illustration,

consider the pervasive use of primitive ball-and-stick type molecular models in

chemistry. The success of these physical models lies as much in their ability to capture

human imagination and support interesting geometric questions as it does in their

inherent realism. Useful models must be both accurate and playful.

142

Fortunately, computers are capable of much more than crunching numbers for

quantitative models. With the development of computer graphics and the explosion of

immersive gaming technologies, computers also provide a powerful platform for

human interaction. Starting with works by Johnson1 and Levinthal2 in the 1960’s,

molecular viewers were developed first to visualize and manipulate x-ray structures

and later to visualize the results of MD simulations as molecular movies. The next

goal was to allow researchers to interact with realistic physical simulations in real time

as they ran. Along these lines, the Sculpt project provided an interactive geometry

optimizer that included a user-defined spring force in a modified steepest descent

optimizer.3 By furnishing the molecular potential from a classical force field further

simplified with rigid bonds and strict distance cutoffs for non-bonded interactions,

Sculpt could achieve real-time calculation rates for protein systems containing up to

eighty residues, which was certainly impressive at the time.

Later work replaced Sculpt’s geometry optimizer with a MD kernel.4,5 Rather

than being limited to minimum energy structures, the user could then probe dynamical

behavior of protein systems, watching the dynamics trajectory unfold in real time and

insert arbitrary spring forces to steer the dynamics in any direction of interest. Force-

feedback devices have also been used to control molecular tugs.6 These allow users to

feel as well as see the molecular interactions and increase the precision of user control.

Of course, arbitrary user interaction is a (sometimes large) source of energy flow into

the simulation. Aggressive thermostats are necessary to reduce this heating.4 As a

consequence, the results of interactive dynamics are not immediately applicable, for

example, in calculating statistical properties. However, for small forces the trajectories

143

will explore phase space regions that are still relevant to dynamics on standard

ensembles, and thus offer many qualitative mechanistic insights.

Classical force fields suffer from two disadvantages that hamper their

application to general-purpose chemical modeling. First they are empirically tuned,

and as a result are valid only in a finite region of configuration space. This is

particularly problematic for interactive simulations, where the user is free to pull the

system into highly distorted configurations. The second, more important disadvantage

is that covalent bonds modeled by classical springs cannot be rearranged during the

simulation. Ab initio forces, calculated from first principles electronic structure theory,

do not suffer from these disadvantages and provide ideal potentials for use with

interactive dynamics. However, due to prohibitive computational costs, it remains

difficult to evaluate ab initio forces at the fast rate necessary to support real-time

dynamics.

Recently, a divide-and-conquer implementation of the semi-empirical atom

superposition and electron delocalization molecular orbital (ASED-MO) model has

been developed to support real-time energy minimization and MD in the SAMSON

program.7,8 SAMSON offers impressive performance, able to perform real-time

calculations on systems containing up to thousands of atoms. User interaction is

implemented by alternating between “user action steps” in which the user moves or

inserts atoms, and standard optimization or MD steps. In order to keep the

computational complexity manageable, SAMSON applies several approximations

beyond those inherent in ASED-MO. The global system is split into overlapping sub-

problems that are solved independently in parallel. Also for large systems, distant

144

atoms are frozen so that only forces for atoms in the region of user interaction need to

be calculated in each step.

Several previous attempts at full ab initio interactive dynamics have been

reported. In early work, Marti and Reiher avoided on the fly electronic structure

calculations by using an interpolated potential surface.9 The potential surface is pre-

calculated over some relevant configuration space. Then during dynamics, the force at

each time step is obtained from a simple moving least-squares calculation. However, it

is difficult to predict a priori what regions of configuration space will be visited. A

partial solution is to periodically add additional interpolation points where and when

higher accuracy is desired.10 However, the number of needed interpolation points

grows exponentially with the dimensionality of the system. Thus, for non-trivial

systems, it is essential to evaluate the ab initio gradient on the fly. Recently the

feasibility of such calculations has been tested using standard packages and tools.

Combining the Turbomole DFT package, minimal basis sets, effective core potentials

and a quad core processor, Haag and Reiher achieve update rates on the order of a

second for systems containing up to 8 atoms.11

In the following we present the results of our own implementation of

interactive ab initio dynamics. By using the GPU accelerated TeraChem12 code and

carefully streamlining the calculation, interactive simulations are possible for systems

up to a few dozen atoms. The final result is a virtual molecular modeling kit that

combines intuitive human interaction with the accuracy and generality of an ab initio

quantum mechanical description.

145

METHOD

The ab initio interactive MD (AI-IMD) system described below is based on the

interactive MD (IMD) interface that was previously developed to enable interactive

steered molecular dynamics in the context of classical force fields.6 A high level

overview of the original scheme is shown in figure 1. Molecular visualization and

management of the haptic (or “touch”) interface is handled by VMD.13 Along with the

current molecular geometry, VMD displays a pointer that the user controls through a

3D haptic device (shown in figure 5). Using the pointer, the user can select and tug an

atom feeling the generated force through feedback to the haptic device. VMD also

sends the haptic forces to a separate MD program, in our case TeraChem,12 where they

are included with the usual ab initio gradient in the following haptic-augmented force.

F (R,t) = !" Eqm (R)+ Fhap (t) (7.1)

After integrating the system forward in time, TeraChem returns updated coordinates

for display in VMD.

Figure 1: Schematic representation of the IMD interface previously developed for classical MD calculations. VMD is responsible for visualization while TeraChem performs AIMD calculations in real time.

A major advantage of the IMD scheme is that it uses spring forces to cushion

the user’s interaction with the system, rather than raw position updates. This allows

the user to add weak biases that do not totally disrupt the initial momentum of the

TeraChemVMD MD Coords

Haptic Forces

Forc

e Fe

edba

ckPo

inte

r Pos

ition

MD Coords

Pointer Position

3D Display

Haptic Device

146

system. It also avoids severe discontinuities that would overthrow the numerical

stability of standard MD integrators. As a result, the system’s dynamics and energy are

always well-defined, albeit for a time-dependent Hamiltonian, and the magnitude of

haptic perturbations can in principle be accurately measured and controlled.

Communication across the IMD interface is asynchronous. In VMD the render

loop does not wait for updated coordinates between draws and force updates are sent

to TeraChem continuously as they happen rather than waiting for the next MD time-

step. Similarly, TeraChem does not wait to receive haptic force updates between time

steps. This scheme was designed to minimize communication latencies.6

Asynchronous communication also logically decouples the software components and

allows each to operate in terms of generic streams of coordinates and simplifies the

process of adapting the system from classical to ab initio MD as detailed below.

Due to the considerations above, the IMD approach provides a robust starting

point for AI-IMD. However, several adjustments to the classical IMD approach are

needed to accommodate ab initio calculations. These are detailed in the following

sections.

Simulation Rate

A primary benefit of interactive modeling is that molecular motions, as well as

static structures, can be intuitively represented and manipulated. Thus it is critical to

maintain a sensation of smooth motion. To achieve this, past research has targeted

simulation rates of at least ten or twenty MD steps per second.4,5,8 Such update rates

are comparable to video frame rates and certainly result in smooth motion. At present,

however, quantum chemistry calculations requiring less than fifty milliseconds are

147

possible only for trivially small systems. In order to reach larger molecules and basis

sets, it is important to decouple graphical updates from the underlying MD time steps.

Ultimately, the necessary simulation rate is dictated not by graphics considerations,

but by the time scale of the motion being studied. The goal of interactive MD is to

shift the movement of atoms to the time scale of seconds on which humans naturally

perceive motion. For molecular vibrations, this requires simulating a few

femtoseconds of molecular trajectory per second of wall time. Assuming a 1 fs time

step, each gradient evaluation is then allowed at least 200ms to execute. Experiments

show that up to a full second between ab initio gradient evaluations is workable,

though the resulting dynamics become increasingly sluggish.

In addition to high performance, an interactive interface requires a uniform

simulation rate. Each second of displayed trajectory must correspond to a standard

interval of simulated time. This is critical both to convey a visual sensation of smooth

motion as well as to provide consistent haptic physics. For example, the effective

power input of a given haptic input will increase with the simulation rate, since the

same duration of pull in wall time translates into longer pulls in simulated time.6

Problematically, the effort needed to evaluate the ab initio gradient varies widely

depending on the molecular coordinates. For many geometries, such as those near

equilibrium, the SCF equations can be converged in just a few iterations by using

guess orbitals from previous MD steps. For strongly distorted geometries, however,

hundreds of SCF iterations are sometimes required. Even worse, the SCF calculation

may diverge causing the entire calculation to abort. Since users tend to drive the

system away from equilibrium, difficult to converge geometries are more common in

148

interactive MD than in dynamics run on traditional ensembles (see figure 2). To

handle these distorted geometries, we employ the very robust ADIIS+DIIS

convergence accelerator.14,15 However, for well-behaved geometries, ADIIS+DIIS was

found on average to require more iterations than DIIS alone. Thus, the best approach

is to converge with standard DIIS for up to ~20 iterations, and switch over to

ADIIS+DIIS only where DIIS fails.

Figure 2: Histogram of wall-times for 1000 steps of MD run with user interaction (red) and without haptic input (blue). The system was the uncharged imidaloze molecule pictured. The simulation used the unrestricted Hartree-Fock electronic structure method with the 3-21G basis. Each step of the SCF was converged to 2.0e-5 in the maximum element of the commutator SDF-FDS.

In order to establish a fixed simulation rate we consider the variance in timings

for individual MD steps illustrated for a particular system in figure 2. Noting that the

large variance in MD step timings is primarily the result of a few outliers, the target

wall time for a simulation step, Twall, can be set far below the worst-case gradient

evaluation time. For the vast majority of steps, the ab initio gradient then completes

before Twall has elapsed and the MD integrator must pause briefly to allow the display

to keep pace. For pathologically long MD steps, such as those requiring many ADIIS

iterations, the displayed trajectory is frozen and haptic input ignored until the MD step

completes.

149

Integrating the haptic force

A key strength of AI-IMD is that it allows users to make and break bonds

between atoms. However, this level of control requires haptic forces that are stronger

than the bonding interactions between atoms. The situation in classical IMD is very

different. Classical force fields in general cannot handle bond reorganization. Thus,

IMD was originally conceived to control only weak non-bonded interactions. In

practice, the inclusion of this strong haptic force in equation (7.1) can produce

noticeable energetic artifacts in the MD simulation. For example, an unbound atom

vibrating around a fixed haptic pointer might visibly gain amplitude with each

oscillation. Such non-conservative dynamics result in rapid heating of the system that

quickly swamps the user’s control. The cause is simply that the MD time step that

would appropriately integrate a closed system is too long to accurately handle the

stronger haptic forces. An obvious solution is to use a shorter time step. However,

reducing the time step would adversely slow the simulation rate, severely reducing the

size of systems that can be modeled interactively. As the system becomes more

sluggish, the user will also tend to increase the haptic force, exacerbating the problem.

A more elegant solution is to use a multiple time step integrator, such as

reversible RESPA,16 to separate the haptic forces from the ab initio interactions. The

weaker ab initio forces can then use a longer MD time step and be evaluated less

frequently (in wall time) than the stronger haptic forces. Between each ab initio

update, l sub-steps are used to accurately integrate the haptic force as follows.

150

Vi(n+1/2,0) !Vi

(n,l ) +Ai(n) "t2

l #

Vi(n+1/2,m+1/2) !Vi

(n+1/2,m ) +Fihap (tn,m )2mi

$ t

Xi(n,m+1) !Xi

(n,m ) +Vi(n+1/2,m+1/2)$ t

Vi(n+1/2,m+1) !Vi

(n+1/2,m+1/2) +Fihap (tn,m+1)2mi

$ t

%

&

'''

(

'''

Xi(n+1,0) !Xi

(n,l )

Ai(n+1) ! )*iE

qm (Xi(n+1,0) )

mi

Vi(n+1,0) !Vi

(n+1/2,l ) +Ai(n+1) "t

2

(7.2)

Here !t is the MD time step between ab initio force updates while the inner

time step, "t=!t/l, governs the interval between haptic force evaluations. A is the

acceleration due only to the ab initio forces, while Fhap (tn,m ) is the haptic force vector

evaluated at time tn,m = n!t +m" t . Compared to the computational costs of an

electronic structure calculation, it is trivial to integrate the haptic force in each sub-

step. Thus, the MTS scheme runs at the full speed of the simpler velocity Verlet

algorithm.

A second difficulty arises when incorporating strong haptic forces in MD. As

shown in figure 1 above, in the classical IMD scheme, VMD is responsible for

calculating the haptic force based on the currently displayed positions of the targeted

atom and haptic pointer. However, due to communication latencies and the time

required for VMD to complete each graphical update, the forces received by

TeraChem lag the simulated trajectory by at least 7ms. Importantly, this lag shifts the

forces in resonance with the system’s nuclear motion. The result is again uncontrolled

151

heating of the haptic vibrational mode. The solution is to modify the IMD scheme so

that haptic positions, rather than pulling forces, are sent to TeraChem. TeraChem can

then calculate the haptic forces at the correct instantaneous molecular geometry.

Decoupling display from simulation

Having developed a robust MD integrator for interactive simulations, we now

consider how to best display molecular motion to the user. In order to maintain the

visual sensation of smooth motion, the displayed coordinates must be updated with a

minimum frequency of ~20Hz. Assuming, as above, that it requires several hundred

milliseconds of wall time to evaluate each MD step, multiple visual frames must be

generated for each simulated MD step. To accomplish this we distinguish between the

simulated system which consists of the usual coordinates, velocities, and accelerations

at each time step, i, and a separate display system, !X(t) , which is continuous in time

and closely follows the simulated system, !X(i!t) " Xi .

By separating the simulation and display problems, the numerical integrity of

the overall model can be maintained while maximizing interactivity. Robust MD

integrators guarantee the numerical stability of the simulation and provide well-

defined physical properties, such as potential and kinetic energies, at each MD time

step. At the same time, the displayed system is free to compromise accuracy for

additional responsiveness. This is advantageous because the tolerances of human

perception are much more forgiving than those required for stable numerical

integration.

152

Figure 3: Schematic of simulated and visualized systems. Visualized frames (right boxes) lag the simulated coordinates (left boxes) by one ab initio gradient evaluation. The visualized coordinates are interpolated between MD time steps, but the simulated and visualized coordinates match after each step. The visualized haptic position is read from the device at each visual frame, but is sampled by the simulation once per MD step. The l MTS substeps of equation 7.2 calculate the haptic forces from a common haptic position.

Consider a simplified case in which the simulated trajectory is integrated using

the velocity Verlet integrator.17 In each MD step, the simulated system is propagated

t femtoseconds forward in simulation time as follows.

Step 1: X(n+1) = X(n) +V(n)!t + 12A(n)!t 2

Step 2: Ai(n+1) = "#iE

ai (X(n+1) )+ Fihap ((n +1)!t)

mi

Step 3: V(n+1) = V(n) + A(n) +A(n+1)

2!t

(7.3)

Here the evaluation of the ab initio gradient dominates the wall time required to

compute each MD step, Twall. Latency between haptic inputs and the system’s

response would be minimized by immediately displaying each coordinate vector,

X(n+1), as it is calculated in step 1. However, in this case displaying further motion

VisualizationTeraChemSimulation

Wal

l Tim

e

Update Xn-1 → Xn

Calculate

-∇E(Xn) → Fn

Sample hn-1

Update Xn → Xn+1

Calculate

-∇E(Xn+1) → Fn+1

Sample hn

Update Xn → Xn+1

Sample hn

Hap

tic P

ositi

on

153

during the time consuming calculation of A(n+1) in step 2 would require extrapolation

forward in time. In practice such extrapolation leads to noticeable artifacts as the

simulated and displayed coordinates diverge. An alternative is to buffer the displayed

trajectory by one MD step. Thus X(n) is displayed as X(n+1) is calculated in step 1, and

the display can then interpolate toward a known X(n+1) during the succeeding gradient

evaluation. In this way, smooth motion is achieved. The distinction between simulated

and visualized systems is illustrated in figure 3.

A variety of interpolation schemes are possible for !X . For example, linear

interpolation would give the following trajectory.

!X(nTwall + t) = X

(n) + X(n+1) !X(n)

Twallt (7.4)

Here t is given in wall time. While equation (7.4) provides continuous molecular

coordinates, the visualized velocity jumps discontinuously by

! !V = A(n+1) !t 2

Twall (7.5)

after each MD step. These velocity jumps can be reduced to

! !V = A

(n) "A(n"1)

2!t 2

Twall (7.6)

by continuously accelerating the display coordinates over the interpolated path as

follows.

!X(nTwall + t) = X

(n) +V(n) !tTwall

t + 12A(n) !t 2

Twall2 t 2 (7.7)

154

This trajectory is again continuous in coordinate space, and additionally, in the special

case of a constant acceleration, the velocity is also continuous between MD steps.

Since for MD simulations, molecular forces change gradually from step to step,

equation (7.7) is sufficient to reduce velocity discontinuities below the threshold of

human perception.

The simple approach of equation (7.7) is easily extended to the more

complicated MTS integrator developed above. For example, the displayed trajectory

can be defined piecewise between sub steps as follows.

!X n +ml( )Twall + t( ) = X(n,m ) +V(n,m ) !t

Twallt + 12A(n,m ) !t 2

Twall2 t 2 (7.8)

However, since the ab initio gradient “kicks” the velocity only at the outer time step,

equation (7.8) corresponds to a linear interpolation in the ab initio forces similar to

equation (7.4). Amortizing the ab initio acceleration over the time step, while applying

each haptic acceleration over its own sub-step produces a smoother trajectory.

!X n +ml( )Twall + t( ) = X(n,0) +V(n,0) !t

Twallt + 12A(n) !t 2

Twall2 t 2 (7.9)

As a result of this smoothing, the simulated and displayed systems do not

match at each sub-step. However, by applying the same haptic forces calculated for

the simulated system to the visual system, the systems will match to within machine

precision at the outer time steps. To avoid any buildup of round-off error, the display

coordinates are re-synced to the simulated system after each step.

155

In the present implementation, TeraChem is responsible for calculating both

the simulated and displayed trajectories. At intervals of about 10ms, a!communication

thread evaluates !X(t) and sends the updated coordinates to VMD.

Optimizing TeraChem for small systems

TeraChem was originally designed to handle large systems such as

proteins.18,19 To enable these huge calculations, the ERIs contributing to the Coulomb

and exchange operators are computed using GPUs. To maximize performance,

TeraChem uses custom unrolled procedures for each angular momentum class of ERIs

(e.g. one function handles Coulomb (ss|ss) type contributions, another (ss|sp), and so

on).20 For large systems, near perfect load balancing is achieved by distributing each

type of ERI across all GPUs in a computer. This is shown in the top frame of figure 4.

However, for small systems, there is not enough parallel work in an ERI batch to

saturate even one, much less as many as eight GPUs. Thus new strategy was

implemented as illustrated in the lower frame of figure 4.

Figure 4: Multi-GPU parallelization strategies. (A) Static load balancing ideal for large systems, where each class of integrals (e.g. sssp) provides enough work to saturate all GPUs. (B) Dynamic scheduling in which each GPU computes independent integral classes, allowing small jobs to saturate multiple GPUs with work.

Here each ERI class runs on a single GPU, and different GPUs run different types of

ERI kernels in parallel. The throughput of each GPU was further improved by using

CUDA streams to enable simultaneous execution of multiple kernels on each GPU.

(A)

(B)Queue 3 ssss sssp sssd ssps ... dddd

GPU3Queue 2 ssss sssp sssd ssps ... dddd

GPU2Queue 1 ssss sssp sssd ssps ... dddd

GPU1

ssss sssp sssd ssps sspp sspdGPU1

...GPU3

GPU2

ddddUnified Queue

156

Dynamic load balancing was enabled through Intel Cilk Plus work-stealing queues.

Together these schemes significantly increase the parallelism exposed by a small

calculation.

When handling small systems, it is also essential to minimize execution

latencies that may not have been apparent in larger calculations. In GPU code,

communication across the PCIe bus is a common cause of latency. This was

minimized by using asynchronous CUDA streams which are provided as part of the

CUDA runtime. CUDA streams reduce latencies for host-GPU communication and

enable memory transfers to overlap kernel execution. Another important cause of

latency is the allocation and release of GPU memory. While memory allocation is an

expensive operation on most architectures, allocations are particularly costly on the

GPU, triggering synchronization across devices and even serializing the execution of

kernels run on separate GPUs. To eliminate these costs, large blocks of memory are

pre-allocated from each GPU at the start of the calculation. Individual CPU threads

request these pre-allocated blocks as needed. A mutual exclusion lock is used to

guarantee thread safety during the assignment of GPU memory blocks. Once assigned,

the memory block is the sole property of a single thread and can be used without

further synchronization. Extending this system, blocks of page-locked host memory

are also pre-allocated and distributed along with the GPU memory blocks. By

assembling GPU input data in page-locked host memory, the driver can avoid staging

transfers through an internal buffer, further improving latency.

157

Table 1: Result of optimizing TeraChem for small systems. All systems were run at the RHF/6-31g* level of theory. Times are total wall times per MD step averaged over five 1fs time steps. The initial guess was taken from the previous step for all 5 time steps, and the wave function was converged to 1.0e-5 Hartree in the maximum element of the commutator, SPF-FPS, where S, P, and F are the overlap, one-particle density, and Fock matrices respectively. For the geometries see chapter 4 figure 2 (taxol and olestra) and figure 6 below (imidazole).

Imidazole (9 atoms)

Unoptimized Optimized GPUs Seconds SpdUp Seconds SpdUp

1 1.98 1.0 0.82 1.0 2 2.09 0.9 0.44 1.9 3 2.32 0.9 0.31 2.6 4 2.63 0.8 0.35 2.4 5 3.05 0.6 0.22 3.7 6 3.36 0.6 0.20 4.1 7 3.79 0.5 0.18 4.6 8 4.57 0.4 0.17 4.8

Taxol (110 atoms)

Unoptimized Optimized GPUs Seconds SpdUp Seconds SpdUp

1 87.58 1.0 71.30 1.0 2 52.47 1.7 39.20 1.8 3 41.25 2.1 28.10 2.5 4 36.82 2.4 23.28 3.1 5 35.44 2.5 20.96 3.4 6 34.40 2.5 18.72 3.8 7 34.28 2.6 17.42 4.1 8 34.78 2.5 16.68 4.3

Olestra (453 atoms)

Unoptimized Optimized GPUs Seconds SpdUp Seconds SpdUp

1 511.08 1.0 480.17 1.0 2 283.54 1.8 269.04 1.8 3 208.98 2.4 193.87 2.5 4 173.09 3.0 161.41 3.0 5 159.73 3.2 146.15 3.3 6 147.99 3.5 133.01 3.6 7 142.27 3.6 125.35 3.8 8 139.03 3.7 119.24 4.0

158

ASSESSMENT

In order to benchmark the numerical stability of the AI-IMD integrator, we

first consider the trivial diatomic system, HCl. Because the user does work through the

haptic forces, interactive dynamics will not in general conserve the total energy of the

system. Energy is conserved, however, in the special case of a fixed pulling position

since the spring force can be construed as a function of only the molecular geometry,

and the Hamiltonian becomes time independent. Indeed, in this case, the resulting

dynamics simply evolves on a force-modified potential energy surface.21 Using this

special case, the AI-IMD integrator is validated as follows. First the HCl molecule is

minimized at the RHF/6-31G** level of theory. The resulting geometry is aligned

along the y-axis with the chlorine atom located at the origin and the hydrogen at

y=1.265 Angstroms. The test system also includes a haptic spring connecting the

hydrogen atom to a fixed pulling point at y=1.35 Angstroms.

Starting from the above initial geometry, AI-IMD calculations were run using

varying haptic spring constants and MTS sub-steps. All calculations used a fixed time

step of 1.0fs to integrate the ab initio forces. For smaller haptic force constants, below

0.7 Hartree/Bohr2, the haptic-induced motions occur on timescales comparable to ab

initio forces encountered in near-equilibrium MD. Thus, a simple velocity Verlet

integrator conserves energy about as well as the MTS approach developed above. At

larger forces, however, velocity Verlet becomes increasingly unstable. Figure 5

demonstrates the stability of our AI-IMD integrator for a spring constant of 2.6

Hartree/Bohr2. Here the velocity Verlet integrator rapidly diverges, while an MTS

integrator using 5 sub-steps remains stable. Although overall energy drifts increase at

159

higher forces, the MTS approach does not exhibit wild divergence even at a force

constant of 4.0 Hartree/Bohr2. This is important because explosive divergence is a

much greater obstacle to controlling the system than is a slow energy drift. In general,

the motion of the haptic tugging point will itself induce heating so that a thermostat is

already required to counter the slow accumulation of energy during long simulations.

Figure 5: Total energy for IMD simulation of HCl with fixed tugging point. HCl was aligned with y-axis, with Cl intially at the origin and H at y=1.265 Angstoms. The haptic force pulled toward y=1.35 Angstroms with a force constant of 2.6 Hartree/Bohr2. A standard veloctiy Verlet integrator (red) suffers from wild energy oscillations and diverges before reaching 500fs. The AI-IMD MTS integrator (blue) used 5 sub-steps per ab initio force evaluation and shows no instability throughout the entire 5ps trajectory (of which 1ps is shown).

We turn next to more interesting systems that better illustrate the potential of

AI-IMD. As currently implemented in TeraChem, AI-IMD can be applied to systems

containing up to a few dozen atoms using double-zeta basis sets at the SCF level of

theory. For smaller systems polarization functions can also be employed. AI-IMD is

thus well suited to treat reactions of small organic molecules. Table 2 shows the

average wall time per MD step for simulations on a variety of representative

molecules. Here the average is calculated from short 250 step non-interactive

trajectories using the spin-restricted Hartree-Fock method and various basis sets. The

initial geometries were optimized at the Hartree-Fock/3-21G level of theory and are

shown in figure 4. At each MD step, the wave function was converged to 1.0e-4

Hartree in the maximum element of the commutator, SPF-FPS, where S, P, and F are

0

50

100

150

200

250

300

350

0 0.2 0.4 0.6 0.8 1

Tota

l Ene

rgy

- E0

(kca

l/mol

)

Time (ps)

velocity VerletMTS, 5 sub-steps

160

the overlap, one-particle density, and Fock matrices respectively. The initial density

for each SCF is obtained from the converged density of the previous MD step. The

Coulomb and exchange operators are computed predominantly using single precision

floating-point operations, with double precision used to accumulate each contribution

into the Fock matrix. This scheme has been shown to be accurate for systems up to

100 atoms.22

Molecule (atoms) STO-3G 3-21G 6-31G** Table 2: Wall time per MD time step. Times listed in ms and averaged over 250 steps of AIMD at RHF level of theory. Initial geometries are shown in figure 6. The MD time step was 1.0fs. Simulations used 8 Nvidia GTX970 GPUs in parallel.

HCl (2) 3.16 3.84 39.52 CH3F+OH- (7) 11.60 15.40 92.24 Imidazole (9) 21.64 36.00 157.28 Caffeine (24) 101.60 195.56 717.88 Quinoline (30) 112.72 201.68 858.84 Spiropyran (40) 246.88 401.84 2492.20

Figure 6: Initial geometries for benchmark calculations listed in table 2. All structures represent minima at the RHF/3-21G level of theory.

Figure 7 illustrates a typical interactive simulation. Here the quinoline system

introduced above is interactively modeled using RHF and the STO-3G basis set. Each

ab initio gradient is allowed 200 ms of wall time. The MTS integrator uses 10 sub-

steps and the force constant of the haptic spring is set to 0.15 Hartree/Bohr2.

Throughout the simulation, the coordinates of the three ammonia nitrogen atoms are

161

frozen to avoid their diffusion into the vacuum. This system has been previously used

to experimentally study long distance proton transfer.23 In a similar spirit, we use a

haptic tug to remove a proton from the central ammonia and form an ammonium ion

on the right as shown in panels (B) and (C) of figure 7. The system then spontaneously

responds by transferring three additional protons and lengthening the central C-C bond

between the quinoline rings to form 7-ketoquinoline as shown in panel (D).

Figure 7: Snapshots of interactive simulation of 7-hydroxyquinoline with three ammonia molecules. (A) The simulation is started from the geometry shown in figure 4. The cartesian coordinates of the ammonia nitrogen atoms are fixed to avoid diffusion. (B) Force is applied to transfer a proton from the central ammonia to form an ammonium ion on the right (C). Subsequently, three protons spontaneously transfer ultimately converting 7-hydroxyquinoline to a tuatomeric 7-ketoquinoline (D).

We find force-feedback through the haptic device to be an important feature in

AI-IMD for several reasons. First, it improves user control by providing a cue to

pulling distance. Even with a 3D display, it is sometimes difficult to estimate exact

positions within the simulated system. Since the force increases as the spring is

162

stretched, the feedback force helps the user determine where and how hard they are

pulling on the system. Second, feedback can provide vector field information that is

not easily represented visually. For example, in simulating the imidazole molecule

shown in figure 8, the N-H bonds can in general be bent much more easily than

stretched. In attempting to transfer the N-H proton between nitrogen atoms, the user

feels that a bending motion is easier and naturally moves the proton along a realistic

path. It the user tries to pull the atom into a forbidden region, here the middle of the !-

bonding structure of the aromatic ring, the proton resists and instead follows the path

shown around the perimeter. It is difficult to represent the field of such repulsions

visually. Thus, a force-feedback interface provides a unique and useful perspective.

Figure 8: Interactive proton transfer in imidazole molecule. The haptic pointer is attached to a proton originally on the far side of the molecule (translucent atom) and used to pull it across to the nearer nitrogen atom. The path of the haptic pointer is shown in green while that of the transfering proton is colored in white.

CONCLUSION

Interactive models have a long history in chemical research. The venerable ball

and stick model, invented more than a century ago still plays an important role in

shaping our understanding of Chemistry. In that same spirit, interactive ab initio

calculations represent a new synthesis of intuitive human interfaces with accurate

numerical methods. AI-IMD can already be applied to systems containing up to a few

dozen atoms at the Hartree-Fock level of theory. And, as computers and algorithms

163

continue to improve, the scope of on-the-fly calculations will continue to widen. Work

is already in progress to extend our work here to higher-level DFT methods and to

provide on-the-fly orbital display. These features present no technical challenges

beyond what has already been accomplished for Hartree-Fock calculations above.

The integration methods presented here provide robust energy conservation for

strong haptic forces as long as the ab initio forces are similar to those encountered in

equilibrium MD. Of course, the user is free to slam the system into much higher-

energy configurations where this assumption simply does not hold. More research is

needed to develop graceful ways to conserve the system’s energy in such cases, for

example by switching to an empirical description which can be rapidly evaluated at a

much shorter time step than is possible for the ab initio forces. Such developments

would greatly improve the resilience of AI-IMD simulations particularly for non-

expert users. The final test of any model is whether it provides fertile ground in which

researchers can formulate and test imaginative scientific questions. This explains the

longevity of existing interactive models in chemistry and surely recommends AI-IMD

as an area deserving much future research.

REFERENCES

(1) Johnson, C. K. OR TEP: A FORTRAN Thermal-Ellipsoid Plot Program for Crystal Structure Illustrations, Oak Ridge National Laboratory, 1965.

(2) Levinthal, C. Sci Am 1966, 214, 42.

(3) Surles, M. C.; Richardson, J. S.; Richardson, D. C.; Brooks, F. P. Protein Sci 1994, 3, 198.

(4) Leech, J.; Prins, J. F.; Hermans, J. Ieee Comput Sci Eng 1996, 3, 38.

164

(5) Prins, J. F.; Hermans, J.; Mann, G.; Nyland, L. S.; Simons, M. Future Gener Comp Sy 1999, 15, 485.

(6) Stone, J. E.; Gullingsrud, J.; Schulten, K. ACM Symposium on 3D Graphics 2001, 191.

(7) Bosson, M.; Grudinin, S.; Redon, S. J. Comp. Chem. 2013, 34, 492.

(8) Bosson, M.; Richard, C.; Plet, A.; Grudinin, S.; Redon, S. J. Comp. Chem. 2012, 33, 779.

(9) Marti, K. H.; Reiher, M. J. Comp. Chem. 2009, 30, 2010.

(10) Haag, M. P.; Marti, K. H.; Reiher, M. Chemphyschem 2011, 12, 3204.

(11) Haag, M. P.; Reiher, M. Int J Quantum Chem 2013, 113, 8.

(12) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2009, 5, 2619.

(13) Humphrey, W.; Dalke, A.; Schulten, K. J Mol Graph Model 1996, 14, 33.

(14) Hu, X. Q.; Yang, W. T. J. Chem. Phys. 2010, 132.

(15) Pulay, P. J. Comp. Chem. 1982, 3, 556.

(16) Tuckerman, M.; Berne, B. J.; Martyna, G. J. J. Chem. Phys. 1992, 97, 1990.

(17) Swope, W. C.; Andersen, H. C.; Berens, P. H.; Wilson, K. R. J. Chem. Phys. 1982, 76, 637.

(18) Kulik, H. J.; Luehr, N.; Ufimtsev, I. S.; Martinez, T. J. J. Phys. Chem. B 2012, 116, 12501.

(19) Ufimtsev, I. S.; Luehr, N.; Martinez, T. J. J. Phys. Chem. Lett. 2011, 2, 1789.

(20) Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2008, 4, 222.

(21) Ong, M. T.; Leiding, J.; Tao, H. L.; Virshup, A. M.; Martinez, T. J. J. Am. Chem. Soc. 2009, 131, 6377.

(22) Luehr, N.; Ufimtsev, I. S.; Martinez, T. J. J. Chem. Theo. Comp. 2011, 7, 949.

(23) Tanner, C.; Manca, C.; Leutwyler, S. Chimia 2004, 58, 234.

165

BIBLIOGRAPHY

Adamson, R. D., J. P. Dombroski, and P. M. W. Gill. "Chemistry without Coulomb Tails." Chemical Physics Letters 254.5-6 (1996): 329-36.

---. "Efficient Calculation of Short-Range Coulomb Energies." J. Comp. Chem. 20 (1999): 921-27.

Ahmadi, G. R., and J. Almlof. "The Coulomb Operator in a Gaussian Product Basis." Chemical Physics Letters 246.4-5 (1995): 364-70.

Almlof, J., K. Faegri, and K. Korsell. "Principles for a Direct Scf Approach to Lcao-Mo Ab Initio Calculations." Journal of Computational Chemistry 3.3 (1982): 385-99.

Anderson, Joshua A., Chris D. Lorenz, and A. Travesset. "General Purpose Molecular Dynamics Simulations Fully Implemented on Graphics Processing Units." Journal of Computational Physics 227.10 (2008): 5342-59.

Anglada, E., J. Junquera, and J. M. Soler. "Efficient Mixed-Force First-Principles Molecular Dynamics." Physical Review E 68.5 (2003): 055701.

Appel, H., E. K. U. Gross, and K. Burke. "Excitations in Time-Dependent Density-Functional Theory." Physical Review Letters 90.4 (2003): 043005.

Asadchev, A., et al. "Uncontracted Rys Quadrature Implementation of up to G Functions on Graphical Processing Units." Journal of Chemical Theory and Computation 6.3 (2010): 696-704.

Asadchev, A., and M. S. Gordon. "Mixed-Precision Evaluation of Two-Electron Integrals by Rys Quadrature." Computer Physics Communications 183.8 (2012): 1563-67.

---. "New Multithreaded Hybrid Cpu/Gpu Approach to Hartree-Fock." Journal of Chemical Theory and Computation 8.11 (2012): 4166-76.

Badaeva, E., et al. "Excited-State Structure of Oligothiophene Dendrimers: Computational and Experimental Study." J. Phys. Chem. B 114 (2010): 15808-17.

166

Barnett, R. N., et al. "Born-Oppenheimer Dynamics Using Density Functional Theory - Equilibruium and Fragmentation of Small Sodium Clusters." J. Chem. Phys. 94 (1991): 608-16.

Barth, E., and T. Schlick. "Extrapolation Versus Impulse in Multiple-Timestepping Schemes. Ii. Linear Analysis and Applications to Newtonian and Langevin Dynamics." Journal of Chemical Physics 109.5 (1998): 1633-42.

Becke, A. D. "A Multicenter Numerical-Integration Scheme for Polyatomic-Molecules." Journal of Chemical Physics 88.4 (1988): 2547-53.

---. "A Multicenter Numerical Integration Scheme for Polyatomic Molecules." The Journal of Chemical Physics 88.4 (1988): 2547-53.

---. "Density-Functional Exchange-Energy Approximation with Correct Asymptotic Behavior." Physical Review A 38.6 (1988): 3098.

Ben-Nun, M., and T. J. Martinez. "Ab Initio Quantum Molecular Dynamics." Adv. Chem. Phys. 121 (2002): 439-512.

Bhaskaran-Nair, K., et al. "Noniterative Multireference Coupled Cluster Methods on Heterogeneous Cpu-Gpu Systems." Journal of Chemical Theory and Computation 9.4 (2013): 1949-57.

Biesiadecki, J. J., and R. D. Skeel. "Dangers of Multiple Time-Step Methods." Journal of Computational Physics 109.2 (1993): 318-28.

Bosson, M., S. Grudinin, and S. Redon. "Block-Adaptive Quantum Mechanics: An Adaptive Divide-and-Conquer Approach to Interactive Quantum Chemistry." Journal of Computational Chemistry 34.6 (2013): 492-504.

Bosson, M., et al. "Interactive Quantum Chemistry: A Divide-and-Conquer Ased-Mo Method." Journal of Computational Chemistry 33.7 (2012): 779-90.

Boys, S. F. "Electronic Wave Functions. 1. A General Method of Calculation for the Stationary States of Any Molecular System." Proceedings of the Royal Society of London Series a-Mathematical and Physical Sciences 200.1063 (1950): 542-54.

Brown, P., et al. "Massively Multicore Parallelization of Kohn-Sham Theory." J. Chem. Theo. Comp. 4 (2008): 1620-26.

Burant, J. C., G. E. Scuseria, and M. J. Frisch. "A Linear Scaling Method for Hartree-Fock Exchange Calculations of Large Molecules." Journal of Chemical Physics 105.19 (1996): 8969-72.

167

Burke, Kieron, Jan Werschnik, and E. K. U. Gross. "Time-Dependent Density Functional Theory: Past, Present, and Future." The Journal of Chemical Physics 123.6 (2005): 062206-9.

Bussi, G., D. Donadio, and M. Parrinello. "Canonical Sampling through Velocity Rescaling." Journal of Chemical Physics 126.1 (2007): 014101.

Car, Roberto, and Michele Parrinello. "Unified Approach for Molecular Dynamics and Density-Functional Theory." Phys. Rev. Lett. 55 (1985): 2471-74.

Carloni, P., U. Rothlisberger, and M. Parrinello. "The Role and Perspective of Ab Initio Molecular Dynamics in the Study of Biological Systems." Acc. Chem. Res. 35 (2002): 455-64.

Case, David A., et al. "The Amber Biomolecular Simulation Programs." Journal of Computational Chemistry 26.16 (2005): 1668-88.

Casida, Mark E. "Time Dependent Density Functional Response Theory for Molecules." Recent Advances in Density Functional Methods. Ed. Chong, D. P. Singapore: World Scientific, 1995.

Casida, Mark E., et al. "Molecular Excitation Energies to High-Lying Bound States from Time-Dependent Density-Functional Response Theory: Characterization and Correction of the Time-Dependent Local Density Approximation Ionization Threshold." The Journal of Chemical Physics 108.11 (1998): 4439-49.

Challacombe, M., and E. Schwegler. "Linear Scaling Computation of the Fock Matrix." Journal of Chemical Physics 106.13 (1997): 5526-36.

Cohen, A. J., P. Mori-Sanchez, and W. Yang. "Insights into Current Limitations of Density Functional Theory." Science 321 (2008): 792-94.

Collins, M. A. "Systematic Fragmentation of Large Molecules by Annihilation." Phys. Chem. Chem. Phys. 14 (2012): 7744-51.

Cordova, F., et al. "Troubleshooting Time-Dependent Density Functional Theory for Photochemical Applications: Oxirane." J. Chem. Phys. 127 (2007): 164111.

Dallos, M., et al. "Analytic Evaluation of Nonadiabatic Coupling Terms at the Mr-Ci Level. Ii. Minima on the Crossing Seam: Formaldehyde and the Photodimerization of Ethylene." J. Chem. Phys. 120 (2004): 7330-39.

Davidson, E. R. "Iterative Calculation of a Few of Lowest Eigenvalues and Corresponding Eigenvectors of Large Real-Symmetric Matrices." Journal of Computational Physics 17.1 (1975): 87-94.

168

Davidson, Ernest R. "The Iterative Calculation of a Few of the Lowest Eigenvalues and Corresponding Eigenvectors of Large Real-Symmetric Matrices." Journal of Computational Physics 17.1 (1975): 87-94.

Daw, M. S. "Model for Energetics of Solids Based on the Density-Matrix." Physical Review B 47.16 (1993): 10895-98.

DePrince, A. E., and J. R. Hammond. "Coupled Cluster Theory on Graphics Processing Units I. The Coupled Cluster Doubles Method." Journal of Chemical Theory and Computation 7.5 (2011): 1287-95.

DePrince, A. E., et al. "Density-Fitted Singles and Doubles Coupled Cluster on Graphics Processing Units." Molecular Physics 112 (2014): 844-52.

Deraedt, H., and B. Deraedt. "Applications of the Generalized Trotter Formula." Physical Review A 28.6 (1983): 3575-80.

des Cloizeaux, J. "Energy Bands + Projection Operators in Crystal - Analytic + Asymptotic Properties." Physical Review 135.3A (1964): A685-+.

Devadoss, C., P. Bharathi, and J. S. Moore. "Energy Transfer in Dendritic Macromolecules: Molecular Size Effects and the Role of an Energy Gradient." Journal of the American Chemical Society 118.40 (1996): 9635-44.

Dion, M., et al. "Van Der Waals Density Functional for General Geometries." Phys. Rev. Lett. 92 (2004): 246401.

Dirac, P. A. M. "Quantum Mechanics of Many-Electron Systems." Proceedings of the Royal Society of London Series a-Containing Papers of a Mathematical and Physical Character 123.792 (1929): 714-33.

Dreuw, A., and M. Head-Gordon. "Single-Reference Ab Initio Methods for the Calculation of Excited States of Large Molecules." Chemical Reviews 105.11 (2005): 4009-37.

Dreuw, Andreas, Jennifer L. Weisman, and Martin Head-Gordon. "Long-Range Charge-Transfer Excited States in Time-Dependent Density Functional Theory Require Non-Local Exchange." The Journal of Chemical Physics 119.6 (2003): 2943-46.

Fedorov, D. G., T. Nagata, and K. Kitaura. "Exploring Chemistry with the Fragment Molecular Orbital Method." Phys. Chem. Chem. Phys. 14 (2012): 7562-77.

Foresman, James B., et al. "Toward a Systematic Molecular Orbital Theory for Excited States." The Journal of Physical Chemistry 96.1 (1992): 135-49.

169

Friedrichs, Mark S., et al. "Accelerating Molecular Dynamic Simulation on Graphics Processing Units." Journal of Computational Chemistry 30.6 (2009): 864-72.

Genovese, L., et al. "Density Functional Theory Claculation on Many-Cores Hybrid Cpu-Gpu Architectures." J. Chem. Phys. 131 (2009): 034103.

Gibson, D. A. , and E. A. Carter. "Time-Reversible Multiple Time Scale Ab Initio Molecular Dynamics." J. Phys. Chem. 97 (1993): 13429-34.

Gordon, M. S., et al. "The Effective Fragment Potential Method: A Qm-Based Mm Approach to Modeling Environmental Effects in Chemistry." J. Phys. Chem. A 105 (2001): 293-307.

Gordon, M. S., et al. "Accurate First Principles Model Potentials for Intermolecular Interactions." Ann. Rev. Phys. Chem. 64 (2013): 553-78.

Grabo, T., M. Petersilka, and E. K. U. Gross. "Molecular Excitation Energies from Time-Dependent Density Functional Theory." Journal of Molecular Structure: THEOCHEM 501-502 (2000): 353-67.

Grimme, Stefan. "Accurate Description of Van Der Waals Complexes by Density Functional Theory Including Empirical Corrections." Journal of Computational Chemistry 25.12 (2004): 1463-73.

---. "Semiempirical Gga-Type Density Functional Constructed with a Long-Range Dispersion Correction." Journal of Computational Chemistry 27.15 (2006): 1787-99.

Grimme, S., and M. Parac. "Substantial Errors from Time-Dependent Density Functional Theory for the Calculation of Excited States of Large Pi Systems." ChemPhysChem 4 (2003): 292-95.

Gross, E. K. U., and Walter Kohn. "Local Density-Functional Theory of Frequency-Dependent Linear Response." Physical Review Letters 55.26 (1985): 2850.

Grossman, J. P., et al. "Hardware Support for Fine-Grained Event-Driven Computation in Anton 2." Acm Sigplan Notices 48.4 (2013): 549-60.

Grossman, J. P., et al. "The Role of Cascade, a Cycle-Based Simulation Infrastructure, in Designing the Anton Special-Purpose Supercomputers." 2013 50th Acm / Edac / Ieee Design Automation Conference (Dac) (2013).

Grubmuller, H., et al. "Generalized Verlet Algorithm for Efficient Molecular Dynamics Simulations with Long-Range Interactions." Mol. Sim. 6 (1991): 121-42.

170

Guidon, M., et al. "Ab Initio Molecular Dynamics Using Hybrid Density Functionals." Journal of Chemical Physics 128.21 (2008): 214104.

Haag, M. P., K. H. Marti, and M. Reiher. "Generation of Potential Energy Surfaces in High Dimensions and Their Haptic Exploration." Chemphyschem 12.17 (2011): 3204-13.

Haag, M. P., and M. Reiher. "Real-Time Quantum Chemistry." International Journal of Quantum Chemistry 113.1-2 (2013): 8-20.

Han, G. W., et al. "Error and Timing Analysis of Multiple Time-Step Integration Methods for Molecular Dynamics." Computer Physics Communications 176.4 (2007): 271-91.

Hancock, Jessica M., et al. "High-Efficiency Electroluminescence from New Blue-Emitting Oligoquinolines Bearing Pyrenyl or Triphenyl Endgroups." The Journal of Physical Chemistry C 111.18 (2007): 6875-82.

Harpham, Michael R., et al. "Thiophene Dendrimers as Entangled Photon Sensor Materials." Journal of the American Chemical Society 131.3 (2009): 973-79.

Hartke, B. , D. A. Gibson, and E. A. Carter. "Multiple Time Scale Hartree-Fock Molecular Dynamics." Int. J. Quant. Chem. 45 (1993): 59-70.

Harvey, M. J., G. Giupponi, and G. DeFabiritiis. "Acemd: Accelerating Biomolecular Dynamics in the Microsecond Time Scale." J. Chem. Theo. Comp. 5 (2009): 1632-39.

He, X., and J. Z. H. Zhang. "The Generalized Molecular Fractionation with Conjugate Caps/Molecular Mechanics Method for Direct Calculation of Protein Energy." J. Chem. Phys. 124 (2006): 184703.

Helgaker, Trygve, Poul Jørgensen, and Jeppe Olsen. Molecular Electronic-Structure Theory. New York: Wiley, 2000.

Heyd, Jochen, Gustavo E. Scuseria, and Matthias Ernzerhof. "Hybrid Functionals Based on a Screened Coulomb Potential." The Journal of Chemical Physics 118.18 (2003): 8207-15.

Hirata, So, and Martin Head-Gordon. "Time-Dependent Density Functional Theory within the Tamm-Dancoff Approximation." Chemical Physics Letters 314.3-4 (1999): 291-99.

171

Hirata, So, Martin Head-Gordon, and Rodney J. Bartlett. "Configuration Interaction Singles, Time-Dependent Hartree--Fock, and Time-Dependent Density Functional Theory for the Electronic Excited States of Extended Systems." The Journal of Chemical Physics 111.24 (1999): 10774-86.

Hu, X. Q., and W. T. Yang. "Accelerating Self-Consistent Field Convergence with the Augmented Roothaan-Hall Energy Function." Journal of Chemical Physics 132.5 (2010).

Humphrey, W., A. Dalke, and K. Schulten. "Vmd: Visual Molecular Dynamics." Journal of Molecular Graphics & Modelling 14.1 (1996): 33-38.

Hwu, Wen-mei. Gpu Computing Gems. Amsterdam ; Burlington, MA: Elsevier, 2011.

Iikura, Hisayoshi, et al. "A Long-Range Correction Scheme for Generalized-Gradient-Approximation Exchange Functionals." The Journal of Chemical Physics 115.8 (2001): 3540-44.

Isborn, C. M., et al. "Excited-State Electronic Structure with Configuration Interaction Singles and Tamm-Dancoff Time-Dependent Density Functional Theory on Graphical Processing Units." Journal of Chemical Theory and Computation 7.6 (2011): 1814-23.

Jacquemin, Denis, et al. "Extensive Td-Dft Benchmark: Singlet-Excited States of Organic Molecules." Journal of Chemical Theory and Computation 5.9 (2009): 2420-35.

Johnson, B. G., P. M. W. Gill, and J. A. Pople. "The Performance of a Family of Density Functional Methods." Journal of Chemical Physics 98.7 (1993): 5612-26.

Johnson, Carroll K. Or Tep: A Fortran Thermal-Ellipsoid Plot Program for Crystal Structure Illustrations. Oak Ridge, TN: Oak Ridge National Laboratory, 1965.

Jorgensen, William L., et al. "Comparison of Simple Potential Functions for Simulating Liquid Water." The Journal of Chemical Physics 79.2 (1983): 926-35.

Kahan, W. "Further Remarks on Reducing Truncation Errors." Communications of the Acm 8.1 (1965): 40-&.

Kirk, D. B., and W. W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Burlington, MA: Morgan Kauffman 2010.

172

Ko, C., et al. "Pseudospectral Time-Dependent Density Functional Theory." J. Chem. Phys. 128 (2008): 104103.

Kobayashi, Y., H. Nakano, and K. Hirao. "Mutlireference Moller-Plesset Perturbation Theory Using Spin-Dependent Orbital Energies." Chem. Phys. Lett. 336 (2001): 529-35.

Kohn, W. "Analytic Properties of Bloch Waves and Wannier Functions." Physical Review 115.4 (1959): 809-21.

Kohn, W., and L. J. Sham. "Self-Consistent Equations Including Exchange and Correlation Effects." Physical Review 140.4A (1965): 1133-&.

Krylov, A. I. "Equation-of-Motion Coupled Cluster Methods for Open-Shell Adn Electronically Excited Species: The Hitchiker's Guide to Fock Space." Ann. Rev. Phys. Chem. 59 (2008): 433-62.

Kulik, H. J., et al. "Ab Initio Quantum Chemistry for Protein Structures." Journal of Physical Chemistry B 116.41 (2012): 12501-09.

Kuskin, J. S., et al. "Incorporating Flexibility in Anton, a Specialized Machine for Molecular Dynamics Simulation." 2008 Ieee 14th International Symposium on High Peformance Computer Architecture (2008): 315-26.

Kussmann, J., and C. Ochsenfeld. "Pre-Selective Screening for Matrix Elements in Linear-Scaling Exact Exchange Calculations." Journal of Chemical Physics 138.13 (2013).

Langlois, J.-M., et al. "Rule-Based Trial Wave Functions for Generalized Valence Bond Theory." J. Phys. Chem. 98 (1994): 13498-505.

Larson, R. H., et al. "High-Throughput Pairwise Point Interactions in Anton, a Specialized Machine for Molecular Dynamics Simulation." 2008 Ieee 14th International Symposium on High Peformance Computer Architecture (2008): 303-14.

Lebedev, V. I., and D. N. Laikov. "Quadrature Formula for the Sphere of 131-Th Algebraic Order of Accuracy." Doklady Akademii Nauk 366.6 (1999): 741-45.

---. "Quadrature Formula for the Spehre of 131-Th Algebraic Order of Accuracy." Dokl. Akad. Nauk 366 (1999): 741-45.

Lee, Chengteh, Weitao Yang, and Robert G. Parr. "Development of the Colle-Salvetti Correlation-Energy Formula into a Functional of the Electron Density." Physical Review B 37.2 (1988): 785.

173

Leech, J., J. F. Prins, and J. Hermans. "Smd: Visual Steering of Molecular Dynamics for Protein Design." Ieee Computational Science & Engineering 3.4 (1996): 38-45.

Leforestier, C. "Classical Trajectories Using the Full Ab Initio Potential Energy Surface H-+Ch4 -> Ch4 + H-." J. Chem. Phys. 68 (1978): 4406-10.

Levine, B., and T. J. Martinez. "Hijacking the Playstation2 for Computational Chemistry." Abst. Pap. Amer. Chem. Soc. 226 (2003): U426.

Levine, Benjamin G., et al. "Conical Intersections and Double Excitations in Time-Dependent Density Functional Theory." Molecular Physics 104.5-7 (2006): 1039-51.

Levinthal, C. "Molecular Model-Building by Computer." Scientific American 214.6 (1966): 42-&.

Li, X. H., V. E. Teige, and S. S. Iyengar. "Can the Four-Coordinated, Penta-Valent Oxygen in Hydroxide Water Clusters Be Detected through Experimental Vibrational Spectroscopy?" Journal of Physical Chemistry A 111.22 (2007): 4815-20.

Li, X. P., R. W. Nunes, and D. Vanderbilt. "Density-Matrix Electronic-Structure Method with Linear System-Size Scaling." Physical Review B 47.16 (1993): 10891-94.

Liu, Weiguo, et al. "Accelerating Molecular Dynamics Simulations Using Graphics Processing Units with Cuda." Comp. Phys. Comm. 179.9 (2008): 634-41.

Luehr, Nathan, Thomas E. Markland, and Todd J. Martinez. "Multiple Time Step Integrators in Ab Initio Molecular Dynamics." Journal of Chemical Physics 140 (2014): 084116.

Luehr, N., I. S. Ufimtsev, and T. J. Martinez. "Dynamic Precision for Electron Repulsion Integral Evaluation on Graphical Processing Units (Gpus)." Journal of Chemical Theory and Computation 7.4 (2011): 949-54.

Ma, Q., J. A. Izaguirre, and R. D. Skeel. "Verlet-I/R-Respa/Impulse Is Limited by Nonlinear Instabilities." Siam Journal on Scientific Computing 24.6 (2003): 1951-73.

Ma, W. J., et al. "Gpu-Based Implementations of the Noniterative Regularized-Ccsd(T) Corrections: Applications to Strongly Correlated Systems." Journal of Chemical Theory and Computation 7.5 (2011): 1316-27.

174

Ma, Z. H., and M. E. Tuckerman. "On the Connection between Proton Transport, Structural Diffusion, and Reorientation of the Hydrated Hydroxide Ion as a Function of Temperature." Chemical Physics Letters 511.4-6 (2011): 177-82.

Maitra, Neepa T., et al. "Double Excitations within Time-Dependent Density Functional Theory Linear Response." The Journal of Chemical Physics 120.13 (2004): 5932-37.

Makino, J., K. Hiraki, and M. Inaba. "Grape-Dr: 2-Pflops Massively-Parallel Computer with 512-Core, 512-Gflops Processor Chips for Scientific Computing." 2007 Acm/Ieee Sc07 Conference (2010): 548-58.

Marti, K. H., and M. Reiher. "Haptic Quantum Chemistry." Journal of Computational Chemistry 30.13 (2009): 2010-20.

Martinez, T. J., and E. A. Carter. "Pseudospectral Double-Excitation Configuration Interaction." J. Chem. Phys. 98 (1993): 7081-85.

---. "Pseudospectral Moller-Plesset Perturbation Theory through Third Order." J. Chem. Phys. 100 (1994): 3631-38.

---. "Pseudospectral Methods Applied to the Electron Correlation Problem." Modern Electronic Structure Theory, Part Ii. Ed. Yarkony, D. R. Singapore: World Scientific, 1995. 1132-65.

---. "Pseudospectral Multi-Reference Single and Double-Excitation Configuration Interaction." J. Chem. Phys. 102 (1995): 7564-72.

Martinez, T. J., A. Mehta, and E. A. Carter. "Pseudospectral Full Configuration Interaction." J. Chem. Phys. 97 (1992): 1876-80.

Marx, D., A. Chandra, and M. E. Tuckerman. "Aqueous Basic Solutions: Hydroxide Solvation, Structural Diffusion, and Comparison to the Hydrated Proton." Chemical Reviews 110.4 (2010): 2174-216.

Mcmurchie, L. E., and E. R. Davidson. "One-Electron and 2-Electron Integrals over Cartesian Gaussian Functions." Journal of Computational Physics 26.2 (1978): 218-31.

McMurchie, Larry E., and Ernest R. Davidson. "One- and Two-Electron Integrals over Cartesian Gaussian Functions." Journal of Computational Physics 26.2 (1978): 218-31.

175

Minary, P., M. E. Tuckerman, and G. J. Martyna. "Long Time Molecular Dynamics for Enhanced Conformational Sampling in Biomolecular Systems." Physical Review Letters 93.15 (2004): 150201.

Morrone, J. A., et al. "Efficient Multiple Time Scale Molecular Dynamics: Using Colored Noise Thermostats to Stabilize Resonances." Journal of Chemical Physics 134.1 (2011): 014103.

Murray, C. W. , N. C. Handy, and G. J. Laming. "Quadrature Schemes for Integrals of Density Functional Theory." Mol. Phys. 78 (1993): 997.

Neese, F., et al. "Efficient, Approximate and Parallel Hartree-Fock and Hybrid Dft Calculations. A 'Chain-of-Spheres' Algorithm for the Hartree-Fock Exchange." Chemical Physics 356.1-3 (2009): 98-109.

Nielsen, I. B., et al. "Absorption Spectra of Photoactive Yellow Protein Chromophores in Vacuum." Biophysical Journal 89.4 (2005): 2597-604.

NVIDIA. Cuda C Programming Guide. 2013. Design Guide. March 7, 2014.

Ochsenfeld, C., C. A. White, and M. Head-Gordon. "Linear and Sublinear Scaling Formation of Hartree-Fock-Type Exchange Matrices." Journal of Chemical Physics 109.5 (1998): 1663-69.

Olivares-Amaya, R., et al. "Accelerating Correlated Quantum Chemistry Calculations Using Graphical Processing Units and a Mixed Precision Matrix Multiplication Library." Journal of Chemical Theory and Computation 6.1 (2010): 135-44.

Ong, M. T., et al. "First Principles Dynamics and Minimum Energy Pathways for Mechanochemical Ring Opening of Cyclobutene." Journal of the American Chemical Society 131.18 (2009): 6377-+.

Parr, Robert G., and Weitao Yang. Density-Functional Theory of Atoms and Molecules. International Series of Monographs on Chemistry. Oxford: Oxford University Press, 1989.

Payne, M. C., et al. "Iterative Minimization Techniques for Ab Initio Total Energy Calculations - Molecular Dynamics and Conjugate Gradients." Rev. Mod. Phys. 64 (1992): 1045-97.

PetaChem, LLC. "Http://Www.Petachem.Com." Web. November 30, 2010 2010.

Polli, D., et al. "Conical Intersection Dynamics of the Primary Photoisomerization Event in Vision." Nature 467 (2010): 440-43.

176

Pople, J. A., P. M. W. Gill, and B. G. Johnson. "Kohn-Sham Density-Functional Theory within a Finite Basis Set." Chemical Physics Letters 199.6 (1992): 557-60.

Prins, J. F., et al. "A Virtual Environment for Steered Molecular Dynamics." Future Generation Computer Systems 15.4 (1999): 485-95.

Pruitt, S. R., et al. "The Fragment Molecular Orbintal and Systematic Molecular Fragmentation Methods Applied to Water Clusters." Phys. Chem. Chem. Phys. 14 (2012): 7752-64.

Pulay, P. "Improved Scf Convergence Acceleration." Journal of Computational Chemistry 3.4 (1982): 556-60.

Ramakrishna, Guda, et al. "Oligothiophene Dendrimers as New Building Blocks for Optical Applications†." The Journal of Physical Chemistry A 112.10 (2007): 2018-26.

Reza Ahmadi, G., and Jan Almlof. "The Coulomb Operator in a Gaussian Product Basis." Chemical Physics Letters 246.4-5 (1995): 364-70.

Rohrdanz, Mary A., and John M. Herbert. "Simultaneous Benchmarking of Ground- and Excited-State Properties with Long-Range-Corrected Density Functional Theory." The Journal of Chemical Physics 129.3 (2008): 034107-9.

Rohrdanz, Mary A., Katie M. Martins, and John M. Herbert. "A Long-Range-Corrected Density Functional That Performs Well for Both Ground-State Properties and Time-Dependent Density Functional Theory Excitation Energies, Including Charge-Transfer Excited States." The Journal of Chemical Physics 130.5 (2009): 054112-8.

Roos, B. O. "Theoretical Studies of Electronically Excited States of Molecular Systems Using Multiconfigurational Perturbation Theory." Acc. Chem. Res. 32 (1999): 137-44.

Rubensson, E. H., E. Rudberg, and P. Salek. "Density Matrix Purification with Rigorous Error Control." Journal of Chemical Physics 128.7 (2008).

Ruckenbauer, M., et al. "Nonadiabatic Excited-State Dynamics with Hybrid Ab Initio Quantum-Mechanical/Molecular-Mechanical Methods: Solcavtion of the Pentadieniminium Cation in Apolar Media." J. Phys. Chem. A 114 (2010): 6757-65.

177

Rudberg, E., and E. H. Rubensson. "Assessment of Density Matrix Methods for Linear Scaling Electronic Structure Calculations." Journal of Physics-Condensed Matter 23.7 (2011).

Rudberg, E., E. H. Rubensson, and P. Salek. "Automatic Selection of Integral Thresholds by Extrapolation in Coulomb and Exchange Matrix Constructions." Journal of Chemical Theory and Computation 5.1 (2009): 80-85.

Runge, Erich, and E. K. U. Gross. "Density-Functional Theory for Time-Dependent Systems." Physical Review Letters 52.12 (1984): 997.

Rys, J., M. Dupuis, and H. F. King. "Computation of Electron Repulsion Integrals Using the Rys Quadrature Method." Journal of Computational Chemistry 4.2 (1983): 154-57.

Sanz-Serna, J. M., and M. P. Calvo. Numerical Hamiltonian Problems. London: Chapman and Hall, 1994.

Schafer, L., et al. "Chromophore Protonation State Controls Photoswitching of the Fluoroprotein Asfp595." PLoS Comp. Bio. 4 (2008): e1000034.

Schmidt, M. W., et al. "General Atomic and Molecular Electronic-Structure System." Journal of Computational Chemistry 14.11 (1993): 1347-63.

Schwegler, E., and M. Challacombe. "Linear Scaling Computation of the Hartree-Fock Exchange Matrix." Journal of Chemical Physics 105.7 (1996): 2726-34.

---. "Linear Scaling Computation of the Fock Matrix. Iv. Multipole Accelerated Formation of the Exchange Matrix." Journal of Chemical Physics 111.14 (1999): 6223-29.

---. "Linear Scaling Computation of the Fock Matrix. Iii. Formation of the Exchange Matrix with Permutational Symmetry." Theoretical Chemistry Accounts 104.5 (2000): 344-49.

Schwegler, E., M. Challacombe, and M. HeadGordon. "Linear Scaling Computation of the Fock Matrix. 2. Rigorous Bounds on Exchange Integrals and Incremental Fock Build." Journal of Chemical Physics 106.23 (1997): 9708-17.

Shaw, D. E., et al. "Anton, a Special-Purpose Machine for Molecular Dynamics Simulation." Isca'07: 34th Annual International Symposium on Computer Architecture, Conference Proceedings (2007): 1-12.

178

Shaw, D. E., et al. "Millisecond-Scale Molecular Dynamics Simulations on Anton." Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (2009).

Stanton, J. F. , and R. J. Bartlett. "The Equation of Motion Coupled Cluster Method. A Systematic Biorthogonal Approach to Molecular Excitation Energies, Transition Probabilities, and Excited Satte Properties." J. Chem. Phys. 98 (1993): 7029-39.

Steele, R. P. . "Communication: Multiple-Timestep Ab Initio Molecular Dynamics with Electron Correlation." J. Chem. Phys. 139 (2013): 011102.

Steinmann, C., D. G. Fedorov, and J. H. Jensen. "Mapping Enzymatic Catalysis Using the Effective Fragment Molecular Orbital Method: Towards All Ab Initio Biochemistry." PLOS One 8 (2013): e60602.

Stone, John E., Justin Gullingsrud, and Klaus Schulten. "A System for Interactive Molecular Dynamics Simulation." ACM Symposium on 3D Graphics (2001): 191.

Stone, John E., et al. "Accelerating Molecular Modeling Applications with Graphics Processors." Journal of Computational Chemistry 28.16 (2007): 2618-40.

Strain, M. C., G. E. Scuseria, and M. J. Frisch. "Achieving Linear Scaling for the Electronic Quantum Coulomb Problem." Science 271.5245 (1996): 51-53.

Stratmann, R. E., G. E. Scuseria, and M. J. Frisch. "Achieving Linear Scaling in Exchange-Correlation Density Functional Quadratures." Chemical Physics Letters 257.3-4 (1996): 213-23.

Streett, W. B., D. J. Tildesley, and G. Saville. "Multiple Time-Step Methods in Molecular Dynamics." Mol. Phys. 35 (1978): 639-48.

Surles, M. C., et al. "Sculpting Proteins Interactively - Continual Energy Minimization Embedded in a Graphical Modeling System." Protein Science 3.2 (1994): 198-210.

Suzuki, M. "Generalized Trotters Formula and Systematic Approximants of Exponential Operators and Inner Derivations with Applications to Many-Body Problems." Communications in Mathematical Physics 51.2 (1976): 183-90.

Swope, W. C., et al. "A Computer-Simulation Method for the Calculation of Equilibrium-Constants for the Formation of Physical Clusters of Molecules - Application to Small Water Clusters." Journal of Chemical Physics 76.1 (1982): 637-49.

179

Szabo, Attila, and Neil S. Ostlund. Modern Quantum Chemistry. New York: McGraw Hill, 1982.

Takashima, H., et al. "Is Large-Scale Ab Initio Hartree-Fock Calculation Chemically Accurate? Toward Improved Calculation of Biological Molecule Properties." Journal of Computational Chemistry 20.4 (1999): 443-54.

Tanner, C., C. Manca, and S. Leutwyler. "7-Hydroxyquinoline Center Dot(Nh3)(3): A Model for Excited State H-Atom Transfer Along an Ammonia Wire." Chimia 58.4 (2004): 234-36.

Tao, Jianmin, and Sergei Tretiak. "Optical Absorptions of New Blue-Light Emitting Oligoquinolines Bearing Pyrenyl and Triphenyl Endgroups Investigated with Time-Dependent Density Functional Theory." Journal of Chemical Theory and Computation 5.4 (2009): 866-72.

Tawada, Yoshihiro, et al. "A Long-Range-Corrected Time-Dependent Density Functional Theory." The Journal of Chemical Physics 120.18 (2004): 8425-33.

Titov, A. V., et al. "Generating Efficient Quantum Chemistry Codes for Novel Architectures." Journal of Chemical Theory and Computation 9.1 (2013): 213-21.

Tokita, Y., and H. Nakatsuji. "Ground and Excited States of Hemoglobin Co and Horesradish Peroxidase Co: Sac/Sac-Ci Study." J. Phys. Chem. B 101 (1997): 3281-89.

Towles, B., et al. "Unifying on-Chip and Inter-Node Switching within the Anton 2 Network." 2014 Acm/Ieee 41st Annual International Symposium on Computer Architecture (Isca) (2014): 1-12.

Trotter, H. F. "On the Product of Semi-Groups of Operators." Proc. Amer. Math. Soc. 10 (1959): 545-51.

Tuckerman, M., B. J. Berne, and G. J. Martyna. "Reversible Multiple Time Scale Molecular-Dynamics." Journal of Chemical Physics 97.3 (1992): 1990-2001.

Tuckerman, Mark E, and Bruce J Berne. "Molecular Dynamics in Systems with Multiple Time Scales: Systems with Stiff and Soft Degrees of Freedom and with Short and Long Range Forces." The Journal of Chemical Physics 95 (1991): 8362-64.

Tuckerman, Mark E, Bruce J Berne, and G J Martyna. "Molecular Dynamics Algorithm for Multiple Time Scales: Systems with Long Range Forces." The Journal of Chemical Physics 94 (1991): 6811-15.

180

---. "Reversible Multiple Time Scale Molecular Dynamics." The Journal of Chemical Physics 97 (1992): 1990-2001.

Tuckerman, Mark E, Bruce J Berne, and A Rossi. "Molecular Dynamics Algorithm for Multiple Time Scales: Systems with Disparate Masses." The Journal of Chemical Physics 94 (1991): 1465-69.

Tuckerman, M. E., A. Chandra, and D. Marx. "Structure and Dynamics of Oh-(Aq)." Accounts of Chemical Research 39.2 (2006): 151-58.

Tuckerman, Mark E, G J Martyna, and Bruce J Berne. "Molecular Dynamics Algorithm for Condensed Systems with Multiple Time Scales." The Journal of Chemical Physics 93 (1990): 1287-91.

Tuckerman, M. E., D. Marx, and M. Parrinello. "The Nature and Transport Mechanism of Hydrated Hydroxide Ions in Aqueous Solution." Nature 417.6892 (2002): 925-29.

Tuckerman, M. E., and Michele Parrinello. "Integrating the Car-Parrinello Equations. Ii. Multiple Time-Scale Techniques." J. Chem. Phys. 101 (1994): 1316-29.

Tuckerman, M. E., et al. "Ab Initio Molecular Dynamics Simulations." J. Phys. Chem. 100 (1996): 12878-87.

Ufimtsev, I. S., N. Luehr, and T. J. Martinez. "Charge Transfer and Polarization in Solvated Proteins from Ab Initio Molecular Dynamics." Journal of Physical Chemistry Letters 2.14 (2011): 1789-93.

Ufimtsev, I. S., and T. J. Martinez. "Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation." Journal of Chemical Theory and Computation 4.2 (2008): 222-31.

---. "Graphical Processing Units for Quantum Chemistry." Computing in Science & Engineering 10.6 (2008): 26-34.

---. "Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Principles Molecular Dynamics." Journal of Chemical Theory and Computation 5.10 (2009): 2619-28.

---. "Quantum Chemistry on Graphical Processing Units. 2. Direct Self-Consistent-Field Implementation." Journal of Chemical Theory and Computation 5.4 (2009): 1004-15.

Valiev, M., et al. "Nwchem: A Comprehensive and Scalable Open-Source Solution for Large Scale Molecular Simulations." Comp. Phys. Comm. 181 (2010): 1477.

181

VandeVondele, J., M. Sulpizi, and M. Sprik. "From Solvent Fluctuations to Quantitative Redox Properties of Quinones in Methanol and Acetonitrile." Ang. Chem. Int. Ed. 45 (2006): 1936-38.

Virshup, A. M., et al. "Photodynamics in Complex Enviroments: Ab Initio Multiple Spawning Quantum Mechanical/Molecular Mechanical Dynamics." J. Phys. Chem. B 113 (2009): 3280-91.

---. "Photodynamics in Complex Environments: Ab Initio Multiple Spawning Quantum Mechanical/Molecular Mechanical Dynamics." J. Phys. Chem. B 113 (2009): 3280-91.

Vogt, L., et al. "Accelerating Resolution-of-the-Identity Second-Order Moller-Plesset Quantum Chemistry Calculations with Graphical Processing Units." Journal of Physical Chemistry A 112.10 (2008): 2049-57.

Vogt, Leslie, et al. "Accelerating Resolution-of-the-Identity Second-Order Moller‚Àíplesset Quantum Chemistry Calculations with Graphical Processing Units‚Ć." The Journal of Physical Chemistry A 112.10 (2008): 2049-57.

Vydrov, O. A., and T. Van Voorhis. "Implementation and Assessment of a Simple Nonlocal Van Der Waals Density Functional." J. Chem. Phys. 132 (2010): 164113.

Vysotskiy, V. P., and L. S. Cederbaum. "Accurate Quantum Chemistry in Single Precision Arithmetic: Correlation Energy." J. Chem. Theo. Comp. Articles ASAP; DOI: 10.1021/ct100533u (2010).

Warshel, A., and M. Levitt. "Theoretical Studies of Enzymatic Reactions: Dielectric, Electrostatic and Steric Stabilization of the Carbenium Ion in the Reaction of Lysozyme." J. Mol. Biol. 103 (1976): 227-49.

Watson, M. A., et al. "Accelerating Correlated Quantum Chemistry Calculations Using Graphical Processing Units." Computing in Science & Engineering 12.4 (2010): 40-50.

White, C. A., and M. Head-Gordon. "Derivation and Efficient Implementation of the Fast Multipole Method." Journal of Chemical Physics 101.8 (1994): 6593-605.

Whitten, J. L. "Coulombic Potential-Energy Integrals and Approximations." Journal of Chemical Physics 58.10 (1973): 4496-501.

---. "Coulombic Potential Energy Integrals and Approximations." The Journal of Chemical Physics 58.10 (1973): 4496-501.

182

Xie, W., et al. "X-Pol Potential: An Electronic Structure Based Force Field for Molecular Dynamics Simulation of a Solvated Protein in Water." J. Chem. Theo. Comp. 5 (2009): 459-67.

Yamaguchi, Shigeo, et al. "Low-Barrier Hydrogen Bond in Photoactive Yellow Protein." Proceedings of the National Academy of Sciences 106.2 (2009): 440-44.

Yang, W. T. "Direct Calculation of Electron-Density in Density-Functional Theory." Physical Review Letters 66.11 (1991): 1438-41.

---. "Direct Calculation of Electron-Density in Density-Functional Theory - Implementation for Benzene and a Tetrapeptide." Physical Review A 44.11 (1991): 7823-26.

Yasuda, K. "Accelerating Density Functional Calculations with Graphics Processing Unit." Journal of Chemical Theory and Computation 4.8 (2008): 1230-36.

---. "Two-Electron Integral Evaluation on the Graphics Processor Unit." Journal of Computational Chemistry 29.3 (2008): 334-42.