arXiv:2111.10466v1 [quant-ph] 19 Nov 2021

10
Simulation of quantum physics with Tensor Processing Units: brute-force computation of ground states and time evolution Markus Hauru, 1 Alan Morningstar, 2, 1 Jackson Beall, 1 Martin Ganahl, 1 Adam Lewis, 1 and Guifre Vidal 1 1 Sandbox@Alphabet, Mountain View, CA 94043, USA 2 Department of Physics, Princeton University, Princeton, NJ 08544, USA (Dated: November 23, 2021) Tensor Processing Units (TPUs) were developed by Google exclusively to support large-scale ma- chine learning tasks. TPUs can, however, also be used to accelerate and scale up other computation- ally demanding tasks. In this paper we repurpose TPUs for the challenging problem of simulating quantum spin systems. Consider a lattice model made of N spin- 1 2 quantum spins, or qubits, with a Hamiltonian H = i hi that is a sum of local terms hi and a wavefunction |Ψi consisting of 2 N complex amplitudes. We demonstrate the usage of TPUs for both (i) computing the ground state |Ψgsi of the Hamiltonian H, and (ii) simulating the time evolution |Ψ(t)i = e -itH |Ψ(0)i generated by this Hamiltonian starting from some initial state |Ψ(0)i. The bottleneck of the above tasks is computing the product H |Ψi, which can be implemented with remarkable efficiency utilising the native capabilities of TPUs. With a TPU v3 pod, with 2048 cores, we simulate wavefunctions |Ψi of up to N = 38 qubits. The dedicated matrix multiplication units (MXUs), the high bandwidth mem- ory (HBM) on each core, and the fast inter-core interconnects (ICIs) together provide performance far beyond the capabilities of general purpose processors. The study of quantum many-body phenomena in strongly correlated systems is among the most challeng- ing computational tasks in modern physics. Describing the quantum mechanical wavefunction of a many-body system, say of N interacting quantum spins, requires computational resources that scale exponentially with the system size N . Sophisticated numerical approaches have been devised over the years to tackle such problems. For instance, Quantum Monte Carlo (QMC) methods bypass storing the full wavefunction by instead sampling over statistically significant configurations [13], whereas tensor network (TN) algorithms exploit the structure of entanglement in ground states of local Hamiltonians to obtain an efficient quasi-exact description [49]. These highly successful methods are customarily used to study emergent quantum phenomena at scale. They have, how- ever, a restricted range of applicability. For instance, none of them can correctly simulate a long Hamiltonian evolution: QMC due to the sign problem, TNs due to the build-up of entanglement over time. In such situations, a brute-force computation requiring exponentially many resources is still today the only reliable approach. A brute-force computation, free of statistical and/or trun- cation errors, is also useful even in regimes where QMC and TN methods are expected to work, e.g. to bench- mark their performance. Google’s Tensor Processing Units (TPUs) are application-specific integrated circuits (ASICs) exclu- sively designed for accelerating training and inference of machine learning models at scale [10, 11]. Here we con- sider the third generation (v3) of TPUs. We can think of a TPU v3 pod, with 2048 cores, 100+ petaFLOPS (in half precision) and 32 TB of high bandwidth mem- ory (HBM), as a special-purpose supercomputer. That is, instead of being designed for general purpose high performance computing (HPC) tasks, a TPU pod is a supercomputer optimized to excel at a class of special- ized workloads required for machine learning. Neverthe- less, one can still inquire whether TPUs’ acceleration and scalability may be repurposed for other computational tasks [1226], in analogy with how Graphic Process- ing Units (GPUs), originally conceived to accelerate the rendering of 3D graphics in gaming, have extended to a much broader range of applications, including general purpose HPC and artificial intelligence. In this paper we explain how to use TPUs for brute- force simulations of quantum many-body physics. In short, the wavefunction |Ψi of N spin- 1 2 quantum spins, or qubits, is distributed over the available HBM and then updated under the action of a local Hamiltonian H, |Ψi 7→ |Ψ 0 i = H |Ψi , (1) which requires both large matrix multiplications on each TPU core and repeatedly re-distributing the wave- function over the cores. Matrix multiplications and wavefunction re-distribution are handled by the TPU’s remarkably fast matrix multiply unit s (MXUs) and inter-core interconnects (ICIs), respectively. Fig. 1 shows typical update times for up to N = 38 qubits. The ability to update the wavefunction |Ψi according to Eq. (1) can then be used as the basis for a number of computations, which we illustrate for quantum spin chain Hamiltonians. First we obtain the ground state energy E gs and ground state wavefunction |Ψ gs i of H, H |Ψ gs i = E gs |Ψ gs i , E gs = min |Ψi hΨ| H |Ψi hΨ|Ψi . (2) We then use the update (1) to simulate time evolution according to H, |Ψ(t)i = e -itH |Ψ(0)i , (3) starting from some initial state |Ψ(0)i. From the wave- function |Ψi, one can compute all sorts of derived proper- ties, such as correlation functions or entanglement mea- sures, which we also illustrate. arXiv:2111.10466v1 [quant-ph] 19 Nov 2021

Transcript of arXiv:2111.10466v1 [quant-ph] 19 Nov 2021

Simulation of quantum physics with Tensor Processing Units:brute-force computation of ground states and time evolution

Markus Hauru,1 Alan Morningstar,2, 1 Jackson Beall,1 Martin Ganahl,1 Adam Lewis,1 and Guifre Vidal1

1Sandbox@Alphabet, Mountain View, CA 94043, USA2Department of Physics, Princeton University, Princeton, NJ 08544, USA

(Dated: November 23, 2021)

Tensor Processing Units (TPUs) were developed by Google exclusively to support large-scale ma-chine learning tasks. TPUs can, however, also be used to accelerate and scale up other computation-ally demanding tasks. In this paper we repurpose TPUs for the challenging problem of simulatingquantum spin systems. Consider a lattice model made of N spin- 1

2quantum spins, or qubits, with

a Hamiltonian H =∑

i hi that is a sum of local terms hi and a wavefunction |Ψ〉 consisting of 2N

complex amplitudes. We demonstrate the usage of TPUs for both (i) computing the ground state|Ψgs〉 of the Hamiltonian H, and (ii) simulating the time evolution |Ψ(t)〉 = e−itH |Ψ(0)〉 generatedby this Hamiltonian starting from some initial state |Ψ(0)〉. The bottleneck of the above tasks iscomputing the product H |Ψ〉, which can be implemented with remarkable efficiency utilising thenative capabilities of TPUs. With a TPU v3 pod, with 2048 cores, we simulate wavefunctions |Ψ〉 ofup to N = 38 qubits. The dedicated matrix multiplication units (MXUs), the high bandwidth mem-ory (HBM) on each core, and the fast inter-core interconnects (ICIs) together provide performancefar beyond the capabilities of general purpose processors.

The study of quantum many-body phenomena instrongly correlated systems is among the most challeng-ing computational tasks in modern physics. Describingthe quantum mechanical wavefunction of a many-bodysystem, say of N interacting quantum spins, requirescomputational resources that scale exponentially withthe system size N . Sophisticated numerical approacheshave been devised over the years to tackle such problems.For instance, Quantum Monte Carlo (QMC) methodsbypass storing the full wavefunction by instead samplingover statistically significant configurations [1–3], whereastensor network (TN) algorithms exploit the structure ofentanglement in ground states of local Hamiltonians toobtain an efficient quasi-exact description [4–9]. Thesehighly successful methods are customarily used to studyemergent quantum phenomena at scale. They have, how-ever, a restricted range of applicability. For instance,none of them can correctly simulate a long Hamiltonianevolution: QMC due to the sign problem, TNs due to thebuild-up of entanglement over time. In such situations,a brute-force computation requiring exponentially manyresources is still today the only reliable approach. Abrute-force computation, free of statistical and/or trun-cation errors, is also useful even in regimes where QMCand TN methods are expected to work, e.g. to bench-mark their performance.

Google’s Tensor Processing Units (TPUs) areapplication-specific integrated circuits (ASICs) exclu-sively designed for accelerating training and inference ofmachine learning models at scale [10, 11]. Here we con-sider the third generation (v3) of TPUs. We can thinkof a TPU v3 pod, with 2048 cores, 100+ petaFLOPS(in half precision) and 32 TB of high bandwidth mem-ory (HBM), as a special-purpose supercomputer. Thatis, instead of being designed for general purpose highperformance computing (HPC) tasks, a TPU pod is asupercomputer optimized to excel at a class of special-ized workloads required for machine learning. Neverthe-

less, one can still inquire whether TPUs’ acceleration andscalability may be repurposed for other computationaltasks [12–26], in analogy with how Graphic Process-ing Units (GPUs), originally conceived to accelerate therendering of 3D graphics in gaming, have extended toa much broader range of applications, including generalpurpose HPC and artificial intelligence.

In this paper we explain how to use TPUs for brute-force simulations of quantum many-body physics. Inshort, the wavefunction |Ψ〉 of N spin- 1

2 quantum spins,or qubits, is distributed over the available HBM and thenupdated under the action of a local Hamiltonian H,

|Ψ〉 7→ |Ψ′〉 = H |Ψ〉 , (1)

which requires both large matrix multiplications oneach TPU core and repeatedly re-distributing the wave-function over the cores. Matrix multiplications andwavefunction re-distribution are handled by the TPU’sremarkably fast matrix multiply units (MXUs) andinter-core interconnects (ICIs), respectively. Fig. 1shows typical update times for up to N = 38 qubits.

The ability to update the wavefunction |Ψ〉 accordingto Eq. (1) can then be used as the basis for a numberof computations, which we illustrate for quantum spinchain Hamiltonians. First we obtain the ground stateenergy Egs and ground state wavefunction |Ψgs〉 of H,

H |Ψgs〉 = Egs |Ψgs〉 , Egs = min|Ψ〉

〈Ψ|H |Ψ〉〈Ψ|Ψ〉

. (2)

We then use the update (1) to simulate time evolutionaccording to H,

|Ψ(t)〉 = e−itH |Ψ(0)〉 , (3)

starting from some initial state |Ψ(0)〉. From the wave-function |Ψ〉, one can compute all sorts of derived proper-ties, such as correlation functions or entanglement mea-sures, which we also illustrate.

arX

iv:2

111.

1046

6v1

[qu

ant-

ph]

19

Nov

202

1

2

Tensor Processing Units.— TPUs come in boards.Each board holds 8 TPU cores controlled by a CPU host.Multiple boards can be joined together to form largerunits, up to a so-called a pod, which for third genera-tion TPUs consist of 2048 cores (that is, 256 boards).Each TPU v3 core has 16 GB of high bandwidth mem-ory (HBM) attached to it, which amounts to 128 GB ofHBM per board, or 32 TB per pod. Each TPU v3 corealso has two matrix multiplication units (MXUs), whichtogether deliver 52.5 teraFLOPS (floating-point opera-tions per second) of matrix multiplication performancein half precision. MXUs natively perform matrix prod-ucts by truncating single precision inputs to a half preci-sion format called bfloat16, and accumulating the resultagain in single precision. Single precision matrix prod-ucts are then achieved by composing six passes throughthe MXU (so the above FLOP counts can be divided bysix to yield the equivalent single precision FLOPS). Fi-nally, double precision is also available through softwareemulation with significant additional time and memoryoverheads [27].

A key aspect of TPUs, both in single-board and multi-board configurations, is that all the cores are directlyconnected to nearest neighbors in a toroidal mesh us-ing fast ICIs and can therefore communicate withoutgoing through their CPU host(s). As a result, we ob-serve multi-core performance scaling nearly linearly inthe number of cores (e.g. 420 teraFLOPS per board,100+ petaFLOPS per pod) even in tasks requiring sig-nificant communication between cores.

Code to run on TPUs can be written in severaldifferent frameworks, but we use the Python libraryJax, which interfaces with the XLA just-in-time com-piler [28, 29]. Parallelism between cores follows the sin-gle instruction multiple data (SIMD) paradigm. TheHBM stores arrays in chunks of 8× 128, and the MXUsnatively perform matrix products of 128× 128 matrices.Care must be taken when writing code for TPUs to makesure arrays come in multiples of these sizes, to maintainhigh performance.

Wavefunction distribution.— Consider the wave-function |Ψ〉 of N spin- 1

2 quantum spins, or N qubits,

characterized by the 2N complex amplitudes Ψb1b2···bN =〈b1b2 · · · bN |Ψ〉. Each amplitude is indexed by a bit string(b1, . . . , bN ), where bi ∈ {0, 1}, and stored here as apair of single precision floating-point numbers for thereal and imaginary parts. Assume that 2Ng TPU coresare available. [For instance, a board has 23 = 8 coresand thus Ng = 3, whereas a full pod has 211 = 2048cores, or Ng = 11.] We divide the N qubits intotwo groups: the first Ng qubits and the remainingNl = N − Ng qubits, which are called global and localqubits, respectively. We also decompose each bit-string(b1, · · · , bN ) into global and local pieces (b1, · · · , bNg

)

and (bNg+1, · · · , bN ). We then break the 2N -element ar-

ray Ψb1···bNg bNg+1···bN into 2Ng sub-arrays of 2Nl elementseach, corresponding to the components with constantglobal bit-string (b1, · · · , bNg

), and store each sub-arraylocally in the HBM of a single TPU core. For instance,

for a TPU board (Ng = 3) we would have 8 sub-arraysdistributed over 8 cores, labelled #0 to #7, as follows:

We denote the distributed wavefunction with the follow-ing diagram, where each line corresponds to a qubit (redlines on the left correspond to global qubits):

TPU memory considerations require that we store eachlocal sub-array as an array with two or more dimensions,where the last two dimensions are multiples of 8 and 128respectively, see App. A for more details.

Wavefunction update.— Consider now a local

Hamiltonian H =∑Pi=1 hi that decomposes as a sum of

P local terms hi, each acting on 7 qubits. We focus on7-qubit terms because they fit the memory constraints ofTPUs, see App. B for how to turn a more generic localHamiltonian into this form. In order to apply H on |Ψ〉as in Eq. (1), we compute a sequence of products hi |Ψ〉for i = 1, · · · , P , which we accumulate in wavefunctions|Ψ(i)〉,

|Ψ(i)〉 = |Ψ(i−1)〉+ hi |Ψ〉 , i = 1, · · · , P, (4)

where |Ψ(0)〉 = 0 is the null vector and |Ψ(P )〉 = H |Ψ〉contains the final result. Each Hamiltonian term hi isrepresented by a 27 × 27 = 128× 128 matrix and broad-cast to each TPU core. Assume first that the term hiacts on the last 7 local qubits. We then simply multi-ply the distributed array for Ψ, regarded as a 2N−7× 27

matrix, with the 27 × 27 matrix hi, represented as:

This product is performed for each sub-array locally oneach TPU core, utilizing the MXUs. When hi acts onother local qubits, we can choose to permute their or-der so that hi effectively acts on the last 7 local qubits.

3

This requires reshaping and/or transposing the localsub-array of Ψ on each core, while paying due atten-tion to the restrictions on array shapes mentioned above(see also App. A). All the above operations are executedin parallel and without inter-core communication, witheach core manipulating their local sub-array accordingto an identical set of instructions.

Finally, when the term hi acts on at least one globalqubit, we first re-distribute the wavefunction |Ψ〉, byswapping (possibly a subset of) the global qubits with anequivalent number of local qubits, in such a way that theHamiltonian term hi ends up acting on local qubits, atwhich point we can compute hi |Ψ〉 as described above.For instance, in a single-board (i.e. 8-core) set-up, withNg = 3, we may want to swap the three global qubitswith the first three local qubits, represented by

The re-distribution of |Ψ〉, executed remarkably fastthanks to the ICIs, requires substantial communicationbetween the cores. Fig. 1 shows the total time requiredfor the update |Ψ〉 7→ H |Ψ〉 for an example Hamiltonian.Using a full TPU v3 pod, the wavefunction for N = 38qubits is updated in about 1 second.

Computation of ground states.— To illustrate thecomputation of a ground state |Ψgs〉, Eq. (2), we applythe Lanczos algorithm, based on the update |Ψ〉 7→ H |Ψ〉as described in App. D, to the 1D XXZ Hamiltonian,

H = J

N∑i=1

(XiXi+1 + YiYi+1 + ∆ZiZi+1), (5)

where X, Y , and Z are the Pauli matrices and site Nis identified with site 0 (periodic boundary conditions).For the couplings J = −1 and ∆ = 1

2 , this model isknown to be at a quantum critical point, in the univer-sality class of a bosonic conformal field theory with unitcentral charge [31]. Our current implementation, whichhandles generic 1D local Hamiltonians, does not exploitany of the several symmetries of the XXZ model, e.g.translation invariance or internal U(1) symmetry. The2-qubit terms in the above Hamiltonian are merged to-gether into 7-qubit terms (see App. B) to optimize theuse of the MXUs. For this computation we use at most128 TPU v3 cores (Ng = 7) to reach up to N = 34qubits, with the total time for the computation beingaround 12 minutes.

Importantly, from the distributed array for |Ψ〉 wecan extract a large variety of quantities, including e.g.wavefunction overlaps 〈Ψ1|Ψ2〉, expectation value of ob-servables such as the energy 〈Ψ|H|Ψ〉, correlation func-

20 22 24 26 28 30 32 34 36 38Number of qubits

103

102

101

100

101

Tim

e (s

)

Number of TPU coresCPU8321285122048, projected

FIG. 1. Compute time for the update |Ψ〉 7→ H |Ψ〉 for a1D, nearest neighbor Hamiltonian such as that in Eq. 5, asa function of the number of qubits. Different colors repre-sent different numbers of TPU cores, ranging from a singleboard (8 cores) to a full pod (2048 cores). Times are av-eraged over a large number of repetitions. Results for 2048cores are extrapolated estimates, due to temporary resourceconstraints. CPU results were run on an 8-core 2.3GHz In-tel Xeon workstation, using a Numba-based [30] implemen-tation. They represent a typical workstation, rather thancutting-edge high-performance computing hardware.

5 10 15Distance

0.30

0.35

0.40

0.45

0.50XX two-point correlator

N22263034

5 10 15Subsystem size

1.2

1.4

1.6

Second Rényi entropy

FIG. 2. Properties of the ground state of the 1D XXZmodel in Eq. (5) with J = −1, ∆ = 1

2. Left: The 〈X1Xn〉 −

〈X1〉〈Xn〉 connected two-point correlator as a function of thedistance n. Right: Second Renyi entropy S2 of a subsystemas a function of subsystem size. The different lines are fordifferent system sizes.

tions and entanglement measures. This requires addi-tional operations that can also be easily implemented onTPUs. The reduced density matrix ρA = TrA |Ψ〉〈Ψ|over a subset A of qubits (where the trace is over therest of the qubits, denoted A), and derived quantitiessuch as von Neumann and Renyi entropies can also becomputed for subsystems of up to N/2 qubits. Han-dling them requires distributed linear algebra meth-

4

ods on TPUs, discussed in Ref. 19. As an exam-ple, Fig. 2 shows the connected two-point correlator〈Ψ|X1Xn |Ψ〉−〈Ψ|X1 |Ψ〉 〈Ψ|Xn |Ψ〉 as a function of thedistance n between spins, as well as the second Renyi en-tropy S2(ρA) = − log2 Tr[(ρA)2], where ρA correspondsto a subsystem A of contiguous spins.

Simulation of time evolution.— Our second ap-plication is the simulation of a time evolution |Ψ(t)〉 =e−itH |Ψ(0)〉 that generates large amounts of entangle-ment. We define a suitably small time interval δt suchthat q ≡ t/δt is an integer, write the time evolution oper-ator for time t, exp(−itH), as the q-fold product of thetime operator for δt, exp(−itH) =

∏qj=1 exp(−i δtH),

and then approximate each operator exp(−i δtH) as asequence of six terms of the form (1− ianδtH),

exp(−i δtH) = (1− ia6δtH)× · · ·· · · × (1− ia2δtH)× (1− ia1δtH) +O(δt7), (6)

where the coefficients an are solved for numerically, see

App. C. The dominant term neglected above is H7δt7

7! .Each update |Ψ〉 7→ (1− i an δtH) |Ψ〉 requires comput-ing |Ψ〉 7→ H |Ψ〉 and is implemented while keeping onlya few copies of |Ψ〉 in memory simultaneously.

For this demonstration we choose a 1D local Hamilto-nian of the form H =

∑Ni=1 h[i,i+5], where each h[i,i+5] is

a Hermitian matrix of size 26×26 that operates on qubitsi through i + 5, thus exemplifying the use of k-qubitHamiltonian terms for k > 2. H is set to have periodicboundaries. The matrix elements of h[i,i+5], which rep-resents a generic non-integrable many-body interaction,are Gaussian distributed with mean zero and standarddeviation

√6 2−5 (such that the average Frobenius norm

of each term is 〈‖h[i,i+5]‖〉 =√

6). The 6-qubit termsh[i,i+5] are still pairwise blocked into 7-qubit terms (seeApp. B) in order to optimally utilize the MXUs.

Fig. 3 shows the second Renyi entropy, as a functionof time t, of various subsystems for the state |Ψ(t)〉 =exp(−itH) |Ψ(0)〉, where the initial state |Ψ(0)〉 has noentanglement, and the system has N = 32 qubits. On 32TPU v3 cores (Ng = 5), simulating this evolution, whichincludes q = 500 time steps with δt = 0.02, and evalua-tion of the entropies, took 44 minutes. At early times theentropy is seen to grow similarly for all subsystems, re-gardless of their size. However, in time each subsystemsaturates to a constant amount of entropy, dependentonly on the size of the subsystem. Specifically, for asubsystem of M qubits this saturation value is roughlyM bits of entropy, except for subsystems that cover alarge proportion of the total system, such as M = N

2 ,for which it is somewhat less, as expected [32]. Forthe larger subsystems we see a sustained period of lin-ear growth, which is evidence for ballistic growth of en-tanglement, as expected for a random Hamiltonian [33].The rapid growth of entanglement all the way to satu-ration makes this time evolution very hard to simulateusing tensor network methods such as Matrix ProductStates (MPS), which rely on the wavefunction being onlymoderately entangled and thus compressible. Indeed, in

0 2 4 6 8 10Time

0

2

4

6

8

10

12

14

Seco

nd R

ényi

ent

ropy

Subsystem size246810121416

FIG. 3. Second Renyi entropy S2 of various subsystems,as a function of time t, during an entangling time evolution|Ψ(t)〉 = e−itH |Ψ(0)〉 on N = 32 qubits, starting from anunentangled product state |Ψ(0)〉 and according to a 1D ran-dom Hamiltonian with 6-qubit terms. For each subsystemsize, the entropy initially grows until saturating to a maxi-mum value corresponding to a random wavefunction.

this example an MPS description would require a cen-tral bond dimension close to 216 = 65, 536. In otherwords, the largest MPS tensors have a similar size as thewhole wavefunction (232 complex coefficients), makingthe MPS approach even more costly than the presentbrute-force approach.

Discussion.— TPUs, originally designed for machinelearning workloads, can be effectively repurposed to ac-celerate and scale up a number of other computationallydemanding tasks [12–26]. In this paper we have demon-strated that TPUs can be used to compute ground statesand simulate real time evolution in many-body systems.A very high matrix multiplication throughput on eachcore (enabled by the MXUs), combined with the abil-ity to directly connect up to thousands of TPU corestogether (using the fast ICIs) for a single parallel sim-ulation with dozens of terabytes of HBM allowed us toconsider an unusually large number N of qubits at re-markable speed, see e.g. Fig. 1. This required adjust-ing to a number of characteristic features of TPUs andtheir XLA compiler, such as operating under the SIMDparadigm and using specific shapes for the matrices andarrays. Importantly, operating with k-qubit Hamilto-nian terms for k < 7 has a computational cost similar tothe k = 7 case, for which we observe peak performance.

Here we have considered local Hamiltonians H in onedimension for concreteness. It is of course also pos-sible to address 2D/3D local Hamiltonians, e.g. forsquare/cubic lattices, where sizes 6×6, 4×9 or 4×3×3are easily accessible. To go well beyond these systemsizes, one has to give up a brute-force, exact simula-tion based on storing the full wavefunction, and use in-stead approximate methods such as tensor network ap-

5

proaches. Tensor network methods can also be signif-icantly accelerated and scaled up with TPUs, allowinge.g. for MPS representations of 100-qubit wavefunctionswith unusually large bond dimensions [25], capable ofdescribing larger amounts of entanglement than it is pos-sible on regular hardware.

ACKNOWLEDGMENTS

This research was supported with Cloud TPUs fromGoogle’s TPU Research Cloud (TRC). Sandbox is ateam within the Alphabet family of companies, whichincludes Google, Verily, Waymo, X, and others. GV is aCIFAR fellow in the Quantum Information Science Pro-gram and a Distinguished Visiting Research Chair atPerimeter Institute. Research at Perimeter Institute issupported by the Government of Canada through theDepartment of Innovation, Science and Economic De-velopment and by the Province of Ontario through theMinistry of Research, Innovation and Science.

[1] H. G. Evertz, “The loop algorithm,” Advances in Physics52, 1–66 (2003), arXiv:cond-mat/9707221.

[2] N. V. Prokof’ev, B. V. Svistunov, and I. S. Tupit-syn, “Exact, complete, and universal continuous-timeworldline Monte Carlo approach to the statistics of dis-crete quantum systems,” Journal of Experimental andTheoretical Physics 87, 310–321 (1998), arXiv:cond-mat/9703200.

[3] Olav F. Syljuasen and Anders W. Sandvik, “QuantumMonte Carlo with directed loops,” Phys. Rev. E 66,046701 (2002), arXiv:cond-mat/0202316.

[4] Steven R. White, “Density matrix formulation for quan-tum renormalization groups,” Phys. Rev. Lett. 69, 2863–2866 (1992).

[5] Guifre Vidal, “Efficient simulation of one-dimensionalquantum many-body systems,” Phys. Rev. Lett. 93,040502 (2004).

[6] F. Verstraete and J. I. Cirac, “Renormalization algo-rithms for quantum-many body systems in two andhigher dimensions,” (2004), arXiv:cond-mat/0407066.

[7] G. Vidal, “A class of quantum many-body states thatcan be efficiently simulated,” Phys. Rev. Lett. 101,110501 (2008), arXiv:quant-ph/0610099.

[8] Michael Levin and Cody P. Nave, “Tensor renormal-ization group approach to 2D classical lattice mod-els,” Phys. Rev. Lett. 99, 120601 (2007), arXiv:cond-mat/0611687.

[9] Jutho Haegeman, J. Ignacio Cirac, Tobias J. Osborne,Iztok Pizorn, Henri Verschelde, and Frank Verstraete,“Time-dependent variational principle for quantum lat-tices,” Phys. Rev. Lett. 107 (2011), arXiv:1103.0936.

[10] Norman Jouppi, Doe Yoon, George Kurian, Sheng Li,Nishant Patil, James Laudon, Cliff Young, and DavidPatterson, “A domain-specific supercomputer for train-ing deep neural networks,” Communications of the ACM63, 67–78 (2020).

[11] Norman P. Jouppi, Cliff Young, Nishant Patil, DavidPatterson, Gaurav Agrawal, Raminder Bajwa, SarahBates, Suresh Bhatia, Nan Boden, Al Borchers, et al.,“In-datacenter performance analysis of a tensor process-ing unit,” in Proceedings of the 44th Annual Interna-tional Symposium on Computer Architecture, ISCA ’17(Association for Computing Machinery, New York, NY,USA, 2017) p. 1–12.

[12] Francois Belletti, Davis King, Kun Yang, Roland Nelet,Yusef Shafi, Yi-Fan Chen, and John Anderson, “Ten-

sor processing units for financial Monte Carlo,” (2020),arXiv:1906.02818 [cs.DC].

[13] Qing Wang, Matthias Ihme, Yi-Fan Chen, and JohnAnderson, “A tensorflow simulation framework for scien-tific computing of fluid flows on tensor processing units,”(2021), arXiv:2108.11076 [physics.comp-ph].

[14] Zhixin Pan and Prabhat Mishra, “Hardware accelerationof explainable machine learning using tensor processingunits,” (2021), arXiv:2103.11927 [cs.LG].

[15] Tianjian Lu, Thibault Marin, Yue Zhuo, Yi-Fan Chen,and Chao Ma, “Accelerating MRI reconstruction onTPUs,” (2020), arXiv:2006.14080 [cs.CE].

[16] Chao Ma, Thibault Marin, TJ Lu, Yi fan Chen, and YueZhuo, “Nonuniform fast fourier transform on TPUs,”(2021).

[17] Tianjian Lu, Yi-Fan Chen, Blake Hechtman, Tao Wang,and John Anderson, “Large-scale discrete fourier trans-form on TPUs,” (2020), arXiv:2002.03260 [cs.MS].

[18] Fantine Huot, Yi-Fan Chen, Robert Clapp, CarlosBoneti, and John Anderson, “High-resolution imagingon TPUs,” (2019), arXiv:1912.08063 [cs.CE].

[19] Adam G. M. Lewis et al., “Tensor Processing Units forDistributed Dense Linear Algebra,” Sandbox@Alphabet,in preparation.

[20] Martin Ganahl et al., “Tensor Processing Units forSimulating Quantum Circuits,” Sandbox@Alphabet, inpreparation.

[21] Alan Morningstar, Markus Hauru, Jackson Beall, Mar-tin Ganahl, Adam G. M. Lewis, Vedika Khemani, andGuifre Vidal, “Simulation of quantum many-body dy-namics with Tensor Processing Units: Floquet prether-malization,” Sandbox@Alphabet, in preparation.

[22] Ross Shillito, Alexandru Petrescu, Joachim Cohen, Jack-son Beall, Markus Hauru, Martin Ganahl, Adam G. M.Lewis, Alexandre Blais, and Guifre Vidal, “Classicalsimulation of superconducting quantum hardware us-ing Tensor Processing Units ,” Sandbox@Alphabet, inpreparation.

[23] Ryan Pederson et al., “Tensor Processing Units forQuantum Chemistry ,” Sandbox@Alphabet, in prepara-tion.

[24] Ruyi Song et al., “Simulation of Spin Light Emit-ting Diodes using Tensor Processing Units,” Sand-box@Alphabet, in preparation.

[25] Martin Ganahl et al., “Density Matrix Renormaliza-tion Group using Tensor Processing Units,” Sand-

6

box@Alphabet, in preparation.[26] Erik Gustafson, Burt Holzman, James Kowalkowski,

Henry Lamm, Andy C. Y. Li, Gabriel Perdue, Ser-gio Boixo, Sergei Isakov, Orion Martin, Ross Thomson,et al., “Large scale multi-node simulations of Z2 gaugetheory quantum circuits using google cloud platform,”(2021), arXiv:2110.07482 [quant-ph].

[27] All the results presented in this paper correspond to end-to-end computations conducted in single precision. Thisis in contrast with the double precision often used in sci-entific computing. Our numerical experiments confirmedthat single precision, capable of roughly 7 digits of ac-curacy, was sufficient to yield stable, numerically preciseresults. This is consistent with the error accumulationanalysis of Ref. 21, where very deep quantum circuits,with several millions of gates, were successfully simulatedusing single precision.

[28] James Bradbury, Roy Frostig, Peter Hawkins,Matthew James Johnson, Chris Leary, Dougal Maclau-rin, George Necula, Adam Paszke, Jake VanderPlas,Skye Wanderman-Milne, and Qiao Zhang, “JAX: com-posable transformations of Python+NumPy programs,”(2018).

[29] https://tensorflow.org/xla, accessed: 2021-10-01.[30] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert,

“Numba: A llvm-based python jit compiler,” in Pro-ceedings of the Second Workshop on the LLVM Com-piler Infrastructure in HPC , LLVM ’15 (Association forComputing Machinery, New York, NY, USA, 2015).

[31] Malte Henkel, Conformal Invariance and Critical Phe-nomena (Springer, Berlin, Heidelberg, 1999).

[32] Patrick Hayden, Debbie Leung, and Andreas Winter,“Aspects of generic entanglement,” Communications inMathematical Physics 265, 95–117 (2006).

[33] Hyungwon Kim and David A. Huse, “Ballistic spread-ing of entanglement in a diffusive nonintegrable system,”Phys. Rev. Lett. 111, 127205 (2013), arXiv:1306.4306.

Appendix A: Distribution and manipulation of anN-qubit wavefunction on TPUs

Consider the wavefunction |Ψ〉 of a system ofN qubits,

|Ψ〉 =

1∑b1=0

1∑b2=0

· · ·1∑

bN=0

Ψb1b2···bN |b1b2 · · · bN 〉 , (A1)

which is characterized by 2N complex amplitudesΨb1b2···bN = 〈b1b2 · · · bN |Ψ〉. In this appendix we explainhow to distribute the 2N amplitudes Ψb1b2···bN over anumber 2Ng of TPU cores. We also explain how to up-date the distributed wavefunction under the action of a7-qubit operator h, |Ψ〉 7→ h |Ψ〉.

We index each amplitude Ψb1b2···bN by a bit-string(b1, . . . , bN ), where bi ∈ {0, 1}, corresponding to thecomputational basis vector |b1b2 · · · bN 〉. We then di-vide the N qubits into two groups. The first group con-tains Ng qubits, which we call global qubits. The sec-ond group contains the remaining Nl = N −Ng qubits,referred to as local qubits. Accordingly, each bit-string(b1, . . . , bN ) is divided into a global part (b1, . . . , bNg

) andlocal part (bNg+1, . . . , bN ). The global part (b1, . . . , bNg

)

determines the TPU core where the amplitude is stored.For instance, for Ng = 3 we have 8 cores, labelled{(000), (001), (010), · · · , (111)}, and core (000) storesthe amplitudes Ψ000b4···bN , core (001) stores the ampli-tudes Ψ001b4···bN , etc. In this way, core (c1, · · · , cNg

)(where we temporarily label the global part of thebit-string using c’s to emphasize that they are con-stant on a given core) stores the 2Nl -element sub-arrayΨc1···cNg bNg+1···bN with components labelled by the lo-

cal bit-string (bNg+1, · · · , bN ) corresponding to the localqubits.

In other words, we think of the 2N amplitudesΨb1b2···bN of a single distributed array as 2Ng sub-arraysof size 2Nl , each one stored locally on the correspondingTPU core. Graphically we denote this with the followingdiagram, where the lines denote qubits with the red oneson the left being the global ones:

Data in TPU memory is stored in chunks of size8 × 128. This means that, for any array, the last twoindices must have sizes that are multiples of 8 and 128(otherwise the array we will be padded with zeros until ithas that shape, incurring a waste of memory that shouldbe avoided). Given that 23 = 8 and 27 = 128, this cor-responds to using the last two indices of the array tolabel at least 3 and 7 qubits, respectively. We thus storethe wavefunction using a distributed array with a shapethat adjusts to this rule. For instance, with the shape(2Ng , 2Nl−10, 23, 27), that is, where each local sub-arrayon each core has shape (2Nl−10, 23, 27). Graphically:

Our next step is to consider the update |Ψ〉 7→ h |Ψ〉,where h represents a 7-qubit operator, or a matrix withshape (128, 128), which we broadcast so that each coreholds a copy. Let us label the components h~α~α′ of thismatrix by indices ~α resulting from composing 7 binaryindices, ~α = (α1, α2, · · · , α7), with αi = {0, 1}. We nextconsider three cases, depending on which qubits the 7-qubit operator h acts on.

(i) First we assume that h acts on the last 7 localqubits. We can then simply multiply h and the local |Ψ〉together by contracting over the last index of the localsub-array,

Ψc1···cNg bNg+1···bN−7~α 7→ (A2)∑~α′

h~α,~α′ Ψc1···cNg bNg+1···bN−7~α′ (A3)

7

Graphically this can be represented as

Importantly, this update of the distributed wavefunc-tion can be accomplished by having each TPU core up-date its corresponding local sub-array, so that the 2Ng

local sub-arrays are updated in parallel without need forinter-core communication.

(ii) Next we assume that the 7-qubit gate acts on ar-bitrary local qubits, and not just the last 7 qubits asbefore. In this case, we re-organize each of the local sub-arrays so as to move the 7 targeted qubits to the the last7 positions, followed by the above contraction, and thenby the reversal of the above re-organization. The reor-ganization of a local sub-array consists of a sequence ofreshapes and transpositions. For instance, we could re-shape the local sub-array from shape (26, 23, 27) to shape(22, 27, 27), represented by

and then transpose the two last indices of the resultinglocal sub-array, which we draw as

Notice that at each step of this example, we store a localarray where the range of the last two indices correspondto, at least, 3 and 7 qubits, respectively (that is, to mul-tiples of 8 and 128), so as to prevent padding with zeros.Once again, all these operations are performed in parallelwithout need for inter-core communication.

(iii) Finally, to apply the update |Ψ〉 7→ h |Ψ〉 when hacts on a set of qubits that includes at least one of theglobal qubits, we first re-distribute the wavefunction sothat all the targeted qubits become local, then proceed as

above, and then re-distribute the resulting wavefunctionback to the original form. For instance, if Ng = 3 andh acts on the three global qubits, but not on the firstthree local qubits, we can re-distribute the wavefunctionby swapping the global qubits with the first three localqubits, represented as

Notice that this transformation cannot be performed lo-cally on each TPU core, but requires instead substantialcommunication between the cores. TPUs can performsuch inter-core communication remarkably fast, thanksto the dedicated ICIs that connects the cores directly,bypassing the CPU hosts on each TPU board.

Appendix B: Local Hamiltonian

As discussed in the main text, TPUs natively do ma-trix products of size 128 × 128, or 27 × 27. In this Ap-pendix we explain how to use this ability to computethe update |Ψ〉 7→ H |Ψ〉 for a local Hamiltonian H thatdecomposes as a sum of local terms.

Consider a local Hamiltonian

H =

Q∑i=1

ki (B1)

where each of theQ terms ki is an operator that acts non-trivially on at most 7 qubits, possibly less. Recall that ona TPU, a matrix multiplication involving matrices withdimensions less than 128 × 128 is done by padding thematrices with zeros until they are of size 128×128. Thatmeans that we would like to be working with local termshi that act on exactly 7 qubits, not less. Accordingly,our first goal is to rewrite the above Hamiltonian as

H =

P∑i=1

hi (B2)

where each 7-qubit term hi may come from adding twoor more terms ki in Eq. (B1). As an example, considera 1D local Hamiltonian where each term ki acts on twoconsecutive qubits, denoted k[i,i+1]. Then we define the7-qubit terms hi, denoted h[i,i+6], by adding together 6of such terms. For instance, the first 7-qubit operatorreads

h[1,7] = k[1,2] ⊗ 15 + 11 ⊗ k[2,3] ⊗ 14+

12 ⊗ k[3,4] ⊗ 13 + · · ·+ 15 ⊗ k[6,7].(B3)

8

Here 1m is the identity on m qubits. Graphically werepresent this as

We can similarly define h[7,13], h[13,19], etc. Notice that,in this example, if e.g. N = 22 , the last term h[19,22]

only acts on 4 qubits. In that case, we will tensor producth[19,22] with the identity 13 on 3 additional qubits, intoa 7-qubit operator h[16,22] = 13 ⊗ h[19,22].

To compute |Ψ〉 7→ H |Ψ〉 for H =∑i hi, we compute

a sequence of products hi |Ψ〉 for i = 1, · · · , P , which weaccumulate in wavefunctions |Ψ(i)〉,

|Ψ(i)〉 = |Ψ(i−1)〉+ hi |Ψ〉 , i = 1, · · · , P, (B4)

where |Ψ(0)〉 = 0 is the null vector and |Ψ(P )〉 = H |Ψ〉contains the final result. Notice that during the abovesteps we need to keep at least three distributed arrayson the TPUs, corresponding to wavefunctions Ψ, hi |Ψ〉and |Ψ(i−1)〉.

It is instructive to evaluate what the loss in efficiencyis if we disregard the rule that matrix products shouldinvolve a matrices of shape (27, 27) = (128, 128) (or,more generally, multiples of these dimensions). Suppose,as a concrete example, that our goal is to multiply the2N -element distributed array for an N-qubit wavefunc-tion |Ψ〉 [properly reshaped into a matrix with shape(2N−2, 22)] by a two-qubit term k [or matrix with shape(22, 22)]. On a CPU, this matrix product would requireroughly 2N−2 × 22 × 22 = 2N+2 floating-point multipli-cations (and a similar number of additions, which weignore in our counting for simplicity). On a TPU, wefirst need to pad both matrices with zeros until theyreach shapes (2N−2, 27) and (27, 27), respectively. Wesee that storing the wavefunction in this format requiresthe same memory as 25 = 32 copies of |Ψ〉 stored withoutpadding. For large N , this is a massive waste of mem-ory! In addition, the matrix multiplication now requires2N−2 × 27 × 27 = 2N+12 floating-point multiplications,which is 210 = 1024× more than on a CPU! Again, amassive waste of compute time.

Given the constraints forced on us by the MXU, howcan we more efficiently accomplish the above product?We proceed by augmenting the two-qubit operator k intoa 7-qubit operator h, given by h = k ⊗ 15, that acts asthe identity on 5 additional qubits. We then reshapethe 2N -element array into a matrix of shape (2N−7, 27)and multiply it by the (27, 27)-shaped matrix for h. No-tice that no padding with zeros is now required. In ad-dition, the matrix multiplication now requires roughly2N−7 × 27 × 27 = 2N+7 floating-point operations, whichis 25 = 32×more the operations needed on a CPU (still a

very significant waste, but better than the factor 1024×).In addition, If we can join the two-qubit term k withother two-qubit terms that are also part of the sameHamiltonian, we can include them in the same matrixmultiplication. For instance, in Eq. (B3) we joined 6two-qubit terms into a single seven-qubit term. In thatcase the MXU forces us into a factor 32/6 ≈ 5.3× moreoperations compared to a CPU. Needless to say, the ex-treme efficiency of the MXU by far compensates for thegeneration of additional floating-point operations. Fi-nally, we remark that for 7-qubit operators (or operatorsacting on even a larger number of qubits), the MXU doesnot force us into generating additional floating-point op-erations, achieving its optimal efficiency.

Appendix C: Expansion of the time evolutionoperator

To numerically time evolve a quantum state, wetake the time evolution operator for a small time δt,exp(−i δtH), and Taylor expand the exponential to sixthorder:

expx =

6∑n=0

1

n!xn +O(x7) =

6∏n=1

(1 + anx) +O(x7)

(C1)

The latter form is nothing but a rewriting of the polyno-mial

∑6n=0

1n!x

n in terms of its roots an, which can besolved for numerically. To accuracy sufficient for singleprecision, the roots are

a1 = 0.37602583− 0.13347447i, a2 = a∗1,

a3 = −0.05612287− 0.25824122i, a4 = a∗3,

a5 = 0.18009704 + 0.30409897i, a6 = a∗5,

Appendix D: Memory efficient Lanczos algorithm

The Lanczos method is a matrix-free diagonalizationtechnique which can be used to approximate extremal(largest in magnitude) eigenvector-eigenvalue pairs of alarge Hermitian linear operator H. The only require-ments (besides H being Hermitian) is the ability to ef-ficiently compute the action of H on a given vector x,and enough memory to store at least four such vectors(plus some small amount of extra memory to performauxiliary computations).

Algorithm 1 outlines the basic Lanczos method. Start-ing from a normalized initial guess x for an extremaleigenstate of H, the Lanczos method iteratively buildsan orthogonal basis Qj ≡ [x0,x1, . . . ,xj ] of Krylov vec-tors xn for the Krylov-subspace span{x,Hx, . . . ,Hjx}.The algorithm makes explicit use of the Hermiticity ofH to obtain the orthogonal basis Qj through a three-step recurrence relation which at any point involves only

9

Algorithm 1 Lanczos algorithm

1: function Lanczos(H,x,K, δ)2: x−1 = zeros like(x)3: x0 = x4: Q−1 = []5: for n=0. . .K-1 do6: βn = ‖xn‖7: if βn < δ then . invariant subspace found8: return {α0, . . . , αn−1}, {β0, . . . , βn}, Qn−1

9: end if10: xn ← xn/βn11: Qn = [Qn−1,xn]12: xn+1 = Axn

13: αn = 〈xn+1,xn〉 . αn ∈ R14: xn+1 ← xn+1 − αnxn − βnxn−1

15: end for16: return {α0, . . . , αK−1}, {β0, . . . , βK}, Qn−1

17: end function

Algorithm 2 Lanczos algorithm for computing atri-diagonalization of H

1: function Lanczos tridiag(H,x,K, δ)2: x−1 = zeros like(x)3: x0 = x4: for n=0. . .K-1 do5: βn = ‖x0‖6: if βn < δ then . invariant subspace found7: return {α0, . . . , αn−1}, {β0, . . . , βn}8: end if9: x0 ← x0/βn

10: x1 = Ax0

11: αn = 〈x1,x0〉 . αn ∈ R12: x1 ← x1 − αnx0 − βnx−1

13: x−1 ← x0

14: x0 ← x1

15: end for16: return {α0, . . . , αK−1}, {β0, . . . , βK}17: end function

Algorithm 3 Lanczos algorithm for computing anapproximate ground state of H

1: function Lanczos groundstate(H,x, {αi}, {βi}, {vi})2: u = zeros like(x)3: x−1 = zeros like(x)4: x0 = x5: for n=0. . . len({vi}) do6: x0 ← x0/βn7: u← u + vnx0

8: x1 = Ax0

9: x1 ← x1 − αnx0 − βnx−1

10: x−1 ← x0

11: x0 ← x1

12: end for13: return u14: end function

three consecutive Krylov vectors. The Lanczos methodconverges towards the dominant eigenstate, i.e. the onewith largest-in-magnitude eigenvalue, or a random linearsuperposition of states if the eigenvalue is degenerate. Toachieve convergence towards the (algebraic) lowest eigen-state, one can e.g. apply a uniform spectral shift to H.The algorithm produces the real coefficients {αn}, {βn}of the tridiagonal matrix Tj ,

Tj ≡

α0 β1

β1 α1 β2

β2 α2. . .

. . .. . . βjβj αj

(D1)

with j at most equal to K − 1, where K is the Krylovspace dimension chosen by the user, but smaller if thealgorithm terminates early (see below). The matricesQj , Tj and H satisfy the Lanczos relation

HQj = QjTj + βj+1xj+1eTj+1 (D2)

with ej+1 the j+1-st euclidean basis-vector of dimensionj+1. The algorithm terminates early if an invariant sub-space is found, i.e. if a newly generated Krylov vectorxn is a linear superposition of previous Krylov vectors.In this case the eigenvalues of Tj and the correspondingRitz vectors (see below) are exact eigenvalues and eigen-vectors of H. Given an eigenvector v of Tj to eigenvalueλ, the corresponding Ritz vector u is obtained by ex-panding v in the Krylov basis Qj :

u = Qjv. (D3)

The pair (λ,u) is called a Ritz pair. If the method doesnot terminate early, the Ritz pair (λ,u) corresponding toan extremal eigenvector-eigenvalue pair (λ,v) of Tj is anapproximate eigenvalue-eigenvector pair of H. A roughmeasure of the quality of this approximation is given bythe residual norm

‖Hu− λu‖ =‖HQjv −QjTjv‖ =

=βK |vj |,

which is given by the modulus of the last element inv times βK . The quality of the approximation usuallyincreases with Krylov dimension K, with typical valuesof K ranging from a few dozens to a few hundreds.

The Lanczos method is infamously ill-conditioned, re-sulting in loss of orthogonality, in finite precision arith-metic, of the constructed Krylov basis Qj for modestsizes of K, and the appearance of spurious degeneracies(“ghost modes”) in the spectrum of Tj . We note that forthe case of only computing a single extremal eigenvector,ghost modes are of no particular concern.

Algorithm 1 requires the storage of the constructedKrylov basis Qj in memory. For large values of j thiscan become prohibitive. One can work around this issueby splitting Algorithm 1 into two separate runs. During

10

the first run one only computes the tridiagonal matrixTj without storing the Krylov vectors Qj . This requiresonly enough memory to store three consecutive Krylovvectors xn in memory. One then diagonalizes Tj to ob-tain the extremal eigenpair (λ,v). Finally, one uses theexpansion coefficients v as input to a second Lanczosrun during which the desired extremal Ritz vector u isaccumulatively computed from the reconstructed Krylov

basis. The second run requires enough memory to storefour Krylov vectors. The two algorithms are outlined inAlgorithms 2 and 3. This approach of course comes atthe expense of additional computational time requiredfor the second run. This latter, memory-efficient imple-mentation of Lanczos is what we use to find the groundstate of a 1D spin chain and obtain for instance the re-sults in Fig. 2.