arXiv:2205.04730v1 [quant-ph] 10 May 2022

28
Theory of Quantum Generative Learning Models with Maximum Mean Discrepancy Yuxuan Du, 1, * Zhuozhuo Tu, 2 Bujiao Wu, 3 Xiao Yuan, 3 and Dacheng Tao 1 1 JD Explore Academy, Beijing 101111, China 2 School of Computer Science, The University of Sydney, Darlington, NSW 2008, Australia 3 Center on Frontiers of Computing Studies, School of Computer Science, Peking University, Beijing 100871, China (Dated:) The intrinsic probabilistic nature of quantum mechanics invokes endeavors of designing quantum generative learning models (QGLMs) with computational advantages over classical ones. To date, two prototypical QGLMs are quantum circuit Born machines (QCBMs) and quantum generative adversarial networks (QGANs), which approximate the target distribution in explicit and implicit ways, respectively. Despite the empirical achievements, the fundamental theory of these models remains largely obscure. To narrow this knowledge gap, here we explore the learnability of QCBMs and QGANs from the perspective of generalization when their loss is specified to be the maximum mean discrepancy. Particularly, we first analyze the generalization ability of QCBMs and identify their superiorities when the quantum devices can directly access the target distribution and the quantum kernels are employed. Next, we prove how the generalization error bound of QGANs depends on the employed Ansatz, the number of qudits, and input states. This bound can be further employed to seek potential quantum advantages in Hamiltonian learning tasks. Numerical results of QGLMs in approximating quantum states, Gaussian distribution, and ground states of parameterized Hamiltonians accord with the theoretical analysis. Our work opens the avenue for quantitatively understanding the power of quantum generative learning models. I. INTRODUCTION Learning is a generative activity that constructs its own interpretations of information and draws inferences on them [1]. This comprehensive philosophy sculpts a substantial subject in artificial intelligence, which is de- signing powerful generative learning models (GLMs) [2, 3] to capture the distribution Q describing the real-world data shown in Fig. 1(a). Concisely, a fundamental concept behind GLMs is estimating Q by a tunable probability distribution P θ . In the past decades, a plethora of GLMs, e.g., the Helmholtz machine [4], variational auto-encoders [5], and generative adversarial networks (GANs) [6, 7], have been proposed. Attributed to the efficacy and flex- ibility of handling P θ , these GLMs have been broadly applied to myriad scientific domains and gained remark- able success, including image synthesis and editing [810], medical imaging [11], molecule optimization [12, 13], and quantum computing [1419]. Despite the wide success, their limitations have recently been recognized from differ- ent perspectives. Concretely, energy-based GLMs suffer from the expensive runtime of estimating and sampling the partition function [20]; variational auto-encoders tend to produce unrealistic and blurry samples when applied to complex datasets [21]; GANs encounter the issues of model collapse, divergence, and inferior performance of simulating discrete distributions [22–25]. Envisioned by the intrinsic probabilistic nature of quantum mechanics and the superior power of quantum computers [2628], quantum generative learning mod- els (QGLMs) are widely believed to further enhance the * [email protected] ability of GLMs. Concrete evidence has been provided by Refs. [29, 30], showing that when the fault-tolerant quantum computers are available, QGLMs could surpass GLMs with provable quantum advantages such as stronger model expressivity and exponential speedups. Since fault- tolerant quantum computing is still in absence, attention has recently shifted to design QGLMs that can be effi- ciently carried out on noisy intermediate-scale quantum (NISQ) machines [3133] with computational advantages on certain tasks [3437]. Toward this goal, a leading strat- egy is constructing QGLMs through variational quantum algorithms [38, 39]. These QGLMs can be mainly divided into two categories, depending on whether the probabil- ity distribution P θ is explicitly formulated or not. For the explicit QGLMs, a variational quantum Ansatz ˆ U (θ) (a.k.a., parameterized quantum circuit [40]) forms the distribution P θ . Primary protocols belonging to this class are quantum circuit Born machines (QCBMs) [4144], quantum variational auto-encoders [45], and quantum Boltzmann machines [46, 47]. As for the implicit QGLMs, a mainstream protocol is quantum generative adversarial networks (QGANs) [4857]. Different from QCBMs, the Ansatz ˆ U (θ) in QGANs implicitly constructs P θ in the sense that the output of the quantum circuit amounts to an example sampled from P θ [58]. Extensive experimental studies have demonstrated the feasibility of QGLMs for different learning tasks, e.g., image generation [50, 59], state approximation [34, 60], and drug design [61, 62]. A crucial vein in quantum machine learning [27] is un- derstanding the learnability of a given quantum learning model from the perspective of generalization [63, 64]. A QGLM with good generalization means that the popu- lation distance between P θ and Q is closed to the em- pirical distance between the empirical distributions of P θ and Q [65, 66]. In this respect, generalization can arXiv:2205.04730v1 [quant-ph] 10 May 2022

Transcript of arXiv:2205.04730v1 [quant-ph] 10 May 2022

Theory of Quantum Generative Learning Models with Maximum Mean Discrepancy

Yuxuan Du,1, ∗ Zhuozhuo Tu,2 Bujiao Wu,3 Xiao Yuan,3 and Dacheng Tao1

1JD Explore Academy, Beijing 101111, China2School of Computer Science, The University of Sydney, Darlington, NSW 2008, Australia

3Center on Frontiers of Computing Studies, School of Computer Science, Peking University, Beijing 100871, China(Dated:)

The intrinsic probabilistic nature of quantum mechanics invokes endeavors of designing quantumgenerative learning models (QGLMs) with computational advantages over classical ones. To date,two prototypical QGLMs are quantum circuit Born machines (QCBMs) and quantum generativeadversarial networks (QGANs), which approximate the target distribution in explicit and implicitways, respectively. Despite the empirical achievements, the fundamental theory of these modelsremains largely obscure. To narrow this knowledge gap, here we explore the learnability of QCBMsand QGANs from the perspective of generalization when their loss is specified to be the maximummean discrepancy. Particularly, we first analyze the generalization ability of QCBMs and identifytheir superiorities when the quantum devices can directly access the target distribution and thequantum kernels are employed. Next, we prove how the generalization error bound of QGANsdepends on the employed Ansatz, the number of qudits, and input states. This bound can be furtheremployed to seek potential quantum advantages in Hamiltonian learning tasks. Numerical results ofQGLMs in approximating quantum states, Gaussian distribution, and ground states of parameterizedHamiltonians accord with the theoretical analysis. Our work opens the avenue for quantitativelyunderstanding the power of quantum generative learning models.

I. INTRODUCTION

Learning is a generative activity that constructs itsown interpretations of information and draws inferenceson them [1]. This comprehensive philosophy sculpts asubstantial subject in artificial intelligence, which is de-signing powerful generative learning models (GLMs) [2, 3]to capture the distribution Q describing the real-worlddata shown in Fig. 1(a). Concisely, a fundamental conceptbehind GLMs is estimating Q by a tunable probabilitydistribution Pθ. In the past decades, a plethora of GLMs,e.g., the Helmholtz machine [4], variational auto-encoders[5], and generative adversarial networks (GANs) [6, 7],have been proposed. Attributed to the efficacy and flex-ibility of handling Pθ, these GLMs have been broadlyapplied to myriad scientific domains and gained remark-able success, including image synthesis and editing [8–10],medical imaging [11], molecule optimization [12, 13], andquantum computing [14–19]. Despite the wide success,their limitations have recently been recognized from differ-ent perspectives. Concretely, energy-based GLMs sufferfrom the expensive runtime of estimating and samplingthe partition function [20]; variational auto-encoders tendto produce unrealistic and blurry samples when appliedto complex datasets [21]; GANs encounter the issues ofmodel collapse, divergence, and inferior performance ofsimulating discrete distributions [22–25].

Envisioned by the intrinsic probabilistic nature ofquantum mechanics and the superior power of quantumcomputers [26–28], quantum generative learning mod-els (QGLMs) are widely believed to further enhance the

[email protected]

ability of GLMs. Concrete evidence has been providedby Refs. [29, 30], showing that when the fault-tolerantquantum computers are available, QGLMs could surpassGLMs with provable quantum advantages such as strongermodel expressivity and exponential speedups. Since fault-tolerant quantum computing is still in absence, attentionhas recently shifted to design QGLMs that can be effi-ciently carried out on noisy intermediate-scale quantum(NISQ) machines [31–33] with computational advantageson certain tasks [34–37]. Toward this goal, a leading strat-egy is constructing QGLMs through variational quantumalgorithms [38, 39]. These QGLMs can be mainly dividedinto two categories, depending on whether the probabil-ity distribution Pθ is explicitly formulated or not. Forthe explicit QGLMs, a variational quantum Ansatz U(θ)(a.k.a., parameterized quantum circuit [40]) forms thedistribution Pθ. Primary protocols belonging to this classare quantum circuit Born machines (QCBMs) [41–44],quantum variational auto-encoders [45], and quantumBoltzmann machines [46, 47]. As for the implicit QGLMs,a mainstream protocol is quantum generative adversarialnetworks (QGANs) [48–57]. Different from QCBMs, the

Ansatz U(θ) in QGANs implicitly constructs Pθ in thesense that the output of the quantum circuit amounts toan example sampled from Pθ [58]. Extensive experimentalstudies have demonstrated the feasibility of QGLMs fordifferent learning tasks, e.g., image generation [50, 59],state approximation [34, 60], and drug design [61, 62].

A crucial vein in quantum machine learning [27] is un-derstanding the learnability of a given quantum learningmodel from the perspective of generalization [63, 64]. AQGLM with good generalization means that the popu-lation distance between Pθ and Q is closed to the em-pirical distance between the empirical distributions ofPθ and Q [65, 66]. In this respect, generalization can

arX

iv:2

205.

0473

0v1

[qu

ant-

ph]

10

May

202

2

2

(d) QGAN

|"⟩%(')

(f) Metric

(g) Quantum advantage

ℚ+ E

rr

QGLMs GLMs

(e) Ansatzes

%())

%())

Kernels QNN DNN

ℙ!" ℚ#

(c) QCBM

|0⟩

ℚ#ℙ!" +(,)ℚℙ$

%())|0⟩|0⟩|0⟩

|0⟩|0⟩|0⟩|0⟩

%())

FIG. 1: The paradigm of quantum generative learning models. (a) The data explored in generative learning includesboth classical and quantum scenarios. (b) The approaches of QGLMs to access target data distribution Q. When Q is classical,QGLMs operate its samples on the classical side or encode its samples into quantum circuits. When Q is quantum, QGLMs maydirectly access it without sampling. (c) The paradigm of QCBMs with MMD loss. The left and right panels depict the setup ofclassical and quantum kernels, respectively. (d) The scheme of QGANs with MMD loss for the continuous distribution Q. (e)

QAOA, hardware-efficient, and tensor-network based Ansatze are covered by U(θ) in Eq. (3). (f) The metrics exploited inQGLMs to measure the discrepancy between the generated and target distributions. (g) When Q and k(·, ·) are both quantum,QGLMs may attain generalization advantages over classical GLMs.

not only be used to evaluate the performance of differentQGLMs, but also provides guidelines to design QGLMswith computational advantages. Despite the importance,prior literature related to the generalization theory ofQGLMs is very few, which is sharply contrast with quan-tum discriminative learning [67–73]. Specifically, Ref. [74]showed that the unbounded loss function facilitates thetraining of QGLM without barren plateaus [75, 76], andRef. [77] showed that the quantum Bayesian networkshave a higher expressive power than their classical coun-terparts. The key challenges in demystifying the gen-eralization of QGLMs are imposed by two factors: asdepicted in Fig. 1(a)-(f), the distribution Q for QGLMs,the way of implementation of QGLMs, and the selectionof the objective function and the employed Ansatz arediverse; the evaluation of the distance between Pθ and Qis intricate due to the curse of dimensionality.

To shrink the above knowledge gap, here we under-stand the learnability of QGLMs with the maximummean discrepancy (MMD) [78]. The attention on MMDloss originates from the fact that many QGLMs employit as the loss function to measure the difference of twodistributions [42–44, 51]. Through the lens of the statisti-cal learning theory [79], we separately unveil the powerof QGLMs towards discrete and continuous distributions.That is, when Q is discrete and can be efficiently accessedby quantum machines, we prove that quantum kernelscan greatly benefit the generalization ability of QCBMs

over their classical counterparts. Meanwhile, to attain thesimilar generalization error, the required computationaloverhead for QCBM with classical kernels is significantlylarger than that of QCBM with quantum kernels. Thisseparation advocates to use quantum kernels to under-score potential quantum advantages of QGLMs. When Qis continuous, we connect the generalization with modelexpressivity and then leverage a statistical tool—coveringnumber, to quantify the generalization of QGANs. Con-cisely, we prove that the generalization error of QGANis upper bounded by O(1/n + 1/m + Ndk

√NgtNge/n),

where n, m, d, k, Nge, and Ngt refer to the number ofreferenced samples, the number of training examples, thedimension of a qudit, the type of quantum gates, thenumber of encoding gates, and the number of trainableparameters, respectively. This bound explicitly exhibitshow the encoding strategy and the adopted Ansatz affectthe generalization error, which can not only provide prac-tical guidance in designing advanced QGANs, but alsocontributes to discover potential advantages of QGLMs.

II. MAIN RESULTS

To better present our main results, let us first recapQCBMs and QGANs, introduce the maximum mean dis-crepancy loss and the measure of generalization, andexhibit some properties of QGLMs with MMD loss.

3

QCBMs. The paradigm of QCBM is shown in Fig. 1(c).

Specifically, an N -qubit Ansatz U(θ) with θ ∈ Θ isapplied to a fixed input state ρ0 = (|0〉 〈0|)⊗N to formthe parameterized distribution Pθ ∈ PΘ. The probabilityof sampling i ∈ [2N ] over the distribution Pθ yields

Pθ(i) = Tr(ΠiU(θ)ρ0U(θ)†), (1)

where Πi = |i〉 〈i| refers to the projector of the com-putational basis i. Given n examples x(j)nj=1 sam-pled from Pθ, its empirical distribution is defined asPnθ(i) =

∑nj=1 δx(j)(i)/n with δ(·)(·) being the indicator.

QGANs. The schematic of QGANs is shown in Fig. 1(d).

An N -qubit Ansatz U(θ) is used to realize the generatorGθ(·), which maps an example z sampled from a priordistribution PZ to the generated example x, i.e., x :=

Gθ(z) ∈ R2N . Formally, the j-th component of x yields

xj = Tr(ΠjU(θ)ρzU(θ)†), ∀j ∈ [2N ], (2)

where ρz refers to the encoded quantum state of z. Givenn examples x(i)ni=1 produced by Gθ(z(i))ni=1, its em-pirical distribution is Pnθ(dx) =

∑ni=1 δx(i)(dx)/n. No-

tably, when Q is discrete, the mechanism of QGANs isequivalent to QCBMs (refer to SM A for details). Dueto this reason, in this study we only focus on applyingQGANs to estimate the continuous distribution Q.

Throughout the whole study, the Ansatz employed inboth QCBMs and QGANs takes the generic form

U(θ) =

Ng∏l=1

Ul(θ), (3)

where θ are trainable parameters living in the parameterspace Θ, Ul(θ) ∈ U(dk) refers to the l-th quantum gateoperated with at most k-qudits with k ≤ N , and U(dk)denotes the unitary group in dimension dk (d = 2 for

qubits) [80]. The form of U(θ) covers almost all Ansatzein VQAs and some constructions are given in Fig. 1(e).

Maximum Mean Discrepancy. Suppose that an unknowndistribution Q and a parametrized family of model dis-tributions PΘ on the same space. The maximum meandiscrepancy (MMD) loss [78], which measures the dif-ference between Pθ ∈ PΘ and Q, is MMD2(Pθ ||Q) =E(k(x,x′)) +E(k(y,y′))− 2E(k(x,y)), where the expec-tations are taken over the randomness of x,x′ ∼ Pθ andy,y′ ∼ Q, and k(·, ·) denotes a predefined kernel (e.g., lin-ear and radial basis function (RBF) kernels). Without lossof generality, we assume maxθ∈Θ MMD2(Pθ ||Q) ≤ C1

with C1 being a constant. QGLMs aim to find an estima-tor minimizing MMD loss,

θ = arg minθ∈Θ

MMD2(Pθ ||Q). (4)

If the distribution Pθ and Q can not be accessed di-rectly, the evaluation of expectation becomes intractableand we instead consider the empirical MMD loss, as

an unbiased estimator of the MMD loss proposed by[78], i.e., MMD2

U (P1 ||P2) := 1n(n−1)

∑ni 6=i′ k(x(i),x(i′)) +

1m(m−1)

∑mj 6=j′ k(y(j),y(j′)) − 2

nm

∑i,j k(x(i),y(j)). The

minimizer of MMD2U yields

θ(n,m)

= arg minθ∈Θ

MMD2U (Pnθ ||Qm), (5)

where Pnθ and Qm separately refers to the empirical dis-tribution of Pθ and Q defined above. See SM B foroptimizing QCBMs and QGANs using MMD loss.

We follow the classical routine [81] to define the general-ization error of QGLMs as follows. When either the kernelor the target distribution Q is classical, the generalizationerror of QGLMs yields

RC = MMD2(Pθ(n,m) ||Q)− inf

θ∈ΘMMD2(Pθ ||Q). (6)

When the kernel k(·, ·) is quantum [82] and the distribu-tion Q can be efficiently accessed by quantum machines,the generalization error of QGLMs yields

RQ = MMD2(Pθ ||Q)− infθ∈Θ

MMD2(Pθ ||Q). (7)

Intuitively, both RC and RQ evaluate the divergence ofthe estimated and the optimal MMD loss, where a lowerRC or RQ implies a better learning performance.

Quantum kernel in MMD loss. The choice of the kernelk(·, ·) in Eq. (4) is flexible. As shown in the right panelof Fig. 1(c), when it is specified to be the quantum kernel(e.g., the linear kernel) and Q can be directly accessed byQGLMs (i.e., Q can be efficiently prepared by a quantumstate), the corresponding MMD loss can be efficientlycalculated.

Lemma 1. Suppose the distribution Q can be directlyaccessed by QGLMs. When the quantum kernel is adopted,the MMD loss in Eq. (4) can be estimated within an errorε in O(1/ε2) sample complexity.

The proof of Lemma 1 is provided in SM C. Thislemma delivers a crucial message such that when bothk(·, ·) and Q are quantum, MMD2(Pθ ||Q) can be effi-ciently estimated by QGLMs in which the runtime cost isindependent of the dimension of data space. In contrast,for GLMs, the runtime cost of calculating the MMD losspolynomially scales with the sample size n and m. Suchruntime discrepancy warrants the good performance ofQGLMs explained in the subsequent context.

A. Generalization of QCBMs

A central topic in QCBMs is understanding whetherquantum kernels can provide generalization advantagesover classical ones. The following theorem provides apositive affirmation whose proof is postponed to SM D.

4

N = 8KL Divergence1.281.21

0.830.59

0.180.07

N =12KL Divergence

4.2

3.8

4.3

4.0

1.60

1.67

N = 4 N = 6

N = 10

|0i U(l1) U(L

1 )

|0i U(l2) U(L

2 )

|0i U(l3) U(L

3 )

|0i U(l4) U(L

4 )

|0i U(l5) U(L

5 )

|0i U(l6) U(L

6 )

×𝑳𝟏 − 𝟏

(𝒂) (𝒄) (𝒅)

(𝒃)

0.99 0.99

0.90

0.960.97

0.99 0.99

0.71

0.870.92

N = 8

0.99 0.99

0.56

0.890.92

N = 10

0.93 0.93

0.57

0.46

0.62

FIG. 2: Simulation results of QCBM. (a) The implementation of QCBMs when N = 6. The label ‘L1 − 1’ refers torepeating the architecture highlighted by the brown color L1 − 1 times. The gate U(θli) refers to the RY gate applied on the

i-th qubit in the l-th layer of Ansatz U(θ). (b) The visualization of a two-qubit GHZ state. (c) The upper and lower panelsseparately show the simulation results of QCBMs in the task of estimating the discrete Gaussian distribution when N = 8 andN = 12. The labels ‘Q’, PQ

θ , ‘PGθ , n’, stand for the target distribution, the output of QCBM with the quantum kernel, and the

output of QCBM with the RBF kernel and n samples, respectively. The inner plots evaluate the statistical performance ofQCBM through KL divergence, where the x-axis labels the number of examples n. (d) The four box-plots separately show thesimulation results of QCBM in the task of approximating N -qubit GHZ state with N = 4, 6, 8, 10. The y-axis refers to thefidelity. The x-axis refers to the applied kernels in QCBM, where the label ‘Q’ represents the quantum kernel and the rest fourlabels refers to the RBF kernel with n samples.

Theorem 1. Following the settings in Lemma 1, when theemployed kernel k(·, ·) can either be realized by quantumor classical machines, with probability at least 1− δ, thegeneralization error of QCBMs yields

RQ ≤ RC ≤ C1

√8

n+

8

m

√C2(2 +

√log

1

δ), (8)

where C1 ≥ maxθ MMD2(Pθ ||Q) and C2 = supx k(x,x).

It indicates that when Q is quantum, the generalizationerror of QCBMs with the quantum kernel is strictly lowerthan its classical counterparts. Remarkably, most tasksin quantum many body physics and quantum informationprocessing satisfy the requirement of directly accessingthe target distribution Q [34]. In consequence, QCBMsmay achieve generalization advantages in these regimes.In addition, the upper bound in Eq. (8) underlies thatthe decisive factor to improve generalization of QCBMs issimultaneously increasing n and m. As such, for QCBMswith classical kernels, there exists a tradeoff betweenthe generalization and the runtime complexity, which isnot the case for quantum kernels warranted by Lemma1. Besides, the factor C2 suggests that the choice ofkernels also effects the generalization of QCBMs. As withquantum discriminative learning models [70], the factor C1

connects the expressivity of QCBM with its generalization

in the sense that Ansatz with an overwhelmed expressivitymay degrade the generalization. All of these observationsprovide a practical guidance of designing QCBMs.

Remark. In SM E, we partially address another longstanding problem in the quantum generative learning the-ory, i.e., whether QCBMs are superior to classical GLMswith better performance. Concisely, we show that for cer-tain Q, QCBMs can attain a lower infθ∈Θ MMD2(Pθ ||Q)over a typical GLM– restricted Boltzmann machine [83],which may lead to a better generalization ability.

We conduct numerical simulations to examine the po-tential advantages of QCBMs as claimed in Theorem 1.The first task is applying QCBMs to estimate the discreteGaussian distribution N(N,µ, σ) [42, 44], where N speci-fies the range of events with x ∈ [2N ], and µ and σ referto the mean and variance, respectively. An intuition ofN(N,µ, σ) is shown in Fig. 2(c), labeled by Q. The hyper-parameter settings are as follows. For all simulations, wefix µ = 1 and σ = 8. There are two settings of the qubitcount, i.e., N = 8, 12. The quantum kernel and RBFkernel are adopted to compute the MMD loss. For RBFkernel, the number of samples is set as n = 100, 1000,and ∞. The employed Ansatz of QCBM is depicted inFig. 2(a) with L1 = 8 for N = 8 and L1 = 12 for N = 12.The maximum number of iterations is T = 50. For each

5

setting, we repeat the training 5 times to get a betterunderstanding of the robustness of the results.

The simulation results of QCBMs are illustrated inFig. 2(c). The two outer plots exhibit the approximateddistributions under different settings. In particular, forboth N = 8, 12, the approximated distribution generatedby QCBM with the quantum kernel well approximates Q.In the measure of KL divergence, the similarity of thesetwo distributions is 0.18 and 1.6 forN = 8, 12, respectively.In contrast, when the adopted kernels are classical and thenumber of measurements is finite, QCBMs encounter theinferior performance. Namely, by increasing n and m from50 to 1000, the KL divergence between the approximateddistribution and the target distribution only decreasesfrom 1.28 to 0.59 in the case of N = 8. Moreover, underthe same setting, the KL divergence does not manifestlydecrease when N = 12, which requires a larger n and mto attain a good approximation as suggested by Theorem1. This argument is warranted by the numerical resultswith the setting n = m → ∞, where the achieved KLdivergence is comparable with QCBM with the quantumkernel. Nevertheless, the runtime complexity of QCBMswith the classical kernel polynomially scales with n and m.According to Lemma 1, under this scenario, QCBMs withthe quantum kernel embraces the runtime advantages.

We next follow Ref. [41] to apply QCBMs to accom-plish the task of preparing GHZ states, a.k.a., “cat states”[84]. An intuition is depicted in Fig. 2(b). The choiceof GHZ states is motivated by their importance in quan-tum information. The formal expression of an N -qubit

GHZ state is |GHZ〉 = (|0〉⊗N + |1〉⊗N )/√

2. The hyper-parameter settings are as follows. The number of qubitsis set as N = 4, 6, 8, 10 and the corresponding depth isL1 = 4, 6, 8, 10. The other settings are identical to thoseused in the task of discrete Gaussian approximation.

The simulation results, as illustrated in Fig. 2(d), in-dicate that QCBMs with quantum kernels outperformsRBF kernels when n and m are finite. This observationbecomes apparent with an increased N . For all settings ofN , the averaged fidelity between the generated states ofQCBMs with the quantum kernel and the target |GHZ〉is above 0.99, whereas the obtained averaged fidelity forQCBMs with the RBF kernel is 0.46 for N = 10 andn = m = 100. Meanwhile, as with the prior task, RBFkernel attain a competitive performance with the quan-tum kernel unless n = m→∞, while the price to pay isan unaffordable computational overhead.

B. Generalization of QGANs

QGANs formulated in Eq. (2) are specified to be a classof learning protocols estimating the continuous distribu-tions Q and thus only concern RC in Eq. (6). In thisscenario, it is of paramount importance of unveiling howthe generalization of QGANs depends on uploading meth-ods and the structure information Ansatz. The followingtheorem makes a concrete step toward this goal, where

the proof is deferred to SM F.

Theorem 2. Assume the kernel k(·, ·) is C3-Lipschitz.

Suppose that the employed quantum circuit U(z) to pre-pare ρz containing in total Nge parameterized gates andeach gate acting on at most k qudits. Following notationsin Eqs. (5) and (6), with probability at least 1 − δ, thegeneralization error of QGANs, RC in Eq. (2) is upperbounded by

8

√8C2

2 (n+m)

nmln

1

δ+

48

n− 1+

144dk√Ngt +Nge

n− 1C4,

(9)where C2 = supx k(x,x), C4 = N ln(1764C2

3nNgeNgt) +1.

The results of Theorem 2 describe how different com-ponents in QGANs effect the generalization bound andconvey four-fold implications. First, according to the firstterm in the right hand-side of Eq. (G2), to achieve atight upper bound of RC , the ratio between the num-ber of examples sampled from Q and P should satisfym/n = 1. In this case, the generalization error linearlydecreases and finally converges to zero with the increasedn or m. Second, RC linearly depends on the kernel termC2, exponentially depends on k in Eq. (3), and sublinearlydepends on the number of trainable quantum gates Ngt.These observations underpin the importance of control-ling the expressivity of the adopted Ansatz and selectingproper kernels to ensure both the good learning perfor-mance and generalization of QGANs. Third, the way ofpreparing ρz is a determinant factor in generalization ofQGANs. In other words, to improve the performanceof QGANs, the prior distribution PZ and the number ofencoding gates Nge should be carefully designed. Last,the explicit dependence on the architecture of Ansatzconnects the generalization of QGANs with their train-ability. Concretely, a large Ngt or k may induce barrenplateaus in training QGLMs [74, 75] and results in aninferior learning performance. Meanwhile, it also leadsto a degraded generalization error bound. This inherentconnection queries a unified framework to simultaneouslyenhance the trainability and generalization of QGANs.

Remark. We emphasize that most kernels such as RBF,linear kernels, and Matern kernels satisfy the Lipchitzcondition employed in Theorem 2 [85]. Meanwhile, theabove results can be efficiently generalized to noisy set-tings by leveraging the analysis in Ref. [70]. Moreover, theachieved results in Theorem 2 provide an efficient way tocompare the generalization error of QGANs with differentAnsatze. See SM G for details. In addition, in SM I, weexplain that Theorem 2 implies the potential advantagesof QGANs in the task of Hamiltonian learning.

To validate the results in Theorem 2, we apply a variantof the style-QGAN proposed by [86] to generate 3D cor-related Gaussian distributions with varied settings. Twokey modifications in our protocol are constructing thequantum generator following Fig. 3(a) and replacing thetrainable discriminator by MMD loss. See SM G for the

6

(c) (d)|0i

U(z)

U(l1)

UE(l4)

U(L1 )

|0i U(l2) U(L

2 )

|0i U(l3) U(L

3 )

(a)

(b)

𝑋!- 𝑋" 𝑋!- 𝑋# 𝑋"- 𝑋# 𝑋!- 𝑋" 𝑋!- 𝑋# 𝑋"- 𝑋#

𝑋!- 𝑋" 𝑋!- 𝑋# 𝑋"- 𝑋#

𝑋!- 𝑋" 𝑋!- 𝑋# 𝑋"- 𝑋#

×𝑳− 𝟏

Real Generated

FIG. 3: Simulation results of QGANs. (a) The implementation of QGANs when the number of qubits is N = 3. U(z)refers to the encoding circuit to load the example z. The meaning of ‘L1 − 1’ is identical to the one explained in Fig. 1. Thegate U(θli) refers to the RY RZ gates applied on the i-th qubit in the l-th layer of U(θ). (b) The outer plot shows the trainingloss of QGANs with varied settings of m. The x-axis refers to the number of iterations. The inner plot shows the generalizationproperty of trained QGANs by evaluating MMD loss. (c) The visualization of the exploited 3D Gaussian distribution. The label‘Xa-Xb’ means projects the 3D Gaussian into the Xa-Xb plane with a, b belonging to x, y, z-axis. (d) The generated datasampled from the trained QGAN with varied settings of m. From upper to lower panel, m equals to 2, 10, and 200, respectively.

construction details. The target distribution Q is a 3Dcorrelated Gaussian distribution centered at µ = (0, 0, 0)

with covariance matrix σ =( 0.5 1 0.25

0.1 0.5 0.10.25 0.1 0.5

). The sampled

examples from Q are visualized in Fig. 3(c). The hyper-parameter settings employed in the training procedure isas follows. The reference samples n ranges from 200 to10000 and we keep n = m. The layer depth of G(θ) is setas L ∈ 2, 4, 6, 8. Each setting is repeated with 5 timesto collect the statistical results.

The simulation results are exhibited in Figs. 3(b)-(d).The outer plot in Fig. 3(b) shows that for all settings ofm, the empirical MMD loss, i.e., MMDU (Pn

θ(n,m) ||Qm),

converges after 60 iterations, where the averaged loss is0.0114, 0.0077, and 0.0054 for m = 2, 10, 200, respectively.The inner plot measures the expected MMD loss, i.e., thetrained QGANs are employed to generate new 10000 ex-amples and then evaluate MMDU to estimate MMD. Theaveraged expected MMD loss for m = 2, 10, 200 is 0.1178,0.0122, and 0.0041 respectively. Since the generalizationerror of QGANs RC is proportional to |MMDU −MMD |,an immediate observation is that QGANs with a largem can obtain a better generalization ability, which echoswith Theorem 2. For illustration, we depicts the generateddistribution of the trained QGANs in Fig. 3(d). Withincreasing m, the learned distribution is close to the realdistribution in Fig. 3(c). See SM H for more simulations.

III. DISCUSSIONS

We conduct a comprehensive study to quantify thegeneralization ability of QGLMs, including QCBMs andQGANs, with MMD loss. Although the attained the-

oretical results do not exhibit the generic exponentialadvantages, we clearly show that under certain tasks andmodel settings, QCBMs and QGANs can surpass classicallearning models. Moreover, we provide a succinct anddirect way to compare generalization of QGLMs withdifferent Ansatze. Extensive numerical simulations havebeen conducted to support our theoretical statements.These theoretical and empirical observations deepen ourunderstanding about the capabilities of QGLMs and ben-efit the design of advanced QGLMs.

The developed techniques in this study are general andprovide a novel approach to theoretically investigate thepower of QGLMs. For instance, a promising direction isuncovering the generalization property of other QGLMssuch as quantum auto-encoder, quantum Boltzmann ma-chines, etc. Furthermore, our work mainly concentrateson QGLMs with MMD loss, whereas a promising researchdirection is to derive the generalization of QGLMs withother loss functions such as Sinkhorn divergence, Steindiscrepancy [87], and Wasserstein distance [88].

For QCBMs, Theorem 1 unveils that quantum kernelscan greatly benefit their generalization and reduce com-putational overhead over classical kernels when the targetdistribution is quantum. These results underline thatan important future direction will be identifying how touse QGLMs to gain substantial quantum advantages forpractical applications, e.g., quantum many body physics,quantum sensing, and quantum information processing.

For QGANs, Theorem 2 hints that their generalizationerror has the explicit dependence on the qudits count,the structure information of the employed Ansatze, theadopted encoding method, and the choice of prior distribu-tion. These results enable us to theoretically understandthe power of QGLMs and provide practical guidance to

7

devise novel Ansatze to enhance the learning performanceof QGLMs. Besides, since Ansatz with too much expres-sivity may degrade the generalization ability of QGANs, itis necessary to integrate various advanced techniques suchas quantum circuit architecture design techniques [89–94]

to boost QGLMs performance. From the theoretical per-spective, the entangled relation between expressivity andgeneralization in QGLMs queries a deeper understandingfrom each side.

[1] Merlin C Wittrock. Generative processes of comprehen-sion. Educational psychologist, 24(4):345–376, 1989.

[2] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deeplearning. nature, 521(7553):436, 2015.

[3] Jurgen Schmidhuber. Deep learning in neural networks:An overview. Neural Networks, 61:85–117, 2015.

[4] Peter Dayan, Geoffrey E Hinton, Radford M Neal, andRichard S Zemel. The helmholtz machine. Neural com-putation, 7(5):889–904, 1995.

[5] Diederik P. Kingma and Max Welling. Auto-encodingvariational bayes. In Yoshua Bengio and Yann LeCun,editors, 2nd International Conference on Learning Repre-sentations, ICLR 2014, Banff, AB, Canada, April 14-16,2014, Conference Track Proceedings, 2014.

[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville,and Yoshua Bengio. Generative adversarial nets. InAdvances in neural information processing systems, pages2672–2680, 2014.

[7] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial networks. In 5th Interna-tional Conference on Learning Representations, ICLR2017, 2017.

[8] Ran Yi, Yong-Jin Liu, Yu-Kun Lai, and Paul L Rosin.APDrawingGAN: Generating artistic portrait drawingsfrom face photos with hierarchical gans. In IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR ’19), pages 10743–10752, 2019.

[9] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarialnetworks. In Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, pages4401–4410, 2019.

[10] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, andJun-Yan Zhu. Semantic image synthesis withspatially-adaptive normalization. In Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition, pages 2337–2346, 2019.

[11] Karim Armanious, Chenming Jiang, Marc Fischer,Thomas Kustner, Tobias Hepp, Konstantin Nikolaou,Sergios Gatidis, and Bin Yang. Medgan: Medical imagetranslation using gans. Computerized Medical Imagingand Graphics, 79:101684, 2020.

[12] Lukasz Maziarka, Agnieszka Pocha, Jan Kaczmarczyk,Krzysztof Rataj, Tomasz Danel, and Micha l Warcho l.Mol-cyclegan: a generative model for molecular opti-mization. Journal of Cheminformatics, 12(1):1–18, 2020.

[13] Oscar Mendez-Lucio, Benoit Baillif, Djork-Arne Clevert,David Rouquie, and Joerg Wichard. De novo gener-ation of hit-like molecules from gene expression signa-tures using artificial intelligence. Nature communications,11(1):1–10, 2020.

[14] Shahnawaz Ahmed, Carlos Sanchez Munoz, Franco Nori,and Anton Frisk Kockum. Quantum state tomography

with conditional generative adversarial networks. Physi-cal Review Letters, 127(14):140502, 2021.

[15] Juan Carrasquilla, Giacomo Torlai, Roger G Melko,and Leandro Aolita. Reconstructing quantum stateswith generative models. Nature Machine Intelligence,1(3):155–161, 2019.

[16] Alistair WR Smith, Johnnie Gray, and MS Kim. Efficientquantum state sample tomography with basis-dependentneural networks. PRX Quantum, 2(2):020348, 2021.

[17] Andrea Rocchetto, Edward Grant, Sergii Strelchuk,Giuseppe Carleo, and Simone Severini. Learning hardquantum distributions with variational autoencoders.npj Quantum Information, 4(1):1–7, 2018.

[18] Roger G Melko, Giuseppe Carleo, Juan Carrasquilla,and J Ignacio Cirac. Restricted boltzmann machines inquantum physics. Nature Physics, 15(9):887–892, 2019.

[19] Giuseppe Carleo, Yusuke Nomura, and Masatoshi Imada.Constructing exact representations of quantum many-body systems with deep neural networks. Nature com-munications, 9(1):1–11, 2018.

[20] Yilun Du and Igor Mordatch. Implicit generation andmodeling with energy based models. In Advances in Neu-ral Information Processing Systems, volume 32. CurranAssociates, Inc., 2019.

[21] Alexey Dosovitskiy and Thomas Brox. Generating im-ages with perceptual similarity metrics based on deepnetworks. Advances in neural information processingsystems, 29:658–666, 2016.

[22] Sanjeev Arora, Andrej Risteski, and Yi Zhang. Do GANslearn the distribution? some theory and empirics. InInternational Conference on Learning Representations,2018.

[23] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio,and Wenjie Li. Mode regularized generative adversarialnetworks. arXiv preprint arXiv:1612.02136, 2016.

[24] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin.Which training methods for gans do actually converge?In International conference on machine learning, pages3481–3490. PMLR, 2018.

[25] Jie Gui, Zhenan Sun, Yonggang Wen, Dacheng Tao, andJieping Ye. A review on generative adversarial networks:Algorithms, theory, and applications. IEEE Transactionson Knowledge and Data Engineering, 2021.

[26] Richard P Feynman. Quantum mechanical computers.Between Quantum and Cosmos, pages 523–548, 2017.

[27] Jacob Biamonte, Peter Wittek, Nicola Pancotti, PatrickRebentrost, Nathan Wiebe, and Seth Lloyd. Quantummachine learning. Nature, 549(7671):195, 2017.

[28] Aram W Harrow and Ashley Montanaro. Quantumcomputational supremacy. Nature, 549(7671):203, 2017.

[29] Xun Gao, Z-Y Zhang, and L-M Duan. A quantummachine learning algorithm based on generative models.Science advances, 4(12):eaat9004, 2018.

[30] Seth Lloyd and Christian Weedbrook. Quantum gen-

8

erative adversarial learning. Physical review letters,121(4):040502, 2018.

[31] John Preskill. Quantum computing in the nisq era andbeyond. Quantum, 2:79, 2018.

[32] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon,Joseph C Bardin, Rami Barends, Rupak Biswas, SergioBoixo, Fernando GSL Brandao, David A Buell, et al.Quantum supremacy using a programmable supercon-ducting processor. Nature, 574(7779):505–510, 2019.

[33] Yulin Wu, Wan-Su Bao, Sirui Cao, Fusheng Chen, Ming-Cheng Chen, Xiawei Chen, Tung-Hsun Chung, Hui Deng,Yajie Du, Daojin Fan, et al. Strong quantum compu-tational advantage using a superconducting quantumprocessor. Physical review letters, 127(18):180501, 2021.

[34] Hsin-Yuan Huang, Michael Broughton, Jordan Cotler,Sitan Chen, Jerry Li, Masoud Mohseni, Hartmut Neven,Ryan Babbush, Richard Kueng, John Preskill, et al.Quantum advantage in learning from experiments. arXivpreprint arXiv:2112.00778, 2021.

[35] Hsin-Yuan Huang, Michael Broughton, Masoud Mohseni,Ryan Babbush, Sergio Boixo, Hartmut Neven, and Jar-rod R McClean. Power of data in quantum machinelearning. Nature communications, 12(1):1–9, 2021.

[36] Xinbiao Wang, Yuxuan Du, Yong Luo, and Dacheng Tao.Towards understanding the power of quantum kernels inthe NISQ era. Quantum, 5:531, August 2021.

[37] Yuxuan Du and Dacheng Tao. On exploring practicalpotentials of quantum auto-encoder with advantages.arXiv preprint arXiv:2106.15432, 2021.

[38] Marco Cerezo, Andrew Arrasmith, Ryan Babbush, Si-mon C Benjamin, Suguru Endo, Keisuke Fujii, Jarrod RMcClean, Kosuke Mitarai, Xiao Yuan, Lukasz Cincio,et al. Variational quantum algorithms. Nature ReviewsPhysics, 3(9):625–644, 2021.

[39] Kishor Bharti, Alba Cervera-Lierta, Thi Ha Kyaw,Tobias Haug, Sumner Alperin-Lea, Abhinav Anand,Matthias Degroote, Hermanni Heimonen, Jakob SKottmann, Tim Menke, et al. Noisy intermediate-scale quantum algorithms. Reviews of Modern Physics,94(1):015004, 2022.

[40] Marcello Benedetti, Erika Lloyd, Stefan Sack, and Mat-tia Fiorentini. Parameterized quantum circuits as ma-chine learning models. Quantum Science and Technology,4(4):043001, 2019.

[41] Marcello Benedetti, Delfina Garcia-Pintos, Oscar Per-domo, Vicente Leyton-Ortega, Yunseong Nam, and Ale-jandro Perdomo-Ortiz. A generative modeling approachfor benchmarking and training shallow quantum circuits.npj Quantum Information, 5(1):1–9, 2019.

[42] Jin-Guo Liu and Lei Wang. Differentiable learning ofquantum circuit born machines. Physical Review A,98(6):062324, 2018.

[43] Brian Coyle, Maxwell Henderson, Justin Chan Jin Le, Ni-raj Kumar, Marco Paini, and Elham Kashefi. Quantumversus classical generative modelling in finance. QuantumScience and Technology, 6(2):024013, 2021.

[44] Yuxuan Du, Min-Hsiu Hsieh, Tongliang Liu, andDacheng Tao. Expressive power of parametrized quan-tum circuits. Phys. Rev. Research, 2:033125, Jul 2020.

[45] Amir Khoshaman, Walter Vinci, Brandon Denis, EvgenyAndriyash, Hossein Sadeghi, and Mohammad H Amin.Quantum variational autoencoder. Quantum Scienceand Technology, 4(1):014001, 2018.

[46] Yuta Shingu, Yuya Seki, Shohei Watabe, Suguru Endo,

Yuichiro Matsuzaki, Shiro Kawabata, Tetsuro Nikuni,and Hideaki Hakoshima. Boltzmann machine learningwith a variational quantum algorithm. Physical ReviewA, 104(3):032413, 2021.

[47] Chang Yu Hsieh, Qiming Sun, Shengyu Zhang, andChee Kong Lee. Unitary-coupled restricted boltzmannmachine ansatz for quantum simulations. npj QuantumInformation, 7(1):1–10, 2021.

[48] Christa Zoufal, Aurelien Lucchi, and Stefan Woerner.Quantum generative adversarial networks for learningand loading random distributions. npj Quantum Infor-mation, 5(1):1–9, 2019.

[49] Jinfeng Zeng, Yufeng Wu, Jin-Guo Liu, Lei Wang,and Jiangping Hu. Learning and inference on gener-ative adversarial quantum circuits. Physical Review A,99(5):052306, 2019.

[50] He-Liang Huang, Yuxuan Du, Ming Gong, Youwei Zhao,Yulin Wu, Chaoyue Wang, Shaowei Li, Futian Liang,Jin Lin, Yu Xu, et al. Experimental quantum generativeadversarial networks for image generation. PhysicalReview Applied, 16(2):024051, 2021.

[51] Yiming Huang, Hang Lei, Xiaoyu Li, and Guowu Yang.Quantum maximum mean discrepancy gan. Neurocom-puting, 454:88–100, 2021.

[52] Jonathan Romero and Alan Aspuru-Guzik. Variationalquantum generators: Generative adversarial quantummachine learning for continuous distributions. AdvancedQuantum Technologies, 4(1):2000003, 2021.

[53] Kaitlin Gili, Marta Mauri, and Alejandro Perdomo-Ortiz.Evaluating generalization in classical and quantum gen-erative models. arXiv preprint arXiv:2201.08770, 2022.

[54] Kouhei Nakaji and Naoki Yamamoto. Quantum semi-supervised generative adversarial network for enhanceddata classification. Scientific reports, 11(1):1–10, 2021.

[55] Paolo Braccia, Leonardo Banchi, and Filippo Caruso.Quantum noise sensing by generating fake noise. PhysicalReview Applied, 17(2):024002, 2022.

[56] Abhinav Anand, Jonathan Romero, Matthias Degroote,and Alan Aspuru-Guzik. Noise robustness and experi-mental demonstration of a quantum generative adver-sarial network for continuous distributions. AdvancedQuantum Technologies, 4(5):2000069, 2021.

[57] Xu-Fei Yin, Yuxuan Du, Yue-Yang Fei, Rui Zhang, Li-Zheng Liu, Yingqiu Mao, Tongliang Liu, Min-Hsiu Hsieh,Li Li, Nai-Le Liu, Dacheng Tao, Yu-Ao Chen, and Jian-Wei Pan. Efficient Bipartite Entanglement DetectionScheme with a Quantum Adversarial Solver. Phys. Rev.Lett. 128, 110501 (2022), 2022. arXiv:2203.07749v1.

[58] Implicit probabilistic models do not specify the distri-bution of the data itself, but rather define a stochasticprocess that, after training, aims to draw samples fromthe underlying data distribution.

[59] Daiwei Zhu, Norbert M Linke, Marcello Benedetti,Kevin A Landsman, Nhung H Nguyen, C HuertaAlderete, Alejandro Perdomo-Ortiz, Nathan Korda,A Garfoot, Charles Brecque, et al. Training of quan-tum circuits on a hybrid quantum computer. Scienceadvances, 5(10):eaaw9918, 2019.

[60] Ling Hu, Shu-Hao Wu, Weizhou Cai, Yuwei Ma, Xiang-hao Mu, Yuan Xu, Haiyan Wang, Yipu Song, Dong-LingDeng, Chang-Ling Zou, et al. Quantum generative ad-versarial learning in a superconducting quantum circuit.Science advances, 5(1):eaav2761, 2019.

[61] Junde Li, Rasit O Topaloglu, and Swaroop Ghosh. Quan-

9

tum generative models for small molecule drug discov-ery. IEEE Transactions on Quantum Engineering, 2:1–8,2021.

[62] Yu-Xin Jin, Jun-Jie Hu, Qi Li, Zhi-Cheng Luo, Fang-Yan Zhang, Hao Tang, Kun Qian, and Xian-Min Jin.Quantum Deep Learning for Mutant COVID-19 StrainPrediction, 2022. arXiv:2203.03556v1.

[63] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learn-ing and generalization in overparameterized neural net-works, going beyond two layers. In Proceedings of the33rd International Conference on Neural InformationProcessing Systems, pages 6158–6169, 2019.

[64] Yuxuan Du, Min-Hsiu Hsieh, Tongliang Liu, Shan You,and Dacheng Tao. Learnability of quantum neural net-works. PRX Quantum, 2(4):040337, 2021.

[65] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, andYi Zhang. Generalization and equilibrium in generativeadversarial nets (gans). arXiv preprint arXiv:1703.00573,2017.

[66] Shengjia Zhao, Hongyu Ren, Arianna Yuan, JiamingSong, Noah Goodman, and Stefano Ermon. Bias andgeneralization in deep generative models: An empiri-cal study. Advances in Neural Information ProcessingSystems, 31, 2018.

[67] Amira Abbas, David Sutter, Christa Zoufal, AurelienLucchi, Alessio Figalli, and Stefan Woerner. The powerof quantum neural networks. Nature ComputationalScience, 1(6):403–409, 2021.

[68] Leonardo Banchi, Jason Pereira, and Stefano Piran-dola. Generalization in quantum machine learning:A quantum information standpoint. PRX Quantum,2(4):040321, 2021.

[69] Matthias C Caro, Hsin-Yuan Huang, M Cerezo, Ku-nal Sharma, Andrew Sornborger, Lukasz Cincio, andPatrick J Coles. Generalization in quantum ma-chine learning from few training data. arXiv preprintarXiv:2111.05292, 2021.

[70] Yuxuan Du, Zhuozhuo Tu, Xiao Yuan, and Dacheng Tao.Efficient measure for the expressivity of variational quan-tum algorithms. Physical Review Letters, 128(8):080506,2022.

[71] Hsin-Yuan Huang, Richard Kueng, and John Preskill.Information-theoretic bounds on quantum advan-tage in machine learning. Physical Review Letters,126(19):190505, 2021.

[72] Junyu Liu, Khadijeh Najafi, Kunal Sharma, FrancescoTacchino, Liang Jiang, and Antonio Mezzacapo. Ananalytic theory for the dynamics of wide quantum neuralnetworks, 2022. arXiv:2203.16711v1.

[73] Yang Qian, Xinbiao Wang, Yuxuan Du, Xingyao Wu,and Dacheng Tao. The dilemma of quantum neuralnetworks. arXiv preprint arXiv:2106.04975, 2021.

[74] Maria Kieferova, Ortiz Marrero Carlos, and NathanWiebe. Quantum generative training using r\’enyi diver-gences. arXiv preprint arXiv:2106.09567, 2021.

[75] Jarrod R McClean, Sergio Boixo, Vadim N Smelyanskiy,Ryan Babbush, and Hartmut Neven. Barren plateausin quantum neural network training landscapes. Naturecommunications, 9(1):1–6, 2018.

[76] Kaining Zhang, Min-Hsiu Hsieh, Liu Liu, and DachengTao. Toward trainability of deep quantum neural net-works. arXiv preprint arXiv:2112.15002, 2021.

[77] Xun Gao, Eric R Anschuetz, Sheng-Tao Wang, J Ig-nacio Cirac, and Mikhail D Lukin. Enhancing genera-

tive models via quantum correlations. arXiv preprintarXiv:2101.08354, 2021.

[78] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch,Bernhard Scholkopf, and Alexander Smola. A kerneltwo-sample test. J. Mach. Learn. Res., 13(null):723–773,mar 2012.

[79] Vladimir Vapnik. The nature of statistical learning the-ory. Springer science & business media, 1999.

[80] For simplicity, the definition of N -qubit Ansatz U(θ)in Eq. (3) omits some identity operators. The complete

description for the l-th layer is IdN−k ⊗ Ul(θ).[81] GK Dziugaite, DM Roy, and Z Ghahramani. Train-

ing generative neural networks via maximum mean dis-crepancy optimization. In Uncertainty in ArtificialIntelligence-Proceedings of the 31st Conference, UAI2015, pages 258–267, 2015.

[82] k(·, ·) is quantum means that the specified kernel can beefficiently computed by quantum algorithms. Represen-tative examples are linear kernel and polynomial kernels[95].

[83] Geoffrey E Hinton. A practical guide to training re-stricted boltzmann machines. In Neural networks: Tricksof the trade, pages 599–619. Springer, 2012.

[84] Michael A Nielsen and Isaac L Chuang. Quantum compu-tation and quantum information. Cambridge UniversityPress, 2010.

[85] Armin Lederer, Jonas Umlauft, and Sandra Hirche. Uni-form Error Bounds for Gaussian Process Regression withApplication to Safe Control. Curran Associates Inc., RedHook, NY, USA, 2019.

[86] Carlos Bravo-Prieto, Julien Baglio, Marco Ce, AnthonyFrancis, Dorota M Grabowska, and Stefano Carrazza.Style-based quantum generative adversarial networks formonte carlo events. arXiv preprint arXiv:2110.06933,2021.

[87] Francois-Xavier Briol, Alessandro Barp, Andrew B Dun-can, and Mark Girolami. Statistical inference for gener-ative models with maximum mean discrepancy. arXivpreprint arXiv:1906.05944, 2019.

[88] Shouvanik Chakrabarti, Huang Yiming, Tongyang Li,Soheil Feizi, and Xiaodi Wu. Quantum wasserstein gen-erative adversarial networks. Advances in Neural Infor-mation Processing Systems, 32, 2019.

[89] David Amaro, Carlo Modica, Matthias Rosenkranz, Mat-tia Fiorentini, Marcello Benedetti, and Michael Lubasch.Filtering variational quantum algorithms for combina-torial optimization. Quantum Science and Technology,7(1):015021, 2022.

[90] M Bilkis, M Cerezo, Guillaume Verdon, Patrick J Coles,and Lukasz Cincio. A semi-agnostic ansatz with variablestructure for quantum machine learning. arXiv preprintarXiv:2103.06712, 2021.

[91] Yuxuan Du, Tao Huang, Shan You, Min-Hsiu Hsieh, andDacheng Tao. Quantum circuit architecture search: errormitigation and trainability enhancement for variationalquantum solvers. arXiv preprint arXiv:2010.10217, 2020.

[92] Kehuan Linghu, Yang Qian, Ruixia Wang, Meng-JunHu, Zhiyuan Li, Xuegang Li, Huikai Xu, Jingning Zhang,Teng Ma, Peng Zhao, et al. Quantum circuit architecturesearch on a superconducting processor. arXiv preprintarXiv:2201.00934, 2022.

[93] En-Jui Kuo, Yao-Lung L Fang, and Samuel Yen-ChiChen. Quantum architecture search via deep reinforce-

10

ment learning. arXiv preprint arXiv:2104.07715, 2021.[94] Shi-Xin Zhang, Chang-Yu Hsieh, Shengyu Zhang, and

Hong Yao. Differentiable quantum architecture search.arXiv preprint arXiv:2010.08561, 2020.

[95] Maria Schuld and Nathan Killoran. Quantum machinelearning in feature hilbert spaces. Physical review letters,122(4):040504, 2019.

[96] Marcello Benedetti, Edward Grant, Leonard Wossnig,and Simone Severini. Adversarial quantum circuit learn-ing for pure state approximation. New Journal of Physics,21(4):043023, 2019.

[97] Martin Arjovsky, Soumith Chintala, and Leon Bottou.Wasserstein generative adversarial networks. In Interna-tional Conference on Machine Learning, pages 214–223,2017.

[98] Mehdi Mirza and Simon Osindero. Conditional gener-ative adversarial nets. arXiv preprint arXiv:1411.1784,2014.

[99] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xi-aogang Wang, Xiaolei Huang, and Dimitris N Metaxas.Stackgan: Text to photo-realistic image synthesis withstacked generative adversarial networks. In Proceed-ings of the IEEE International Conference on ComputerVision, pages 5907–5915, 2017.

[100] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, IanGoodfellow, and Brendan Frey. Adversarial autoen-coders. arXiv preprint arXiv:1511.05644, 2015.

[101] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep learning. MIT press, 2016.

[102] Stephen Boyd, Stephen P Boyd, and Lieven Vanden-berghe. Convex optimization. Cambridge universitypress, 2004.

[103] Maria Schuld, Ville Bergholm, Christian Gogolin, JoshIzaac, and Nathan Killoran. Evaluating analytic gra-dients on quantum hardware. Physical Review A,99(3):032331, 2019.

[104] Harry Buhrman, Richard Cleve, John Watrous, andRonald De Wolf. Quantum fingerprinting. PhysicalReview Letters, 87(16):167902, 2001.

[105] Hirotada Kobayashi, Keiji Matsumoto, and TomoyukiYamakami. Quantum merlin-arthur proof systems: Aremultiple merlins more helpful to arthur? In InternationalSymposium on Algorithms and Computation, pages 189–198. Springer, 2003.

[106] Jacob Biamonte. Universal variational quantum compu-tation. Physical Review A, 103(3):L030401, 2021.

[107] Nicolas Le Roux and Yoshua Bengio. Representationalpower of restricted boltzmann machines and deep beliefnetworks. Neural computation, 20(6):1631–1649, 2008.

[108] Xun Gao and Lu-Ming Duan. Efficient representation ofquantum many-body states with deep neural networks.Nature communications, 8(1):662, 2017.

[109] Shahar Mendelson. A few notes on statistical learningtheory. In Advanced lectures on machine learning, pages1–40. Springer, 2003.

[110] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Tal-walkar. Foundations of Machine Learning. AdaptiveComputation and Machine Learning. MIT Press, Cam-bridge, MA, 2 edition, 2018.

[111] Richard M Dudley. The sizes of compact subsets ofhilbert space and continuity of gaussian processes. Jour-nal of Functional Analysis, 1(3):290–330, 1967.

[112] Thomas Barthel and Jianfeng Lu. Fundamental limita-tions for measurements in quantum many-body systems.

Phys. Rev. Lett., 121:080406, Aug 2018.[113] Sergio Boixo, Sergei V Isakov, Vadim N Smelyanskiy,

Ryan Babbush, Nan Ding, Zhang Jiang, Michael J Brem-ner, John M Martinis, and Hartmut Neven. Character-izing quantum supremacy in near-term devices. NaturePhysics, 14(6):595, 2018.

[114] Qingling Zhu, Sirui Cao, Fusheng Chen, Ming-ChengChen, Xiawei Chen, Tung-Hsun Chung, Hui Deng, YajieDu, Daojin Fan, Ming Gong, Cheng Guo, Chu Guo,Shaojun Guo, Lianchen Han, Linyin Hong, He-LiangHuang, Yong-Heng Huo, Liping Li, Na Li, Shaowei Li,Yuan Li, Futian Liang, Chun Lin, Jin Lin, Haoran Qian,Dan Qiao, Hao Rong, Hong Su, Lihua Sun, LiangyuanWang, Shiyu Wang, Dachao Wu, Yulin Wu, Yu Xu,Kai Yan, Weifeng Yang, Yang Yang, Yangsen Ye, Jiang-han Yin, Chong Ying, Jiale Yu, Chen Zha, Cha Zhang,Haibin Zhang, Kaili Zhang, Yiming Zhang, Han Zhao,Youwei Zhao, Liang Zhou, Chao-Yang Lu, Cheng-ZhiPeng, Xiaobo Zhu, and Jian-Wei Pan. Quantum compu-tational advantage via 60-qubit 24-cycle random circuitsampling, 2021.

[115] Scott Aaronson and Lijie Chen. Complexity-theoreticfoundations of quantum supremacy experiments. In 32ndComputational Complexity Conference, page 1, 2017.

[116] Julia Kempe, Alexei Kitaev, and Oded Regev. The com-plexity of the local hamiltonian problem. Siam journalon computing, 35(5):1070–1097, 2006.

11

Supplementary Material: “Theory of Quantum Generative Learning Models withMaximum Mean Discrepancy”

SM A: Schematic of QGANs in the discrete and continuous settings

For the purpose of elucidating, in this section, we first demonstrate the equivalence of QCBMs and QGANs in thediscrete setting when the loss function is specified to be MMD, and then introduce the basic theory of GANs andQGANs, especially for QGANs with MMD loss.

Equivalence between QCBMs and QGANs when Q is discrete. In accordance with the explanations inRefs. [48, 96], when QGAN is applied to estimate a discrete distribution Q (e.g., quantum state approximation), thequantum generator aims to directly capture the distribution of the data itself. This violates the criteria of implicitgenerative models, where a stochastic process is employed to draw samples from the underlying data distribution aftertraining. More specifically, when Q is discrete, the output of the quantum circuit for both QCBM and QGAN takesthe form Pθ(i) = Tr(ΠiU(θ)ρ0U(θ)†) in Eq. (1). The concept ‘adversarial’ originates from the way of optimizing θ.Instead of using a deterministic distance measure (e.g., KL divergence) as in QCBMs, QGANs utilize a discriminatorDγ , implemented by either trainable parameterized quantum circuit or a neural network, to maximally separate Pθ(i)from Q. The behavior of simultaneously update θ (to minimize the loss) and γ (to maximize the loss) is termed asquantum generative adversarial learning. With this regard, when we replace the trainable Dγ by the deterministicmeasure MMD, QGAN takes an equivalent mathematical form with QCBM.

Basic theory of (classical) GANs and QGANs when Q is continuous. The fundamental mechanism ofGAN [6] and its variations [97–100] is as follows. GAN sets up a two-players game: the generator G creates data thatpretends to come from the real data distribution Q to fool the discriminator D, while D tries to distinguish the fakegenerated data from the real training data. Mathematically, G and D corresponds to two a differentiable functions.In particular, the input of G is a latent variable z and its output is x, i.e., G : G(z,θ)→ x with θ being trainableparameters for G. The role of the latent variable z is ensuring GAN to be a structured probabilistic model [101].The input of D can either be the generated data x or the real data y ∼ Q and its output corresponds to the binaryclassification result (real or fake), respectively. The mathematical expression of D yields D : D(x,y,γ)→ (0, 1) withγ being trainable parameters for D. If the distribution learned by G equals to the real data distribution, i.e., Pθ = Q,then D can never discriminate between the generated data and the real data and this unique solution is called Nashequilibrium [6].

To reach the Nash equilibrium, the training process of GANs corresponds to the minimax optimization. Namely, thediscriminator D updates γ to maximize the classification accuracy, while the generator G updates θ to minimize theclassification accuracy. With this regard, the optimization of GAN follows

minθ

maxγL(Dγ(Gθ(z)), Dγ(x)) := Ex∼Q[Dγ(x)] + Ez∼PZ [(1−Dγ(Gθ(z))] , (A1)

where Q is the distribution of training dataset, and PZ is the probability distribution of the latent variable z. Ingeneral, Gθ and Dγ are constructed by deep neural networks, and their parameters are updated iteratively usinggradient descent methods [102].

The key difference between GANs and QGANs is the way of implementing Gθ and Dγ . Particularly, in QGANs,either G, D, or both can be realized by variational quantum circuits instead of deep neural networks. The trainingstrategy of QGANs is similar to classical GANs. In this study, we focus on QGANs with MMD loss, which can betreated as the quantum extension of MMD-GAN [81]. Unlike conventional GANs and QGANs, MMD-GAN and QGANreplace a trainable discriminator with MMD. In this way, the family of discriminators is substituted with a familyH of test functions, closed under negation, where the optimization of D can be completed with the analytical form.Therefore, the goal of QGANs is finding an estimator minimizing an unbiased MMD loss, i.e.,

MMD2U (Pθ ||Q) :=

1

n(n− 1)

n∑i 6=i′

k(x(i),x(i′)) +1

m(m− 1)

m∑j 6=j′

k(y(j),y(j′))− 2

nm

∑i,j

k(x(i),y(j)) (A2)

where x(i) ∼ Pθ and y(j) ∼ Q.

SM B: Optimization of QGANs with MMD loss

For self-consistency, in this section, we introduce the elementary backgrounds of the optimization of QGLMs withMMD loss. See Ref. [78] for elaborations.

12

MMD loss. Let k : X × X → R be a Borel measurable kernel on X , and consider the reproducing kernel Hilbertspace Hk associated with k (see Berlinet and Thomas-Agnan [2004]), equipped with inner product 〈·, ·〉Hk . Let

Pk(X ) be the set of Borel probability measures µ such that∫X

√k(x,x)µ(dx) < ∞. The kernel mean embedding

Πk(µ) =∫k(·, y)µ(dy), interpreted as a Bochner integral, defines a continuous embedding from Pk(X ) into Hk. The

mean embedding pulls-back the metric on Hk generated by the inner product to define a pseudo- metric on Pk(X )called the maximum mean discrepancy MMD: Pk(X )× Pk(X )→ R+, i.e.,

MMD(P1 ||P2) = ‖Πk(P1)−Πk(P2)‖Hk . (B1)

The MMD loss has a particularly simple expression that can be derived through an application of the reproducingproperty (f(x) = 〈f, k(·,x)〉Hk), i.e.,

MMD2(P1 ||P2) :=

∥∥∥∥∫Xk(·,x)P1(dx)−

∫Xk(·,x)P2(dx)

∥∥∥∥2

Hk

=

∫X

∫Xk(x,y)P1(dx)P1(dy)− 2

∫X

∫Xk(x,y)P1(dx)P2(dy) +

∫X

∫Xk(x,y)P2(dx)P2(dy)

=Ex,y∼P1(k(x,y))− 2Ex∼P1,y∼P2

(k(x,y)) + Ex,y∼P2(k(x,y)), (B2)

which provides a closed form expression up to calculation of expectations.

Optimization of QCBMs with MMD loss. The goal of QCBMs is finding an estimator minimizing the lossfunction MMD2(Pθ ||Q) in Eq. (B2), where Pθ is defined in Eq. (1). The optimization is completed by the gradient

based descent optimizer. The updating rule satisfies θ(t+1) = θ(t)−η∇θ MMD2(Pθ ||Q) and η is the learning rate.Concretely, the partial derivative of the j-th entry satisfies

∂MMD2(Pθ ||Q)

∂ θj

=∂ Ex,y∼Pθ (k(x,y))− 2Ex∼Pθ,y∼Q(k(x,y)) + Ex,y∼Q(k(x,y))

∂ θj

=∑x,y

k(x,y)

(Pθ(y)

∂ Pθ(x)

∂ θj+ Pθ(x)

∂ Pθ(y)

∂ θj

)− 2

∑x,y

k(x,y)∂ Pθ(x)

∂ θjQ(y)

=∑x,y

k(x,y)(Pθ(y)

(Pθ+π

2 ej(x)− Pθ−π2 ej (x)

)+ Pθ(x)

(Pθ+π

2 ej(y)− Pθ−π2 ej (y)

))−2∑x,y

k(x,y)(Pθ+π

2 ej(x)− Pθ−π2 ej (x)

)Q(y)

=Ex∼Pθ+π2ej,y∼Pθ (k(x,y))− Ex∼Pθ−π

2ej,y∼Pθ (k(x,y)) + Ex∼Pθ,y∼Pθ+π

2ej

(k(x,y))− Ex∼Pθ,y∼Pθ−π2ej

(k(x,y))

−2Ex∼Pθ+π2ej,y∼Q(k(x,y)) + 2Ex∼Pθ−π

2ej,y∼Q(k(x,y)), (B3)

where the last second equality employs the parameter shift rule [103] to calculate the partial derivative ∂ Pθ(y)/∂ θjand ∂ Pθ(x)/∂ θj . According to Lemma 1, the six expectation terms in Eq. (B3) can be analytically and efficientlycalculated when the k(·, ·) is quantum. In the case of classical kernels, the six expectation terms in Eq. (B3) areestimated by the sample mean.

Optimization of QGANs with MMD loss. We next derive the gradients of QGANs with respect to the `-th entry.Since the evaluation of expectation is runtime expensive when Q is continuous, QGANs employ an unbiased estimator

of the MMD loss in Eq. (A2) to update θ. The updating rule at the t-th iteration is θ(t+1) = θ(t)−η∇θ MMD2U (Pθ ||Q).

According to the chain rule, we have

∂MMD2U (Pθ ||Q)

∂ θ`

=∂ 1n(n−1)

∑ni6=i′ k(Gθ(z(i)), Gθ(z(i′)))− 2

nm

∑i,j k(Gθ(z(i)),y(j))

∂ θ`

=1

n(n− 1)

n∑i 6=i′

∂k(Gθ(z(i)), Gθ(z(i′)))

∂Gθ(z(i))

∂Gθ(z(i))

∂ θ`+∂k(Gθ(z(i)), Gθ(z(i′)))

∂Gθ(z(i′))

∂Gθ(z(i′))

∂ θ`− (B4)

2

nm

∑i,j

∂k(Gθ(z(i)),y(j))

∂Gθ(z(i))

∂Gθ(z(i))

∂ θ`(B5)

13

where the first equality uses ∂ 1m(m−1)

∑mj 6=j′ k(y(j),y(j′))/∂ θ` = 0, each derivative ∂k(Gθ(z(i)), Gθ(z(i′)))/∂Gθ(z(i))

can be easily computed for standard kernels, and the derivative ∂Gθ(z(i))/∂ θ` for ∀i ∈ n, j ∈ [m] can be computedvia the parameter shift rule. Therefore, the gradients of QGANs with MMD loss can be achieved.

SM C: Proof of Lemma 1

We first introduce the definition of kernels.

Definition 1 (Definition 2, [95]). Let X be a nonempty set, called the input set. A function k : X × X → C is called

kernel if the Gram matrix K with entries Km,m′ = k(xm,xm′) is positive semi-definite.

Proof of Lemma 1. Here we separately elaborate on the calculation of MMD loss when the target distribution Q can beefficiently prepared by a diagonalized mixed state σ =

∑y Q(y) |y〉 〈y| or a pure quantum state |Ψ〉 =

∑y

√Q(y) |y〉.

Diagonalized mixed states. In this setting, the quantum kernel corresponds to the linear kernel, i.e., k(a,a′) = 〈a,a′〉.Recall the definition of the MMD loss is MMD2(Pθ ||Q) = E(k(x,x′)) − 2E(k(x,y)) + E(k(y,y′)). The first termequals to

E (k(x,x′)) = EPθ,Pθ (〈x,x′〉) = EPθ,Pθ (δx,x′) =∑x

P2θ(x). (C1)

Similarly, the second term equals to

E (k(x,y)) =∑x

Pθ(x)Q(x). (C2)

And the third term equals to

E (k(y,y′)) =∑y

Q2(y). (C3)

The above three terms can be effectively and analytically evaluated by quantum Swap test when the input state ofQCBM in Eq. (1) is a full rank mixed state, e.g., ρ0 = I2N /2N . Denote the output state of QCBM as ρ = U(θ)ρ0U(θ)†.This state is also diagonalized and its diagonalized entry records Pθ, i.e.,

ρ =∑x

Pθ(x) |x〉 〈x| . (C4)

According to [104, 105], given two mixed states %1 and %2, the output of Swap test is 1/2 + Tr(%1%2)/2 with an additiveerror ε in O(1/ε2) runtime. As such, when %1 = %2 = ρ, the first term E (k(x,x′)) can be calculated by Swap test,because Tr(ρρ) =

∑x P

2θ(x). Likewise, through setting %1 = ρ, and %2 = σ (%1 = %2 = σ), the second (third) term

can be efficiently evaluated by Swap test with an additive error ε. In other words, by leveraging Swap test, we canestimate MMD loss with an additive error ε in O(1/ε2) runtime cost.

Pure states. In this setting, the quantum kernel corresponds to a nonlinear kernel, i.e., k(a,a′) = 〈 a√P(a)

, a′√P(a′)〉,

where P(a) stands for the probability of sampling a and∑a P(a) = 1. With this regard, the explicit form of MMD

loss yields

MMD(Pθ ||Q)

=E(k(x,x′))− 2E(k(x,y)) + E(k(y,y′))

=∑x

∑x′

Pθ Pθ

⟨x√Pθ(x)

,x′√Pθ(x′)

⟩− 2

∑x

∑y

Pθ Q

⟨x√Pθ(x)

,y√Q(y)

⟩+∑y

∑y′

QQ

⟨y√Q(y)

,y′√Q(y′)

⟩=∑x

Pθ(x) +∑y

Q(y)− 2∑x

√Pθ(x)Q(x)

=2− 2∑x

√Pθ(x)Q(x). (C5)

The above results indicate that the evaluation of MMD loss amounts to calculating∑x

√Pθ(x)Q(x). Denote

the generated state of QCBM as |Φ(θ)〉 = U(θ) |0〉⊗n = eiφ∑x Pθ(x) |x〉, where ρ0 = (|0〉 〈0|)⊗n. When the target

14

distribution Q refers to a pure quantum state |Ψ〉 =∑y

√Q(y) |y〉, the term

∑x

√Pθ(x)Q(x) can be evaluated by

Swap test [104], i.e.,

| 〈Φ(θ)|Ψ〉 |2 =

2N∑i=1

2N∑j=1

√Pθ(x)Q(y) 〈x|y〉

2

=

2N∑i=1

Pθ(x)Q(x)

2

. (C6)

According to [104], taking account of the sample error, this term can be estimated within an additive error ε in O(1/ε2)runtime complexity.

SM D: Proof of Theorem 1 (generalization of QCBMs)

The proof of Theorem 1 adopts the following lemma.

Lemma 2 (Adapted from Theorem 1,[87]). Suppose that the kernel k(·, ·) is bounded. Following the notations inTheorem 1, when the number of examples sampled from P

θ(n,m) and Q is n and m, with probability 1− δ,

MMD(Pθ(n,m) ||Q) ≤ inf

θ∈ΘMMD(Pθ ||Q) + 2

√2

n+

2

m

√supx∈X

k(x,x)

(2 +

√2

δ

). (D1)

Proof of Theorem 1. To prove Theorem 1, we first derive the upper bound of the generalization error RC under thegeneric setting. Then, we analyze RQ under the specific setting where the quantum kernel is employed and the targetdistribution Q can be directly accessed by quantum machines.

The calculation of RC . Recall the definition of θ(n,m)

in Eq. (5). Let us first rewrite the generalization error as

RC =MMD2(Pθ(n,m) ||Q)− inf

θ∈ΘMMD2(Pθ ||Q)

=

(MMD(P

θ(n,m) ||Q)− inf

θ∈ΘMMD(Pθ ||Q)

)(MMD(P

θ(n,m) ||Q) + inf

θ∈ΘMMD(Pθ ||Q)

)≤2C1

∣∣∣∣MMD(Pθ(n,m) ||Q)− inf

θ∈ΘMMD(Pθ ||Q)

∣∣∣∣ , (D2)

where the second equality uses infθ∈Θ MMD2(Pθ ||Q) = (infθ∈Θ MMD(Pθ ||Q))2 and the inequality employs theMMD(P

θ(n,m) ||Q) + infθ∈Θ MMD(Pθ ||Q) ≤ 2 MMD(P

θ(n,m) ||Q) ≤ 2C1.

In conjunction with the above equation with the results of Lemma 2, we obtain that with probability at least 1− δ,the upper bound of the generalization error of QGCM yields

RC ≤ 4C1

(2

n+

√2

m

)√supx∈X

k(x,x)

(2 +

√log

2

δ

). (D3)

The calculation of RQ. Recall the definition of θ in Eq. (4). When QCBM adopts the quantum kernel and the targetdistribution Q can be directly accessed by quantum machines, the minimum argument of the loss function yields

θ = arg minθ∈Θ MMD2(Pθ ||Q). Following the definition of the generalization error, we obtain

RQ =MMD2(Pθ ||Q)− infθ∈Θ

MMD2(Pθ ||Q)

≤ MMD2(Pθ(n,m) ||Q)− inf

θ∈ΘMMD2(Pθ ||Q)

= RC , (D4)

where the inequality is supported by the definition of θ, i.e, MMD2(Pθ ||Q) = minθ∈Θ MMD2(Pθ ||Q) ≤MMD2(P

θ(n,m) ||Q).

Combining the results of RC and RQ in Eqs. (D3) and (D4), we obtain that with probability at least 1− δ,

RQ ≤ RC ≤ 4C1

(2

n+

√2

m

)√supx∈X

k(x,x)

(2 +

√log

2

δ

). (D5)

15

SM E: The comparison between QCBMs and GLMs for estimating discrete distributions

In this section, we emphasize a central question in QGLMs, i.e., when both variational quantum circuits and neuralnetworks are used to implement Pθ, which one can attain a lower infθ∈Θ MMD2(Pθ ||Q). The importance of this issuecomes from Eq. (7), where the generalization error bound becomes meaningful when infθ∈Θ MMD2(Pθ ||Q) is small.In this respect, it is necessary to understand whether QCBMs allow a lower infθ∈Θ MMD2(Pθ ||Q) over (classical)GLMs.

In what follows, we analyze when QCBMs promise a lower infθ∈Θ MMD2(Pθ ||Q) over a typical GLM—restrictedBoltzmann machine (RBM) [83]. To be more specific, consider that both QCBMs and RBMs are universal approximatorswith an exponential number of trainable parameters [106, 107] with infθ∈Θ MMD2(Pθ ||Q) → 0, we focus on themore practical scenario in which the number of parameters polynomially scales with the feature dimension, the qubitcount, and the number of visible neurons. Denote the space of the parameterized distributions formed by QCBMs (or

RBM) as PQCBMΘ (or PRBM

Θ ). The superiority of QCBMs can be identified by showing infθ∈Θ MMD2(PQCBMθ ||Q) ≤

infθ∈Θ MMD2(PRBMθ ||Q). This amounts to finding a distribution Q satisfying(

Q ∈ PQCBMΘ

)∧(Q /∈ PRBM

Θ

). (E1)

According to the results in [108], there is a large class of quantum states meeting the above requirement. Representativeexamples include projected entangled pair states and ground states of k-local Hamiltonians.

SM F: Proof of Theorem 2 (generalization of QGANs)

The proof of Theorem 2 utilizes the following three lemmas. For clearness, we defer the proof of Lemmas 4 and 5 toSM F 1 and F 2, respectively.

Lemma 3 (McDiarmids inequality, [109]). Let f : X1 ×X2 × ...×XN → R and assume there exists c1, ..., c2 ≥ 0 suchthat, for all k ∈ 1, ..., N, we have

supx1,...,xk,xk,...xN

|f(x1, ...,xk, ...,xN )− f(x1, ..., xk, ...,xN )| ≤ ck. (F1)

Then for all ε ≥ 0 and independent random variables ξ1, ...ξN in X ,

Pr(|f(ξ1, ...ξN )− E(f(ξ1, ...ξN ))| ≥ ε) ≤ exp

(−2ε2∑Nn=1 c

2n

). (F2)

Lemma 4. Following the notations in Theorem 2, define G = k(Gθ(·), Gθ(·))|θ ∈ Θ and G+ = k(Gθ(z), Gθ(·))|θ ∈Θ, z ∈ Z. Given the set S = z(i)ni=1, we have

E(supθ∈Θ|Ez,z′(k(Gθ(z), Gθ(z′)))− 1

n(n− 1)

∑i 6=i′

k(Gθ(z(i)), Gθ(z(i′)))|)

≤ 8

n− 1+

24√d2k(Nge +Ngt)

n− 1

(1 +N ln(1764C2

3 (n− 1)NgeNgt)). (F3)

Lemma 5. Following the notations in Theorem 2, define W = k(Gθ(·), ·)|θ ∈ Θ and W+ = k(Gθ(·),y)|θ ∈ Θ,y ∈Y. Given the set S = z(i)ni=1 and the set y(j)mj=1, we have

E

supθ∈Θ

∣∣∣Ez,y(k(Gθ(z),y))− 1

mn

∑i∈[n],j∈[m]

k(Gθ(z(i)),y(j))∣∣∣

≤ 8

n+

24√d2k(Nge +Ngt)

n

(1 +N ln(1764C2

3nNgeNgt)). (F4)

We are now ready to prove Theorem 2.

16

Proof of Theorem 2. Let E(θ) = MMD2U (Pnθ ||Qm) and T (θ) = MMD2(Pθ ||Q). Note that the generalization error

equals to

RC = T (θ(n,m)

)− T (θ∗). (F5)

In the remainder of the proof, when no confusion occurs, we abbreviate θ(n,m)

as θ for clearness. The above equationis upper bounded by

RC = E(θ)− E(θ) + T (θ)− T (θ∗) ≤ E(θ∗)− E(θ) + T (θ)− T (θ∗) ≤ |T (θ)− E(θ)|+ |T (θ∗)− E(θ∗)|, (F6)

where the first inequality employs the definition of θ with E(θ∗) ≥ E(θ) and the second inequality uses the property ofabsolute value function. In the following, we derive the probability for supθ∈Θ |T (θ)− E(θ)| < ε, which in turn canachieve the upper bound of RC .

According to the explicit form of the MMD loss, supθ∈Θ |E(θ)− T (θ)| satisfies

supθ∈Θ

∣∣MMD2U (Pnθ ||Qm)−MMD2(Pθ ||Q)

∣∣=supθ∈Θ

∣∣∣Ez,z′(k(Gθ(z), Gθ(z′)))− 2Ez,y(k(Gθ(z),y)) + Ey,y′(k(y,y′))− 1

n(n− 1)

∑i 6=i′

k(Gθ(z(i)), Gθ(z(i′)))

+2

mn

∑i∈[n],j∈[m]

k(Gθ(z(i)),y(j))− 1

m(m− 1)

∑j 6=j′

k(y(j),y(j′))∣∣∣

≤supθ∈Θ

∣∣∣Ey,y′(k(y,y′))− 1

m(m− 1)

∑j 6=j′

k(y(j),y(j′))∣∣∣︸ ︷︷ ︸

T1

+ supθ∈Θ

∣∣∣Ez,z′(k(Gθ(z), Gθ(z′)))− 1

n(n− 1)

∑i6=i′

k(Gθ(z(i)), Gθ(z(i′)))∣∣∣︸ ︷︷ ︸

T2

+2 supθ∈Θ

∣∣∣Ez,y(k(Gθ(z),y))− 1

mn

∑i∈[n],j∈[m]

k(Gθ(z(i)),y(j))∣∣∣

︸ ︷︷ ︸T3

(F7)

where the inequality comes from the Jensen inequality.

We next separately derive the upper bounds of the terms T1, T2, and T3 in Eq. (F7).

Upper bound of T1. The upper bound of T1 only depends on the examples sampled from the target distributionQ, which is independent of θ ∈ Θ. With this regard, T1 can be taken out of the supremum and we apply theconcentration inequality in Lemma 3 to derive the upper bound of |Ey,y′(k(y,y′)) − 1

m(m−1)

∑j 6=j′ k(y(j),y(j′))|.

Recall the precondition of employing Lemma 3 is finding the upper bound on f(·). Let the function f(·) be1

m(m−1)

∑j 6=j′ k(y(j),y(j′)). For each ` ∈ 1, ...,m, the desired upper bound yields∣∣∣∣∣∣− 1

m(m− 1)(∑

j 6=j′,j 6=`

k(y(j),y(j′)) +∑j′ 6=`

k(y(`),y(j′))) +1

m(m− 1)(∑

j 6=j′,j 6=`

k(y(j),y(j′)) +∑j′ 6=`

k(y(`),y(j′)))

∣∣∣∣∣∣=

∣∣∣∣∣∣ 1

m(m− 1)

∑j′ 6=`

(k(y(`),y(j′)))− k(y(`),y(j′)))

)∣∣∣∣∣∣≤2C2

m, (F8)

where the inequality leverages the assumption that the kernel k(·, ·) is upper bounded by C2.Given this upper bound, we obtain

Pr(T1 ≥ ε)

=Pr

∣∣∣∣∣∣Ey,y′(k(y,y′))− 1

m(m− 1)

∑j 6=j′

k(y(j),y(j′))

∣∣∣∣∣∣ ≥ ε

≤exp

(− ε2

8C22

m

)= δT1, (F9)

17

where the inequality exploits the results in Lemma 3.Upper bound of T2. We next use the concentration inequality to quantify the upper bound of T2. The derivation issimilar to that of T1. In particular, supported by the results of Lemma 3, we have

Pr(|T2− E(T2)| ≥ ε) ≤ exp

(− ε2

8C22

n

)= δT2. (F10)

Suppose that E(T2) ≤ ε1, an immediate observation is that

Pr(T2 ≥ ε1 + ε) ≤ exp

(− ε2

8C22

n

)= δT2. (F11)

In other words, the derivation of the upper bound of T2 amounts to analyzing the upper bound ε1.Upper bound of T3. Following the same routine with the derivation of the upper bound of T2, we obtain

Pr(|T3− E(T3)| ≥ ε) ≤ exp

(− ε2

8C22

nm

n+m

)= δT3. (F12)

Suppose that E(T3) ≤ ε2. The above result hints that

Pr(T3 ≥ ε2 + ε) ≤ exp

(− ε2

8C22

nm

n+m

)= δT3. (F13)

Summing up Eqs. (F9), (F11), and (F13), the union bound gives

Pr

(supθ∈Θ|E(θ)− T (θ)| ≥ ε1 + 2ε2 + 4ε

)≤ δT1 + δT2 + δT3

⇒Pr

(supθ∈Θ|E(θ)− T (θ)| ≥ ε1 + 2ε2 + 4ε

)≤ 3δT3

⇒Pr(|E(θ)− T (θ)| ≥ ε1 + 2ε2 + 4ε

)≤ 3δT3. (F14)

This yields that with probability at least 1− 3δT3,

2(ε1 + 2ε2 + 4ε) ≥ |E(θ)−T (θ)|+ |E(θ∗)−T (θ∗)| ≥ |E(θ)−T (θ)− E(θ∗) + T (θ∗)| ≥ |T (θ)−T (θ∗)| = RC . (F15)

According to the explicit forms of ε1 and ε2 achieved in Lemmas 4 and 5, with probability 1− 3δT3, the generalizationerror of QGANs is upper bounded by

RC

≤8ε+ 2

(8

n− 1+

24√d2k(Nge +Ngt)

n− 1

(1 +N ln(1764C2

3 (n− 1)NgeNgt)))

+4

(8

n+

24√d2k(Nge +Ngt)

n

(1 +N ln(1764C2

3nNgeNgt)))

≤8

√8C2

2 (n+m)

nmln(

1

3δT3) + 6

(8

n− 1+

24√d2k(Ngt +Nge)

n− 1

(1 +N ln(1764C2

3nNgeNgt)))

. (F16)

where the last inequality uses the relation between δT3 and ε in Eq. (F13) with ε =√

8C22 (n+m) ln(1/δT3)/(nm),

1/n < 1/(n− 1), and n− 1 < n.

1. Proof of Lemma 4

Recall Lemma 4 aims to derive the upper bound of E(T2) in Eq. (F7). In Ref. [81], the authors utilize a statisticalmeasure named Rademacher complexity to quantify these two terms. Different from the classical counterpart, here weadopt an another statistical measure, i.e., covering number, to derive the upper bound of E(T2). This measure allowsus to identify how E(T2) scales with the qubit count N and the architecture of the employed Ansatz such as thetrainable parameters Ngt and the types of the quantum gates. For self-consistency, we provide the formal definition ofcovering number and Rademacher as follows.

18

Definition 2 (Covering number, [110]). The covering number N (U , ε, ‖ · ‖) denotes the least cardinality of any subsetV ⊂ U that covers U at scale ε with a norm ‖ · ‖, i.e.,

supA∈U

minB∈V‖A−B‖ ≤ ε. (F17)

Definition 3 (Rademacher, [110]). Let µ be a probability measure on X , and let F be a class of uniformly boundedfunctions on X . Then the Rademacher complexity of F is

Rn(F) = EµEσ1,...,σn

(1√n

supf∈F

∣∣∣∣∣n∑i=1

σif(x(i))

∣∣∣∣∣), (F18)

where σ = (σ1, ..., σn) is a sequence of independent Rademacher variables taking values in −1, 1 and each withprobability 1/2, and x1, ...,xn ∈ X are independent, µ-distributed random variables.

Intuitively, the covering number concerns the minimum number of spherical balls with radius ε that occupies thewhole space; the Rademacher complexity measures the ability of functions from F to fit random noise. The relationbetween Rademacher complexity and covering number is established by the following Dudley entropy integral bound.

Lemma 6 (Adapted from [81, 111]). Let F = f : X × X → R and F+ = h = f(x, ·) : f ∈ F ,x ∈ X andF+ ⊂ B(L∞(X )). Given the set S = x(1), ...,x(n) ∈ X , denote the Rademacher complexity of F+ as Rn(F+), itsatisfies

Rn(F+) ≤ infα>0

(4α+

12√n

∫ 1

α

√ln(N ((F+)|S , ε, ‖ · ‖2)dε

), (F19)

where (F+)|S = [f(x,x(i))]i=1:n : f ∈ F ,x ∈ X denotes the set of vectors formed by the hypothesis with S.

Ref. [81] hinges on the term E(T2) with the Rademacher complexity, as stated in the following lemma.

Lemma 7 (Adapted from Lemma 1, [81]). Following notations in Theorem 2 and Lemma 6, define G =k(Gθ(·), Gθ(·))|θ ∈ Θ and G+ = k(Gθ(z), Gθ(·))|θ ∈ Θ, z ∈ Z. Given the set S = z(i)ni=1, we have

E(T2) ≤ 2√n− 1

Rn−1(G+),

where Rn−1(G+) refers to the Rademacher’s complexity of G+.

In conjunction with the above two lemmas, the term E(T2) is upper bounded by the covering number of G+. Assuch, the proof of Lemma 4 utilizes the following three lemmas, which are used to formalize the relation of coveringnumber of two metric spaces, quantify the covering number of variational quantum circuits, and evaluate the coveringnumber of the space living in N -qubit quantum states, respectively.

Lemma 8 (Lemma 5, [112]). Let (H1, d1) and (H2, d2) be metric spaces and f : H1 → H2 be bi-Lipschitz such that

d2(f(x), f(y)) ≤ Kd1(x,y), ∀x,y ∈ H1, (F20)

and

d2(f(x), f(y)) ≥ kd1(x,y), ∀x,y ∈ H1 with d1(x,y) ≤ r. (F21)

Then their covering numbers obey

N (H1, 2ε/k, d1) ≤ N (H2, ε, d2) ≤ N (H1, ε/K, d1), (F22)

where the left inequality requires ε ≤ kr/2.

Lemma 9 (Lemma 2, [70]). Define the operator group as

Hcirc :=U(θ)ΠjU(θ)†|θ ∈ Θ

. (F23)

Suppose that the employed encoding Ansatz U(θ) containing in total Ng gates, each gate ui(θ) acting on at most k

qudits, and Ngt ≤ Ng gates in U(θ) are trainable. The ε-covering number for the operator group Hcirc with respect tothe operator-norm distance obeys

N (Hcirc, ε, ‖ · ‖) ≤(

7Ngt‖Πj‖ε

)d2kNgt, (F24)

where ‖Πj‖ denotes the operator norm of Πj.

19

Lemma 10. Define the input state group as B :=ρz := U(z)†(|0〉 〈0|)U(z)

∣∣∣z ∈ Z. Suppose that the employed

quantum circuit U(z) containing in total Nge parameterized gates to load z and each gate ui(z) acting on at most kqudits. The ε-covering number for B with respect to the operator-norm distance obeys

N (B, ε, ‖ · ‖) ≤(

7Ngeε

)d2kNge. (F25)

Proof of Lemma 10. The proof is identical to that presented in Lemma 2 of Ref. [70].

We are now ready to prove Lemma 4.

Proof of Lemma 4. Recall the aim of Lemma 4 is to obtain the upper bound of E(T2). In conjunction with Lemmas 6and 7, we obtain

E(T2) ≤ E2√n− 1

infα>0

(4α+

12√n− 1

∫ 1

α

√ln(N ((G+)|S , ε, ‖ · ‖2)dε

), (F26)

where G+ = k(Gθ(z), Gθ(·))|θ ∈ Θ, z ∈ Z and S denotes the set z(1), ...,z(n−1) sampled from the prior distributionPZ , and (G+)|S = [k(Gθ(z), Gθ(z(i))]i=1:n−1 : θ ∈ Θ, z ∈ Z denotes the set of vectors formed by the hypothesiswith S. In other words, the upper bound of E(T2) is quantified by the covering number N ((G+)|S , ε, ‖ · ‖2).

We next follow the definition of covering number to quantify how N ((G+)|S , ε, ‖ · ‖2) depends on the structureinformation of the employed Ansatz and the input quantum states. Denote Qε1 as an ε1-covering of the set Q1 =Gθ(z)|θ ∈ Θ and Qε3 as an ε3-covering of the set Q = Gθ(z)|z ∈ Z. Then, the covering number N ((G+)|S , ε, ‖·‖2)can be upper bounded by N ((Q1)|S , ε1, ‖ · ‖2) and N ((Q3)|S , ε3, ‖ · ‖3). Mathematically, according to the explicitexpression of (G+)|S , we have for any (θ, z) and (θ′, z′)

∥∥∥[k(Gθ(z), Gθ(z(i)))]i=1:n−1 − [k(Gθ′(z′), Gθ′(z

(i)))]i=1:n−1

∥∥∥2

=∥∥∥[k(Gθ(z), Gθ(z(i)))]i=1:n−1 − [k(Gθ′(z), Gθ(z(i)))]i=1:n−1 + [k(Gθ′(z), Gθ(z(i)))]i=1:n−1

−[k(Gθ′(z), Gθ′(z(i)))]i=1:n−1 + [k(Gθ′(z), Gθ′(z

(i)))]i=1:n−1 − [k(Gθ′(z′), Gθ′(z

(i)))]i=1:n−1

∥∥∥2

≤∥∥∥[k(Gθ(z), Gθ(z(i)))]i=1:n−1 − [k(Gθ′(z), Gθ(z(i)))]i=1:n−1

∥∥∥2

+∥∥∥[k(Gθ′(z), Gθ(z(i)))]i=1:n−1

−[k(Gθ′(z), Gθ′(z(i)))]i=1:n−1

∥∥∥2

+∥∥∥[k(Gθ′(z), Gθ′(z

(i)))]i=1:n−1 − [k(Gθ′(z′), Gθ′(z

(i)))]i=1:n−1

∥∥∥2

≤C3

(√n− 1

∥∥Gθ(z)−Gθ′(z)∥∥+

∥∥∥[∥∥Gθ′(z(i))−Gθ(z(i))∥∥]i=1:n−1

∥∥∥2

+√n− 1

∥∥Gθ′(z)−Gθ′(z′)∥∥) , (F27)

where the first inequality uses the triangle inequality and the last inequality exploits C3-Lipschitz property of thekernel. Following the definition of covering number, the above relationship indicates that if for any θ there exists θ′

such that∥∥Gθ(z)−Gθ′(z)

∥∥ ≤ ε1 holds for every z, and for any z there exists z′ such that∥∥Gθ(z)−Gθ(z′)

∥∥ ≤ ε3holds for every θ, the composition of the covering sets Qε1 and Qε3 forms the covering set of (G+)|S . That is, thecovering number of (G+)|S is upper bounded by

N ((G+)|S , C3

√n− 1(2ε1 + ε3), ‖ · ‖2) ≤ N (Q1, ε1, ‖ · ‖2)×N (Q3, ε3, ‖ · ‖2). (F28)

In other words, to quantify the ε-covering of (G+)|S , it is equivalent to deriving the upper bound of

N (Q1, ε/(3C3

√n− 1), ‖ · ‖2) and N (Q3, ε/(3C3

√n− 1), ‖ · ‖2), respectively. We next separately derive these two

quantities.

The upper bound of N (Q1, ε/(3C3

√n− 1), ‖ · ‖2). Let Q4 be an ε

3C32N√n−1

-cover of Hcirc in Eq. (F23). Then, for

any θ, there exists θ′ such that ‖U(θ)ΠjU(θ) − U(θ′)ΠjU(θ′)‖ ≤ ε3C32N

√n−1

for every j with U(θ′)ΠjU(θ′) ∈ Q4.

20

This leads that for any z, we have∥∥∥Gθ(z)−Gθ′(z)∥∥∥

2

=∥∥∥[Tr(U(θ)ΠjU(θ)†ρz)− Tr(U(θ′)ΠjU(θ′)†ρz)]j=1:2N

∥∥∥2

≤∥∥∥[‖U(θ)ΠjU(θ)† − U(θ′)ΠjU(θ′)†‖]j=1:2N

∥∥∥2

≤2Nε

3C32N√n− 1

3C3

√n− 1

, (F29)

where the first inequality comes from the Cauchy-Schwartz inequality and the second inequality follows the definitionof covering number.

The above observation means that the covering set of N (Q1, ε/(3C3

√n− 1), ‖ · ‖2) is independent with z and its

covering number is upper bound by N (Hcirc, ε3C32N

√n−1

, ‖ · ‖2). Then, by leveraging the results in Lemma 9, we

obtain

N(Q1,

ε

3C3

√n− 1

, ‖ · ‖2)≤ N

(Hcirc,

ε

3C32N√n− 1

, ‖ · ‖2)≤(

21C32N√n− 1Ngt‖Πj‖ε

)d2kNgt. (F30)

The upper bound of N (Q3, ε/(3C3

√n− 1), ‖ · ‖2). Let Q5 be an ε

3C32N√n−1

-cover of B in Eq. (F25). Then, for any

encoding state ρz ∈ B, there exists ρz′ ∈ Q5 with ‖ρz−ρz′‖ ≤ ε3C32N

√n−1

. By expanding the term∥∥Gθ′(z)−Gθ′(z′)

∥∥,

we obtain the following result, i.e., for any θ′,∥∥Gθ′(z)−Gθ′(z′)∥∥

=∥∥[Tr(ΠjU(θ′)ρzU(θ′)†)− Tr(ΠjU(θ′)ρz′U(θ′)†)]j=1:2N

∥∥=∥∥[Tr(U(θ′)†ΠjU(θ′)(ρz − ρz′))]j=1:2N

∥∥≤∥∥∥[∥∥ρz − ρz′∥∥]j=1:2N

∥∥∥≤2N

ε

3C32N√n− 1

3C3

√n− 1

, (F31)

where the first inequality uses Tr(AB) ≤ Tr(A)‖B‖ when 0 A and Tr(U(θ′)†ΠjU(θ′)) = Tr(Πj) = 1 for ∀j ∈[2N ], and the last inequality follows the definition of covering number. The achieved relation means that thecovering set of N (Q3, ε/(3C3

√n− 1), ‖ · ‖2) does not depend on θ and its covering number is upper bounded by

N (B, ε3C32N−1

√n−1

, ‖ · ‖2).

Then, based on the results in Lemma 10, we have

N(Q3,

ε

3C3

√n− 1

, ‖ · ‖2)≤ N

(B, ε

3C32N√n− 1

, ‖ · ‖2)≤(

21C32N√n− 1Ngeε

)d2kNge. (F32)

Combining Eqs. (F30) and (F32), the covering number N ((G+)|S , ε, ‖ · ‖2) in Eqn. (F28) is upper bounded by

N((G+)|S , ε, ‖ · ‖2

)≤N

(Hcirc,

ε

3C32N−1√n− 1

, ‖ · ‖2)×N

(B, ε

3C32N−1√n− 1

, ‖ · ‖2)

≤(

21C32N√n− 1Ngt‖Πj‖ε

)d2kNgt (21C32N

√n− 1Ngeε

)d2kNge=(21C32N

√n− 1Ngt)

d2kNgt(21C32N√n− 1Nge)

d2kNge

(1

ε

)d2k(Nge+Ngt)

. (F33)

21

Denote C5 = (21C32N√n− 1Ngt)

d2kNgt

d2k(Nge+Ngt) (21C32N√n− 1Nge)

d2kNge

d2k(Nge+Ngt) . Using Lemma 6, we obtain

E(T2) ≤ 2√n− 1

infα>0

4α+12√n− 1

∫ 1

α

√√√√ln

((C5

ε

)d2k(Nge+Ngt))dε

=

2√n− 1

infα>0

(4α+

12√d2k(Nge +Ngt)√

n− 1

∫ 1

α

√ln

(C5

ε

)dε

)

≤ 2√n− 1

infα>0

(4α+

12√d2k(Nge +Ngt)√

n− 1

(ε+ ε ln

(C5

ε

)) ∣∣∣1ε=α

). (F34)

For simplicity, we set α = 1/√n− 1 in Eq. (F34) and then E(T2) is upper bounded by

E(T2)≤ 2√n− 1

(4√n− 1

+12√d2k(Nge +Ngt)√

n− 1

(ε+ ε ln

(C5

ε

)) ∣∣∣1ε=α

)

≤ 8

n− 1+

24√d2k(Nge +Ngt)

n− 1(1 + lnC5) . (F35)

Since the two exponent terms in C5 are no larger than 1, we have C5 ≤ (21C32N√n− 1Ngt)(21C32N

√n− 1Nge).

This relation further simplifies the upper bound E(T2) as

E(T2) ≤ 8

n− 1+

24√d2k(Nge +Ngt)

n− 1

(1 +N ln(1764C2

3 (n− 1)NgeNgt)). (F36)

2. Proof of Lemma 5

Proof of Lemma 5. The proof of Lemma 5 is very similar to the one of Lemma 4 and thus we skip it here.

SM G: Generalization of QGANs with varied Ansatz

The derived generalization error bound in Theorem 2 is succinct, which can be directly employed to quantifythe generalization ability of QGANs with the specified Ansatze. For concreteness, here we separately analyze thegeneralization ability of QGANs with two typical classes of Ansatze, i.e., the hardware-efficient Ansatze and thequantum approximation optimization Ansatze.

Hardware-efficient Ansatze. An N -qubits hardware-efficient Ansatz is composed of L layers, i.e., U(θ) =∏Ll=1 U(θl) with L ∼ poly(N), where U(θl) is composed of parameterized single-qubit gates and fixed two-qubit gates.

In general, the topology of U(θl) for any l ∈ [L] is the same and each qubit interacts with at least one parameterizedsingle-qubit gate and two qubits gates. Mathematically, we have U(θl) = (⊗Ni=1Us)Ueng with Us = RZ(β)RY (γ)RZ(ν)being realized by three rotational qubit gates and γ, β, ν ∈ [0, 2π). The number of two-qubit gates in each layer isset as N and the connectively of two-qubit gates aims to adapt to the topology restriction of quantum hardware.The entangled layer Ueng contains two-qubit gates, i.e., CNOT gates, whose connectivity adapts the topology of thequantum hardware.

An example of 4-qubit QGAN with hardware-efficient Ansatz is illustrated in the left panel of Fig. ??(a). Underthis setting, when both the encoding unitary and the trainable unitary adopt the hardware-efficient Ansatz, we havek = 2, d = 2, Nge = LEN , Ngt = L(3N), and Ng = L(3N +N) = 4L. Based on the above settings, we achieve thegeneralization error of ab N -qubit QGAN with the hardware-efficient Ansatze, supported by Theorem 2, i.e., withprobability at least 1− δ,

RC ≤ 8

√8C2

2 (n+m)

nmln

1

δ+

48

n− 1+

576√N(LE + 3L)

n− 1(N ln(5292C2

3nN2LEL) + 1), (G1)

where C2 = supx k(x,x).

22

UE(z(i)) U(l)

|0i RY (z(i)1 ) U(l

1)

|0i RY (z(i)2 ) U(l

2)

|0i RY (z(i)3 ) U(l

3)

|0i RY (z(i)4 ) U(l

4)

×𝐿! ×𝐿 ×𝐿! ×𝐿

UE(z(i)) U(l)

|0i RY (z(i)1 ) RX(l

1)

UC(l+1)

|0i RY (z(i)2 ) RX(l

2)

|0i RY (z(i)3 ) RX(l

3)

|0i RY (z(i)4 ) RX(l

4)

(a) (b)

FIG. 4: Illustration of QGANs with different Ansatze. The left panel presents that both the encoding method and thetrainable unitary of QGANs employ the hardware-efficient Ansatze. The right panel presents a class of QGANs such that theencoding method uses the hardware-efficient Ansatze and the trainable unitary is implemented by the quantum approximateoptimization Ansatze.

Quantum approximate optimization Ansatze. Fig. 4 (b) plots the quantum approximate optimization

Ansatze. The mathematical expression of this Ansatz takes the form U(θ) =∏Ll=1 U(θl), where the l-th layer

U(θl) = UB(θl)UC(θl) is implemented by the driven Hamiltonian UB(θl) = ⊗Ni=1RX(θli) and the target Hamiltonian

UC(θl) = exp(−iθll+1HC) with HC being a specified Hamiltonian. Under this setting, when the encoding unitaryis constructed by the hardware-efficient Ansatz and the trainable unitary is realized by the quantum approximateoptimization Ansatze, we have k = N , d = 2, Nge = LEN , and Ngt = L(N + 1). Based on the above settings, weachieve the generalization error of an N -qubit QGAN with the quantum approximate optimization Ansatze, supportedby Theorem 2, i.e., with probability at least 1− δ,

RC ≤ 8

√8C2

2 (n+m)

nmln

1

δ+

48

n− 1+

144× 2N√

(N + 1)(L+ LE)

n− 1(N ln(1764C2

3nLLEN(N + 1)) + 1), (G2)

where C2 = supx k(x,x).

SM H: More details of numerical simulations

1. Hyper-para and metrics

RBF kernel. The explicit expression of the radial basis function (RBF) kernel is k(x,y) = exp(− ||x−y||2

σ2 ), where

σ refers to the bandwidth. In all simulations, we set σ−2 as 0.25, 4 for QCBMs and −0.001, 1, 10 for QGANs,respectively.

KL divergence. We use the KL divergence to measure the similarity between the generated distribution Pθ and truedistribution Q. In the discrete setting, its mathematical expression is KL(Pθ ||Q) =

∑i Pθ(i) log(Q(i)/Pθ(i)) ∈ [0,∞).

In the continuous setting, KL(Pθ ||Q) =∫Pθ(dx) log(Q(dx)/Pθ(dx))dx ∈ [0,∞). When the two distributions are

exactly matched with P = Q, we have KL(P ||Q) = 0.

State fidelity. Suppose that the generated state is |Ψ(θ)〉 and the target state is |Ψ∗〉. The state fidelity [84] forpure states is defined as F = | 〈Ψ(θ)|Ψ∗〉 |2.

Optimizer. For QCBMs, the classical optimizer is assigned as L-BFGS-B algorithm and the tolerance for terminationis set as 10−12. For QGANs, the classical optimizer is assigned as Adam with default paramters. The source code.The QCBM is realized by Python, Numpy, Scipy, Tensorflow, Keras, and QIBO. We release the source code in Githubrepository.

Hardware parameters. All simulation results in this study are completed by the classical device with Intel(R)Xeon(R) Gold 6267C CPU @ 2.60GHz and 128 GB memory.

23

𝑁 = 8

𝑁 = 12

0.001

7e-5

0.0110.015

0.3

0.03

2.122.21

(𝒂) (𝒂)

FIG. 5: Generating discrete Gaussian with varied settings of QCBMs. (a) The simulation results with the variednumber samples n (corresponding to x-axis). The upper and lower panels demonstrate the achieved MMD loss for N = 8, 12,respectively. (b) The simulation results of QCBM with N = 12, the quantum kernel, and n→∞ for the varied circuit depth(corresponding to x-axis). The left and right panels separately show the achieved MMD loss and the KL divergence between thegenerated and the target distributions.

2. More simulation results related to the task of discrete Gaussian approximation

Training loss of QCBMs. Fig. 5(a) plots the last iteration training loss of QCBMs. All hyper-parameter settingsare identical to those introduced in the main text. The x-axis stands for the setting of n used to compute MMDU inEq. (A2). The simulation results indicate that the performance QCBM with RBM kernel is steadily enhanced with theincreased n. When n → ∞, its performance approaches to the QCBM with quantum kernel. These phenomenonsaccord with Theorem 1.

Effect of circuit depth. We explore the performance of QCBMs with quantum kernels by varying the employedAnsatz. Specifically, we consider the case of N = 12 and set L1 in Fig. 1(a) as 4, 6, 8, 10. The collected simulationresults are shown in Fig. 5(b). In conduction with Fig. 1(c), QCBM with L1 = 6 attains the best performance overall settings, where the achieved MMD loss is 7× 10−5 and the KL divergence is 0.03. This observation implies thatproperly controlling the expressivity of Ansatz, which effects the term C1 in Theorem 1, contributes to improve thelearning performance of QCBM.

3. More simulation results related to the task of GHZ state approximation

Approximated states. The approximated GHZ states of QGANs with different random seeds discussed in Fig. 3(d)are depicted in Fig. 6. Specifically, the difference between the approximated state and the target GHZ state becomesapparent with the decreased number of examples n and the increased number of qubits N . These observations echowith the statement of Theorem 1.

4. More simulation results related to the task of 3D Gaussian approximation

Implementation of the modified style-QGANs. Recall the quantum generator adopted Gθ(z) in the modifiedstyle-QGANs is exhibited in Fig. 3(a). Different from the original proposal applying the re-uploading method, themodified quantum generator first uploads the prior example z using U(z) followed by the Ansatz U(θ). Such amodification facilitates the analysis of the generalization behavior of QGANs as claimed in Theorem 2.

The construction details of U(z) and U(θ) are illustrated in Fig. 7. Particularly, the circuit layout of

U(z) and the l-th layer of U(θ) is the same. Mathematically, U(z) = UE(γ4)(⊗3i=1U(γi)), where U(γi)) =

RZ(z3) RY(z2) RZ(z2) RY(z1),∀i ∈ [3] and UE(γ4) = (I2 ⊗ CRY(z2))(CRY(z1) ⊗ I2) refers to the entangle-

ment layer. Similarly, for the l-th layer of U(θ), its mathematical expression is Ul(θ) = UE(γ4)(⊗3i=1U(γi)),

where U(γi)) = RZ(θ3) RY(θ2) RZ(θ2) RY(θ1),∀i ∈ [3]. When l is odd, the entanglement layer takes the formUE(γ4) = (I2 ⊗ CRY(θ2))(CRY(θ1)⊗ I2); otherwise, its implementation is shown in the lower right panel of Fig. 7.

The optimization of the modified style-QGAN follows an iterative manner. At each iteration, a classical optimizerleverages the batch gradient descent method to update the trainable parameters θ minimizing MMD loss. After T

24

𝑁 = 4

Seed 1 Seed 2 Seed 3 Seed 4 Seed 5

𝑁 = 6

𝑁 = 8

𝑁 = 10

FIG. 6: The approximated GHZ state with the varied number of qubits. The label follows the same meaningexplained in Fig. 3.

U(z) =

U(1)

UE(l4)U(2)

U(3)

UE() = RY()

RY()

or

RY()

RY()

UE(l) = RY(l1)

RY(l2)

or

RY(l2)

RY(l1)

U(l) =

U(l1)

UE(l4)U(l

2)

U(l3)

U(l) =

U(l1)

UE(l4)U(l

2)

U(l3)

FIG. 7: The implementation detail of the modified style-QGANs. The circuit architecture of U(z) and the l-th layer

of U(θ) is identical, which is depicted in the upper right panel. The lower panel plots the construction of the entanglement layerUE(γl).

iterations, the optimized parameters are output as the estimation of the optimal results. The Pseudo code of the

25

FIG. 8: The simulation results of QGANs with the varied number of quantum gates. The left panel shows theMMD loss of QGANs during 120 iterations. The label ‘L = a’ refers to set the block number as L1 = a + 1. The right panelevaluates the generalization property of trained QGANs by calculating the expected MMD loss. The x-axis refers to L.

modified style-QGAN is summarized in Alg. 1.

Algorithm 1: The modified style-QGAN

Data: Training set yimi=1, number of examples n, learning rate η, iterations T , MMD loss;Result: Output the optimized parameters.

1 Randomly divide y(i)mi=1 into mmini mini batches with batch size b;2 Initialize parameters θ;3 while T > 0 do4 Regenerate noise inputs z(i)ni=1 every r iterations;5 for j ← 1,mmini do6 Generate x(i)ni=1 with x(i) = Gθ(z(i));

7 Compute the b’th minibatch’s gradient ∇MMD2(Pnθ ||Qb) ;

8 θ ← θ−η∇MMD2(Pnθ ||Qb) ;

9 end10 T ← T − 1;

11 end

Code availability. The source code for conducting all numerical simulations will be available in Github repositoryhttps://github.com/yuxuan-du/QGLM-Theory.

Simulation results. Here we examine how the number of examples m and the number of trainable gates Ng effectthe generalization of QGANs. The experimental setup is identical to those introduced in the main text. To attain avaried number of Ng, the circuit depth of Ansatz in Fig. 3(a) is set as L = 2, 4, 6, 8. Other hyper-parameters are fixedwith T = 800, n = m = 5000, batch size b = 64. We repeat each setting with 5 times to collect the simulation results.

Effect of the number of examples. Let us first focus on the setting L = 2 and m = 5000. In conjunction withthe simulation results in the main text (i.e., L = 2 and m = 2, 10, 200), the simulation results in Fig. 8 indicate thatan increased number of n and m contribute to a better generalization property. Specifically, at t = 120, the averagedempirical MMD loss (MMDU ) of QGANs is 0.0086, which is comparable with other settings discussed in the maintext. The averaged expected MMD loss is 0.0045, which is similar to the setting with m = 200. In other words, whenthe number of examples m and n exceeds a certain threshold, the generalization error of QGANs is dominated byother factors instead of m and n.

Effect of the number of trainable gates. We next study how the number of trainable gates effects thegeneralization error of QGANs. Following the structure of the employed Ansatz shown in Fig. 3(a), varying the numberof trainable gates amounts to varying the number of blocks L1. The results of QGANs with varied L1 are illustratedin Fig. 8. For all setting of QGANs, their empirical MMD loss fast converges after 40 iterations. Nevertheless, theirexpected MMD loss is distinct, where a larger L1 (or equivalently, a larger number of trainable gates Ng) implies ahigher expected MMD loss and leads to a worse generalization. These observations accord with the result of Theorem2 in the sense that an Ansatz with the overwhelming expressivity may incur a poor generalization ability of QGANs.

SM I: Implications of Theorem 2 from the perspective of potential advantages

Here we explore how the results of Theorem 2 contribute to see potential advantages of QGANs. To do so, we firsttheoretically formalize the task of parameterized Hamiltonian learning and prove this task is computationally hard for

26

classical computers. Then we conduct numerical simulations to apply QGANs to tackle parameterized Hamiltonianlearning problems.

1. Theoretical analysis

Let us first introduce the parameterized Hamiltonian learning problem. Define an N -qubit parameterized Hamiltonianas H(a), where a refers to the interaction parameter sampled from a prior continuous distribution D. For instance,the parameter a can be specified to the strength of the transverse magnetic field and D can be set as the uniformdistribution. Define |φ(a)〉 as the ground state of H(a). The aim of the parameterized Hamiltonian learning is usingm training samples a(i), |φ(a(i))〉mi=1 to approximate the distribution of the ground states for H(a) with a ∼ D, i.e.,|φ(a)〉 ∼ Q. If a trained QGLM can well approximate Q, then it can prepare the ground state of H(a′) for an unseenparamter a′ ∼ D. This property may contribute to investigating many crucial behaviors in condensed-matter systems.

We note that using QGLM instead of GLM to approximate Q allows certain computational advantages, warrantedby the following lemma.

Lemma 11. Suppose that Q refers to the distribution of the ground states for parameterized Hamiltonians H(a) witha ∼ D. Under the quantum threshold assumption, there exists a distribution Q that can be efficiently represented byQGLMs but is computationally hard for GLMs.

The results of Lemma 11 indicate the superiority of QGLMs over GLMs, i.e., Q ∈ PQΘ and Q /∈ PCΘ. In conjunction thissalient observation with the definition of generalization, we can conclude that when the number of parameters scales with

O(poly(N)), there may exist a certain kernel leading to infθ∈Θ MMD2(PQθ ||Q)→ 0 and infθ∈Θ MMD2(PCθ ||Q) > 0.This separation is crucial in quantum machine learning. In particular, although both the generalization error RC andRQ are continuously decreased by increasing n and m, the separated expressive power between GLMs and QGMLsmeans that the MMD loss for GLMs can not converge to zero and fail to exactly learn the target distribution Q.

2. Proof of Lemma 11

The proof of Lemma 11 is established on the results of quantum random circuits, which is widely believed tobe classically computationally hard and in turn can be used to demonstrate quantum advantages on NISQ devices[113, 114]. The construction of a random quantum circuit is as follows. Denote H(N, s) as the distribution over the

quantum circuit C under 2D-lattice (√N ×

√N) structure, where C is composed of s two-qubit gates, each of them is

drawn from the 2-qubit Haar random distribution, and s is required to be greater than the number of qubits N . Forsimplicity, here we choose s = 2N2 to guarantee the hardness of simulating the distribution H(N, s). The operatingrule for the i-th quantum gate satisfies:

• If i ≤ N , the first qubit of the i-th gate is specified to be the i-th qubit and the second is selected randomly fromits neighbors;

• If i > N , the first qubit is randomly selected from 1, 2, ..., N and randomly select the second qubit from itsneighbors.

Following the same routine, Ref. [115] proposed a Heavy Output Generation (HOG) problem detailed below to separatethe power between classical and quantum computers on the distribution of the output quantum state after performinga circuit C sampled from H(N, s) on the initial state |0N 〉.

Definition 4 (HOG, [115]). Given a random quantum circuit C ∼ H(N, s) for s ≥ N2, generate k binary stringsz1, z2, · · · , zk in 0, 1N such that at least 2/3 fraction of zi’s are greater than the median of all probabilities of thecircuit outputs C |0N 〉.

Concisely, under the quantum threshold assumption, there do not exist classical samples that can spoof the ksamples output by a random quantum circuit with success probability at least 0.99 [115]. In addition, they prove thatquantum computers can solve the HOG problem with high success probability. For completeness, we introduce thequantum threshold assumption as follows.

Assumption 1 (quantum threshold assumption, [115]). There is no polynomial-time classical algorithm that takes arandom quantum circuit C as input with s ≥ N2 and decides whether 0n is greater than the median of all probabilitiesof C with success probability 1/2 + Ω

(2−N

).

27

FIG. 9: Simulation results of QGANs in parameterized Hamiltonian learning. The left panel shows the fidelity ofthe approximated ground states and the exact ground states of Hamiltonian H(a) with varied a. The right panel exhibits theestimated and exact ground energies of a class of Hamiltonians H(a).

We are now ready to prove Lemma 11.

Proof of Lemma 11. The core idea of the proof is to show that there exists a ground state |φ(θ)〉 of a HamiltonianH(a), which can be efficiently prepared by quantum computers but is computationally hard for classical algorithms.To achieve this goal, we connect |φ(θ)〉 with the output state of random quantum circuits.

With the quantum threshold assumption, Aarsonson and Chen [115] proved that there exists a quantum state |Φ〉generated from a random circuit C sampling from H

(N, 2N2

)such that HOG problem on this instance is classically

hard, with success probability at least 0.99. Conversely, |Φ〉 can be efficiently prepared by the parameterized quantum

circuit U(θ∗) whose topology is identical to C. Moreover, due to the fact that quantum adiabatic algorithms with2-local Hamiltonians can implement universal quantum computational tasks [116], the quantum state |Φ〉 ≡ |φ(a∗)〉 =

U(θ∗) |0〉⊗N must correspond to the ground state of a certain Hamiltonian H(a∗).We now leverage the above result to design a parameterized Hamiltonian learning task that separates the power of

classical and quantum machines. In particular, we restrict the target distribution Q as the delta distribution, where

the probability of sampling |Φ〉 ≡ |φ(a∗)〉 = U(θ∗) |0〉⊗N equals to one and the probability of sampling other groundstates of H(a)′ with a′ 6= a∗ and a′ ∼ D is zero. In this way, the Hamiltonian learning task is reduced to using QGLMor GLM to prepare the quantum state |Φ〉. This task can be efficiently achieved by QGML but is computationallyhard for GLMs.

3. Numerical simulations

We apply QGANs introduced in the main text to study the parameterized Hamiltonian learning problem formulatedin Section I 1. In particular, the parameterized Hamiltonian is specified as the XXZ spin chain, i.e.,

H(a) =

N∑i=1

(XiXi+1 + YiYi+1 + aZiZi+1) + η

N∑i=1

Zi. (I1)

In all numerical simulations, we set N = 2 and η = 0.25. The distribution D for the parameter a is uniform ranging from−0.2 to 0.2. In the preprocessing stage, to collect m referenced samples a(j), |φ(a(j))〉mj=1, we first uniformly sample

a(j)mj=1 points from D and calculate the corresponding eigenstates of H(a(j))mj=1 via the exact diagonalization.The setup of QGAN is as follows. The prior distribution PZ is set as D. The encoding unitary is U(a) =

CNOT(RZ(a)RY(a))⊗ (RZ(a)RY(a)). The hardware-efficient Ansatz is used to implement U(θ) =∏Ll=1 Ul(θ) with

Ul(θ) = CNOT(RZ(θl,1)RY(θl,2))⊗ (RZ(θl,3)RY(θl,4)). The number of blocks is L = 4. The number of training andreferenced examples is set as n = m = 9. The Adagrad optimizer is used to update θ. The total number of iterationsis set as T = 80. We employ 5 different random seeds to collect the statistical results. To evaluate the performance ofthe trained QGAN, we apply it to generate in total 41 ground states of H(a) with a ∈ [−0.2, 0.2] and compute thefidelity with the exact ground states.

The simulation results are shown in Fig. 9. Specifically, for all settings of a, the fidelity between the approximatedground state output by QGANs and the exact ground state is above 0.97. Moreover, when a ∈ [−0.06, 0.06], the

28

fidelity is near to 1. The right panel depicts the estimated ground energy using the output of QGANs, where themaximum estimation error is 0.125 when a = 0.2. These observations verify the ability of QGANs in estimating theground states of parameterized Hamiltonians.