Randomized algorithms & probabilistic analysis Lecture notes

59
CSE 525: Randomized algorithms & probabilistic analysis Lecture notes Spring 2019 James R. Lee Paul G. Allen School of Computer Science & Engineering University of Washington Contents 1 First moments 3 1.1 The probabilistic method ................................... 3 1.2 Linearity of expectation ................................... 3 1.3 The method of conditional expectation ........................... 4 1.4 Markov’s inequality ...................................... 5 1.5 Crossing number inequalities ................................ 5 2 Second moments 7 2.1 Threshold phenomena in random graphs ......................... 7 2.2 Chebyshev’s inequality and second moments ....................... 7 2.3 Unbiased estimators ..................................... 9 2.4 Percolation on a tree ..................................... 10 2.5 Using unbiased estimators to count ............................. 11 3 Chernoff bounds 12 3.1 Randomized rounding .................................... 14 3.2 Some more applications ................................... 16 3.2.1 Balls in bins ...................................... 16 3.2.2 Randomized Quicksort ............................... 17 3.3 Negative correlation ..................................... 18 4 Martingales 20 4.1 Doob martingales ....................................... 20 4.2 The Hoeffding-Azuma inequality .............................. 21 4.3 Proof .............................................. 23 4.4 Additional applications .................................... 24 4.4.1 Concentration in product spaces .......................... 24 4.4.2 Tighter concentration of the chromatic number .................. 25 1

Transcript of Randomized algorithms & probabilistic analysis Lecture notes

CSE 525: Randomized algorithms & probabilistic analysisLecture notes

Spring 2019

James R. LeePaul G. Allen School of Computer Science & Engineering

University of Washington

Contents

1 First moments 31.1 The probabilistic method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Linearity of expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 The method of conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Crossing number inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Second moments 72.1 Threshold phenomena in random graphs . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Chebyshev’s inequality and second moments . . . . . . . . . . . . . . . . . . . . . . . 72.3 Unbiased estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Percolation on a tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5 Using unbiased estimators to count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Chernoff bounds 123.1 Randomized rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Some more applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Balls in bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.2 Randomized Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Negative correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Martingales 204.1 Doob martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 The Hoeffding-Azuma inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3 Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4 Additional applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.4.1 Concentration in product spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 244.4.2 Tighter concentration of the chromatic number . . . . . . . . . . . . . . . . . . 25

1

5 Memoryless random variables and low-diameter partitions 265.1 Random tree embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.2 Random low-diameter partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.3 Memoryless random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.4 The partitioning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6 Low-distortion embeddings 296.1 Distances to subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.1.1 Fréchet’s embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.1.2 Bourgain’s embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 The curse of dimensionality and dimension reduction 327.1 The Johnson-Lindenstrauss lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

8 Compressive sensing and the RIP 348.1 The restricted isometry property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358.2 Random construction of RIP matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

9 Concentration for sums of random matrices 369.1 Symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369.2 The method of exponential moments for matrices . . . . . . . . . . . . . . . . . . . . 389.3 Large-deviation bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

10 Spectral sparsification 4010.1 Laplacians of graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

10.1.1 Spectral sparsification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4110.2 Random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

10.2.1 Effective resistances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

11 Random walks and electrical networks 4411.1 Hitting times and cover times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4411.2 Random walks and electrical networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 4411.3 Cover times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4611.4 Matthews’ bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

12 Markov chains and mixing times 4812.1 The Fundamental Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4912.2 Eigenvalues and mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5012.3 Mixing times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5212.4 Some Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

13 Eigenvalues, expansion, and rapid mixing 5413.1 Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5513.2 Multi-commodity flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5613.3 The Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

References 59

2

1 First moments

1.1 The probabilistic method

An old math puzzle goes: Suppose there are six people in a room; some of them shake hands.Prove that there are at least three people who all shook each others’ hands or three people such thatno pair of them shook hands.

Generalized a bit, this is the classic Ramsey problem. The diagonal Ramsey numbers R(k) aredefined as follows. R(k) is the smallest integer n such that in every two-coloring of the edges of thecomplete graph Kn by red and blue, there is a monochromatic copy of Kk , i.e. there are k nodessuch that all of the

k2edges between them are red or all of the edges are blue. A solution to the

puzzle above asserts that R(3) 6 6 (and it is easy to check that, in fact, R(3) 6).In 1929, Ramsey proved that R(k) is finite for every k. We want to show that R(k)must grow

pretty fast; in fact, we’ll prove that for k > 3, we have R(k) > b2k/2c. This requires finding a coloringof Kn that doesn’t contain any monochromatic Kk . To do this, we’ll use the probabilistic method:We’ll give a random coloring of Kn and show that it satisfies our desired property with positiveprobability. This proof appeared in a paper of Erdös from 1947, and this is the example that startsAlon and Spencer’s famous book devoted to the probabilistic method.

Lemma 1.1. Ifn

k

21−(k

2) < 1, then R(k) > n. In particular, R(k) > b2k/2c for k > 3.

Proof. Consider a uniformly random 2-coloring of the edges of Kn . Every edge is colored red orblue independently with probability half each. For any fixed set of k vertices H, let EH denote theevent that the induced subgraph on H is monochromatic. An easy calculation yields

P(EH) 2 · 2−(k2) .

Since there aren

k

possible choices for H, we can use the union bound:

P(exists R such that EH) 6 2 · 2−(k2) ·

(nk

).

Thus if 21−(k2)n

k

< 1, then with positive probability, no event EH occurs. Thus there must exist at

least one coloring with no monochromatic Kk . One can check that if k > 3 and n b2k/2c, then thisis satisfied.

We have employed the following basic tool.

Tool 1.2 (Union bound). If A1 ,A2 , . . . ,Am are arbitrary events, then

P(A1 ∪ A2 ∪ · · · ∪ Am) 6 P(A1) + P(A2) + · · · + P(Am)

1.2 Linearity of expectation

Let’s look at a couple more examples of the probabilistic method in action. We’ll use a basic fact inprobability: Linearity of expectation.

Tool 1.3 (Linearity of expectation). If X1 ,X2 , . . . ,Xn are discrete real-valued random variables, then

E[X1 + X2 + · · · + Xn] E[X1] + E[X2] + · · · + E[Xn]The great fact about this inequality is thatwe don’t need to know anything about the relationships

between the random variables; linearity of expectation holds no matter what the dependencestructure.

3

MAX-3SAT. Let’s consider a 3-CNF formula over the variables x1 , x2 , . . . , xn . Such a formulahas the form ϕ C1 ∧ C2 ∧ · · · ∧ Cm where each clause is an OR of three literals involvingdistinct variables: Ci zi1 ∨ zi2 ∨ zi3 . A literal is a variable or its negation. For instance,(x2 ∨ x3 ∨ x4) ∧ (x3 ∨ x5 ∨ x1) ∧ (x1 ∨ x5 ∨ x4) is a 3-CNF formula.

Claim 1.4. If ϕ is a 3-CNF formula with m clauses, then there exists an assignment that makes at least 78 m

clauses evaluate to true.

Proof. We will prove this using the probabilistic method. For every variable independently, wechoose a uniformly random truth assignment: true or false each with probability 1/2. Let Ai equal1 if clause Ci is satisfied by our random assignment, and equal 0 otherwise. Then P(Ai 1) 7/8because there are 7 ways to satisfy a clause out of the 8 possible truth values for its literals.

Let A A1 + · · · + Am denote the total number of satisfied clauses. By linearity of expectation,we have

E[A] m∑

i1E[Ai] 7

8m . (1.1)

Since a randomassignment satisfies 78 m clauses in expectation, theremust exist at least one assignment

that satisfies this many clauses.

MAX-CUT. Consider an undirected graph G (V, E). A cut is a subset S ⊆ V , and we use E(S, S)to denote the set of edges crossing the cut S. This is the set of edges with one endpoint in S and onenot in S.

Claim 1.5. In any graph G (V, E), there exists a cut S ⊆ V that cuts at least half the edges, i.e.,|E(S, S)| > |E|

2 .

Proof. We construct a random set S ⊆ V by including every vertex in S independently withprobability 1/2. For an edge e ∈ E, let Ae 1 if e crosses the cut S, and 0 otherwise. First, it shouldbe apparent that P(Ae 1) 1/2. Therefore by linearity of expectation,

E|E(S, S)|

∑e∈E

E[Ae] |E|2.

Thus there must exist at least one cut S that has at least half the edges crossing it.

1.3 The method of conditional expectation

Claim 1.4 asserts that there exists an assignment satisfying at least 78 m clauses, but what if we wish

to actually find one? One way is to randomly sample from the underlying distribution and thencheck the resulting assignment. Analyzing the probability of success will require our first tail bound;we’ll get there in the next section.

Let’s examine another way that actually results in a deterministic algorithm. Let S(x1 , x2 , . . . , xn)denote the expected number of satisfied clauses given a partial truth assignment to the inputvariables, where we choose the unassigned variables uniformly at random. We will use T to denotetrue, F to denote false, and? to denote that no assignment has been chosen for that variable.

For instance, S(?,?, . . . , ?) denotes the expected number of satisfied clauses in a randomassignment, and we have already seen (cf. (1.1)) that

S(?,?, . . . , ?) 78

m.

4

Note that a simple linear-time algorithm can estimate S(x1 , x2 , . . . , xn) for any partial assignmentx1 , . . . , xn ∈ T, F, ? by simply going through the clauses one by one.

As an example, consider the clause x1 ∨ x2 ∨ x4. The probability that a random assignmentsatisfies this is 7/8. If we assign x1 F, then the probability becomes 3/4, and if we set x1 T, thenthe probability becomes 1.

Observe thatS(?,?, . . . , ?) 1

2S(F, ?, . . . , ?) + 1

2S(T, ?, . . . , ?) .

Since S(?,?, . . . , ?) > 78 m, it must hold that S(F, ?, . . . , ?) > 7

8 m or S(T, ?, . . . , ?) > 78 m. As we

have just argued, it’s possible to compute both these quantities and figure out which is larger. Wecan then set x1 to the corresponding value and keep assigning truth values recursively. Eventually,this process ends at a full assignment to the variables that satisfies at least 7

8 m clauses. The keyproperty we employed here is the ability to efficiently compute the conditional expectation of theunderlying random variable under a partial assignment.

1.4 Markov’s inequality

The probabilistic method shows the existence of an object, but it doesn’t necessarily give us arandomized algorithm to construct it. If we just know that the probability of an event is non-zero, itcould still be very tiny; we might need to do an arbitrarily large number of random experimentsbefore we get a positive outcome. Sometimes we can say more.

Tool 1.6 (Markov’s inequality). Let X be a non-negative random variable. Then for any α > 0, wehave

P[X > α] 6 EXα

.

The proof of this lemma is easy; we leave it as an exercise.

Consider now ourMAX-3SAT example above. Let X denote the number of unsatisfied clausesin a random truth assignment. We know from the preceding analysis that E[X] 6 1

8 m. Markov’sinequality tells us that for any ε > 0,

P

[X >

(18+ ε

)m

]6

m/8(1/8 + ε)m

11 + 8ε

6 1 − ε .

The last inequality is only true if we assume ε 6 7/8, but for any value ε > 7/8, the probability isclearly zero.

This means that, with probability at least ε, we will get an assignment that satisfies at least(7/8 − ε)-fraction of clauses. So in expectation, after 1/ε samples, we will get an assignment that isvery close to the one guaranteed to exist. The same kind of reasoning applies to our MAX-CUTanalysis.

1.5 Crossing number inequalities

Let’s look at one more application of the linearity of expectation. It is almost as elementary as theexamples above, but has some powerful consequences in incidence geometry and sum-productestimates.

If G (V, E) is an undirected graph, we use the notation cr(G) to denote the crossing number ofG. This is the minimum number of edge crossings required to draw G in the plane. A drawing of

5

the graph means that the vertices are mapped to distinct points, and each edge is drawn as a closed,continuous curve of bounded length. The following result is due independently to Leighton andAtjai-Chvatal-Newborn-Szemeredi.

Theorem 1.7. If G is a graph with n vertices and m edges, and m > 4n, then

cr(G) > m3

64n2 .

Note that for dense graphs, i.e. those with m Ω(n2), we getΩ(n4) crossings (the most possibleup to a constant factor). We start with a basic fact: Euler’s formula implies that, in every planargraph (a planar graph G is one for which cr(G) 0), we have m 6 3n − 6.

Thus if m > 3n, we must have cr(G) > 1. Since we can always remove one crossing from adrawing by removing one edge from the underlying graph, this gives us

cr(G) > m − 3n . (1.2)

This is still pretty weak. But now we will use random sampling to do seriously heavy amplification.

Proof of Theorem 1.7. Suppose we have a drawing of G in the plane. Wewill make some assumptionsabout this drawing (which are without loss of generality). We may assume that every edge crossinginvolves four distinct vertices. If an edge crosses itself, that can be fixed by short-circuiting the loops.If two edges emanating from the same vertex cross each other, they can be uncrossed withoutaffecting the rest of the drawing (draw a picture to convince yourself). So we may assume that theonly crossings are between edges x , y and u , v where x , y , u , v are all distinct vertices.

Now we will construct a (random) graph Gp by keeping every vertex of G independently withprobability p. The value of p will be chosen soon. Let np and mp denote the number of edges andvertices remaining in Gp , and let cp denote the number of crossings remaining in our drawing (afterthe edges and vertices not remaining in Gp are removed).

Every vertex remains with probability p. By independence, an edge remains with probability p2.Finally, a crossing remains with probability p4 since we said that every crossing has to involve fourdistinct vertices. In order for a crossing to remain, all of those four vertices must be in Gp . Thuslinearity of expectation gives us:

E[np] pn (1.3)E[mp] p2m (1.4)E[cp] p4cr(G) . (1.5)

But from (1.2), we know that cp > mp − 3np , and thus E[cp] > E[mp] − 3E[np]. Plugging in ourvalues above yields

p4cr(G) > p2m − 3pn ,

or equivalently

cr(G) > mp2 −

3np3 .

Finally, we set p 4nm (p 6 1 since we have assumed m > 4n). This yields

cr(G) > m3

16n2 −3m3

64n2 m3

64n2 ,

completing our proof.

6

2 Second moments

2.1 Threshold phenomena in random graphs

Consider a positive integer n and value p ∈ [0, 1]. Perhaps the simplest model of random(undirected) graphs is Gn ,p . To sample a graph from Gn ,p , we add every edge i , j (for i , j andi , j ∈ 1, . . . , n) independently with probability p. For example, if X denotes the number of edgesin a Gn ,p graph, then EX p

n2.

A 4-clique in a graph is a set of four nodes such that all42 6 possible edges between the nodes

are present. Let G a random graph sampled according to Gn ,p , and let C4 denote the event that Gcontains a 4-clique. It will turn out that if p n−2/3, then G contains a 4-clique with probabilityclose to 1, while if p n−2/3, then P[C4]will be close to 0. Thus p n−2/3 is a “threshold” for theappearance of a 4-clique.

Remark 2.1. Here we use the asymptotic notation f (n) 1(n) to denote that limn→∞ f (n)/1(n) ∞.Similarly, f (n) 1(n)means that limn→∞ f (n)/1(n) 0.

We can use a simple first moment calculation for one side of our desired threshold behavior.

Lemma 2.2. If p n−2/3 then P[C4]→ 0 as n →∞.

Proof. Let X denote the number of 4-cliques in G ∼ Gn ,p . We can write X ∑

S XS where the setS runs over all

n4subsets of four vertices in G, and XS 1 if there is a 4-clique on S, and XS 0

otherwise.We have P[XS 1] p6 since all 6 edges must be present, thus by linearity of expectation

EX p6n4. So if p n−2/3, then EX → 0 as n →∞. But now Markov’s inequality implies that

P[C4] P[X > 1] 6 EX → 0 .

On the other hand, proving that p n−2/3 ⇒ P[C4] → 1 is more delicate. Even though afirst moment calculation implies that, in this case, EX → ∞, this is not enough to conclude thatP[C4]→ 1. For instance, it could be the case that with probability 1 − 1

n2 , we have no 4-cliques, butwith probability 1

n2 we see all possiblen

4

4-cliques. In that case, EX n2, but still the probabilityof seeing a 4-clique would be 1

n2 .We need to exploit the fact that the appearances of distinct 4-cliques are mostly independent

events. Certainly if S and S′ are two disjoint sets of vertices, then the corresponding randomvariables XS and XS′ are independent.

2.2 Chebyshev’s inequality and second moments

First, we recall the notion of the variance of a real-valued random variable X: Var(X) E (X − E[X])2.If we know something about the variance, we can improve upon Markov’s inequality.

Tool 2.3 (Chebyshev inequality). If X is a real-valued random variable with µ EX, then for everyα > 0,

P|X − µ| > α

6Var(X)α2 .

Proof. Simply apply Markov’s inequality to the nonnegative random variable (X − µ)2.

One consequence of the Chebyshev inequality is the following simple fact.

7

Corollary 2.4. If X is a real-valued random variable, then

P(X 0) 6 Var(X)(EX)2 .

Proof. Using the Chebyshev inequality, we have:

P(X 0) 6 P (|X − EX | > EX) 6 Var(X)(EX)2 .

Now we are in position to analyze the other side of the threshold. It will help to havethe following definition: For two random variables X and Y, we define their covariance byCov(X,Y) E[XY] − E[X]E[Y]. Note that if X and Y are independent, then Cov(X,Y) 0.

Lemma 2.5. If X X1 + · · · + Xn , then

Var(X) n∑

i1Var(Xi) +

n∑i1

∑j,i

Cov(Xi ,X j) .

Proof. One observe first that for any random variable X, we have Var(X) E[X2] − (EX)2, hence

Var(X) E *,

n∑i1

Xi+-

2

− *,E

n∑i1

Xi+-

2

.

Expanding the squares and collecting terms yields the desired result.

Lemma 2.6. If p n−2/3, then P[C4]→ 1 as n →∞

Proof. As before, let X ∑

S XS denote the number of 4-cliques in G ∼ Gn ,p . Our goal is to showthat P[X 0]→ 0 as n → ∞. Using Corollary 2.4, it suffices to show that Var(X) (EX)2. Themain task will be evaluating Var(X).

WriteVar(X)

∑S

Var(XS) +∑

S

∑T,S

Cov(XS ,XT) .

First, observe that since XS is a 0, 1 random variable, we have Var(XS) 6 E[X2S] E[XS], yielding

Var(X) 6 E[X] +∑

S

∑T,S

Cov(XS ,XT) . (2.1)

Now we evaluate the second sum. The value Cov(XS ,XT) depends on |S ∩ T |. In particular, if|S ∩ T | 6 1, then Cov(XS ,XT) 0 since S and T share no possible edges, hence the events XS andXT are independent.

Next, note that Cov(XS ,XT) 6 E[XSXT], and the latter event is that we have a 4-clique on both Sand T. Thus if |S ∩ T | 2, then Cov(XS ,XT) 6 p11 (since for XS XT 1 to happen, we need 11edges to be present). Similarly, if |S ∩ T | 3, then Cov(XS ,XT) 6 p9. So now we are simply left tocount the possibilities: ∑

S

∑T,S

Cov(XS ,XT) 6 O(n6)p11+ O(n5)p9 .

(Make sure you understand why this line is true!) From (2.1), we conclude that

Var(X) 6 O(n4)p6+ O(n6)p11

+ O(n5)p9 .

Also, recall that EX n

4p6 Θ(n4p6), so (EX)2 Θ(n8p12). In particular, if p n−2/3, then

Var(X) (EX)2. Combined with Corollary 2.4, this yields the desired result.

8

2.3 Unbiased estimators

Suppose we want to estimate the area of a unit disk in the plane. One way to accomplish this is viaa Monte Carlo algorithm: We sample a uniformly random point in x ∈ [0, 1]2 in the unit square, andthen check whether x2

1 + x22 6 1. Let X be the indicator of the event that x is in the unit circle. Then

E[X] area(disk)area([0, 1]2)

π4.

So the random variable X is an unbiased estimator for π/4.There are many situations in which one wants to estimate the probability of some event, but

doing so directly is difficult. In those cases, having an unbiased estimator is quite helpful. Crucially,the usefulness of such an estimator depends on its variance. First, let’s recall the following simplefact (you should be able to prove it easily using Lemma 2.5).

Fact 2.7. If X1 ,X2 , . . . ,Xn are independent, then

Var(X1 + X2 + · · · + Xn) Var(X1) + Var(X2) + · · · + Var(Xn) .We recall that a family of random variables is said to be “i.i.d.” if they are “identical and

independently distributed,” i.e. they are independent samples from the same distribution.

Theorem 2.8. For every ε > 0, the following holds. Let X1 ,X2 , . . . ,Xn be i.i.d. random variables withE[Xi] µ and Var(Xi) σ2. Let X

X1+X2+···+Xnn be their empirical mean. If n > 4

ε2σ2

µ2 , then

P|X − µ| > εµ

614.

Proof. By linearity, E[X] µ and by independence and Fact 2.7, Var(X) σ2/n. So Chebyshev’sinequality implies that

P|X − µ| > εµ

6σ2

nµ2ε2 614,

where the final bound uses our assumption on n.

Observe that if each Xi is a 0, 1 random variable, then Var(Xi) µ − µ2, so σ2 6 µ. In thiscase, the required number of samples simplifies to n > 4

ε2 ·1µ . This represents the intuitive fact that

if we are trying to estimate probability of a very rare event (so that µ is very small), then we willneed many samples to get a decent estimate.

The median trick. Although we stopped at 14 , we can improve the probability of a poor estimate

with a small additional expense. Suppose that we do N trials of the above experiment (requiringN · n samples overall). Let X′ be the median value of these N trials. Then as long as N > 2 log4/3

1δ ,

we will have P[|X′ − µ| > εµ] 6 δ.To see this, let Y1 , . . . ,YN be indicators that are equal to 1 if the ith trial gives an empirical mean

in the range [µ(1 − ε), µ(1 + ε)]. We know from the preceding lemma that P[Yi 1] > 3/4, and thevariables Yi are independent. The only way that X′ < [µ(1 − ε), µ(1 + ε)] is if at least half thetrials end negatively, i.e. Y1 + · · · + YN 6 N/2.

Claim 2.9. If we perform 2k + 1 coin flips and P[heads] > 3/4, then P[6 k flips are heads] 6 (3/4)k .

One can prove this directly by counting and estimating some binomial coefficients. In Lecture 5,we will see how to prove this and related results in a more general way, and that will start our forayinto the concentration of measure phenomenon.

9

2.4 Percolation on a tree

In the last lecture, we used the fact that for a real-valued random variable X, we have

P(X 0) 6 Var(X)(EX)2 .

Let us see that second moments can also control the probability that a non-negative randomvariable is non-zero. (A generalization of this inequality often goes by the name “Payley-Zygmundinequality.”)

Lemma 2.10. If X is a non-negative random variable, then

P(X > 0) > (EX)2E[X2] .

Proof. The proof relies on the Cauchy-Schwarz inequality: For any two real-valued random variablesY and Z,

E |YZ | 6√E[Y2]

√E[Z2] .

We will use Y X and Z 1X>0, i.e. Z is the indicator of the event that X > 0:

EX EX1X>0

6

√E[X2]

√E 1X>0

√E[X2]√P(X > 0) .

Rearranging and squaring both sides completes the proof.

Let’s now consider the following model of percolation on the complete (rooted) binary tree ofdepth n. Fix a parameter p ∈ [0, 1] and also an orientation of the tree (so that we can refer to “left”and “right” children). Suppose that every left edge is independently deleted with probability 1 − pand every right edge is independently deleted with probability p. Let X

∑` Z` denote the number

of leaves which can reach the root of the tree. Here, the sum is over all 2n leaves ` and Z` is theindicator random variable that is 1 precisely when the path from the root to the leaf ` remains intact.Here we are interested in the probability that there exists at least one reachable leaf, i.e. P[X > 0].The first moment. It is relatively easy to compute the expected value of X:

EX

n∑i0

(ni

)p i(1 − p)n−i

(p + 1 − p)n 1 .

The summation variable i indexes the number of left turns that a root-leaf path makes. If a pathmakes i left turns, then the probability it remains intact is p i(1− p)n−i . Furthermore, we can specifya root-leaf path by a sequence of “left” and “right” turns; this implies that the number of leaveswhose root-leaf path contains exactly i left turns is

ni

.

This calculation doesn’t tell us that there is a root-leaf with decent probability. Certainly if p 0or p 1, then there is a leaf with probability one (the left-most and right-most paths, respectively).But what about intermediate values of p? What if p 1/2?

10

The second moment. Let’s now compute the second moment, but we’ll be a bit more clever. LetXn denote the number of reachable leaves in a tree of depth n. If XL and XR denote the number ofreachable leaves under the left and right subtrees, then X2

n (XL + XR)2 X2L + X2

R + 2XLXR.Observe that E[X2

L] p E[X2n−1], E[X2

R] (1 − p)E[X2n−1], E[XL] p E[Xn−1] p and E[XR]

(1 − p)E[Xn−1]. Since XL and XR are independent, we have

E[X2n] E[X2

n−1] + 2p(1 − p) .Since X0 1, we get E[X2

n] 1 + 2np(1 − p).Now applying Lemma 2.10 gives

P(X > 0) > 11 + 2np(1 − p) .

Thus if p Θ(1/n), there exists a reachable leaf with constant probability. If p 1/2, we get a leafwith probability at least 2/(n + 2).

Notice that this still leaves a number of interesting questions to be answered. We know thatP(X > 0) > 2/(n + 2), but E[X2

n] Θ(n). Is it the case that we see Θ(√n) reachable leaves withprobability Θ(1/n)?[See video of the discrete torus being covered by random walk—the asymptotics of the cover timeare determined by a related percolation process on a complete tree.]

2.5 Using unbiased estimators to count

Consider a formula in disjunctive normal form (a DNF formula), e.g.

ϕ (X1 ∧ X2 ∧ X3) ∨ (X2 ∧ X5) ∨ · · ·We can easily determine whether such a formula is satisfiable (we just check whether each termseparately is satisfiable). On the other hand, counting the number of satisfying assignments to sucha formula is #P-hard. (That means the problem is at least as hard as an NP-complete problem; thisclass of decision problems is thought to be extremely difficult to solve.)

The reason for this hardness is easy to see. Suppose that ϕ is a CNF formula. Then de Morgan’slaws can be used to write ¬ϕ as a DNF formula in polynomial time. But if ϕ is a formula on nvariables, then

# satisfying assignments to ϕ 2n−

# satisfying assignments to ¬ϕ

.

Nevertheless, we will show that one can obtain an efficient algorithm to approximate the numberof satisfying assignments.

Theorem 2.11 (Karp and Luby, 1983). Given a DNF formula ϕ with n variables and m terms, and anumber ε > 0, there is an algorithm running in time O(mn/ε2) that outputs a value Z such that

P(1 − ε)Z 6 N(ϕ) 6 (1 + ε)Z

>34,

where N(ϕ) is the number of satisfying assignments to ϕ.

11

As we saw in the last lecture, one can use the median trick to amplify the probability ofcorrectness to be very close to 1. One should note that a naïve Monte Carlo algorithm will not workso well: If we choose one of the 2n assignments uniformly at random and check whether it satisfiesϕ, then the number we are trying to estimate is µ

N(ϕ)2n , but this could be exponentially small.

The unbiased estimator theorem would dictate that we use exponentially many samples, makingour algorithm quite inefficient.

Proof of Theorem 2.11. The main idea will be to apply the Monte Carlo method to a more suitablespace. Let Si be the set of assignments which satisfy the ith term in ϕ. Our goal is to compute thesize of the union

⋃mi1 Si .

Let U (a , i) : a ∈ Si. Say that the pair (a , i) ∈ U is special if i is the first term that theassignment a satisfies. Since every satisfying assignment has exactly one first term that it satisfies,we have the following.

Claim 2.12. The number of special pairs is precisely⋃m

i1 Si

Now we apply the Monte Carlo algorithm to estimate |S |/|U | where S is the set of specialpairs. Note that |Si | 2n−qi where qi is the number distinct variables in term i. Thus we can easilycompute |U | ∑m

i1 |Si |. In particular, if we obtain a multiplicative approximation to µ |S |/|U |,then we can obtain a multiplicative approximation to |S | N(ϕ).

To generate a random sample fromU , we choose i with probability |Si |/|U |, and then pick arandom satisfying assignment for term i. Finally, this is extended to a uniformly random assignmenton the remaining n − qi variables.

Thus we are left to estimate µ |S |/|U |. But it’s easy to see that |U | 6 m |S | since each satisfyingassignment can satisfy at most m terms. Therefore µ > 1

m . The unbiased estimator theorem nowstates that we need at most 4

ε2µ6 4m

ε2 samples in order to achieve our goal.A naïve implementation of the algorithm requires O(nm2/ε2) time, but further improvements

[Karp-Luby-Madras 1989] show how to implement the algorithm in O(mn/ε2) time.

Remark 2.13. The algorithm can be used to approximate the size of the union of any collection offinite sets as long as we can compute their individual sizes and sample random elements from eachof them.

A straightforward generalization of the algorithm allows us to compute the probability of arandom assignment in a probabilistic DNF model where each variable is true independently withsome probability pi .

3 Chernoff bounds

We have seen how knowledge of the variance of a random variable X can be used to controldeviation of X from its mean. This is the heart of the second moment method. But often we cancontrol even higher moments, and this allows us to obtain much stronger concentration properties.

A prototypical example is when X1 ,X2 , . . . ,Xn is a family of independent (but not necessarilyidentically distributed) 0, 1 random variables and X X1 + X2 + · · · + Xn . Let pi E[Xi] anddefine µ E[X] p1 + p2 + · · · + pn . In that case, we have the following multiplicative form of the“Chernoff bound.”

12

Theorem 3.1 (Chernoff bound, mutiplicative error). For every β > 1, it holds that

PX > βµ

6

(eβ−1

ββ

)µ, (3.1)

and

P

(X 6

µ

β

)6

(e1/β−1

ββ

)µ. (3.2)

It’s easy to use these formulae, but it sometimes helps to employ the slightly weaker bounds: Ifwe put β 1 + δ for 0 < δ < 1,

PX > (1 + δ)µ

6 e−δ2µ

3 ,

PX 6 (1 − δ)µ

6 e−δ2µ

2 .

The main point is that these tail bounds go down exponentially in the mean µ and the multiplicativedeviation β, as opposed to the previous tail bounds we’ve seen (Markov and Chebyshev) that onlygo down polynomially.

Proof of Theorem 3.5. Much as we proved Chebyshev’s inequality by applying Markov’s inequalityto the random variable |X − EX |2, the Chernoff bound is proved by applying a function to theunderlying random variable X and then applying Markov’s inequality.

Let t > 0 be a parameter we will choose later and write

P[X > βµ] Pe tX > e tβµ

6E[e tX]

e tβµ . (3.3)

The point of applying the function X 7→ e tX is that we can exploit independence:

Ee tX

E[e t(X1+···+Xn)] E

n∏i1

e tXi

n∏i1

Ee tXi

. (3.4)

Now write:E[e tXi ] (1 − pi) + pi e t

1 + pi(e t− 1) 6 epi(e t

−1) ,

where the last inequality uses 1 + x 6 ex which is valid for all x ∈ R.Plugging this into (3.4) yields

Ee tX6

n∏i1

epi(e t−1)

eµ(e t−1)

Now recalling (3.3), we haveP[X > βµ] 6 eµ(e t

−1−βt) .

Choosing t ln β yields (3.7). One can prove (3.8) similarly.

13

3.1 Randomized rounding

A classical technique in the field of approximation algorithms is towrite down a linear programmingrelaxation of a combinatorial problem. The linear program (LP) is then solved in polynomial time,and one rounds the fractional solution to an integral solution that is, hopefully, not too much worsethan the optimal solution. A classical example goes back to Raghavan and Thompson.

Let D (V,A) be a directed network, and suppose that we are given a sequence of terminal pairs(s1 , t1), (s2 , t2), . . . , (sk , tk) where si, ti ⊆ V . We use = (D , (si , ti)) denote this instance ofthe min-congestion disjoint paths problem. The goal is to choose, for every i, a directed si-ti path γi inD so as to minimize the maximum congestion of an arc e ∈ A:

opt(=) minimizemaxe∈A

#i : e ∈ γi.

This problem is NP-hard. Our goal will be an approximation algorithm that outputs a solutionγi so that the congestion of every edge is at most α · opt. The number α is called the approximationfactor of our algorithm.

A fractional relaxation. Our approach will be to compute first a fractional solution that sends 1unit of flow from si to ti for every i. A flow can be thought of in the following way. Let Pi denotethe set of simple, directed si-ti paths in D, and let P denote the set of all simple directed paths in D.Here, simple means that no arc is repeated.

A multi-flow F is a mapping F : P → R+ of paths to nonnegative real numbers. The multi-flowF routes the demands (si , ti) if, for every i 1, . . . , k, we have

∑γ∈Pi F(γ) 1, i.e. we send at least

one unit of flow from si to ti for every i. Finally, the congestion of an arc e ∈ A under the flow F isthe value conF(e) ∑

γ∈P:e∈γ F(γ), i.e. the amount of flow passing through the arc e.We make the definition:

LP(=) minimizeF

maxe∈A

conF(e),

where the minimum is over all multi-flows F that route the demands (si , ti). It should be clearthat LP(=) 6 opt(=). The reason we write LP(=) is that this value can be computed by a linearprogram of polynomial-size. This is not precisely clear from our formulation because there arepossibly exponentially many paths in P, but there is a compact formulation of the LP using standardtechniques (see the remark at the end of this section).

Given a multi-flow F, we will round it to an integral multi-flow F′, where an integral flow is onesuch that, for every i 1, . . . , k, we have F′(γ) 1 for exactly one γ ∈ Pi . Note that an integral flowrepresents a solution to the initial disjoint paths problem. Furthermore, we will now show that forsome α > 1, we have

maxe∈A

conF′(e) 6 α · (1 + maxe∈A

conF(e)) .In particular, if we apply this to the optimal fractional flow F∗, we arrive at

maxe∈A

conF(e) 6 α · (1 + maxe∈A

conF∗(e)) α(1 + LP(=)) 6 α(1 + opt(=)) 6 2α · opt(=) ,

implying that we have achieved an 2α-approximation to the optimal solution. (Note that we haveused the trivial bound opt > 1.)

14

Theorem 3.2. Let n |V | and suppose that n > 4. If there is a multi-flow F that routes the demands(s1 , t1), . . . , (sk , tk), then there exists an integral multi-flow F′ that routes the demands, and furthermore

maxe∈A

F′(e) 6 Clog n

log log n

(1 + max

e∈AF(e)

), (3.5)

where C > 0 is a universal constant.

Proof. We will produce a random integral multi-flow F′ that routes the demands (si , ti) and arguethat it satisfies the conditions of the theorem with high probability.

For every i 1, . . . , k, we do the following independently. We know that∑γ∈Pi F(γ) 1. Thus

we can think of F as providing a probability distribution over si-ti paths. We let γi denote a randomsi-ti path chosen with probability F(γ) for γ ∈ Pi .

The set of paths γ1 , γ2 , . . . , γk gives us an integral multi-flow F′. We are left to bound themaximum congestion of an edge. To this end, fix an edge e ∈ A. For every γ ∈ P such that e ∈ γ,let Xγ be the indicator random variable that is 1 when the path γ is chosen in the rounding. Thenthe number of edges going through the edge e after rounding is given by the random variable

conF′(e) ∑γ:e∈γ

Xγ . (3.6)

We may assume that conF(e) > 1 because we are comparing conF′(e) to 1 + conF(e) in (3.5).First, we have

E[conF′(e)] ∑γ:e∈γ

E[Xγ] ∑γ:e∈γ

F(γ) conF(e) .

So at least in expectation, the congestion does not increase. If the Xγ random variables wereindependent, then we could apply the Chernoff bound. Unfortunately, this not necessarily the case.For instance, if γ, γ′ ∈ Pi both contain the edge e, then Xγ and Xγ′ are not independent; in fact, atmost one of them can be equal to 1. Thus we will first rewrite conF′(e) as a sum of independent0, 1 random variables.

Let Yi be the indicator variable that equals 1 if the unique si-ti path in F′ uses the edge e, i.e. ife ∈ γi . Then the Yi are independent (since we round each si-ti pair independently). Moreover,we have Yi

∑γ∈Pi :e∈γ Xγ, so conF′(e) ∑k

i1 Yi .Since conF′(e) is a sum of independent 0, 1 random variables, we can apply the Chernoff

bound (Theorem 3.5) to conclude that

PconF′(e) > β · conF(e) 6

(eβ−1

ββ

)conF(e)6

eβ−1

ββ,

where in the last inequality we have used our assumption that conF(e) > 1. We would like to choosethe latter bound to be at most n−3. To do this, we need to choose β C log n

log log n for some constant C.(You should check that this is the right choice of β.)

Setting β like this, we have

PconF′(e) > β · conF(e) 6 1

n3 ,

and thus by a union bound over the n2 possible edges,

P∃e ∈ A such that conF′(e) > β · conF(e) 6 n2

·1n3 6

1n.

Thus with probability at least 1 − 1n , our integral flow F′ satisfies the claim of the theorem.

15

Remark 3.3. Note that if we knew conF(e) > C′ log n for some constant C′ and every e ∈ A, thenwe could actually choose β O(1) and still achieve a bound of n−3 on the probability of an overcongested edge. This means that if all the fractional congestions are Ω(log n), we can get an O(1)approximation.Remark 3.4. To compute the optimal fractional multi-flow, we write a linear program with variablesFe : e ∈ A. Our program shouldminimize the value λ such that Fe 6 λ for every e ∈ A. Moreover,to make sure that the variables Fe : e ∈ A correspond to an optimal flow, we should add the flowconstraints at every non-terminal vertex: The flow in should be equal to the flow out of the vertex.At terminals, we have to allow there to be a surplus or deficit based on whether we are at a sourceor a sink. This program has O(m) variables and O(m + n) linear constraints, where m |A| andn |V |.

3.2 Some more applications

In the preceding lecture, we saw our first large-deviation inequality. Let X1 ,X2 , . . . ,Xn be afamily of independent (but not necessarily identically distributed) 0, 1 random variables andX X1 + X2 + · · · + Xn . Let pi E[Xi] and define µ E[X] p1 + p2 + · · · + pn . We recall thefollowing multiplicative form of the “Chernoff bound.”

Theorem 3.5 (Chernoff bound, mutiplicative error). For every β > 1, it holds that

PX > βµ

6

(eβ−1

ββ

)µ, (3.7)

and

P

(X 6

µ

β

)6

(e1/β−1

ββ

)µ. (3.8)

3.2.1 Balls in bins

Suppose we throw m balls uniformly at random into n bins. For i 1, . . . , n, let X(i) denote thenumber of balls that land in bin i. Let Z : max(X(1) , . . . ,X(n)) denote the maximum load. Even tobound E[Z] seems tricky, and it is often the case that evaluating the expected maximum of a familyof random variables requires understanding their tail behavior.

Let X(i)j be the indicator random variable that is 1 if ball j lands in bin i. Then E[X(i)

j ] 1/n andhence by linearity of expectation, E[X(i)] m/n. Applying Theorem 3.5 yields, for any i 1, . . . , n,

P

[X(i) > βm

n

]6

(eβ

)βm/n

. (3.9)

Let’s consider two representative regimes.

Regime I: m n. By choosing β c log n

log log n for c > 1 large enough (as in the preceding lecture),(3.9) gives

P[X(i) > β

]6 n−2

Now the deviation probability is small enough to apply a union bound:

P[Z > β] 6n∑

i1P[X(i) > β] 6 n · n−3

1n.

Thus with high probability, the maximum load is O( log nlog log n ).

16

Regime II: m > cn log n, c > 0. In this case, we have m/n > c log n, so applying (3.9) and a unionbound gives

P[Z > β] 6 n · β−cβ log n .

Now choosing β 1/c gives P[Z > β] 6 1n , implying that the maximum load is only an O(1) factor

more than its expectation.

In comparing the two regimes, note that as the expected number of balls per bin rises, weactually get more concentration around the mean.

3.2.2 Randomized Quicksort

I don’t particularly like this application since it requires some unnatural machinations to getindependent random variables. In the next lecture, we will see that large-deviation inequalitieshold for martingales, and this argument becomes more natural.

But a quick recap: Consider the numbers 1, 2, . . . , n. We construct a random rooted tree Twhere each node v ∈ V(T) has an associated subset Sv ⊆ 1, . . . , n defined as follows: For the rootr ∈ V(T), we have Sr 1, . . . , n. Then inductively, for a node v with |Sv | > 1, we partition Sv

uniformly at random into two sets SLv , SR

v with |SLv |, |SR

v | > 1, and we give v two children labeled bythese sets. Thus T has precisely n leaves labeled by the singleton sets 1, . . . , n.

Let Di denote the depth of the leaf labeled by i. The following claim is straightforward toverify inductively: The number of comparisons made by Quicksort is precisely D1 + D2 + · · · + Dn .(Strictly speaking, this is only true because we are including the pivot in one of the two child lists,but a more clever implementation would only do better.)

Claim 3.6. There is a constant C > 1 such that for any i ∈ 1, 2, . . . , n, it holds thatP[Di > C log n] 6 n−2 .

Taking a union bound gives

P[# comparisons > Cn log n] 6 1n,

i.e. Quicksort runs in O(n log n) time with high probability.Fix an element i ∈ 1, . . . , n. Let S0 , S1 , . . . , SDi i be the labels of the nodes occuring

from the root down to the leaf labeled i in T . Define S j : i for j > Di . Then an elementarycalculation gives

P

[|S j+1 | 6 1

2|S j |

]>

12

∀j > 0 .

Define:

Yj

1 if |S j+1 | 6 12 |S j | or S j+1 S j

0 otherwise.

The next claim is straightforward to verify.

Claim 3.7. For every j > 0, we have P[Yj 1] > 12 . Moreover, if

∑M−1j0 Yj > log2 n, then Di 6 M.

If Yj were independent random variables, then we could apply (3.8) to conclude that

P

[Y0 + Y1 + · · · + YM−1 6

M2β

]6

(e1/β

β

)βM/2

17

Choosing M Θ(log n) and β Θ(1) would then yield Claim 3.6.Unfortunately, these random variables are not independent, as clearly Yj 1 for j > Di , for

instance. One solution is to use a hack: We can couple the Yj random variables to independentrandom variables Yj such that Yj 1 ⇒ Yj 1 and P[Yj 1] 1/2. Then we can legitimatelyapply the Chernoff bound to the family Yj and reach the same conclusion.

This is easy to do by definingYj Z jYj ,

where Z j is a collection of 0, 1 randomvariables such that Z j 1with probability 1/(2P[Yj 1])(so that Z j is independent of Yj conditioned on S j). Note that this definition makes sense sinceP[Yj 1] > 1/2.

Now we have P[Yj 1] 1/2 for all j > 0, and moreover, the random variables Yj areindependent:

PYj 1 | Y0 , . . . , Yj−1 , Yj+1 , . . .

12

∀j > 0 .

Even this preceding fact is slightly tricky to verify and the whole argument doesn’t reflect one’snatural intuition: Independence shouldn’t matter as long as we have probability at least 1/2 toreduce the size of S j+1 conditioned on S j with |S j | > 1. That is the purview of martingale theory, andwe will cover large-deviation inequalities for martingales in the next lecture.

3.3 Negative correlation

Say that a collection X1 , . . . ,Xn of random variables are negatively correlated if it holds that for anysubset S ⊆ [n]:

E

n∏i1

Xi

6

n∏i1

E[Xi] . (3.10)

Note that if X1 , . . . ,Xn are independent, then this holds with equality.

Examples. We will state some examples of negatively correlated families (without proof).

1. Loads of the bins. The variables X(i) : i 1, . . . , n from Section 3.2.1 are negativelycorrelated.

This stands to reason: Telling you that some bins have unusually large (resp., small) loadmakes the expected load of the remaining bins smaller (resp., larger).

2. Random permutations. If X1 , . . . ,Xn 1, 2, . . . , n, then the family Xi is negativelycorrelated.

The intuition here is the same as the previous example.

3. Random spanning trees. Suppose G (V, E) is an undirected graph and T is a uniformlyrandom spanning tree of G. For e ∈ E, let Xe denote the indicator random variable that is 1precisely when e is an edge of T . Then the family Xe : e ∈ E is negatively correlated.

Suppose I tell you that Xe1 · · · Xek 1 for some edges e1 , . . . , ek ∈ E. Intuitively, onecan contract the connected components in the graph spanned by e1 , . . . , ek and considera uniformly random spanning tree on the rest. Now the problem of connecting everythingtogether has become easier, and thus P[Xe] decreases for e < e1 , . . . , ek. Certainly if e

18

connects two vertices that are already connected in the graph with edges e1 , . . . , ek, thenP[Xe 1 | Xe1 · · · Xek 1] 0.

Actually proving negative correlation for this family is non-trivial.

It turns out that the Chernoff bounds Theorem 3.5 hold if we consider negatively correlatedrandom variables. It is still an area of active research to determine good notions for “negativedependence” in general settings. In particular, the notion of of negative correlation above isunsuitable for many settings, especially because it can be hard to verify and does not satisfy naturalclosure properties (making it difficult to derive new negatively correlated families from old ones).

Theorem 3.8 (Chernoff for negatively correlated random variables). If, in the statement of Theorem 3.5,we only assume that X1 , . . . ,Xn are negatively correlated 0, 1 random variables (instead of independent),then the conclusion still holds.

Proof. To see this, note that the one place we used independence in the proof of Theorem 3.5 is inthe calculation: When X X1 + · · · + Xn ,

Ee tX

E

n∏i1

e tXi

n∏i1

Ee tXi

.

Let us see that the inequality Ee tX6

∏ni1 E

e tXi

still holds when X1 , . . . ,Xn are only assumed

to be negatively correlated.To this end, let X1 , . . . , Xn be independent 0, 1 random variables with E[Xi] E[Xi] for each

i 1, . . . , n, and define X : X1 + · · · + Xn . For any nonnegative integer k,

E[Xk] ∑α

E[Xα11 Xα2

2 · · ·Xαnn ] ,

where the sum is over all α ∈ Rn with αi > 0 and∑

i αi k. Using the negative correlation property,this gives

E[Xk] 6∑α

E[Xα11 ]E[Xα2

2 ] · · ·E[Xαnn ]

∑α

E[Xα11 ]E[Xα2

2 ] · · ·E[Xαnn ],

where the last line follows because Xi and Xi have the same distribution for every i. Finally, notethat by independence,∑

α

E[Xα11 ]E[Xα2

2 ] · · ·E[Xαnn ]

∑α

E[Xα11 Xα2

2 · · · Xαnn ] E[Xk].

We conclude that for every integer k > 0,

E[Xk] 6 E[Xk]. (3.11)

Using the Taylor expansion

e tX 1 + tX +

t2X2

2+

t3X3

6+ · · · ,

and applying (3.11) to each term gives

E[e tX] 6 E[e tX] n∏

i1E[e tXi ]

n∏i1

E[e tXi ] ,

yielding our desired inequality. Now the proof of the Chernoff bound can proceed exactly as in thepreceding lecture.

19

4 Martingales

We have seen that if X X1 + · · · + Xn is a sum of independent 0, 1 random variables, then Xis tightly concentrated around its expected value E[X]. The fact that the random variables were0, 1-valued was not essential; similar concentration results hold if we simply assume that theyare in some bounded range [−L, L]. One can also relax the independence assumption, as we willsee next.

Consider a sequence of random variables X0 ,X1 ,X2 , . . .. The sequence Xi is called a discrete-time martingale if it holds that

E[Xi+1 | X0 ,X1 , . . . ,Xi] Xi

for every i 0, 1, 2, . . .. More generally, the sequence is Xi is a martingale with respect to anothersequence of random variables Yi if it for every i, it holds that

E[Xi+1 | Y0 ,Y1 , . . . ,Yi] Xi .

Note that this is equivalent to E[Xi+1−Xi | Y0 ,Y1 , . . . ,Yi] 0. If one thinks of Y0 ,Y1 , . . . ,Yi as allthe “information” up to time i, then this says that the difference Xi+1 − Xi is unbiased conditionedon the past up to time i. Observe that for any i, we have

E[Xi] E[E[Xi | X0 , . . . ,Xi−1]] E[Xi−1] · · · E[X0] .Martingales form an extremely useful class of random processes that appear in a vast array of

settings (e.g., finance, machine learning, information theory, statistical physics, etc.). The classicexample is that of a Gambler whose bank roll is X0. At each time, she chooses to play some game inthe casino at some stakes. If we assume that every game is fair (that is, the expected utility fromplaying the game is 0), then the sequence X0 ,X1 , . . . forms a martingale, where Xi is the amountof money she has at time i.

Remark 4.1. The correct level of generality at which to define martingales involves a filtration.Formally, this is an increasing sequence of σ-algebras on our measure space (Ω, µ, F ): F0 ⊆ F1 ⊆

· · · ⊆ F . A sequence of random variables Xi is a martingale with respect to the filtration Fi ifE[Xi+1 | Fi] Xi for every i > 0.

4.1 Doob martingales

One reasonmartingales are so powerful is that they model a situation where one gains progressivelymore information over time. Suppose thatU is a set of objects, and f :U → R. Let X be a randomvariable taking values inU , and let Yi be another sequence of random variables. The associatedDoob martingale is given by

Xi E[ f (X) | Y0 ,Y1 , . . . ,Yi] .In words, this is our “estimate” for the value of f (X) given the information contained in Y0 , . . . ,Yi.To see that this is always a martingale with respect to Yi, observe that

E[Xi+1 | Y0 , . . . ,Yi] E[E[ f (X) | Y0 , . . . ,Yi+1] | Y0 , . . . ,Yi] E[ f (X) | Y0 , . . . ,Yi] Xi ,

where we have used the tower rule of conditional expectations.

20

Balls in bins. Suppose we throw m balls into n bins one at a time. At step i, we place ball i in auniformly random bin. Let C1 , C2 , . . . , Cm be the sequence of (random) choices, and let C denotethe final configuration of the system, i.e. exactly which balls end up in which bins.

Now we can consider a functional like f (C) # of empty bins. If Xi E[ f (C) | C1 , . . . , Ci],then Xi is a (Doob) martingale. It is straightforward to calculate that

E[Xm] E[X0] E[ f (C)] n ·(1 − 1

n

)m.

Suppose we are interested the concentration of Xm f (C) around its mean value. Of course, wecan write Xm Z1 + · · ·Zm where Zi is the indicator of whether the ith been is empty after all theballs have been thrown. But note that the Zi variables are not independent—in particular, if I tellyou that Z1 1 (bin 1 is empty), it decreases slightly the likelihood that other bins are empty.

The vertex exposure filtration. Recall that Gn ,p denotes the random graph model where anundirected graph on n vertices is chosen by including every edge independently with probabilityp. Suppose the vertices are numbered 1, 2, . . . , n Let G ∼ Gn ,p and denote by Gi the inducedsubgraph on the vertices 1, . . . , i. G0 denotes the empty graph.

Let χ(G) denote the chromatic number of G, and consider the Doob martingale

Xi E[χ(G) | G0 , . . . ,Gi] .If we wanted to understand concentration properties of Xn χ(G), this seems even more daunting.The chromatic number is a very complicated parameter of a graph! Nevertheless, we will now seethat martingale concentration inequalities allow us to achieve tight concentration using very limitedinformation about a sequence of random variables.

4.2 The Hoeffding-Azuma inequality

Say that a martingale Xi has L-bounded increments if

|Xi+1 − Xi | 6 L

for all i > 0. (The preceding inequality is meant to hold with probability 1.)

Theorem 4.2. For every L > 0, if Xi is a martingale with L-bounded increments, then for every λ > 0and n > 0, we have

P[Xn > X0 + λ] 6 e−λ2

2L2n

P[Xn 6 X0 − λ] 6 e−λ2

2L2n

It’s useful to note the following special case of the theorem.

Corollary 4.3. Suppose that Z1 , Z2 , . . . , Zn are independent random variables taking values in the interval[−L, L]. Put Z Z1 + · · · + Zn and µ E[Z]. Then for every λ > 0, we have

PZ > µ + λ

6 e−λ

2/(2L2n)

PZ 6 µ − λ

6 e−λ

2/(2L2n)

21

The Lispchitz condition. Recall the setting of Doob martingales, whereU is a set. Suppose thatwe can describe every element u ∈ U by a sequence of values u (u1 , u2 , . . . , un). (For instance,every configuration of m balls in n bins can be described by the sequence of which balls go intowhich bins.)

Say that f is L-Lipschitz if it holds that for every i 1, . . . , n and for every two elementsu (u1 , u2 , . . . , ui , . . . , un) ∈ U and u′ (u1 , u2 , . . . , u′i , . . . , un) ∈ U that differ only in the ithcoordinate, we have

| f (u) − f (u′)| 6 L .

Let Z (Z1 , . . . , Zn) be a U-valued random variable such that the random variables Zi areindependent. We now confirm that the Doob martingale Xi E[ f (Z) | Z1 , . . . , Zi] has L-boundedincrements.

Let Z′i+1 be an independent copy of Zi+1 conditioned on Z1 , . . . , Zi , and let Z′

(Z1 , . . . , Zi , Z′i+1 , . . . , Zn). Then:|Xi+1 − Xi |

E[ f (Z) | Z1 , . . . , Zi+1] − E[ f (Z) | Z1 , . . . , Zi]

E

f (Z) − f (Z′) | Z1 , . . . , Zi+1

6 E

f (Z) − f (Z′) | Z1 , . . . , Zi+1

6 L ,

where in the last step we have used the fact that the term inside the absolute value signs is always atmost L by the L-Lipschitz property of f , and the fact that Z and Z′ differ in at most one coordinate.

Remark 4.4. Note the power of Theorem 4.2 combined with this construction of Doob marginales. Itmeans that if we have any random variable Z (Z1 , . . . , Zn) that is built out of independent piecesof information Zi and some quantity f (Z) that we care about does not depend too much onchanging any single piece of information, then f (Z) is tightly concentrated about its mean. This isa vast generalization of the fact that sums of independent, bounded random variables are highlyconcentrated (cf. Corollary 4.3).

The number of empty bins. First let’s apply this to balls and bins. Recall that for a sequence ofchoices C1 , . . . , Cm (where Ci is the bin that the ith ball is thrown into), we put f (C1 , . . . , Cm) to bethe number of empty bins. Then clearly f is 1-Lipschitz: Changing the fate of ball i can only changethe number of empty bins by 1. Therefore the corresponding martingale Xi E[ f (C1 , . . . , Cm) |C1 , . . . , Ci] has 1-bounded increments, and Azuma’s inequality implies that

P[Xn > X0 + λ] 6 e−λ22m .

Recall that X0 E[Xn] n(1 − 1m )n . Consider the situation where m n and thus X0

ne . If we

put λ C√

n, we see that with high probability we expect the number of empty bins to be in theinterval n

e ± O(√n).The chromatic number. Similarly, consider the vertex exposure martingale. We have to be a littlemore careful here to describe a graph G by a sequence (Z1 , . . . , Zn) of independent random variables.The key is to think about Zi containing the information on edges from vertex i to the vertices1, . . . , i − 1 so that we have independence.

Since we can identify a graph G with the vector (Z1 , . . . , Zn), we can think of the chromaticnumber as a function χ(Z1 , . . . , Zn). The function χ satisfies the 1-Lipschitz property because

22

changing the edges adjacent to some vertex i can only change the chromatic number by 1. Thechromatic number cannot increase by more than one because we could always color i a new color;it cannot decrease by more than one because if we could color the graph without vertex i with ccolors, then we can color the whole graph with c + 1 colors.

So the martingale Xi E[χ(G) | Z1 , . . . , Zi] E[χ(G) | G1 , . . . ,Gi] has 1-bounded incrementsand Azuma’s inequality tells us that

P[χ(G) > E[χ(G)] + λ] 6 e−λ22n .

Even without having any idea how to compute E[χ(G)], we are able to say something significantabout its concentration properties.

Remark 4.5. By the way, if G ∼ Gn ,1/2, then E[χ(G)] n/(2 log2 n), so the concentration windowhere—which is O(√n)—is again quite small with respect to the expectation. In the next lecture, wewill see how a more clever use of Azuma’s inequality can achieve even better concentration of χ(G).

4.3 Proof

We will actually prove the following generalization of Theorem 4.2.

Theorem 4.6. Suppose that Xi is a sequence of random variables satisfying the property that for everysubset of distinct indices i1 < i2 < · · · < ik , we have

EXi1 Xi2 · · ·Xik

0 .

Then for every λ > 0 and n > 1, it holds that

P

n∑i1

Xi > λ6 exp

(−

λ2

2∑n

i1 ‖Xi‖2∞

).

Here, ‖Xi‖∞ is the essential supremum of Xi , i.e. the least value L such that |Xi | 6 L withprobability one.

The reason Theorem 4.6 proves Theorem 4.2 is as follows: Suppose that Zi is a martingalewith respect to the sequence of random variables Yi, and let Xi Zi − Zi−1. Consider distinctindices i1 < i2 < · · · < ik . Then:

E[Xi1 · · ·Xik ] EXi1 · · ·Xik−1 E

Zik − Zik−1 | Y0 , . . . ,Yik−1

0 ,

where the final inequality follows from defining property of a martingale.

Proof of Theorem 4.6. Note that from our assumptions, we have that for any sequences of constantsai and bi, we have

E

n∏i1

(ai + biXi)

n∏i1

ai . (4.1)

Also, observe that for any a, the functions f (x) eax is convex. Thus for x ∈ [−1, 1], it lies belowthe line connecting e−a to ea . In other words, for x ∈ [−1, 1],

eax 6ea + e−a

2+ x

ea− e−a

2 cosh(a) + x sinh(a) .

23

Combining this with (4.1), we have for any t:

E[e t

∑ni1 Xi

]6 E

n∏i1

cosh(t‖Xi‖∞) + Xi

‖Xi‖∞ sinh(t‖Xi‖∞)

n∏i1

cosh(t‖Xi‖∞) 6 e t2‖Xi ‖2∞/2 ,

where the final inequality follows from cosh(x) ∑ x2k

(2k)! 6∑ x2k

2k k! ex2/2.Now we are in position to apply the method of Laplace transforms:

P

n∑i1

Xi > λ6

E[e t∑n

i1 Xi ]e tλ 6 e(t2/2)∑n

i1 ‖Xi ‖2∞−tλ .

Setting t λ∑ni1 ‖Xi ‖2

finishes the proof.

4.4 Additional applications

4.4.1 Concentration in product spaces

DefineU 1, 2, . . . , 6n . Define the hamming distance between x , y ∈ U by

H(x , y) : #i ∈ [n] : xi , yi,and if A ⊆ U , define H(x ,A) : minH(x , y) : y ∈ A. The following theorem shows that Uexhibits “concentration of measure.” Starting with any sufficiently large set A ⊆ U , most of thepoints inU will be very close to A (the distance to A will be much smaller than the diameter ofU).

Theorem 4.7. Consider any subset A ⊆ U with |A| > 6n−1. Then for any c > 0,

x ∈ U : H(x ,A) 6 (c + 2)√n

6n > 1 − e−c2/2. (4.2)

Proof. Let Z (Z1 , . . . , Zn) ∈ U be a uniformly random point. Define the Doob martingaleXi E[H(Z,A) | Z1 , . . . , Zi]. Since themap x 7→ H(x ,A) is 1-Lipschitz, we know that |Xi−Xi−1 | 6 1for every i 1, 2, . . . , n. Thus if µ E[H(Z,A)], Azuma’s inequality yields

P[H(Z,A) 6 µ − c√

n] 6 e−c2/2

P[H(Z,A) > µ + c√

n] 6 e−c2/2.

It is not immediately obvious how to calculate µ, but we can get a good bound using concentration.If µ < 2

√n, then the first inequality gives

P[H(Z,A) 0] 6 e−22/2 e−2 < 1/6,

but we know that P[Z 0] |A|/6n > 1/6, thus µ > 2√

n. Now apply the second inequality,yielding

P[H(Z,A) > 2√

n + c√

n] 6 e−c2/2.

This is precisely our goal (4.2).

24

4.4.2 Tighter concentration of the chromatic number

Previously, using the vertex exposure martingale we were able to prove reasonable concentrationfor χ(G)when G ∼ Gn ,p . In what follows, we will put p n−α for some α > 0. We will show that,surprisingly, if α > 5/6, then with probability tending to one, χ(G) is concentrated on one of fourvalues. In what follows, we will say that an event En (explicitly or implicitly indexed by n) holds“with high probability” if P(En)→ 1 as n →∞.

Lemma 4.8. For any c > 0 and α > 5/6, the following holds for G ∼ Gn ,p : With high probability, everyinduced subgraph of size at most c

√n is 3-colorable.

Proof sketch. Let S be the smallest subset of V(G) that is not 3-colorable (if no such set exists, weare done). Then every x ∈ S must have at least three neighbors in S, otherwise since S \ x is3-colorable, it would be the case that S is also 3-colorable. Thus the number of edges in the inducedsubgraph G[S] is at least 3|S|/2.

But it is unlikely that any set S with |S| 6 c√

n has at least 3|S|/2 edges inside it. To see this, lett |S|, and we’ll compute the probability for a fixed set S: It’s at most

p3t/2( t

2

3t/2

)6 p3t/2O(t)3t/2 .

Now we take a union bound over all sets of size at most T:∑t6T

p3t/2O(t)3t/2(nt

)6 O(pT)3T/2O

( nT

)T.

The latter inequality holds as long as T n. Now using p n−α and T 6 c√

n, this is bounded by

O(n)(1/2−α)3T/2O(n)T/2 ,and the latter quantity is o(1) as long as 3/2(1/2 − α) < 1/2, i.e. α > 5/6.

Theorem 4.9. With high probability, χ(G) takes one of four different values.Proof. Fix a number ε > 0 that we will send to 0. Let u u(n , p , ε) be the smallest integer so thatP[χ(G) 6 u] > ε. Observe that, by the choice of u, we have P[χ(G) > u − 1] > 1 − ε.

Let Y Y(G) be the minimal size of a set of vertices S such that χ(G \ S) 6 u. Consider thevertex exposure martingale for G ∼ Gn ,p . Note that Y is 1-Lipschitz with respect to the exposureprocess because we could always add the modified vertex to S. Thus we can apply Azuma’sinequality to the corresponding Doob martingale to conclude that

P[Y > µ + λ√

n] 6 e−λ2/2 (4.3)

P[Y 6 µ − λ√n] 6 e−λ2/2 , (4.4)

where µ E[Y].Let us choose λ so that e−λ

2/2 ε. By the definition of u, we have P[Y 0] > ε. We concludefrom (4.4) that µ 6 λ

√n. Now using (4.3), we see that P[Y > 2λ

√n] 6 ε.

By Lemma 4.8, we may assume that every subset of size at most 2λ√

n is 3-colorable by throwingaway an ε-fraction of graphs. Now observe that Y < 2λ

√n implies that G is u + 3 colorable since

G \ S is u-colorable and |S| < 2λ√

n so S can be colored with an additional 3 colors. We concludethat

P[χ(G) ∈ u , u + 1, u + 2, u + 3] > 1 − 3ε .

Sending ε → 0 completes the proof.

25

5 Memoryless random variables and low-diameter partitions

5.1 Random tree embeddings

Let (X, d) be a metric space on n points. Many problems in computation involve data pointsequipped with a natural distance. Metric spaces also arise as the solutions to linear programmingrelaxations of combinatorial problems. Often, it is useful to embed a metric space into a “simpler”one while changing the distances as little as possible. This is beneficial if an algorithmic problemcan be solved more easily on the simple space.

A prime example of a simple metric space is a tree metric. Let T be a graph-theoretic with verticesV(T) and edge set E(T). We will assume that the tree is equipped some nonnegative length `(e) oneach edge e ∈ E(T). There is a canonical shortest-path metric which we denote dT .

A natural goal is to try and map our space (X, d) into a tree metric so that we preserve distancesmultiplicatively. In other words, we would look for a map f : X → V(T) so that

d(x , y) 6 dT( f (x), f (y)) 6 D · d(x , y) ∀x , y ∈ X ,

and the distortion D is as small as possible. Unfortunately, this doesn’t work so well. For instance, if(X, d) is the shortest-path metric on an unweighted n-cycle, one can show that any such mappingmust have D > Ω(n). (Getting this argument right is actually a little tricky, but it is true.)

On the other hand, if we allow ourselves to use a random embedding, then we can approximatedistance well in expectation. Consider again the shortest-path metric on an n-cycle Cn . Let T be therandom (unweighted) tree that results from a single uniformly random edge of Cn . It’s easy to seethat for every x , y ∈ V(Cn), we have

dCn (x , y) 6 EdT(x , y) 6 2 · dCn (x , y) .

The pair with the largest distortion is an edge x , y of Cn ; the expected length of x , y in T is(1 − 1

n )1 +1n (n − 1) 6 2.

Non-contracting tree embeddings. We now formalize the goal of embedding into a random tree.We say that (X, d) admits a random tree embedding with distortion D if there exists a random tree metricT and a random map F : X → V(T) that satisfies the following two properties:

1. Non-contracting.

With probability one, for every x , y ∈ X, we have dT(F(x), F(y)) > d(x , y).2. Non-expanding in expectation.

For all x , y ∈ X,E[dT(F(x), F(y))] 6 D · d(x , y) .

There are many scenarios in approximation algorithms and online algorithms where suchembeddings can be used to reduce solving a problem in the general case to solving it on a tree bylosing a factor of D in the approximation ratio (see Homework #3 for an example). It’s also the casethat such mappings can be useful for preconditioning diagonally dominant linear systems; in thiscase, one usually loses a factor of D (or DO(1)) in the running time. In a sequence of works, Bartalshowed that one can achieve D O(log n log log n). The optimal bound was obtained a few yearslater.

26

Theorem 5.1 (Fakcharoenphol-Rao-Talwar 2003). Every n-point metric space (X, d) admits a randomtree embedding with distortion O(log n).

In Homework #3, you will prove that the theorem holds with D O((log n)2). We now discussthe basic primitive one needs.

5.2 Random low-diameter partitions

Given a parameter ∆ > 0 (the diameter bound), our goal is to construct a random partitionX C1 ∪ C2 ∪ · · · ∪ Ck of X into sets where diam(Ci) 6 ∆ for every i 1, . . . , k. Here, diam(S) maxx ,y∈S d(x , y) denotes the maximum distance in a subset S ⊆ X.

Of course, this is easy (we could simply decompose X into sets of singletons). We will alsorequire that for every x , y ∈ X, we have

P[x and y are separated in P] 6 d(x , y)∆

· α . (5.1)

We say that a partition P C1 , C2 , . . . , Ck separates x and y if x ∈ Ci , y ∈ C j and i , j. Our goalwill be to prove that such random partitions always exist if we take α 8 ln n.

Exercise: Prove that if X R and d(x , y) |x − y | then we can find such a random partition withα 1. This demonstrates why the scaling in (5.1) is natural.

5.3 Memoryless random variables

Geometric random variables. Write X ∼ Geom(p) to denote the fact that X is a geometric randomvariable with mean 1/p. Recall that X is the number of independent coin flips needed to get headswhen the coin comes up heads with probability p. It is easy to see that for every k > 1, we haveP[X k] (1 − p)k−1p. One can also check that

P[X > k] (1 − p)k−1∀k > 1 (5.2)

P[X k | X > j] (1 − p)k− j p ∀1 6 j 6 k .

The last property expresses that a geometric random variable is “memoryless” in the sense that thedistribution of X | X > j is the same as the distribution of X + j.

Exponential random variables. There is also a continuous memoryless distribution: The expo-nential distribution. If X ∼ Exp(µ) is exponential with mean µ, then it has density µe−tµ, and theproperty that the distribution of X | X > λ is the same that of X + λ.

5.4 The partitioning algorithm

Consider again a metric space (X, d) and a parameter ∆ > 0. Also recall our goal of producing arandom partition whose sets have diameter at most ∆ and such that (5.1) holds for α 8 ln n. Wemay assume that ∆ > 4 ln n, else we are done.

For simplicity, let us assume that d(x , y) ∈ 0, 1, 2, . . . , n. This will make the analysis slightlyeasier without sacrificing any of the essential details. Recall also that for x ∈ X and r > 0, the ballof radius r around x is defined by

B(x , r) y ∈ X : d(x , y) 6 r .

27

Order the vertices X x1 , . . . , xn arbitrarily. We produce the following random partition. Foreach i 1, 2, . . . , n, we choose an independent random variable Ri ∼ Geom( 4 ln n

∆), and set

Ci B(xi , Ri) \⋃j<i

C j .

Thus the ith set if B(xi , Ri) but we remove the points that have already been clustered. Our partitionis P C1 , C2 , . . . , Cn.

Note that, as stated, the algorithm could potentially output a set of diameter bigger than ∆:There’s even a chance that R1 > ∆. If maxi Ri > ∆/2, we will output the partition P∗ x : x ∈ Xinto singleton clusters. That ensures that the diameter of our sets are always bounded, and we willshow this eventuality happens only with very small probability. Let’s use E to denote the eventthat maxi Ri > ∆/2.

Now fix x , y ∈ X. Let Ex ,y be the event that x and y end up in different sets of the partition P.(We are ignoring P∗ for now.) We are interested in proving an upper bound on P(Ex ,y). In order todo this, it’s helpful to think about the process as “growing” balls around x1 , x2 , . . . in order until thewhole space is partitioned. The growing is because the random radii are geometrically distributed,thus we can think about each center xi flipping coins until one comes up heads and at each stepincrementing the radius by one if the coin comes up tails.

With this picture in mind, it’s intuitive that we only need to start getting worried about Ex ,y

occurring when some ball B(xi , R) “reaches” one of x or y. Until then, there are lots of growingballs that die out before they ever see one of x or y.

Let’s make this intuition precise. LetZi denote the event that x , y∩Ci , ∅ and x , y∩C j ∅

for j < i. In other words, Ci is the first set that contains one of x or y (and it possibly contains both).Then

P(Ex ,y) n∑

i1P[Zi] · P[Ex ,y | Zi] .

So if we can show that P[Ex ,y | Zi] 6 p it will imply that P[Ex ,y] 6 p.Fix i. Let us suppose that d(x , xi) 6 d(y , xi) (otherwise interchange the roles of x and y).

Next, note that Zi ⇒ Ri > d(xi , x). Thus conditioned on Zi , we have the following: IfRi ∈ [d(xi , x), d(xi , y)) then Ex ,y occurs, and otherwise Ri > d(xi , y) and Ex ,y does not occur.

The memoryless property of the geometric distribution means that Ri − d(xi , x) | Ri > d(xi , x)again has law Geom( 4 ln n

∆). We conclude that

P[Ex ,y | Zi] 6 P[Ri ∈ [d(xi , x), d(yi , x)) | Ri > d(xi , x)] P[Ri < d(yi , x) | Ri > d(xi , x)] P[X < d(yi , x) − d(xi , x)] ,

where X ∼ Geom( 4 ln n∆

).Using (5.2) and the fact that d(y , xi) − d(x , xi) 6 d(x , y), we have

P[X < d(yi , x) − d(xi , x)] 6 P[X < d(x , y)] 1 − P[X > d(x , y)] 1 −(1 − 4 ln n

)d(x ,y)−1

Computation yields

1 −(1 − 4 ln n

)d(x ,y)−1

6 1 −(1 − 4 ln n

)d(x ,y)6 1 −

(1 − d(x , y)4 ln n

)

d(x , y)∆

4 ln n ,

28

where we have used the fact that (1 − ε)k > 1 − εk for all ε ∈ [0, 1] and k > 1.Thus P[Ex ,y] 6 d(x ,y)

∆4 ln n. We are almost done, but recall that we sometimes output the

partition P∗ instead of P. Thus

P[x and y are separated] 6 P[E] + d(x , y)∆

4 ln n .

Using a union bound along with (5.2), we have

P[E] 6 n · P[R1 > ∆/2] 6 n ·(1 − 4 ln n

)∆/26 n · e−2 ln n

1n.

Here, we have used the fact that1 − 1

k

k6 1

e for k > 1. We conclude that

P[x and y are separated] 6 1n+

d(x , y)∆

4 ln n 6d(x , y)∆

8 ln n ,

using our assumption that d(x ,y)∆> 1

n if x , y (since d(x , y) ∈ 0, 1, 2, . . . , n).

6 Low-distortion embeddings

Let (X, d) be a finite metric space with n |X |. Recall that the distance function d : X × X → R+

satisfies the axioms of a metric: For all x , y , z ∈ X:

1. d(x , y) 0 ⇐⇒ x y

2. d(x , y) d(y , x)3. d(x , y) 6 d(x , z) + d(z , y)

While properties (2) and (3) are essential for us, the implication d(x , y) 0 ⇒ x y is often notparticularly important. If a distance function satisfies only (2), (3), and d(x , x) 0, it is commonlycalled a pseudometric.

Metric spaces arise in a variety of mathematical and scientific domains since they abstract theproperties of many natural notions of “similarity” between objects. Consider, for instance, thelatency between nodes in a network, the travel-distance between cities, the edit distance betweengenetic sequences, or various similarity measures between proteins.

Often one might first try to understand a given metric space (X, d) by trying to compare it to awell-understood space. For instance, one could think about mapping F : X → Rk into a Euclideanspace Rk equipped with the Euclidean norm ‖x‖2

√x2

1 + · · · + x2k . One way to measure how well

this mapping preserves the geometry of X is via the bilipschitz distortion. This is the smallest numberD > 0 such that

d(x , y) 6 ‖F(x) − F(y)‖2 6 D · d(x , y) ∀x , y ∈ X . (6.1)

Today we will prove the following result.

Theorem 6.1 (Bourgain 1985). Every n-point metric space embeds into some Euclidean space Rk withbilipschitz distortion D, where D 6 O(log n).

We will show that this is possible k 6 O((log n)2), but in the next lecture, we will see that forgeneral reasons, one can achieve k 6 O(log n).

29

6.1 Distances to subsets

6.1.1 Fréchet’s embedding

Let us first show how we can achieve the significantly worse bounds D 6√

n and k n. Enumeratethe points X x1 , x2 , . . . , xn and let F : X → Rn be defined by F(x) (F1(x), . . . , Fn(x)), where

Fi(x) d(x , xi) .First, note that every coordinate is 1-Lipschitz: For all x , y ∈ X,

|Fi(x) − Fi(y)| |d(x , xi) − d(y , xi)| 6 d(x , y) ,where we have used the triangle inequality. From this, we get

‖F(x) − F(y)‖22

n∑i1

|Fi(x) − Fi(y)|2 6 n d(x , y)2 , (6.2)

implying that ‖F(x) − F(y)‖2 6√

n · d(x , y) for all x , y ∈ X.On the other hand, for any x ∈ X, it holds that

‖F(x) − F(y)‖2 > |Fi(x) − Fi(xi)| d(x , xi) ,Therefore (6.1) is satisfied with D

√n.

6.1.2 Bourgain’s embedding

To get improved distortion, we will construct our coordinates out of distances to subsets instead ofsimply to points. For a subset S ⊆ X and x ∈ X, let us define

d(x , S) : miny∈S

d(x , y) .

First, observe that such maps are also 1-Lipschitz: The triangle inequality yields

d(x , S) 6 d(y , S) + d(x , y) ,hence

|d(x , S) − d(y , S)| 6 d(x , y) ∀x , y ∈ X, S ⊆ X . (6.3)

For some number m 6 O(log n) that we will choose later, letSt , j : t 1, 2, . . . , blog2 nc , j 1, 2, . . . ,m

.

denote independent random subsets St , j ⊆ X, where St , j is formed by sampling every point of Xindependently with probability 2−t . Our embedding is

F(x) (d(x , S1,1), . . . , d(x , S1,m),d(x , S2,1), . . . , d(x , S2,m),

· · ·

d(x , Sblog2 nc ,1), . . . , d(x , Sblog2 nc ,m)).

30

From (6.3), we see that

‖F(x) − F(y)‖2 6√

mblog2 nc · d(x , y) x , y ∈ X (6.4)

(just as in in (6.2)).We move on to the lower bound. To this end, we define the open and closed balls: For R > 0,

B(x , R) y ∈ X : d(x , y) 6 R ,B(x , R) y ∈ X : d(x , y) < R .

Fix x , y ∈ X and for t 1, 2, . . . , blog2 nc, let rt be the smallest radius such that

max|B(x , rt)|, |B(y , rt)| > 2t

.

Let t∗ be the smallest value of t such that rt > d(x , y)/4 and reassign rt∗ d(x , y)/4.Note that

d(x , y)4

r1 + (r2 − r1) + (r3 − r2) + · · · + (rt∗ − rt∗−1) . (6.5)

We will use the sets St , j to get a contribution of rt − rt−1 to the lower bound, and therefore (6.5)shows we will get a contribution of Ω(d(x , y)).

So consider now some t ∈ 1, 2, . . . , t∗. For the sake of analysis, let r0 0. Note that, bydefinition of rt , we have at least one of |B(x , rt−1)| > 2t−1 or |B(y , rt−1)| > 2t−1. Without loss ofgenerality, assume that it holds for x. It also true that B(y , rt)| < 2t . Let us summarize:

|B(x , rt−1)| > 2t−1

|B(y , rt)| < 2t .

Let St ⊆ X be a random subset where every point is sampled independently with probability2−t . Consider the event

Et St ∩ B(x , rt−1) , ∅ and St ∩ B(y , rt) ∅ .

Notice thatEt occurs ⇒ |d(x , St) − d(y , St)| > rt − rt−1 . (6.6)

Claim 6.2. P(Et) > 112 .

Proof. Observe that rt 6 d(x , y)/4, hence B(x , rt) and B(y , rt) are disjoint. In particular, the twoevents composing Et are independent, and it suffices to lower bound their probabilities separately.First, note that

P(St ∩ B(x , rt−1) , ∅

)> 1 − P

(St ∩ B(x , rt−1) ∅

) 1 −

1 − 2−t|B(x ,rt−1)|

> 1 −1 − 2−t2t−1

> 1 − 1√

e>

13,

where we have used the fact that (1 − 1k )k 6 1

e for k > 1. Next, calculate

P(St ∩ B(y , rt) ∅

)

1 − 2−t|B(y ,rt )| >

1 − 2−t2t

>14,

where we have used (1 − 1k )k > 1/4 for k > 2.

31

Now let Et , j be the event corresponding to (6.6) for the set St , j .

Corollary 6.3. If Ω(m) of the events Et , j : j 1, . . . ,m occur, then‖F(x) − F(y)‖2

2 > Ω(m)(rt − rt−1)2 .We can say something more: If it holds that

Ω(m) of the events Et , j : j 1, . . .m occur for every t 1, 2, . . . , blog2 nc, (6.7)

then since the contributions come from disjoint sets of coordinates,

‖F(x)−F(y)‖22 > Ω(m)

t∗∑t1

(rt−rt−1)2 > Ω(m

t∗

)*,

t∗∑t1

(rt − rt−1)+-

2

> Ω(m

t∗

)d(x , y)2 > Ω

(m

log n

)d(x , y)2 .

The second inequality is Cauchy-Schwarz, and the third is from (6.5).Combining this with (6.4), our map has distortion O(log n) as long as we choose m large enough

so that (6.7) holds with probability, say, 1 − 1/n3. That’s because we can then take a union boundover all possible pairs x , y ∈ X. But since each event Et , j occurs with probability at least 1/12, asimple Chernoff bound shows that choosing some m 6 O(log n) suffices.

7 The curse of dimensionality and dimension reduction

The “curse” of dimensionality refers to the fact that many algorithmic approaches to problems inRk become exponentially more difficult as k grows. This is essentially due to volume growth: Thek-dimensional Euclidean ball of radius r grows exponentially in k: volk(B(0, 2r)) 2k volk(B(0, r)).

On the other hand, in high dimensions, the concentration of measure phenomenon can come toour aid: Sufficiently smooth functionals on Rk are tightly concentrated around their expected value.A prototypical example is the Johnson-Lindenstrauss dimension reduction lemma.

7.1 The Johnson-Lindenstrauss lemma

Lemma 7.1 (Johnson-Lindenstrauss). For every n > 1 and every n-point subset X ⊆ Rn , the followingholds. For every ε > 0, there is a linear map A : Rn

→ Rk such that

(1 − ε)2‖x − y‖22 6 ‖A(x) − A(y)‖2

2 6 (1 + ε)2‖x − y‖22 ∀x , y ∈ X , (7.1)

and k 6 24 ln nε2 .

Proof. We may assume that 0 < ε < 1/2. We will a define random linear map A : Rn→ Rk and

then argue that A satisfies (7.1) with high probability. Let X( j)i : i 1, . . . , k , j 1, . . . , n be a

family of i.i.d. N(0, 1) random variables, and define an m × n matrix A by Ai j 1√

kX( j)

i .

Claim 7.2. For every u ∈ Rn with ‖u‖2 1, we have

P‖Au‖2

2 < [1 − ε, 1 + ε] 6 2e−ε2k/8 .

Let us finish the proof of Lemma 7.1 and then prove the claim. By setting u x−y

‖x−y‖2for every

x , y ∈ X and taking a union bound over these n2 different pairs, we have

P(1 − ε)‖x − y‖2

2 6 ‖A(x − y)‖22 6 (1 + ε)‖x − y‖2

2> 1 − 2n2e−ε

2k/8 .

Taking k 24 ln nε2 the latter probability is at least 1 − 2

n , showing that the desired map A exists (andcan be found by a randomized algorithm).

32

Proof of Claim 7.2. Note first that

‖Au‖22

1k

k∑i1

*.,

n∑j1

u jX( j)i

+/-

2

.

By the 2-stability property of independent N(0, 1) random variables, we have∑n

j1 u jX( j)i again has

distribution N(0, 1). Thus we have

‖Au‖22

1k

k∑i1

Y2i

where Yi are i.i.d. N(0, 1).Since ‖Au‖2

2 is a sum of independent random variables, it makes sense to use the method ofLaplace transforms. So far we have only done this for bounded random variables, but since N(0, 1)variables have a quickly diminishing tail, we can hope that the method will work here as well.

For some parameter λ > 0, we have

P‖Au‖2

2 > 1 + ε P *

,

k∑i1

(Y2i − 1) > εk+

- P

(eλ

∑ki1(Y2

i −1) > eελk)

6E

[eλ

∑ki1(Y2

i −1)]

eελk

∏ni1 E

[eλ(Y2

i −1)]

eελk.

Thus our remaining goal is to bound E[eλ(Y2

−1)] when Y is N(0, 1). First, observe that

E[eλY2 ]

1√

∫∞

−∞

e−t2/2eλt2dt

1√

1 − 2λ(7.2)

for 0 < λ < 1/2. (Clearly the integral is divergent for λ > 1/2.) Thus in this range of λ,

E[eλ(Y2

−1)]

e−λ√

1 − 2λ.

Let us finally calculate (using the Taylor expansion log(1 − x) −∑k>1

xk

k ),

logE[eλ(Y2

−1)] −λ −

12

log(1 − 2λ)6 2λ2(1 + (2λ) + (2λ)2 + · · · )

2λ2

1 − 2λ.

Therefore:P

‖Au‖22 > 1 + ε

6 e2λ2k/(1−2λ)−ελk .

Choosing λ ε/4 and using 0 < ε < 1/2, we conclude that P‖Au‖2

2 > 1 + ε6 e−ε

2k/8, completingthe proof. A similar argument shows that P

‖Au‖22 < 1 − ε

6 e−ε

2k/8.

33

Remark 7.3. A simple volume bound shows that that Θ(log n) dependence in the dimensionis necessary. A linear algebraic argument of Alon shows that the dimension must be at leastΩ( log n

ε2 log(1/ε) ). Very recently (FOCS 2017), Larsen and Nelson established that there is a lower bound

of Ω( log nε2 ), showing that Lemma 7.1 is tight up to the constant factor.

Remark 7.4. Lemma 7.1 actually works for any independent family X( j)i where each random variable

satisfies a sub-Gaussian tail bound: E[eαX] 6 eCα2 for some constant C > 1. For instance, if X isa uniform ±1 random variable, then E[eαX] 1

2 (eα + e−α) 6 eα2/2 (recall that to prove this, you

should Taylor expand both sides).The proof uses a clever trick. Note that if Z is a N(0, 1) random variable (independent of X)

then E[eαZ] eα2/2 for any α > 0. Now write

E[eλX2] E[e(√

2λX)2/2] E[e√

2λZX] 6 E[e2CλZ2] 1√

1 − 2λC,

for 0 < λ < 12C , where the last equality is exactly the equality we proved in (7.2). Given this bound,

we can finish the proof just as in Claim 7.2.

8 Compressive sensing and the RIP

The reason that extreme compression of photographs or audio recordings is possible is that thecorresponding images are often sparse in the correct basis (e.g., the Fourier or wavelet basis). Thusone can take a very detailed photo and then zero out all the small coefficients, vastly compressingthe image while also preserving the bulk of the important information.

Problematically, despite only recording a small amount of information at the end (say, s largeFourier coefficients), in order to figure out which coefficients to save, we had to perform a verydetailed measurement (making our camera pretty expensive). Compressive sensing is the idea that, ifwe do a few random linear measurements, then we can capture the large coefficients without firstknowing what they are.

Sparse recovery. Let us formalize the sparse recovery problem. Our signal will be a point x ∈ Rn ,and we will have a linear measurement map Φ : Rn

→ Rm that makes m linear measurements,where hopefully m n. Say that a signal x ∈ Rn is s-sparse if ‖x‖0 6 s, where ‖ · ‖0 denotes thenumber of non-zero coordinates in its argument. For s-sparse signals x to be uniquely recoverablefrom the measurements Φ(x), the following property is necessary and sufficient: For every pair ofdistinct s-sparse vectors x , y ∈ Rn , it holds that Φ(x) , Φ(y).

Given the measurements M Φ(x), we might want to recover the unique correspondings-sparse vector x. It would be natural to solve the following optimization: min ‖y‖0 subject toΦ(y) M. Clearly the optimizer y∗ satisifies ‖y∗‖0 6 s, so by the unique encoding property fors-sparse vectors and the fact that Φ(x) Φ(y), it must be that x y. Unfortunately, `0 optimizationsubject to linear constraints is an NP-hard problem.

Instead, one often solves the problem: min ‖y‖1 subject to Φ(y) M. This is a linear programand can thus be solved efficiently. It is often referred to as the “basis pursuit” algorithm. Remarkably,if we choose the map Φ appropriately, then the optimum solution y∗ satisfies x y∗, yielding anefficient algorithm for sparse recovery.

34

8.1 The restricted isometry property

We will now formalize the properties of the map Φ : Rn→ Rm that makes efficient sparse recovery

possible. For s > 1, let δs δs(Φ) be the smallest number such that for every s-sparse vector x ∈ Rn ,we have

(1 − δs)2‖x‖22 6 ‖Φ(x)‖2

2 6 (1 + δs)2‖x‖22 (8.1)

It will help to think about this parameter in a slightly dierent way as well. Let T ⊆ [n] index asubset of |T | s columns of Φ (thought of as an m × n matrix). Let ΦT : Rs

→ Rm be the linearmap corresponding to the matrix formed from the columns of Φ indexed by T. Then the aboveproperty is equivalent to the property that for every |T | s and x ∈ Rs , we have

(1 − δs)2‖x‖22 6 ‖ΦT(x)‖2

2 6 (1 + δs)2‖x‖22 . (8.2)

Theorem8.1. If δ2s(Φ) < 1, thenΦ has the unique recovery property for s-sparse vectors. If δ2s(Φ) <√

2−1,then `1-minimization performs s-sparse recovery.

Proof. We will prove only the first assertion. Suppose that x , y ∈ Rn are s-sparse vectors. Thenx − y is 2s-sparse, hence if Φ(x) Φ(y), then (8.1) gives

0 ‖Φ(x − y)‖22 > (1 − δ2s)2‖x − y‖2

2 ,

and therefore x y when δ2s < 1.

8.2 Random construction of RIP matrices

Let us dene the m × n random matrix Φ by setting Φi j 1√

mX( j)

i where X( j)i is a family of i.i.d.

N(0, 1) random variables. With high probability, this matrix will have the RIP or appropriatelychosen parameters.

Theorem 8.2. For every n > s > 1 and 0 < δ < 1, there is an m 6 O( sδ2 log n

s + s log 1δ ) such that with

high probability, δs(Φ) 6 δ.Proof. Fix a subset T ⊆ [n] with |T | s. Let ET denote the event that ‖ΦT(x)‖2 ∈ [1 − δ, 1 + δ] forall x ∈ Rs with ‖x‖2 1. We will show that

P [ET] > 1 − 2(16δ

) se−δ

2m/48. (8.3)

Assuming this is true, we can take a union bound over |T | s, yielding

P [δs(Φ) 6 δ] PET for every T ⊆ [n], |T | s

> 1 − 2

(16δ

) se−δ

2m/48(ns

)Using the fact that log

ns

6 O(s log n

s ), we can conclude by choosing m as in the theorem statementso that this probability is at least, say, 1 − 1/n.

Thus we are left to prove (8.3). Let N be a δ/4-net on the unit sphere in Rs . This is a collectionof unit vectors N such that for every x ∈ Rn with ‖x‖2 1, there is an x′ ∈ N with ‖x − x′‖2 6 δ/4.A simple volume argument shows we can choose such a net with |N | 6 (16/δ)s .

Now using Claim 1.2 from Lecture 10 and a union bound over N , we have

P

(∀x ∈ N, ‖ΦT(x)‖2 ∈

[1 − δ

4, 1 +

δ4

])> 1 − 2

(16δ

) se−δ

2m/48.

35

We are left to show that ‖ΦT(x)‖2 ∈ [1 − δ4 , 1 +

δ4 ] for all x ∈ N implies ‖ΦT(x)‖2 ∈ [1 − δ, 1 + δ] or

all x ∈ Rs with ‖x‖2 1.This uses a clever trick. We will define a sequence of points xi : i > 0 such that xi

‖xi ‖ ∈ N forevery i > 0. For any y ∈ Rs , let Γ(y) y′‖y‖2 where y′ ∈ N is the closest point from N to y/‖y‖2.Note that by the net property, we have ‖y − Γ(y)‖2 6

δ4 ‖y‖2.

Consider now some ‖x‖2 1. Define x0 : Γ(x), x1 : Γ(x − x0), and so on:

xi+1 : Γ(x − (x0 + · · · + xi))Then:

x x0 + (x − x0) x0 + x1 + (x − x0 − x1) · · · ∞∑

i0xi .

By a simple induction, we have ‖xi‖2 6 (δ/4)i and by construction, xi/‖xi‖ ∈ N for every i > 0.Now we can use our assumption that ‖Φ(y)‖2 ∈ [1 − δ/4, 1 + δ/4] for every y ∈ N to write

‖ΦT(x)‖2 6∞∑

i0‖ΦT(xi)‖2 6

(1 +

δ4

) ∞∑i0

(δ4

) i

1 + δ/41 − δ/4

6 1 + δ,

where the last inequality follows from δ < 1. For the other side, write

‖ΦT(x)‖2 > ‖ΦT(x0)‖2 −

∞∑i1

‖ΦT(xi)‖2 >(1 − δ

4

)−δ4

(1 +

δ4

) ∞∑i0

(δ/4)i

> 1 − δ4−

δ(1 +δ4 )

4(1 − δ/4) > 1 − δ,

where we again used δ < 1. We have thus confirmed (8.3), completing the proof.

Remark 8.3. Note that we must always perform s “measurements” even if we know exactly the simportant coordinates. The preceding theorems says that we can do unique (and efficient) recoverywith only O(s log n)measurements without knowing anything about the input signal except thatit’s s-sparse.Remark 8.4. In a more realistic model, we might expect that our signal is of the form x xs + ywhere xs is s-sparse and ‖y‖2 6 ε‖x‖2. In other words, the signal has s large coordinates plus“noise.” The RIP and basis pursuit algorithms can also be used to provide guarantees in this setting.

9 Concentration for sums of random matrices

Consider the graph sparsification problem: Given a graph G (V, E), we want to approximate G (in asense to be defined later) by a sparse graph H (V, E′). Generally we would like that E′ ⊆ E andmoreover |E′| is as small as possible—say O(n) or O(n log n)where n |V |. We will be able to dothis by choosing a (nonuniform) random sample of the edges, but to analyze such a process, wewill need a large-deviation inequality for sums of random matrices.

9.1 Symmetric matrices

If A is a d × d real symmetric matrix, then A has all real eigenvalues which we can orderλ1(A) > λ2(A) > · · · > λd(A). The operator norm of A is

‖A‖ : max‖x‖21

‖Ax‖2 max |λi(A)| : i ∈ 1, . . . , d .

36

The trace of A is Tr(A) ∑di1 Aii

∑di1 λi(A). The trace norm of A is ‖A‖∗ ∑d

i1 |λi(A)|. Asymmetric matrix is positive semidefinite (PSD) if all its eigenvalues are nonnegative. Note that for aPSD matrix A, we have Tr(A) ‖A‖∗. We also recall the matrix exponential eA

∑∞

k0Ak

k! which iswell-defined for all real symmetric A and is itself also a real symmetric matrix. If A is symmetric,then eA is always PSD, as the next argument shows.

Every real symmetric matrix can be diagonalized, writing A UT DU, where U is an orthogonalmatrix, i.e. UUT UTU I and D is diagonal. One can easily check that Ak UT DkU for anyk ∈ N, thus Ak and A are simultaneously diagonalizable. It follows that A and eA are simultaneouslydiagonalizable. In particular, we have λi(eA) eλi(A).

Finally, note that for symmetric matrices A and B, we have | Tr(AB)| 6 ‖A‖ · ‖B‖∗. To see this, letui be an orthonormal basis of eigenvectors of B with Bui λi(B)ui . Then

|Tr(AB)|

d∑i1

〈ui ,ABui〉

d∑i1

λi(B)〈ui ,Aui〉6

d∑i1

|λi(B)| · ‖A‖ ‖B‖∗‖A‖.

Many classical statements are either false or significantly more difficult to prove when translatedto the matrix setting. For instance, while ex+y ex e y e y ex is true for arbitrary real numbers xand y, it is only the case that eA+B eAeB if A and B are simultaneously diagonalizable. However,somewhat remarkably, the matrix analog does hold if we do it inside the trace.

Theorem 9.1 (Golden-Thompson inequality). If A and B are real symmetric matrices, then

Tr(eA+B) 6 Tr(eAeB) .Proof. We can prove this using the non-commutative Hölder inequality: For any even integer p > 2and real symmetric matrices A1 ,A2 , . . . ,Ap :

Tr(A1A2 · · ·Ap) 6 ‖A1‖Sp ‖A2‖Sp · · · ‖Ap‖Sp ,

where ‖A‖Sp Tr(ATA)p/21/p is the Schatten p-norm. Consider real symmetric matrices U,V .

Applying this with A1 · · ·Ap UV gives, for every even p > 2:

Tr((UV)p) 6 ‖UV‖pSp

Tr((VTUTUV)p/2) Tr((VU2V)p/2) Tr((U2V2)p/2),

where the last inequality uses the cyclic property of the trace. Applying this inequality repeatedlynow yields, for every even p,

Tr((UV)p) 6 Tr(UpVp).If we now take U eA/p and V eB/p , this gives

Tr((eA/p eB/p)p) 6 Tr(eAeB). (9.1)

For p large, we can use the Taylor approximation eA/p 1 + A/p + O(1/p2) and similarly for eB/p .Thus: eA/p eB/p

∼ e(A+B)/p+O(1/p2). Therefore taking p →∞ in (9.1) gives

Tr(eA+B) 6 Tr(eAeB).

37

9.2 The method of exponential moments for matrices

We will consider now a random d × d real matrix X. The entries (Xi j) of X are all (not necessarilyindependent) random variables. We have seen inequalities (like those named after Chernoff andAzuma) which assert that if X X1 + X2 + · · · + Xn is a sum of independent random numbers, thenX is tightly concentrated around its mean. Our goal now is to prove a similar fact for sums ofindependent random symmetric matrices.

First, observe that the trace is a linear operator; this is easy to see from the fact that it isthe sum of the diagonal entries of its argument. If A and B are arbitrary real matrices, thenTr(A + B) Tr(A) + Tr(B). This implies that if X is a random matrix, then E[Tr(X)] Tr(E[X]).Note that E[X] is the matrix defined by (E[X])i j E[Xi j].

Suppose that X1 ,X2 , . . . ,Xn are independent random real symmetric matrices. Let X

X1 + X2 + · · ·+ Xn . Let Sk X1 + · · ·+ Xk be the partial sum of the first k terms so that X Sn . Ourfirst goal will be to bound the probability that X has an eigenvalue bigger than t. To do this, we willtry to extend the method of exponential moments to work with symmetric matrices, as discoveredby Ahlswede and Winter. It is much simpler than previous approaches that only worked for specialcases.

Note that for β > 0, we have λi(eβX) eβλi(X). Therefore:

P

[max

iλi(X) > t

] P

[max

iλi(eβX) > eβt

]6 P

Tr(eβX) > eβt

, (9.2)

where the last inequality uses the fact that all the eigenvalues of eβX are nonnegative, henceTr(eβX) ∑

i λi(eβX) > maxi λi(eβX).Now Markov’s inequality implies that

P[Tr(eβX) > eβt] 6 E[Tr(eβX)]eβt . (9.3)

As in our earlier uses of the Laplace transform, our goal is now to bound E[Tr(eβX)] by a productthat has one factor for each term Xi .

In the matrix setting, this is more subtle: Using Theorem 9.1,

E[Tr(eβX)] E[Tr(eβ(Sn−1+Xn))] 6 E[Tr(eβSn−1 eβXn )] .Now we push the expectation over Xn inside the trace:

E[Tr(eβSn−1 eβXn )] ETr(eβSn−1 E[eβXn | X1 , . . . ,Xn−1]) E

Tr(eβSn−1 E[eβXn ]) ,

and we have used independence to pull eβSn−1 outside the expectation and then to remove theconditioning. Finally, we use the fact that Tr(AB) 6 ‖A‖ · ‖B‖∗ and ‖B‖∗ Tr(B) when B has allnonnegative eigenvalues (as is the case for eβSn−1):

ETr(eβSn−1 E[eβXn ]) 6

E[eβXn ]E Tr(eβSn−1) .

Doing this n times yields

E[Tr(eβX)] 6 Tr(I)n∏

i1

E[eβXi ] d

n∏i1

E[eβXi ] .

38

Combining this with (9.2) and (9.3) yields

P

[max

iλi(X) > t

]6 e−βt d

n∏i1

E[eβXi ] .

We can also apply this to −X to get

P [‖X‖ > t] 6 e−βt d *,

n∏i1

E[eβXi ] +

n∏i1

E[e−βXi ]+

-. (9.4)

9.3 Large-deviation bounds

Let Y be a random, symmetric, psd d × d matrix with E[Y] I. Suppose that ‖Y‖ 6 L withprobability one.

Theorem 9.2. If Y1 ,Y2 , . . . ,Yn are i.i.d. copies of Y, then for any ε ∈ (0, 1) the following holds. Letλ1 , λ2 , . . . , λn denote the eigenvalues of 1

n∑n

i1 Yi . Then

P[λ1 , λ2 , . . . , λn ⊆ [1 − ε, 1 + ε]] > 1 − 2d exp−ε2n/4L

.

There is a slightly nicer way to write this using the Löwner ordering of symmetric matrices: Wewrite A B to denote that the matrix A − B is positive semidefinite. We can rewrite the conclusionof Theorem 9.2 as

P

(1 − ε)I 1

n

n∑i1

Yi (1 + ε)I> 1 − 2d exp

−ε2n/4L

. (9.5)

Proof of Theorem 9.2. Define Xi : Yi − E[Yi] and X : X1 + · · · + Xn . Then (9.5) is equivalent to

P [‖X‖ > εn] 6 2d exp−ε2n/4L

.

We know from (9.4) that it will suffice to boundE[eβXi ] for each i. To do this, we will use the

fact that1 + x 6 ex 6 1 + x + x2

∀x ∈ [−1, 1] .Note that ifA is a real symmetricmatrix, then since I,A,A2, and eA are simultaneouslydiagonalizable,this yields

I + A eA I + A + A2 (9.6)

for any A with ‖A‖ 6 1.Observe that E[Xi] 0. To evaluate ‖Xi‖, let us use the fact that for real symmetric A, we have

‖A‖ max‖x‖21 |xTAx |. So consider some x ∈ Rd with ‖x‖2 1 and write

|xT Xi x | |xTYi x − xT E[Yi]x |n

6 |xTYi x | 6 L ,

where we have used the fact that since Yi is PSD, so is E[Yi], and thus xT E[Yi]x and xTYi x are bothnonnegative. We also used our assumption that ‖Yi‖ 6 L. We conclude that ‖Xi‖ 6 L.

Moreover, we have

E[X2i ] E[(Yi − E[Yi])2]

E[Y2i ] − (E[Yi])2

E[Y2i ] E[‖Yi‖Yi] L E[Yi] ,

39

where in the final line we again used the assumption ‖Yi‖ 6 L. Finally, since E[Yi] I, concludethat ‖ E[X2

i ]‖ 6 L.Therefore for any β 6 1/L, we can apply (9.6), yielding

E[eβXi ] I + β E[Xi] + β2 E[X2i ] I + β2 E[X2

i ] eβ2 E[X2

i ] .

We conclude that for β 6 1/L, we have ‖E[eβXi ]‖ 6 eβ2L.

Plugging this into (9.4), we see that

P [‖X‖ > εn] 6 2de−εnβeβ2Ln .

Choosing β : ε2L yields

P [‖X‖ > εn] 6 2de−ε2n/4L ,

completing the argument.

Finally, we can prove a generalization of Theorem 9.2 for random matrices whose expectation isnot the identity.

Theorem 9.3. Let Z be a d × d random real, symmetric, PSD matrix. Suppose also that Z L · E[Z] forsome L > 1. If Z1 , Z2 , . . . , Zn are i.i.d. copies of Z, then for any ε > 0, it holds that

P

(1 − ε)E[Z] 1

n

n∑i1

Zi (1 + ε)E[Z]> 1 − 2de−ε

2n/4L .

In other words, the empirical mean 1n∑n

i1 Zi is very close (in a spectral sense) to the expectationE[Z].Proof of Theorem 9.3. This is made difficult only because it may be that A E[Z] is not invertible.Suppose that A UDUT where D is a diagonal matrix D diag(λ1 , λ2 , . . . , λk , 0, . . . , 0), where kis the rank of A and λi , 0 for i 1, . . . , k. Then the pseudoinverse of A is defined by

A+ Udiag(λ−1

1 , λ−12 , . . . , λ−1

k , 0, . . . , 0)UT .

Note that AA+ A+A Iim(A), where Iim(A) denotes the operator that acts by the identity on theimage of A, and annihilates ker(A).

Since A is PSD, A+ is also PSD, and we can define A+/2 as the square root of A+. One can writethis explicitly as

A+/2 Udiag(λ−1/2

1 , λ−1/22 , . . . , λ−1/2

k , 0, . . . , 0)UT .

Now to prove Theorem 9.3, it suffices to apply Theorem 9.2 to the matrices Yi A+/2ZiA+/2.Verification is left as an exercise.

10 Spectral sparsification

Consider the graph sparsification problem: Given a graph G (V, E), we want to approximate G (in asense to be defined shortly) by a sparse graph H (V, E′). Generally we would like that E′ ⊆ E andmoreover |E′| is as small as possible—say O(n) or O(n log n) where n |V |.

40

10.1 Laplacians of graphs

In everything that follows, we will consider n-vertex graphs with vertex set V 1, 2, . . . , n.For an edge e ∈ E with e i , j and i < j, we define the vector xe ei − e j where ei are thestandard basis vectors in Rn . We also define the n × n matrix Le xe xT

e . Notice that this matrix has(Le)ii (Le) j j 1 and (Le)i j (Le) ji −1; the rest of the entries are zero. This matrix has rank oneand is positive semidefinite: For every v ∈ Rn , we have

vT Le v (vi − v j)2 .Now for a graph G (V, E), we define the (combinatorial) Laplacian of G by

LG

∑e∈E

Le .

It should be clear that LG is also PSD (since it is a nonnegative sum of PSD matrices) and

vT LGv

∑i , j∈E

(vi − v j)2 . (10.1)

If G is equipped with nonnegative edge weights we > 0 : e ∈ E, we define the correspondingweighted Laplacian by LG

∑e∈E we Le .

10.1.1 Spectral sparsification

Spielman and Teng introduced the following notion of spectral graph approximation. Consider(possibly weighted) graphs H and G. We say that H ε-spectrally approximates G for some ε > 0 if

(1 − ε)LG LH (1 + ε)LG . (10.2)

Recall that this equivalent to requiring that for every v ∈ Rn , we have

(1 − ε)vT LGv 6 vT LH v 6 (1 + ε)vT LGv .

From this expression, we see that spectral approximation is stronger than cut approximation.Indeed, consider any subset of vertices S ⊆ V and the corresponding characteristic vector v 1S.Then vT LGv wG(E(S, S)) and vT LH v wH(E(S, S)), where these two expressions are meant torepresent the weight of the edges in G that cross the cut (S, S) and the weight of the edges in H thatcross the cut (S, S), respectively. In particular, (10.2) entails that the weight of every cut in H shouldbe within 1 ± ε of the weight of the corresponding cut in G.

10.2 Random sampling

We will prove the following theorem.

Theorem 10.1 (Spielman-Srivastava 2008). For every ε > 0, the following holds. For every unweighted,connected graph G (V, E), there exists a weighted graph H (V, E′) such that E′ ⊆ E and |E| 6 O( n log n

ε2 )and H ε-spectrally approximates G.

To see that weights are necessary in Theorem 10.1, consider the case when G is an n-clique. Inthat case, we will need to put large weights on any sparse graph H that approximates G (recallingthat, in particular, the weight of every cut in H should be close to the size of the corresponding cutin G).

41

A sampling algorithm. Suppose we have some probabilities pe > 0 such that∑

e∈E pe 1. Thenwe can consider the following algorithm: Set all edge weights we : 0 for every e ∈ E. Fori 1, 2, . . . , k, sample an edge e (independent from previous choices) with probability pe andupdate

we : we +1

kpe.

Let H be the corresponding (random) weighed graph.It is easy to see that LH

∑ki1 Zi where Zi

1kpe(i)

Le(i) and e(i) is the edge that we sampled inthe ith iteration. Moreover, we can calculate for any i ∈ 1, . . . , k:

E[Zi] ∑e∈E

pe1

kpeLe

1k

LG .

Therefore E[LH] LG. The expectation of our “approximator” is equal to LG. Moreover, ourapproximator LH is a sum of i.i.d. random matrices. In order to achieve (10.2) using a small valueof k, we need concentration for this sum. And in order to achieve concentration using the approachof Lecture 14, we need a bound on the eigenvalues of individual summands.

Claim 10.2. For any κ > 1, if Zi κ E[Zi] (with probability one), then we can choose k O( κε2 log n)

and achieve (10.2) with high probability.

Proof. Theorem 1.3 of Lecture 14 states that under our assumptions, we have

P

(1 − ε)LG

k∑i1

Zi (1 + ε)LG

> 1 − 2ne−ε

2k/4κ .

Plugging in some k O( κε2 log n), we can get this expression to be at least 1 − 1/n.

So we are left to choose sampling probabilities pe that can guarantee Zi κ E[Zi] for somereasonable value of κ. A natural choice would be uniform sampling: pe 1/|E|. But this can failto give non-trivial bounds: Suppose that G is the union of two disjoint n/2-cliques connected bya single edge a. In order for H to spectrally approximate G, we had better include the edge a(otherwise the corresponding cut in H will have weight zero, while it has non-zero weight in G).But this would mean we need to sample Θ(n2) edges if we do uniform random sampling (since Gcontains Θ(n2) edges, only one of which is a). That doesn’t yield a particularly sparse graph.

Remark 10.3. It is not too difficult to see that no matter what choice we make for pe we will needat least Ω(n log n) edges to be sampled. Consider G to be the n-path. In that case, we will need tosample every edge at least once to approximate all n − 1 cuts in G. The standard coupon collectorbound dictates that we will need at least Ω(n log n) samples.

10.2.1 Effective resistances

Instead, we need to choose the probabilities pe so that important edges (like the single edge a inthe preceding example) have a very good chance of being sampled. To do this, we will set pe to beproportional to the effective resistance of the edge e. We’ll see more about effective resistances (andtheir relationship to random walks) in the next lecture.

For now, we can simply give the definition: For every edge e, we set

Re : Tr(Le L+

G) ,

42

where we recall that L+

G is the pseudo-inverse of LG.Since G is connected, LG has rank n − 1. That is easy to see from the formula (10.1). If v is

a multiple of the all-ones vector (1, 1, . . . , 1), then vT LGv 0. On the other hand, if vT LGv ∑i , j∈E(vi − v j)2 0 and G is connected, it must be that v1 v2 · · · vn . In other words,

LGv 0 implies that v is a multiple of (1, 1, . . . , 1). Thus LG has rank n − 1. That means we canwrite LG

∑ni2 λi wi wT

i for some orthonormal family of vectors wi ⊆ Rn and λi > 0. One thendefines L+

G ∑n

i11λi

wi wTi . Now we have LGL+

G Iim(LG). We will also need the positive squareroot: L+/2

G ∑n

i11√λi

wi wTi .

Finally we define our sampling probabilities pe Re

n−1 . First, let’s verify that these are indeedprobabilities. Recalling that Le xe xT

e , we have Re Tr(Le L+

G) xTe L+

Gxe > 0 since L+

G is PSD(because LG is PSD). Moreover, using linearity of the trace and the definition of LG:∑

e∈E

Re Tr *,

∑e∈E

Le L+

G+- Tr(LGL+

G) Tr(Iim(LG)) n − 1 ,

where the last line follows because, as we discussed, LG has rank n − 1.Note that with these sampling probabilities, we have

Zi κ · E[Zi] ⇐⇒ n − 1kRe

Le κ1k

LG ⇐⇒ Le κRe

n − 1LG

By Claim 10.2 with κ n − 1, we are done with the proof of Theorem 10.1 after we verify thefollowing.

Lemma 10.4. For every edge e ∈ E, we have Le Re LG.

Proof. We need to prove that vT Le v 6 Re vT LGv for every v ∈ Rn . But we need only prove this forv ⊥ (1, . . . , 1) since (1, . . . , 1) ∈ ker(LG) ∩ ker(Le). Since im(L+/2

G ) contains every vector orthogonalto (1, . . . , 1), it suffices to prove the inequality for v L+/2

G w for any w ∈ Rn with w ⊥ (1, . . . , 1):

(L+/2G w)T Le(L+/2

G w) 6 Re(L+/2G w)T LG(L+/2

G w)Using the fact that L+

G is symmetric, and w ⊥ ker(LG), the RHS is Re wT w. On the other hand,the LHS is wT L+/2

G Le L+/2G w 6 ‖L+/2

G Le L+/2G ‖wT w 6 Tr(L+/2

G Le L+/2G )wT w Re wT w, completing the

proof.

Remark 10.5. The proof is a bit easier to understand in the following way: We want to proveLe Re LG. It would be nice to simply multiply on the left and right by L+/2

G yielding

L+/2G Le L+/2

G Re Iim(G) .

This latter statement is true because the maximum eigenvalue of L+/2G Le L+/2

G is certainly at most itstrace which is equal to Re .

If A and B are symmetric matrices and C is invertible, then CACT CBCT

⇐⇒ A B (this isan easy exercise). On the other hand, if C is singular (like L+/2

G ), then we need to be careful aboutwhat happens on ker(C).

43

11 Random walks and electrical networks

Let G (V, E) be an undirected graph. The random walk on G is a Markov chain on V that, at eachtime step, moves to a uniformly random neighbor of the current vertex.

For x ∈ V , use dx to denote the degree of vertex x. Then more formally, random walk on G is thefollowing process Xt. We start at at some node X0 v0 ∈ V . Then if Xt v, we put Xt+1 uwith probability 1/dv for every neighbor u of v.

11.1 Hitting times and cover times

One can study many natural properties of the randomwalk. For two vertices u , v ∈ V , we define thehitting time Huv from u to v as the expected number of steps for the randomwalk to hit v when startedat u. Formally, define the random variable T mint > 0 : Xt v. Then Huv E[T | X0 u].

The cover time of G starting from u is the quantity covu(G) which is the expected number ofsteps needed to visit every vertex of G started at u. Again, we can define this formally: LetT mint > 0 : X0 ,X1 , . . . ,Xt V. Then covu(G) E[T | X0 u]. Finally, we define the covertime of G as cov(G) maxu∈V covu(G).

11.2 Random walks and electrical networks

It turns out that randomwalks (on undirected graphs) are very closely related to electrical networks.We recall the basics of such networks now. Again, we let G (V, E) be a connected, undirectedgraph which we think of as an electrical circuit with unit resistors on every edge.

If we create a potential difference at two vertices (by, say, connecting the positive and negativeterminals of a battery), then we induce an electrical flow in the graph. Between every two nodesu , v there is a potential φu ,v ∈ R. Electrical networks satisfying the following three laws.

(K1) The flow into every node equals the flow out.

(K2) The sum of the potential differences around any cycle is equal to zero.

(Ohm) The current flowing from u to v on an edge e u , v is precisely φu ,vruv

where ruv is theresistance of u , v. [In other words, V iR.]

In our setting, all resistances are equal to one, but one can define things more generally: If we putconductances cuv on the edges u , v ∈ E, then the corresponding random walk would operate asfollows: If Xt u then Xt+1 v with probability cuv∑

v∈V cuvfor every neighbor v of u. In that case, we

would have ruv 1/cuv .

Remark 11.1. In fact, (K2) is related to a somewhat more general fact. The potential differencesare given—naturally—by differences in a potential. There exists a map ϕ : V → R such thatφu ,v ϕ(u) − ϕ(v). If G is connected, then the potential ϕ is uniquely defined up to a tranlation.

To define the potential ϕ, put ϕ(v0) 0 for some fixed node v0. Now for any v ∈ V and anypath γ 〈v0 , v1 , v2 , . . . , vk v〉 in G, we can define ϕ(v) φv0 ,v1 + φv1 ,v2 + · · · + φvk−1 ,vk . This iswell-defined—independent of the choice of path γ—since by (K2), the potential differences aroundevery cycle sum to zero.

Finally, we make an important definition: The effective resistance Reff(u , v) between two nodesu , v ∈ V is defined to be the necessary potential difference created between u and v to induce acurrent of one unit to flow between them. If we imagine the entire graph G acting as a single “wire”

44

between u and v, then Reff(u , v) denotes the effective resistance of that single wire (recall Ohm’slaw). We now prove the following.

Theorem 11.2. If G (V, E) has m edges, then for any two nodes u , v ∈ V , we have

Huv + Hvu 2mReff(u , v) .In order to prove this, we will set up four electrical networks corresponding to the graph G. We

label these networks (A)-(D).

(A) We inject dx units of flow at every vertex x ∈ X, and extract∑

x∈V dx 2m units of flow atvertex v.

(B) We inject dx units of flow at every vertex x ∈ X, and extract 2m units of flow at vertex u.

(C) We inject 2m units of flow at vertex u and extract dx units of flow at every vertex x ∈ X.

(D) We inject 2m units of flow at vertex u and extract 2m units of flow at vertex v.

We will use the notation φ(A)x ,y , φ

(B)x ,y , etc. to denote the potential differences in each of these

networks.

Lemma 11.3. For any vertex u ∈ V , we have Huv φ(A)u ,v .

Proof. Calculate: For u , v,

du

∑w∼u

φ(A)u ,w

∑w∼u

(φ(A)

u ,v − φ(A)w ,v

) duφ

(A)u ,v −

∑w∼u

φ(A)w ,v ,

where we have first use (K1), then (K2). Rearranging yields

φ(A)u ,v 1 +

1du

∑w∼u

φ(A)w ,v .

Now observe that the hitting times satisfy the same set of linear equations: For u , v,

Huv 1 +1du

∑w∼u

Hwv .

We conclude that Huv φ(A)u ,v as long as this system of linear equations has a unique solution.

But consider some other solution H′uv and define f (u) Huv−H′uv . Plugging this into the precedingfamily of equations yields

f (u) 1du

∑w∼u

f (w) .

Such a map f is called harmonic, and it is a well-known fact that every harmonic function f on afinite, connected graph is constant. Since f (v) Hvv 0, this implies that f ≡ 0, and hence thefamily of equations has a unique solution, completing the proof.

45

Remark 11.4. To prove that every harmonic function on a finite, connected graph is constant, wecan look at the corresponding Laplace operator: (L f )(u) du f (u) −∑

w∼u f (u). A function f isharmonic if and only if L f 0. But we have already seen that, on a connected graph, the Laplacianhas rank n − 1 and ker(L) span(1, . . . , 1), i.e., the only harmonic functions on our graph aremultiples of the constant function.

Define now the commute time between u and v as the quantity Cuv Huv + Hvu . We restate andprove Theorem 11.2.

Theorem 11.5. In any connected graph with m edges, we have Cuv 2mReff(u , v) for every pair of verticesu , v ∈ V .

Proof. From Lemma 11.3, we have Huv φ(A)u ,v . By symmetry, Hvu φ(B)

v ,u as well Since network C isthe reverse of network B, this yields Hvu φ(C)

u ,v . Finally, since network D is the sum of networks Aand C, by linearity we have

φ(D)u ,v φ(C)

u ,v + φ(A)u ,v Huv + Hvu Cuv .

Finally, note that Reff(u , v) 2mφ(D)u ,v by definition, since network D has exactly 2m units of current

flowing from u to v. This yields the claim of the theorem.

11.3 Cover times

We can now use Theorem 11.5 to give a universal upper bound on the cover time of any graph.

Theorem 11.6. For any connected graph G (V, E), we have cov(G) 6 2|E|(|V | − 1).Proof. Fix a spanning tree T of G. Then we have

cov(G) 6∑

x ,y∈E(T)Cx y .

The right-hand side can be interpreted as a very particular way of covering the graph G: Start atsome node x0 and “walk” around the edges of the spanning tree in order x0 , x1 , x2 , . . . , x2(n−1) x0. If we require the walk to first go from x0 to x1, then from x1 to x2, etc., we get the sum∑2(n−1)−1

i0 Hxi xi+1 ∑

x ,y∈E(T) Cx y . This is one particular way to visit every node of G, so it gives anupper bound on the cover time.

Finally, we note that if x , y is an edge of the graph, then by Theorem 11.5, we have Cx y

2|E|Reff(x , y) 6 2|E|. Here we use the fact that for every edge x , y of a graph, the effectiveresistance is at most the resistance, which is at most one. This completes the proof.

Remark 11.7. The last stated fact is a special case of the Rayleigh monotonicity principle. This statesthat adding edges to the graph (or, more generally, decreasing the resistance of any edge) cannotincrease any effective resistance. In the other direction, removing edges from the graph (or, moregenerally, increasing the resistance of any edge) cannot decrease any effective resistance. A similarfact is false for hitting times and commute times, as we will see in the next few examples.

46

Examples.

1. The path. Consider first G to be the path on vertices 0, 1, . . . , n. Then H0n + Hn0 C0n

2nReff(0, n) 2n2. Since H0n Hn0 by symmetry, we conclude that H0n n2. Note thatTheorem 11.6 implies that cov(G) 6 2n2, and clearly cov(G) > H0n n2, so the upper boundis off by at most a factor of 2.

2. The lollipop. Consider next the “lollipop graph” which is a path of length n/2 from u to vwith an n/2 clique attached to v. We have Huv + Hvu Cuv Θ(n2)Reff(u , v) Θ(n3). Onthe other hand, we have already seen that Huv Θ(n2). We conclude that Hvu Θ(n3), hencecov(G) > Ω(n3). Again, the bound of Theorem 11.6 is cov(G) 6 O(n3), so it’s tight up to aconstant factor here as well.

3. The complete graph. Finally, consider the complete graph G on n nodes. In this case,Theorem 11.6 gives cov(G) 6 O(n3)which is way off from the actual value cov(G) Θ(n log n)(since this is just the coupon collector problem is flimsy disguise).

11.4 Matthews’ bound

The last example shows that sometimes Theorem 11.6 doesn’t give such a great upper bound.Fortunately, a relatively simple bound gets us within an O(log n) factor of the cover time.

Theorem 11.8. If G (V, E) is a connected graph and Rmax : maxx ,y∈V Reff(x , y) is the maximumeffective resistance in G, then

|E|Rmax 6 cov(G) 6 O(log n)|E|Rmax .

Proof. One direction is easy:

cov(G) > maxu ,v

Huv >12

maxu ,v

Cuv 12

2|E|maxu ,v

Reff(u , v) |E|Rmax .

For the other direction, we will examine a random walk of length 2c |E|Rmax log n divided intolog n epochs of length 2c |E|Rmax. Note that for any vertex v and any epoch i, we have

P[v unvisited in epoch i] 6 1c.

This is because no matter what vertex is the first of epoch i, we know that the hitting time to v is atmost maxu Huv 6 maxu Cuv 6 2|E|Rmax. Now Markov’s inequality tells us that the probability ittakes more than 2c |E|Rmax steps to hit v is at most 1/c.

Therefore the probability we don’t visit v in any epoch is at most c− log n n− log c , and by a unionbound, the probability that there is some vertex left unvisited after all the epochs is at most n1−log c .

We conclude thatcov(G) 6 2c |E|Rmax log n + n1−log c2n3 ,

where we have used the weak upper bound on the cover time provided by Theorem 11.6. Choosingc to be a large enough constant makes the second term negligible, yielding

cov(G) 6 O(|E|Rmax log n) ,as desired.

47

One can make some improvements to this “soft” proof, yielding the following stronger bounds.

Theorem 11.9. For any connected graph G (V, E), the following holds. Let thit denote the maximumhitting time in G. Then

cov(G) 6 thit

(1 +

12+ · · · +

1n

).

Moreover, if we define for any subset A ⊆ V , the quantity tAmin minu ,v∈A,u,v Huv , then

cov(G) > maxA⊆V

tAmin

(1 +

12+ · · · +

1|A| − 1

). (11.1)

For the proofs, consult Chapter 11 of the Levin-Peres-Wilmer book http://pages.uoregon.edu/dlevin/MARKOV/markovmixing.pdf.

Kahn, Kim, Lovász, and Vu showed that the best lower bound in (11.1) is within an O(log log n)2factor of cov(G), improving over the O(log n)-approximation in Theorem 11.8. In a paper with JianDing and Yuval Peres, we showed that one can compute an O(1) approximation using a multi-scalegeneralization of the bound (11.1) based on Talagrand’s majorizing measures theory.

12 Markov chains and mixing times

Consider a finite state space Ω and a transition kernel P : Ω ×Ω→ [0, 1] such that for every x ∈ Ω,∑y∈Ω P(x , y) 1. The Markov chain corresponding to the kernel P is the sequence of random

variables X0 ,X1 ,X2 , . . . such that for every t > 0, we have P[Xt+1 y | Xt x] P(x , y). Notethat we also have to specify a distribution for the initial state X0.

Corresponding to every such process, one can consider the (weighted) directed graph D (Ω,A)with A (x , y) : P(x , y) > 0 and edge weights w(x , y) P(x , y). Then the random process Xtcorresponds precisely to random walk on D: At every time step, one moves from the current vertexx to a neighbor y with probability P(x , y).Convergence to stationarity. For every t > 0, let Pt(x , y) P[Xt x | X0 y]. TheMarkov chaindescribed by P is said to be irreducible if for every x , y ∈ Ω, there is some t such that Pt(x , y) > 0; inother words, there is always some way to reach any state form any other. This corresponds preciselyto the digraph D being strongly connected. The chain is aperiodic if for every x , y ∈ Ω,

gcd(t : Pt(x , y) > 0) 1 .

Theorem 12.1 (Fundamental Theorem of Markov Chains). If P is irreducible and aperiodic, then thereis a unique probability measure π : Ω→ [0, 1] such that for every x , y ∈ Ω, we have

Pt(x , y)→ π(y) as t →∞ .

In other words, the Markov chain “forgets” where it started and converges to a unique limitingdistribution. This is referred to as the stationary measure π.

Reversibility. A Markov chain is said to be reversible with respect to the measure µ if for everyx , y ∈ Ω, we have µ(x)P(x , y) µ(y)P(y , x). (These are called the “detailed balance conditions.”)The chain is said to be reversible if it is reversible with respect to some probability measure. Notethat reversible chains correspond precisely to random walks on (weighted) undirected graphs.

48

Also, if P is irreducible and aperiodic—and hence has a unique stationary measure π byTheorem 12.1—then actually π µ. To see this, note that by the detailed balance conditions: Forevery y ∈ Ω, we have∑

x∈Ω

µ(x)P(x , y) ∑x∈Ω

µ(y)P(y , x) µ(y)∑x∈Ω

P(y , x) µ(y) . (12.1)

The right-hand side can be interpreted as the probability of going to y in one step started from themeasure µ. Now Theorem 12.1 implies that if we start from distribution µ, then we converge toπ; on the other hand, (12.1) says that if we start distributed according to µ, then we stay that wayunder the chain. Thus µ π. This provides a nice local way to check that some measure is thestationary measure of the chain.

Remark 12.2. If P is irreducible, but not necessarily aperiodic, then there is still a unique stationarydistribution, i.e. a probability π such that for every x ∈ Ω,

∑y∈Ω P(x , y)π(y) π(x). But it may not

be the case that the chain converges to π from some starting states.For instance, if the chain is given by a directed graph with two nodes Ω x , y and arcs (x , y)

and (y , x), then π (1/2, 1/2) is the unique stationary measure, but the chain does not converge toπ when starting in either state x or y (because of periodicity).

For our purposes, aperiodicity is a rather weak obstruction to mixing. Given any chain P andnumber α ∈ (0, 1), we can consider the chain P′ αI + (1 − α)P. If P is irreducible, then so isP′. Moreover, for any such α, the chain P′ is aperiodic (even if P was not). When measuringconvergence to equilibrium, this α “self loop” probability does not slow down the chain too much.

12.1 The Fundamental Theorem

Let us sketch a proof of Theorem 12.1. We want to begin with a stationary measure π for P, i.e., aprobability distribution π on Ω (interpreted as a row vector) such that πP π. Suppose Xt isthe Markov chain with transition law P and for x ∈ Ω, define

τ+x : mint > 0 : Xt x.We do not prove the following lemma (see, e.g., Section 1.5.3 in the Levin-Peres-Wilmer book).

Lemma 12.3. Suppose that P is irreducible and aperioidic. If we define

π(x) 1E[τ+x | Xt 0] ,

then π is a probability distribution satisfying πP π.

Now let us argue that Pt(x , y) → π(y) for every x , y ∈ Ω. Let Π denote the |Ω| × |Ω| matrixwhere every row is π.

Fact 12.4. It holds that ΠP Π and QΠ Π for every row-stochastic matrix Q.

Using the fact that P is irreducible and aperiodic, choose r such that Pr(x , y) > 0 for everyx , y ∈ Ω. Let θ < 1 be such that

Pr(x , y) > (1 − θ)π(y) ∀x , y ∈ Ω.

Then we can writePr

(1 − θ)Π + θQ , (12.2)

49

where Q is stochastic. Now we claim that for every k > 1:

Pkr (1 − θk)Π + θkQk . (12.3)

If this holds, then for s < r, we have

Pkr+s PkrPs

(1 − θk)Π + θkQkPs ,

and thus Pt→ Π as t →∞, completing the proof of Theorem 12.1.

We prove (12.3) by induction on k. The case k 1 is (12.2). In the general case, write

Pr(k+1) PrkPr

[(1 − θk)Π + θkQk]

Pr

(1 − θk)ΠPr+ θkQk [(1 − θ)Π + θQ] ,

where in the second line we use (12.2). Now (12.4) implies that ΠPr Π and QkΠ Π, hence

Pr(k+1)

[(1 − θk) + θk(1 − θ)] Π + θk+1Qk+1 (1 − θk+1)Π + θk+1Qk+1 ,

completing the proof.

12.2 Eigenvalues and mixing

It will be useful to give a more quantitative proof of Theorem 12.1 in the reversible case. To do this,we again think of P as an Ω ×Ωmatrix. If we also think about a probability measure µ ∈ RΩ as arow vector, then µP denotes the distribution that arises by starting at µ and taking one step of thechain associated to P.

If P is reversible with respect to π, then (12.1) implies that πP π, i.e. π is a (left) eigenvectorwith eigenvalue 1. We now analyze the other eigenvalues of P.

Real eigenvalues. Note that P is not necessarily a symmetric matrix, but we can prove that P issimilar to a symmetric matrix. Let D denote the diagonal matrix with Dxx π(x). Then

(√D−1P√

D)x y 〈ex , ey√

D−1P√

D〉 〈√D−1ex , eyP√

D〉 √π(x)π(y)P(x , y) .

But by (12.1), this is equal to√π(y)π(x)P(y , x). Thus

D−1P√

D is a real, symmetric matrix and hencehas real eigenvalues. This implies that P also has real eigenvalues.

All eigenvalues in [−1, 1]. Now note that for any v ∈ RΩ, we have

‖vP‖1 6 ‖|v |P‖1 ‖v‖1 , (12.4)

where |v | denotes the vector whose entries are the absolute value of the corresponding entries in v.This is simply because P is an averaging operator.

Now suppose that vP λv. Then using (12.4)

|λ| · ‖v‖1 ‖vP‖1 6 ‖v‖1 ,

implying that |λ| 6 1.

50

Unique eigenvector with eigenvalue 1. Suppose now that v vP and consider the corre-sponding Laplacian matrix L D − PD (using our notation for “edge Laplacians,” this is12∑

x ,y π(x)P(x , y)Lx ,y). One can check that this matrix is symmetric since PD is symmetricby the detailed balance conditions. As we saw in Lectures 14-15, for any vector w we have

wLwT

12

∑x ,y

π(x)P(x , y) (wx − wy)2 .

This is easiest to see by writing L ∑

x ,y cx yLx y where cx y : π(x)P(x , y) and Lx y is the(unweighted) Laplacian corresponding to the graph with a single edge x , y. (The factor 1/2 isdue to the fact that we are summing over all pairs x , y vs. all edges x , y.)

Let w vD−1. Then vP v ⇒ wL 0 ⇒ wLwT 0. Thus wx wy whenever P(x , y) > 0.But since the chain P is irreducible, we can connect every pair x , y by a chain of such implications,implying that w α(1, 1, . . . , 1) is a multiple of the all-ones vector. But this implies that v Dw isa multiple of π. Since P is an averaging operator, it preserves the `1 norm, hence α 1 and v π.

Not a bipartite graph. Now we claim that if P is aperiodic, −1 cannot be an eigenvalue of P.Suppose, for the sake of contradiction, that vP −v for some v , 0. Again, let |v | denote the vectorwhose entries are the absolute values of the corresponding entries in v. Then

‖v‖22

|v |PvT 6 |v |P |v |T 6 ‖v‖2

2 ,

where the last inequality follows from the fact that all the eigenvalues of P lie in [−1, 1]. We concludethat |v |P |v |, implying that |v | π.

Finally, observe that vP −v implies that for every x, one has vx −(Pv)x , hence

π(x)sgn(vx) vx −(vP)x −

∑y

P(y , x)vy −

∑y

P(y , x)π(y)sgn(vy) .

But by the detailed balance conditions, we have π(x) ∑y P(y , x)π(y). Hence it must be that

sgn(vx) −sgn(vy) whenever P(y , x) > 0.Thus if we set L x : vx < 0 and R x : vx > 0, then P(x , y) > 0 implies x and y are on

different sides of the bipartition. (Note that this is a bipartition since |v | π implies that vx , 0 forany x ∈ Ω.) But this implies that for x , y on the same side of the bipartition, we have Pt(x , y) 0when t is odd, contradicting the fact that P was assumed aperiodic.

Convergence to stationarity (spectral argument). Let us fix an inner product in which the matrixP is self-adjoint:

〈u , v〉L2(π) ∑x∈Ω

π(x)ux vx ,

and let ‖u‖L2(π) √〈u , u〉L2(π) denote the corresponding Euclidean norm.

Consider any vector w ∈ RΩ. Let λ1 1, λ2 , . . . , λn denote the (left) eigenvalues of P arrangedso that 1 |λ1 | > |λ2 | > · · · > |λn |. Since P is self-adjoint with respect to L2(π), we can choose anL2(π)-orthonormal basis v(1) , v(2) , . . . , v(n) of corresponding eigenvectors with v(1) a multiple of π.

Recall that from the above reasoning, we have |λi | < 1 for i > 1. Write w ∑n

i1 αi v(i), and notethat for any t > 0,

wPt α1v(1)

+

∑i>2

λtiαi v(i) .

51

In particular, we have

wPt− α1v(1)

2

L2(π) ∑i>2

λ2ti |αi |2 6 λ2t

2 ‖w‖2L2(π) . (12.5)

Since |λ2 | < 1, this implies that ‖wPt− α1v(1)‖L2(π) → 0 as t → ∞, showing that wPt

→ α1v(1),where we recall that v(1) is a multiple of π.

Note that if w has all non-negative entries, then since P is an averaging operator, we have‖wPt ‖1 ‖w‖1, hence wPt

→ ‖w‖1π. Finally, observe that if v ex then this implies exPt→ π,

which is exactly the claim of Theorem 12.1 (in the reversible case). For later use, we note thefollowing consequence of (12.5): If x ∈ Ω, then

exPt

− π2

L2(π) 6 λ2t2 π(x). (12.6)

12.3 Mixing times

Nowwe have seen that any irreducible, aperiodic Markov chain P on a finite state spaceΩ convergesto a unique stationary measure π. We are not only concerned with convergence, but also the rate ofconvergence—we would like to be able to sample efficiently from π.

To this end, we first introduce a metric on the space of probability measures on Ω: For any twomeasures µ and ν on Ω, the total variation distance is defined by

dTV (µ, ν) def

12‖µ − ν‖1

12

∑x∈Ω

|µ(x) − ν(x)| .

As an exercise, one can also show that dTV (µ, ν) maxA⊆Ω |µ(A) − ν(A)|.For simplicity of notation, let us define p(x)

t to be the distribution given by Pt ex (i.e., thedistribution of the chain started at x after t steps). For any t > 0 and x ∈ Ω, define the quantity∆x(t) dTV (π, p(x)

t ), and we set ∆(t) maxx∈Ω ∆x(t). For ε > 0, we denote

τ(ε) mint : ∆(t) 6 ε .In words, this is the first time t such that, starting from any initial state, the measure of the chainafter t steps is within ε of the stationary measure. Finally, by convention, one takes τmix τ(1/2e)as the mixing time of the Markov chain P. Note that the precise value of ε is not so important; as thefollowing lemma shows, once we have obtained the mixing time, further convergence to stationarityhappens very fast.

Lemma 12.5. For every t > 0, we have

∆(t) 6 exp(−

⌊ tτmix

⌋).

In particular, for every ε > 0, it holds that τ(ε) 6 τmixdln(1/ε)e.We will not prove this, but it can be done using the coupling characterization of total variation

distance that appears in Homework #6. Finally, we can use our proof of Theorem 12.1 in thereversible case to give an upper bound on τmix in terms of the spectral gap of the chain.

52

Theorem 12.6. Let P be a reversible and irreducible, aperiodic Markov chain on the state space Ω. Supposethat P has eigenvalues 1 λ1 > λ2 > · · · > λn , and let λ(P) max|λ2 |, |λn |. Then

τmix 6

⌈1 + ln(1/πmin)

1 − λ(P)⌉,

where πmin : minπ(x) : x ∈ Ω.Proof. Consider x ∈ Ω and ε > 0. Recall that

dTV (p(x)t , π) 1

2

∑y∈Ω

Pt(x , y) − π(y) 1

2

∑y∈Ω

π(y)Pt(x , y)π(y) − 1

6

12

*.,

∑y∈Ω

π(y)Pt(x , y)π(y) − 1

2+/-

1/2

,

where the last line uses E[X2] > (E[X])2. Observe that∑y∈Ω

π(y)Pt(x , y)π(y) − 1

2

61

πmin

∑y∈Ω

π(y)2Pt(x , y)π(y) − 1

2

1

πmin

∑y∈Ω

π(y) Pt(x , y) − π(y)2

1

πmin

exPt

− π2

L2(π) .

Combining the preceding two inequalities with (12.6) gives

dTV (p(x)t , π)2 6 1

πmin

exPt

− π2

L2(π) 6π(x)πmin

λ(P)2t

Now setting t d 11−λ(P) ln(1/(επmin))e and using the fact that (1 − δ)1/δ 6 e−1 for δ > 0 yields

dTV (p(x)t , π)2 6 ε

2

4,

implying dTV (p(x)t , π) 6 ε/2. Setting ε 1/e and recalling the definition of τmix yields the desired

result.

Finally, one should note that this bound is essentially tight up to the O(log(1/πmin)) factor.Theorem 12.7. Under the assumption of Theorem 12.6, we have

τmix >1

1 − λ(P) − 1 .

Proof. Let v be a (left) eigenvector of P with eigenvalue λ λ(P) , 1. In that case, since π is also aneigenvector of P, we see that v is orthogonal to the stationary measure π, i.e.

∑y∈Ω π(y)vy 0. It

follows that for t > 0 and any x ∈ Ω,

|λt vx | |(vPt)x |

∑y

Pt(x , y)vy − π(y)vy

6 ‖v‖∞

∑y

|Pt(x , y) − π(y)| 2‖v‖∞ dTV (p(x)t , π) .

Now choose x so that |vx | ‖v‖∞, yieldingdTV (p(x)

t , π) > 12λ(P)t .

Therefore λ(P)τmix 6 1/e, implying that

τmix >−1

log(1 − (1 − λ(P))) >1

1 − λ(P) − 1 ,

where in the final line we have used that log(1 − a) > 1 +1

a−1 for all a ∈ [0, 1].

53

So we see that up to a log(1/πmin) factor, the spectral gap 1 − λ(P) controls the mixing time ofthe chain: If we set τrel : 1

1−λ(P) (commonly called the “relaxation time” of the chain), then

τrel − 1 6 τmix 6 O(log(1/πmin)) τrel.

12.4 Some Markov chains

One famous state space is the set of all permutations of n objects (for n 52). In this case, |Ω| n!.Here are some shuffles:

1. Random transposition. At every step, we choose two uniformly random positions i and j(with replacement) and swap the cards at positions i and j.

2. Top to random. We take the top card and insert it at one of the n positions in the deckuniformly at random.

3. Riffle shuffle. We split the deck into two parts L and R uniformly at random, and then take auniformly random interleaving of L and R.

And here’s a combinatorial example: Let G (V, E) be a graph with degree at most ∆, andsuppose we have q colors with q > ∆ + 1 (so we are assured that G is q-colorable). Let Ω be theset of all q-colorings on G. Here is a natural Markov chain: Suppose we have a proper coloringχ : V → [q]. We choose a uniformly random v ∈ V and a uniformly random color c ∈ [q]. If noneighbor of v in χ has color c, then we color v with c. Otherwise, we stay at the current coloring.

This example demonstrates the complex structure of Markov chains on combinatorial statespaces. For what values of q (depending on ∆) is the chain irreducible? It turns out that if q > ∆+ 2,then the chain is always irreducible, and the stationary measure is uniform on proper q-colorings.A huge open problem in MCMC (Markov chain Monte Carlo) is to resolve the following conjecture.

Conjecture 12.8. For all q > ∆ + 2, this Markov chain has mixing time O(n log n), where n |V |.The best bound (due to Vigoda, 1999) is that this holds for q > 11

6 ∆.1

13 Eigenvalues, expansion, and rapid mixing

Let P be the transition kernel of a reversible, irreducible, aperiodic Markov chain on the state spaceΩ. Suppose that P has stationary measure π (this exists and is unique by the Fundamental Theoremof Markov Chains). Let us also assume that all the eigenvalues of P lie in [0, 1]. In the last lecture,we proved that they must lie in [−1, 1]. Now by replacing P with P′ 1

2 I + 12 P, we can ensure that

all eigenvalues are nonnegative while only changing the mixing time by a factor of 2.Suppose the eigenvalues of P are 1 λ1 > λ2 > · · · > λ |Ω| > 0. In the last lecture, we defined

τmix and showed that1

1 − λ2− 1 6 τmix 6 O(log(1/πmin)) 1

1 − λ2,

where πmin : minπ(x) : x ∈ Ω is the minimum stationary probability. In other words, up to afactor of O(log(1/πmin)), the mixing time is controlled by the inverse spectral gap of P.

1This year (2019), a group from MIT has improved this bound slightly.

54

The Gibbs distribution on matchings. To understand the phrase “rapid mixing,” let us considersampling from a particular measure on an exponentially large state space. Fix an n-vertex graphG (V, E) and consider the setM(G) of all matchings in G; these are precisely subsets of the edgesE in which every vertex has degree at most one. It is clear thatM(G) can be very large; for instance,in the complete graph on 2n vertices, we have log |M(G)| n log n.

For a parameter λ > 1, let πλ denote the measure onM(G)where a matching m has probabilityproportional to λ |m |. Here, |m | denotes the number of edges in m. Thus πλ(m) λ |m |/Z, where

Z

∑m∈M(G)

λ |m |

is the corresponding partition function, which can itself be very difficult to compute. (In fact, theability to approximate Z efficiently is essentially equivalent to the ability to sample efficiently fromπλ.)

Our goal is to produce a sample from a distribution that is very close to πλ. To do this, wewill define a Markov chain onM(G)whose stationary distribution is πλ. We will then show thatτmix 6 nO(1), implying that there is a polynomial-time algorithm to sample via simulating the chainfor nO(1) steps. In general, for such an exponentially large state space indexed by objects of size n,we say that the chain is “rapidly mixing” if the mixing time is at most nO(1).

13.1 Conductance

For a pair of states x , y ∈ Ω, define Q(x , y) π(x)P(x , y) and note that since P is reversible, thedetailed balance conditions give us Q(x , y) Q(y , x). For two sets S, T ⊆ Ω, define Q(S, T) ∑

x∈S∑

y∈T Q(x , y). Finally, given a subset A ⊆ Ω, we define its conductance as the quantity

Φ(A) Q(A, A)π(A) .

Note that Q(A, A) represents the “ergodic flow” from A to A—this is the probability of a transitiongoing betweenA and A at stationarity. This quantity has a straightforward operational interpretation:It is precisely the probability that one step of the Markov chain leaves A when we start from thestationary measure restricted to A. Note that if Φ(A) is small, we expect that the chain might get“trapped” inside A, and thus perhaps such a “bottleneck” could be an obstruction to mixing. In fact,we will see momentarily that this is true, and moreover, these are the only obstructions to rapidmixing.

We define the conductance of the chain P to capture the conductance of the “worst” set

Φ∗ maxπ(A)6 1

2

Φ(A) .

Then we have the following probabilistic version of the discrete Cheeger inequality (provedindependently by Jerrum-Sinclair and Lawler-Sokal in the context of Markov chains on discretespaces).

Theorem 13.1. It always holds that

12(Φ∗)2 6 1 − λ2 6 2Φ∗ .

55

This is a basic fact in spectral graph theory; wewill not prove it here. Let usmention, though, thatthe right-hand side is straightforward—it verifies that indeed a low-conductance set is an obstructionto rapid mixing. The left-hand side, which claims that those are the only such obstructions, is moresubtle.

The best way to prove the right-hand side is as follows: Recall the inner product

〈u , v〉`2(π) ∑x∈Ω

π(x)ux vx

and the associated Euclidean norm ‖v‖`2(π) √〈v , v〉`2(π). Then using the variational principle for

eigenvalues, we haveλ2 max

v:〈v ,1〉`2(π)0〈v , vP〉 ,

where 1 denotes the all-ones vector. Consider now any A ⊆ Ωwith π(A) 6 12 , and define

vx

√1−π(A)π(A) x ∈ A

√π(A)

1−π(A) x < A .

Note that 〈v , 1〉`2(π) π(A)√

1−π(A)π(A) − (1 − π(A))

√π(A)

1−π(A) 0, and

‖v‖2`2(π) 1 − π(A) + π(A) 1 .

Therefore1 − λ2 〈v , v(I − P)〉`2(π)

12

∑x ,y

Q(x , y)(vx − vy)2 ,

where the last equality is the usual one we have done with Laplacian matrices (like I − P) inpreceding lectures. But the latter quantity is precisely

Q(A, A) *.,

√1 − π(A)π(A) +

√π(A)

1 − π(A)+/-

2

6 2Q(A, A)π(A) 2Φ(A) ,

where the inequality uses the fact that π(A) 6 12 .

13.2 Multi-commodity flows

Although Theorem 13.1 gives a nice characterization of rapid mixing in terms of conductance, thequantity Φ∗ is NP-hard to compute, and can be difficult to get a handle on for explicit chains. Thuswe now present another connection between conductance and multi-commodity flows.

We consider a multi-commodity flow instance on a graph with vertices corresponding to statesΩ and edges x , y with capacity Q(x , y). The demand between x and y is π(x)π(y). Let C∗ be theoptimal congestion that can be achieved by a multi-commodity flow satisfying all the demands(recalling that the congestion of an edge in a given flow is the ratio of the total flow over the edge toits capacity).

Theorem 13.2. It holds that1

2C∗6 Φ∗ 6

1C∗

O(log |Ω|) .

56

The right-hand side is due to Leighton and Rao (1988). We will only need the much simplerleft-hand side inequality which can be proved as follows. Suppose there exists a flow achievingcongestion C and consider some A ⊆ Ω. Then

C · Q(A, A) > π(A)π(A) .This is because the left-hand side represents an upper bound on the total flow going across thecut—Q(A, A) is the capacity across the cut (A, A), and we have to rescale by C to account for thecongestion. On the other hand, π(A)π(A) represents the amount of flow that must be travelingacross the cut to satisfy the demand. If π(A) 6 1

2 , we conclude that

Q(A, A) > π(A)π(A)C

>π(A)2C

,

completing the proof.

Remark 13.3 (Proof sketch of RHS of Theorem 13.2). (This is related to HW#4(c), which would givethe worse bound O((log |Ω|)2).) If we use linear programming duality to characterize C∗, it has thefollowing dual representation:

1C∗

mind

∑x ,y Q(x , y)d(x , y)∑

x ,y∈Ω π(x)π(y)d(x , y) , (13.1)

where the minimum is over all symmetric distance functions d(x , y) on Ω × Ω that satisfy thetriangle inequality d(x , y) 6 d(x , z) + d(z , y) for all x , y , z ∈ Ω.

Recall that every finite metric space (X, d) admits a mapping F : X → Rn with distortionD 6 O(log n), i.e.,

d(x , y)D

6 ‖F(x) − F(y)‖2 6 d(x , y) x , y ∈ X.

Now let us decompose the Euclidean distance on Rn into a convex combination over cuts. First,note that for any a , b ∈ R, we have

|a − b | ∫∞

−∞

|χs(a) − χs(b)| ds ,

where χs : 1(−∞,s]. In other words, χs(a) 1 if a 6 s and χs(a) 0 otherwise.Let 1 denote a random n-dimensional Gaussian vector, i.e., 1 (11 , . . . , 1n) where 1i are i.i.d.

N(0, 1) random variables. Recall that for u , v ∈ Rn , we have ‖u − v‖22 E[〈u − v , 1〉2], because

〈u − v , 1〉 is an N(0, ‖u − v‖22) random variable (by the 2-stability property of normal random

variables). One can also calculate: If 10 is an arbitrary normal random variable with mean zero,then

E[|10 |] √

√E[12

0].Therefore:

‖u − v‖2

√E[〈u − v , 1〉2]

√π2E

|〈u − v , 1〉|

We thus arrive at the following “cut decomposition” for all of Rn :

‖u − v‖2

√π2E1

[∫∞

−∞

χs(〈u , 1〉) − χs(〈v , 1〉) ds

]

57

Suppose now that d is the optimal metric in (13.1) and let F : Ω → Rn denote a distortionD 6 O(log n) embedding. The distortion condition yields

1C∗>

1D

∑x ,y Q(x , y)‖F(x) − F(y)‖2∑x ,y π(x)π(y)‖F(x) − F(y)‖2

E1[∫∞

−∞

∑x ,y Q(x , y)

χs(〈F(x), 1〉) − χs(〈F(y), 1〉) ds]

E1[∫∞

−∞

∑x ,y π(x)π(y)

χs(〈F(x), 1〉) − χs(〈F(y), 1〉) ds

]

Finally, we observe that ∫f (x) dx∫1(x) dx

> minx

f (x)1(x) .

Thus there exists some choice of 1 ∈ Rn and s ∈ R such that

1C∗>

1D

∑x ,y Q(x , y)|χs(〈F(x), 1〉) − χs(〈F(y), 1〉)|∑x ,y π(x)π(y)|χs(〈F(x), 1〉) − χs(〈F(y), 1〉)| ,

but the latter ratio is precisely 1D

Q(A,A)π(A)π(A) for the set A x ∈ Ω : 〈F(x), 1〉 6 s, hence

1C∗>

1D

Q(A, A)π(A)π(A) >

12D

Q(A, A)min(π(A), π(A)) >

Φ∗

2D,

verifying the RHS of Theorem 13.2.

13.3 The Gibbs sampler

Recall now that our goal is to sample from the Gibbs measure πλ introduced earlier. The followingMarkov chain is due to Jerrum and Sinclair. If we are currently at a matching m ∈ M(G), we defineour local transition as follows.

1. With probability 1/2, we stay at m.

2. Otherwise, choose an edge e u , v ∈ E(G) uniformly at random and:

(a) If both u and v are unmatched in m, set m : m ∪ e.(b) If e ∈ m, then with probability 1/λ, put m : m \ e, and otherwise stay at m.

(c) If exactly one of u or v is matched in m, then let e′ be the unique edge that contains oneof u or v and put m : m \ e′ ∪ e.

(d) If both u and v are matched, stay at m.

Exercise: Show that this chain is reversible with respect to the measure πλ.

Now we would like to prove that this chain is rapid mixing by giving a low-congestion multi-commodity flow in the corresponding graph. In fact, we will give an “integral flow,” i.e. we willspecify for every pair of matchings x , y ∈ M(G), a path γx y .

To do this, consider the edges of x to be colored red and the edges of y to be covered blue. Thenthe colored union x ∪ y is a multi-graph where every node has degree at most 2. It is easy to seethat every such graph breaks into a disjoint union of paths and even-length cycles. (Note also thetrivial cycles of length two when x and y share an edge.)

The path γx y will “fix” each of these components one at a time (in some arbitrary order). Thetrivial cycles are already fine (we don’t have to move those edges). To explain how to handle the

58

path components, we look at a simple example. Suppose the path is e1 , e2 , e3 , e4 , e5 , e6. Then wedefine a path from the red matching to blue the matching (in this component as follows):

e1 , e3 , e5 → e3 , e5 → e2 , e5 → e2 , e4 → e2 , e4 , e6 .

Note that each transition is a valid step of the chain. We can do a similar thing for a cycle by firstdeleting a red edge so that it becomes a path.

Congestion analysis. So now we have given a path γx y between every pair of states x , y ∈ M(G).In the flow, this path should have flow value πλ(x)πλ(y) so that is satisfies the correspondingdemand. We are left to analyze the weight of paths that use a given “edge” (a transition) of thechain. The interested reader is referred to the beautiful argument at its original source [?].

59