A Behavioral Learning Process in Games

27
Games and Economic Behavior 37, 340–366 (2001) doi:10.1006/game.2000.0841, available online at http://www.idealibrary.com on A Behavioral Learning Process in Games 1 Jean-François Laslier CNRS and Laboratoire d’ ´ Econom´ etrie, ´ Ecole Polytechnique, Paris, France Richard Topol CNRS and CREA, ´ Ecole Polytechnique, Paris, France and Bernard Walliser 2 CERAS, ´ Ecole Nationale des Ponts et Chauss´ ees, and CREA, ´ Ecole Polytechnique, Paris, France Received July 4, 1996; published online September 14, 2001 This paper studies the cumulative proportional reinforcement (CPR) rule, accord- ing to which an agent plays, at each period, an action with a probability propor- tional to the cumulative utility that the agent has obtained with that action. The asymptotic properties of this learning process are examined for a decision-maker under risk, where it converges almost surely toward the expected utility maximiz- ing action(s). The process is further considered in a two-player game; it converges with positive probability toward any strict pure Nash equilibrium and converges with zero probability toward some mixed equilibria (which are characterized). The CPR rule is compared in its principles with other reinforcement rules and with replica- tor dynamics. Journal of Economic Literature Classification Number: C72. 2001 Academic Press Key Words: evolution; learning; Nash equilibrium; Polya urn; reinforcement. 1. INTRODUCTION A host of recently introduced game-theoretic learning models study dynamic systems of boundedly rational players. Specific attention is 1 We thank Michel Balinski, Nicolas Bouleau, Olivier Compte, Yuri Kaniovski, Tom Palfrey, Sylvain Sorin, and especially Michel Bena¨ ım and an associate editor for very helpful comments. 2 To whom correspondence should be addressed. E-mail: [email protected]. 340 0899-8256/01 $35.00 Copyright 2001 by Academic Press All rights of reproduction in any form reserved.

Transcript of A Behavioral Learning Process in Games

Games and Economic Behavior 37, 340–366 (2001)doi:10.1006/game.2000.0841, available online at http://www.idealibrary.com on

A Behavioral Learning Process in Games1

Jean-François Laslier

CNRS and Laboratoire d’Econometrie, Ecole Polytechnique, Paris, France

Richard Topol

CNRS and CREA, Ecole Polytechnique, Paris, France

and

Bernard Walliser2

CERAS, Ecole Nationale des Ponts et Chaussees, and CREA,Ecole Polytechnique, Paris, France

Received July 4, 1996; published online September 14, 2001

This paper studies the cumulative proportional reinforcement (CPR) rule, accord-ing to which an agent plays, at each period, an action with a probability propor-tional to the cumulative utility that the agent has obtained with that action. Theasymptotic properties of this learning process are examined for a decision-makerunder risk, where it converges almost surely toward the expected utility maximiz-ing action(s). The process is further considered in a two-player game; it convergeswith positive probability toward any strict pure Nash equilibrium and converges withzero probability toward some mixed equilibria (which are characterized). The CPRrule is compared in its principles with other reinforcement rules and with replica-tor dynamics. Journal of Economic Literature Classification Number: C72. 2001

Academic Press

Key Words: evolution; learning; Nash equilibrium; Polya urn; reinforcement.

1. INTRODUCTION

A host of recently introduced game-theoretic learning models studydynamic systems of boundedly rational players. Specific attention is

1We thank Michel Balinski, Nicolas Bouleau, Olivier Compte, Yuri Kaniovski, Tom Palfrey,Sylvain Sorin, and especially Michel Benaım and an associate editor for very helpful comments.

2 To whom correspondence should be addressed. E-mail: [email protected].

3400899-8256/01 $35.00Copyright 2001 by Academic PressAll rights of reproduction in any form reserved.

behavioral learning process 341

focused on the asymptotic behavior of such systems and how it compareswith the usual equilibrium concepts (Fudenberg and Levine, 1998; Young,1998). In fact, two types of learning processes, defined in an infinite rep-etition of a basic normal form game, may be considered according to theinformation that players are able to gather and compute (Walliser, 1998).In epistemic learning, players observe all past moves of their opponents,estimate their next actions based on various “forecasting rules,” and playmyopically a best response to their expected moves. In behavioral learn-ing, players only observe the utility payoffs from their previous actions andrevise their strategies through “reinforcement rules” that favor the bestand inhibit the worst.The first learning process was analyzed extensively, especially its conver-

gence properties; that is, the asymptotic distribution of actions that may ormay not coincide with a Nash equilibrium. The most usual rule is “ficti-tious play,” wherein each player assumes that the opponent’s future actionis played with a probability equal to its past frequency, and plays a pureaction that is a best response (Robinson, 1950, Nachbar, 1990). A general-ization is the “smooth fictitious play rule,” wherein each player has the sameexpectation of his or her opponents’ future actions, but defines a stochasticbest response in which the actions with greater utility are played with higherprobabilities (Kaniovski and Young, 1995; Benaım and Hirsch, 1999).The second learning process has been studied far less, even though it was

suggested early (Bush and Mosteller, 1955) and convergence results werealready proved in some specific cases (Arthur, 1993). A general “index rule”states that a player computes from past payoffs a utility index associatedwith each action, then chooses an action with a probability strictly increas-ing with the value of its index. The more-specific cumulative proportionalreinforcement (CPR) rule studied extensively in this paper, considers theutility index to be the cumulated utility obtained in the past and assumesthat the probability of choosing an action is proportional to this index.The foregoing rule is similar to the one introduced by Roth and Erev

(1995) to test its predictions in laboratory experiments. But these authorsemphasized the first periods of the process, rather than the long-termbehavior of the rule. Conversely, Borgers and Sarin (1997, 2000) studiedthe asymptotic properties of two specific reinforcement rules; their rules arerather complex, because the utility index is an adaptive one and includesaspiration levels that follow their own evolution schemes. Finally, Posch(1997) studied the long-term properties (especially the cycles) resultingfrom a rule that is similar to CPR, except with the utility index subjectto some normalization conditions.A learning rule cannot be intrinsically rational, because a player is not

able to simulate his or her opponent’s behavior (in epistemic learning) or isnot even aware that he or she faces another player (in behavioral learning).

342 laslier, topol, and walliser

Hence one can think of many different rules, showing various forms ofbounded rationality in a dynamic setting, and these rules can be comparedonly according to some criteria stated by the modeler. A first criterion isthe implementation cost of the rule, which is related to the nature of theobservations that it needs and to the computational complexity of its appli-cation. A second criterion is its efficiency, which concerns the asymptoticstate to which it leads (if any) and the speed of convergence (typically notwell known).Three efficiency conditions may be stated and interpreted according to

the exploration–exploitation trade-off, relevant for any repeated decisionprocess in an imperfectly known stochastic environment (Gittins, 1989).The learning process must be optimizing in a stationary environment, whichimplies that it allows for a successful exploitation at its end, that is, it con-verges to the (expected) utility maximizing action. The learning processmust be flexible in a nonstationary environment, which implies that it allowsfor an extended exploration at its beginning, avoiding lock-in early to someundesirable action or cycle. The learning process must be progressive in anykind of environment, which implies that it allows for a smooth transitionbetween exploration and exploitation, so experimentation slows down.The results proved in this paper show that the CPR rule satisfies these

three efficiency conditions. In the long run, the rule converges toward theoptimizing action, because an action with high (average) utility is playedmore often, hence further increasing the cumulative utility by positiveretroaction. In early periods, the rule is open to experimentation of allactions, because the prior indexes of each action are assumed to be equal,except if the agent has some prior information on their (expected) utility.In the intermediate periods, the rule assumes a smooth transition becausethe probability of each action is slowly evolving, tending more and more tothe best actions.The fictitious play rule does not satisfy the second condition, because the

process can be trapped in a subset of actions (Shapley, 1964). The pointis that fictitious play is deterministic and allows passive experimentationduring the natural course of the game, but does not allow active experi-mentation by voluntary deviations from it. The smooth fictitious play ruledoes not satisfy the first condition, because it cannot converge toward apure action; hence it excludes maximizing actions. The point is that therule always reacts stochastically, because the expected utility of each action(conditional on others’ past frequency of actions) is always strictly positive.The CPR rule, like smooth fictitious play, leads formally to a stochastic

process in discrete time, which may be studied for two-player games usingPolya urns in two equivalent ways. One may consider two coupled linearPolya urns, wherein the ball colors in each urn correspond to the player’sactions and the number of balls of a given color corresponds to the past

behavioral learning process 343

cumulative utility of that action. One may also consider a unique nonlin-ear Polya urn, wherein the ball colors correspond to any combination ofactions and the number of balls of a given color is the number of timesthe corresponding combination of actions occurred in the past. The secondform happens to be more convenient for the analysis, because convergenceresults are now available for nonlinear Polya urns (Hill et al., 1980; Arthuret al., 1984; Pemantle, 1989; Benaım and Hirsch, 1996; Benaım, 1999).The paper is organized as follows. Section 2 introduces the CPR rule

and studies the stochastic dynamic process that it induces, as well as theassociated deterministic dynamic process. Section 3 studies the CPR ruleused by a decision-maker in a stationary and risky environment and provesthat the process converges toward the expected utility maximizing action(s).Section 4 studies the CPR rule used by both players of a two-player gameand proves that the process converges with positive probability towards astrict Nash equilibrium and with zero probability toward “linearly unstable”equilibria. Section 5 compares, at the level of assumptions, the CPR rulewith other reinforcement rules and with replicator dynamics.

2. THE CUMULATIVE PROPORTIONALREINFORCEMENT RULE

2.1. Definition

For a player in a repeated-choice situation, consider the following vari-ables, where i ∈ I represents for an action and t ∈ � the time:

• δi�t� is the Kronecker function of played action: δi�t� = 1 if i isplayed at t and 0 otherwise.

• u�t� is the utility (or “payoff”) provided by the played action, it isassumed positive.The CPR rule, a reinforcement rule and more precisely an index rule

(see Section 5.1), is defined as follows:• The cumulative utility of action i at time t, CUi�t�, is the sum of the

utilities obtained up to t by using action i, including some positive initialvalue:

CUi�t + 1� = CUi�t� + δi�t�u�t� =t∑

τ=1δi�τ�u�τ� + CUi�1�

• The probability of choosing action i at time t, pi�t�, is proportionalto its cumulative utility,

pi�t� =CUi�t�CU�t� �

with CU�t� =∑j CUj�t�.

344 laslier, topol, and walliser

If payoffs are integers, then an individual who behaves according to theCPR rule can be characterized by a player’s urn, which contains balls of I different colors, corresponding to the various actions. Each time thatan action is played and a payoff is received, the individual adds to his orher urn a number of balls of the corresponding color equal to the receivedpayoff. The player’s urn is called the purse. At time t, the number of ballsof color i in the player’s purse is CUi�t�. Each time that the player has tochoose an action, he or she picks a ball at random from the purse and playsthe corresponding action. Hence the probability pi�t� of playing i at time tis proportional to CUi�t�.The definition of the CPR rule is independent of the individual’s envi-

ronment. We consider two environments:

• Nature is a passive opponent that can be in state h ∈ H. The playerreceives a constant and positive utility ui� h from the combination of actionand state �i� h�. At each period, nature selects state h with a constant prob-ability qh.

• A second player is an active opponent who chooses an action h ∈H. Player 1 receives utility ui� h and player 2 receives utility vi� h from thecombination of actions �i� h�, constant and positive. Player 2 uses the CPRrule with his or her own cumulative utility CVh�t� and probability qh�t�.

The first case cannot be reduced to the second case, because the station-arity of qh for Nature implies that CVh is also stationary. But this is notcompatible with a cumulative utility (even if it is a limit case for vi� h = 0for all i and h). Nonetheless, the two cases can be treated simultaneouslyby considering the joint probability ri� h of the combination �i� h�,

ri� h�t� = pi�t�qh�t�

This depends only on the history of the process. The process can be char-acterized by a more abstract urn called the modeler’s urn. The modeler’surn is unique but contains balls of I · H different types, corresponding toall possible combinations. Each time that the combination �i� h� is played,one ball of type �i� h� is added. Hence the number ni� h�t� of balls of type�i� h� in the urn at date t is proportional to the past frequency xi� h of thecombination �i� h�. At each period, one ball is picked out of the modeler’surn, and the corresponding actions are played. The probability ri� h of pick-ing a �i� h� ball is not equal to the the proportion xi� h of balls of type �i� h�(the modeler’s urn is a nonlinear Polya urn) but, as we show, the vector ris related to the vector x by a “transition urn function” r = r�x� that doesnot depend on time.

behavioral learning process 345

2.2. The Stochastic Process

Let xi� h�t� = ni� h�t�t−1 be the past frequency of the combination �i� h�. Write

ui� =∑h

xi� hui� h� u� =∑i

ui�

The probability of playing i at time t is

pi�t� =CUi�t�CU�t� =

∑k ni� k�t�ui� k∑

j� k nj� k�t�uj� k

=∑

k xi� k�t�ui� k∑j� k xj� k�t�uj� k

= ui� �t�u� �t�

(1)

If needed, the corresponding notations for the other player are

v�h�t� =∑i

xi� h�t�vi� h� v� �t� =∑i� h

xi� h�t�vi� h� qh�t� =v�h�t�v� �t�

The (time-invariant) transition urn functions r = r�x� are then as follows:

• for one player against Nature:

ri� h = piqh =∑

k xi� kui� k∑j� k xj� kuj� k

qh =ui�

u�

qh (2)

• for two players:

ri� h = piqh =∑

k xi� kui� k∑j� k xj� kuj� k

∑j xj� hvj� h∑

j� k xj� kvj� k

= ui�

u�

v� h

v�

(3)

The stochastic evolution of the modeler’s urn is given by

ni� h�t + 1� = ni� h�t� + εi� h�t�with εi� h�t� equal to 1 with probability ri� h�t� and 0 with probability1− ri� h�t�. In frequency terms, the same relation can be written as

xi� h�t + 1� − xi� h�t� =1t�−xi� h�t� + εi� h�t� (4)

Hence the conditional expected value of the increment of x is

E��xi� h�t� x�t� =1t�−xi� h�t� + ri� h�t� (5)

The stochastic evolution of the player’s purse is given by

CUi�t + 1� = CUi�t� + εi�t��with εi�t� being ui� h�t� with probability pi�t�qh�t� (for any h) and 0 withprobability 1− pi�t�. In probability terms, the same relation is

pi�t + 1� = CUi�t� + uj� hδi� j

CU�t� + uj� h

346 laslier, topol, and walliser

with probability pj�t�qh�t�. Hence the conditional expected value of theincrement of p is

E��pi�t� x�t� =pi�t�CU�t�

∑j� h

pj�t�qh�t�ui� h − uj� h(

1+ ui� h

CU�t�)(

1+ uj� h

CU�t�) (6)

Because the number CU�t� that appears in this expression is a function ofthe I · H numbers xi� h�t�, it is not possible to write the expected valueof the increment �pi�t� as a function of the I + H variables pi�t� andqh�t�. The coupling of the two purses really defines a I · H–dimensionalurn process.We now state a first result, which proves that the CPR rule generates

endless active experimentation.

Proposition 1. In a stochastic process generated by the CPR rule, eachaction is chosen an infinite number of times.

Proof. Consider the indexes associated to action i and to the otheractions: CUi�t� ≥ CUi�1� and CUj�t� ≤ t maxh uj� h. For some positiveconstants a, b, and c, the probability of playing action i at time t can beminorated, independently of the history of the system before t,

pi�t� ≥a

b+ ct

It follows that the probability of never playing i after time t is less than theinfinite product,

∞∏τ=t

�1− a

b+ ct� = 0

The result follows. Q.E.D.

2.3. The Associated Deterministic Process

To sum up, the dynamic stochastic process generated by the CPRrule is completely determined by the stochastic evolution of the I · H–dimensional state variable x�t� = �xi� h�t���i� h�∈I×H . This evolution isspecified by the initial state x�0� and (4), which in vectorial form is

x�t + 1� − x�t� = 1t�−x�t� + ε�t��� t ∈ � (7)

We call a sequence �x�t��t∈� randomly obtained by (7) simply a “CPRsequence.” The associated deterministic difference equation is obtained bytaking the expectation of the variation of x [see (5)],

x�t + 1� − x�t� = 1t�−x�t� + r�x�t���� t ∈ ��

behavioral learning process 347

where r is the transition urn function (2, 3). In continuous time, the ordinarydifferential equation (ODE) associated with (7) is

x�t� = 1t�−x�t� + r�x�t���� t ∈ �0�+∞�

Notice that the coefficient 1/t applies to all coordinates of the vector x;therefore, the trajectories of the previous ODE are the same as the trajec-tories of the following ODE, on which we work:

x�t� = −x�t� + r�x�t��� t ∈ �0�+∞� (8)

We call trajectory �x�t��t∈� determined by (8) a “deterministic CPRtrajectory.”The question is then to know whether a random CPR sequence is close

to a deterministic CPR trajectory. To state that such is the case, we needthree technical notions:

• A semif low on �N is a continuous map,

� � �0�+∞�×�N → �N

�t� x0� �→ ��t� x��such that ��0� x0� = x0 and ��t + s� x0� = ��t���s� x0��.

• A continuous function x � �0�+∞�→ �N is an asymptotic pseudo-trajectory for � if, for any T > 0,

limt→∞ sup

0≤h≤T

x�t + h� −��h� x�t�� = 0

• The continuous piecewise affine interpolation of a sequence �x�t��t∈�is the function x � �0�+∞�→ �m defined by

x�t + s� = �1− s�x�t� + sx�t + 1�� t ∈ �� s ∈ �0� 1

The ODE (8) has a unique solution for each initial condition. The solu-tion starting from an initial condition x0 and considered at time t can bedenoted by ��t� x0�. It follows from basic properties of differential equa-tions that � is a semiflow. The definition of an asymptotic pseudotrajectoryto � means that for any length T , the function x on an interval �t� t + T isapproximately a solution to (8), with arbitrary accuracy for t large enough.Notice that it does not say that x on �t�+∞� is, even approximately, a solu-tion to (8). More precisely, say that a function x is a limit trajectory for � if

limt→∞ sup

0≤h

x�t + h� −��h� x�t�� = 0

348 laslier, topol, and walliser

Clearly, a limit trajectory is an asymptotic pseudotrajectory. To see that theconverse is not true, consider the following counterexample on the positivereal line: ��h� a� = a and x�t� = log t. Here, for any t,

sup0≤h

x�t + h� −��h� x�t�� = +∞�

and hence x is not a limit trajectory. But for any T ,

sup0≤h≤T

x�t + h� −��h� x�t�� = log�t + T � − log t

tends to 0 when t tends to infinity, and hence x is an asymptotic pseudo-trajectory.The following lemma asserts that, according to the previous definitions,

the stochastic process can be approximated by its expectation. The mainargument is that in urn processes, the speed of evolution decreases like 1/t.

Lemma 1. The interpolation of a CPR sequence is almost surely an asymp-totic pseudotrajectory of the semiflow � induced by the associated ordinarydifferential equation.

Proof. The result is deduced from propositions 4.1 and 4.2 of Benaım(1999), whose hypotheses are easily checked. First, the stochastic and deter-ministic processes stay in a bounded space M (here the simplex of �N withN = I · H). Second, the vector field on the right side of (8), −x+ r�x�,is bounded and Lipschitz on M , so that it is globally integrable and theassociated semiflow � is well defined. Third, the steps γt = 1/t of the pro-cess (7) are deterministic, with

∑t γ

2t < ∞ and the perturbations ε�t� are

bounded. Q.E.D.

3. INDIVIDUAL DECISION-MAKING

3.1. Rest Points

A natural candidate for an asymptotic state of the CPR stochastic processis a rest point of the associated deterministic process (8). These are thepoints x such that for all i and h,

xi� h = ri� h = piqh (9)

The rest points are easy to characterize.

Proposition 2. In individual decision-making, a state x is a rest point ifand only if all actions played in x give the same expected utility. If pi > 0,then

∑h qhui� h = u� .

behavioral learning process 349

Proof. According to (1), for all i and h,

xi� h = piqh =ui�

u�

· qh

Moreover, ui� can be written with pi and qk as

ui� =∑k

xi� kui� k = pi

∑k

qkui� k

Because qh > 0 for some h, if i is such that pi > 0, then∑

k qkui� k =u� . Q.E.D.

For instance, playing a specific action with probability 1 constitutes a restpoint. Any maximizing action (or any combination of maximizing actions ifthere are several) provides a rest point.

3.2. Convergence of the Deterministic Trajectories

We now prove that the deterministic CPR trajectory converges to thechoice of the maximizing action (if unique) or of a mixture of maximizingactions.

Proposition 3. In individual decision-making, any deterministic CPR tra-jectory converges to a state in which the decision-maker plays only maximizingactions.

Proof. Let x�·� be a trajectory of the ODE (8) with an initial positionx�0� in the simplex of �I×H . By using xi� k = −xi� k + piqk, it follows thatfor any i,

pi

pi

= ui�

ui�

− u�

u�

= pi

∑k qkui� k

ui�

−∑

k pjqkuj� k

u�

=∑

k qkui� k −∑

j� k pjqkuj� k

u�

This relation states that the CPR rule strengthens the actions whoseexpected payoff is larger than the mean. If i and j are two actions yield-ing the same expected payoff, then pi

pi= pj

pj= pi+pj

pi+pj, so that pi

pjis constant.

Let M denote the set of maximizing actions and let pM = ∑i∈M pi. If

M = I, then pi is constant for all i. If not, then let u1 denote the maxi-mum expected payoff in M and u2 denote the maximum expected payoffin I\M . If pM > 0, then

pM

pM

= u1 − pMu1 −∑

j /∈M pj

∑h qhuj� h

u�

≥ �1− pM�u1 − u2

u1

350 laslier, topol, and walliser

From this differential inequation, it follows that pM�t� tends to 1 whent tends to infinity; the decision-maker tends to choose only maximiz-ing actions. Hence the limit of the trajectory is easily found, for allk, limt→∞ xi� k�t� = 0 if i /∈ M and limt→∞ xi� k�t� = pi�0�∑

j∈M pj�0�qk ifi ∈ M . Q.E.D.

3.3. Convergence of the Stochastic Process

The next proposition states that the stochastic process converges almostsurely to the choice of the maximizing action (if unique) or of a mixtureof maximizing actions. The proof, given in Appendix A, is grounded ona majoration allowing one to apply the global convergence result for two-color urns of Hill et al. (1980).

Proposition 4. In individual decision-making, the CPR process almostsurely converges to a state in which the decision-maker plays only maximizingactions.

4. TWO-PLAYER GAME

4.1. Rest Points

The rest points for the associated deterministic process (5, 8) are thepoints x such that (9) holds: xi� h = ri� h = piqh. The equality xi� h = piqh

for all i and h means that the strategies played in the past by the two play-ers form independent statistical distributions. Being a different statementthan the independence of instantaneous choices described by ri� h = piqh,it generally does not hold out of rest points. Again, the rest points can becharacterized as follows.

Proposition 5. In a two-person game, a state is a rest point if and only if

(i) players’ past choices are statistically uncorrelated: ∀i� h� xi� h =piqh; and

(ii) all played actions give the same expected utility: if pi �= 0, then∑k qkui� k = u� and if qh �= 0, then

∑j pjvj� h = v� .

Proof. Expressing xi� h = piqh with the introduced notations (1),we have

xi� h =ui�

u�

v�h

v�

Writing ui� and v�h with pi and qh gives

ui� = pi

∑k

qkui� k and v�h = qh

∑j

pjvj� h

behavioral learning process 351

Thus, under conditions pi �= 0 and qh �= 0,∑k

qkui� k = u� and∑j

pjvj� h = v� Q.E.D.

Notice that∑

k qkui� k is the payoff for player 1 of the pure strategy iagainst the mixed strategy q. Thus the previous proposition asserts thatthe players use mixed strategies that form a Nash equilibrium of the gamerestricted to the pure strategies played with positive probability. Therefore,any Nash equilibrium of the game or of any of its subgames provides arest point. For instance, any combination �i� h� of pure strategies providesa rest point (with xi� h = 1).

4.2. Stability of the Deterministic System

To study the stability of the rest points of the differential equation (8),some definitions are needed. For an ordinary differential equation x =f �x�, let f ′�x� denote the Jacobian matrix of f at x. A rest point x (wheref �x� = 0) is linearly stable if all of the eigenvalues of f ′�x� have strictlynegative real parts. It is linearly unstable if some eigenvalues of f ′�x� havestrictly positive real parts. It is elliptic if it is neither linearly stable norlinearly unstable; that is if some eigenvalues are pure imaginary and othershave strictly negative real parts. Here f �x� = r�x� − x, so that f ′�x� =r ′�x� − Id, where Id denotes the identity matrix. A complex number λ is aneigenvalue of f ′�x� if and only if µ = λ− 1 is an eigenvalue of r ′�x�.Concerning notations, at any rest point x characterized by the mixed

strategies p and q, let

I0 = �i ∈ I � pi = 0� and I+ = �i ∈ I � pi > 0��H0 = �h ∈ H � qh = 0� and H+ = �h ∈ H � qh > 0��

ui =∑k

ui� k

u�

qk� (10)

and

vh =∑k

vj� h

u�

pj (11)

The eigenvalues of f ′�x� are given by the following lemma, which isproved in Appendix B. The eigenvalue λ = −1 appears because the processstays in the simplex �∑i� h xi� h = 1�; the sum of the rows of r ′�x� is null.The other eigenvalues describe the stability of the process inside the simplex

352 laslier, topol, and walliser

(the associated eigenvectors are parallel to the simplex):

Lemma 2. If x is a rest point, then the eigenvalues λ of f ′�x� are

λ = ui − 1� for i ∈ I0 � I0 values ��λ = vh − 1� for h ∈ I0 � H0 values ��λ = ±k � I+ + H+ − 2 values, by opposite pairs��

andλ = −1 �of order 1+ � I − 1�� H − 1��

It is then possible to study the stability of the rest points by focusing onthe Nash equilibria. A strict Nash equilibrium is an equilibrium in whichevery deviation strictly loses. The following results can be stated.

Proposition 6. Rest points can be classified in four categories:

(i) A pure strict Nash equilibrium is linearly stable.(ii) A pure nonstrict Nash equilibrium is elliptic.(iii) A nonpure Nash equilibrium is not linearly stable.(iv) A rest point that is not a Nash equilibrium is linearly unstable.

Proof. Let x be a rest point, characterized by the mixed strategies p andq. A Nash equilibrium is characterized by the fact that an action that is notplayed does not give higher mean utility than the played actions. For thefirst player,

∀i ∈ I0� ui� ≤ u�

(i) The eigenvalues of f ′�x� are only −1, ui�

u� − 1 < 0 and

v�h

v� − 1 < 0.

(ii) The eigenvalues of f ′�x� are only −1, ui�

u� − 1 ≤ 0 and v�h

v� − 1 ≤ 0;

notice that 0 appears for an alternative best response.(iii) The eigenvalues of f ′�x� include opposite values ±k; hence one

has a nonnegative real part.(iv) The eigenvalues of f ′�x� include either ui�

u� − 1 > 0 or

v�h

v� − 1 > 0. Q.E.D.

Mixed equilibria can be elliptic or linearly unstable; the results are summa-rized in the following table.

Rest point Linearly stable Elliptic Linearly unstable

Pure strict Nash ×Pure nonstrict Nash ×Non-pure Nash × ×Non-Nash ×

behavioral learning process 353

The stability of a rest point can be explicitly checked through the eigen-values of f ′�x�, thanks to the computable expressions of ui, vh, u, and vin Appendix B. For instance, the following corollary (proved in AppendixC) concerns the two-player games with two actions for each player. Thesegames, as it is well known, generically belong to three categories.

Corollary 1. In a generic 2 × 2 game, the following results hold:

(i) If the game has no pure equilibrium, then the mixed equilibrium iselliptic.

(ii) If the game has one pure equilibrium, then it is linearly stable.(iii) If the game has two pure equilibria, then these are linearly stable

and the mixed equilibrium is linearly unstable.

4.3. Convergence of the Stochastic Process

Consider first the stochastic behavior around unstable rest points. If a restpoint is unstable for the associated deterministic differential system, thenit is generally possible to prove that the probability for a random sequenceto converge toward this rest point is zero. One important condition is thatthe stochastic perturbations are such that they send the process away fromany stabilizing direction around the rest point. The simplest way to makethis sure is to have perturbations that can go in any direction (inside thesimplex). The proof of the following proposition verifies that this is indeedthe case for the CPR process at interior rest points.

Proposition 7. If x∗ is a linearly unstable rest point in the interior of thesimplex, then the probability that the CPR process converges to x∗ is zero.

Proof. Recall that from (4) and (5), one can write the stochastic pro-cess as

xi� h�t + 1� − xi� h�t� =1t�−xi� h�t� + ri� h�t� +

1tξi� h�t�

with

ξi� h�t� = �−ri� h�t� + εi� h�t�� E�ξi� h�t� x�t� = 0

Claim. Let x∗ be in the interior of the simplex. Let θ be a normed vectorparallel to the simplex:

∑i� h θi� h = 0 and

∑i� h θ2i� h = 1. Then there exists

a constant c > 0 and a neighborhood � of x∗ such that for all x�t� ∈ � ,

E�max�ξ�t� · θ� 0� x�t� > c (12)

Proof of the claim. Because θ is a normalized vector parallel to the sim-plex, all of its coordinates cannot be equal. Let �i� h� be such that θi� h =

354 laslier, topol, and walliser

maxj� k θj� k; then θi� h > r · θ = ∑j� k rj� kθj� k. Recall that with probability

ri� h, εi� h = 1 and εj� k = 0 for �j� k� �= �i� h�. Thus with probability ri� h,ξ · θ = θi� h − r · θ. It follows that E�max�ξ · θ� 0� ≥ ri� h�θi� h − r · θ� =c�r� θ� > 0. The quantity c�r� θ� varies continuously with r and θ; θ lies ina compact and r is a continuous function of x, having all of its coordinatespositive at x∗. Hence min�c�r� θ�� is strictly positive in a neighborhood ofx∗ and can be taken as the minorant in (12). Hence the claim is proved.The conclusion is drawn from theorem 1 of Pemantle (1989), whosehypotheses are easily checked—in particular, condition 6 about thedirections of perturbation, which follows directly from the precedingclaim. Q.E.D.

Consider now the stochastic behavior around stable rest points. If a restpoint is stable for the deterministic differential system, then it generallycan be proved that the probability for a random sequence to convergetoward this rest point is strictly positive. One important condition is that thestochastic perturbations make the considered rest point reachable from anyinterior point in the simplex. For an urn process like the CPR process, it iswell known that any point is reachable from any interior point, as stated inthe following proof.

Proposition 8. If x∗ is a linearly stable rest point, then there is a strictlypositive probability that the CPR process converges toward x∗.

Proof. A point x∗ in the simplex is said to be reachable from anotherpoint x0 if for any neiborhood � of x∗ there exists a finite number k suchthat the probability that x�t + k� ∈ � knowing that x�t� = x0 is strictlypositive. In an urn process like the CPR process, the order of magnitude ofthe stochastic perturbation is 1/t. From the facts that 1/t tends to 0 and that∑

τ≤t 1/τ tends to infinity, it is easily deduced that any point is reachablefrom any interior point. The proposition then follows from theorem 7.3 ofBenaım (1999). Q.E.D.

From Propositions 7 and 8 and from points (i) and (iv) in Proposition 6,two convergence results follow (the other cases being undeterminate).

Proposition 9. The CPR process (i) converges, with some positive proba-bility, toward any pure strict Nash equilibrium and (ii) never converges towarda non-Nash rest point.

These results apply to two categories of 2 × 2 games.

Corollary 2. In a generic 2 × 2 game, the following results hold:

(i) If the game has one pure equilibrium, then the CPR process con-verges with positive probability toward it.

behavioral learning process 355

(ii) If the game has two pure equilibria, then the CPR process convergeswith positive probability toward any of them and with zero probability towardthe mixed one.

Note that when the game has pure Nash equilibria, it is not proved thatthe process converges with probability 1 toward them. In other terms, thequestion of the probability that the process does not converge has not beenanswered. All results for games are local, contrary to the global resultsavailable for decision-making.

5. COMPARISON WITH OTHER RULES

5.1. Other Reinforcement Rules

Among all possible reinforcement rules, the index rules have twoproperties:

1. The action chosen at date t is not directly linked to the precedingone. This excludes rules defined by a transition probability from action att − 1 to action at t.

2. The action depends only on a specific utility index for each action.This excludes rules that depend on a global aspiration level evolving adap-tively (Borgers and Sarin, 1995; Karandikar et al., 1998).

More precisely, an index rule is defined by two functions:

1. The player defines a utility index associated with action i as afunction of the past history of that action,

Ui�t + 1� = g(��δi�τ��� �u�τ���Ui�1�� λ�τ∈�1��t�

)2. The player chooses action i with a probability that is an increasing

function of the utility index,

pi�t + 1� = f �Ui�t + 1��∑j f �Uj�t + 1��

Of course, by changing g, one can always take f as linear. However, thechoice probability is frequently expressed as a power function,

pi�t + 1� = �Ui�t + 1�d∑j�Uj�t + 1�d �

with two specific cases: d = 1, proportional choice, and d = +∞, maximiz-ing choice. The utility index is frequently expressed in a linear form,

Ui�t + 1� = λ�t�Ui�t� + µ�t�δi�t�u�t�

356 laslier, topol, and walliser

with two specific one-parameter cases: λ�t� = 1, incremental utility, andλ�t� = 1− µ�t�, autoregressive utility.Consider first the index rules with proportional choice and incremental

utility,

pi�t + 1� = Ui�t + 1�U�t + 1� and �Ui�t� = µ�t�δi�t�u�t��

where U�t + 1� =∑j Uj�t + 1�. Brief computation shows that the variation

in the probability of choice is proportional to u�t� and to δi�t� −pi�t�. Onecan write

�pi�t� = ν�t�u�t��δi�t� − pi�t�with ν�t� = µ�t�

U�t+1� . Three standard rules belong to this class:

• Cross (1973) and Borgers and Sarin (1997): ν�t� = 1; hence (withu�t� < 1), µ�t� = 1∏t

τ=1�1−u�τ��U�1� .• Erev and Roth (1998): ν�t� = 1

1−γ+u�t� ; hence µ�t� = 11−γ

×∏tτ=1�1+ u�τ�

1−γ�.

• CPR rule: ν�t� = 1CU�t+1� ; hence µ�t� = 1.

In these rules, the utility index is increased only for the action played atthe current period. Moreover, the increase is proportional to the ongoingutility with a coefficient µ�t� growing with time (constant for the CPR rule).Hence the utility index appears as a discounted index with a discount factorµ�t�/µ�t + 1� smaller than 1 and changing with time (constant and equalto 1 for the CPR rule). Note that the CPR rule is the only rule to stayinvariant by a positive linear transformation of utility.Consider now the index rules with proportional choice and auto-

regressive utility,

pi�t + 1� = Ui�t + 1�U�t + 1� and Ui�t + 1� = λ�t�Ui�t� + �1− λ�t��δi�t�u�t�

Two standard rules belong to this class:

• Discounted index rule: λ�t� = λ < 1; hence UDii �t + 1� = �1− λ�×∑t

τ=1 λt−τδi�τ�u�τ� + λtUi�1�

• Average index rule: λ�t� = 1 − δi�t�txi��t� ; hence UAv

i �t + 1� =∑tτ=1 δi�τ�u�τ�

txi��t� =∑

h ui� hxi� h�t�xi��t� � where xi� =

∑h xi� h.

The CPR rule is a limit case of the discounted index rule when the con-stant discount rate equals 1, but it has very different properties. The mod-eler’s urn is still relevant, but the number of balls added at each period is

behavioral learning process 357

exponentially increasing with time (Kilani and Lesourne, 1995). The evolu-tion of the frequency of balls is still given by (4), but the decreasing speed ofevolution 1/t is changed to a constant 1− λ. The usual convergence resultsabout urn processes are no longer applicable. (The steps γt are constantand thus do not satisfy the condition

∑t>0 γ

2t < ∞.) For instance, it can

be shown directly that for a decision-maker, the process may be locked ina nonmaximizing action. Consider a situation with one state of nature andtwo possible actions yielding utilities u1�1 > u2�1 > 0. If dominated action 2was played during t periods, then

UDi1 �t + 1� = U1�1�λt and UDi

2 �t + 1� = U2�1�λt + u2�1

t∑τ=1

λτ

The probability of playing action 1 at t + 1 is written as

p1�t + 1� = UDi1 �t + 1�

UDi1 �t + 1� +UDi

2 �t + 1� ≤U1�1�λt

u2�1∑t

τ=1 λτ≤ U1�1�

u2�1λt−1

The probability of always playing action 2 is then larger than the infiniteproduct

∏t≥1�1− U1�1�

u2�1λt−1� and is thus strictly positive.

Consider finally index rules with maximizing choice. One example fromartificial intelligence is the Q-learning rule (Watkins and Dayan, 1992) pro-posed as an algorithm converging to the solution of a dynamic program. Ituses an utility index,

UQli �t + 1� = UQl

i �t� + α�t�δi�t�[γmax

j�UQl

j �t�� −UQli �t� + u�t�

]

Another example is the Gittins rule (Gittins, 1989), which has interestingoptimality properties in individual decision-making when the probabilitiesof the states of nature are not well known. It uses an index that can beexplicitly stated only for specific second-order probability distributions onthe states distribution. A last example is fictitious play, which is not a rein-forcement rule but can be reinterpreted in this way. The correspondingutility index is somehow similar to the average index UAv,

UFpi �t + 1� =∑

h

ui� hx�h�t� = UAvi �t + 1�

∑h ui� hxi��t�x�h�t�∑

h ui� hxi� h�t�

Both indexes coincide when past plays of actions i and h are uncorrelated.But even if they can be reinterpreted as index rules, all of these rules lookquite different from the CPR rule.

358 laslier, topol, and walliser

5.2. Replicator Dynamics

The CPR rule can also be compared with replicator dynamics, which isstudied extensively in evolutionary game theory (Friedman, 1991; Weibull,1995). Of course, their basic principles are different: individual learningin the CPR rule case and social selection in replicator dynamics. But theequations that describe the two processes have intriguing similarities anddifferences whose exposition may help to position both models.Concerning population dynamics, two populations are considered. At

time t, let pi�t� be the proportion of agents in the first population whoplay action i, and let qh�t� be the proportion of agents in the second popu-lation who play action h. Replicator dynamics considers that agents of thetwo populations meet stochastically and reproduce proportionally to theutility that they get. On average, the utility obtained by an agent playingaction i in the first population is u�ei� q�, and the utility obtained on aver-age by the agents of the first population is u�p� q�. The transition betweentwo periods is given by

�pi�t� = αpi�t��u�ei� q�t�� − u�p�t�� q�t��

= αpi�t�[∑

h

ui� hqh�t� −∑j� h

uj� hpj�t�qh�t�]

= αpi�t�∑j� h

pj�t�qh�t��ui� h − uj� h�

with α = 1 in the standard replicator and

α =[∑

j� h

uj� hpj�t�qh�t�]−1

in the normalized replicator.As concerns individual learning, the CPR process is conveniently

described by (6). When t is large, CU�t� also becomes large, so that theterms ui� h/CU�t� can be neglected, and the transition written as

E��pi�t� x�t� � α′pi�t�∑j� h

pj�t�qh�t��ui� h − uj� h��

with

α′ = �1/t�[∑

j� h

uj� hxj� h�t�]−1

Apart from the fact that the equations are compared asymptotically andin expectation, they highlight two differences between CPR learning andnormalized replicator dynamics (compare α with α′). First, the speed of

behavioral learning process 359

adjustment is decreasing with a factor �1/t� in the CPR process and con-stant in replicator dynamics. The interpretation is that in CPR learning,cumulated utility is relevant for the player, whereas in replicator dynamics,short-term utility influences the system at each period. Second, the CPRprocess takes into account the possible correlations in the players’ pastchoices of actions (using the variables xi� h), whereas replicator dynamicsuses the present probabilities pi and qh. The interpretation is that in CPRlearning, reinforcement is based on the player’s past personal experiments,whereas in replicator dynamics, it is based on the simultaneous payoffs forthe various strategies in the present state of the system. In some cases, thesedifferences are sufficient to induce different behavior for both processes.For instance, although the “decreasing-speed” effect does not affect deter-ministic trajectories, it is important for stochastic processes. The exampleprovided in Section 5.1, comparing the CPR rule with the discounted indexrule, is an example of this phenomenon.

APPENDIX A: PROOF OF PROPOSITION 4

Let ui denote the expected utility of action i, ui =∑

h∈H qhui� h. Letu1 = maxi ui denote the maximum expected utility, let M denote the set ofmaximizing actions and let u2 = maxi �∈M ui, with 0 < u2 < u1. Assume firstthat there is only one maximizing action i = 1.

Part 1. Consider a two-color (1 and 2) urn in which one ball is addedat each time, according to an urn function s. For 0 ≤ y ≤ 1, if y is theproportion of balls of color 1, then the probability of adding a ball of color1 is s�y�. Suppose that s has the following properties: s is continuous, s�0� =0, and there exists y∗ > 0 such that s�y∗� = y∗, s�y� > y for 0 < y < y∗ ands�y� < y for y > y∗. Then y = 0 is the unique “upcrossing fixed point” andy = y∗ is the unique “downcrossing fixed point" for s. It follows from Hillet al. (1980) that with probability 1, the proportion of balls of color 1 tendsto y∗.For instance, with s�y� = yu1/yu1 + �1− y�u2, one obtains convergence

of the CPR process to the maximizing action in the case where two strate-gies are available to the decision-maker and there is no randomness onNature’s side.

Part 2. Returning to the stochastic and multicolored CPR process x, let�2�µ� be a probability space on which x is realized: �x�t� ω��t≥1 is the CPRtrajectory for draw ω ∈ 2. Let xi� =

∑h xi� h denote the past frequency of

action i. From Proposition 1, each action i is played an infinite numberof times; thus, by the law of large numbers, the ratio xi� h�t� ω�/xi��t� ω�,which gives the frequency of the state of nature h among the times when

360 laslier, topol, and walliser

i was chosen before t, tends almost surely to qh when t tends to infinity.Therefore, there exists 2′ ⊆ 2 with µ�2′� = 1 such that, for all ω ∈ 2′ andall ε > 0, there exists T �ω� ε� such that for all i and h,

∀ t > T �ω� ε�� qh − ε <xi� h�t� ω�xi��t� ω�

< qh + ε

From the definition (1) of pi, one can then deduce the majoration,

∀ t > T �ω� ε�� pi�t� ω� ≥ �1− 2ε� xi��t� ω�ui∑j xj��t� ω�uj

Applied to the maximizing action i = 1 compared to the others, one gets

∀ t > T �ω� ε�� p1�t� ω� ≥ sε�x1��t� ω��� (13)

with sε�x1�� = �1 − 2ε�x1�u1/x1�u1 + �1 − x1��u2. This inequality meansthat the probability of choosing action i = 1 is larger than the function sεof the past frequency of that action.

Part 3. For ε > 0, consider the function sε. This function has two fixedpoints: 0 (upcrossing) and 1− 2 ε/u1 − u2 = y∗ε (downcrossing). Accordingto the part 1 of this proof, a two-color urn with urn function sε convergesto y∗ε . By building such an urn process y�t� ω� on the same space �2�µ� asthe CPR process x�t� ω�, one obtains from (13) that there exists a 2ε ⊆ 2′

with µ�2ε� = 1 such that in the event that 2ε, p1�t� ω� ≥ y�t� ω� for allt > T �ω� ε�, and limt→∞ y�t� ω� = y∗ε . Thus, for all ω ∈ 2ε, there exists aT ′�ω� ε� that one can choose larger than T �ω� ε� such that

∀t > T ′�ω� ε�� p1�t� ω� ≥ y�t� ω� ≥ y∗ε − ε

Choosing nested events 2ε for various ε and taking the intersection of thesenested almost-sure events, one obtains that, almost surely, limt→∞ p1�t� =1. Hence the limit of the trajectory is easily found: almost surely, for allk, limt→∞ xi� k�t� = 0 if i �= 1 and limt→∞ x1� k�t� = qk. This completes theproof of the proposition in the case of a unique maximizing action.

Part 4. Assume now that there are several maximizing actions. The con-clusion can be derived from the unique maximizing action case by consid-ering an appropriate “subprocess.” Let i ∈ M be a maximizing action andlet J = �I\M� ∪ �i�. Consider only the dates at which choosen actions arein J. More precisely, for ω ∈ 2 and for T > 1 , let tJT �ω� denote the dateat which, for the T th time, the choosen action belongs to J. Because eachaction is choosen infinitely many times, tJT �ω� is defined for all integersT > 1. Consider the stochastic process xJ defined by, for all j ∈ J and allh ∈ H,

xJj� h�T�ω� =

xj� h�tJT �ω��ω�∑j′∈J�h′∈H xj′�h′ �tJT �ω��ω�

behavioral learning process 361

This process is indeed a CPR process, because the relative probabilities ofchoosing any two actions depend only on the cummulative utilities for theseactions. Because there is a unique maximizing action for this subprocess,we obtain that for a nonmaximizing action j /∈ M , the choice probabilitypJ

j �T�ω� tends to 0 when T tends to infinity, for almost all ω. But fromthe definition of the CPR rule, the probability of playing a given actiondecreases each time this action is not choosen; thus the fact that pJ

j �T�ω�tends to 0 implies that pj�t� ω� itself tends to 0.

APPENDIX B: EIGENVALUES IN A TWO-PLAYER GAME

Derivation of (3) provides the following expression for r ′�x� at a restpoint x characterized by the mixed strategies p and q (with i �= j andh �= k):

∂ri� h∂xi� h

= qh�1− pi�ui� h

u�

+ pi�1− qh�vi� h

v�

∂ri� h∂xi� k

= qh�1− pi�ui� k

u�

− piqh

vi� k

v�

∂ri� h∂xj� h

= −piqh

uj� h

u�

+ pi�1− qh�vj� h

v�

and∂ri� h∂xj�k

= −piqh

uj�k

u�

− piqh

vj�k

v�

Let a = �ai� h�i� h be an eigenvector of r ′�x� associated with an eigenvalue µ,

r ′�x� · a = µa (14)

For all i and h, write

µai� h = qh

∑k

ui� k

u�

ai� k + pi

∑j

vj� h

v�

aj� h − piqh

∑j�k

(uj�k

u�

+ vj�k

v�

)aj�k

Write the equation as

µai� h = qhui + pivh� (15)

with

andui=

∑k

ui� k

u�ai� k − pi

∑j�k

uj�k

u�aj�k

vh=∑

jvj� h

v�aj� h − qh

∑j�k

vj�k

v�aj�k

(16)

362 laslier, topol, and walliser

Note that∑i

ui =∑h

vh = 0 (17)

From (15), for any i,

µ∑k

ui� k

u�

ai� k = uiui + pi

∑k

ui� k

u�

vk

and

µ∑j�k

uj�k

u�

aj�k =∑j

ujuj +∑j�k

pj

uj�k

u�

vk

Thus if µ �= 0, then for any i, (16) is equivalent to

µui = uiui + pi

[∑k

ui� k

u�

vk −∑j

ujuj −∑j�k

pj

uj�k

u�

vk

] (18)

Likewise, one can also obtain for any h

µvh = vhvh + qh

[∑j

vj� h

v�

uj −∑k

vkvk −∑j�k

qk

vj�k

v�

uj

] (19)

Equations (18) and (19) show that the I + H numbers ui,vh areobtained by solving a linear system. This system can be written in matrixform as

µ

(uv

)= m

(uv

)

The coefficients of the matrix m aremi�i = �1− pi�ui

mi�j = −piuj

mi� h = pi

(ui� h

u�−∑

j pjuj� h

u�

)

for j �= i, and similar expressions for mh�h, mh�k for k �= h, and mh�i. Thusthe eigenvalues µ �= 0 are the eigenvalues of a square matrix of dimension I + H. To find them, observe that if i belongs to I0, then (18) reducesto µui = uiui. The line of m corresponding to i is 0 except on the diag-onal. Thus the characteristic polynom P�µ� = det�m − µId� of m can befactorized as

P�µ� =(∏

i∈I0�ui − µ�

)( ∏h∈H0

�vh − µ�)Q�µ��

behavioral learning process 363

where Q is the characteristic polynom of the restriction of m to the coordi-nates in I+ ×H+. Hence µ = ui (i ∈ I0) and µ = vh (h ∈ H0) are the twofirst subsets of eigenvalues in lemma 2.From the definition of u and v (10) and (11), it can be seen that Q

depends only on the restriction of the game �u� v� to the actions in I+ andH+. To avoid heavy notations, we now suppose that I+ = I and H+ = H.From the study of rest points, we know that if i is played,

∑k ui� kqk = u�;

that is,

∀i� h� ui = vh = 1

Also using (17), (18) can be written as

µui = ui +∑k

pi

[ui� k

u�

−∑j

pj

uj�k

u�

]vk (20)

Letting

ui� k = pi

[ui� k

u�

−∑j

pj

uj�k

u�

]� vh�j = qh

[vj� h

v�

−∑k

qk

vj�k

v�

]� (21)

one finds two matrices u and v of dimensions I ×H and H × I such that

λ

(uv

)=

(0 uv 0

)·(

uv

)(22)

(recall that λ = µ − 1). Conversely, if �λ� u� v� is a solution of (22) suchthat

∑i ui =

∑h vh = 0, then (18) and (19) are satisfied, and thus one has

a solution to (14). If �λ� u� v� is a solution of (22) such that∑

i ui �= 0 or∑h vh �= 0, then summing (20) shows that λ = 0.From the system (22), one finds that

λ2u = uvu

Hence if λ is an eigenvalue of f ′�x� with λ �= −1 (µ �= 0), then λ2 is aneigenvalue of uv (as well as of vu). Note also that if

(uv

)is an eigenvector

associated with λ, then(

u−v

)is another eigenvector associated with −λ.

Consequently, except for λ = −1, the third subset of eigenvalues in lemma2 is composed of opposite pairs �k�−k�.Finally, the sum of the eigenvalues is easily computed as the trace Tr

of r ′�x�,

Tr�r ′�x�� =∑i

�1− pi�∑h

qh

ui� h

u�

+∑h

�1− qh�∑i

pi

vi� h

v�

= ∑i∈I0�ui − 1� + ∑

h∈H0

�vh − 1� + I + H − 2

364 laslier, topol, and walliser

and

Tr�f ′�x�� = Tr�r ′�x�� − I · H= ∑

i∈I0�ui − 1� + ∑

h∈H0

�vh − 1� − � I − 1�� H − 1� − 1

Therefore, the order of multiplicity of λ = −1 is � I − 1�� H − 1� + 1 inthe fourth subset of eigenvalues in lemma 2.Moreover, because the size of m is � I + H�× � I + H�, the extra value

µ = 1 (λ = 0) appears twice in (22).

APPENDIX C: PROOF OF COROLLARY 1

From Proposition 6 and the observation that pure Nash equilibria aregenerically strict, one deduces that in a generic 2× 2 game, a pure equilib-rium is linearly stable and that no other rest point is linearly stable. At atotally mixed equilibrium �p� q� (with I = �1� 2� and H = �1� 2�), (21) canbe written as

u = p1p2

u�

(u1�1 − u2�1 u1�2 − u2�2u2�1 − u1�1 u2�2 − u1�2

)

Because p1 and p2 are positive, q1u1�1 + q2u1�2 = q1u2�1 + q2u2�2 = u�,thus denoting

∇�u� = q1�u1�1 − u2�1� = q2�u2�2 − u1�2�

One can write

u = p1p2

u�

∇�u�(

1/q1 −1/q2−1/q1 1/q2

)

Similar computations done for v with

∇�v� = p1�v1�1 − v1�2� = p2�v2�2 − v2�1�

show that, apart from λ = −1 (with multiplicity 2), the eigenvalues areλ = 0 (extra double root) and λ such that

λ2 = ∇�u�∇�v�u�v�

behavioral learning process 365

Hence we have two cases with ∇�u�∇�v� positive or negative, correspondingto the fact that the mixed equilibrium is the only equilibrium or not.

REFERENCES

Arthur, B. (1993). “On Designing Economic Agents Who Behave Like Human Agents,”J. Evol. Econ. 3, 1–22.

Arthur, B., Ermoliev, Y., and Kaniovski, Y. (1984). “A Generalized Urn Problem and ItsApplications,” Kibernetika 1, 49–56.

Benaım, M. (1999). “Dynamics of Stochastic Approximation Algorithms,” in Le Seminaire deProbabilites, Vol. 33 (J. Azema and M. Yor, Eds.). Berlin: Springer-Verlag.

Benaım, M., and Hirsch, M. (1996). “Asymptotic Pseudo-Trajectories and Chain RecurrentFlows, With Applications,” J. Dynam. Diff. Eq. 8, 141–176.

Benaım, M., and Hirsch, M. (1999). “Mixed Equilibria and Dynamical Systems Arising FromFictitious Play in Perturbed Games,” Games and Econ. Behav. 29, 36–72.

Borgers, T., and Sarin, R. (1997). “Learning Through Reinforcement and ReplicatorDynamics,” J. Econ. Theory 77, 1–14.

Borgers, T., and Sarin, R. (2000). “Naive Reinforcement Learning With Endogeneous Aspira-tions,” Int. Econ. Rev. 41, 921–950.

Bush, R., and Mosteller, F. (1955). Stochastic Models for Learning. New York: Wiley.Cross, J. (1973). “A Stochastic Learning Model of Economic Behavior,” Quart. J. Econ.

87, 239–266.Erev, I. and Roth, A. (1998). “Predicting How People Play Games: Reinforcement Learningin Experimental Games with Unique Mixed Strategy Equilibria,” Amer. Econ. Review 88,848–881.

Friedman, D. (1991). “Evolutionary Games in Economics,” Econometrica 59, 637–666.Fudenberg, D., and Levine, D. (1998). Theory of Learning in Games. Cambridge, MA:MIT Press.

Gittins, J. (1989). Multi-Armed Bandits Allocation Indices. New York: Wiley.Hill, B., Lane, D., and Sudderth, W. (1980). “A strong law for some generalized urn processes,”

Annals of Probability 8, 214–226.Kaniovski, Y., and Young, P. (1995). “Learning dynamics in games with stochastic perturba-tions,” Games and Economic Behavior 11, 330–363.

Karandikar, R., Mookherge, D., Ray, D., and Vega-Redondo, F. (1998). “Evolutionary aspira-tions and cooperation,” Journal of Economic Theory 80, 292–331.

Kilani, K., and Lesourne, J. (1995). “Endogeneous Preferences, Self-Organizing Systems andConsumer Theory,” mimeo, 95–6 CNAM Paris.

Nachbar, J. (1990). “Evolutionary Selection Dynamics in Games,” Int. J. Game Theory 19, 59–89.Pemantle, R. (1989). “Non-Convergence to Unstable Points in Urn Models and StochasticApproximations,” Ann. Prob. 18, 698–712.

Posch, M. (1997). “Cycling in a Stochastic Learning Algorithm for Normal Form Games,”J. Evol. Econ. 7, 193–207.

Robinson, J. (1950). “An Iterative Method for Solving a Game,” Ann. Math. 54, 296–301.Roth, A., and Erev, I. (1995). “Learning in Extensive Form Games: Experimental Data andSimple Dynamic Models in the Intermediate Term,” Games and Econ. Behav. 8, 164–212.

366 laslier, topol, and walliser

Shapley, L. (1964). “Some Topics in Two-Person games,” in Advances in Game Theory(M. Dresher, L. Shapley and A. Tucker, Eds.). Princeton, NJ: Princeton University Press.

Walliser, B. (1998). “A Spectrum of Equilibration Processes in Games,” J. Evol. Econ. 8, 67–87.Watkins, C., and Dayan, P. (1992). “Q-Learning,” Mach. Lear. 8, 279–292.Weibull, J. (1995). Evolutionary Game Theory, Cambridge MA: MIT Press.Young, P. (1998). Individual Strategy and Social Structure, Princeton, NJ: Princeton UniversityPress.