Learning to play Go using recursive neural networks - CiteSeerX

Learning to play Go using recursive neural networks

Lin Wu, Pierre Baldi ∗School of Information and Computer SciencesInstitute for Genomics and Bioinformatics

University of California Irvine, Irvine, CA 92697, USA

Abstract

Go is an ancient board game that poses unique opportunities and challenges for artificial intelligence. Currently, there are nocomputer Go programs that can play at the level of a good human player. However, the emergence of large repositories of gamesis opening the door for new machine learning approaches to address this challenge. Here we develop a machine learning approachto Go, and related board games, focusing primarily on the problem of learning a good evaluation function in a scalable way.Scalability is essential at multiple levels, from the library of local tactical patterns, to the integration of patterns across theboard, to the size of the board itself. The system we propose is capable of automatically learning the propensity of local patternsfrom a library of games. Propensity and other local tactical information are fed into recursive neural networks, derived from aprobabilistic Bayesian network architecture. The recursive neural networks in turn integrate local information across the boardin all four cardinal directions and produces local outputs that represent local territory ownership probabilities. The aggregationof these probabilities provides an effective strategic evaluation function that is an estimate of the expected area at the end, or atvarious other stages, of the game. Local area targets for training can be derived from datasets of games played by human players.In this approach, while requiring a learning time proportional to N4, skills learned on a board of size N2 can easily be transferredto boards of other sizes. A system trained using only 9 × 9 amateur game data performs surprisingly well on a test set derivedfrom 19× 19 professional game data. Possible directions for further improvements are briefly discussed.

Key words: computer games, computer Go, machine learning, evaluation function, recursive neural networks, knowledge transfer

1. Introduction

Go is an ancient board game–over 3,000 years old(Iwamoto, 1972; Cobb, 2002)–that poses unique opportu-nities and challenges for artificial intelligence and machinelearning. The rules of Go are deceptively simple: two op-ponents alternatively place black and white stones on theempty intersections of an odd-sized square board, tradi-tionally of size 19 × 19. The goal of the game, in simpleterms, is for each player to capture as much territory as pos-sible across the board by encircling the opponent’s stones.This disarming simplicity of the rules of Go, however, con-ceals a formidable combinatorial complexity (Berlekampand Wolfe, 1994). On a 19 × 19 board, there are approx-imately 319×19 = 10172.24 possible board configurationsand, on average, on the order of 200-300 possible moves

∗ Corresponding author. Tel.: +1 949 8245809; fax: +1 824 9813.Email address: [email protected] (Pierre Baldi).

at each step of the game, preventing any form of semi-exhaustive search. For comparison purposes, the game ofchess has a much smaller branching factor, on the order of35-40 (Russell and Norvig, 2002; Plaat et al., 1996). To-day, computer chess programs, built essentially on searchtechniques and running on a simple desktop, can rival oreven surpass the best human players. In contrast, and inspite of several decades of research efforts and progress inhardware speed, the best Go programs of today are easilydefeated by an average human amateur player (Bouzy andCazenave, 2001; Fotland, 1993; Chen, 1990; Graepel et al.,2001; Enzenberger, 1996, 2003).

Besides the intrinsic challenge of the game, and the non-trivial market created by over 100 million players world-wide, Go raises other important questions for our under-standing of natural or artificial intelligence in the distilledsetting created by the simple rules of the game, unclut-tered by the endless complexities of the “real world”. Forinstance, are there any “intelligent” solutions to Go and if

Preprint submitted to Elsevier 9 August 2007

so what would characterize them? Note that to many ob-servers current computer solutions to chess appear to be“brute force”, hence “unintelligent”. But is this percep-tion correct, or an illusion–is there something like true in-telligence beyond “brute force” and computational power?Where is Go situated in the apparent tug-of-war betweenintelligence and sheer computational power? To understandsuch questions, the methods used to develop computer Goprograms are as important as the performance of the result-ing players. This is particularly true in the context of cer-tain current efforts aimed at solving Go by pattern match-ing within very large databases of Go games. How “intelli-gent” is pattern matching?

A second set of issues has to do with whether our at-tempts at solving Go ought to mimic how human expertsplay? Not withstanding the fact that we do not know howhuman brains play Go, here the experience gained againwith chess, as well as many other problems, clearly showsthat mimicking the human brain is not necessary in orderto develop good computer Go programs. However we maystill find useful inspiration in the way humans seem to play,for instance in trying to balance tactical and strategic goals.

Finally, and in connection with human playing, we notethat humans in general do not learn to play Go directly onboards of size 19×19, but rather start by learning on smallerboard sizes, typically 9×9. How can we develop algorithmscapable of transferring knowledge from one board size toanother?

The emergence of large datasets of Go game records, andthe ability to record even larger datasets through Web andpeer-to-peer technologies, opens the door for new pattern-matching and statistical machine learning approaches toGo. Here we take modest steps towards addressing thesechallenges by developing a scalable machine learning ap-proach to Go capable of knowledge transfer across boardsizes. Clearly good evaluation functions and search algo-rithms are essential ingredients of computer board-gamesystems (Beal, 1990; Schrufer, 1989; Ryder, 1971; Abram-son, 1990; Raiko, 2004). Here we focus primarily on theproblem of learning a good evaluation function for Go in ascalable way. We do include simple search algorithms in oursystem, as many other programs do, but these algorithmsare not the primary focus. By scalability we imply that amain goal is to develop a system more or less automatically,using machine learning approaches, with minimal humanintervention and handcrafting. The system ought to be ableto transfer information across board sizes, for instance from9× 9 to 19× 19.

We take inspiration in three ingredients that seem to beessential to the Go human evaluation process: the under-standing of local patterns (Chen, 1992), the ability to com-bine patterns, and the ability to relate tactical and strategicgoals. Our system is built to learn these three capabilitiesautomatically and attempts to combine the strengths of ex-isting systems while avoiding some of their weaknesses. Thesystem is capable of automatically learning the propensityof local patterns from a library of games. Propensity and

other local tactical information are fed into a set of recur-sive neural networks, derived from a probabilistic Bayesiannetwork architecture. The networks in turn integrate lo-cal information across the board and produce local out-puts that represent local territory ownership probabilities.The aggregation of these probabilities provides an effectivestrategic evaluation function that is an estimate of the ex-pected area at the end (or at other stages) of the game. Lo-cal area targets for training can be derived from datasetsof human games. The main results we present here are de-rived on a 19× 19 board using a player trained using only9× 9 game data.

2. Data

Because the approach to be described emphasizes scal-ability and learning, we are able to train our systems at agiven board size and use it to play at different sizes, bothlarger and smaller. Pure bootstrap approaches to Go wherecomputer players are initialized randomly and play largenumbers of games, such as evolutionary approaches or re-inforcement learning (Sutton and Barto, 1998), have beentried (Schrauldolph et al., 1994). We have implementedthese approaches and used them for small board sizes 5×5and 7×7, where no training data was available to us. How-ever, in our experience, these approaches do not scale upwell to larger board sizes. For larger board sizes, better re-sults are obtained using training data derived from recordsof games played by humans. We have used available datafor training and validation at board sizes 9 × 9, 13 × 13,and 19× 19. The data is briefly described below.

Data for 9 × 9 Boards: This dataset consists of 3,495games. We randomly selected 3,166 games (90.6%) fortraining, and the remaining 328 games (9.4%) for vali-dation. Most of the games in this data set are played byamateurs. A subset of 424 games (12.13%) have at leastone player with an olf ranking of 29, corresponding to a 3kyu player.

Data for 13 × 13 Boards: This dataset consists of 4,175games. Most of the games, however, are played by ratherweak players and therefore cannot be used effectively fortraining. For validation purposes, however, we retained asubset of 91 games where both players have an olf rankinggreater or equal to 25–the equivalent of a 6 kyu player.

Data for 19× 19 Boards: This high-quality dataset con-sists of 1,835 games played by professional players (at least1 dan). A subset of 1,131 games (61.6%) are played by 9 danplayers (the highest possible ranking). This is the datasetused in Stern et al. (2005).

2

3. System Architecture

3.1. Evaluation Function, Outputs, and Targets

Because Go is a game about territory, it is sensible totry to compute “expected territory” as the evaluation func-tion, and to decompose this expectation as a sum of localprobabilities. More specifically, let Aij(t) denote the own-ership of intersection ij on the board at time t during thegame. At the end of a game (Chen and Chen, 1999; Ben-son, 1976), each intersection can be black, white, or neutral1 . Black is represented as 1, white as 0, and neutral as 0.5.The same scheme with 0.5 for empty intersections, or morecomplicated schemes, can be used to represent ownership atvarious intermediate stages of the game. Let Oij(t) be theoutput of the learning system at intersection ij at time t inthe game. Likewise, let Tij(t) be the corresponding trainingtarget. In the most simple case, we can use Tij(t) = Aij(T ),where T denotes the end of the game. In this case, the out-put Oij(t) can be interpreted as the probability Pij(t), esti-mated at time t, of owning the ij intersection at the end ofthe game. Likewise,

∑ij Oij(t) is the estimate, computed

at time t, of the total expected area at the end of the game.Propagation of information provided by targets/rewards

computed at the end of the game only, however, can beproblematic. With a dataset of training examples, thisproblem can be addressed because intermediary area val-ues Aij(t) are available for training for any t. In thesimulations presented here, we use a simple scheme

Tij(t) = (1− w)Aij(T ) + wAij(t + k) (1)

Here w ≥ 0 is a parameter that controls the convex com-bination between the area at the end of the game and thearea at some step t + k in the nearer future. The specialcase w = 0 corresponds to the simple case described abovewhere only the area at the end of the game is used in thetarget function.

To learn the local probabilities and hence the evaluationfunction from the targets, we propose to use a directedgraphical model (Bayesian network) which in turn leads toa directed acyclic graph recursive neural network (DAG-RNN) architecture.

3.2. DAG-RNN Architectures

The basic architecture is closely related to an architec-ture originally proposed for a problem in a completely dif-ferent area – the prediction of protein contact maps (Pol-lastri and Baldi, 2002; Baldi and Pollastri, 2003). As aBayesian network, the basic architecture can also be viewedas a generalization of input-output bidirectional HMMs

1 This is called “seki”. Seki is a stalling situation where two livegroups share liberties and neither group can fill these liberties with-out becoming prisoner of the opponent.

(Baldi et al., 1999) to two dimensional problems. The archi-tecture is described in terms of the DAG (Directed AcyclicGraph) of Figure 1 where the nodes are arranged in 6 latticeplanes reflecting the Go board spatial organization. Eachplane contains N ×N nodes arranged on the vertices of asquare lattice. In addition to the input and output planes,there are four hidden planes for the lateral propagation andintegration of information across the Go board in each car-dinal direction. Thus each intersection ij in a N×N board isassociated with six variables. These variables consist of theinput Iij , four hidden variables HNE

ij ,HNWij ,HSW

ij ,HSEij ,

and the output Oij . Within each hidden plane, the edgesof the quadratic lattice are oriented towards one of the fourcardinal directions (NE, NW, SE, and SW). Directed edgeswithin a column of this architecture are given in Figure 1and run from the input node to the hidden nodes, and fromthe input and hidden nodes to the output node.

To simplify belief propagation and learning, in a DAG-RNN the probabilistic relationship between a node variableand its parent node variables is replaced by a deterministicrelationship parameterized by a neural network. Recursiv-ity arises from weight sharing or stationarity, i.e. from theplate structure of the Bayesian network, when the sameneural network is replicated at multiple locations. Thus theprevious Bayesian network architecture, leads to a DAG-RNN architecture consisting of five neural networks in theform

Oij = NO(Iij ,HNWij ,HNE

ij ,HSWij ,HSE

ij )

HNEij = NNE(Iij ,H

NEi−1j ,H

NEij−1)

HNWij = NNW (Iij , H

NWi+1j ,H

NWij−1)

HSWij = NSW (Ii,j ,H

SWi+1j ,H

SWij+1)

HSEij = NSE(Iij ,H

SEi−1j ,H

SEij+1)

(2)

where, for instance, NO is a single neural network for theoutput that is shared across all spatial output locations.In addition, since Go is “isotropic” we use a single net-work shared across the four hidden planes. Thus NNE is asingle neural network for the lateral North-East propaga-tion of information that is shared across all spatial locationin the North-East plane. In addition, Go involves strongboundaries effects and therefore we add one neural networkNC for the corners, shared across all four corners, and oneneural network NS for each side position, shared across allside positions and across all four sides.

In short, the entire Go DAG-RNN architecture is de-scribed by four feedforward NNs (corner, side, lateral, out-put) that are shared at all corresponding locations. For eachone of these feedforward neural networks, we have exper-imented with several architectures, but we typically use asingle hidden layer. The DAG-RNN in the main simulationresults uses 16 hidden nodes and 8 output nodes for thelateral propagation networks, and 16 hidden nodes and oneoutput node for the output network. All transfer functionsare logistic. The total number of free parameters is 5,642 for

3

Table 1Input features. The first three features–stone type, influence, and propensity–are properties associated with the corresponding intersectionand a fixed number of surrounding locations. The other properties are group properties involving variable numbers of neighboring stones.

Feature Description

b,w,e stone type: black, white or empty

influence influence from the stones of the same color and the opposing color

propensity local statistics computed from 3× 3 patterns in the training data (section 3.3)

Neye number of true eyes

N1st number of liberties, which is the number of empty intersections connected to a group of stones. We also call itthe 1st-order liberties

N2nd number of 2nd-order liberties, which is defined as the liberties of the 1st-order liberties

N3rd number of 3rd-order liberties, which is defined as the liberties of the 2nd-order liberties

N4th number of 4th-order liberties, which is defined as the liberties of the 3rd-order liberties

O1st features of the weakest connected opponent group (stone type, number of liberties, number of eyes)

O2nd features of the second weakest connected opponent group (stone type, number of liberties, number of eyes)

Oij

Iij

Input Plane

Output Plane

4 Hidden Planes

SE

SW

NW

NE NEi,j−1

NEi+1,j

NEi,j

SEi,j−1 SEi,j

SEi−1,j

SWi,j SWi,j+1

NWi,j NWi,j+1

SWi−1,j

NWi+1,j

Fig. 1. The DAG underlying the recursive neural network architecture. Nodes are regularly arranged in one input plane, one output plane,and four hidden planes, one for each cardinal direction. In each plane, nodes are arranged on a square lattice. All the directed edges of thesquare lattice in each hidden plane are oriented in the direction of one of the four possible cardinal corners: NE, NW, SW, and SE. NodeNEij , for instance, receives directed connections from its parent nodes NEi,j−1 and NEi + 1, j. Additional directed edges run vertically ineach column ij. The input node Iij is connected to four corresponding hidden nodes, one for each hidden plane. The input node and thehidden nodes are connected to the output node Oij .

4

the architecture used in the Results. Note that in spite ofthe deterministic reparameterization of the Bayesian net-work, the overall model remains probabilistic in the sensethat the outputs Oij of the logistic functions at each spatiallocation vary between zero and one, and can be interpretedas territory ownership probabilities.

Because the underlying graph is acyclic, these networkscan be unfolded in space and training can proceed by simplegradient descent (back-propagation) taking into accountrelevant symmetries and weight sharing. In terms of sym-metries, for instance, all game positions can be used fortraining a ‘black’ player that is about to make a move,simply by switching the color of the stones on those train-ing move where ‘white’ is about to play. Likewise, rotatedboard positions can be used for training. Because of theweight sharing, networks trained at one board size can im-mediately be applied to any other board size, thus provid-ing a simple mechanism for reusing and extending acquiredknowledge. For a board of size N ×N , the training proce-dure scales like O(WMN4) where W is the number of ad-justable weights, and M is the number of training games.There are roughly N2 board positions in a game and, foreach position, N2 outputs Oij need to be trained, hencethe O(N4) scaling. Both the game records and the posi-tions within each selected game record are randomly se-lected, without replacement, during training. Weights areupdated essentially on line, once every 10 game positionswith a learning rate of 0.00001. Training a single player onthe 9 × 9 dataset takes on the order of a week on a cur-rent desktop computer, corresponding roughly to 50 train-ing epochs at 3 hours per epoch.

3.3. Inputs

At a given board intersection, the input vector Iij

has multiple components from the list given in Table 1.The first three components–stone type, influence, andpropensity–are associated with the corresponding intersec-tion and a fixed number of surrounding locations. Influenceand propensity are described below in more detail. Theremaining features correspond to group properties (Chen,1989) involving variable numbers of neighboring stonesand are self explanatory for those familiar with Go. Thegroup Gij associated with a given intersection is the max-imal set of stones of the same color that are connected toit. Neighboring (or connected) opponent groups of Gij aregroups of the opposite color that are directly connected(adjacent) to Gij . The idea of using higher order libertiesis taken from Werf (Werf et al., 2003). O1st and O2nd

provide the number of true eyes and the number of lib-erties of the weakest and the second weakest neighboringopponent groups. Weakness here is defined in alphabeticalorder with respect to the number of eyes first, followed bythe number of liberties. Typical input vectors used in theexperiments consist of a subset of these input features andtypically have on the order of one hundred components.

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

Fig. 2. Ten unique 3×3 grid locations on a 9×9 board under rotationand mirror symmetries.

Influence: We use two types of influence calculation. Bothalgorithms are based on Chen’s method (Chen, 2002). Thefirst one is an exact implementation of Chen’s method,which has a non-stringent influence propagation rule. Thesecond one uses a stringent influence propagation rule. InChen’s method, any opponent stone can block the prop-agation of influence. In the stringent implementation, anopponent stone can block the propagation of influence ifand only if it is stronger than the stone emitting the influ-ence. Strength is again defined in alphabetical order withrespect to the number of eyes first, followed by the numberof liberties.

Propensity–Automated Learning and Scoring of aPattern Library:

We develop a method to learn local patterns and theirvalue automatically from a database of games. The basicmethod is illustrated in the case of 3 × 3 patterns used insome of the simulations. Considering rotation and mirrorsymmetries, there are 10 unique locations for a 3× 3 win-dow on a 9×9 board, as shown in Figure 2 (Ralaivola et al.,2005). For each such 3 × 3 window, there are 39 possiblestone patterns. The 3× 3 windows at different board loca-tions have different symmetries. For example, the uniquelocation at the center of the board has 8 symmetries (4rotations and 2 mirror symmetries), while the corner lo-cation which can occur at 4 positions has only 2 mirrorsymmetries. It is easy to see that for each unique location,the product of the number of symmetries and the numberof board positions where it can occur must equal 8. Givenany 3 × 3 pattern of stones in a board window and a setof games, we then compute nine numbers, one for each in-tersection. These numbers are local indicators of strengthor propensity. The propensity Sij(p) of each intersection ijassociated with stone pattern p and a 3 × 3 window w isdefined as:

Swij(p) =

NBij(p)−NWij(p)NBij(p) + NWij(p) + C

(3)

where NBij(p) is the number of times that pattern p endswith a black stone at intersection ij at the end of the gamesin the data, and NWij(p) is the same for a white stone.

5

A

A

B

B

C

C

D

D

E

E

F

F

G

G

H

H

J

J

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 91.0

1.0 1.0

1.0 1.0

1.0

1.0

0.8

0.8

A

A

B

B

C

C

D

D

E

E

F

F

G

G

H

H

J

J

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

A

B

C

D

(a) (b)

Fig. 3. Example of a 3× 3 pattern with high propensity values. Thenumbers on the left board represent the calculated propensities ateach intersection. The right board represents a corresponding tacticalbattle that might occur in the course of a game when it is white’sturn to place a stone. To attack the black pattern, white would needto take both C and D. However, none of these positions can be takendirectly. Hence white places a stone in A, to which black can respondby placing a stone in C, or by placing a stone in a different region ofthe board. If black places a stone in a different region of the board,white has the option to take B, forcing black to take D.

Both NBij(p) and NWij(p) are computed taking into ac-count the location and the symmetries of the correspondingwindow w. C plays a regularizing role in the case of rarepatterns and is set to 1 in the simulations. Thus Sw

ij(p) isan empirical normalized estimate of the local differentialpropensity towards conquering the corresponding intersec-tion in the local context provided by the corresponding pat-tern and window. An example of propensity calculations isgiven in Figure 3 showing one of the highest ranking pat-terns calculated from the 9× 9 training data set.

In general, a given intersection ij on the board is cov-ered by several 3× 3 windows. For instance, a corner loca-tion is associated with only one 3× 3 window, whereas thecenter location is associated with nine different windows.Thus, for a given intersection ij on a given board, we cancompute a value Sw

ij(p) for each different window that con-tains the intersection. In the following simulations, a singlefinal value Sij(p) is computed by averaging over the differ-ent w’s. However, more complex schemes that retain moreinformation can easily be envisioned by, for instance: (1)computing also the standard deviation of the Sw

ij(p) as afunction of w; (2) using a weighted average, weighted bythe importance of the window w; and (3) using the entireset of Sw

ij(p) values, as w varies around ij, to augment theinput vector.

3.4. Move Selection and Search

For a given position, the next move can be selected us-ing one-level search by considering all possible legal movesand computing the estimate of the total expected area E =∑

ij Oij(T ) at the end of the game, or some intermediateposition, or a combination of both, where Oij(t) are the out-

0 20 40 60 80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

w1w2w30w38

Fig. 4. Validation error versus game phase. Curves correspond to aDAG-RNN system trained and validated using 9× 9 amateur gamedata. The phase of the game is represented on the x axis by thenumber of stones present on the board. The y axis correspond to therelative entropy between the predicted membership probabilities andthe true probabilities, computed on the validation dataset. The fourcurves correspond to the validation error of the DAG-RNN after 1,2, 30, and 38 epochs of training.

puts (predicted probabilities) of the DAG-RNNs. The nextmove can be chosen by maximizing this evaluation function(1-ply search). Alternatively, the next move can be chosenfrom all the legal moves with a probability proportional toeE/Temp, where Temp is a temperature parameter (Brug-mann, 1993; Schrauldolph et al., 1994; Stern et al., 2005;Bouzy and Helmstetter, 2003). The value of Temp playsan important role and is discussed in the Results. We havealso experimented with a few other simple search schemes,such as 2-ply search (MinMax).

4. Results

We have trained a large number of players using themethods described above. In the absence of training data,we used pure bootstrap approaches (e.g. reinforcementlearning) at sizes 5× 5 and 7× 7 with results that were en-couraging but clearly insufficient. Not surprisingly, whenused to play at larger board sizes, the RNNs trained atthese small board sizes yield rather weak players. Thequality of most 13×13 games available to us is too poor forproper training, although a small subset can be used forvalidation purposes. Furthermore, we do not have any datafor sizes N = 11, 15, and 17. Thus the most interestingresults we report are derived by training the DAG-RNNsusing the 9 × 9 game data, and using these systems toplay at 9× 9 and, more importantly, at larger board sizes.Several 9× 9 players achieve top comparable performance.For conciseness, here we report the results obtained withone of them, trained with target parameters w = 0.25 andk = 2 in Equation 1,

Figure 4 shows how the validation error changes as train-ing progresses. Validation error here is defined as the rel-ative entropy between the output probabilities producedby the RNN and the target probabilities, computed on

6

0 50 100 150 200 250

0.00

0.02

0.04

0.06

T100T10T2T1

Fig. 5. Average move probabilities versus game phase when thetemperature parameter Temp is equal to 1, 2, 10 and 100 respectively.The moves are from the 19×19 human expert dataset. Probabilitiesare computed using a DAG-RNN system trained on the amateur9× 9 dataset.

the validation data. The validation error decreases quicklyduring the first epochs, each epoch corresponding to onepass through the training data. In this case, no substan-tial decrease in validation error is observed after 30 train-ing epochs. Note also how the error is smaller towards theend of the game due to the reduction in uncertainty as thegame progresses.

For any input board position, the trained DAG-RNNsproduce a set of local probabilities Oij . To derive a corre-sponding player, or to assess the quality of the system by ex-amining its ability to retrieve or rank human expert moves,it is necessary to convert these local probabilities into globalprobabilities of individual moves and to rank them. Oneimportant issue to convert local output probabilities intoglobal move probabilities proportional to

∑ij Oij/Temp is

the choice of the temperature parameter Temp.Figure 5 shows the average probability of the move played

by human experts in the 19 × 19 board dataset, duringdifferent phases of the game. These probabilities are com-puted using the DAG-RNN trained using the 9×9 amateurdataset. The four curves correspond to four different valuesof the temperature parameter (Temp = 1, 2, 10 and 100).Overall, as expected, when the temperature decreases, theaverage probability increases. Furthermore, this probabil-ity increases also with the phase of the game. When thetemperature is zero, the system assigns probability one tothe move associated with the largest predicted area, andzero to all the other moves. Thus the maximum value of theaverage probability for the human moves is reached whenTemp = 0 and is equal to the fraction of matched moves,i.e. the fraction of human moves that are ranked first ac-cording to the system. This fraction varies with the phaseof the game. For instance, it is close to 10% for the 19× 19test dataset containing 2601 moves from games played byhuman experts, taken between move 80 and move 83.

A choice of Temp = 0, however, is not sound since it cor-responds to a state of overconfidence and exclusive focuson a single move. Thus to determine the value of the tem-

1 e−02 1 e+00 1 e+020.00

00.

005

0.01

00.

015

Fig. 6. The average probability of human expert moves after remov-ing the ‘matched’ moves. The x axis represents temperature on alogarithmic scale. The average is computed using the 19 × 19 datawith moves 80-83. The probabilities are computed from the outputof the 9× 9 system trained on human amateur data. The maximumhas a value of 0.014796 ' 1/68 and is reached for Temp = 0.833.

perature parameter, we maximize the average probabilityof the human moves after removing all the matched moves(Figure 6) which yields in this case an optimal temperatureparameter Temp = 0.833 for moves 80-83.

With this choice of temperature we can now compute theprobability of any move, rank all moves accordingly, andevaluate the performance of the system by looking at how itranks and scores the moves played by human experts. Thuswe can compute the average probability of moves playedby human expert players according to the DAG-RNN orother probabilistic systems such as (Stern et al., 2005). InTable 2, we report such probabilities for several systemsand at different board sizes. In this test, the 9 × 9 valida-tion data set has 1,298 moves from amateur games betweenmove 18 and 21. The 19× 19 test data set has 2,601 movesfrom professional games between move 80 and 83. The valueof temperature T is 1.25 for board size 9 × 9, 0.909 for13× 13, and 0.833 for 19× 19. For size 19× 19, we use thesame test set used in (Stern et al., 2005). Boltzmann5 andBoltzmannLiberties are their results reported in the pre-published version of their NIPS 2004 paper. At this size,the probabilities in the table are computed using the 80-83moves of each game. For boards of size 19 × 19, a randomplayer that selects moves uniformly at random among all le-gal moves assigns a probability of 1/281 = 1/[(19×19)−80]to the moves played by professional players in the dataset.BoltzmannLiberties was able to improve this probability to1/194. The best DAG-RNNs trained using amateur data at9× 9 are capable of bringing this probability further downby over one order of magnitude to 1/15. This probabilitycorresponds to the fact that roughly 10% of human expertmoves are ranked at the top by the DAG-RNN. Note thatthese information-retrieval metrics are only indicative ofperformance since, for instance, the move played by humanexperts may not always be the best possible move. But, tosummarize, the DAG-RNN trained on amateur 9× 9 dataimproves the probability of human expert moves on boards

7

Table 2Probabilities assigned by different systems to moves played by human players. Random players choose one move uniformly at random overall legal moves. The 9× 9 validation dataset has 1,298 moves from amateur games taken between move 18 and 21. The 19× 19 dataset sethas 2,601 moves from human expert games taken between move 80 and 83. For boards of size 19× 19, a random player assigns probability1/281 to the human expert moves. The DAG-RNN trained on the 9×9 amateur dataset, assigns an average probability of 1/15 to the humanexpert moves. About 1/10 human expert moves are ranked first among all possible legal moves by the DAG-RNN.

Board Size System Log Probability Probability

9× 9 Random player -4.13 1/62

9× 9 RNN(1-ply search) -1.86 1/7

13× 13 Random player -4.88 1/132

13× 13 RNN(1-ply search) -2.27 1/10

19× 19 Random player -5.64 1/281

19× 19 Boltzmann5 -5.55 1/254

19× 19 BoltzmannLiberties -5.27 1/194

19× 19 RNN(1-ply search) -2.70 1/15[10% ranked 1st]

T

19 19

A

A

18 18

B

B

17 17

C

C

16 16

D

D

15 15

E

E

14 14

F

F

13 13

G

G

12 12

H

H

11 11

J

J

10 10

K

K

9 9

L

L

8 8

M

M

7 7

N

N

6 6

O

O

5 5

P

P

4 4

Q

Q

3 3

R

R

2 2

S

S

1 1

T

T

19 19

A

A

18 18

B

B

17 17

C

C

16 16

D

D

15 15

E

E

14 14

F

F

13 13

G

G

12 12

H

H

11 11

J

J

10 10

K

K

9 9

L

L

8 8

M

M

7 7

N

N

6 6

O

O

5 5

P

P

4 4

Q

Q

3 3

R

R

2 2

S

S

1 1

T

Fig. 7. Example of an outstanding move based on territory predictions made by the DAG-RNN. For each intersection, the height of thegreen bar represents the estimated probability that the intersection will be owned by black at the end of the game. The figure on the leftshows the predicted probabilities if black passes. The figure on the right shows the predicted probabilities if black makes the move at N12.The overall predicted territory is 153.623 if black passes, and 164.007 if black plays N12. Move N12 causes the greatest increase in area andis the top-ranked move for the DAG-RNN. Indeed this is the move selected in the game played by Zhou, Heyang (black, 8 dan) and Chang,Hao (white, 9 dan) on 10/22/2000. Black won the game by 2.75 point.

of size 19 × 19 by more than one order of magnitude overrandom.

Figure 8 provides a ROC-like curve by displaying thepercentage of moves made by professional human playeron boards of size 19× 19 that are contained in the m top-ranked moves according to the DAG-RNN trained on 9× 9amateur data, for various values of m across all phases ofthe game. For instance, when there are 80 stones on theboard, and hence on the order of 300 legal moves available,there is a 50% chance that a move selected by a very highlyranked human player (dan 9) is found among the top 30

choices produced by the DAG-RNN.A remarkable example where the top ranked move ac-

cording to the DAG-RNN coincides with the move actuallyplayed in a game between two very highly-ranked playersis given in Figure 7, illustrating also the underlying proba-bilistic territory calculations.

5. Discussion

In summary, we have designed a DAG-RNN approachfor the game of Go and demonstrated that it can learn

8

0 50 100 150 200

010

2030

4050

60

randtop1top5top10top20top30

Fig. 8. Percentage of moves made by expert human players on boardsof size 19×19 that are contained in the m top-ranked moves accordingto the DAG-RNN trained on 9× 9 amateur data, for various valuesof m. The baseline associated with the red curve corresponds to arandom uniform player.

territory predictions fairly well. Systems trained using onlya set of 9 × 9 amateur games achieve surprisingly goodperformance on a 19×19 test set that contains 1,835 gamesplayed by human experts. While the system described hereachieves over one order of magnitude of improvement inthe probability of human expert moves over random, it isclear that one order of magnitude of improvement in thisprobability remains to be made in order to achieve humanexpert competence on boards of size 19× 19. The methodsand results presented clearly point also to several possiblerelated directions of improvement.

A first obvious direction is to train systems of size greaterthan 9 × 9. In spite of the O(N4) scaling issue, training19×19 systems is possible over a period of months. One is-sue, however, is the amount of training data. The data setsused are a good starting point but it ought to be possible tocollect much larger datasets. Creating a large, freely avail-able, repository of game records should be a key enablingstep in order to develop better players and should have highpriority. With current database and Web technologies, thecreation of very large annotated game repositories is cer-tainly possible.

A second direction is to further exploit patterns thatare larger than 3 × 3, especially at the beginning of thegame when the board is sparsely occupied and matching oflarge patterns is possible using, for instance, Zobrist hash-ing techniques (Zobrist, 1970). Techniques related to pat-tern matching are also likely to benefit from large data-bases and directly impinge on the fundamental question ofthe nature of intelligence raised in the Introduction, since apurely pattern matching approach is likely to be regardedas a brute force approach.

A third direction is to combine different players, such asplayers trained at different board sizes, or players trainedon different phases of the game. Perhaps surprisingly, inpreliminary results so far we have not been able to gainany significant improvements by training different playersspecialized on different phases of the game (Figure 9).

Finally, it is also clear that the exploration/exploitationtradeoff plays a key role in Go and better search strategies

0 50 100 150 200

0.01

0.02

0.03

0.04

0.05

0.06

p0.20p20.40p0.40p40.81p30.81p0.81

Fig. 9. Probability of moves played by human experts in the 19× 19dataset versus game phase. Probabilities are computed according to6 DAG-RNN players trained using data from different phases of thegame in the 9 × 9 dataset. Each player, represented as pn1.n2, istrained using data between move n1 and move n2. The target is theboard configuration at n2. Probabilities are evaluated on the first 200moves. Player p0.81 trained on all moves achieves the best results,particularly at early stages of the game.

ought to be investigated, together with how one can effec-tively combine search and learning strategies together. Sim-ple exhaustive schemes, such as 1-ply search are too weakboth in terms of their shallowness and their time complex-ity (Table 3). With W adjustable weights, we have seenthat the learning complexity scales like O(WMN4), withM training games. The 1-ply playing complexity on theother hand scales like O(WN6) (O(N2) moves in a gamewith O(N2) possible moves at each position and O(N2)sub-moves to explore).

Table 3Time complexity for training and playing for different board sizes.Training column: approximate time complexity for training on onegame (O(WN4)). Playing column: approximate time complexity forplaying one game using 1-ply search (O(WN6)). The training timefor Go boards of size 5× 5 is used as the basic unit of time.

Board Size Training Playing

5× 5 1 25

9× 9 10.5 34

13× 13 45.8 308.9

19× 19 208.5 3010.9

Interestingly, random search methods (Abramson, 1990),have been used effectively to design a 9 × 9 Go program(Raiko, 2004), which can run on a palm computer. Randomsearch methods, together with further analyses of the explo-ration/exploitation tradeoff in the armed bandit problemand in tree search (Auer et al., 2002; Kocsis and Szepesvari,2006; Coulom, 2006), have also been used very recently todesign an effective 9 × 9 player (Gelly et al., 2006). Whilerandom search methods can scale up to 19 × 19 in speed,they do not seem to scale up in player quality. Thus theproblem of finding effective search methods, that preservethe speed and scalability of random search but achieve bet-ter results, possibly by leveraging an evaluation function,is likely to play a central role in the ongoing quest for in-telligent Go programs.

9

AcknowledgmentsThe work of PB and LW has been supported by a Lau-

rel Wilkening Faculty Innovation award and awards fromNSF, BREP, and Sun Microsystems to PB. We would liketo thank Jianlin Chen for developing a Go graphical userinterface, Nicol Schraudolph for providing the 9 × 9 and13 × 13 data, and David Stern for providing the 19 × 19data.

References

Abramson, B., 1990. Expected-outcome: a general model ofstatic evaluation. IEEE Transactions on Pattern Analy-sis and Machine Intelligence 12 (2), 182–193.

Auer, P., Cesa-Bianchi, N., Fischer, P., 2002. Finite-timeanalysis of the multiarmed bandit problem. MachineLearning 47 (2-3), 235–256.

Baldi, P., Brunak, S., Frasconi, P., Pollastri, G., Soda, G.,1999. Exploiting the past and the future in protein sec-ondary structure prediction. Bioinformatics 15, 937–946.

Baldi, P., Pollastri, G., 2003. The principled design of large-scale recursive neural network architectures–DAG-RNNsand the protein structure prediction problem. Journal ofMachine Learning Research 4, 575–602.

Beal, D. F., 1990. A generalised quiescence search algo-rithm. Artificial Intelligence 43 (1), 85–98.

Benson, D. B., 1976. Life in the game of Go. InformationScience 10, 17–29.

Berlekamp, E., Wolfe, D., 1994. Mathematical Go–Chillinggets the last point. A K Peters, Wellesley, MA.

Bouzy, B., Cazenave, T., 2001. Computer Go: An AI-oriented survey. Artificial Intelligence 132 (1), 39–103.

Bouzy, B., Helmstetter, B., 2003. Monte Carlo Go develop-ments. Advances in Computer Games conference 10.

Brugmann, B., 1993. Monte Carlo Go.Chen, K., 1989. Group identification in computer Go.

Heuristic Programming in Artificial Intelligence: theFirst Computer Olympiad 1, 195–210.

Chen, K., 1990. The move decision process of Go Intellect.Computer Go 14, 9–17.

Chen, K., 1992. Attack and defense. Heuristic Program-ming in Artificial Intelligence 3, 146–156.

Chen, K., Chen, Z., 1999. Static analysis of life and death inthe game of Go. Information Sciences 121 (1-2), 113–134.

Chen, Z., 2002. Semi-empirical quantitative theory of Gopart 1: Estimation of the influence of a wall. ICGA Jour-nal 25 (4), 211–218.

Cobb, W. S., 2002. The Book of GO. Sterling PublishingCo., New York, NY.

Coulom, R., 2006. Efficient selectivity and backup opera-tors in monte-carlo tree search. In: Proceedings of the5th International Conference on Computers and Games,Turin, Italy.

Enzenberger, M., 1996. The integration of a priori knowl-edge into a Go playing neural network.

Enzenberger, M., 2003. Evaluation in Go by a neural net-

work using soft segmentation. 10th Advances in Com-puter Games conference 10, 97–108.

Fotland, D., 1993. Knowledge representation in the ManyFaces of Go.

Gelly, S., Wang, Y., Valizadegan, H., Teytaud, O.,2006. UCT and MoGo: exploration vs exploitation incomputer-go. In: Demos of Advances in Neural Informa-tion Processing Systems.

Graepel, T., Goutrie, M., Kruger, M., Herbrich, R., 2001.Learning on graphs in the game of Go. In: Proceddingsof the International Conference on Artificial Neural Net-works, ICANN 2001. pp. 347–352.

Iwamoto, K., 1972. GO for Beginners. Pantheon Books,New York, NY.

Kocsis, L., Szepesvari, C., 2006. Bandit-based monte-carloplanning. In: ECML.

Plaat, A., Schaeffer, J., Pijls, W., de Bruin, A., 1996. Ex-ploiting graph properties of game trees. In: 13th NationalConference on Artificial Intelligence (AAAI’96). pp. 234–239.

Pollastri, G., Baldi, P., 2002. Prediction of contact mapsby giohmms and recurrent neural networks using lateralpropagation from all four cardinal corners. Bioinformat-ics 18, S62–S70.

Raiko, T., 2004. The Go-playing program called Go81. In:Proceedings of the Finnish Artificial Intelligence Confer-ence, STeP 2004.

Ralaivola, L., Wu, L., Balid, P., 2005. SVM and Pattern-Enriched Common Fate Graphs for the game of Go.ESANN 2005 27-29, 485–490.

Russell, S. J., Norvig, P., 2002. Artificial Intelligence: AModern Approach, 2nd Edition. Prentice Hall.

Ryder, J. L., 1971. Heuristic analysis of large trees as gen-erated in the game of Go. Ph.D. thesis, Department ofComputer Science, Standford University.

Schrauldolph, N. N., Dayan, P., Sejnowski, T. J., 1994.Temporal difference learning of position evaluation in thegame of Go. In: Advances in Neural Information Process-ing Systems 6. pp. 817–824.

Schrufer, G., 1989. A strategic quiescence search. ICCAJournal 12 (1), 3–9.

Stern, D. H., Graepel, T., MacKay, D. J. C., 2005. Mod-elling uncertainty in the game of Go. In: Advances inNeural Information Processing Systems 17. pp. 1353–1360.

Sutton, S., Barto, A., 1998. Reinforcement Learning -An introduction. The MIT Press, Cambridge, Massa-chusetts, London, England.

Werf, E., Herik, H., Uiterwijk, J., 2003. Learning to scorefinal positions in the game of Go. In: Advances in Com-puter Games: Many Games, Many Challenges. pp. 143–158.

Zobrist, A. L., 1970. A new hashing method with appli-cation for game playing. Technical report 88, Universityof Wisconsin, April 1970. Reprinted in ICCA Journal,13(2), (1990), pp. 69-73.

10

Learning to play Go using recursive neural networks - CiteSeerX

Documents

Transcript of Learning to play Go using recursive neural networks - CiteSeerX