A Latent Variable Model for Generative Dependency Parsing

12
Proceedings of the 10th Conference on Parsing Technologies, pages 144–155, Prague, Czech Republic, June 2007. c 2007 Association for Computational Linguistics A Latent Variable Model for Generative Dependency Parsing Ivan Titov University of Geneva 24, rue G´ en´ eral Dufour CH-1211 Gen` eve 4, Switzerland [email protected] James Henderson University of Edinburgh 2 Buccleuch Place Edinburgh EH8 9LW, United Kingdom [email protected] Abstract We propose a generative dependency pars- ing model which uses binary latent variables to induce conditioning features. To define this model we use a recently proposed class of Bayesian Networks for structured predic- tion, Incremental Sigmoid Belief Networks. We demonstrate that the proposed model achieves state-of-the-art results on three dif- ferent languages. We also demonstrate that the features induced by the ISBN’s latent variables are crucial to this success, and show that the proposed model is particularly good on long dependencies. 1 Introduction Dependency parsing has been a topic of active re- search in natural language processing during the last several years. The CoNLL-X shared task (Buch- holz and Marsi, 2006) made a wide selection of standardized treebanks for different languages avail- able for the research community and allowed for easy comparison between various statistical meth- ods on a standardized benchmark. One of the sur- prising things discovered by this evaluation is that the best results are achieved by methods which are quite different from state-of-the-art models for constituent parsing, e.g. the deterministic parsing method of (Nivre et al., 2006) and the minimum spanning tree parser of (McDonald et al., 2006). All the most accurate dependency parsing models are fully discriminative, unlike constituent parsing where all the state of the art methods have a genera- tive component (Charniak and Johnson, 2005; Hen- derson, 2004; Collins, 2000). Another surprising thing is the lack of latent variable models among the methods used in the shared task. Latent vari- able models would allow complex features to be in- duced automatically, which would be highly desir- able in multilingual parsing, where manual feature selection might be very difficult and time consum- ing, especially for languages unknown to the parser developer. In this paper we propose a generative latent vari- able model for dependency parsing. It is based on Incremental Sigmoid Belief Networks (ISBNs), a class of directed graphical model for structure pre- diction problems recently proposed in (Titov and Henderson, 2007), where they were demonstrated to achieve competitive results on the constituent parsing task. As discussed in (Titov and Hender- son, 2007), computing the conditional probabili- ties which we need for parsing is in general in- tractable with ISBNs, but they can be approximated efficiently in several ways. In particular, the neu- ral network constituent parsers in (Henderson, 2003) and (Henderson, 2004) can be regarded as coarse ap- proximations to their corresponding ISBN model. ISBNs use history-based probability models. The most common approach to handling the unbounded nature of the parse histories in these models is to choose a pre-defined set of features which can be unambiguously derived from the history (e.g. (Char- niak, 2000; Collins, 1999; Nivre et al., 2004)). De- cision probabilities are then assumed to be indepen- dent of all information not represented by this finite set of features. ISBNs instead use a vector of binary 144

Transcript of A Latent Variable Model for Generative Dependency Parsing

Proceedings of the 10th Conference on Parsing Technologies, pages 144–155,Prague, Czech Republic, June 2007. c©2007 Association for Computational Linguistics

A Latent Variable Model for Generative Dependency Parsing

Ivan TitovUniversity of Geneva

24, rue General DufourCH-1211 Geneve 4, [email protected]

James HendersonUniversity of Edinburgh

2 Buccleuch PlaceEdinburgh EH8 9LW, United [email protected]

Abstract

We propose a generative dependency pars-ing model which uses binary latent variablesto induce conditioning features. To definethis model we use a recently proposed classof Bayesian Networks for structured predic-tion, Incremental Sigmoid Belief Networks.We demonstrate that the proposed modelachieves state-of-the-art results on three dif-ferent languages. We also demonstrate thatthe features induced by the ISBN’s latentvariables are crucial to this success, andshow that the proposed model is particularlygood on long dependencies.

1 Introduction

Dependency parsing has been a topic of active re-search in natural language processing during the lastseveral years. The CoNLL-X shared task (Buch-holz and Marsi, 2006) made a wide selection ofstandardized treebanks for different languages avail-able for the research community and allowed foreasy comparison between various statistical meth-ods on a standardized benchmark. One of the sur-prising things discovered by this evaluation is thatthe best results are achieved by methods whichare quite different from state-of-the-art models forconstituent parsing, e.g. the deterministic parsingmethod of (Nivre et al., 2006) and the minimumspanning tree parser of (McDonald et al., 2006).All the most accurate dependency parsing modelsare fully discriminative, unlike constituent parsingwhere all the state of the art methods have a genera-

tive component (Charniak and Johnson, 2005; Hen-derson, 2004; Collins, 2000). Another surprisingthing is the lack of latent variable models amongthe methods used in the shared task. Latent vari-able models would allow complex features to be in-duced automatically, which would be highly desir-able in multilingual parsing, where manual featureselection might be very difficult and time consum-ing, especially for languages unknown to the parserdeveloper.

In this paper we propose a generative latent vari-able model for dependency parsing. It is based onIncremental Sigmoid Belief Networks (ISBNs), aclass of directed graphical model for structure pre-diction problems recently proposed in (Titov andHenderson, 2007), where they were demonstratedto achieve competitive results on the constituentparsing task. As discussed in (Titov and Hender-son, 2007), computing the conditional probabili-ties which we need for parsing is in general in-tractable with ISBNs, but they can be approximatedefficiently in several ways. In particular, the neu-ral network constituent parsers in (Henderson, 2003)and (Henderson, 2004) can be regarded as coarse ap-proximations to their corresponding ISBN model.

ISBNs use history-based probability models. Themost common approach to handling the unboundednature of the parse histories in these models is tochoose a pre-defined set of features which can beunambiguously derived from the history (e.g. (Char-niak, 2000; Collins, 1999; Nivre et al., 2004)). De-cision probabilities are then assumed to be indepen-dent of all information not represented by this finiteset of features. ISBNs instead use a vector of binary

144

latent variables to encode the information about theparser history. This history vector is similar to thehidden state of a Hidden Markov Model. But un-like the graphical model for an HMM, which speci-fies conditional dependency edges only between ad-jacent states in the sequence, the ISBN graphicalmodel can specify conditional dependency edges be-tween states which are arbitrarily far apart in theparse history. The source state of such an edge is de-termined by the partial output structure built at thetime of the destination state, thereby allowing theconditional dependency edges to be appropriate forthe structural nature of the problem being modeled.This structure sensitivity is possible because ISBNsare a constrained form of switching model (Mur-phy, 2002), where each output decision switches themodel structure used for the remaining decisions.

We build an ISBN model of dependency parsingusing the parsing order proposed in (Nivre et al.,2004). However, instead of performing determin-istic parsing as in (Nivre et al., 2004), we use thisordering to define a generative history-based model,by integrating word prediction operations into theset of parser actions. Then we propose a simple, lan-guage independent set of relations which determinehow latent variable vectors are interconnected byconditional dependency edges in the ISBN model.ISBNs also condition the latent variable vectors on aset of explicit features, which we vary in the experi-ments.

In experiments we evaluate both the performanceof the ISBN dependency parser compared to previ-ous work, and the ability of the ISBN model to in-duce complex history features. Our model achievesstate-of-the-art performance on the languages wetest, significantly outperforming the model of (Nivreet al., 2006) on two languages out of three anddemonstrating about the same results on the third.In order to test the model’s feature induction abili-ties, we train models with two different sets of ex-plicit conditioning features: the feature set individu-ally tuned by (Nivre et al., 2006) for each consideredlanguage, and a minimal set of local features. Thesemodels achieve comparable accuracy, unlike withthe discriminative SVM-based approach of (Nivre etal., 2006), where careful feature selection appears tobe crucial. We also conduct a controlled experimentwhere we used the tuned features of (Nivre et al.,

2006) but disable the feature induction abilities ofour model by elimination of the edges connectinglatent state vectors. This restricted model achievesfar worse results, showing that it is exactly the ca-pacity of ISBNs to induce history features which isthe key to its success. It also motivates further re-search into how feature induction techniques can beexploited in discriminative parsing methods.

We analyze how the relation accuracy changeswith the length of the head-dependent relation,demonstrating that our model very significantly out-performs the state-of-the-art baseline of (Nivre etal., 2006) on long dependencies. Additional exper-iments suggest that both feature induction abilitiesand use of the beam search contribute to this im-provement.

The fact that our model defines a probabilitymodel over parse trees, unlike the previous state-of-the-art methods (Nivre et al., 2006; McDonald et al.,2006), makes it easier to use this model in appli-cations which require probability estimates, e.g. inlanguage processing pipelines. Also, as with anygenerative model, it may be easy to improve theparser’s accuracy by using discriminative retrainingtechniques (Henderson, 2004) or data-defined ker-nels (Henderson and Titov, 2005), with or even with-out introduction of any additional linguistic features.In addition, there are some applications, such as lan-guage modeling, which require generative models.Another advantage of generative models is that theydo not suffer from the label bias problems (Bot-tou, 1991), which is a potential problem for con-ditional or deterministic history-based models, suchas (Nivre et al., 2004).

In the remainder of this paper, we will first reviewgeneral ISBNs and how they can be approximated.Then we will define the generative parsing model,based on the algorithm of (Nivre et al., 2004), andpropose an ISBN for this model. The empirical partof the paper then evaluates both the overall accuracyof this method and the importance of the model’scapacity to induce features. Additional related workwill be discussed in the last section before conclud-ing.

145

2 The Latent Variable Architecture

In this section we will begin by briefly introduc-ing the class of graphical models we will be us-ing, Incremental Sigmoid Belief Networks (Titovand Henderson, 2007). ISBNs are designed specif-ically for modeling structured data. They are latentvariable models which are not tractable to computeexactly, but two approximations exist which havebeen shown to be effective for constituent parsing(Titov and Henderson, 2007). Finally, we presenthow these approximations can be trained.

2.1 Incremental Sigmoid Belief Networks

An ISBN is a form of Sigmoid Belief Network(SBN) (Neal, 1992). SBNs are Bayesian Networkswith binary variables and conditional probabilitydistributions in the form:

P (Si = 1|Par(Si)) = σ(∑

Sj∈Par(Si)

JijSj),

where Si are the variables, Par(Si) are the variableswhich Si depends on (its parents), σ denotes the lo-gistic sigmoid function, and Jij is the weight for theedge from variable Sj to variable Si in the graphi-cal model. SBNs are similar to feed-forward neuralnetworks, but unlike neural networks, SBNs have aprecise probabilistic semantics for their hidden vari-ables. ISBNs are based on a generalized version ofSBNs where variables with any range of discrete val-ues are allowed. The normalized exponential func-tion (’soft-max’) is used to define the conditionalprobability distributions at these nodes.

To extend SBNs for processing arbitrarily long se-quences, such as a parser’s sequence of decisionsD1, ..., Dm, SBNs are extended to a form of Dy-namic Bayesian Network (DBN). In DBNs, a newset of variables is instantiated for each position inthe sequence, but the edges and weights are the samefor each position in the sequence. The edges whichconnect variables instantiated for different positionsmust be directed forward in the sequence, therebyallowing a temporal interpretation of the sequence.

Incremental Sigmoid Belief Networks (Titov andHenderson, 2007) differ from simple dynamic SBNsin that they allow the model structure to depend onthe output variable values. Specifically, a decision isallowed to effect the placement of any edge whose

destination is after the decision. This results in aform of switching model (Murphy, 2002), whereeach decision switches the model structure used forthe remaining decisions. The incoming edges fora given position are a discrete function of the se-quence of decisions which precede that position.This makes the ISBN an “incremental” model, notjust a dynamic model. The structure of the model isdetermined incrementally as the decision sequenceproceeds.

ISBNs are designed to allow the model structureto depend on the output values without overly com-plicating the inference of the desired conditionalprobabilities P (Dt|D1, . . . , Dt−1), the probabilityof the next decision given the history of previous de-cisions. In particular, it is never necessary to sumover all possible model structures, which in generalwould make inference intractable.

2.2 Modeling Structures with ISBNs

ISBNs are designed for modeling structured datawhere the output structure is not given as part ofthe input. In dependency parsing, this means theycan model the probability of an output dependencystructure when the input only specifies the sequenceof words (i.e. parsing). The difficulty with suchproblems is that the statistical dependencies in thedependency structure are local in the structure, andnot necessarily local in the word sequence. ISBNsallow us to capture these statistical dependencies inthe model structure by having model edges dependon the output variables which specify the depen-dency structure. For example, if an output specifiesthat there is a dependency arc from word wi to wordwj , then any future decision involving wj can di-rectly depend on its head wi. This allows the headwi to be treated as local to the dependent wj even ifthey are far apart in the sentence.

This structurally-defined notion of locality is par-ticularly important for the model’s latent variables.When the structurally-defined model edges connectlatent variables, information can be propagated be-tween latent variables, thereby providing an evenlarger structural domain of locality than that pro-vided by single edges. This provides a poten-tially powerful form of feature induction, which isnonetheless biased toward a notion of locality whichis appropriate for the structure of the problem.

146

2.3 Approximating ISBNs

(Titov and Henderson, 2007) proposes two approxi-mations for inference in ISBNs, both based on vari-ational methods. The main idea of variational meth-ods (Jordan et al., 1999) is, roughly, to construct atractable approximate model with a number of freeparameters. The values of the free parameters are setso that the resulting approximate model is as close aspossible to the original graphical model for a giveninference problem.

The simplest example of a variation method is themean field method, which uses a fully factorized dis-tribution Q(H|V ) =

∏i Qi(hi|V ) as the approxi-

mate model, where V are the visible (i.e. known)variables, H = h1, . . . , hl are the hidden (i.e. la-tent) variables, and each Qi is the distribution of anindividual latent variable hi. The free parameters ofthis approximate model are the means µi of the dis-tributions Qi.

(Titov and Henderson, 2007) proposes two ap-proximate models based on the variational approach.First, they show that the neural network of (Hen-derson, 2003) can be viewed as a coarse mean fieldapproximation of ISBNs, which they call the feed-forward approximation. This approximation im-poses the constraint that the free parameters µi ofthe approximate model are only allowed to dependon the distributions of their parent variables. Thisconstraint increases the potential for a large approx-imation error, but it significantly simplifies the com-putations by allowing all the free parameters to beset in a single pass over the model.

The second approximation proposed in (Titov andHenderson, 2007) takes into consideration the factthat, after each decision is made, all the precedinglatent variables should have their means µi updated.This approximation extends the feed-forward ap-proximation to account for the most important com-ponents of this update. They call this approxima-tion the mean field approximation, because a meanfield approximation is applied to handle the statisti-cal dependencies introduced by the new decisions.This approximation was shown to be a more accu-rate approximation of ISBNs than the feed-forwardapproximation, but remain tractable. It was alsoshown to achieve significantly better accuracy onconstituent parsing.

2.4 Learning

Training these approximations of ISBNs is done tomaximize the fit of the approximate models to thedata. We use gradient descent, and a regularizedmaximum likelihood objective function. Gaussianregularization is applied, which is equivalent to theweight decay standardly used in neural networks.Regularization was reduced through the course oflearning.

Gradient descent requires computing the deriva-tives of the objective function with respect to themodel parameters. In the feed-forward approxima-tion, this can be done with the standard Backpropa-gation learning used with neural networks. For themean field approximation, propagating the error allthe way back through the structure of the graphicalmodel requires a more complicated calculation, butit can still be done efficiently (see (Titov and Hen-derson, 2007) for details).

3 The Dependency Parsing Algorithm

The sequences of decisions D1, ..., Dm which wewill be modeling with ISBNs are the sequences ofdecisions made by a dependency parser. For this weuse the parsing strategy for projective dependencyparsing introduced in (Nivre et al., 2004), whichis similar to a standard shift-reduce algorithm forcontext-free grammars (Aho et al., 1986). It canbe viewed as a mixture of bottom-up and top-downparsing strategies, where left dependencies are con-structed in a bottom-up fashion and right dependen-cies are constructed top-down. For details we referthe reader to (Nivre et al., 2004). In this section webriefly describe the algorithm and explain how weuse it to define our history-based probability model.

In this paper, as in the CoNLL-X shared task,we consider labeled dependency parsing. The stateof the parser is defined by the current stack S, thequeue I of remaining input words and the partial la-beled dependency structure constructed by previousparser decisions. The parser starts with an emptystack S and terminates when it reaches a configura-tion with an empty queue I . The algorithm uses 4types of decisions:

1. The decision Left-Arcr adds a dependency arcfrom the next input word wj to the word wi ontop of the stack and selects the label r for the

147

relation between wi and wj . Word wi is thenpopped from the stack.

2. The decision Right-Arcr adds an arc from theword wi on top of the stack to the next inputword wj and selects the label r for the relationbetween wi and wj .

3. The decision Reduce pops the word wi fromthe stack.

4. The decision Shiftwjshifts the word wj from

the queue to the stack.

Unlike the original definition in (Nivre et al., 2004)the Right-Arcr decision does not shift wj to thestack. However, the only thing the parser can doafter a Right-Arcr decision is to choose the Shiftwj

decision. This subtle modification does not changethe actual parsing order, but it does simplify the def-inition of our graphical model, as explained in sec-tion 4.

We use a history-based probability model, whichdecomposes the probability of the parse accordingto the parser decisions:

P (T ) = P (D1, ..., Dm) =∏

t

P (Dt|D1, . . . , Dt−1),

where T is the parse tree and D1, . . . , Dm is itsequivalent sequence of parser decisions. Since weneed a generative model, the action Shiftwj

also pre-dicts the next word in the queue I , wj+1, thus theP (Shiftwi

|D1, . . . , Dt−1) is a probability both ofthe shift operation and the word wj+1 conditionedon current parsing history.1

Instead of treating each Dt as an atomic decision,it is convenient to split it into a sequence of elemen-tary decisions Dt = dt

1, . . . , dtn:

P (Dt|D1, . . . , Dt−1) =∏

k

P (dtk|h(t, k)),

1In preliminary experiments, we also considered look-ahead, where the word is predicted earlier than it appears at thehead of the queue I , and “anti-look-ahead”, where the word ispredicted only when it is shifted to the stack S. Early predic-tion allows conditioning decision probabilities on the words inthe look-ahead and, thus, speeds up the search for an optimaldecision sequence. However, the loss of accuracy with look-ahead was quite significant. The described method, where anew word is predicted when it appears at the head of the queue,led to the most accurate model and quite efficient search. Theanti-look-ahead model was both less accurate and slower.

Figure 1: An ISBN for estimating P (dtk|h(t, k)).

where h(t, k) denotes the parsing historyD1, . . . , Dt−1, dt

1, . . . , dtk−1. We split Left-Arcr

and Right-Arcr each into two elementary decisions:first, the parser decides to create the correspondingarc, then, it decides to assign a relation r to thearc. Similarly, we decompose the decision Shiftwj

into an elementary decision to shift a word and aprediction of the word wj+1. In our experiments weuse datasets from the CoNLL-X shared task, whichprovide additional properties for each word token,such as its part-of-speech tag and some fine-grainfeatures. This information implicitly induces wordclustering, which we use in our model: first wepredict a part-of-speech tag for the word, then a setof word features, treating feature combination as anatomic value, and only then a particular word form.This approach allows us to both decrease the effectof sparsity and to avoid normalization across all thewords in the vocabulary, significantly reducing thecomputational expense of word prediction.

4 An ISBN for Dependency Parsing

In this section we define the ISBN model we use fordependency parsing. An example of this ISBN forestimating P (dt

k|h(t, k)) is illustrated in figure 1. Itis organized into vectors of variables: latent statevariable vectors St′ = st′

1 , . . . , st′

n , representing anintermediate state at position t′, and decision vari-able vectors Dt′ , representing a decision at positiont′, where t′ ≤ t. Variables whose value are given atthe current decision (t, k) are shaded in figure 1, la-tent and current decision variables are left unshaded.

As illustrated by the edges in figure 1, the prob-ability of each state variable st′

i (the individual cir-cles in St′) depends on all the variables in a finiteset of relevant previous state and decision vectors,

148

but there are no direct dependencies between the dif-ferent variables in a single state vector. For eachrelevant decision vector, the precise set of decisionvariables which are connected in this way can beadapted to a particular language. As long as theseconnected decisions include all the new informationabout the parse, the performance of the model is notvery sensitive to this choice. This is because ISBNshave the ability to induce their own complex featuresof the parse history, as demonstrated in the experi-ments in section 6.

The most important design decision in buildingan ISBN model is choosing the finite set of relevantprevious state vectors for the current decision. Byconnecting to a previous state, we place that state inthe local context of the current decision. This speci-fication of the domain of locality determines the in-ductive bias of learning with ISBNs. When decidingwhat information to store in its latent variables, anISBN is more likely to choose information whichis immediately local to the current decision. Thisstored information then becomes local to any fol-lowing connected decision, where it again has somechance of being chosen as relevant to that decision.In this way, the information available to a given deci-sion can come from arbitrarily far away in the chainof interconnected states, but it is much more likelyto come from a state which is relatively local. Thus,we need to choose the set of local (i.e. connected)states in accordance with our prior knowledge aboutwhich previous decisions are likely to be particularlyrelevant to the current decision.

To choose which previous decisions are particu-larly relevant to the current decision, we make useof the partial dependency structure which has beendecided so far in the parse. Specifically, the currentlatent state vector is connected to a set of 7 previouslatent state vectors (if they exist) according to thefollowing relationships:

1. Input Context: the last previous state with thesame queue I .

2. Stack Context: the last previous state with thesame stack S.

3. Right Child of Top of S: the last previous statewhere the rightmost right child of the currentstack top was on top of the stack.

4. Left Child of Top of S: the last previous statewhere the leftmost left child of the current stacktop was on top of the stack.

5. Left Child of Front of I2 : the last previousstate where the leftmost child of the front ele-ment of I was on top of the stack.

6. Head of Top: the last previous state where thehead word of the current stack top was on topof the stack.

7. Top of S at Front of I: the last previous statewhere the current stack top was at the front ofthe queue.

Each of these 7 relations has its own distinct weightmatrix for the resulting edges in the ISBN, but thesame weight matrix is used at each position wherethe relation is relevant.

All these relations but the last one are motivatedby linguistic considerations. The current decision isprimarily about what to do with the current word onthe top of the stack and the current word on the frontof the queue. The Input Context and Stack Contextrelationships connect to the most recent states usedfor making decisions about each of these words. TheRight Child of Top of S relationship connects to astate used for making decisions about the most re-cently attached dependent of the stack top. Simi-larly, the Left Child of Front of I relationship con-nects to a state for the most recently attached depen-dent of the queue front. The Left Child of Top of S

is the first dependent of the stack top, which is a par-ticularly informative dependent for many languages.Likewise, the Head of Top can tell us a lot about thestack top, if it has been chosen already.

A second motivation for including a state in thelocal context of a decision is that it might contain in-formation which has no other route for reaching thecurrent decision. In particular, it is generally a goodidea to ensure that the immediately preceding state isalways included somewhere in the set of connectedstates. This requirement ensures that information, atleast theoretically, can pass between any two statesin the decision sequence, thereby avoiding any hard

2We refer to the head of the queue as the front, to avoidunnecessary ambiguity of the word head in the context of de-pendency parsing.

149

independence assumptions. The last relation, Top ofS at Front of I , is included mainly to fulfill this re-quirement. Otherwise, after a Shiftwj

operation, thepreceding state would not be selected by any of therelationships.

As indicated in figure 1, the probability of eachelementary decision dt′

k depends both on the currentstate vector St′ and on the previously chosen ele-mentary action dt′

k−1 from Dt′ . This probability dis-tribution has the form of a normalized exponential:

P (dt′

k = d|St′ , dt′

k−1)=Φh(t′,k) (d) e

∑j

Wdjst′

j

∑d′Φh(t′,k) (d′) e

∑j

Wd′jst′

j

,

where Φh(t′,k) is the indicator function of the set ofelementary decisions that may possibly follow thelast decision in the history h(t′, k), and the Wdj arethe weights. Now it is easy to see why the origi-nal decision Right-Arcr (Nivre et al., 2004) had tobe decomposed into two distinct decisions: the de-cision to construct a labeled arc and the decision toshift the word. Use of this composite Right-Arcr

would have required the introduction of individualparameters for each pair (w, r), where w is an arbi-trary word in the lexicon and r - an arbitrary depen-dency relation.

5 Searching for the Best Tree

ISBNs define a probability model which does notmake any a-priori assumptions of independence be-tween any decision variables. As we discussed insection 4 use of relations based on partial outputstructure makes it possible to take into account sta-tistical interdependencies between decisions closelyrelated in the output structure, but separated by mul-tiple decisions in the input structure. This propertyleads to exponential complexity of complete search.However, the success of the deterministic parsingstrategy which uses the same parsing order (Nivre etal., 2006), suggests that it should be relatively easyto find an accurate approximation to the best parsewith heuristic search methods. Unlike (Nivre et al.,2006), we can not use a lookahead in our generativemodel, as was discussed in section 3, so a greedymethod is unlikely to lead to a good approximation.Instead we use a pruning strategy similar to that de-scribed in (Henderson, 2003), where it was applied

to a considerably harder search problem: constituentparsing with a left-corner parsing order.

We apply fixed beam pruning after each deci-sion Shiftwj

, because knowledge of the next wordin the queue I helps distinguish unlikely decisionsequences. We could have used best-first search be-tween Shiftwj

operations, but this still leads to rela-tively expensive computations, especially when theset of dependency relations is large. However, mostof the word pairs can possibly participate only in avery limited number of distinct relations. Thus, wepursue only a fixed number of relations r after eachLeft-Arcr and Right-Arcr operation.

Experiments with a variety of post-shift beamwidths confirmed that very small validation perfor-mance gains are achieved with widths larger than 30,and sometimes even a beam of 5 was sufficient. Wefound also that allowing 5 different relations aftereach dependency prediction operation was enoughthat it had virtually no effect on the validation accu-racy.

6 Empirical Evaluation

In this section we evaluate the ISBN model fordependency parsing on three treebanks from theCoNLL-X shared task. We compare our genera-tive models with the best parsers from the CoNLL-X task, including the SVM-based parser of (Nivre etal., 2006) (the MALT parser), which uses the sameparsing algorithm. To test the feature induction abil-ities of our model we compare results with two fea-ture sets, the feature set tuned individually for eachlanguage by (Nivre et al., 2006), and another fea-ture set which includes only obvious local features.This simple feature set comprises only features ofthe word on top of the stack S and the front wordof the queue I . We compare the gain from usingtuned features with the similar gain obtained by theMALT parser. To obtain these results we train theMALT parser with the same two feature sets.3

In order to distinguish the contribution of ISBN’sfeature induction abilities from the contribution of

3The tuned feature sets were obtained fromhttp://w3.msi.vxu.se/˜nivre/research/MaltParser.html. Weremoved lookahead features for ISBN experiments butpreserved them for experiments with the MALT parser. Anal-ogously, we extended simple features with 3 words lookaheadfor the MALT parser experiments.

150

our estimation method and search, we perform an-other experiment. We use the tuned feature set anddisable the feature induction abilities of the modelby removing all the edges between latent variablesvectors. Comparison of this restricted model withthe full ISBN model shows how important the fea-ture induction is. Also, comparison of this restrictedmodel with the MALT parser, which uses the sameset of features, indicates whether our generative esti-mation method and use of beam search is beneficial.

6.1 Experimental Setup

We used the CoNLL-X distributions of DanishDDT treebank (Kromann, 2003), Dutch Alpino tree-bank (van der Beek et al., 2002) and Slovene SDTtreebank (Dzeroski et al., 2006). The choice of thesetreebanks was motivated by the fact that they allare freely distributed and have very different sizesof their training sets: 195,069 tokens for Dutch,94,386 tokens for Danish and only 28,750 tokens forSlovene. As it is generally believed that discrimina-tive models win over generative models with a largeamount of training data, so we expected to see simi-lar trend in our results. Test sets are about equal andcontain about 5,000 scoring tokens.

We followed the experimental setup of the sharedtask and used all the information provided for thelanguages: gold standard part-of-speech tags andcoarse part-of-speech tags, word form, word lemma(lemma information was not available for Danish)and a set of fine-grain word features. As we ex-plained in section 3, we treated these sets of fine-grain features as an atomic value when predictinga word. However, when conditioning on words, wetreated each component of this composite feature in-dividually, as it proved to be useful on the develop-ment set. We used frequency cutoffs: we ignoredany property (e.g., word form, feature or even part-of-speech tag4) which occurs in the training set lessthan 5 times. Following (Nivre et al., 2006), we usedpseudo-projective transformation they proposed tocast non-projective parsing tasks as projective.

ISBN models were trained using a small devel-opment set taken out from the training set, whichwas used for tuning learning parameters and for

4Part-of-speech tags for multi-word units in the Danish tree-bank were formed as concatenation of tags of the words, whichled to quite sparse set of part-of-speech tags.

early stopping. The sizes of the development setswere: 4,988 tokens for larger Dutch corpus, 2,504tokens for Danish and 2,033 tokens for Slovene.The MALT parser was trained always using the en-tire training set. We expect that the mean field ap-proximation should demonstrate better results thanfeed-forward approximation on this task as it is the-oretically expected and confirmed on the constituentparsing task (Titov and Henderson, 2007). How-ever, the sizes of testing sets would not allow usto perform any conclusive analysis, so we decidednot to perform these comparisons here. Instead weused the mean field approximation for the smallertwo corpora and used the feed-forward approxima-tion for the larger one. Training the mean field ap-proximations on the larger Dutch treebank is feasi-ble, but would significantly reduce the possibilitiesfor tuning the learning parameters on the develop-ment set and, thus, would increase the randomnessof model comparisons.

All model selection was performed on the devel-opment set and a single model of each type wasapplied to the testing set. We used a state vari-able vector consisting of 80 binary variables, as itproved sufficient on the preliminary experiments.For the MALT parser we replicated the parametersfrom (Nivre et al., 2006) as described in detail ontheir web site.

The labeled attachment scores for the ISBN withtuned features (TF) and local features (LF) andISBN with tuned features and no edges connect-ing latent variable vectors (TF-NA) are presentedin table 1, along with results for the MALT parserboth with tuned and local feature, the MST parser(McDonald et al., 2006), and the average score(Aver) across all systems in the CoNLL-X sharedtask. The MST parser is included because it demon-strated the best overall result in the task, non signif-icantly outperforming the MALT parser, which, inturn, achieved the second best overall result. The la-beled attachment score is computed using the samemethod as in the CoNLL-X shared task, i.e. ignor-ing punctuation. Note, that though we tried to com-pletely replicate training of the MALT parser withthe tuned features, we obtained slightly different re-sults. The original published results for the MALTparser with tuned features were 84.8% for Danish,78.6% for Dutch and 70.3% for Slovene. The im-

151

Danish Dutch SloveneISBN TF 85.0 79.6 72.9

LF 84.5 79.5 72.4TF-NA 83.5 76.4 71.7

MALT TF 85.1 78.2 70.5LF 79.8 74.5 66.8

MST 84.8 79.2 73.4Aver 78.3 70.7 65.2

Table 1: Labeled attachment score on the testing setsof Danish, Dutch and Slovene treebanks.

provement of the ISBN models (TF and LF) overthe MALT parser is statistically significant for Dutchand Slovene. Differences between their results onDanish are not statistically significant.

6.2 Discussion of Results

The ISBN with tuned features (TF) achieved signif-icantly better accuracy than the MALT parser on 2languages (Dutch and Slovene), and demonstratedessentially the same accuracy on Danish. The resultsof the ISBN are among the two top published resultson all three languages, including the best publishedresults on Dutch. All three models, MST, MALT andISBN, demonstrate much better results than the av-erage result in the CoNLL-X shared task. These re-sults suggest that our generative model is quite com-petitive with respect to the best models, which areboth discriminative.5 We would expect further im-provement of ISBN results if we applied discrimina-tive retraining (Henderson, 2004) or reranking withdata-defined kernels (Henderson and Titov, 2005),even without introduction of any additional features.

We can see that the ISBN parser achieves aboutthe same results with local features (LF). Local fea-tures by themselves are definitely not sufficient forthe construction of accurate models, as seen fromthe results of the MALT parser with local features(and look-ahead). This result demonstrates that IS-BNs are a powerful model for feature induction.

The results of the ISBN without edges connectinglatent state vectors is slightly surprising and suggestthat without feature induction the ISBN is signifi-cantly worse than the best models. This shows that

5Note that the development set accuracy predicted correctlythe testing set ranking of ISBN TF, LF and TF-NA models oneach of the datasets, so it is fair to compare the best ISBN resultamong the three with other parsers.

to root 1 2 3 - 6 > 6Da ISBN 95.1 95.7 90.1 84.1 74.7

MALT 95.4 96.0 90.8 84.0 71.6Du ISBN 79.8 92.4 86.2 81.4 71.1

MALT 73.1 91.9 85.0 76.2 64.3Sl ISBN 76.1 92.5 85.6 79.6 54.3

MALT 59.9 92.1 85.0 78.4 47.1Av ISBN 83.6 93.5 87.3 81.7 66.7

MALT 76.2 93.3 87.0 79.5 61.0Improv 7.5 0.2 0.4 2.2 5.7

Table 2: F1 score of labeled attachment as a functionof dependency length on the testing sets of Danish,Dutch and Slovene.

the improvement is coming mostly from the abil-ity of the ISBN to induce complex features and notfrom either using beam search or from the estima-tion procedure. It might also suggest that genera-tive models are probably worse for the dependencyparsing task than discriminative approaches (at leastfor larger datasets). This motivates further researchinto methods which combine powerful feature in-duction properties with the advantage of discrimina-tive training. Although discriminative reranking ofthe generative model is likely to help, the derivationof fully discriminative feature induction methods iscertainly more challenging.

In order to better understand differences in per-formance between ISBN and MALT, we analyzedhow relation accuracy changes with the length ofthe head-dependent relation. The harmonic meanbetween precision and recall of labeled attachment,F1 measure, for the ISBN and MALT parsers withtuned features is presented in table 2. F1 score iscomputed for four different ranges of lengths andfor attachments directly to root. Along with the re-sults for each of the languages, the table includestheir mean (Av) and the absolute improvement ofthe ISBN model over MALT (Improv). It is easyto see that accuracy of both models is generally sim-ilar for small distances (1 and 2), but as the distancegrows the ISBN parser starts to significantly outper-form MALT, achieving 5.7% average improvementon dependencies longer than 6 word tokens. Whenthe MALT parser does not manage to recover a longdependency, the highest scoring action it can chooseis to reduce the dependent from the stack withoutspecifying its head, thereby attaching the dependent

152

to the root by default. This explains the relativelylow F1 scores for attachments to root (evident forDutch and Slovene): though recall of attachment toroot is comparable to that of the ISBN parser (82.4%for MALT against 84.2% for ISBN, on average over3 languages), precision for the MALT parser is muchworse (71.5% for MALT against 83.1% for ISBN,on average).

The considerably worse accuracy of the MALTparser on longer dependencies might be explainedboth by use of a non-greedy search method in theISBN and the ability of ISBNs to induce history fea-tures. To capture a long dependency, the MALTparser should keep a word on the stack during along sequence of decision. If at any point duringthe intermediate steps this choice seems not to belocally optimal, then the MALT parser will choosethe alternative and lose the possibility of the longdependency.6 By using a beam search, the ISBNparser can maintain the possibility of the long de-pendency in its beam even when other alternativesseem locally preferable. Also, long dependences areoften more difficult, and may be systematically dif-ferent from local dependencies. The designer of aMALT parser needs to discover predictive featuresfor long dependencies by hand, whereas the ISBNmodel can automatically discover them. Thus weexpect that the feature induction abilities of ISBNshave a strong effect on the accuracy of long depen-dences. This prediction is confirmed by the differ-ences between the results of the normal ISBN (TF)and the restricted ISBN (TF-NA) model. The TF-NA model, like the MALT parser, is biased towardattachment to root; it attaches to root 12.0% morewords on average than the normal ISBN, withoutany improvement of recall and with a great loss ofprecision. The F1 score on long dependences for theTF-NA model is also negatively effected in the sameway as for the MALT parser. This confirms that theability of the ISBN model to induce features is a ma-jor factor in improving accuracy of long dependen-cies.

6The MALT parser is trained to keep the word as long aspossible: if both Shift and Reduce decisions are possible duringtraining, it always prefers to shift. Though this strategy shouldgenerally reduce the described problem, it is evident from thelow precision score for attachment to root, that it can not com-pletely eliminate it.

7 Related Work

There has not been much previous work on latentvariable models for dependency parsing. Depen-dency parsing with Dynamic Bayesian Networkswas considered in (Peshkin and Savova, 2005), withlimited success. Roughly, the model consideredthe whole sentence at a time, with the DBN beingused to decide which words correspond to leavesof the tree. The chosen words are then removedfrom the sentence and the model is recursively ap-plied to the reduced sentence. Recently several la-tent variable models for constituent parsing havebeen proposed (Koo and Collins, 2005; Matsuzakiet al., 2005; Prescher, 2005; Riezler et al., 2002).In (Matsuzaki et al., 2005) non-terminals in a stan-dard PCFG model are augmented with latent vari-ables. A similar model of (Prescher, 2005) uses ahead-driven PCFG with latent heads, thus restrict-ing the flexibility of the latent-variable model by us-ing explicit linguistic constraints. While the modelof (Matsuzaki et al., 2005) significantly outperformsthe constrained model of (Prescher, 2005), they bothare well below the state-of-the-art in constituentparsing. In (Koo and Collins, 2005), an undirectedgraphical model for constituent parse reranking usesdependency relations to define the edges. Thus, itshould be easy to apply a similar method to rerank-ing dependency trees.

Undirected graphical models, in particular Condi-tional Random Fields, are the standard tools for shal-low parsing (Sha and Pereira, 2003). However, shal-low parsing is effectively a sequence labeling prob-lem and therefore differs significantly from full pars-ing. As discussed in (Titov and Henderson, 2007),undirected graphical models do not seem to be suit-able for history-based parsing models.

Sigmoid Belief Networks (SBNs) were used orig-inally for character recognition tasks, but later a dy-namic modification of this model was applied to thereinforcement learning task (Sallans, 2002). How-ever, their graphical model, approximation method,and learning method differ significantly from thoseof this paper. The extension of dynamic SBNs withincrementally specified model structure (i.e. Incre-mental Sigmoid Belief Networks, used in this pa-per) was proposed and applied to constituent parsingin (Titov and Henderson, 2007).

153

8 Conclusions

We proposed a latent variable dependency parsingmodel based on Incremental Sigmoid Belief Net-works. Unlike state-of-the-art dependency parsers,it uses a generative history-based model. We demon-strated that it achieves state-of-the-art results on aselection of languages from the CoNLL-X sharedtask. The parser uses a vector of latent variablesto represent an intermediate state and uses rela-tions defined on the output structure to construct theedges between latent state vectors. These proper-ties make it a powerful feature induction methodfor dependency parsing, and it achieves competi-tive results even with very simple explicit features.The ISBN model is especially accurate at modelinglong dependences, achieving average improvementof 5.7% over the state-of-the-art baseline on depen-dences longer than 6 words. Empirical evaluationdemonstrates that competitive results are achievedmostly because of the ability of the model to in-duce complex features and not because of the use ofa generative probability model or a specific searchmethod. As with other generative models, it can befurther improved by the application of discrimina-tive reranking techniques. Discriminative methodsare likely to allow it to significantly improve overthe current state-of-the-art in dependency parsing.7

Acknowledgments

This work was funded by Swiss NSF grant 200020-109685, UK EPSRC grant EP/E019501/1, and EUFP6 grant 507802 for project TALK. We thankJoakim Nivre and Sandra Kubler for an excellenttutorial on dependency parsing given at COLING-ACL 2006.

References

Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1986.Compilers: Principles, Techniques and Tools. Addi-son Wesley.

Leon Bottou. 1991. Une approche theoretique del’apprentissage connexionniste: Applications a la re-connaissance de la parole. Ph.D. thesis, Universite deParis XI, Paris, France.

7The ISBN dependency parser will be soon made download-able from the authors’ web-page.

Sabine Buchholz and Erwin Marsi. 2006. CoNLL-Xshared task on multilingual dependency parsing. InProc. of the Tenth Conference on Computational Nat-ural Language Learning, New York, USA.

Eugene Charniak and Mark Johnson. 2005. Coarse-to-fine n-best parsing and MaxEnt discriminative rerank-ing. In Proc. 43rd Meeting of Association for Compu-tational Linguistics, pages 173–180, Ann Arbor, MI.

Eugene Charniak. 2000. A maximum-entropy-inspiredparser. In Proc. 1st Meeting of North AmericanChapter of Association for Computational Linguistics,pages 132–139, Seattle, Washington.

Michael Collins. 1999. Head-Driven Statistical Modelsfor Natural Language Parsing. Ph.D. thesis, Univer-sity of Pennsylvania, Philadelphia, PA.

Michael Collins. 2000. Discriminative reranking for nat-ural language parsing. In Proc. 17th Int. Conf. on Ma-chine Learning, pages 175–182, Stanford, CA.

S. Dzeroski, T. Erjavec, N. Ledinek, P. Pajas, Z. Zabokrt-sky, and A. Zele. 2006. Towards a Slovene depen-dency treebank. In Proc. Int. Conf. on Language Re-sources and Evaluation (LREC), Genoa, Italy.

James Henderson and Ivan Titov. 2005. Data-definedkernels for parse reranking derived from probabilis-tic models. In Proc. 43rd Meeting of Association forComputational Linguistics, Ann Arbor, MI.

James Henderson. 2003. Inducing history representa-tions for broad coverage statistical parsing. In Proc.joint meeting of North American Chapter of the Asso-ciation for Computational Linguistics and the HumanLanguage Technology Conf., pages 103–110, Edmon-ton, Canada.

James Henderson. 2004. Discriminative training ofa neural network statistical parser. In Proc. 42ndMeeting of Association for Computational Linguistics,Barcelona, Spain.

M. I. Jordan, Z.Ghahramani, T. S. Jaakkola, and L. K.Saul. 1999. An introduction to variational methods forgraphical models. In Michael I. Jordan, editor, Learn-ing in Graphical Models. MIT Press, Cambridge, MA.

Terry Koo and Michael Collins. 2005. Hidden-variablemodels for discriminative reranking. In Proc. Conf. onEmpirical Methods in Natural Language Processing,Vancouver, B.C., Canada.

Matthias T. Kromann. 2003. The Danish dependencytreebank and the underlying linguistic theory. In Pro-ceedings of the 2nd Workshop on Treebanks and Lin-guistic Theories (TLT), Vaxjo, Sweden.

154

Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii.2005. Probabilistic CFG with latent annotations. InProceedings of the 43rd Annual Meeting of the ACL,Ann Arbor, MI.

Ryan McDonald, Kevin Lerman, and Fernando Pereira.2006. Multilingual dependency analysis with a two-stage discriminative parser. In Proc. of the Tenth Con-ference on Computational Natural Language Learn-ing, New York, USA.

Kevin P. Murphy. 2002. Dynamic Belief Networks:Representation, Inference and Learning. Ph.D. thesis,University of California, Berkeley, CA.

Radford Neal. 1992. Connectionist learning of beliefnetworks. Artificial Intelligence, 56:71–113.

Joakim Nivre, Johan Hall, and Jens Nilsson. 2004.Memory-based dependency parsing. In Proc. of theEighth Conference on Computational Natural Lan-guage Learning, pages 49–56, Boston, USA.

Joakim Nivre, Johan Hall, Jens Nilsson, Gulsen Eryigit,and Svetoslav Marinov. 2006. Pseudo-projective de-pendency parsing with support vector machines. InProc. of the Tenth Conference on Computational Nat-ural Language Learning, pages 221–225, New York,USA.

Leon Peshkin and Virginia Savova. 2005. Dependencyparsing with dynamic Bayesian network. In AAAI,20th National Conference on Artificial Intelligence,Pittsburgh, Pennsylvania.

Detlef Prescher. 2005. Head-driven PCFGs with latent-head statistics. In Proc. 9th Int. Workshop on ParsingTechnologies, Vancouver, Canada.

Stefan Riezler, Tracy H. King, Ronald M. Kaplan,Richard Crouch, John T. Maxwell, and Mark John-son. 2002. Parsing the Wall Street Journal using aLexical-Functional Grammar and discriminative esti-mation techniques. In Proc. 40th Meeting of Associa-tion for Computational Linguistics, Philadelphia, PA.

Brian Sallans. 2002. Reinforcement Learning for Fac-tored Markov Decision Processes. Ph.D. thesis, Uni-versity of Toronto, Toronto, Canada.

Fei Sha and Fernando Pereira. 2003. Shallow parsingwith conditional random fields. In Proc. joint meet-ing of North American Chapter of the Association forComputational Linguistics and the Human LanguageTechnology Conf., Edmonton, Canada.

Ivan Titov and James Henderson. 2007. Constituentparsing with incremental sigmoid belief networks. InProc. 45th Meeting of Association for ComputationalLinguistics, Prague, Czech Republic.

L. van der Beek, G. Bouma, J. Daciuk, T. Gaustad,R. Malouf, G van Noord, R. Prins, and B. Villada.2002. The Alpino dependency treebank. Computa-tional Linguistic in the Netherlands (CLIN).

155