A Discriminative Model for Tree-to-Tree Translation

Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 232–241,Sydney, July 2006.c©2006 Association for Computational Linguistics

A Discriminative Model for Tree-to-Tree Translation

Brooke CowanMIT CSAIL

[email protected]

Ivona KucerovaMIT Linguistics Department

[email protected]

Michael CollinsMIT CSAIL

[email protected]

AbstractThis paper proposes a statistical, tree-to-tree model for producing translations.Two main contributions are as follows:(1) a method for the extraction of syn-tactic structures with alignment informa-tion from a parallel corpus of translations,and (2) use of a discriminative, feature-based model for prediction of these target-language syntactic structures—which wecall aligned extended projections, orAEPs. An evaluation of the method ontranslation from German to English showssimilar performance to the phrase-basedmodel of Koehn et al. (2003).

1 Introduction

Phrase-based approaches (Och and Ney, 2004)to statistical machine translation (SMT) have re-cently achieved impressive results, leading to sig-nificant improvements in accuracy over the origi-nal IBM models (Brown et al., 1993). However,phrase-based models lack a direct representationof syntactic information in the source or target lan-guages; this has prompted several researchers toconsider various approaches that make use of syn-tactic information.

This paper describes a framework fortree-to-tree based statistical translation. Our goal is tolearn a model that maps parse trees in the sourcelanguage to parse trees in the target language.The model is learned from a corpus of transla-tion pairs, where each sentence in the source ortarget language has an associated parse tree. Wesee two major benefits of tree-to-tree based trans-lation. First, it is possible to explicitly model thesyntax of the target language, thereby improvinggrammaticality. Second, we can build a detailedmodel of the correspondence between the sourceand target parse trees, with the aim of constructingtranslations that preserve the meaning of sourcelanguage sentences.

Our translation framework involves a process

where the target-language parse tree is brokendown into a sequence of clauses, and each clauseis then translated separately. A central concept weintroduce in the translation of clauses is that of analigned extended projection(AEP). AEPs are de-rived from the concept of anextended projectionin lexicalized tree adjoining grammars (LTAG)(Frank, 2002), with the addition of alignment in-formation that is based on work in synchronousLTAG (Shieber and Schabes, 1990). A key con-tribution of this paper is a method for learningto map German clauses to AEPs using a feature-based model with a perceptron learning algorithm.

We performed experiments on translation fromGerman to English on the Europarl data set. Eval-uation in terms of both BLEU scores and humanjudgments shows that our system performs sim-ilarly to the phrase-based model of Koehn et al.(2003).

1.1 A Sketch of the Approach

This section provides an overview of the transla-tion process. We will use the German sentencewirwissen daß das haupthemmnis der vorhersehbarewiderstand der hersteller waras a running exam-ple. For this example we take the desired transla-tion to bewe know that the main obstacle has beenthe predictable resistance of manufacturers.

Translation of a German sentence proceeds inthe following four steps:

Step 1: The German sentence is parsed and thenbroken down into separate parse structures for asequence of clauses. For example, the German ex-ample above is broken into a parse structure forthe clausewir wissenfollowed by a parse struc-ture for the subordinate clausedaß. . .war. Eachof these clauses is then translated separately, usingsteps 2–3 below.

Step 2: An aligned extended projection(AEP)is predicted for each German clause. To illustratethis step, consider translation of the second Ger-man clause, which has the following parse struc-ture:

232

s-oc kous-cp daßnp-sb1 art das

nn haupthemmnisnp-pd2 art der

adja vorhersehbarenn widerstandnp-ag art der

nn herstellervafin-hd war

Note that we use the symbols1 and 2 to identifythe two modifiers (arguments or adjuncts) in theclause, in this case a subject and an object.

A major part of the AEP is a parse-tree frag-ment, that is similar to a TAG elementary tree (seealso Figure 2):

SBAR

that S

NP VP

V

has

VP

V

been

NP

Following the work of Frank (2002), we will referto a structure like this as anextended projection(EP). The EP encapsulates the core syntactic struc-ture in the English clause. It contains the mainverbbeen, as well as the function wordsthat andhas. It also contains a parse tree “spine” which hasthe main verbbeenas one of its leaves, and has theclause labelSBARas its root. In addition, it spec-ifies positions for arguments in the clause—in thiscase NPs corresponding to the subject and object.

An AEP contains an EP, as well asalignmentinformation about where the German modifiersshould be placed in the extended projection. Forexample, the AEP in this case would contain thetree fragment shown above, together with an align-ment specifying that the modifiers1 and 2 fromthe German parse will appear in the EP as subjectand object, respectively.

Step 3: The German modifiers are translatedand placed in the appropriate positions within theAEP. For example, the modifiersdas haupthemm-nis and der vorhersehbare widerstand der her-steller would be translated asthe main obstacle,and the predictable resistance of manufacturers,respectively, and then placed into the subject andobject positions in the AEP.

Step 4: The individual clause translations arecombined to give a final translation. For example,the translationswe knowandthat the main obsta-cle has been. . . would be concatenated to giveweknow that the main obstacle has been. . ..

The main focus of this paper will be Step 2: theprediction of AEPs from German clauses. AEPsare detailed structural objects, and their relation-ship to the source-language clause can be quitecomplex. We use a discriminative feature-basedmodel, trained with the perceptron algorithm, toincrementally predict the AEP in a sequence ofsteps. At each step we define features that allowthe model to capture a wide variety of dependen-cies within the AEP itself, or between the AEP andthe source-language clause.

1.2 Motivation for the Approach

Our approach to tree-to-tree translation is mo-tivated by several observations. Breaking thesource-language tree into clauses (Step 1) consid-erably simplifies the difficult problem of definingan alignment between source and target trees. Ourimpression is that high-quality translations can beproduced in a clause-by-clause fashion.1 The useof a feature-based model for AEP prediction (Step2) allows us to capture complex syntactic corre-spondences between English and German, as wellas grammaticality constraints on the English side.

In this paper, we implement the translation ofmodifiers (Step 3) with the phrase-based systemof Koehn et al. (2003). The modifiers in our dataset are generally small chunks of text such as NPs,PPs, and ADJPs, which by definition do not in-clude clauses or verbs. In our approach, we usethe phrase-based system to generaten-best lists ofcandidate translations and then rerank the trans-lations based on grammaticality, i.e., using crite-ria that judge how well they fit the position in theAEP. In future work, we might use finite state ma-chines in place of a reranking approach, or recur-sively apply the AEP approach to the modifiers.

Stitching translated clauses back together (Step4) is a relatively simple task: in a substantial ma-jority of cases, the German clauses are not embed-ded, but instead form a linear sequence that ac-counts for the entire sentence. In these cases wecan simply concatenate the English clause trans-lations to form the full translation. Embeddedclauses in German are slightly more complicated,but it is not difficult to form embedded structuresin the English translations.

Section 5.2 of this paper describes the features

1Note that we do not assume that all of the translationsin the training data have been produced in a clause-by-clausefashion. Rather, we assume that good translations for testexamples can be produced in this way.

233

we use for AEP prediction in translation fromGerman to English. Many of the features of theAEP prediction model are specifically tuned to thechoice of German and English as the source andtarget languages. However, it should be easy todevelop new feature sets to deal with other lan-guages or treebanking styles. We see this as oneof the strengths of the feature-based approach.

In the work presented in this paper, we focus onthe prediction of clausal AEPs, i.e., AEPs associ-ated with main verbs. One reason for this is thatclause structures are particularly rich and com-plex from a syntactic perspective. This means thatthere should be considerable potential in improv-ing translation quality if we can accurately predictthese structures. It also means that clause-levelAEPs are a good test-bed for the discriminativeapproach to AEP prediction; future work may con-sider applying these methods to other structuressuch as NPs, PPs, ADJPs, and so on.

2 Related WorkThere has been a substantial amount of previouswork on approaches that make use of syntactic in-formation in statistical machine translation. Wu(1997) and Alshawi (1996) describe early work onformalisms that make use of transductive gram-mars; Graehl and Knight (2004) describe meth-ods for training tree transducers. Melamed (2004)establishes a theoretical framework for general-ized synchronous parsing and translation. Eisner(2003) discusses methods for learning synchro-nized elementary tree pairs from a parallel corpusof parsed sentences. Chiang (2005) has recentlyshown significant improvements in translation ac-curacy, using synchronous grammars. Riezler andMaxwell (2006) describe a method for learninga probabilistic model that maps LFG parse struc-tures in German into LFG parse structures in En-glish.

Yamada and Knight (2001) and Galley et al.(2004) describe methods that make use of syn-tactic information in the target language alone;Quirk et al. (2005) describe similar methods thatmake use of dependency representations. Syntac-tic parsers in the target language have been usedas language models in translation, giving someimprovement in accuracy (Charniak et al., 2001).The work of Gildea (2003) involves methods thatmake use of syntactic information in both thesource and target languages.

Other work has attempted to incorporate syntac-

S

NP-A VP

V

know

SBAR-A

SBAR-A

IN

that

S

NP-A VP

V

has

VP

V

been

NP-ANP

D

the

N

obstacle

Figure 1: Extended projections for the verbsknowandbeen,and for the nounobstacle. The EPs were taken from the parsetree for the sentenceWe know that the main obstacle has beenthe predictable resistance of manufacturers.

tic information through reranking approaches ap-plied ton-best output from phrase-based systems(Och et al., 2004). Another class of approacheshas shown improvements in translation through re-ordering, where source language strings are parsedand then reordered, in an attempt to recover a wordorder that is closer to the target language (Collinset al., 2005; Xia and McCord, 2004).

Our approach is closely related to previouswork on synchronous tree adjoining grammars(Shieber and Schabes, 1990; Shieber, 2004), andthe work on TAG approaches to syntax describedby Frank (2002). A major departure from previouswork on synchronous TAGs is in our use of a dis-criminative model that incrementally predicts theinformation in the AEP. Note also that our modelmay include features that take into account anypart of the German clause.

3 A Translation Architecture Based onAligned Extended Projections

3.1 Background: Extended Projections (EPs)

Extended projections (EPs) play a crucial role inthe lexicalized tree adjoining grammar (LTAG)(Joshi, 1985) approach to syntax described byFrank (2002). In this paper we focus almost ex-clusively on extended projections associated withmain verbs; note, however, that EPs are typicallyassociated with all content words (nouns, adjec-tives, etc.). As an example, a parse tree for thesentencewe know that the main obstacle has beenthe predictable resistance of manufacturerswouldmake use of EPs for the wordswe, know, main, ob-stacle, been, predictable, resistance, andmanufac-turers. Function words (in this sentencethat, the,has, andof) do not have EPs; instead, as we de-scribe shortly, each function word is incorporatedin an EP of some content word.

Figure 1 has examples of EPs. Each one isan LTAG elementary tree which contains a sin-

234

gle content word as one of its leaves. Substitutionnodes (such asNP-A or SBAR-A) in the elemen-tary trees specify the positions of arguments of thecontent words. Each EP may contain one or morefunction words that are associated with the con-tent word. For verbs, these function words includeitems such as modal verbs and auxiliaries (e.g.,should and has); complementizers (e.g.,that);and wh-words (e.g.,which). For nouns, functionwords include determiners and prepositions.

Elementary trees corresponding to EPs form thebasic units in the LTAG approach described byFrank (2002). They are combined to form a fullparse tree for a sentence using the TAG operationsof substitution and adjunction. For example, theEP forbeenin Figure 1 can be substituted into theSBAR-A position in the EP forknow; the EP forobstaclecan be substituted into the subject posi-tion of the EP forbeen.

3.2 Aligned Extended Projections (AEPs)We now build on the idea of extended projectionsto give a detailed description of AEPs. Figure 2shows examples of German clauses paired with theAEPs found in training data.2 The German clauseis assumed to haven (wheren ≥ 0) modifiers. Forexample, the first German parse in Figure 2 hastwo arguments, indexed as1 and2. Each of thesemodifiers must either have a translation in the cor-responding English clause, or must be deleted.

An AEP consists of the following parts:

STEM: A string specifying the stemmed formof the main verb in the clause.

SPINE: A syntactic structure associated withthe main verb. The structure has the symbolVas one of its leaf nodes; this is the position ofthe main verb. It includes higher projections ofthe verb such as VPs, Ss, and SBARs. It also in-cludes leaf nodesNP-A in positions correspond-ing to noun-phrase arguments (e.g., the subjector object) of the main verb. In addition, it maycontain leaf nodes labeled with categories suchasWHNPor WHADVPwhere a wh-phrase may beplaced. It may include leaf nodes correspondingto one or more complementizers (common exam-ples beingthat, if, so that, and so on).

VOICE: One of two alternatives,active orpassive , specifying the voice of the main verb.

2Note that in this paper we consider translation from Ger-man to English; in the remainder of the paper we takeEnglishto be synonymous with the target language in translation andGermanto be synonymous with the source language.

SUBJECT: This variable can be one of threetypes. If there is no subject position in theSPINEvariable, then the value forSUBJECTis NULL.Otherwise,SUBJECTcan either be a string, forexamplethere,3 or an index of one of then modi-fiers in the German clause.

OBJECT: This variable is similar toSUBJECT,and can also take three types:NULL, a specificstring, or an index of one of then German modi-fiers. It is alwaysNULL if there is no object posi-tion in theSPINE; it can never be a modifier indexthat has already been assigned toSUBJECT.

WH: This variable is alwaysNULL if there is nowh-phrase position within theSPINE; it is alwaysa non-empty string (such aswhich, or in which) ifa wh-phrase position does exist.

MODALS: This is a string of verbs that consti-tute the modals that appear within the clause. WeuseNULL to signify an absence of modals.

INFL: The inflected form of the verb.

MOD(i): There are n modifier variablesMOD(1), MOD(2), . . ., MOD(n) that spec-ify the positions for German arguments that havenot already been assigned to theSUBJECTorOBJECTpositions in the spine. Each variableMOD(i) can take one of five possible values:

• null : This value is chosen if and only ifthe modifier has already been assigned to thesubject or object position.

• deleted : This means that a translation ofthei’th German modifier is not present in theEnglish clause.

• pre-sub : The modifier appears after anycomplementizers or wh-phrases, but beforethe subject of the English clause.

• post-sub : The modifier appears after thesubject of the English clause, but before themodals.

• in-modals : The modifier appears after thefirst modal in the sequence of modals, but be-fore the second modal or the main verb.

• post-verb : The modifier appears some-where after the main verb.

3This happens in the case where there exists a subject inthe English clause which is not aligned to a modifier in theGerman clause. See, for instance, the second example in Fig-ure 2.

235

German Clause English AEP

s-oc kous-cp daßnp-sb1 art das

nn haupthemmnisnp-pd2 art der

adja vorhersehbarenn widerstandnp-ag art der

nn herstellervafin-hd war

Paraphrase: that [np-sb themain obstacle] [np-pd thepredictable resistance of man-ufacturers] was

STEM: beSPINE:SBAR-A IN that

S NP-AVP V

NP-A

VOICE: activeSUBJECT: 1

OBJECT: 2

WH: NULLMODALS: hasINFL: beenMOD1: nullMOD2: null

s pp-mo1 appr zwischenpiat beidennn gesetzen

vvfin-hd bestehenadv-mo2 alsonp-sb3 adja erhebliche

adja rechtliche$, ,adja praktischekon undadja wirtschaftlichenn unterschiede

Paraphrase:[pp-mo betweenthe two pieces of legislation]exist so [np-sb significantlegal, practical and economicdifferences]

STEM: beSPINE:S NP-A

VP VNP-A

VOICE: activeSUBJECT: “there”OBJECT: 3

WH: NULLMODALS: NULLINFL: areMOD1: post-verbMOD2: pre-subMOD3: null

s-rc prels-sb dievp pp-mo1 appr an

pdat jenemnn tag

pp-mo2 appr inne tschernobyl

vvpp-hd gezundetvafin-hd wurde

Paraphrase:which [pp-mo onthat day] [pp-mo in cher-nobyl] released were

STEM: releaseSPINE:SBAR WHNP

SG-A VP V

VOICE: passiveSUBJECT: NULLOBJECT: NULLWH: whichMODALS: wasINFL: releasedMOD1: post-verbMOD2: post-verb

Figure 2: Three examples of German parse trees, togetherwith their aligned extended projections (AEPs) in the train-ing data. Note that in the second example the correspondencebetween the German clause and its English translation is notentirely direct. The subject in the English is the expletivethere; the subject in the German clause becomes the objectin English. This is a typical pattern for the German verbbestehen. The German PPzwischen ...appears at the startof the clause in German, but is post-verbal in the English.The modifieralso—whose English translation isso—is in anintermediate position in the German clause, but appears in thepre-subject position in the English clause.

4 Extracting AEPs from a Corpus

A crucial step in our approach is the extractionof training examples from a translation corpus.Each training example consists of a German clausepaired with an English AEP (see Figure 2).

In our experiments, we used the Europarl cor-pus (Koehn, 2005). For each sentence pair fromthis data, we used a version of the German parserdescribed by Dubey (2005) to parse the Germancomponent, and a version of the English parserdescribed by Collins (1999) to parse the Englishcomponent. To extract AEPs, we perform the fol-lowing steps:

NP and PP Alignment To align NPs and PPs,first all German and English nouns, personaland possessive pronouns, numbers, and adjectivesare identified in each sentence and aligned usingGIZA++ (Och and Ney, 2003). Next, each NP inan English tree is aligned to an NP or PP in thecorresponding German tree in a way that isconsis-tentwith the word-alignment information. That is,the words dominated by the English node must bealigned only to words dominated by the Germannode, and vice versa. Note that if there is morethan one German node that is consistent, then theone rooted at the minimal subtree is selected.

Clause alignment, and AEP Extraction Thenext step in the training process is to identifyGerman/English clause pairs which are transla-tions of each other. We first break each Englishor German parse tree into a set of clauses; seeAppendix A for a description of how we iden-tify clauses. We retain only those training ex-amples where the English and German sentenceshave the same number of clauses. For these re-tained examples, define the English sentence tocontain the clause sequence〈e1, e2, . . . , en〉, andthe German sentence to contain the clause se-quence〈g1, g2, . . . , gn〉. The clauses are orderedaccording to the position of their main verbs inthe original sentence. We createn candidate pairs〈(e1, g1), (e2, g2), . . . , (en, gn)〉 (i.e., force a one-to-one correspondence between the two clause se-quences). We then discard any clause pairs(e, g)which are inconsistent with the NP/PP alignmentsfor that sentence.4

4A clause pair is inconsistent with the NP/PP alignmentsif it contains an NP/PP on either the German or English sidewhich is aligned to another NP/PP which is not within theclause pair.

236

Note that this method is deliberately conserva-tive (i.e., high precision, but lower recall), in that itdiscards sentence pairs where the English/Germansentences have different numbers of clauses. Inpractice, we have found that the method yields alarge number of training examples, and that thesetraining examples are of relatively high quality.Future work may consider improved methods foridentifying clause pairs, for example methods thatmake use of labeled training examples.

An AEP can then be extracted from eachclause pair. The EP for the English clause isfirst extracted, giving values for all variables ex-cept forSUBJECT, OBJECT, andMOD(1), . . . ,MOD(n). The values for theSUBJECT, OBJECT,andMOD(i) variables are derived from the align-ments between NPs/PPs, and an alignment ofother clauses (ADVPs, ADJPs, etc.) derived fromGIZA++ alignments. If the English clause has asubject or object which is not aligned to a Germanmodifier, then the value forSUBJECTor OBJECTis taken to be the full English string.

5 The Model5.1 Beam search and the perceptron

In this section we describe linear history-basedmodels with beam search, and the perceptron al-gorithm for learning in these models. These meth-ods will form the basis for our model that mapsGerman clauses to AEPs.

We have a training set ofn examples,(xi, yi)for i = 1 . . . n, where eachxi is a German parsetree, and eachyi is an AEP. We follow previouswork on history-based models, by representingeachyi as a series ofN decisions〈d1, d2, . . . dN 〉.In our approach,N will be a fixed number for anyinputx: we take theN decisions to correspond tothe sequence of variablesSTEM, SPINE, . . .,MOD(1), MOD(2), . . ., MOD(n) describedin section 3. Eachdi is a member of a setDiwhich specifies the set of allowable decisions atthe i’th point (for example,D2 would be the setof all possible values forSPINE). We assume afunction ADVANCE(x, 〈d1, d2, . . . , di−1〉) whichmaps an inputx together with a prefix of decisionsd1 . . . di−1 to a subset ofDi. ADVANCE is a func-tion that specifies which decisions are allowablefor a past history〈d1, . . . , di−1〉 and an inputx. Inour case the ADVANCE function implements hardconstraints on AEPs (for example, the constraintthat theSUBJECTvariable must beNULL if nosubject position exists in theSPINE). For any in-

put x, a well-formeddecision sequence forx is asequence〈d1, . . . , dN 〉 such that fori = 1 . . . n,di ∈ ADVANCE(x, 〈d1, . . . , di−1〉). We defineGEN(x) to be the set of all decision sequences (orAEPs) which are well-formed forx.

The model that we will use is adiscriminatively-trained, feature-based model. Asignificant advantage to feature-based mod-els is their flexibility: it is very easy tosensitize the model to dependencies in thedata by encoding new features. To define afeature-based model, we assume a functionφ(x, 〈d1, . . . , di−1〉, di) ∈ Rd which maps a deci-siondi in context(x, 〈d1, . . . , di−1〉) to a featurevector. We also assume a vectorα ∈ Rd of param-eter values. We define thescorefor any partial orcomplete decision sequencey = 〈d1, d2, . . . , dm〉paired withx as:

SCORE(x, y) = Φ(x, y) · α (1)

where Φ(x, y) =∑mi=1 φ(x, 〈d1, . . . , di−1〉, di).

In particular, given the definitions above, the out-put structureF (x) for an inputx is the highest–scoring well–formed structure forx:

F (x) = arg maxy∈GEN(x)

SCORE(x, y) (2)

To decode with the model we use a beam-searchmethod. The method incrementally builds an AEPin the decision orderd1, d2, . . . , dN . At eachpoint, a beam contains the topM highest–scoringpartial paths for the firstm decisions, whereMis taken to be a fixed number. The score for anypartial path is defined in Eq. 1. The ADVANCEfunction is used to specify the set of possible deci-sions that can extend any given path in the beam.

To train the model, we use the averaged per-ceptron algorithm described by Collins (2002).This combination of the perceptron algorithm withbeam-search is similar to that described by Collinsand Roark (2004).5 The perceptron algorithm is aconvenient choice because it converges quickly —usually taking only a few iterations over the train-ing set (Collins, 2002; Collins and Roark, 2004).

5.2 The Features of the ModelThe model’s features allow it to capture depen-dencies between the AEP and the German clause,as well as dependencies between different partsof the AEP itself. The features included inφ

5Future work may consider alternative algorithms, suchas those described by Daume and Marcu (2005).

237

1 main verb2 any verb in the clause3 all verbs, in sequence4 spine5 tree6 preterminal label of left-most child of subject7 terminal label of left-most child of subject8 suffix of terminal label of right-most child of subject9 preterminal label of left-most child of object10 terminal label of left-most child of object11 suffix of terminal label of right-most child of object12 preterminal label of the negation wordnicht (not)13 is either of the stringses gibt(there is/are)

or es gab(there was/were) present?14 complementizers and wh-words15 labels of all wh-nonterminals16 terminal labels of all wh-words17 preterminal label of a verb in first position18 terminal label of a verb in first position19 terminal labels of all words in any relative pronoun

under a PP20 are all of the verbs at the end?21 nonterminal label of the root of the tree22 terminal labels of all words constituting the subject23 terminal labels of all words constituting the object24 the leaves dominated by each node in the tree25 each node in the context of a CFG rule26 each node in the context of the RHS of a CFG rule27 each node with its left and right sibling28 the number of leaves dominated by each node

in the tree

Table 1: Functions of the German clause used for makingfeatures in the AEP prediction model.

can consist of any function of the decision history〈d1, . . . , di−1〉, the current decisiondi, or the Ger-man clause. In defining features over AEP/clausepairs, we make use of some basic functions whichlook at the German clause and the AEP (see Ta-bles 1 and 2). We use various combinations ofthese basic functions in the prediction of each de-cisiondi, as described below.

STEM: Features for the prediction ofSTEMconjoin the value of this variable with each of thefunctions in lines 1–13 of Table 1. For example,one feature is the value ofSTEMconjoined withthe main verb of the German clause. In addition,φ includes features sensitive to the rank of a can-didate stem in an externally-compiled lexicon.6

SPINE: Spine prediction features make use ofthe values of the variablesSPINE andSTEMfromthe AEP, as well as functions of the spine in lines1–7 of Table 2, conjoined in various ways withthe functions in lines 4, 12, and 14–21 of Table 1.Note that the functions in Table 2 allow us to look

6The lexicon is derived from GIZA++ and provides, for alarge number of German main verbs, a ranked list of possibleEnglish translations.

1 does theSPINE have a subject?2 does theSPINE have an object?3 does theSPINE have any wh-words?4 the labels of any complementizer nonterminals

in theSPINE5 the labels of any wh-nonterminals in theSPINE6 the nonterminal labelsSQor SBARQin theSPINE7 the nonterminal label of the root of theSPINE8 the grammatical category of the finite verbal form

INFL (i.e., infinitive, 1st-, 2nd-, or 3rd-person pres,pres participle, sing past, plur past, past participle)

Table 2: Functions of the English AEP used for making fea-tures in the AEP prediction model.

at substructure in the spine. For instance, one ofthe features forSPINE is the labelSBARQor SQ,if it exists in the candidate spine, conjoined witha verbal preterminal label if there is a verb in thefirst position of the German clause. This featurecaptures the fact that German yes/no questions be-gin with a verb in the first position.

VOICE: Voice features in general combine val-ues ofVOICE, SPINE, andSTEM, with the func-tions in lines 1–5, 22, and 23 of Table 1.

SUBJECT: Features used for subject predictionmake use of the AEP variablesVOICEandSTEM.In addition, if the value ofSUBJECTis an indexi (see section 3), thenφ looks at the nontermi-nal label of the German node indexed byi as wellas the surrounding context in the German clausaltree. Otherwise,φ looks at the value ofSUBJECT.These basic features are combined with the func-tions in lines 1, 3, and 24–27 of Table 1.

OBJECT: We make similar features to those forthe prediction ofSUBJECT. In addition, φ canlook at the value predicted forSUBJECT.

WH: Features forWHlook at the values ofWHandSPINE, conjoined with the functions in lines1, 15, and 19 of Table 1.

MODALS: For the prediction ofMODALS, φlooks atMODALS, SPINE, andSTEM, conjoinedwith the functions in lines 2–5 and 12 of Table 1.

INFL: The features forINFL include the valuesof INFL , MODALS, andSUBJECT, andVOICE,and the function in line 8 of Table 2.

MOD(i): For the MOD(i) variables, φ looksat the value ofMODALS, SPINE and the currentMOD(i) , as well as the nonterminal label of theroot node of the German modifier being placed,and the functions in lines 24 and 28 of Table 1.

238

6 Deriving Full Translations

As we described in section 1.1, the translation of afull German sentence proceeds in a series of steps:a German parse tree is broken into a sequence ofclauses; each clause is individually translated; andfinally, the clause-level translations are combinedto form the translation for a full sentence. The firstand last steps are relatively straightforward. Wenow show how the second step is achieved—i.e.,how AEPs can be used to derive English clausetranslations from German clauses.

We will again use the following translationpair as an example:daß das haupthemmnis dervorhersehbare widerstand der hersteller war./thatthe main obstacle has been the predictable resis-tance of manufacturers.

First, an AEP like the one at the top of Fig-ure 2 is predicted. Then, for each German mod-ifier which does not have the valuedeleted , anEnglish translation is predicted. In the example,the modifiersdas haupthemmnisandder vorherse-hbare widerstand der herstellerwould be trans-lated tothe main obstacle, andthe predictable re-sistance of manufacturers, respectively.

A number of methods could be used for trans-lation of the modifiers. In this paper, we use thephrase-based system of Koehn et al. (2003) togeneraten-best translations for each of the mod-ifiers, and we then use a discriminative rerank-ing algorithm (Bartlett et al., 2004) to choose be-tween these modifiers. The features in the rerank-ing model can be sensitive to various properties ofthe candidate English translation, for example thewords, the part-of-speech sequence or the parsetree for the string. The reranker can also take intoaccount the original German string. Finally, thefeatures can be sensitive to properties of the AEP,such as the main verb or the position in which themodifier appears (e.g., subject, object,pre-sub ,post-verb , etc.) in the English clause. SeeAppendix B for a full description of the featuresused in the modifier translation model. Note thatthe reranking stage allows us to filter translationcandidates which do not fit syntactically with theposition in the English tree. For example, we canparse the members of then-best list, and then learna feature which strongly disprefers prepositionalphrases if the modifier appears in subject position.

Finally, the full string is predicted. In ourexample, the AEP variablesSPINE, MODALS,and INFL in Figure 2 give the ordering<that

SUBJECT has been OBJECT>. The AEPand modifier translations would be combinedto give the final English string. In gen-eral, any modifiers assigned topre-sub ,post-sub , in-modals or post-verb areplaced in the corresponding position within thespine. For example, the second AEP in Fig-ure 2 has a spine with ordering<SUBJECTare OBJECT>; modifiers 1 and 2 would beplaced in positionspre-sub andpost-verb ,giving the ordering <MOD2 SUBJECT areOBJECT MOD1>. Note that modifiers assignedpost-verb are placed after the object. If mul-tiple modifiers appear in the same position (e.g.,post-verb ), then they are placed in the orderseen in the original German clause.

7 ExperimentsWe applied the approach to translation from Ger-man to English, using the Europarl corpus (Koehn,2005) for our training data. This corpus containsover 750,000 training sentences; we extracted over441,000 training examples for the AEP modelfrom this corpus, using the method described insection 4. We reserved 35,000 of these trainingexamples as development data for the model. Weused a set of features derived from the those de-scribed in section 5.2. This set was optimized us-ing the development data through experimentationwith several different feature subsets.

Modifiers within German clauses were trans-lated using the phrase-based model of Koehn etal. (2003). We first generatedn-best lists for eachmodifier. We then built a reranking model—seesection 6—to choose between the elements in then-best lists. The reranker was trained using around800 labeled examples from a development set.

The test data for the experiments consisted of2,000 sentences, and was the same test set as thatused by Collins et al. (2005). We use the modelof Koehn et al. (2003) as a baseline for our ex-periments. The AEP-driven model was used totranslate all test set sentences where all clauseswithin the German parse tree contained at leastone verb and there was no embedding of clauses—there were 1,335 sentences which met these crite-ria. The remaining 665 sentences were translatedwith the baseline system. This set of 2,000 trans-lations had a BLEU score of 23.96. The baselinesystem alone achieved a BLEU score of 25.26 onthe same set of 2,000 test sentences. We also ob-tained judgments from two human annotators on

239

100 randomly-drawn sentences on which the base-line and AEP-based outputs differed. For each ex-ample the annotator viewed the reference transla-tion, together with the two systems’ translationspresented in a random order. Annotator 1 judged62 translations to be equal in quality, 16 transla-tions to be better under the AEP system, and 22to be better for the baseline system. Annotator 2judged 37 translations to be equal in quality, 32 tobe better under the baseline, and 31 to be betterunder the AEP-based system.

8 Conclusions and Future WorkWe have presented an approach to tree-to-tree based translation which models a newrepresentation—aligned extended projections—within a discriminative, feature-based framework.Our model makes use of an explicit representationof syntax in the target language, together with con-straints on the alignments between source and tar-get parse trees.

The current system presents many opportuni-ties for future work. For example, improve-ment in accuracy may come from a tighter in-tegration of modifier translation into the over-all translation process. The current method—using ann-best reranking model to select the bestcandidate—chooses each modifier independentlyand then places it into the translation. We in-tend to explore an alternative method that com-bines finite-state machines representing then-bestoutput from the phrase-based system with finite-state machines representing the complementiz-ers, verbs, modals, and other substrings of thetranslation derived from the AEP. Selecting mod-ifiers using this representation would correspondto searching the finite-state network for the mostlikely path. A finite-state representation has manyadvantages, including the ability to easily incorpo-rate ann-gram language model.

Future work may also consider expanded defi-nitions of AEPs. For example, we might considerAEPs that include larger chunks of phrase struc-ture, or we might consider AEPs that contain moredetailed information about the relative ordering ofmodifiers. There is certainly room for improve-ment in the accuracy with which AEPs are pre-dicted in our data; the feature-driven approach al-lows a wide range of features to be tested. For ex-ample, it would be relatively easy to incorporate asyntactic language model (i.e., a prior distributionover AEP structures) induced from a large amount

of English monolingual data.

Appendix A: Identification of Clauses

In the English parse trees, we identify clauses asfollows. Any non-terminal labeled by the parserof (Collins, 1999) asSBARor SBAR-A is labeledas a clause root. Any node labeled by the parser asS or S-A is also labeled as the root of a clause, un-less it is directly dominated by a non-terminal la-beledSBARor SBAR-A. Any node labeledSGorSG-A by the parser is labeled as a clause root, un-less (1) the node is directly dominated bySBARorSBAR-A; or (2) the node is directly dominated bya VP, and the node is directly preceded by a verb(POS tag beginning withV) or modal (POS tag be-ginning with M). Any node labeledVP is markedas a clause root if (1) the node is not directly dom-inated by aVP, S, S-A , SBAR, SBAR-A, SG, orSG-A; or (2) the node is directly preceded by acoordinating conjunction (i.e., a POS tag labeledasCC).

In German parse trees, we identify any nodeslabeled asS or CS as clause roots. In addition,we mark any node labeled asVP as a clause root,provided that (1) it is preceded by a coordinatingconjunction, i.e., a POS tag labeled asKON; or (2)it has one of the functional tags-mo, -re or -sb .

Appendix B: Reranking ModifierTranslations

Then-best reranking model for the translation ofmodifiers considers a list of candidate translations.We hand-labeled 800 examples, marking the ele-ment in each list that would lead to the best trans-lation. The features of then-best reranking algo-rithm are combinations of the basic features in Ta-bles 3 and 4.

Each list contained then-best translations pro-duced by the phrase-based system of Koehn et al.(2003). The lists also contained a supplementarycandidate “DELETED”, signifying that the mod-ifier should be deleted from the English transla-tion. In addition, each candidate derived from thephrase-based system contributed one new candi-date to the list signifying that the first word ofthe candidate should be deleted. These additionalcandidates were motivated by our observation thatthe optimal candidate in then-best list producedby the phrase-based system often included an un-wanted preposition at the beginning of the string.

240

1 candidate string2 should the first word of the candidate be deleted?3 POS tag of first word of candidate4 POS tag of last word of candidate5 top nonterminal of parse of candidate6 modifier deleted from English translation?7 first candidate onn-best list8 first word of candidate9 last word of candidate10 rank of candidate inn-best list11 is there punctuation at the beginning, middle,

or end of the string?12 if the first word of the candidate should be deleted,

what is the string that is deleted?13 if the first word of the candidate should be deleted,

what is the POS tag of the word that is deleted?

Table 3: Functions of the candidate modifier translations usedfor making features in then-best reranking model.

1 the position of the modifier (0–4) in AEP2 main verb3 voice4 subject prediction5 German input string

Table 4: Functions of the German input string and predictedAEP output used for making features in then-best rerankingmodel.

AcknowledgementsWe would like to thank Luke Zettlemoyer, Regina Barzilay,

Ed Filisko, and Ben Snyder for their valuable comments and

help during the writing of this paper. Thanks also to Jason

Rennie and John Barnett for providing human judgments of

the translation output. This work was funded by NSF grants

IIS-0347631, IIS-0415030, and DMS-0434222, as well as a

grant from NTT, Agmt. Dtd. 6/21/1998.

ReferencesH. Alshawi. 1996. Head automata and bilingual tiling: trans-

lation with minimal representations.ACL 96.

P. Bartlett, M. Collins, B. Taskar, and D. McAllester. 2004.Exponentiated gradient algorithms for large-margin struc-tured classification.Proceedings of NIPS 2004.

P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer.1993. The mathematics of statistical machine translation.Computational Linguistics, 22(1):39–69.

E. Charniak, K. Knight, and K. Yamada. 2001. Syntax-basedlanguage models for statistical machine translation.ACL01.

D. Chiang. 2005. A hierarchical phrase-based model forstatistical machine translation.ACL 05.

M. Collins. 1999. Head-Driven Statistical Models for Natu-ral Language Parsing. University of Pennsylvania.

M. Collins. 2002. Discriminative training methods for hid-den markov models: theory and experiments with percep-tron algorithms.EMNLP 02.

M. Collins and B. Roark. 2004. Incremental parsing with theperceptron algorithm.ACL 04.

M. Collins, P. Koehn, and I. Kucerova. 2005. Clause restruc-turing for statistical machine translation.ACL 05.

H. Daume III and D. Marcu. 2005. Learning as search op-timization: approximate large margin methods for struc-tured prediction.ICML 05.

A. Dubey. 2005. What to do when lexicalization fails: pars-ing German with suffix analysis and smoothing.ACL 05.

J. Eisner. 2003. Learning non-isomorphic tree mappings formachine translation.ACL 03, Companion Volume.

R. Frank. 2002.Phrase Structure Composition and SyntacticDependencies. Cambridge, MA: MIT Press.

M. Galley, M. Hopkins, K. Knight, and D. Marcu. 2004.What’s in a translation rule?HLT-NAACL 04.

D. Gildea. 2003. Loosely tree-based alignment for machinetranslation.ACL 03.

J. Graehl and K. Knight. 2004. Training tree transducers.NAACL-HLT 04.

A. Joshi. 1985. How much context-sensitivity is necessaryfor characterizing structural descriptions – tree-adjoininggrammar. Cambridge University Press.

P. Koehn, F. J. Och, and D. Marcu. 2003. Statistical phrasebased translation.HLT-NAACL 03.

P. Koehn. 2005. Europarl: A parallel corpus for statisticalmachine translation.MT Summit 05.

I. D. Melamed 2004. Statistical machine translation by pars-ing. ACL 04.

F. J. Och, D. Gildea, S. Khudanpur, A. Sarkar, K. Yamada,A. Fraser, S. Kumar, L. Shen, D. Smith, K. Eng, V. Jain,Z. Jin, D. Radev. 2004. A smorgasbord of features forstatistical machine translation.HLT/NAACL 04

F. J. Och and H. Ney. 2004. The alignment template ap-proach to statistical machine translation.ComputationalLinguistics, 30(4):417–449.

F. J. Och and H. Ney. 2003. A systematic comparison ofvarious statistical alignment models.Computational Lin-guistics, 29(1):19–51.

C. Quirk, A. Menezes, and C. Cherry. 2005. Depen-dency tree translation: syntactically informed phrasalSMT. EACL 05.

S. Riezler and J. Maxwell. 2006. Grammatical machinetranslation. InNLT-NAACL 06.

S. Shieber. 2004. Synchronous grammars as tree transduc-ers. InProceedings of the Seventh International Workshopon Tree Adjoining Grammar and Related Formalisms.

S. Shieber and Y. Schabes. 1990. Synchronous tree-adjoining grammars. InProceedings of the 13th Interna-tional Conference on Computational Linguistics.

D. Wu. 1997. Stochastic inversion transduction grammarsand bilingual parsing of parallel corpora.ComputationalLinguistics, 23(3):377–403.

F. Xia and M. McCord. 2004. Improving a statistical MTsystem with automatically learned rewrite patterns.COL-ING 04.

K. Yamada and K. Knight. 2001. A syntax-based statisticaltranslation model.ACL 01.

241

A Discriminative Model for Tree-to-Tree Translation

Documents

Transcript of A Discriminative Model for Tree-to-Tree Translation