Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation

Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 620–629,Honolulu, October 2008. c©2008 Association for Computational Linguistics

Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation

Roy W. Tromble1 and Shankar Kumar2 and Franz Och2 and Wolfgang Macherey2

1Department of Computer ScienceJohns Hopkins University

Baltimore, MD 21218, [email protected]

2 Google Inc.1600 Amphitheatre Pkwy.

Mountain View, CA 94043, USA{shankarkumar,och,wmach}@google.com

AbstractWe present Minimum Bayes-Risk (MBR) de-coding over translation lattices that compactlyencode a huge number of translation hypothe-ses. We describe conditions on the loss func-tion that will enable efficient implementationof MBR decoders on lattices. We introducean approximation to the BLEU score (Pap-ineni et al., 2001) that satisfies these condi-tions. The MBR decoding under this approx-imate BLEU is realized using Weighted Fi-nite State Automata. Our experiments showthat the Lattice MBR decoder yields mod-erate, consistent gains in translation perfor-mance over N-best MBR decoding on Arabic-to-English, Chinese-to-English and English-to-Chinese translation tasks. We conduct arange of experiments to understand why Lat-tice MBR improves upon N-best MBR andstudy the impact of various parameters onMBR performance.

1 Introduction

Statistical language processing systems for speechrecognition, machine translation or parsing typicallyemploy the Maximum A Posteriori (MAP) deci-sion rule which optimizes the 0-1 loss function. Incontrast, these systems are evaluated using metricsbased on string-edit distance (Word Error Rate), n-gram overlap (BLEU score (Papineni et al., 2001)),or precision/recall relative to human annotations.Minimum Bayes-Risk (MBR) decoding (Bickel andDoksum, 1977) aims to address this mismatch by se-lecting the hypothesis that minimizes the expectederror in classification. Thus it directly incorporatesthe loss function into the decision criterion. The ap-proach has been shown to give improvements over

the MAP classifier in many areas of natural lan-guage processing including automatic speech recog-nition (Goel and Byrne, 2000), machine transla-tion (Kumar and Byrne, 2004; Zhang and Gildea,2008), bilingual word alignment (Kumar and Byrne,2002), and parsing (Goodman, 1996; Titov and Hen-derson, 2006; Smith and Smith, 2007).

In statistical machine translation, MBR decodingis generally implemented by re-ranking an N -bestlist of translations produced by a first-pass decoder;this list typically contains between 100 and 10, 000hypotheses. Kumar and Byrne (2004) show thatMBR decoding gives optimal performance when theloss function is matched to the evaluation criterion;in particular, MBR under the sentence-level BLEUloss function (Papineni et al., 2001) gives gains onBLEU. This is despite the fact that the sentence-levelBLEU loss function is an approximation to the exactcorpus-level BLEU.

A different MBR inspired decoding approach ispursued in Zhang and Gildea (2008) for machinetranslation using Synchronous Context Free Gram-mars. A forest generated by an initial decoding passis rescored using dynamic programming to maxi-mize the expected count of synchronous constituentsin the tree that corresponds to the translation. Sinceeach constituent adds a new 4-gram to the existingtranslation, this approach approximately maximizesthe expected BLEU.

In this paper we explore a different strategyto perform MBR decoding over Translation Lat-tices (Ueffing et al., 2002) that compactly encode ahuge number of translation alternatives relative to anN -best list. This is a model-independent approach

620

in that the lattices could be produced by any statis-tical MT system — both phrase-based and syntax-based systems would work in this framework. Wewill introduce conditions on the loss functions thatcan be incorporated in Lattice MBR decoding. Wedescribe an approximation to the BLEU score (Pa-pineni et al., 2001) that will satisfy these condi-tions. Our Lattice MBR decoding is realized usingWeighted Finite State Automata.

We expect Lattice MBR decoding to improveupon N -best MBR primarily because lattices con-tain many more candidate translations than the N -best list. This has been demonstrated in speechrecognition (Goel and Byrne, 2000). We conducta range of translation experiments to analyze latticeMBR and compare it with N -best MBR. An impor-tant aspect of our lattice MBR is the linear approxi-mation to the BLEU score. We will show that MBRdecoding under this score achieves a performancethat is at least as good as the performance obtainedunder sentence-level BLEU score.

The rest of the paper is organized as follows. Wereview MBR decoding in Section 2 and give the for-mulation in terms of a gain function. In Section 3,we describe the conditions on the gain function forefficient decoding over a lattice. The implementa-tion of lattice MBR with Weighted Finite State Au-tomata is presented in Section 4. In Section 5, we in-troduce the corpus BLEU approximation that makesit possible to perform efficient lattice MBR decod-ing. An example of lattice MBR with a toy latticeis presented in Section 6. We present lattice MBRexperiments in Section 7. A final discussion is pre-sented in Section 8.

2 Minimum Bayes Risk Decoding

Minimum Bayes-Risk (MBR) decoding aims to findthe candidate hypothesis that has the least expectedloss under the probability model (Bickel and Dok-sum, 1977). We begin with a review of MBR decod-ing for Statistical Machine Translation (SMT).

Statistical MT (Brown et al., 1990; Och and Ney,2004) can be described as a mapping of a word se-quence F in the source language to a word sequenceE in the target language; this mapping is producedby the MT decoder δ(F ). If the reference transla-tion E is known, the decoder performance can be

measured by the loss function L(E, δ(F )). Givensuch a loss function L(E,E′) between an automatictranslation E′ and the reference E, and an under-lying probability model P (E|F ), the MBR decoderhas the following form (Goel and Byrne, 2000; Ku-mar and Byrne, 2004):

E = argminE′∈E

R(E′)

= argminE′∈E

∑E∈E

L(E,E′)P (E|F ),

where R(E′) denotes the Bayes risk of candidatetranslation E′ under the loss function L.

If the loss function between any two hypothesescan be bounded: L(E,E′) ≤ Lmax, the MBR de-coder can be rewritten in terms of a gain functionG(E,E′) = Lmax − L(E,E′):

E = argmaxE′∈E

∑E∈E

G(E,E′)P (E|F ). (1)

We are interested in performing MBR decodingunder a sentence-level BLEU score (Papineni et al.,2001) which behaves like a gain function: it variesbetween 0 and 1, and a larger value reflects a highersimilarity. We will therefore use Equation 1 as theMBR decoder.

We note that E represents the space of transla-tions. For N -best MBR, this space E is the N -bestlist produced by a baseline decoder. We will investi-gate the use of a translation lattice for MBR decod-ing; in this case, E will represent the set of candi-dates encoded in the lattice.

In general, MBR decoding can use differentspaces for hypothesis selection and risk computa-tion: argmax and the sum in Equation 1 (Goel,2001). As an example, the hypothesis could be se-lected from the N -best list while the risk is com-puted based on the entire lattice. Therefore, theMBR decoder can be more generally written as fol-lows:

E = argmaxE′∈Eh

∑E∈Ee

G(E,E′)P (E|F ), (2)

where Eh refers to the Hypothesis space from wherethe translations are chosen, and Ee refers to the Evi-dence space that is used for computing the Bayes-risk. We will present experiments (Section 7) toshow the relative importance of these two spaces.

621

3 Lattice MBR Decoding

We now present MBR decoding on translation lat-tices. A translation word lattice is a compact rep-resentation for very large N -best lists of transla-tion hypotheses and their likelihoods. Formally,it is an acyclic Weighted Finite State Acceptor(WFSA) (Mohri, 2002) consisting of states and arcsrepresenting transitions between states. Each arc islabeled with a word and a weight. Each path in thelattice, consisting of consecutive transitions begin-ning at the distinguished initial state and ending at afinal state, expresses a candidate translation. Aggre-gation of the weights along the path1 produces theweight of the path’s candidate H(E,F ) accordingto the model. In our setting, this weight will implythe posterior probability of the translation E giventhe source sentence F :

P (E|F ) =exp (αH(E,F ))∑

E′∈E exp (αH(E′, F )). (3)

The scaling factor α ∈ [0,∞) flattens the distribu-tion when α < 1, and sharpens it when α > 1.

Because a lattice may represent a number of can-didates exponential in the size of its state set, it is of-ten impractical to compute the MBR decoder (Equa-tion 1) directly. However, if we can express the gainfunction G as a sum of local gain functions gi, thenwe now show that Equation 1 can be refactored andthe MBR decoder can be computed efficiently. Weloosely call a gain function local if it can be ap-plied to all paths in the lattice via WFSA intersec-tion (Mohri, 2002) without significantly multiplyingthe number of states.

In this paper, we are primarily concerned with lo-cal gain functions that weight n-grams. Let N ={w1, . . . , w|N |} be the set of n-grams and let a localgain function gw : E × E → R, for w ∈ N , be asfollows:

gw(E,E′) = θw#w(E′)δw(E), (4)

where θw is a constant, #w(E′) is the number oftimes that w occurs in E′, and δw(E) is 1 if w ∈ Eand 0 otherwise. That is, gw is θw times the numberof occurrences of w in E′, or zero if w does not oc-cur in E. We first assume that the overall gain func-tion G(E,E′) can then be written as a sum of local

1using the log semiring’s extend operator

gain functions and a constant θ0 times the length ofthe hypothesis E′.

G(E,E′) = θ0|E′|+∑w∈N

gw(E,E′) (5)

= θ0|E′|+∑w∈N

θw#w(E′)δw(E)

Given a gain function of this form, we can rewritethe risk (sum in Equation 1) as follows∑

E∈EG(E,E′)P (E|F )

=∑E∈E

(θ0|E′|+

∑w∈N

θw#w(E′)δw(E))P (E|F )

= θ0|E′|+∑w∈N

θw#w(E′)∑

E∈Ew

P (E|F ),

where Ew = {E ∈ E|δw(E) > 0} represents thepaths of the lattice containing the n-gram w at leastonce. The MBR decoder on lattices (Equation 1) cantherefore be written as

E = argmaxE′∈E

{θ0|E′|+

∑w∈N

θw#w(E′)p(w|E)}

. (6)

Here p(w|E) =∑

E∈EwP (E|F ) is the posterior

probability of the n-gram w in the lattice. We havethus replaced a summation over a possibly exponen-tial number of items (E ∈ E) with a summation overthe number of n-grams that occur in E , which is atworst polynomial in the number of edges in the lat-tice that defines E . We compute the posterior proba-bility of each n-gram w as:

p(w|E) =∑

E∈Ew

P (E|F ) =Z(Ew)Z(E)

, (7)

where Z(E) =∑

E′∈E exp(αH(E′, F )) (denomi-nator in Equation 3) andZ(Ew) =

∑E′∈Ew

exp(αH(E′, F )). Z(E) andZ(Ew) represent the sums2 of weights of all pathsin the lattices Ew and E respectively.

4 WFSA MBR Computations

We now show how the Lattice MBR Decision Rule(Equation 6) can be implemented using WeightedFinite State Automata (Mohri, 1997). There are foursteps involved in decoding starting from weightedfinite-state automata representing the candidate out-puts of a translation system. We will describe these

2in the log semiring, where log +(x, y) = log(ex + ey) isthe collect operator (Mohri, 2002)

622

steps in the setting where the evidence lattice Ee maybe different from the hypothesis lattice Eh (Equa-tion 2).

1. Extract the set of n-grams that occur in the ev-idence lattice Ee. For the usual BLEU score, nranges from one to four.

2. Compute the posterior probability p(w|E) ofeach of these n-grams.

3. Intersect each n-gram w, with an appropriateweight (from Equation 6), to an initially un-weighted copy of the hypothesis lattice Eh.

4. Find the best path in the resulting automaton.

Computing the set of n-grams N that occur in afinite automaton requires a traversal, in topologicalorder, of all the arcs in the automaton. Because thelattice is acyclic, this is possible. Each state q in theautomaton has a corresponding set of n-grams Nq

ending there.

1. For each state q, Nq is initialized to {ε}, the setcontaining the empty n-gram.

2. Each arc in the automaton extends each of itssource state’s n-grams by its word label, andadds the resulting n-grams to the set of its tar-get state. (ε arcs do not extend n-grams, buttransfer them unchanged.) n-grams longer thanthe desired order are discarded.

3. N is the union over all states q of Nq.

Given an n-gram, w, we construct an automatonmatching any path containing the n-gram, and in-tersect that automaton with the lattice to find the setof paths containing the n-gram (Ew in Equation 7).Suppose E represent the weighted lattice, we com-pute3: Ew = E ∩ (w w Σ∗), where w = (Σ∗ w Σ∗)is the language that contains all strings that do notcontain the n-gram w. The posterior probabilityp(w|E) of n-gram w can be computed as a ratio ofthe total weights of paths in Ew to the total weightsof paths in the original lattice (Equation 7).

For each n-gram w ∈ N , we then constructan automaton that accepts an input E with weight

3in the log semiring (Mohri, 2002)

equal to the product of the number of times the n-gram occurs in the input (#w(E)), the n-gram fac-tor θw from Equation 6, and the posterior proba-bility p(w|E). The automaton corresponds to theweighted regular expression (Karttunen et al., 1996):w(w/(θwp(w|E)) w)∗.

We successively intersect each of these automatawith an automaton that begins as an unweightedcopy of the lattice Eh. This automaton must alsoincorporate the factor θ0 of each word. This canbe accomplished by intersecting the unweighted lat-tice with the automaton accepting (Σ/θ0)∗. Theresulting MBR automaton computes the total ex-pected gain of each path. A path in this automa-ton that corresponds to the word sequence E′ hascost: θ0|E′|+

∑w∈N θw#w(E)p(w|E) (expression

within the curly brackets in Equation 6).Finally, we extract the best path from the resulting

automaton4, giving the lattice MBR candidate trans-lation according to the gain function (Equation 6).

5 Linear Corpus BLEU

Our Lattice MBR formulation relies on the decom-position of the overall gain function as a sum of lo-cal gain functions (Equation 5). We here describe alinear approximation to the log(BLEU score) (Pap-ineni et al., 2001) which allows such a decomposi-tion. This will enable us to rewrite the log(BLEU)as a linear function of n-gram matches and the hy-pothesis length. Our strategy will be to use a firstorder Taylor-series approximation to what we callthe corpus log(BLEU) gain: the change in corpuslog(BLEU) contributed by the sentence relative tonot including that sentence in the corpus.

Let r be the reference length of the corpus, c0 thecandidate length, and {cn|1 ≤ n ≤ 4} the numberof n-gram matches. Then, the corpus BLEU scoreB(r, c0, cn) can be defined as follows (Papineni etal., 2001):

log B = min(

0, 1− r

c0

)+

14

4∑n=1

logcn

c0 −∆n,

≈ min(

0, 1− r

c0

)+

14

4∑n=1

logcn

c0,

where we have ignored ∆n, the difference betweenthe number of words in the candidate and the num-

4in the (max, +) semiring (Mohri, 2002)

623

ber of n-grams. If L is the average sentence lengthin the corpus, ∆n ≈ (n− 1) c0

L .The corpus log(BLEU) gain is defined as the

change in log(BLEU) when a new sentence’s (E′)statistics are added to the corpus statistics:

G = log B′ − log B,

where the counts in B′ are those of B plus those forthe current sentence. We will assume that the brevitypenalty (first term in the above approximation) doesnot change when adding the new sentence. In exper-iments not reported here, we found that taking intoaccount the brevity penalty at the sentence level cancause large fluctuations in lattice MBR performanceon different test sets. We therefore treat only cns asvariables.

The corpus log BLEU gain is approximated by afirst-order vector Taylor series expansion about theinitial values of cn.

G ≈N∑

n=0

(c′n − cn)∂ log B′

∂c′n

∣∣∣∣c′n=cn

, (8)

where the partial derivatives are given by∂ log B

∂c0=

−1c0

, (9)

∂ log B

∂cn=

14cn

.

Substituting the derivatives in Equation 8 gives

G = ∆ log B ≈ −∆c0

c0+

14

4∑n=1

∆cn

cn, (10)

where each ∆cn = c′n − cn counts the statistic inthe sentence of interest, rather than the corpus as awhole. This score is therefore a linear function incounts of words ∆c0 and n-gram matches ∆cn. Ourapproach ignores the count clipping present in theexact BLEU score where a correct n-gram presentonce in the reference but several times in the hypoth-esis will be counted only once as correct. Such anapproach is also followed in Dreyer et al. (2007).

Using the above first-order approximation to gainin log corpus BLEU, Equation 9 implies that θ0, θw

from Section 3 would have the following values:

θ0 =−1c0

(11)

θw =1

4c|w|.

5.1 N-gram FactorsWe now describe how the n-gram factors (Equa-tion 11) are computed. The factors depend ona set of n-gram matches and counts (cn; n ∈{0, 1, 2, 3, 4}). These factors could be obtained froma decoding run on a development set. However, do-ing so could make the performance of lattice MBRvery sensitive to the actual BLEU scores on a partic-ular run. We would like to avoid such a dependenceand instead, obtain a set of parameters which canbe estimated from multiple decoding runs withoutMBR. To achieve this, we make use of the propertiesof n-gram matches. It is known that the average n-gram precisions decay approximately exponentiallywith n (Papineni et al., 2001). We now assume thatthe number of matches of each n-gram is a constantratio r times the matches of the corresponding n− 1gram.

If the unigram precision is p, we can obtain then-gram factors (n ∈ {1, 2, 3, 4}) (Equation 11) as afunction of the parameters p and r, and the numberof unigram tokens T :

θ0 =−1T

(12)

θn =1

4Tp× rn−1

We set p and r to the average values of unigram pre-cision and precision ratio across multiple develop-ment sets. Substituting the above factors in Equa-tion 6, we find that the MBR decision does not de-pend on T ; therefore any value of T can be used.

6 An Example

Figure 1 shows a toy lattice and the final MBR au-tomaton (Section 4) for BLEU with a maximum n-gram order of 2. We note that the MBR hypothesis(bcde) has a higher decoder cost relative to the MAPhypothesis (abde). However, bcde gets a higher ex-pected gain (Equation 6) than abde since it sharesmore n-grams with the Rank-3 hypothesis (bcda).This illustrates how a lattice can help select MBRtranslations that can differ from the MAP transla-tion.

7 Experiments

We now present experiments to evaluate MBR de-coding on lattices under the linear corpus BLEU

624

0

1

2

3

4

5

6

7

8

9

10

0

1

2

3

4

7

5

8

6

c/0.013

d/0.013

d/−0.008

d/−0.008

e/0.004

a/0.038

a/0.5

b/0.6

b/0.6

b/0.6

c/0.6

c/0.6

d/0.3

d/0.4

e/0.5

a/0.5

a/0.063

b/0.043

b/0.043

b/0.013

c/0.013

Figure 1: An example translation lattice with decodercosts (top) and its MBR Automaton for BLEU-2 (bot-tom). The bold path in the top is the MAP hypothesisand the bold path in the bottom is the MBR hypothe-sis. The precision parameters in Equation 12 are set to:T = 10, p = 0.85, r = 0.72.

Dataset # of sentencesaren zhen enzh

dev1 1353 1788 1664dev2 663 919 919blind 1360 1357 1859

Table 1: Statistics over the development and test sets.

gain. We start with a description of the data setsand the SMT system.

7.1 Development and Blind Test Sets

We present our experiments on the constrained datatrack of the NIST 2008 Arabic-to-English (aren),Chinese-to-English (zhen), and English-to-Chinese(enzh) machine translation tasks.5 In all languagepairs, the parallel and monolingual data consists ofall the allowed training sets in the constrained track.

For each language pair, we use two developmentsets: one for Minimum Error Rate Training (Och,2003; Macherey et al., 2008), and the other for tun-ing the scale factor for MBR decoding. Our devel-opment sets consists of the NIST 2004/2003 evalu-ation sets for both aren and zhen, and NIST 2006(NIST portion)/2003 evaluation sets for enzh. Wereport results on NIST 2008 which is our blind testset. Statistics computed over these data sets are re-ported in Table 1.

5http://www.nist.gov/speech/tests/mt/

7.2 MT System Description

Our phrase-based statistical MT system is similar tothe alignment template system described in Och andNey (2004). The system is trained on parallel cor-pora allowed in the constrained track. We first per-form sentence and sub-sentence chunk alignment onthe parallel documents. We then train word align-ment models (Och and Ney, 2003) using 6 Model-1iterations and 6 HMM iterations. An additional 2 it-erations of Model-4 are performed for zhen and enzhpairs. Word Alignments in both source-to-targetand target-to-source directions are obtained usingthe Maximum A-Posteriori (MAP) framework (Ma-tusov et al., 2004). An inventory of phrase-pairsup to length 5 is then extracted from the union ofsource-target and target-source alignments. Severalfeature functions are then computed over the phrase-pairs. 5-gram word language models are trained onthe allowed monolingual corpora. Minimum ErrorRate Training under BLEU is used for estimatingapproximately 20 feature function weights over thedev1 development set.

Translation is performed using a standard dy-namic programming beam-search decoder (Och andNey, 2004) using two decoding passes. The first de-coder pass generates either a lattice or an N -best list.MBR decoding is performed in the second pass. TheMBR scaling parameter (α in Equation 3) is tunedon the dev2 development set.

7.3 Translation Results

We next report translation results from lattice MBRdecoding. All results will be presented on the NIST2008 evaluation sets. We report results using theNIST implementation of the BLEU score whichcomputes the brevity penalty using the shortest ref-erence translation for each segment (NIST, 20022008). The BLEU scores are reported at the word-level for aren and zhen but at the character level forenzh. We measure statistical significance using 95%confidence intervals computed with paired bootstrapresampling (Koehn, 2004). In all tables, systems in acolumn show statistically significant differences un-less marked with an asterisk.

We first compare lattice MBR to N -best MBR de-coding and MAP decoding (Table 2). In these ex-periments, we hold the likelihood scaling factor α a

625

BLEU(%)aren zhen enzh

MAP 43.7 27.9 41.4N -best MBR 43.9 28.3∗ 42.0Lattice MBR 44.9 28.5∗ 42.6

Table 2: Lattice MBR, N -best MBR & MAP decoding.On zhen, Lattice MBR and N -best MBR do not showstatistically significant differences.

constant; it is set to 0.2 for aren and enzh, and 0.1for zhen. The translation lattices are pruned usingForward-Backward pruning (Sixtus and Ortmanns,1999) so that the average numbers of arcs per word(lattice density) is 30. For N -best MBR, we useN -best lists of size 1000. To match the loss func-tion, Lattice MBR is performed at the word level foraren/zhen and at the character level for enzh. Ourlattice MBR is implemented using the Google Open-Fst library.6 In our experiments, p, r (Equation 12)have values of 0.85/0.72, 0.80/0.62, and 0.63/0.48for aren, zhen, and enzh respectively.

We note that Lattice MBR provides gains of 0.2-1.0 BLEU points over N -best MBR, which in turngives 0.2-0.6 BLEU points over MAP. These gainsare obtained on top of a baseline system that hascompetitive performance relative to the results re-ported in the NIST 2008 Evaluation.7 This demon-strates the effectiveness of lattice MBR decoding asa realization of MBR decoding which yields sub-stantial gains over the N -best implementation.

The gains from lattice MBR over N -best MBRcould be due to a combination of factors. These in-clude: 1) better approximation of the corpus BLEUscore, 2) larger hypothesis space, and 3) larger evi-dence space. We now present experiments to teaseapart these factors.

Our first experiment restricts both the hypothesisand evidence spaces in lattice MBR to the 1000-bestlist (Table 3). We compare this to N -best MBR with:a) sentence-level BLEU, and b) sentence-level logBLEU.

The results show that when restricted to the 1000-best list, Lattice MBR performs slightly better thanN -best MBR (with sentence BLEU) on aren/enzhwhile N -best MBR is better on zhen. We hypothe-

6http://www.openfst.org/7

http://www.nist.gov/speech/tests/mt/2008/doc/mt08 official results v0.html

BLEU(%)aren zhen enzh

Lattice MBR, Lin. Corpus BLEU 44.2 28.1 42.2N -best MBR, Sent. BLEU 43.9∗ 28.3∗ 42.0∗

N -best MBR, Sent. Log BLEU 44.0∗ 28.3∗ 41.9∗

Table 3: Lattice and N-best MBR (with SentenceBLEU/Sentence log BLEU) on a 1000-best list. In eachcolumn, entries with an asterisk do not show statisticallysignificant differences.

BLEU(%)Hyp Space Evid Space aren zhen enzh

Lattice Lattice 44.9 28.5 42.61000-best Lattice 44.6 28.5 42.6

Lattice 1000-best 44.1∗ 28.0∗ 42.11000-best 1000-best 44.2∗ 28.1∗ 42.2

Table 4: Lattice MBR with restrictions on hypothesis andevidence spaces. In each column, entries with an asteriskdo not show statistically significant differences.

size that on aren/enzh, the linear corpus BLEU gain(Equation 10) is better correlated to the actual cor-pus BLEU than sentence-level BLEU while the op-posite is true on zhen. N -best MBR gives similarresults with either sentence BLEU or sentence logBLEU. This confirms that using a log BLEU scoredoes not change the outcome of MBR decoding andfurther justifies our Taylor-series approximation ofthe log BLEU score.

We next attempt to understand factors 2 and 3. Todo that, we carry out lattice MBR when either thehypothesis or the evidence space in Equation 2 is re-stricted to 1000-best hypotheses (Table 4). For com-parison, we also include results from lattice MBRwhen both hypothesis and evidence spaces are iden-tical: either the full lattice or the 1000-best list (fromTables 2 and 3).

These results show that lattice MBR results arealmost unchanged when the hypothesis space is re-stricted to a 1000-best list. However, when the ev-idence space is shrunk to a 1000-best list, there isa significant degradation in performance; these lat-ter results are almost identical to the scenario whenboth evidence and hypothesis spaces are restrictedto the 1000-best list. This experiment throws lighton what makes lattice MBR effective over N -bestMBR. Relative to the N -best list, the translation lat-tice provides a better estimate of the expected BLEUscore. On the other hand, there are few hypotheses

626

outside the 1000-best list which are selected by lat-tice MBR.

Finally, we show how the performance of latticeMBR changes as a function of the lattice density.The lattice density is the average number of arcs perword and can be varied using Forward-Backwardpruning (Sixtus and Ortmanns, 1999). Figure 2 re-ports the average number of lattice paths and BLEUscores as a function of lattice density. The resultsshow that Lattice MBR performance generally im-proves when the size of the lattice is increased.However, on zhen, there is a small drop beyond adensity of 10. This could be due to low quality (lowposterior probability) hypotheses that get included atthe larger densities and result in a poorer estimate ofthe expected BLEU score. On aren and enzh, thereare some gains beyond a lattice density of 30. Thesegains are relatively small and come at the expenseof higher memory usage; we therefore work with alattice density of 30 in all our experiments. We notethat Lattice MBR is operating over lattices which aregigantic in comparison to the number of paths in anN -best list. At a lattice density of 30, the lattices inaren contain on an average about 1081 hypotheses!

7.4 Lattice MBR Scale Factor

We next examine the role of the scale factor α inlattice MBR decoding. The MBR scale factor de-termines the flatness of the posterior distribution(Equation 3). It is chosen using a grid search on thedev2 set (Table 1). Figure 3 shows the variation inBLEU scores on eval08 as this parameter is varied.The results show that it is important to tune this fac-tor. The optimal scale factor is identical for all threelanguage pairs. In experiments not reported in thispaper, we have found that the optimal scaling factoron a moderately sized development set carries overto unseen test sets.

7.5 Maximum n-gram Order

Lattice MBR Decoding (Equation 6) involves com-puting a posterior probability for each n-gram in thelattice. We would like to speed up the Lattice MBRcomputation (Section 4) by restricting the maximumorder of the n-grams in the procedure. The results(Table 5) show that on aren, there is no degradationif we limit the maximum order of the n-grams to3. However, on zhen/enzh, there is improvement by

BLEU(%)Max n-gram order aren zhen enzh

1 38.7 26.8 40.02 44.1 27.4 42.23 44.9 28.0 42.44 44.9 28.5 42.6

Table 5: Lattice MBR as a function of max n-gram order.

considering 4-grams. We can therefore reduce Lat-tice MBR computations in aren.

8 Discussion

We have presented a procedure for performing Min-imum Bayes-Risk Decoding on translation lattices.This is a significant development in that the MBRdecoder operates over a very large number of trans-lations. In contrast, the current N -best implementa-tion of MBR can be scaled to, at most, a few thou-sands of hypotheses. If the number of hypothesesis greater than, say 20,000, the N -best MBR be-comes computationally expensive. The lattice MBRtechnique is efficient when performed over enor-mous number of hypotheses (up to 1080) since ittakes advantage of the compact structure of the lat-tice. Lattice MBR gives consistent improvements intranslation performance over N -best MBR decod-ing, which is used in many state-of-the-art researchtranslation systems. Moreover, we see gains on threedifferent language pairs.

There are two potential reasons why Lattice MBRdecoding could outperform N -best MBR: a largerhypothesis space from which translations could beselected or a larger evidence space for computing theexpected loss. Our experiments show that the mainimprovement comes from the larger evidence space:a larger set of translations in the lattice provides abetter estimate of the expected BLEU score. In otherwords, the lattice provides a better posterior distri-bution over translation hypotheses relative to an N -best list. This is a novel insight into the workingsof MBR decoding. We believe this could be possi-bly employed when designing discriminative train-ing approaches for machine translation. More gener-ally, we have found a component in machine transla-tion where the posterior distribution over hypothesesplays a crucial role.

We have shown the effect of the MBR scaling fac-

627

10 20 30 40

44

44.2

44.4

44.6

44.8

45 aren

Lattice Density

33

85

121

161

187

208

BLE

U(%

)

10 20 30 40

28.2

28.3

28.4

28.5

28.6

28.7 zhen

Lattice Density

6

22

37

49 59 65

10 20 30 40

41.8

42

42.2

42.4

42.6

enzh

Lattice Density

3

10

17

25 30

34

Figure 2: Lattice MBR vs. lattice density: aren/zhen/enzh. Each point also shows the loge(Avg. # of paths).

0 0.2 0.4 0.6 0.8 1

44

44.2

44.4

44.6

44.8

aren

Scale Factor

BL

EU

(%)

0 0.2 0.4 0.6 0.8 1

27.9

28

28.1

28.2

28.3

28.4

28.5 zhen

Scale Factor

0 0.2 0.4 0.6 0.8 1

41.8

42

42.2

42.4

42.6

42.8

enzh

Scale Factor

Figure 3: Lattice MBR with various scale factors α: aren/zhen/enzh.

tor on the performance of lattice MBR. The scalefactor determines the flatness of the posterior distri-bution over translation hypotheses. A scale of 0.0means a uniform distribution while 1.0 implies thatthere is no scaling. This is an important parameterthat needs to be tuned on a development set. Therehas been prior work in MBR speech recognition andmachine translation (Goel and Byrne, 2000; Ehlinget al., 2007) which has shown the need for tuningthis factor. Our MT system parameters are trainedwith Minimum Error Rate Training which assigns avery high posterior probability to the MAP transla-tion. As a result, it is necessary to flatten the prob-ability distribution so that MBR decoding can selecthypotheses other than the MAP hypothesis.

Our Lattice MBR implementation is made pos-sible due to the linear approximation of the BLEUscore. This linearization technique has been appliedelsewhere when working with BLEU: Smith andEisner (2006) approximate the expectation of logBLEU score. In both cases, a linear metric makesit easier to compute the expectation. While we haveapplied lattice MBR decoding to the approximateBLEU score, we note that our procedure (Section 3)is applicable to other gain functions which can bedecomposed as a sum of local gain functions. In par-ticular, our framework might be useful with transla-

tion metrics such as TER (Snover et al., 2006) orMETEOR (Lavie and Agarwal, 2007).

In contrast to a phrase-based SMT system, a syn-tax based SMT system (e.g. Zollmann and Venu-gopal (2006)) can generate a hypergraph that rep-resents a generalized translation lattice with wordsand hidden tree structures. We believe that our lat-tice MBR framework can be extended to such hy-pergraphs with loss functions that take into accountboth BLEU scores as well as parse tree structures.

Lattice and Forest based search and training pro-cedures are not yet common in statistical machinetranslation. However, they are promising becausethe search space of translations is much larger thanthe typical N -best list (Mi et al., 2008). We hopethat our approach will provide some insight into thedesign of lattice-based search procedures along withthe use of non-linear, global loss functions such asBLEU.

References

P. J. Bickel and K. A. Doksum. 1977. MathematicalStatistics: Basic Ideas and Selected topics. Holden-Day Inc., Oakland, CA, USA.

P. F. Brown, J. Cocke, S. A. Della Pietra, V. J . DellaPietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S.

628

Roossin. 1990. A Statistical Approach to MachineTranslation. Computational Linguistics, 16(2):79–85.

M. Dreyer, K. Hall, and S. Khudanpur. 2007. Compar-ing Reordering Constraints for SMT Using EfficientBLEU Oracle Computation. In SSST, NAACL-HLT2007, pages 103–110, Rochester, NY, USA, April.

N. Ehling, R. Zens, and H. Ney. 2007. Minimum BayesRisk Decoding for BLEU. In ACL 2007, pages 101–104, Prague, Czech Republic, June.

V. Goel and W. Byrne. 2000. Minimum Bayes-Risk Au-tomatic Speech Recognition. Computer Speech andLanguage, 14(2):115–135.

V. Goel. 2001. Minimum Bayes-Risk Automatic SpeechRecognition. Ph.D. thesis, Johns Hopkins University,Baltimore, MD, USA.

J. Goodman. 1996. Parsing Algorithms and Metrics. InACL, pages 177–183, Santa Cruz, CA, USA.

L. Karttunen, J-p. Chanod, G. Grefenstette, andA. Schiller. 1996. Regular Expressions for LanguageEngineering. Natural Language Engineering, 2:305–328.

P. Koehn. 2004. Statistical Significance Tests for Ma-chine Translation Evaluation. In EMNLP, Barcelona,Spain.

S. Kumar and W. Byrne. 2002. Minimum Bayes-Riskword alignments of bilingual texts. In EMNLP, pages140–147, Philadelphia, PA, USA.

S. Kumar and W. Byrne. 2004. Minimum Bayes-RiskDecoding for Statistical Machine Translation. In HLT-NAACL, pages 169–176, Boston, MA, USA.

A. Lavie and A. Agarwal. 2007. METEOR: An Auto-matic Metric for MT Evaluation with High Levels ofCorrelation with Human Judgments. In SMT Work-shop, ACL, pages 228–231, Prague, Czech Republic.

W. Macherey, F. Och, I. Thayer, and J. Uszkoreit. 2008.Lattice-based Minimum Error Rate Training for Sta-tistical Machine Translation. In EMNLP, Honolulu,Hawaii, USA.

E. Matusov, R. Zens, and H. Ney. 2004. SymmetricWord Alignments for Statistical Machine Translation.In COLING, Geneva, Switzerland.

H. Mi, L. Huang, and Q. Liu. 2008. Forest-Based Trans-lation. In ACL, Columbus, OH, USA.

M. Mohri. 1997. Finite-state transducers in language andspeech processing. Computational Linguistics, 23(3).

M. Mohri. 2002. Semiring frameworks and algorithmsfor shortest-distance problems. Journal of Automata,Languages and Combinatorics, 7(3):321–350.

NIST. 2002-2008. The NIST Machine Translation Eval-uations. http://www.nist.gov/speech/tests/mt/.

F. Och and H. Ney. 2003. A systematic comparison ofvarious statistical alignment models. ComputationalLinguistics, 29(1):19 – 51.

F. Och and H. Ney. 2004. The Alignment Template Ap-proach to Statistical Machine Translation. Computa-tional Linguistics, 30(4):417 – 449.

F. Och. 2003. Minimum Error Rate Training in Statisti-cal Machine Translation. In ACL, Sapporo, Japan.

K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2001.Bleu: a Method for Automatic Evaluation of MachineTranslation. Technical Report RC22176 (W0109-022), IBM Research Division.

A. Sixtus and S. Ortmanns. 1999. High QualityWord Graphs Using Forward-Backward Pruning. InICASSP, Phoenix, AZ, USA.

D. Smith and J. Eisner. 2006. Minimum Risk Anneal-ing for Training Log-Linear Models. In ACL, Sydney,Australia.

D. Smith and N. Smith. 2007. Probabilistic models ofnonprojective dependency trees. In EMNLP-CoNLL,Prague, Czech Republic.

M. Snover, B. Dorr, R. Schwartz, L. Micciulla, andJ. Makhoul. 2006. A Study of Translation Edit Ratewith Targeted Human Annotation. In AMTA, Boston,MA, USA.

I. Titov and J. Henderson. 2006. Loss Minimization inParse Reranking. In EMNLP, Sydney, Australia.

N. Ueffing, F. Och, and H. Ney. 2002. Generation ofWord Graphs in Statistical Machine Translation. InEMNLP, Philadelphia, PA, USA.

H. Zhang and D. Gildea. 2008. Efficient Multi-pass De-coding for Synchronous Context Free Grammars. InACL, Columbus, OH, USA.

A. Zollmann and A. Venugopal. 2006. Syntax Aug-mented Machine Translation via Chart Parsing. InHLT-NAACL, New York, NY, USA.

629

Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation

Documents

Transcript of Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation