arXiv:1803.06966v2 [cs.CL] 18 Apr 2018

12
Polyglot Semantic Parsing in APIs Kyle Richardson , Jonathan Berant , Jonas Kuhn Institute for Natural Language Processing, University of Stuttgart, Germany {kyle,jonas}@ims.uni-stuttgart.de Tel-Aviv University, Israel [email protected] Abstract Traditional approaches to semantic parsing (SP) work by training individual models for each available parallel dataset of text-meaning pairs. In this paper, we explore the idea of polyglot semantic translation, or learning se- mantic parsing models that are trained on mul- tiple datasets and natural languages. In par- ticular, we focus on translating text to code signature representations using the software component datasets of Richardson and Kuhn (2017a,b). The advantage of such models is that they can be used for parsing a wide va- riety of input natural languages and output programming languages, or mixed input lan- guages, using a single unified model. To fa- cilitate modeling of this type, we develop a novel graph-based decoding framework that achieves state-of-the-art performance on the above datasets, and apply this method to two other benchmark SP tasks. 1 Introduction Recent work by Richardson and Kuhn (2017a,b); Miceli Barone and Sennrich (2017) considers the problem of translating source code documentation to lower-level code template representations as part of an effort to model the meaning of such doc- umentation. Example documentation for a number of programming languages is shown in Figure 1, where each docstring description in red describes a given function (blue) in the library. While cap- turing the semantics of docstrings is in general a difficult task, learning the translation from de- scriptions to formal code representations (e.g., for- mal representations of functions) is proposed as a reasonable first step towards learning more gen- eral natural language understanding models in the software domain. Under this approach, one can view a software library, or API, as a kind of par- allel translation corpus for studying text code or code text translation. 1. (en, Java) Documentation * Returns the greater of two long values public static long max(long a, long b) 2. (en, Python) Documentation max(self, a, b): """Compares two values numerically and returns the maximum""" 3. (en, Haskell) Documentation --| "The largest element of a non-empty structure" maximum :: forall z. Ord a a => t a -> a 4. (de, PHP) Documentation * gibt den gr¨ oßeren dieser Werte zur¨ uck. max (mixed $value1, mixed $value2) Figure 1: Example source code documentation. Richardson and Kuhn (2017b) extracted the standard library documentation for 10 popular programming languages across a number of natu- ral languages to study the problem of text to func- tion signature translation. Initially, these datasets were proposed as a resource for studying semantic parser induction (Mooney, 2007), or for building models that learn to translate text to formal mean- ing representations from parallel data. In follow- up work (Richardson and Kuhn, 2017a), they pro- posed using the resulting models to do automated question-answering (QA) and code retrieval on target APIs, and experimented with an additional set of software datasets built from 27 open-source Python projects. As traditionally done in SP (Zettlemoyer and Collins, 2012), their approach involves learning individual models for each parallel dataset or lan- guage pair, e.g., (en, Java), (de, PHP), and (en, Haskell). Looking again at Figure 1, we notice that while programming languages differ in terms of representation conventions, there is often over- lap between the functionality implemented and naming in these different languages (e.g., the max arXiv:1803.06966v2 [cs.CL] 18 Apr 2018

Transcript of arXiv:1803.06966v2 [cs.CL] 18 Apr 2018

Polyglot Semantic Parsing in APIs

Kyle Richardson†, Jonathan Berant‡, Jonas Kuhn††Institute for Natural Language Processing, University of Stuttgart, Germany

{kyle,jonas}@ims.uni-stuttgart.de‡Tel-Aviv University, Israel

[email protected]

AbstractTraditional approaches to semantic parsing(SP) work by training individual models foreach available parallel dataset of text-meaningpairs. In this paper, we explore the idea ofpolyglot semantic translation, or learning se-mantic parsing models that are trained on mul-tiple datasets and natural languages. In par-ticular, we focus on translating text to codesignature representations using the softwarecomponent datasets of Richardson and Kuhn(2017a,b). The advantage of such models isthat they can be used for parsing a wide va-riety of input natural languages and outputprogramming languages, or mixed input lan-guages, using a single unified model. To fa-cilitate modeling of this type, we develop anovel graph-based decoding framework thatachieves state-of-the-art performance on theabove datasets, and apply this method to twoother benchmark SP tasks.

1 Introduction

Recent work by Richardson and Kuhn (2017a,b);Miceli Barone and Sennrich (2017) considers theproblem of translating source code documentationto lower-level code template representations aspart of an effort to model the meaning of such doc-umentation. Example documentation for a numberof programming languages is shown in Figure 1,where each docstring description in red describesa given function (blue) in the library. While cap-turing the semantics of docstrings is in generala difficult task, learning the translation from de-scriptions to formal code representations (e.g., for-mal representations of functions) is proposed as areasonable first step towards learning more gen-eral natural language understanding models in thesoftware domain. Under this approach, one canview a software library, or API, as a kind of par-allel translation corpus for studying text → codeor code→ text translation.

1. (en, Java) Documentation

*Returns the greater of two long valuespublic static long max(long a, long b)

2. (en, Python) Documentationmax(self, a, b):

"""Compares two values numericallyand returns the maximum"""

3. (en, Haskell) Documentation--| "The largest element of a non-empty structure"maximum :: forall z. Ord a a => t a -> a

4. (de, PHP) Documentation

*gibt den großeren dieser Werte zuruck.max (mixed $value1, mixed $value2)

Figure 1: Example source code documentation.

Richardson and Kuhn (2017b) extracted thestandard library documentation for 10 popularprogramming languages across a number of natu-ral languages to study the problem of text to func-tion signature translation. Initially, these datasetswere proposed as a resource for studying semanticparser induction (Mooney, 2007), or for buildingmodels that learn to translate text to formal mean-ing representations from parallel data. In follow-up work (Richardson and Kuhn, 2017a), they pro-posed using the resulting models to do automatedquestion-answering (QA) and code retrieval ontarget APIs, and experimented with an additionalset of software datasets built from 27 open-sourcePython projects.

As traditionally done in SP (Zettlemoyer andCollins, 2012), their approach involves learningindividual models for each parallel dataset or lan-guage pair, e.g., (en, Java), (de, PHP), and (en,Haskell). Looking again at Figure 1, we noticethat while programming languages differ in termsof representation conventions, there is often over-lap between the functionality implemented andnaming in these different languages (e.g., the max

arX

iv:1

803.

0696

6v2

[cs

.CL

] 1

8 A

pr 2

018

function), and redundancy in the associated lin-guistic descriptions. In addition, each Englishdescription (Figure 1.1-1.3) describes max dif-ferently using the synonyms greater, maximum,largest. In this case, it would seem that trainingmodels on multiple datasets, as opposed to singlelanguage pairs, might make learning more robust,and help to capture various linguistic alternatives.

With the software QA application in mind, anadditional limitation is that their approach doesnot allow one to freely translate a given descrip-tion to multiple output languages, which would beuseful for comparing how different programminglanguages represent the same functionality. Themodel also cannot translate between natural lan-guages and programming languages that are notobserved during training. While software docu-mentation is easy to find in bulk, if a particularAPI is not already documented in a language otherthan English (e.g., Haskell in de), it is unlikelythat such a translation will appear without consid-erable effort by experienced translators. Similarly,many individual APIs may be too small or poorlydocumented to build individual models or QA ap-plications, and will in some way need to bootstrapoff of more general models or resources.

To deal with these issues, we aim to learn moregeneral text-to-code translation models that aretrained on multiple datasets simultaneously. Ourultimate goal is to build polyglot translation mod-els (cf. Johnson et al. (2016)), or models withshared representations that can translate any inputtext to any output programming language, regard-less of whether such language pairs were encoun-tered explicitly during training. Inherent in thistask is the challenge of building an efficient poly-glot decoder, or a translation mechanism that al-lows such crossing between input and output lan-guages. A key challenge is ensuring that such a de-coder generates well-formed code representations,which is not guaranteed when one simply appliesstandard decoding strategies from SMT and neuralMT (cf. Cheng et al. (2017)). Given our ultimateinterest in API QA, such a decoder must also fa-cilitate monolingual translation, or being able totranslate to specific output languages as needed.

To solve the decoding problem, we introducea new graph-based decoding and representationframework that reduces to solving shortest pathproblems in directed graphs. We investigateseveral translation models that work within this

framework, including traditional SMT models andmodels based on neural networks, and report state-of-the-art results on the technical documentationtask of Richardson and Kuhn (2017b,a). To showthe applicability of our approach to more conven-tional SP tasks, we apply our methods to the Geo-Query domain (Zelle and Mooney, 1996) and theSportscaster corpus (Chen et al., 2010). Theseexperiments also provide insight into the maintechnical documentation task and highlight thestrengths and weaknesses of the various transla-tion models being investigated.

2 Related Work

Our approach builds on the baseline models intro-duced in Richardson and Kuhn (2017b) (see alsoDeng and Chrupała (2014)). Their work is posi-tioned within the broader SP literature, where tra-ditionally SMT (Wong and Mooney, 2006a) andparsing (Zettlemoyer and Collins, 2009) methodsare used to study the problem of translating textto formal meaning representations, usually center-ing around QA applications (Berant et al., 2013).More recently, there has been interest in using neu-ral network approaches either in place of (Dongand Lapata, 2016; Kocisky et al., 2016) or incombination with (Misra and Artzi, 2016; Jia andLiang, 2016; Cheng et al., 2017) these traditionalmodels, the latter idea we look at in this paper.

Work in NLP on software documentation hasaccelerated in recent years due in large part to theavailability of new data resources through web-sites such as StackOverflow and Github (cf. Al-lamanis et al. (2017)). Most of this recent workfocuses on processing large amounts of API datain bulk (Gu et al., 2016; Miceli Barone and Sen-nrich, 2017), either for learning longer executableprograms from text (Yin and Neubig, 2017; Rabi-novich et al., 2017), or solving the inverse prob-lem of code to text generation (Iyer et al., 2016;Richardson et al., 2017). In contrast to our work,these studies do not look explicitly at translatingto target APIs, or at non-English documentation.

The idea of polyglot modeling has gained sometraction in recent years for a variety of problems(Tsvetkov et al., 2016) and has appeared withinwork in SP under the heading of multilingual SP(Jie and Lu, 2014; Duong et al., 2017). A relatedtopic is learning from multiple knowledge sourcesor domains (Herzig and Berant, 2017), which isrelated to our idea of learning from multiple APIs.

When building models that can translate betweenunobserved language pairs, we use the term zero-shot translation from Johnson et al. (2016).

3 Baseline Semantic Translator

Problem Formulation Throughout the paper,we refer to target code representations as API com-ponents. In all cases, components will consistof formal representations of functions, or func-tion signatures (e.g., long max(int a, intb)), which include a function name (max), asequence of arguments (int a, int b), andother information such as a return value (long)and namespace (for more details, see Richard-son (2018)). For a given API dataset D ={(xi, zi)}ni=1 of size n, the goal is to learn a modelthat can generate exactly a correct component se-quence z = (z1, .., z|z|), within a finite space C ofsignatures (i.e., the space of all defined functions),for each input text sequence x = (x1, ..., x|x|).This involves learning a probability distributionp(z | x). As such, one can think of this underlyingproblem as a constrained MT task.

In this section, we describe the baseline ap-proach of Richardson and Kuhn (2017b). Techni-cally, their approach has two components: a sim-ple word-based translation model and task specificdecoder, which is used to generate a k-best list ofcandidate component representations for a giveninput x. They then use a discriminative model torerank the translation output using additional non-world level features. The goal in this section isto provide the technical details of their translationapproach, which we improve in Section 4.

3.1 Word-based Translation Model

The translation models investigated in Richardsonand Kuhn (2017b) use a noisy-channel formula-tion where p(z | x) ∝ p(x | z)p(z) via Bayesrule. By assuming a uniform prior on output com-ponents, p(z), the model therefore involves esti-mating p(x | z), which under a word-translationmodel is computed using the following formula:p(x | z) =

∑a∈A p(x, a | z), where the summa-

tion ranges over the set of all many-to-one wordalignments A from x → z, with |A| equal to(|z| + 1)|x|. They investigate various types ofsequence-based alignment models (Och and Ney,2003), and find that the classic IBM Model 1 out-performs more complex word models. This modelfactors in the following way and assumes an inde-

pendent word generation process:

p(x | z) =1

|A|

|x|∏j=1

|z|∑i=0

pt(xj | zi) (1)

where each pt defines a multinomial distributionover a given component term z for all words x.

The decoding problem for the above transla-tion model involves finding the most likely out-put z, which requires solving an arg maxz overEquation 1. In the general case, this problem isknown to be NP-complete for the models underconsideration (Knight, 1999) largely due to thelarge space of possible predictions z. Richardsonand Kuhn (2017b) avoid these issues by exploitingthe finiteness of the target component search space(an idea we also pursue here and discuss more be-low), and describe a constrained decoding algo-rithm that runs in time O(|C| log |C|). While thisworks well for small APIs, it becomes less feasi-ble when dealing with large sets of APIs, as in thepolyglot case, or with more complex semantic lan-guages typically used in SP (Liang, 2013).

4 Shortest Path Framework

To improve the baseline translation approach usedpreviously (Section 3.1), we pursue a graph basedapproach. Given the formulation above and thefiniteness of our prediction space C, our approachexploits the fact that we can represent the completecomponent search space for any set of APIs as adirected acyclic finite-state automaton (DAFSA),such as the one shown graphically in Figure 2. Theunderlying graph is constructed by concatenat-ing all of the component representations for eachAPI of interest and applying standard finite-stateconstruction and minimization techniques (Mohri,1996). Each path in the resulting compact automa-ton is therefore a well-formed component repre-sentation.

Using an idea from Johnson et al. (2016), weadd to each component representation an artificialtoken that identifies the output programming lan-guage or library. For example, the two edges fromthe initial state 0 in Figure 2 are labeled as 2Cand 2Clojure, which identify the C and Clojureprogramming languages respectively. All pathsstarting from the right of these edges are there-fore valid paths in each respective programminglanguage. The paths starting from the initial state0, in contrast, correspond to all valid componentrepresentations in all languages.

0.00

s0

∞5

s5

∞1

s1

∞6

s6

∞2

s2

∞3

s3

∞7

s7

∞4

s4

∞9

s9

∞10

s10

∞11

s11

∞8

s82C

2Clojure

numeric

algo

math

mathceil

atan2

atan2

ceil

x

x

arg

y

Figure 2: A DAFSA representation for a portion of the component sequence search space C that includes mathfunctions in C and Clojure, and an example path/translation (in bold): 2C numeric math ceil arg.

Decoding reduces to the problem of finding apath for a given text input x. For example, giventhe input the ceiling of a number, we would wantto find the paths corresponding to the componenttranslations numeric math ceil arg (in C)and algo math ceil x (in Clojure) in thegraph shown in Figure 2. Using the trick above,our setup facilitates both monolingual decoding,i.e., generating components specific to a particularoutput language (e.g., the C language via the pathshown in bold), and polyglot decoding, i.e., gener-ating any output language by starting at the initialstate 0 (e.g., C and Clojure).

We formulate the decoding problem using avariant of the well-known single source shortestpath (SSSP) algorithm for directed acyclic graphs(DAGs) (Johnson (1977)). This involves a graphG = (V,E) (nodes V and labeled edges E, seegraph in Figure 2), and taking an off-line topo-logical sort of the graph’s vertices. Using a datastructure d ∈ R|V | (initialized as ∞|V |, as shownin Figure 2), the standard SSSP algorithm (whichis the forward update variant of the Viterbi algo-rithm (Huang, 2008)) works by searching forwardthrough the graph in sorted order and finding foreach node v an incoming labeled edge u, with la-bel z, that solves the following recurrence:

d(v) = min(u,z):(u,v,z)∈E

{d(u) + w(u, v, z)

}(2)

where d(u) is shortest path score from a uniquesource node b to the incoming node u (computedrecursively) andw(u, v, z) is the weight of the par-ticular labeled edge. The weight of the resultingshortest path is commonly taken to be the sumof the path edge weights as given by w, and theoutput translation is the sequence of labels associ-ated with each edge. This algorithm runs in lineartime over the size of the graph’s adjacency matrix(Adj) and can be extended to find k SSSPs. Inthe standard case, a weighting function w is pro-

Algorithm 1 Lexical Shortest Path SearchInput: Input x of size n, DAG G = (V,E), lexical transla-

tion function pt, source node b with initial score o.Output: Shortest component path1: d[V [G]]←∞, π[V [G]]← Nil, d[b]← o2: s[V [G], n]← 0.0 . Shortest path sums at each node3: for each vertex u ≥ b ∈ V [G] in sorted order do4: for each vertex and label (v, z) ∈ Adj[u] do5: score← −log

[ ∏ni pt(xi | z) + s[u, i]

]6: if d[v] > score then7: d[v]← score, π[v]← u8: for i in 1, .., n do . Update scores9: s[v, i]← pt(xi | z) + s[u, i]

10: return FINDPATH(π, |V |, b)

vided by assuming a static weighted graph. In ourtranslation context, we replace w with a transla-tion model, which is used to dynamically generateedge weights during the SSSP search for each in-put x by scoring the translation between x and eachedge label z encountered.

Given this general framework, many differ-ent translation models can be used for scoring.In what follows, we describe two types of de-coders based on lexical translation (or unigram)and neural sequence models. Technically, eachdecoding algorithm involves modifying the stan-dard SSSP search procedure by adding an addi-tional data structure s to each node (see Figure 2),which is used to store information about transla-tions (e.g., running lexical translation scores, RNNstate information) associated with particular short-est paths. By using these two very different mod-els, we can get insight into the challenges asso-ciated with the technical documentation transla-tion task. As we show in Section 6, each modelachieves varying levels of success when subjectedto a wider range of SP tasks, which reveals differ-ences between our task and other SP tasks.

4.1 Lexical Translation Shortest Path

In our first model, we use the lexical translationmodel and probability function pt in Equation 1 as

the weighting function, which can be learned ef-ficiently off-line using the EM algorithm. Whenattempting to use the SSSP procedure to computethis equation for a given source input x, we im-mediately have the problem that such a compu-tation requires a complete component representa-tion z (Knight and Al-Onaizan, 1998). We use anapproximation1 that involves ignoring the normal-izer |A| and exploiting the word independence as-sumption of the model, which allows us to incre-mentally compute translation scores for individ-ual source words given output translations corre-sponding to shortest paths during the SSSP search.

The full decoding algorithm in shown in Algo-rithm 1, where the red highlights the adjustmentsmade to the standard SSSP search as presented inCormen et al. (2009). The main modification in-volves adding a data structure s ∈ R|V | × |x| (ini-tialized as 0.0|V |×|x| at line 2) that stores a runningsum of source word scores given the best trans-lations at each node, which can be used for com-puting the inner sum in Equation 1. For example,given an input utterance ceiling function, s6 in Fig-ure 2 contains the independent translation scoresfor words ceiling and function given the edge la-bel numeric and pt. Later on in the search, thesescores are used to compute s7, which will providetranslation scores for each word given the edge se-quence numeric math. Taking the product over anygiven sj (as done in line 7 to get score) will givethe probability of the shortest path translation atthe particular point j. Here, the transformationinto − log space is used to find the minimum in-coming path. Standardly, the data structure π canbe used to retrieve the shortest path back to thesource node b (done via the FINDPATH method).

4.2 Neural Shortest Path

Our second set of models use neural networks tocompute the weighting function in Equation 2. Weuse an encoder-decoder model with global atten-tion (Bahdanau et al., 2014; Luong et al., 2015),which has the following two components:

Encoder Model The first is an encoder net-work, which uses a bi-directional recurrent neuralnetwork architecture with LSTM units (Hochre-iter and Schmidhuber, 1997) to compute a se-quence of forward annotations or hidden states(−→h 1, ...,

−→h |x|) and a sequence of backward hid-

1Details about the approx. are provided as supp. material.

den states (←−h , ...,

←−h |x|) for the input sequence

(x1, ..., x|x|). Standardly, each word is then rep-resented as the concatenation of its forward andbackward states: hj = [

−→h j ,←−h j ].

Decoder Model The second component is a de-coder network, which directly computes the con-ditional distribution p(z | x) as follows:

p(z | x) =

|z|∑i=1

log pΘ(zi | z<i, x) (3)

pΘ(zi | z<i, x) ∼ softmax(f(Θ, z<i, x)) (4)

where f is a non-linear function that encodes in-formation about the sequence z<i and the inputx given the model parameters Θ. We can thinkof this model as an ordinary recurrent languagemodel that is additionally conditioned on the inputx using information from our encoder. We imple-ment the function f in the following way:

f(Θ, z<i, x) = Woηi + bo (5)

ηi = MLP(ci, gi) (6)

gi = LSTMdec(gi−1,Eoutzi−1

, ci) (7)

where MLP is a multi-layer perceptron model witha single hidden layer, Eout ∈ R|Σdec|×e is a ran-domly initialized embedding matrix, gi is the de-coder’s hidden state at step i, and ci is a context-vector that encodes information about the input xand the encoder annotations. Each context vec-tor ci in turn is a weighted sum of each annota-tion hj against an attention vector αi,j , or ci =∑|x|

j=1 αi,jhj , which is jointly learned using an ad-ditional single layered multi-layer perceptron de-fined in the following way:

αi,j ∝ exp(ei,j); ei,j = MLP(gi−1, hj) (8)

Lexical Bias and Copying In contrast to stan-dard MT tasks, we are dealing with a relativelylow-resource setting where the sparseness of thetarget vocabulary is an issue. For this reason, weexperimented with integrating lexical translationscores using a biasing technique from Arthur et al.(2016). Their method is based on the followingcomputation for each token zi:

biasi = pt′(z1 | x1) . . . pt′(z1 | x|x|)...

. . ....

pt′(z|Σdec| | x1) . . . pt′(z|Σdec| | x|x|)

αi,1

...αi,|x|

Algorithm 2 Neural Shortest Path SearchInput: Input x, DAG G, neural parameters Θ and non-linear

function f , beam size l, source node b with init. score o.Output: Shortest component path1: d[V [G]]←∞, d[b]← o, π[V [G]]← Nil2: s[V [G]]← Nil . Path state information3: s[b]← InitState() . Initialize source state4: for each vertex u ≥ b ∈ V [G] in sorted order do5: if isinf(d[u]) then continue6: p← s[u] . Current state at node u, or z<i

7: L1[l] ← arg max

(v1,...,vk)∈Adj[u]

softmax(f(Θ, p, x))

8: for each vertex and label (v, z) ∈ L do9: score← − log pΘ(z | p, x) + d[u]

10: if d[v] > score then11: d[v]← score, π[v]← u12: s[v]← UpdateState(p, z)

13: return FINDPATH(π, |V |, b)

The first matrix uses the inverse (pt′) of the lex-ical translation function pt already introduced tocompute the probability of each word in the targetvocabulary Σdec (the columns) with each word inthe input x (the rows), which is then weighted bythe attention vector from Equation 8. biasi is thenused to modify Equation 5 in the following way:

fbias(Θ, z<i, x) = Woηi + bo+

log(biasi + ε)

where ε is a hyper-parameter that helps to preservenumerical stability and biases more heavily on thelexical model when set lower.

We also experiment with the copying mecha-nism from Jia and Liang (2016), which works byallowing the decoder to choose from a set of latentactions, aj , that includes writing target words ac-cording to Equation 5, as done standardly, or copy-ing source words from x, or copy[xi] accordingto the attention scores in Equation 8. A distribu-tion is then computed over these actions using asoftmax function and particular actions are cho-sen accordingly during training and decoding.

Decoding and Learning The full decoding pro-cedure is shown in Algorithm 2, where the differ-ences with the standard SSSP are again shown inred. We change the data structure s to contain thedecoder’s RNN state at each node. We also mod-ify the scoring (line 7, which uses Equation 4) toconsider only the top l edges or translations at thatpoint, as opposed to imposing a full search. Whenl is set to 1, for example, the procedure does agreedy search through the graph, whereas when lis large the procedure is closer to a full search.

In general terms, the decoder described above

works like an ordinary neural decoder with thedifference that each decision (i.e., new target-sideword translation) is constrained (in line 7) by thetransitions allowed in the underlying graph in or-der to ensure wellformedness of each componentoutput. Standardly, we optimize these models us-ing stochastic gradient descent with the objectiveof finding parameters Θ that minimize the negativeconditional log-likelihood of the training dataset.

4.3 Monolingual vs. Polyglot Decoding

Our framework facilitates both monolingual andpolyglot decoding. In the first case, the decoderrequires a graph associated with the output seman-tic language (more details in next section) and atrained translation model. The latter case requirestaking the union of all datasets and graphs (withartificial identifier tokens) for a collection of tar-get datasets and training a single model over thisglobal dataset. In this setting, we can then de-code to a particular language using the languageidentifiers or decode without specifying the outputlanguage. The main focus in this paper is investi-gating polyglot decoding, and in particular the ef-fect of training models on multiple datasets whentranslating to individuals APIs or SP datasets.

When evaluating our models and building QAapplications, it is important to be able to generatethe k best translations. This can easily be done inour framework by applying standard k SSSP algo-rithms (Brander and Sinclair, 1995). We use animplementation of the algorithm of Yen (1971),which works on top of the SSSP algorithms in-troduced above by iteratively finding deviating orbranching paths from an initial SSSP (more detailsprovided in supplementary materials).

5 Experiments

We experimented with two main types of re-sources: 45 API documentation datasets and twomultilingual benchmark SP datasets. In the for-mer case, our main objective is to test whethertraining polyglot models (shown as polyglot in Ta-bles 1-2) on multiple datasets leads to an improve-ment when compared to training individual mono-lingual models (shown as monolingual in Tables1-2). Experiments involving the latter datasets aremeant to test the applicability of our general graphand polyglot method to related SP tasks, and arealso used for comparison against our main techni-cal documentation task.

Figure 3: Test Acc@1 for the best monolingual models (in yellow/left) compared with the best lexical polyglotmodel (green/right) across all 45 technical documentation datasets.

5.1 DatasetsTechnical API Docs The first dataset includesthe Stdlib and Py27 datasets of Richardson andKuhn (2017b,a), which are publicly available viaRichardson (2017). Stdlib consists of short de-scription and function signature pairs for 10 pro-gramming languages in 7 languages, and Py27contains the same type of data for 27 popularPython projects in English mined from Github. Wealso built new datasets from the Japanese transla-tion of the Python 2.7 standard library, as well asthe Lua stdlib documentation in a mixture of Rus-sian, Portuguese, German, Spanish and English.

Taken together, these resources consist of79,885 training pairs, and we experiment withtraining models on Stdlib and Py27 separately aswell as together (shown as + more in Table 1).We use a BPE subword encoding (Sennrich et al.,2015) of both input and output words to make therepresentations more similar and transliterated alldatasets (excluding Japanese datasets) to an 8-bitlatin encoding. Graphs were built by concate-nating all function representations into a singleword list and compiling this list into a minimizedDAFSA. For our global polyglot dataset, this re-sulted in a graph with 218,505 nodes, 313,288edges, and 112,107 paths or component represen-tations over an output vocabulary of 9,324 words.

Mixed GeoQuery and Sportscaster We run ex-periments on the GeoQuery 880 corpus using thesplits from Andreas et al. (2013), which includesgeography queries for English, Greek, Thai, andGerman paired with formal database queries, aswell as a seed lexicon or NP list for each language.In addition to training models on each individualdataset, we also learn polyglot models trained onall datasets concatenated together. We also createda new mixed language test set that was built by re-

placing NPs in 803 test examples with one or moreNPs from a different language using the NP listsmentioned above (see examples in Figure 4). Thegoal in the last case is to test our model’s ability tohandle mixed language input. We also ran mono-lingual experiments on the English Sportscastercorpus, which contains human generated soccercommentary paired with symbolic meaning repre-sentation produced by a simulation of four games.

For GeoQuery graph construction, we built asingle graph for all languages by extracting gen-eral rule templates from all representations in thedataset, and exploited additional information andpatterns using the Geobase database and the se-mantic grammars used in (Wong and Mooney,2006b). This resulted in a graph with 2,419 nodes,4,936 edges and 39,482 paths over an output vo-cabulary of 164. For Sportscaster, we directlytranslated the semantic grammar provided in Chenand Mooney (2008) to a DAFSA, which resulted ina graph with 98 nodes, 86 edges and 830 paths.

5.2 Experimental SetupFor the technical datasets, the goal is to see if ourmodel generates correct signature representationsfrom unobserved descriptions using exact match.We follow exactly the experimental setup and datasplits from Richardson and Kuhn (2017b), andmeasure the accuracy at 1 (Acc@1), accuracy intop 10 (Acc@10), and MRR.

For the GeoQuery and Sportscaster experi-ments, the goal is to see if our models can gen-erate correct meaning representations for unseeninput. For GeoQuery, we follow Andreas et al.(2013) in evaluating extrinsically by checking thateach representation evaluates to the same answeras the gold representation when executed againstthe Geobase database. For Sportscaster, we evalu-ate by exact match to a gold representation.

Method Acc@1 Acc@10 MRR

stdl

ibmono. RK Trans + rerank 29.9 69.2 43.1

Lexical SP 33.2 70.7 45.9poly. Lexical SP + more 33.1 69.7 45.5

Neural SP + bias 12.1 34.3 19.5Neural SP + copy bias 13.9 36.5 21.5

py27

mono. RK Trans + rerank 32.4 73.5 46.5Lexical SP 41.3 77.7 54.1

poly. Lexical SP + more 40.5 76.7 53.1Neural SP + bias 8.7 25.5 14.2Neural SP + copy bias 9.0 26.9 15.1

Table 1: Test results on the Stdlib and Py27 tasksaveraged over all datasets and compared against thebest monolingual results from Richardson and Kuhn(2017b,a), or RK

5.3 Implementation and Model DetailsWe use the Foma finite-state toolkit of Hulden(2009) to construct all graphs used in our exper-iments. We also use the Cython version of Dynet(Neubig et al., 2017) to implement all the neuralmodels (see supp. materials for more details).

In the results tables, we refer to the lexical andneural models introduced in Section 4 as Lexi-cal Shortest Path and Neural Shortest Path, wheremodels that use copying (+ copy) and lexical bias-ing (+ bias) are marked accordingly. We also ex-perimented with adding a discriminative rerankerto our lexical models (+ rerank), using the ap-proach from Richardson and Kuhn (2017b), whichuses additional lexical (e.g., word match and align-ment) features and other phrase-level and syntaxfeatures. The goal here is to see if these additional(mostly non-word level) features help improve onthe baseline lexical models.

6 Results and Discussion

Technical Documentation Results Table 1shows the results for Stdlib and Py27. Inthe monolingual case, we compare against thebest performing models in Richardson and Kuhn(2017b,a). As summarized in Figure 3, our ex-periments show that training polyglot models onmultiple datasets can lead to large improvementsover training individual models, especially on thePy27 datasets where using a polyglot model re-sulted in a nearly 9% average increase in accuracy@1. In both cases, however, the best perform-ing lexical models are those trained only on thedatasets they are evaluated on, as opposed to train-ing on all datasets (i.e., + more). This is surprisinggiven that training on all datasets doubles the sizeof the training data, and shows that adding moredata does not necessarily boost performance whenthe additional data is from another distribution.

Method Acc@1 Acc@10UBL (Kwiatkowski et al., 2010) 74.2 –TreeTrans (Jones et al., 2012) 76.8 –nHT (Susanto and Lu, 2017) 83.3 –

Stan

dard

Geo

quer

y

mon

olin

gual

Lexical Shortest Path 68.6 92.4Lexical Shortest Path + rerank 74.2 94.1Neural Shortest Path 73.5 91.1Neural Shortest Path + bias 78.0 92.8Neural Shortest Path + copy bias 77.8 92.1

poly

glot

Lexical Shortest Path 67.3 92.9Lexical Shortest Path + rerank 75.2 94.7Neural Shortest Path 78.0 91.4Neural Shortest Path + bias 78.9 91.7Neural Shortest Path + copy bias 79.6 91.9

Mix

ed

poly

. Best Monolingual Model 4.2 18.2Lexical Shortest Path + rerank 71.1 94.3Neural Shortest Path + copy bias 75.2 90.0

mon

o.

PCFG (Borschinger et al., 2011) 74.2 –wo-PCFG (Borschinger et al., 2011) 86.0 –

Spor

tsca

ster

Lexical Shortest Path 40.3 86.8Lexical Shortest Path + rerank 70.3 90.2Neural Shortest Path 81.9 94.8Neural Shortest Path + bias 83.4 93.9Neural Shortest Path + copy bias 83.3 90.5

Table 2: Test results for the standard (above) and mixed(middle) GeoQuery tasks averaged over all languages,and results for the English Sportscaster task (below).

The neural models are strongly outperformedby all other models both in the monolingual andpolyglot case (only the latter results shown), evenwhen lexical biasing is applied. While surpris-ing, this is consistent with other studies on low-resource neural MT (Zoph et al., 2016; Ostling andTiedemann, 2017), where datasets of comparablesize to ours (e.g., 1 million tokens or less) typi-cally fail against classical SMT models. This re-sult has also been found in relation to neural AMRsemantic parsing, where similar issues of sparsityare encountered (Peng et al., 2017). Even by dou-bling the amount of training data by training on alldatasets (results not shown), this did not improvethe accuracy, suggesting that much more data isneeded (more discussion below).

Beyond increases in accuracy, our polyglotmodels support zero-shot translation as shownin Figure 4, which can be used for translat-ing between unobserved language pairs (e.g.,(es,Clojure), (ru,Haskell) as shown in 1-2),or for finding related functionality across differ-ent software projects (as shown in 3). These re-sults were obtained by running our decoder modelwithout specifying the output language. We note,however, that the decoder can be constrained toselectively translate to any specific programminglanguage or project (e.g., in a QA setting). Futurework will further investigate the decoder’s poly-glot capabilities, which is currently hard to evalu-ate since we do not have an annotated set of func-tion equivalences between different APIs.

1. Source API (stdlib): (es, PHP) Input: Devuelve el mensaje asociado al objeto lanzado.

Out

put Language: PHP Function Translation: public string Throwable::getMessage ( void )

Language: Java Function Translation: public String lang.getMessage( void )Language: Clojure Function Translation: (tools.logging.fatal throwable message & more)

2. Source API (stdlib): (ru, PHP) Input: konvertiruet stroku iz formata UTF-32 v format UTF-16.

Out

put Language: PHP Function Translation: string PDF utf32 to utf16 ( ... )

Language: Ruby Function Translation: String#toutf16 => stringLanguage: Haskell Function Translation: Encoding.encodeUtf16LE :: Text -> ByteString

3. Source API (py): (en, stats) Input: Compute the Moore-Penrose pseudo-inverse of a matrix.

Out

put Project: sympy Function Translation: matrices.matrix.base.pinv solve( B, ... )

Project: sklearn Function Translation: utils.pinvh( a, cond=None,rcond=None,... )Project: stats Function Translation: tools.pinv2( a,cond=None,rcond=None )

4. Mixed GeoQuery (de/gr) Input: Wie hoch liegt der hochstgelegene punkt inΑλαμπάμα?Logical Form Translation: answer(elevation 1(highest(place(loc 2(stateid(’alabama’))))))

Figure 4: Examples of zero-shot translation when running in polyglot mode (1-3, function representations shownin a conventionalized format), and mixed language parsing (4).

Semantic Parsing Results SP results are sum-marized in Table 2. In contrast, the neural mod-els, especially those with biasing and copying,strongly outperform all other models and are com-petitive with related work. In the GeoQuery case,we compare against two classic grammar-basedmodels, UBL and TreeTrans, as well as a fea-ture rich, neural hybrid tree model (nHT). Wealso see that the polyglot Geo achieves the bestperformance, demonstrating that training on mul-tiple datasets helps in this domain as well. Inthe Sportscaster case we compare against twoPCFG learning approaches, the second, and bestperforming model (wo-PCFG) involves a gram-mar model that encodes complex word-order con-straints.

The real advantage of training a polyglot modelis shown on the results related to mixed languageparsing (i.e., the middle set of results). Here wecompared against the best performing monolin-gual English model (Best Mono. Model), whichclearly does not have a way to deal with multi-lingual NPs. We also find the neural model to bemore robust than the lexical models with rerank-ing.

While the lexical models overall perform poorlyon both tasks, the weakness of this model is par-ticularly acute in the Sportscaster case. We foundthat mistakes are largely related to the ordering ofarguments, which these lexical (unigram) modelsare blind to. That these models still perform rea-sonably well on the Geo task shows that such or-dering issues are less of a factor in this domain.

Discussion Having results across related SPtasks allows us to reflect on the nature of the maintechnical documentation task. Consistent with re-cent findings (Dong and Lapata, 2016), we showthat relatively simple neural sequence models are

competitive with, and in some cases outperform,traditional grammar-based SP methods on bench-mark SP tasks. However, this result is not ob-served in our technical documentation task, in partbecause this problem is much harder for neurallearners given the sparseness of the target data andlack of redundancy. For this reason, we believe ourdatasets provide new challenges for neural-basedSP, and serve as a cautionary tale about the scal-ability and applicability of commonly used neuralmodels to lower-resource SP problems.

In general, we believe that focusing on polyglotand mixed language decoding is not only of inter-est to applications (e.g, mixed language API QA)but also allows for new forms of SP evaluation thatare more revealing than only translation accuracy.When comparing the accuracy of the best mono-lingual Geo model and the worst performing neu-ral polyglot model, one could mistakingly thinkthat these models have equal abilities, though thepolyglot model is much more robust and general.Moving forward, we hope that our work helps tomotivate more diverse evaluations of this type.

7 Conclusion

In this paper, we look at learning from multipleAPI libraries and datasets in the context of learn-ing to translate text to code representations andother SP tasks. To support polyglot modelingof this type, we developed a novel graph baseddecoding method and experimented with variousSMT and neural MT models that work in thisframework. We report a mixture of positive resultsspecific to each task and set of models using thispolyglot modeling idea, some of which reveal in-teresting limitations of different approaches to SP.We also introduced two new API datasets, and amixed language version of Geoquery that will bereleased to facilitate further work on polyglot SP.

Acknowledgements

This work was funded by the DeutscheForschungsgemeinschaft (DFG) via SFB 732,project D2. We thank Alex Fraser for helpfulfeedback on an earlier draft.

ReferencesMiltiadis Allamanis, Earl T Barr, Premkumar Devanbu,

and Charles Sutton. 2017. A Survey of MachineLearning for Big Code and Naturalness. arXivpreprint arXiv:1709.06182 .

Jacob Andreas, Andreas Vlachos, and Stephen Clark.2013. Semantic parsing as machine translation. Inin Proceedings of ACL-2013. pages 47–52.

Philip Arthur, Graham Neubig, and Satoshi Nakamura.2016. Incorporating Discrete Translation Lexiconsinto Neural Machine Translation. arXiv preprintarXiv:1606.02006 .

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural Machine Translation by JointlyLearning to Align and Translate. arXiv preprintarXiv:1409.0473 .

Jonathan Berant, Andrew Chou, Roy Frostig, andPercy Liang. 2013. Semantic Parsing on Freebasefrom Question-Answer Pairs. In in Proceedings ofEMNLP-2013. pages 1533–1544.

Benjamin Borschinger, Bevan K. Jones, and MarkJohnson. 2011. Reducing grounded learning tasks togrammatical inference. In Proceedings of EMNLP-2011. pages 1416–1425.

AW Brander and MC Sinclair. 1995. A ComparativeStudy of k-Shortest Path Algorithms. In In Proc. of11th UK Performance Engineering Workshop.

David L. Chen, Joohyun Kim, and Raymond J.Mooney. 2010. Training a MultilingualSportscaster: Using Perceptual Context to LearnLanguage. Journal of Artificial IntelligenceResearch 37:397–435.

David L. Chen and Raymond J. Mooney. 2008. Learn-ing to Sportscast: A Test of Grounded LanguageAcquisition. In Proceedings of ICML-2008. pages128–135.

Jianpeng Cheng, Siva Reddy, Vijay Saraswat, andMirella Lapata. 2017. Learning Structured Natu-ral Language Representations for Semantic Parsing.arXiv preprint arXiv:1704.08387 .

T Cormen, C Leiserson, R Rivest, and C Stein. 2009.Introduction to Algorithms. MIT Press.

Huijing Deng and Grzegorz Chrupała. 2014. SemanticApproaches to Software Component Retrieval withEnglish Queries. In Proceedings of LREC-14. pages441–450.

Li Dong and Mirella Lapata. 2016. Language to Log-ical Form with Neural Attention. arXiv preprintarXiv:1601.01280 .

Long Duong, Hadi Afshar, Dominique Estival, GlenPink, Philip Cohen, and Mark Johnson. 2017. Mul-tilingual Semantic Parsing and Code-Switching.CoNLL 2017 page 379.

Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, andSunghun Kim. 2016. Deep API Learning. arXivpreprint arXiv:1605.08535 .

Jonathan Herzig and Jonathan Berant. 2017. NeuralSemantic Parsing over Multiple Knowledge-Bases.arXiv preprint arXiv:1702.01569 .

Sepp Hochreiter and Jurgen Schmidhuber. 1997. LongShort-Term Memory. Neural computation 9(8).

Liang Huang. 2008. Advanced Dynamic Programmingin Semiring and Hypergraph Frameworks. In Pro-ceedings of COLING-2008 (tutorial notes).

Mans Hulden. 2009. Foma: a Finite-State Compilerand Library. In Proceedings of EACL.

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, andLuke Zettlemoyer. 2016. Summarizing Source Codeusing a Neural Attention Model. In Proceedings ofACL.

Robin Jia and Percy Liang. 2016. Data Recombina-tion for Neural Semantic Parsing. arXiv preprintarXiv:1606.03622 .

Zhanming Jie and Wei Lu. 2014. Multilingual Seman-tic Parsing: Parsing Multiple Languages into Se-mantic Representations. In COLING. pages 1291–1301.

Donald B Johnson. 1977. Efficient Algorithms forShortest Paths in Sparse Networks. Journal of theACM (JACM) 24(1):1–13.

Melvin Johnson, Mike Schuster, Quoc V Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Viegas, Martin Wattenberg, Greg Corrado,et al. 2016. Google’s Multilingual Neural MachineTranslation System: Enabling Zero-Shot Transla-tion. arXiv preprint arXiv:1611.04558 .

Bevan Keeley Jones, Mark Johnson, and Sharon Gold-water. 2012. Semantic Parsing with Bayesian TreeTransducers. In Proceedings of ACL-2012. pages488–496.

Kevin Knight. 1999. Decoding Complexity in Word-Replacement Translation Models. Computationallinguistics 25(4):607–615.

Kevin Knight and Yaser Al-Onaizan. 1998. Translationwith Finite-state Devices. In Proceedings of AMTA.

Tomas Kocisky, Gabor Melis, Edward Grefenstette,Chris Dyer, Wang Ling, Phil Blunsom, andKarl Moritz Hermann. 2016. Semantic Parsing withSemi-Supervised Sequential Autoencoders. In Pro-ceedings of EMNLP-16. pages 1078–1087.

Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwa-ter, and Mark Steedman. 2010. Inducing Probabilis-tic CCG Grammars from Logical Form with Higher-order Unification. In Proceedings of EMNLP-2010.pages 1223–1233.

Percy Liang. 2013. Lambda Dependency-BasedCompositional Semantics. arXiv preprintarXiv:1309.4408 .

Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective Approaches to Attention-Based Neural Machine Translation. arXiv preprintarXiv:1508.04025 .

Antonio Valerio Miceli Barone and Rico Sennrich.2017. A Parallel Corpus of Python Functions andDocumentation Strings for Automated Code Doc-umentation and Code Generation. arXiv preprintarXiv:1707.02275 .

Dipendra Kumar Misra and Yoav Artzi. 2016. NeuralShift-Reduce CCG Semantic Parsing. In EMNLP.pages 1775–1786.

Mehryar Mohri. 1996. On Some Applications ofFinite-State Automata Theory to Natural Lan-guage Processing. Natural Language Engineering2(1):61–80.

Raymond Mooney. 2007. Learning for Semantic Pars-ing. In Proceedings of CICLING.

Graham Neubig, Chris Dyer, Yoav Goldberg, AustinMatthews, Waleed Ammar, Antonios Anastasopou-los, Miguel Ballesteros, David Chiang, DanielClothiaux, Trevor Cohn, Kevin Duh, ManaalFaruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji,Lingpeng Kong, Adhiguna Kuncoro, Gaurav Ku-mar, Chaitanya Malaviya, Paul Michel, YusukeOda, Matthew Richardson, Naomi Saphra, SwabhaSwayamdipta, and Pengcheng Yin. 2017. Dynet:The Dynamic Neural Network Toolkit. arXivpreprint arXiv:1701.03980 https://github.com/clab/dynet.

Franz Josef Och and Hermann Ney. 2003. A System-atic Comparison of Various Statistical AlignmentModels. Computational linguistics 29(1):19–51.

Robert Ostling and Jorg Tiedemann. 2017. NeuralMachine Translation for Low-resource Languages.arXiv preprint arXiv:1708.05729 .

Xiaochang Peng, Chuan Wang, Daniel Gildea, and Ni-anwen Xue. 2017. Addressing the Data Sparsity Is-sue in Neural AMR Parsing. Proceedings of ACL.

Maxim Rabinovich, Mitchell Stern, and Dan Klein.2017. Abstract Syntax Networks for Code Gener-ation and Semantic Parsing. In Proceedings of ACL.

Kyle Richardson. 2017. Code-Datasets. https://github.com/yakazimir/Code-Datasets.

Kyle Richardson. 2018. A Language for Func-tion Signature Representations. arXiv preprintarXiv:1804.00987 .

Kyle Richardson and Jonas Kuhn. 2017a. Function As-sistant: A Tool for NL Querying of APIs. In Pro-ceedings of EMNLP.

Kyle Richardson and Jonas Kuhn. 2017b. LearningSemantic Correspondences in Technical Documen-tation. In Proceedings of ACL.

Kyle Richardson, Sina Zarrieß, and Jonas Kuhn. 2017.The Code2Text Challenge: Text Generation inSource Code Libraries. In Proceedings of INLG.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2015. Neural Machine Translation of RareWords with Subword Units. arXiv preprintarXiv:1508.07909 .

Raymond Hendy Susanto and Wei Lu. 2017. SemanticParsing with Neural Hybrid Trees. In AAAI. pages3309–3315.

Yulia Tsvetkov, Sunayana Sitaram, Manaal Faruqui,Guillaume Lample, Patrick Littell, DavidMortensen, Alan W Black, Lori Levin, and ChrisDyer. 2016. Polyglot Neural Language Models: ACase Study in Cross-Lingual Phonetic Represen-tation Learning. arXiv preprint arXiv:1605.03832.

Yuk Wah Wong and Raymond J. Mooney. 2006a.Learning for semantic parsing with statistical ma-chine translation. In Proceedings of HLT-NAACL-2006. pages 439–446.

Yuk Wah Wong and Raymond J Mooney. 2006b.Learning for Semantic Parsing with Statistical Ma-chine Translation. In Proceedings of NACL.

Jin Y Yen. 1971. Finding the k Shortest Loopless Pathsin a Network. Management Science 17(11):712–716.

Pengcheng Yin and Graham Neubig. 2017. A Syntac-tic Neural Model for General-Purpose Code Gener-ation. In Proceedings of ACL.

John M Zelle and Raymond J Mooney. 1996. Learn-ing to Parse Database Queries using Inductive LogicProgramming. In Proceedings of AAAI-1996. pages1050–1055.

Luke S. Zettlemoyer and Michael Collins. 2009.Learning context-dependent mappings from sen-tences to logical form. In Proceedings of ACL-2009.pages 976–984.

Luke S Zettlemoyer and Michael Collins. 2012. Learn-ing to Map Sentences to Logical Form: StructuredClassification with Probabilistic Categorial Gram-mars. arXiv preprint arXiv:1207.1420 .

Barret Zoph, Deniz Yuret, Jonathan May, and KevinKnight. 2016. Transfer Learning for Low-resourceNeural Machine Translation. Proceedings of ACL .