Universal Dependency Parsing for Hindi-English Code ...

Proceedings of NAACL-HLT 2018, pages 987–998New Orleans, Louisiana, June 1 - 6, 2018. c©2018 Association for Computational Linguistics

Universal Dependency Parsing for Hindi-English Code-switching

Irshad Ahmad BhatLTRC, IIIT-H,

Hyderabad, [email protected]

Riyaz Ahmad BhatInteraction Labs,Bangalore, India

[email protected]

Manish ShrivastavaLTRC, IIIT-H,


Dipti Misra SharmaLTRC, IIIT-H,


AbstractCode-switching is a phenomenon of mixinggrammatical structures of two or more lan-guages under varied social constraints. Thecode-switching data differ so radically fromthe benchmark corpora used in NLP com-munity that the application of standard tech-nologies to these data degrades their perfor-mance sharply. Unlike standard corpora, thesedata often need to go through additional pro-cesses such as language identification, nor-malization and/or back-transliteration for theirefficient processing. In this paper, we in-vestigate these indispensable processes andother problems associated with syntactic pars-ing of code-switching data and propose meth-ods to mitigate their effects. In particular, westudy dependency parsing of code-switchingdata of Hindi and English multilingual speak-ers from Twitter. We present a treebankof Hindi-English code-switching tweets underUniversal Dependencies scheme and proposea neural stacking model for parsing that effi-ciently leverages part-of-speech tag and syn-tactic tree annotations in the code-switchingtreebank and the preexisting Hindi and En-glish treebanks. We also present normaliza-tion and back-transliteration models with adecoding process tailored for code-switchingdata. Results show that our neural stackingparser is 1.5% LAS points better than the aug-mented parsing model and our decoding pro-cess improves results by 3.8% LAS pointsover the first-best normalization and/or back-transliteration.

1 Introduction

Code-switching1 (henceforth CS) is the juxtaposi-tion, within the same speech utterance, of gram-matical units such as words, phrases, and clauses

1Code-mixing is another term in the linguistics literatureused interchangeably with code-switching. Both terms are of-ten used to refer to the same or similar phenomenon of mixedlanguage use.

belonging to two or more different languages(Gumperz, 1982). The phenomenon is prevalent inmultilingual societies where speakers share morethan one language and is often prompted by mul-tiple social factors (Myers-Scotton, 1995). More-over, code-switching is mostly prominent in col-loquial language use in daily conversations, bothonline and offline.

Most of the benchmark corpora used in NLP fortraining and evaluation are based on edited mono-lingual texts which strictly adhere to the norms ofa language related, for example, to orthography,morphology, and syntax. Social media data in gen-eral and CS data, in particular, deviate from thesenorms implicitly set forth by the choice of corporaused in the community. This is the reason why thecurrent technologies often perform miserably onsocial media data, be it monolingual or mixed lan-guage data (Solorio and Liu, 2008b; Vyas et al.,2014; Cetinoglu et al., 2016; Gimpel et al., 2011;Owoputi et al., 2013; Kong et al., 2014). CS dataoffers additional challenges over the monolingualsocial media data as the phenomenon of code-switching transforms the data in many ways, forexample, by creating new lexical forms and syn-tactic structures by mixing morphology and syn-tax of two languages making it much more diversethan any monolingual corpora (Cetinoglu et al.,2016). As the current computational models failto cater to the complexities of CS data, there is of-ten a need for dedicated techniques tailored to itsspecific characteristics.

Given the peculiar nature of CS data, it has beenwidely studied in linguistics literature (Poplack,1980; Gumperz, 1982; Myers-Scotton, 1995), andmore recently, there has been a surge in studiesconcerning CS data in NLP as well (Solorio andLiu, 2008a,a; Vyas et al., 2014; Sharma et al.,2016; Rudra et al., 2016; Joshi et al., 2016; Bhatet al., 2017; Chandu et al., 2017; Rijhwani et al.,

987

2017; Guzman et al., 2017, and others). Be-sides the individual computational works, a seriesof shared-tasks and workshops on preprocessingand shallow syntactic analysis of CS data havealso been conducted at multiple venues such asEmpirical Methods in NLP (EMNLP 2014 and2016), International Conference on NLP (ICON2015 and 2016) and Forum for Information Re-trieval Evaluation (FIRE 2015 and 2016). Mostof these works have attempted to address pre-liminary tasks such as language identification,normalization and/or back-transliteration as thesedata often need to go through these additionalprocesses for their efficient processing. In thispaper, we investigate these indispensable pro-cesses and other problems associated with syn-tactic parsing of code-switching data and pro-pose methods to mitigate their effects. In par-ticular, we study dependency parsing of Hindi-English code-switching data of multilingual In-dian speakers from Twitter. Hindi-English code-switching presents an interesting scenario for theparsing community. Mixing among typologicallydiverse languages will intensify structural varia-tions which will make parsing more challenging.For example, there will be many sentences con-taining: (1) both SOV and SVO word orders2,(2) both head-initial and head-final genitives, (3)both prepositional and postpositional phrases, etc.More importantly, none among the Hindi and En-glish treebanks would provide any training in-stance for these mixed structures within individualsentences. In this paper, we present the first code-switching treebank that provides syntactic anno-tations required for parsing mixed-grammar syn-tactic structures. Moreover, we present a parsingpipeline designed explicitly for Hindi-English CSdata. The pipeline comprises of several modulessuch as a language identification system, a back-transliteration system, and a dependency parser.The gist of these modules and our overall researchcontributions are listed as follows:

• back-transliteration and normalization mod-els based on encoder-decoder frameworkswith sentence decoding tailored for code-switching data;

• a dependency treebank of Hindi-Englishcode-switching tweets under Universal De-pendencies scheme; and

2Order of Subject, Object and Verb in transitive sentences.

• a neural parsing model which learns POS tag-ging and parsing jointly and also incorporatesknowledge from the monolingual treebanksusing neural stacking.

2 Preliminary Tasks

As preliminary steps before parsing of CS data, weneed to identify the language of tokens and nor-malize and/or back-transliterate them to enhancethe parsing performance. These steps are indis-pensable for processing CS data and without themthe performance drops drastically as we will seein Results Section. We need normalization ofnon-standard word forms and back-transliterationof Romanized Hindi words for addressing out-of-vocabulary problem, and lexical and syntactic am-biguity introduced due to contracted word forms.As we will train separate normalization and back-transliteration models for Hindi and English, weneed language identification for selecting whichmodel to use for inference for each word form sep-arately. Moreover, we also need language infor-mation for decoding best word sequences.

2.1 Language Identification

For language identification task, we train a mul-tilayer perceptron (MLP) stacked on top of a re-current bidirectional LSTM (Bi-LSTM) networkas shown in Figure 1.

x 2x 1 x n

…

…

ℎ11

ℎ11

ℎ12

ℎ12

ℎ1�

ℎ1�

Hidden Hidden Hidden

t1 t2 tn …

Feature layer

…

Output layer

Softmax

…

ℎ 1 1

ℎ 1 1

ℎ 1 2

ℎ 1 2

ℎ 1

ℎ 1 �

…

…

ℎ 1 1

ℎ 1 1

ℎ 1 2

ℎ 1 2

ℎ 1 �

ℎ 1 �

…

…

ℎ 1 1

ℎ 1 1

ℎ 1 2

ℎ 1 2

ℎ 1 �

ℎ 1 �

…

x T 2x T 1 x T n

x F 2x F 1 x F n

Inputlayer

x2x1

…xn

Char-RNN

Eng-Embd

Hin-Embd

Flag-Embd

…

Linear Linear Linear…

Figure 1: Language identification networkWe represent each token by a concatenated vec-tor of its English embedding, back-transliteratedHindi embedding, character Bi-LSTM embeddingand flag embedding (English dictionary flag andword length flag with length bins of 0-3, 4-6, 7-10,and 10-all). These concatenated vectors are passedto a Bi-LSTM network to generate a sequence of

988

hidden representations which encode the contex-tual information spread across the sentence. Fi-nally, output layer uses the feed-forward neuralnetwork with a softmax function for a probabil-ity distribution over the language tags. We trainthe network on our CS training set concatenatedwith the data set provided in ICON 20153 sharedtask (728 Facebook comments) on language iden-tification and evaluate it on the datasets from Bhatet al. (2017). We achieved the state-of-the-art per-formance on both development and test sets (Bhatet al., 2017). The results are shown in Table 1.

Label Precision Recall F1-Score counthi 97.76 98.09 97.92 1465en 96.87 98.83 97.84 1283ne 94.33 79.17 86.08 168

acro 92.00 76.67 83.64 30univ 99.71 1.00 99.86 349

average 97.39 97.42 97.36 3295(Bhat et al., 2017) - 96.10 - -

Table 1: Language Identification results on CS test set.

2.2 Normalization and Back-transliteration

We learn two separate but similar character-levelmodels for normalization-cum-transliteration ofnoisy Romanized Hindi words and normaliza-tion of noisy English words. We treat both nor-malization and back-transliteration problems asa general sequence to sequence learning prob-lem. In general, our goal is to learn a mappingfor non-standard English and Romanized Hindiword forms to standard forms in their respectivescripts. In case of Hindi, we address the problemof normalization and back-transliteration of Ro-manized Hindi words using a single model. Weuse the attention-based encoder-decoder modelof Luong (Luong et al., 2015) with global at-tention for learning. For Hindi, we train themodel on the transliteration pairs (87,520) fromthe Libindic transliteration project4 and Brahmi-Net (Kunchukuttan et al., 2015) which are fur-ther augmented with noisy transliteration pairs(1,75,668) for normalization. Similarly, for nor-malization of noisy English words, we train themodel on noisy word forms (4,29,715) syntheti-cally generated from the English vocabulary. Weuse simple rules such as dropping non-initial vow-els and replacing consonants based on their phono-logical proximity to generate synthetic data for

3http://ltrc.iiit.ac.in/icon2015/4https://github.com/libindic/indic-trans

normalization. Figure 2 shows some of the noisyforms generated from standard word forms usingsimple and finite rules which include vowel elision(please → pls), interchanging similar conso-nants and vowels (cousin→ couzin), replac-ing consonant or vowel clusters with a single let-ter (Twitter → Twiter), etc. From here on-wards, we will refer to both normalization andback-transliteration as normalization.

pls1

blk10

agresive21

agrsv21

ppl5

nd7

plz1

becauze17

bcz17abt4

r2

b3

srry8

sry8

sereis6

tel24rockin19

blak11

twiter23

twtr23

boyz15

tuk9

couzin16

czn16

desert20

dezert20

dzrt20

riting21

ritin14

rong14

kiking12

kikin11

smac13

rkn19

rokin19

mesages18

msgs18

busines22

thier10

nvr7hv7

msgz18

please1

are2

be3

about4

people5

series6

and7

sorry8

took9

their10

black11

kicking12

smack13

wrong14

boys15

cousin16

because17

messages18

rocking19

dessert20

aggressive21

business22

twitter23

tell24

1 2 1

3

Figure 2: Synthetic normalization pairs generatedfor a sample of English words using hand craftedrules.

At inference time, our normalization modelswill predict the most likely word form for each in-put word. However, the single-best output fromthe model may not always be the best option con-sidering an overall sentential context. Contractedword forms in social media content are quite of-ten ambiguous and can represent different stan-dard word forms. For example, noisy form ‘pt’can expand to different standard word forms suchas ‘put’, ‘pit’, ‘pat’, ‘pot’ and ‘pet’. Thechoice of word selection will solely depend onthe sentential context. To select contextually rel-evant forms, we use exact search over n-best nor-malizations from the respective models extractedusing beam-search decoding. The best word se-quence is selected using the Viterbi decoding overbn word sequences scored by a trigram languagemodel. b is the size of beam-width and n is thesentence length. The language models are trainedon the monolingual data of Hindi and English us-ing KenLM toolkit (Heafield et al., 2013). Foreach word, we extract five best normalizations(b=5). Decoding the best word sequence is a non-trivial problem for CS data due to lack of normal-ized and back-transliterated CS data for training alanguage model. One obvious solution is to applydecoding on individual language fragments in aCS sentence (Dutta et al., 2015). One major prob-

989

buddy

-

-

-

-

of

-

-

droplet

one

and

iam

-

Yar

cn

anyone

tel

me

k

twitr

acount

bnd

ksy

krty

hn

plz

year

can

anyones

tell

mae

ok

twitt

account

band

casey

courty

hon

please

yarn

con

anyone

teal

moe

kk

twirt

count

bind

cosy

karity

nh

poles

yard

cano

anyone

tele

men

coo

twitre

adcount

bound

sky

curity

han

plus

friend

-

-

-

-

from

-

-

drop

certain

do

am

-

buddy

can

anyone

tell

me

of

twitt

account

drop

one

and

am

please

ययरछछनअनययनयततलममककटट ववटरअकयउउटबबबदककससकरतकहह हपलछज

यरकयनअनययनसटकलममककटवसटरएकछउबटबउदकक सककक तयहमपलछज

यछररकबअनययनसटसलममकसकतवतररअकछउनटबछबडकसतकरतसहब बपलतज

-कक नकयईभसबतछनछम मझकय--खछतछ----कक पयय

-सकनछवकसमकककहनछममझ--लतखछ----कक पयछ

ययरकछनककससकयटतलमसकककटवटरअकछउबटबउदकक सककरतकहक उकक पयछ

hi

en

en

en

en

hi

ne

en

hi

hi

hi

hi

en

ययरcananyonetellmeककtwittaccountबउदकक सककरतकहक उplease

RawTweet Top 3

NormalizationsTop 2

Dictionary EquivalentsTop 3

TransliterationsTop 2

Dictionary Equivalents

English Decoding Hindi DecodingLang.Tag

FinalBestBest Best

Figure 3: The figure shows a 3-step decoding process for the sentence “Yar cn anyone tel me k twitr account bndksy krty hn plz” (Friend can anyone tell me how to close twitter account please).

lem with this approach is that the language modelsused for scoring are trained on complete sentencesbut are applied on sentence fragments. Scoring in-dividual CS fragments might often lead to wrongword selection due to incomplete context, particu-larly at fragment peripheries. We solve this prob-lem by using a 3-step decoding process that workson two separate versions of a CS sentence, onein Hindi, and one in English. In the first step,we replace first-best back-transliterated forms ofHindi words by their translation equivalents us-ing a Hindi-English bilingual lexicon.5 An exactsearch is used over the top ‘5’ normalizations ofEnglish words, the translation equivalents of Hindiwords and the actual word itself. In the secondstep, we decode best word sequence over Hindiversion of the sentence by replacing best Englishword forms decoded from the first step by theirtranslation equivalents. An exact search is usedover the top ‘5’ normalizations of Hindi words, thedictionary equivalents of decoded English wordsand the original words. In the final step, Englishand Hindi words are selected from their respectivedecoded sequences using the predicted languagetags from the language identification system. Notethat the bilingual mappings are only used to aidthe decoding process by making the CS sentenceslexically monolingual so that the monolingual lan-guage models could be used for scoring. They arenot used in the final decoded output. The overalldecoding process is shown in Figure 3.

Both of our normalization and back-transliteration systems are evaluated on the

5An off-the-shelf MT system would have been appropri-ate for this task, however, we would first need to adapt it toCS data which in itself is a non-trivial task.

evaluation set of Bhat et al. (2017). Resultsof our systems are reported in Table 3 with acomparison of accuracies based on the natureof decoding used. The results clearly show thesignificance of our 3-step decoding over first-bestand fragment-wise decoding.

Data-setHindi English

Tokens FB FW 3-step Tokens FB FW 3-stepDev 1549 82.82 87.28 90.01 34 82.35 88.23 88.23Test 1465 83.54 88.19 90.64 28 71.42 75.21 81.71

Table 2: Normalization accuracy based on the numberof noisy tokens in the evaluation set. FB = First Best,and FW = Fragment Wise

3 Universal Dependencies forHindi-English

Recently Bhat et al. (2017) provided a CS datasetfor the evaluation of their parsing models whichthey trained on the Hindi and English Univer-sal Dependency (UD) treebanks. We extend thisdataset by annotating 1,448 more sentences. Fol-lowing Bhat et al. (2017) we first sampled CSdata from a large set of tweets of Indian lan-guage users that we crawled from Twitter usingTweepy6–a Twitter API wrapper. We then useda language identification system trained on ICONdataset (see Section 2) to filter Hindi-English CStweets from the crawled Twitter data. Only thosetweets were selected that satisfied a minimum ra-tio of 30:70(%) code-switching. From this dataset,we manually selected 1,448 tweets for annotation.The selected tweets are thoroughly checked forcode-switching ratio. For POS tagging and depen-dency annotation, we used Version 2 of Universaldependency guidelines (De Marneffe et al., 2014),

6http://www.tweepy.org/

990

while language tags are assigned based on the tagset defined in (Solorio et al., 2014; Jamatia et al.,2015). The dataset was annotated by two expertannotators who have been associated with anno-tation projects involving syntactic annotations foraround 10 years. Nonetheless, we also ensured thequality of the manual annotations by carrying aninter-annotator agreement analysis. We randomlyselected a dataset of 150 tweets which were anno-tated by both annotators for both POS tagging anddependency structures. The inter-annotator agree-ment has a 96.20% accuracy for POS tagging anda 95.94% UAS and a 92.65% LAS for dependencyparsing.

We use our dataset for training while the devel-opment and evaluation sets from Bhat et al. (2017)are used for tuning and evaluation of our models.Since the annotations in these datasets follow ver-sion 1.4 of the UD guidelines, we converted themto version 2 by using carefully designed rules. Thestatistics about the data are given in Table 3.

Data-set Sentences Tokens Hi En Ne Univ AcroTrain 1,448 20,203 8,363 8,270 698 2,730 142Dev 225 3,411 1,549 1,300 151 379 32Test 225 3,295 1,465 1,283 168 349 30

Table 3: Data Statistics. Dev set is used for tuningmodel parameters, while Test set is used for evaluation.

4 Dependency Parsing

We adapt Kiperwasser and Goldberg (2016)transition-based parser as our base model and in-corporate POS tag and monolingual parse tree in-formation into the model using neural stacking, asshown in Figures 4 and 6.

4.1 Parsing AlgorithmOur parsing models are based on an arc-eagertransition system (Nivre, 2003). The arc-eagersystem defines a set of configurations for a sen-tence w1,...,wn, where each configuration C =

(S, B, A) consists of a stack S, a buffer B, anda set of dependency arcs A. For each sentence, theparser starts with an initial configuration where S

= [ROOT], B = [w1,...,wn] and A = ∅ and ter-minates with a configuration C if the buffer isempty and the stack contains the ROOT. The parsetrees derived from transition sequences are givenby A. To derive the parse tree, the arc-eager sys-tem defines four types of transitions (t): Shift,Left-Arc, Right-Arc, and Reduce.

We use the training by exploration method ofGoldberg and Nivre (2012) for decoding a tran-

sition sequence which helps in mitigating errorpropagation at evaluation time. We also usepseudo-projective transformations of Nivre andNilsson (2005) to handle a higher percentage ofnon-projective arcs in the CS data (∼2%). We usethe most informative scheme of head+path tostore the transformation information.

Inputlayerx2x1

…xn

…

…

ℎ11

ℎ11

ℎ12

ℎ12

ℎ1�

ℎ1�

…

…

ℎ21

ℎ21

ℎ22

ℎ22

ℎ2�

ℎ2�


t1 t2 tn …


ℎ11

ℎ11

ℎ12

ℎ12

ℎ1�

ℎ1�

…

…

ℎ31

ℎ31

ℎ32

ℎ32

ℎ3�

h3�

Feature Template (s0, b0)

(ScoreLeftArc, ScoreRightArc, ScoreShift, ScoreReduce)

POSfeature layer

Parserfeature layer


Softmax

…

…

…

……

POSoutput layer

Parserinput layer

Parseroutput layer

Softmax



Figure 4: POS tagging and parsing network basedon stack-propagation model proposed in (Zhang andWeiss, 2016).

4.2 Base ModelsOur base model is a stack of a tagger network anda parser network inspired by stack-propagationmodel of Zhang and Weiss (2016). The param-eters of the tagger network are shared and actas a regularization on the parsing model. Themodel is trained by minimizing a joint negativelog-likelihood loss for both tasks. Unlike Zhangand Weiss (2016), we compute the gradients of thelog-loss function simultaneously for each train-ing instance. While the parser network is updatedgiven the parsing loss only, the tagger network isupdated with respect to both tagging and parsinglosses. Both tagger and parser networks compriseof an input layer, a feature layer, and an outputlayer as shown in Figure 4. Following Zhang andWeiss (2016), we refer to this model as stack-prop.

991

Tagger network: The input layer of the taggerencodes each input word in a sentence by concate-nating a pre-trained word embedding with its char-acter embedding given by a character Bi-LSTM.In the feature layer, the concatenated word andcharacter representations are passed through twostacked Bi-LSTMs to generate a sequence of hid-den representations which encode the contextualinformation spread across the sentence. The firstBi-LSTM is shared with the parser network whilethe other is specific to the tagger. Finally, outputlayer uses the feed-forward neural network witha softmax function for a probability distributionover the Universal POS tags. We only use the for-ward and backward hidden representations of thefocus word for classification.

Parser Network: Similar to the tagger network,the input layer encodes the input sentence usingword and character embeddings which are thenpassed to the shared Bi-LSTM. The hidden rep-resentations from the shared Bi-LSTM are thenconcatenated with the dense representations fromthe feed-forward network of the tagger and passedthrough the Bi-LSTM specific to the parser. Thisensures that the tagging network is penalized forthe parsing error caused by error propagation byback-propagating the gradients to the shared tag-ger parameters (Zhang and Weiss, 2016). Finally,we use a non-linear feed-forward network to pre-dict the labeled transitions for the parser config-urations. From each parser configuration, we ex-tract the top node in the stack and the first nodein the buffer and use their hidden representationsfrom the parser specific Bi-LSTM for classifica-tion.

dis rat ki barish alwayz scares me .This night of rain always scares me .

Mixed grammar Mixed grammar

Hindi grammar English grammar

Figure 5: Code-switching tweet showing grammaticalfragments from Hindi and English.

4.3 Stacking ModelsIt seems reasonable that limited CS data wouldcomplement large monolingual data in parsing CSdata and a parsing model which leverages bothdata would significantly improve parsing perfor-mance. While a parsing model trained on ourlimited CS data might not be enough to accu-rately parse the individual grammatical fragmentsof Hindi and English, the preexisting Hindi and

English treebanks are large enough to provide suf-ficient annotations to capture their structure. Sim-ilarly, parsing model(s) trained on the Hindi andEnglish data may not be able to properly connectthe divergent fragments of the two languages asthe model lacks evidence for such mixed struc-tures in the monolingual data. This will happenquite often as Hindi and English are typologicallsvery diverse (see Figure 5).

POSInputlayer

……

ℎ11 ℎ1

2 ℎ1


t1 t2 tn


……

ℎ31

ℎ31

ℎ32

ℎ32

ℎ3�

h3�

(ScoreLeftArc, ScoreRightArc, ScoreShift, ScoreReduce)

POSfeature layer

Parserfeature layer


Softmax

…

Hin-Eng POS hidden layer

x2Hidden

x1Hidden

xnHidden

Hin-Eng Base Model

x2x1 xn…

POSoutput layer

Parserinput layer

Parseroutputlayer

…

…

…

…

…

…

Hidden Hidden Hidden…

n

ℎ11 ℎ1

2 ℎ1n

ℎ21 ℎ2

2 ℎ2n

ℎ21 ℎ2

2 ℎ2n…

…

x1 x2 xn

x1 x2 xn

Hin-Eng Base Model

Feature Template (s0, b0)

Hin-Eng Parser Bi-LSTM

Softmax



Figure 6: Neural Stacking-based parsing architecturefor incorporating monolingual syntactic knowledge.

As we discussed above, we adapted feature-level neural stacking (Zhang and Weiss, 2016;Chen et al., 2016) for joint learning of POS tag-ging and parsing. Similarly, we also adapt thisstacking approach for incorporating the monolin-gual syntactic knowledge into the base CS model.Recently, Wang et al. (2017) used neural stackingfor injecting syntactic knowledge of English into agraph-based Singlish parser which lead to signif-icant improvements in parsing performance. Un-like Wang et al. (2017), our base stacked modelswill allow us to transfer the POS tagging knowl-edge as well along the parse tree knowledge.

As shown in Figure 6, we transfer both POStagging and parsing information from the source

992

model trained on augmented Hindi and Englishdata. For tagging, we augment the input layer ofthe CS tagger with the MLP layer of the sourcetagger. For transferring parsing knowledge, hid-den representations from the parser specific Bi-LSTM of the source parser are augmented withthe input layer of the CS parser which already in-cludes the hidden layer of the CS tagger, word andcharacter embeddings. In addition, we also add theMLP layer of the source parser to the MLP layerof the CS parser. The MLP layers of the sourceparser are generated using raw features from CSparser configurations. Apart from the additionof these learned representations from the sourcemodel, the overall CS model remains similar to thebase model shown in Figure 4. The tagging andparsing losses are back-propagated by traversingback the forward paths to all trainable parametersin the entire network for training and the wholenetwork is used collectively for inference.

5 Experiments

We train all of our POS tagging and parsing mod-els on training sets of the Hindi and English UD-v2 treebanks and our Hindi-English CS treebank.For tuning and evaluation, we use the develop-ment and evaluation sets from Bhat et al. (2017).We conduct multiple experiments in gold and pre-dicted settings to measure the effectiveness of thesub-modules of our parsing pipeline. In predictedsettings, we use the POS taggers separately trainedon the Hindi, English and CS training sets. Allof our models use word embeddings from trans-formed Hindi and English embedding spaces toaddress the problem of lexical differences preva-lent in CS sentences.

5.1 Hyperparameters

Word Representations For language identifica-tion, POS tagging and parsing models, we includethe lexical features in the input layer of our neu-ral networks using 64-dimension pre-trained wordembeddings, while we use randomly initializedembeddings within a range of [−0.1, +0.1] fornon-lexical units such as POS tags and dictionaryflags. We use 32-dimensional character embed-dings for all the three models and 32-dimensionalPOS tag embeddings for pipelined parsing mod-els. The distributed representation of Hindi andEnglish vocabulary are learned separately fromthe Hindi and English monolingual corpora. The

English monolingual data contains around 280Msentences, while the Hindi data is comparativelysmaller and contains around 40M sentences. Theword representations are learned using Skip-grammodel with negative sampling which is imple-mented in word2vec toolkit (Mikolov et al.,2013). We use the projection algorithm of Artetxeet al. (2016) to transform the Hindi and En-glish monolingual embeddings into same semanticspace using a bilingual lexicon (∼63,000 entries).The bilingual lexicon is extracted from ILCI andBojar Hindi-English parallel corpora (Jha, 2010;Bojar et al., 2014). For normalization models,we use 32-dimensional character embeddings uni-formly initialized within a range of [−0.1,+0.1].

Hidden dimensions The POS tagger specificBi-LSTMs have 128 cells while the parser spe-cific Bi-LSTMs have 256 cells. The Bi-LSTMin the language identification model has 64 cells.The character Bi-LSTMs have 32 cells for all threemodels. The hidden layer of MLP has 64 nodes forthe language identification network, 128 nodes forthe POS tagger and 256 nodes for the parser. Weuse hyperbolic tangent as an activation function inall tasks. In the normalization models, we use sin-gle layered Bi-LSTMs with 512 cells for both en-coding and decoding of character sequences.

Learning For language identification, POS tag-ging and parsing networks, we use momentumSGD for learning with a minibatch size of 1. TheLSTM weights are initialized with random or-thonormal matrices as described in (Saxe et al.,2013). We set the dropout rate to 30% for POS tag-ger and parser Bi-LSTM and MLP hidden stateswhile for language identification network we setthe dropout to 50%. All three models are trainedfor up to 100 epochs, with early stopping based onthe development set.

In case of normalization, we train our encoder-decoder models for 25 epochs using vanilla SGD.We start with a learning rate of 1.0 and after 8epochs reduce it to half for every epoch. We use amini-batch size of 128, and the normalized gradi-ent is rescaled whenever its norm exceeds 5. Weuse a dropout rate of 30% for the Bi-LSTM.

Language identification, POS tagging andparsing code is implemented in DyNet (Neubiget al., 2017) and for normalization withoutdecoding, we use Open-NMT toolkit for neuralmachine translation (Klein et al., 2017). All

993

the code is available at https://github.com/irshadbhat/nsdp-cs and the datais available at https://github.com/CodeMixedUniversalDependencies/UD_Hindi_English.

6 Results

In Table 4, we present the results of our mainmodel that uses neural stacking for learning POStagging and parsing and also for knowledge trans-fer from the Bilingual model. Transferring POStagging and syntactic knowledge using neuralstacking gives 1.5% LAS7 improvement over anaive approach of data augmentation. The Bilin-gual model which is trained on the union of Hindiand English data sets is least accurate of all ourparsing models. However, it achieves better ornear state-of-the-art results on the Hindi and En-glish evaluation sets (see Table 5). As comparedto the best system in CoNLL 2017 Shared Taskon Universal Dependencies (Zeman et al., 2017;Dozat et al., 2017), our results for English arearound 3% better in LAS, while for Hindi only0.5% LAS points worse. The CS model trainedonly on the CS training data is slightly more accu-rate than the Bilingual model. Augmenting the CSdata to Hindi-English data complements their syn-tactic structures relevant for parsing mixed gram-mar structures which are otherwise missing in theindividual datasets. The average improvements ofaround ∼5% LAS clearly show their complemen-tary nature.

ModelGold (LID+TRN) Auto (LID+TRN)UAS LAS UAS LAS

Bilingual 75.26 65.41 73.29 63.18CS 76.69 66.90 75.84 64.94

Augmented 80.39 71.27 78.95 69.51Neural Stacking 81.50 72.44 80.23 71.03

(Bhat et al., 2017) 74.16 64.11 66.18 54.40

Table 4: Accuracy of different parsing models on theevaluation set. POS tags are jointly predicted withparsing. LID = Language tag, TRN = Translitera-tion/normalization.

Table 6 summarizes the POS tagging results onthe CS evaluation set. The tagger trained on the CStraining data is 2.5% better than the Bilingual tag-ger. Adding CS training data to Hindi and Englishtrain sets further improves the accuracy by 1%.However, our stack-prop tagger achieves the high-

7The improvements discussed in the running text are forthe models that are evaluated in auto settings.

est accuracy of 90.53% by leveraging POS infor-mation from Bilingual tagger using neural stack-ing.

PipelineStack-prop

Data-set Gold POS Auto POSUAS LAS POS UAS LAS POS UAS LAS

Hindi 95.66 93.08 97.52 94.08 90.69 97.65 94.36 91.02English 89.95 87.96 95.75 87.71 84.59 95.80 88.30 85.30

Table 5: POS and parsing results for Hindi and En-glish monolingual test sets using pipeline and stack-prop models.

ModelGold (LID+TRN) Auto (LID+TRN)Pipeline SP Pipeline SP

Bilingual 88.36 88.12 86.71 86.27CS 90.32 90.38 89.12 89.19


(Bhat et al., 2017) 86.00 85.30

Table 6: POS tagging accuracies of different models onCS evaluation set. SP = stack-prop.

Pipeline vs Stack-prop Table 7 summarizes theparsing results of our pipeline models which usepredicted POS tags as input features. As comparedto our stack-prop models (Table 4), pipeline mod-els are less accurate (average 1% LAS improve-ment across models) which clearly emphasizes thesignificance of back-propagating the parsing lossto tagging parameters as well.

ModelGold (LID+TRN+POS) Auto (LID+TRN+POS)UAS LAS UAS LAS

Bilingual 82.29 73.79 72.09 61.18CS 82.73 73.38 75.20 64.64


Table 7: Accuracy of different parsing mod-els on the test set using predicted language tags,normalized/back-transliterated words and predictedPOS tags. POS tags are predicted separately be-fore parsing. In Neural Stacking model, only parsingknowledge from the Bilingual model is transferred.

Significance of normalization We also con-ducted experiments to evaluate the impact of nor-malization on both POS tagging and parsing. Theresults are shown in Table 8. As expected, taggingand parsing models that use normalization with-out decoding achieve an average of 1% improve-ment over the models that do not use normaliza-tion at all. However, our 3-step decoding leads tohigher gains in tagging as well as parsing accura-cies. We achieved around 2.8% improvements intagging and around 4.6% in parsing over the mod-els that use first-best word forms from the normal-ization models. More importantly, there is a mod-

994

erate drop in accuracy (1.4% LAS points) causeddue to normalization errors (see results in Table 4for gold vs auto normalization).

System POS UAS LASNo Normalization 86.98 76.25 66.02

First Best 87.74 78.26 67.223-step Decoding 90.53 80.23 71.03

Table 8: Impact of normalization and back-transliteration on POS tagging and parsing models.

Monolingual vs Cross-lingual EmbeddingsWe also conducted experiments with monolingualand cross-lingual embeddings to evaluate the needfor transforming the monolingual embeddings intoa same semantic space for processing of CS data.Results are shown in Table 9. Cross-lingual em-beddings have brought around ∼0.5% improve-ments in both tagging and parsing. Cross-lingualembeddings are essential for removing lexical dif-ferences which is one of the problems encounteredin CS data. Addressing the lexical differences willhelp in better learning by exposing syntactic simi-larities between languages.

Embedding POS UAS LASMonolingual 90.07 79.46 70.53Crosslingual 90.53 80.23 71.03

Table 9: Impact of monolingual and cross-lingual em-beddings on stacking model performance.

7 Conclusion

In this paper, we have presented a dependencyparser designed explicitly for Hindi-English CSdata. The parser uses neural stacking architectureof Zhang and Weiss (2016) and Chen et al. (2016)for learning POS tagging and parsing and forknowledge transfer from Bilingual models trainedon Hindi and English UD treebanks. We have alsopresented normalization and back-transliterationmodels with a decoding process tailored for CSdata. Our neural stacking parser is 1.5% LASpoints better than the augmented parsing modeland 3.8% LAS points better than the one whichuses first-best normalizations.

ReferencesMikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016.

Learning principled bilingual mappings of word em-beddings while preserving monolingual invariance.

In Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing. pages2289–2294.

Irshad Bhat, Riyaz A. Bhat, Manish Shrivastava, andDipti Sharma. 2017. Joining hands: Exploitingmonolingual treebanks for parsing of code-mixingdata. In Proceedings of the 15th Conference of theEuropean Chapter of the Association for Computa-tional Linguistics: Volume 2, Short Papers. Associa-tion for Computational Linguistics, Valencia, Spain,pages 324–330.

Ondrej Bojar, Vojtech Diatka, Pavel Rychly, PavelStranak, Vıt Suchomel, Ales Tamchyna, and DanielZeman. 2014. HindEnCorp - Hindi-English andHindi-only Corpus for Machine Translation. InNicoletta Calzolari (Conference Chair), KhalidChoukri, Thierry Declerck, Hrafn Loftsson, BenteMaegaard, Joseph Mariani, Asuncion Moreno, JanOdijk, and Stelios Piperidis, editors, Proceedings ofthe Ninth International Conference on Language Re-sources and Evaluation (LREC’14). European Lan-guage Resources Association (ELRA), Reykjavik,Iceland.

Ozlem Cetinoglu, Sarah Schulz, and Ngoc ThangVu. 2016. Challenges of computational processingof code-switching. In Proceedings of the SecondWorkshop on Computational Approaches to CodeSwitching. Association for Computational Linguis-tics, Austin, Texas, pages 1–11.

Khyathi Raghavi Chandu, Manoj Chinnakotla, Alan WBlack, and Manish Shrivastava. 2017. Webshodh:A code mixed factoid question answering systemfor web. In International Conference of the Cross-Language Evaluation Forum for European Lan-guages. Springer, pages 104–111.

Hongshen Chen, Yue Zhang, and Qun Liu. 2016. Neu-ral network for heterogeneous annotations. In Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing. Associationfor Computational Linguistics, Austin, Texas, pages731–741.

Marie-Catherine De Marneffe, Timothy Dozat, Na-talia Silveira, Katri Haverinen, Filip Ginter, JoakimNivre, and Christopher D Manning. 2014. Universalstanford dependencies: A cross-linguistic typology.In Proceedings of the Ninth International Confer-ence on Language Resources and Evaluation. vol-ume 14, pages 4585–92.

Timothy Dozat, Peng Qi, and Christopher D Manning.2017. Stanford’s graph-based neural dependencyparser at the conll 2017 shared task. Proceedingsof the CoNLL 2017 Shared Task: Multilingual Pars-ing from Raw Text to Universal Dependencies pages20–30.

Sukanya Dutta, Tista Saha, Somnath Banerjee, andSudip Kumar Naskar. 2015. Text normalization incode-mixed social media text. In Proceedings of the

995

2nd International Conference on Recent Trends inInformation Systems (ReTIS).

Kevin Gimpel, Nathan Schneider, Brendan O’Connor,Dipanjan Das, Daniel Mills, Jacob Eisenstein,Michael Heilman, Dani Yogatama, Jeffrey Flanigan,and Noah A Smith. 2011. Part-of-speech taggingfor twitter: Annotation, features, and experiments.In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics: HumanLanguage Technologies: short papers-Volume 2. As-sociation for Computational Linguistics, pages 42–47.

Yoav Goldberg and Joakim Nivre. 2012. A dynamic or-acle for arc-eager dependency parsing. In Proceed-ings of the 24th International Conference on Com-putational Linguistics. pages 959–976.

John J Gumperz. 1982. Discourse strategies, volume 1.Cambridge University Press.

Gualberto Guzman, Joseph Ricard, Jacqueline Serigos,Barbara E Bullock, and Almeida Jacqueline Toribio.2017. Metrics for modeling code-switching acrosscorpora. Proceedings of Interspeech 2017 pages 67–71.

Kenneth Heafield, Ivan Pouzyrevsky, Jonathan HClark, and Philipp Koehn. 2013. Scalable modifiedkneser-ney language model estimation. In ACL (2).pages 690–696.

Anupam Jamatia, Bjorn Gamback, and Amitava Das.2015. Part-of-speech tagging for code-mixedenglish-hindi twitter and facebook chat messages.page 239.

Girish Nath Jha. 2010. The TDIL program and theIndian language corpora initiative (ILCI). In Pro-ceedings of the Seventh Conference on InternationalLanguage Resources and Evaluation (LREC 2010).European Language Resources Association (ELRA).

Aditya Joshi, Ameya Prabhu, Manish Shrivastava, andVasudeva Varma. 2016. Towards sub-word levelcompositions for sentiment analysis of hindi-englishcode mixed text. In Proceedings of COLING 2016,the 26th International Conference on ComputationalLinguistics: Technical Papers. The COLING 2016Organizing Committee, Osaka, Japan, pages 2482–2491.

Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim-ple and accurate dependency parsing using bidirec-tional lstm feature representations. Transactionsof the Association for Computational Linguistics4:313–327.

Guillaume Klein, Yoon Kim, Yuntian Deng, JeanSenellart, and Alexander M. Rush. 2017. Opennmt:Open-source toolkit for neural machine translation.In Proc. ACL. https://doi.org/10.18653/v1/P17-4012.

Lingpeng Kong, Nathan Schneider, SwabhaSwayamdipta, Archna Bhatia, and Chris Dyer.2014. A dependency parser for tweets. In Pro-ceedings of Conference on Empirical Methods InNatural Language Processing (EMNLP). volume1001, page 1012.

Anoop Kunchukuttan, Ratish Puduppully, and PushpakBhattacharyya. 2015. Brahmi-net: A transliterationand script conversion system for languages of the in-dian subcontinent.

Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025 .

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781 .

Carol Myers-Scotton. 1995. Social motivations forcodeswitching: Evidence from Africa. Oxford Uni-versity Press.

Graham Neubig, Chris Dyer, Yoav Goldberg, AustinMatthews, Waleed Ammar, Antonios Anastasopou-los, Miguel Ballesteros, David Chiang, DanielClothiaux, Trevor Cohn, et al. 2017. Dynet: Thedynamic neural network toolkit. arXiv preprintarXiv:1701.03980 .

Joakim Nivre. 2003. An efficient algorithm for pro-jective dependency parsing. In Proceedings of the8th International Workshop on Parsing Technologies(IWPT).

Joakim Nivre and Jens Nilsson. 2005. Pseudo-projective dependency parsing. In Proceedings ofthe 43rd Annual Meeting on Association for Com-putational Linguistics.

Olutobi Owoputi, Brendan OConnor, Chris Dyer,Kevin Gimpel, Nathan Schneider, and Noah ASmith. 2013. Improved part-of-speech tagging foronline conversational text with word clusters. InProceedings of NAACL-HLT . pages 380–390.

Shana Poplack. 1980. Sometimes ill start a sentence inspanish y termino en espanol: toward a typology ofcode-switching1. Linguistics 18(7-8):581–618.

Shruti Rijhwani, Royal Sequiera, Monojit Choud-hury, Kalika Bali, and Chandra Shekhar Maddila.2017. Estimating code-switching on twitter witha novel generalized word-level language detectiontechnique. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers). volume 1, pages 1971–1982.

Koustav Rudra, Shruti Rijhwani, Rafiya Begum, Ka-lika Bali, Monojit Choudhury, and Niloy Ganguly.

996

2016. Understanding language preference for ex-pression of opinion and sentiment: What do hindi-english speakers do on twitter? In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing. Association for Com-putational Linguistics, Austin, Texas, pages 1131–1141. https://aclweb.org/anthology/D16-1121.

Andrew M Saxe, James L McClelland, and Surya Gan-guli. 2013. Exact solutions to the nonlinear dynam-ics of learning in deep linear neural networks. arXivpreprint arXiv:1312.6120 .

Arnav Sharma, Sakshi Gupta, Raveesh Motlani, PiyushBansal, Manish Shrivastava, Radhika Mamidi, andDipti M. Sharma. 2016. Shallow parsing pipeline -hindi-english code-mixed social media text. In Pro-ceedings of the 2016 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies. Asso-ciation for Computational Linguistics, San Diego,California, pages 1340–1345.

Thamar Solorio, Elizabeth Blair, Suraj Maharjan, SteveBethard, Mona Diab, Mahmoud Gonheim, AbdelatiHawwari, Fahad AlGhamdi, Julia Hirshberg, AlisonChang, and Pascale Fung. 2014. Overview for thefirst shared task on language identification in code-switched data. In Proceedings of the First Workshopon Computational Approaches to Code-Switching.EMNLP 2014, Conference on Empirical Methods inNatural Language Processing, Octobe, 2014, Doha,Qatar.

Thamar Solorio and Yang Liu. 2008a. Learning to pre-dict code-switching points. In Proceedings of theConference on Empirical Methods in Natural Lan-guage Processing. Association for ComputationalLinguistics, pages 973–981.

Thamar Solorio and Yang Liu. 2008b. Part-of-speechtagging for english-spanish code-switched text. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing. Associationfor Computational Linguistics, pages 1051–1060.

Yogarshi Vyas, Spandana Gella, Jatin Sharma, Ka-lika Bali, and Monojit Choudhury. 2014. Pos tag-ging of english-hindi code-mixed social media con-tent. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP). volume 14, pages 974–979.

Hongmin Wang, Yue Zhang, GuangYong LeonardChan, Jie Yang, and Hai Leong Chieu. 2017. Uni-versal dependencies parsing for colloquial singa-porean english. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers). Associationfor Computational Linguistics, Vancouver, Canada,pages 1732–1744.

Daniel Zeman, Martin Popel, Milan Straka, Jan Ha-jic, Joakim Nivre, Filip Ginter, Juhani Luotolahti,

Sampo Pyysalo, Slav Petrov, Martin Potthast, Fran-cis Tyers, Elena Badmaeva, Memduh Gokirmak,Anna Nedoluzhko, Silvie Cinkova, Jan Hajic jr.,Jaroslava Hlavacova, Vaclava Kettnerova, ZdenkaUresova, Jenna Kanerva, Stina Ojala, Anna Mis-sila, Christopher D. Manning, Sebastian Schuster,Siva Reddy, Dima Taji, Nizar Habash, Herman Le-ung, Marie-Catherine de Marneffe, Manuela San-guinetti, Maria Simi, Hiroshi Kanayama, Valeria de-Paiva, Kira Droganova, Hector Martınez Alonso,Cagr Coltekin, Umut Sulubacak, Hans Uszkor-eit, Vivien Macketanz, Aljoscha Burchardt, KimHarris, Katrin Marheinecke, Georg Rehm, TolgaKayadelen, Mohammed Attia, Ali Elkahky, ZhuoranYu, Emily Pitler, Saran Lertpradit, Michael Mandl,Jesse Kirchner, Hector Fernandez Alcalde, Jana Str-nadova, Esha Banerjee, Ruli Manurung, AntonioStella, Atsuko Shimada, Sookyoung Kwak, GustavoMendonca, Tatiana Lando, Rattima Nitisaroj, andJosie Li. 2017. Conll 2017 shared task: Multilingualparsing from raw text to universal dependencies. InProceedings of the CoNLL 2017 Shared Task: Mul-tilingual Parsing from Raw Text to Universal Depen-dencies. Association for Computational Linguistics,Vancouver, Canada, pages 1–19.

Yuan Zhang and David Weiss. 2016. Stack-propagation: Improved representation learning forsyntax. In Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers). Association for Compu-tational Linguistics, Berlin, Germany, pages 1557–1566.

997

A Supplemental Material

A.1 Example Annotations from our CSTreebank

i thought mosam different hoga bas fog hy

ROOT

nsubj nsubj

ccomp

cop advmod

advcl

cop

Thand bhi odd even formula follow Kr rhi h ;-)

ROOTnsubj

advmod

amod

compound

obj

compound aux

aux

discourse

Tum kitne fake account banaogy

ROOTnsubj

det

amod obj

Ram Kapoor reminds me of boondi ke laddu

ROOT

nsubj

flat obj

case

nmod

case

obl

Has someone told Gabbar cal kya hai ?

ROOT

aux

nsubj iobj nmod

ccomp

cop

punct

Enjoying Dilli ki sardi after a long time .

ROOT

nmod

case

obj case

det

amod

obl

punct

Biggboss dekhne wali awaam can unfollow me .

ROOT

obj

amod

mark

nsubj

aux iobj

punct

Kaafi depressing situation hai yar

ROOT

advmod amod cop

vocative

Some people are double standards ki dukaan

ROOT

det

nsubj

cop

amod

nmod

case

There is no seperate emoji for khushi ke aansu .

ROOT

expl

cop

advmod

amod

case

nmod

case

obl

punct

998

Universal Dependency Parsing for Hindi-English Code ...

Documents

Transcript of Universal Dependency Parsing for Hindi-English Code ...