Towards a syntactically annotated corpus of Maltese: Theory and tools

43
Towards a syntactically annotated corpus of Maltese: Theory and tools Slavomír /bulbul/ Čéplö (CUNI) Ján /...j/ Bátora (Sonic Studio s.r.o.)

Transcript of Towards a syntactically annotated corpus of Maltese: Theory and tools

Towards a syntactically annotated corpus of Maltese: Theory and

tools

Slavomír /bulbul/ Čéplö (CUNI)Ján /...j/ Bátora (Sonic Studio

s.r.o.)

Introduction

What?“Constitutent Order Variation and Information Structure in Maltese: A Corpus Analysis”

Where?Institute of the Czech National Corpus, Faculty of Arts, Charles University in Prague

How?“For the purposes of this thesis, a balanced and representative corpus of appropriate size … This corpus will then be used to create a simple treebank which serves as the primary source of research data.”

Introduction

How exactly?1. Add texts from various domains to the corpus.

2. Add PoS tagging, lemmatization and morphological annotation.

3. Create a manually annotated treebank to be used for statistical analysis and, ultimately, parser training.

WORK IN PROGRESSEVERYTHING IS SUBJECT TO CHANGE

Preliminaries: Current state of Maltese corpora

MLRS v2.0 (beta) (2014)- ~130 million tokens- CQP web- PoS tagged

bulbulistan beta (2013)- ~180 million tokens- NoSketchEngine- PoS-tagged

MLRS v3.0 (coming soon 2015)- ~200 million tokens- PoS tagged (new scheme)- Lemmas, morphological annotation

Current state of Maltese corpora: Annotation

TOKENFid-daħlatiegħul-Belt,ir-Retal-Karnivalilbieraħkelluwarajhfollata'protesta.

Current state of Maltese corpora: Annotation

TOKEN TAGFid- PREP_DEFdaħla NOUNtiegħu GEN_PRONl- DEFBelt NOUN, X_PUNir- DEFRe NOUNtal- GEN_DEFKarnival NOUNilbieraħ ADVkellu VERB_PSEUwarajh PREP_PRONfolla NOUNta' GENprotesta NOUN. X_PUN

PoS Tagging v2 (2015)ADJ adjective NUM_CRD cardinal numeral QUAN quantifierADV adverb NUM_FRC fractions VERB verbCOMP complementizer NUM_ORD ordinal numeral VERB_PSEU pseudoverbCONJ_CORD coordinating

conjunctionNUM_WHD number one X_ABV abbreviation

CONJ_SUB subordinating conjunction

PART_ACT active participle X_BOR bordelDEF article PART_PASS passive participle X_DIG digitsFOC focus particle PREP preposition X_ENG english wordsFUT future particle PREP_DEF preposition with article X_FOR other foreignGEN genitive particle PREP_PRON preposition with pronoun X_PUN punctuationGEN_DEF genitive particle

with articlePROG progressive particle

GEN_PRON genitive particle with pronoun

PRON_DEM demonstrative pronoun

HEMM existential verb PRON_DEM_DEF demonstrative pronoun with article

INT interjection PRON_INDEF indefinite pronounKIEN auxiliary PRON_INT interrogative pronounLIL oblique particle PRON_PERS personal pronounLIL_DEF oblique particle with

articlePRON_PERS_NEG personal pronoun with

negatorLIL_PRON oblique particle with

pronounPRON_REC reciprocal pronoun

NEG verbal negator PRON_REF reflexive pronounNOUN nounNOUN_PROP proper noun

PoS Tagging v2 (2015)

Hierarchy of tagging decisions:

1. SemanticsADJ, NOUN, VERB, PRON, QUAN, FUT, PROG …2. MorphologyADJ vs. PART_PASS/PART_ACT, VERB vs. VERB_PSEU, GEN vs. GEN_DEF/GEN_PRON

3. SyntaxADV = {hawn, issa, … waħdi }PRON_INDEF = {ħadd, xejn, kulħadd}ADJ>NOUN conversion =

l-/DEF poplu/NOUN Malti/ADJ vs. Il-/DEF Malti/NOUN hu/PRON-PERS ġeneruż/ADJ

Current state of Maltese corpora: Annotation

TOKEN TAG LEMMA MORPHOLOGYFid- PREP_DEFdaħla NOUNtiegħu GEN_PRONl- DEFBelt NOUN, X_PUNir- DEFRe NOUNtal- GEN_DEFKarnival NOUNilbieraħ ADVkellu VERB_PSEUwarajh PREP_PRONfolla NOUNta' GENprotesta NOUN. X_PUN

COMING

SOON

Current state of Maltese corpora: Annotation

TOKEN TAG SYNTAXFid- PREP_DEFdaħla NOUNtiegħu GEN_PRONl- DEFBelt NOUN, X_PUNir- DEFRe NOUNtal- GEN_DEFKarnival NOUNilbieraħ ADVkellu VERB_PSEUwarajh PREP_PRONfolla NOUNta' GENprotesta NOUN. X_PUN

Theory

Syntactic annotation: Some basic questions

Theoretical:What kind of theoretical approach?How deep?

Practical:What annotation tools?What data format?Who will do the annotation?

Theoretical faithfulnessvs.Simplicity

Option 1: Constituency

FID-DAĦLA TIEGĦU L-BELT, IR-RE TAL-KARNIVAL ILBIERAĦ KELLU WARAJH FOLLA TA’ PROTESTA.In DEF-entry his DEF-city, DEF-king GEN-DEF-carnival yesterday had after.him crowd GEN

protest

Option 2: Dependency

FID-DAĦLA TIEGĦU L-BELT, IR-RE TAL-KARNIVAL ILBIERAĦ KELLU WARAJH FOLLA TA’ PROTESTA.

Constituency vs dependency

Constituency

Dependency

Less theoretical baggage XDeals better with constituent order phenomena

X

Better performance and availability of parsers [1]

X

Intuitiveness / Familiarity XEstablished framework for cross-linguistic comparison [2]

X

Enables the recovery of semantic information

X

We in Prague like it X

[1] MaltParser, Stanford Parser[2] “Universal Dependencies” (http://universaldependencies.github.io/docs/)

Option 3?

Stein, A. (2008) “Syntactic Annotation of Old French Text Corpora” (http://corpus.revues.org/1510)Abeillé A. & Barrier N. (2004). “Enriching a French Treebank”. LREC 4

Option 3: Combine constituency and dependency

1. Clausal constituents are defined as virtual nodes on a tree

2. The structure of a constituent is defined in terms of dependency

Option 3: Combine constituency and dependency

3. It’s chunks all the way down

a) Easier to work with for manual annotationb) Certain types of chunks can only become specific

constituents c) Chunking can be done automaticallyd) Analysis of chunk structure can be done

automatically >>> bottom-top conversion to full dependency

FID-DAĦLA TIEGĦU L-BELT, IR-RE TAL-KARNIVAL ILBIERAĦ KELLU WARAJH FOLLA TA’ PROTESTA.

Option 3: Combine constituency and dependency

Chunking: Basics

“… identifying and classifying the flat, non-overlapping segments of a sentence…”Jurafsky and Martin 2009: 485

FID-DAĦLA TIEGĦU L-BELT, IR-RE TAL-KARNIVAL ILBIERAĦ KELLU WARAJH FOLLA TA’ PROTESTA.

Somewhat easily automated:- Rule-based- Statistical (Yamcha, Apache Open NLP)

Chunking: Principles

1. Primacy of content words„Dependency relations hold primarily between content words, rather than being indirect relations mediated by function words. ... Function words attach as direct dependents of the most closely related content word.“ (http://universaldependencies.github.io/docs/u/overview/syntax.html)

2. Maximum extent

IT- TFAJLA LANQAS BISS KIENET SE TAGĦTI KASU.

NP VC NP

FID-DAĦLA TIEGĦU L-BELT ...

NP

PP

Chunking: Principles

3. Minimum size

Exception:

FTIT FTIT , IN- NIES BDEW/VERB JITĦAJRU/VERB JMORRU/VERB JGĦIXU /VERB HEMM.

VC

FID-DAĦLA TIEGĦU L-BELT ...

NP NP

NP

Chunking: Categories

Label Full nameNP Noun phrase Nouns, pronouns

VCVerbal complex/chain

ADVP Adverbial phraseADJP Adjectival phrase

PPPrepositional phrase PREP_PRON

SBAR (tbc) Complementizerli, ma (as in “iżjed ma jgħaddi ż- żmien...”)

CCONJPCoordinating conjunction

SCONJPSubordinating conjunction

Including combinations of PREP+COMP (wara li, qabel ma)

INTJ InterjectionLST (tbr?) ListOTH OtherO

Chunking: Example

TOKEN TAG SYNTAXFid- PREP_DEF B-NPdaħla NOUN I-NPtiegħu GEN_PRON I-NPl- DEF B-NPBelt NOUN I-NP, X_PUN Oir- DEF B-NPRe NOUN I-NPtal- GEN_DEF B-NPKarnival NOUN I-NPilbieraħ ADV B-ADVPkellu VERB_PSEU B-VPwarajh PREP_PRON B-ADVPfolla NOUN B-NPta' GEN B-NPprotesta NOUN I-NP. X_PUN O

Syntactic annotation: Basic principles

1. Clausal constituents are defined as virtual nodes2. The structure of a constituent is defined in terms of

dependency3. Chunks all the way down4. Simplicity (no superfluous categories) even at the

expense of shortcuts…

6. As close to the Universal Dependencies notation as possible

5. … while making sure all the linguistic information is preserved

Syntactic annotation: Clausal constituentsLabel Full name UD

equivalentNotes

NSUBJ/NSUBJPASS

Nominal subject NSUBJ/NSUBJPASS

CSUBJ/CSUBJPASS

Clausal subject CSUBJ/CSUBJPASS

VC Verbal complex ROOT VERB, VERB_PSEU, HEMM

COP Copula ROOTKIEN, PRON_PERS, qiegħed and ġie (passive constructions only)

CC Copular complement n/a Including PART_PASS in passive constructions

DOBJ Direct object DOBJIOBJ Indirect object IOBJVCOMP Verbal complement n/a “Insibha diffiċli”, “toqgħod attent”

ADVMOD Adverbial modifier

ADVMOD

DISC Discourse element DISCOURSE

Sentence-level focus, vocative, interjections, “grazzi”, “jekk jogħġbok”...

LIST Lists, enumerations LIST

OTH Other DEP

Syntactic annotation: Special cases

1. Passive sentences

Note:In both cases, rules 4 and 5 apply:4 – DObj is used for the passive “by-object”/agent in both cases / copula + copular Complement are used for the combination ġie + PASS_PART5 - The sentences are unambiguously identified as passive by the use of of NSubj and the presence of NSubj identifies the DObj as “by-object”/agent, thus no need to establish a separate label

Syntactic annotation: Special cases

2. Verbless sentences

Syntactic annotation: Special cases

3. PRON_INT as interrogatives

Syntactic annotation: Special cases

4. Coordination

5. Two constituents with the same function

Syntactic annotation: Special cases

6. Various non-attached elements

5. Verbal complement

Syntactic annotation: Complex sentences

1. Simple sentence / main clause / root: <S>

2. Coordinated independent clauses (including parataxis): <Sind> (UD conj)

Syntactic annotation: Complex clauses

3. Main clause with subordinated clausesa) Complementizer clause: <Scomp> (UD ccomp/xcomp)

Syntactic annotation: Complex clauses

3. Main clause with subordinated clausesa) Complementizer clause: <Scomp> (UD ccomp/xcomp)

Syntactic annotation: Complex clauses

3. Main clause with subordinated clausesb) Adverbial clause: <Sadv> (UD advcl)

Syntactic annotation: Some basic questions (partially) answered

Theoretical:What kind of theoretical approach? ✓ How deep? ✓

Practical:What annotation tools?What data format?Who will do the annotation?

Tools

Tools: Introduction

Hard truth no. 1:

Most NLP applications require human-annotated data.Hard truth no. 2:

Most software tools to produce said data are really not that good.

Tools: POS Tagger

Tools: PoS tagging

Tools: Manual chunking

Tools: Parsing

In conclusion: To do

- Refine the annotation system- Develop an annotation manual- Complete the development of PoSTagger (and come up with

a better name)- Develop conversion to full UD annotation- while (accuracy < 90%) {

annotate(); train(); tune(); test()};