Police Districting Problem: Literature Review and Annotated ...
Towards a syntactically annotated corpus of Maltese: Theory and tools
Transcript of Towards a syntactically annotated corpus of Maltese: Theory and tools
Towards a syntactically annotated corpus of Maltese: Theory and
tools
Slavomír /bulbul/ Čéplö (CUNI)Ján /...j/ Bátora (Sonic Studio
s.r.o.)
Introduction
What?“Constitutent Order Variation and Information Structure in Maltese: A Corpus Analysis”
Where?Institute of the Czech National Corpus, Faculty of Arts, Charles University in Prague
How?“For the purposes of this thesis, a balanced and representative corpus of appropriate size … This corpus will then be used to create a simple treebank which serves as the primary source of research data.”
Introduction
How exactly?1. Add texts from various domains to the corpus.
2. Add PoS tagging, lemmatization and morphological annotation.
3. Create a manually annotated treebank to be used for statistical analysis and, ultimately, parser training.
WORK IN PROGRESSEVERYTHING IS SUBJECT TO CHANGE
Preliminaries: Current state of Maltese corpora
MLRS v2.0 (beta) (2014)- ~130 million tokens- CQP web- PoS tagged
bulbulistan beta (2013)- ~180 million tokens- NoSketchEngine- PoS-tagged
MLRS v3.0 (coming soon 2015)- ~200 million tokens- PoS tagged (new scheme)- Lemmas, morphological annotation
Current state of Maltese corpora: Annotation
TOKENFid-daħlatiegħul-Belt,ir-Retal-Karnivalilbieraħkelluwarajhfollata'protesta.
Current state of Maltese corpora: Annotation
TOKEN TAGFid- PREP_DEFdaħla NOUNtiegħu GEN_PRONl- DEFBelt NOUN, X_PUNir- DEFRe NOUNtal- GEN_DEFKarnival NOUNilbieraħ ADVkellu VERB_PSEUwarajh PREP_PRONfolla NOUNta' GENprotesta NOUN. X_PUN
PoS Tagging v2 (2015)ADJ adjective NUM_CRD cardinal numeral QUAN quantifierADV adverb NUM_FRC fractions VERB verbCOMP complementizer NUM_ORD ordinal numeral VERB_PSEU pseudoverbCONJ_CORD coordinating
conjunctionNUM_WHD number one X_ABV abbreviation
CONJ_SUB subordinating conjunction
PART_ACT active participle X_BOR bordelDEF article PART_PASS passive participle X_DIG digitsFOC focus particle PREP preposition X_ENG english wordsFUT future particle PREP_DEF preposition with article X_FOR other foreignGEN genitive particle PREP_PRON preposition with pronoun X_PUN punctuationGEN_DEF genitive particle
with articlePROG progressive particle
GEN_PRON genitive particle with pronoun
PRON_DEM demonstrative pronoun
HEMM existential verb PRON_DEM_DEF demonstrative pronoun with article
INT interjection PRON_INDEF indefinite pronounKIEN auxiliary PRON_INT interrogative pronounLIL oblique particle PRON_PERS personal pronounLIL_DEF oblique particle with
articlePRON_PERS_NEG personal pronoun with
negatorLIL_PRON oblique particle with
pronounPRON_REC reciprocal pronoun
NEG verbal negator PRON_REF reflexive pronounNOUN nounNOUN_PROP proper noun
PoS Tagging v2 (2015)
Hierarchy of tagging decisions:
1. SemanticsADJ, NOUN, VERB, PRON, QUAN, FUT, PROG …2. MorphologyADJ vs. PART_PASS/PART_ACT, VERB vs. VERB_PSEU, GEN vs. GEN_DEF/GEN_PRON
3. SyntaxADV = {hawn, issa, … waħdi }PRON_INDEF = {ħadd, xejn, kulħadd}ADJ>NOUN conversion =
l-/DEF poplu/NOUN Malti/ADJ vs. Il-/DEF Malti/NOUN hu/PRON-PERS ġeneruż/ADJ
Current state of Maltese corpora: Annotation
TOKEN TAG LEMMA MORPHOLOGYFid- PREP_DEFdaħla NOUNtiegħu GEN_PRONl- DEFBelt NOUN, X_PUNir- DEFRe NOUNtal- GEN_DEFKarnival NOUNilbieraħ ADVkellu VERB_PSEUwarajh PREP_PRONfolla NOUNta' GENprotesta NOUN. X_PUN
COMING
SOON
Current state of Maltese corpora: Annotation
TOKEN TAG SYNTAXFid- PREP_DEFdaħla NOUNtiegħu GEN_PRONl- DEFBelt NOUN, X_PUNir- DEFRe NOUNtal- GEN_DEFKarnival NOUNilbieraħ ADVkellu VERB_PSEUwarajh PREP_PRONfolla NOUNta' GENprotesta NOUN. X_PUN
Syntactic annotation: Some basic questions
Theoretical:What kind of theoretical approach?How deep?
Practical:What annotation tools?What data format?Who will do the annotation?
Theoretical faithfulnessvs.Simplicity
Option 1: Constituency
FID-DAĦLA TIEGĦU L-BELT, IR-RE TAL-KARNIVAL ILBIERAĦ KELLU WARAJH FOLLA TA’ PROTESTA.In DEF-entry his DEF-city, DEF-king GEN-DEF-carnival yesterday had after.him crowd GEN
protest
Option 2: Dependency
FID-DAĦLA TIEGĦU L-BELT, IR-RE TAL-KARNIVAL ILBIERAĦ KELLU WARAJH FOLLA TA’ PROTESTA.
Constituency vs dependency
Constituency
Dependency
Less theoretical baggage XDeals better with constituent order phenomena
X
Better performance and availability of parsers [1]
X
Intuitiveness / Familiarity XEstablished framework for cross-linguistic comparison [2]
X
Enables the recovery of semantic information
X
We in Prague like it X
[1] MaltParser, Stanford Parser[2] “Universal Dependencies” (http://universaldependencies.github.io/docs/)
Option 3?
Stein, A. (2008) “Syntactic Annotation of Old French Text Corpora” (http://corpus.revues.org/1510)Abeillé A. & Barrier N. (2004). “Enriching a French Treebank”. LREC 4
Option 3: Combine constituency and dependency
1. Clausal constituents are defined as virtual nodes on a tree
2. The structure of a constituent is defined in terms of dependency
Option 3: Combine constituency and dependency
3. It’s chunks all the way down
a) Easier to work with for manual annotationb) Certain types of chunks can only become specific
constituents c) Chunking can be done automaticallyd) Analysis of chunk structure can be done
automatically >>> bottom-top conversion to full dependency
FID-DAĦLA TIEGĦU L-BELT, IR-RE TAL-KARNIVAL ILBIERAĦ KELLU WARAJH FOLLA TA’ PROTESTA.
Option 3: Combine constituency and dependency
Chunking: Basics
“… identifying and classifying the flat, non-overlapping segments of a sentence…”Jurafsky and Martin 2009: 485
FID-DAĦLA TIEGĦU L-BELT, IR-RE TAL-KARNIVAL ILBIERAĦ KELLU WARAJH FOLLA TA’ PROTESTA.
Somewhat easily automated:- Rule-based- Statistical (Yamcha, Apache Open NLP)
Chunking: Principles
1. Primacy of content words„Dependency relations hold primarily between content words, rather than being indirect relations mediated by function words. ... Function words attach as direct dependents of the most closely related content word.“ (http://universaldependencies.github.io/docs/u/overview/syntax.html)
2. Maximum extent
IT- TFAJLA LANQAS BISS KIENET SE TAGĦTI KASU.
NP VC NP
FID-DAĦLA TIEGĦU L-BELT ...
NP
PP
Chunking: Principles
3. Minimum size
Exception:
FTIT FTIT , IN- NIES BDEW/VERB JITĦAJRU/VERB JMORRU/VERB JGĦIXU /VERB HEMM.
VC
FID-DAĦLA TIEGĦU L-BELT ...
NP NP
NP
Chunking: Categories
Label Full nameNP Noun phrase Nouns, pronouns
VCVerbal complex/chain
ADVP Adverbial phraseADJP Adjectival phrase
PPPrepositional phrase PREP_PRON
SBAR (tbc) Complementizerli, ma (as in “iżjed ma jgħaddi ż- żmien...”)
CCONJPCoordinating conjunction
SCONJPSubordinating conjunction
Including combinations of PREP+COMP (wara li, qabel ma)
INTJ InterjectionLST (tbr?) ListOTH OtherO
Chunking: Example
TOKEN TAG SYNTAXFid- PREP_DEF B-NPdaħla NOUN I-NPtiegħu GEN_PRON I-NPl- DEF B-NPBelt NOUN I-NP, X_PUN Oir- DEF B-NPRe NOUN I-NPtal- GEN_DEF B-NPKarnival NOUN I-NPilbieraħ ADV B-ADVPkellu VERB_PSEU B-VPwarajh PREP_PRON B-ADVPfolla NOUN B-NPta' GEN B-NPprotesta NOUN I-NP. X_PUN O
Syntactic annotation: Basic principles
1. Clausal constituents are defined as virtual nodes2. The structure of a constituent is defined in terms of
dependency3. Chunks all the way down4. Simplicity (no superfluous categories) even at the
expense of shortcuts…
6. As close to the Universal Dependencies notation as possible
5. … while making sure all the linguistic information is preserved
Syntactic annotation: Clausal constituentsLabel Full name UD
equivalentNotes
NSUBJ/NSUBJPASS
Nominal subject NSUBJ/NSUBJPASS
CSUBJ/CSUBJPASS
Clausal subject CSUBJ/CSUBJPASS
VC Verbal complex ROOT VERB, VERB_PSEU, HEMM
COP Copula ROOTKIEN, PRON_PERS, qiegħed and ġie (passive constructions only)
CC Copular complement n/a Including PART_PASS in passive constructions
DOBJ Direct object DOBJIOBJ Indirect object IOBJVCOMP Verbal complement n/a “Insibha diffiċli”, “toqgħod attent”
ADVMOD Adverbial modifier
ADVMOD
DISC Discourse element DISCOURSE
Sentence-level focus, vocative, interjections, “grazzi”, “jekk jogħġbok”...
LIST Lists, enumerations LIST
OTH Other DEP
Syntactic annotation: Special cases
1. Passive sentences
Note:In both cases, rules 4 and 5 apply:4 – DObj is used for the passive “by-object”/agent in both cases / copula + copular Complement are used for the combination ġie + PASS_PART5 - The sentences are unambiguously identified as passive by the use of of NSubj and the presence of NSubj identifies the DObj as “by-object”/agent, thus no need to establish a separate label
Syntactic annotation: Complex sentences
1. Simple sentence / main clause / root: <S>
2. Coordinated independent clauses (including parataxis): <Sind> (UD conj)
Syntactic annotation: Complex clauses
3. Main clause with subordinated clausesa) Complementizer clause: <Scomp> (UD ccomp/xcomp)
Syntactic annotation: Complex clauses
3. Main clause with subordinated clausesa) Complementizer clause: <Scomp> (UD ccomp/xcomp)
Syntactic annotation: Complex clauses
3. Main clause with subordinated clausesb) Adverbial clause: <Sadv> (UD advcl)
Syntactic annotation: Some basic questions (partially) answered
Theoretical:What kind of theoretical approach? ✓ How deep? ✓
Practical:What annotation tools?What data format?Who will do the annotation?
Tools: Introduction
Hard truth no. 1:
Most NLP applications require human-annotated data.Hard truth no. 2:
Most software tools to produce said data are really not that good.