Building a Hierarchical Annotated Corpus of Thai Using Phrase Structure Grammar

14
International Journal of Advanced Intelligence Volume 5, Number 1, pp.56-69, July, 2013. © AIA International Advanced Information Institute Building a Hierarchical Annotated Corpus of Thai Using Phrase Structure Grammar Rattasit Sukhahuta Computer Science Department, Faculty of Science, Chiang Mai University, Chiangmai, Thailand * [email protected] Prasert Luekhong College of Integrated Science and Technology, Rajamangala University of Technology Lanna, Chiangmai, Thailand [email protected] Received (18, December 2012) Revised (31, May 2013) With the amount of explorative research related to Thai natural language processing increasing every year there is a greater need for annotated corpus for computational re- sources. The processes of constructing a Treebank and annotating sentences are time consuming. Nevertheless, they require a tremendous amount of resources and NLP tasks. This paper introduces an approach to automatically build a large Treebank for the Thai language using phrase structure annotation. The Context Free Grammar rules based on phrase structure grammar for the Thai language are developed in order to perform an au- tomatic syntactically annotated corpus. The experiment results show major improvement with an efficient approach with little human intervention. Keywords: Thai Tree Bank; Orchid Part-of-Speech; ANTLR; Thai Context Free Grammar 1. Introduction The growing amount of research interest on Thai natural language processing has led to the concern the availibility of large annotated corpus called Treebank. A Treebank is usually created with some specific linguistic theory with syntactic annotation. The ambiguity of the Thai language structure is important and needs to be addressed since it is a bottleneck for further analysis of processes. In this research, we describe the design of a phrase structure grammar and the development of the grammar parser that assigns the grammatical structure to annotate an input text with label brackets. The approach to the problem is divided into two parts. First, the Thai written language is processed with morphological analysis and each word is assigned a part-of-speech tag. Second, the phrase structure grammar (PSG) rules for the * 239 Huaykaew Rd. Tumbol Suthep, Amphur Muang, Chiangmai, Thailand 50200 56

Transcript of Building a Hierarchical Annotated Corpus of Thai Using Phrase Structure Grammar

International Journal of Advanced IntelligenceVolume 5, Number 1, pp.56-69, July, 2013.© AIA International Advanced Information Institute

Building a Hierarchical Annotated Corpus of Thai Using PhraseStructure Grammar

Rattasit SukhahutaComputer Science Department, Faculty of Science, Chiang Mai University,

Chiangmai, Thailand*[email protected]

Prasert LuekhongCollege of Integrated Science and Technology, Rajamangala University of Technology Lanna,

Chiangmai, [email protected]

Received (18, December 2012)Revised (31, May 2013)

With the amount of explorative research related to Thai natural language processingincreasing every year there is a greater need for annotated corpus for computational re-sources. The processes of constructing a Treebank and annotating sentences are timeconsuming. Nevertheless, they require a tremendous amount of resources and NLP tasks.This paper introduces an approach to automatically build a large Treebank for the Thailanguage using phrase structure annotation. The Context Free Grammar rules based onphrase structure grammar for the Thai language are developed in order to perform an au-tomatic syntactically annotated corpus. The experiment results show major improvementwith an efficient approach with little human intervention.

Keywords: Thai Tree Bank; Orchid Part-of-Speech; ANTLR; Thai Context Free Grammar

1. Introduction

The growing amount of research interest on Thai natural language processing has ledto the concern the availibility of large annotated corpus called Treebank. A Treebankis usually created with some specific linguistic theory with syntactic annotation. Theambiguity of the Thai language structure is important and needs to be addressedsince it is a bottleneck for further analysis of processes. In this research, we describethe design of a phrase structure grammar and the development of the grammarparser that assigns the grammatical structure to annotate an input text with labelbrackets. The approach to the problem is divided into two parts. First, the Thaiwritten language is processed with morphological analysis and each word is assigneda part-of-speech tag. Second, the phrase structure grammar (PSG) rules for the

*239 Huaykaew Rd. Tumbol Suthep, Amphur Muang, Chiangmai, Thailand 50200

56

Building a Hierarchical Annotated Corpus of Thai Using Phrase Structure Grammar 57

Thai language are defined in a form of Extended Backus–Naur Form (EBNF) withthe principles of phrase structure and the X-bar theory that aims at the commonproperties of different types of syntactic constituents. The annotated text is usuallyin a form of syntactic bracket labeling so that the sentences will be grouped intoa hierarchical form of phrase structure. Treebank is one of the important resourcesin the natural language processing. It is essentially used for studying syntacticphenomena and it has been used for training, algorithm comparing and resultsevaluation in computational linguistics. Constructing a Treebank is a complicateprocess and require a tremendous amount of work and resources. The annotatedtext is assigned with POS tags and organized into a hierarchical form of phrasestructure in labeled brackets corrected by human annotators.

In recent years, there have been many Treebanks created for many languagesbut it is still rare for Thai. For English, the Penn Treebank1 has proven to bean important resource containing over 4.5 million words of American English.Inthis work, the comparison between entirely manual and semi-automated tagginghas been conducted after partially automating the bracketing of the output of thePOS tagging phrase automatically parsed and simplified to yield a skeletal syntacticrepresentation, which is then corrected by human annotators. In addition, they im-plemented a new syntactic annotation scheme, designed to highlight aspects of thepredicate-argument structure2. They believe that this revised form of annotationwill provide a corpus of annotated material that is useful for training stochasticparsers on surface syntax, for training stochastic parsers that work at one level ofanalysis beyond surface syntax, and at the same time provide a consistent databasefor use in linguistic research. Yves LEPAGE & et. al3. had presented a Treebankof 6,553 sentences of Japanese conversations in the domain of hotel reservationswith exploitation of Tesnière’s structural syntax framework. Correspondences be-tween surface texts and trees are ensured by means of intervals. Sabine Brants et.al4. presented the TIGER Treebank, a corpus of 35,000 syntactically annotatedGerman newspaper sentences based on the annotation scheme used for the NE-GRA corpus, which is concerned with the use of secondary edges in coordination,verb sub-categorization, finer distinctions concerning the German expletive and adifferent of structured proper nouns. Qaiser Abbas & et al5 had presented a Tree-bank of the Urdu language based on Probabilistic Grammar, Chung-hye Han et. al6presented a Penn Korean Treebank building a 54-thousand-word Korean Treebankusing a phrase structure annotation based on the morpho-syntatic phenomena. ThePenn Arabic Treebank7 was proposed by Maamouri, Mohamed & et. al., ColumbiaArabic Treebank by Habash, Nizar and Ryan M. Roth8, and the high-quality Tree-bank of the Chinese language, the Penn Chinese Treebank was proposed by Xue,Naiwen et. al9 and Jiajun Yan et.al.10

For the Thai language, Ruangrajitpakorn & et. al.11 had proposed an algorithmfor creating a Thai Treebank based on Categorial Grammar (CG) formalism. In thiswork, a Thai corpus was parsed with the existing CG syntactic dictionary and LALR

58 R. Sukhahuta, P. Luekhong

parser. The corrected parsed trees were collected as a preliminary CG Treebank. Itconsists of 50,346 trees from 27,239 utterances with CG tags and composition. Ru-angrajitpakorn, Trakultaweekoon & Supnithi12 only used small syntactic categoriesfor constructing large trees with the assistance of linguistic knowledge for parsingthe Thai language. In addition, the VB-EM algorithm has been exploited to adjustparameters of trees for selecting the best tree. This yields the result of 70.62% accu-racy for bracketing recovery. Rishøj et. al.13 extended the Thai Categorial GrammarTreebank10 to Dependency Trees for automatically tree transformation.

This paper discusses the issue of an automatic Thai Treebank building approachand design, and a morphological process to deal with the language. The paper isarranged as follows: Section 2 describes the process of Thai morphological analysisand part-of-speech tagging. Section 3 discusses the processes of the context-freegrammar rules construction for Thai phrases. Section 4 explains the experimentresults and discussion. Finally, we conclude with the final results of the researchand discuss our future work in section 5.

2. Thai Morphological Analysis

Thai is a language with complex orthography with its own writing system. Thereare many different ways that Thai words can be created such as compounding,reduplicating, deleting and loaning. Thai has no word inflection for different case,sex or gender; therefore, the word form will not be changed. There is only oneform of words for different syntactic categories. For instance, the word ไป (go) wordwill be in the same form for all tenses and gender. Additional words are normallyadded to the sentence to clarify the meaning. In Thai writing, the words are writtenaltogether without an indicator of boundaries between words or sentences. Usually,the spaces will be used to separate the contexts; but they cannot be used as anindicator for boundary of the sentences. To describe the Thai morphology, the worditself cannot be used to clarify the complete word meaning but rather it needsa group of words that are normally in form of a phrase. Many Thai words arecompounds where new words are created from a combination of loan words such asthe word แม่ (mother) and มด (ant) resulting a new word แมม่ด (witch) that gives acomplete different meaning to the original words. Moreover, the words that appearin the sentence are often in a serial form, for instance, serial verbs, serial auxiliariesand serial determiners. Often text preprocessing tasks such as word segmentationand part of speech tagging become necessary when performing the Thai naturallanguage analysis.

In this research, the Orchid annotated corpus has been used to study the con-stituents of Thai grammar. It has proven to be a complete resource for Thai seg-mentation with POS. The corpus consists of 23,125 sentences and each sentencehas an average of 10-25 words. The sentences are segmented and tagged with part-of-speech. In Orchid, the POS tags have been extended from the original 8 Thai

Building a Hierarchical Annotated Corpus of Thai Using Phrase Structure Grammar 59

general syntactic categoriesa. For instance, the noun (NOUN) is further describedinto 6 subcategories: NPRP, NCNM, NONM, NLBL, NCMN and NTTL12. In thiscorpus, the training documents are divided into sentences. The following shows theexample of a tagged sentence within the corpus:

Input: ผู้วจิยัสามารถทีจะวางแผนดาํเนินการทงัทางเทคนิคและอุปกรณ์เครอืงมอื‘‘Researcher can plan and carry out both technical issue and equipment tools’’

Output: ผู้วจิยั/NCMN <space>/PUNC สามารถ/XVAM ทีจะ/JSBR วางแผน/VACTดาํเนินการ/VACT ทงั/JCRG ทาง/NCMN เทคนิค/ NCMN และ/JCRG อุปกรณ์/NCMNเครอืงมอื/NCMN

We noticed that a sentence could be either a complete or incomplete sentence.Although, the definition of Thai writing and the Thai sentence has been clearlydefined, many Thai sentences found in the documents are incomplete. Many times,the subject is left out from the sentence conjunction that is a continuation from theprevious sentence. For instance, the following sentence begins with a conjunctionand it is considering to be a sentence in the training corpus.

Input: ซงึทงันีเนืองจากการประสานงานและความรว่มมอืเป็นไปอยา่งดีของนกัวจิยัและผู้เกยีวขอ้ง‘because of well coordination and cooperation of the researchers and people involved’

Output: ซงึ/JSBR ทงันี/JCRG <space>/PUNC เนืองจาก/JSBR การ/FIXN ประสานงาน/VACT <space>/PUNC และ/JCRG ความ/FIXN รว่มมอื/VACT เป็น/VSTA ไป/XVAM อยา่ง/FIXV ดี/VATT ของ/RPRE นกัวจิยั/NCMN และ/JCRG ผู้เกยีวขอ้ง/NCMN

Phrasal structure is a grouping of words with the same syntactic group. It pro-vides a complete meaning rather than a single word. Each phrase constituent con-tains a head and modifiers. The advantage of the phrase structure is that it containsthe different constituents that are formed into a sentence. For example, the sentencethat consists of Subject + Verb and Object can be substituted with phrasal cate-gories NP + VP + NP. Each phrasal category can be organized in that a hierarchicalstructure such that NP -> N + Modifier and so on. The sentences contain all con-stituents will be referred to as complete sentences. The phrasal categories usuallycontain the constituent head and its modifier where the modifier will be organizedin the projection level in X-Bar11. Therefore, we defined the common phrase intothree main phrasal categories (NP, VP, PP) and five clausal structures including alabel noun, classifier, adverb node, ending and clause separator.

aA syntactic category for the Thai language normally consists of the 8 general categories of noun,verb, adjective, adverb, preposition, determiner, auxiliary and conjunction.

60 R. Sukhahuta, P. Luekhong

3. Phrase Structure Pattern Extraction

The hiearchical annotated sentence building approach uses phrase structure annota-tion for syntactic labeled bracketing. Since Thai has rich structure of orthography,it is important to understand how the word patterns are formed into phrases andthen into a sentence. The purpose is to annotate a Thai sentence into a hierarchi-cal form of phrase structure. This task is usually performed by human annotatorswith linguistic knowledge and an understanding of the Thai syntactic structure. Forlarge corpus, it might be difficult for the task to be done manually since it requiresmany linguists and everyone must follow the same annotation rules and standards.In this work, the grammar patterns are learned and eventually be extracted fromthe annotated corpus containing sentences that have been segmented and taggedwith Orchid part-of-speech tags. The technique is to identify the constituent headof each phrase and then find its modifiers around it. For example, the sentence withPOS pattern:

Input: NPRP NCMN XVBM XVMM VACT XVAE RPRE VACT VSTA XVAE

A common noun (NCMN), active verb (VACT) and preposition (RPRE) will beidentified as the head of the phrasal category and their modifiers around them are:

NPRP NCMN ; noun phrase pattern for series nounsXVBM XVMM VACT XVAE ; pre-verb auxiliary + active verb + post verb auxiliaryRPRE ; prepositionalVACT VSTA XVAE ; active verb + stative verb + ending auxiliary verb

In the process of grammar rules extraction, we created a pattern learning algo-rithm by performing backward and forward pattern search from the head using thepart-of-speech tags annotated within the Orchid corpus. Table 1 and 2 illustratethe pre-defined rules for extracting Verb, Noun and Classifier patterns, which aremanually learned from the training corpus, and need to be specified in the algo-rithm. The process begins with the identification of the head node. The projectionheads are the main syntactic categories such that NCMN, NPRP, VACT, RPREare commonly used to identify the type of a phrase. Once the head is found themodifiers and arguments are searched from both directions forward and backward.The engine then extracts the phrasal pattern by scanning the annotated text start-ing from each head to find its modifier as illustrated in the algorithm 1. In each passof the extraction process, a set of patterns will be stored along with the frequency.The results of the training process are the phrasal patterns with the frequency ofeach pattern. There are 511 patterns extracted for the phrasal group. This can bedivided into forward and backward patterns. The extracted patterns of each phrasalgroup are examined based on the structure and frequency of occurrences (Table 2).Each pattern contains at least the head and its modifier. For instance the patternVACT XVAE ADVN contains VACT as a head and XVAE ADVN as the modifier.The extracted pattern will be ranked according to the frequency of occurrences.

Building a Hierarchical Annotated Corpus of Thai Using Phrase Structure Grammar 61

Algorithm 1 Phrase Structure Pattern Learning AlgorithmS = {s|s is a stop node}n ϵ node ∧n /∈ S

Read Line from sentence filewhile Line ̸= empty doSplit Line into node[ ]for index = 0 to size(node)− 1 do

if n = HeadNode thenwhile n ̸= S doPattern +=n

Left Shift for BackwardRight Shift for Forward

end whileend if

end forend while

Table 1. Learning Pattern for Verb and Noun Categories

HeadNode VACTForward StopNode VACT|PUNC|NCMN|NPRP|NCNM|DDAN|

DDAC|DDBQ|DDAQ|DIAC|DIBQ|DIAQ|DCNM|DONM|JCRG|JCMP|JSBR|RPRE|PPRS|PDMN|PNTR|PREL|NTTL|EAFF|EITT|PPRS|PDMN|PNTR|PREL|CNIT|CLTV|CMTR|CFQC|CVBL|JCRG|JCMP|JSBR|FIXN|FIXV

HeadNode VACTBackward StopNode VACT|PUNC|NCMN|NPRP|NCNM|DDAN|

DDAC|DDBQ|DDAQ|DIAC|DIBQ|DIAQ|DCNM|DONM|JCRG|JCMP|JSBR|PPRS|PDMN|PNTR|PREL|ADVN|ADVI|ADVP|ADVS|EITT|EAFF|CNIT|CLTV|CMTR|CFQC|CVBL|PPRS|PDMN|PNTR|PREL|JCRG|JCMP|JSBR|RPRE

HeadNode NCMNForward StopNode NCMN|PUNC|VACT|RPRE|DDAN|DDAC|

DDBQ|DDAQ|DIAC|DIBQ|DIAQ|DCNM|DONM|XVBM|XVAM|XVMM|XVBB|XVAE|ADVN|ADVI|ADVP|ADVS|EAFF|EITT|PPRS|PDMN|PNTR|PREL|CNIT|CLTV|CMTR|CFQC|CVBL|JCRG|JCMP|JSBR

HeadNode NCMNBackward StopNode NCMN|PUNC|VACT|DDAN|DDAC|DDBQ|

DDAQ|DIAC|DIBQ|DIAQ|DCNM|DONM|XVBM|XVAM|XVMM|XVBB|XVAE|ADVN|ADVI|ADVP|ADVS|EAFF|EITT|CNIT|CLTV|CMTR|CFQC|CVBL|PPRS|PDMN|PNTR|PREL|JCRG|JCMP|JSBR|RPRE

62 R. Sukhahuta, P. Luekhong

Table 2. Learning Pattern for Classifier Category

HeadNode CNIT CLTV CMTR CFQC CVBLForward StopNode VACT|NCMN|PUNC|JCRG|JCMP|JSBR|RPRE|PPRS|

PDMN|PNTR|PREL|EITT|EAFF|CNIT|CLTV|CMTR|CFQC|CVBL|XVBM|XVAM|XVMM|XVBB|XVAE|DDAN|DDAC|DDBQ|DDAQ|DIAC|DIBQ|DIAQ|DCNM|DONM|FIXV|FIXN

textbfHeadNode CNIT CLTV CMTR CFQC CVBLBackward StopNode VACT|NCMN|PUNC|JCRG|JCMP|JSBR|RPRE|PPRS|PDMN|

PNTR|PREL|EITT|EAFF|CNIT|CLTV|CMTR|CFQC|CVBL|XVBM|XVAM|XVMM|XVBB|XVAE|DDAN|DDAC|DDBQ|DDAQ|DIAC|DIBQ|DIAQ|DCNM|DONM

Table 3. The results of extracted pattern

Forward Patterns Backward PatternsVerb Phrase (VP) 279 51Noun Phrase (NP) 64 48Classifier 41 28

We then used the patterns as a guideline for manually constructing the con-text free grammar rules in Extended Backus–Naur Form (EBNF) notation as anexpression of our context-free grammars. With EBNF, we constructed the gram-mar pattern according to phrase structure of Thai using regular expressions, forinstance, a symbol ‘+’ (one or many), ‘*’ (zero or many) and ‘?’ for an option. Thefollowing statements show the simplified version of the context-free grammar rulesconstructed without the action rules:

--------------------------- Sentence ------------------------------statement : (sentence)+sentence : sub_sentence ((conjunction | clause_separator) sub_sentence)*sub_sentence : (phrase_struture)+phrase_struture : noun_phrase | verb_phrase | prepositional_phrase |

classifier_node | label_noun | ending

-------------------------- Noun Phrase ----------------------------noun_phrase : noun_bar_specifier_relclausenoun_bar_specifier_relclause : noun_bar (relative_clause)?noun_bar : (series_determiners)?noun_words : serial_noun | personal_pronoun (post_noun)*post_noun : (adverb_node | adjective_phrase definite_determiner |

Building a Hierarchical Annotated Corpus of Thai Using Phrase Structure Grammar 63

classifier_node | prepositional_phrasenoun_words : (serial_noun | personal_pronoun)serial_noun : (noun | nominal_prefix) (noun | nominal_prefix)nominal_prefix : (FIXN)+(serial_verb | serial_noun)label_noun : NLBLnoun : NCMN | NTTL | NPRP-------------------------- Verb Phrase ----------------------------verb_phrase : verb_bar_specifier_relclause (noun_phrase)?

(prepositional_phrase)?verb_bar_specifier_relclause : verb_bar(relative_clause)?verb_bar : (pre_verb)* verb_words (post_verb)*pre_verb : (negator | series_preverb_auxiliary)verb_words : (serial_verb | adverbial_prefix)post_verb : (series_postverb_auxiliary | adverb_node | classifier_node)serial_verb : verb (verb)*verb : (VACT | VSTA )attributive_verb : VATT-----------------------Prepositional Phrase------------------------prepositional_phrase : prepositional_bar (classifier_node)?prepositional_bar : preposition (noun_phrase)?preposition : RPRE-----------------------Classifier--------------------------------classifier_node : nominal_number (serial_classifier)?

| classifier definite_determinermain_classifier : serial_classifierserial_classifier : classifier (classifier)*nominal_number : (DCNM | NCNM | NONM | DONM)post_classifier : series_definite_determinersclassifier : (CNIT | CLTV | CMTR | CFQC | CVBL)

The grammar rules begin with the statement of non-terminal rules of sentenceswhich is the highest level in the tree. The sentence can be divided into one or moresub-sentences where each sub-sentence can be joined with conjunction or a clauseseparator. In this case, a space is used as an indicator. The sub-sentence can bedifferent phrases such that noun phrase (NP), verb phrase (VP) and prepositionalphrase (PP). In order to cope with the complexity of the Thai language, the rulesare relaxed by allowing the sentences to be either complete or incomplete; in thiscase, Classifier, Label Noun and Ending are allowed to appear by themselves as anode in the parse tree. In this research, we have developed a Thai parser using asoftware application with ANTLR . The parser development was based on the phrasestructure grammar rules with the focus on optimizing the pre-processing of linguisticanalysis. The parsing technique used in this study is based on the top-down parserthat looks for the syntax tree starting from the root node and works down to the

64 R. Sukhahuta, P. Luekhong

leaves. The grammar rules use an extended Backus-Naur Form (EBNF) notation,which describes non-LL(k)/LR(k) context-free languages. The top-down parsingstrategy with syntactic predicates augmented solve non-LL(1) designed to provideselective backtracking and semantic predicates to any context sensitive tree pattern.Basically, the parser is divided into two modules. First, the scanner, or the lexicalanalyser, scans the input stream from left to right to recognize the valid stringscalled tokens, the scanner works on the basis of deterministic finite automata (DFA),which uses regular expressions. Second, the parser recognizes the phrase structure ofa group of words according to the predefined grammar rules. For each rule, the actionscripts are specified with an implementation using the C# programming language.The input sentences for the syntactical parser were performed with morphologicalanalysis with part-of-speech tagging using the Orchid tagset. The following areresults from the parser based on the input sentence segmented and tagged with thepart-of-speech. The result from the parser will be in the format of labeled brackets,and represented in a tree structure with the developed visualize tool. The resultsare the sentences with phrase structure annotation with bracket labeling, illustratedin the following sentence:

ตลาดสนิคา้ทีนํามาขายในประเทศไทยmarketing products that are sold in Thailand

(ROOT (S (NP(N’(NCMN(ตลาด) NCMN(สนิคา้)) RelC(Pn(Prel(PREL(ที)))VP(V’(XVAM(นํา) XVAM(มา) VACT(ขาย)) PP(P’(RPRE(ใน))

NP(N’(NPRP(ประเทศไทย))))))))

4. Experimental Results and Disscussion

The experiments for creating the Thai Treebank were performed on the OrchidThai corpus that has already been segmented and tagged with POS consisting of23,125 sentences14. The input sentences are pre-processed by organizing them intoone sentence per line. Then we parse each sentence using our parser. With theresults of the Treebank that we have created, we study the probability for eachphrasal pattern of the CFG rules using the NLTK toolkit. We have also developeda visualization tool using the NLTK toolkit to display a parse tree and evaluatethe results (shown in Fig. 1). The parser parsed 20,369 sentences or 88.08%. Thesentences that contain at least one error (“[ERR]”) will be considered as unparsedsentences which will be stored in a different location. Many of the errors come fromthe rule patterns that have not been specified in our grammar rules. However, theycan be resolved by adding these rules back to the CFGs although it has to be donecarefully since the source of errors may come from the following sources:

Building a Hierarchical Annotated Corpus of Thai Using Phrase Structure Grammar 65

Table 4. Example of the extracted patterns for forward verb

Patterns FrequencyVACT 27443VACT XVAE 2164VACT FIXN 1893VACT VSTA 950VACT ADVN 659VACT FIXN VATT 340VACT FIXN VSTA 284VACT XVAE ADVN 205VACT XVAE VSTA 147VACT XVAM 126VACT XVBM 96VACT ADVN XVAE 88VACT VATT 85VACT XVAE XVAE 83VACT FIXV VATT 81VACT XVBM VSTA 73VACT ADVN ADVN 67VACT ADVP 38VACT VSTA XVAE 38VACT XVAE FIXV VATT 35

Fig. 1. Example of the parsed result in a visualized tree

• There are errors duing morphological processing such as incorrect wordsegmentation or part-of-speech tagging as in the following example,

66 R. Sukhahuta, P. Luekhong

Table 5. Example of the extracted patterns for backward verb

Patterns FrequencyVACT 17802FIXN VACT 10260XVAM VACT 2347XVBM VACT 2166VSTA FIXN VACT 869XVAE VACT 793XVBM XVAM VACT 564XVMM VACT 499XVBM XVMM VACT 271VSTA VACT 246XVBM VSTA FIXN VACT 208XVAM VSTA FIXN VACT 85VATT VACT 65XVBM XVMM VSTA FIXN VACT 59XVMM VSTA FIXN VACT 49XVBM VSTA VACT 38XVAM XVAM VACT 37XVBM XVBM VACT 35XVAM XVBM VACT 33XVAM VSTA VACT 24

INPUT:และ/JCRG หวงั/VSTA เป็น/FIXN อยา่ง/FIXV ยงิ/VATT วา่/JSBR จะ/XVBMได้รบั/VSTA ความ/FIXN รว่มมอื/VACT มาก/ADVN ยงิขนึไป/ADVN

OUTPUT: (ROOT (S (JCRG และ) (VP(V’(VSTA หวงั)))(NP(N’(FIXN เป็น)(FIXV อยา่ง) (VATT ยงิ) [ERR] )) (VP(V’(XVBM จะ)(VSTA ได้รบั))) (NP(N’(N’(FIXN ความ) (VACT รว่มมอื)) (ADV(ADVN มาก)(ADVN ยงิขนึไป))))))

• Incorrect placed of ‘<space>’ for context or a clause separator• The preposition is left out from the sentence, for example, the word ‘of’ ของis missing, which resulting an incorrect parse tree, for example,

บา้นของผม => house of I => NCMN RPRE PRONบา้นผม => house I => NCMN NCMN

• Many a Thai sentences are not complete sentences as they are not in a formof Subject + Verb + Object. For this case, we resolved the problem by usingthe parsing rules that allow the statement to be a complete sentence andincomplete sentence. The incomplete sentences are the intermediate level ofphrases such as NP, VP and PP and the constituent nodes such as Adverb

Building a Hierarchical Annotated Corpus of Thai Using Phrase Structure Grammar 67

and Classifier.

The results of the extraction process are CFG rule patterns with the frequencyand probability of each pattern. There are 19,594 patterns found in the CFG rules.The CFG rules consist of 4,062 non-terminal level rules (20.73%) and 15,532 ter-minal level rules (79.27%). With the results of the Treebank that we have created,we are able to study the phrasal pattern and to compute the probability for eachpattern of the CFG rules. This has led to a construction of the probabilistic contextfree grammar (PCFG) Treebank for the Thai language which is part of our ongoingresearch. The examples of the Thai PCFG rules are illustrated in Table 6.

Table 6. Example of the Thai probabilistic context free grammar (PCFG)

Pattern Frequency ProbabilityNP->N’ 46678 0.088637641N’->NCMN 33084 0.062823765VP->V’ 22136 0.042034424P’->RPRE 21492 0.040811521PP->P’ NP 19994 0.037966944N’->N’ PP 13500 0.025635378V’->VACT 10494 0.019927234FIXN->การ 9792 0.018594194V’->VSTA 9228 0.017523205N’->NCMN NCMN 7499 0.014239978N’->N’ CL’ 5998 0.011389703RPRE->ของ 5690 0.010804837Pn->Prel 5553 0.010544685Prel->PREL 5553 0.010544685RPRE->ใน 5340 0.010140216JCRG->และ 5339 0.010138317

5. Conclusion

The process of Treebank development is considered as labor intensive and requiresa tremendous amount of NLP resources. This paper describes our approach to au-tomatically annotating Thai sentences with label brackets. In this research, theextracted phrasal patterns from the annotated corpus are used to create the CFGrules construction, which results in a Thai parser for phrase structure annotation.We also emphasized the construction of the Thai phrases using context-free gram-mars including noun phrase, verb phrase and prepositional phrase and the additionalphrasal nodes (e.g. classifier, labeling, ending and punctuation) are introduced into

68 R. Sukhahuta, P. Luekhong

the phrase structure. Since the development of the parse is beyond the scope ofthis research, we used the ANTLR language translator tool to provide the syntacticpredicates to parse the sentences via arbitrary expressions using semantic and syn-tactic context. With this approach we are able to automatically create annotatedsentences in brackets labeling for the Thai Treebank corpus.

Acknowledgments

This paper is part of the research project with research funding granted for the year2012 by the Faculty of Science, Chiang Mai University.

References1. M. Marcus and M. Marcinkiewicz. Building a large annotated corpus of English: The Penn

Treebank, Computational linguistics, 19, pp. 313–330, 1993.2. M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and

B. Schasberger. The Penn treebank: Annotating predicate argument structure, In Proceedingsof the workshop on Human Language Technology, 1994, pp. 114–119.

3. Y. Lepage, S. Ando, S. Akamine, and H. Iida. An Annotated Corpus in Japanese Using Tes-niere’s Structural Syntax,In ACL-COLING Workshop on Processing of Dependency-BasedGrammars, 1998, pp. 109–115.

4. S. Brants, S. Dipper, and S. Hansen. The TIGER treebank, In Proceedings of the workshop ontreebanks and linguistic theories, 2002, pp. 24–41.

5. Q. Abbas, N. Karamat, and S. Niazi. Development of Tree-bank Based Probabilistic Grammarfor Urdu Language, International Journal of Electrical & Computer Sciences IJECS, 9(9), 2002.

6. C. Han, N. R. Han, E. S. Ko, H. Yi. and M. Palmer. Penn korean treebank: Development andevaluation, In Proc. Pacific Asian Conf. Language and Comp, 2002.

7. M. Maamouri, A. Bies, T. Buckwalter, and W. Mekki. The Penn Arabic Treebank: Building alarge-scale annotated Arabic corpus, In NEMLAR Conference on Arabic Language Resourcesand Tools, 2004, pp. 102–109.

8. N. Habash and R. M. Roth. Catib: The columbia arabic treebank, In Proceedings of the ACL-IJCNLP 2009, August, pp. 221–224, 2009.

9. N. Xue, F. Xia, F.-D. Chiou, and M. Palmer. The Penn Chinese TreeBank: Phrase structureannotation of a large corpus,In Natural Language Engineering, vol. 11, no. 2, pp. 207–238, Jun.2005.

10. Jiajun Yan, Bracewell B. David, Shingo Kuroiwa and Fuji Ren . Chinese semantic dependencyanalysis, Construction of a Treebank and its use, In classification, ACM Transactions on Speechand Language Processing, Vol.4, No.2, pp.1-20, 2007.

11. T. Ruangrajitpakorn, K. Trakultaweekoon, and T. Supnithi. A Syntactic Resource for Thai:CG Treebank, In Proceedings of the 7th Workshop on Asian Language Resources, 2009, pp.96–102.

12. T. Ruangrajitpakorn, P. Boonkwan, T. Supnithi, and P. Bangcharoensap. A Semi-SupervisedApproach on Using Syntactic Prior Knowledge for Construction Thai Treebank, In Asian Lan-guage Processing (IALP), 2010 International Conference on, 2010, pp. 285–288.

13. C. Rishøj, T. Ruangrajitpakorn, P. Boonkwan, and T. Supnithi. Automatic Transformationof the Thai Categorial Grammar Treebank to Dependency Trees, In aclweb.org, pp. 1243–1250,2011.

14. V. Sornlertlamvanich, T. Charoenporn, and H. Isahara. ORCHID: Thai part-of-speech taggedcorpus, In Orchid, TR-NECTEC-1997-001, 1996, no. April, pp. 5–19.

15. A. Kornai. The X-bar theory of phrase structure, Language, 66(1), pp. 24–50, 1990.16. Fuji Ren and Bracewell B. David. Advanced Information Retrieval, Electronic Notes in The-

oretical Computer Science, 225(1), pp.303-317, 2009.

Building a Hierarchical Annotated Corpus of Thai Using Phrase Structure Grammar 69

17. Fuji Ren. From Cloud Computing to Language Engineering, Affective Computing and Ad-vanced Intelligence, International Journal of Advanced Intelligence, 2(1), pp.1-14, 2010.

Rattasit SukhahutaHe received a Ph.D. degree in Information Systems in

2001 from the University of East Anglia, United King-dom. He is working as a full-time lecturer as an assistantprofessor at the Computer Science department, Faculty ofScience, Chiang Mai University. His research interests in-clude Natural Language Processing, Information Extrac-tion and Language Understanding.

Prasert LuekhongHe received a master degree in Internet and Information

Technology in 2003 from Naresuan University, Thailand.He is now pursuing his Ph.D degree in Computer Scienceat Chiang Mai University, Thailand. His research inter-ests include Machine Translation and Natural LanguageProcessing.