Chunking Italian: Linguistic and Task-oriented Evaluation

8
Chunking Italian: Linguistic and Task-oriented Evaluation * Stefano Federici, Simonetta Montemagni, Vito Pirrelli Istituto di Linguistica Computazionale, CNR Via della Faggiola, 32 - Pisa, Italy e-mail: {stefano,simo,vito}@ilc.pi.cnr.it Abstract This paper reports on the experience of developing and applying a shallow parsing scheme (chunking) to unrestricted Italian texts, with a view to automatic acquisition of lexical information from corpora, and the prospective definition of further, more complex levels of syntactic analysis. The first part of the paper illustrates in detail the adopted annotation scheme, by relating it to more established linguistic notions and some specific issues of Italian syntactic analysis. The second part of the paper focuses on a detailed evaluation of relevant issues such as: reliability of text chunking with finite state technology, usability of a chunked text as a source for automatic acquisition of lexical information, amenability of the chunking scheme to further more complex levels of syntactic annotation. 1. Introduction This paper * reports on the development of a scheme and tools for shallow parsing of Italian unrestricted texts, carried out in Pisa over the last two years in the framework of the SPARKLE project (LE-2111) 1 with a view to: 1) defining an underspecified, scalable syntactic parsing level, which combines the property of being attainable without resorting to a rich syntactic lexicon with the further bonus of being linguistically adequate; 2) automatically acquiring (from a corpus annotated according to the developed scheme) lexical information such as subcategorization frames, collocations, lexico-semantic preferences etc. Section 2 introduces the basic notions underlying the annotation scheme - henceforth referred to as “chunking scheme”. Section 3 goes into a detailed account of the Italian typology of chunks, with extensive exemplification and some background linguistic justification. Section 4 is devoted to an evaluation of the performance of the Italian Chunker against unrestricted texts, by testing its precision and recall on a test bed drawn from a corpus of financial newspapers articles. Moreover, the section defines the basis for a task-oriented evaluation of the Italian chunking scheme by providing i) results of a significant experiment of automatic acquisition of lexical information (namely subcategorization frames) from chunked texts and ii) an estimate of the complexity of the task of adding, to the flat * The work reported in this paper was jointly carried out by the authors in the framework of the SPARKLE (Shallow PARsing and Knowledge extraction for Language Engineering) project (LE- 2111). For the specific concerns of the Italian Academy only, S. Federici is responsible for sections 2 and 4.1, S. Montemagni for 3.1, 1 and 5, and V. Pirrelli for 3.2, 4.2 and 4.3. 1 The SPARKLE consortium consists of four Academic Partners (University of Stuttgart (Germany), University of Sussex (UK), Computer Laboratory of Cambridge University (UK), Consorzio Pisa Ricerche (coordinating partner, Italy)), and three industrial partners (SHARP Laboratories of Europe, Rank Xerox European Research Centre, Daimler-Benz AG). SPARKLE is an LE DGXIII-supported project. syntactic structure imposed by the chunking scheme, a further level of syntactic analysis specified for functional dependencies. 2. Text Chunking: Basics As a first approximation, chunking a text means segmenting it into an unstructured sequence of syntactically organized text units called “chunks” (Abney, 1991). In our view, this is understood to be done with a minimum of presupposed linguistic knowledge, that is through recourse to an “empty” syntactic lexicon, containing no other information than the entry’s lemma, part of speech and morpho-syntactic features. The resulting analyses are flat: all chunks are represented at the same structural level, as daughters of the same top node. In turn, each chunk C is a syntactically organized structure (defined in terms of attribute-value pairs), which displays chunk-specific features as well as the nature and scope of the dependencies holding between the words covered by C. Text chunking is carried out through a finite state automaton (hereafer referred to as “Chunker”, Federici et al., 1996) which takes as input a morpho-syntactically tagged text. A whole range of objectives, including minimization of typical problems in parsing real texts (such as parse overgeneration and undergeneration), identification of reliable syntactic constituents in a text through the sparsest possible available knowledge, and prospective acquisition of lexical information from corpora lays at the heart of the development of the chunking annotation scheme and its related software. Under our interpretation, a chunk is a textual unit of adjacent word tokens: accordingly, discontinuous chunks are not allowed. Words which are covered by a single chunk share the property of being related through dependency chains which can be identified unambiguously in context with no recourse to lexical information other than part of speech and morpho- syntactic features. A chunk is always a maximal, non- recursive text unit of this type; it cannot be embedded within a more inclusive chunk. To be more concrete, a sentence such as le nuove tecnologie informatiche hanno un sempre maggiore impatto sul processo produttivo ‘the new computer technologies have an ever increasing impact on the productive process’ will be segmented into six chunks as follows: A. [le nuove tecnologie] B. [informatiche] C. [hanno] D. [un sempre maggiore impatto] E. [sul processo] F. [produttivo] where each chunk includes a sequence of adjacent word tokens which are mutually related through dependency links of some specifiable kind. The chunk internal

Transcript of Chunking Italian: Linguistic and Task-oriented Evaluation

Chunking Italian: Linguistic and Task-oriented Evaluation* Stefano Federici, Simonetta Montemagni, Vito Pirrelli

Istituto di Linguistica Computazionale, CNR Via della Faggiola, 32 - Pisa, Italy

e-mail: {stefano,simo,vito}@ilc.pi.cnr.it

Abstract This paper reports on the experience of developing and applying a shallow parsing scheme (chunking) to unrestricted Italian texts, with a view to automatic acquisition of lexical information from corpora, and the prospective definition of further, more complex levels of syntactic analysis. The first part of the paper illustrates in detail the adopted annotation scheme, by relating it to more established linguistic notions and some specific issues of Italian syntactic analysis. The second part of the paper focuses on a detailed evaluation of relevant issues such as: reliability of text chunking with finite state technology, usability of a chunked text as a source for automatic acquisition of lexical information, amenability of the chunking scheme to further more complex levels of syntactic annotation.

1. Introduction This paper* reports on the development of a scheme and tools for shallow parsing of Italian unrestricted texts, carried out in Pisa over the last two years in the framework of the SPARKLE project (LE-2111)1 with a view to: 1) defining an underspecified, scalable syntactic parsing

level, which combines the property of being attainable without resorting to a rich syntactic lexicon with the further bonus of being linguistically adequate;

2) automatically acquiring (from a corpus annotated according to the developed scheme) lexical information such as subcategorization frames, collocations, lexico-semantic preferences etc.

Section 2 introduces the basic notions underlying the annotation scheme - henceforth referred to as “chunking scheme”. Section 3 goes into a detailed account of the Italian typology of chunks, with extensive exemplification and some background linguistic justification. Section 4 is devoted to an evaluation of the performance of the Italian Chunker against unrestricted texts, by testing its precision and recall on a test bed drawn from a corpus of financial newspapers articles. Moreover, the section defines the basis for a task-oriented evaluation of the Italian chunking scheme by providing i) results of a significant experiment of automatic acquisition of lexical information (namely subcategorization frames) from chunked texts and ii) an estimate of the complexity of the task of adding, to the flat * The work reported in this paper was jointly carried out by the authors in the framework of the SPARKLE (Shallow PARsing and Knowledge extraction for Language Engineering) project (LE- 2111). For the specific concerns of the Italian Academy only, S. Federici is responsible for sections 2 and 4.1, S. Montemagni for 3.1, 1 and 5, and V. Pirrelli for 3.2, 4.2 and 4.3. 1 The SPARKLE consortium consists of four Academic Partners (University of Stuttgart (Germany), University of Sussex (UK), Computer Laboratory of Cambridge University (UK), Consorzio Pisa Ricerche (coordinating partner, Italy)), and three industrial partners (SHARP Laboratories of Europe, Rank Xerox European Research Centre, Daimler-Benz AG). SPARKLE is an LE DGXIII-supported project.

syntactic structure imposed by the chunking scheme, a further level of syntactic analysis specified for functional dependencies.

2. Text Chunking: Basics As a first approximation, chunking a text means segmenting it into an unstructured sequence of syntactically organized text units called “chunks” (Abney, 1991). In our view, this is understood to be done with a minimum of presupposed linguistic knowledge, that is through recourse to an “empty” syntactic lexicon, containing no other information than the entry’s lemma, part of speech and morpho-syntactic features. The resulting analyses are flat: all chunks are represented at the same structural level, as daughters of the same top node. In turn, each chunk C is a syntactically organized structure (defined in terms of attribute-value pairs), which displays chunk-specific features as well as the nature and scope of the dependencies holding between the words covered by C. Text chunking is carried out through a finite state automaton (hereafer referred to as “Chunker”, Federici et al., 1996) which takes as input a morpho-syntactically tagged text. A whole range of objectives, including minimization of typical problems in parsing real texts (such as parse overgeneration and undergeneration), identification of reliable syntactic constituents in a text through the sparsest possible available knowledge, and prospective acquisition of lexical information from corpora lays at the heart of the development of the chunking annotation scheme and its related software. Under our interpretation, a chunk is a textual unit of adjacent word tokens: accordingly, discontinuous chunks are not allowed. Words which are covered by a single chunk share the property of being related through dependency chains which can be identified unambiguously in context with no recourse to lexical information other than part of speech and morpho-syntactic features. A chunk is always a maximal, non- recursive text unit of this type; it cannot be embedded within a more inclusive chunk. To be more concrete, a sentence such as le nuove tecnologie informatiche hanno un sempre maggiore impatto sul processo produttivo ‘the new computer technologies have an ever increasing impact on the productive process’ will be segmented into six chunks as follows: A. [le nuove tecnologie] B. [informatiche] C. [hanno] D. [un sempre maggiore impatto] E. [sul processo] F. [produttivo] where each chunk includes a sequence of adjacent word tokens which are mutually related through dependency links of some specifiable kind. The chunk internal

structure (described in detail in Federici et al., 1996) keeps track of these dependency links.

A chunked text does not contain information about nature and scope of inter-chunk dependencies. Hence, if two text substrings are assigned different chunks, this does not necessarily exclude the existence of a dependency relationship holding between the two. For example, the chunked representation above says nothing about the relationship between impatto ‘impact’ and sul processo ‘on the process’, but this is not to entail that such a relationship cannot possibly hold. Simply, the lexical knowledge available to the Chunker makes it impossible to state unambiguously what chunk relates to its neighbouring chunks and what is the nature of this relationship. If we abstract away from the lexical content of the chunks in the sentence above, E is potentially dependent on either D or C and this cannot be decided unless subcategorization knowledge is resorted to (e.g. that impatto ‘impact’, unlike the verb avere ‘have’, subcategorizes for a prepositional phrase headed by su ‘on’). For lack of this knowledge, the inter-chunk dependency is left underspecified. In the example above, use of underspecification (to be read here as unattachment) also motivates the different, admittedly non-conventional treatment of the adjectives informatico ‘computerized’ and produttivo ‘productive’ on the one hand, and nuove ‘new’ and maggiore ‘increasing’ on the other hand. Note that nuove ‘new’ and maggiore are, as it were, “trapped” between the determiner and the noun, thus becoming part of a wider (nominal) chunk. In the case of informatico and produttivo, on the other hand, the adjectives form an independent chunk. The different treatment reflects the intuitive idea that the Chunker can go for unambiguous dependencies only. The position of informatico in context leaves no ambiguity as to its governor (i.e . the ensuing noun), and this is captured by making them part of the same chunk. The same is not true of postnominal adjectival modifiers, which are then kept apart as independent chunks. Our definition of chunk draws crucially on the notion of “potential governor”. A chunk contains at most one potential governor (marked in bold in the chunked sentence above). This is always the rightmost element of the word sequence covered by the chunk and generally (but not always) represents the syntactic head of the chunk. From an inter-chunk perspective, the potential governor is the word with which neighbouring chunks can syntactically combine in a dependency relationship. Clearly, the nature and direction of this dependency (whether from head to dependent or from dependent to head) is contingent on whether such a potential governor is subcategorizing for something or subcategorized for by something else in the context considered. Although, as already pointed out above, the Chunker ignores inter-chunk dependencies, it nonetheless paves the way for them to be assigned at a later processing stage, by charting, as it were, the map of possible linguistic units (namely the potential governors) among which syntactic dependencies can possibly hold.

3. Chunking Scheme We define two basic types of chunks: “phrase chunks” and “marker chunks”. Phrase chunks contain a potential

governor. Marker chunks contain elements which cannot possibly act as a potential governor due to their categorial status: i.e. punctuation and coordinating conjunctions. Marker chunks are nonetheless quite important for the purposes of lexical acquisition, as they keep track of text markers which are used as “signposts” in the acquisition phase (Federici et al., 1998). In what follows, we will focus on phrase chunks only.

3.1 Typology of phrase chunks The typology of phrase chunks in the Italian chunking annotation scheme is summarized in the table below. NAME TYPE POTGOV EXAMPLES ADJ_C adjectival

chunk adj bello ‘nice’,

molto bello ‘very nice’ BE_C predicative

chunk adj past part

è bello ‘(it/(s)he) is nice’, è caduto ‘(it/he) fell’

ADV_C adverbial chunk

adv sempre ‘always’

SUBORD_C subordinating chunk

conj quando ‘when’, dove ‘where’

N_C nominal chunk noun pron verb adj

la mia casa ‘my house’, io ‘I’, questo ‘this’, l’aver fatto ‘having done’, il bello ‘the nice (one)’

P_C prepositional chunk

noun pron verb adj

di mio figlio ‘of my son’, di quello ‘of that (one)’, dell’aver fatto ‘of having done’, del bello ‘of the nice (one)’

FV_C finite verbal chunk

verb sono stati fatti ‘(they) have been done’, rimangono ‘(they) remain’

G_C gerundival chunk

verb mangiando ‘eating’

I_C infinitival chunk

verb per andare ‘to go’, per aver fatto ‘to have done’

PART_C participial chunk

verb finito ‘finished’

Table 1: Typology of phrase chunks

The set of categories given above departs from the set of traditional phrasal categories used in constituency-based syntax in many respects. More granular distinctions are made here, e.g. verbal chunks are partitioned into subclasses on the basis of the verb mood. On the other hand, classical categories such as that_clause, wh_clause and so on, do not appear in the list since they happen to be decomposed into sequences of basic chunks; this is again due to lack of certainty in identifying their borders on the basis of the available knowledge. In what follows, we provide, for each chunk category, a definition and some relevant examples.

ADJ_C ADJ_Cs are chunks beginning with any premodifying adverbs and intensifiers and ending with a head adjective. This definition provides a necessary but not sufficient condition for identification of ADJ_C. In fact, adjectival phrases occurring in pre-nominal position are not marked as distinct chunks since their relationship to the governing noun is unambiguously identified within the nominal chunk (see example sentence above). The same holds in the case of predicate adjectival phrases governed by the verb essere ‘be’, which are part of BE_C (see below). ADJ_Cs thus include: • post-nominal adjectival phrases, either immediately

following the noun they modify or placed further down in the sentence. ⇒ [N_C un bambino N_C] [ADJ_C bravo ADJ_C]

‘a good boy’ ⇒ [N_C la progettazione N_C] [P_C di tecniche

P_C] [P_C di base P_C] [ADJ_C indispensabili ADJ_C] [P_C al progresso P_C] [ADJ_C industriale ADJ_C] ‘the design of basic techniques indispensable to the industrial progress’

• predicate adjectival phrases which are not governed by the verb essere ‘be’. ⇒ [FV_C diventa FV_C] [ADJ_C più difficile

ADJ_C] ‘(it) gets more difficult’ ⇒ [FV_C lo considera FV_C] [ADJ_C molto

opportuno ADJ_C] ‘(he) considers it very appropriate’

The fact that predicate adjectival phrases governed by copulative verbs other than essere are treated differently from adjectival phrases governed by the verb essere follows from the assumption that the chunker relies only on basic linguistic information concerning lemmata and part of speech. The class of all copulative verbs is a potentially open class, and thus its definition goes beyond the range of linguistic knowledge presupposed by the Chunker.

BE_C BE_Cs consist of a form of the verb essere ‘be’ and an ensuing adjective/past participle including any intervening adverbial phrase. E.g.:

⇒ [BE_C è intelligente BE_C] ‘(he) is intelligent’ ⇒ [BE_C è molto bravo BE_C] ‘(he) is very good’ ⇒ [BE_C è appena arrivato BE_C] ‘(he) just arrived’

ADV_C ADV_Cs extend from any adverbial pre-modifier to the head adverb. Once more, this definition provides a necessary but not sufficient condition for ADV_C. In fact, adverbial phrases that occur between an auxiliary and a past participle form are not identified as distinct chunks due to their unambiguous dependency on the verb. By the same token, adverbs which happen to immediately premodify verbs or adjectives are respectively part of a verbal chunk and an adjectival chunk. Finally, noun phrases used adverbially (e.g. questa mattina ‘this morning’) are treated as nominal chunks (see below). E.g.:

⇒ [FV_C ha sempre camminato FV_C] [ADV_C molto ADV_C] ‘(he) has always walked a lot’

⇒ [FV_C ha finito FV_C] [ADV_C molto rapidamente ADV_C] ‘(he) has finished very quickly’

SUBORD_C SUBORD_Cs are chunks which include a subordinating conjunction. Subordinating conjunctions are chunked as an independent chunk in its own right only when they are not immediately followed by a verbal group. Compare, for example, the chunk structure of the following sentence

⇒ [FV_C non so FV_C] [SUBORD_C quando SUBORD_C] [N_C il direttore N_C] [FV_C mi riceverà FV_C] ‘(I) do not know when the director will receive me’

with the chunk structure of the following sentence, which differs from the previous one in having the subject of the subordinate clause in postverbal position:

⇒ [FV_C non so FV_C] [FV_C quando mi riceverà FV_C] [N_C il direttore N_C].

N_C N_Cs extend from the beginning of the noun phrase to its head. They include nominal chunks headed by nouns, pronouns, verbs in their infinitival form when preceded by an article (i.e. Italian nominalised infinitival constructions) and proper names. Noun phrases functioning adverbially (e.g. questa mattina ‘this morning’) are also treated as nominal chunks. All kinds of modifiers and/or specifiers occurring between the beginning of the noun phrase and the head are included in N_Cs. E.g.:

⇒ [N_C un bravo bambino N_C] ‘a good boy’ ⇒ [N_C tutte le possibili soluzioni N_C] ‘all possible

solutions’ ⇒ [N_C i sempre più frequenti contatti N_C] ‘the

always more frequent contacts’ ⇒ [N_C questo N_C] ‘this’ ⇒ [N_C il camminare N_C] ‘walking’ ⇒ [N_C il bello N_C] ‘the nice (one)’

In the chunking scheme, nominal chunks cover only a portion of the range of linguistic phenomena normally taken care of by nominal phrases: namely only noun phrases with prenominal complementation.

P_C P_Cs go from a preposition to the head of the ensuing nominal group. Most of the criteria given for N_Cs also apply to this case. Typical instances of P_Cs are:

⇒ [P_C per i prossimi due anni P_C] ‘for the next two years’

⇒ [P_C fino a un certo punto P_C] ‘up to a certain point’

FV_C FV_Cs are finite verb chunks which include all intervening modals, ordinary and causative auxiliaries as well as medial adverbs and clitic pronouns, up to the head verb. E.g.: • verbal chunk with auxiliary or modal verb and medial

adverb:

⇒ [FV_C può ancora camminare FV_C] ‘(he) can still walk’

• verbal chunk with pre-modifying adverb: ⇒ [FV_C non ha mai fatto FV_C] [ADV_C così

ADV_C] ‘(he) has never done so’ • the auxiliary essere ‘be’ in periphrastic verb forms

(whether active or passive) such as sono caduto ‘I fell’, sono stato colpito ‘I was hit’, or mi sono accorto ‘I realized’, is dealt with as part of a finite verb chunk, unless the verb essere is followed by a past participle which the dictionary also categorises as an adjective; in the latter case it is chunked as a BE_C (see above). ⇒ [FV_C è FV_C] [N_C un simpatico ragazzo

N_C] ‘(he) is a nice guy’ • fronted auxiliaries constitute separate FV_Cs:

⇒ [FV_C può FV_C] [N_C la commissione N_C] [I_C deliberare I_C] [P_C su questa materia P_C]? ‘can the Commission deliberate on this topic?’

• periphrastic causative constructions: ⇒ [FV_C fece studiare FV_C] [N_C il bambino

N_C] ‘(he) let the child study’ • clitic pronouns are part of the chunk headed by the

immediately adjacent verb: ⇒ [FV_C lo ha sempre fatto FV_C] ‘(he) has always

done it’

G_C G_Cs contain a gerund form. When part of a tensed verb group (e.g. in progressive constructions), the gerundival verb form is not marked independently. G_C also includes gerund forms functioning as noun phrases.

⇒ [FV_C sta studiando FV_C] ‘(he) is studying’ ⇒ [G_C studiando G_C] [FV_C ho imparato FV_C]

[ADV_C molto ADV_C] ‘by studying (I) have learned a lot’

I_C Infinitival chunks (I_Cs) include both bare infinitives and infinitives introduced by a preposition.

⇒ [FV_C ha promesso FV_C] [I_C di arrivare I_C] [ADV_C presto ADV_C] ‘(he) has promised to arrive early’

⇒ [FV_C desidera FV_C] [I_C partire I_C] [ADV_C domani ADV_C] ‘(he) wishes to leave tomorrow’

PART_C A past participle chunk (PART_C) includes participial constructions such as:

⇒ [PART_C finito PART_C] [N_C il lavoro N_C] , [N_C Giovanni N_C] [FV_C andò FV_C] [P_C a casa P_C] ‘(having) finished the job, John went home’

3.2 Outstanding issues: chunk categories and underspecification The category of a chunk cannot always be identified with certainty by the Chunker. The problem can partly be circumvented through use of underspecified chunk categories, which add to the inventory provided above. The Chunker resorts to underspecified categories in cases of systematic ambiguity. For instance, the chunk di_C includes an introducing complex preposition di ‘of’

(i.e. di fused with an article) which can possibly be interpreted either as a preposition or as a partitive article: e.g. [di_C dello zucchero di_C] ‘some sugar’ / ‘of sugar’. The category di_C is compatible with both analyses and thus subsumes both N_C and P_C. The systematic ambiguity between adjectives and past or present participles is another case in point. Consider the phrase un’immagine colorata ‘a coloured picture’ and its chunked representation below: A. [N_C un’immagine N_C] B. [?_C colorata ?_C] Here, the potential governor in B can be either a past participle form of the verb colorare ‘colour’, or an adjective (colorato ‘coloured’). The corresponding chunk category would then vary accordingly between PART_C and ADJ_C respectively. This ambiguity is preserved by means of the underspecified chunk category ADJPART_C, which subsumes both ADJ_C and PART_C: ?_C above would thus be replaced by [ADJPART _C colorata ADJPART _C]. Finally, the homography between the relative pronoun che and che as a subordinate conjunction gives rise to yet another possible syntactic ambiguity. Che_C is introduced as a cover chunk for both constructions: it starts from an occurrence of che and is constructed as a SUBORD_C (see definition above). Crucially, it leaves underspecified the contextually appropriate part of speech of che.

4. Evaluation In this section we evaluate i) the accuracy/robustness of the Italian Chunker; ii) the usefulness of the chunking scheme for the task of bootstrapping a syntactic lexicon from a chunked corpus; and, finally, iii) the amenability of chunked texts to further more refined levels of syntactic analysis, with a view to bootstrapping a full syntactic analysis. For the purposes of iii) we give an estimate of the amount of dependency ambiguity left unsolved in a chunked text, and a quantitative assessment of the portion of ambiguity which is likely to be curtailed when relying on a lexicon augmented with subcategorization information.

4.1 Accuracy of Chunking For purposes of evaluation, the Italian Chunker was run on the SPARKLE Italian test bed, a sample of 200 sentences drawn from a corpus of financial newspapers articles. Input sentences were tagged automatically by an Italian stochastic tagger (Picchi, 1994). Results of chunking were then evaluated by precision and recall against a version of the same sample chunked by hand. Chunk precision and recall were computed on the basis of the data reported in the table below.

AUTOMATICALLY CHUNKED TEXT

MANUALLY CHUNKED TEXT

total no of chunks 3,842 3,883 matches 3,520

Table 2: Comparing automatic and manual chunking

which gives i) the number of chunks automatically identified by the Chunker, ii) the number of manually annotated chunks and iii) the total number of matches between i) and ii). These figures give a recall of 90.65% (3,520/3,883) and a precision of 91.62% (3,520/3,842).

In this calculation, both full and compatible matches are counted in as a full match. A compatible match is one between a fully specified hand-coded chunk and an automatically identified chunk left underspecified for the chunk category and/or of the potential governor, provided that the underspecified chunk subsumes the fully specified chunk. The reason why all compatible matches, 620 in the whole test sample considered, are counted in as full matches for evaluation is basically that underspecification is part and parcel of the chunking scheme, as it responds to the linguistic intuitition underpinning the scheme’s design: “settle on dependable choices only, given the linguistic knowledge available”. In fact, most of the found 620 underspecified representations are due to systematic categorial ambiguities of Italian (as in the case of di_C and che_C) which cannot possibly be solved on the basis of the Chunker’s a priori knowledge.

On the basis of figures in the table above, there are a number of cases (namely 322) for which a match could not be found. This class includes: i) automatically unidentified chunks, that is cases where the Chunker was unable to either assign a category or identify linguistically sound boundaries, and ii) truly mistaken chunks, containing either a wrong category or a wrong boundary or both. There are 83 instances of case i) (corresponding to 2.16% of the total number of identified chunks), and 239 instances of case ii) or simply errors (6.22%). If unidentified chunks are considered as “no answers”, and are accordingly discounted from the total number of identified chunks for evaluation, then precision goes to 93.64%. Chunk matching was required to involve all the different levels listed below at the same time: • chunk category • chunk boundary • potential governor (lemma and part of speech) Both compatible matches and erroneous analyses were classified according to the level at which a mismatch occurs. Results are reported in the table below:

COMPATIBLE MATCH

ERROR TOTAL

chunk category 538 185 723 chunk boundary not appl. 95 95 potential governor

440 197 637

TOTAL 978 477 1,455

Table 3: Typology of mismatches

With both compatible matches and errors, mismatches mainly involve the chunk category and the potential governor; mismatches at the chunk boundary level occur more rarely and only in the case of erroneous analyses.

The reader will also note that the total number of different types of mismatch (1,455) outnumbers the

amount of actually observed mismatches, be they compatible matches (620) or errors (239). In fact the same per-chunk mismatch can simultaneously involve different factors, e.g. the chunk category and the potential governor. For instance, whenever an underspecified chunk category is selected, information about the potential governor may also be underspecified, since it has to include all potential governors which are amenable to the underspecified category (e.g. an ADJPART_C is assigned a list of potential governors corresponding to both the adjectival and verbal readings of the word form occuring in the text). Chunking errors were also classified on the basis of their primary source. Firstly, there are errors due to the imperfect coverage of the dictionary used by the morpho-syntactic tagger. Other errors originate at the level of tagging, when the wrong part of speech of the word in context is selected. Finally, there are errors made by the chunker itself, despite availability of the relevant knowledge. The table below illustrates the distribution of errors per source.

ERROR SOURCE total chunker tagger dictionary

ERROR TYPE

n. % n. % n. % n. %

chunk category

15 8.1 141 76.2 29 15.7 185 100

chunk boundary

12 12.6 65 68.4 18 19.0 95 100

potential governor

16 8.1 152 77.2 29 14.7 197 100

total 43 9.0 358 75.1 76 15.9 477 100

Table 4: Errors and their source

More than 70% of the errors originate, on average, at the tagging level. This is a lowerbound estimate since it does not include errors of the tagger that were corrected at the chunking level. In fact, the chunker contains several built-in recovering strategies from systematic errors of the tagger. Clearly, tagging errors corrected during chunking are not included in the figures above. Another significant group of errors is due to lack of coverage of the morphological dictionary used by the tagger. The remaining group of errors is represented by genuine chunking mistakes: on average, they only amount to 9% of all identified errors, that is less than 0.2% of all identified chunks.

4.2 Chunking and Lexical Acquisition A way to assess the usefulness of the Italian chunking scheme is to consider the suitability of a chunked text as a source for automatic acquisition of lexical information. For this purpose, chunked texts were scoured by an analogy-based algorithm for the acquisition of verb subcategorization frames from attested unfiltered usages of a verb in real texts (Federici et al., 1998; Carroll et al., 1997a). An experiment was carried out for 30 verbs. The acquisition corpus was a collection of (chunked) financial newspapers articles, averaging one million word tokens. Acquired frames were evaluated against a Lexical Test Suite (LTS) developed for the 30 verbs starting from the

information provided by a general purpose computational lexicon, the Italian PAROLE lexicon (Ruimy et al. 1997), integrated with domain-specific lexical evidence acquired through manual analysis of the acquisition corpus.

Evaluation of acquired lexical information was measured in terms of type recall, i.e. percentage of correctly acquired subcategorization patterns with respect to LTS, and type precision, i.e. percentage of correctly acquired patterns with respect to all patterns acquired.

The process of evaluating the performance of the system relative to LTS could in principle be reduced to a report of per-frame type recall/precision. However, an analogy-based algorithm for acquisition of frames from corpus data does merely rely on attested evidence, with no consistency or completeness check on acquired frames. Since running texts are typically fraught with ellipses, acquired frames often lack the specification of omitted complements. This general feature of real language usage combines in Italian with so-called pro-drop phenomena which make acquisition of subjectless frames fairly likely. Hence, an evaluation based on per-frame type recall only is unsuitable for our purposes unless more flexible measures are concurrently adopted: we thus introduced the measure of per-slot recall (see below). Type precision is instead measured per frame only.

LTS contains 209 frames, 16% of which appear to be domain-specific. On average, each verb in LTS is attested under 6.97 different frames. The following table gives a detailed account of the reliability of acquired information in terms of: i) per-frame recall; ii) per-slot recall; iii) precision.

PER-FRAME

RECALL PER-SLOT RECALL

PRECISION

LTS 55% 68% 75%

Table 5: Recall and precision

Per-frame recall always requires a full match of acquired frames against those in LTS; this entails that acquired frames with omitted complements are not counted in as correctly acquired subcategorization patterns. A more flexible measure, which is tolerant of complement omission, is given by the per-slot recall. For each acquired frame, the overall per-slot recall is gauged as a summation of all single per-slot recall scores divided by the number of arguments in the LTS frame. When the acquired frame covers the LTS frame in its entirety then the overall per-slot recall is trivially 1; when the LTS frame is covered only partially by the acquired frame, the overall per-slot recall value is less than 1. To illustrate, consider the following LTS frame of the verb tagliare ‘cut’, covering the construction tagliare qualcosa con uno strumento ‘to cut something with an instrument’:

FRAME SLOT

ACQUIRED LEXICON

RECALL

subj 1 1 obj 1 1 iobj_con overall 2 0.66

Table 6: Per slot recall for a frame of tagliare ‘cut’

Here, a positive match is found only for subject and object slots. As a consequence, the overall per-slot recall figure amounts to 0.66 (i.e. 2/3). Precision in Table 5 is calculated as the number of correctly acquired frames divided by the number of all frames acquired, respectively 298 and 397. Correctness of acquired frames was manually checked. The disproportion between correctly acquired frames (298) and frames in LTS (209) is explained by the large number of elliptic but still correct acquired frames. The overall ratio between fully specified reference frames and correctly acquired ones can be gauged by dividing the corresponding figures: 298/209. On average, one LTS frame is matched by 1.4 acquired frames. Per verb, the ratio oscillates between a minimum value of 0.5 to a maximum of 3.

An estimate of the overall coverage of acquired information can be inferred from the percentage of utterly missed frames (that is LTS frames for which the acquisition engine was not able to contribute any sort of information) which amount to 23%: the complement to 100 of this figure, i.e. 77%, is a good estimate of the actual amount of information acquired.

Finally, we checked the results of the experiment by token recall against a manual analysis of the attested subcategorization frames in the acquisition corpus. Given the amount of manual checking required, this was done for three verbs only. It turned out that none of the frames actually used in the corpus eluded the analogy-based acquisition algorithm.

4.3 Beyond Chunking To what extent is our chunking scheme scalable? To give an objective answer to this question we evaluated the amenability of the chunking scheme to a more complex level of syntactic annotation. In particular, we focused on the automatic identification of inter-chunk syntactic dependencies on the basis of the information provided by the automatically-acquired syntactic lexicon. Hereafter, we refer to this level of syntactic analysis as “hyper-chunking”.

Assessment of the feasibility of hyper-chunking an already chunked text was carried out by projecting the acquired subcategorization frames onto a test bed of chunked sentences drawn from the SPARKLE Italian test bed (see Section 4.1), with a view to identifying contextually-appropriate functional dependencies between chunks. The thus identified functional dependencies were evaluated against a version of the test bed manually-annotated at the functional level. Note that the acquired lexicon used in this experiment is constituted by verb subcategorization frames only; a more comprehensive evaluation would require simultaneous consideration of subcategorization information about all words occurring in the specific context.

For the purposes of this evaluation, 100 different verb types were randomly selected out of the 400 verbs attested in the SPARKLE Italian test bed. The selected verbs occur in 160 nuclear sentences (either main or subordinate clauses), and entertain an overall number of 275 inter-chunk dependencies. We concentrated on correct identification of these verb dependencies only. For each nuclear sentence headed by a test verb V, we first assessed the complexity of the task of

establishing the dependency environment of V. This was done by measuring the degree of potential ambiguity of each chunked context C of V in terms of the prior probability P for each chunk in C to syntactically depend on V.

Following Basili et al. (1997), P is gauged on the basis of the cardinality of the chunk’s collision set, that is on the basis of the number of linguistic units (phrase chunks) which the chunk in question can potentially depend on. Consider, for instance, the following chunked nuclear Italian sentence: [N_C il bambino] [V_C ha dato] [N_C un libro] [P_C alla mamma] ‘the child gave a book to his mother’ On the basis of general syntactic principles, the chunk [P_C alla mamma] in the above context can, for lack of explicit subcategorization knowledge, depend on either the nominal chunk [N_C un libro] or the verbal chunk. Our definition of collision set also includes, as a single case, the theoretically possible case where the chunk in question depends on some other governor occurring in another adjacent nuclear sentence (for a detailed discussion of the criteria guiding the definition of collision sets, see Carroll et al., 1997b). Thus, in the case above, the cardinality of the P_C’s collision set is 3. The prior probability P of getting the dependency of a chunk in C right with no lexical information is thus estimated as 1/|CS| (where |CS| is the cardinality of the collision set). This amounts to 1/ 3 in the case at hand. Intuitively, P corresponds to the probability of getting the dependency of the chunk right by chance.

We then calculated the posterior probability P’ for a chunk occurring in C to syntactically depend on the verb head V in question, given the subcategorization frames of V in an available lexicon. In particular, P’ is calculated as a function of the subcategorization information provided, for the verb V, by the automatically-acquired lexicon (Carroll et al., 1997a). For example, if the P_C in question (alla mamma) is found in a subcategorization frame in the list of acquired frames of dare ‘give’, then we assume that P’ is equal to 1. In other words, if a given chunk category is found in a subcategorization frame of V, then the chunk bearing that category in the chunked sentence is classified as a complement of V with certainty (this is for sure a simplifying assumption, but it is the furthest we can get given the constraints of our evaluation based on verb subcategorization only). P’ is 0 if the chunk under analysis does not appear in any of the frames of the verb. More problematic cases of P’ estimation, where the value ranges between 0 and 1, are considered in detail in Carroll et al. (1997b).

Finally, we calculated Kononenko and Bratko’s (1991) information score (IS) relative to the two probability distributions P and P’ for each chunk attested in a context C of V as follows: a) IS = -log2 P + log2 P’ if P’ > P b) IS = log2(1- P) - log2(1 - P’) if P’ < P Intuitively, IS is a measure of how good our classification is relative to the difficulty of the task: the correct classification into an a-priori more probable class has a lower value than the correct classification into a less

probable class. Thus, for each chunk, a positive IS indicates that our system correctly predicted the attachment of the chunk to V. Viceversa, a negative IS indicates that the correct attachment of the chunk to V was missed.

Given the chunked context C, its overall IS is eventually obtained as a summation over the IS’s of each chunk c in C divided by the overall number of chunks in C (expressed below as |C|): IS( C) = 1/|C| ∑c∈C IS( c) Following Kononenko and Bratko, an estimate of the initial difficulty of the task of classifying each chunk as syntactically dependent on V in a test bed T is then given by the average entropy E. E is a measure of the probability of assigning the appropriate dependency of each chunk in T randomly (under the assumption that each assignment is independent from any other): E = 1/|T| ∑c∈T -log2 P( c) When there is no ambiguity in assigning each chunk the right dependency, E would be 0. In our test bed E is 0.99. This means that one has, on average, one chance out of two of getting the dependency right randomly2. The gain in the hyper-chunking task through use of the acquired lexical information is then represented by the absolute Information Score ISa. ISa is the sum of all IS’s scored by the acquired lexicon for each chunk in each nuclear sentence against the overall number of chunks in the test sentences: ISa = 1/|T| ∑c∈T IS( c) For ISa we obtained a value of 0.70, whereas the value corresponding to the perfect classification (i.e. with each chunk being assigned the contextually appropriate dependency with certainty) is ISa = E = 0.99 A comparative evaluation of the validity of our classification can thus be gauged through the relative Information Score ISr = ISa/E. In our evaluation, ISr is 0.71, whereas the maximum value of ISr, corresponding to the perfect classification, is 1.

5. Conclusions In this paper, we showed that chunking unrestricted texts is a feasible task. The required basic linguistic knowledge is easily attainable on a large scale, and the needed finite-state parsing technology is suitable and sufficiently reliable. We also showed that the feasibility of the task is accompanied by adequacy and usefulness of the obtained results for different Natural Language Processing (NLP) tasks.

In the perspective we proposed in this paper, chunking is seen as a first step on the way to full syntactic parsing, whereby parses are left underspecified when they

2 This figure includes cases such as the occurrence of syntactically non ambiguous clitic pronouns which inevitably lowers the perplexity of the task.

are undecidable given the lexical knowledge available. We proved that this level of underspecified syntactic annotation can be designed so as to be linguistically sound. This also explains why chunked texts appear to be particularly suited as an input for automatic lexical acquisition, since preliminary identification of syntactic chunks and, more crucially, of potential governors significantly curtails the search space for either arguments or modifiers of a head in context. We believe that yet other NLP tasks which do not require full text understanding, such as information retrieval and filtering, can equally benefit from relying on a chunked representation.

We have also reasons to believe that a chunked text lends itself quite naturally to being scaled up to more refined levels of syntactic analysis, once availability of a large subcategorization lexicon makes it possible to identify, at a later stage, all the inter-chunk syntactic dependencies which were left unsolved at an early stage. By doing this way, we should be able to bootstrap a dependency-based syntactic analysis from chunking, through the intermediate ancillary process of automatic acquisition of a full subcategorization lexicon. Note that underspecification (e.g. unattachment, assignment of underspecified chunk categories) plays a significant role here, as further analysis levels never require backtracking of parse decisions which are taken at the chunking level.

References Abney, S. (1991). Parsing by Chunks. In D. Bouchard & K.

Lefel (Eds.), Views on Phrase Structure. Kluwer Academic Publishers.

Basili, R., Candito, M.H., Pazienza, M.T., Velardi, P. (1997). Evaluating the information gain of probability-based PP-disambiguation methods. In D. Jones & H. Somers (Eds.), New Methods in Language Processing. UCL Press, London.

Carroll J., Briscoe, T., Calzolari, N., Federici, S., Montemagni, S., Pirrelli, V., Grefenstette, G., Sanfilippo, A., Carroll, G., Rooth M. (1996). Specification of Phrasal Parsing. Deliverable 1, Work Package 1, EC project SPARKLE “Shallow Parsing and Knowledge Extraction for Language Engineering” (LE-2111). Available at <http://www.ilc.pi.cnr.it/sparkle>.

Carroll, G., Light, M., Prescher, D., Rooth, M., Carroll, J., Briscoe, T., Korhonen, A., McCarthy, D., Calzolari, N., Federici, S., Montemagni, S., Pirrelli, V. (1997a). Syntactic and Semantic Type and Selection. Deliverable 5.1, Work Package 5, EC project SPARKLE “Shallow Parsing and Knowledge Extraction for Language Engineering” (LE-2111).

Carroll J., Briscoe, T., Federici, S., Montemagni, S., Pirrelli, V., Prodanof, I., Vannocchi, M. (1997b). Phrasal Parsing Software. Deliverable 3.2, Work Package 3, EC project SPARKLE “Shallow Parsing and Knowledge Extraction for Language Engineering” (LE-2111). Available at <http://www.ilc.pi.cnr.it/sparkle>.

Federici, S., Montemagni, S., Pirrelli, V. (1996). Shallow Parsing and Text Chunking: a View on Underspecification in Syntax. In J. Carroll (Ed.), Proceedings of the Workshop On Robust Parsing. 12-16 August 1996, ESSLI, Prague (Czech Republic).

Federici, S., Montemagni, S., Pirrelli, V. (1998). An Analogy-based System for Lexicon Acquisition. SPARKLE Working Paper.

Kononenko, I., Bratko, I. (1991). Information-Based Evaluation Criterion for Claissifier's Performance. Machine Learning, 6, pp. 67-80.

Picchi, E. (1994). Statistical Tools for Corpus Analysis: a tagger and lemmatizer for Italian. In W. Martin et al. (Eds.), Proceedings of the Sixth Euralex International Congress. Amsterdam, 30-8/3-9 1994, pp. 501-510.

Ruimy, N., Battista, M., Corazzari, O., Gola, E., Spanu, A. (1997). Italian Lexicon Documentation. WP3.11 LE-PAROLE, Pisa.