Shallow Parsing and Text Chunking: a View on Underspecification in Syntax

10
SHALLOW PARSING AND TEXT CHUNKING: A VIEW ON UNDERSPECIFICATION IN SYNTAX * Stefano Federici 1 , Simonetta Montemagni 1 , Vito Pirrelli 2 1 Parola sas, v. del Borghetto 35, 56124 Pisa, Italy 2 ILC-CNR v.della Faggiola 32, 56126 Pisa, Italy {stefano,simo,vito}@ilc.pi.cnr.it Abstract. This paper illustrates a technique of shallow parsing - named “text chunking” - whereby “parse incompleteness” is reinterpreted as “parse underspecification”. A text is chunked into structured units which can be identified with certainty on the basis of available knowledge. The chunking process stops at that level of granularity beyond which the analysis gets undecidable. We argue that a chunked syntactic representation can usefully be exploited as such for non trivial NLP applications which do not require full text understanding such as automatic lexical acquisition and information retrieval. 1. Introduction Under the wide umbrella of “shallow parsing” lies a variety of different approaches to parsing (e.g. rule-based vs stochastic techniques) and purposes for having a sentence parsed (e.g. acquisition of lexical knowledge from textual sources, machine translation, style checking etc.). Due to this state of affairs, we still lack a unique, consensual definition of what shallow parsing technically means, besides the generic interpretation of this term as covering all sorts of approach to partial or incomplete parsing. In this paper we describe a technique of text chunking whose output, following Jensen’s terminology (1993), consists in a “syntactic sketch”, amenable to a variety of different levels of further syntactic refinement, whether constituency- or dependency-based. We suggest that this technique amounts to a particular view of the notion of shallow parsing, whereby “parse incompleteness” is reinterpreted as “parse underspecification” relative to possible more refined levels of analysis. Nonetheless, we are confident that an underspecified syntactic representation in terms of textual chunks can usefully be exploited as such for non trivial NLP tasks and applications such as automatic lexical acquisition and information retrieval. 2. Shallow Parsing: Basics Broadly speaking, shallow parsers take as input the output of a morphological analyser, possibly but not necessarily disambiguated in context through application of a tagger. Besides differences in the typology of the output representation, all shallow parsers share the property that resulting analyses need not be complete: i.e. unrecognised structures as well as unidentified dependency relations between words in text are left unspecified as to their nature and scope. The output of a shallow parser is computed on the basis of a minimum presupposed linguistic knowledge, which adds to the information accrued from the input representation: namely morphosyntactic, lemma and word order information. We can express this by saying that the starting point of a shallow parser is typically a sort of “empty syntactic lexicon”, and that its resulting analyses are not lexically-driven: e.g., all constituents are represented on a * Work reported in this paper has been carried out within the framework of the EC SPARKLE project “Shallow Parsing and Knowledge Extraction for Language Engineering” (LE-2111). All ideas illustrated here are the outcome of a joint effort. However, for the specific concerns of the Italian Academy, S. Federici is responsible for sections 3.3 and 4, S. Montemagni for sections 3.1 and 5, V. Pirrelli for sections 1, 2 and 3.2.

Transcript of Shallow Parsing and Text Chunking: a View on Underspecification in Syntax

SHALLOW PARSING AND TEXT CHUNKING: A VIEW ON UNDERSPECIFICATION IN SYNTAX*

Stefano Federici1, Simonetta Montemagni1, Vito Pirrelli21Parola sas, v. del Borghetto 35, 56124 Pisa, Italy 2ILC-CNR v.della Faggiola 32, 56126 Pisa, Italy

{stefano,simo,vito}@ilc.pi.cnr.it Abstract. This paper illustrates a technique of shallow parsing - named “text chunking” - whereby “parse incompleteness” is reinterpreted as “parse underspecification”. A text is chunked into structured units which can be identified with certainty on the basis of available knowledge. The chunking process stops at that level of granularity beyond which the analysis gets undecidable. We argue that a chunked syntactic representation can usefully be exploited as such for non trivial NLP applications which do not require full text understanding such as automatic lexical acquisition and information retrieval.

1. Introduction Under the wide umbrella of “shallow parsing” lies a variety of different approaches to parsing (e.g. rule-based vs stochastic techniques) and purposes for having a sentence parsed (e.g. acquisition of lexical knowledge from textual sources, machine translation, style checking etc.). Due to this state of affairs, we still lack a unique, consensual definition of what shallow parsing technically means, besides the generic interpretation of this term as covering all sorts of approach to partial or incomplete parsing. In this paper we describe a technique of text chunking whose output, following Jensen’s terminology (1993), consists in a “syntactic sketch”, amenable to a variety of different levels of further syntactic refinement, whether constituency- or dependency-based. We suggest that this technique amounts to a particular view of the notion of shallow parsing, whereby “parse incompleteness” is reinterpreted as “parse underspecification” relative to possible more refined levels of analysis. Nonetheless, we are confident that an underspecified syntactic representation in terms of textual chunks can usefully be exploited as such for non trivial NLP tasks and applications such as automatic lexical acquisition and information retrieval.

2. Shallow Parsing: Basics Broadly speaking, shallow parsers take as input the output of a morphological analyser, possibly but not necessarily disambiguated in context through application of a tagger. Besides differences in the typology of the output representation, all shallow parsers share the property that resulting analyses need not be complete: i.e. unrecognised structures as well as unidentified dependency relations between words in text are left unspecified as to their nature and scope. The output of a shallow parser is computed on the basis of a minimum presupposed linguistic knowledge, which adds to the information accrued from the input representation: namely morphosyntactic, lemma and word order information. We can express this by saying that the starting point of a shallow parser is typically a sort of “empty syntactic lexicon”, and that its resulting analyses are not lexically-driven: e.g., all constituents are represented on a

* Work reported in this paper has been carried out within the framework of the EC SPARKLE project “Shallow Parsing and Knowledge Extraction for Language Engineering” (LE-2111). All ideas illustrated here are the outcome of a joint effort. However, for the specific concerns of the Italian Academy, S. Federici is responsible for sections 3.3 and 4, S. Montemagni for sections 3.1 and 5, V. Pirrelli for sections 1, 2 and 3.2.

par, as daughters of the sentence node, given the impossibility of relying on lexical information (mainly subcategorization frames) to establish appropriate dependencies.

3. Chunking This section is devoted to a description of the notion of shallow parsing we intend to suggest here, and which hereafter will be referred to as “text chunking”. As a first approximation, chunking a text means segmenting it into an unstructured sequence of syntactically organized text units called “chunks”, which display the range of the relations holding between their internal words.

3.1 What is a Chunk Under our interpretation, a chunk is a textual unit of adjacent word tokens: accordingly, discontinuous chunks are not allowed. Word tokens internal to a chunk share the property of being mutually linked through those dependency chains which can be identified unambiguously with no recourse to idiosyncratic lexical information other than part of speech and lemma. A chunk is always a maximal, non recursive text unit of this type: hence, it cannot be embedded within a more inclusive chunk (a detailed account of this constraint is given below). To be more concrete, a sentence such as the interested watcher could always observe the stars visible to the naked eye will be chunked as follows: A. [the interested watcher]B. [could always observe]C. [the stars]D. [visible]E. [to the naked eye]

The sentence is segmented into five chunks. Each chunk includes a sequence of adjacent word tokens (a text substring) which are mutually related through dependency links of some specifiable kind. For example, in chunks A and B the following dependency chains are observed:

the interested watcher

could always observe The fact that two text substrings are assigned different chunks does not necessarily entail that there is no dependency relationship linking the two. For example, the chunked representation above says nothing about the relationship between visible and to the naked eye,but this is not to exclude that such a relationship can possibly exist. Simply, on the basis of the knowledge available to the chunker, it is impossible to state unambiguously what chunk relates to its neighbouring chunks and what the nature of this relationship is. If we abstract away from the lexical content of the chunks in the sentence above, E is potentially dependent on either D, C, or B and this cannot be decided unless lexical knowledge of some kind (e.g. that visible subcategorizes for a prepositional phrase headed by to) is resorted to. We therefore say that the output of a chunker is underspecified as to the nature and scope of inter-chunk (as opposed to intra-chunk ) relations.

Note further the different, admittedly non conventional treatment of visible as opposed to the treatment of interested and naked: in the latter case the adjectival modifier is “trapped” between a determiner and a noun and is part of a wider chunk, while in the former the adjective is a chunk on its own right. This is in keeping with the idea that the chunker finds out those relations only which can be identified with certainty. Since the governor of postnominal modifiers cannot be identified with the same degree of certainty as in the case of prenominal modification, the two constructions are chunked differently: prenominal modification is part of the same chunk as its modified head, while postnominal modification gives rise to an independent chunk. 3.1.1 Chunks and Potential Governors Our definition of chunk draws on the notion of “potential governor”. A chunk contains at most one potential governor, that is an element internal to the chunk on which neighbouring chunks can syntactically depend either as arguments or adjuncts (hereafter comprehensively referred to as “complements”). A potential governor is always the rightmost element of a chunk. It represents a kind of syntactic “handle” externally available to other chunks for them to hang onto syntactically. The potential governors in A and C above are watcher and observe respectively. Incidentally, in the chunked sentence above, there is no other chunk-external element depending on watcher, while observe takes the chunk [the stars] as a direct object. As mentioned above, all these inter-chunk dependencies are invisible to the chunker. According to our definition, a potential governor G is a suitable candidate as the head of a chunk other than the one to which G belongs. In most cases a potential governor is also the head of its own chunk, but this is not always true. Unlike heads in the classic sense, potential governors indicate solely the potential of a chunk for syntactically combining with yet other chunks in such a way that the latter are governed by the former. As a result all potential governors are heads, but not all heads are potential governors. We will elaborate this point in section 3.2.2.

3.2 Chunk Typology We distinguish two basic sorts of chunk, namely “governing chunks” and “non governing chunks” depending on whether they have a potential governor or not. Non governing chunks are those which contain elements which cannot possibly act as potential governors due to their categorial status: i.e. punctuation and coordinating conjunctions. In what follows, we will focus on governing chunks only. 3.2.1 Typology of governing chunks A typology of governing chunks is provided in the table below.

NAME TYPE POTGOV EXAMPLES ADJ_C adjectival chunk adj bello ‘nice’, molto bello ‘very nice’ PA_C predicate adjectival

chunk adj è bello ‘(it/(s)he) is nice’

è molto simpatico ‘(it/(s)he) is very nice’ ADV_C adverbial chunk adv sempre ‘always’ SUBORD_C subordinating chunk conj quando ‘when’

dove ‘where’ N_C nominal chunk noun

pron verb adj

la mia casa ‘my house’ io ‘I’, questo ‘this’ l’aver fatto ‘having done’ il bello ‘the nice (one)’

P_C prepositional chunk noun pron

di mio figlio ‘of my son’ di quello ‘of that (one)’

NAME TYPE POTGOV EXAMPLES verb adj

dell’aver fatto ‘of having done’ del bello ‘of the nice (one)’

FV_C finite verbal chunk verb sono stati fatti ‘(they) have been done’ rimangono ‘(they) remain’

G_C gerundival chunk verb mangiando ‘eating’ I_C infinitival chunk verb per andare ‘to go’

per aver fatto ‘to have done’ PART_C participial chunk verb finito ‘finished’

The set of syntactic categories given above departs from the set of traditional phrasal categories used in constituency-based syntax in several respects. In general, more granular distinctions are made here. Verbal chunks are partitioned on the basis of verb mood. Classical categories such as that_clause, wh_clause and so on, do not appear in the list since they are decomposed into sequences of basic chunks. Nominal chunks are intended to cover only a portion of the range of linguistic phenomena normally taken care of by nominal phrases: namely only noun phrases with prenominal complementation. Chunks can be classified according to the category of the potential governor they contain (column 3). There are four chunks, namely FV_C, G_C, I_C and PART_C, which can only contain a verb as a potential governor. A verb can also act as a potential governor within nominal and prepositional chunks (N_C and P_C), but only in particular contexts (i.e. Italian nominalised infinitival constructions introduced by an article: [N_C lo scrivere N_C] [N_C libri N_C] ‘writing books’). Nouns, pronouns and adjectives are other admissible potential governors in N_Cs and P_Cs. Potential governors within ADJ_C, ADV_C and SUBORD_C are adjectives, adverbs and subordinating conjunctions respectively. The potential governor in PA_C is the predicate adjective. 3.2.2 Syntactic Heads and Potential Governors We consider here in more detail the relationship between two largely overlapping notions such as syntactic head (as this notion emerges from the relevant literature, see for instance Hudson 1984 and Pollard and Sag 1994) and potential governor. In particular, we will focus here on those cases where the two notions appear to diverge significantly. In PA_Cs, the head is the copula, but the role of potential governor is played by the adjective. This move is justified from our perspective on the basis of the observation that external dependencies can only involve the predicate adjective (as opposed to the copula). At the same time, we observe that the relation between the predicate adjective, the copula and possible other elements of the chunk is not ambiguous, and this makes it possible to group them within the same chunk. Consider now the case of P_Cs. Prepositions, which are defined as governing elements (i.e. heads) under most syntactic approaches, are excluded here from the list of potential governors within prepositional chunks. Again, this follows from the fact that prepositions cannot function as the candidate governor of a neighbouring chunk, since the governing scope of a preposition never goes beyond the limits of the chunk it appears in. The only exception to this generalization is the case of coordination as in for you and me, where the preposition is chunked with the first conjoined element only, the second one being part of a distinct chunk (i.e. [P_C for you P_C] and [N_C me N_C]). In order to establish inter-chunk dependencies, the head of the noun phrase embedded in a P_C is more relevant than an introducing preposition. Consider, for instance, the sequence of prepositional chunks [P_C1 con colpevole uso P_C1] [P_C2 di armi P_C2] ‘[with guilty use] [of weapons]’, where the nominal governor in P_C1 is crucial to establish what other chunk P_C2 depends on.

Considerations of this sort explain why the range of potential governors of nominal and prepositional chunks coincides. 3.2.3 Chunk categories and underspecification It is not always the case that the category of a chunk can be identified unambiguously given the available knowledge: this problem can be got around through use of underspecified chunk categories, which add to the inventory provided above. The chunking process resorts to underspecified categories in cases of systematic ambiguity. As far as Italian is concerned, for instance, the chunk di_C is a chunk introduced by complex di which can either be interpreted as a preposition or as a partitive article: e.g. [di_C dello zucchero di_C] 'some sugar' / 'of sugar'. The category di_C is compatible with both analyses and thus subsumes both N_C and P_C. The systematic ambiguity between adjectives and past or present participles is another case in point. Consider the phrase un’immagine colorata ‘a coloured picture’ and its chunked representation below: [N_C un’immagine N_C] [?_C colorata ?_C] Here, the potential governor in ?_C can either be a verb (colorare ‘to colour’) or an adjective (colorato ‘coloured’), and the corresponding chunk category can vary between PART_C and ADJ_C respectively. This ambiguity is captured through the underspecified chunk category ADJPART_C subsuming both ADJ_C and PART_C: ?_C above is thus replaced by [ADJPART _C colorata ADJPART _C].

3.3 Inside and Outside a Chunk Given the criteria for chunking sketchily illustrated so far, it remains to be seen what sort of dependencies hold between the elements of a chunk, and how we represent undecidable dependencies which hold between chunks. 3.3.1 Intra-Chunk Dependencies Each chunk is a syntactically organized structure, which displays nature and scope of the dependencies holding between its internal words. It is described by a set of attribute-value pairs whose configuration varies according to the chunk type and category. In the definition of governing chunks, two attributes need to be specified obligatorily: i) Chunk Category (CC), whose possible values are given in the table above; ii) POTential GOVernor (POTGOV), whose value is the lemma of the potential governor specified for its part of speech and (optionally) other morpho-syntactic features. An elementary chunk is exemplified below for the expression la legislazione ‘the legislation’:

[ [CC: N_C] [POTGOV: legislazione#SF] ] This basic structure can contain further attributes specifying, for instance, the preposition “introducing” the chunk (in P_C and I_C chunks), or adjectival premodifiers (i.e. intervening between a determiner and a potential governor) in N_Cs and P_Cs: a questo riguardo ‘in this respect’ [ [CC: P_C] [PREP: a] [POTGOV: riguardo#SM] ] per verificare ‘to verify’ [ [CC: I_C] [PREP: per] [POTGOV: verificare#VTP] ]

un bravo bambino ‘a good boy’ [ [CC: N_C] [MOD: bravo] [POTGOV: bambino#SM] ] All verbal chunks (i.e. FV_C, I_C) can also contain indication of: the clitic pronoun(s) occurring with the verb; the auxiliary used (AUX) in periphrastic verb forms; the modal verb (MOD) used in modal constructions; the causative verb (CAUS) used in causative constructions. lo disse ‘(he) said it’ [ [CC: FV_C] [CLIT: lo] [POTGOV: dire#VT] ] farlo ‘to do it’ [ [CC: FV_C] [CLIT: lo] [POTGOV: fare#VT] ] è stata trasmessa ‘(it) has been transmitted’ [ [CC: FV_C] [AUX: essere] [POTGOV: trasmettere#VT] ] ha dichiarato ‘(he) has declared’ [ [CC: FV_C] [AUX: avere] [POTGOV: dichiarare#VTR] ] che possono essere sbarcati ‘which can be landed’ [ [CC: FV_C] [INTRO: che] [AUX: essere] [MODAL: potere] [POTGOV: sbarcare#VTI] ] lascia intendere ‘(he) lets (someone) understand’ [ [CC: FV_C] [CAUS: lasciare] [POTGOV: intendere#VTIPB] ] Subordinating conjunctions, when immediately followed by a verbal group, are included in the corresponding verbal chunk (i.e. F_V or I_C) and recorded as a value of the SUBCONJ attribute: dove si trova ‘where (it) is’ [ [CC: FV_C] [SUBCONJ: dove] [CLIT: si] [POTGOV: trovare#VTBP] ] Note that subordinating conjunctions which are not immediately followed by a verbal chunk as in dove la mia famiglia si trova ‘where my family is’ are treated as independent chunks (SUBORD_C) since no dependency relation between the conjunction and the immediately following nominal chunk can be assumed with certainty. 3.3.2 Inter-Chunk Dependencies Dependencies which cannot be identified unambiguously as to their scope and nature are distributed over different chunks. By way of exemplification, consider the Italian phrase insiemi di leggi utili whose English translation can be either ‘sets of useful laws’ or ‘useful sets of laws’. Accordingly, depending on either interpretation, utile would enter into two different dependency chains:

insiemi di leggi utili

insiemi di leggi utili

Put in a nutshell, the chunked output which follows is compatible with both interpretations, since the potential governors of utile (i.e. insieme and legge), as well as utile itself, are assigned different chunks, N_C, P_C and ADJ_C respectively. [ [CC: N_C] [POTGOV: insieme#SM] ] [ [CC: P_C] [PREP: di] [POTGOV: legge#SF] ] [ [CC: ADJ_C] [POTGOV: utile#A] ] Chunking a text means to trust irrevocable parsing decisions only, that is to identify those intra-chunk relations which can reliably be spotted on the basis of the linguistic information available at the current level of analysis. These decisions should never be either disconfirmed or revoked at later stages of processing; if they are, this counts as a chunking failure. This approach to shallow parsing results in a neat way of dealing with parsing indecisions which amounts to saying nothing about things on which nothing (or very little) can be said with certainty. We thus depart from other possible approaches to parsing indecision entertained in the literature, which, roughly speaking, either decide to always rely on the most likely structure (e.g. minimal attachment, Frazier and Fodor 1978; right attachment, Kimball 1972), or pack local ambiguities as alternative analyses inserted as a part of a common (tree) structure (sub-tree sharing and local packing, Tomita 1987 and Alshawi 1992).

4. A Chunker for Italian: Preliminary Results In this section, preliminary results of our Italian Chunker are provided. The chunker takes as input the output of a stochastic tagger of Italian (Picchi 1994) and segments the text into sequences of chunks. The results of a set of experiments are reported in the table below. Each test was performed on texts belonging to different domains and types extracted from the Italian Reference Corpus (Bindi et al. 1991): namely, a biology textbook, a document on tax legislation, a financial newspaper, for an overall amount of 3.485 sentences and 74.406 word tokens.

biology textbook tax legislation financial newspaper words 13.306 21.050 40.050 sentences 606 1.229 1.650 average no of words/sentence

21,95 17,12 24,27

processing time per word token

0,0031 secs 0,0032 secs 0,0030 secs

processing time per sentence

0,068 secs 0,054 secs 0,073 secs

1 word 3.047 42,13% 5.734 45,09% 8.169 38,53%2words

3.198 44,22% 5.664 44,54% 10.124 47,75%

governing chunks

3words

833 11,52% 76,26% 1156 9,09% 77,60% 2.489 11,74% 76,05%

4words

143 1,98% 145 1,14% 382 1,80%

≥≥≥≥5words

11 0,15% 17 0,13% 40 0,19%

total 7.232 100,00% 12.716 100,00% 21.204 100,00%non governing chunks

2.129 22,45% 3.648 22,27% 5.883 21,10%

unidentified chunks

122 1,29% 22 0,13% 795 2,85%

total no of chunks 9.483 100,00% 16.386 100,00% 27.882 100,00%

A common trend emerges from the experiments performed on different text domains and types: in all cases, governing chunks represent more than three quarters of the total number of segmented chunks; the remaining cases are non governing chunks and, in a negligible percentage of cases, chunks of an unidentified nature. One and two word chunks taken together represent more than 80% of governing chunks, irrespective of type and domain of text; three word chunks are around 10% of the overall number of governing chunks, whereas longer chunks (i.e. with more than three words) occur fairly rarely. Clearly, these percentages do not reflect the text coverage of governing chunks. More indicative figures in this respect are reported in the table below for one of the texts considered, namely the financial newspaper.

governing chunks

no covered words

covered text

1 word 8.169 8.169 20,40%2 words 10.124 20.248 50,55%3 words 2.489 7.467 18,64%4 words 382 1528 3,81%≥5 words 40 203 0,50%total 21.204 37.615 94,00%

Governing chunks cover, as a whole, 94% of the entire segmented text. The coverage of one word chunks is 20%, while 2 word chunks account for half of the text. Note that chunks consisting of three words or more, which amount to 14% only of governing chunks, represent more than 22% of text. On average, the word length of a governing chunk is 1.7, which means that each sentence is split into an average number of 12 governing chunks. A more detailed model of chunk distribution per sentence is not available yet. These figures are nonetheless sufficient to give the reader an idea of the amount of structure which a chunker is able to impose on a morpho-syntactically tagged text.

4.1 Accuracy of Chunking Chunking accuracy depends on accuracy of tagging. In two different evaluation tests we assessed the accuracy of the chunker against both a manually corrected tagged corpus and one with unsupervised tagging. In the former case the accuracy rate averages 99%, in the latter it drops to about 93%. Most of the 7% chunking failures (about 70% of them) in the experiment with no hand-checking of morphosyntactic tagging are due to the presence of wrong tags. The remaining 30% of the failures is represented by genuine chunking mistakes: they amount to 2% of all identified chunks. It should be observed, however, that this percentage is greater than the 1% failures we obtained in the experiment with a manually corrected tagged corpus. The difference can arguably be explained as a kind of domino effect caused by tagging mistakes which clearly have an impact also on chunking tag sequences which, strictly speaking, do not contain wrong tags. Some of the problems connected with wrong tags are tackled through use of underspecified chunk categories (section 3.2.3). This is the case, for instance, of di_C chunks which are assigned by the chunker regardless of the tagger’s interpretation of di as a preposition or an article. From a preliminary analysis of samples of chunked text it appears that, on average, underspecified chunks represent about 12% of all identified chunks. Finally, it is worth noting here that manual inspection of unidentified chunks (which range between 0.13% and 2.85% of all chunks in the tests carried out so far) offers an invaluable opportunity of gathering knowledge about frozen or semi-frozen grammatical expressions (e.g. complex prepositions such as al di la’ ‘beyond’, which is made up out of a preposition linked with its article + preposition + adverb) which often elude general

principles of grammar, even at a fairly low level such as chunking. We intend to capitalise on this acquisition process by augmenting the chunker with a repository of this type of often neglected linguistic knowledge which will be represented as a set of ready-made chunks. Besides, at the moment we are working on the idea of merging the two steps of tagging and chunking into one, so that knowledge about chunk width can be used as an indicator of the width of the context where a sequence of ambiguous tags is most likely to be disambiguated with high accuracy.

5. Discussion The “minimalist” approach to shallow parsing described in these pages segments a text into units which can be identified with certainty on the basis of the comparatively little amount of linguistic knowledge available at this stage. These units, called chunks, are assigned a structured representation wherein dependency links are made explicit. The chunking process stops at that level of granularity beyond which the analysis gets undecidable, i.e. whenever more than one parsing decision is admissible on the basis of the available knowledge. When this occurs, be it a case of overgeneration or a genuine ambiguity, the chunked output is not committed to any such decision, but is compatible with them all. In the case of so-called undergeneration, a chunking failure, given the local character of this analysis, never involves the whole sentence, but affects a limited portion of text (corresponding to a limited number of adjacent chunks). As a result, unidentified chunks never block the chunking process as a whole, as is often the case in syntactic parsing with a generative grammar. This syntactic sketch can usefully be exploited as such in NLP applications which do not necessarily require full text understanding. For example, we are confident that the chunker output can be used with considerable gains in automatic acquisition of lexical knowledge: 1) the preliminary individuation of syntactic chunks reduces significantly the search space of candidate heads and their corresponding complements in a text; 2) (semi) frozen expressions which result from untypical tag sequences can usefully be spotted since they are likely to give rise to local failures in chunking. This is part of current work in the framework of the SPARKLE project (Carroll et al. 1996). Furthermore, we tend to believe that similar considerations should hold for information retrieval which can benefit a lot from operating on organized text units rather than on raw text. Yet, the chunker can also be conceived of as the first component of a complex syntactic parsing system; this initial component produces a syntactic sketch which is to be refined and revised at further processing stages. The output representation produced by the chunker is compatible with both constituency- and dependency-based approaches to syntactic analysis. To give an example, a nominal phrase such as una interessante ricerca scientifica ‘an interesting scientific research’ would result in the following chunked representation:

[ [CC: N_C] [MOD: interessante] [POTGOV: ricerca#SF] ] [ [CC: ADJ_C] [POTGOV: scientifico#AF] ] The corresponding dependency- and constituency-based representations of the phrase (in 1 and 2 below) would then be computed on top of the chunked input: (1) ...... una interessante ricerca scientifica

(2) [NP una [ADJP interessante ADJP] ricerca [ADJP scientifica ADJP] NP] In (1), the dashed arc represents the dependency link that is not explicitly represented within the chunked representation, representation which nonetheless already isolates the relevant items of this dependency chain, namely the potential governors of N_C and ADJ_C respectively. In (2), the nominal phrase with both pre- and post-modifiers is the result of recombining the nominal chunk (which includes the prenominal modifier) with the postnominal adjectival chunk. It is interesting to note that in both cases the decisions taken by the chunker monotonically relate to more elaborate levels of analysis where all such amount of linguistic structure is output correctly. All revisions and refinements never involve unpacking of existing chunks. This means that a chunk might, in some cases, not include all relevant linguistic information, but will certainly represent a core of a more inclusive syntactic structure. We can say that every chunk represents a kind of syntactic atom: structures identified at later stages of syntactic analysis can only fully contain (or be contained in) it. What was illustrated in this paper has been experimented on Italian. It is reasonable to believe that the amount of linguistic structure which can be imposed reliably on a morpho-syntactically tagged text by chunking it is subject to language specific variation. However, the basic vocabulary of chunk markers and the battery of criteria for chunk identification remains fairly stable across considerably different languages such as English, German and Italian. Steven Abney (1996), Mats Rooth and Glenn Carroll (p.c.) developed independently sets of criteria for chunking which overlap considerably the machinery illustrated here, in spite of some fundamental differences in the final purposes of the produced parses. References Abney S., 1996, Chunk Stylebook, Manuscript, University of Tuebingen. Available at

<http://www.sfs.nphil.uni-tuebingen.de/~abney/96i.ps.gz>. Alshawi H. (ed.), 1992, The Core Language Engine, The MIT Press, Cambridge Massachusetts. Bindi R., Monachini M., Orsolini P., 1991, Italian Reference Corpus. Key for consultation,

NERC WP7-13, Istituto di Linguistica Computazionale, CNR, Pisa. Carroll J., T. Briscoe, N. Calzolari, S. Federici, S. Montemagni, V. Pirrelli, G. Grefenstette, A.

Sanfilippo, G. Carroll, M. Rooth, 1996, Specification of Phrasal Parsing, Deliverable 1, Work Package 1, EC project SPARKLE “Shallow Parsing and Knowledge Extraction for Language Engineering” (LE-2111). Available at <http://www.ilc.pi.cnr.it/sparkle>.

Frazier L., Fodor J., 1978, ‘The sausage machine: A new two-stage parsing model’, Cognition,6, pp. 291-325.

Hudson R., 1984, Word Grammar, Basil Blackwell, Oxford. Jensen K., 1993, PEG: the PLNLP English Grammar, in Jensen K., Heidorn G.E., Richardson S.D.,

(eds.), Natural Language Processing: The PLNLP Approach, Kluwer Academic Publishers, Boston, pp. 29-45.

Kimball J., 1972, ‘Seven principles of surface structure parsing in natural language’, Cognition, 2, pp. 15-47.

Picchi E., 1994, ‘Statistical Tools for Corpus Analysis: a tagger and lemmatizer for Italian’, in W. Martin et al. (eds.), Proceedings of the Sixth Euralex International Congress (Amsterdam 30-8/3-9 1994), pp. 501-510.

Pollard C., Sag I., 1994, Head-Driven Phrase Structure Grammar, CSLI, Stanford, CA. Tomita M., 1987, 'An Efficient Augmented-Context-Free Parsing Algorithm', Computational

Linguistics, 13, pp. 31-46.