Language Documentation: The Nahuatl Grammar

12
A. Gelbukh (Ed.): CICLing 2005, LNCS 3406, pp. 474 485, 2005. © Springer-Verlag Berlin Heidelberg 2005 Language Documentation: The Nahuatl Grammar Mike Maxwell 1 and Jonathan D. Amith 2 1 Linguistic Data Consortium [email protected] 2 Gettysburg College [email protected] Abstract. We describe an on-going documentation project for Nahuatl, an indigenous language of Mexico. While we follow standard recommendations for documenting text corpora and for the dictionary, the usual recommendations are not explicit concerning the grammar. Since Nahuatl is an agglutinating language, the morphological component of the grammar is highly complex. Accordingly, we consider it essential to not only provide static information about the language, such as a lexicon and parsed text, but dynamic documentation in the form of a working morphological grammar. When compiled into a finite state transducer, this grammar provides parses for arbitrary inflected forms, including many not in the corpus, as well as the generation of the partial or full inflectional paradigms. In keeping with the archival goals of language documentation, we argue that this grammar should be simultaneously human readable and computer processable, so that it will be re-implementable in future computational tools. The notion of literate computing provides the appropriate paradigm for these dual goals. 1 Language Description and Documentation “It is to be lamented… that we have suffered so many of the Indian tribes already to extinguish, without our having previously collected and deposited in the records of literature, the general rudiments at least of the languages they spoke. Were vocabularies formed of all the languages spoken in North and South America, preserving their appellations of the most common objects in nature, of those which must be present to every nation barbarous or civilized, with the inflections of their nouns and verbs, their principles of regimen and concord, and these deposited in all the public libraries, it would furnish opportunities to those skilled in the languages of the old world to compare them with these, now or at a future time, and hence to construct the best evidence of the derivation of this part of the human race.” –Thomas Jefferson (1781-1782) Notes on the State of Virginia There are over 6000 languages in the world today [1]. The diversity of these languages has the potential to provide us with a window into the mind, an understanding of what it means to be human, and a way of reconstructing pre-history that we can attain in no other way.

Transcript of Language Documentation: The Nahuatl Grammar

A. Gelbukh (Ed.): CICLing 2005, LNCS 3406, pp. 474 – 485, 2005. © Springer-Verlag Berlin Heidelberg 2005

Language Documentation: The Nahuatl Grammar

Mike Maxwell1 and Jonathan D. Amith 2

1 Linguistic Data Consortium [email protected]

2 Gettysburg College [email protected]

Abstract. We describe an on-going documentation project for Nahuatl, an indigenous language of Mexico. While we follow standard recommendations for documenting text corpora and for the dictionary, the usual recommendations are not explicit concerning the grammar. Since Nahuatl is an agglutinating language, the morphological component of the grammar is highly complex. Accordingly, we consider it essential to not only provide static information about the language, such as a lexicon and parsed text, but dynamic documentation in the form of a working morphological grammar. When compiled into a finite state transducer, this grammar provides parses for arbitrary inflected forms, including many not in the corpus, as well as the generation of the partial or full inflectional paradigms. In keeping with the archival goals of language documentation, we argue that this grammar should be simultaneously human readable and computer processable, so that it will be re-implementable in future computational tools. The notion of literate computing provides the appropriate paradigm for these dual goals.

1 Language Description and Documentation

“It is to be lamented… that we have suffered so many of the Indian tribes already to extinguish, without our having previously collected and deposited in the records of literature, the general rudiments at least of the languages they spoke. Were vocabularies formed of all the languages spoken in North and South America, preserving their appellations of the most common objects in nature, of those which must be present to every nation barbarous or civilized, with the inflections of their nouns and verbs, their principles of regimen and concord, and these deposited in all the public libraries, it would furnish opportunities to those skilled in the languages of the old world to compare them with these, now or at a future time, and hence to construct the best evidence of the derivation of this part of the human race.” –Thomas Jefferson (1781-1782) Notes on the State of Virginia

There are over 6000 languages in the world today [1]. The diversity of these languages has the potential to provide us with a window into the mind, an understanding of what it means to be human, and a way of reconstructing pre-history that we can attain in no other way.

Language Documentation: The Nahuatl Grammar 475

Sadly, many of these languages are on the verge of extinction. A common estimate is that during this century, at least a half of these languages will disappear; the worst-case predictions are considerably more grim.1

All languages have a story to tell about the human language capacity, but there are some questions which can only be answered by certain languages. For example, there are only a handful of languages which have the word order Object-Verb-Subject, and only one of these (Hixkaryana, [2]) is well documented. Had these few languages disappeared from the face of the Earth before being documented, we might not have known that this word order was even possible in human languages, much less what the properties associated with this word order might be.

While language change is inexorable, the loss of languages is not. Languages can be preserved by continuing to be spoken, but they can also be preserved by being written down and described. The long-term survival of knowledge of ancient languages such as Latin, Classical Greek, and Sanskrit is due in no small part to the efforts of a handful of speakers of these languages who wrote down their grammars. Some other ancient languages remain more or less accessible to this day because they have been preserved in written form, which we have been able to decode, usually with the help of bilingual documents such as the famed Rosetta Stone. In contrast, the complete loss of all other languages of that era is due to their lack of documentation.

Accordingly, linguists and speakers of many minority languages have focused increasing effort in recent decades on language preservation—preventing extinction—and documentation, the latter in recognition of the fact that at least some preservation efforts will fail.

Descriptions of languages are arguably the one field of human knowledge where scholarly writings of today will retain their value for the foreseeable future. Assuming that civilization survives and knowledge increases, all other fields of endeavor—with the possible exception of historical narrative—will some day be replaced by a better understanding. The works of today’s astronomers, biologists, and engineers will be superseded by those of future generations of astronomers, biologists, and engineers. Not so descriptions of dying languages: once the primary data can no longer be produced because there are no more native speakers, no one can write more exhaustive descriptions than those that have already been made.

Naturally, not all language documentation is equal. At one end of the spectrum of methodologies, groups such as the Volkswagen Foundation, (www. volkswagenstiftung.de/foerderung/index_e.html) have advocated a breadth-first approach, calling for preservation of large text and audio or video corpora of languages, and little more. This has the advantage of spreading funding and human resources over as wide a variety of languages as possible. At the other end of the spectrum, one might advocate intensive investigation of individual languages, resulting in the in-depth knowledge that some research programs (such as generative linguistics) require.

Arguably both approaches are useful; if we cannot know a lot about all languages, then we should at least know a lot about a few languages, and a little about a lot of

1 See e.g the Linguistic Society of America’s FAQ at http://lsadc.org/faq/endangered.htm.

476 M. Maxwell and J.D. Amith

languages. But even at the intensive end of the spectrum, some focusing of effort is needed. Each language has its individual story to tell, hence there is some sense in concentrating particular effort on those aspects of each language that are unique. To take one example, Lushootseed, a nearly extinct Central Coast Salish language of Washington State (United States), employs a process called ‘reduplication’ in its morphology. Urbanczyk [3] examined reduplication in this language, with the goal of illuminating what a reduplicative process in a human language can be. A crucial point in the analysis turned on cases where a reduplicative affix occurred on stems with a particular stress pattern. The corpus which she used was reasonably large, and had been painstakingly collected over years by competent linguists. But it turns out that in this large corpus, there are only four instances of this affix on the relevant stems—and these four instances are evenly split over whether they support the analysis.

Clearly Lushootseed is a case where directed data gathering would have been helpful. But gathering data with the aim of answering a particular question is possible only if the questions are known. And the questions which a particular language is suited to answering can be known only if the language has been analyzed in enough detail to know where its individual genius lies.

Examples could be multiplied, but we believe the point is obvious: for at least some languages, data collection must be intensive, thorough, and focused. But how can we ensure that our data collection is accurate? While a linguist may be needed to ask the right questions and to propose the solutions, humans are less good at testing those solutions rigorously; we are much too apt to overlook problems, or not to apply the analysis consistently. Computers, on the other hand, are nothing if not thorough and rigorous; as programmers know all too well, computers will not overlook “minor” faults. Hence, if we can use the computer to test the grammatical analysis, thereby finding the holes in our analysis and in our data collection, we have an ideal marriage between human and machine.

2 The Nahuatl Project

One of us (Jonathan Amith) has been documenting Ameyaltepec and Oapan Nahuatl, two dialects of the Nahuatl language from the Balsas Valley of central Guerrero, Mexico. After having lived in these villages for five years in the early 1980s, in 2000 he returned to begin a long-term language documentation effort. This has involved writing the draft of a pedagogical grammar, compiling and making available on-line a 10,000 word dictionary of Ameyaltepec and Oapan Nahuatl, and beginning a long-term effort to compile an extensive textual corpus (in audio and transcribed form).

Nahuatl is an agglutinating language, that is, inflected words may carry several affixes. In particular, transitive verbs commonly bear at least two prefixes, indicating agreement in person and number with both subject and object. Additional prefixes may mark other features, such as direction of movement or non-specific arguments. Verbs also commonly take from two to several suffixes, marking tense and aspect, as well as number and other features. Nouns are generally marked as either possessed or unpossessed; they are often used as predicates, in which case subject marking is

Language Documentation: The Nahuatl Grammar 477

obligatory. Moreover, there is relatively productive noun incorporation into verbs, putting Nahuatl into the class of at least weakly2 polysynthetic languages.

In addition, allomorphy is common. Many prefixes have two forms, one used before a vowel, the other before a consonant. Some suffixal allomorphy occurs as well, but the most complex allomorphy involves stems. Verb stems regularly have three allomorphs; the differences among these allomorphs vary among verbs. While the origin of this variation can be traced back to historical stages of Nahuatl, the regularity has become obscured over time by sound changes, so that stem allomorphy is now best described in terms of verb classes. The Ameyaltepec/Oapan dictionary lists about a dozen such classes. By the judicious use of phonological rules, we can reduce that by half. Still, the complexity of the variation makes it difficult for the linguist to ensure by inspection that the written grammar accounts for all the forms.3

Two years ago, we began building a computationally implemented morphological grammar of Nahuatl. At present, the verbal morphology (the most complicated aspect of Nahuatl morphology) is substantially in place, with some irregular forms still to be accounted for. Using a finite state transducer engine allows us to both parse inflected words and generate arbitrary inflected forms from a suitable meaning representation.

The forms we work with are in a shallow orthography. The orthography abstracts away from some phonological processes, particularly across word boundaries.

In summary, we are engaged in intensive and thorough documentation of the Ameyaltepec/Oapan dialects of Nahuatl, and our work is focused in part on the morphology, particularly the inflectional morphology. Computational tools are thus essential for ensuring the accuracy and coverage of our analysis. For this purpose, we rely on a morphological transducer, capable of parsing words found in our text corpus into their constituent morphemes. Failure to parse a given word indicates a problem, either in the text (such as a misspelling) or in the grammar.4

3 Electronic Language Documentation Versus Archivability

“…documentation projects are usually tied to software version, file formats, and system configurations having a lifespan of three to five years… In the very generation when the rate of language death is at its peak, we have chosen to use moribund technologies, and to create endangered data. When the technologies die, unique heritage is either lost or encrypted.” –Bird and Simons, Seven dimensions of portability for language documentation and description.

2 Incorporation is lexically governed; hence Nahuatl is not as productive as in a highly

polysynthetic language such as Mohawk. 3 A reviewer asked whether we have attempted automatic learning of phonological rules. We

have not; the complexity of Nahuatl morphology exceeds the capabilities of any morpho-logical and phonological learning tools that we know of, and in any case grammatical analysis was underway before computational implementation began. However, we expect that the corpus and grammar we are creating will prove useful in testing such learning tools.

4 We are also using the computational grammar as part of a web-based language learning tool.

478 M. Maxwell and J.D. Amith

Bird and Simons [4] lay out standards for electronic language documentation and description.5 High on their list is the requirement that documentation should be in plain text form, using plain text annotation (e.g. XML), as opposed to non-human-readable binary formats (such as those used in many databases or word processors). We have followed those guidelines with respect to our lexicon and texts, but the grammar is a different problem. As discussed above, we have built a computational grammar for our own purposes, and we want to include this in the final archivable language documentation. Our motivation for including the grammar is three-fold. First, a parser can use a computer-readable grammar to parse words in another corpus, or by elicitation from a native speaker. Second, a parser facilitates enrichment of the lexicon by processing texts and determining which forms are missing from the lexicon. Third, a generator can use such a grammar to generate forms which are not in a corpus but which (assuming the grammar has been written correctly) are nevertheless grammatically possible.6

This last point—the need to be able to reliably generate forms which do not happen to be attested in a corpus—has been highlighted by recent work in the phonology of Yawelmani (Yowlumne) Yokuts [5]. It turns out that over two thirds of the wordforms from this language which were used as crucial evidence in theoretical debates about phonology in the past thirty years have been constructed by non-speaker linguists on the basis of published descriptions. Many of these constructed forms appear to be erroneous [6]. If there were a morphological generator which had been tested7 against corpora and/or native speakers, the discussion of Yawelmani—and theoretical phonology in general—might have taken a different turn.

A transducer is a morphological engine that combines the capabilities of a parser and a generator. In our project, we are using the Xerox finite state transducer [7]. This is available only in executable form, and runs under current versions of Microsoft Windows and Linux, Solaris, and Macintosh OS X. The choice of this tool is in part related to the morphological and phonological complexities of Nahuatl, as well as the need for bidirectionality (both parsing and generation). The Xerox transducer is simply the only general and currently available tool we know of which provides the ability to write rules in a more or less linguistically motivated formalism. In particular, Xerox-style grammar rules allow one to craft an item-and-arrangement grammar, 8 together with (morpho-)phonological rules to produce predictable allomorphs (and other mechanisms for non-predictable allomorphy, i.e. suppletion).

5 Bird and Simons [4] make a distinction between language documentation (the primary data)

and description (analyses). In our project, we are doing both. But for reasons of brevity, we will not repeat this dichotomy in the remainder of this paper, instead using the terms ambiguously.

6 Of course the danger is that the grammar will produce forms which are not in the corpus because they are incorrect; that danger can only be avoided by careful checking. The danger of producing ungrammatical forms from a morphological grammar is less than the same danger with a syntactic grammar, since apart from productive compounding, the set of morphological forms is finite, unlike the set of sentences. Moreover, irregular forms are almost never rare, so that once the basic vocabulary has been covered, the remaining forms tend to be very predictable. But the point remains: careful checking is necessary.

7 Methodically, but not necessarily exhaustively. 8 The Xerox tools do provide for certain kinds of item-and-process affixation processes.

Language Documentation: The Nahuatl Grammar 479

The formalism resembles that of the American structuralists, in the sense that there is no provision for the use of rules written in terms of phonological (distinctive) features (much less autosegmental-style analyses, or constraint-based theories). While this means that the analysis is couched in terms of a severely dated theory, for purposes of documenting a grammar, this is not a disadvantage: the changes which later theories of morphology made are largely irrelevant to Nahuatl (which has no large-scale non-concatenative morphological processes, the focus of much more recent work). 9 Likewise, the theories of phonology which came later were, for the most part, either in areas which are not crucial to Nahuatl (such as stress and tone), or were aimed at solving questions of descriptive or explanatory adequacy which are not essential to documenting the grammar of a language. In summary, we are interested in getting the facts of Nahuatl right, and at drawing attention to the generalizations that we know of, leaving it to future generations of linguists to draw conclusions of theoretical import. For our purpose, we have found the Xerox transducers to be perfectly adequate.

However, the use of a computationally interpreted grammar gives rise to a dilemma. Given that the best practice for documentation is to use plain text, while the inclusion of a working morphological grammar implies an executable program, there is a tension between the long-term goals of language documentation and our desire to provide a complete description of Nahuatl morphology. We want the grammar to be runnable for years to come. We do not want to rely on the continued existence of particular executable version of a morphological engine, an engine which may only be executable under a fixed set of operating systems on a fixed set of CPUs.10

In theory, one could overcome this problem by considering the archival form of the grammar to include a computer-readable form of the Xerox tools, together with an executable form of the Linux operating system. But this overlooks issues of hardware compatibility (does the archival grammar include an entire PC to run the software on, or perhaps an emulator?), as well as more mundane issues of copyright.

Alternatively, one might hope that the owners of proprietary software (Xerox, in this case) will some day release the source code for the software that we are dependent on. Whether this hope will be realized in any particular case is of course unknowable. Moreover, releasing the source code does not necessarily mean that anyone can produce an executable form of the program, since the programming languages the tools are written in may not exist in the long term.11

9 More recent theories of morphology often assume a feature-based approach to grammatical

meaning, which the Xerox tools also do not directly support. Again, this is not highly relevant to our grammar.

10 This portability problem was brought forcibly to our attention after a recent server upgrade from the Linux OS to the very similar FreeBSD OS. Our Xerox tools ceased to work; the transducer could not compile the grammar. The problem was resolved by installing Linux compatibility libraries. But that solution may not be available ten years from now, and it will almost certainly not be available to investigators who might wish to use our grammar a century hence.

11 This is not a minor quibble: one of the authors (Maxwell) released the source code to his own morphological parser (Hermit Crab, www.sil.org/computing/hermitcrab/) a decade ago. Unfortunately, it was coded in versions of Prolog and C that are no longer available, and porting to currently available versions of Prolog would be non-trivial. ‘Open source’ is not the complete answer.

480 M. Maxwell and J.D. Amith

For purposes of language documentation, therefore, the limitation of requiring a particular executable program, compiled for a particular combination of operating system and CPU, is unacceptable. This means that the grammar we have written for the Xerox transducer is, by itself, inadequate to the goal of language documentation.

Fortunately, the programming languages that the tools interpret (xfst and lexc) are based on the abstract language of regular expressions. The theory of finite state automatons and transducers, which generate and parse the languages defined by regular expressions, is well-understood, and forms the groundwork of much of modern computing theory. Thus, to the extent that the notation expected by the Xerox tools is generic, it meets our needs for longevity as well as could be expected from any such formalism. Unfortunately, the xfst/ lexc programming languages do have idiosyncrasies in their notation (as do all such notations).

One solution would be to simply document the xfst and lexc notation, perhaps by including a computer-readable version of the manual in the archival form of our grammar. But this is a high price to pay, not only in terms of the size of the manual, but more importantly in terms of the burden that it would impose on users. Anyone wishing to understand our grammar would need to first read the programming language manual.

A better approach, we argue in the next section, is to document the effect of the rules and other notation at the point in the grammar where they are used.

4 Two Forms of Documentation

“…one deliberately writes a paper, not just comments, along with code.”—Doug McIlroy. “Programming Pearls: A Literate Program”, CACM, June 1986, pg. 478-479.

In language engineering, best practice for the documentation of computational grammars calls for “two types of documentation: one within the grammar itself and one as an overview of the grammar” [8:169]. Language engineering differs from language documentation with respect to the purpose of the grammar, but this recommendation at least is relevant to both communities

In our Nahuatl project, the task of writing the human-readable, or prose version of the grammar has fallen to one of us (Jonathan Amith, who speaks Nahuatl), while the task of writing the computer-processible grammar , with its own documentation, falls to the other (Mike Maxwell, who “speaks” the xfst programming language). In the process of writing the xfst version, Maxwell found himself frequently writing comments into the source code that were paraphrases of the prose grammar, often including references to section numbers in the human-readable grammar. The two documents were thus intertwined, in the sense that a thorough understanding of the source code and its comments required jumping back to the prose grammar. Of course, this meant that revisions had to be made in both places.

Moreover, the prose grammar was sometimes ambiguous, or omitted details that were necessary in order to write the machine-readable grammar. For example, there are two processes of consonant cluster simplification: degemination, which reduces of a sequence of two identical consonants to one, and affricate reduction, which changes an alveolar affricate to a fricative in the environment preceding another alveolar

Language Documentation: The Nahuatl Grammar 481

consonant. The grammar gave examples of the application of these rules in which the second alveolar consonant was /n/, /l/, and /t/, but did not explain what happened to a sequence of two alveolar affricates /ts+ts/. In theory, both rules would be applicable. But since the two rules are mutually bleeding (the application of one precludes the application of the other), in fact only one of them could apply. Under xfst (as well as the particular theory of phonology on which xfst is roughly based), which rule would apply is determined by the rule order12 For purposes of the xfst grammar, the author of the xfst grammar therefore had to know which result was appropriate, but the human-readable grammar was silent on this point. (As it happens, it is the affricate reduction rule which applies, and this is now documented.)

The two forms of the grammar, with their individual documentation, are thus complementary: the prose form is easier for the human reader to understand. The machine readable grammar, on the other hand, is more precise and less ambiguous.

In this sense, the distinction is like that between a traditional written grammar, and a generative grammar written in a precise formalism. But in a practical sense, the machine readable grammar is even more precise than the generative grammar, because it can be run on a computer and tested for correctness against data. While in theory it is possible to hand-test a generative grammar, in practice this becomes nearly impossible for grammars with any degree of complexity (and our Nahuatl grammar certainly exceeds this degree of complexity, at least for us).

Given that both a verbal description and an implementable form of the grammar are necessary for purposes of language documentation, the key issues are: how to minimize duplication between the two forms, keep them in synch, and make both grammars useable for decades, or better, centuries.

5 Literate Programming for Language Documentation

“Literate programming…is the difference between performing and exposing a magic trick.” –Ross Williams, FunnelWeb Tutorial Manual

For purposes of language documentation the ideal grammar would combine the human readable and computer readable versions into one. This is precisely what the discipline of literate programming was invented for.

Donald Knuth [9] created the term “literate computing” to describe a way of writing computer programs that inverted the usual way of documenting code. Instead of writing a document consisting of source code which the computer could understand, and then sprinkling in comments for humans who might need to decipher it (to fix bugs, or improve it in some way), the literate programming document becomes a way of explaining to a human being how the program works, and contains the source code at appropriate points. An executable program can be extracted from the document, a process which Knuth referred to as ‘tangling’. The process of producing an elegantly formatted human-readable document he called ‘weaving’.

The advantages of literate programming are even more important for language documentation than for computer programming. Computer programs have a longevity

12 Or more precisely, the order in which the finite state transducers representing the two rules

are composed.

482 M. Maxwell and J.D. Amith

measured in years, or at best, in decades. Language grammars, on the other hand, should remain accessible for centuries, even millennia.13

We sketch here what such a literate grammar looks like. Consider the fragment of the Nahuatl grammar shown in the figure below.

Fig. 1. A fragment of the Nahuatl grammar

The rules shown there are explained in prose, as is their interaction with other rules. As would be expected in a grammatical description intended for human consumption, we include examples of rule application in this grammar fragment. It is in fact common practice to include examples in traditional computational grammars [8]. Well-commented code is also preferred in programming languages, which

13 One might question whether the longevity of the English or Spanish in which the

grammatical description is written, not to mention the dictionary glosses, although we see no alternative here. Of course, the art of machine translation may render this a moot question.

4.3 h-deletion /h/ deletes everywhere except word-final (including before zero morphs). This applies only to underlying /h/, not to various /h/s that are derived from /w/ and /k/; that is, the rules deriving /h/ from /w/ and /k/ counterfeed h-deletion.

Any /h/ internal to a lexically listed stem is presumably non-derived, and should not be subject to this rule. This condition is encoded in the xfst version of this rule by the requirement that the right-hand environment must contain an overt morpheme boundary (‘%-’; the ‘%’ is an escape character required by the xfst notation before a special character like ‘-’).

This rule bleeds the rule of e-epenthesis (below).

define hDeletion [h -> 0 || [? - c] _ %- [Cons | Vowel]];

The '[? - c]' in the left environment of h-deletion prevents this rule from applying to the 'h' of the grapheme 'ch' (it encodes the regular expression “any single character except ‘c’).

4.4 w h /w/ becomes /h/ before /k/. Again, this applies in derived environments, which we have encoded in xfst by requiring an explicit boundary marker.

This rule counterfeeds the above rule of h-Deletion (section 4.3). It should therefore also be bled by the rule of k-Drop (section 4.2), by transitivity of ordering. It is unclear whether w h interacts crucially with any other rules.

define wk2h [w -> h || _ Nulls %- k];

The following is an example of the application of this rule: chi:x-te:w-ka-0 ‘wait.for/V-upon.parting-PLUP-SG’ chi:xte:hka w h

Language Documentation: The Nahuatl Grammar 483

includes computational grammars; but most code is in fact under-commented. The expository nature of literate programming, as applied to grammars of natural languages, makes it more natural to supply examples for the human reader’s benefit.

The formatting—bold for section headings, italics for names of rule, the use of Courier font for programming code—is also intended to be an aid to understanding, and is not available in the comment fields of (most) computer languages. The formatting itself is of course not present in the XML file which is the archival form of the grammar, since that would be contrary to our aims of portability. Instead, the formatting is provided by style sheets, in standard XML fashion.

Walsh [10] has proposed a small extension to XML to support literate programming. The extension consists of primitives, <src:fragment> and <src:fragref>, which respectively allow the inclusion of programming code fragments, and references to those fragments. These extensions are well suited to being used with the DocBook [11]), an XML format commonly used for books and articles. Stylesheets are then used to produce the nicely formatted human-readable document (in printed format, PDF file, HTML format, etc.), as well as the actual program. We are currently engaged in producing a grammatical description of Nahuatl, complete with working programming code, in this literate programming format. This will then serve as the archival format for the Nahuatl grammar.

6 Conclusion

“Always remember that you’re not just writing for the next couple months or years, but possibly for the next couple of thousands of years.”—Elliotte Rusty Harold and W. Scott Means, XML in a Nutshell, p. 96.

We have argued that language documentation calls for not only a verbal description of the grammar, but a computational analysis as well. We have also argued why this need is ill served by a traditional programming approach, and in its place have advocated a literate computing approach, intertwining the verbal description and the computational analysis. And finally, we have described how our own language documentation project, encompassing several dialects of the Nahuatl language, is using such a documentation method.

Epilogue: Object Oriented Grammar Editors

We argued above that the human-readable and machine-readable forms of the grammar are complementary. As linguists, we cannot help noting that the term ‘complementarity’ is related to the linguistic term ‘complementary distribution’, suggesting that the two forms of the grammar might, in fact, be best viewed as allo-grammars: dual manifestations of a single underlying structure. For example, consider the grammar fragment in the figure below, incorporating both prose and code:

484 M. Maxwell and J.D. Amith

The object-marking prefixes of the Oapan dialect are shown in the chart below:

Singular Plural

1 ne:ch- te:ch-

2 mits- me:ch-

3 k- kim-

The xfst code for this is as follows:

define ObjPre [[[1SgO .x. {ne:ch}]%-"@D.PERS.1@"]

|[[2SgO .x. {mits} ]%-"@D.PERS.2@"]

|[[3SgO .x. k ]%- ]

|[[1PlO .x. {te:ch}]%-"@D.PERS.1@"]

|[[2PlO .x. {me:ch}]%-"@D.PERS.2@"]

|[[3PlO .x. {kim} ]%- ]];

Fig. 2. Another fragment of the Nahuatl grammar

The xfst ‘define’ repeats much of the information in the table. This is obvious for the phonological form of the morphemes, but it is also true for the representation of their meaning: the glosses in the xfst code are directly related to the table labels: ‘1SgO’ is equivalent to ‘1’ (row label) ‘Singular’ (column label) ‘Object’ (from the caption); this relation between morphosyntactic features and glosses argued in more detail in [12]. The flag diacritics (the xfst constructs containing the ‘@’ signs) repeat the person information. Even the hyphens at the right end of the morphemes in both representations (representing boundary markers) are in fact a “view” of the fact that this is a table of prefixes (as opposed to suffixes or roots).

The table and the xfst code are thus best seen as views of a single underlying structure: a slot in an inflectional schema, together with the affixes that can fill that slot. Suppose that we had a single computational object representing that structure, and that we could design views of that object that looked like the table or like the xfst code. Then we would simply insert a reference to that object in the prose of the grammar, choosing the table view to give human-readable form, and the xfst code view to produce the machine-readable form.

Similarly, the entire grammar—from individual affixes to phonological rules—can be conceived of as a collection of objects.

An object-oriented approach was taken in the LinguaLinks project [13-15], developed by SIL during the 1990s; and it is being taken in the current SIL FieldWorks project (http://fieldworks.sil.org/). Such an approach is capable of incorporating a literate computing view directly in the user interface.

An object-oriented grammar development system comes with a cost: the objects must somehow be integrated into a traditional word processor (or other document-producing tool), or else the capabilities of the traditional word processor must be replicated in the object-oriented programming environment. Additionally, the programmers of the object-based grammar tools must anticipate the potential needs of

Language Documentation: The Nahuatl Grammar 485

all users. Not everyone will want to use such an environment. But even for grammars developed in such an environment, the object oriented database is a binary format; another format, such as an XML-based format, is needed for archival purposes.

Thus, regardless of the methodology used to create the grammar, we believe that the literate programming approach we have outlined in this paper constitutes the best practice for grammar documentation: a text (XML) based format which intertwines verbal description and the programming source code for the grammar.

References

1. Grimes, B.F.: Ethnologue: Languages of the World. 14th. ed. Summer Institute of Ling., Dallas, 2000

2. Derbyshire, D.: Hixkaryana. Lingua Descriptive Studies 1. North-Holland, Amsterdam, 1979

3. Urbanczyk, S.: Patterns of Reduplication in Lushootseed. Outstanding Dissertations in Linguistics. Garland Publishing, New York, 2001

4. Bird, S. and Simons, G.: Seven dimensions of portability for language documentation and description. Language 79, 2003, 557-582

5. Weigel, W.F.: The Yokuts Canon: A case study in the interaction of theory and description. Paper presented at Annual meeting of the Linguistics Society of America. San Francisco, 2002

6. Blevins, J.: A Reconsideration of Yokuts Vowels. International Journal of American Linguistics 70, 2004, 33-51

7. Beesley, K.R. and Karttunen, L.: Finite State Morphology. CSLI Studies in Computational Linguistics. University of Chicago Press, Chicago, 2003

8. Butt, M. and King, T.H.: Grammar Writing, Testing and Evaluation. In Farghaly, A. (ed) Handbook for Language Engineers. CSLI Publications, Stanford, 2003, 129-179

9. Knuth, D.E.: Literate programming. The Computer Journal 27, 1984, 97-111 10. Walsh, N.: Literate Programming in XML. In XML 2002. Baltimore, MD, 2002 11. Walsh, N. and Muellner, L.: DocBook: The Definitive Guide. O'Reilly & Associates, Inc.,

Sebastopol, California, 1999 12. Hayashi, L.S., Maxwell, M.B., and Simons, G.: A Morphological Glossing Assistant. In

Proceedings of the International LREC Workshop on Resources and Tools in Field Linguistics. Las Palmas, Spain, 2002

13. Rettig, M., Simons, G., and Thomson, J.: Extended objects. Communications of the ACM 36, 1993, 19-24

14. Simons, G.F. and Thomson, J.V.: Multilingual Data Processing in the CELLAR Environment. In Nerbonne, J. (ed) Linguistic Databases. Center for the Study of Language and Information, Stanford, CA, 1997, 203-34

15. Simons, G.F.: The Nature of Linguistic Data and the Requirements of a Computing Environment for Linguistic Research. In Lawler, J. (ed) Using Computers in Linguistics: A Practical Guide. Routledge, New York, NY, 1998, 10-25