The Bible as a parallel corpus: Annotating the ‘Book of 2000 Tongues’

26
Computers and the Humanities 33: 129–153, 1999. © 1999 Kluwer Academic Publishers. Printed in the Netherlands. 129 The Bible as a Parallel Corpus: Annotating the “Book of 2000 Tongues” PHILIP RESNIK, MARI BROMAN OLSEN and MONA DIAB Department of Linguistics and Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, USA (E-mail: {resnik,molsen,mdiab}@umiacs.umd.edu) Abstract. We report on a project to annotate biblical texts in order to create an aligned multilingual Bible corpus for linguistic research, particularly computational linguistics, including automatically creating and evaluating translation lexicons and semantically tagged texts. The output of this project will enable researchers to take advantage of parallel translations across a wider number of lan- guages than previously available, providing, with relatively little effort, a corpus that contains careful translations and reliable alignment at the near-sentence level. We discuss the nature of the text, our annotation process, preliminary and planned uses for the corpus, and relevant aspects of the Corpus Encoding Standard (CES) with respect to this corpus. We also present a quantitative comparison with dictionary and corpus resources for modern-day English, confirming the relevance of this corpus for research on present day language. Key words: Bible, computational linguistics, parallel corpora, Corpus Encoding Standard, transla- tion lexicons 1. Why This Text? 1.1. THE NATURE OF THE TEXT The Bible is a widely available, representative sample of carefully translated texts in a variety of styles in a broad range of languages. These properties uniquely suit our purposes, which include research in lexical semantics, construction of translation lexicons, and evaluation of semantic tagging for multilingual machine translation and other natural language processing applications. The question of which documents comprise the biblical canon has been a source of historical and theological controversy (Weigelt, 1988; Niebuhr, 1997), but considerations of availability – which texts are most consistently available in electronic form in multiple languages – have led us to define our corpus as comprising the 66 books described in Figure 1. 1 Taking all the books together, the corpus represents at least 30–40 authors in a variety of text styles, including representative samples of narrative, poetry, and correspondence. The New Testament subcorpus alone “compares favourably in size to other major collections analysed by scholars . . . approximately as large as if not

Transcript of The Bible as a parallel corpus: Annotating the ‘Book of 2000 Tongues’

Computers and the Humanities33: 129–153, 1999.© 1999Kluwer Academic Publishers. Printed in the Netherlands.

129

The Bible as a Parallel Corpus: Annotating the

“Book of 2000 Tongues”

PHILIP RESNIK, MARI BROMAN OLSEN and MONA DIABDepartment of Linguistics and Institute for Advanced Computer Studies, University of Maryland,College Park, MD 20742, USA (E-mail: {resnik,molsen,mdiab}@umiacs.umd.edu)

Abstract. We report on a project to annotate biblical texts in order to create an aligned multilingualBible corpus for linguistic research, particularly computational linguistics, including automaticallycreating and evaluating translation lexicons and semantically tagged texts. The output of this projectwill enable researchers to take advantage of parallel translations across a wider number of lan-guages than previously available, providing, with relatively little effort, a corpus that contains carefultranslations and reliable alignment at the near-sentence level. We discuss the nature of the text, ourannotation process, preliminary and planned uses for the corpus, and relevant aspects of the CorpusEncoding Standard (CES) with respect to this corpus. We also present a quantitative comparison withdictionary and corpus resources for modern-day English, confirming the relevance of this corpus forresearch on present day language.

Key words: Bible, computational linguistics, parallel corpora, Corpus Encoding Standard, transla-tion lexicons

1. Why This Text?

1.1. THE NATURE OF THE TEXT

The Bible is a widely available, representative sample of carefully translated textsin a variety of styles in a broad range of languages. These properties uniquelysuit our purposes, which include research in lexical semantics, construction oftranslation lexicons, and evaluation of semantic tagging for multilingual machinetranslation and other natural language processing applications. The question ofwhich documents comprise the biblical canon has been a source of historicaland theological controversy (Weigelt, 1988; Niebuhr, 1997), but considerationsof availability – which texts are most consistently available in electronic form inmultiple languages – have led us to define our corpus as comprising the 66 booksdescribed in Figure 1.1

Taking all the books together, the corpus represents at least 30–40 authors ina variety of text styles, including representative samples of narrative, poetry, andcorrespondence. The New Testament subcorpus alone “compares favourably in sizeto other major collections analysed by scholars . . . approximately as large as if not

130 PHILIP RESNIK ET AL.

− OLD TESTAMENT (Hebrew canon)

• Pentateuch[also known as Law, Torah]: Genesis, Exodus, Leviticus, Numbers,Deuteronomy• Historical Books: Joshua, Judges, Ruth, 1 Samuel, 2 Samuel, 1 Kings, 2 Kings, 1

Chronicles, 2 Chronicles, Ezra, Nehemiah, Esther• Poetry and Wisdom Literature: Job, Psalms, Proverbs, Ecclesiastes, Song of

Solomon• The Prophets:

∗ Latter (Literary, Writing) Prophets : Isaiah, Jeremiah, Lamentations, Ezekiel∗ Minor Prophets: Daniel, Hosea, Joel, Amos, Obadiah, Jonah, Micah, Nahum,

Habbakuk, Zephaniah, Haggai, Zechariah, Malachi

− NEW TESTAMENT

• The Gospels and Acts: Matthew, Mark, Luke, John, Acts• The Letters: Romans, 1 Corinthians, 2 Corinthians, Galatians, Ephesians, Philip-

pians, Colossians, 1 Thessalonians, 2 Thessalonians, 1 Timothy, 2 Timothy, Titus,Philemon, Hebrews, James, 1 Peter, 2 Peter, 1 John, 2 John, 3 John, Jude,Revelation

Figure 1. Biblical texts used in this work.

larger than the corpus of Homer’s Iliad, of Homer’s Odyssey, of Sophocles, ofAeschylus, of Herodotus. . . [with] individual books. . . comparable in size to otherwell-known classical texts: e.g. Plato’s Apology approximates the size of Paul’sRomans or 1 Corinthians” (Porter, 1989).

As a resource for research using corpus-based statistical methods in compu-tational linguistics, the Bible is small by current standards (e.g. see Church andMercer, 1993); with some variation for language and translation, it is typicallyon the order of 800,000 words and 4–5 megabytes. However, this is comparableto some monolingual corpora widely used for corpus-based research, such as theBrown Corpus of American English (Kucera and Francis, 1967). Furthermore, thebreadth of languages offers an opportunity for research not generally available withthe larger corpora in use today.

1.2. AVAILABILITY

Originally written in Hebrew, Aramaic, and Greek, between 1000 B.C. and 100A.D., the Bible is the world’s most translated book (Connolly, 1996). The firstcomplete translation, the Latin “Vulgate,” or common version, was made in the 4thcentury by Jerome. By 1804 there were 67 languages with at least one book trans-lated – 68 by the beginning of this century. Now the number tops 2197, with morethan 363 complete Bibles, and 905 New Testaments.2,3 Additional translations are

THE BIBLE AS A PARALLEL CORPUS 131

----------------------------------------------------------------You searched for WarlpiriThis language is sometimes called Wailbri or Walpiri.Warlpiri is spoken in Australia (Northern Terr., Hooker Creek)First publication of:A single book of the Bible 1985----------------------------------------------------------------You searched for sorbishsorbish is referred to in this database as Sorbian: Upper.Sorbian: Upper is spoken in Germany (SE, Upper Saxony.)First publication of:A single book of the Bible 1670The New Testament 1706The Bible 1728----------------------------------------------------------------You searched for quechuaThat language name could not be found in the database. This maybe for one of three possible reasons:...

Possible Matches: Languages in the database that closely matchyour query:quechua: apurimacquechua: ayacucho...

Possible Matches: Variant language names that match your query:ambo-pasco quechuaancash conchucos quechua...

----------------------------------------------------------------

Figure 2. Sample output: Scriptures of the World.

in process, including 680 new languages by personnel in national bible societiesassisted by the United Bible Society (UBS) alone.4 Furthermore, the above figureschanged dramatically in the 10 months since the first draft of this paper: 180 moretranslations are in progress, 13 more Bibles are completed, and 25 New Testaments.Many versions are now available electronically, either as text in the public domain,or bundled with search software for a modest fee. See for example, Bibleworks,5

as well as Section 2.5.Several sites on the World Wide Web provide reference information on Bible

translations. The UBS maintains a site that provides information on whether andwhen translations were first made in a given language; for example, see searchresults for Warlpiri, Sorbish and Quechua (edited for brevity) in Figure 2.6 Each

132 PHILIP RESNIK ET AL.

Language Description PriceQUECHUA, CUZCO 1988 Peruvian Bible Society 33QUECHUA, AYACUCHO 1987 Peruvian Bible Society 33QUECHUA, NORTH BOLIVIAN New TestamentQUECHUA,AYACUCHO New TestamentQUECHUA,CENTRAL BOLIVIAN New TestamentQUECHUA,SAN MARTIN New Testament

Figure 3. Sample output: Catalogue of World Scriptures.

response includes a link to the local Bible Society, from which certain texts maybe ordered, at least in print.

The Forum of Bible Agencies, whose members include the UBS and WycliffeBible Translators (with the participation of the Summer Institute of Linguistics(SIL)), maintains a consolidated site with a Catalogue of World Scriptures.7 Mostreferences are to print Bibles; however, one goal of the site is to provide links toelectronic Bible versions as they become available. The Bibles Online link failed toreturn any response for Warlpiri or Sorbish, but produced the text output in Figure 3for Quechua. The Languages of the World link retrieved results for both Warlpiriand Quechua, and provided a link to Sorbish under both Country and Variant Namelinks.

The American Bible Society is experimenting with HTML markup and webdistribution of more comprehensive information on all translations. Figure 4 isa sample showing the proposed format of theBook of 2000 Tonguesproject, anelectronic update of theirBook of 1000 Tongues(Liana Lupas and Erroll F. Rhodes,eds., 1939 and 1972).8

A number of versions of the Bible are currently available in electronic mediaand on the Web, including multiple language versions – for example, the BibleGateway web site offers multiple English translations (NIV, NASB, RSV, KJV,Darby, YLT) and versions (or portions) in German, Swedish, Latin, French, Span-ish, and Tagalog.9 However, our survey of available on-line versions indicates that,while most versions appear in a format useful for browsing and searching, there isno parallel corpusof the Bible, in the sense of a collection of documents that isboth marked up monolingually according to a standard set of conventions and alsoexplicitly aligned across languages.

For example, the Bible Gateway site makes it possible for a user to retrieveparticular passages, and even entire chapters at once. However, the markup in theretrieved text does not explicitly relate verses to their counterparts in other lan-guages. In particular, although their encoding usesDT andDD elements to encodeverse numbers and verse contents, respectively, this structural choice is used pri-marily for its presentational effect – the uniform indentation of verses even whenthey occupy multiple lines of text. For example, consider Genesis 1:3–4 in anEnglish version (NIV) and French version (LSG) at the site.10

THE BIBLE AS A PARALLEL CORPUS 133

ENGLISH

Speakers: 450,000,000 first language speakers (1991 est.);800,000,000 total including second language speakers (Ethn12).

Location: United Kingdom, United States, international.

Kinship: Indo-European / Germanic / West / North Sea / English.

1526 New Testament [Repr.+1836, +1837, +1989; Facs. 1862, +1976]Peter Schoeffer, Worms

1530 +Pentateuch [Repr. +1967, +1992] Hans Luft, Marburg(= J. Hoochstraten, Antwerp)

1531 Jonah [Facs. +1863] Martin de Kayser, Antwerp? Translatedby William Tyndale (Hychyns)...

Figure 4. Sample output: ProposedBook of 2000 Tongues.

〈DT〉3〈DD〉 And God said, “Let there be light,” and there was light.

〈DT〉4〈DD〉 God saw that the light was good, and he separated the light from thedarkness.

〈DT〉3〈DD〉 Dieu dit: Que la lumière soit! Et la lumière fut.

〈DT〉4〈DD〉 Dieu vit que la lumière était bonne; et Dieu sépara la lumière d’avec lesténèbres.

Nothing in the encoding of the verse provides a mapping from Genesis 1:3 inEnglish to the corresponding verse in French, as opposed to, say, Genesis 2:3or Exodus 12:3. Nor can the parallel relationship between these verses or eventhe books themselves be inferred from other elements on the page, without addi-tional knowledge that EnglishGenesiscorresponds to FrenchGenèse, ExodustoExode, and so forth. Making this parallelism explicit is, we believe, part of whatdistinguishes creating aparallel corpusfrom the activity of collecting multipletranslations.

In summary, the Bible is available in print form for a huge range of languages,and in on-line form for a respectable and growing subset of those languages. How-ever, to our knowledge, the project of creating a parallel corpus for the Bible hasnot previously been attempted.

134 PHILIP RESNIK ET AL.

1.3. CAREFUL TRANSLATION

Because to most of its translators the Bible represents God’s Word, it is among theworld’s most carefully translated texts. As suggested by the explosive increase innumber of translations described above, wide use and acceptance of translations isa relatively recent phenomenon. A large part of the history of translation theory ingeneral is directly or indirectly due to past and present biblical translation (Nida,1964; Nida-Taber, 1969; Robinson, 1991). Much current theory and practice grewout of the zeal of Protestant missionaries of the past two decades, eager to presentthe Bible in the language of the people (Bassnett-McGuire, 1980).

Numerous publications on Bible translation discuss nuances and difficulties oftranslation at a microscopic level, sometimes providing sentence-by-sentence guid-ance at the linguistic, semantic, and pragmatic levels (see, for example, Beekmanand Callow, 1974; Blight, 1992; Deibler, 1993; Moore, 1993; deWaard-Smalley,1979). Similar care is taken in the introduction of new translations: a recent attemptto introduce gender neutral language in the New International Version (NIV), lastrevised in 1983, met with a great outcry in some segments of the population,leading the publisher to abandon the project (LeBlanc, 1997). Translations inestablished languages, therefore, tend to be conservative. And first translations ina given language are subject to rigorous scrutiny at every stage. We may thereforebe confident that the texts we have are as accurate as humanly possible.

In languages with multiple translations, texts could also be paired according toage and style of translation: there are many translation contemporaries of the KingJames Version, for example Luther’s German and the Spanish Reina de Valera,which have a formal, literary feel. In contrast, versions from UBS and SIL tend tofollow “dynamic equivalence” theories of translation (Nida, 1964; Nida and Taber,1969), which attempt to make the impact of the original text idiomatic for today.For example, an original Greek phrase in Colossians 1:20 translates literally as“the blood of his cross,” whereas the Good News Bible has “God made peacethroughhis Son’s sacrificial death on the cross” (GNB, 1976), emphasis added.Alignment of such translations would therefore serve as an important source forpairs of idioms and figures of speech.11

1.4. STANDARD STRUCTURE AND VERSE ALIGNMENT

One problem working with parallel corpora is that most often they are not explicitlyaligned – for example, a considerable amount of work has been done attempting toautomatically align the Canadian Hansards (parliamentary proceedings in Englishand French), at the sentence and sometimes word level.12 In contrast, the structureof many Biblical texts is largely standardized in terms of books, chapters, andverses, particularly for the 66-book collection we have designated as our objectof study. Thus, verse-level alignment for our collection comes almost for free; infact, the first aim of this project is to represent that standardized structure in aconsistent format.

THE BIBLE AS A PARALLEL CORPUS 135

Of course, some variation among translations exists within individual verses;finer-grained alignments, e.g. word-aligned parallel corpora, will still require fur-ther work. For example, if versions differ in their source texts (say, the MasoreticText and the Septuagint), there will be variation in the versions not attributableto differences between the target languages. Source textual variationwithin versesdoes not affect the fundamental alignment of the verses, but may introduce noiseinto word-alignment programs between the two versions (see Section 3.2). How-ever, the structure inherent in the Bible allows us to identify coarser variation suchas presence/absence of particular sections. For example, some later translationseliminate, relocate, or footnote passages found in the KJV (as well as its con-temporaries and descendants). The last part of Mark 16:8 and John 7:53–8:11 arecontested: both are attested in the plurality of manuscripts, but not in the oldesttexts, discovered since the KJV translation. Verse alignment limits the impact ofsuch omissions, as such cases simply result in null pairings. (See Melamed, 1996cfor discussion of automatic methods for detecting omissions in translations, as wellas discussion in Section 3.2.)

Thus far, such inconsistencies represent only a relative handful of instancesacross the entire collection that were easy to deal with. In addition, the consistentverse-level alignments provide appropriate training material for algorithms thatlearn to do lower-level alignments on the basis of correctly aligned text (Melamed,1996a). Verse-level alignment may also be used as a source of test material foralgorithms that attempt to produce sentence-level alignments.

2. Annotation

2.1. INTERMEDIATE FORMAT

Annotating individual language versions of the 66-book collection (or, in somecases, just the New Testament) requires a simple 3-level hierarchy of text ele-ments (book, chapter, verse). Our ultimate goal is a target annotation format usingStandard Generalized Markup Language (SGML), for which the structure of thetexts is specified in declarative fashion using a document type definition (DTD);this is discussed in detail in Section 2.2. However, in our initial pass through theannotation process, we labeled elements asb (book), c (chapter), andv (verse),producing a simple intermediate representation that captures the major structurallevels without conforming to any particular DTD. The following examples show asingle verse, Matthew 1:7, in 12 languages:13

CHINESE: <v id=“MAT:1:7”> </v>DANISH: <v id=“MAT:1:7”>og Salomon avlede Roboam; og Roboam avledeAbia; og Abia avlede Asa;</v>ENGLISH: <v id=“MAT:1:7”>And Solomon begat Roboam; and Roboam begatAbia; and Abia begat Asa;</v>

136 PHILIP RESNIK ET AL.

FINNISH: <v id=“MAT:1:7”>Salomolle syntyi Rehabeam, Rehabeamille syntyiAbia, Abialle syntyi Aasa;</v>FRENCH: <v id=“MAT:1:7”>Salomon engendra Roboam; Roboam engendraAbia; Abia engendra Asa;</v>GREEK: <v id=“MAT:1:7”>solomwn de egennhsen ton roboam roboam deegennhsen ton abia abia de egennhsen ton asa</v>INDONESIAN: <v id=“MAT:1:7”>Salomo memperanakkan Rehabeam, Reha-beam memperanakkan Abia, Abia memperanakkan Asa,</v>LATIN: <v id=“MAT:1:7”>Salomon autem genuit Roboam Roboam autem genuitAbiam Abia autem genuit Asa</v>SPANISH: <v id=“MAT:1:7”>Salomón Engendró a Roboam; Roboam Engendró aAbías; Abías Engendró a Asa;</v>SWAHILI: <v id=“MAT:1:7”>Solomoni alimzaa Rehoboamu, Rehoboamu alimzaaAbiya, Abiya alimzaa Asa,</v>SWEDISH: <v id=“MAT:1:7”>Salomo födde Roboam, Roboam födde Abia. Abiafödde Asaf;</v>VIETNAMESE: <v id=“MAT:1:7”>Salomôn sinh Roboam, Roboam sinh Abya,Abya sinh Asa,</v>

In all these cases, the intermediate encoding for book and chapter elements isidentical, as follows, where the ellipsis represents the sequence of verse elementsfor all the verses in the chapter.

<b id="MAT"><c id="MAT:1">...</c></b>

The labels (id attributes) for elements make it possible to identify verses inde-pendent of context, by including the book and chapter in the label, e.g. “GEN:1:1”for Genesis, chapter 1, verse 1. This will allow users to take advantage of simpletools, such as Unix ‘grep’ for finding a particular verse, as well more powerfulSGML-based tools. Although standard codes for book abbreviations exist for pub-lications such as the journalNew Testament Studies,14 these abbreviations differedfor various languages and were not consistent in length. We used the abbreviationsin Table I: the first three letters of each book, except for Philippians (PHI) andPhilemon (PHM), Jude (JUD) and Judges (JDG), since the first three charactersfailed to distinguish books. Numbers were treated as the first character for bookslike 1 Kings.

2.2. TARGET FORMAT

For our target format, we chose verse-level annotation of structure conforming tothe Corpus Encoding Standard (CES) subset of the TEI (Ide, 1996), which includes

THE BIBLE AS A PARALLEL CORPUS 137

Table I. Three-character book title abbreviations

Genesis GEN Isaiah ISA Romans ROM

Exodus EXO Jeremiah JER 1 Corinthians 1CO

Leviticus LEV Lamentations LAM 2 Corinthians 2CO

Numbers NUM Ezekiel EZE Galatians GAL

Deuteronomy DEU Daniel DAN Ephesians EPH

Joshua JOS Hosea HOS Philippians PHI

Judges JDG Joel JOE Colossians COL

Ruth RUT Amos AMO 1 Thessalonians 1TH

1 Samuel 1SA Obadiah OBA 2 Thessalonians 2TH

2 Samuel 2SA Jonah JON 1 Timothy 1TI

1 Kings 1KI Micah MIC 2 Timothy 2TI

2 Kings 2KI Nahum NAH Titus TIT

1 Chronicles 1CH Habakkuk HAB Philemon PHM

2 Chronicles 2CH Zephaniah ZEP Hebrews HEB

Ezra EZR Haggai HAG James JAM

Nehemiah NEH Zechariah ZEC 1 Peter 1PE

Esther EST Malachi MAL 2 Peter 2PE

Job JOB Matthew MAT 1 John 1JO

Psalms PSA Mark MAR 2 John 2JO

Proverbs PRO Luke LUK 3 John 3JO

Ecclesiastes ECC John JOH Jude JUD

Song of Solomon SON Acts ACT Revelation REV

document type definitions (DTDs) for primary data (cesDoc), linguistic annotation(cesAna), and alignment of parallel texts (cesAlign). In many respects, creating aparallel corpus from multiple versions of the Bible is an excellent match for theCES. Since the corpus is being created primarily for use in corpus-based compu-tational linguistics research, the restrictions imposed by the CES, in comparison tothe full generality of the TEI, are suited to the task (CES Sec. 0.2.3). Moreover,the CES contains useful and explicit guidelines for encoding, and the consistentstructure and content of biblical text should make it straightforward to achieveLevel 1 conformance to the standard using fully automatic conversion of originalfiles. Level 1 conformance (CES Sec. 4.2) mainly requires that the document bevalid in terms of the cesDoc DTD, that it include appropriate header documenta-tion, and that it be encoded consistently and in conformance with CES down tothe paragraph level. Since encoding of paragraphs varies in the biblical text and isoften absent entirely, we take Level 1 conformance to mean down to the verse level(also see Section 2.3).

138 PHILIP RESNIK ET AL.

It may also be possible to achieve Level 2 conformance to the guidelines forprimary data using fully automatic methods, although this requires further con-sideration. Level 2 conformance goes beyond the minimum by requiring correctparagraph-level markup – as noted, it may be possible to accomplish this triviallybecause most versions of the Bible do not contain paragraphs. It also requires con-sistency in any marking of sub-paragraph elements and conversion of certain textfeatures into SGML – e.g., replacing special characters such as dashes and poundsigns with corresponding SGML entities, using SGML to indicate quotations, andso forth – but these conversions are amenable to automatic processing.

Level 3 conformance goes significantly beyond Level 2 in its requirementsfor consistently and explicitly marking up different kinds of elements in the text.Achieving Level 3 conformance is not a goal in this project because reliable iden-tification of all the specified sub-paragraph elements, particularly names, wouldrequire significant manual effort.15

As a means for encoding parallelism of corpora, the cesAlign encoding con-ventions are ideally suited for the present task, since they permit an arbitrarydegree of parallelism, and because alignment information is recorded in an exter-nal document, making it trivial to work with a monolingual subset or anyn-wayparallel subset.16 In the case at hand, the alignment document can be created formultiple parallel Bible versions trivially by encoding one-to-one links betweenbook/chapter/verse labels, as illustrated in Figure 5. Alignment at the sub-verselevel would be considerably more tricky, of course, since translations may reflectdifferent decisions about how verses are broken into clauses, etc. We leave this forfuture work.

It is worth noting that, for many purposes, construction of an alignment docu-ment is unnecessary: since aligned verses are identified by identical verse tags, oneneed not incur the overhead of maintaining an alignment document and updatingit every time a new version is added to the corpus. We anticipate, however, that asthe Corpus Encoding Standard becomes better established software tools will bedeveloped to handle more general CES-encoded parallel corpora, where the align-ment is less trivial. (For example, the Multext effort (Véronis, 1996) is involvednot only in developing standards, such as CES, and producing multilingual corporaencoded according to those standards, such as Orwell’s1984, but also in creatingtools that support the standards.17) Under those circumstances, it will be worthtaking the simple extra step of generating an explicit alignment document in orderto take advantage of the tools that may require it.

Despite the fact that the CES is well suited for our task in many ways, we foundtwo areas in which the CES draft was problematic for our purposes at the time weinitiated this work. The first is merely a question of scope, which may be remediedas the standard develops: one goal of our corpus-based research using the Bible isto investigate word sense and semantic issues, and these are explicitly outside thepurview of the current CES draft (CES Sec. 0.2.4).

THE BIBLE AS A PARALLEL CORPUS 139

<cesAlign ...><cesHeader ...>

...<translations>

<translation trans.loc="Bible.EN.sgml" lang=en wsd="ISO8859-1" n=1><translation trans.loc="Bible.FR.sgml" lang=fr wsd="ISO8859-1" n=2><translation trans.loc="Bible.ES.sgml" lang=es wsd="ISO8859-1" n=3>

</translations></cesHeader>

<linkList><!-- verse alignments --><linkGrp targType="seg">

<link xtargets="GEN:1:1 ; GEN:1:1 ; GEN:1:1"><link xtargets="GEN:1:2 ; GEN:1:2 ; GEN:1:2">...<link xtargets="REV:22:21 ; REV:22:21 ; REV:22:21">

</linkGrp></linkList></cesAlign>

Figure 5. Illustration of an alignment document for three-way parallel version of the Biblein English, French, and Spanish.

Second, and more important, the verse structure of the Bible does not respect thelinguistic subdivisions chosen in earlier drafts of the CES, at least with regard to theencoding standard for primary data (cesDoc). At the level of the text division, the〈div〉 element can be used for books and chapters. Below that level, the standardprovides the paragraph, sentence, and token as basic elements. However, verses canconsist of text above or below the sentence level, as in (1) and (2), respectively.

(1) <v id="GEN:1:31">And God saw every thing that he had made,and, behold, it was very good. And the evening and themorning were the sixth day. </v>

(2) <v id="GEN:10:13">And Mizraim begat Ludim, and Anamim, andLehabim, and Naphtuhim, </v>

<v id="GEN:10:14">And Pathrusim, and Casluhim,(out of whom came Philistim,) and Caphtorim. </v>

We considered using〈p〉 (paragraph) for verse elements, resulting in a Level 1CES-conformant encoding.18 However, identifying Bible verses with either〈p〉or 〈s〉 elements sacrifices standardization at the semantic level (CES Sec. 1.3.3),allowing those elements to represent something different in our corpus than the

140 PHILIP RESNIK ET AL.

conventional CES denotation of these elements, namely paragraphs and ortho-graphic sentences. In the case of〈s〉, the CES itself differs subtly from the TEIstandard. In CES the〈s〉 tag is specifically intended to capture “s-units, or ortho-graphic sentences,” and it is permitted to nest (CES Sec. 4.5.9.4, and N. Ide, p. c.).In TEI, 〈s〉 “contains a sentence-like division of a text” (TEI Guidelines Section15.1, Linguistic Segment Categories) and does not nest.

Another option considered is identifying verses with thechunkelement fromthe cesAna DTD for linguistic annotations: each verse would consist of a chunkcomprising the series of tokens within that verse. This preserves adherence to thestandard at the semantic level, but sacrifices the notion ofverseas a meaningfulstructural element at the level of the primary encoding, shifting the burden to thelinguistic level of encoding.

Ultimately, consultation on this topic with the coordinator of the CES has ledto the addition of a〈seg〉 tag to the CES standard. The TEI guidelines define〈seg〉 as containing “any arbitrary phrase-level unit of text (including other〈seg〉units)”; CES now treats it as sufficiently general that it may include〈p〉 and otherparagraph-level elements as well as being contained by them. The type of a〈seg〉will be identified using atype attribute, so verses in biblical text can now beidentified as〈seg type="verse"〉. The book/chapter/verse distinctions needed forthe corpus are thus accounted for in the markup, and there is flexibility for thefuture: at some later point, further annotation within〈seg〉 can be added, to identifyverse-internal sentence divisions, for example.

The issue is more general than our application of the CES to annotating Bibletext. For example, encoding speech data presents similar problems, since the basicstructural element in conversational speech is the turn or utterance (e.g. see theChild Language Data Exchange System database (MacWhinney, 1991)).19 Likeverses, turns may comprise material both below and above the level of the sentence.The TEI guidelines for encoding speech suggest the use of the〈u〉 element tocapture utterances, with〈seg〉 used to “subdivide the divisions of a spoken text intounits smaller than the individual utterance or turn” (TEI Guidelines, Sec. 11.3.1);the〈div〉 element is used to “aggregate utterances or other parts of a transcript intounits smaller than a complete text” (TEI Guidelines, Sec. 11.1.1).

2.3. INPUT FORMATS

Within a particular electronic version of the Bible, data formats are fairly con-sistent. Once low-level character set issues are dealt with – some pertaining tonon-Latin character sets, and some involving the transition from a PC to Unixplatform – the input formats seem to group according to a reasonably small setof dimensions. These include:

1. Line breaks: whether verses are implicitly delimited by appearing one per line,or broken across lines and delimited in another fashion.

THE BIBLE AS A PARALLEL CORPUS 141

2. File breaks: whether separate files are given for each book, for OT and NT, orthe entire text is in a single file.

3. Labels: whether book labels appear explicitly in a file or are implicit in the filename; whether verses are explicitly numbered and, if so, whether those labelsalso include chapter and verse (e.g. “1” vs. “1:1”).

4. Header information: whether files contain information regarding the edition,translation, etc.

5. Formatting codes: whether the documents are essentially in plain-text formator contain embedded formatting.

6. Information beyond book/chapter/verse: section headings, paragraph structure,footnotes, etc. Since our goal is to capture only the book/chapter/verse struc-ture common across different versions, these other forms of information areignored.

An on-line Swahili version of the New Testament, for example, illustratesembedded formatting, with separate marking for chapters and verses and a sectionheading that is discarded (Matthew 2:1–2):

\c 2\s Wageni kutoka mashariki\p\v 1 Yesu alizaliwa mjini Bethlehemu, mkoani Yudea, wakatiHerode alipokuwa mfalme. Punde tu baada ya kuzaliwa kwake,wataalamu wa nyota kutoka mashariki walifika Yerusalemu,\v 2 wakauliza, <<Yuko wapi mtoto, Mfalme wa Wayahudi,aliyezaliwa? Tumeiona nyota yake ilipotokea mashariki, tukajakumwabudu.>>\p

A French version illustrates plain text with one verse per line and the name of thebook being repeated with each chapter heading (Matthew 2:1–2):20

Matthieu 2

1. Jésus étant né à Bethléhem en Judée, au temps duroi Hérode, voici des mages d’Orient arrivèrent àJérusalem,

2. et dirent: Oú est le roi des Juifs qui vient de naître?car nous avons vu son étoile en Orient, et nous sommesvenus pour l’adorer.

The simple, uniform structure of the source text appears to greatly reduce thevariation in document encoding for the on-line source documents. Minor variationwithin a version does occur – for example, in one version a period sometimes

142 PHILIP RESNIK ET AL.

# Input format: One verse per line, one book per file, verses# numbered as "chapter:verse", no header info, plain text input.

while (<STDIN>){if (($chapter, $verse, $line) = /^(\d+):(\d+)\s+(.*)\r$/){# Possibly deal with boundary between chaptersif ($chapter != $current_chapter){

# Close old chapter element if there is oneif ($current_chapter != 0)

{ print "</c>\n"; }

# Upate current chapter, open new chapter element$current_chapter = $chapter;print "<c id=\"$book:$chapter\">\n";

}

# Print verse elementprint "<v id=\"$book:$chapter:$verse\">$line</v>\n";

}}

Figure 6. Example ofPERLcode for intermediate-level annotation.

follows verse numbers and sometimes not – but this is easily handled. By orga-nizing the annotated versions book by book, we also eliminate potential problemsin reordering; for example, Hebrews is the 58th book in the English Bible and the63rd in the German, with the relative order of every other book identical.

2.4. THE ANNOTATION PROCESS

This project could not have been undertaken if the input or target formats requiredany but the most minimal manual effort. Fortunately, a simplePERL script sufficesfor each of the versions we have looked at so far, written for one language and eas-ily adapted for others. In order to quantify this we kept track of the coding time andfound that the average annotation time for the first nine languages was 5.3 hours,3.5 hours on average if we exclude the first two, which served as templates for therest. In simple cases scripts have been written and executed in less than an hour andwe are confident that a version in any given language should not require more thana few days’ effort. The simplicity of the program’s main loop should be apparenteven for those not familiar with thePERL programming language (Figure 6).

THE BIBLE AS A PARALLEL CORPUS 143

Figure 7 illustrates representations of the book of Genesis through the entireprocess, including input representations coming from English and French on-line versions of the text, the results of our automatically-generated intermediaterepresentation, and the projected target representation (excluding the alignmentdocument, already illustrated in Figure 5).

2.5. STATUS

Up-to-date information about the project is maintained at〈http://benjamin.umd.edu/parallel/〉. As of this writing, eight versions of the complete 66-book Bible(English, Chinese, Danish, French, Indonesian, Latin, Swedish, and Spanish) andfour versions of the 27-book New Testament subset (Finnish, Greek, Swahili, andVietnamese) have been annotated in our intermediate book/chapter/verse format.We have not yet completed conversion to the CES target format, but this processappears straightforward. We also note that, given files in the intermediate format,conversion to another format such as full TEI would be similarly straightfor-ward, and might provide an opportunity to take advantage of tools written for thatstandard.

We hope to make rapid progress through the remainder of our inventory of elec-tronic source versions. Currently our total set includes complete or partial Biblesin Arabic, Chinese, Danish, English, Finnish, French, German, Greek, Hungarian,Indonesian, Italian, Korean, Latin, Quechua, Swahili, Swedish, Tagalog, Turkish,Vietnamese, and Warlpiri.

We are in the process of reviewing the status of each Bible version with respectto restrictions on its redistribution. We are optimistic about the possibility of ourmaking our annotated versions available, although it is necessary to be cautiousabout redistributing copyrighted materials. We note that even if we are entirelyunable to redistribute our versions of the text, an alternative mechanism may existfor making available TEI-encoded versions of even commercial versions of the on-line Bible: release of our annotation scripts and instructions for their use, so thatmembers of the community can acquire source versions themselves (via download-ing of publicly available versions, or purchase of commercial versions) and createprivate annotated versions of the text for their own use using our software.

3. Research Uses for Aligned Bibles

3.1. TRANSLATION AND COMPARATIVE LINGUISTICS

An important part of the translation process is the comparison of previous trans-lations, into the same or different languages. Miles Coverdale, translator of thefirst printed English Bible (in 1526) writes: “one translation declareth, openeth andillustrateth another, and. . . in many cases one is a plain commentary unto another”(quoted in the introduction to Vaughan, 1967). The KJV translators also stood

144 PHILIP RESNIK ET AL.

Genesis: English source HTMLHeader information...〈H2〉Genesis 1〈/H2〉〈DL COMPACT〉〈DT〉1〈DD〉In the beginning. . .. . .

〈/DL〉Other information (footnotes, etc.)...

Genesis: French source textHeader information...Genese 11. Au commencement. . .2 La terreetait informe. . .. . .

Intermediate file GEN.EN〈b id="GEN"〉〈c id="GEN:1"〉〈v id="GEN:1:1"〉In the beginning. . .〈/v〉. . .

〈/c〉〈/b〉

Intermediate file GEN.FR〈b id="GEN"〉〈c id="GEN:1"〉〈v id="GEN:1:1"〉Aucommencement,. . .〈/v〉. . .

〈/c〉〈/b〉

English target (all books combined)〈cesdoc . . .〉〈cesheader . . .〉. . .

〈/cesheader〉〈text〉〈body lang=en id=Bible〉〈div id="GEN" type=part〉〈div id="GEN:1" type=chapter〉〈seg type="verse" id="GEN:1:1"〉In thebeginning. . .〈/seg〉. . .

〈/div〉〈/div〉. . .

〈div id="REV" type=part〉〈div id="REV:1" type=chapter〉. . .

〈seg type="verse" id="REV:22:21"〉Thegrace of the Lord. . .〈/seg〉〈/div〉〈/div〉〈/body〉〈/text〉〈/cesdoc〉

French target (all books combined)〈cesdoc . . .〉〈cesheader . . .〉. . .

〈/cesheader〉〈text〉〈body lang=fr id=Bible〉〈div id="GEN" type=part〉〈div id="GEN:1" type=chapter〉〈seg type="verse" id="GEN:1:1"〉Aucommencement. . .〈/seg〉. . .

〈/div〉〈/div〉. . .

〈div id="REV" type=part〉〈div id="REV:1" type=chapter〉. . .

〈seg type="verse" id="REV:22:21"〉Quela grâce. . .〈/seg〉〈/div〉〈/div〉〈/body〉〈/text〉〈/cesdoc〉

Figure 7. Stages in the annotation process.

THE BIBLE AS A PARALLEL CORPUS 145

firmly on the shoulders of the translation giants, including in their subtitle, “Trans-lated out of the Original Tongues and with the Former Translations DiligentlyCompared and Revised” (KJV).

Having an aligned text of the type we describe facilitates research in the originallanguages of the Bible, as well as in comparative linguistics more generally. Par-ticularly in languages with no living speakers, the subtleties of the text are revealedprimarily through examination of a variety of translated forms by expert scholars.The work of Olsen (1997) illustrates this methodology. In that research, Olsenexplores, among other things, the potential translations of certain New TestamentGreek grammatical forms as a way of discovering the range of meanings conveyedby such forms. The resulting data were used for a theoretical separation betweenthe semantic (uncancelable) and pragmatic (cancelable and variable) meaning, adistinction important for understanding and translating the language, as well as forlexicons in general, computational or otherwise.

3.2. RESOURCE ACQUISITION FOR NATURAL LANGUAGE PROCESSING

Parallel corpora are increasingly of interest in natural language processing,with applications in cross-language information retrieval (Hull and Oard, 1997),machine translation (e.g. Brown et al., 1990), approaches to word sense disam-biguation (Brown et al., 1991), and in computational lexicography (Melamed,1996b). However, corpora reliably aligned at word- or even sentence- level aredifficult to obtain even for commonly found language pairs, and for “low density”languages – those for which few resources exist – even more difficult to find.

The Bible is an interesting alternative to investigate: as discussed above, itcan potentially yield a multi-way parallel corpus with representation from everylanguage family, with the content carefully translated and nearly sentence-levelalignment included. Although it is not large, parallel corpora of significantlysmaller size have yielded useful results (Resnik and Melamed, 1997). Its content ismore specialized than, say, contemporary newspaper text; however, it does covera very wide range of linguistic phenomena and domains of world knowledge; forexample, see the range of conceptual categories in the Louw-Nida thesaurus-likelexicon for the New Testament (Louw and Nida, 1989) and the results in Section 4.

We are also investigating parallel versions of the Bible as a possible resourcefor bootstrapping natural language resources, especially for work in machine trans-lation and cross-language information retrieval. Preliminary work shows promise.For example, in order to assess whether it might be possible to use the Bible cor-pus as a starting point for cross-language retrieval work involving the Indonesiannational language (Bahasa Indonesia), we located an on-line Indonesian versionof the Bible on the Web, annotated it according to the intermediate-level format,aligned it with a modern English version (NIV), performed a simple statisticalanalysis of the parallel corpus based on the log-likelihood ratio (Dunning, 1993),utilized the resulting statistics to extract an initial word-to-word translation lexi-

146 PHILIP RESNIK ET AL.

con, and finally used that translation lexicon to demonstrate simple cross-languagedocument retrieval by supplying English queries to locate Indonesian documents– all in the space of less than a week. Naturally it is important to progressfrom anecdotal results to systematic investigation with rigorous evaluation; ourlonger-range plans include applying the techniques described by Resnik andMelamed (1997) for extracting and assessing word correspondences from paral-lel corpora, and then using techniques for identifying multi-word units (Melamed,1997).

Melamed (1998) reports additional preliminary results on the Bible data. Withour assistance, he used the English (NIV)/French (LSG) pairing, in the intermediateformat, to manually create a benchmark annotation of “translational equivalence”at the level of words and phrases for 250 pairs of verses. As an example, if theEnglish phraseJacob ’s ladderwere presented alongside Frenchla échelle deJacob (with ’s and la automatically identified as separate tokens), the annota-tion would indicate links forladder with the phrasela échelle, de with ’s, andJacobwith Jacob. In his project he hired bilingual speakers to create these bench-mark links using software developed in collaboration with this project. Obviously“translational equivalence” is a complicated notion; part of the process includeddevelopment of a detailed style guide for annotators and the application of rigorousstatistical methods in order to evaluate their level of consistency and agreement.21

Melamed uses the annotation set to formally evaluate statistical methods in lexi-cal acquisition, and suggests its possible use in other areas such as modeling andevaluation for machine translation.

In future work we intend to compare translation lexicons automatically acquiredfrom the Bible corpus to benchmark sets like the one created by Melamed and tobilingual lexicons acquired by other means. For example, we have available to usSpanish-English and Arabic-English lexicons, with Korean-English and Chinese-English in process, and other projects planned (Dorr, 1997). We will investigatespecifically (i) the extent to which the biblically-based lexicons could be consid-ered “core” or “seed” lexicons, and (ii) what would be needed (in terms of coverageand resources) to scale up the biblically-based lexicons for use in interlingualmachine translation (Dorr, 1994; Dorr and Olsen, 1997). Partial answers to thesequestions follow.

4. Coverage Results for Present-Day Language

The first question to ask when investigating the Bible as a linguistic resource is:to what extent is the text relevant for research onpresent daylanguage? After all,research on present day language is unlikely to find itself concerned with centu-rions, and “making sacrifices” means something rather different now than it didin Abraham’s time. On the other hand, biblical texts are concerned with a widerange of human activities, emotions, and concerns that are as present in modernlife as they were several thousand years ago. Even if the corpus does not in the

THE BIBLE AS A PARALLEL CORPUS 147

least resemble a complete database, how useful might it be as a resource, particu-larly for bootstrappinglinguistic knowledge when other resources are unavailable?We investigated this question by conducting two comparisons, the first using theLongman Contemporary Dictionary of the English Language (LDOCE) (Proctor,1978), and the second with the Brown Corpus (Kucera and Francis, 1967).

LDOCE, unlike most dictionaries, uses acontrol vocabulary. Approximately2200 open-class words, closed-class words, and affixes were carefully selected inadvance to serve as the definitional basis for the English language: every definitionis written using only items from that set. The LDOCE control vocabulary can bethought of as a learner’s vocabulary for English, comparable to learner’s wordlistsfor other languages (cf. Beheydt and Wieers, 1991; Buchanan, 1927; Kleijn andNieuwborg, 1983; Sciarone, 1977; Vander Beke, 1929). We investigated the extentto which the modern English subset of our Bible corpus, the New InternationalVersion (NIV), covers the LDOCE control vocabulary.

In order to accomplish this comparison, we needed two lists: the LDOCE con-trol vocabulary and the list of words found in the NIV. Since the NIV list containedword forms rather than lemmata (root forms), we automatically reduced all wordson that list to lemmata using a simple algorithm relying on a small set of morpho-logical rules and the lexical databases from WordNet (Miller, 1990). Words on theLDOCE list were not lemmatized – for example,feelingsappears on the list butfeelingdoes not – however, identical words in multiple part of speech categories,such asbite (noun and verb) andblind (adjective and verb) were treated as singleitems. This led to a list of 2212 items in the LDOCE vocabulary and 11151 itemsin the Bible corpus vocabulary. A fully automated intersection of the two lists con-tained 1727 items, or 78.1% of the LDOCE control vocabulary. A random sampleof 100 items from this intersection includes the following:

ability, anxious, anyone, art, ashamed, at, baby, behave, below, birth, bit, bite,black, blame, bone, bread, broad, brother, build, calm, cheek, circular, clay,cliff, cloth, contain, continue, control, course, cry, cure, damage, declare, ditch,dog, educate, elect, fear, finish, fit, fond, frame, hang, heart, indeed, insect, its,land, lid, look, lose, love, memory, mirror, model, moral, narrow, neither, opin-ion, palace, particular, pool, presence, press, price, pronounce, pull, religious,rent, ring, seem, sew, shadow, side, single, snow, soap, square, stand, steady,strength, strike, sword, take, thick, thin, throw, tonight, track, troublesome,trust, undo, vote, waist, weave, west, wheel, wine, within, worst

A manual analysis of the 485 “missed” items, illustrated below, shows that thecoverage estimate of 78.1% is overly pessimistic: if only the “truly missing” itemsbelow are considered, the coverage figure is 85%.• Items truly missing (331). Mostly words pertaining to modern artifacts and

concepts (atom, bicycle, bullet, clock, electric), political and religious groups(British, Buddhist, Christianity), names of measurements (cent, dollar, pence),and units of time (century, Friday, January). Other misses include everydayvocabulary just not found in Biblical text (berry, box, cat, curl, duck).

148 PHILIP RESNIK ET AL.

• Different spelling/missed owing to tokenization or lemmatization (71). In orderto identify words in the Bible corpus, a simple tokenizer was used, for exampleto separatefather’sinto fatherplus’s, and words were lemmatized as describedabove. However, this analysis was not perfect and led to missing some itemsthat are in fact on the LDOCE list, such asno oneandaccording to(not tok-enized in the Bible corpus as single items),hardly (not present as an adverbin the Bible, althoughhard and -ly are attested), andfeelings (included inthe LDOCE vocabulary although the singularfeeling is not). Spelling differ-ences were also found, since LDOCE uses British spellings and the NIV usesAmerican spellings (theatre/theater, practise/practice, neighbor/neighbour).• Affixes with examples in Bible text (52). The LDOCE control vocabulary

includes, as separate items, 54 prefixes and suffixes. The majority of these arein fact attested in the Bible corpus list; examples include-able (abominable,acceptable, accountable, admirable), -ful (useful, watchful, willful, youthful),-like (flowerlike, ghostlike).• Words having morphologically related words that are attested (31). For exam-

ple, althoughadventuredoes not appear in the Bible corpus,adventurersdoes.

Naturally the exact coverage figure depends upon the intended use of the corpus.For example, morphologically related words in the text, even if not exact matches,might well be useful to a lexicographer. And, of course, coverage of word forms oreven lemmata does not guarantee that the overlap pertains to the relevant sensesof the words under consideration. Idioms, concepts expressed as phrases, andcross-language divergences in linguistic expression (Dorr, 1994) complicate thepicture still further. Still, we find this simple comparison with the LDOCE controlvocabulary to be a surprising confirmation of the Bible’s relevance as a corpus forinvestigating modern language: the corpus contains at least one instance of use forsome 80% of the most “useful” words in modern English. This coverage, com-bined with verse-level parallelism across languages, suggests that the corpus canin fact be a useful resource for investigating cross-linguistic lexical relationships inpresent day languages.

In a second comparison, we considered the coverage of the Bible corpus withrespect tofrequently usedwords, rather than words that are useful in dictionarydefinitions. We conducted an automatic comparison substantially similar to the oneabove, looking at the most frequent lemmata in the Brown Corpus of AmericanEnglish (Kucera and Francis, 1967), widely used as a source of word frequencyinformation in English. Table II shows the coverage figures of the NIV vocabularyfor the most frequent 500, 1000, 1500, 2000, and 4000 lemmata in the Browncorpus.

The degree of overlap is encouraging both for the most frequent 500 words inthe Brown corpus – of which approximately 20% are closed class items – and forlargern. A random sample of 100 items from the intersection of NIV vocabularywith the 2000 most frequent words in the Brown Corpus includes the following:

THE BIBLE AS A PARALLEL CORPUS 149

Table II. NIV coverage of most-frequentBrown corpus vocabulary

Brown NIV Vocabulary

Top N Coverage (Percentage)

500 454 (91%)

1000 849 (85%)

1500 1203 (80%)

2000 1508 (75%)

4000 2512 (63%)

accompany, achievement, address, afternoon, argue, arrive, board, breakfast,brief, building, but, buy, call, choose, climb, collect, come, conclude, deter-mine, director, dream, dry, encourage, everyone, exchange, extent, family, fast,fat, favor, force, future, happen, here, hide, his, husband, impression, increase,influence, insist, intend, iron, lady, lift, make, moment, narrow, nine, observa-tion, occasion, officer, opportunity, oppose, participate, perform, philosophy,plan, play, please, pleasure, public, reflect, relation, relative, remove, reply,requirement, road, root, satisfy, search, select, sheet, silence, simple, single,sleep, snake, someone, space, spirit, spread, staff, straight, strip, survive, sweet,test, tragedy, trial, watch, wave, west, wheel, wind, wine, work, yard, yield

The most frequent 2000 words (types) in the Brown corpus account for 84%of the word instances (tokens); the percentage is higher if numbers and propernames are not considered as words to be covered. That is, if one were to selectas a vocabulary of interest the most frequent 2000 words in the Brown Corpus,covering nearly five sixths of the corpus’s 1 million words of text, fully 75% of thewords in that vocabulary could be found in the NIV. The most frequent 4000 wordsin the Brown corpus cover nine tenths of the word instances, and 62.8% of those4000 words are found in NIV.

These coverage figures seem particularly noteworthy given that the BrownCorpus was created to span many genres, including such decidedly non-biblicalmaterial as sports and science fiction. In order to assess the relationship of coverageto genre more carefully, we conducted the coverage comparison separately for eachof the 15 genres in the Brown corpus. The results, in Table III, are strikingly intu-itive: the vocabulary of the Bible is least comprehensive for categories that focuson science, technology, and current events, and most comprehensive for categoriesthat are more aligned to universal concerns of everyday life. Interestingly, coveragereaches about two thirds of the most frequent 2000 words even for the least wellcovered of the categories.

This comparison of the NIV vocabulary with the Brown Corpus shares thesame pitfalls as the comparison with the LDOCE control vocabulary: coverage

150 PHILIP RESNIK ET AL.

Table III. NIV coverage of Brown corpus vocabulary: most frequent 2000 words in each genre

NIV Vocabulary

Genre Coverage (Percentage)

K FICTION: GENERAL 1563 78.2%

N FICTION: ADVENTURE 1540 77.0%

G BELLES-LETTRES 1504 75.2%

P FICTION: ROMANCE 1501 75.1%

F POPULAR LORE 1496 74.8%

L FICTION: MYSTERY 1453 72.7%

D RELIGION 1450 72.5%

E SKILL AND HOBBIES 1412 70.6%

B PRESS: EDITORIAL 1408 70.4%

R HUMOR 1346 67.3%

J LEARNED 1332 66.6%

A PRESS: REPORTAGE 1332 66.6%

C PRESS: REVIEWS 1331 66.6%

M FICTION: SCIENCE 1321 66.1%

H MISCELLANEOUS: GOVERNMENT & HOUSE ORGANS 1316 65.8%

of lemmata represents only one level of analysis, and other relevant considerationssuch as phrasal units and word senses have yet to be explored. However, like theLDOCE analysis, the comparison with frequent vocabulary in the Brown Corpusconfirms that biblical text, far from being concerned only with archaic topics, is arelevant resource for research on the lexicon in present day language.22

5. Conclusion

We have reported on a project to annotate biblical texts in order to create an alignedmultilingual Bible corpus for research purposes. At present, we have implementeda standard intermediate-level annotation, delimiting book, chapter, and verse, for agrowing collection of languages. The availability of on-line versions of the Bibleleads us to be optimistic about the prospect of creating a resource that covers a widevariety of languages and will be valuable to specialists in translation, linguistics,and the computational analysis of language.

Analyses using the LDOCE control vocabulary and the Brown Corpus ofAmerican English show that the Bible covers a wide range of modern-day vocab-ulary. This coverage, and the collection’s remarkable qualities as a cross-languagedatabase, make the Bible-as-multilingual-corpus a unique resource for linguisticresearch.

THE BIBLE AS A PARALLEL CORPUS 151

Acknowledgments

This paper is a revised and extended version of a paper presented at the TEI-10 conference in Providence, Rhode Island. The work was supported, in part,by Department of Defense contract MDA90496C1250, DARPA/ITO ContractN66001-97-C-8540, Army Research Laboratory contract DAAL03-91-C-0034through Battelle, and by Sun Microsystems Laboratories. We are grateful toStephen Clark, Dan Melamed, participants at the TEI-10 conference and fiveanonymous TEI-10 and CHum reviewers for helpful comments relating to thiswork.

Notes

1 See (Alexander and Alexander, 1980; Connolly, 1996). Subdivisions of the prophets are fromFreeman (1968). “Minor” prophets were first designated as such by the Latin church in the time ofAugustine and Jerome based on length of writing. The book of Daniel is sometimes in the “writings”section of the Hebrew canon, and the books of Joshua, Judges, 1 and 2 Samuel, 1 and 2 Kings aresometimes called the “former prophets.”2 〈http://www.lib.cam.ac.uk/Handbook/Guide15.html#tag1, http://www.biblesociety.org/trans-gr.htm〉3 The URLs in this paper reflect sites available at the time of the last revision, October 1998. Siteschange rapidly and references should not be cited without checking accuracy: at least one site in ourpaper changed locations twice in the reviewing period.4 〈http://www.biblesociety.org/translat.htm〉5 〈http://www.bibleworks.com/contacts.htm, http://www.omroep.nl/eo/bible/〉6 〈http://www.biblesociety.org/translat.htm〉7 〈http://www.scriptureresources.org〉8 〈http://www.americanbible.org/LOCALNAGIM.nsf/Pages/Home〉9 〈http://bible.gospelcom.net/bible/〉10 Note that the French source in this case is encoded using ISO-8859-1 (Latin-1) rather than usingSGML entity references such as\&eacute; for character ‘e’ with an acute accent.11 Special issues arise in automatically creating translation lexicons that include non-compositionalpairs; see, e.g. Melamed (1997) and Section 3.2.12 〈http://www-rali.iro.umontreal.ca/TransSearch/TS-simple-uen.cgi〉13 In the final version, accented characters will be encoded according to SGML conventions; herethey are left in a simpler form for readability.14 〈http://www.cup.cam.ac.uk/journals/nts/ntsifc.htm〉15 It may be possible to identify some elements automatically using, e.g., the name and place listfor the New Testament provided in the Louw-Nida lexicon (Louw and Nida, 1989). For anotherpossible use of that lexicon with the biblical text, see Section 3.2.16 〈http://www.cs.vassar.edu/CES/CES1-5.html#ToCalign〉17 〈http://nl.ijs.si/ME〉, 〈http://nl.ijs.si/ME/CD/docs/1984.html〉18 〈http://www.cs.vassar.edu/CES/CES1-4.5.html〉19 〈http://childes.psy.cmu.edu/〉20 For ease of presentation, we show the French verses broken across lines. In the source text,however, there is only a space – no line break – betweenduandroi, à andJérusalem, etc.21 Some might well argue that the notion of translational equivalence is not just complicated butundefinable; for interesting discussion see (Hofstadter, 1997).

152 PHILIP RESNIK ET AL.

22 Dan Melamed (p.c.) suggests that the coverage figures in this section are best evaluated in com-parison to how well the modern-day resources covereach other. The full Brown Corpus vocabularylist includes 2094 of the 2122 items on the LDOCE list, or 94.6%. The LDOCE control vocabularycovers 1233 of the most frequent 2000 items in the Brown corpus, or 61.7%.

References

Alexander, D. and P. Alexander.Eerdman’s Concise Bible Handbook. William B. Eerdman’s, 1980.Bassnet-McGuire, S.Translation Studies. New York: Methuen, 1980.Beekman, J. and J. Callow.Translating the Word of God. Zondervan, Grand Rapids, MI, 1974.Beheydt, L. and T. Wieers.Elementair woordenboek Nederlands, 1991.Blight, R. C. Translation Problems from A to Z. Summer Institute of Linguistics, 1992.Brown, P., J. Cocke, S. D. Pietra, V. D. Pietra, F. Jelinek, R. Mercer, and P. Roossin. “A Statistical

Approach to Machine Translation”.Computational Linguistics, 16(2) (1990), 79–85.Brown, P., S. Della Pietra, V. Della Pietra, and R. Mercer. “A Statistical Approach to Sense Dis-

ambiguation in Machine Translation”. InFourth DARPA Workshop on Speech and NaturalLanguage. Pacific Grove, CA, February, 1991.

Buchanan, M. A.A Graded Spanish Word Book. Toronto, 1927.Church, Kenneth W. and Robert Mercer. “Introduction to the Special Issue on Computational

Linguistics Using Large Corpora”.Computational Linguistics, 19(1) (1993), 1–24.Connolly, K. The Indestructible Book. Baker, Grand Rapids, MI, 1996.Deibler, E.An Index of Implicit Information in the Gospels. Summer Institute of Linguistics, 1993.deWaard, J. and W. A. Smalley.A Translator’s Handbook to the Book of Amos. United Bible Society,

1979.Dorr, B. J. “Machine Translation Divergences: A Formal Description and Proposed Solution”.

Computational Linguistics, 20(4) (1994), 597–633.Dorr, B. J. “Large-scale Dictionary Construction for Foreign Language Tutoring and Interlingual

Machine Translation”.Machine Translation, 12(4) (1994), 271–322.Dorr, B. J. and M. B. Olsen. “Deriving Verbal and Compositional Lexical Aspect for NLP Appli-

cations”. In Proceedings of the 35th Annual Meeting of the Association for ComputationalLinguistics (ACL-97). Madrid, Spain, July 7–12, 1997, pp. 151–158.

Dunning, T. “Accurate Methods for the Statistics of Surprise and Coincidence”.ComputationalLinguistics, 19(1) (1993), 61–74, March.

Freeman, H. E.An Introduction to the Old Testament Prophets. Chicago: Moody Press, 1968.GNB. Good News Bible: The Bible in Today’s English Version. 1976.Hofstadter, D. R.Le Ton Beau De Marot : In Praise of the Music of Language. Basic Books, 1997.Hull, D. A. and D. W. Oard. Symposium on Cross-Language Text and Speech Retrieval. Technical

Report SS-97-04, American Association for Artificial Intelligence, Menlo Park, CA, March,1997.

Ide, N. Corpus Encoding Standard: Document CES 1, version 1.4, October.http://www.cs.vassar.edu/CES/, 1996.

KJV. The Holy Bible, Authorized King James Version.Kleijn, P. de and E. Nieuwborg.Basiswoordenboek Nederlands. Leuven, 1983.Kucera, H. and W. Francis.Computational Analysis of Present-day American English. Brown

University Press: Providence, R.I., 1967.LeBlanc, D. Hands Off My NIV! InChristianity Today, June, 1997.Louw, J. P. and E. A. Nida. Greek-English Lexicon of the New Testament Based on Semantic

Domains, 2nd edition. New York: United Bible Societies, 1989.MacWhinney, B.The CHILDES Project: Tools for Analyzing Talk. Erlbaum, 1991.

THE BIBLE AS A PARALLEL CORPUS 153

Melamed, I. D. “A Geometric Approach to Mapping Bitext Correspondence”. InConference onEmpirical Methods in Natural Language Processing. Philadelphia, Pennsylvania, 1996a.

Melamed, I. D. “Automatic Construction of Clean Broad-coverage Translation Lexicons”. InPro-ceedings of the 2nd Conference of the Association for Machine Translation in the Americas.Montreal, Canada, 1996b.

Melamed, I. D. “Automatic Detection of Omissions in Translations”. InProceedings of the 16thAnnual Conference on Computational Linguistics (COLING-96). Copenhagen, 1996c.

Melamed, I. D. “Automatic Discovery of Non-compositional Compounds in Parallel Data”. InProceedings of the 2nd Conference on Empirical Methods in Natural Language Processing(EMNLP-97). Brown University, August, 1997.

Melamed, I. D. Manual Annotation of Translational Equivalence: The Blinker Project. TechnicalReport 98-07, University of Pennsylvania, 1998.

Miller, G. “WordNet: An On-line Lexical Database’.International Journal of Lexicography, 3(4)(1990) (Special Issue).

Moore, B. R.Doublets in the New Testament. Summer Institute of Linguistics, 1993.Nida, E. A. Towards a Science of Translating. E. J. Brill, Leiden, 1964.Nida, E. A. and C. R. Taber.The Theory and Practice of Translation. E. J. Brill, Leiden, 1969.Niebuhr, G. “Mass Marketing Makes Nonbiblical Texts Readily Accessible”. InNew York Times,

December, 1997.Olsen, M. B. A Semantic and Pragmatic Model of Lexical and Grammatical Aspect. New York:

Garland, 1997.Porter, S. “Verbal Aspect in the Greek of the New Testament, with Reference to Tense and Mood”.

In Studies in Biblical Greek, Vol. 1. Ed. D. A. Carson, New York: Peter Lang, 1989.Proctor, P., ed.Longman Dictionary of Contemporary English (LDOCE). Longman Group, 1978.Resnik, P. and I. D. Melamed. “Semi-automatic acquisition of domain-specific translation lexicons”.

In Fifth Conference on Applied Natural Language Processing. Washington, D.C., 1997.Robinson, D.The Translator’s Turn. Baltimore and London: The Johns Hopkins University Press,

1991.Sciarone, A. G.Vocabolario fondamentale delle lingua italiana. Minerva Italica, Bergamo, 1977.Vander Beke, G. E.French Word List. New York: Macmillan, 1929.Vaughan, C.The New Testament from 26 Translations. Zondervan, Grand Rapids, MI, 1967.Véronis, J. Multext Home Page: Document MUL1, version 0.1, April.http://www.lpl.univ-

aix.fr/projects/multext/, 1996.Weigelt, M. A. “Textual Criticism of the Bible”. InBaker Encyclopedia of the Bible, Vol. I A-I. Ed.

W. A. Elwall, Grand Rapids, MI, 1988.