TEI and Linguistics - OpenEdition Journals

163
Journal of the Text Encoding Initiative Issue 3 | November 2012 TEI and Linguistics Piotr Bański, Eleonora Litta Modignani Picozzi and Andreas Witt (dir.) Electronic version URL: http://journals.openedition.org/jtei/475 DOI: 10.4000/jtei.475 ISSN: 2162-5603 Publisher TEI Consortium Electronic reference Piotr Bański, Eleonora Litta Modignani Picozzi and Andreas Witt (dir.), Journal of the Text Encoding Initiative, Issue 3 | November 2012, « TEI and Linguistics » [Online], Online since 05 November 2012, connection on 13 April 2020. URL : http://journals.openedition.org/jtei/475 ; DOI : https://doi.org/ 10.4000/jtei.475 This text was automatically generated on 13 April 2020.

Transcript of TEI and Linguistics - OpenEdition Journals

Journal of the Text Encoding Initiative 

Issue 3 | November 2012TEI and LinguisticsPiotr Bański, Eleonora Litta Modignani Picozzi and Andreas Witt (dir.)

Electronic versionURL: http://journals.openedition.org/jtei/475DOI: 10.4000/jtei.475ISSN: 2162-5603

PublisherTEI Consortium

Electronic referencePiotr Bański, Eleonora Litta Modignani Picozzi and Andreas Witt (dir.), Journal of the Text EncodingInitiative, Issue 3 | November 2012, « TEI and Linguistics » [Online], Online since 05 November 2012,connection on 13 April 2020. URL : http://journals.openedition.org/jtei/475 ; DOI : https://doi.org/10.4000/jtei.475

This text was automatically generated on 13 April 2020.

TABLE OF CONTENTS

Editorial Introduction to the Third IssuePiotr Bański, Eleonora Litta Modignani Picozzi and Andreas Witt

The TEI and Current Standards for Structuring Linguistic DataAn OverviewMaik Stührenberg

A TEI P5 Document Grammar for the IDS Text ModelHarald Lüngen and C. M. Sperberg-McQueen

Creating Lexical Resources in TEI P5A Schema for Multi-purpose Digital DictionariesGerhard Budin, Stefan Majewski and Karlheinz Mörth

Consistent Modeling of Heterogeneous Lexical StructuresLaurent Romary and Werner Wegstein

A TEI Schema for the Representation of Computer-mediated CommunicationMichael Beißwenger, Maria Ermakova, Alexander Geyken, Lothar Lemnitzer and Angelika Storrer

Building and Maintaining the TEI LingSIG BibliographyUsing Open Source Tools for an Open Content InitiativePiotr Bański, Stefan Majewski, Maik Stührenberg and Antonina Werthmann

Journal of the Text Encoding Initiative, Issue 3 | November 2012

1

Editorial Introduction to the ThirdIssuePiotr Bański, Eleonora Litta Modignani Picozzi and Andreas Witt

1 Linguistics had a strong presence at the TEI’s beginnings, being represented by names

as significant as those of Nancy Ide, Donald E. Walker, and Antonio Zampolli.

Linguistics was mentioned explicitly in the names of two of its three founding

organizations: Association for Computers and the Humanities, Association for

Computational Linguistics, and Association for Literary and Linguistic Computing. It

was the main focus of one of the four initial committees (http://www.tei-c.org/Vault/

AB/abj01.txt) and, within several years of the inception of the work on the TEI

Guidelines, the British National Corpus clearly demonstrated the TEI’s usefulness for

encoding language resources.

2 While the TEI proved successful in annotating basic grammatical information in an in-

line fashion, by the time the BNC was compiled there was a rapid development in

corpus studies, directed not only at the volume of primary data but also at annotations

that gradually began to provide information beyond part-of-speech categorization and

lemmatization. Architectures were needed which would provide simple and fast

deployment, describing exactly the information that was needed without the overhead

of extra markup and using a flatter metadata structure. This is how specifications such

as CES (primarily for morphosyntactic and alignment annotation) and TigerXML (for

syntactic annotation, both hierarchical and relational) were developed and began to be

adopted by the linguistic community.

3 The early 2000s saw rapid development of language resources encoded in CES (Corpus

Encoding Standard) developed from TEI P3, and then XCES, as well as Tiger XML, both

of which exceeded the TEI in their popularity within tightly focused linguistic circles. It

should also be pointed out that, while the robust stand-off mechanisms of the TEI are

still being refined, CES and then XCES provided basic reference mechanisms which

proved extremely popular among corpus creators. Similar is the case of feature

structure markup: while the ISO/TEI feature structure schema offers numerous ways to

encode linguistic information (Witt et. al. 2009), in the absence of the feature structure

validation mechanisms, corpus builders adopted much simpler solutions.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

2

4 This state of affairs has gradually been changing: TEI P5, as a mature XML-based toolkit

that supports all the newest XML technologies, once again could be an important

player in the market of annotation standards (in the case of the TEI, a more precise

phrase could be annotation standard toolkits) and has recently been applied to encode

major linguistic enterprises such as the National Corpus of Polish, with its impressive

stand-off architecture featuring a number of separate annotation layers

(Przepiórkowski and Bański 2010).

5 The TEI special interest group for linguists (LingSIG), founded in 2010, has as its aim

making the TEI even more competitive in the area of linguistic annotation frameworks,

while maintaining close connections with the work performed at ISO TC37 SC4, the ISO

committee devoted to the management of language resources.

6 At the time of writing, the SIG has met twice (at the TEI conferences in Zadar and in

Würzburg) where a series of micropresentations were offered on various topics

connecting the TEI and linguistics. It was also from the participants of the Würzburg

meeting that most submissions for the present issue were received.

7 This issue begins with an overview of the current annotation standards landscape. “The

TEI and Current Standards for Structuring Linguistic Data: An Overview,” by Maik

Stührenberg, provides a remarkable summary of the most recent efforts to create

international standards for the annotation of linguistic corpora, developed by the ISO

technical committee for Terminology and other Language and Content Resources (ISO/

TC 37). This article opens a window onto the world of standards creation, detailing the

steps necessary for a set of protocols to become a standard, contrasting that with

community discussion-based specifications such as the TEI Guidelines, and showing

how the latter have been influential in the creation of de facto standards.

8 The second paper, “A TEI P5 Document Grammar for the IDS Text Model,” by corpus

linguistics specialist Harald Lüngen and a veteran of both TEI and XML, C.M. Sperberg-

McQueen, presents the process of making the legacy data of DeReKo (Deutsches

Referenzkorpus, the largest archive of German written text, collected at the IDS

Mannheim since 1964) compatible with the current version of TEI P5. The paper

describes the steps taken to encode the corpus since the early 1990s through a detailed

analysis of the way the IDS text model evolved to ultimately include the preparation of

an ODD file which, in turn, documents the model.

9 Gerhard Budin, Stefan Majewski, and Karlheinz Mörth write about a similar effort in

the area of dictionary encoding. Their paper describes the work of the Institute for

Corpus Linguistics and Text Technology (ICLTT) of the Austrian Academy of Sciences in

a number of projects involving both the digitisation of print dictionaries and the

creation of new born-digital lexicographical data. The article explores how even within

the restrictions imposed by the TEI dictionary module, an attentive customisation with

an eye to interoperability with other standards and digital NLP tools makes TEI P5 a

model that can be applied over a variety of digitisation projects. The article touches on

issues of hierarchies, polyfunctionality of certain elements in the dictionary module,

word-class information, and interoperability of the markup schema with other digital

frameworks. The authors present the project’s experience in encoding

morphosyntactic information, linguistic varieties and writing systems, etymology,

semantics, and specific production metadata, ultimately proving the value of the

customised TEI P5 dictionary module both in the representation of digital dictionaries

and the potential for use in NLP related applications.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

3

10 “Consistent Modeling of Heterogeneous Lexical Structures” by Laurent Romary and

Werner Wegstein highlights issues concerning the interoperability of a variety of data

sources in lexical data modelling. This article starts by underlining the difficulties

arising from building ad hoc data models from the TEI Guidelines’ Dictionaries chapter,

which inevitably leads to poor accessibility. The authors focus on lexical structures and

propose a more generic methodology based on the concept of crystals, the smallest

units in a construct that can help divide a document into regular chunks of information

that can be processed more easily by external tools.

11 Michael Beißwenger, Maria Ermakova, Alexander Geyken, Lothar Lemnitzer, and

Angelika Storrer present a novel application of TEI P5 in the description of computer-

mediated communication. Their paper, “A TEI Schema for the Representation of

Computer-mediated Communication,” introduces an XML schema which provides a

structure for the encoding of the structural units of communication in not only forums,

blogs, and bulletin boards but also instant messaging, wikis, and twitter feeds, as well

as the annotation of these units. The paper offers an interesting view on the processing

of a new literary genre characterised by precise interaction features such as emoticons,

interaction words, acronyms, and so on, and on the need for TEI P5 to cater for such

forms of text.

12 Piotr Bański, Stefan Majewski, Maik Stührenberg, and Antonina Werthmann take on a

more general issue relating to the social and infrastructural aspect of the SIG, and

present a proposal for integrating a TEI markup exporter into the general-purpose

citation manager Zotero. The paper provides a glimpse into the origins of the SIG’s

online presence and articulates a proposal for specific choices within TEI bibliographic

elements to suggest a coherent and interchangeable way of sharing and maintaining

bibliographic reference stores.

13 The guest editors of the volume wish to express their thanks to the authors and the

reviewers, and acknowledge the work by the Journal of the Text Encoding Initiative regular

editors, Susan Schreibman and Kevin Hawkins, in bringing the issue into uniform

shape.

BIBLIOGRAPHY

Przepiórkowski, Adam and Piotr Bański. 2011. “XML Text Interchange Format in the National

Corpus of Polish.” In Explorations across Languages and Corpora, edited by Stanisław Góźdź-

Roszkowski, 55–65. Frankfurt am Main: Peter Lang.

Witt, Andreas, Georg Rehm, Erhard Hinrichs, Timm Lehmberg, and Jens Stegmann. 2009.

“SusTEInability of linguistic resources through feature structures.” Literary & Linguistic Computing

24 (3): 363–372. doi:10.1093/llc/fqp024.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

4

The TEI and Current Standards forStructuring Linguistic DataAn Overview

Maik Stührenberg

1. Introduction

1 During the last decade linguistic annotation of corpora has undergone a substantial

change. While in the late 20th century annotation formats were developed and used

exclusively for projects or within small communities, we now have a large number of

standardization efforts carried out by the International Organization for

Standardization (ISO), addressing, in particular, new advancements in technology such

as very large and multiply annotated corpora. An overview is given by Ide and Romary

(2007) and Declerck et al. (2007).

2 In addition, these standardization efforts are increasingly adopted in international

projects such as CLARIN (Common Language Resources and Technology Infrastructure)

and FLARENET (Fostering Language Resources Network).1 Both projects involve

harmonization of formats and standards for language resources and technology with

the goal of making these much more accessible to researchers via component metadata

registries (see Broeder et al. 2011) and by providing guidelines to choose particular

specifications (see Monachini et al. 2011).

3 Of course, international standards are not developed in isolation, without any reference

to established de facto standards such as the TEI Guidelines. However, there are some

differences that can be observed when comparing the TEI Guidelines to these

specifications with respect to various aspects of markup languages such as the formal

model, the notation, and the annotation model.

4 After a short overview of the process of standardization of international standards, we

will contrast this process with the development of community-based specifications,

such as the TEI Guidelines. After this introduction, a number of ISO standards that deal

with the annotation of language corpora will be examined. The TEI’s influence on the

Journal of the Text Encoding Initiative, Issue 3 | November 2012

5

development of these standards will then be discussed. This paper will conclude with recommendations for scholars and researchers that deal

with linguistically annotated corpora.

2. Current International Standards

2.1. International Standardization

5 The term standard can have two meanings. On the one hand, the term can depict

international (or national) industry norms and standards—that is, specifications

developed by organizations that have been assigned to this task, such as ANSI

(American National Standards Institute) in the USA or DIN (Deutsches Institut für

Normung) in Germany. Such standards are called de jure standards.On the other hand, there are also de facto (or market-driven) standards, i.e., specifications

that are not endorsed by a standards organization but have achieved a greater

popularity compared to similar specifications. An obvious example of such a de facto

standard is the original file format of Microsoft Word: the ubiquitous “doc” format. In

this case, the status of the specification is based on the dominant market position of the

respective company. Another example is the tagset of the TEI Guidelines, the status of

which can be explained by its broad acceptance by scholars around the world.

6 De jure standards are developed by international committees, usually under the

auspices of the International Organization for Standardization (ISO) and comprising

members from various national standards bodies. ISO, for example, has technical

committees (TC), divided into subcommittees (SC) and then into working groups (WG)

chartered to work on a distinctive topic. But the work of developing a standard often

begins in one or more national bodies, since technical committees are made up of

national representatives of various stakeholders such as industry, NGO, government or

academia. Therefore, each national organization for standardization (a member body)

decides to participate in a number of technical committees. These national bodies often

reflect the structure of ISO, allowing for straightforward collaboration between

corresponding committees in different countries.

7 A relevant ISO subcommittee in the field of linguistic annotation is ISO/TC 37/SC 4 (in

this case, “SC” is for subcommittee 4) called “Language Resource Management”, of the

technical committee “Terminology and other Language and Content Resources”. It is

divided into six working groups (WG):

WG 1: Basic descriptors and mechanisms for language resources

WG 2: Annotation and representation schemes

WG 3: Multilingual information representation

WG 4: Lexical resources

WG 5: Workflow of language resource management

WG 6: Linguistic annotation.2

8 These working groups develop relevant specifications for the field of linguistic

annotation.

9 ISO has a protocol for the proposal process (International Organization for

Standardization/International Electrotechnical Commission 2012) in which proposals

Journal of the Text Encoding Initiative, Issue 3 | November 2012

6

must pass through seven stages, each of which takes some time, before becoming

official standards:

Preliminary stage

Proposal stage

Preparatory stage

Committee stage

Enquiry stage

Approval stage

Publication stage

10 The first stage marks the introduction of a Preliminary Work Item (PWI), which can be

introduced by members of the working group or by outside interested parties. After a

positive internal review, it becomes a New Work Item Proposal (NP). At that time it

reaches the proposal stage, in which the so-called P-members (“participating

members”) of the respective committee (or sub-committee) have to vote in favor or

against the further pursuit of this item.3 If the majority of the P-members cast a

positive vote and at least five P-members signal a willingness to participate in the

standardization process, the NP is added as a new project of the WG, reaching the

beginning of the preparatory stage.

11 In each of the following stages the status of the proposal changes according to

substantial improvements that have been made. The committee stage is the first stage

at which the Committee Draft (CD), as it’s then called, is commented on by national

bodies of the TC/SC. This stage ends when all technical issues have been resolved. In

that case the CD is transformed into a Draft International Standard (DIS) and enters the

enquiry stage.

12 At this stage the DIS will be circulated to all national bodies for a ballot. A vote can be

either positive, negative, or an abstention; in the two former cases the vote may be

accompanied by editorial or technical comments. The DIS is approved if a two-thirds

majority of the P-members’ votes are in favor and not more than one-quarter of the

total votes cast are negative. In that case it will be registered as a Final Draft

International Standard (FDIS), proceeding to the approval stage.4

13 From this point onwards the text of the FDIS is usually not publicly available for free

(although there are exceptions to this rule). As a result, researchers often consult and

cite Committee Drafts or Draft International Standards in their work. However, such a

time-consuming and consensus-driven process means that major changes often exist

between draft versions and the final International Standard. In contrast, openly

developed standards such as the TEI Guidelines are often publicly available both as

drafts and final versions, which eases the adoption of changes between different

versions.

14 The boundaries between de facto and de jure standards can be very weak; in fact,

sometimes de facto standards became de jure standards. For example, Simons (2007)

explains the long process of developing a standard for describing language codes,

starting from Ethnologue and ending with the International Standard ISO 639-3:2007.5

15 In the next section we will discuss some de jure standards that have been developed in

ISO/TC 37/SC 4 that may affect the work of current and future linguists.6

Journal of the Text Encoding Initiative, Issue 3 | November 2012

7

2.2. Feature Structures (FS)

16 Feature Structures are general-purpose data structures consisting of a named feature

and its value (or values). Complex feature structures contain a group of individual

features allowing for a representation of various kinds of information. In linguistics, feature structures are best known as part of Head-driven Phrase

Structure Grammar (HPSG).7

17 Feature structure representations have been a part of the TEI Guidelines from the very

beginning.8 However, during the transition from P4 to P5 a substantial amount of work was

undertaken to improve the tag set and to clarify its underlying formal logic.

18 The following is an example of a TEI-based linguistic feature structure:

<fs>

<f name="CAT">

<symbol value="np" />

</f>

<f name="AGR">

<fs>

<f name="NUM">

<symbol value="sing" />

</f>

<f name="PER" />

<symbol value="third" />

</f>

</fs>

</f>

</fs>

Figure 1: TEI-based feature structure for a linguistic annotation (from Stegmann and Witt 2009).

19 This feature structure consists of two features. The first, named “CAT”, is a simple

feature that has the atomic feature value “np”. The second, named “AGR” is a complex

feature (that is, its value consists of other feature structures), containing the features

“NUM” and “PER”.

20 A few key players in the TEI community submitted the P5 revision of the feature

structure annotation format for standardization as the two-part ISO standard 24610.

While the first part, ISO 24610-1:2006, describes feature structures (including the

representation format shown in the example above and an informal overview of the

basic characteristics of feature structures), the second part, ISO 24610-2:2011, discusses

feature system declaration described in Chapter 18.11 of the TEI Guidelines.

21 Both parts of ISO 24610 use a RELAX NG grammar that is a subset of the TEI’s P5

document grammar with only slight changes (for example, a different root element). As

one may observe, there is a five-year gap between the two parts of ISO 24610. In

addition, ISO 24610-1 was scheduled for a regular revision that should have been

Journal of the Text Encoding Initiative, Issue 3 | November 2012

8

finished in early 2012. However, due to time constraints on the part of the involved

experts, work on the Committee Draft for the revision has been put on hold, leaving ISO

24610-1:2006 as the current version.

2.3. The Linguistic Annotation Framework (LAF)

22 Development of the Linguistic Annotation Framework began in 2005, and it became an

approved standard in 2012 (ISO 24612). Its goal is to establish a definitive standard

based on widely used de facto standards such as the TEI, the Corpus Encoding Standard

(CES, see Ide 1998), and its successor XCES (Ide et al. 2000).

23 LAF provides a framework for representing linguistic annotation of various kinds. It

includes an abstract data model for general-purpose linguistic annotation (in contrast

to more specific annotation formats such as the Morpho-Syntactic Annotation

Framework discussed in the next section) and an XML serialization format called Graph

Annotation Format (GrAF), which serves as a pivot format for mapping between user-

defined annotation formats. The data model consists of three parts: (1) anchors that

define regions by referencing locations in the primary data (that is, the data to be

annotated); (2) a graph structure, consisting of nodes, edges and links to the before-

mentioned regions; and (3) an annotation structure comprising a directed graph

referencing regions or other annotations. The nodes in this graph are associated with

feature structures providing the annotation content. LAF does not include data

categories but instead relies on ISO 12620:2009, the International Standard for

describing data categories, and on ISOcat, an implementation of ISO 12620:2009

developed in ISO/TC 37/SC 3.9

24 A language resource conforming to LAF consists of the primary data; a base

segmentation (that is, at least one document that provides anchors and therefore

defines regions of the primary data); a number of annotation documents containing

nodes, edges and feature structures; and a set of header files (metadata). By storing

primary data and annotation in separate files, LAF uses stand-off annotation (see

Thompson and McKelvie 1997), similar to CES and XCES, to more easily encode

overlapping and discontiguous regions than if these were encoded in a single file. The

anchors are nodes that are located between base units of the primary data. Depending

on the type of primary data (text, audio, video, or other) the base unit can be a

character, a segment of time, or another useful unit of segmentation. An annotation

document contains annotations associated with the nodes in the graph that reference

regions of the primary data. While stand-off annotation would allow the combination of

several linguistic annotation layers into a single annotation document (see Stührenberg

and Jettka 2009), the standard recommends the use of separate annotation files for the

purpose of exchange.

25 Figure 2 shows a fragment of an example annotation document containing both a

header, nodes, edges and annotations (taken from ISO/FDIS 24612).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

9

<?xml version="1.0" encoding="UTF-8"?>

<graph xmlns="http://www.xces.org/ns/GrAF/1.0/">

<graphHeader>

<labelsDecl>

<labelUsage label="fullTextAnnotation" occurs="1"/>

<labelUsage label="Target" occurs="171"/>

<labelUsage label="FE" occurs="372"/>

<labelUsage label="sentence" occurs="32"/>

<labelUsage label="annotationSet" occurs="171"/>

<labelUsage label="NamedEntity" occurs="32"/>

</labelsDecl>

<dependencies>

<dependsOn type="fntok"/>

</dependencies>

<annotationSpaces>

<annotationSpace as.id="FrameNet" default="true"/>

</annotationSpaces>

</graphHeader>

<node xml:id="fn-n156"/>

<a label="FE" ref="fn-n156">

<fs>

<f name="FE" value="Speaker"/>

<f name="rank" value="1"/>

<f name="GF" value="Ext"/>

<f name="PT" value="NP"/>

</fs>

</a>

<!-- [...] -->

<edge xml:id="e233" from="fn-n156" to="fn-n133"/>

<!-- [...] -->

<region xml:id="r1" anchors="980 9190"/>

<region xml:id="r2" anchors="980 993"/>

<!-- [...] -->

<node xml:id="a232">

<link targets="r1"/>

</node>

<node xml:id="a233">

<link targets="r2"/>

</node>

<!-- [...] -->

<a label="R Gesture Units 1" ref="a232"/>

<a label="preparation" ref="a233"/>

</graph>

Figure 2: An example annotation document using the Graph Annotation Format (GrAF).

26 LAF takes input from several other specifications: the header files resemble the ones

used in CES, which in turn are based on TEI headers. ISO 24610-1:2006 can be used for

Journal of the Text Encoding Initiative, Issue 3 | November 2012

10

these feature structures. However, the standard recommends its own representation

format shown in figure 2 as a more concise notation.

27 What is somewhat disturbing is the fact that a document grammar for the Graph

Annotation Format was removed when the draft standard moved from from DIS to

FDIS. The DIS version contained an XML schema file in the informative annex of the

specification while the FDIS contains only fragments of a RELAX NG document

grammar. Since the FDIS was approved as International Standard in 2012 without any

comments regarding this topic, we assume that this is also the case for the final

version.

2.4. The Syntactic Annotation Framework (SynAF)

28 The Syntactic Annotation Framework (SynAF, ISO 24615:2010) pursues the goal of

defining both a meta-model for syntactic annotation and a set of data categories. In

contrast to the more specific Morpho-Syntactic Annotation Framework (MAF), which is

discussed in the next subsection, SynAF had already been published as an International

Standard in 2010. The latest version that is publicly available for free is ISO/FDIS 24615,

but an early version is discussed by Declerck (2006). SynAF is based on the Penn

Treebank initiative, the Negra/Tiger initiative, and the ISST initiative and has been

developed mainly by the LIRICS Consortium. While MAF deals with part of speech,

morphological and grammatical features, SynAF deals with the annotation of syntactic

constituency of groups of MAF word forms in sentence boundaries.

29 The meta-model for SynAF contains the generic class of Syntactic Nodes and Syntactic

Edges, which together form a Syntactic Graph. Syntactic Nodes can be differentiated

into T_Nodes (terminal nodes)—that is, the morpho-syntactic annotated word forms of

MAF, defined over one or more spans—and NT_Nodes (non-terminal nodes of a syntax

tree). The T_Nodes are annotated with syntactic data categories according to the word

level, whereas the NT_Nodes are annotated with syntactic categories according to the

phrase, clause, or sentence level.

30 Syntactic Edges are used to represent relations between Syntactic Nodes, such as

dependency relations. The edges can be specified as primarySyntacticEdge (expressing

the constituency relationship) or secondarySyntacticEdge, which “may be used to

express the relationship between a head and a coreferent of its omitted dependent”

(ISO/FDIS 24615, 14). Since the standard does not propose a specific tag set but only

generic classes and specific data categories, there are several possible serialization

formats. Romary et al. (2011) propose the <tiger2> XML format; another natural

selection would be the Graph Annotation Format defined in LAF.

2.5. The Morpho-Syntactic Annotation Framework (MAF)

31 The Morpho-Syntactic Annotation Framework is closely connected to the Syntactic

Annotation Framework (SynAF) discussed in the previous section. MAF is not yet an

International Standard but is in the stage of an FDIS (ISO/FDIS 24611). The last version

freely available to the public is ISO/CD 24611. However, the basic concepts of the

specification such as the two-level structuring for tokens and word forms, and the

ambiguity handling are discussed by Clément and de la Clergerie (2005).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

11

32 MAF uses stand-off annotation as well and represents an annotated document as the

primary data (called a “raw document” by Clément and de la Clergerie 2005) and a set

of annotations. An input document can be divided into tokens, which can be used as

anchors for word forms. Tokens resemble the regions in LAF—that is, they represent

segments of the primary data. MAF does not provide an addressing schema used to

refer to positions but instead relies on externally defined addressing schemas.10

33 Similar to LAF, these tokens can be organized in a directed acyclic graph (DAG) called a

token lattice. Word forms carry the annotation by using feature structure

representations and refer to tokens in an m:n-relation (where one or more tokens

anchors one or more word forms). Word forms, too, can be organized—in a word form

lattice. Figure 3 shows an example annotation of the sentence “I wanna put up new

wallpaper.”11

Journal of the Text Encoding Initiative, Issue 3 | November 2012

12

<maf xmlns="http://www.iso.org/ns/MAF" document="sample.txt"

addressing="char_offset">

<olac:olac

xmlns:olac="http://www.language-archives.org/OLAC/1.0/"

xmlns="http://purl.org/dc/elements/1.1/">

<creator>Maik Stührenberg</creator>

</olac:olac>

<token xml:id="t1" form="I" from="0" to="1"/>

<token xml:id="t2" join="right" form="wan" from="2" to="5"/>

<token xml:id="t3" join="left" form="na" from="5" to="7"/>

<token xml:id="t4" form="put" from="8" to="11"/>

<token xml:id="t5" form="up" from="12" to="14"/>

<token xml:id="t6" form="new" from="15" to="18"/>

<token xml:id="t7" form="wall" from="19" to="23"/>

<token xml:id="t8" form="paper" from="23" to="28"/>

<token xml:id="t9" form="." from="28" to="29">.</token>

<wordForm lemma="I" tokens="#t1">

<fs>

<f name="pos">

<symbol value="PP"/>

</f>

</fs>

</wordForm>

<wordForm lemma="want" tokens="#t2">

<fs>

<f name="pos">

<symbol value="VBP"/>

</f>

</fs>

</wordForm>

<wordForm lemma="to" tokens="#t3">

<fs>

<f name="pos">

<symbol value="TO"/>

</f>

</fs>

</wordForm>

<wordForm tokens="#t2 #t3"/>

<wordForm lemma="put" tokens="#t4"/>

<wordForm lemma="up" tokens="#t5"/>

<wordForm lemma="put_up" tokens="#t4 #t5">

<fs>

<f name="pos">

<symbol value="VB"/>

</f>

</fs>

</wordForm>

<wordForm lemma="new" tokens="#t6">

<fs>

Journal of the Text Encoding Initiative, Issue 3 | November 2012

13

<f name="pos">

<symbol value="JJ"/>

</f>

</fs>

</wordForm>

<wordForm lemma="wallpaper" tokens="#t7 #t8">

<fs>

<f name="pos">

<symbol value="NN"/>

</f>

</fs>

</wordForm>

</maf>

Figure 3: Example annotation using MAF’s current serialization format.

34 Instead of stand-off annotation, it is possible to use inline annotation for the token

content; in fact, most examples in ISO/CD 24611 use this notation. In this case the value

of the @from attribute would be used as element content of the <token> element and

the @from and @to attributes would be omitted. However, following the standard, this

is not recommended since it may conflict with other annotations.

35 The morpho-syntactic content is represented by feature structures: ISO/CD 24611

directly refers to ISO 24610-1:2006. Metadata may be included according to the OLAC

metadata specification (Simons and Bird 2008) using the OLAC namespace as seen in

figure 3.

36 In addition, ISO/FDIS 24611 contains a RELAX NG-like specification, some annotated

examples and a list of morpho-syntactic data categories as part of its appendixes.

3. The Relation of the TEI to the Current de jureStandards

37 In this section the relation between the TEI and the previously mentioned standards

will be discussed, focusing on aspects of their notation format and annotation models.

Bański and Przepiórkowski have already stated the fact that the TEI is a direct ancestor

of these standards:

The current standards that have been or are being established by ISO TC 37 SC 4committee …, known together as the LAF (Linguistic Annotation Framework) familyof standards, … descend in part from an early application of the TEI, back when theTEI was still an SGML-based standard. That application was the Corpus EncodingStandard …, later redone in XML and known as XCES …. XCES was a conceptualpredecessor of the current ISO LAF pivot format for syntactic interoperability ofannotation formats, GrAF (Graph Annotation Framework) …. GrAF defines an XMLserialization of the LAF data model consisting of directed acyclic graphs withannotations (also expressible as graphs), attached to nodes. This basic data model isin fact common to the TEI formats defined for the NCP, the LAF family of standards,and the other standards and best practices …. (2010b, 36)

Journal of the Text Encoding Initiative, Issue 3 | November 2012

14

3.1. Influence on the Data Model

38 In the field of Digital Humanities there has been the assumption that text is

hierarchically structured (see, for example, Coombs et al. 1987 or the OHCO thesis

postulated by DeRose et al. 1990 and Renear et al. 1996, stating that a text is an Ordered

Hierarchy of Content Objects), and therefore markup languages which were developed

to annotate mainly textual content use the formal model of a tree.

39 But in fact, there are several authors that tend to agree that the formal model of XML

instances is that of a graph: Abiteboul et al. 2000, Polyzotis and Garofalakis 2002, Gou

and Chirkova 2007, Møller and Schwartzbach 2011, and Jettka and Stührenberg 2011. In

particular, the use of the XML-inherent integrity constraints—that is, ID/IDREF/

IDREFS token-type attributes (in XML DTD syntax) or xs:ID/xs:IDREF/

xs:IDREFS and xs:key/xs:keyref (in XSD syntax), respectively, which are

supported by document grammar formalisms—can be used to represent graph

structures in XML. An example for such an XML serialization of a graph can be

observed in the way in which an edge in GrAF is constructed by referring to the IDs of

already established nodes via the @from and @to attributes. Similar examples can be

found in the XStandoff format (Stührenberg and Jettka 2009; Witt et al. 2011; Jettka and

Stührenberg 2011).

40 Apart from a representation format for graphs, networks, and trees found in TEI since

P3, the refined and enhanced feature structure representation format of TEI P5 has

been a great step in establishing a more expressive formal model. In addition, other

specifications developed for various projects, such as XStandoff, NITE (Carletta et al.

2005), or the Potsdamer Austauschformat für linguistische Annotation12 (PAULA, Dipper

et al. 2007), propagate graph-based formal models.

41 Therefore, the TEI cannot be seen as the direct or single ancestor of the current

standards in development. However, it seems that this newer graph-based formal

model (that is dependent on the existence of a document grammar using the

aforementioned integrity constraints) may play a greater part in future XML formats

(especially those for structuring multiply annotated data), and one may argue that the

TEI has accompanied this change from a strictly hierarchical to a graph-based formal

model.

3.2. Influence on Notation Format

42 The notation format that is used by all standards discussed here is stand-off

annotation. Although stand-off annotation is not a generic TEI concept, the TEI

Guidelines have long included mechanisms to deal with overlapping markup, namely

milestone elements, fragmentation and reconstruction, and multiple encodings of the

same information.13 Moreover, it was the previously mentioned Corpus Encoding

Standard (CES), a modification of TEI P3 that made stand-off annotation the default

model for linguistic corpora. In the current version of the TEI (P5) the term “stand-off

markup” is discussed in Chapters 16.9 and 20.4, firmly establishing the concept of

separating primary data and markup in the wider text encoding community. This

support for stand-off annotation is rated as a crucial point by Bański and

Przepiórkowski: “Any standards adopted for these levels should allow for stand-off

Journal of the Text Encoding Initiative, Issue 3 | November 2012

15

annotation, as is now common practice and as is virtually indispensable in the case of

many levels of annotation, possibly involving conflicting hierarchies” (2010a, 98).

43 Although stand-off annotation can still be cumbersome to manage (especially when

positions in the primary data are used to establish anchors and regions), some software

products have been developed during the past years to support this notation—for

example, the web-based annotation platform “Serengeti” (which uses XStandoff—see

Stührenberg et al. 2007; Poesio et al. 2011) or the “Glozz Annotation Platform”

(Widlöecher and Mathet 2009). Among the various candidates for dealing with multiple

(and possibly overlapping) annotations, stand-off markup seems to be the most

promising. (See Bański 2010 for a discussion of advantages and disadvantages of using

TEI stand-off annotation.)

3.3. Influence on the Annotation Model

44 One of the building blocks of the TEI’s success among various scholars is the fact that it

does not define a normative standard but rather guidelines. These recommendations

try to not constrain the user to a single way of encoding but leave a large amount of

personal freedom (and responsibility) to the user, while other annotation formats try

to be as strict as possible to reflect a certain annotation model and theory.

45 The generic markup that is manifested in the TEI’s feature structure representation is

informed by this permissive attitude. As a consequence, all current International

Standards for linguistic data use generic elements and attributes (and especially

feature structures) to store annotation information. The use of such generic markup

has both advantages and disadvantages. On the one hand it helps to separate the

meaning (the concept) of an annotation from its serialization (a separation introduced

by Bayerl et al. 2003 and Witt 2004), establishing a basis for multiply annotated corpora.

But on the other hand, a generic annotation format is generally more verbose and

makes only little use of the hierarchical relations between elements inherent in XML.

In addition, it relies heavily on a given set of standardized data categories to assure the

comparability of annotation.

4. Conclusion

46 A comparison of the TEI Guidelines with the International Standards discussed in the

previous sections leaves us with mixed results. On the one hand, the ISO specifications

have the advantage of being de jure standards (at least if the standardization process

will be finished for MAF). On the other hand, this status is a mixed blessing. Since

International Standards are the outcome of a procedure relying on consensus, the

results are often compromise-ridden. Moreover, specifications can get mired in long

approval processes: LAF is a case in point, since it took so many years to reach the

status of an International Standard. This long gestation raised problems for other

standards, such as MAF, that refer to LAF’s components even before the standard was

finalized. In addition, users not familiar with the relationships between the different

standards may find it difficult to keep track of specification status and dependencies.

To help such users, we have developed a web-based information system presenting an

overview of these relations (Stührenberg et al. 2012).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

16

47 In contrast, the TEI Guidelines represent a stable and mature representation format for

annotation. Although it is also based on consensus, by maintaining a greater variety of

possible annotation solutions it is less prone to compromise.14 Another advantage over

the standards discussed in this article is that the TEI can be used as is without the need

to add further specifications, such as an external metadata format. In addition, the TEI

tag set is highly modular and can be modified easily by using the web-based “Roma”

tool, resulting in a strict or rich feature set depending on one’s own needs. The

comprehensive Guidelines themselves and a large helpful community complement

these benefits. Therefore, it should not be surprising that the TEI remains a

recommended annotation format for encoding linguistic corpora, following

Przepiórkowski and Bański: “We conjecture that—given the stability, specificity and

extensibility of TEI P5 and the relative instability and generality of some of the other

proposed standards—this approach is currently the optimal way of following corpus

encoding standards.” (2009, 250).

48 However, with International Standards such as the Linguistic Annotation Framework,

the Morpho-Syntactic Annotation Framework, and the Syntactic Annotation

Framework, normative efforts to ease the exchange of linguistically annotated data are

finally emerging. It will be interesting to observe the final version of MAF and

especially the application of LAF and MAF in the wild.

49 Regarding the relationship between the TEI Guidelines and the discussed de jure

standards, one can observe that the former may have influenced current specifications

in many ways. However, especially for the data model and notation format, other

projects and specifications played important roles as well.

5. Recommendations

50 Current linguistic researchers are spoiled for choice: in addition to well-established de

facto standards such as the TEI, international de jure standards are on the rise. Projects

such as CLARIN or FLARENET promise to help users choose among them by providing

recommendations and guidelines as the aforementioned web-based information

system. Apart from that, it seems that the combination of generic annotation formats

such as the feature structure representation format present in the TEI P5, ISO

24610-1:2006, and ISO 24610-2:2011 and respective data category sets will be a valid

candidate for a sustainable annotation format. Data categories should be registered via

the official implementation of ISO 12620:2009, ISOcat, available at http://

www.isocat.org.

51 A practical additional interim solution could be the setup of an ISOcat TEI data category

set providing all of the elements and attributes in P5. In conjunction with a stylesheet

transforming inline TEI to a stand-off TEI feature structure representation (with the

respective ISOcat references), the resulting output format should be compatible with

ISO 24610-1:2006 and could be used as a starting point for LAF-based annotations.

52 As a side-effect, users familiar with the TEI could use their existing annotation tool

chain. Future versions of the TEI Guidelines should further embrace the noticeable

trend of using stand-off notation, possibly introducing it to a broader range of

linguistic researchers and even for other non-linguistic uses of the TEI.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

17

BIBLIOGRAPHY

Abiteboul, Serge, Peter Buneman, and Dan Suciu. 2000. Data on the Web: From Relations to

Semistructured Data and XML. San Francisco: Morgan Kaufman.

Piotr Bański. 2010. “Why TEI standoff annotation doesn’t quite work: and why you might want to

use it nevertheless.” In Proceedings of Balisage: The Markup Conference, 2010. Vol. 5 of Balisage

Series on Markup Technologies. doi:10.4242/BalisageVol5.Banski01.

Bański, Piotr, and Adam Przepiórkowski. 2010a. “TEI P5 as a Text Encoding Standard for

Multilevel Corpus Annotation.” In Digital Humanities 2010 Conference Abstracts, 98–100. http://

dh2010.cch.kcl.ac.uk/academic-programme/abstracts/papers/pdf/ab-616.pdf.

———. 2010b. “The TEI and the NCP: the Model and its Application.” In Proceedings of LREC 2010

Workshop on Language Resources: From Storyboard to Sustainability and LR Lifecycle Management, 34–

38. http://www.lrec-conf.org/proceedings/lrec2010/workshops/W20.pdf.

Bayerl, Petra Saskia, Harald Lüngen, Daniela Goecke, Andreas Witt, and Daniel Naber. 2003.

“Methods for the Semantic Analysis of Document Markup.” In Proceedings of the 2003 ACM

Symposium on Document Engineering, 161–170. New York: ACM.

Broeder, Daan, Oliver Schonefeld, Thorsten Trippel, Dieter van Uytvanck, and Andreas Witt. 2011.

“A Pragmatic Approach to XML Interoperability – the Component Metadata Infrastructure

(CMDI).” In Proceedings of Balisage: The Markup Conference 2011. Vol. 7 of Balisage Series on Markup

Technologies. doi:10.4242/BalisageVol7.Broeder01.

Carletta, Jean, Stefan Evert, Ulrich Heid, and Jonathan Kilgour. 2005. “The NITE XML toolkit: data

model and query language.” Language Resources and Evaluation 39 (4): 313–334.

Clément, Lionel, and Èric Villemonte de la Clergerie. 2005. “MAF: A Morphosyntactic Annotation

Framework.” In Proceedings of the 2nd Language & Technology Conference: Human Language

Technologies as a Challenge for Computer Science and Linguistics, 90–94. Poznań, Poland:

Wydawnictwo Poznańskie.

Coombs, James H. Allen H. Renear, and Steven J. DeRose. 1987. “Markup Systems and the Future

of Scholarly Text Processing.” Communications of the ACM 30 (11): 933–947.

Dalby, David, Lee Gillam, Christopher Cox, and Debbie Garside. 2004. “Standards for Language

Codes: Developing ISO 639.” In LREC 2004: Fourth International Conference on Language Resources and

Evaluation, 127–130. Paris: ELRA.

Declerck, Thierry. 2006. “SynAF: Towards a Standard for Syntactic Annotation.” In Book of

Abstracts [conference abstracts from LREC 2006], 229–232. Paris: ELRA.

Declerck, Thierry, Nancy Ide, and Thorsten Trippel. 2007. “Interoperable Language Resources.”

SDV – Sprache und Datenverarbeitung 31 (01/02): 101–113.

DeRose, Steven J., David G. Durand, Elli Mylonas, and Allen H. Renear. 1990. “What is text,

really?” Journal of Computing in Higher Education 1 (2): 3–26.

Dipper, Stefanie, Michael Götze, Uwe Küssner, and Manfred Stede. 2007. “Representing and

Querying Standoff XML.” In Datenstrukturen für linguistische Ressourcen und ihre Anwendungen. Data

Structures for Linguistic Resources and Applications, edited by Georg Rehm, Andreas Witt, and Lothar

Lemnitzer, 337–346. Tübingen: Gunter Narr.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

18

Gou, Gang, and Rada Chirkova. 2007. “Efficiently Querying Large XML Data Repositories: A

Survey.” IEEE Transactions on Knowledge and Data Engineering 19 (10): 1381–1403.

Ide, Nancy, Patrice Bonhomme, and Laurent Romary. 2000. “XCES: An XML-based Encoding

Standard for Linguistic Corpora.” In Second International Conference on Language Resources and

Evaluation, 825–830. Paris: European Language Resources Association.

Ide, Nancy. 1998. “Corpus Encoding Standard: SGML Guidelines for Encoding Linguistic Corpora”.

In First International Conference on Language Resource and Evaluation, 463–470. Paris: ELRA.

Ide, Nancy, and Laurent Romary. 2007. “Towards International Standards for Language

Resources.” In Evaluation of Text and Speech Systems, edited by Laila Dybkjaer, Holmer Hemsen, and

Wolfgang Minker, 263–284. Dordrecht: Springer.

International Organization for Standardization/International Electrotechnical Commission. 2012.

“ISO/IEC Directives, Part 1: Procedures for the technical work.” 9th Edition, March 8, 2012.

http://isotc.iso.org/livelink/livelink?

func=ll&objId=10563026&objAction=Open&nexturl=%2Flivelink%2Flivelink%3Ffunc%3Dll%26objId%3D4230455%26objAction%3Dbrowse%26sort%3Dsubtype

Jettka, Daniel, and Maik Stührenberg. 2011. “Visualization of concurrent markup: From trees to

graphs, from 2D to 3D.” In Proceedings of Balisage: The Markup Conference 2011. Vol. 7 of Balisage

Series on Markup Technologies. doi:10.4242/BalisageVol7.Jettka01.

Langendoen, D. Terence, and Gary F. Simons. 1995. “A Rationale for the TEI Recommendations for

Feature-Structure Markup.” Computers and the Humanities 29 (3): 191–209.

Monachini, Monica, Valeria Quochi, Nicoletta Calzolari, Núria Bel, Gerhard Budin, Tommaso

Caselli, Khalid Choukri, Gil Francopoulo, Erhard Hinrichs, Steven Krauwer, Lothar Lemnitzer,

Joseph Mariani, Jan Odijk, Stelios Piperidis, Adam Przepiorkowski, Laurent Romary, Helmut

Schmidt, Hans Uszkoreit, and Peter Wittenburg. 2011. “The Standards’ Landscape Towards an

Interoperability Framework: The FLaReNet proposal Building on the CLARIN Standardisation

Action Plan.” http://www.flarenet.eu/sites/default/files/FLaReNet_Standards_Landscape.pdf.

Møller, Anders, and Michel I. Schwartzbach. 2011. “XML Graphs in Program Analysis.” Science of

Computer Programming 76 (6): 492–515.

Poesio, Massimo, Nils Diewald, Maik Stührenberg, Jon Chamberlain, Daniel Jettka, Daniela

Goecke, and Udo Kruschwitz. 2011. “Markup Infrastructure for the Anaphoric Bank: Supporting

Web Collaboration.” In Modeling, Learning and Processing of Text Technological Data Structures, edited

by Alexander Mehler, Kai-Uwe Kühnberger, Henning Lobin, Harald Lüngen, Angelika Storrer, and

Andreas Witt, 197–218. Berlin: Springer.

Pollard, Carl, and Ivan A. Sag. 1987. Information-based Syntax and Semantics. Menlo Park: CSLI.

Pollard, Carl, and Ivan A. Sag. 1994. Head-Driven Phrase Structure Grammar. Chicago: The University

of Chicago Press.

Polyzotis, Neoklis, and Minos Garofalakis. 2002. “Statistical Synopses for Graph-Structured XML

Databases.” In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data,

358–369. New York: ACM.

Przepiórkowski, Adam, and Piotr Bański. 2009. “Which XML Standards for Multilevel Corpus

Annotation?“ http://bach.ipipan.waw.pl/~adamp/Papers/2009-ltc-tei/ltc-030-

przepiorkowski.pdf.

Renear, Allen H., Mylonas, Elli, and David D. Durand. 1996. “Refining Our Notion of What Text

Really Is: The Problem of Overlapping Hierarchies.” Selected Papers from the ALLC/ACH Conference,

Christ Church, Oxford, April 1992. Vol. 4 of Research in Humanities Computing. 263–280.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

19

Romary, Laurent, Amir Zeldes, and Florian Zipser. 2011. “<tiger2/> – Serialising the ISO SynAF

Syntactic Object Model.” Computing Research Repository (CoRR). http://arxiv.org/pdf/1108.0631v1.

Simons, Gary F. 2007. “Linguistics as a community activity: The paradox of freedom through

standards.” In Time and Again: Theoretical and Experimental Perspectives on Formal Linguistics: Papers

in Honor of D. Terence Langendoen, edited by William D. Lewis, Simin Karimi, Heidi Harley, and Scott

Farrar, 235–250. Amsterdam: John Benjamins.

Simons, Gary F., and Steven Bird. 2008. “OLAC Metadata.” Open Language Archives Community

Standard. http://www.language-archives.org/OLAC/metadata-20080531.html.

Stegmann, Jens, and Andreas Witt. 2009. “TEI Feature Structures as a Representation Format for

Multiple Annotation and Generic XML Documents.” Proceedings of Balisage: The Markup Conference.

Balisage Vol. 3 of Series on Markup Technologies. doi:10.4242/BalisageVol3.Stegmann01.

Stührenberg, Maik, Daniela Goecke, Nils Diewald, Irene Cramer, and Alexander Mehler. 2007.

“Web-based Annotation of Anaphoric Relations and Lexical Chains.” In Proceedings of the Linguistic

Annotation Workshop, 140–147. http://www.aclweb.org/anthology/W/W07/W07-1523.pdf.

Stührenberg, Maik, and Daniel Jettka. 2009. “A Toolkit for Multi-dimensional Markup: The

Development of SGF to XStandoff.” Proceedings of Balisage: The Markup Conference 2009. Vol. 3 of

Balisage Series on Markup Technologies. doi:10.4242/BalisageVol3.Stuhrenberg01.

Stührenberg, Maik, Antonina Werthmann, and Andreas Witt. 2012. “Guidance through the

Standards Jungle for Linguistic Resources.” In Proceedings of the LREC 2012 Workshop on Collaborative

Resource Development and Delivery, 9–13.

Thompson, Henry S., and David McKelvie. 1997. “Hyperlink Semantics for Standoff Markup of

Read-only Documents.” In Proceedings of SGML Europe ’97: The next decade – Pushing the Envelope,

227–229.

Widlöcher, Antoine, and Yann Mathet. 2009. “La plate-forme Glozz : environnement d’annotation

et d’exploration de corpus”. In Actes de la 16e conférence sur le Traitement Automatique des Langues

Naturelles (TALN 2009) – Session posters. http://www-lipn.univ-paris13.fr/taln09/pdf/

TALN_120.pdf.

Witt, Andreas. 2004. “Multiple Hierarchies: New Aspects of an Old Solution.” Proceedings of

Extreme Markup Languages, Montréal.http://conferences.idealliance.org/extreme/html/2004/Witt01/EML2004Witt01.html.

Witt, Andreas, Daniela Goecke, Maik Stührenberg, and Dieter Metzing, 2011. “Integrated

Linguistic Annotation Models and Their Application in the Domain of Antecedent Detection”. In

Modeling, Learning and Processing of Text Technological Data Structures, edited by Alexander Mehler,

Kai-Uwe Kühnberger, Henning Lobin, Harald Lüngen, Angelika Storrer, and Andreas Witt, 197–

218. Berlin: Springer.

NOTES

1. See the projects’ websites at http://www.clarin.eu/ and http://www.flarenet.eu/,

respectively, for further information.

2. The website located at http://www.tc37sc4.org/ provides some further information.

3. P-members are contrasted with O-members, who only observe but still have the right

to comment on the process.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

20

4. If no negative votes are cast the DIS proceeds to the publication stage immediately.

5. See Dalby et al. (2004) for further details about the design philosophy of this special

standard.

6. Apart from the specifications discussed in this section there are of course other

standards that may be of interest, such as the Lexical Markup Framework (LMF, ISO

24613:2008). However, due to space restrictions we limit the discussion to the

annotation formats described in this article. We will not discuss in detail any metadata

standards, such as ISO 12620:2009 (Data Category Registry, DCR), which can be used

together with generic annotation formats to provide further semantics for a

linguistically encoded text.

7. For an overview of HPSG, see Pollard and Sag (1987, 1994).

8. See Langendoen et. al (1995) for a discussion of the TEI recommendations for feature

structure markup.

9. See http://www.isocat.org for more information about both ISO 12620:2009 and

about the ISOcat registry.

10. The current version of MAF includes the notion, that “character offsets may be

sufficient” in the simplest case.

11. The original example was taken from http://korpling.german.hu-berlin.de/tiger2/

homepage/tiger1.html and was adapted to meet further MAF requirements.

12. Potsdam Interchange Format for Linguistic Annotation.

13. Early usage of stand-off annotation can be found in the second phase of the TIPSTER

project in 1996. A discussion of the concept can be found in Thompson and McKelvie

(1997). The P3 version of the TEI did not include the term stand-off as such but

supported the connection of analytic and interpretive markup outside of textual

markup and embedded markup (Chapter 14.9). The current P5 includes a whole chapter

dealing with stand-off markup (Chapter 16.9).

14. One has to admit that one of the disadvantages of the TEI is the fact that it

frequently allows too many ways of annotating a certain text feature. This can also be

seen as a limiting compromise.

ABSTRACTS

The TEI has served for many years as a mature annotation format for corpora of different types,

including linguistically annotated data. Although it is based on the consensus of a large

community, it does not have the legal status of a standard. During the last decade, efforts have

been undertaken to develop definitive de jure standards for linguistic data that not only act as a

normative basis for the exchange of language corpora but also address recent advancements in

technology, such as web-based standards, and the use of large and multiply annotated corpora.

In this article we will provide an overview of the process of international standardization and

discuss some of the international standards currently being developed under the auspices of ISO/

TC 37, a technical committee called “Terminology and other Language and Content Resources”.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

21

After that the relationship between the TEI Guidelines and these specifications, according to

their formal model, notation format, and annotation model, will be discussed. The conclusion of

the paper provides recommendations for dealing with language corpora.

INDEX

Keywords: feature structures, ISO/TC 37/SC 4, Linguistic Annotation Framework (LAF), Morpho-

Syntactic Annotation Framework (MAF), standards, Syntactic Annotation Framework (SynAF)

AUTHOR

MAIK STÜHRENBERG

Maik Stührenberg received his Ph. D. in Computational Linguistics and Text Technology from

Bielefeld University in 2012. After graduating in 2001, he worked on various projects at Justus-

Liebig-Universität Gießen, Bielefeld, and at the Institut für Deutsche Sprache (IDS, Institute for

the German Language) in Mannheim. He is currently employed as a research assistant at Bielefeld

University and is involved in NA 105-00-06 AA, the German mirror committee of ISO TC37 SC4.

His main research interests include specifications for structuring multiply annotated data

(especially linguistic corpora), query languages, and query processing.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

22

A TEI P5 Document Grammar for theIDS Text ModelHarald Lüngen and C. M. Sperberg-McQueen

1. Introduction

1 The Institut für Deutsche Sprache (IDS) in Mannheim, Germany, hosts the German

Reference Corpus (DEREKO), the largest archive in the world of corpora of contemporary

written German. With over 5 billion word tokens,1 DEREKO contains fiction, scientific

texts, newspaper articles, and a wide variety of other text types. The corpora in DEREKO

have been collected since 1964 and are licensed for academic use via the IDS corpus

access platform COSMAS II.2 They are used by linguistics researchers at the IDS and at

other institutions around the world.

2 All corpora within DEREKO are marked up with metadata and annotations according to

the IDS text model, which is currently realized in IDS-XCES, an IDS-specific adaptation

of the XCES corpus encoding standard (Ide et al. 2000). This paper describes the

features of the IDS text model and our ongoing project, named I5 (short for “IDS-TEI P5

”), in which we are preparing a TEI P5 ODD document for this text model. Since DEREKO

is not available for direct download, a migration to TEI P5 had not been highly

prioritized (since no one outside the IDS would directly benefit from such a

conversion). However, it is hoped that a TEI P5 document grammar for the IDS text

model will facilitate the building and maintenance of quality assurance tools, will

enable the IDS to abandon the older in-house annotation format and therefore enable

new project members to familiarize themselves more quickly and easily with the

model. In the long run, we hope that the migration to TEI P5 will contribute to a

harmonization and standardization process in which tools will be produced that are

able to deal with large-scale TEI data (cf. Kupietz et al. 2010).

3 This paper begins with background on the nature and purposes of the corpora collected

at IDS and the motivation for the I5 project (section 1). It continues with a description

of the origin and history of the IDS text model (section 2), and a description (section 3)

Journal of the Text Encoding Initiative, Issue 3 | November 2012

23

of the techniques used to automate, as far as possible, the preparation of the ODD file

documenting the IDS text model. It ends with some concluding remarks (section 4). A

survey of the additional features of the IDS-XCES realization of the IDS text model is

given in an appendix.3

2. Origin and History of the IDS Text Model

4 IDS researchers at different locations in Germany initially created corpora for specific

research purposes, each encoded using a home-grown encoding scheme. Examples of

these early corpora are the Wendekorpus,4 the Bonner Zeitungskorpus,5 and the

Mannheimer Korpora 1 and 2,6 all of which are available to this day as part of the German

Reference Corpus, DEREKO. A unified text and annotation format for all IDS corpus texts

was first introduced in 1991.

2.1. BOT

5 The first attempt at a unified pivot format for IDS corpora was called BOT (an acronym

formed out of the initials of “Beginning of text”). It grew out of the COSMAS (Corpus

Search, Management and Analysis System, later known as COSMAS I) project, which lasted

from 1991 to 2003. Among its goals were the integration of the various existing single

corpora into a common representation format, the centralization of corpus acquisition

and encoding activities at the IDS, and the development of corpus access software for

linguistic research. The first version of BOT was defined by Cyril Belica of IDS in 1992

and remains the basis of the IDS text model. It is a character-based format, of which the

header part contains bibliographic metadata expressed in seven data fields (see table 1

and the example below it), which form the minimum of bibliographic data. Each field is

represented in a single line in the file and exhibits a binary structure (field_name:

value_string).

Field name Semantics

BOTC corpus identifier

BOTD document identifier

BOTT text identifier

BOTd resolved document identifier

BOTt elaborated bibliographic reference

BOTi reduced bibliographic reference

BOTP processing information: is page numbering encoded in the corpus text as in the source or not?

Journal of the Text Encoding Initiative, Issue 3 | November 2012

24

Table 1: The seven fields of the BOT minimum

BOTC:DIV

BOTD:WC4

BOTT:WC4.04004

BOTd:Christa Wolf: Essays/Gespräche/Reden/Briefe

1959-1974

BOTt:DIV/WC4.04004 Wolf, Christa: Das siebte Kreuz,

[Nachwort], (Entstehung: 1963), In: Wolf, Christa:

Werke, Bd. 4, Essays/Gespräche/Reden/Briefe 1959-

1974, Hrsg.: Hilzinger, Sonja. - München:

Luchterhand Literaturverlag, 1999, S. 24-41

BOTi:DIV/WC4.04004 Wolf, Christa: Das siebte Kreuz,

[Nachwort], (Entstehung: 1963), In: Wolf, Christa:

Werke, Bd. 4, Essays/Gespräche/Reden/Briefe 1959-

1974, Hrsg.: Hilzinger, Sonja. - München:

Luchterhand Literaturverlag, 1999, S. 24-41

BOTP:1

Example of corpus text header as BOT minimum. The line breaks within the fields are not present inthe original.

6 The fields BOTC, BOTD, and BOTT reflect a three-level hierarchical structure in this

model: corpus, document, and text. A corpus contains one or more documents, and a

document contains one or more texts, and each corpus, document, or text would have

such a header. In the model, a text is defined as a relatively independent, coherent sequence

of natural language utterances that has emerged from natural communicative situations.7 A

text may comprise, for example, one or sometimes several newspaper articles, a journal

article, a short story, or an extract of a literary work. Texts are combined to form a

document according to certain aspects such as source, chronological sequence, topic, or

text type—for example, texts from one edition of a particular day’s newspaper would

form one document. However, not every document contains more than one text: a

corpus of the collected works of one author would contain one document per novel,

each of which would include a single text.

7 The fields BOTd, BOTt, and BOTi contain the bibliographic reference in different

degrees of detail,8 each of which was needed for different presentation modes (for

example, as part of a corpus overview or of a KWIC view of query results).

8 Later, the field BOT+ent (for “Entstehungszeit”, the time of creation, if known, or

otherwise of the first edition) was also included in the BOT minimum because the

(approximate) year when a literary work was actually written can differ considerably

from the year of publication of the source used in the composition of the corpus. The

collected works of Thomas Mann, for instance, all have 1960 as their year of

publication, while his first novel, Buddenbrooks, first appeared in 1901. If only the date

of publication were recorded, discrepancies between the date of composition and the

date of publication would distort linguistic analyses of language variation over time.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

25

9 BOT also included a number of “surrounding tags” for inline annotations such as b+…

+b for a caption or u+…+u for a heading, using what is sometimes known as the

“Mannheim Conventions” (“Mannheimer Konvention”, MK). This convention was based

on markup as used in several of the earlier corpora (see, for example, Kolvenbach

1989).

10 Within the COSMAS project, all existing IDS corpora (about 28 million tokens) were

converted into the BOT/MK format using a set of conversion scripts,9 by 1993 they were

accessible via the new corpus research system, also named COSMAS, to researchers at

the IDS, and by 1996 to researchers all over the world via a web interface.

11 While the P2 version of the TEI Guidelines had been published in 1992, IDS staff chose

not to adopt the TEI at that time, both because the IDS was not yet receiving any text

data in SGML and because COSMAS had already been designed to use BOT/MK syntax.

12 Many useful types of information were only implicit in BOT/MK or were missing

entirely and therefore unavailable for researchers to use in queries. Moreover, the

follow-up project COSMAS II had started in 1995, and one of its goals was to allow the

creation of virtual corpora (cf. Kupietz and Keibel 2009), but the original BOT/MK

format did not contain all the fields necessary to do this. Consequently, in the years

1993–1998 many more fields were added to the BOT header, in particular new fields for

the components of bibliographic information (this information had been included as an

unparsed string in the first version of BOT). The more recent field names all start with

‘BOT+’, e.g. BOT+a (author), BOT+ti (title), BOT+u (subtitle), BOT+X (text type), BOT+b

(volume), BOT+in (title of a collection in which the document or text was contained),

and the above-mentioned BOT+ent. Altogether the revised BOT header has 38 fields

available. Moreover, two basic templates of a BOT header were defined, each specifying

a subset of the full set of BOT fields: Template 1 was used for independent works and

dependent works contained in collections, and Template 2 was used for newspaper and

journal articles.10

13 New texts to be added to the corpora were encoded according to the new version of

BOT/MK. The values of the fields BOTd, BOTt, and BOTi, which contained the various

versions of a bibliographic text string, were now automatically assembled at a later

stage of the conversion from the fields that contained the components. By 1998, the IDS

corpora comprised approximately 260 million tokens.

2.2. Conversion to IDS-CES

14 The year 1999 saw the start of DEREKO, a project for the acquisition and annotation of a

German Reference Corpus (Deutsches Referenzkorpus),11 conducted in cooperation with

the universities of Stuttgart and Tübingen and lasting until 2002, when it reached 1.8

billion tokens.12 Two important goals of DEREKO were, first, mass acquisition of texts by

obtaining licenses from publishing houses and individuals, and, second, the use of CES,

a new corpus encoding standard (Ide 1998) based on TEI. Between 1998 and 2003, a

mapping of all BOT/MK fields and inline markup into the CES structure of elements and

attributes was specified. Certain features of the BOT/MK markup, however, could not

be rendered within the CES markup; therefore, additional elements and attributes were

defined on top of CES, yielding IDS-CES, the IDS-specific adaptation of CES. As far as

possible, the additional elements and attributes were taken from the TEI P3 Guidelines

(ACH/ACL/ALLC 1994), but several had to be defined totally outside CES and TEI, with

Journal of the Text Encoding Initiative, Issue 3 | November 2012

26

care taken to name and define them in the style of CES. In particular, it was decided

that the three-way hierarchical structure with the units corpus, document, and text

should be retained, although CES/TEI provided only <cesCorpus> and <cesDoc>.

Hence, <idsCorpus>, <idsDoc>, and <idsText> were defined to replace these.

Another element that was newly introduced in IDS-CES is <creatDate> for the time

of creation, i.e. for the value of the field BOT+ent.13 Initially, IDS-CES was used as an

exchange format only, as COSMAS still employed BOT/MK internally. Newly acquired

texts (some of which arrived in SGML) were first encoded in BOT/MK, and a converter

(TRADUCES14) was developed to transform the new and old BOT texts into IDS-CES.

15 Starting in 2001, the BOT/MK format was extended again, this time under the name

“BOTX”. For BOTX, new markup was defined: u+zz+, u+zzz+, etc. for sub-headings at

different levels, li+ for list items, and other tags for tables, preface, table of contents,

footnotes (which had previously not been marked up or had even been removed), and

more textual features. The idea behind this extension was that all the features of a

document—including not only previously unrepresented layout features but also tables

of contents and imprint information—should be representable within the IDS text

model so that the source document’s layout would be reconstructible. The endeavor

was also inspired by the many elements and attributes offered by CES for document

features that were not captured by BOT/MK. Many of these features were in fact

already marked up in the SGML source documents that the IDS received but then

dropped in the BOT/MK representation, so it seemed worthwhile to make an effort to

retain them. For some time, all incoming texts were converted to BOTX using small

specialized programming routines that a programmer designed by checking the

original layout in the corresponding hardcopy edition of the text. Since all the older

corpus texts remained in plain BOT/MK, all BOTX texts first had to be converted to

BOT/MK (that is, some markup had to be removed automatically) for their integration

in COSMAS I.

16 BOTX and BOT/MK still had some flaws. For instance, the order of MK annotations (the

inline annotations) is not fixed and was sometimes unclear—for example, when a

passage is in a foreign language, is a quotation, and is printed in italics. Moreover, some

tags are ambiguous character sequences that happen to appear in the source text, so

since around 2004 several alternative tags taken from the TEI have been introduced in

BOT/MK, such as <line>...</line> instead of ‘…/’. Incoming texts were marked

up with the new tags and then converted to IDS-CES, but the existing corpora were not

retroactively changed.

17 For completeness, we would like to mention that around 2007, some more new markup

was added to BOT/MK only, namely the three fields BOT+D, BOT+V, and BOT+R for

specifying the results of the IDS duplicate detection module, and the field BOT+th for

results from the IDS thematic classification module.15 These fields are mapped to

<classDecl> and sub-elements in IDS-CES.

18 IDS-CES was introduced as the internal corpus representation format in COSMAS II, the

successor of the research software COSMAS I, which was finally taken out of service in

2003. Under COSMAS II, the BOTX texts were directly converted to IDS-CES without loss

of information.16

Journal of the Text Encoding Initiative, Issue 3 | November 2012

27

2.3. Conversion to IDS-XCES

19 In 2000, the first XCES specification was released (Ide, Bonhomme, and Romary 2000),

in which the SGML-based Corpus Encoding Standard was redefined on an XML basis. In

2006, an IDS-XCES DTD was developed, consisting of the XCES DTD with the addition of

those elements and attributes that had already been added to CES to form IDS-CES.

20 The corpus archive (containing around 2.4 billion tokens at that time) was converted in

2006. The mapping from IDS-CES to IDS-XCES was entirely automated using XSLT.17

(The differences between XCES and IDS-XCES are described in the appendix.) In 2008,

IDS-XCES was introduced as the internal corpus data format in COSMAS.

21 All incoming texts are still initially encoded in the IDS pivot format BOT/MK or its

extension BOTX. So the chain of conversions for new text data to be integrated in

COSMAS II is currently original format → BOT(X) → IDS-CES → IDS-XCES.

22 The following diagram illustrates the long and complex development of IDS-XCES

described above, by which this format was derived from TEI P3 and TEI P4 (TEI

Consortium 2001), through CES and XCES and the local changes at IDS. As a result of the

long chain of derivation, the relationship between the text model of the TEI Guidelines

and the IDS text model (and the corresponding differences between the two markup

systems) are hard to take in at a glance.

Figure 1: Development of the IDS-XCES DTD

23 In fact, one motivation for preparing a TEI P5-based ODD file for the IDS text model is to

make the relation between the two text models simpler and clearer.

24 The Appendix gives a brief account of IDS-XCES, mainly by specifying its differences to

the original XCES.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

28

3. Preparation of an ODD File

25 As the summary above has made clear, elements and attributes from TEI P3 and TEI P4

come into IDS-XCES through two channels: some are inherited through XCES, while

others are retroactively added in IDS-XCES.

26 The goal of the I5 project is to reorganize the definition of the IDS vocabulary as a

single set of modifications taking TEI P5 as its base and using the new customization

mechanism specified in TEI P5, which uses an ODD (“one document does it all”) file to

specify a particular customization instead of relying on the customization mechanisms

built into a particular schema language. TEI P5 defines a specific XML tag set for use in

ODD files and prescribes an algorithm for processing ODD files to generate customized

versions of the TEI encoding scheme. This prescribed algorithm is implemented by

software available from the TEI Consortium under the name Roma. As indicated in the

diagram below, Roma reads the TEI P5 specification of the vocabulary and the ODD file

provided by the user and generates on demand from them a DTD, a schema document

in Relax NG or XSD notation, or reference documentation for the elements and

attributes included in the specified customization.

Figure 2: Generating document grammars and documentation using Roma

27 The immediate concrete goal of the I5 project, therefore, is to prepare an ODD file

which will, when processed by Roma, produce document grammars suitable for use by

IDS in processing the DEREKO archive.

3.1. Conditions for the Language as a Set of Documents

28 What language should those document grammars describe? The core requirements for

the language to be defined can be summarized schematically in this way:

I5 : P5 ≃ IX : P3

L(I5) ±⊆ L(P5)

L(I5) ≡ L(IX)

29 In these and subsequent formulae, the following abbreviations are used for brevity:

P3, P4, and P5 denote the document grammars (or in some cases the languages defined by

those grammars) of TEI P3, TEI P4, and TEI P5 respectively.

CES and XCES denote the document grammars of the original Corpus Encoding Standard and

its XML revision.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

29

IX denotes the document grammar of the current IDS-XCES DTD (as realized in file ids-

xces.dtd).

I5 denotes the grammar developed by this project.

For any document grammar x, L(x) denotes the language recognized (or defined) by x.

E(x) denotes the set of elements defined in an XML vocabulary or document grammar x.

DRK denotes the German Reference Corpus (Deutsches Referenzkorpus) DEREKO, viewed as a

set of documents (and thus as a language defined by enumeration of its sentences).

30 With these notational conventions in place, we can reformulate the core requirements

for the I5 project.

31 First, the I5 vocabulary should stand in roughly the same relation to TEI P5 as the

current IDS-XCES vocabulary stands to TEI P3 (or TEI P4):

I5 : P5 ≃ IX : P3

32 Second, the language should be (more or less) a subset of the language defined by TEI

P5:

L(I5) ±⊆ L(P5)

33 Third, the language defined by the I5 document grammar should at least in principle be

equivalent, or nearly equivalent, to that defined by IDS-XCES:

L(I5) ≡ L(IX) (?)

34 Since one goal of the project is to incorporate in I5 the improvements on TEI P3

incorporated in TEI P5, absolute equivalence in all details is probably not in fact

desirable (hence the addition of the question mark vis-à-vis the initial formulation).

35 Some further requirements or desiderata can also be identified and expressed

formulaically. If absolute equivalence is not the goal, then the language of I5 needs to

be constrained in other ways. The language of I5 should be similar to that of IDS-XCES,

even if not absolutely equivalent. That is, perhaps strict equivalence (≡) should be

replaced by similarity (≃):

L(I5) ≃ L(IX)

36 Every document in DEREKO should be legal against the new document grammar:

DRK ⊆ L(I5)

37 Empirically, it can be observed that DEREKO exercises only a proper subset of the current

document grammar IDS-XCES: for example, there are a number of attributes defined in

the grammar which do not in fact occur in the corpus. It is a design decision (still being

considered) whether to retain such currently unused constructs, in the expectation

that they may be used later, or to eliminate them so as to make the language of I5 be

more nearly equivalent to that of DEREKO:

DRK ⊊ L(I5) ?

DRK = L(I5) ?

3.2. Realization of a Grammar in ODD

38 To specify a particular customization of the TEI vocabulary, an ODD file must specify

the inclusion or exclusion of individual:

TEI modules

elements within a module

attributes within a module

Journal of the Text Encoding Initiative, Issue 3 | November 2012

30

39 Some examples may help make it clear how this is done.

40 For example, the following ODD-file fragment includes the tei and core modules in a

customization:

<TEI ...> ...

<specGrp xml:id="specgroup-core">

<moduleRef key="tei"/><!-- required -->

<moduleRef key="core"/>

<!--* abbr address analytic author bibl biblScope biblStruct

* corr date distinct editor foreign gap gloss head

* hi imprint item l label lb lg list measure mentioned

* monogr name note num orig p pb ptr pubPlace publisher

* q quote ref reg respStmt sp speaker stage term time title

*-->

<p>Delete unneeded elements.</p>

<specGrpRef target="#specgroup-core-deletions"/>

<p>Rename some elements.</p>

<specGrpRef target="#specgroup-core-renamings"/>

</specGrp>

...

</TEI>

41 Individual elements may be excluded by specifying mode="delete" on an

appropriate <elementSpec> element:

<elementSpec ident="add" module="core" mode="delete"/>

<elementSpec ident="addrLine" module="core" mode="delete"/>

<elementSpec ident="binaryObject" module="core" mode="delete"/>

<elementSpec ident="cb" module="core" mode="delete"/>

<elementSpec ident="choice" module="core" mode="delete"/>

42 I5 must deal with several different sets of elements:

Some elements should be incorporated from TEI P5. TEI P5 elements not present in IDS-

XCES, on the other hand, should be excluded.

Elements present in XCES but not present in TEI P5 must be defined. (They could be taken

over from an XCES ODD file, if one existed, but there is not currently any ODD-defined

version of XCES.)

Additional elements found in IDS-XCES but not in XCES or TEI P5 must be defined.

43 That is, E(I5) =

(E(IX) ∩ E(P5))

∪ (E(IX) ∩ (E(XCES) ∖ E(P5))

∪ (E(IX) ∖ (E(XCES) ∪ E(P5))

Journal of the Text Encoding Initiative, Issue 3 | November 2012

31

44 Note that the elements in the last group are not necessarily IDS extensions to XCES:

they may also include elements in TEI P3 which are inherited by IDS-XCES from TEI P3,

but which are no longer included in the TEI vocabulary in version P5.

45 It is possible to identify the elements which belong in each of the subsets described

manually, given sufficient patience and capacity for tedious detail. It is significantly

more convenient, however, to make the machine help us in the task. This can be done

in a three-step process:

Encode the relevant document grammars as XML documents.

Compare them using XQuery.

Generate the appropriate ODD declarations automatically.

46 A number of tools exist which can provide XML representations of DTDs. For the work

described here, we have used a simple application based on SWI Prolog (Wielemaker

n.d.), which loads a DTD and emits an XML representation of the DTD. The following

example shows a fragment of the IDS-XCES document grammar in this representation:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

32

<dtd>

<desc>This document

(<code>2011/blackmesatech/IDS/interim/ids_xces_onefile.v3.xml</code>)

is an XML representation of

<code>2011/blackmesatech/IDS/interim/onefile.dtd</code> made by

dtdxml.pl on <date value="2011-09-11">2011-09-11</date></desc>

<elemdecl gi="gloss">

<star>

<or>

<elem>#pcdata</elem>

<or>

<elem>abbr</elem>

<or>

<elem>date</elem>

<or>

<elem>num</elem>

<or>

<!--* ... *-->

</or>

</or>

</or>

</or>

</or>

</star>

</elemdecl>

<attlist gi="gloss">

<att>

<name>id</name>

<type>id</type>

<dft>

<implied/>

</dft>

</att>

<att>

<name>n</name>

<type>cdata</type>

<dft>

<implied/>

</dft>

</att>

<att>

<name>xml:lang</name>

<type>cdata</type>

<dft>

<implied/>

</dft>

Journal of the Text Encoding Initiative, Issue 3 | November 2012

33

</att>

<!--* ... *-->

</attlist>

</dtd>

47 It is then a straightforward task to use XQuery to identify the first set of elements: IDS

elements which appear in TEI P5:

(: find the IDS elements that appear in the TEI Guidelines :)

declare namespace TEI = "http://www.tei-c.org/ns/1.0";

declare variable $dir.TEI := "file:/home/TEI";

declare variable $dir.IDS := "file:/Users/cmsmcq/2011/blackmesatech/IDS";

declare variable $P5 := doc(

concat($dir.TEI,

"/P5/Source/Guidelines/en/guidelines-en.xml"

));

declare variable $ids-xces := doc(

concat($dir.IDS,

'/interim/ids_xces_onefile.v3.xml'

));

<elements>{

for $e in $ids-xces/dtd/elemdecl

let $gi := string($e/@gi),

$elemspec := $P5//TEI:elementSpec

[@ident = $gi]

where $elemspec

order by $gi

return <e gi="{$gi}" module="{$elemspec/@module}"/>

}</elements>

48 After setting up variables for the text of TEI P5 and the XML encoding of the IDS-XCES

DTD ($P5 and $ids-xces, respectively), the query identifies each element name in

the IDS-XCES DTD ($gi, for generic identifier) and then finds ($elemspec) the

specification for that element in TEI P5, if there is one. If such an element specification

exists in TEI P5, then an XML element is returned giving the name of the element and

its module.

49 Once the basic query is formulated, it is simple to modify the return statement to

return instead the appropriate ODD declaration for the element:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

34

(: ... :)

return <elementSpec module="{$module}"

ident="{$teigi}" mode="change">

<altIdent>{$idsgi}</altIdent>

</elementSpec>

(: ... :)

50 Similar queries can be constructed to generate appropriate ODD declarations for

elements to be suppressed from TEI P5 or added to it.

3.3. Documentation

51 The ODD file is designed as a form of literate program which allows us to embed the

formal declarations of the document grammar in a human-readable document and

intertwine the schema with the documentation. The I5 project endeavors to make the

relation of TEI P5 to the IDS customization of TEI P5 easier to understand by treating

the ODD file itself not, as is sometimes done, primarily as input to Roma but primarily

as a document intended for human readers. The screen shot below illustrates the

principle: it shows the part of the document beginning with the ODD fragment given

above, which embeds the tei and core modules of TEI P5, in a style derived from the

IDS house style for Web pages.

52 A significant part of the effort in the I5 project is the preparation of appropriate tag-set

documentation for IDS-specific elements and attributes and for IDS-specific usages for

Journal of the Text Encoding Initiative, Issue 3 | November 2012

35

standard TEI and XCES constructs. Descriptions of the elements and attributes of the

document grammar are taken in part from TEI P5, in part transcribed from the XCES

documentation, and in part written from scratch. The individual element specifications

are embedded in the ODD file, as can be seen in the screen shot below; they can also be

extracted by Roma and integrated with documentation for standard TEI elements in the

form of reference documentation.

4. Conclusions

53 The history of the IDS text model and its markup reflects, in its individual way, several

important trends in the processing of natural language data generally and in corpus

linguistics more particularly.

54 From the earliest beginnings, corpus data at IDS used markup to record important

information about the text and to make explicit certain information within the text

which would otherwise have been inaccessible to automatic processing. The early

collections, however, all used idiosyncratic markup. Because of the large (for the times)

volume of data it had collected, IDS became convinced earlier than some projects of the

need to develop a standardized system of text representation. Like many who are early

to perceive the need for standardization, IDS developed its own standard format, in the

form of BOT. This standardization effort paid off: it made feasible the significant

investment in infrastructure represented by the COSMAS I project.

55 During COSMAS II, TEI markup was introduced in the form of CES. The broad coverage,

non-prescriptive approach, and sheer size of TEI P2, P3, and P4 made them daunting to

many prospective users: hard to understand and thus hard to adopt. CES and XCES,

which took a more focused, domain-specific approach, were more prescriptive, smaller,

and easier to understand; in consequence, they were easier for IDS to adopt as the basis

Journal of the Text Encoding Initiative, Issue 3 | November 2012

36

for its SGML and XML formats. Experience showed, however, that some TEI constructs

omitted from CES as unnecessary for corpus-linguistic work were needed, after all, to

handle the broad variety of texts and textual constructs which turn up in large corpora

like DEREKO. So IDS-CES and IDS-XCES found it necessary to bring some elements and

attributes back from TEI P3 and P4.

56 With I5, the IDS text model is directly derived from the TEI text model; the relation of

I5 to TEI, defined as it is by a single ODD file, will be somewhat easier to discern than

the relation of IDS-CES to TEI P3 or of IDS-XCES to TEI P4. The relation to XCES will still

be relatively easy to identify: by comparing the I5 ODD file to the extension files of

XCES, any reader will be able to see which TEI elements are retained in one

customization but not the other, and which additional elements and attributes are

common to the two.

Appendix: Features of IDS-XCES

57 As indicated in Section 2.3, the format IDS-XCES is based on the XCES document

grammar as defined in Ide, Bonhomme, and Romary (2000) and the XCES DTD files.18

These DTD files have been taken and modified as necessary for the IDS text model as

described below. The IDS-XCES document grammar comprises the files ids-xcesdoc.dtd, ids-lat1.ent, ids.xcustomize.ent, and

ids.xheader.elt;19 the XCES DTD files xcesAlign.dtd and xcesAna.dtd have

no equivalents among the IDS-XCES DTD source files. The former file has no equivalent

because there is no need to align corpus data in the monolingual German reference

corpus. The latter has no equivalent because linguistic annotations, apart from the

sentence segmentation, play (almost) no role in the IDS text model. Instead, several

layers of linguistic markup are provided as standoff annotation in separate files.20 Still,

IDS-XCES allows the specification of morphosyntactic annotations in attributes of the

<w> element: for a small number of corpora, there are versions of IDS-XCES documents

with inline linguistic annotations added to the element <w>.

58 In IDS-XCES, some IDS-specific elements and attributes have been added to the original

XCES, and in doing so, some of the XCES content models have been modified. These

additional elements and attributes can be grouped into those that are essentially

(context-dependent) renamings of XCES elements, those that have been taken from the

TEI P3 (or P4) specification (such as <textDesc> and <front>) and those that are

neither in XCES nor in TEI P3 (or P4).

59 In the following sections, we will give a summary of the most important features of IDS-

XCES compared with XCES. We give examples of elements and their characteristics,

without presenting complete content models including all attributes. The complete

changes are documented formally in a synopsis at http://www.ids-mannheim.de/kl/

projekte/korpora/idsxces.html.

A.1. Corpus Structure and Header

Element name Possible parents Modeled on Meaning Example

<idsCorpus> xml document rootXCES:

cesCorpuscorpus

Journal of the Text Encoding Initiative, Issue 3 | November 2012

37

<idsDoc> <idsCorpus> XCES: cesDoc document

<idsText> <idsDoc> XCES: cesDoc text

<idsHeader>

<idsCorpus>,

<idsDoc>,

<idsText>

XCES:

cesHeaderheader

<korpusSigle> <titleStmt> IDS-specific

corpus ID

(formerly

BOTC)

<korpusSigle>DIV</

korpusSigle>

<dokumentSigle> <titleStmt> IDS-specific

document

ID

(formerly

BOTD)

<dokumentSigle>DIV/

SGP </

dokumentSigle>

<textSigle> <titleStmt> IDS-specific

text ID

(formerly

BOTT)

<textSigle>DIV/SGP.

00000 </textSigle>

<c.title> <titleStmt> IDS-specific corpus title

<d.title> <titleStmt> IDS-specificdocument

title

<t.title> <titleStmt> IDS-specific text title

<pagination> <editorialDecl> IDS-specific

whether

page

numbering

is present

or not

(processing

info;

formerly

BOTP)

<pagination

type="yes"/>*

* Pagination information is included in a @type attribute, which is available for many elements in bothXCES and TEI.

Table 2: Examples of elements added for the description of the corpus structure and header

60 Those elements that are essentially renamings of XCES elements are the high-level

components <idsCorpus>, <idDoc>, and <idsText>—representing the three-

level corpus structure of the IDS text model (from <cesCorpus> and <cesDoc>)—

and <idsHeader> (from <cesHeader>).

61 In the content model for the <idsHeader>, the CES element <titleStmt> has been

substantially revised to contain one of <korpusSigle>, <dokumentSigle>, or

<textSigle> and one of <c.title>, <d.title>, or <t.title> to mark the ID

and title of a corpus, document, or text (respectively).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

38

A.2. Front and Back Matter

Element namePossible

parentsModeled on Meaning Example

<front> <text>TEI P3/P4:

<front>

front

matter

<back> <text> TEI P3/P4: <back>back

matter

<titlePage> <front>TEI P3/P4:

<titlePage>title page

<docTitle> <titlePage>TEI P3/P4:

<docTitle>

document

title as

part of

the

source

<docTitle><titlePart

type="main"><s>Jacques

Hilarius Sandsacks

Psychoschmarotzer</

s></

titlePart><titlePart

type="desc"><s>Roman</

s></titlePart></

docTitle>*

<docImprint> <front>TEI P3/P4:

<docImprint>imprint

<docImprint>Aufbau-

Verlag</docImprint>

* Since <docTitle> contains the title as it occurs printed in the source, it is part of the object text,and can be divided into sentence-like divisions marked by <s>.

62 One group of non-XCES elements found in IDS-XCES includes <front> and <back>

and their child elements, all of which were taken from the TEI P3 (or P4) specifications.

A.3. Drama

Element

name

Possible

parentsModeled on Meaning Example

<stage>

<div>,

<sp>,

<s>

TEI P3/P4:

<stage>

stage direction or

extra-linguistic

event in debate

<stage>(Beifall bei der

CDU/CSU und der FDP)</

stage>

63 For the encoding of drama and records of parliamentary debates, the element

<stage> (for stage directions) was adopted from TEI P3 (or P4).

A.4. Page Breaks and Pointers

Element

name

Possible

parentsModeled on Meaning Example

<pb>non-header

elements

with mixed

content

TEI P3/P4:

<pb>page break <pb n="38" TEIform="pb"/>

Journal of the Text Encoding Initiative, Issue 3 | November 2012

39

<lb>TEI P3/P4:

<lb>line break <lb TEIform="pb"/>

<ptr>TEI P3/P4:

<ptr>

pointer to

xml id

<ptr rend="number"

targType="note" targOrder="u"

target="shs.00000-n2-f2"/>

<xptr>TEI P3/P4:

<xptr>

<xptr targType="pb"

targOrder="u" doc="korpref.mk2"

from="WF1.00004-168-PB168"

to="DITTO" TEIform="xptr"/>

64 A group of “milestone” elements—<pb>, <lb>, <ptr>, and <xptr>—has been added

to IDS-XCES as part of almost all mixed content models. They are adopted from the TEI

P3 (or P4) to mark page breaks, line breaks, and references to other corpora,

documents, texts (e.g. from a bibliography section), sections, pages etc.

A.5. Corrections and Completions

Element

name

Possible

parentsModeled on Meaning Example

<orig>

non-header

elements

with mixed

content

TEI P3/P4:

<orig>

spelling variant

or morphological

ellipsis

<orig

reg="Ferienheime">Ferien-</

orig> und Kinderheime

65 The <orig> element with its attribute @reg has also been adopted from the TEI P3 (or

P4). In some corpora it is used to mark and complete morphological ellipsis and

sometimes used to mark spelling variants.

A.6. Morphosyntactic Inline Annotations

Element

namePossible parents

Modeled

onMeaning Example

<w>non-header elements

with mixed content

TEI P3/P4:

<w>wordform

<w ana="NOU com sg n

dat">Telefon</w>

66 The <w> element with its attribute @ana has been adopted from the TEI P3 (or P4) to

mark word forms and to provide morphosyntactic analyses for them. Only a handful of

the IDS corpora, however, contain such inline annotations.

A.7. Time of Creation

Element namePossible

parents

Modeled

onMeaning Example

Journal of the Text Encoding Initiative, Issue 3 | November 2012

40

<creatDate> <creation>IDS-

specific

time of

creation

and

(short

version

of)

reference

to first

edition

<creation><creatDate>2001</

creatDate><creatRef>(Erstveröffentlichung:

Frankfurt a.M., 2001)</

creatRef><creatRefShort>(Erstv. 2001)</

creatRefShort></creation>

<creatRef> <creation>IDS-

specific

<creatRefShort> <creation>IDS-

specific

67 The elements under <creation> are used to encode available information about the

time of creation of a text and the publication date of the first edition, if known. In TEI

P3 and P4, the contents of <creation> can be marked up using generic <bibl> and

<date> elements, but TEI does not provide an unambiguous way to indicate that a

particular bibliographic reference and date inside a <creation> element are for the

first edition. In CES and XCES, <creation> contains only character data with no

substructure at all.

A.8. Text Description

Element name Possible parents Modeled on Meaning Example

<textDesc> <profileDesc>TEI P3/P4:

<textDesc>

wrapper for

text

description

<textType> <textDesc> IDS-specific

text type

according to

type

inventory

(BOT+x)

<textType>Roman</textType>

<textTypeRef> <textDesc> IDS-specific

text type as

to appear in

bibliographic

string

(BOT+X)

<textTypeRef>Tageszeitung</

textTypeRef>

<textTypeArt> <textDesc> IDS-specific

text type of a

specific

article

(BOT+xa)

<textTypeArt>Interview</

textTypeArt>

<textDomain> <textDesc> IDS-specificsubject area

(BOT+r)

<textDomain>Regionales /

Unterhaltung/Kultur</

textDomain>

<column> <textDesc> IDS-specific

original label

of newspaper

section as in

the source

(BOT+ress)

<column>FERNSEHEN</column>

Journal of the Text Encoding Initiative, Issue 3 | November 2012

41

68 IDS-specific elements under <textDesc> are used to encode genre, text type,

newspaper section or subject area according to different classification schemes.

A.9. Edition Information

Element namePossible

parents

Modeled

onMeaning Example

<further>

<edition>

IDS-

specific

further edition

of the same

source with

year (BOT+gg)

<further>5. Auflage 1998

(1. Auflage 1997)</further>

<kind>IDS-

specific

kind of edition

of the source

(BOT+g)

<kind>Taschenbuch</kind>

<appearance>IDS-

specific

“physical”

appearance of

the source

(BOT+e)

<appearance>Microfiche</

appearance>

69 IDS-specific elements under <edition> are used to encode information about other

existing editions or the range of existing editions, the kind of edition (paperback,

special edition etc.), and the kind of object that was used as the source (photocopy,

microfiche, etc.).

A.10. Bibliographic Reference

Element namePossible

parents

Modeled

onMeaning Example

<reference> <sourceDesc>IDS-

specific

bibliographic

reference

string

<reference type="short"

assemblage="regular">DIV/

SGP.00000 Szendrödi:

Jacques Hilarius Sandsacks

Psychoschmarotzer, 2001</

reference>

70 The element <reference> may appear multiple times under <sourceDesc>, with

different values of its @type attribute specifying various versions of the bibliographic

reference string required for different modes of display and information in the

@assemblage attribute about whether it has been automatically assembled from

other elements or not.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

42

BIBLIOGRAPHY

Association for Computers and the Humanities (ACH), Association for Computational Linguistics

(ACL), and Association for Literary and Linguistic Computing (ALLC). 1999. Guidelines for Electronic

Text Encoding and Interchange (TEI P3), edited by C. M. Sperberg-McQueen and Lou Burnard.

Chicago and Oxford: Text Encoding Initiative. First published 1993. http://www.tei-c.org/Vault/

GL/P3/index.htm.

al-Wadi, Doris and Irmtraud Jüttner. 1996. “Textkorpora des Instituts für Deutsche Sprache: Zur

einheitlichen Struktur der bibliographischen Beschreibung der Korpustexte.” In LDV-INFO 8.

Informationsschrift der Arbeitsstelle Linguistische Datenverarbeitung, edited by IDS, 1–85. Mannheim.

Belica, Cyril, Marc Kupietz, Andreas Witt, and Harald Lüngen. 2011 “The Morphosyntactic

Annotation of DEREKO: Interpretation, Opportunities, and Pitfalls.” In Grammatik und Korpora 2009.

Dritte Internationale Konferenz. Mannheim, 22.4.–24.9.2009, edited by Marek Konopka, Jacqueline

Kubczak, Christian Mair, František Šticha, and Ulrich Hermann Waßner, 451–469.

Korpuslinguistik und interdisziplinäre Perspektiven auf Sprache 1. Tübingen: Narr.

Ide, Nancy. 1998. “Corpus Encoding Standard: SGML Guidelines for Encoding Linguistic Corpora.”

Proceedings of the First International Language Resources and Evaluation Conference, 463–470.

Grananda, Spain.

Ide, Nancy, Patrice Bonhomme, and Laurent Romary. 2000. “XCES: An XML-based Standard for

Linguistic Corpora.” In Proceedings of the Second Language Resources and Evaluation Conference (LREC),

825–830. Athens, Greece.

Kolvenbach, Monika. 1988/1989. “Schreibkonventionen für IDS-Korpora.” In LDV-INFO 7.

Informationsschrift der Arbeitsstelle Linguistische Datenverarbeitung. Edited by Tobias Brückner.

Kupietz, Marc. 2005. Near-Duplicate Detection in the IDS Corpora of Written German. Technical Report

KT-2006-01. Institut für Deutsche Sprache, Mannheim.

Kupietz, Marc and Holger Keibel. 2009. “The Mannheim German Reference Corpus (DEREKO) as a

basis for empirical linguistic research.” Working Papers in Corpus-based Linguistics and Language

Education 3, edited by Makoto Minegishi and Yuji Kawaguchi, 53–59. Tokyo: Tokyo University of

Foreign Studies (TUFS).

Kupietz, Marc, Oliver Schonefeld, and Andreas Witt. 2010. “The German Reference Corpus: New

developments building on almost 50 years of experience.” In Language Resources: From Storyboard

to Sustainability and LR Lifecycle Management, edited by Victoria Arranz and Laura van Eerten.

http://www.lrec-conf.org/proceedings/lrec2010/workshops/W20.pdf

Perkuhn, Rainer, Cyril Belica, Doris al-Wadi, Meike Lauer, Kathrin Steyer, and Christian Weiß.

2005. “Korpustechnologie am Institut für Deutsche Sprache.” In Korpuslinguistik deutsch: synchron

– diachron – kontrastiv, edited by Johannes Schwitalla and Werner Wegstein, 57–70. Tübingen,

Germany.

TEI Consortium. 2001. TEI P4: Guidelines for Electronic Text Encoding and Interchange: XML-Compatible

Edition, edited by C. M. Sperberg-McQueen and Lou Burnard. N.p.: TEI Consortium. http://

www.tei-c.org/release/doc/tei-p4-doc/html/.

TEI Consortium. 2012. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.1.0.

Last updated June 17. N.p.: TEI Consortium. http://www.tei-c.org/release/doc/tei-p5-doc/en/

html/index.html.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

43

Wielemaker, Jan. n.d. “SWI-Prolog SGML/XML parser.” SWI-Prolog. http://www.swi-prolog.org/

pldoc/package/sgml.html.

NOTES

1. http://www.ids-mannheim.de/kl/projekte/korpora/

2. http://www.ids-mannheim.de/cosmas2/

3. We would like to thank Doris al-Wadi, Cyril Belica, Marc Kupietz, and Eric Seubert for their

enormous help regarding our questions about the history of the IDS text model.

4. Texts from 1989–1990 that document the political change that led to reunification, prepared

by the IDS and the former Zentralinstitut für Sprachwissenschaft.

5. Bonn newspaper corpus, from various years between 1949 and 1974, prepared in the 1970s.

6. Mannheim corpus 1 and 2, with texts from 1949 to 1974.

7. Cf. Perkuhn 2005 et al., p. 61 (our translation).

8. Since the document identifier consists of three capital letters usually derived from

the initials of the author and/or initials of content words from the title of the

document, the resolved document identifier (field BOTd) also corresponds to an

abbreviated version of the bibliographic reference.

9. Like BOT itself, the conversion scripts were prepared by Cyril Belica.

10. The many additional BOT fields and the two basic templates were all specified by

Doris al-Wadi and Irmtraud Jüttner (al-Wadi and Jüttner 1996).

11. The name DEREKO (Deutsches Referenzkorpus) has been in use since then for the

archive of contemporary written-language corpora at the IDS.

12. The project would later be called DEREKO-I.

13. The specification of the mapping and the definition of IDS-specific elements were

prepared by Doris al-Wadi of IDS.

14. TRADUCES was developed by Eric Seubert of IDS.

15. These fields were added by Marc Kupietz of IDS (Kupietz 2005).

16. BOTX was defined by Eric Seubert.

17. The specification of the mapping and the conversion script were prepared by Marc

Kupietz.

18. These may be downloaded from http://www.xces.org/dtds.html.

19. These files may be downloaded from http://corpora.ids-mannheim.de/idsxces1/

DTD/.

20. See Belica et al. 2011.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

44

ABSTRACTS

This paper describes work in progress on I5, a TEI-based document grammar for the corpus

holdings of the Institut für Deutsche Sprache (IDS) in Mannheim and the text model used by IDS

in its work. The paper begins with background information on the nature and purposes of the

corpora collected at IDS and the motivation for the I5 project (section 1). It continues with a

description of the origin and history of the IDS text model (section 2), and a description (section

3) of the techniques used to automate, as far as possible, the preparation of the ODD file

documenting the IDS text model. It ends with some concluding remarks (section 4). A survey of

the additional features of the IDS-XCES realization of the IDS text model is given in an appendix.

INDEX

Keywords: corpora, ODD, DTD, CES, XCES

AUTHORS

HARALD LÜNGEN

Harald Lüngen has been a researcher in the area of corpus linguistics at the Institut für Deutsche

Sprache in Mannheim, Germany, since 2011, specialising in the construction and maintenance of

the German Reference Corpus DEREKO and in methods of corpus analysis. Before that, he worked

as a computational linguist and project scientist in the fields of computational lexicology and

morphology, text parsing, and text technology.

C. M. SPERBERG-MCQUEEN

C. M. Sperberg-McQueen (Black Mesa Technologies LLC) is a consultant specializing in helping

memory institutions solve information management problems and preserve cultural heritage

information for the future by using descriptive markup: XML, XSLT, XQuery, XML Schema, and

related technologies. He co-edited the XML 1.0 specification and the first versions of the TEI

Guidelines.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

45

Creating Lexical Resources in TEI P5A Schema for Multi-purpose Digital Dictionaries

Gerhard Budin, Stefan Majewski and Karlheinz Mörth

AUTHOR'S NOTE

This paper is based on a presentation given at the TEI Members’ Meeting 2011 in

Würzburg, Germany.

1. Background

Lexicography, the art of compiling dictionaries, is one of the oldest branches of

linguistics. All remnants of early lexicographic writings stem from Asia, and the oldest

extant precursors of modern dictionaries were Sumerian/Akkadian clay tablets dating

from the second millennium BC; these early lexicographic endeavours represent a very

modern type of text—a bilingual dictionary (Snell-Hornby 1986, 208) which, in most

areas of the world, would not emerge until at least 2,000 years later.

In contrast to the Sumerian clay tablets, most other early testimonies of this academic

tradition were monolingual in nature. The Sanskrit grammarian Yāska1 is regarded by

many as the earliest known Indian lexicographer; his Nirukta was a treatise on

etymology and semantics, containing a glossary of irregular verbs. Chinese

lexicography is some centuries younger: the Erya (author unknown) is the most ancient

Chinese writing that falls into the broader category of dictionaries (Wilkinson 2000, 62).

Although the creation of modern dictionaries is considered to have begun in Europe

with the rise of national languages, there is no clearly discernible demarcation line

between pre-modern and modern dictionary production. Some outstanding works

emerged in the 17th and 18 th centuries. Jean Nicot’s Trésor de la langue Française was

printed in 1606, Agnolo Monosini’s Vocabulario della lingua italiana appeared in 1612,

Johann Christoph Adelung’s Grammatisch-kritisches Wörterbuch der Hochdeutschen

Mundart followed in 1781, and Samuel Johnson finished his Dictionary of the English

Journal of the Text Encoding Initiative, Issue 3 | November 2012

46

Language in 1755.2 The first large-scale Chinese dictionary from this time period, the

Kangxi zidian, dates from 1716 (Wilkinson 2000, 64).

The latest step in this long history is being constituted by the transition towards digital

methods. Today, digital technology is not only used to produce print dictionaries;

rather, many dictionaries exist solely in digital form. Information and communication

technology has become pervasive in all stages of the modern dictionary creation

process: both data acquisition and representation of lexical knowledge rely heavily on

this technology. Furthermore, dictionary makers have shifted from traditional methods

such as introspection and interviews of competent speakers towards more empirical

methods based on lexicographic research using increasingly sophisticated digital

resources such as corpora (large digital text collections that reflect real-world language

usage).

2. The ICLTT’s Dictionaries

The Institute for Corpus Linguistics and Text Technology (ICLTT) of the Austrian

Academy of Sciences has been conducting a number of lexicographic projects,

including both digitizing print dictionaries and creating born-digital lexicographic

data. The lexicographic data produced in these projects are designed to serve a variety

of purposes for both linguistic research and lexicography. To ensure that NLP tools

available at the institute would work with all the data, a uniform encoding system for

all projects was needed. The integration of digital corpus data with the lexicographic

infrastructure has been an important goal and plays an important role in all these

efforts.

The ICLTT as an institution has grown out of several projects. One of the best known

results of these projects is probably the Austrian Academy Corpus (AAC), a digital

collection of German language texts stemming from the 19th and 20 th centuries. The

digital texts contained in the AAC were collected with a literary, a socio-historic and a

lexicographic perspective in mind, but in spite of the literary and historical focus in

setting up the corpus, it is increasingly used by linguists (Moerth 2002).

2.1. Print Dictionaries

The main motive behind setting up the corpus was the institute’s involvement in a

longstanding text-lexicographic project which produced two dictionaries designed to

ease access to one of Austria’s most important works of twentieth-century literature,

Karl Kraus’ magazine Die Fackel. The first volume was a dictionary of idioms and

idiomatic expressions; the second one a comprehensive listing and documentation of

insults and invective terms.

In recent years, the institute has shifted from addressing the needs of literary scholars

by focusing on particular works of literature to catering to the needs of linguists by

devoting resources to smaller and more diverse projects. The ICLTT has also

contributed to the production of the largest German-Russian dictionary ever produced

(Dobrovolsky 2008–2010), which was published as a cooperative project of the Austrian

and the Russian Academies of Sciences.

In addition to creating new print dictionaries, the institute has also digitized historical

dictionaries and even incorporated them into the AAC in order to extend the collection

Journal of the Text Encoding Initiative, Issue 3 | November 2012

47

of texts to as many types of written language as possible. Currently, efforts are being

made to make this data TEI P5 compliant.

2.2. Born-digital Dictionaries

Dictionaries are increasingly created in and for the digital world. Apart from digitizing

paper dictionaries, the ICLTT has also started to create new digital lexical resources,

some of which build on the department’s digital text collections. These include

dictionaries for doing variational linguistics on German as written and spoken in

Austria, Early Modern German, and Arabic; a GUI tool for converting German

Wiktionary data to TEI P5;3 and a comprehensive Dictionary of Modern Persian Single Word

Verbs to be used as the basis for a morphological analyzer. The variation among these

projects has been brought about to a certain degree by the ICLTT’s role as Austria’s

CLARIN and DARIAH coordinator.

3. Data Formats

In choosing a uniform encoding system for all ICLTT data, the department’s staff

surveyed data formats in use. Although most of the relevant dictionary productions of

the recent past have relied on digital data and methods, there is little consensus on

standards. A great number of divergent formats have coexisted: MULTILEX and GENELEX

(GENEric LEXicon) are systems that are associated with the Expert Advisory Group on

Language Engineering Standards (EAGLES).4 Other formats used in digital dictionary

projects are OLIF (Open Lexicon Interchange Format),5 MILE (Multilingual ISLE Lexical

Entry),6 LIFT (Lexicon Interchange Format),7 OWL (Web Ontology Language)8 and DICT

(Dictionary Server Protocol),9 the latter being an important dictionary delivery format

(Faith 1997).

Another standard considered was ISO 1951 (“Presentation/representation of entries in

dictionaries – requirements, recommendations and information”). Although this

standard focuses on encoding the presentation of lexicographical data in dictionaries

for human use in what is called LEXml (Lexicographical Markup Language), it seems

that after a few years of existence only few publishing houses have been using this

format (such as Langenscheidt, Munich) for their dictionary production line.

Last but not least, when looking for an encoding standard for machine readable

dictionaries, ISO 24613:2008 (“Language resource management – Lexical markup

framework (LMF)”), the ISO standard for natural language processing (NLP) and

machine-readable dictionaries (MRD), must be considered. Recently, there have been

discussions about the possibility of creating a TEI serialization of LMF (Romary 2010).

In modeling lexicographic data, it has become common practice to conceptualize the

underlying structures as tree-like constructs, which makes XML an ideal syntax for

expressing the data. Another option, from software engineering, is UML (Unified

Modeling Language)10 which in turn can easily be serialized into an XML vocabulary.

This approach was taken by the authors of LMF.

For our projects, the final “short list” contained ISO 1951, LMF and the TEI dictionary

module. ISO 1951 was eschewed from the very beginning, among other reasons for lack

of support in the community. LMF in turn has gained more support in the dictionary-

producing community. Given the still small amount of available data using LMF and

Journal of the Text Encoding Initiative, Issue 3 | November 2012

48

ongoing discussions, the decision was made to move towards TEI and keep an eye on

the LMF specification as it develops.11

4. TEI Dictionary Module

The TEI dictionary module appears to be the de facto encoding standard for

dictionaries digitized from print sources. As such, “TEI for dictionaries” has a

longstanding tradition. Interestingly, the most recent versions of the TEI Guidelines

contain a passage that indicates that the authors had in mind a much wider range of

dictionaries:

... The elements described here may also be useful in the encoding of computationallexica and similar resources intended for use by language-processing software; theymay also be used to provide a rich encoding for word lists, lexica, glossaries, etc.included within other documents. (TEI Consortium P5 2012, 247)

This passage reflects a considerable conceptual extension of the initial purpose of the

module.12 However, the idea of extending the scope of the TEI dictionary module for use

by language-processing software is not at all as far-fetched as it may seem at first

glance. The fact that there are people interested in the issue has been documented by

the large audience of the workshop “Tightening the Representation of Lexical Data: A

TEI Perspective,” held at the 2011 Annual Conference and Members’ Meeting of the TEI

Consortium (Würzburg, Germany). Actually, the TEI’s ability to adapt to many types of

dictionaries makes it an ideal candidate for such an endeavor.

A fundamental problem we came up against when we started to model our dictionary

data was the lack of available examples against which we could compare our data. It

would have been beneficial if more projects had made at least samples of their data

publicly accessible.13 Many of the examples which can be found on the TEI website are

repetitive and are by no means exhaustive.14 However, getting hold of examples in

other encoding languages is not easy either: ISO 1951 seems to be used by a single

publishing house and LMF has not won much ground in the field, though there are

some data available for the latter.15

5. ICLTT’s TEI Schema

The following sections outline selected features of the ICLTT’s customization of the TEI

P5 dictionary module. The system has been used successfully for lexicographic data

encoding at the department, where it is meant to be a multi-purpose system targeting

both human users and software applications. The following four requirements had

featured strongly in our decision in favor of TEI encoding:

Acquaintance with the overall TEI system: as the department has been working with TEI on

text encoding projects, a number of colleagues are conversant with TEI and have used it

from the very beginning of our dictionary projects;

Intuitiveness of the TEI system: the concise and yet expressive set of elements is definitely

more easily readable to human lexicographers working on the XML source than for instance

the LMF serialization proposed in ISO 24613:2008;

Consistency with other language resources contained in the same collection: the intention

was to keep the encoding system of the dictionary resources in line with other textual data

to be integrated with these lexicographic resources.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

49

Adaptability to the needs of dictionaries to be used in natural language processing (NLP).

In order to make the TEI dictionary module usable for NLP purposes, it has been

necessary to tighten the many combinatorial options of TEI P5—that is, to constrain the

content models of various elements.

5.1. Representing Lemmas

In TEI, dictionaries are a specific type of text and are therefore encoded with <text>elements, which are made up of optional <front> and <back> matter. The

dictionary entries are placed in a <body> element.

<TEI>

<teiHeader>

...

</teiHeader>

<text>

<front>...</front>

<body>

<entry>...</entry>

<entry>...</entry>

<entry>...</entry>

...

...

...

</body>

<back>...</back>

</text>

</TEI>

Individual entries may be seen as the core of all lexicographic encoding; the structure

of dictionary entries can display a great variety of different forms.16 This also accounts

for the fact that the P5 version of the Guidelines (250) offer three elements to encode

this type of microtexts: <entry>, <entryFree>, and <superEntry>.

The <superEntry> element can be used to group entries together and is not used in

our schema. As the name implies, <entryFree> contains a single <entry> with a

comparatively large number of acceptable elements that may be arranged in many

different ways. In TEI P5, <entryFree> can contain 30 different elements from the

dictionary module alone.17 The great flexibility of this element makes it suitable for

digitizing print dictionaries, but in creating strictly defined dictionary structures to be

used by software, this flexibility is of lesser value.

In contrast to <entryFree>, the <entry> element allows for only ten sub-elements:

<case>, <def>, <etym>, <form>, <gramGrp>, <hom>, <sense>, <usg>, <xr>,

and <dictScrap>. The dictionary schema described in this paper only contains the

simple <entry> element (combinatorial options were further restricted by excluding

both <dictScrap> and <hom> elements from the list of possible child elements).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

50

Simple dictionary entries invariably start with a lemma. Optionally, entries contain an

indication of the word class of the lemma and one or more <sense> elements. A

typical entry has a structure like this:

<entry>

<form type="lemma">

...

</form>

<gramGrp>

<gram type="pos">...</gram>

</gramGrp>

<sense>

...

</sense>

...</entry>

In many cases, it is difficult for lexicographers to decide whether to integrate lexical

items into one single entry or rather to make two or more entries. Lexical homonymy

in TEI dictionaries is often encoded using the <hom> element, as in the following

abridged example.

<entry>

<form type="lemma"><orth>Schloss</orth></form>

<hom>

<sense>

<cit type="translation" xml:lang="en">

<quote>castle, palace</quote></cit>

</sense>

</hom>

<hom>

<sense>

<cit type="translation" xml:lang="en">

<quote>(pad)lock</quote></cit>

</sense>

</hom>

</entry>

As a basic principle, we have attempted to keep hierarchies in our encoding system as

flat as possible. This is why the <hom> element has been excluded from the set of

Journal of the Text Encoding Initiative, Issue 3 | November 2012

51

possible elements. That is, in cases of homonymy, lexicographers have to either work

with entries that contain several senses or to create separate entries, which would be

encoded in TEI as follows:

<entry>

<form type="lemma"><orth>Schloss</orth></form>

<sense>

<cit type="translation" xml:lang="en">

<quote>castle, palace</quote></cit>

</sense>

</entry>

<entry>

<form><orth>Schloss</orth></form>

<sense>

<cit type="translation" xml:lang="en">

<quote>(pad)lock</quote></cit>

</sense>

</entry>

The same encoding pattern is applied to grammatical homonyms and polyfunctional

items—that is, homographs that are semantically related but have different word classes.

However, encoding homonyms in separate <entry> elements can be problematic,

especially when lexical items belong to different word classes and need to be

distinguished (consider an example from English: “talk” as a verb versus as a noun). For

us, the deciding factor was whether the word class difference manifests itself in the

semantic description, the <sense> block in TEI nomenclature. Whenever different

part-of-speech labels would need to be assigned to <sense> elements (such as with all

grammatical homonyms), the lexical items were encoded in separate <entry>elements rather than in one.

Polyfunctionality is a very common phenomenon and has posed problems in almost all

our projects. Our approach, as detailed above, has pros and cons. However, our main

argument in favor of splitting entries—putting each homonym into a separate

<entry>—is that it makes access to the particular lexical items more straightforward.

Working along these lines, part-of-speech labels only appear on the top-most level of

the entry together with the lemma, not within <sense> elements. If necessary, the

relation between entries could be made explicit by <re> (related entry) elements or

some system of links.

It is obvious that the decision of whether to split entries also depends on what one

plans to do with a particular set of data. For some of our projects, we have plans to

enrich lexical data using corpora: looking for new, hitherto unregistered word forms,

doing statistics on word forms, etc.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

52

5.2. Encoding Word Class Information

A fixed component of all single-word dictionary entries is a block containing word-class

information. In early experiments, we encoded this information within the <form>element representing the lemma. While TEI allows word-class information to appear in

various locations within an <entry> element, the motivation behind putting it within

<form> was that it seemed to be more consistent to say that the lemma, rather than

the entry, belongs to a particular word class. In addition, putting the <gramGrp>element in the lemma’s <form> element allowed <gramGrp> elements containing

part-of-speech information to appear inside <form> elements, yielding an additional

simplification of the schema.

Over time, we have come back to a more canonical TEI encoding, abandoning this

rather atypical practice. This change of attitude was, among other things, motivated by

experiments of converting our data into an LMF-conformant XML serialization: in LMF,

@part-of-speech is defined as an attribute of the element <LexicalEntry>.18

Practical experience has also led us to change usage of elements inside the <gramGrp>element. Initially, word-class information was encoded using the <gramGrp> element,

which can contain a number of other elements such as <case>, <gen>, <mood>,

<pos>, and <tns>. For example:

...

<gramGrp>

<pos>noun</pos>

</gramGrp>

...

We now only allow the <gram> element within <gramGrp>, using attributes to

distinguish various word-class categories. The above example can be rewritten to its

<gram> equivalent like this:

...

<gramGrp>

<gram type="pos">noun</gram>

</gramGrp>

...

Choice of appropriate terminology is important when labeling lemmas with word

classes. Scholars working on digital resources have long needed to maintain

consistency both within a project and one agreed upon by the community at large.

Nowadays, it also involves interoperability with other digital resources, especially by

referring to publicly accessible frameworks (concept repositories) to make the

linguistic terminology explicit. In the field of linguistics, two such frameworks play an

Journal of the Text Encoding Initiative, Issue 3 | November 2012

53

increasingly important role: the so-called GOLD Standard, the General Ontology of

Linguistics Descriptions (Farrar and Langendoen 2003) and ISOcat, the ISO TC37/SC4

Data Category Registry (Kemps-Snijders et al. 2009). The most important feature of the

web-based ISOcat registry is that it provides persistent identifiers (PIDs) for all the

concepts registered in the database, allowing for explicit reference to terms used.

So far, we have attempted to make use of ISOcat terminology in the ICLTT

customization without explicitly referring to the ISOcat terms in the encoding of the

entries. However, we have started to experiment with an alternative way of marking up

word-class information that makes explicit reference to the concept repository which

is exemplified in the following excerpts:

...

<gramGrp>

<gram type="pos" corresp="#vrbNoun"/>

</gramGrp>

...

The label of the @corresp attribute above refers to a feature structure that, in turn,

provides an explicit reference to the particular entry in the ISOcat database:

<fs type="partOfSpeech">

<f xml:id="vrbNoun" name="verbalNoun" fVal="http://www.isocat.org/

datcat/DC-3858"/>

<f xml:id="comNoun" name="commonNoun" fVal="http://www.isocat.org/

datcat/DC-385"/>

<f xml:id="prNoun" name="properNoun" fVal="http://www.isocat.org/datcat/

DC-384"/>

</fs>

5.3. Morphosyntactic Information

Dictionary entries often contain more grammatical forms of the headword. In

traditional lexicography, particular word forms are usually given in order to point the

user to irregularities in inflectional paradigms. In a digital dictionary, which does not

have any spatial limitations, it is not uncommon to have more comprehensive lists of

word forms.

5.3.1. <gramGrp> vs. Feature Structures

The ICLTT has experimented with entries giving only inflectional irregularities and also

those giving complete paradigms; in either case, each word form is encoded with a

<form> element. Whatever the intended use of these word forms, a system is needed

Journal of the Text Encoding Initiative, Issue 3 | November 2012

54

to identify their function. The traditional TEI way to do this would be to enter the

morphosyntactic details of a <form> in a <gramGrp> element:

...

<form type="inflected">

<gramGrp>

<pos value="verb"/>

<tns value="present"/>

<number value="singular"/>

<mood value="indicative"/>

<per value="2"/>

</gramGrp>

<orth>gehst</orth>

</form>

...

In search of a more generic approach, we resorted to a system combining feature

structures19 and ISOcat grounded values. Instead of using the <gramGrp> element as a

child of <form>, the @ana (analytic) attribute is added to the <form> element.

...

<form type="inflected" ana="#v_pres_ind_sg_p2">

<orth>gehst</orth>

</form>

...

The labels used to construct the pointers in the @ana attribute are human-readable

abbreviations. In this part of the system, we have attempted to proceed in line with the

ISO TC37/SC4–related MAF (Morphosyntactic Annotation Framework) draft

specification, in particular Chapter 8 on morpho-syntactic content (ISO 24611 2008, 21).

The components of the value of the @ana attribute are resolved in a feature structure

library:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

55

<fvLib>

...

<fs xml:id="v_pres_ind_sg_p2" name="v_pres_ind_sg_p2"

feats="#pos.verb #tns.pres #mood.ind #num.pl #pers.2">

...

</fvLib>

<fLib>

<f xml:id="pos.verb" name="pos"><symbol value="verb"/></f>

...

<f xml:id="tns.pres" name="tense"><symbol value="present"/></f>

...

<f xml:id="mood.ind" name="mood"><symbol value="indicative"/></f>

...

<f xml:id="num.pl" name="number"><symbol value="plural"/></f>

...

<f xml:id="pers.2" name="person"><symbol value="2nd"/></f>

...

</fLib>

This method of annotating morphosyntactic phenomena is not only extremely concise

(the information is only referenced through links), it also allows for the assignment of

multiple interpretations of the content of the <orth> element. The attribute @anacan contain an open number of so-called data.pointers, each separated by

whitespace:

...

<form type="inflected" ana="#v_pres_ind_pl_p1 #v_pres_ind_pl_p3 ">

<orth>gehen</orth>

</form>

...

5.3.2. A Particular Case: Encoding Roots of Semitic Words

Any general-purpose system such as the TEI is bound to have conceptual gaps. A

particular problem of our projects involving Semitic languages was how to deal with

what in Semitic studies is commonly referred to as a root. In Semitic morphology, word

forms are constructed on top of two, three, or four consonants. These consonants,

which function as abstract linguistic units, form what is commonly called “the root”,

i.e. the semantic skeleton of all morphologically derived forms. The scholars working

with and on the described encoding system were very reluctant to use the TEI element

<form> for the particular purpose, as this would have meant stretching the semantics

of the element too much. Roots are neither word forms nor stems. In order to avoid “tag

Journal of the Text Encoding Initiative, Issue 3 | November 2012

56

abuse”, we first experimented with the TEI’s feature-structure capabilities. Here is an

example taken from our Colloquial Cairene Arabic Dictionary (safar is Arabic for ‘journey’).

...

<form type="lemma">

<form type="lemma"><orth>safar</orth></form>

<fs><f name="root"><string>sfr</string></f></fs>

...

However, our current practice is to encode the root of each lemma by means of the

<gramGrp> element holding the word-class information. Adding an additional

<gram> element to <gramGrp> appears to be a both concise and conceptually

consistent solution to the problem:

...

<gramGrp>

<gram type="pos">noun</gram>

<gram type="root">sfr</gram>

</gramGrp>

...

5.4. Identifying Linguistic Varieties and Writing Systems

When encoding digital texts, linguistic varieties are usually identified using so-called

language codes, of which there are several systems. An older (yet very versatile) system

is Verbix Language Codes, which makes use of the old SIL codes.20 LS-2010

(Linguasphere language codes) is a rather recent system which was published in 2000

and updated in 2010. It contains over 32,000 codes. The most widely used standard is

ISO 639.

All these systems are incomplete and, if still being maintained, continue to evolve. A

downside to all of them is the lack of support coming from the many scholarly

disciplines involved in their use. In addition to the high (and ever changing) number of

linguistic varieties on our globe, one additional aspect has to be taken into

consideration: many linguists also need codes for historic linguistic varieties as well as

for living varieties.

In TEI encoding, it has become common practice to make use of the global21 attribute

@xml:lang, incorporated into the TEI from the World Wide Web Consortium’s XML

Specification. TEI prescribes this attribute to identify both linguistic varieties and

writing systems. In this hybrid approach, the value of the attribute should be

constructed in accordance with Best Current Practice 47 (BCP 47)22 which in turn refers to

and aggregates a number of ISO standards (639-1, 639-2, ISO 15924, ISO 3166).23

Journal of the Text Encoding Initiative, Issue 3 | November 2012

57

BCP 47 defines an extensible system that is sufficiently expressive to identify most

standard linguistic varieties. Language tags are assembled from a sequence of

components (which are also called subtags), each separated by a hyphen. All subtags

except for the first one are optional and have to be arranged in a particular order. The

first subtag is usually an ISO 639-2 value and indicates the linguistic variety; the second

one is an ISO 3166-1 region code. For example, es-MX stands for Spanish as spoken in

Mexico, es-419 for Spanish as spoken in Latin America. In addition, the ISO 639-3 three-

letter language codes and ISO 15924 codes are used. One can specify, for instance, that

the language being used in a particular encoded element is in the Cantonese dialect

(gan) of Chinese (zh) as spoken in Hongkong (HK) and written in Latin characters

(Latn): these subtags have to be arranged in the proper order: zh-gan-Latn-HK.

While identifiers for standard linguistic varieties are adequate for many text encoding

projects, some of our projects in variational linguistics, especially dialectology, need to

provide locational granularity beyond what is specified in the second subtag. To solve

this problem, ICLTT staff make use of private use subtags (which, according to BCP 47,

must be introduced with an x singleton). They help to indicate particular geographical

locations and writing systems that cannot be identified by one of the standards

referenced by BCP 47. Consider the following case of the representation of the lemma

for Egyptian Arabic book:

...

<form type="lemma">

<orth xml:lang="ar-arz-x-cairo-vicav">kitāb</orth> </form>

...

In constructing these labels, ISO standards have been applied wherever possible. The

value of the BCP 47 language tag (that is, the value of the @xml:lang attribute) starts

with the shortest available ISO 639 code: ar stand for Arabic. This is followed by an

extended language subtag. ISO 639-3 provides 30 identifiers for what in the

specification is called individual languages, which all belong to the macrolanguage

Arabic.24 The three-letter subtag arz translates into Egyptian Arabic. 25 Unfortunately,

this is not precise enough for purposes of dialectology, as the dialects spoken in Egypt

are subdivided into a great number of quite divergent dialects, which our system has to

accommodate (with private use subtags, as explained above). The schema we are using

constructs these subtags from two components: location and writing system. The first

component (location) does not require further explanation, whereas the second

component (writing system) in this example is vicav, which stands for Viennese Corpus of

Arabic Varieties (transcription), a hybrid system for transcription that attempts to

represent the most common current usage in the community. While this system of

constructing language labels has served our purposes very well, for documentary

purposes it is still recommended to specify the exact meaning of the toponym (the first

component of our private use subtag) in the <teiHeader> of the dictionary. 26 We

hope that future standards for language tags will allow for geo-spatial references with

much finer granularity.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

58

The following example is taken from a Modern Persian dictionary entry, the English

translation of the lemma is ‘to go, to walk’.

...

<form type="lemma">

<orth xml:lang="fa-Arab"> نتفر </orth>

<orth xml:lang="fa-x-modDMG">raftan</orth>

</form>

...

The two letters fa identify the language (Modern Persian, ISO 639-2), and Arab indicates

the writing system (ISO 15924).27 The private use subtag indicates the system used to

transcribe the Arabic characters. In this particular case, modDMG is a modified version of

the system of the Deutsche Morgenländische Gesellschaft. Documentation of the system and

the applied modifications are explained in the dictionary’s <teiHeader>.

5.5. Etymologies

The encoding of etymologies is straightforward in TEI. As in canonical TEI, our schema

allows the <etym> element as a child of entry. <etym> in turn contains one or more

<lang> elements. To make the information inside the <lang> element explicit, a

@sameAs attribute is added whose value points to feature structures referring to an

ISO 639-2 value.

...

<etym><lang sameAs="#iso2_la">Latin</lang></etym>

...

5.6. Adding Semantics

So far, we have discussed phenomena pertaining to orthography and morphology, but

we have not yet touched on equivalents or translations of the lemmas. All of this kind

of information is placed in one or more <sense> elements. In monolingual

dictionaries, equivalents of the lemma are encoded as <def> elements. Definition in

this particular sense implies synonym or paraphrase. When working on bi- and multi-

lingual data, translations are encoded as <cit> elements, and the content proper is

placed in <quote> elements within these.28 Translations in more than one language

are encoded by means of several <cit> elements.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

59

<entry>

<form type="lemma"><orth>Schloss</orth></form>

<sense>

<cit type="translation" xml:lang="en">

<quote>castle, palace</quote></cit>

<cit type="translation" xml:lang="fr">

<quote>château, palais</quote></cit>

</sense>

...

In addition to the <def> and <cit> elements, our schema only allows <gramGrp>and <usg> inside the <sense> element.

...

<sense>

<usg type="dom">colour</usg>

<cit type="translation" xml:lang="en">

<quote>black</quote></cit>

</sense>

...

5.6.1. Grammatical Valency

The appropriate encoding of grammatical phenomena often called valency or

government is still not entirely resolved in the TEI Guidelines. The Guidelines provide

only two examples for the <colloc> element; both are encoded with a @typeattribute that has the value prep (for preposition). One is an entry for French médire de,

which in English translates as “to speak ill of”.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

60

<entry>

<form>

<orth>médire</orth>

</form>

<gramGrp>

<pos>v</pos>

<subc>t ind</subc>

<colloc type="prep">de</colloc>

</gramGrp>

</entry>

The second example is an entry with Chinese shuō “to speak” as lemma, followed here

by the resultative particle dào, which can be rendered in this context as of or about.

<entry>

<form>

<orth>說</orth> </form>

<gramGrp>

<colloc type="prep">到</colloc>

</gramGrp>

</entry>

The solution we had in mind was something that would reach beyond what, to a

majority of linguists, would be acceptable as collocate. For this reason, we decided to

consider other encoding options.

A uniform system for specifying a lexical item’s main complements (arguments in

linguistic nomenclature) was needed. Note that this part of our encoding system is still

in its infancy. However, it is important to mention that this kind of information is

invariably marked up within the <sense> element. Our current encoding is illustrated

by the following excerpt:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

61

...

<sense>

<gramGrp>

<gram type="argument">in</gram>

</gramGrp>

<cit type="translation" xml:lang="en">

<quote>sich interessieren (für)</quote></cit>

</sense>

...

In our customization, the <gram> element is used to list selected arguments relevant

to the material of a specific project. None of the projects aims at the exhaustive

coverage of arguments. We have also been thinking about making use of feature

structures, as in the following example:

...

<fs type="syntacticBehaviour">

<f name="coreArguments" feats="#optSubj #oblPrepObj "/>

</fs>

...

The above structure will appear very familiar to readers conversant with LMF (Lexical

Markup Framework). With a generic solution designed along these lines, a precise

expression of valency or government is achievable. It would also be feasible to

differentiate between mandatory and optional arguments.

5.6.2. Dictionary Examples

As explained above, all ICLTT dictionary projects are tightly interlinked with corpus-

building activities. For this reason, the encoding of examples in dictionary entries

requires particular attention. The relation between dictionary and corpus has to be

seen as bidirectional: on the one hand, lexicographic data are designed to be used in

the analysis of corpora, yet on the other hand, corpora are used to enhance and refine

dictionaries.

One important requirement was identified at the outset of our work: dictionary

examples must be reusable in different entries of a dictionary. As we did not want to

duplicate data in the dictionary, the natural choice was to work with <ptr> elements

to reference examples.

In TEI P5, dictionary examples are encoded as <cit> elements with @type attributes.

Except for the value of the @type attribute, they look exactly like translations. The

following example is taken from an isiZulu-English glossary:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

62

...

<cit type="exampleSentence" xml:id="amanzi_ayabanda_01">

<quote>Amanzi ayabanda.</quote>

<cit type="translation" xml:lang="en">

<quote>The water is cold.</quote>

</cit>

</cit>

...

In our TEI-encoded dictionaries, examples such as the one above are children of the

<body> element. Our dictionary editing program organizes dictionaries into three

basic units—one metadata record (a <teiHeader> element) for the whole dictionary,

an open number of entries, and dictionary examples (which can either be multi-word

expressions, phrases or sentences with respective translations)—each of which are

stored as separate database entries. Examples can then be linked to particular

<sense> elements through a unique identifier which is referenced via the @targetattribute of a <ptr> element:

<entry xml:id="amanzi_01">

<form type="lemma">

<orth>amanzi</orth>

</form>

...

<sense>

<cit type="translation" xml:lang="en">

<quote>water</quote>

</cit>

<ptr type="exampleSentence" target="#amanzi_ayabanda_01"/>

</sense>

</entry>

Usually, one example <cit> element contains a single <quote> element.

Nevertheless, in some cases multiple <quote> elements might be required, such as to

give the example in several orthographic representations (with the @xml:langattribute differentiating them). The following example is again taken from the

Colloquial Cairene Arabic dictionary:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

63

...

<cit type="exampleSentence" xml:id="id_dinya_harr_01">

<quote xml:lang="ar-arz-x-cairo-vicav">id-dinya ḥarrᴵ ʡawi in-nahar-da.</quote>

<quote xml:lang="ar-arz-x-cairo-modDMG">id-dinyaᴵ ḥarr ’awi in-nahar-da.</quote>

<quote xml:lang="ar-arz-x-cairo-IPA">id-dinya ḥarrᴵ ’awi in-nahar-da.</quote>

<quote xml:lang="ar-arz-Arab-x-cairo">. هدراهنلا یوق رح ایندلا </quote>

<cit type="translation" xml:lang="en">

<quote>It’s very hot today.</quote>

</cit>

</cit>

...

5.7. Metadata at the Level of the Dictionary Entry

Recording production metadata has been a recurring issue in many of the ICLTT’s

encoding projects, and the lexicographic work is no exception. It is common knowledge

that the TEI provides very efficient mechanisms to make statements about all kinds of

responsibility in the <teiHeader> element. However, problems arise when such

statements are needed on a more granular level than the whole TEI document.29 In

parts of our lexicographic work, we need to make responsibility statements not only

about the whole dictionary but also about particular entries.

In everyday lexicographic work, it is not enough to assign the ID of one single

lexicographer to an entry; one might want to trace who did what and at what time. As

neither <revisionDesc> nor <change> may be used as child elements of

<entry>, we considered various options to accommodate this information in our TEI

structures. The intention was not to store production-related metadata only as a

separate field in the database but to preserve this data in a self-contained manner as

part of the entries so that this data would be passed on whenever a digital dictionary

gets distributed.

Two elements were singled out which appeared to be plausible candidates to handle

metadata about revisions of entries: <div> and <note>. These elements both have

sufficiently generic semantics and, most importantly, may be used as children of the

<entry> element. We first tried to encode metadata on revisions like this:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

64

...

<note type="revisionDesc">

<list>

<item><date when="2011-10-11"/>charly, added POS</item>

</list>

</note>

...

We wanted to stay as close as possible to comparable TEI structures without bending

the semantics of particular elements. We decided in favor of a <div> element for

revisions, containing a feature structure. This <div> element is inserted as the last

element at the end of the entry. Each modification of the entry is registered by means

of an <fs> element:

...

<div type="revisionDesc">

<fs type="change">

<f name="who">charly</f>

<f name="when">2011-10-15</f>

<f name="what">added POS</f>

</fs>

...

</div>

...

The <fs> element corresponds to the TEI <change> element, and the single features

(<f> elements) correspond to the attributes of @change. Such constructs can also be

used to register status information: labels carrying values such as proposal, draft, and

approved can be used to control release of selected entries to the public.

6. Tools

So far, work on these digital lexical resources has been accomplished using a software

application developed in-house. The program was initially used in collaborative

glossary editing projects carried out as part of language courses at the University of

Vienna. As it proved to be flexible and adaptable enough, it has been put to use in the

ICLTT’s dictionary projects.

At the heart of the software application is the dictionary editing client, a standalone

application temporarily dubbed the Viennese Lexicographic Editor (VLE). It supports web-

based editing and dictionary entries are stored on a web server. All additional software

components (PHP and MySQL) are open-source and freely available. Communication

Journal of the Text Encoding Initiative, Issue 3 | November 2012

65

between the dictionary client (VLE) and the server has been implemented as a RESTful

web service.

While the dictionary editor is geared towards general use with XML data, it is

particularly suitable and customized for the use with TEI-encoded data. In addition to

fully customizable XSLT stylesheets, the tool includes a number of helpful built-in

features described in brief below.

Configurable keyboard layouts are designed to support the input of Unicode characters

usually not available in standard key assignments. Recent VLE versions allow the

automatic assignment of a keyboard to particular @xml:lang attributes to spare users

of manual switching between keyboard layouts. For example, when the user works on

contents of an element provided with an @xml:lang="ru" attribute, VLE

automatically activates the Russian keyboard layout; on entering an element with the

attribute @xml:lang="de", it switches back to the German layout.

Entry-specific metadata can be generated automatically whenever an entry is saved.

IDs of both entries and examples are created automatically on the basis of the contents

of the respective items.

Another feature of the dictionary editor is a special module that assists with the

integration of corpus examples into dictionaries. The principal idea behind this module

was optimizing access to digital corpora: the corpus interface of the dictionary writing

application enables lexicographers to launch corpus queries and insert them into

existing dictionary entries without using the clipboard to copy-and-paste, which would

inevitably result in a lot of inefficient typing or clicking.30

The validation of our dictionary data currently uses XML Schema, but the most recent

versions of VLE have been delivered with a newly integrated library that is also capable

of validating the data against RelaxNG schemas.

7. Conclusion

The heterogeneity of linguistic annotation has been and will remain a major obstacle

for interoperability and reusability of language resources. Over the past few years,

there has been increased awareness among developers and users of the need to achieve

a higher degree of convergence in many parts of their encoding systems. ICLTT staff

members’ previous experiences with LMF have shaped the TEI customization, and the

draft MAF specification is significantly influencing linguistically motivated TEI

applications. In creating digital dictionaries, both of these ISO specifications (and

others referenced by them) will continue to complement the work with the TEI

Guidelines.

All of our lexicographic endeavors have been guided by a vision of an ever more

densely knit web of dictionaries and more reusable, standards-based, and ideally

publicly available language resources. Such resources and the respective tools for

creation and access form an integral part of state-of-the-art ICT infrastructures. The

ICLTT’s interest in furthering the outreach of the TEI and integrating the Guidelines

into the newly evolving digital infrastructures has, among others reasons, been

motivated by their strong commitment to the European infrastructure projects CLARIN

and DARIAH.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

66

In conclusion, we would like to emphasize that our customization of the TEI P5

dictionary module has proved to be a solid foundation for new lexicographic projects.

While there is no doubt that much work remains to be done, we strongly believe that

the results of our experiments furnish ample evidence that TEI P5 can not only be used

to represent digitized print dictionaries but also for NLP purposes.

BIBLIOGRAPHY

Atkins, Beryl T.S., and Michael Rundell. 2008. The Oxford Guide to Practical Lexicography. Oxford;

New York: Oxford University Press.

Banski, Piotr, and Beata Wójtowicz. 2009. “FreeDict: An Open Source Repository of TEI-encoded

Bilingual Dictionaries”. Paper presented at the 2009 Conference and Members’ Meeting of the TEI

Consortium, Ann Arbor, Michigan, November 9–15, 2009. http://www.tei-c.org/Vault/

MembersMeetings/2009/files/Banski+Wojtowicz-TEIMM-presentation.pdf.

Bel, Nuria, Nicoletta Calzolari, and Monica Monachini, eds. 1995. “Common Specifications and

Notation for Lexicon Encoding and Preliminary Proposal for the Tagsets”. MULTEXT Deliverable

D1.6.1B. Pisa.

Budin, Gerhard, Heinrich Kabas, and Karlheinz Moerth. 2012. “Towards Finer Granularity in

Metadata: Analysing the Contents of Digitised Periodicals”. In Journal of the Text Encoding Initiative

2. doi: 10.4000/jtei.416.

Budin, Gerhard, and Karlheinz Mörth. 2011. “Hooking up to the Corpus: the Viennese

Lexicographic Editor’s Corpus Interface”. In Electronic Lexicography in the 21st Century: New

Applications for New Users: Proceedings of eLex 2011, Bled, 10–12 November 2011, edited by Iztok Kosem

and Karmen Kosem. Ljubljana: Trojina, 52–59. Institute for Applied Slovene Studies.

Dobrovolsky, Dmitry O. 2008–2010. Neues Deutsch-Russisches Grosswörterbuch. 3 vols. Moscow: AST.

Faith, R. 1997. A Dictionary Server Protocol. http://www.rfc-editor.org/rfc/rfc2229.txt.

Farrar, Scott, and D. Terence Langendoen. 2003. “A Linguistic Ontology for the Semantic Web”.

GLOT International 7 (3): 97–100.

Hass, Ulrike, ed. 2005. Grundfragen der elektronischen Lexikographie: Elexiko, das Online-

Informationssystem zum deutschen Wortschatz. Berlin; New York: W. de Gruyter.

Hausmann, Franz Joseph, Oskar Reichman, Herbert Ernst Wiegand, and Ladislav Zgusta, eds.

1989–1991. Dictionaries. An International Encyclopedia of Lexicography. 3 vols. Berlin; New York: W. de

Gruyter.

Ide, Nancy, Adam Kilgarriff, and Laurant Romary. 2000. “A Formal Model of Dictionary Structure

and Content”. In Proceedings of the Ninth EURALEX International Congress: EURALEX 2000: Stuttgart,

Germany, August 8th–12th, 2000, 113–126. Stuttgart: Universität Stuttgart, Institut für

maschinelle Sprachverarbeitung.

Ide, Nancy, Jean Veronis, Susan Warwick-Amstrong, and Nicoletta Calzolari. 1992. “Principles for

Encoding Machine Readable Dictionaries”. In EURALEX ’92 Proceedings: Papers Submitted to the 5th

Journal of the Text Encoding Initiative, Issue 3 | November 2012

67

EURALEX International Congress on Lexicography in Tampere, Finland. Tampere, Finland: Tampereen

Yliopisto.

ISO-24611 (Draft). 2008. Language resource management — Morpho-syntactic annotation framework.

ISO-24613. 2008. Language resource management – Lexical markup framework (LMF).

Kemps-Snijders, Marc, Menzo Windhouwer, Peter Wittenburg, and Sue Ellen Wright. 2009.

“ISOcat: Remodelling Metadata for Language Resources”. In International Journal on Metadata,

Semantics and Ontologies 4: 261–276.

Mörth, Karlheinz. 2002. “The Representation of Literary Texts by Means of XML: Some

Experiences of Doing Markup in Historical Magazines.” In Digital Evidence. Selected Papers from DRH

2000, Digital Resources for the Humanities Conference, edited by Michael Fraser, Nigel Williamson, and

Marilyn Deegan, 17–32. London: Office for Humanities Communication.

Romary, Laurent, Susanne Salmon-Alt, and Gil Francopoulo. 2004. “Standards Going Concrete:

From LMF to Morphalou”. In Workshop on Enhancing and Using Electronic Dictionaries. Geneva:

Coling.

Romary, Laurent. 2010. “Standardization of the Formal Representation of Lexical Information for

NLP”. In Dictionaries: An International Encyclopedia of Lexicography. Supplementary Volume: Recent

Developments with Special Focus on Computational Lexicography. http://arxiv.org/abs/0911.5116.

Romary, Laurent. 2010. “Using the TEI Framework as a Possible Serialization for LMF”. Paper

presented at RELISH workshop, August 4–5, 2010, Nijmegen, Netherlands. http://hal.archives-

ouvertes.fr/docs/00/51/17/69/PDF/NijmegenLexicaAugust2010.pdf.

Sarup, Lakshman. 1920–27. The Nighantu and the Nirukta: The Oldest Indian Treatise on Etymology,

Philology and Semantics. Delhi.

Snell-Hornby, Mary. 1986. “The Bilingual Dictionary: Victim of its own Tradition?” In The History

of Lexicography, edited by Reinhard Hartmann, 207–218. Amsterdam: John Benjamins.

TEI Consortium. 2012. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.1.0.

Last updated June 17. N.p.: TEI Consortium. http://www.tei-c.org/release/doc/tei-p5-doc/en/

html/index.html.

Wegstein, Werner, Mirjam Blümm, Dietmar Seipel, and Christian Schneiker. 2009.

“Digitalisierung von Primärquellen für die TextGrid-Umgebung: Modellfall Campe-Wörterbuch“.

http://www.textgrid.de/fileadmin/TextGrid/reports/TextGrid_R4_1.pdf.

Wilkinson, Endymion. 2000. Chinese History. A Manual. Cambridge, Mass.: Harvard University Asia

Center.

NOTES

1. There is no reliable information available as to his date of birth. Tradition assumes

the 5th or 6th century BC. See Sarup (1920–27, 54).

2. While none of these works can be regarded as an absolute first, they can all be seen

as important milestones in their respective traditions.

3. A project working on Russian Wiktionary versions is the Wiktionary-Export project

which also produces TEI versions (http://wiktionary-export.nataraj.su/en/about.html).

4. http://www.ilc.cnr.it/EAGLES96/home.html

Journal of the Text Encoding Initiative, Issue 3 | November 2012

68

5. http://www.olif.net/

6. http://www.w3.org/2001/sw/BestPractices/WNET/ISLE_D2.2-D3.2.pdf

7. http://code.google.com/p/lift-standard/

8. http://www.w3.org/TR/owl-ref/

9. http://tools.ietf.org/html/rfc2229

10. A standardized object-oriented modeling language.

11. The ICLTT’s dictionary editor VLE provides a tool to convert some of the TEI

encoded dictionary data into LMF. This end is achieved by making use of XSLT

stylesheets to transform the TEI data into an XML format that looks very much like the

XML serialization as it can be found in the ISO specification.

12. This also shows in the fact that the P4 chapter was titled “Print dictionaries”,

whereas the current P5 version bears the title “Dictionaries”.

13. An example of what we would like to see more of can be found on the ICLTT’s

experimental Showcase website: http://corpus3.aac.ac.at/showcase/index.php/

dictionary. In this dictionary interface, each entry can also be viewed with its TEI

encoding.

14. Among the well-documented examples of TEI P5 encoded dictionaries, there is the

CAMPE dictionary, a product of the TextGrid project (Wegstein 2009). While most data

in the field are not easily available, let alone for reusing or further development, a

number of P5-compliant dictionaries were made freely available by the FreeDict project

(Banski 2009).

15. See the LMF website: http://www.lexicalmarkupframework.org/.

16. The general structure of these items of lexicographic information has been

discussed in various publications before. See Ide et al. (1992), Ide et al. (2000), and

Romary (2011).

17. These are <case>, <colloc>, <def>, <etym>, <form>, <gen>, <gramGrp>,

<hom>, <hyph>, <iType>, <lang>, <lbl>, <mood>, <number>, <oRef>,

<oVar>, <orth>, <pRef>, <pVar>, <per>, <pos>, <pron>, <re>, <sense>,

<subc>, <superEntry>, <syll>, <tns>, <usg>, and <xr>.

18. ISO-24613:2008(E), 39.

19. Feature structures are a general-purpose data structure that have become a widely

used means of representation in linguistics. They have a longstanding tradition in the

TEI. A chapter on the topic in the TEI Guidelines goes back to P3 (Sperberg-McQueen

and Burnard 1994, 394–431).

20. http://wiki.verbix.com/Documents/VerbixLanguageCodes

21. Global attributes can be used on all elements of the TEI encoding scheme.

22. BCPs are published by the Internet Engineering Task Force together with RFC

(request for comments) documents.

23. BCP 47 is made up of two IETF documents: RFC 4646 and RFC 4647. A good overview

is given in TEI Consortium 2012, liv.

24. The registration authority for ISO 639-3 is SIL International (http://www.sil.org/

iso639-3/codes.asp).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

69

25. It is interesting that W3C discourages the use of macrolanguage subtags (http://

www.w3.org/International/questions/qa-choosing-language-tags.en#langsubtag). The

label arz-x-cairo-vicav would be as clear as ar-arz-x-cairo-vicav.

26. While Cairo, Illinois (USA), will probably not be confused with the Egyptian capital

in this context, other ambiguities will definitely occur.

27. The language identifier fa has the “Suppress-Script: Arab” entry set in the IANA

registry. That means that it is the default and should be omitted. However, we decided

to be more explicit in such cases as the different <orth> elements are being used in

our markup scheme exactly for the purpose of representing different writing systems.

28. The structure of the <sense> block has been heavily affected by the transition

from P4 to P5. The <trans> and <tr> elements have been removed from P5.

29. In a paper presented at the TEI Members’ Meeting last year, we discussed the

possibility of assigning TEI headers through links to particular divisions of text

documents (Budin and Moerth 2011).

30. See Budin 2011.

ABSTRACTS

Although most of the relevant dictionary productions of the recent past have relied on digital

data and methods, there is little consensus on formats and standards. The Institute for Corpus

Linguistics and Text Technology (ICLTT) of the Austrian Academy of Sciences has been

conducting a number of varied lexicographic projects, both digitising print dictionaries and

working on the creation of genuinely digital lexicographic data. This data was designed to serve

varying purposes: machine-readability was only one. A second goal was interoperability with

digital NLP tools. To achieve this end, a uniform encoding system applicable across all the

projects was developed. The paper describes the constraints imposed on the content models of

the various elements of the TEI dictionary module and provides arguments in favour of TEI P5 as

an encoding system not only being used to represent digitised print dictionaries but also for NLP

purposes.

INDEX

Keywords: P5, dictionaries, digital lexicography, NLP

AUTHORS

GERHARD BUDIN

Gerhard Budin is full professor for terminology studies and translation technologies at the

Centre of Translation Studies at the University of Vienna, director of the Institute for Corpus

Linguistics and Text Technology of the Austrian Academy of Sciences, member (kM) of the

Journal of the Text Encoding Initiative, Issue 3 | November 2012

70

Austrian Academy of Sciences, and holder of the UNESCO Chair for Multilingual, Transcultural

Communication in the Digital Age. He also serves as vice-president of the International Institute

for Terminology Research and Chair of a technical sub-committee in the International Standards

Organization (ISO) focusing on terminology and language resources (ISO/TC 37/SC 2 2001–2009,

SC 1 2009-present). His main research interests are language technologies, corpus linguistics, and

knowledge engineering, E-Learning technologies and collaborative work systems, distributed

digital research environments, terminology studies, ontology engineering, cognitive systems,

cross-cultural knowledge communication and knowledge organization, philosophy of science,

and information science.

STEFAN MAJEWSKI

Stefan Majewski studied English Language and Literature as well as Sociology at the University of

Vienna and Electronics at the Vienna University of Technology. He graduated in English

Linguistics with a focus on research infrastructures for corpus linguistics. Currently, he is

working at the Austrian Academy of Sciences, where he coordinates and works for the “Data

Service Infrastructure for the Social Sciences and Humanities” (DASISH) project. He is also

employed by the Göttingen State and University Library, where he works for the “TextGrid”

project in research and development. His current interests focus on research infrastructures and

annotation systems.

KARLHEINZ MÖRTH

Karlheinz Mörth is senior researcher and project leader at the Institute for Corpus Linguistics

and Text Technology (ICLTT) of the Austrian Academy of Sciences, lecturer at the University of

Vienna and co-head of the DARIAH Virtual Competency Centre 1 (eInfrastructure). Proceeding

from a broad background in cultural, literary and linguistic studies, he has been working on a

number of scholarly digital projects. He has contributed to the design and creation of the

Austrian Academy Corpus (AAC), taking responsibility for text encoding and software

development. His current research activities focus on eLexicography and text technology for

linguistic research.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

71

Consistent Modeling ofHeterogeneous Lexical StructuresLaurent Romary and Werner Wegstein

AUTHOR'S NOTE

The authors would like to thank the reviewers of earlier versions of this paper,

especially reviewer A, for their very detailed analysis and constructive criticism that

contributed to the profile of our paper.

1. Pooling Lexical Sources: A Digital HumanitiesPerspective

1 Our paper addresses the problem of interoperability between heterogeneous data

sources, an issue that has regularly been the object of many debates within the Text

Encoding Initiative (TEI) community and in general within many standardization

groups providing models or formats for data interchange. At the core of the problem is

the trade-off between expressivity—offering a flexible platform for representing a

variety of possible structures—and processability—being able to predict under which

conditions some data can be the object of a blind interchange, in particular in the

context of them being processed randomly by a generic tool.

2 This trade-off has no generic solution, but it regularly arises in defining the

components of such an expansive modeling platform as the TEI Guidelines. The TEI

specifications are an expression of a balance of interests between the many, varied use

cases from the community and the need to abstract away from such examples in order

to design recommendations that new users can easily understand and apply in the

context of their own encoding endeavours.

3 Throughout the TEI Guidelines one finds a stratification of corrections, constraints, and

new features added over time, which have left some constructs as hybrid data models

Journal of the Text Encoding Initiative, Issue 3 | November 2012

72

and which leave the user wondering which representation is the “optimal” one in a

given context, leading to heterogeneous encoding practice in the global data space of

existing TEI documents. Over the years, this has become more and more an issue as

documents are increasingly accessible online and scholars increasingly collaborate on

projects using TEI documents. That is, the “stratification” of the Guidelines has

worsened the problem of interoperability.

4 In this paper we will focus on lexical structures, which we believe represent a typical

case of the interoperability problem in terms of pooling data from heterogeneous

sources. We have asked ourselves whether the TEI chapter dedicated to lexical data,

simply entitled “Dictionaries,” should not be revised or at least be accompanied by

further constraints on its usage so that basic operations related to the querying,

displaying, or merging of lexical information could be made more straightforward.

5 From a digital humanities perspective, we want to understand if it is possible to find a

balance between expressing precise constraints on the encoding of a primary source

and leaving some freedom to the scholar who will see the encoding activity as a step in

his research process. This is why we have made an attempt to identify a generic

methodology for expressing encoding constraints on source texts based on the idea of

local representation or crystals (Romary 2009). These crystals correspond to elementary

constructs at a low level of granularity in a document, which, independently of the

broader organization of the document itself, can be used to express a certain concept in

an extremely regular way, thus making the further reuse of this information chunk

easier. In this context, interoperability is related to the capacity of a person or a tool to

process encoded crystals within a document independently of its origin.

6 After presenting the general background for modeling and representing lexical

sources, we give an overview of the various crystals that form the basis of most existing

types of lexical entries. For each of these crystals we make systematic

recommendations with corresponding supporting arguments. In the second part of the

paper we illustrate our proposals with concrete cases taken from various dictionary

and lexical database projects.

2. Modeling Tools for Lexical Resources

7 The case of lexical data as presented in a dictionary offers an interesting experimental

setting for studying interoperability in the context of standardisation. It is complex

enough to reflect the variability which is intrinsic to the TEI Guidelines while providing

a limited observational setting for studying the granular structure of lexical entries as

well as the rather high internal coherence that one specific lexical source usually has.

Lexical resources also reflect the variety of analytical points of view that one may have

on linguistic information, ranging from quite descriptive and verbose objects in the

domain of standard human-oriented dictionaries to fully structured databases like

those developed in the natural language processing domain.

8 In this paper we consider only lexical resources that are encoded semasiologically—

where entries are determined according to the forms found in a language and further

refined into the different senses that have been deemed relevant for this form. This

word-to-sense organization is usually seen as the most appropriate for the

representation of large coverage lexica, as opposed to onomasiological representations

(concept-to-term), which better take into account the organization of domain-specific

Journal of the Text Encoding Initiative, Issue 3 | November 2012

73

vocabularies (terminologies). The semasiological perspective is usually the underlying

model for traditional print dictionaries as well as for large-scale lexica in the natural-

language-processing domain (Halpern 2006; Atkins et al. 2002).

9 There are two main international standardization activities that are relevant for the

modeling and the representation of semasiological resources: the Lexical Markup

Framework (LMF) and TEI. In accordance with the modeling strategy of ISO committee

TC 37, LMF (which has been standardised as ISO 24613:2008) provides a group of meta-

models that can be combined to produce specific data models applicable to a wide

range of lexical types or components including machine readable lexica, morphology,

syntax, semantics, and multi-word expression. Even when the LMF specification

provides a possible XML serialisation, it tends to be agnostic as to the actual

implementation of the models it allows one to describe. On the other hand, the TEI has

been seminal in offering a reference XML vocabulary for the representation of

dictionaries, which is mostly compliant with LMF principles.1 However, the variety of

constructions that the TEI actually allows for the representation of the same lexical

phenomenon could possibly be seen as a hindrance to the achievement of deep

interoperability across heterogeneous lexical resources.

10 In this paper we take as a starting point the positions described by LMF and the latest

release of the TEI Guidelines2 in order to provide further insights into how to build

lexical resources or dictionaries relying on a systematic use of standardised constructs.

The work presented here is also based upon some core principles that have

systematically guided our work, both theoretically but also practically, through the in-

depth presentation of examples that have served as experimental background for

testing our proposals. Even though the present work is not about modeling XML

structures at large, several of these principles are derived from a more global concept

of the kind of semantics that XML constructs convey and the way to actually reflect this

in the design of XML formats.

11 With this perspective in mind, two generic constraints that affect the organization and

semantics of lexical structures can be stated:

Semantic grouping: Features that jointly convey a given meaning in a lexical entry should be

systematically grouped together, even when only one such feature occurs and even at the

cost of favoring more deeply-structured representations.

Hierarchical dependency: Features, or groups thereof, which qualify a given level (for

instance, an entry), are considered to be inherited by subcomponents (typically the senses)

of the lexical entry unless otherwise stated (Ide, Kilgarriff, and Romary 2000). (Here and

below, we use “level” to refer to a hierarchical relationship within the data structure.)

12 From these constraints we will progressively derive specific recommendations for the

local organization of lexical entries as guided by a crystal-based analysis. Comparing

these with real data, and in particular with legacy dictionaries, we will try to

understand possible transition schemes from weakly structured data to more

standardized constructs.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

74

3. Core Proposals: Towards a Systematic Descriptionof Lexical Crystals

3.1. Crystals as Coherent Sub-structures

13 Introducing the concept of crystals in data modeling in general and in the TEI

Guidelines in particular reflects the need to describe data structures that act as

scaffolding for a coherent group of components (or elements in XML terminology).

More precisely, a crystal can be defined as an independent group of connected elements (a

clique) with semantic coherence. A typical example of a crystal is a structured

bibliographical entry using the TEI’s <biblStruct> element. This element contains

internal structure (comprising <analytic>, <monogr> with <imprint>, and

<series>), can be inserted at various places within the TEI architecture, and can be

further expanded by other components or crystals (for example, <author>).

14 Without introducing any specific formalism here, we might define a crystal by:

The set of mandatory and optional components that may occur in the crystal

The structural organization of the crystal, stating in particular the hierarchical relations

between components

The anchor points of the crystal (<analytic>, <monogr> with <imprint>, and

<series>), where it can be further expanded

The global semantics of the crystal, in complement to the specific semantics of its

component elements

15 A crystal is thus a modeling tool that can be used to provide a coherent description of a

subset taken from a more complex data model (as is typically the case with the TEI

Guidelines). To illustrate this, we will briefly demonstrate how the TEI Guidelines

chapter on dictionaries can serve as a basis for implementing LMF, and point out some

consequences this could have on the data architecture that we recommend for certain

TEI elements.

16 As a starting point, let us consider the LMF subset depicted in figure 1, which

implements the semasiological view of a lexical entry. This UML diagram states that a

Lexical Entry is characterised by at least one Form component to which a hierarchically

embedded series of Sense components may be associated. The Form component is

further refined by means of an optional Form Representation component, which can be

used to represent the various concrete implementations of a lexical form (e.g. phonetic,

graphical, etc.). Finally, each component of the meta-model (corresponding here to a

UML class) can be further characterised by properties attached to each of them.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

75

Figure 1: The Lexical Entry sub-structure of the LMF core package

17 Transposed to the TEI world, the LMF metamodel can be expressed as a TEI crystal

rooted on the <entry> element. This crystal, depicted in figure 2, states that the

minimal lexical entry in a sense as defined by TEI uses the <entry>, <form> and

<sense> elements, with <form> being further decomposed by means of a series of

elements implementing the Form Representation component of LMF.3 The picture also

introduces three new classes, which could gather up all further descriptive elements

needed to refine <entry>, <form>, and <sense>: model.entryDesc,

model.formDesc, and model.senseDesc.

18 This first presentation of the TEI lexical entry as a crystal illustrates how this concept

may help in describing complex structures that rely on constraints that go beyond (and

deeper) than what we normally express by means of DTDs or schemas. Even though we

do not systematically analyze the equivalences between LMF and the TEI in the

following section, we hope that the preceding explanation will help the reader

understand the logic behind the various constraints explained in subsequent sections.

In a pattern analogous to the internal structure of the <cit> element, we see the

organization of the various elements of this lexical entry crystal as a combination of a

structural description (direct dependency of one element on another) and a descriptive

dimension (further constraints applicable to the group of elements).

Figure 2: The ideal element-class organization of a TEI lexical entry

3.2. Morphographical Descriptions

19 In a semasiologically structured lexical entry, form information gives one or more

realizations of a word—whether graphical, phonetical or iconical (by means of a picture

or drawing)—which can be used to find the corresponding lexical unit. Such

Journal of the Text Encoding Initiative, Issue 3 | November 2012

76

information may comprise abstract identifiers for the headword, namely the lemma,

morphological components or categories (such as the consonantal pattern in Arabic),

or any inflectional variant that can be associated with the entry.

20 The central issue in describing the corresponding morphographical crystal is that it

should be based upon an abstract representation of Form as a component, which in turn

groups together all the possible realizations of the corresponding form (the Form

Representation component in LMF), as well as the associated constraints. In terms of

good practices, one should thus refrain from providing a form representation

(realization) in isolation and always include it within an embedding <form> element.4

Unless there is only one form associated with a given lexical entry, the form type (such

as a lemma or inflected form) should be provided to ensure its univocal identification.

21 As a consequence, the minimal structure associated with a TEI-encoded lexical entry—

where the only information given is that of a lemma (here, the French word chat; (en)

cat)—should be encoded as follows:

<entry>

<form type="lemma">

<orth>chat</orth>

</form>

</entry>

22 On this basis, additional variants of the form (such as pronunciation) can be added to

the same form container, together with complementary information characterizing

them. For instance, when more than one orthography is used to provide the form, the

appropriate @type attribute should be used to qualify the corresponding orthography.

In the following example, the lemma for the Korean word “치다” (chida; (en) to hit) is

provided in Hangul orthography ((ko) 한글) orthography together with a Romanized

form.

<form type="lemma">

<orth type="한글">치다</orth>

<orth type="romanized">chida</orth>

</form>

23 As a next step, we advocate the definition of stable values for the @type attribute on

<orth>, adopting ISO 15924 to refer to the script.

24 When alternative forms are provided, indicating, for example, inflectional variation,

then the variants should be encoded in full in order to reflect linguistic differences. For

instance, the example provided in Annex B of LMF (clergyman) is reformulated in TEI as

follows:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

77

<entry>

<gramGrp>

<pos>commonNoun</pos>

</gramGrp>

<form type="lemma">

<orth>clergyman</orth>

</form>

<form type="inflected">

<orth>clergyman</orth>

<gramGrp>

<number>singular</number>

</gramGrp>

</form>

<form type="inflected">

<orth>clergymen</orth>

<gramGrp>

<number>plural</number>

</gramGrp>

</form>

</entry>

3.3. Grammatical Information

25 Grammatical information may appear at various points within a dictionary entry; it is

there to provide additional information about the core objects comprising the entry. In

the lexicographic tradition grammatical information qualifies the lemma, or rather,

since the lemma is just a code representing the entry as a whole, syncretizes the

grammatical features that apply by default to all possible occurrences of the word.

However, the grammatical information can also occur at many other possible levels of

the entry, qualifying inflected forms in a more precise way (as in the “clergyman”

example above), indicating specific constraints associated to a sense, or even qualifying

the occurrence within an example of phrasal expression. As a whole, a grammatical

crystal defined according to these principles may be used at any place where the usage

of a word is described.

26 The notation for grammatical features within human-oriented dictionaries varies

greatly: a given grammatical constraint can, for instance, be represented by a

prototypical morpheme (e.g. der / die / das to indicate grammatical gender in German)

or by means of a descriptive phrase (used in the plural form). At best, idiosyncratic codes

are used (e.g. masc., fém.), though they are not always consistently applied within a

single dictionary, let alone across dictionaries. There is no doubt that such a situation

prevents one from querying lexical entries that include grammatical constraints in a

coherent way. It is therefore a priority to establish requirements for the representation

of grammatical features in a way that is both standard and yet preserves the initial

editorial choices. As a basis for such recommendations we recommend that TEI-based

Journal of the Text Encoding Initiative, Issue 3 | November 2012

78

encoding of dictionary entries should be in keeping with the following elementary

principles:

Grammatical features should systematically be embedded within a <gramGrp> container

element, even if only one feature is present and even if the grammatical information is split

up so that more than one <gramGrp> container may be necessary.

Whereas one should be flexible with the textual content of a grammatical descriptor, it is of

utmost importance to normalize the intended value by means of a @norm attribute.

27 For instance, when a value for the grammatical gender is given by means of a

determiner, the @norm attribute will provide the reference value (e.g. as a code from

the ISOcat data category registry).5 Depending on the encoder’s editorial choices, a

minimal encoding might look like the following example:

<form type="lemma">

<gramGrp>

<gen norm="feminine">die</gen>

</gramGrp>

<orth>Katze</orth>

</form>

28 A more elaborate encoding scheme could lead to the following lemma structure:

<form type="lemma">

<form type="marker">

<gramGrp>

<pos norm="determiner"/>

<gen norm="feminine"/>

</gramGrp>

<orth>die</orth>

</form>

<form type="head">

<gramGrp>

<pos norm="noun"/>

<gen norm="feminine"/>

</gramGrp>

<orth>Katze</orth>

</form>

</form>

29 In general, such grammatical descriptions should be thought of as being equivalent to

the provision of feature structures and thus mappable onto an <fs> element. For

instance, the preceding minimal encoding example (omitting the orthographic form) is

equivalent to:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

79

<fs>

<f name="gender"><symbol value="feminine"/></gen>

</fs>

30 The next stage in providing a recommendation is to make sure that values for the

@norm attribute are stable within a project and, when possible, across projects. We

recommend two complementary strategies:

For a given project, document and publicize the values used for the norm attribute so that

the community may be aware of possible discrepancies

Relate such values to entries in the ISOcat data category registry so that they are mapped

onto standardized conceptual references.

31 It should be noted that at the time of writing, there is an item on the TEI Council

agenda to better integrate mechanisms available in ISO 12620:2009 (the standard which

defines the structure of ISOcat) within the TEI architecture to facilitate such mappings.

We can thus expect that these recommendations may become in due course standard

practice within the TEI community.

3.4. Senses as Systematic Entry Points

32 The representation level introduced by the Sense component in LMF and its

counterpart <sense> in the TEI Guidelines is an essential concept implementing the

semasiological perspective of a dictionary. Still, a “lazy” encoding style for dictionary

entries could lead to the idea that such a structure is superfluous when, for instance, a

word can directly be described at the same level as the morphological and grammatical

information by a simple definition or a translation that is a child of <entry>. Indeed,

it is often the case in the simplest forms of legacy lexical structures that senses are not

explicitly separated out in the microstructure of the entry. We consider this bad

practice and recommend that <sense> be used to enclose all descriptors that describe

the signified (as opposed to the signifier, that is the <form>, in the Saussurian sense).

33 As can be observed from the variety of constraints that may apply to a <sense>

element within a lexical entry, the underlying understanding of the semasiological

model extends to the organization of senses that do not rely on strict semantic criteria

(Ide, Kilgarriff, and Romary 2000). This is not so much of a paradox when we think of

the numerous ways by which semantic variation may be observed, among which we can

include pure morpho-syntactic or syntactic markers. As a result, we consider that

<sense> should be used to describe any subdivision reflecting a variation in usage for

a given word. In an extreme case, applying automatic collocation extraction tools

(Kilgarriff and Tugwell 2002) may result in generating lexical entries automatically

where senses correspond to the various collocation classes that the tool has

determined.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

80

34 We thus see the sense component in LMF and the <sense> element in TEI as a generic

container organizing the further description of a signifier, which may contain

information related to:

The actual syntactico-semantic restriction applicable to the sense being described, for

instance by means of further grammatical constraints, a definition, or some usage

restriction

The provision of further illustrative information, in particular contextualized examples or

translations (see the section on the <cit> element below)

Relational information referring to external information expressing the same meaning,

either within another lexical entry or an external ontological reference (such as in the

lexical database project WordNet, described by Miller and Fellbaum [2007]).

35 In order to actually facilitate further querying, it is important that each feature

intended to be associated with a sense shall be precisely typed. Precise typing requires

that clearly defined typologies be associated with elements such as <usg> and <cit>.

Furthermore, dictionary projects should be able to document precisely how much

restrictive or illustrative information is inherited along embedded senses. For instance,

a clear editorial strategy should state whether grammatical constraints replace or

complete existing ones at a higher level of a sense hierarchy.

3.5. <cit>: A Generic Linguistic Quotation Tool

36 The <cit> element in TEI P5 is the result of a merger of several constructs from

former editions of the TEI chapter on dictionaries that had been created to handle

examples and translations in dictionary entries. The underlying aim of the new

framework was twofold. On the one hand, the objective was to provide greater

coherence to the way language excerpts appear not only in dictionaries but in textual

content in general. On the other hand, the TEI Council wanted to design a sound

framework for dealing with additional references or constraints provided in a lexical

entry to compliment the quoted object itself, taking into account that such refinements

may lead to recursive constructs. In terms of interoperability across TEI-based

applications, the main vision behind the <cit> element, and the crystal it shapes, is to

provide entry points for generic searches for quoted language in texts, from the point

of view both of the full-text content and of providing a systematized representation of

constraints associated with the full text.

37 Language quotations in text may indeed take many different forms. In dictionaries the

most basic quotation is simply a phrase or sentence exemplifying the headword. Most

of the times, this quotation does not appear alone but is refined according to two main

axes:

Indication of the source of the quotation, for instance the following from P5 2.0.0: ‘La valeur

n’attend pas le nombre des années’ (Corneille)

Provision of usage information, stating constraints that the example is bound by, such as

domain or pronunciation, as in the following from P5 2.0.0: some … 4. (S~ and any are used

with more): Give me ~ more/s@'mO:(r)/

38 In the case of multilingual dictionaries, language quotations are similarly used to

provide equivalences for the entry (or sub-sense thereof) in the target language. In a

way that is similar to the monolingual case, further refinement of the encoding

structure of a quotation may indicate some source or usage information, but it may also

Journal of the Text Encoding Initiative, Issue 3 | November 2012

81

document the target language proper. A usual case here is the indication of the

grammatical gender of a noun equivalent in the target language.

39 Quotation constructs are not covered in LMF but can easily be modeled as an extension

to the LMF core packages. Figure 3 is a simple representation for such an extension.

The approach is similar to the one we advocate above for grammatical information in

relation to senses, in which the quoted text is embedded in a quotation construct even

if no refinement is actually stated.

Figure 3: An LMF extension for quotations represented in a dictionary

40 In the TEI Guidelines, the quotation construct is implemented by means of the <cit>

element, which has the following characteristics:

The quoted object may be realized not only by means of a <quote> or <q> (both from the

model.qLike class) but also as a more elaborated construct such as an XML object

(<egXML>, a member of model.egLike).

The refinement of a quotation can be instantiated as a bibliographic reference (using an

element from model.biblLike), as a pointer or external reference to a constraint (using

an element from model.ptrLike), as specific lexicographic features such as grammatical

constraints (using an element from model.entryPart), or through the inclusion of

feature structures in <cit>—accidental by design—which are part of model.global. It

should be noted that a refinement can actually be an embedded <cit> (by virtue of the

inclusion of model.entryPart in the content model of <cit>), thus offering, for

example, a natural way to provide a translation of a quotation.

41 Note that the TEI Guidelines already systematize the values of the @type attribute to

“example” and “translation” for use in dictionaries.

42 Given the variety of possible cases where <cit> may be used and the potentially

infinite combinations of refinement, it may be difficult to provide clear requirements

for its application. Basically a proper usage of <cit> should allow a human reader or a

processor to identify one quoted object and treat all other components as refinements

in which semantics are understood in a conjunctive way (in other words, all

refinements apply en bloc to the quoted object). By default, the quoted object should be

the first child of the <cit> element or, in general, the first child that is a member of

either model.qLike or model.egLike.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

82

43 Although the second part of this paper provides several applications of <cit> in the

context of our observational corpus, we can illustrate here some basic usages of this

element from examples available in the TEI Guidelines.

44 In the following prototypical case, a simple example for the headword is associated

with a refinement giving the pronunciation of part of the quoted text:

<cit type="example">

<quote>Give me <oRef/> more</quote>

<pron extent="part">s@'mO:(r)</pron>

</cit>

45 The next example illustrates the representation of a translation refined with a

grammatical feature:

<cit type="translation" xml:lang="fr">

<quote>habilleur</quote>

<gramGrp>

<gen>m</gen>

</gramGrp>

</cit>

46 Finally, we cannot resist presenting a recursive case where the embedded <cit> is

used as an additional descriptive element for the quoted text at the higher level:

<cit type="example">

<quote>she was horrified at the expense.</quote>

<cit type="translation" xml:lang="fr">

<quote>elle était horrifiée par la dépense.</quote>

</cit>

</cit>

4. Illustrated Guidelines for Early Printed Dictionaries

4.1. Lexicographical Justification

47 We tested our encoding concepts using printed dictionaries from the second half of the

18th century for two reasons. First, in the history of English lexicography the early 18th

century marks the beginning of modern dictionary practice (Landau 2001, 60–66).

Samuel Johnson’s Dictionary of the English Language, first published in 1755, perfectly

Journal of the Text Encoding Initiative, Issue 3 | November 2012

83

embodies these advances in lexicography. Johnson is the first English lexicographer to

include thousands of other quoted “‘authorities’ within his text as illustrations of word

use” (Reddick 1996, 9). His dictionary also brought together “for the first time key

conventions for future dictionary presentation: the folio6 design is a system of

typography that displays the structure of each entry, though there are inconsistencies

of abbreviation and ambiguities” (Luna 2005, 193). Thus this dictionary offers an ideal

test bed to study problems in providing a consistent encoding in P5 of a source

document that offers notational inconsistencies. Second, because Johann Christoph

Adelung7 translated Samuel Johnson’s dictionary into German (Adelung 1783–1796),

Johnson’s dictionary opens up additional perspectives for the study of bilingual

lexicographical resources in the 18th century and research into the history of revision

and the reuse of dictionaries.

48 We test our modeling of lexicographic structures with three samples from Johnson’s

monolingual dictionary representing the most frequent word-classes: the adjective

ABLE, the verb To APPLAUD, and all entries for the noun APPLE (the use of all caps

versus small caps by Johnson is explained below). We further compare Johnson’s apple

entries with the section of apple entries in Adelung’s bilingual English-German

translation of Johnson’s dictionary. To illustrate the differing encoding structures of

bilingual German-English dictionaries we use Eber’s entry FÄHIG, the equivalent of

ABLE. As a source for this entry, Ebers obviously used only the German-French

dictionary of Christian Friedrich Schwan (Schwan 1782), so we include Schwan’s entry

FÆHIG in order to illustrate dictionary reuse across languages in the 18th century. The

images of the encoded pages are given as a supplement to this article.

4.2. Typographic Analysis and Text Encoding

49 Luna begins his essay on the typographic design of Johnson’s dictionary with some

reflexions on how a typographer would analyze a dictionary: “In particular, how does a

typographer look at a dictionary that is also a cultural artifact, as Samuel Johnson’s

Dictionary of the English Language undoubtedly is?” (2005, 175). Building on a more wide-

ranging definition of typography as “configuration of verbal graphic language,” Luna

concludes, “the main concern of this essay is not the quality of the printing, nor the

nature of the paper, nor even the origin of the founts of type used to compose the

Dictionary, but how its visual presentation reflects the structure of the text, its usability,

and perhaps even its compiler’s intentions” (2005, 175).

50 This concept comes very close to what a TEI encoding of a dictionary in an adequate

granularity should achieve: reflecting the structure of the encoded text, facilitating re-

usability in electronic form and—at its best—assisting in the detection of the author’s

intentions. In order to put our aim of a consistent modeling of heterogeneous

structures into practice, we follow some basic principles.

51 We adopt a conservative editorial view for our literal transcription (see section 9.5.1 of

P5) and try to keep the latter close to the printed original: we do not add any character

to the original text or delete it, we transcribe the text in the order in which it appears

in the source, we preserve the linear structures of the text with <pb>, <cb> and <lb>,

and we retain the end-of-line hyphenation (see section 3.2.2 of P5). With such

orthographical variation within the texts of the dictionaries, this makes transcription

much easier. For clarity and to ensure a consistent encoding we encode only a few

Journal of the Text Encoding Initiative, Issue 3 | November 2012

84

structurally important typographic features (significant use of typeface and italics) at

the level of the lexical entry.8

4.3. Encoding Practice at the <entry> Level

52 With re-usability, interoperability, and sustainability of the dictionary entries in mind,

we use two attributes to refine the <entry> element: @xml:id to guarantee a robust

and reliable non-ambiguous identification and @type for classification of the entries.

53 The @xml:id attribute is composed of four parts, each separated by a dot:

two initials of the author’s name and a combination of six letters or numbers to identify the

encoded edition precisely

four digits for the year of publication

six digits for the running number of the entry (given as a random value in the examples)

the lemma, transcribed in lower case only and with any incidental spaces replaced by

underlines.

54 Thus our sample entry ABLE in Samuel Johnson’s dictionary is assigned the @xml:id

'sjdict1f.1755.000123.able'. In the first part, “sj” is taken from Samuel Johnson, “dict”

reflects the title Dictionary of the English Language, and “1” indicates the edition and “f”

the format folio (because edition and format are both rather important for a precise

identification of the different printed editions of Johnson’s dictionary). They are not

necessary for Adelung (Henne 2001, 170), Ebers (Lewis 2012), and Schwan.

55 We use the TEI @type attribute of <entry> to distinguish typographically or verbally

marked types of entries and map them onto corresponding identifiers of the ISOcat

data category registry. The @type attribute used on <entry> belongs to the attribute

class att.entryLike, which includes a list of suggested values for @type. For the

entries in Johnson’s Dictionary we had to add some more fine-grained distinctions to the

list of suggested values.

56 An occasional user of Johnson’s Dictionary may be puzzled about the typesetting of

entry headwords. Thus APPLAUD and APPLE are in full caps, while APPLAUSE and APPLE

TREE are in small caps. Now and then, however, entries appear typeset in italic capital

letters, e.g. ABORIGINES and ABRACADABRA. In his preface, Johnson explains the

background for these marked differences, which for him reflect basic lexicographical

distinctions: “In the investigation both of the orthography and signification of words,

their ETYMOLOGY was necessarily to be considered, and they were therefore to be divided

into primitives and derivatives. A primitive word, is that which can be traced no

further to any English root; . . . Derivatives, are all those that can be referred to any

word in English of greater simplicity” (1755, 3f). Thus primitives or roots are marked by

full caps and the derivatives by small caps. Furthermore, the entries in italic capital

letters indicate foreign words used in the English language (Luna 2005, 181).

57 As Luna notices (2005, 196 fn. 24), this distinction of entries echoes a completely

different way of organizing a dictionary: word-families, represented by roots (in

alphabetical order), followed by their derivatives (ordered non-alphabetically into

morphological or etymological groups). Since Johnson used a single alphabetical order

for all entries, this organizing principle is no longer clearly visible. It is only faintly

reflected in the differentiation of the lemmas. But it is still implicit and that is why we

think it should be encoded explicitly as a significant feature of the dictionary structure.

1.

2.

3.

4.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

85

Accordingly, we map the entries representing lexical units in Johnson’s Dictionary onto

the ISOcat identifiers root or derivation and use foreign to indicate foreign words

respectively. Two examples: ABLE and APPLE of Love.

<entry xml:id="sjdict1f.1755.000123.able" type="Root">

<form type="lemma" norm="able">

<lb/><orth rend="allcaps">A'BLE</orth><pc>.</pc>

<gramGrp><pos norm="adjective">adj.</pos></gramGrp>

</form>

<entry xml:id="sjdict1f.1755.000346.apple_of_love" type="Phrase">

<form type="lemma" norm="apple of love">

<lb/><orth><hi rend="smallcaps">APPLE</hi> <hi rend="italics">of

Love</hi></orth><pc>.</pc>

<gramGrp><pos norm="noun"/></gramGrp>

</form>

<sense>

<cit type="Encyclopedic_Information">

<quote><lb/>Apples of love are of three sorts; ...

<bibl><author>Mortimer</author>’s <title>Husbandry</title>.</bibl>

</cit>

</sense>

</entry>

58 The typography of the entry APPLE of Love―small caps for apple though belonging to the

root entries, italics for of love, and the word class information missing from the source

(though supplied in the encoding)―indicates uncertainty about the word status of the

entry. Furthermore, the classification as type phrase may require some explanation.

Valerie Adams comments in her introduction to word-formation on the distinction

between words and phrases: “Certain noun-preposition-noun phrases also show their

incomplete unification by the possibility of pluralizing the first noun” (1976, 9). Since

the illustrative quotation of Mortimer’s book on Husbandry starts with the plural form

“apples”, we regard the type “Phrase” here as justified and did not consider alternative

ISOcat options.

4.4. The <form> Block

59 The <form> element is designed to contain information on the written form (encoded

using <orth>) and, if present, the spoken form (encoded using <pron>) of one

lemma. We use <form> with two attributes: a @type attribute to distinguish the

lemma from any given inflected forms and a @norm attribute to even out any

orthographic variation, such as the use of upper or lower case, hyphenation, or special

markers to indicate the stress position within the orthographic representation of the

lemma. The <form> block contains a number of elements including <orth> and

<gramGrp>; the TEI <stress> element, designed for stress patterns given

Journal of the Text Encoding Initiative, Issue 3 | November 2012

86

separately, is not applicable here, apart from the fact that we did not want to split up

the orthographic representation any further or change it.

60 Within <orth>, typographic details are stored in a @rend attribute. In Johnson’s

Dictionary we use it to store his typographic differentiation of the printed entries: that

is, his distinction between all caps and small caps. In Schwan’s dictionary it is used to

distinguish two different orthographic representations of the German lemma, the first

with Antiqua capital letters only, the second with upper and lower case, depending on

the German orthography, and using a Fraktur typeface.

61 We use <gramGrp> to collect grammatical information such as part-of-speech (in a

<pos> element) or gender (in a <gen> element). Quite often, grammatical

information precedes or follows the orthographic representation of the entry, such as

the infinitive marker To in entries for verbs in Johnson’s dictionary or the determiner

der, die, das in German noun entries. We capture this information with a <gram>element and a @type attribute containing the appropriate ISOcat value. Without

exception, we store all elements that interpret grammatical features like <pos>,

<gen>, or <gram> within a <gramGrp> element, once again using a @norm attribute

to map the different grammatical descriptions given in the dictionaries to an ISOcat

entry. This way, we avoid conflicts with the order of text on the printed page and can

adjust inconsistencies like missing word class information, such as by adding an empty

<pos> element with a @norm attribute based on information collected elsewhere in

the entry. One example is Johnson’s entry APPLAUD that requires two <gramGrp>elements to capture the grammatical information:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

87

<pb n="148"/><cb n="APP"/>

<entry xml:id="sjdict1f.1755.000234.applaud" type="Root">

<lb/><form type="lemma" norm="applaud">

<gramGrp><gram type="infinitiveParticle">To</gram></gramGrp>

<orth rend="allcaps">APPLA'UD</orth><pc>.</pc>

<gramGrp><pos norm="verb">v.a.</pos></gramGrp>

</form>

<etym>

<pc>[</pc><mentioned xml:lang="la">applaudo</mentioned><pc>,</pc>

<lang><abbr>Lat.</abbr></lang><pc>]</pc>

</etym>

<lb/><sense>

<num>1.</num>

<def>To praise by clapping the hand.</def>

</sense>

<lb/><sense>

<num>2.</num>

<def>To praise in general.</def>

</sense>

<cit type="example">

<lb/><quote>I would applaud thee to the very echo,

<lb/>That should applaud again.</quote>

<bibl><author><abbr>Shakesp.</abbr></author><title>Macbeth</title>.</

bibl>

</cit>

<cit type="example">

<lb/><quote>Nations unborn your mighty names shall sound,

<lb/>And worlds applaud that must not yet be found!</quote>

<bibl><author>Pope</author>.</bibl>

</cit>

</entry>

62 Our use of <pc> is governed by the principle that we avoid punctuation marks as

delimiters of text in elements within <form> and within <etym>; this is for ease of

reusability and searching.

63 In testing our encoding concept we encountered some phenomena―word class in

grammar and hyphenation in orthography―which prompted us to reinforce our aim of

consistently modeling heterogeneous lexicographical data through normalization. The

first case has to do with an old problem of word classes: the categories of adjective and

adverb in German. Ebers defines the part-of-speech information in his entry fähig with

the abridged terms in Latin adj. et adv. This concept—one word, two word classes—is not

compatible with the present-day understanding of word classes in German: since

adverbs in German are never inflected and fähig is capable of inflection, this word is

generally regarded as an adjective in any present-day dictionary of German. Of course,

we do not alter Ebers’ word class definition, but we suggest resolving the word class

conflict in this and in comparable cases by standardizing the value of the @norm

Journal of the Text Encoding Initiative, Issue 3 | November 2012

88

attribute on <pos>, using the ISOcat value adjective only. Ebers’ example entry fähig in

abridged form:

<entry xml:id="jedictge.1796.000999.fähig" type="main">

<form xml:lang="de" type="lemma" norm="fähig">

<lb/><orth>Fähig</orth><pc>,</pc>

<gramGrp>

<pos norm="adjective" xml:lang="la">adj. et adv.</pos>

</gramGrp>

</form>

<sense> ... </sense>

</entry>

64 The second phenomenon has to do with hyphenation, an old problem primarily but not

only in the English language. First, consider Johnson’s noun compounds with apple in

abridged form:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

89

<entry xml:id="sjdict1f.1755.000347.apple-graft" type="derivation">

<form type="lemma" norm="apple graft">

<lb/><orth rend="smallcaps">APPLE-GRAFT</orth><pc>.</pc>

<gramGrp><pos norm="noun">n.s.</pos></gramGrp>

</form>

<etym><pc>[</pc>from

<mentioned corresp="#sjdict1f.1755.000345.apple">apple</mentioned

<lbl>and</lbl>

<mentioned corresp="#sjdict1f.1755.009999.graft">graft</mentioned>

<pc>.]</pc>

</etym>

<sense> ... </sense>

</entry>

<entry xml:id="sjdict1f.1755.000348.apple-tart" type="derivation">

<form type="lemma" norm="apple tart">

<lb/><orth rend="smallcaps">APPLE-TART</orth><pc>.</pc>

<gramGrp><pos norm="noun"/></gramGrp>

</form>

<etym><pc>[</pc>from

<mentioned corresp="#sjdict1f.1755.000345.apple">apple</mentioned>

<lbl>and</lbl>

<mentioned corresp="#sjdict1f.1755.029999.tart">tart</mentioned>

<pc>.]</pc>

</etym>

<sense> ... </sense>

</entry>

<entry xml:id="jdict1f.1755.000349.apple_tree" type="derivation">

<form type="lemma" norm="apple tree">

<lb/><orth rend="smallcaps">APPLE TREE</orth><pc>.</pc>

<gramGrp><pos norm="noun"><abbr>n.s.</abbr></pos></gramGrp>

</form>

<etym><pc>[</pc>from

<mentioned corresp="#sjdict1f.1755.000345.apple">apple</mentioned>

<lbl>and</lbl>

<mentioned corresp="#sjdict1f.1755.039999.tree">tree</mentioned>

<pc>.]</pc>

</etym>

<sense> ... </sense>

</entry>

<entry xml:id="jdict1f.1755.000350.apple_woman" type="derivation">

<form type="lemma" norm="apple woman">

<lb/><orth rend="smallcaps">APPLE WOMAN</orth><pc>.</pc>

<gramGrp><pos norm="noun"><abbr>n.s.</abbr></pos></gramGrp>

</form>

<etym><pc>[</pc>from

<mentioned corresp="#sjdict1f.1755.000345.apple">apple</mentioned>

Journal of the Text Encoding Initiative, Issue 3 | November 2012

90

<lbl>and</lbl>

<mentioned corresp="#sjdict1f.1755.049999.woman">woman</mentioned>

<pc>.]</pc>

</etym>

<sense> ... </sense>

</entry>

65 Apart from the special case “APPLE of love,” both “APPLE-GRAFT” and “APPLE-TART” are

hyphenated, whereas “APPLE TREE” and “APPLE WOMAN” are spelled as two separate words.

There is no consistent distinction here between open (word-spaced) and hyphenated

compounds. Noel Osselton gives a compact résumé of “variation of hyphenated

compounds” in entries and their steady downgrading in the second half of the

dictionary from the letter M onwards (2005). Against this background we have used the

@norm attribute of <form> in order to provide the best support for search

procedures: we have retained the original hyphenated and open compound spellings

from Johnson’s text but have encoded the open or word-spaced form on the @normattribute as the standardized form.

66 In his translation of Johnson’s apple entries, Adelung takes a different view. He unifies

the hyphenated spelling for all the apple compounds, downgrades the hybrid entry

Apple of love to appear as a form mentioned within the base entry apple, and adds more

compounds, taken from other sources mentioned in the introduction:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

91

<entry xml:id="jagkwbed.1783.000999.apple" type="main">

<form xml:lang="en" type="lemma" norm="apple">

<lb/><orth>'Apple</orth><pc>,</pc>

<gramGrp>

<pos norm="noun" xml:lang="la">subst.</pos>

</gramGrp>

<pc>(</pc><pron >äpp'l</pron><pc>,</pc>

</form>

<etym><mentioned><lang xml:lang="ang">angels.</lang>

<lang xml:lang="nds">niederd.</lang>aep- <lb/>pel</mentioned>

<pc>,</pc> <mentioned><lang xml:lang="de">deutsch</lang> Apfel</

mentioned>

<pc>.</pc><pc>)</pc>

</etym>

<sense xml:lang="de">

<num>1)</num>

<def>Die Frucht des <lb/>Apfelbaumes,</def>

<cit type="translation"><quote>der Apfel.</quote></cit>

</sense>

<sense xml:lang="de">

<num>2)</num>

<cit type="Encyclopedic_Information">

<quote>Wegen eini-<lb/>ger Ähnlichkeit in der Gestalt ...</quote>

</cit>

<cit type="Encyclopedic_Information">

<quote><mentioned xml:lang="en">The Apple of love, Love-apple</

mentioned>

o-<lb/>der <mentioned xml:lang="en">Wolf's Peach</mentioned>,

<cit type="translation" xml:lang="de"><quote>Liebesapfel</quote>

</cit>

<term xml:lang="la">Lycoper-<lb/>sicon<name nymRef="Linné">Linn.</

name>

</term>auch wohl eine Art des <term xml:lang="la">Sola-<lb/>num</

term>;

<mentioned xml:lang="en">the Mad-apple</mentioned>,

</sense>

<sense xml:lang="de">

<num>3)</num>

<usg>Figürlich,</usg><def>die Pupille in dem Auge,</def>

<cit type="translation"><quote>der <lb/>Augapfel,</quote></cit>

<xr type="synonym "><lbl>welcher wohl auch

<ref xml:lang="en" target="#adwbeng1.1783.009999.eye-ball">

Eye-ball</ref> ge-<lb/>nannt wird.</lbl>

</xr>

</sense>

</entry>

<entry xml:id="jagkwbed.1783.001000.apple-coar" type="main">

<form xml:lang="en" type="lemma" norm="apple coar">

Journal of the Text Encoding Initiative, Issue 3 | November 2012

92

<lb/><orth>'Apple-coar</orth><pc>,</pc>

<gramGrp<pos norm="noun" xml:lang="la">subst.</pos></gramGrp>

</form>

<etym><lbl>von</lbl>

<mentioned xml:lang="en" corresp="#jagkwbed.1783.000999.apple">

apple 1)</mentioned>

</etym>

<sense>

<def>der Griebs oder Gröbs in dem Apfel.</def>

</sense>

</entry>

<entry xml:id="jagkwbed.1783.001001.apple-graft" type="main">

<form xml:lang="en" type="lemma" norm="apple graft">

<lb/><orth>'Apple-graft</orth><pc>,</pc>

<gramGrp><pos norm="noun" xml:lang="la">subst.</pos></gramGrp>

</form>

<sense>...</sense>

</entry>

<entry xml:id="jagkwbed.1783.001002.apple-loft" type="main">

<form xml:lang="en" type="lemma" norm="apple loft">

<lb/><orth>'Apple-loft</orth><pc>,</pc>

<gramGrp<pos norm="noun" xml:lang="la">subst.</pos></gramGrp>

</form>

<sense>...</sense>

</entry>

<entry xml:id="jagkwbed.1783.001003.apple-monger" type="main">

<form xml:lang="en" type="lemma" norm="apple monger">

<lb/><orth>'Apple-monger</orth><pc>,</pc>

<gramGrp><pos norm="noun" xml:lang="la">subst.</pos></gramGrp>

</form>

<sense>...</sense>

</entry>

<entry xml:id="jagkwbed.1783.001004.apple-paring" type="main">

<form xml:lang="en" type="lemma" norm="apple paring">

<lb/><orth>'Apple-paring</orth><pc>,</pc>

<gramGrp><pos norm="noun" xml:lang="la">subst.</pos></gramGrp>

</form>

<sense>...</sense>

</entry>

<entry xml:id="jagkwbed.1783.001005.apple-roaster" type="main">

<form xml:lang="en" type="lemma" norm="apple roaster">

<lb/><orth>'Apple-roaster</orth><pc>,</pc>

<gramGrp><pos norm="noun" xml:lang="la">subst.</pos></gramGrp>

</form>

<sense>...</sense>

</entry>

Journal of the Text Encoding Initiative, Issue 3 | November 2012

93

<entry xml:id="jagkwbed1.1783.001006.apple-squire" type="main">

<form xml:lang="en" type="lemma" norm="apple squire">

<lb/><orth>'Apple-squire</orth><pc>,</pc>

<gramGrp><pos norm="noun" xml:lang="la">subst.</pos></gramGrp>

</form>

<sense>...</sense>

</entry>

<entry xml:id="jagkwbed.1783.001007.apple-tart" type="main">

<form xml:lang="en" type="lemma" norm="apple tart">

<lb/><orth>'Apple-tart</orth><pc>,</pc>

<gramGrp><pos norm="noun" xml:lang="la">subst.</pos></gramGrp>

</form>

<sense>...</sense>

</entry>

<entry xml:id="jagkwbed.1783.001008.apple-thorn" type="main">

<form xml:lang="en" type="lemma" norm="apple thorn">

<lb/><orth>'Apple-thorn</orth><pc>,</pc>

<gramGrp><pos norm="noun" xml:lang="la">subst.</pos></gramGrp>

</form>

<sense>...</sense>

</entry>

<entry xml:id="jagkwbed.1783.001009.apple-tree" type="main">

<form xml:lang="en" type="lemma" norm="apple tree">

<lb/><orth>'Apple-tree</orth><pc>,</pc>

<gramGrp><pos norm="noun" xml:lang="la">subst.</pos></gramGrp>

</form>

<sense>...</sense>

</entry>

<entry xml:id="jagkwbed.1783.001010.apple-woman" type="main">

<form xml:lang="en" type="lemma" norm="apple woman">

<lb/><orth>'Apple-woman</orth><pc>,</pc>

<gramGrp><pos norm="noun" xml:lang="la">subst.</pos></gramGrp>

</form>

<sense>...</sense>

</entry>

67 These examples illustrate that, despite differences in detail, the <entry> and <form>

information can be encoded using the same pattern. Missing standard information (like

word class) can be supplied without modification of the transcription of the printed

text. Even if the encoding cuts into typographical structures (such as <pron> in

Adelung’s dictionary), it does not corrupt the transcription.

4.5. <etym>: Between Etymology and Word-Formation

68 As noted above, Johnson emphasized the importance of etymology in his preface.

Accordingly, he opens his dictionary with a grammar, and, in the introduction to the

Journal of the Text Encoding Initiative, Issue 3 | November 2012

94

chapter “Of DERIVATION”, explains: “That the English language may be more easily made

understood, it is necessary to enquire how its derivative words are deduced from their

primitives, and how the primitives are borrowed from other languages” (1755, 47). In

compound word entries, he uses square brackets following the part-of-speech

information to mark the root components of the compound—his derivatives (for

example, in APPLE-GRAFT: [from apple and graft]); for root entries, he provides

information about related words in Indo-European, Romance or Germanic languages, if

necessary with an English translation (for example, in ABLE: [habile, Fr. habilis, Lat.

Skilful, ready.]). In accordance with Johnson’s method, we use the <etym> element for

both cases. The <etym> element requires no additional attribute to distinguish these

two cases since its content structure clearly indicates to what type of entry a given

<etym> belongs and how it is to be interpreted:

<entry xml:id="sjdict1f.1755.000123.able" type="Root">

<form>...</form>

<etym><pc>[</pc>

<mentioned xml:lang="fr" >habile</mentioned><pc>,</pc>

<lang><abbr>Fr.</abbr> </lang>

<mentioned xml:lang="la">habilis</mentioned><pc>,</pc>

<lang><abbr>Lat.</abbr></lang>

<lb/><gloss xml:lang="en">Skilful<pc>,</pc> ready<pc>.</pc>

</gloss><pc>]</pc>

</etym>

69 In the encoding of the entry ABLE above, the content of <etym> consists of two

<mentioned> elements, each with a <lang> and possibly a <gloss>, meaning it

must be a root entry.

<entry xml:id="sjdict1f.1755.000347.apple-graft" type="derivation">

<form>...</form>

<etym><pc>[</pc>from

<mentioned corresp="#sjdict1f.1755.000345.apple">apple</mentioned

<lbl>and</lbl>

<mentioned corresp="#sjdict1f.1755.009999.graft">graft</mentioned>

<pc>.]</pc>

</etym>

70 In the encoding of the entry APPLE-GRAFT, the content of <etym> consists of two

<mentioned> elements, each with a @corresp attribute that points to other entries

within the same dictionary, indicating a derivation. While the effort of identifying the

target entry and inserting the corresponding @xml:id attribute is not insignificant,

from our point of view the resulting network of linked entries is worth the effort.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

95

4.6. Stepwise Refinement of <sense>: <num>, <def>, and

<gramGrp> with <gram>

71 The function of <sense> as a container for the semasiological information of

dictionary entries was explained the first half of this paper. Some sections of the

encoding of ABLE can illustrate the flexibility of the concept of crystals for the

encoding of complex semantic structures. The first step of refinement adds <num>elements to label the different <sense>s.

<entry xml:id="sjdict1f.1755.000123.able" type="Root">

<form> ... </form>

<etym> ... </etym>

<sense>

<lb/><num>1.</num>

<def>...</def><cit>...</cit><cit>...</cit>

</sense>

<sense>

<lb/><num>2.</num>

<def>Having power sufficient; enabled.</def>

<cit type="example">

<lb/><quote>All mankind acknowledge themselves able and

sufficient to <lb/> do many things, which actually they never do.

</quote>

<bibl><author>South</author>’s <title>Serm.</title></bibl>

</cit>

</sense>

<sense>

<lb/><num>3.</num>

<gramGrp>

<gram type="syntax">Before a verb, with the participle

<hi rend="italics">to</hi></gram>

</gramGrp>,

<def>it signifies generally hav-<lb/>ing the power</def>;

<gramGrp>

<gram type="syntax">before a noun, with <hi rend="italics">for</

hi></gram>

</gramGrp>,

<def> it means <hi>qualified</hi></def>.

<!-- instances of <cit type="example"> omitted for brevity -->

</sense>

72 In a second step—<num>3.</num>—one <sense> element is used to combine the

morpho-syntactic features “able + to before a verb” in the <gramGrp> container with

the semasiological definition “signifies generally having the power” contained in the

<def> element. In a different construction with able, the morpho-syntactic feature

“before a noun, with for” in <gramGrp> and <gram> is connected with the definition

Journal of the Text Encoding Initiative, Issue 3 | November 2012

96

‘it means qualified’ in <def>. Usually we find grammatical information in a kind of

shorthand in the source, which is likewise encoded briefly:

<gramGrp><pos norm="noun">n.s.</pos></gramGrp>

73 For ABLE, we have a discursive example, which as such is interesting not only in its own

right but also because it combines two clearly distinct syntactic structures and

divergent semantic paraphrases into one sense. The <cit> examples that follow in

sense number 3 repeat the structures and illustrate both usages:

<cit type="example">

<lb/><quote>Wrath is cruel, and anger is outrageous; but who is

able <lb/> to stand before envy?</quote>

<bibl><title>Prov.</title>

<biblScope type="part">xxvii.</biblScope>

<biblScope type="ll">4.</biblScope>

</bibl>

</cit>

<cit type="example">

<lb/><quote>There have been some inventions also, which have been

<lb/>able for the utterance of articulate sounds,

as the speaking of <lb/>certain words.</quote>

<bibl><author>Wilkin</author>’s <title>Mathematical Magic</title>.

</bibl>

</cit>

74 The phrases able to and able for are marked by italics in the print dictionary, but this

was not captured in the encoding. Furthermore, while the refinement of the encoding

could be extended to word level and features of a fine-grain morpho-syntactical

analysis, this is beyond what we want to illustrate in this paper. Therefore we have just

encoded to support analysis of syntax.

4.7. Bilingual Dictionaries: A Shift of Perspective

75 The consistent modeling of heterogeneous lexical structures can be extended to the

more complex structures we find in the two bilingual dictionaries, Adelung’s English-

German translation of Johnson’s dictionary (1783–1796) and Ebers’ New and Complete

Dictionary of the German and English Languages (1796), compiled using Adelung’s and

Schwan’s lexicographical materials. Nevertheless a comparable precision in the

encoding can be achieved. Let us first compare the entry Apple-tart in Johnson’s

dictionary and Adelung’s translation:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

97

<entry xml:id="sjdict1f.1755.000348.apple-tart" type="derivation">

<form type="lemma" norm="apple tart">

<lb/><orth rend="smallcaps">APPLE-TART</orth><pc>.</pc>

<gramGrp><pos norm="noun"/></gramGrp>

</form>

<etym><pc>[</pc>from

<mentioned corresp="#sjdict1f.1755.000345.apple">apple</mentioned>

<lbl>and</lbl>

<mentioned corresp="#sjdict1f.1755.029999.tart">tart</mentioned>

<pc>.]</pc>

</etym>

<sense>

<def>A tart made of apples.</def>

<cit type="example">

<lb/><quote>What, up and down carv’d like an apple-tart.</quote>

<lb/><bibl><author>Shakespeare</author>'s

<title>Taming of the Shrew</title>.

</bibl>

</cit>

</sense>

</entry>

<entry xml:id="jagkwbed.1783.001007.apple-tart" type="main">

<form xml:lang="en" type="lemma" norm="apple tart">

<lb/><orth>'Apple-tart</orth><pc>,</pc>

<gramGrp><pos norm="noun" xml:lang="la">subst.</pos></gramGrp>

</form>

<sense xml:lang="de">

<def>eine Torte von Ä-<lb/>pfeln,</def>

<cit type="translation"><quote>eine Äpfeltorte.</quote></cit>

</sense>

</entry>

76 In contrast to Johnson, Adelung, meeting the requirements of an English-German

dictionary, left out the <etym> element on word-formation and the Shakespeare

quotation and added the word-class information. He translated Johnson’s definition of

apple-tart almost literally into German and then added the slightly strange German

compound Äpfeltorte.

77 The encoding of the translation becomes more complex because of the mix of two

languages which requires an additional control of the extension and inheritance of the

@xml:lang attribute. The use of the German plural form Äpfel in Äpfeltorte may have

been inspired by Johnson’s plural definition and the fact that a decent apple-tart

requires more than one apple. Ten years later, in Adelung’s monolingual German

dictionary, the entry shows no umlaut and the definition is derived from a recipe that

puts the sliced apples on top (1793–1801, vol. 1, 412).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

98

78 In a final look at Ebers’ German-English dictionary, the randomly chosen sample entry

fähig shows the problems in encoding bilingual dictionaries when translation from

mother-tongue into a foreign language is involved.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

99

<entry xml:id="jedictge.1788.000999.fähig" type="main">

<form xml:lang="de" type="lemma" norm="fähig">

<lb/><orth>Fähig</orth><pc>,</pc>

<gramGrp><pos xml:lang="la"norm="adjective">adj. et adv.</pos>

</gramGrp>

</form>

<sense>

<def xml:lang="de">tüchtig, geschickt</def>

<cit type="translation" xml:lang="en">

<quote>capable, able, apt, fit, proper.</quote>

</cit>

<cit type="example" xml:lang="de">

<quote>zu etwas fähig seyn,</quote></cit>

<cit type="translation" xml:lang="en">

<quote>to be capable or <lb/>fit for a Thing.</quote></cit>

<lb/><cit type="example" xml:lang="de">

<quote>sie ist des Erbrechts nicht fähig</quote></cit>

<cit type="translation" xml:lang="en">

<quote>she is <lb/>incapable for Succession.</quote></cit>

</sense>

<sense>

<def xml:lang="de">fähig, lehrsam, gelehrig,</def>

<cit type="translation" xml:lang="en">

<quote>docile, teach- <lb/>able.</quote></cit>

<lb/><cit type="example" xml:lang="de">

<quote>fähig etwas zu erfinden</quote></cit>

<cit type="translation" xml:lang="en">

<quote>inventive.</quote></cit>

<cit type="example" xml:lang="de">

<quote>der Unterweisung fähig</quote></cit>

<cit type="translation" xml:lang="en">

<quote>susceptible of <lb/>Discipline, of Instruction</quote></

cit>

<lb/><cit type="example" xml:lang="de">

<quote>er ist fähig alles zu unternehmen</quote></cit>

<cit type="translation" xml:lang="en">

quote>he <lb/>is a Man that will undertake any <lb/>Thing</

quote></cit>

</sense>

<sense>

<def xml:lang="de">fähig machen,</def>

<cit type="translation" xml:lang="en">

<quote>to enable or fit, to in- <lb/>capacitate, to habilitate.</

quote>

</cit>

<lb/><cit type="example" xml:lang="de">

<quote>der Hunger macht einen zu allem fähig,</quote></cit>

Journal of the Text Encoding Initiative, Issue 3 | November 2012

100

<lb/><cit type="translation" xml:lang="en">

<quote>Hunger breaks through Stone-<lb/>Walls, or Hunger drives

the Wolf <lb/>out of the Forest.</quote></cit>

<lb/><cit type="example" xml:lang="de">

<quote>einen wieder fähig machen,</quote></cit>

<cit type="translation" xml:lang="en">

<quote>to rehabi-<lb/>litate, re-enable, re-instate, re- <lb/

>store,

or re-establish one</quote></cit>

</sense>

</entry>

79 At first glance the main lexicographical problem here is to specify the different senses

of fähig, first in German (with a separate <sense>, each containing a <def>, for each

sense), then in translating the German adjectives into the English equivalents (using

<cit type="translation">), and finally in adding English translations (in <cittype="translation">) of German example phrases (in <cittype="example">) containing the adjective. Unlike in Johnson’s dictionary, the

senses are not numbered and the principle of their order is not quite clear.

80 Recalling the longish title of Ebers’ dictionary, New and Complete Dictionary of the German

and English Languages Composed Chiefly After the German Dictionaries of Mr. Adelung and of

Mr. Schwan, it is worthwhile taking a closer look at Ebers’ possible sources. The entry

fähig in Adelung’s dictionaries (1774–1786, vol. 2; 1793–1801, vol. 2) is built around two

numbered senses and looks completely different. But checking Christian Friedrich

Schwan’s Nouveau dictionnaire de la langue allemande et françoise: Composé sur les

dictionnaires de M. Adelung et de l’Acad. Françoise (1782, 519) shows clearly how Ebers had

compiled this entry of his dictionary:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

101

<entry xml:id="csdictaf.1782.000999.fähig" type="main">

<form xml:lang="de" rend="iso15924:Latn" type="lemma" norm="fähig">

<lb/><orth>FÆHIG</orth><pc>,</pc>

<pc>(</pc><orth rend="iso15924:Latf">fähig</orth><pc>)</pc>

<gramGrp>

<pos xml:lang="fr" norm="adjective">adj. & adv.</pos>

</gramGrp>

</form>

<sense rend="iso15924:Latn">

<def xml:lang="de">tüchtig, geschikt;</def>

<cit type="translation" xml:lang="fr">

<quote>Capable, habile, propre.</quote></cit>

<cit type="example" xml:lang="de"><quote>Zu etwas fähig seyn;</

quote></cit>

<lb/><cit type="translation" xml:lang="fr">

<quote>être capable de qq. ch. être propre à une chose.</quote></

cit>

<lb/><cit type="example" xml:lang="de">

<quote>Sie ist des Erbrechts nicht fähig;</quote></cit>

<cit type="translation" xml:lang="fr">

<quote>elle n'est pas <lb/>habile à succéder.</quote></cit>

</sense>

<sense rend="iso15924:Latn">

<abbr>It.</abbr><def xml:lang="de">Fähig, lehrsam, geleh-<lb/>rig</

def>

<cit type="translation" xml:lang="fr"><quote>docile.</quote></cit>

<cit type="example" xml:lang="de">

<quote>Fähig etwas zu erfinden;</quote></cit>

<cit type="translation" xml:lang="fr"><quote>inven-<lb/>tif.</quote></

cit>

<cit type="example" xml:lang="de">

<quote>Der Unterweisung fähig;</quote></cit>

<cit type="translation" xml:lang="fr">

<quote>susceptible de di-<lb/>scipline.</quote></cit>

<cit type="example" xml:lang="de">

<quote>Er ist fähig alles zu unternèhmen;</quote></cit>

<lb/><cit type="translation" xml:lang="fr">

<quote>il est homme à tout entreprendre.</quote></cit>

<cit type="example" xml:lang="de">

<quote>Dinge, die<lb/>nicht jedermann zu verstehen fähig ist;</

quote>

</cit>

<cit type="translation" xml:lang="fr">

<quote>des <lb/>choses qui ne sont pas à la portée de tout

le mon-<lb/>de</quote></cit>

<cit type="example" xml:lang="de">

<quote>Er ist nicht fähig, euch in geringsten zu<lb/>schaden</

quote></cit>

<cit type="translation" xml:lang="fr">

Journal of the Text Encoding Initiative, Issue 3 | November 2012

102

<quote>il est incapable de vour nuire aucunement.</quote></cit>

<lb/><cit type="example" xml:lang="de"><quote>Fähig machen</quote></

cit>

<cit type="translation" xml:lang="fr"><quote>habiliter.</quote></cit>

<cit type="example" xml:lang="de">

<quote>Der Hunger macht <lb/>einen zu allem fähig;</quote></cit>

<cit type="translation" xml:lang="fr">

<quote>la faim chasse le loup hors<lb/>du bois.</quote></cit>

<cit type="example" xml:lang="de">

<quote>Einen wieder fähig machen;</quote></cit>

<cit type="translation" xml:lang="fr">

<quote>réhabi-<lb/>liter qq. un.</quote></cit>

</sense>

</entry>

81 With the exception of two phrases—“Dinge, die nicht jedermann zu verstehen fähig ist”

and “Er ist nicht fähig euch in geringsten zu schaden”—Ebers has copied the German

text of Schwan’s dictionary and replaced the French translation equivalents by English

ones. The encoding problems remain the same and we think that the solution we

propose is adequate.

5. Conclusion

82 Above we applied our encoding suggestions for the <form> block to Johnson’s entry To

APPLAUD but did not comment on the unusual structure of the elements <sense> and

<cit>: two numbered senses, followed by two quotations. A look at the last edition

(the fourth folio edition of 1773), which was considerably revised and prepared for

publication by Johnson himself, can make the author’s original intentions clearer.

Thanks to Anne McDermott’s excellent CD-ROM edition, published in 1996, we have

access to an SGML encoding of the texts of both the first and fourth folio editions and

can not only compare the texts themselves but also the change over the years from TEI

P3 SGML of 1994 to the current P5 using XML Schema:

83 First folio edition [TEI P5]:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

103

<entry xml:id="sjdict1f.1755.000234.applaud" type="Root" >

<lb/><form type="lemma" norm="applaud">

<gram type="infinitiveParticle">To</gram>

<orth rend="allcaps">APPLA'UD</orth><pc>.</pc>

<gramGrp><pos norm="verb">v.a.</pos></gramGrp>

</form>

<etym>

<pc>[</pc><mentioned xml:lang="la">applaudo</mentioned><pc>,</pc>

<lang><abbr>Lat.</abbr></lang><pc>]</pc>

</etym>

<lb/><sense>

<num>1.</num>

<def>To praise by clapping the hand.</def>

</sense>

<lb/><sense>

<num>2.</num>

<def>To praise in general.</def>

</sense>

<cit type="example">

<lb/><quote>I would applaud thee to the very echo,

<lb/>That should applaud again.</quote>

<bibl><author><abbr>Shakesp.</abbr></author><title>Macbeth</title>.</

bibl>

</cit>

<cit type="example">

<lb/><quote>Nations unborn your mighty names shall sound,

<lb/>And worlds applaud that must not yet be found!</quote>

<bibl><author>Pope</author>.</bibl>

</cit>

</entry>

84 Ann McDermott Fourth folio edition [TEI P3 SGML]:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

104

<ENTRYFREE ID="J4APPLAUD-1" N="1999" TYPE="4">IV

<FORM>

<HI REND="ital">To</HI> <HI REND="acp">APPLA'UD.</HI>

</FORM>

<PB SIG="Bb2r" MACFILE=":4:100:148.CAL" PCFILE="4100148.CAL">

<POS><HI REND="ital">v.a.</HI></POS>

<ETYM>[<HI REND="ital">applaudo,</HI> Lat.]</ETYM>

<SENSE N="1">

<DEF>

<NUM>1.</NUM> To praise by clapping the hand.

</DEF>

<EG TYPE="verse">

<QUOTE>

<L>I would <HI REND="ital">applaud</HI> thee to the very echo,</L>

<L>That should <HI REND="ital">applaud</HI> again.</L>

</QUOTE>

<AUTHOR><HI REND="ital">Shakesp.</HI></AUTHOR>

<TITLE><HI REND="ital">Macbeth.</HI></TITLE>

</EG>

</SENSE>

<SENSE N="2">

<DEF>

<NUM>2.</NUM> To praise in general.

</DEF>

<EG TYPE="verse">

<QUOTE>

<L>Nations unborn your mighty names shall sound,</L>

<L>And worlds <HI REND="ital">applaud</HI> that must

not yet be sound!</L>

</QUOTE>

<AUTHOR><HI REND="ital">Pope.</HI>

</AUTHOR>

</EG>

</SENSE>

</ENTRYFREE>

85 We can conclude:

The transcription of the entry APPLAUD in the SGML version of the fourth folio edition

shows clearly that Johnson had intended to illustrate each definition with an illustrative

quotation, as elsewhere in the dictionary, and that the unusual structure of the first folio

text—two numbered senses, followed by two quotations—is simply a typesetting error.

Both encodings have many structural features in common: with the exception of <cit> and

<pc>, all elements used in our encoding were available in TEI P3, whereas the mechanisms

usable at the attribute level are not comparable. But the main difference is the style of the

encoding: although the SGML version is very close to the typography of the text, our

encoding, using crystals, aims more at interpreting typographical detail in order to capture

1.

2.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

105

lexicographic and linguistic data and to constrain encoding options in favor of robust

interoperability and reusability of resources.

A. Appendix: Facsimiles

A.1. Johnson, Entry “ABLE”

Facsimile A.1.: Page with entry “ABLE” from Johnson (1755).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

106

A.2. Johnson, Entries “To APPLAUD” and “APPLE”

Facsimile A.2.: Page with entries “To APPLAUD” and “APPLE” from Johnson (1755).

A.3. Adelung, Entry “Apple”

Facsimile A.3.: Page with entry “Apple” from Adelung (1783–1796).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

107

A.4. Ebers, Entry “FÄHIG”

Facsimile A.4.: Page with entry “FÄHIG” from Ebers (1796).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

108

A.5. Schwan, Entry “FÆHIG”

Facsimile A.5.: Page with entry “FÆHIG” from Schwan (1782).

BIBLIOGRAPHY

Adams, V. 1976. An Introduction to Modern English Word-Formation. London: Longman.

Adelung, J. C. 1774–1786. Versuch eines vollständigen grammatisch-kritischen Wörterbuches Der

Hochdeutschen Mundart, mit beständiger Vergleichung der übrigen Mundarten, besonders aber der

Oberdeutschen. 5 vols. Leipzig: Breitkopf.

Adelung, J. C. 1783–1796. Neues grammatisch-kritisches Wörterbuch der Englischen Sprache für die

Deutschen; vornehmlich aus dem größern englischen Werke des Hrn. Samuel Johnson nach dessen vierten

Ausgabe gezogen und mit vielen Wörtern, Bedeutungen und Beyspielen vermehrt. 2 vols. Leipzig: im

Schwickertschen Verlage.

Adelung, J. C. 1793–1801. Grammatisch-kritisches Wörterbuch der Hochdeutschen Mundart, mit

beständiger Vergleichung der übrigen Mundarten, besonders aber der Oberdeutschen, von Johann

Christoph Adelung, Churfürstl. Sächs. Hofrathe und Ober-Bibliothekar . . . . 4 vols. Leipzig: Breitkopf.

Atkins, S., N. Bel, F. Bertagna, P. Bouillon, N. Calzolari, C. Fellbaum, R. Grishman, R. Lenci, C.

MacLeod, M. Palmer, G. Thurmair, M. Villegas, and A. Zampolli. 2002. “From Resources to

Applications. Designing the Multilingual ISLE Lexical Entry.” In Proceedings of the 3rd International

Conference on Language Resources and Evaluation, 687–693.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

109

Ebers, J. 1796. New and Complete Dictionary of the German and English Languages composed chiefly after

the German Dictionaries of Mr. Adelung and of Mr. Schwan. . . . Vol. 1. Leipzig: Breitkopf and Haertel.

Halpern, J. 2006. “The role of lexical resources in CJK natural language processing.” In Proceedings

of the Workshop on Multilingual Language Resources and Interoperability, 9–16.

Henne, H., ed. 2001. Deutsche Wörterbücher des 17. und 18. Jahrhunderts. Einführung und Bibliographie.

Hildesheim/Zürich/New York: Olms.

Ide, N., A. Kilgarriff, and L. Romary. 2000. “A Formal Model of Dictionary Structure and Content.”

In Proceedings of Euralex 2000. Stuttgart, 113-126. http://hal.archives-ouvertes.fr/hal-00164625.

Johnson, S. 1755. A Dictionary of the English Language . . . . 2 vols. London: W. Strahan.

Kilgarriff, A. and D. Tugwell. “Sketching Words.” Lexicography and Natural Language Processing: A

Festschrift in Honour of B. T. S. Atkins, ed. Marie-Hélène Corréard. Stuttgart: EURALEX. 125–137.

http://www.kilgarriff.co.uk/Publications/2002-KilgTugwell-AtkinsFest.pdf.

Landau, S. I. 2001. Dictionaries. The Art and Craft of Lexicography. 2nd ed. Cambridge: Cambridge

University Press.

Lewis, D. 2012. Die Wörterbücher von Johannes Ebers. Studien zur frühen englisch-deutschen

Lexikographie. PhD diss., University of Würzburg (in print).

Luna, P. 2005. “The typographic design of Johnson’s Dictionary.” In Anniversary Essays on Johnson’s

Dictionary, ed. Jack Lynch and Anne McDermott, 175–197. Cambridge: Cambridge University Press.

McDermott, A. ed. 1996. Samuel Johnson, A Dictionary of the English Language, on CD-ROM. The First

and Fourth Editions. Cambridge: Cambridge University Press.

Miller George A. and Christiane Fellbaum. 2007. “WordNet Then and Now.” In Language Resources

and Evaluation 41: 209–214, doi:10.1007/s10579-007-9044-6.

Osselton, N. E. 2005. “Hyphenated Compounds in Johnson’s Dictionary.” Anniversary Essays on

Johnson’s Dictionary, eds. Jack Lynch and Anne McDermott, 160–174. Cambridge: Cambridge

University Press.

Reddick, A. 2006. The Making of Johnson’s Dictionary 1746–1773. Rev. ed. Cambridge: Cambridge

University Press.

Romary, L. 2009. “ODD as a generic specification platform.” Paper presented at Text Encoding in

the Era of Mass Digitization: Conference and Members’ Meeting of the TEI Consortium. http://

hal.inria.fr/inria-00433433.

Schwan, C. F. 1782. Nouveau Dictionnaire de la Langue Allemande et Françoise . . . . Vol. 1. Mannheim:

Chez C.F. Schwan et M. Fontaine.

TEI Consortium. 2012. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.0.2.

Last updated February 2. TEI Consortium. http://www.tei-c.org/Guidelines/P5/.

NOTES

1. Some LMF packages, such as the description of subcategorization frames, do not yet

have any equivalence in the TEI vocabulary, but the TEI extension mechanisms do

facilitate the description of such extensions.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

110

2. Note that some of the changes proposed in this paper (in particular regarding the

systematic use of <sense>) have already been integrated into the December 2011

release (2.0.0, Laurentian).

3. Ideally, this should correspond to model.formPart, but in the current version of

the TEI Guidelines this class is cluttered with other components which are there for

purely syntactic (practical) reasons. We would limit this class to form <orth>,

<pron>, <hyph>, <syll>, and <stress>.

4. Even if this is not allowed in the <entry> element, form representations still

appear in: <cit>, <dictScrap>, <entryFree>, and <nym>, because of their

membership to model.entryPart.

5. http://www.isocat.org/

6. Paul Luna’s analyses here the typography of Johnson’s folio edition of his dictionary

(in opposition to different typography and text structure in the quarto and octavo

editions). Folio is the old measure of size of a book and an indispensable term for

research on Johnson’s dictionaries.

7. Since Adelung’s name does not appear on the title page nor elsewhere in the front

matter, his role as a translator is little known. It is worth mentioning the publication

context. Adelung studied and translated Johnson’s dictionary while working on the two

editions of his own German dictionaries. The first volume of his translation, containing

the letters A to J, was published in 1783. This was after nearly three years of work—

according to his preface (p. xii)—and before he finished the fifth and last volume of the

first edition of his German dictionary which he had started in 1773 (Adelung 1774–

1786). Thirteen years later, in 1796, he published the second volume of his translation

with the letters K to Z, after having finished the first two volumes of the second and

final edition of his German dictionary (Adelung 1793–1801). Against this background,

future research into structural relations between Johnson’s Dictionary of the English

Language and Adelung’s German dictionaries looks promising.

Almost at the same, time Johannes Ebers used Adelung’s lexicographical materials to

compile a German-English counterpart in three volumes with a very elaborate title New

and Complete Dictionary of the German and English Languages composed chiefly after the

German Dictionaries of Mr. Adelung and of Mr. Schwan ... (Ebers 1796).

8. We do not encode the two typefaces for Latin script used by German printers of

Adelung’s and Ebers’ dictionaries because there is a fixed relation between language

(encoded using @xml:lang) and the typeface: for German texts the Fraktur variant is

used, whereas for other languages Antiqua is used. We only encode exceptions to this

rule, such as in Schwan’s German-French dictionary, where ISO 15924 codes are used

for the representation of names of scripts. We do not encode the indentation and

alignment structure, nor do we encode italics in the contexts of part-of-speech labels

(in a <pos> element), of cited forms in <etym> (if printed in italics), of the lemmata

used in illustrative quotations (in a <cit> element), or of the names of authors and

their works in the sources for the illustrative quotations (in a <bibl> element).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

111

ABSTRACTS

Our paper outlines a proposal for the consistent modeling of heterogeneous lexical structures in

semasiological dictionaries, based on the element structures described in detail in chapter 9

(Dictionaries) of the TEI Guidelines. The core of our proposal describes a system of relatively

autonomous lexical “crystals” that can, within the constraints of the relevant element’s

definition, be combined to form complex structures for the description of morphological form,

grammatical information, etymology, word-formation, and meaning for a lexical structure.

The encoding structures we suggest guarantee sustainability and support re-usability and

interoperability of data. This paper presents case studies of encoding dictionary entries in order

to illustrate our concepts and test their usability.

We comment on encoding issues involving <entry>, <form>, <etym>, and on refinements to

the internal content of <sense>.

INDEX

Keywords: dictionary encoding, semasiological dictionary, entry, form, sense, Samuel Johnson,

Dictionary of the English Language

AUTHORS

LAURENT ROMARY

Laurent Romary is Directeur de Recherche for INRIA (France) and guest scientist at Humboldt

University (Berlin, Germany). He carries out research on the modeling of semi-structured

documents, with a specific emphasis on texts and linguistic resources. He received a PhD degree

in computational linguistics in 1989 and his Habilitation in 1999. During several years he launched

and directed the Langue et Dialogue team at Loria (Nancy, France) and participated in several

national and international projects related to the representation and dissemination of language

resources and on man-machine interaction, coordinating the MLIS/DHYDRO, IST/MIAMM, and

eContent/Lirics projects. He has been the editor of ISO standard 16642 (TMF – Terminological

Markup Framework) and is the chairman of ISO committee TC 37/SC 4 on Language Resource

Management, as well as member (2001–2007) then chair (2008–2011) of the TEI Council. In the

recent years, he led the Scientific Information directorate at CNRS (2005–2006) and established

the Max-Planck Digital Library (Sept. 2006–Dec. 2008). He currently contributes to the

establishment and coordination of the DARIAH infrastructure in Europe as transitional director.

WERNER WEGSTEIN

Werner Wegstein is a professor emeritus of German Linguistics and Computational Philology. His

publications include a scholarly edition of an Old-High German Glossary (Ph.D. 1985), the first

complete reverse index to a Middle High German dictionary (1990, together with E. Koller and

N.R. Wolf), a Habilitation on computer-based philology (1995), conference papers on the

application of IT to medieval German (2001), and a recently co-authored work on corpus

linguistics (Korpuslinguistik deutsch: synchron, diachron, kontrastiv, 2005). He hosted the TEI

workshops in Würzburg; is a founding member of TextGrid, the humanities partners of the

Journal of the Text Encoding Initiative, Issue 3 | November 2012

112

German D-Grid initiative; and at present is active in a project researching the interaction

between the sciences and humanities in the field of variation.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

113

A TEI Schema for theRepresentation of Computer-mediated CommunicationMichael Beißwenger, Maria Ermakova, Alexander Geyken, Lothar Lemnitzerand Angelika Storrer

1. Introduction

1 In the past three decades, computer networks and especially the Internet have brought

forth new and emerging genres of interpersonal communication which are the subject

of research in the field of “computer-mediated communication” (henceforth CMC). In

general, genres such as e-mail, online forums, chats, instant messaging, or weblogs

stand in the tradition of well-known genres such as spoken conversations or written

letters. On the other hand, they display linguistic and structural features which differ

from both speech and written text (see below for details) and which can be traced back

to the ways in which interlocutors adapt to the technical potentials and limitations of

computer-mediated communication.

2 Recent surveys on the use of the Internet (such as “ARD/ZDF-Onlinestudie”,1 conducted

annually in Germany) show that use of CMC applications is an important part of

everyday communication. To gain a better understanding of these new forms of

mediated communication and their linguistic peculiarities, we need tools and models

that allow one to analyze them on a broad empirical basis and with the help of corpus

technology and methods from computational linguistics. One important prerequisite

for that would be a common format for the representation and exchange of CMC

resources. Even though CMC phenomena are no longer a completely new field of

research within the humanities, such a format still does not exist.

3 In this paper, we present an XML schema for the representation of genres of computer-

mediated communication that is conformant with the encoding framework defined by

the TEI. Up to now, the encoding of CMC genres and document types has not been a

focus of the TEI. Our schema takes the modules as well as the element and attribute

Journal of the Text Encoding Initiative, Issue 3 | November 2012

114

classes of the P5 version of the TEI Guidelines (released on November 1, 2007) as a

starting point and uses the TEI customization mechanism to extend support to these

genres and document types. The focus of the schema is on those CMC genres which are

written and dialogic―threads in forums and bulletin boards, chat and instant messaging

conversations, wiki talk pages, weblog discussions, microblogging on Twitter, and

conversations on “social network” sites. The schema has been developed in the context

of the project “Deutsches Referenzkorpus zur internetbasierten Kommunikation”

(DeRiK, Beißwenger et al. 2012),2 which is a joint initiative of TU Dortmund University

and the Berlin-Brandenburg Academy of Sciences and the Humanities (BBAW). The

project is embedded in the scientific network Empirische Erforschung internetbasierter

Kommunikation ( http://www.empirikom.net/), funded by the Deutsche

Forschungsgemeinschaft (DFG). The aim of the project is to build a corpus on language

use in the German-speaking Internet which covers the most popular CMC genres. The

corpus is designed to be integrated into the corpora and lexical resource framework

provided by the project “Digitales Wörterbuch der deutschen Sprache” (DWDS)3 at the

BBAW “Zentrum Sprache”.

4 Since all corpus resources of the DWDS project are already encoded according to the

TEI encoding framework, and since there is not yet a common standard for an XML/TEI

representation of the structural and linguistic properties of CMC resources, the project

group decided that the TEI would be an optimal basis for the annotation of the DeRiK

data—assuming that the encoding framework of the TEI would prove to be flexible

enough to be adapted to the particularities of CMC discourse. In particular, we

formulated the following requirements for our schema:

It should provide a model that is adapted to the structural particularities of CMC discourse;

especially that the interlocutors’ contributions to conversations in forums, chats, wiki and

weblog discussions, etc. can neither be adequately described as utterances in speech nor as

paragraphs in traditional writing.

It should provide elements for the annotation of units which are often regarded as “typical”

for language use on the web and which are of special interest to anyone who wants to

compare linguistic features of CMC discourse with the language documented in text corpora

(such as the DWDS corpora); in the DeRiK context, a special focus lies on units which we

subsume under the category interaction signs (including emoticons, interaction words, and

addressing terms).

It should be open to extensions by other researchers in the field of empirical CMC research

or by corpus designers who want to adapt the schema for their own project purposes

(especially on the microlevel, which―in the terminology of our project―is the level below

the individual user contribution).

On the macrolevel (the level above the individual user contributions), its structure should be

oriented toward surface phenomena and thus be as independent as possible from any

specific theory of CMC discourse; this will allow use of the macrostructure model of the

schema as a basic document structure in as many projects as possible; in addition, it will

allow automation of the generation of the basic TEI structure of CMC documents (which is an

important requirement, especially in projects that aim at building large corpora).

It should allow for an easy (but reversible) anonymization of CMC data for purposes in which

the annotated data should be made available as a resource for other researchers or for the

public (as is intended with the DeRiK corpus as part of the DWDS framework).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

115

It should provide all information and metadata which are necessary for using and

referencing random excerpts from the data as references in a general language dictionary as

well as in the results of a corpus query (as is the case in the DWDS online portal).

5 First we will give an outline of the motivation and context of the project. We then will

describe the design of our schema in detail and illustrate some of our basic modeling

decisions with the help of examples from our data.4 The schema itself, its

documentation, and some encoded example documents can be found online.5

6 The current version of the schema will form the foundation of the annotation of CMC

documents in the DeRiK context. Since it is meant to be a core model for representing

CMC, it can be modified and extended by others according to their own specific

perspectives on CMC data. It will have to prove its adequacy for the resource types in

focus by being used and analyzed by more researchers and corpus builders than just its

authors. The schema and its further discussion could be a first step towards an

integration of features for the representation of CMC genres into a future version of the

TEI Guidelines.

2. Motivation and Project Background

2.1. Motivation

7 The motivation for building a corpus of German CMC is to close a gap in the range of

corpora currently available for the study of CMC and contemporary German in general.

Hardly any annotated specialized corpora of CMC exist, and general corpora of

contemporary German do not systematically include language as used on the Internet

(Beißwenger and Storrer 2008). This poses a blatant gap since online communication

has become an important part of everyday communication and can no longer be

ignored when documenting contemporary everyday language use. The field of corpus

linguistics is aware of that gap. In addition to the DeRiK project, which aims to build a

German CMC corpus and integrate it into the DWDS general language corpora, there

are similar ideas or projects for other languages as well. One example is the SoNaR

project which aims at building a balanced reference corpus of contemporary Dutch

including a subcorpus of CMC (Reynaert et al. 2010).

8 Due to a lack of standards for representing CMC, up to now corpus-based research

projects focusing on features of CMC discourse have typically developed their own,

project-specific encoding schemas (see, for example, the XML encoding for chats that

has been designed for the resources included in the Dortmund Chat Corpus, 2003–2009).6

This complicates, maybe even makes impossible, the sharing of this data across

projects, which is all the more regrettable because the individual projects add valuable

structural and semantic information to their data through their annotations (not to

mention the time and person hours required to annotate the data). The potential for

sharing, merging, and comparing corpora, particularly in contrastive linguistic

research, calls for a basic schema which suits the needs of various projects and which is

easy to handle and extend.

9 In addition, such a schema should be compliant with encoding frameworks already

widely used in existing text and speech corpora. This would allow the schema to not

only meet the needs of scholars interested in CMC but also those interested in

Journal of the Text Encoding Initiative, Issue 3 | November 2012

116

phenomena of contemporary language in general or in comparative analyses of

linguistic phenomena in CMC corpora or corpora of “traditional” text or speech genres.

10 Since many resources within the humanities are already using the encoding framework

provided by the Text Encoding Initiative (TEI), a basic schema for CMC would ideally

comply with this. As will be shown in section 3 of this paper, TEI has the power and

flexibility to describe CMC structures and features even though modules and elements

covering the particularities of CMC discourse are not yet implemented in the TEI.

Therefore, a TEI-compliant XML schema for CMC discourse requires additional

modules. Considering the relevance of the Internet as a communication medium, a

separate module for CMC document types and features could be an important

extension for a future version of the TEI Guidelines.

2.2. The DeRiK Corpus in the Context of the DWDS System

11 Designers of balanced corpora representing the current state of a language should be

sure to include all relevant types of genres in which the contemporary use of this

language is embodied. Nowadays, for a language like German with a strong online

presence, this should include genres of computer-mediated communication. In the

project Deutsches Referenzkorpus zur internetbasierten Kommunikation (DeRiK), 7 we are

aiming to build a corpus of German CMC covering data from the most popular CMC

genres. Data sampling is guided by the findings of the ARD/ZDF-Onlinestudie, which

shows the popularity of various genres among German online users. For practical

reasons, though, the project will sample only those domains and genres that are

cleared from intellectual property rights. The data will be integrated in and presented

through the DWDS, a digital lexical system developed by and hosted at the BBAW. The

system offers one-click access to three different types of resources (Geyken 2007):

Lexical resources: a common language dictionary,8 an etymological dictionary, and a

thesaurus;

Corpus resources: a balanced reference corpus (called the “DWDS core corpus”) of German

from 1900 to the present. The corpus is balanced among nearly equal shares of journalistic

texts, scientific prose, functional texts, and fiction. Until recently, CMC did not play a role

either as an independent text genre or as part of one or more of these genres; additionally, a

set of newspaper corpora and specialized corpora that are not part of the DWDS core corpus

(such as German newspapers from Jewish communities edited in the first decades of the 20th

century);

Statistical resources for words and word combinations.

12 In the web interface, these resources are displayed alongside one another in separate

panels (see fig. 1). Information in all corpus panels can be retrieved through a linguistic

search engine which allows the user to search for patterns of single words,

combinations of words, combinations of words and part-of-speech patterns, and more.

It is thus possible to retrieve examples for multi-word phrases (e.g., collocations) and

grammatical constructions (such as a verb used in the passive voice).

1.

2.

3.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

117

Figure 1: Web interface of the DWDS system

13 The DeRiK corpus will be integrated into this framework as an independent panel as

well as a subcorpus of the DWDS core corpus and, thus, fill the “CMC gap” in the

current version of the corpus.

14 The integration of a CMC reference corpus into the DWDS system will be valuable for

various research and application fields, for example:

Lexicology and lexicography: Besides genre-specific discourse markers and Internet jargon

(like “lol”), new vocabulary is characteristic of CMC discourse. For example, “gruscheln”, a

form describing the virtual approaching of another person in the German social network

StudiVZ (English paraphrase: “to poke”). Furthermore, the disembodiment of synchronous

written communication leads to a metaphorical usage of verbs like “knuddeln” (en: “to hug

[somebody]”). These features should be documented and described in lexical resources.

Language variation and stylistics: The linguistic peculiarities and the stylistic aspects of CMC

are described in the CMC-related literature.9 However, most empirical studies on the matter

have been based upon small and project-related datasets. The DeRiK corpus will provide a

broader basis for qualitative and quantitative investigations on linguistic features and

linguistic variation in German CMC. The DWDS framework will facilitate the comparison of

CMC genres with corpora of other written genres; it will, thus, be easier to investigate how

new patterns and genres emerge.

Language teaching: Internet communication has become an important part of everyday

communication. Thus, language- and culture-specific properties of CMC should also be

regarded in communicative approaches to Second Language Teaching. In this context, the

DeRiK corpus and the lexicographic documentation of CMC vocabulary in the DWDS

dictionary may be useful resources. In school teaching, German native pupils may use the

DWDS system to compare written language and CMC corpora and to explore how style varies

across different genres (Beißwenger and Storrer 2011).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

118

3. Specification of the Schema

3.1. CMC Genres, Document Types, and Features Covered by the

Schema

15 In a broader sense, computer-mediated communication comprises all communication

“that takes place between human beings via the instrumentality of computers”

(Herring 1996, 1). In a narrower sense, the term “computer-mediated communication”

is used for such forms of communication that are based on computer networks (usually

the Internet). According to John December 1996, those forms of computer-mediated

communication can also be subsumed under the category “Internet-based

communication,” including all communication that “takes place on the global

collection of networks that use the TCP/IP protocol suite for data exchange”. Internet-

based communication can be accessed using client software on desktop or mobile

computers or through applications for the use of online services on mobile

communication devices such as mobile and smart phones.

16 Taking into account the focus of the DeRiK project, we restrict the focus of our schema

to forms of communication which are (i) based on the TCP/IP protocol suite for data

exchange, (ii) dialogic (with all participating users being able to switch between the

role of a recipient/reader and the role of a producer/author of messages), and (iii)

based on writing as the main encoding medium for the users’ dialogue contributions

(that is, the verbal parts of the contributions must be encoded using writing, though

they may also include graphics, embedded audio, or video files). Thus, the present

version of our schema does not cover communication which is mediated via computers

while not being Internet-based (such as SMS communication), monologic forms of

Internet-based communication (such as static webpages), or spoken online

communication using audio or video conferencing software (such as Skype or

Teamspeak).

17 Our schema focuses on those forms of computer-mediated communication in which

written dialogue contributions of more than one interlocutor are displayed in the same

document. In its present version, the schema excludes communication via e-mail and

on Usenet in which each user contribution is stored in a separate (e-mail) document. In

our opinion, the representation of documents that render only one text message

(which, in addition, may have other documents in a vast range of file formats as

attachments) demands a different base structure than documents which preserve

sequences of contributions by two or more users. We do not exclude e-mail and Usenet

conversations from the DeRiK project in general; we simply do not claim that the

schema we describe below is able to adequately cover their features.

18 The schema draft that we describe in the following sections gives a core model for the

representation of the following types of CMC documents:

threads in online forums and in bulletin boards;

discussion threads on talk pages in wikis;

logfiles of conversations in webchats, on Internet Relay Chat (IRC), and in instant messaging

applications;

sequences of user postings in online guestbooks (which have a structure similar to chat or

instant-messaging logfiles);

Journal of the Text Encoding Initiative, Issue 3 | November 2012

119

sequences of postings and threads on profile pages and in discussion sections of social

network sites;

sequences of user postings on Twitter (such as “timelines” of postings that include the same

thematic hashtag);

discussion threads in weblogs;

sequences of review postings for products presented on online shopping sites;

threads and sequences of “private messages” preserved in users’ individual mailboxes on

social network sites or learning platforms.

19 The status of our schema is that of a core model for the representation of CMC. This

means that the schema is meant to provide elements for the representation of the basic

structural peculiarities on the macrolevel and of some prominent linguistic features

that can be found on the microlevel of CMC discourse. The structural elements on the

microlevel are those elements that can be found in the content of individual users’

contributions to CMC conversations, while the constituting structural elements of the

macrolevel are the users’ contributions themselves. Structures on the microlevel (or

microstructures) are made of linguistic units, punctuation, media objects, and

hyperlinks. The current version of our schema confines itself to those microstructural

elements that can be regarded as typical for CMC―especially the CMC-specific

interaction signs (section 3.5 below). The schema could be extended in such a way that it

covers further linguistic and structural phenomena of CMC discourse (for an overview

of linguistic features in German CMC discourse, see, for example, Runkehl et al. [1998]

and Storrer [2009]; for English, see, for example, Crystal [2001] and the contributions in

Herring [1996]). The schema presented in the following sections is open to such

extensions.

3.2. Basic Modeling Decision: Customizing TEI’s Basic Formats for

the Representation of Text Structure

20 None of the modules in the current version of the TEI Guidelines can be adopted “as is”

for creating a model for the representation of CMC. There are many elements in the

default text structure module which are useful for describing the structure of individual

users’ contributions to CMC discourse, but CMC documents can be regarded as text

documents only in a very technical sense since they include stretches of written

language which, due to their separation through line-breaks, appear paragraph-like. On

the other hand, the dialogic structure of CMC discourse appears similar to the structure

of spoken conversations (covered by the transcribed speech module), but the production

of the users’ contributions to CMC dialogues is a monologic activity and, thus, more

text-like than speech, in which the interlocutor perceives and processes the verbal

utterance nearly simultaneously with its production by the speaker. Therefore, neither

of these modules, nor any other module in P5, provides a model of interpersonal

communication that fits the particularities of the main constituting elements of CMC

discourse. These are the stretches of text that an individual user produces in private

and then passes on to the server through performing a “posting” action (usually by

hitting the [ENTER] key on the keyboard or by clicking on a [SEND] or [SUBMIT] button on

the screen).

21 The commonalities and differences of CMC discourse with text and speech have been

widely addressed in the CMC literature. CMC can best be described as (synchronous or

Journal of the Text Encoding Initiative, Issue 3 | November 2012

120

asynchronous) written or typed conversation (Werry 1996; Storrer 2001; Beißwenger

2002) or as interactive written discourse (Ferrara et al. 1991; Werry 1996), which has to be

regarded as crucially different from spoken conversation as well as from texts since it

uses features of textuality for the purpose of dialogic exchange (see also, for example,

Crystal 2001, 25–48; Hoffmann 2004; Zitzen and Stein 2005): Just like text, CMC is

written. In some CMC genres, the users can apply text formatting features and

paragraph structuring to their contributions. In contrast to texts and similar to spoken

conversation, CMC discourse is dialogic, while the users’ contributions to CMC

dialogues are being composed in a private activity, then sent to the server, then

displayed on the screens; it is not until then that they can be read by other users

(Beißwenger 2003, 2007). This “pre-transmission composition” protocol for the

production of dialogue contributions in CMC is text-like, not speech-like. Accordingly,

even in synchronous modes of CMC (chat and instant messaging), the users lack the

possibility to provide simultaneous feedback or to perceive and process the

contributions of their interlocutors simultaneously with their verbalization (which has

crucial consequences for the interactional management layer, especially turn-taking in

conversation; see, for example, Garcia and Jacobs 1998, 1999; Herring 1999; Beißwenger

2003, 2007; Schönfeldt and Golato 2003; Ogura and Nishimoto 2004; Zitzen and Stein

2005). As can be seen by observing message composition in chat sessions, the message

production includes subprocesses of evaluation and revision (re-writing) which are

particular to the production of text (see, for example, the findings on message

production in chats in Beißwenger [2007, 2010]). All in all, CMC can thus be considered

as more than just a hybrid of text and speech (Crystal 2001, 48). Therefore, neither text

nor speech provides an adequate model for its description. But considering the form

and production of user contributions to CMC conversations, a text model seems to be a

better starting point for practical modeling purposes than a speech model. Or, in

Crystal’s words, “[o]n the whole, Internet language is better seen as writing which has

been pulled some way in the direction of speech rather than as speech which has been

written down” (2011, 21). Still, this does not mean that written language is a good

model for CMC per se; but certain structural features specific to written language can

also be found in CMC, and therefore, a model for the description of text can provide

more elements that can be adopted for the description of written CMC than a model for

speech which is bound to completely different conditions of verbalization and mutual

perception.

22 For our schema, we decided to use the TEI header module in P5 as the basis for the

representation of metadata in CMC documents (with some minor customizations which

will be described in section 3.5 below). For the representation of the document

structure, we decided to tailor a customized version of the TEI default text structure

module and, additionally, of some elements from the common core module (especially

the <p> element for the annotation of paragraphs). The main issues that we had to deal

with while customizing the respective TEI modules for the representation of CMC were

(i) the question of how to represent the users’ written contributions as the main

constituting elements of CMC conversations, (ii) the question of how to represent CMC-

specific types of grouping sequences of users’ contributions to larger units (threads and

logfiles), and (iii) the question of how to differentiate between the inner structure of the

individual users’ contribution and the structure of the CMC discourse (the first being

controlled by the user, the second being the result of an interactional achievement of

Journal of the Text Encoding Initiative, Issue 3 | November 2012

121

all participating users and/or of a certain server routine for ordering incoming user

postings).

23 Regarding (i), we decided to introduce a new element <posting> and assign it to the

divLike class of elements (section 3.3.1 below). Regarding (ii), we decided to

introduce two new <div> types and name them thread and logfile (section 3.3.2 below).

Regarding (iii), we decided to use the <p> element for segmentations in the content of

postings (CMC microstructure) and to use <div> elements for segmentations above the

posting level (CMC macrostructures).

3.3. Elements of the Document Macrostructure

3.3.1. The <posting> Element

24 The element <posting> is the basic CMC-specific element in our schema. In CMC

documents it represents the largest structural unit that can be assigned to one author

and one point in time. The category posting is defined as a content unit that has been

sent to the server “en bloc”. Its function is to make a (written) contribution to the

ongoing dialogue. After being sent (“posted”) to the server, the submitted unit is

displayed in the CMC document as one continuous stretch of content (text plus

embedded media objects such as graphics or video files, etc.). It is usually assigned to

the user name of its author (the user who has sent the unit to the server) and often also

to a certain point in time (indicated through a timestamp). Therefore, postings can be

recognized by their formal structure and, thus, be annotated automatically, even if

they may have different forms and structures in different CMC genres or applications.

Figure 2: Macrostructure of a Wikipedia talk page (excerpt)

25 The example given in figure 2 shows an excerpt from a Wikipedia talk page. Individual

user postings all end with a signature that gives the author’s name and a timestamp.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

122

For example, the signature of posting 1 assigns the posting to an author named Netpilots

and indicates that it was received by the server at 10:36, July 28, 2011 (CEST). More

information about the author can be found on the author’s profile page, which can be

accessed through the hyperlink underlying the name.

26 In a Wikipedia talk page, there is a convention to use a paragraph break to separate

each author’s posting. This makes the sequence of postings in the document appear like

a sequence of paragraphs in a text document. In addition, individual postings can have

internal structure. Posting 1, for example, structures its content into two paragraphs

and a bullet list with two items. Furthermore, the author of posting 1 uses hyperlinks to

connect certain segments of his posting with other Wikipedia pages (“Schwäbisch

Gmünd” and “Facebook”) and with Web resources external to Wikipedia

(“Gescheiterter Bud-Spencer-Tunnel/Focus.de” and “Artikel im Tages-Anzeiger”), plus

bold font weight to highlight the segment “Bud Spencer Tunnel” in the first paragraph.

27 In addition to the paragraph breaks between postings, the postings in example 1 are

also separated from each other by different levels of indentation. The indentations

were deliberately added by the authors in an attempt to create thread structures,

similar to those in discussion groups. Thus, the level of indentation is a feature of the

posting itself and not something that has been automatically assigned by the server.

28 The example given in figure 3 shows an excerpt from a chat logfile. In this case, the

postings are linearly placed one after another in the order of their arrival on the chat

server. In the user chat interface, each individual posting is rendered as a block, and

the server automatically adds information about the authors―the user’s nickname,

which is inserted in front of every posting.

105 Dill die rosi ihr englisch ist nihct vom feinsten

rosi’s english is not the best

106 Rosenstaub1979 Nö

Nope

107 Rosenstaub1979 is schon zuuulang her

it’s been toooooo long

108 Dill aber rosi ist prächtig

but rosi is magnificent

109 Dill prachtvoll

grand

110 Rosenstaub1979 Ich glaube, so 9 Jahre

I think, about 9 years

111 Rosenstaub1979 *lol* @Dill

*lol* @Dill

112 Dill 9 jahre?

9 years?

113 Rosenstaub1979 Ja, kommt fast hin

Journal of the Text Encoding Initiative, Issue 3 | November 2012

123

Yes, that’s about right

Figure 3: Sequence of postings in a chat room

29 A posting represents a category in its own right which is different from text or speech.

Below, we examine the TEI elements for divisions and paragraphs (components of texts)

and for utterances (components of spoken discourse) to check whether they would

suffice to encode postings.

30 According to the TEI Guidelines, the paragraph element <p> is used to mark “the

fundamental organizational unit for all prose texts, being the smallest regular unit into

which prose can be divided” (TEI P5: 3.1) while the element <div> identifies

subdivisions of a text, such as chapters or sections (TEI P5: 4.1). Being defined as an

“organizational unit” (of a text), the notion of the paragraph implies that there is an

author or at least an author-like authority (editor or publisher) who makes certain

structuring decisions while composing his text and, thus, divides it into a series of units

(for example, according to subtopics and information units). In CMC, on the other hand,

one author’s reach ends with the beginning and end of his current posting while the

structure of the sequence of postings is either due to a server routine (as in chat

logfiles) or a joint achievement of the group of users (as in Wikipedia talk pages and in

certain forums). Thus, the resulting structure is not based on any sort of authorial

structuring of the text. Modeling a user posting as a paragraph would therefore reduce

the original concept of the paragraph to absurdity: a paragraph is a holistic unit

determined by (one author’s) global text coherence, whereas a posting in CMC is an

atomic constituent of a written dialogue determined by the ongoing dialogue’s local

coherence.

31 For example, in figure 3, the user Rosenstaub sends posting 106 (“Nope”) as a direct

reaction to the previous posting 105 from user Dill. This reaction of hers was not

previously determined by an author (as is the case, for example, with individual

characters’ utterances in dramatic dialogues), but she reacted in this way because the

previous posting created a context which made this type of response seem sensible for

her locally. Before reading posting 105, Rosenstaub could not even know herself that her

own next contribution would be “Nope”; the intention for her “Nope” response is

directly caused through the reception and processing of posting number 105. On the

other hand, user Dill, when he sends his posting number 105, does not know which type

of posting will follow in 106 (or if any reaction at all will come from Rosenstaub) because

there is no author who planned the entire dialogue in advance; instead, the dialogue is

developed by the users as they go along; at the same time, each posting creates a

context for the partners’ responses that follow. Both participants are acting according

to their own communication goals; but neither of the participants can precisely predict

in advance how the dialogue will really develop.

32 Postings also differ greatly from utterances in spoken conversation. Thus, the element

<u> (utterance) from the TEI’s spoken module (“transcribed speech”)―describing “a

stretch of speech usually preceded and followed by silence or by a change of speaker”

(TEI P5: 8.3.1)―is also an inadequate option for the conceptualization of postings. The

simultaneity of verbalization, perception, and mental processing as one very central

characteristic of spoken utterances is not present in postings: Due to the “pre-

transmission composition” protocol discussed above, the turn-taking apparatus does

Journal of the Text Encoding Initiative, Issue 3 | November 2012

124

not function in the same way as in spoken conversation. Postings―like texts―are first

produced in their entirety; the composition process can accordingly not be tracked by

the other participants, its result (after having been submitted to and transmitted by the

server) can only be read retrospectively. In spoken conversation, on the other hand, the

listeners can give immediate feedback and, thus, directly react to (and affect) the

ongoing verbalization; they can anticipate the completion of turn-constructional units

and negotiate turns simultaneously with the linear unfolding of the current speaker’s

utterance (see, for example, Sacks, Schegloff and Jefferson 1974; Schegloff 2007).

33 Therefore, in our schema, the element <posting> is the basic structural element of a

CMC document. We consider it a macrostructural element, but it is the pivot between the

higher level macrostructural components thread and logfile (see section 3.3.2) and the

microstructure of the content which it encloses (see section 3.5). The structure of

<posting> is based on that of the existing <div> element.

34 The <div> and <posting> elements have the following similarities:

<div> and <posting> are high-level elements, belonging to the same class

(model.divLike);

<div> and <posting> contain the major divisions of text;

<div> and <posting> have similar internal content.

35 It is important to note that <posting>, like <div>, does not belong to the class of

pLike elements. One <posting> may consist of one or more paragraphs, similar to a

<div>. While a division may represent, for example, a chapter of a book, <posting>represents one user contribution to some computer-mediated communication event

(forum, blog, web-discussion, or chat). Such a contribution can contain multiple

paragraphs, just like <div>. In the chat example given in figure 3, all postings consist

of exactly one paragraph and the portion of text exhibits no special markup, but on the

Wikipedia talk page given in figure 2, some of the postings contain divisions and

markup that the authors inserted into the content of their postings in order to

structure their content. Therefore, <posting> cannot be a model.pLike element.

36 The <div> and <posting> elements have the following differences:

<div> is a self-nesting element, while <posting> is not;

<posting>s can only appear inside of a division which encloses one complete CMC

document (such as an entire forum thread, an entire blog with user comments, or a chat

logfile).

37 In other words, <posting> is a child element of <div> and shares its content model

except that it does not contain divisions and does not embed itself. Normally,

<posting> consists of one or more paragraphs. In some cases a posting contains a

head, typically with a title.

38 Attributes in the following classes can be used with the posting element:

att.ascribed, att.datable, att.global, att.typed. The most commonly

used attributes for posting are @synch and @who. @synch is used to signify the time

when a posting arrives at the server. Such sequential points in time are ordered on a

timeline encoded separately from the postings in the same XML document (in the

<front> section, as shown in the code snippet in fig. 4 and section 3.4). The @whoattribute refers to the profile of the person who submitted the posting. Profiles of all

users who contributed to the conversation recorded in one CMC document are listed in

the header of the XML document. The <person> element is used for this purpose.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

125

39 In addition, we introduce new attributes in the TEI customization specifically for use

with the <posting> element: @revisedWhen, @revisedBy, and @indentLevel.

The first two attributes are similar to @synch and @who but differ from them in the

following aspect: they mark the time when a posting was revised and the person who

revised it (which, in some cases, appears in Wiki and in forum discussions). These

attributes take into account the fluidity of the CMC medium. Both the @who and the

@revisedBy attributes are added to the att.ascribed class; @synch and

@revisedWhen are added to the att.datable class. The values of @synch, @who,

@revisedWhen, and @revisedBy are URIs which point to a profile and to a point of

a timeline. The @indentLevel attribute is added to the att.global class. Its

function is to mark the (relative) level of indentation of the text in a posting (as defined

by its author). The value of this attribute must be an integer from 1 to ∞ depending on

the level of the indentation of the posting (see the encoding example given in fig. 5).

Figure 4: This example contains an encoding of a user profile, a part of the timeline, and one posting.For the complete encoding of this XML document, see http://www.empirikom.net/bin/view/Themen/CmcTEI.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

126

Figure 5: Encoding of postings 1 and 2 from the example given in figure 2

3.3.2. Threads and logfiles

40 As stated earlier, we use the term macrostructure to describe how series of postings are

arranged in CMC documents: CMC macrostructures do not emerge from the actions of

just one user but from all posting activities of all users involved in a CMC conversation,

plus server routines for ordering incoming user postings. Thus, the structuring on the

macrostructure level of a CMC document has a different status from the structuring

inserted by one and the same author into the content of his postings. In order to

differentiate between divisions on the macro- and the microstructural levels of CMC,

we therefore reserve the <p> element exclusively for divisions in the content of

individual postings, while we use the <div> element exclusively for the representation

of divisions on the macrolevel. In addition, we differentiate between two major types of

macrostructures in CMC:

logfiles, which arrange the sequence of postings in chronological order based on when they

reached the server (see the examples given in fig. 7)

threads, which structure the sequence of postings in two dimensions:

the above/below dimension, which usually stands for a temporal “before/after” relation;

the left/right dimension, in which one can use indentation to emphasize the topical

affiliation of one message to a previous message (see the example given in fig. 6).

41 To differentiate these two CMC-specific macrostructure types, we use the values thread

and logfile on the @type attribute of <div>.

1.

2.

1.

2.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

127

Figure 6: Differentiation between CMC macro- and microstructures in a CMC “thread” macrostructure

Figure 7: CMC “logfile” macrostructure

3.4. Metadata and Anonymization

3.4.1. Metadata

42 The TEI customization needs to account for metadata specific to CMC. In our context, it

is convenient to add metadata to each individual document, and the TEI header is

sufficient to record data relevant to the description of a CMC document. However, we

want to draw the attention of the reader to the following features which are particular

to the CMC document type:

Documents are quite difficult to identify on the Web. Mechanisms of persistent identifiers

are just now gaining ground and are far from being well established. We therefore follow a

double strategy: in cases where we are able to refer to a persistent identifier (as is the case

with versions of Wikipedia talk pages), we include that information as a part of the source

1.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

128

description. In cases where we cannot refer to a persistent identifier, we download the web

page and store it as a digital copy and refer to it in the source description.

As a part of the metadata, we store the profiles of the participants in the computer-mediated

interactions included in our corpus. We construct these profiles from those data recoverable

from the interaction. The reasons for doing so are explained below.

In addition, we store a timeline on which the individual users’ contributions (postings) are

situated via the @synch attribute of the element <posting> (see section 3.3.1). We are

aware that in most cases, we can only capture the point in time when a contribution is

received and processed by the server, but the interesting point for purposes of

documentation and analysis is the relative chronological order of contributions and not the

absolute point in time.

3.4.2. Anonymization

43 In order to be able to distribute the collected CMC data as widely as possible, we need to

anonymize the data. Our anonymization strategy shall support the following goals:

Every user of the data shall be able to associate a certain set of postings in a CMC document

to a user. This user, however, shall not be identifiable as an individual of the “real world”.

Despite that, some privileged (“authorized”) users shall be able to see and maintain the data

which could be used to identify an individual person as the author of certain postings. It

might be useful to automatically or individually recover only certain features of a (set of)

user(s), such as their gender, if such data are available.

44 To achieve these particular goals, we perform the following steps:

All of the recoverable personal data of a CMC participant are collected into a person profile

in a <person> element. This profile is provided with a value of @xml:id which is unique

within the particular TEI document. All person profiles are stored in the header of the

document; thus, they can easily be separated from the body of the document and therefore

be hidden from the less privileged users of the data.

Each <posting> is linked to a person profile via the @who attribute, which points to the

value of an @xml:id of a <person> element.

Instances of user names in segments of a given posting are also linked to a <person> (see

section 3.5.1.5 below).

45 We are aware that the procedure of identifying names and maintaining person

portfolios can be a time-consuming task. However, this effort is in some cases

unavoidable and a necessary prerequisite for the publication and distribution of

valuable data. We therefore want to ensure that a reliable anonymization strategy

exists and can be used in such cases.

46 For an example of this strategy in use, see the example in figure 4 (section 3.3.1).

3.5. Elements of the Document Microstructure

3.5.1. CMC-specific Types of Interaction Signs

47 Up to now, many assumptions about the Internet’s impact on language change have

been based upon small datasets and the linguistic intuition and experience of the

researchers. An annotation standard for typical elements of Internet

jargon―emoticons and acronyms, to name just two―would help to investigate their

usage and dissemination across (sub)languages and digital genres on a broader

2.

3.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

129

empirical basis. However, there is no common terminology to classify the elements of

Internet jargon, nor consensus about the status of these elements in a natural language

grammar framework. To fill this gap, we have developed an annotation schema for

these phenomena on the microstructure level of CMC documents. The basic linguistic

description category of our approach is termed an interaction sign; in the schema,

instances of interaction signs such as emoticons, acronyms, etc. are represented using

the element <interactionTerm>. Below we briefly introduce the category of an

interaction sign and embed it into a broader grammatical framework. By means of

examples, we describe how the category and its subcategories are used for the

annotation of our German reference corpus.

48 First and foremost, our schema serves the annotation needs of the DeRiK project. Some

of the subcategories may be specific to German CMC, so it is clear that the annotation

schema suggested below has to be developed further and discussed within the CMC

community. For example, the set of subcategories of interaction sign may have to be

extended and adapted for other languages. In principle, we consider our proposal as a

first step towards the development of an annotation standard that will facilitate cross-

language, cross-genre, and micro-diachronic investigations of elements of Internet

jargon in CMC corpora. The schema favors a grammatical perspective, but it is open for

extensions motivated by other fields of research such as cultural studies or sentiment

analysis.

3.5.1.1. Interaction Signs: Definition and Subclasses

49 Spoken discourse typically contains elements like “hm”, “well”, “oh my god”, “oops”,

and “wow”. Grammar frameworks usually categorize them as interjections (see, for

example, Greenbaum 1996; McArthur et al. 1998; Blake 2008) or Interjektionen (DUDEN

2005), inserts (Biber et al. 1999; Biber et al. 2002), discourse markers (Schiffrin 1986),

discourse particles, or Gesprächspartikeln (DUDEN 1995). These interjections are different

from responsives like “yes” and “no”, which can occur in both spoken and written

dialogues.

50 In the system of syntactic categories of the three-volume German grammar of the

Mannheim Institut für Deutsche Sprache, Grammatik der deutschen Sprache (Zifonun,

Hoffmann, and Strecker 1997, henceforth GDS),10 both interjections and responsives are

categorized as Interaktive Einheiten (henceforth IE). In spoken discourse, IEs serve as

devices for conversation management: they can be used to express reactions to a

partner’s utterances or to display the speaker’s emotions.11 One important syntactic

feature of IE is that they are not integrated in the sentence’s syntactic structure (Ehlich

1986; Trabant 1998). Instead, they are often either used as sentence-equivalent

utterances (like “nö” in posting 106 of the example given in fig. 3 above) or used in

front of or after the sentence boundaries (like “ja, sollte eigentlich” in posting 2 of the

example given in fig. 2).

51 Many CMC-specific elements like emoticons and acronyms occur in the same positions

and have similar functions as IEs in spoken discourse. It is, thus, not surprising that

grammars―if they describe them at all―classify these elements as interjections.12 In

the STTS tagset, a standard for German part-of-speech classification,13 most IEs would

best be annotated using the POS-Tag ITJs (Interjektio) or PTKANT (Antwortpartikel); in the

CLAWS2 tagset for English,14 they would fit into the category UH (interjection).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

130

52 But this simple solution is not sufficient for corpus-based research on CMC jargon

across languages, cultures, and genres. On the one hand, elements like emoticons are

language-independent iconic signs that cannot be classified as syntactic units of

natural languages in a strong, narrow sense. On the other hand, iconic signs like the

emoticon “:-)” and symbolic signs like the abbreviation “*s*” (derived from the English

“smile”) are often used as synonyms. All these elements share topological and

functional features with natural language interjections in spoken discourse. By

subsuming all of these elements of Internet jargon under one category, “interaction

sign”, we want to account for their functional and semantic similarities (see fig. 8).

Figure 8: Typology of interaction signs (with examples)

53 In our schema, we introduce an element <interactionTerm> as a phrase-level

element (in the model.phrase class) which encloses one or more instances of

subclasses of interaction signs. The <interactionTerm> element can have

members of att.global as attributes. In addition, we introduce elements for the

following subclasses of interaction signs: the two subclasses of “Interaktive Einheiten”

as described by the GDS (interjection and responsive) and the four subclasses for elements

which are typically—but not exclusively—used in written CMC discourse

(<emoticon>, <interactionWord>, <interactionTemplate>, and

<addressingTerm>). Each of the elements is assigned a set of attributes by which

their occurrence in the corpus documents can be sub-classified according to formal,

positional, semiotic, semantic, and functional criteria. In the following, we outline the

underlying basic ideas of choosing these categories and describe the properties of the

elements introduced in our schema for their representation in our corpus data.

3.5.1.2. Emoticons

54 Emoticons are iconic units created using the keyboard. They are often used to portray

facial expressions, and they typically serve as emotion, illocution, or irony markers.

Due to their iconic character, the use of emoticons is not restricted to CMC in one

particular language; instead, the same emoticons can be found in CMC data in different

languages. There are several systems of emoticons: besides the Western-style

emoticons, there are, for example, Japanese and Korean style variants. Postings 3 and 5

in the example given in figure 2 include Japanese-style emoticons (“Kawaiicons”);

Western-style emoticons can be found in the example given in figure 9.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

131

Figure 9: Postings on a Wikipedia talk page displaying instances of the Western-style emoticons :o)and ;o) and instances of the interaction words *freu* (“happy”) and *g* (< “grin”). The combination of:o) and *freu* in posting 5 is an example of an interaction term that consists of two types ofinteraction signs.

55 In our schema, instances of emoticons are represented using the <emoticon>

element, which is assigned to the gLike element class. Conventionally, elements of

this class contain non-Unicode characters and glyphs. Although most emoticons are

produced as a sequence of keyboard characters (dot, comma, colon, and the like), the

resulting figure is comparable in its semiotic status to graphic characters. While some

smiley faces have been included in Unicode, the variety of emoticons is still larger than

can be captured by Unicode characters alone. That is why we place the <emoticon>element in the class of gLike elements.

56 The <emoticon> element includes attributes from the att.global class and a

number of new attributes from other classes, such as @style,

@systemicFunction, @contextFunction, and @topology, the first three of

which are members of the att.typed class. The @style attribute describes the

native region of an emoticon. The value list of @style is currently set to Western,

Japanese, Korean, and Other. The attributes @systemicFunction and

@contextFunction (explained below) share the following list of values:

emotionMarker:positive, emotionMarker:negative, emotionMarker:neutral,

emotionMarker:unspec, responsive, ironyMarker, illocutionMarker, virtualEvent.

57 The distinction between a systemic and a context function reflects the semantic

differentiation between the expression meaning and the utterance meaning of lexicalized

linguistic units (cf. Löbner 2002). The idea is that, comparable to other lexemes, these

types of emoticons (and other interaction words; see section 3.5.2.2) commonly used in

CMC can be assigned a general, context-independent meaning. On the Web, there are

many lists displaying the “most common emoticons” with descriptions of their

Journal of the Text Encoding Initiative, Issue 3 | November 2012

132

meaning (systemic function). Figure 10 shows an excerpt from Wikipedia’s list of

Western emoticons; the left column renders types of emoticons, the right column gives

short paraphrases of their (context-independent and, thus, systemic) function, as

assigned by the authors.

58 In a given context of use, the function of an instance of a given type of emoticon may

vary from its systemic function. Figure 11 shows an example (b) in which the smiley :-))

and its variant :), which are usually assigned the systemic function of a positive

emotion marker (“happy face”, see entry in fig. 10), are used for marking irony. The

context function of these elements in (b), thus, differs from their systemic function. On

the other hand, in (a) in figure 11, the context function of “:)” is identical with the

systemic function; here, the emoticon is used for displaying a positive emotion of

happiness.

59 The @topology attribute (which is a member of att.placement) captures the

position of the emoticon relative to the text to which it belongs. Consequently, the

range of values is set to front_position, back_position, intermediate_position, standalone.

Icon Meaning

>:] :-) :) :o) :] :3 :c) :> =] 8) =) :} :^) Smiley or happy face […]

>:D :-D :D 8-D 8D x-D xD X-D XD =-D =D =-3 =3 8-) Laughing, big grin, laugh with spectacles

:-)) Very happy

>:[ :-( :( :-c :c :-< :< :-[ :[ :{ >.> <.< >.< Frown, sad

:-|| Angry

>;] ;-) ;) *-) *) ;-] ;] ;D ;^) Wink, smirk

>:P :-P :P X-P x-p xp XP :-p :p =p :-Þ :Þ :-b :b Tongue sticking out, cheeky/playful […]

Figure 10: Excerpt from the list of Western emoticons as given in the English Wikipedia, page “List ofemoticons” (as of 2012-02-01)

11a: 178 system Shadok kommt aus dem Raum Alshain herein.

Shadok comes in from the room Alshain.

185 marc30 Holla Shaddy :)

Hey Shaddy :)

189 Shadok heya marc30 ;o)

hey marc30 ;o)

11b: 536 ThorThor... ärgert sich immer noch, daß die franzosen den pott nicht behalten

haben *gg*

Thor… is still upset that the french didn’t hold on to the pott *gg*

544 Erdbeere$Erdbeere$ ärgert sich mit .... der pott geht an frankreich und wir bekommen

die küste

Erdbeere$ feels your pain …. the pott goes to france and we get the coast

554 Bochum Bochum tritt erdbeere in den arsch :-))

Journal of the Text Encoding Initiative, Issue 3 | November 2012

133

Bochum kicks erdbeere in the butt :-))

564 Erdbeere$ ohh wie nett :)

ohh how nice :)

Figure 11: Convergence (11a) and divergence (11b) of systemic function and context function (excerptfrom document no. 2221006 in the Dortmund Chat Corpus).

3.5.1.3. Interaction Words

60 Interaction words are symbolic linguistic units. Their morphologic construction is based

on a word or a phrase of a given language which describes expressions, gestures, bodily

actions, or virtual events―for example, the units sing, g (< grins, “grin”), fg (< fat grin), s

(< smile), wildsei (“being wild”) in figure 12 are used as emotion or illocution markers

(postings 865, 876, 880), irony markers (postings 878, 879, 886) or to playfully mimic

simulated bodily activity (posting 864):

858 Turnschuh OHNE DEUTSCHLAND FAHRN WIR ZUR EM!

WE ARE GOING TO THE EUROPEAN CUP WITHOUT GERMANY

859 system Ryo hat die Farbe gewechselt

Ryo changed colors

860 Gangrulez jo schade

yep too bad

861 system Windy123 geht in einen anderen Raum: Forum

Windy123 is going to another room: Forum

862 juliana alle leute müssen ihre fernseher bei media markt bezahlen

all the people have to pay for their TV at media markt

863 juliana haha

haha

864 Turnschuh Es gab mal ein Rudi Völler.......es gab mal ein Rudi Völler.....♫sing♫

There once was a Rudi Völler.......there once was a Rudi Völler.....♫sing♫

865 Ryo *g*

*g*

866 Gangrulez hehe..das wurd eh gerichtlich gestoppt juliana

hehe..that was stopped by the courts anyway juliana

867 juliana echt?

really?

868 oz gang: echt ??

gang: really ??

869 Gangrulez ja

Journal of the Text Encoding Initiative, Issue 3 | November 2012

134

yeah

870 juliana wieso?

why?

871 Gangrulez wettbewerbsverzerrung

distortion of competition

872 Naturkonstantler Fussball ist sooo unendlich unwichtig...

Soccer is sooo incredibly unimportant…

873 juliana versteh ich nicht. ich fand es war ein cooler trick

I don’t understand. I thought it was a cool trick

874 Gangrulez aber es war eine Art Glücksspiel

but it was a kind of gamble

875 Turnschuhmag auch keinen Fussball......nur wollte ich das letzte Deutschlandspiel

sehen *fg*

Turnschuh also doesn’t like soccer......but I would have liked to have seen the last

Germany game *fg*

876 Chris-Redfield *s* aber net erlaubt @ juli

*s* but not allowed @ juli

877 julianafußball ist nen dreck wichtig. es ist ein spiel. hauptsache, die jungen

männer haben sich fitgehalten und ihrer gesundheit was getan :)

soccer isn’t worth it. it’s a game. Main thing, the young men have kept fit and

done something for their health :)

878 Gangrulez und das entspircht nicht dem Handel *g

and that wasn’t the deal *g

879 juliana chris, du weißt doch, daß ich ein gesetzesbrecher bin *g*

chris, you do know that i am a law breaker *g*

880 Chris-Redfield ja ich weiß *s*

yes i know *s*

881 juliana *wildsei*

*being wild*

882 juliana naja... äh.

oh well… um.

883 Gangrulez ach ich muss ja noch ne mail schreiben..

oh i have to write an e-mail..

884 juliana ich geh zu meinem buch und...

I’m going to go to my book and…

885 system Gangrulez geht in einen anderen Raum: sphere

Gangrulez goes to another room: sphere

Journal of the Text Encoding Initiative, Issue 3 | November 2012

135

886 Naturkonstantler

vielleicht können wir ja mal eine Greencard für potentielle Fussballspieler

einführen... ich werde eine Petition bein B-tag einreichen... Ja, so bin ich,

ich sorge mich um das Wohl der Allgemeinheit! *g*

maybe we can introduce a green card one day for potential soccer players… I will

submit a petition to congress… Yes, that’s how I am, I care for society’s well-being!

*g*

887 juliana mal schaun

we’ll see

888 system juliana verlässt den Raum

juliana leaves the room

Figure 12: Excerpt of a social chat displaying instances of interaction words (postings 864, 865, 875,876, 878, 879, 880, 881, 886) and of addressing terms (868, 876)

61 The element <interactionWord> in our schema is a member of

model.global.spoken. It shares properties of the <kinesic>, <incident>,

and <vocal> elements in TEI. The element <interactionWord> is provided with

attributes from the class att.global and several new attributes: @formType,

@systemicFunction, @contextFunction, @topology, and

@semioticSource. The attributes @systemicFunction, @contextFunction,

and @topology are used for the <emoticon> element. @formType is in the

att.typed class of attributes and is used to describe morphological properties of the

<interactionWord>. The list of values is currently set to simple, complex, and

abbreviated. The attribute @semioticSource is in the att.typed class of attributes

and is used to describe the semiotic mode that forms the basis for an interaction word;

its current list of values is set to mimic (such as for grins “grin” and stirnrunzel “frown”),

gesture (such as for kopfschüttel “shake head” and wink “wave”), bodilyReaction (such as

for schluck “gulp”, seufz “sigh”, and hüstel “little cough”), sound (such as for plätscher

“splash” and blubb ”plop”), action (such as for tanz “dancing”, knuddle “cuddling”, erklär

“explaining”, and mampf “munching”), sentiment (such as for freu “happy”), process

(such as for träum “dreaming”), and emotion (such as for schäm “ashamed”).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

136

Figure 13: Encoding snippet for example 11b from figure 11

3.5.1.4. Interaction Templates

62 Interaction templates are units that the user does not generate with the keyboard but by

activating a template which automatically inserts a previously prepared text or

graphical element into a space of the user’s choice.

63 The category of interaction templates includes graphic smileys, chosen by the user of a

CMC environment from a finite list of elements. These often portray facial expressions

but can depict almost anything; in the case of animated GIFs, they can even portray

entire scenes as moving pictures. This clearly goes beyond what can be expressed using

only keyboard-generated emoticons. On the other hand, users can invent new

emoticons by combining keyboard characters, while template-generated units are

always bound to predefined templates.

64 The element <interactionTemplate> in our schema belongs to the

model.global class of elements. It is provided with the att.global class of

attributes and a few new attributes which belong to different classes. The most

important attributes for this element are @type, @motion, @systemicFunction,

and @contextFunction.

65 As the attribute @type is used to characterize the surface of the figure, the list of

values is currently set to: iconic, verbal, and iconic-verbal.

66 The @motion attribute belongs to the att.typed class and has two possible values:

static and animated.

67 The attributes @systemicFunction and @contextFunction have already been

introduced in section 3.5.1.2, but one additional value of attribute

Journal of the Text Encoding Initiative, Issue 3 | November 2012

137

@systemicFunction should be mentioned: “evaluation” is used to express whether

the enclosed graphic element expresses appreciation or disapproval.

3.5.1.5. Addressing Terms

68 Addressing terms address an utterance to a particular interlocutor (see the examples in

the postings 868 and 876 in fig. 12). The most widely used form here is the one made

out of the “@” character together with a specification of the addressee’s name.

69 The element <addressingTerm> in our schema belongs to the model.nameLike

class of elements. While this element usually uses no attributes, our customization

includes the att.global attributes. The content of <addressingTerm> is

restricted to two elements: <addressMarker> and <addressee>.

70 The <addressMarker> element belongs to the class model.labelLike (used to

gloss or explain parts of a document) and is provided with the att.global class of

attributes. The purpose of <addressMarker> is to identify or to highlight the

addressee in a posting. This is typically achieved by using the “at” sign (“@”) or one of

a set of fixed phrases (English: “to”; German: “an” or “für”).

71 The element <addressee> is placed in the model.nameLike.agent class. It

includes the @who, @scope, and @formType attributes, plus those from the

att.global class. Names of addressees are often addressed using abbreviated or

nickname forms of their usernames, so the name of the addressee given in the

addressing term might not be identical with the username of the interlocutor. We

would like to enable the users of our corpus to retrieve the alternative form from the

data even after the corpus data have been anonymized (as explained in section 3.4). We

use the @formType attribute for this purpose and assign it the following set of values:

persNameFull, persNameAbbreviation, and persNameNickname. Thus, the attribute

@formType allows us to describe cases like the ones illustrated through the examples

in figure 14:

14a:

306 Lantonie Lantonie heiratet Thor....

Lantonie is marrying Thor….

308 Lantonie :))

:))

323 zora wos? *eifersüchtel*@lanto

what? *jealous*@lanto

14b:

104Chris-

Redfieldtom ram ist doch nicht alles im leben *g*

tom ram is not all there is in life *g*

108 TomcatMJ nö, aber hilft dem server weiter@c-r :-)

no, but helps the server@c-r :-)

14c:

Journal of the Text Encoding Initiative, Issue 3 | November 2012

138

117 RaebchenRaebchen rät allen Pärchen, nicht auf Deck zu knutschen (sowas hat die Titanic

sinken lassen! habe ich im Film gesehen)

Raebchen advises all couples not to make out on deck (that’s what made the Titanis

sink! i saw it in the movie)

123 McMike *lol*@Raeby

*lol*@Raeby

14d:

89 McMike könntet Ihr mich bitte zum Käpten ernennen?

could you all please appoint me captain?

94 ineli26 ineli26 ernennt McMike zum Kapitaen

Ineli26 appoints McMike captain

[…]

160 McMike Monk, kannst Du das steuer übernehmen?

Monk, can you take over the wheel?

164 Monk klar wohin solls gehen?

of course where to?

169 McMike Monk immer dem Fön nach

Monk keep following the Foen

172 ineli26 lol @ kapitaen

lol @ kapitaen

Figure 14: Types of addressees’ names in addressing terms: abbreviated form (14a and 14b) andnickname form (14c and 14d) (excerpts from documents no. 2221006, 2221007, and 2221001 in theDortmund Chat Corpus)

72 The @scope attribute is added to the att.scoping class. This attribute is used to

specify whether one or more persons or groups are addressed; the values of this

attribute are all, group, individual, and unspec.

73 The @who attribute is supposed to mark the name of the addressee (the recipient of the

posting). Its value points to the value of @xml:id of the <person> element for the

addressee.15

74 Figure 15 gives an encoding example for addressing terms in chat postings.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

139

Figure 15: Encoding snippet for postings 868 and 876 from the example in figure 12

3.5.2. User Signatures

75 An important element of the microstructure in postings in forums, bulletin boards, and

wiki discussions is the signature text predefined by a user and inserted into a posting

automatically (usually at its end). It often includes the name of the user plus additional

text (such as sayings, proverbs, quotes, or personal information about the user) or

graphics. In our schema, we do not represent signatures as a part of every single

posting; instead, we mark the position in the posting where the user signature is placed

and describe its content only once in the <person> element.

76 For the representation of the signature text’s position in the postings and for the

description of the signature content, we introduce two special elements: The element

<autoSignature> is an empty element contained in the model.pPart.editclass. It replaces the signature text in the posting. The user’s signature is kept in the

element <signatureContent> in the <person> element; it is placed in the

model.persStateLike class and referenced by the @target attribute on

<autoSignature>.

3.5.3. Postscripts, Openers, and Closers

77 Some elements in CMC discourse are similar to elements used in epistolary

correspondence. However, their use is less restricted than with their functional

equivalents in written letters.

78 One element of this type is the <postscript>. In CMC, a complete posting can be

marked by a user as a postscript (for example by introducing it with “p.s.”); in other

cases, a postscript can be a part of a paragraph (see the examples given in fig. 16). The

Journal of the Text Encoding Initiative, Issue 3 | November 2012

140

current TEI definition of the <postscript> element does not offer any opportunity

to encode such cases. In our schema, we therefore introduced a <segtype=“postscript”> for their annotation.

16a:

p.s.: ich hasse einfache antworten deshalb würde ich die antwort von <<user2>> kritisieren wollen:

warum ist der “normal-christliche” lebensstil in so feste bahnen zementiert? warum läuft es

trotzdem so schief. […]

p.s.: i hate simple answers which is why I would like to criticize the answer given by <<user2>>: why is the

“normal Christian” lifestyle so strictly regulated? Why despite this does is still go wrong. […]

(Follow-up message of user1 to his own prior posting in a blog discussion; anonymized)

16b:

Die genannten Quellen sind für die Fragestellung in keinster Weise reputabel, d.h. auch danach läge

Theoriefindung vor. In Volkach heisst die Mainbrücke auch nur Mainbrücke, weil es für

Einheimischen nur diese eine gibt. Aber der Eigentümer, das Land Bayern, hat natürlich mehrere

Mainbrücken, daher ist es nun einmal die Mainbrücke Volkach. Also Fahrradbrücke wird das

Bauwerk sicher nicht heissen, man müsste halt mal bei der Bauverwaltung der Stadt Konstanz

nachfragen. Anderenfalls dann doch gemäß reputabler Literatur auf Geh- und Radwegbrücke über den

Seerhein bei Konstanz verschieben. --Störfix 21:55, 13. Jul. 2011 (CEST) P.S. oder die Brücke endlich

z.B. nach einem verdienten OB benennen ;-)

The mentioned sources are in no way trustworthy for this question, i.e. it would be conspiracy theory. In

Volkach the Main Bridge is only called the Main Bridge because there is only the one for the locals. But the

owner, the state of Bavaria, of course, has several Main bridges, making this one the Main Bridge Volkach.

Thus, this construction will definitely not be called Bike Bridge, you would have to ask at the City of

Constance’s planning department. Otherwise, stick with the sme terminology as in the more respectable

literature, Geh- und Radwegbrücke über den Seerhein bei Konstanz. --Störfix 21:55, 13. Jul. 2011 (CEST) P.S. or

finally name the bridge after a deserving mayor ;-)

(Wikipedia talk page for the article “Geh- und Radwegbrücke über den Seerhein bei Konstanz”)

Figure 16: Types of postscripts in CMC: postscript posting (16a), postscript as part of a paragraphwithin a posting (16b)

79 CMC communication is characterized by a less conventional style of writing than in

epistolary correspondence, which affects the form of a posting. We assume that, similar

to conventional discourse types such as letters, some kinds of postings (especially in

asynchronous CMC genres such as forums, bulletin boards, and Wikipedia talk pages)

have a structure which consists of an opening part, the main part of a message, and a

closing part. However, the opening and closing parts are in many cases neither cleanly

separated from the body of the message nor necessarily the first or last part of the

message (see example below). Additionally, an opener or closer element can appear

more than once in a posting.

80 Unfortunately, the elements of the current TEI P5 framework which come closest to

these structures (the <opener> and <closer> elements) are too restricted in their

distribution. For example, the element <opener> may appear exclusively at the top of

a division, while <closer> is permitted at the bottom of a document only. For us to

use these elements, the content model for <div>s would have to be loosened to allow

these elements to appear in other places. Specifically, it would be useful if the

Journal of the Text Encoding Initiative, Issue 3 | November 2012

141

<opener> and <closer> elements could join the inter-level elements so that they

would be able to appear within as well as in between chunks of text. In the current

version of our schema, we use <seg> elements for the annotation of openers and

closers in CMC postings and use a @type attribute with a value of “opener” or “closer”

(see the example given in fig. 17).

Figure 17: Opener and closer inside one posting, encoded using the <seg> element

4. Conclusions and Outlook

81 We have shown in this paper that the TEI Guidelines offer an appropriate way of

structurally encoding documents of various CMC genres. We demonstrated this by

focusing on some of these genres—chats, forum, and wiki discussions, in particular—

and on some features of dialogic CMC which have figured prominently in the linguistic

literature about this text type.

82 Customization of the TEI Guidelines is one way of adapting the TEI encoding framework

to new genres and document types. However, considering the relevance of CMC in

today’s everyday communication, it could be an important extension to future versions

of the TEI Guidelines to include a standard for the representation of the features and

peculiarities of CMC genres and document types. Such a standard should include a

model for the representation of those structural and linguistic features of CMC

discourse which are not yet covered by the modules and elements in the P5 version of

the TEI Guidelines (among others, a <posting> element for representing the main

constituting units of the CMC document structure and elements for the annotation of

typical Internet jargon units such as the interaction signs described in section 3.5.1). A

standard for the representation of CMC discourse should take into account that the

distribution and content model of certain elements from existing modules in TEI P5

would have to be modified in order to use them for the annotation of their functional

equivalents in CMC postings. As shown in the example of postscript-, opener-, and

closer-like elements in CMC (see section 3.5.2), the position of the equivalent TEI

elements in the structure of the postings is less restricted than in epistolary

correspondence. In cases like these, a modification of existing TEI elements (the

elements <postscript>, <opener>, and <closer>) would ideally account for

both CMC’s orientation toward traditional text types and text elements as well as CMC’s

free and creative use and modification.

83 CMC is constantly gaining popularity, both as a medium of communication and as an

object of study. We therefore want to suggest with this paper that the TEI offers users a

Journal of the Text Encoding Initiative, Issue 3 | November 2012

142

framework for annotating resources of this type. We hope that the schema presented

here might pave the ground for such a development.

84 Much still has to be done to achieve a fuller understanding of CMC genres and their

peculiarities. This is not due to a lack of studies of this kind of communication, but to a

constant change both in the ways in which the medium is used and in its technological

frameworks. CMC is a fluid mode of communication, and we probably will have to

constantly adapt our modeling and schema to new forms and media of CMC which will

emerge in the future. We are confident that the TEI Guidelines will provide an

appropriate framework for this. We hope that further discussion of the schema

presented in this paper will help uncover the extent to which its core features can be

appropriate for the representation of CMC discourse in languages other than German

(and especially those with writing systems not using the Latin alphabet).

85 For DeRiK in particular, we are facing the following challenges in the near future:

Acquiring texts in larger proportions: Up to now we have been working with a small sample of

texts of various genres. In the future we will acquire a larger set of documents for our

reference corpus—ideally 10 million tokens per year. We have to clear the rights of many of

the text sources unless they have not already been cleared by the providers, as is the case

with Wikipedia talk pages, for example. We hope that we can acquire substantial portions of

data from projects focused on empirical research in the field of CMC (including the projects

from partners in the Empirikom network). Ideally, this would be a win-win situation: the

partners would get their texts curated and distributed in a way that the empirical basis of

their research could be used to replicate their work or to perform comparable research on

the same data, and more users and researchers could find and use this data easily.

Analyzing CMC texts linguistically: Software for automatic analysis and annotation of texts is

optimized for well-formed written clauses and sentences. CMC texts will therefore pose

challenges to these tools on different levels, from tokenization and sentence boundary

detection to part-of-speech tagging and syntactic parsing. We hope to have shown with the

examples in this paper that, seen from the perspective of a normative grammar for written

text, many productions of CMC are not “well-formed”. It will be a major challenge to find

and describe the regularities in text production which seem to be irregular at first sight. NLP

tools have to be adjusted accordingly. Of course there is a continuum ranging from well-

thought-out—and well-formulated—texts and dialogues (such as on Wikipedia talk pages or

scientific blogs) to very informal and highly speech-like contributions in some chat sessions.

Tools for the linguistic analysis of CMC should be able to cover the whole range.

Annotating the collected data using our TEI schema: Last but not least, the data collected for

integration in our corpus will be annotated using the schema presented in this paper. We

assume that some of its structure can be generated automatically on the basis of filters that

transform structural patterns of the raw data format (such as HTML) into the target format;

other components of the schema (especially the functional subclassification of types of

interaction signs using attributes) will, at least in the beginning, require manual or, at best,

semi-automatic encoding. Further analyses of CMC-specific units on the microlevel of

postings may help to develop strategies for a partial automatization of this task; we hope

that further discussions in the context of the Empirikom network will contribute to this.

Providing a framework for managing a corpus of CMC data: Scripts will be needed to transform

CMC data from various sources to the TEI target format; ideally this will be a framework

which can be parameterized for each individual source. In addition, scripts will be needed to

transform the TEI/XML-encoded data into something which can be displayed nicely; XSLT

Journal of the Text Encoding Initiative, Issue 3 | November 2012

143

scripts will be an appropriate means. We will provide such scripts and tools alongside the

schema and documentation on our website. Additional facilities will be provided by the

DWDS framework (see section 2.2).

BIBLIOGRAPHY

References

Beißwenger, Michael. 2002. “Getippte ‘Gespräche’ und ihre trägermediale Bedingtheit: Zum

Einfluß technischer und prozeduraler Faktoren auf die kommunikative Grundhaltung beim

Chatten.” In Moderne Oralität, edited by Ingo W. Schröder and Stéphane Voell, 265–299. Marburg:

Reihe Curupira.

———. 2003. “Sprachhandlungskoordination im Chat.” Zeitschrift für germanistische Linguistik 31

(2): 198–231.

———. 2007. Sprachhandlungskoordination in der Chat-Kommunikation. Linguistik, Impulse, &

Tendenzen 26. Berlin: de Gruyter.

———. 2010. “Chattern unter die Finger geschaut: Formulieren und Revidieren bei der

schriftlichen Verbalisierung in synchroner internetbasierter Kommunikation.” In Nähe und

Distanz, edited by Vilmos Àgel and Mathilde Hennig, 247–294. Linguistik, Impulse, & Tendenzen 35.

Berlin: de Gruyter.

Beißwenger, Michael and Angelika Storrer. 2011. “Digitale Sprachressourcen in

Lehramtsstudiengängen: Kompetenzen – Erfahrungen – Desiderate.” In Language Resources and

Technologies in E-Learning and Teaching, edited by Frank Binder, Henning Lobin, and Harald

Lüngen. Special issue, Journal for Language Technology and Computational Linguistics 26 (1): 119–139.

http://media.dwds.de/jlcl/2011_Heft1/9.pdf.

———. 2008. “Corpora of Computer-Mediated Communication.” In Corpus Linguistics. An

International Handbook. Volume 1, edited by Anke Lüdeling and Merja Kytö, 292–208. Handbooks of

Linguistics and Communication Science 29.1. Berlin: de Gruyter.

Beißwenger, Michael, Maria Ermakova, Alexander Geyken, Lothar Lemnitzer, and Angelika

Storrer. 2012. “DeRiK: A German Reference Corpus of Computer-Mediated Communication.”

Digital Humanities 2012. http://www.dh2012.uni-hamburg.de/conference/programme/abstracts/

derik-a-german-reference-corpus-of-computer-mediated-communication/.

Biber, Douglas et al. 1999. Longman Grammar of Spoken and Written English. Edinburgh: Pearson

Education Limited.

Biber, Douglas, Susan Conrad and Geoffrey Leech. 2002. Longman Student Grammar of Spoken and

Written English. Edinburgh: Pearson Education Limited.

Blake, Barry J. 2008. All About Language. New York: Oxford University Press.

Crystal, David. 2001. Language and the Internet. Cambridge: Cambridge University Press.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

144

Danet, Brenda, and Susan C. Herring, eds. 2007. The Multilingual Internet. Language, Culture, and

Communication Online. New York: Oxford University Press.

December, John. 1996. “Units of Analysis for Internet Communication,” Journal of Computer-

Mediated Communication 1 (4). Accessed February 03, 2012, http://jcmc.indiana.edu/vol1/issue4/

december.html.

DUDEN. 1995. Die Grammatik. 5th ed. Mannheim: Bibliographisches Institut.

DUDEN. 2005. Die Grammatik. 7th ed. Mannheim: Bibliographisches Institut.

Ehlich, Konrad. 1986. Interjektionen. Tübingen: Niemeyer.

Ferrara, Kathleen, Hans Brunner, and Greg Whittemore. 1991. “Interactive written discourse as

an emergent register.” Written Communication 8 (1): 8–34.

Garcia, Angela Cora, and Jennifer Baker Jacobs. 1998. “The Interactional Organization of

Computer Mediated Communication in the College Classroom.” Qualitative Sociology 21 (3): 299–

317.

———. 1999. “The Eyes of the Beholder: Understanding the Turn-Taking System in Quasi-

Synchronous Computer-Mediated Communication.” Research on Language and Social Interaction 32

(4): 337–367.

Geyken, Alexander. 2007. “The DWDS corpus: A reference corpus for the German language of the

20th century”. In Collocations and Idioms, edited by Christiane Fellbaum, 23–40. London:

Continuum Press.

Greenbaum, Sidney. 1996. The Oxford English Grammar. New York: Oxford University Press.

Herring, Susan C. 1996. “Introduction.” In Computer-Mediated Communication: Linguistic, Social and

Cross-Cultural Perspectives, edited by Susan C. Herring, 1–10. Pragmatics & Beyond n.s. 39.

Amsterdam: John Benjamins.

———. 1999. “Interactional Coherence in CMC.” Journal of Computer-Mediated Communication 4 (4).

http://jcmc.indiana.edu/vol4/issue4/herring.html.

Herring, Susan C., ed. 1996. Computer-Mediated Communication: Linguistic, Social and Cross-Cultural

Perspectives. Pragmatics & Beyond n.s. 39. Amsterdam: John Benjamins.

Herring, Susan, ed. 2010/2011. Computer-Mediated Conversation. Special issue, Language@Internet

7/8. http://www.languageatinternet.org/.

Hoffmann, Ludger. 2004. “Chat und Thema.” In Internetbasierte Kommunikation, edited by Michael

Beißwenger, Ludger Hoffmann, and Angelika Storrer, 103–122. Osnabrücker Beiträge zur

Sprachtheorie 50.

Klappenbach, Ruth, and Wolfgang Steinitz, eds. 1962–1977. Wörterbuch der deutschen

Gegenwartssprache. 6 vols. Berlin: Akademie-Verlag.

Löbner, Sebastian. 2002. Understanding Semantics. London: Edward Arnold Publishers.

McArthur, Tom, ed. 1998. Concise Oxford Companion to the English Language. Oxford: Oxford

University Press.

Ogura, Kanayo, and Kazushi Nishimoto. 2004. “Is a Face-to-Face Conversation Model Applicable to

Chat Conversations?” Paper presented at the Eighth Pacific Rim International Conference on

Artificial Intelligence, 2004. http://ultimavi.arc.net.my/banana/Workshop/PRICAI2004/Final/

ogura.pdf.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

145

Reynaert, Martin, Nelleke Oostdijk, Orphée De Clercq, Henk van den Heuvel, and Franciska de

Jong. 2010. “Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch

Reference Corpus,” Proceedings of the Seventh Conference on International Language Resources and

Evaluation (LREC'10): 2693–2698. Accessed February 03, 2012 http://eprints.eemcs.utwente.nl/

18001/01/LREC2010_549_Paper_SoNaR.pdf

Runkehl, Jens, Peter Schlobinski, und Torsten Siever. 1998. Sprache und Kommunikation im Internet:

Überblick und Analysen. Opladen: Westdeutscher Verlag.

Sacks, Harvey, Emanuel A. Schegloff, and Gail Jefferson. 1974. “A Simplest Systematics for the

Organization of Turn-Taking for Conversation,” Language 50 (4): 696–735.

Schegloff, Emanuel A. 2007. Sequence Organization in Interaction. Vol. 1 of A Primer in Conversation

Analysis. Cambridge: Cambridge University Press.

Schiffrin, Deborah. 1986. Discourse markers. Vol. 5 of Studies in Interactional Sociolinguistics.

Cambridge: Cambridge University Press.

Schönfeldt, Juliane, and Andrea Golato. 2003. “Repair in Chats: A Conversation Analytic

Approach.” Research on Language and Social Interaction 36 (3): 241–284.

Storrer, Angelika. 2001. “Getippte Gespräche oder dialogische Texte? Zur

kommunikationstheoretischen Einordnung der Chat-Kommunikation.” In Sprache im Alltag:

Beiträge zu neuen Perspektiven in der Linguistik; Herbert Ernst Wiegand zum 65. Geburtstag gewidmet,

edited by Andrea Lehr, Matthias Kammerer, Klaus-Peter Konerding, Angelika Storrer, Caja

Thimm, and Werner Wolski, 439–465. Berlin: de Gruyter.

———. 2009. “Rhetorisch-stilistische Eigenschaften der Sprache des Internets.” In Rhetorik und

Stilistik – Rhetorics and Stylistics: Ein internationales Handbuch historischer und systematischer

Forschung, edited by Ulla Fix, Andreas Gardt, and Joachim Knape, 2211–2226. Berlin: de Gruyter.

TEI Consortium. 2012. TEI P5: Guidelines for Electronic Text Encoding and Interchange. http://www.tei-

c.org/Guidelines/P5/.

Trabant, Jürgen. 1998. Artikulationen: Historische Anthropologie der Sprache. Frankfurt: Suhrkamp.

Werry, Christopher C. 1996. “Linguistic and interactional features of Internet Relay Chat.” In

Computer-Mediated Communication: Linguistic, Social and Cross-Cultural Perspectives, edited by Susan

C. Herring, 47–63. Pragmatics & Beyond n.s. 39. Amsterdam: John Benjamins.

Zifonun, Gisela, Ludger Hoffmann, and Bruno Strecker. 1997. Grammatik der deutschen Sprache. 3

vols. Schriften des Instituts für deutsche Sprache 7.1–7.3. Berlin: de Gruyter.

Zitzen, Michaela, and Dieter Stein. 2005. “Chat and conversation: a case of transmedial stability?”

Linguistics 42 (5): 983–1021.

WWW Resources

ARD/ZDF Onlinestudie (1997–2011). http://www.ard-zdf-onlinestudie.de/.

Digitales Wörterbuch der deutschen Sprache (DWDS). http://www.dwds.de/.

Dortmunder Chat-Korpus. http://www.chatkorpus.tu-dortmund.de/.

Grammis 2.0: das grammatische Informationssystem des Instituts für deutsche Sprache (IDS).

http://hypermedia.ids-mannheim.de/.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

146

“Online documentation of the DeRiK TEI schema for the representation of computer-mediated

communication.” http://www.empirikom.net/bin/view/Themen/CmcTEI.

“Projekt: Deutsches Referenzkorpus zur internetbasierten Kommunikation (DeRiK).” http://

www.empirikom.net/bin/view/Themen/DeRiK.

Scientific network (DFG). “Empirische Erforschung internetbasierter Kommunikation“

(“Empirical Research on Internet-based Communication“). http://www.empirikom.net.

“STTS Tag Table.” Institute for Natural Language Processing. http://www.ims.uni-stuttgart.de/

projekte/corplex/TagSets/stts-table.html.

Text Encoding Initiative (TEI). http://www.tei-c.org/index.xml.

“UCREL CLAWS2 Tagset.” University Centre for Computer Corpus Research on Language. http://

ucrel.lancs.ac.uk/claws2tags.html.

NOTES

1. http://www. ard-zdf-onlinestudie.de

2. For a brief description of the project, see also http://www.empirikom.net/bin/view/

Themen/DeRiK.

3. http://www.dwds.de/

4. We would like to thank the members of the scientific network Empirikom as well as

Laurent Romary and the participants of the Annual Conference and Members’ Meeting

of the TEI Consortium 2011 in Würzburg for valuable discussions on the subject and for

their comments on previous versions of the schema.

5. http://www.empirikom.net/bin/view/Themen/CmcTEI

6. http://www.chatkorpus.tu-dortmund.de

7. http://www.empirikom.net/bin/view/Themen/DeRiK

8. This dictionary is based on a six-volume printed dictionary, the Wörterbuch der

deutschen Gegenwartssprache (WDG, en.: Dictionary of Contemporay German) published

between 1962 and 1977 and compiled at the Deutsche Akademie der Wissenschaften.

9. Recent overviews are given in Storrer 2009 and Herring 2010/2011.

10. An online version of the GDS is available at http://hypermedia.ids-mannheim.de/; a

brief description of the category interaction sign ( Interaktive Einheit) can be found in

module http://hypermedia.ids-mannheim.de/call/public/sysgram.ansicht?

v_typ=d&v_id=370.

11. See GDS (362): “Ihre Funktion besteht in der unmittelbaren (oft automatisiert

ablaufenden) Lenkung von Gesprächspartnern, die sich elementar auf die laufende

Handlungskooperation, Wissensverarbeitung und den Ausdruck emotionaler

Befindlichkeit erstrecken kann”.

12. See, for example, DUDEN (2005, sec. 892) and Ehlich (1986).

13. See the STTS tag table: http://www.ims.uni-stuttgart.de/projekte/corplex/

TagSets/stts-table.html.

14. See the CLAWS2 tagset: http://ucrel.lancs.ac.uk/claws2tags.html.

15. This is part of the anonymization strategy discussed in section 3.4.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

147

ABSTRACTS

The paper presents an XML schema for the representation of genres of computer-mediated

communication (CMC) that is compliant with the encoding framework defined by the TEI. It was

designed for the annotation of CMC documents in the project Deutsches Referenzkorpus zur

internetbasierten Kommunikation (DeRiK), which aims at building a corpus on language use in the

most popular CMC genres on the German-speaking Internet. The focus of the schema is on those

CMC genres which are written and dialogic―such as forums, bulletin boards, chats, instant

messaging, wiki and weblog discussions, microblogging on Twitter, and conversation on “social

network” sites.

The schema provides a representation format for the main structural features of CMC discourse

as well as elements for the annotation of those units regarded as “typical” for language use on

the Internet. The schema introduces an element <posting>, which describes stretches of text

that are sent to the server by a user at a certain point in time. Postings are the main constituting

elements of threads and logfiles, which, in our schema, are the two main types of CMC

macrostructures. For the microlevel of CMC documents (that is, the structure of the <posting>

content), the schema introduces elements for selected features of Internet jargon such as

emoticons, interaction words and addressing terms. It allows for easy anonymization of CMC data

for purposes in which the annotated data are made publicly available and includes metadata

which are necessary for referencing random excerpts from the data as references in dictionary

entries or as results of corpus queries.

Documentation of the schema as well as encoding examples can be retrieved from the web at

http://www.empirikom.net/bin/view/Themen/CmcTEI. The schema is meant to be a core model

for representing CMC that can be modified and extended by others according to their own

specific perspectives on CMC data. It could be a first step towards an integration of features for

the representation of CMC genres into a future new version of the TEI Guidelines.

INDEX

Keywords: computer-mediated communication, CMC, web genres, thread, logfile, forum, chat

AUTHORS

MICHAEL BEISSWENGER

Michael Beißwenger is a researcher and lecturer for German Linguistics at TU Dortmund

University. He graduated from the University of Heidelberg with an M.A. in German Philology

and History (2000) and finished his Ph.D. (“Dr.phil.”) in German Linguistics at TU Dortmund

University with a monograph on interactional management in chats

(“Sprachhandlungskoordination in der Chat-Kommunikation,” Berlin/New York: de Gruyter,

2007). Since 2010, he is the coordinator of the scientific network “Empirical Research on

Internet-based Communication” (http://www.empirikom.net/) funded by the German Research

Foundation (DFG).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

148

MARIA ERMAKOVA

Maria Ermakova is studying Historical Linguistics (M.A.) at Humboldt University Berlin. Since

2010 she has been working as research assistant for the Digital Dictionary of the German

Language project (DWDS) at the Berlin-Brandenburg Academy of Sciences (BBAW).

ALEXANDER GEYKEN

Alexander Geyken is a researcher at the Berlin-Brandenburg Academy of Sciences (BBAW) where

he is Head of the Digital Dictionary of German language (DWDS), a long-term project of the

BBAW.

LOTHAR LEMNITZER

Lothar Lemnitzer is a lexicographer and researcher at the Berlin-Brandenburg Academy of

Sciences (BBAW). He has written introductory books in German about corpus linguistics and

lexicography. He graduated from the University of Heidelberg and finished his Ph.D. (“Dr. phil.”)

in English Linguistics at the University of Münster. He currently uses large corpora of

contemporary German as a basis for the compilation of articles for the Digital Dictionary of

German language (DWDS).

ANGELIKA STORRER

Angelika Storrer is professor for German linguistics at TU Dortmund University since 2002. Her

research interests include computational lexicography, corpus-based methods in linguistics, and

language on the Internet. As a member of the Berlin-Brandenburg Academy of Sciences (BBAW)

she is involved in the work on the Digital Dictionary of German language (DWDS).

Journal of the Text Encoding Initiative, Issue 3 | November 2012

149

Building and Maintaining the TEILingSIG BibliographyUsing Open Source Tools for an Open Content Initiative

Piotr Bański, Stefan Majewski, Maik Stührenberg and AntoninaWerthmann

1. Introduction

1 While the TEI has been successful in becoming a de facto standard for numerous

applications in Digital Humanities, its status in the area of linguistic annotation is not

as clear. After the initial success of the TEI-encoded British National Corpus (Dunlop

1995), the TEI has given way to simpler and more specialized formats for corpus

annotation, such as (X)CES (Ide et al. 1996; Ide 2000), TigerXML (Mengel and Lezius

2000; Lezius 2002), and, more recently, PAULA (Dipper and Götze 2005; Dipper et al.

2007). Currently, the ISO TC37 SC4 committee is working on the so-called LAF

(Linguistic Annotation Framework) family of standards: see (Stührenberg 2012) for

more details.

2 The LingSIG (the “TEI for Linguists” special interest group of the TEI)1 has been created

to examine the actual and potential relationship between TEI markup and the needs

and requirements of linguists. This goal may require adapting (or re-adapting) TEI

markup to the common tasks faced in everyday linguistic practice. In order to achieve

that, a serious review of existing resources is needed, as well as access to people who

are experts in the relevant areas. Both these infrastructural subtasks can be supported

by creating a comprehensive bibliography of works dealing with linguistic markup that

is TEI-inspired or that may inspire new TEI solutions. This bibliography can serve both

as a repository of knowledge and as a resource that can attract non-TEI markup

specialists by providing them with a useful service.

3 This paper addresses an infrastructural issue of universal relevance—the collective

creation of a shared bibliography—congenial with the TEI’s overall aims and

methodology and presented here in the context of the LingSIG. Below, we describe a

Journal of the Text Encoding Initiative, Issue 3 | November 2012

150

combination of open-source general tools and an open-access approach to creating

knowledge repositories. We believe that, for an initiative such as the TEI, it is important

to choose non-proprietary, freely available solutions. If these solutions have the

advantage of attracting new users and promoting the initiative itself, so much the

better, especially if it is done in a non-committal way: no one using the LingSIG

bibliographic repository has to be a user of the TEI. On the other hand, the solution

described here may enhance the culture of sharing that the TEI has grown within.

4 In what follows, we first mention the roots of the idea to establish a repository of

bibliographic references in the context of the TEI LingSIG, then briefly describe Zotero

—the tool that has been chosen to create, store and access the repository—and finally

present the TEI-Zotero Translator—initially a separate Firefox add-on and now part of

the Zotero package that further connects the communities involved by creating a

bridge between the bibliographic recommendations of the TEI Guidelines and the

activities of the LingSIG.

2. LingSIG Reference Library

5 The reference library discussed here is the product of activities connected with the

“TEI for Linguists” special interest group of the TEI (LingSIG). The LingSIG’s roots reach

back to the Digital Humanities conference in London in 2010, where its future

conveners met and decided to prepare a formal application to the TEI Council outlining

the SIG’s aims. What soon followed was the informal “LLiZ” (Linguistic Lunch in Zadar),

organized by Piotr Bański, and the first official SIG meeting during the 2010 Annual

Meeting of the TEI Consortium in Zadar. During that meeting, the participants agreed

that one of the aims that the SIG should address is the creation of a common repository

of references to works that should be taken into account in the process of building a

consistent set of TEI encoding proposals targeting the needs of linguists.

6 The first version of the reference library was created as a TEI Wiki resource and

announced on the SIG mailing list, but, despite an initially positive reaction, the low

number of responses indicated that the barrier to active contribution was too high. It

became obvious that, although using a wiki opened the resource for collective building, it

was only a partially successful move: the results could only be pasted straight from the

wiki page and each time had to be reformatted to conform to a given style sheet.

Furthermore, only a simple web-page search was available to locate references and a

lot of work would have to be devoted to maintaining the entries in a uniform shape. A

more flexible resource was needed that combined the Web 2.0 idea of collective

building and maintenance with greater flexibility of the result format, easier access to

bibliographic data and better search facilities. At this point, the decision was made to

transfer the development to the Zotero platform.2

7 These days, a researcher’s life is punctuated with deadlines. With the date of the next

TEI meeting approaching fast, Zotero-based development manifested one more

advantage over wiki-based creation: it was rapid. It took only a moment to import the

BibTeX of Maik Stührenberg’s extensive linguistic-markup-oriented bibliography and

only several days of Antonina Werthmann’s post-editing to create a sizeable and usable

resource.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

151

3. Zotero

8 Zotero is an open-source citation manager.3 Citation management software is nowadays

a standard component in the preparation workflow for scientific texts; most of the

available tools offer a standard set of features, including adding and editing

bibliographic references, exporting citations formatted according to most standard

academic citation styles, working with citations directly form a word processor using a

plug-in, and creating searchable catalogues of references. While Zotero offers all these

functionalities, it is unique in that it was specifically designed to be used within the

context of a web browser.4

Figure 1. Zotero user-interface, complementing web-oriented research

9 Zotero’s functionality is designed mainly for web-based research activities. Given the

extensive repositories of publicly accessible library catalogues, proprietary services

such as Google Scholar, pre-print archives such as arXiv.org, and countless online

archives of journals, this functionality can be expected to cover a great part of the

bibliographic work for scientific writing in many disciplines. Zotero includes import

translators which allow the direct import of bibliographic data for items discovered

while browsing the Web, reducing time otherwise spent on creating citations manually.

10 Apart from having all the advantages of standard web-oriented tools, Zotero offers

cloud-based synchronisation features that allow any item edited or changed in one

Zotero instance by one collaborator to be updated automatically in all the other

instances. Zotero’s rigid data model and import translators help to reduce the number

of errors that can be introduced by collaborative editing.5

11 Zotero comes in two flavours: as a plugin for current web browsers or as a stand-alone

tool. The first option was built as a plugin for Mozilla Firefox, but since the release of

version 3.0.2, the Zotero Connector is also available as a plugin for Google Chrome and

Apple Safari. The Zotero stand-alone version, which runs under Windows, Mac OS X

and Linux, has been available since early 2012. Both versions feature connectors to web

Journal of the Text Encoding Initiative, Issue 3 | November 2012

152

browsers and plugins for popular word processors, such as Microsoft Word or

OpenOffice/LibreOffice/NeoOffice.

3.1. Creating Bibliographies

12 New bibliographic items can be edited manually or created automatically from the

content of a particular site that the user is visiting (using an import translator). In the

first case, the information is entered into a form with predefined fields corresponding

to particular types of items (book, book section, journal article, etc.; see the lower right

part of fig. 1). In the case of automatic generation of bibliographic items, the required

metadata is copied automatically from web pages, though accuracy and completeness

depends on whether an import translator is available for the cited content. This

includes homepages of publishers, library catalogues, databases of journals and books,

but also sites such as scholar.google.com, amazon.com or popular blogging platforms.

The availability and quality of the assisted automatic creation of bibliographic items

within the Zotero database is dependent on whether the site provides such information

and on whether Zotero provides a suitable import plugin, whose presence is indicated

by an icon in the browser’s address bar. This icon generally corresponds to the

available item types and supplies a one-click-solution, that is, by clicking the icon, the

user saves all the corresponding metadata in the Zotero database. If a PDF file is

available as well, it will be automatically attached to the newly created item. After

creating a Zotero item, one may modify it by correcting or adding metadata entries.

Finally, the item can be tagged with categories, keywords and additional information.

13 In addition to importing data from individual Web pages, Zotero also supports import

of bibliographic metadata in the following bibliographic file formats: MODS (Metadata

Object Description Schema),6 BibTeX, RIS (Research Information System Format),

Refer/BibIX,7 and Unqualified Dublin Core RDF. Recent discussions on TEI-L and

between developers indicated that there is some interest in creating import facilities

for TEI bibliographies as well. The LingSIG plans to implement an import feature via a

student project or when a particular project that uses the exporter could immediately

benefit from reversing the flow of information.

3.2. Working with Reference Libraries

14 Once a Zotero library has been created, it is not only possible to use the information

stored in the metadata of the respective bibliographic items but also to add notes and

attachments (such as electronic versions of articles). In addition, the ability to define

tags allows for a very flexible categorization scheme (in addition to the use of folders to

organize library items). For the LingSIG library, we have chosen tags such as ”XCES”,

”TEI”, ”EXMARaLDA”, and ”BNC”; since these tags can be used for both searching and

organizing items, they constitute a facility that is powerful and easy to use.

15 Libraries created with Zotero can then be shared among the members of the respective

Zotero groups. By joining the LingSIG group,8 new members are allowed to use the

collection and to add to it in a manner much more straightforward than that offered by

wiki-based solutions. All members of the group are allowed to modify the library.9

Changes made by group members can be synchronized with the online library either on

demand or automatically. Apart from accessing the library via Zotero front-ends, one

Journal of the Text Encoding Initiative, Issue 3 | November 2012

153

can also use APIs for read- and write-access to the library using other tools. File

attachments can be synchronized via Zotero File Storage or WebDAV.

3.3. Exporting Bibliographies

16 Storing bibliographic items in a Zotero database opens up several export possibilities.

Citations and reference lists can be generated by Zotero in a great variety of

bibliographic styles as defined by the Citation Style Language (CSL).10 Some styles,

including Chicago, MLA, APA, and Vancouver, are already predefined in Zotero. Others

can be installed via the Zotero Style Repository.11

17 Apart from exporting single or multiple library items, Zotero can create reports,

interactive timelines, and reference lists (the last in a variety of formats, such as HTML

or RTF, and according to different styles). It thus promises to be a nearly universal

writing aid for the members of the LingSIG, and by extension, the entire TEI

community. This is made even more obvious by the fact that, thanks to work by Stefan

Majewski and feedback from the TEI community, Zotero is now able to export TEI XML

<biblStruct> elements directly. This is the topic of the following section.

4. TEI and Zotero

18 As we have shown above, there are numerous reasons for choosing Zotero for citation

management. While Zotero’s integration with major word processors is sufficient for

many purposes, text-encoding scholars often have more advanced needs. For this

reason, some members of the TEI community have begun developing tools capable of

transforming bibliographic items from Zotero to structures that may be used with TEI-

encoded documents. The resulting prototypes addressed particular requirements of

specific tasks and were not meant to be general-purpose tools, but the creation of the

TEI Zotero translator—once a separate Firefox plugin but now integrated into the

Zotero code itself—opens the way towards potential standardization in this area.

4.1. Possible Translation Workflows

19 Two approaches have been used for exporting bibliographic items from Zotero to TEI.

Firstly, it is possible to take one of the standardized output formats that are supported

by default (such as MODS12 and Zotero RDF13) and translate that into TEI XML by means

of an XSL transformation. Another option is to extend Zotero to provide facilities to

directly export its library to TEI XML. From the conceptual perspective, both

approaches are similar: the main challenge is to find the appropriate mapping between

Zotero fields and their closest matches in the TEI. Nevertheless, they differ in the

workflow required to generate the TEI encoding. The first approach requires an

additional transformational step after the initial export into an intermediate format.14

The other approach implements the transformation as a built-in Zotero feature that

might be selected as an option on export. Clearly, the latter requires one fewer step by

the user, offers greater stability (due to its lesser dependence on an intermediate

format controlled by a third party), and makes the task of maintenance simpler: only

the initial and the target data structures have to be considered, not how these map to

the intermediate format. The downside of this approach is that it requires the export

Journal of the Text Encoding Initiative, Issue 3 | November 2012

154

translator to be written in non-XML technology (in the case at hand, ECMAScript). In

what follows, we concentrate on the built-in exporter and, hence, on the direct

mapping from Zotero fields to TEI XML structures.

4.2. Data-mapping Decisions

20 Given an object that represents the items that should be exported, the translator has to

construct the most appropriate output representation. It is therefore essential to know

all possible data structures in the source format and their equivalents in the target

format. The documentation for Zotero plug-in developers is not explicit about the

available data fields in the source database. Nevertheless, as an open source project,

Zotero offers information on the data structures in its source code and in the ample

selection of available export translators, especially the translators to Zotero RDF and to

MODS, which provide good guidance on the availability and handling of the data fields.

21 In TEI encoding, it is often possible to represent information in multiple ways. That is

because the TEI offers a toolkit which has to be customized, with the particular

modeling decisions dependent on the particular use cases. While numerous out-of-the-

box TEI customizations exist, in the area addressed here no ready-made solutions are

available and each project tends to make its own choices. For the TEI Zotero export

translator, encoding decisions have been made at three levels, discussed in the sections

that follow: base encoding (section 4.2.1), item-type-specific encoding (section 4.2.2),

and item-specific encoding (4.2.3). By fleshing those decisions out for scrutiny, and by

offering the translator as a solution employed by the LingSIG bibliography, we hope to

take a step toward standardizing the resulting format.

4.2.1. Base Encoding

22 The fundamental modeling decision concerning the translator was made at the level of

what we call the “base encoding”: the choice among the three possible top-level

elements for bibliographic references (<bibl>, <biblStruct>, and <biblFull>).

For the purpose of Zotero’s export to TEI, the top-level element <biblStruct> is

used. In what follows, we justify this choice.

23 The element <bibl> is a container for any kind of bibliographic reference that

features a mixed content model: it may contain a mixture of plain text and elements in

any order. Therefore, <bibl> is specifically suited for the representation of existing

bibliographies (that is, the transcription of physical source documents), but it is not the

optimal choice for born-digital bibliographies designed for further processing. For the

latter, it is crucial to have unified, predictable encoding. For this purpose, the element

<biblStruct> was devised. It requires a specific structure and ensures that

particular types of information—especially the core information about the author, the

place of publication, and the title—are stored at the same location in the structure. The

core set of information is structured by bibliographic level: using the element

<monogr> for the monographic level, <analytic> for the analytic level, and

<series> for the series level. This distinction is particularly useful when it comes to

making formatting decisions in XSLT.

24 <biblFull> is similar to <biblStruct> in that it is highly structured, but it

follows a different approach: it uses the same content model as <fileDesc>, and is

Journal of the Text Encoding Initiative, Issue 3 | November 2012

155

thus less rigid with respect to ordering the relevant information. The more predictable

structure of <biblStruct> and its advantages for processing were the factors that

determined the choice for the base target encoding for the export from Zotero to TEI.

25 Bibliographic items are typically arranged in a list-like structure. Consequently, some

kind of a structuring device or a container has to be used to hold the individual items.

As suggested by the Guidelines, the <listBibl> element is used for this purpose in

the output of the translator. The base encoding for the Zotero export is therefore a

<listBibl> containing multiple <biblStruct>s.

4.2.2. Item-type-specific Encoding

26 The second level concerns the item-type-specific encoding—that is, the way in which

the item type for a Zotero item (“journal article”, ”book section”, etc.) affects the

mapping to the corresponding elements within the <biblStruct>. While every item

type within the Zotero database features a unique set of properties, many of these

properties are shared and the mapping to TEI is the same irrespective of the type. For

example, the place of publication will always be mapped to the element <pubPlace>within the <imprint> part of <biblStruct>. Nevertheless, some mappings are

affected by the item type: for example, the property item.title15 maps to <title>within <analytic> for analytic item types such as ‘journal article’ or ‘book section’,

and to <title> within <monogr> for types that do not have an analytic level.

27 The first fundamental question at this level of encoding is whether the given item

features an analytic level. The TEI Zotero translator defines the item types journal

article, book section, magazine article, newspaper article, and conference paper as

analytic. While Zotero has a schema that determines which fields may be used for a

bibliographic item of a specific type, it does not require the user to enter a minimal

amount of data for any item type. In practice, this can lead to situations where it is not

possible to meet the minimal requirements for <biblStruct>. For the rare cases

where no title is given for a bibliographic resource, an empty <title> element is

generated in <monogr> or respectively in <analytic>—in other words, the

translator remains neutral with respect to apparent omissions in the content of Zotero

items and translates them into corresponding empty elements in the TEI markup, thus

making them easier to spot in the process of validation.

4.2.3. Item-specific Encoding

28 Decisions made at the level of the individual bibliographic items are determined by the

values of the Zotero fields for these items. Firstly, as has been mentioned, the TEI

Zotero translator depends on which of the available fields are actually filled in by the

user. Secondly, for fields that may hold an arbitrary number of individual values, the

exporter will handle items differently depending on how many values they have. In

particular, the area where Zotero provides great flexibility is the assignment of

responsibilities for the creation of the work referenced, and these need to be carefully

mapped to TEI.

29 In Zotero, any bibliographic item can have an arbitrary number of creators of a

particular type. The available creator types are determined by the item type (for

example, in Zotero books may have editors while websites do not have editors but

rather contributors). Many of the Zotero creator types have direct equivalents in the

Journal of the Text Encoding Initiative, Issue 3 | November 2012

156

TEI (for example, creator.type with the value “editor” or the value “seriesEditor”

can both be mapped to the element <editor>). Nevertheless, this does not apply to

all available types (for example, creator.type with the value “contributor”). For

those creator types that do not map directly to TEI elements, a <respStmt> is used

with an element <resp> that contains the name of the Zotero creator type. Consider

the following example:

<respStmt>

<resp>contributor</resp>

<persName>

<forename>Kevin</forename>

<surname>Hawkins</surname>

</persName>

</respStmt>

30 The above fragment is the typical choice for the encoding of information about a

contributor to a wiki, while the following fragment would be the encoding of

information concerning the authorship of the present paper:

<author>

<forename>Piotr</forename>

<surname>Bański</surname></author>

<author>

<forename>Stefan</forename>

<surname>Majewski</surname>

</author>

<author>

<forename>Maik</forename>

<surname>Stührenberg</surname>

</author>

<author>

<forename>Antonina</forename>

<surname>Werthmann</surname>

</author>

31 This is an example of how the structure of the exported item is determined by the

content available within the given data field.

4.3. Output Options

32 Apart from the direct representation of the item data, the TEI Zotero translator offers a

set of output options. First of all, it optionally generates @xml:id attributes for each

exported <biblStruct>. These IDs are generated from the name of the author, the

Journal of the Text Encoding Initiative, Issue 3 | November 2012

157

year of publication, and if necessary a character for the disambiguation of publications

if there is more than one reference per author per year (e.g. “Dipper2005b”). Secondly,

the translator can optionally put a simple minimal TEI document around the

<listBibl> for use cases where a complete TEI file is needed for processing or

validation. Finally, since Zotero organizes bibliographic items in collections, it is

possible to represent Zotero’s collection structure within the generated TEI. Collections

in Zotero can, first of all, nest. Secondly, individual bibliographic items may be put into

multiple collections. As <listBibl> can nest as well, it is ideally suited to

representing Zotero collections. The title of the collection is put in a <head> element

at the beginning of the <listBibl> corresponding to the exported collection.

<listBibl>

<head>Recent Papers</head>

<listBibl>

<head>to be read</head>

<biblStruct>

</listBibl>

<listBibl>

33 While the TEI Zotero translator is now a mature piece of software, as evidenced by its

recent inclusion into the mainstream Zotero distribution, some important

functionality, such as import facilities for existing TEI-encoded bibliographies, is still

missing. It should be stressed, however, that the translator has been released under an

open-source license and is thus open to contributions in the form of code patches,

feedback, and general discussion.16

5. Summary and Conclusions

34 The present paper highlights the needs relevant for modern collaborative research

practice and, using the example of the TEI LingSIG, shows how Zotero answers many of

the demands that such practice creates. The existence of Zotero-to-TEI translation tools

further confirms that this is not a random choice, and the fact that the tool described

here, the TEI Zotero translator, has been integrated into Zotero testifies to the

reception of the ideas presented here by a broader community of developers and users.

35 The findings reported here go beyond the confines of the LingSIG for two reasons: its

Zotero repository is meant to be usable beyond the SIG and even the TEI community,

and the co-operative resource-building strategy recommended here constitutes a

feasible blueprint for other open-content and open-source initiatives. Also, the

mapping solution used by the translator follows a set of choices that are subject to

community acceptance as the potential de facto way of creating bibliographies.

36 Apart from the matter of acceptance of the Zotero-to-TEI mapping choices, which is an

issue to be decided by the TEI community, we have identified some features that Zotero

Journal of the Text Encoding Initiative, Issue 3 | November 2012

158

users would benefit from. One is the need to ensure preservation of Zotero databases

via automatic backups, versioning, or the like. It would also be beneficial in some

contexts to be able to require a value for some fields, such as the “title” field, possibly

by having incomplete citations appear in a shared “waiting room” before they are

added to the store as complete references. Being able to restrict and directly

manipulate the inventory of tags defined for a particular bibliography store would also

help ensure the overall consistency of the database.

37 The final issue concerns the definition and implementation of a TEI-to-Zotero mapping

(in the other direction). At first glance, it seems reasonable to expect to be able to

import <biblStruct> objects into Zotero, but more concrete solutions will require

needs analysis and further funding.

BIBLIOGRAPHY

Dipper, S. and M. Götze. 2005. “Accessing Heterogeneous Linguistic Data – Generic XML-based

Representation and Flexible Visualization”. In Human Language Technologies as a Challenge for

Computer Science and Linguistics: 2nd Language & Technology Conference, April 21–23, 2005: Proceedings,

206–210, Poznań, Poland: Wydawnictwo Poznańskie.

Dipper, S., M. Götze, U. Küssner, and M. Stede. 2007. “Representing and Querying Standoff XML”.

In Datenstrukturen für linguistische Ressourcen und ihre Anwendungen. Data Structures for Linguistic

Resources and Applications. Proceedings of the Biennial GLDV Conference 2007, edited by G. Rehm, A.

Witt, and L. Lemnitzer, 337–346, Tübingen, Germany: Gunter Narr Verlag.

Dunlop, D. 1995. “Practical considerations in the use of TEI headers in a large corpus.” Computers

and the Humanities 29: 85–98.

Ide, N., G. Priest-Dorman, and J. Véronis. 1996. Corpus Encoding Standard (CES). Technical report,

Expert Advisory Group on Language Engineering Standards (EAGLES).

Ide, N., P. Bonhomme, and L. Romary. 2000. “XCES: An XML-based Encoding Standard for

Linguistic Corpora”. In Proceedings of the Second International Language Resources and Evaluation

(LREC 2000), 825–830. Athens: European Language Resources Association (ELRA).

Lezius, W. 2002. Ein Suchwerkzeug für syntaktisch annotierte Textkorpora. Ph.D. Thesis, Institut für

Maschinelle Sprachverarbeitung der Universität Stuttgart.

Mengel, A. and W. Lezius. 2000. “An XML-based encoding format for syntactically annotated

corpora”. In Proceedings of the Second International Conference on Language Resources and Engineering

(LREC 2000), 21–126. Athens: European Language Resources Association (ELRA).

Stührenberg, M. 2012. “The TEI and Current Standards for Structuring Linguistic Data: An

Overview.” Journal of the Text Encoding Initiative, 3.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

159

NOTES

1. http://wiki.tei-c.org/index.php/SIG:TEI_for_Linguists

2. We are grateful to Stuart Yeates for the initial suggestion to use Zotero, made on the

LingSIG mailing list. We also wish to acknowledge the pioneer role that the SIG on

Education has played by setting up a Zotero repository of TEI-related works at http://

www.zotero.org/groups/tei. At the time when the LingSIG repository was created, the

general TEI repository had barely started, and the two were developed in parallel. Our

repository differs in scope, as its primary focus is linguistic markup, be it TEI or not.

Thus, the two repositories merely overlap to some extent. However, it is worth noting

that users who belong to both groups have all the resources at their disposal and can

combine them (and automatically detect and merge duplicates) in the user’s private

Zotero space. It is also worth noting that, unlike the Education SIG’s library, which is a

unitary resource that can only be searched by string-matching, the LingSIG library

features catalog-based and tag-based categories.

3. For a comparison of citation managers, see http://en.wikipedia.org/wiki/

Comparison_of_reference_management_software. What played a decisive role in our

case is that Zotero is open-source, cross-platform, web-oriented, and extremely

flexible.

4. In the presence of a running stand-alone instance, browser add-ons become merely

interfaces, or “connectors”, between the web content accessed by the browser and the

database controlled by the stand-alone Zotero.

5. One shortcoming of Zotero’s features for collaboration is the lack of version history

and the ease of propagation of errors introduced into the content. That is, if a major

maintenance error occurs, as, for example, when one participant accidentally deletes a

set of bibliographic items, there is no version history available that could be used to

revert the changes. Therefore, frequent manual backups by the project participants are

advisable pending an enhancement that targets this issue. On the other hand, Zotero

provides the functionality for duplicate detection and merging that is not present in

wiki-like resources.

6. MODS is developed by the Library of Congress. See http://www.loc.gov/standards/

mods/mods-schemas.html for schema files.

7. The import format of the EndNote citation manager is based on the Refer/BibIX

format.

8. The LingSIG group at Zotero is accessible at https://www.zotero.org/groups/tei-

lingsig.

9. This is not the only possible administrative choice in Zotero groups, but any attempt

to limit the write access would run counter to the aims of the entire project, which is to

involve as many contributors as we can.

10. See http://citationstyles.org/citation-style-language/schema/ for the current

version of the CSL schema in the RELAX NG notation. Since CSL 1.0, the schema is not

only supported by Zotero but by the Mendeley reference manager as well.

11. The Zotero Style Repository is located at http://www.zotero.org/styles/. The styles

can be used with any client software that supports CSL 1.0.

12. For more information on the Metadata Object Description Schema, see http://

www.loc.gov/standards/mods/.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

160

13. Zotero RDF is the custom export format of Zotero that can also export attached files

and notes.

14. Laura Mandell’s XSL Transformation from Zotero RDF to TEI follows this approach

(see http://wiki.tei-c.org/index.php/ZoteroToTEI).

15. Properties of the items as provided by Zotero are used in dot-notation (i.e.

item.property).

16. Contributions are welcome via E-Mail to Stefan Majewski or via https://

github.com/smjwsk/translators or http://code.google.com/p/tei-zotero-translator/.

The author follows discussions on TEI-L.

ABSTRACTS

The present contribution addresses an infrastructural issue of universal relevance, addressed in

the specific context of the TEI. We describe a combination of open-source tools and an open-

access approach to creating knowledge repositories that have been employed in building a

bibliographic reference library for the “TEI for Linguists” special interest group (LingSIG). The

authors argue that, for an initiative such as the TEI, it is important to choose open, freely

available solutions. If these solutions have the advantage of attracting new users and promoting

the initiative itself, so much the better, especially if it is done in a non-committal way: no one

using the LingSIG bibliographic repository has to be a member of the LingSIG or a “TEI-er” in

general.

INDEX

Keywords: LingSIG, Zotero, structured bibliography, reference management, collaborative

workflow

AUTHORS

PIOTR BAŃSKI

Piotr Bański is an assistant professor of Linguistics at the Institute of English Studies of the

University of Warsaw, and a senior researcher at the Institut für Deutsche Sprache in Mannheim,

where he is the coordinator of the project “Corpus Analysis Platform of the Next Generation”. He

is also an elected member of the TEI Technical Council for term 2011–2012 and an expert of the

ISO TC37 SC4 committee for Language Resources Management. His current interests focus mostly

on text encoding as well as the creation and use of robust language resources.

STEFAN MAJEWSKI

Stefan Majewski studied English Language and Literature as well as Sociology at the University of

Vienna, in addition to Electronics at the Vienna University of Technology. He graduated in

English Linguistics with a focus on research infrastructures for corpus linguistics. Currently, he is

Journal of the Text Encoding Initiative, Issue 3 | November 2012

161

working at the Austrian Academy of Sciences, where he is coordinating and working for the

“Data Service Infrastructure for the Social Sciences and Humanities” (DASISH) project. He is also

employed by the Göttingen State and University Library, where he works for the “TextGrid”

project in research and development. His current interests focus on research infrastructures and

annotation systems.

MAIK STÜHRENBERG

Maik Stührenberg received his Ph. D. in Computational Linguistics and Text Technology from

Bielefeld University in 2012. After graduating in 2001, he worked on various projects at the

Justus-Liebig-Universität Gießen and Bielefeld University. He is currently employed as a research

assistant at the Institut für Deutsche Sprache (IDS, Institute for the German Language) in

Mannheim as a member of the CLARIN-D project and is involved in the NA 105-00-06 AA, the

German mirror committee of ISO TC37 SC4. His main research interests include specifications for

structuring multiply annotated data (especially linguistic corpora), query languages, and query

processing.

ANTONINA WERTHMANN

Antonina Werthmann studied computational linguistics at Heidelberg University and has been

working since 2011 as a research assistant at the Institut für Deutsche Sprache in Mannheim (IDS,

Institute for the German Language) as a member of the CLARIN-D project. Her main research

tasks consist in the collection, description and extension of the information systems on standards

in the field annotation of linguistics resources.

Journal of the Text Encoding Initiative, Issue 3 | November 2012

162