On Solving the Problem of Textual Contamination

12
1 Paper presented at the Workshop “Information Technologies and Innovation in Sanskrit Based Indian Studies” held the University of Vienna, Department for South Asian, Tibetan and Buddhist Studies on 26 March 2011. On solving the problem of textual contamination by means of computer aided stemmatics Philipp Maas, University of Vienna 1) What is “Computer aided stemmatics”? Computer aided stemmatics is a method to build reliable hypotheses of transmission histories mainly by comparing different text versions. The resulting hypothesis is then used as a guideline to establish the earliest reconstructable version of this work (called archetype in stemmatics) and to get an idea of how (and why) the literary work changed in the course of transmission from the oldest reconstructable text version into several different versions. Computer aided stemmatics combines two complementary approaches: first, the use of cladistic software to calculate and analyze cladograms, i.e. trees having an internested branching structure. Cladograms serve as rough estimates of the transmission history. The second approach is the philo- logical discussion of variant readings, which basically consist of checking whether the computer generated structure of a tree is acceptable from a philological-historical perspective. If philological-historical arguments con- tradict the cladistic hypothesis, the hypothesis has to be modified accor- dingly in order to transform the cladogram into a stemma. Modifications may concern the structure of the tree as well as decisions regarding the over- all utility of witnesses for building a stemma.

Transcript of On Solving the Problem of Textual Contamination

1

Paper presented at the Workshop “Information Technologies and Innovation in Sanskrit Based

Indian Studies” held the University of Vienna, Department for South Asian, Tibetan and Buddhist

Studies on 26 March 2011.

On solving the problem of textual contamination by means of

computer aided stemmatics

Philipp Maas, University of Vienna

1) What is “Computer aided stemmatics”?

Computer aided stemmatics is a method to build reliable hypotheses of

transmission histories mainly by comparing different text versions.

The resulting hypothesis is then used as a guideline to establish the earliest

reconstructable version of this work (called archetype in stemmatics) and to

get an idea of how (and why) the literary work changed in the course of

transmission from the oldest reconstructable text version into several

different versions.

Computer aided stemmatics combines two complementary approaches: first,

the use of cladistic software to calculate and analyze cladograms, i.e. trees

having an internested branching structure. Cladograms serve as rough

estimates of the transmission history. The second approach is the philo-

logical discussion of variant readings, which basically consist of checking

whether the computer generated structure of a tree is acceptable from a

philological-historical perspective. If philological-historical arguments con-

tradict the cladistic hypothesis, the hypothesis has to be modified accor-

dingly in order to transform the cladogram into a stemma. Modifications

may concern the structure of the tree as well as decisions regarding the over-

all utility of witnesses for building a stemma.

2

1) An introduction to the basic principles of cladistics

Cladistics is a method which is used in different academic fields like

archaeology, linguistics, cultural sciences, in different philologies and, of

course, in its homeland of evolutionary biology. All these fields share the

aim to build hypotheses of the historical development of objects under

investigation (like artefacts, languages, rituals, recipes, literary works, and

biological species) from a recording of identical and different characteristics

of these objects. In cladistics, the recording of characteristics is transformed

into a graphic representation of the historical development in the from of a

branching diagram, a tree. A tree should account for the distribution of

characters (or variants) among the taxa (or manuscripts) under investigation.

In choosing the tree which fits the data best, the so-called parsimony

principle is used. This principle – frequently referred to as Occam’s razor –

is based on the assumption that if there are several alternative solutions for a

scientific problem, the most economical – or parsimonious – solution is to

be preferred as long as it is not contradicted by other known facts.

Parsimony translates into textual cladistics as follows.

Different versions of a text differ from each other in their variants. If two or

more textual witnesses share the same variant as against all other witnesses,

there are basically two possible explanations. Either one and the same

variant occurred several times in the history of transmission or the variant

occurred only once and was subsequently copied. The second explanation,

the more parsimonious one, is the one to be preferred under normal

conditions.

3

2) The application of cladistics under favourable conditions

Fig. 1: Data matrix created from ten instances of variation in six manuscripts

This data matrix records the variants occurring at ten exemplary instances of

variation for six manuscripts from a collation of the CS Vimānasthāna 8.

Within each column identical numerical symbols represent identical

readings in the manuscripts. For example, in column no. 1, number 0 is

recorded for manuscripts Ad, Chd, P1s, which means that they share a

common reading exclusively as against all other manuscripts. For manu-

scripts Ap1 and Ba1 number 2 is recorded, which means that these two

witnesses share a different reading; and finally the symbol 4 for B1d

indicates that this witness has a peculiar reading not shared by one of the

other manuscript. The choice of the numerical symbols is arbitrary, but it is

important to record variants in such a manner that identical symbols

represent identical readings.

In order to grant explanatory power to the present example, it is necessary

that we take each place of a variation to represent a class of similar cases,

which would occur in a recording of a not too short text passage.

Parsimony software is used to find the best, i.e. most parsimonious, repre-

sentation of the data in form of a bifurcate tree. The most parsimonious re-

presentation is the one that explains the distribution of variants among the

textual witnesses by assuming the lowest degree of textual changes along the

branch lines. It therefore amounts to assuming textual changes only in

instance, where the data provide evidence for such a change.

4

Fig. 2: Unrooted tree calculated from ten unordered and equally weighted variants. CI 1.0.

I fed the just mentioned data matrix to the cladistic computer program

PAUP* and applied the exhaustive search option under parsimony criteria.

The program found very quickly one most parsimonious unrooted tree out of

all together 105 possibilities. All recorded variants support this tree, which

means that there is a perfect consistency in the data. The degree of con-

sistency between data and tree can be measured and translated into the so-

called consistency index. In our case, where there is perfect consistency, the

consistency index has the numerical value of 1.0. A lower degree of

consistency would, of course, result in a lower value of the consistency

index.

In the present case, the high consistency between data and tree is not a big

surprise, since we are dealing here not with a real textual transmission, but

with exemplary variants selected with a view to demonstrate how cladistic

software works under favourable conditions and how the results of cladistics

compare to the results of classical stemmatics.

In order to do so, we first have to root the up to now un-rooted tree, since an

un-rooted tree cannot serve as a hypothesis on the development of a text in

time. Rooting does not affect the structure of the tree. All the lines that make

up the un-rooted tree remain unchanged. Rooting is nothing more than

5

identifying the particular point on a tree which deserves the apex position,

and then pulling this point upwards, which leaves the lines of the tree

hanging down. The position of the root in a cladistic tree corresponds to the

position of the archetype, the oldest reconstructable witness, in a stemma.

There is no way in text genealogy to identify the root in a cladogram by

mere numerical calculations. At least one variant which is exclusively trans-

mitted by a single group of manuscripts and which can confidently be

judged as being original has to be identified on the basis of philological con-

siderations. If the same group of manuscripts also contains at least one clear

variant of secondary origin, this group must go back to one of the hyp-

archetypes. All the other available witnesses accordingly go back to the

other hyparchetype, so that the archetype can be located at that part of an

unrooted tree which connects the hyparchetypes.

Fig. 3: Variant no. 2 traced along the tree

In the present example, manuscripts Ad Chd P1s transmit in variant no. 2 a

reading of secondary origin, where as Ap1 B1d Bad preserve at the same

instance the genuine reading. It may be noted that from a cladistic point of

view it is unclear whether the one or the other reading is original.

6

Fig. 4: Variant no. 8 traced along the tree

In variant no. 8, the opposite is the case. Ap1 B1d Bad transmit a reading of

secondary origin, whereas Ad P1s (and Chd) preserve the more original one.

According to the interpretation of these two readings, the root of the tree can

be located along the branch which connects the two groups of manuscripts

Ad Chd P1s and Ap1 B1d Bad.

As the result of the cladistic analysis we now have a hypothesis of the

development of the CS in time, which is based on a numerical calculation

and on the philological judgement of two variant readings.

But what is the relationship of this hypothesis to the remaining eight variants

when judged from a philological perspective? How would the result

achieved so far compare to a classical stemma, built by applying the princi-

ples of textual criticism as formulated by Paul Maas?

3) A methodological comparison of cladistics with classical stemmatics

The existence of peculiar readings in all manuscripts indicates that none of

the available witnesses was the immediate exemplar of one of the others.

The position of the taxa at the terminal points of branches in the cladogram

7

corresponds to the position they would have in a classical stemma at the end

of each line of transmission.

Fig. 5: Variant no. 5 traced along the tree

Moreover, classical stemmatics would ascribe a common exemplar to Chd

and P1s due to the connective error they share exclusively in variant no. 5.

And the same holds true for Ap1d B1d, which share a connective error in

variant no. 4.

Fig. 6: Variant no. 4 traced along the tree

From a stemmatic perspective, however, this reading is difficult to judge,

since an editor would have to know the temporal relationship between the

8

variants sa�darśitāni and darśitāni a priori in order to judge at the omission

of the preverb sa� as a connective error. In cladistics, such a judgement is

not necessary. On the backdrop of all available data, the individual reading

darśitāni can only be taken to be a corruption of sa�darśitāni.

The different interpretations of this variant in cladistics and stemmatics hint

at a fundamental difference between these two methods. Whereas classical

stemmatics is based on the judgement of individual readings, cladistics

involves considerations of the logical structure of the whole tree and on the

interrelationship of individual variants.

Under favourable conditions, i.e. in a transmission in which there is perfect

consistency between data and tree, cladistics produces the same result as

classical stemmatics at its best, but cladistics achieves this result much

faster. Cladistics is therefore superior to classical stemmatics in terms of

efficiency. Moreover, it is superior in terms of reliability, since in classical

stemmatics a misjudgement of the temporal relationship of alternative

variants may lead either to misjudging the data as partly contradictory or to

ascribing a wrong stemmatic position to one or more witnesses. This is ruled

out in cladistics. And finally, cladistics is to be preferred over classical

stemmatics also on the basis of methodological considerations. Its

application requires much less back-ground assumptions on the temporal

relationship of alternative readings than stemmatics. Whereas cladistics

requires just a small number of philological judgements in order to root a

previously established tree, classical stemmatics requires the philological

judgement of variants for each and every step of tree building.

4) Contamination in classical stemmatics

It may seem, however, that in demonstrating the superiority of cladistics

over stemmatics I am beating a dead horse, since stemmatics is nowadays

virtually out of use. The reason for this is the fact the classical stemmatics

9

simply does not work for real existing textual traditions of Sanskrit (and

other) works, since in real transmissions the occurrence of contamination

(i.e. mixing of two or more text versions into a single copy) is to be

assumed. In the following part of my presentation, I would therefore like to

show how cladistics works in dealing with contaminated transmissions.

5) Contamination in cladistics

In order to discuss the capability of cladistics in a contaminated trans-

mission, I recorded the variants of witness J1d to our data matrix. Manu-

script J1d was identified as being strongly contaminated in a previous study.

The cladistic software *PAUP calculates again one most parsimonious tree

from the data. This time, however, not all data is in harmony with the tree

structure. There is a logical conflict between different possibilities of

mapping the data on the tree. This conflict shows itself in a consistency

index of ca. 0.96 which is lower than the optimum amount of 1.0.

Fig. 7: Most parsimonious tree containing J1d. Consistency Index ca. 0.92

After rooting, the tree shape is almost identical to that of the previously

discussed tree. The only difference is that the newly added manuscript J1d

received the prominent position of being a sister of the youngest common

exemplar of Ad Chd and P1s. If this hypothesis is accepted, the readings of

10

J1d are decisive for a reconstruction of the archetype. All cases, in which

J1d reads together with two witnesses out of the family Ap1d, B1d and Ba1d

would then have to be considered as being of archetypal origin. But is this

hypothesis reliable?

From a cladistic point of view the hypothesis is simply the best one can

achieve, since the tree in its present shape corresponds to the most

parsimonious resolution of the data. From a historical-philological per-

spective, we are, however, obliged to question the cladistic hypothesis. The

clue to judge the reliability of the cladistic hypothesis is a discussion of

variant no 7, in which the data does not fit the tree.

Fig. 8: Variant no. 7 traced along the tree

The problem with this variant is that it cannot be traced consistently along

the present tree, since J1d and P1s share the common reading akleśam

asai��utva� as against akleśasahi��utva� in Ad and Chd. If we are willing

to accept the structure of the tree, we have to accept that an identical textual

change occurred in J1d and P1s independently. This explanation is quite

unconvincing, especially if we take the present variant not as an individual

case but as representing a whole class of variants. It is therefore necessary to

take an alternative explanation into consideration and to change the tree

structure in such a way that variant no. 7 indicates that J1d and P1s are in

fact sister manuscripts.

11

In the modified tree, the consistency of the data drops to a value of 0.92,

since now two variants do not show a most parsimonious resolution along

the tree.

Fig 9: Variant no. 2 traced along an alternative tree

The first inconsistent variant is no. 2, in which, as we have already seen,

manuscripts Ad Chd and P1s share the common error kā as against kārya in

all other witnesses.

Fig. 10: Variant no. 5 traced along an alternative tree

The second case is quite similar. Here manuscripts Chd and P1s share the

common writing error yathā as against the correct reading yasya in all other

manuscripts. J1d, on the other hand, does not transmit this error.

These two inconsistencies can easily be explained – without taking recourse

to improbable likelihoods – as the result of corrections. In all cases in which

12

J1d is free from variants of secondary origin transmitted in P1s, its readings

agree with the readings we find within the group Ap1d, B1d and Ba1d.

Therefore it is quite obvious that the scribe of J1d must have had access not

only to an exemplar which is closely related to P1s, but also to a second one,

which derived from the common archetype of Ap1d, B1d and Ba1d. In other

words, J1d can be considered a contaminated copy. Its true genealogical

position therefore is not close to the archetype as was indicated by cladistics,

but in close proximity to P1s.

6) Methodological consequences

First of all, it should have become clear that cladistics as such does not

provide a solution for the problem of contamination. Whenever two or more

exemplars were mixed systematically in order to improve the readings of the

main exemplar in the copy, cladistics will inevitably ascribe a too early

position to this contaminated witness. On the other hand, cladistic software

provides the opportunity to investigate the structure of cladistic trees and to

judge the informative value of individual variants.

The key to spot contaminated witnesses is the following consideration:

Unobjectionable readings pass frequently from a secondary exemplar into a

copy, since clear writing errors will usually be corrected as soon as they are

detected. Therefore, a small number of shared writing errors provides more

evidence for a close stemmatic relationship of witnesses than a large number

of meaningful readings. Based on this assumption it is possible to identify

contaminated manuscripts, to trace their main exemplar, to determine to

region of a stemma from which they received contaminational readings, and

to decide which manuscripts are decisive for the reconstruction of the

archetype.

The use of cladistic software therefore has a large potential to cushion or

even solve the long standing problem of contamination in stemmatics.