Shakespeare and other English Renaissance authors as characterized by Information Theory complexity...

12
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

Transcript of Shakespeare and other English Renaissance authors as characterized by Information Theory complexity...

This article appeared in a journal published by Elsevier. The attachedcopy is furnished to the author for internal non-commercial researchand education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies areencouraged to visit:

http://www.elsevier.com/copyright

Author's personal copy

Physica A 388 (2009) 916–926

Contents lists available at ScienceDirect

Physica A

journal homepage: www.elsevier.com/locate/physa

Shakespeare and other English Renaissance authors as characterized byInformation Theory complexity quantifiersOsvaldo A. Rosso a,b,∗, Hugh Craig c, Pablo Moscato aa Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, School of Electrical Engineering and Computer Science,The University of Newcastle, University Drive, Callaghan NSW 2308, Australiab Chaos & Biology Group, Instituto de Cálculo, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Pabellón II, Ciudad Universitaria, 1428Ciudad de Buenos Aires, Argentinac Centre for Literary and Linguistic Computing, School of Humanities and Social Science, Faculty of Education and Arts, The University of Newcastle,University Drive, Callaghan NSW 2308, Australia

a r t i c l e i n f o

Article history:Received 30 May 2008Received in revised form 18 August 2008Available online 24 November 2008

Keywords:EntropyStatistical complexityEnglish literatureInformation Theory

a b s t r a c t

We introduce novel Information Theory quantifiers in a computational linguistic studythat involves a large corpus of English Renaissance literature. The 185 texts studied (136plays and 49 poems in total), with first editions that range from 1580 to 1640, form arepresentative set of its period. Our data set includes 30 texts unquestionably attributedto Shakespeare; in addition we also included A Lover’s Complaint, a poem which generallyappears in Shakespeare collected editions but whose authorship is currently in dispute.Our statistical complexity quantifiers combine the power of Jensen–Shannon’s divergencewith the entropy variations as computed from a probability distribution function of theobserved word use frequencies. Our results show, among other things, that for a givenentropy poems display higher complexity than plays, that Shakespeare’s work falls intotwo distinct clusters in entropy, and that his work is remarkable for its homogeneity andfor its closeness to overall means.

© 2008 Elsevier B.V. All rights reserved.

1. Introduction

The application of concepts and tools from mathematics, physics and computer science to literary texts has generatedconsiderable interest recently. Frequently the object of these studies is not the more traditional one of attributing disputedtexts to authors. They aim instead at amore general understanding of the pattern ofword frequency andword distribution ina corpus of literary texts. Quantifiers associated bothwith the first-order statistics ofwords andother linguistic elements (seeRefs. [1–3]) and with high-order statistics [4,5] have been employed, especially in order to quantify the size, coherence anddistribution of vocabularies in language samples of various kinds. In the present contribution we apply two quantifiers fromInformation Theory, onewell established – entropy – and a secondmore novel – statistical complexity – to a set of drama andpoetry texts. The results show that, when used together, these quantifiers reveal patterns of likeness and difference in thetexts which relate in interesting ways to established categories like mode and genre, offer promising indications in relationto the internal chronological development of authorial styles, and help establish broad differences between authors. Ourstudy thus forms part of a general move to use quantifiers to characterise the profile and dynamics of a language corpus,

∗ Corresponding author at: Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, School of Electrical Engineering andComputer Science, The University of Newcastle, University Drive, Callaghan NSW 2308, Australia. Tel.: +61 2 4921 6175; fax: +61 2 4921 6929.E-mail addresses: [email protected], [email protected] (O.A. Rosso), [email protected] (H. Craig),

[email protected] (P. Moscato).

0378-4371/$ – see front matter© 2008 Elsevier B.V. All rights reserved.doi:10.1016/j.physa.2008.11.018

Author's personal copy

O.A. Rosso et al. / Physica A 388 (2009) 916–926 917

and of sub-corpora within it, to supplement the more usual application of such measures to the classification of texts byauthor.

1.1. Information Theory quantifiers and vocabulary structure

The best known quantifier of information, the entropy measure (the Shannon entropy), has been discussed by Thoironand by Hoover in relation to the stylistics of literary texts [6,7]. The consensus is that entropy does not relate satisfactorilyto the more generalised concept of literary richness, and in general is useful neither as an indicator of style nor as a markerof authorial difference. However, before this contribution, Statistical Complexity has not been considered in relation tovocabulary patterns.Several statistical complexity measures (SCM) have been recently introduced in the literature. They are all based on the

product of an entropic measure,H , and a distance (in probability space) to a fixed reference state,Q. The latter quantity isusually called disequilibrium. The statistical complexity vanishes both for a totally random process and for a purely periodicone. It was López-Ruiz, Mancini and Calbet (LMC) who introduced the original product functional form for a statisticalcomplexity measure [8]. Given a probability distribution P associated to the state of a system, the LMC-measure C(LMC) isthe product of a normalized entropyH (normalized Shannon-entropy) times the disequilibriumQ, given by the Euclidean‘‘distance’’ from P to Pe, the uniform distribution. Martín, Plastino and Rosso (MPR) [9,10] introduced a modification of theLMC-measure, and we follow their lead by using the Jensen–Shannon divergence in this contribution.The intensive MPR-statistical complexity has the form

C(MPR)JS [P] = QJ [P, Pe] ·HS[P], (1)

where, to the probability distribution P = {pj; j = 1, . . . ,N} with N is the number of possible states of the system understudy, we associate the entropic measure

HS[P] = S[P]/Smax =

(−

N∑j=1

pj ln(pj)

)/Smax, (2)

with Smax = S[Pe] = lnN , (0 ≤ HS ≤ 1). Pe = {1/N, . . . , 1/N} is the uniform distribution and S denoted the Shannon’sentropy. The disequilibriumQJ is defined in terms of the extensive Jensen–Shannon divergence [10] and is given by

QJ [P, Pe] = Q0 · JS[P, Pe], (3)

with

JS[P2, P2] = {S[(P1 + P2)/2] − S[P1]/2− S[P2]/2} (4)

the above-mentioned Jensen–Shannon divergence and

Q0 = −2{(N + 1N

)ln(N + 1)− 2 ln(2N)+ lnN

}−1, (5)

a normalization constant, equal to the inverse of maximum possible value of JS[P, Pe], i.e., the value obtained when one ofthe components of P , say pm, equals unity and the remaining pj vanish. Thus, the disequilibriumQJ is an intensive quantity.The disequilibriumQ would reflect on the systems’s ‘‘architecture’’, being different from zero if there exist ‘‘privileged’’, or‘‘more likely’’ states among the accessible ones.We remark on the fact that the above statistical complexity measure quantifies not only randomness but also the degree

of correlational structures and consequently it is not a trivial function of the entropy, in the sense that, for a givenHS-value,there exists a range of possible C’s values between a minimum Cmin and a maximum Cmax. Thus, evaluating C providesimportant additional information regarding the peculiarities of a probability distribution, not already carried by the entropy.A general procedure for obtaining the bounds Cmin and Cmax corresponding to the statistical complexity measures’ family isgiven in Ref. [11]. For details, see Refs. [8,10–14].

1.2. Data description and probability distribution functions

We assembled 136 plays and 49 poems for this study (beingM = 185 the total number of considered texts), all but oneof which can be confidently attributed to a single author. The exception is A Lover’s Complaint, which generally appears incollected editions of Shakespeare but whose authorship has recently been disputed on the grounds that its style is unlikeShakespeare’s [15]. Shakespeare’s works are the largest authorial group in the study, and one of our aims is to examine thecharacteristics of his plays and his poems in relation to the larger corpus. We included in the study the twenty-seven playsand three poems accepted as his unaided work.Other significant authorial collections in the plays set are those by Ben Jonson (seventeen), Thomas Middleton (twelve),

John Fletcher and John Ford (both seven), and ChristopherMarlowe (six). George Chapman and John Davies of Hereford eachhave seven poems in the collection, and Edmund Spenser has four. On the other hand many authors are represented by a

Author's personal copy

918 O.A. Rosso et al. / Physica A 388 (2009) 916–926

Table 1List of authors (in alphabetical order) of works considered in the present study with the classification of the texts as Plays and Poems. The total number oftexts considered is 185, 136 plays and 49 poems.

Author Plays Poems Total

Francis Beaumont 1 – 1Richard Brome 1 – 1Thomas Campion – 1 1Thomas Carew – 1 1Elizabeth Carey 1 – 1George Chapman 3 7 10Henry Chettle 1 – 1Samuel Daniel – 1 1Sir William Davenant 1 – 1John Davies – 7 7John Day 1 – 1Thomas Dekker 5 – 5Edward de Vere, Earl of Oxford – 1 1John Donne – 3 3Michael Drayton – 1 1John Fletcher 7 – 7John Ford 7 2 9Thomas Goffe 1 – 1Robert Greene 4 – 4Fulke Greville, Lord Brooke 1 1 2William Haughton 2 – 2George Herbert – 3 3Mary Herbert – 1 1Thomas Herrick – 1 1Thomas Heywood 5 1 6Ben Jonson 17 2 19Thomas Kyd 2 – 2Aemilia Lanyer – 1 1Thomas Lodge 1 – 1John Lyly 4 – 4Christopher Marlowe 6 1 7Shakerly Marmion 1 – 1John Marston 2 – 2Philip Massinger 2 – 2Thomas Middleton 12 – 12Anthony Munday 1 – 1Thomas Nashe 1 1 2George Peele 4 – 4Henry Porter 1 – 1Sir Walter Raleigh – 1 1Thomas Rowley 2 – 2William Shakespeare 28 3 31James Shirley 3 1 4Sir Philip Sidney – 1 1Robert Southwell – 1 1Edmund Spenser – 4 4Sir John Suckling 1 – 1Cyril Tourneur 1 – 1uncertain – 1 1John Webster 3 1 4Robert Wilmot 1 – 1Robert Wilson 3 – 3Lady Mary Wroth – 1 1TOTAL 136 49 185

single play or a single poem, and they provide a reference and perspective to the study. The texts span the period 1580–1640,a large slice of the English Renaissance or Early Modern period (Shakespeare is associated with plays from the 1590s to theearly 1610s). The complete author list considered in the corpus and the number of the corresponding texts by genre is givenin the Table 1.Our variables for this study are frequencies ofwords. It is important to consider two types ofwords. One of these is known

as function words, like the, and, and I, a closed set, rarely added to, common inmost texts, of wordswhich have a grammaticalfunction rather than bearing meaning in themselves. The second group is the lexical words, like nouns, adjectives and mainverbs. Clearly, the frequencies of these words are different. Some are relatively common, but a good proportion appear onlyonce in the entire corpus.The texts of plays and poems we used were from early printed editions, with irregular spelling. We used electronic

text from online collections such as Literature Online [16] wherever possible, edited them so as to separate headings, stage

Author's personal copy

O.A. Rosso et al. / Physica A 388 (2009) 916–926 919

directions and so on from the main text, and proof-read them against hardcopy from microfilm. In some cases we createdour own electronic text frommicrofilm copies. We adopted two different approaches to standardising spelling. A set of twohundred function words were all modernized in the texts. Among these two hundred are a number of homographs taggedto separate distinct grammatical functions (that as relative, demonstrative, and conjunction, for instance). The list includeswords that would be among the commonest in almost any corpus of English, such as the the and and already mentioned,as well as forms common in Early Modern English but now obsolete, such as thou and the verb form wert. Each instanceof these two hundred word-forms was inspected throughout the texts to ensure that instances could be counted even incontractions like th’ and I’ll and distinguished from similar forms with a different sense, such as the nouns art and May.The rest – all those words not in the list of two hundred function words – were dealt with through a library of equivalentspellings. We collected groups of spelling variants, such as spring, springe and sprynge, and counted instances of these asinstances of a single headword or parent word (in this case, spring). Where a spelling belonged to two or more differentparent words we discarded it. After these various operations we were left with 56,877 different words to add to the set oftwo hundred function words.In this way three words sets will be considered: (a) k = 1: function words set (N (1)

= 200 words), (b) k = 2: lexicalwords set (N (2)

= 56,877 words), (c) k = 3: total words set given by the union of the function and lexical words set (N (3)=

57,077 words). Once the working set (k = 1, 2, 3) for each considered text (i = 1, . . . ,M) is fixed the words frequency f (k,i)j

is determined (j = 1, . . . ,N (k)). Note that in this way all texts are measured with the same common framework, and in thisway we can compare the corresponding results without considering the total length (in words) of the text.In some preliminary investigations, we observed that if one adopts a more sophisticated text indexing systems based on

the assignment of appropriately weighted single terms, as described by Salton and Buckley [17], in order to equalize thelength of the analyzed texts, similar results are obtained in the characterization of poems and plays. In consequence, wepostponed for future work the more detailed analysis of the results using this type of technique applied to the assignmentof the corresponding probability distributions and continued with the present simple approach to it.The corresponding probability distribution function (PDF), for each text (i) in the words set considered (k), is defined as

P (k,i)≡ {p(k,i)j ; j = 1, . . . ,N

(k)}where

p(k,i)j = f (k,i)j

/N(k)∑j=1

f (k,i)j

. (6)

Also, we define the words average probability distribution over a texts set considered (i = 1, . . . ,M) as P (k)ave ≡ {〈p

(k)j 〉; j =

1, . . . ,N (k)}where

〈p(k)j 〉 = 〈f(k)j 〉

/N(k)∑j=1

〈f (k)j 〉

(7)

with

〈f (k)j 〉 =

[M∑i=1

f (k,i)j

]/M, (8)

andM the total number of works considered.

1.3. Computational results

The Normalized Shannon EntropyHS and the MPR-Statistical Complexity C(MPR)JS for theM = 185 texts constituting our

corpus were evaluated using the function words set (k = 1,N (1)= 200 words) and the results are displayed in Fig. 1(a)

and (b). All texts are localized in a narrow region of the HC-plane: 0.761 < HS < 0.857 and 0.227 < C(MPR)JS < 0.303. This

can be understood as a measure of the homogeneity of the corpus considered in the present work.The relatively high values in entropy imply that evenwhen the functionwords are used very often in the texts their use is

not uniform and presents a subjacent structure which is reflected in the relatively high scores in complexity. From Fig. 1(b)is clear that plays and poems do form broad clusters, but the twomodes are characterized by the same range of the variationin entropy and complexity. In consequence the function words set cannot be used to make a classification between the textmodes due to the clusters overlapping in the HC-plane.The obtained values of entropy and statistical complexity for the corpus under analysis considering the total words set

(k = 3, N (3)= 57,077 words,M = 185) are shown in Fig. 2(a), (b). As in the previous case, we see that all texts are localized

in a small portion of the HC-plane: 0.513 < HS < 0.580 and 0.481 < C(MPR)JS < 0.523, confirming in this way the presence

of structure (Zipf’s law) as well as complexity (measured by the closeness to the upper bound provided by the maximumattainable complexity) in the analyzed corpus. The change in the localization of the corpus as a whole in the HC-plane in

Author's personal copy

920 O.A. Rosso et al. / Physica A 388 (2009) 916–926

Fig. 1. (a) Characterization of plays and poems in the Normalized Shannon Entropy (HS [P]) MPR Statistical Complexity (C(MPR)JS [P]) plane. The

corresponding plays and poems PDF were evaluated based on the function words set (k = 1). The continuous line represents the maximal and minimal(Cmax and Cmin) statistical complexity curves. (b) Detail.

the present case (total words set, see Fig. 2) compared with the previous one (function words set, see Fig. 1) follows fromthe inclusion of the lexical words, which, even when their frequencies are low in general, make some contribution to thecorresponding quantifiers. It is also interesting to note that the contribution of the function words in the set, and their highvalues of frequency in appearance in relation to the lexical words, are responsible for the decrease in entropy values. FromFig. 2(b) it is clear that plays and poems can be grouped in two clusters with only a very small overlap. In consequence thetotal words set could be used to make a classification between the texts by mode. Separately, neither the entropy nor thestatistical complexitymeasures serve to distinguish between poems and plays. On entropy, plays and poems are representedamong both the highest and lowest values. On statistical complexity plays range from the lowest value to the second highest,and poems range from below the average to the highest.

1.4. Stylistic differences

Entropy reflects the spread of the total words of a text among the different words available. Texts that are essentially ina single style and tend to be repetitive have low entropy scores, and unpredictable, internally varied texts have high scores.Among plays, the typical low-scoring text is a comedy with a domestic setting, whose characters use a simple, restricted,vernacular vocabulary. Examples are Henry Porter’s Two AngryWomen of Abingdon (HS = 0.52570) and Thomas Heywood’sThe Wise Woman of Hogsdon (HS = 0.52703). On the other hand, comedies with some or all of the action set in courts, likeJonson’s Cynthia’s Revels (HS = 0.56571) and Poetaster (HS = 0.56262) can be relatively high in entropy. At the other endof the spectrum are courtly tragedies, with dialogue in a high style and an abundance of exotic references, like John Ford’sThe Ladies’ Trial (HS = 0.57667) and John Marston’s Sophonisba (HS = 0.56962). The range for poems is from lyrics orvernacular narratives to highly wrought, high-flown tragic narratives or laments. The extremes here are George Herbert’scollection of devotional lyrics, The Church Militant (HS = 0.51363) at the low end, and William Shakespeare’s classicalnarrative The Rape of Lucrece (HS = 0.57904) at the high one.Plays and poems are not so well mixed if both axes are taken into account. For a given entropy value poems are

consistently higher on statistical complexity than are plays. Poems and plays are almost equally likely to occupy a givenspan of entropy values, but within that span poems dominate the higher complexity scores. The obvious contrast betweenthe form of poems and plays – presentation by a single voice, as against dialogue – offers an explanation. Even though theoverall dispersion of total words among the different words is much the same, as reflected in the entropy score, dialoguebrings with it more pronouns, and a preponderance of other function words which mark deixis and anaphora (as with hereand this) and modality (should, must, might and so on). One rough measure of this is the proportion of the whole text madeup by instances of the 200 function words we counted.As we mention above, for a given entropy value there exists a range of possible values of statistical complexity. In

Table 2 a select group of plays and poems with a similar, mid-range normalized entropy score and corresponding statisticalcomplexity is given. Also in this table the proportion of their total words represented by the 200 function words is shown.Against the general tendency for poems to score higher than plays, Fulke Greville’s play Mustapha (C(MPR)

JS < 0.51006)has a higher statistical complexity score than Chapman’s poem Euthymiae Raptus (C(MPR)

JS < 0.50703). The comparison isbetween a stately tragedy with long, disquisitory speeches and a poem which is in the form of a dialogue (between thepoet and the Lady Peace). Hamlet (C(MPR)

JS < 0.49031) and Every Man out of His Humour (C(MPR)JS < 0.48608) have low

statistical complexity values given their entropy score. Both are strongly interlocutory plays, with many scenes made up of

Author's personal copy

O.A. Rosso et al. / Physica A 388 (2009) 916–926 921

Fig. 2. (a) Characterization of plays and poems in the entropy–complexity plane (HS × C(MPR)JS ). The corresponding plays and poems PDF were evaluated

based on the total words set (k = 3). The continuous line represents the maximal and minimal (Cmax and Cmin) statistical complexity curves. (b) Detail.

Table 2Select group of plays and poems with a similar, mid-range normalized entropy score (HS [P]) and corresponding statistical complexity (C

(MPR)JS [P]). % of

Total is the proportion of their total words represented by the 200 function words.

Author Text Mode HS [P] C(MPR)JS [P] % of Total

Ben Jonson Every Man Out of His Humor Play 0.54990 0.48608 58William Shakespeare Hamlet Play 0.55461 0.49031 58William Shakespeare The Tempest Play 0.55360 0.50094 58George Chapman Euthymiae Raptus Poem 0.55398 0.50703 53Fulke Greville Mustapha Play 0.55353 0.51006 52George Chapman Hymnus in Cynthian Poem 0.55003 0.51354 48

Fig. 3. Characterization ofWilliam Shakespeareworks in the entropy–complexity plane (HS×C(MPR)JS ). The correspondingworks PDFwere evaluated based

on the total words set (k = 3) frequency corresponding to plays and poems. The continuous line represents themaximal (Cmax) statistical complexity curve.Note that for William Shakespeare texts two groups are clearly distinguished, the first one corresponding to early and middle comedies and two tragedies(HS ≤ 0.543) and the second one to history plays, later comedies and the rest of the tragedy set (HS ≥ 0.549), see also Table 3. The localization of the ALover’s Complaint of uncertain authorship is also displayed.

brief exchanges between characters,while The Tempest (C(MPR)JS < 0.50094) inclines to longer expository speeches (including

an inset masque with mythological characters). Table 2 shows that for these texts, as statistical complexity increases, the200 function words occupy a smaller proportion of the total of words used; plays tend to give way to poems; and dialogueyields to monologue.The disposition of William Shakespeare’s works in the HC-plane is given in Fig. 3. The corresponding words probability

distribution was evaluated based on the total words set (k = 3,N (3)= 57, 077 words) for plays and poems. Dates of first

performance are taken fromHarbage and Schoenbaum [18].WilliamShakespeare’swork spreadswidely across both entropyand complexity ranges, with texts both below average and close to the maximum of the entropy for this set, and both high

Author's personal copy

922 O.A. Rosso et al. / Physica A 388 (2009) 916–926

Table 3Shakespeare plays in the study with classification acording to genre (comedy, tragedy and history) and date of first performance. Numerical valuescorresponding to the Normalized Shannon Entropy (HS [P]) and MPR-Statistical Complexity (C

(MPR)JS [P]) evaluated using the total words set (k = 3). The

distance (Jensen–Shannon Divergence, JS [P, Pave]) between the work and average words PDF (taken over all the plays without inclusion of Shakespeareworks) is given in the last column. The texts are arranged according to the grouping observed in Fig. 3.

W. Shakespeare texts Genre Date HS [P] C(MPR)JS [P] JS [P, Pave]

Comedy of Errors Comedy 1592 0.53647 0.49343 0.11054Two Gentlemen of Verona Comedy 1593 0.53618 0.49202 0.10358The Merchant of Venice Comedy 1596 0.54273 0.49206 0.09483Much Ado About Nothing Comedy 1598 0.53236 0.48690 0.09600As You Like It Comedy 1599 0.53886 0.48908 0.09182The Merry Wives of Windsor Comedy 1600 0.53402 0.48473 0.11214Twelfth Night Comedy 1600 0.53729 0.48878 0.09626All’s Well That Ends Well Comedy 1603 0.54118 0.48887 0.08576Julius Caesar Tragedy 1599 0.54126 0.49495 0.10777Othello Tragedy 1604 0.54235 0.48954 0.08926

Love’s Labour’s Lost Comedy 1595 0.55893 0.49865 0.10471Midsummer Night’s Dream Comedy 1595 0.55052 0.50039 0.11167Troilus and Cressida Comedy 1602 0.55953 0.49782 0.09906Cymbeline Comedy 1609 0.55611 0.49515 0.08256A Winter’s Tale Comedy 1610 0.54985 0.49245 0.08553The Tempest Comedy 1611 0.55360 0.50094 0.10430King John History 1591 0.55720 0.50048 0.10123Richard III History 1593 0.55656 0.49760 0.09320Richard II History 1595 0.56127 0.50349 0.10650Henry IV Part 1 History 1597 0.55212 0.49406 0.09722Henry IV Part 2 History 1597 0.55338 0.49369 0.09317Henry V History 1599 0.55928 0.49388 0.10157Romeo and Juliet Tragedy 1595 0.55300 0.49727 0.09638Hamlet Tragedy 1601 0.55461 0.49031 0.08857King Lear Tragedy 1605 0.55812 0.49810 0.08815Antony and Cleopatra Tragedy 1607 0.55724 0.49812 0.09606Coriolanus Tragedy 1608 0.54964 0.49180 0.09606

and low values for statistical complexity. Interestingly, William Shakespeare plays form two well defined clusters along theentropy axis (see Fig. 3 and Table 3). The first group, with lower entropy counts (HS ≤ 0.543), consists of early to middlecomedies and two tragedies. There is no play with a later date than 1604 (Othello). The second group (HS ≥ 0.549) containsall the history plays in the set, and a group of mostly (not exclusively) later comedies and tragedies. The six plays includedin our analysis with dates of 1605 and after are all in this group. The history plays as a group are higher in entropy; theirspecificity in location and their focus on the details of war and politics might be expected to give rise to a more dispersedvocabulary. In comedy and tragedy, there is a tendency for later plays to be in the second group, higher in entropy, reflectinga style that is sprawling rather than compact, exotic rather than domestic, and varied rather than consistent. Among theWilliam Shakespeare poems, the variation in entropy follows the lines of genre. The Sonnets (HS = 0.55830), with theirmixture of personal and more widely philosophical reference, are lower in entropy than the comic mythological narrativeVenus and Adonis (HS = 0.56390), which in its turn is lower than the tragic Rape of Lucrece (HS = 0.57904), set in classicalRome.In order to compare theWilliam Shakespeare works with the rest of the corpus we evaluate the average PDF P (k)

ave over allplays without the inclusion of William Shakespeare plays (M = 109), see Eqs. (7) and (8). In fact, in the present studywe consider this P (k)

ave as the reference word PDF. Then we are able to compare the different PDF associated with eachauthor’s work included in the corpus in relation to the average PDF by evaluation of the corresponding distance, usingthe Jensen–Shannon divergence (see Figs. 4 and 5).Plays are concentrated in the range 10,000 to 27,500words. The length of the poems ranges from approximately 1,600 to

21,500 words. In Fig. 4 we present the Jensen–Shannon divergence between all the plays and poems PDF, P , to the averagePDF using the total word set, P (k=3)

ave , as a function of the total number of words for the corresponding work. Plays and poemspresent medium linear anti-correlation (Pearson coefficient R = −0.718 and −0.640, respectively) between the number oftotal words and the distance to the average distribution. This behaviour is expected due to the fact that we are working withfixed number of words in the evaluation of the corresponding PDF. In fact, if the analyzed text has a larger number of words,the number of non-zero probabilities (non zeroword frequencies) is expected to increase and the PDF is expected to presentless kurtosis. For texts with bigger numbers of total words, large values of entropy and small complexity are expected (theassociated PDF will be closer to the uniform distribution).For the bulk of plays (see Fig. 5) the distance from the average distribution is in the range 0.08 < JS[P (3), P (3)

ave] < 0.12.The poems have a wider spread 0.14 < JS[P (3), P (3)

ave] < 0.26. The three plays with the largest distance from the averagedistribution are: The Battle of Alcazar (10,832 words, JS = 0.16899) and The Love of King David and Fair Bethsabe (14,808words, JS = 0.15923), both by George Peele and Cornelia by Thomas Kyd (15,042 words, JS = 0.15042). The Battle of Alcazar

Author's personal copy

O.A. Rosso et al. / Physica A 388 (2009) 916–926 923

Fig. 4. Jensen–Shannon divergence between plays and poems PDF (P) and average PDF (Pave taken over all considered plays excluding those by WilliamShakespeare) versus word totals of the corresponding work. Both words probability distributions were evaluated based on the total words set.

Fig. 5. Jensen–Shannon divergence between plays and poems word PDF (P) and average PDF (Pave), JS [P, Pave], versus distance (Jensen–Shannondivergence) between word PDF (P) and uniform PDF (Pe), JS [P, Pe]. The average word PDF Pave was taken over all considered plays excluding those byWilliam Shakespeare. All probability distributions were evaluated based on the total words set.

is the story of a heroic Englishman fighting in Morocco for the King of Portugal, with numerous speeches of high-romanticambition. The Love of King David adheres closely to biblical sources. Cornelia is a translation from a French original, dealingin stately fashion with Julius Caesar’s defeat of the followers of Pompey the Great. It is easy to understand how they mightall depart markedly in language use from the norm established from the wider patterns of non-Shakespearean drama. Atthe other extreme, the lowest-scoring play on JS[P (3), P (3)

ave] is William Shakespeare’s late comedy Cymbeline (27,153 words,JS = 0.07874). It is in a subdued, courtly style, hasmultiple plots, includes amix of comic and tragic elements, and draws onsome contrasting source materials. The result evidently defines the broad average for drama across the six decades coveredby our corpus, balancing the extremes of domestic comedy, romantic tragedy, and geopolitical history plays.In the case of poems we have: Hymnus in Noctem (2994 words, JS = 0.25290); and Hymnus in Cynthiam by George

Chapman (3963 words, JS = 0.24122) and Church Militant by George Herbert (2149 words, JS = 0.25315). Theseare opposed to William Shakespeare’s Sonnets, a collection of lyrics which has the lowest score of any of the poems(17,581 words, JS = 0.12418). The Sonnets are mostly addresses to a beloved, and in a conversational style which onecould imagine brought them closer than any other poems in the collection to the norm established by the plays.Table 4 shows the average entropy and complexity for the fifteen playwrights with more than two plays in our corpus,

with the standard error for the twomeasures, as well as the overall averages for the twomeasures and their standard errors.Given the extraordinary status of Shakespeare as a playwright, it is of interest to see how his works fare in this comparison.The average entropy of his plays is close to the average; their average statistical complexity is a little below the average.In each case the standard error is among the very lowest. The overall averages were all calculated without Shakespeare’splays, to allow a comparison unaffected by his own large representation in the set (27 plays out of 136). Thus, far from beingcharacterised by any exceptionally high values in entropy and complexity, Shakespeare’s plays are in fact unremarkable inentropy and among the lowest in statistical complexity, and are remarkable for their narrow dispersion around the mean.

Author's personal copy

924 O.A. Rosso et al. / Physica A 388 (2009) 916–926

Table 4The average entropy and statistical complexity for the playwrights with more than two plays in our corpus, with the standard error for the two measures,as well as the overall averages for the two measures and their standard errors. The overall averages were all calculated without Shakespeare’s plays.

Author Works 〈HS〉 SE[HS ] 〈C(MPR)JS 〉 SE[C(MPR)

JS ]

George Chapman 3 plays 0.55647 0.00052 0.49987 0.00114Thomas Dekker 5 plays 0.56034 0.00572 0.50255 0.00260John Fletcher 7 plays 0.54483 0.00262 0.49589 0.00220John Ford 7 plays 0.56224 0.00584 0.50305 0.00282Robert Greene 4 plays 0.55077 0.00156 0.50238 0.00206Thomas Heywood 5 plays 0.54281 0.00628 0.49566 0.00298Ben Jonson 17 plays 0.54903 0.00231 0.49146 0.00153John Lyly 4 plays 0.53821 0.00185 0.49709 0.00128Christopher Marlowe 6 plays 0.54631 0.00530 0.49744 0.00225Thomas Middelton 12 plays 0.54514 0.00530 0.49492 0.00083George Peele 4 plays 0.54435 0.00668 0.49838 0.00167Willian Shakespeare 27 plays 0.54903 0.00175 0.49424 0.00091James Shirley 3 plays 0.53757 0.00339 0.49209 0.00198John Webster 3 plays 0.55488 0.00612 0.49436 0.00263Robert Wilson 3 plays 0.54273 0.00412 0.49730 0.00316

PLAYS AVERAGE 110 plays 0.54956 0.00113 0.49749 0.00702

Table 5The 1st to 25th ranked function words and the corresponding probability of the average PDF (k = 1).

Rank (j) Function Word p(k=1)j

1 the 0.053272 and 0.049983 I 0.047804 of 0.034155 a 0.033226 you 0.031047 is 0.026208 my 0.025309 to (infinitive) 0.0236310 it 0.0228711 in (preposition) 0.0220912 to (preposition) 0.0170713 not 0.0170214 your 0.0163715 will (verb) 0.0162116 me 0.0154517 be 0.0151818 with 0.0144819 but 0.0142020 have 0.0133321 this 0.0130622 he 0.0126823 for (preposition) 0.0125124 his 0.0121825 as 0.01121

In Fig. 5 the distance between PDF of all works (plays and poems) included in the corpus in relation to the word averagePDF (JS[P (3), P (3)

ave]) and the uniform word PDF (JS[P (3), P (3)e ]) are shown. A clear distinction between plays and poems is

observed. In particular we note that plays present less dispersion than poems. This observation could be associated withthe fact that poems have fewer total words. Note that theWilliam Shakespeare plays are distributed widely along the rangeinterval of values for plays confirming the observation made previously from Fig. 4 and Table 4.By construction in the PDF based in the total word set (k = 3) the function words (k = 1) are associated with the biggest

probability (biggest words frequency). In Fig. 6, we compare the corresponding average PDF associated with the functionword set (k = 1) with corresponding PDF of Hamlet by William Shakespeare (JS[P (3), P (3)

ave] = 0.08857) and The Battle ofAlcazar by George Peele (JS[P (3), P (3)

ave] = 0.17141). In order to facilitate the comparison between both PDF, we ranked theaverage PDF (k = 1) by increasing values ofword probability (p(1)j ). In Table 5 the twenty five functionwordswith the largestaverage probability are given. The diagonal in Fig. 6 represents the ideal case in which both PDF are equivalent (P (1)

≡ P (1)ave).

The list of the commonest words will of course vary with text type and genre.The fact that some of the function words in the list have been tagged according to grammatical function makes

comparison with untagged corpora a little difficult, but it is worth noting that the ten most common words in the‘‘Imaginative fiction’’ part of the British National Corpus [19] are, in order of overall frequency, the, and, to, a, of, I, he,

Author's personal copy

O.A. Rosso et al. / Physica A 388 (2009) 916–926 925

Fig. 6. (a) Ranked average word PDF (Pave taken over all considered plays excluding those byWilliam Shakespeare) versus word PDF (P) corresponding toWilliam Shakespeare play Hamlet. Both words probability distributions were evaluated based on the function words set. The distance between P and Pavefor k = 3 is given at the top of the figure. (b) Same as before, corresponding to Peele’s Battle of Alcazar.

was, she, and in. In the case of Hamlet we observe that the corresponding PDF follows the average PDF closely. As thefrequency of words increases, so, broadly, does the divergence from the mean, since the absolute values are greater. Amongthe commonest words, the occurs more often in Hamlet than the average (k = 1, p(Shakespeare) = 0.06294, p(ave) = 0.05327),and at about the average level (k = 1, p(Shakespeare) = 0.05165, p(ave) = 0.04998), and I (k = 1, p(Shakespeare) = 0.03481,p(ave) = 0.04780) markedly below.For the Peele play we see clearly that the associated word PDF presents large deviations from the diagonal. We see that

the use of the (k = 1, p(Peele) = 0.08208, p(ave) = 0.05327), and (k = 1, p(Peele) = 0.10251, p(ave) = 0.04998) and of (k = 1,p(Peele) = 0.07484, p(ave) = 0.03415) are much higher than the average values. On the other hand, the use of I (k = 1,p(Peele) = 0.01578, p(ave) = 0.04780) is much lower than the average value. In addition we observe that the probabilityassociated with the words you, it, not, will (verb), me, but, and have is lower than the corresponding average probability byone order of magnitude. This pattern of use reflects a paucity of interactive dialogue in the play, and a predominance ofspeeches of exposition and description.

1.5. Conclusions

Counts of individual words in a text are among the simplest data to collect. Entropy, divergence from the mean andstatistical complexity are well understood. Yet in this case, the combination of data and method reveals results whichreflect the judgments readers make (even if only subliminally) and arrays texts in ways which prompt a rethinking ofliterary analysis at the macroscopic level. Texts from various modes, genres and authors are set alongside each other ona systematic basis. The individual peculiarities of the texts can then be balanced against the wider patterns of likenessand difference which are revealed by quantitative analysis. We can thus add language to other areas already exploredthrough these quantities, such as species richness [20], dynamical systems [13,14,21], stochastic process [22,23], biologicalsystems [12,24,25] and genome-wide expression [26]. In each case the methods offer genuinely new insights into complexand sometimes baffling domains.Fig. 6(a) and (b) also reflect a wider finding of this study, thatWilliam Shakespeare’s plays generally use vocabulary items

at a rate which is very close to the norm for the drama of his time. His exceptional achievement is certainly not to be foundin any wide divergence from the mean of vocabulary use established by his contemporaries. In fact, his work is unusual ifanything for its constant closeness to the average use of words at the time. It is Shakespeare plays that diverge least fromthe mean of any in the set, and it is Shakespeare poems that diverge least from the mean of any poem (Table 4 and Figs. 4and 5). Our results would suggest that we should think of William Shakespeare in word-use terms as a playwright who wasextremely well attuned to the general practice of his day, producing an exceptionally large set of plays which all remainwithin a broadly defined common linguistic standard. If his dialogue in general strikes audiences and readers as just andsatisfying, this impression may derive at least in part from his tendency in vocabulary use to fit with patterns of the dramaof his day, which may of course reflect in turn a wider educated usage among English speakers of the time.Finally, the study offers some evidence that William Shakespeare may not be the author of the poem A Lover’s Complaint.

The values calculated for this work place it far from the three canonical Shakespeare poems in Figs. 3–5. We need tobear in mind, however, that mode, genre and chronology, rather than authorship, are the factors that predominate in thedisposition of texts in the entropy–statistical complexity plane, and thus these results are far from conclusive. We aim topursue statistical aspects of this problem further in a future study.

Author's personal copy

926 O.A. Rosso et al. / Physica A 388 (2009) 916–926

Acknowledgments

Thisworkwas partially supported by the Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Argentina(PIP 5687/05, PIP 6036/05) and ANPCyT, Argentina (PICT 11-21409/04). OAR and PM gratefully acknowledge support fromAustralian Research Council (ARC) Centre of Excellence in Bioinformatics, Australia.

References

[1] L.L. Gonçalves, L.B. Gonçalves, Fractal power law in literary English, 2005. cond-mat/0501361.[2] M.A. Montemurro, D.H. Zanette, Entropic analysis of the role of words in literary texts, Adv. Complex. Syst. 5 (2002) 7–17.[3] H. Zhou, G.W. Slater, A metric to search for relevant words, Physica A 329 (2003) 309–327.[4] Z.L. Antiqueira, M.G.V. Nunes, O.N. Oliveira Jr., L. da. F. Costa, Strong correlations between text quality and complex networks features, Physica A 373(2007) 811–820.

[5] D.R. Amancio, L. Antiqueira, T.A.S. Pardo, L. da. F. Costa, O.N. Oliverira, M.G.V. Nunes, Complex networks analysis of manual and machine translations,Internat. J. Modern Phys. C 19 (2008) 583–598.

[6] P. Thoiron, Diversity measures and entropy as measures of lexical richness, Comput. Humanities 20 (1986) 197–202.[7] D. Hoover, Another perspective on vocabulary richness, Comput. Humanities 37 (2003) 151–178.[8] R.L. López-Ruiz, H.L. Mancini, X. Calbet, A statistical measure of complexity, Phys. Lett. A 209 (1995) 321–326.[9] M.T. Martín, A. Plastino, O.A. Rosso, Statistical complexity and disequilibrium, Phys. Lett. A 311 (2003) 126–132.[10] P.W. Lamberti, M.T. Martín, A. Plastino, O.A. Rosso, Intensive entropic non-triviality measure, Physica A 334 (2004) 119–131.[11] M.T.Martín, A. Plastino, O.A. Rosso, Generalized statistical complexitymeasures: Geometrical and analytical properties, Physica A 369 (2006) 439–462.[12] O.A. Rosso, M.T. Martín, A. Figliola, K. Keller, A. Plastino, EEG analysis using wavelet-based informational tools, J. Neurosci. Methods 153 (2006)

163–182.[13] A.M. Kowalski, M.T. Martín, A. Plastino, O.A. Rosso, Bandt–Pompe approach to the classical-quantum transition, Physica D 233 (2007) 21–31.[14] O.A. Rosso, H.A. Larrondo, M.T. Martín, A. Plastino, M.A. Fuentes, Distinguishing noise from chaos, Phys. Rev. Lett. 99 (2007) 154102.[15] B. Vickers, Shakespeare, ‘‘A Lover’s Complaint’’, and John Davies of Hereford, Cambridge University Press, Cambridge, 2007.[16] Literature Online, Proquest Information and Learning, URL: http://www.lion.chadwyck.com/.[17] G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Inform. Process. Manag. 24 (1988) 513–523.[18] A. Harbage, S. Schoenbaum, Annals of English Drama 975—1700, University of Philadelphia Press, Philadelphia, 1964.[19] The British National Corpus, version 2 - BNC World, Distributed by Oxford University Computing Services on behalf of the BNC Consortium, 2001.

URL: http://www.natcorp.ox.ac.uk/.[20] L. Jost, Entropy and diversity, Oikos 11 (2006) 363–374.[21] O.A. Rosso, R. Vicente, C.R. Mirasso, Encryption test of pseudo-aleatory messages embedded on chaotic laser signals: An information theory approach,

Phys. Lett. A 372 (2008) 1018–1023.[22] L. Zunino, D.G. Perez, M.T. Martín, M. Garavaglia, A. Plastino, O.A. Rosso, Permutation entropy of fractional Brownian motion and fractional Gaussian

noise, Phys. Lett. A 372 (2008) 4768–4774.[23] H.L. Larrondo, M.T. Martín, C.M. Gonzalez, A. Plastino, O.A. Rosso, Random numbers generators and causality, Phys. Lett. A 352 (2006) 421–425.[24] A.M. Korol, R.J. Rasia, O.A. Rosso, Alterations of thalassemic erythrocytes detected by wavelet entropy, Physica A 375 (2007) 257–264.[25] O.A. Rosso, Entropy changes in brain function, Int. J. Psychophysiol. 64 (2007) 75–80.[26] R. Berretta, A. Mendes, P. Moscato, Selection of discriminative genemicroarray experiments usingmathematical programming, J. Res. Practice Inform.

Technol. 99 (2007) 287–299.