Writer Verification of Historical Documents among Cohort Writers

Writer Verification of Historical Documents Among Cohort Writers

Gregory R. Ball and Sargur N. SrihariCEDAR

University at Buffalo520 Lee Entrance, Suite 202Amherst, NY 14228, USA

{grball, srihari}@cedar.buffalo.edu

Roger StritmatterDepartment of Humanities

Coppin State University2500 West North Avenue

Baltimore, MD 21216-3698, [email protected]

Abstract

Over the last century forensic document sciencehas developed progressively more sophisticated patternrecognition methodologies for ascertaining the author-ship of disputed documents. We present a writer ver-ification method and an evaluation of its performanceon historical documents with known and unknown writ-ers. The questioned document is compared againsthandwriting samples of Herman Melville, a 19th cen-tury American author who has been hypothesized to bethe writer as well as against samples crafted by severalwriters from the same time period. The comparison ledto a high confidence result to the questioned document’swritership, as well as gives evidence for the validity ofthe writer verification method in the context of histor-ical documents. Such methodology can be applied tomany such questioned historical documents, both in lit-erary and legal fields.

1. Introduction

Over the last century forensic document sciencehas developed progressively more sophisticated patternrecognition methodologies for ascertaining the author-ship of disputed documents. These include advancesnot only in computer assisted stylometrics, but foren-sic handwriting analysis such as that used in CEDAR-FOX[1], a forensic examination tool that captures “theuniqueness of an individual writer by generating a prob-ability distribution of distances between instances ofcharacter classes in the known document and the gen-eral population, as represented by a set of many sam-ples” and employs a log-likelihood ratio “to determinethe strength of confidence for the opinion.” In earlierwork [2], we applied this method to a set of historicaldocuments. In this paper, we establish the generalized

validity of the writer verification method for historicaldocuments by performing analysis of historical docu-ments created by writers matching similar demograph-ics to the writers previously compared (similar histori-cal era, gender, etc.).

Because documents typically studied by literary his-torians are usually printed rather than handwritten, theapplication of handwriting analysis in the humanitieshas been more limited than stylometrics [3, 6], primar-ily concentrating on a limited number of still unresolvedquestions such as the alleged appearance of WilliamShakespeare’s hand in the manuscript of Sir ThomasMore (see Howard-Hill [4] for a survey). However,in cases where both known and unknown samples arehandwritten, and contain sufficient sample size to gen-erate reproducable results, forensic handwriting analy-sis may provide a gold standard for identifying author-ship, for literary as well as legal documents.

2 Writer Verification

2.1 CEDAR-FOX

The CEDAR-FOX system has a number of inter-active features that make it a useful as an experimen-tal tool: feature extractors for handwriting, confidencevalue computation, an interface to scan handwrittendocument images, obtain line and word segments, andautomatically extract features for handwriting match-ing after performing character recognition and/or wordrecognition (with or without interactive human assis-tance in the recognition process, e.g., by providing wordtruth).

One objective of CEDARFOX is “writer verifi-cation,” which provides a measure of confidence ofwhether two samples (questioned document and exem-plar document) belong to the same individual or to two

different individuals. Verification is the task of deter-mining whether a pair of handwriting samples was writ-ten by the same individual. Given a new pair of docu-ments, verification is performed as follows: (i) writingelement extraction, (ii) similarity computation, (iii) de-termining the log-likelihood ratio (LLR) from the esti-mated conditional probability density estimates.

2.2 Document Features

The system computes three types of features –macro-features at the document level, micro-features atthe character level, and style features from bigrams andwords. Each of these features contributes to the finalresult to provide a confidence measure of whether twodocuments under consideration are from the same ordifferent writers.

Macro-features capture the global characteristics ofthe writer’s individual writing habit and style, which in-cludes entropy of gray values, binarization threshold,number of black pixels, number of interior contours,number of exterior contours, contour slope componentsconsisting of horizontal (0 degree or flat stroke), pos-itive (45 or 225 degree), vertical (90 or 270 degree)and negative (135 or 315 degree), average height andaverage slant per line. In the current study, 11 of13 macro-features (all except entropy and number ofblack pixels due to their variability depending on pagesize/scanning) were used in the experiments.

Micro-features are designed to capture the finer de-tails at the character level. Each micro-feature is agradient-based binary feature vector, which consists of512 bits corresponding to gradient (192 bits), structural(192 bits) and concavity (128 bits) features, known asGSC features (described fully in [7]). The use of micro-features depends on the availability of recognized char-acters, i.e. character images associated with truth. Fourpossible methods are available in CEDAR-FOX to getrecognized characters: (i) manually crop characters andlabel each with its truth; this labor-intensive methodhas the highest accuracy, (ii) automatically recognizecharacters by using a built-in character recognizer; themethod is error prone for cursive handwriting wherethere are few isolated characters, (iii) automatically rec-ognize words using a word-lexicon from which seg-mented and recognized characters are obtained; errorrates depend on the size of lexicon and can be as highas 40% for a page, and (iv) use transcript mappingto associate typed words in a transcript of the entirepage with word images [5]; it involves typing the docu-ment content once which can then be reused with eachscanned document. Since the full-page documents havethe same content (CEDAR-letter), the transcript map-

ping approach was used. This method has an 85% ac-curacy among words recognized. Since most words arerecognized, they are also useful for extracting featuresof letter pairs and whole words, as discussed next.

Style features are features based on the shapes ofwhole words and shapes of letter pairs, known as bi-grams. A bigram is a partial word that contains twoconnected characters, such as “th,’ “ed,” etc. Style fea-tures are similar to micro-features (i.e. the GSC fea-tures) where a binary feature vector is computed fromthe bigram or word image. While the bigram featurehas the same size as the micro-feature (with 512 binaryvalues), the word feature has 1024 bits correspondingto gradient (384 bits), structural (384 bits) and concav-ity (256 bits). As with micro-features, to use these twostyle features, word recognition needs to be done first.As mentioned above when describing micro-features, atranscript-mapping algorithm was used to do the wordrecognition automatically. The words being recognizedare saved and used to compute the word-GSC features.Then characters are segmented from these words andthe two consecutive segmented characters both beingconfirmed by a character recognizer are combined asone bi-gram component.

2.3 Description of Similarity Computation

For macro-features, since they are all real-valuedfeatures, the distance is simply the absolute value oftheir difference. For micro and style features, severalmethods have been recently evaluated, which has led tothe choice of a so-called “correlation” measure beingused to compute the dissimilarity between two binaryfeature vectors (described more fully in [8]).

The distributions of dissimilarities conditioned onbeing from the same or different writer are used to com-pute likelihood functions for a given pair of samples.Such distributions can be learned from a training dataset and represented either as probability tables or asprobability density functions. Probability distributionsfor discrete valued distances are represented as proba-bility tables. In the case of continuous-valued distancesparametric methods are used to represent probabilitydensity functions. Parametric methods are more com-pact representations than nonparametric methods. Forexample, with the k-nearest-neighbor algorithm, a well-known nonparametric method, we need to store someor all training samples. This compactness results in afaster run-time algorithm. Also parametric methods aremore robust in a sense that nonparametric methods aremore likely to over-fit and therefore generalize less ac-curately. Of course the challenge is to find the rightparametric model for the problem at hand. While the

Gaussian density is appropriate for mean distance val-ues that are high, the Gamma density is better for mod-eling distances since distance is never negative-valued.

2.4 Log Likelihood Ratio

When each document is characterized by more thanone feature CEDAR-FOX makes the assumption thatthe writing elements or features are statistically inde-pendent, although strictly speaking this is incorrect. Weinvestigated complex models, such as a Bayesian neu-ral network, and got an improvement of 1-2% on over-all accuracy, which is not significant. There are sev-eral other justifications for choosing naı̈ve Bayes. First,its functioning is simple to explain and modify, e.g.,since log-likelihood ratios are additive we can observethe effects of adding and removing features easily. Itssimplicity goes back to the earliest QDE literature [1]where there is reference to multiplying the probabilitiesof handwriting elements. Second, as has been observedin other machine learning tasks, more complex modelstend to overfit to the data, which can lead to poorer per-formance on large amounts of unseen data.

Each of the two likelihoods that the given pair ofdocuments were either written by the same individualor by different individuals can be expressed, assum-ing statistical independence of the features as follows.For each writing element ei, i = 1, ..., c, where c is thenumber of writing elements considered, we compute thedistance di(j, k) between the jth occurrence of ei in thefirst document and the kth occurrence of ei in the sec-ond document for that writing element. The likelihoodsare extracted as

Ls =

c∏i=1

∏j

∏k

ps(dj(j, k))

Ld =

c∏i=1

∏j

∏k

pd(di(i, k))

The LLR in this case has the form

LLR =

c∑i=1

∑j

∑k

[lnps(di(j, k))− lnpd(di(j, k))]

That is, the final LLR value is computed using allthe features, considering each occurrence of each ele-ment in each document. The CEDAR-FOX verificationsystem outputs the LLR of two documents being tested.When the likelihood ratio (LR) is above 1, the LLRvalue is positive and when the likelihood ratio is below1, the LLR value is negative. Hence, if the final score

obtained is positive, the system concludes that the twodocuments were written by the same writer. Similarly,if the final LLR score is negative, the system concludesthat the two documents were written by different writ-ers.

3 Questioned and Known Documents

A recent literary discovery which might pose a testcase for advanced forensic handwriting analysis is theHydrachos manuscript (H.), an April 1846 satiricalnewspaper, The PHILADal GAZETTE - EXTR<A>,with seven pen and ink drawings accompanying a 437word handwritten commentary on U.S. and world news.

Measuring 40 x 25 cm, the document contains satir-ical news content, primarily from the United States,Great Britain, Italy, and China, on both the recto (Figure1(b)) and verso sides (Figure 1(a)).

Lexical, grammatical, thematic, visual, content, andsituational analysis[9] all support the hypothesis that themanuscript’s author is the New England author Her-man Melville. Although Melville does not use theword Hydra<r>chos, many other phrases and allusionsin the manuscript can be traced to his published writ-ings. Qualitative stylistic analysis supports the attribu-tion. In the 437 word document, Stritmatter[9] was ableto trace 59 words and phrases , including Antidiluvian(sic); hoe cake (used in an identical context); John Bulland Jonathan (as sobriquets for England and America).Analysis of diction and content is also consistent withMelville’s authorship.

In addition, while the utility and accuracy of thewriter verification system used in CEDAR-FOX is well-established on a large data set, it has not been testedextensively on documents of historic origin. To help es-tablish that the system is era-independent, or at least ap-propriate for comparison to Melville’s time period, let-ters written by two of Melville’s male relatives (ThomasMelville, Jr. and Alan Melville) were obtained, bothproduced in 18181.

4 Experiments and Results

To evaluate the documents according to the writerverification method described in the previous section,we followed the following sequence of steps to preparethe documents for analysis:

Step 1: Manual preprocessing: We modified the doc-uments to remove the hand-drawn non-text images, andremove the more major noise.

1The authors wish to acknowledge the New York Public Libraryfor providing scans of the relevant documents

(a) Hydrachos manuscript, verso

(b) Hydrachos manuscript, recto

Figure 1. The unknown document–the Hy-drachos manuscript

Step 2: Pre-processing with CEDAR-FOX: Bina-rization, line and word segmentation, document widefeature calculation, automatic character recognition.

Step 3: Word segmentation correction: Some man-nual correction to the word segmentation was per-formed since there were some slight word segmentationerrors.

Step 4: Transcript mapping, manual transcript cor-rections: Using the “transcript mapping” function andcorrect transcripts, the known and unknown documentswere manually verified to be correctly word truthed.

Some of the processed documents used are shown inFigure 2.

(a) Known document

(b) Unknown document

Figure 2. Pre-processed and truthed doc-uments

Table 1. Overall LLR ResultsTotal LLR System Opinion

H. Melville Letter 1 vs. Hydrachos, verso 35.29 Identified As SameH. Melville Letter 2 vs. Hydrachos, verso 129.42 Identified As SameH. Melville Letter 1 vs. Hydrachos, recto 20.82 Highly Probable SameH. Melville Letter 2 vs. Hydrachos, recto 60.12 Identified As SameH. Melville Letter 1 vs. H. Melville Letter 2 333.52 Identified As SameHydrachos manuscript, verso vs. recto 42.87 Identified As SameFull Hydrachos manuscript vs. H. Melville Letter 1 56.11 Identified As SameFull Hydrachos manuscript vs. H. Melville Letter 2 189.54 Identified As SameT. Melville Letter vs. H. Melville Letter 1 -14.84 Indications Did NotT. Melville Letter vs. H. Melville Letter 2 12.38 Indications DidT. Melville Letter vs. Hydrachos, recto -68.18 Identified As DifferentT. Melville Letter vs. Hydrachos, verso -73.03 Identified As DifferentA. Melville Letter vs. H. Melville Letter 1 -671.41 Identified As DifferentA. Melville Letter vs. H. Melville Letter 2 -386.09 Identified As DifferentA. Melville Letter vs. Hydrachos, recto -318.96 Identified As DifferentA. Melville Letter vs. Hydrachos, verso -448.61 Identified As DifferentA. Melville Letter vs. T. Melville Letter -284.46 Identified As Different

5 Results

Figure 3. Comparison of “ea,” “ou,” “oo,”and “wh” between H. Melville’s writingand the Unknown document

We treated the problem as a case of four known (let-ters) and two unknown (verso and recto) documents andgenerated an LLR comparison value for each of thepairs of documents. Comparisons of some select bigramfeatures are shown in Figure 3.

The comparison method was validated on a test setof 1,648 test cases[8]. When performing on the testcases using all features on documents containing differ-ent content, the overall error rate for the general popula-tion was 0-2.4%. These test documents were also used

to map the LLR score onto a nine point opinion scale,ranging in order as follows: identified as same, highlyprobable same, probably did, indications did, no con-clusion, indications did not, probably did not, highlyprobable did not, indentified as different. The over-all LLR and opinion results for the documents are de-scribed in Table 1.

The LLR scores are compound scores consisting ofseveral feature components: the global (macro), charac-ter (micro), bigram, word, and lexeme features.

The characters were automatically extracted basedupon the word truth. All words were split into candi-date characters and the automatically generated char-acters were validated by a character recognizer. Char-acters failing the automatic verification check were notused in the comparison so as to not introduce artifactsinto the comparisons.

Each acceptable character was compared exhaus-tively to each acceptable corresponding character in thecomparison document. When all LLR’s of all charactersare combined, overall trends are observed. In this case,all characters led to positive results, adding informationsuggesting same authorship.

To more characterize the strength of confidence inthe LLR scores, we modeled the 1,648 validation casesand evaluated which cases had higher or lower LLRscores – that is, we tabulated the LLR values of allknown same and different writer cases (with full pagedocuments consisting of different content) and deter-mined the percentage of such validation cases that hadhigher or lower LLR values than our current study’s re-sults. The results are presented in Table 2.

Table 2. Confidence Score Breakdown% same writer % diff writerhaving > LLR having > LLR

HM1 vs. Qverso 14.32 0HM2 vs. Qverso 0 0HM1 vs. Qrecto 41.99 0HM2 vs. Qrecto 4.61 0TM vs. HM1 0.97 74.51TM vs. HM2 85.92 0.48TM vs. Qrecto 0 1.21TM vs. Qverso 0 8.73AM vs. HM1 0 0AM vs. HM2 0 0AM vs. Qrecto 0 0AM vs. Qverso 0 0AM vs. TM 0 0

In short, none of the validation cases consisting offull page documents written by different writers withdifferent content had LLR values that were higher thanthe H. Melville letters vs. Hydrachos sets presented, im-plying that the documents we compared are more sim-ilar than any of the validation pairings. In the case ofHerman Melville Known Letter 2 vs. the average of theverso and recto sides of the questioned document, theLLR score implies similarity stronger than all but about2% of the validation cases.

When comparing the T. Melville, Jr. letters to theHydrachos manuscript, the system clearly identified thewriters as different. When comparing the T. Melville,Jr. letters to the Herman Melville known letters, theresults were essentially non-decisive, resulting in verylow LLR scores when considering the large amount oftext compared. One of the two scores was (incorrectly)slightly positive. When examining the incorrect resultin more detail, one major factor for the score was thatthe letters being compared had very few large words incommon, which meant that the word comparison had tobe based on very short words, which yield less strongresults.

When comparing A. Melville’s letters to HermanMelville’s letters and to the Hydrachos document, thesystem reported very strongly that the writers were dif-ferent. When comparing the T. Melville, Jr. and A.Melville letters to one another, again the system re-ported a strong result that they were different. In fact,in all five cases, 0% of the validation test cases were theresults so strong in either the same or different writergroups.

5.1 Conclusions

In all cases, the LLR numbers indicated the pro-posed writer likely penned the questioned document;in addition, comparisons between the questioned doc-ument and with known documents penned by variouscontemporary writers likely to have similar penmanshipstyles resulted in strong opinions that the documentswere written by different writers. Further comparisonsbetween the known different writers resulted in reason-able results indicating that the documents were indeedrecognized as being created by different writers, witha single exception which resulted essentially in a non-conclusive result. This helps to establish that the writerverification method described returns valid results forthe time period in question. Further results could begenerated by comparing other known documents con-taining similar phrasing to the questioned document.Additional results could be gained by comparing thesedocuments to other known documents.

References

[1] G. R. Ball and S. N. Srihari. Comparison of statisticalmodels for writer verification. In Proceedings of Docu-ment Recognition and Retrieval XVI, SPIE, January 2009.

[2] G. R. Ball, R. Stittmeyer, and S. N. Srihari. Writer verifi-cation in historical documents. In Proceedings DocumentRecognition and Retrieval XVII, San Jose, CA, January2010. SPIE.

[3] D. Holmes. The analysis of literary style a review. Jour-nal of the Royal Statistical Society. Series A (General),148(4):328–341, 1985.

[4] T. H. Howard-Hill, ed., editor. Shakespeare and SirThomas More: Essays on the Play and its ShakespeareanInterest. The University Press, Cambridge, 1989.

[5] C. Huang and S. Srihari. Mapping transcripts to hand-written text. 2006.

[6] R. D. Peng and N. Hengartner. Quantitative analysis ofliterary styles. The American Statistician, 56(3):175–185, 2002.

[7] H. A. S. N. Srihari, S.-H. Cha and S. Lee. Individ-uality of handwriting. Journal of Forensic Sciences,47(4):856872, July 2002.

[8] S. N. Srihari, C.Huang, and H. Srinivasan. On the dis-criminability of the handwriting of twins. Journal ofForensic Sciences, 53(2):430–446, 2008.

[9] R. Stritmatter. The great hydrachos satire of april 11,1846: A lost literary manuscript attributed to hermanmelville. June 2009.

Writer Verification of Historical Documents among Cohort Writers

Documents

Transcript of Writer Verification of Historical Documents among Cohort Writers