Reducing OCR errors in Gothic-script documents

Zurich Open Repository and Archive

University of ZurichMain LibraryWinterthurerstr. 190CH-8057 Zurichwww.zora.uzh.ch

Year: 2011

Reducing OCR errors in Gothic-script documents

Furrer, L; Volk, M

Furrer, L; Volk, M. Reducing OCR errors in Gothic-script documents. In: 8th International Conference onRecent Advances in Natural Language Processing (RANLP 2011), Hisar, 16 September 2011 - 16September 2011, 97-103.Postprint available at:http://www.zora.uzh.ch

Posted at the Zurich Open Repository and Archive, University of Zurich.http://www.zora.uzh.ch

Originally published at:Furrer, L; Volk, M. Reducing OCR errors in Gothic-script documents. In: 8th International Conference onRecent Advances in Natural Language Processing (RANLP 2011), Hisar, 16 September 2011 - 16September 2011, 97-103.

Furrer, L; Volk, M. Reducing OCR errors in Gothic-script documents. In: 8th International Conference onRecent Advances in Natural Language Processing (RANLP 2011), Hisar, 16 September 2011 - 16September 2011, 97-103.Postprint available at:http://www.zora.uzh.ch

Posted at the Zurich Open Repository and Archive, University of Zurich.http://www.zora.uzh.ch

Originally published at:Furrer, L; Volk, M. Reducing OCR errors in Gothic-script documents. In: 8th International Conference onRecent Advances in Natural Language Processing (RANLP 2011), Hisar, 16 September 2011 - 16September 2011, 97-103.

Reducing OCR errors in Gothic-script documents

Abstract

In order to improve OCR quality in texts originally typeset in Gothic script, we have built anautomated correction system which is highly specialized for the given text. Our approachincludes external dictionary resources as well as information derived from the text itself. Thefocus lies on testing and improving different methods for classifying words as correct orerroneous. Also, different techniques are applied to find and rate correction candidates. Inaddition, we are working on a web application that enables users to read and edit the digitizedtext online.

Reducing OCR Errors in Gothic-Script Documents

Lenz FurrerUniversity of Zurich

[email protected]

Martin VolkUniversity of [email protected]

Abstract

In order to improve OCR quality in textsoriginally typeset in Gothic script, wehave built an automated correction systemwhich is highly specialized for the giventext. Our approach includes external dic-tionary resources as well as informationderived from the text itself. The focus lieson testing and improving different meth-ods for classifying words as correct or er-roneous. Also, different techniques are ap-plied to find and rate correction candidates.In addition, we are working on a web ap-plication that enables users to read and editthe digitized text online.

1 Introduction

The State Archive of Zurich currently runs a digi-tization project, aiming to make publicly availableonline governmental texts in German of almost200 years. A part of this project comprises the res-olutions by the Zurich Cantonal Government from1887 to 1902, which are archived as printed vol-umes in the State Archive. These documents aretypeset in Gothic script, also known as blackletteror Fraktur, in opposition to the antiqua fonts ofmodern typesetting. In a cooperation project of theState Archive and the Institute of ComputationalLinguistics at the University of Zurich, we digitizethese texts and prepare them for public online ac-cess.The tasks of the project are automatic image-

to-text conversion (OCR) of the approximately11,000 pages, the segmentation of the bulk of textinto separate resolutions, the annotation of meta-data (such as the date or the archival signature) aswell as improving the text quality by automaticallycorrecting OCR errors.As for OCR, the data are most challenging, since

the texts contain not only Gothic type letters –

which already lead to a lower accuracy comparedto antiqua texts – but also particular words, phrasesand even whole paragraphs are printed in anti-qua font (cf. figure 1). Although we are lucky tohave an OCR engine capable of processing mixedGothic and antiqua texts, the alternation of the twofonts still has an impairing effect on the text qual-ity. Since the interspersed antiqua tokens can bevery short (e. g. the abbreviation Dr.), their divert-ing script is sometimes not recognized by the en-gine. This leads to heavily misrecognized wordsdue to the different shapes of the typefaces; forexample antiqua Landrecht (Engl.: “citizenship”)is rendered as completely illegible I>aii<lreclitt,which is clearly the result of using the inappropri-ate recognition algorithm for Gothic script.

Figure 1: Detail of a session header, followed bytwo resolutions, in German. The numbered titlesas well as Roman numerals are typeset in antiqualetters.

In section 2 we present the OCR system used inthe project. The focus of our work lies on section 3,which discusses our efforts in post-correctingOCR

Figure 2: Output XML of Abbyy Recognition Server 3.0, representing the word “Aktum”.

errors. Section 4 is concerned with a web applica-tion.

2 OCR system

We use Abbyy’s Recognition Server 3.0 for OCR.Our main reason to decide in favour of this productwas its capability of converting text with both anti-qua and Gothic script. Another great advantage incomparison toAbbyy’s FineReader and other com-mercial OCR software is its detailed XML outputfile, which contains a lot of information about therecognized text.Figure 2 shows an excerpt of the XML output.

The fragment contains information about thefive letters of the word Aktum (here meaning“governmental session”), which corresponds tothe first word in figure 1. The first two XMLtags, par and line, hold the contents and furtherinformation of a paragraph and a line respectively,e. g. we can see that this paragraph’s alignmentis centered. Each character is embraced by acharParams tag that has a series of attributes.The graphical location and dimensions of a letterare described by the four pixel positions l, t, rand b indicating the left, top, right and bottomborders. The following boolean-value attributesspecifie if a character is at a left word boundaries(wordStart), if the word was found in the in-ternal dictionary (wordFromDictionary) andif it is a regular word or rather a number or apunctuation token (wordNormal, wordNumeric,wordIdentifier). charConfidence has avalue between 0 and 100 indicating the recog-nition confidence for this character, while

serifProbability rates its probability ofhaving serifs on a scale from 0 to 255. Thefourth character u shows an additional attributesuspicious, which only appears if it is set totrue. It marks characters that are likely to bemisrecognized (which is not the case here).We are pleased with the rich output of the OCR

system. Still, we can think of features that couldmake a happy user even happier. For example,post-correction systems could gain precision if thealternative recognition hypotheses made duringOCR were accessible.1 Some of the informationprovided is not very reliable, such as the attribute“wordFromDictionary” that indicates if a wordwas found in the internal dictionary: on the onehand, a lot of vocabulary is not covered, whereason the other, there are misrecognized words thatare found in the dictionary (e. g. Bandirektion,which is an OCR error for German Baudirektion,Engl.: “building ministry”). While the XML at-tribute “suspicious” (designating spots with lowrecognition confidence) can be useful, the “char-Confidence” value does not help a lot with locatingerroneous word tokens.An irritating aspect is, that we found the dehy-

phenation to be done worse than in the previousengine for Gothic OCR (namely FineReader XIX).Words hyphenated at a line break (e. g. fakul- tativeon the fourth line in figure 1) are often split intotwo halves that are no proper words, thus look-ing up the word parts in its dictionary does nothelp the OCR engine with deciding for the cor-

1Abbyy’s Fine Reader Software Developer Kit (SDK) iscapable of this.

rect conversion of the characters. Since Abbyy’ssystems mark hyphenation with a special charac-ter when it is recognised, one can easily determinethe recall by looking at the output. We encoun-tered the Recognition Server’s recall to be signif-icantly lower than that of FineReader XIX, whichis regrettable since it goes along with a higher errorrate in hyphenated words.The Recognition Server provides dictionaries

for various languages, including so-called “OldGerman”.2 In a series of tests the “Old Ger-man” dictionary has shown slightly better resultsthan the Standard German dictionary for our texts.Some, but not all of the regular spelling varia-tions compared to modern orthography are cov-ered by this historical dictionary, e. g. old Ger-manUrtheil (in contrast to modern GermanUrteil,Engl.: “jugdement”) is found in Abbyy’s internaldictionary, whereas reduzirt (modern German re-duziert, Engl.: “reduced”) is not. It is also possi-ble to include a user-defined list of words to im-prove the recognition accuracy. However, whenwe added a list of geographical terms to the sys-tem and compared the output of before and after,there was hardly any improvement. Although thewords from our list were recognized more accu-rately, the overall number of misrecognized wordswas almost completely compensated by new OCRerrors unrelated to the words added.All in all, Abbyy’s Recognition Server 3.0

serves our purposes of digitizing well, especiallyin respect of handling large amounts of data andfor pipeline processing.

3 Automated Post-Correction

The main emphasis of our project is on the post-correction of recognition errors. Our evaluation ofa limited number of text samples yielded a wordaccuracy of 96.7%, which means that one wordout of 30 contains an error (as e. g. Regieruug in-stead of correct Regierung, Engl.: “government”).We aim to significantly reduce the number of mis-recognized word tokens by identifying them inthe OCR output and determining the originally in-tended word with its correct spelling.In order to correct a maximum of errors in the

corpus with the given resources, we decided for

2The term has not to be confused with the linguistic epochOld High German, which refers to texts from the first mille-nium A. D. Abbyy’s “Old German” seems to cover some or-thographical variation from the 19th, maybe the 18thcentury.

an automated approach to the post-correction. Thegreat homogeneity of the texts concerning the lay-out as well as the limited variation in orthographyare a good premise to achieve remarkable improve-ments. Having in mind the genre of our texts, wefollowed the idea of generating useful documentsready for reading and resigned from manually cor-recting or even re-keying the complete texts, as dide. g. the project Deutsches Textarchiv3, which hasa very literary, almost artistic standard in digitizingfirst-edition books from 1780 to 1900.The task resembles that of a spell-checker as

found in modern text processing applications, withtwo major differences. First, the scale of our textdata – with approximately 11 million words we ex-pect more than 300,000 erroneous word forms –does not allow for an interactive approach, askinga human user for feedback on every occurrenceof a suspicious word form. Therefore we needan automatic system that can account for correc-tions with high reliability. Second, dating fromthe late 19th century, the texts show historic or-thography, which differs in many ways from thespelling encountered in modern dictionaries (e. g.historic Mittheilung vs. modern Mitteilung, Engl.:“message”). This means that using modern dictio-nary resources directly cannot lead to satisfactoryresults.Additionally, the governmental resolutions con-

tain, typically, many toponymical references, i. e.geographical terms such as Steinach (a stream orplace name) or Schwarzenburg (a place name).Many of these are not covered by general-vocabulary dictionaries. We also find regional pe-culiarities, such as pronunciation variants or wordsthat are not common elsewhere in the German spo-ken area, e. g. Hülfstrupp vs. standard GermanHilfstruppe, Engl.: “rescue team”. Of course thereis also a great amount of genre-specific vocabu-lary, i. e. administrative and legal jargon. We arehence challenged to build a fully-automated cor-rection system with high precision that is aware ofhistoric spelling and regional variants and containsgeographical and governmental language.

3.1 Classification

The core task of our correction systemwith respectto its precision is the classification routine, thatdetermines the correctness of every word. Thispart is evidently the basis for all correcting steps.

3see www.deutschestextarchiv.de

Misclassifications are detrimental, as they can leadto false corrections. At the same time it is alsothe hardest task, as there are no formal criteriathat universally describe orthography. Reynaert(2008) suggests building a corpus-derived lexiconthat is mainly based on word frequencies and thedistribution of similarly spelled words. However,this is sometimes misleading. For example, saum-selig (Engl.: “dilatory”) is a rare but correct word,which is found only once in the whole corpus,whereas Liegenschast is not correct (in fact, it ismisrecognized for Liegenschaft, Engl.: “real es-tate”), but it is found sixteen times in this erroneousform. This shows that the frequency of word formsalone is not a satisfactory indicator to distinguishcorrect words from OCR errors; other informationhas to be used as well.An interesting technique for reducing OCR er-

rors is comparing the output of two or more differ-ent OCR systems. The fact that different systemsmake different errors is used to localize and cor-rect OCR errors. For example, Volk et al. (2010)achieved improvements in OCR accuracy by com-bining the output of Abbyy FineReader 7 and Om-nipage 17. However, this approach is not applica-ble to our project for the simple reason that thereis to our knowledge, for the time being, only onestate-of-the-art OCR system that recognizes mixedantiqua / Gothic script text.The complex problem of correcting historical

documents with erroneous tokens is addressed byReffle (2011) with a two-channel model, which in-tegrates spelling variation and error correction intoone algorithm. To cover the unpredictable num-ber of spelling variants of texts with historical or-thography (or rather, depending on their age, textswithout orthography), the use of static dictionariesis not reasonable as they would require an enor-mous size. Luckily, the orthography of our corpusis comparatively modern and homogenous, whichmeans that there is a manageable number of devi-ations in comparison to modern spelling and be-tween different parts of the corpus.In our approach, we use a combination of vari-

ous resources for this task, such as a large dictio-nary system for German that covers morphologicalvariation and compounding (such as ging, a pastform of gehen, Engl.: “to go”, or German Nieder-druckdampfsystem, a compound of four segmentsmeaning “low-pressure steam system”), a list oflocal toponyms (since our texts deal mostly with

events in the administered area) and the recogni-tion confidence values of the OCR engine. Everyword is either judged as correct or erroneous ac-cording to the information we gather from the var-ious resources.The historic and regional spelling deviations are

modelled with a set of handwritten rules describ-ing the regular differences. For example, with arule stating that the sequence th corresponds to t inmodern spelling, the old form Mittheilung can beturned into the standard form Mitteilung. Whilethe former word is not found in the dictionary, thederived one is, which allows for the assumptionthatMittheilung is a correctly spelled word.The classification method is applied to single

words, which are thus considered as a correct orerroneous form separately and without their origi-nal context. This reduces the number of correctioncandidates significantly, as not every word is to behandled, but only the set of the distinctword forms.However, this limits the correction to non-word er-rors, i. e. misrecognized tokens that cannot be in-terpreted as a proper word. For example, Fcuer isclearly a non-sense word, which should be spelledFeuer (Engl.: “fire”). In contrast, the two wordforms Absatze and Absätze (Engl.: “paragraph”)are two correct morphological variants of the sameword, which are often confused during OCR. Todecide which one of the two variants is correct ina particular occurrence, the word has to be judgedin its original context, which is outside the scopeof our project.The decision for the correctness of a word is

primarily based on the output of the Germanmorphology system Gertwol by Lingsoft, in thatwe check every word for its analyzability. Al-though the analyses of Gertwol are reliable in mostcases, there are two kinds of exceptions that canlead to false classifications. First, there are cor-rect German words that are unknown to Gertwol,such as Latin words like recte (Engl.: “right”) orproper names. Second, sometimes Gertwol findsanalyses for misrecognized words, e. g. correctRegierungsrat (Engl.: “member of the govern-ment”) is often rendered Negierungsrat, which isthen analysed as a compound of Negierung (Engl.:“negation”) and Rat (Engl.: “councillor”). Toavoid these misclassifications we apply heuristicswhich include word lists for known frequent is-sues, the frequency of the word form or the feed-back values of the OCR system concerning the

recognition confidence.

3.2 CorrectionThe set of all words recognized as correct wordsplus their frequency can now serve as a lexiconfor correcting erroneousword tokens. This corpus-derived lexicon is by nature highly genre-specific,which is desirable. On the other hand, rare wordsmight remain uncorrected if all of their few occur-rences happen to bemisspelled, in which case therewill be no correction candidate in the lexicon. Dueto the repetitive character of the texts – administra-tive work involves many recurrent issues – there isalso a lot of repetition in the vocabulary across thecorpus. This increases the chance that a word mis-recognized in one place has a correct occurrence inanother.The procedure of finding correction candidates

is algorithmically complex, since the orthographi-cal similarity of words is not easily described in aformal way. There is no linear ordering that wouldgroup together similar words. In the simplest ap-proach to searching for similar words in a largelist, every word has to be compared to every otherword, which has quadratic complexity regardingthe number of words. In our system we avoid thiswith two different approaches that are applied sub-sequently. In the first step, only a limited set ofcharacter substitutions is performed, whereas thesecond step allows for a broader range of devia-tions between an erroneous form and its correction.

Figure 3: OCR errors ranked by their frequency,from a sample text of approx. 50,000 characterslength.

3.2.1 Substitution of similar charactersIn this approach we attempt to undo the most com-mon OCR errors. When working with automat-

rank frequency {correct}–{recognized}1 49 {u}–{n}2 38 {n}–{u}3 14 {a}–{ä}4 14 {}–{.}5 13 {d}–{b}6 11 {s}–{f}… … …140 1 {o}–{p}141 1 {r}–{?}142 1 {§}–{8}143 1 {ä}–{ö}

Table 1: Either ends of a ranked list, showing char-acter errors and their frequency.

Figure 4: Gothic characters of the SchwabacherFraktur that are frequently confused by the OCRsystem. From left to right, the characters are: s (inits long form), f, u, n, u or n, B, V, R, N.

ically recognized text, one will inevitably noticethat a small set of errors accounts for the majorityof misrecognized characters. At character level,OCR errors can be described as substitutions ofsingle characters. Thus, by ranking these errortypes by their frequency, it can be shown that al-ready a small sample of OCR errors resembles Zip-fian distribution,4 as is demonstrated in figure 3.For example, the 20 most frequent types of errors,which is less than 14% of the 143 types encoun-tered, sum up to 50% of all occurrences of char-acter errors. Table 1 shows the head of this errorlist as well as its tail. Among the top 6 we findpairs like u and n that are also often confoundedwhen recognizing antiqua text as well as typicalGothic-script confusions like d–b or s–f. Figure 4illustrates the optical similarity of certain charac-ters that can also challenge the human eye. For ex-ample, due to its improper printing, the fifth lettercannot be determined clearly as u or n.For our correction system we use a manually

edited substitution table that is based on these ob-servations. The substitution pairs are inverted andused as replacement operations that are appliedto the misspelled word forms. The spelling vari-

4Zipf’s Law states that, given a large corpus of natural lan-guage data, the rank of each word is inversely proportional toits frequency.

ants produced are then searched for correct wordsthat can be found in our dictionary. For example,given an erroneous token sprechcn and a substitu-tion pair e, c stating that recognized c can representactual e, the system produces the variants spree-hcn, sprechen and spreehen. Consulting the dictio-nary finally uncovers sprechen (Engl.: “to speak”)as a correct word, suggesting it as a correction forgarbled sprechcn.

3.2.2 Searching shared trigramsIn a second step, we look for similar words moregenerally. This is applied to the erroneous wordsthat could not be corrected by the method de-scribed above. There are various techniques toefficiently find similar words in large mounds ofdata, such as Reynaert’s (2005) anagram hashingmethod, that treats words as multisets of wordsand models edit operations as arithmetic functions.Mihov and Schulz (2004) combine finite-state au-tomata with filtering methods that include a-tergodictionaries.In our approach, we use character n-grams to

find similar correct words for each of the erroneouswords. Since many of the corrections with a closeediting distance have already been found in theprevious step, we have to look for corrections in awider spectrum now. In order to reduce the searchspace, we create an index of trigrams that pointsto the lexicon entries. Every word in the lexiconis split into overlapping sequences of three charac-ters, which are then stored in a hashed data struc-ture. For example, from ersten (Engl.: “first”) weget the four trigrams ers, rst, ste and ten. Eachtrigram is used as a key that points to a list of allwords containing this three-letter sequence, so tennot only leads to ersten, but also to warten (Engl.:“wait”) and many other words with the substringten.Likewise, all trigrams are derived from each er-

roneous word form. All lexicon entries that shareat least one trigram with the word in question canthus be accessed by the trigram index. The lexiconentries found are now ranked by the number of tri-grams shared with the error word. To stay withthe previous example, a misrecognized word formrrsten5 has three trigrams (rst, ste, ten) in common

5The correction pair ersten–rrsten is not found in the firststep although its editing distance is 1 and thus minimal. Butsince the error {e}–{r} has a low frequency, it is not con-tained in the substitution table, which makes this correctioninvisible for the common-error routine.

with ersten, but only one trigram (ten) is sharedwith warten.The correction candidates found with this

method are further challenged, e. g. by using theLevenshtein distance as a similarity measure andby rejecting corrections that affect the ending of aword.6

3.3 EvaluationIn order to measure the effect of the correctionsystem, we manually corrected a random sampleof 100 resolutions (approximately 35,000 word to-kens). Using the ISRI OCR-Evaluation FrontiersToolkit (Rice, 1996), we determined the recogni-tion accuracy before and after the correction step.As can be seen in table 2, the text quality in termsof word accuracy could be improved from 96.72%to 98.36%, which means that the total number ofmisrecognized words was reduced by half.

4 Crowd Correction

Inspired by the Australian Newspaper DigitisationProgram (Holley, 2009) and their online crowd-correction system, we are setting up an online ap-plication. We are working on a tool that allows forbrowsing through the corpus and reading the text ina synoptical view of both the recognized plain textand a scan image of the original printed page. Ouronline readers will have the possibility to immedi-ately correct errors in the recognized text by click-ing a word token for editing. The original word inthe scan image is highlighted for convenience ofthe user, which permits fast comparison of imageand text, so the intended spelling of a word can beverified quickly.Crowd correction combines easy access of the

historic documents with manual correcting. Simi-lar to the concept of publicly maintained servicesas Wikipedia and its followers, information is pro-vided free of charge. At the same time, the userhas the possibility to give something back by im-

6Editing operations affecting a word’s ending are a deli-cate issue in an inflecting language like German, since it islikely that we are dealing with morphological variation in-stead of an OCR error. Data sparseness of rare words canlead to unfortunate constellations. For example, the adjectiveform innere (Engl.: “inner”) is present in the lexicon, whilethe inflectional variant innern is lacking. However, the latterform is indeed present in the corpus, but only in the misrecog-nized form iunern. As innere is the only valuable correctioncandidate in this situation, the option would be changing mis-spelled iunern into grammatically inapt innere. For the sakeof legibility, an unorthographical word is preferred to an un-grammatical form.

plain (errors) corrected (errors) improvementcharacter accuracy 99.09% (2428) 99.39% (1622) 33.2%word accuracy 96.72% (1165) 98.36% (584) 49.9%

Table 2: Evaluation results of the automated correction system.

proving the quality of the corpus. It is importantto keep the technical barrier low, so that new userscan start correcting straight away without needingto learn a new formalism. With our click-and-typetool the access is easy and intuitive. However, inconsequence, the scope of the editing operationsis limited to word level, it does not, for example,allow for rewriting whole passages.

5 Conclusion

With this project we demonstrate how to builda highly specialized correction system for a spe-cific text collection. We are using both generaland specific resources, while the approach as awhole is widely generalizable. Of course, work-ing with older texts with less standardized orthog-raphy than late 19thcentury governmental resolu-tions may lead to a lower improvement, but thetechniques we applied are not bound to a cer-tain epoch. In the crowd correction approach wesee the possibility to further improve OCR outputin combination with online access of the historictexts.

ReferencesRose Holley. 2009. How Good Can It Get? Analysingand Improving OCR Accuracy in Large Scale His-toric Newspaper Digitisation Programs. D-LibMagazine, 15(3/4).

Stoyan Mihov and Klaus U. Schulz. 2004. Fast Ap-proximate Search in Large Dictionaries. Computa-tional Linguistics 30(4):451–477.

Ulrich Reffle. 2011. Efficiently generating cor-rection suggestions for garbled tokens of histor-ical language. Natural Language Engineering17(2):265–282.

Martin Reynaert. 2005. Text-Induced Spelling Correc-tion. Ph. D. thesis, Tilburg Universtiy, Tilburg, NL.

Martin Reynaert. 2008. Non-Interactive OCR Post-Correction for Giga-Scale Digitization Projects.Proceedings of the Computational Linguistics andIntelligent Text Processing 9th International Confer-ence, CICLing 2008 Berlin, 617–630.

Stephen V. Rice. 1996. Measuring the Accuracy ofPage-Reading Systems. Ph. D. dissertation, Univer-sity of Nevada, Las Vegas, NV.

Martin Volk, Torsten Marek, and Rico Sennrich. 2010.Reducing OCR Errors by Combining Two OCR Sys-tems. ECAI 2010 Workshop on Language Technol-ogy for Cultural Heritage, Social Sciences, and Hu-manities (LaTeCH 2010), Lisbon, Portugal, 16 Au-gust 2010 – 16 August 2010, 61–65.

Reducing OCR errors in Gothic-script documents

Documents

Transcript of Reducing OCR errors in Gothic-script documents