Automatic Article Extraction in Old Newspapers Digitized Collections

Automatic Article Extraction in Old Newspapers DigitizedCollections

David HebertLITIS, University of Rouen

UFR des Sciences etTechniques

F-76800 Saint Etienne duRouvray

[email protected]

Thomas PalfrayLITIS, University of Rouen



[email protected]

Stephane NicolasLITIS, University of Rouen



[email protected]

Pierrick TranouezLITIS, University of Rouen



[email protected]

Thierry PaquetLITIS, University of Rouen



[email protected]

ABSTRACTWe present a complete method for article segmentation inold newspapers, which deals with complex layouts analysisof degraded documents. The designed workflow can processlarge amounts of documents and generates digital objects inMETS/ALTO format in order to facilitate the indexing andthe browsing of information in digital libraries. The anal-ysis of the document image is performed by a two stagesscheme. Pixels are labelled in a first stage with a Condi-tional Random Field model in order to intent to label theareas of interest with a low logical level. Then this first log-ical representation of the document content is analysed ina second stage to get a higher logical representation includ-ing article segmentation and reading order. This top-levelstructural analysis relies on the generation of an article sep-aration grid applied recursively on the document image, al-lowing analysing any type of Manhattan page layout, evenfor complex structures with multiple columns and overlap-ping entities. This method which benefits from both a localanalysis using a probabilistic model trained using machinelearning procedures, and a more global structural analysisusing recursive rules, is evaluated on a dataset of daily lo-cal press document images covering several time periods anddifferent page layouts, to prove its effectiveness.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

Keywordspage layout analysis, information extraction from documentimages, logical structure, articles extraction in newspapers,document image labelling, conditional random field, struc-tural analysis

1. INTRODUCTIONDuring the last twenty years, the archives and national

libraries worldwide have launched digitization programs oftheir historical collections in order to preserve them, and to-gether provide remote access for a wider range of potentialusers thanks to the internet. Old newspapers archives areemblematic of this trend. However, digitization programsof such collections require indexing facilities of the docu-ment images obtained after scanning. Indeed, consideringthe large amount of documents these collections contain,and the large amount of articles each document can con-tain, smart indexing and retrieval is required. This is onlyconceivable through textual queries of users. This is whytextual transcriptions of the digitized collections are neces-sary as well as a deep understanding of the document struc-ture so as to extract each article of the collection. StandardOCR technologies cannot fully automate the whole processof automatic transcription generation and page layout un-derstanding of old newspapers, and a specific digitizationprocess dedicated to complex page layout analysis, characterrecognition (OCR), logical structure detection and readingorder detection.

Physical structure extraction is the process that extractsthe physical structure of the document such as columns andlines of text, prior to the OCR process. Logical structureanalysis is the process that gives access to the informationunits of the document and its organization. It gives access todescriptors (logical tags also known as meta data) such as Ti-tle, Sub-Title, Chapter, Article, Paragraph, Captions, etc...

Generally, these two processes of document structure extrac-tion (known as physical and logical layout analysis) operateseparately and sequentially, one after the other. This is jus-tified by the fact that most of the time physical segmenta-tion of document images can be performed without the needfor any additional knowledge, whereas logical layout extrac-tion is generally performed thanks to the use of a documentmodel (e.g. a style sheet) that expresses the relations be-tween physical and logical entities of the two representationsof the document. However, it is now well known that diffi-cult segmentation tasks must incorporate a recognition stageso as to improve their performance. Therefore, we have de-veloped a new methodology dedicated to logical labeling ofold newspapers images. This method is intended to extractmetadata in the images of the digitized, thanks to the jointuse of a method of classification of sequence of pixels basedon Conditional Random Field modelling, associated with aset of rules defining the concept of article within a newspa-per. Based on physical descriptors, the pixel labeling per-forms directly a logical analysis that give us a first low logicallevel of segmentation. Then, based on the detected logicalentities, the set of rules is able to bring a higher logical levelto detect articles in a newspaper.

In the first part of this paper we overview the relatedworks in the literature. Then, the second part describesour method. The third part is dedicated to the evaluationof the approach which was tested on 19th century issuesfrom the ”Journal de Rouen”, a regional French newspaper.Finally we conclude by a discussion about the potential ofthe method and future work.

2. RELATED WORKSince 2001 the ICDAR Conference is organizing a docu-

ment page segmentation competition [2] in which some ofthe proposed algorithms may have goals similar to the sys-tem we propose in this paper. Nevertheless, the documentdataset used for this competition contains only modern doc-uments, therefore the proposed methods may be inefficientfor older newspapers, which were printed with less efficienttechnology and were futhermore already altered at the timeof their digitization. Among the contribution we can men-tion the work described in [1] which is based on a pixel label-ing stage which may be adaptable to old documents. In [3]an approach based on the detection of maximal empty rect-angles to delimit columns and text blocks is described. Thismethod is integrated in the OCROPUS opensource OCR.Although interesting, this method does not cope with thedifficulties inherent to old documents (skewing, deforma-tions,...). A more interesting method taking into accountthese difficulties is described in [11]. The authors proposeto use a multiscale approach to extract text blocks in oldnewspapers. This efficient method is limited to detect textblocks. Neither logical structure, nor reading order, are pro-vided by this method. The approach we propose in thispaper tries to bring a solution to this problem.

3. PROPOSED APPROACHThe method we present in this paper has been imple-

mented as a complete system dedicated to process a largeamount of old newspaper images. It automatically analyzesthe logical organization of pages so as to extract articles.This extraction process has to make both a physical and a

logical layout analysis. As opposed to the traditional ap-proaches which separate physical and logical analysis, wehave designed a methodology that performs logical analysisboth at pixel level and at text blocs level. First, we pro-ceed to a logical labeling of the image at pixel level whichgives a low level representation of the logical structure to beextracted. This image processing stage embeds some knowl-edge about the physical organization of the logical infor-mation to be extracted. It is based on a machine learningapproach (Conditional Random Fields). The labelled im-age obtained is further analyzed during a second stage thatrecursively builds the logical entities of the document by re-ferring to a generic layout model. This reconstruction stagecan be viewed as a specific bidimensional parser that pro-vides the parsed tree of the document image. Finally, thesystem provides XML METS/ALTO output files. These filescontain both the logical structure describing the reading or-der of the articles, their physical layout composed of the de-tected text lines and their associated character recognizedby an OCR. The following two paragraphs provide an indepth description of the two main steps of the methodology.

3.1 Logical labeling at pixel levelThe proposed method for article extraction in newspaper

document images relies on a first segmentation stage usingConditional Random Fields model (CRF) with multiscalequantization feature functions. This approach has been pre-sented in details in [8] and we recall its most important stepsonly. Conditional Random Fields (CRF) introduced in 2001by Lafferty et al. [9] have opened a new way for sequenceand image analysis. In its original formulation, a CRF isa stochastic model of a process that account for the depen-dencies between a sequence of discrete observations (origi-nally a word sequence) and a sequence of labels that can beassociated to these observations (originally Part Of Speechtags). The application of CRF to image labeling requiressome adaptation so as to deal with numerical values insteadof discrete observations. This adaptation can be viewed as apreprocessing step dedicated to providing the CRF discreteobservations extracted from raw numerical pixel values ofthe image. In the field of computer vision [7, 10] use theoutputs of a neural network or a SVM to feed a CRF. CRFhave also been applied to document structure extraction.[12] uses a 2D-CRF based approach on top of a first NeuralNetwork classification stage. Another example can be foundin [4] were the CRF model is introduced at a second stageafter a pixel classification stage. Training these systems isparticularly difficult because they require training two sub-systems : a local classifier, which then feeds a CRF. In theproposed system, we use a CRF with multi-scale quantiza-tion feature functions [8]. Such an approach requires train-ing one CRF only. Physical descriptors of the image madeof Run Length features are quantized using several quanti-zation functions and fed to the CRF. Let us define a linearquantization function with quantifier q that quantizes thecontinuous observation o as follows:

Q(o, q) : 0 7−→ Xo 7−→ x = round( o

q)

Assuming o is ranging within the interval [omin, omax]then the quantization function can take only Nd discretevalues, with Nd = (omax − omin)/q.

To avoid a choice of the q value, we use multiple quanti-zation functions. Let q1, q2, , qN be a set of quantizers, each

(a) (b)

Figure 1: The CRF logical labelling at pixel level(b) for the image (a)

defining a quantization function Qi(o) = Q(o, qi), then bychoosing a dyadic law of quantifiers as follows qi = 2∗qi−1 =q1 ∗ 2i−1 , we build a multi-scale quantization scheme withthe ability to keep most of the original information containedin the continuous features without any assumption about thedistribution of these features.

With these multiscale quantizations, the CRF model canbe written as in equation (1).

p(Y |X) =1

Z(X)

t=T∏t=1

exp

(k=K∑k=1

λkfk(yt−1, yt, Q1(X), ..., QN (X), t)

)(1)

In the old newspaper article extraction workflow, we usea CRF with multiscale quantized feature functions on a setof physical descriptors made of vertical and horizontal runlength, as described in [8]. We choose a first quantizationstep q1 = 2 with the dyadic law qi = 2 ∗ qi−1, giving onour images 10 quantization scales. One strength of a CRFmodel is its ability to deal with contextual information inthe observation space. Concretely, a decision at the posi-tion t is not only made by taking only the observation atthis position into account but also neighboring ones in a de-fined vicinity. The logical labeling uses a sequential CRFalong the horizontal direction with a context of two previ-ous and two next observations around the current pixel tolabel. This system provides a fine segmentation of the im-age at pixel level, where each pixel is associated to a logicallabel specifying the logical function of the entity this pixelbelongs to. The logical functions are the following ones :

• Vertical separator

• Horizontal separator

• Titles (composed of ”title characters” and ”title inter-words”)

• Text lines (composed of ”characters”, ”inter-characters”and ”inter-words”)

• Noise

• Background

An example of the labeling result is given by the figure1. This labeling stage provides a precise description of the

logical structure of the document at pixel level, but it doesnot provide the logical structure of the document. This isthe goal of the second labeling stage that we present in thefollowing paragraph.

3.2 Logical structure extractionLogical structure extraction from document images aims

at producing the parsed tree of the document, where eachnode of the tree accounts for a particular entity of its logicalorganization e.g. title, sub-title, paragraph, figure, caption,etcaA ↪e In addition the order in which these entities are orga-nized, namely the logical order, is associated. Logical struc-ture extraction is the process that takes as input a raw levelrepresentation of the document at pixel level and providesa high level representation. Therefore, logical structure ex-traction can be viewed as a particular bi-dimensional parser.As already mentioned, this stage generally takes place af-ter a first physical segmentation of the document. In themethod that we propose here, we exploit both the physicalproperties of the image entities as well as their labeling thatis produced by the CRF. The approach is composed of twomain steps : 1- the detection of labeled atomic entities in theimage 2- the reconstruction process of higher level entitiesthanks to a generic document model. We now detail thesetwo steps.

3.2.1 Labelled atomic entities detectionThe atomic entities on which the reconstruction process

takes place are text lines, titles, horizontal and vertical sep-arators. These entities are detected in the image thanksto the CRF labelling stage. First titles and text lines aredetected by applying the following label merging rules :

text line = characters + inter-characters + inter-words

title = title characters + title inter-words

Despite the very good performance of the CRF labellingstage, there are some pixel labelling inconsistencies that re-quire to be detected and corrected before the reconstruc-tion process can take place. Most of the inconsistencies oc-cur when some entities with different labels are connectedtogether (for example text entities and title entities beingconnected together). Such erroneous cases are corrected bylabeling the entities involved with the most occurring label(background pixels are not considered). The text lines areobtained by extracting the connected components, whichare labeled ”text” in the resulting image. Despite the ro-bustness of the extraction process, some text lines may beconnected because of some important deformation in theimage due to some degradations or digitization artefacts.Possible connected text lines are detected by computing theaverage surface of the text lines in the whole document im-age. Then, text entities with a surface much higher than themean surface, are considered erroneous. These situationsare then corrected by a specific algorithm, which separatesthem. The results provided by this detection and labelingprocess are shown on figure 2. Structural entities such astext lines, titles, vertical and horizontal separators can thenbe extracted on the image.

3.2.2 Article extraction using a layout modelThe atomic entities of interest detected in the image match

one of the following label: title, text, horizontal separator,

Figure 2: Labelled entities detection

(a) (b)

Figure 3: (a): A section is the set of blocks of thesame color. (b): The unambiguous reading orderinside sections

and vertical separator. Page layout of a newspaper followssome precise layout rules, a layout model. A page layoutmodel is made of a precise description of the allowed spatialphysical organization of articles as well as a description ofthe organization of an article. The model that was imple-mented in this study copes with complex multi-sections andmulti-columns page layouts. A page is a section. A sectioncan be composed of many sections that are organized se-quentially, one below the other, or hierarchically one insidethe other. They are separated with a large horizontal sep-arator that span over the whole width of the section. Eachsection can contain multiple columns, with a variable num-ber of columns between sections. Columns are separatedwith vertical separators that span over the whole height ofthe section. A section contains a sequence of articles (figure3(a)). An article begins with a ”title” entity followed by atleast one ”text” entity, and ends with an ”horizontal sepa-rator” entity or another ”title” entity. The detection of thespatial organization of pages in sections appears to be thekey issue that further enables the detection of articles. In-deed, articles are organized sequentially within each sectionand are separated by titles and / or horizontal separators.

Figure 4: The grid of cells containing at least ontext line

Articles extraction is therefore implemented following thetwo main steps. First, the physical grid constituted by ver-tical and horizontal separators is detected and text blocsare assigned to their surrounding cell. Section delimiter arealso detected at the end of this process. Second, articles aredetected easily as they are made of the text blocs delim-ited between successive titles and horizontal separators andfollowing the reading order of the section.

Detection of the separator grid and text blocks.The first step of our text blocks detection consists in ex-

tending all the entities that are logical separators. Theseare entities labelled ”separator” and those that are labelled”title”. Extension of a horizontal (resp. vertical) separatorconsists in extending its width (resp. height) until touchingthe vertical (resp. horizontal) left and right (resp. top andbottom) separators. We apply the following steps sequen-tially:

• Create the vertical and horizontal separator mask

• Connect the neighboring vertical separators

• Extend vertical separators as long as they do not crossa horizontal separator or a title

• Connect the neighboring horizontal separators

• Extend horizontal separators and the titles as long asthey do not cross a vertical separator

At the end of this process, the grid covering the entire im-age is obtained and it will serve to extract the articles. Forthat purpose each grid cell is associated with the text linesit surrounds. The ”title” entities are also associated to theirsurrounding grid cell. Cells containing no text line are re-moved. Finally we obtain a list of text blocks made of theremaining grid cells (figure 4).

Reading order detection.Section detection is based on the detection of the horizon-

tal section delimiters. These are the horizontal delimitersthat span over multiple columns. Text blocs are groupedwithin each section they belong to, and they are organizedsequentially following a top-down, left right reading order.By definition, a section follows this unambiguous readingorder (see figure 3(b)). Within each section, following the

Figure 5: The final article segmentation

Table 1: Results of the whole process of logical seg-mentation into articles

#articles #detected #correct %correct %over-seg226 245 194 85.84 8.41

reading order, articles are made of the successive text blocsdelimited between title entities and horizontal separators(figure 5).

4. RESULTS

4.1 Quantitative evaluationThis method was tested on a dataset containing 42 doc-

ument images from early 19th century issues of a Frenchregional newspaper called ”Journal de Rouen”. The resultshave been checked manually by visual inspection. We deter-mined the article detection rate and the article over-segmentationrate. These results are given in table 1 below. The analysisof the errors produced shows that a great amount of themare due to labeling errors produced by the CRF segmen-tation stage. For example, 14 text lines have been labelederroneously as ”title”. This leads to the detection of 14 ex-tra articles when applying the editorial rules, which lead todetect 28 articles instead 14. Theses errors produce over-segmentation cases, most of the time.

4.2 Mass evaluation of the methodThe proposed method has been used extensively during

the digitization process of the ”Journal de Rouen”. 21550documents made of 4 pages in average have been precessedby the workflow described in this paper. Theses documentsare newspapers published between 1767 and 1843. The lay-out is not constant and evolves during this large period oftime. The simplest one is given by the running example ofthis paper (figure 1(a)). Some examples of more complexlayouts are shown on figures 6 and 7. Results can not bequantified but seem to follow the 85% found for the quanti-tative evaluation. Some regions of documents are more dif-ficult to analyse such as documents with tables or documentwithout any vertical separator between columns. These er-rors induced by missing separators can be avoided by evolv-ing the separator definition. At this time, a separator is aphysical entity but considering large white spaces as sepa-

(a) (b)

(c) (d)

Figure 6: Some examples of layouts available in thecollection and their associated article segmentation

rators can solve these issues.

5. CONCLUSION AND FUTURE WORKWe presented in this paper a logical segmentation method

based on the analysis of low-level labeling results producedby a CRF model, using a set of rules defined by a genericlayout model. The proposed method is able to segment thetextual content of old newspapers with complex Manhattanstructure (multi columns), using a little set of simple rules.We obtain with this method an article segmentation rate of85.84% on a test dataset containing 42 images of ”Journal deRouen”, one of the oldest French regional newspapers. Thesefirst results are promising, and allow us to identify two mainimprovement issues. The first one consists in improving ourCRF model because we noted that the majority of the errorscome from this stage. In a second step we will improvefurther both the CRF model and the layout rules in orderto be able to take into account some important other entitiesof the document structure, such as figures, pictures, captionsand tables.

6. REFERENCES[1] C. An, D. Yin, and H. S. Baird. Document

segmentation using pixel-accurate ground truth. InICPR, pages 245–248. IEEE, 2010.

[2] A. Antonacopoulos, S. Pletschacher, D. Bridson, andC. Papadopoulos. Icdar 2009 page segmentationcompetition. In ICDAR, pages 1370–1374. IEEEComputer Society, 2009.

(a) (b)

(c) (d)

Figure 7: Some other examples of layouts availablein the collection and their associated article segmen-tation

[3] T. M. Breuel. Two geometric algorithms for layoutanalysis. In Proceedings of the 5th InternationalWorkshop on Document Analysis Systems V, DAS ’02,pages 188–199, London, UK, UK, 2002.Springer-Verlag.

[4] S. Chaudhury, M. Jindal, and S. Dutta Roy.Model-guided segmentation and layout labelling ofdocument images using a hierarchical conditionalrandom field. In Proceedings of the 3rd InternationalConference on Pattern Recognition and MachineIntelligence, PReMI ’09, pages 375–380, Berlin,Heidelberg, 2009. Springer-Verlag.

[5] T.-M.-T. Do and T. Artieres. Conditional RandomFields for Online Handwriting Recognition. InG. Lorette, editor, Tenth International Workshop onFrontiers in Handwriting Recognition, La Baule(France), Oct. 2006. Universite de Rennes 1, Suvisoft.

[6] S. Feng, R. Manmatha, and A. Mccallum. Exploringthe use of conditional random field models and hmmsfor historical handwritten document recognition. Inthe Proceedings of the 2nd IEEE InternationalConference on Document Image Analysis for Libraries(DIAL, pages 30–37, Washington, DC, USA, 2006.

[7] X. He, R. S. Zemel, and M. A. Carreira-perpinan.Multiscale conditional random fields for imagelabeling. In In CVPR, pages 695–702, 2004.

[8] D. Hebert, T. Paquet, and S. Nicolas. Continuous crfwith multi-scale quantization feature functionsapplication to structure extraction in old newspaper.In International Conference on Document Analysisand Recognition (ICDAR), pages 493–497. IEEE,March 2011.

[9] J. D. Lafferty, A. McCallum, and F. C. N. Pereira.Conditional random fields: Probabilistic models forsegmenting and labeling sequence data. In Proceedingsof the Eighteenth International Conference on MachineLearning, ICML ’01, pages 282–289, San Francisco,CA, USA, 2001. Morgan Kaufmann Publishers Inc.

[10] C.-H. Lee, S. Wang, A. Murtha, M. R. G. Brown, andR. Greiner. Segmenting brain tumors usingpseudo-conditional random fields. In D. N. Metaxas,L. Axel, G. Fichtinger, and G. SzAl’kely, editors,MICCAI (1), volume 5241 of Lecture Notes inComputer Science, pages 359–366. Springer, 2008.

[11] A. Lemaitre, J. Camillerapp, and B. Couasnon.Approche perceptive pour la reconnaissance de filetsbruites - application a la structuration de pages dejournaux. In Actes du Xeme Colloque InternationalFrancophone sur l’Ecrit et le Document, CIFED’08,pages 61–66, France, 2008. A. T. et Thierry Paquet(ed.).

[12] S. Nicolas, J. Dardenne, T. Paquet, and L. Heutte.Document image segmentation using a 2d conditionalrandom field model. In ICDAR, pages 407–411. IEEEComputer Society, 2007.

Automatic Article Extraction in Old Newspapers Digitized Collections

Documents

Transcript of Automatic Article Extraction in Old Newspapers Digitized Collections