Image processing for historical newspaper archives

37
Image Processing for Historical Newspaper Archives Takahiro SHIMA, Kengo TERASAWA and Toshio KAWASHIMA

Transcript of Image processing for historical newspaper archives

Image Processing for Historical Newspaper Archives

Takahiro SHIMA, Kengo TERASAWA and Toshio KAWASHIMA

Background

September 16, 2011 Historical Document Imaging and Processing 2

Fast full text search system for newspaper archives is very useful

However, OCR is difficult for old newspaper Word spotting based method (Terasawa et al. 2005, 2009)

Apply OCR only for segmentation

September 16, 2011 Historical Document Imaging and Processing 3

Character-segmented images are preferable to deal with large image database.

Even though character recognition is difficult, we thought character segmentation can be performed by OCR software.

However, the accuracy of segmentation using OCR software in status quo is low (0% - 40%).

Proposed methods

September 16, 2011 Historical Document Imaging and Processing 4

1. Column Separation

2. Apply some tricks that improve the following process.

<< Main contribution of the study >>

3. Character segmentation

Separated Columns

September 16, 2011 Historical Document Imaging and Processing 5

Character segmentation

September 16, 2011 Historical Document Imaging and Processing 6

Using off-the-shelf OCR library Typewritten document OCR library V7.0 (Media Drive

Corporation)

Ideal result Real result

Obstacle factors

September 16, 2011 Historical Document Imaging and Processing 7

3. Errors in layout analysis

1. Ruled lines 2. Ruby characters

4. Noise

Solution1 : Ruled lines removal

September 16, 2011 Historical Document Imaging and Processing 8

Applying Run Length Smoothing Algorithm (N. Stamatopoulos et al. 2009 )

Estimating rough text line width by size of connected components

Extremely large connected components are removed

Source Image Ruled lines are removed

Solution2: Ruby characters removal

September 16, 2011 Historical Document Imaging and Processing 9

Connect characters vertically using anisotropic Gaussian filter and apply the scale space approach (R.Manmatha et al. 1996)

Thin connected regions are regarded as ruby characters and removed

Filtered anisotropic Gaussian

Binarized Estimated body character regions (ruby characters are removed)

+−= 2

2

2

2

22exp

21);,(

yxyx

yxyxGσσσπσ

σ

Solution3: Text line segmentation

September 16, 2011 Historical Document Imaging and Processing 10

For improving OCR ’s layout analysis performance, column images are segmented into text lines Layout analysis to small region is more accurate

Solution4: Noise removal

September 16, 2011 Historical Document Imaging and Processing 11

Noise regions are estimated by pixel value histogram and removed

255 0

Noiseless image

Noisy image Character

Noise

Source noisy image

After noise removal

Character segmentation result using the proposed methods

September 16, 2011 Historical Document Imaging and Processing 12

Without Preprocessing Proposed Method

Experiment’s outline

September 16, 2011 Historical Document Imaging and Processing 13

Material: 771 pages of “Hakodate Shimbun” published in 1881

1. Column separation Number of columns: 2379

2. Character segmentation to 20 columns Number of characters: 17021 Excluding masthead and ruby characters

Experimental results (column separation)

September 16, 2011 Historical Document Imaging and Processing 14

Number of

columns

Detected columns

Correctly detected columns

Recall False negatives

False positives

2379 2401 2340 0.983 39 61

False negative (2-column layout) Masthead false detection

Experimental results (character segmentation)

September 16, 2011 Historical Document Imaging and Processing 15

Number of characters

Correctly detected regions

Recall Over-segmentation

Connecting multiple

characters

Ruby character

s

Without preprocessing 17021 5375 0.316 428 9913 2197

Proposed method 17021 16384 0.963 626 40 191

Over-segmentation Connecting multiple characters

Discussion (Character segmentation)

September 16, 2011 Historical Document Imaging and Processing 16

Connecting multiple characters and false negative detection are improved

Over-segmentation becomes worse Pixels are over-removed by noise removal

Conclusion

September 16, 2011 Historical Document Imaging and Processing 17

Highly accurate character segmentation is realized by proposed preprocess methods

The system is implemented and publicly in use via the Hakodate City Central Library website http://www.lib-hkd.jp/rein/

Future research Suppression of over-segmentation Measures against tables and advertisements

September 16, 2011 Historical Document Imaging and Processing 18

September 16, 2011 Historical Document Imaging and Processing 19

Background

September 16, 2011 20

Searching intended information from digital archives is getting more difficult.

Newspaper Archive

Digitize

Where are articles of XXX ?

User must search them manually

User

Digital Archive

Historical Document Imaging and Processing

September 16, 2011 Historical Document Imaging and Processing 21

Summary

September 16, 2011 22

This study figures out the causes that make the character segmentation difficult, and removes them to improve the accuracy of character segmentation.

Historical Document Imaging and Processing

Target materials Japanese local newspaper “Hakodate Shimbun”

(published from 1878-1884, 3575 pages)

September 16, 2011 23 Historical Document Imaging and Processing

Characteristics of the material

September 16, 2011 24

Binary image (ディザ) (6072x8600px)

Three-column layout Irregular masthead part No illustration

Transverse and vertical ruled lines in a boundary of columns

Ruby characters Noises caused by show-

through or sanction seal

Historical Document Imaging and Processing

Outer frame detection

September 16, 2011 25

Hough Transform To reduce costs, regions where outer frames will exist

are estimated by projection Edge detection filter is performed to clarify straight lines

Historical Document Imaging and Processing

Column separation

September 16, 2011 Historical Document Imaging and Processing 26

Inner ruled lines are detected by probabilistic Hough Transform This outputs line segments. Both endpoints of line segment

is useful for column separation. performing dilation to remove line discontinuity

Using target knowledge Separating rough columns by

detecting horizontal lines Detecting masthead’s boundary

line

Needed character segmentation accuracy on humane studies

September 16, 2011 Historical Document Imaging and Processing 27

50% accuracy of a full text search will be deemed good for humanities researchers “文献研究と情報技術:史学・古典

学の現場から,” 林 晋ら, 2009

80%程度の精度でも簡単で速い作業で切り出しが可能である方が、利便性が高い 現状では0~50%程度

Discussion (column separation)

September 16, 2011 Historical Document Imaging and Processing 28

False negatives Damage of outer frame or ruled lines Unexpected layout (e.g. four-column layout)

False detection of masthead Many vertical ruled lines causes adverse

effects on masthead detecting

The accuracy of column separation

achieved 97.5%

Results on noisy image

September 16, 2011 Historical Document Imaging and Processing 29

Proposed method

Without Preprocessing

September 16, 2011 Historical Document Imaging and Processing 30

0

Noiseless Image

Noisy Image Character

Noise

255

255 0

Problems of the approach using OCR

September 16, 2011 Historical Document Imaging and Processing 31

Connecting multiple characters

Character recognition failed almost completely

櫨鷲難讐蹴.殿謎鷺,

麹”藻霞翫藩

織糞憶、、x

謬鑑ム醜轟蟹嚢欝

乏㌦癒・

象名子棉績熔胡ノ魂綱,

のたうじこべやムこれい

・・・

ノイズの多い紙面の例

September 16, 2011 Historical Document Imaging and Processing 32

変則的レイアウトの例

September 16, 2011 Historical Document Imaging and Processing 33

なぜ画像化文書検索か

September 16, 2011 Historical Document Imaging and Processing 34

誤りがあったとしても、手軽に絞り込みを行うことができれば大いに有用 質に関する問題

古い文書を「一意に決まった文字列」と理解することは困難 何を表現するのかは人文学者が判断すること

量に関する問題 膨大な資料

処理速度

September 16, 2011 Historical Document Imaging and Processing 35

段組切り出し 1ページにつき1分程度

文字切り出し 段組1個につき約1分。1ページでは3~4分

全文検索のための特徴量算出 1ページにつき20~40分程度

提案手法

September 16, 2011 Historical Document Imaging and Processing 36

September 16, 2011 Historical Document Imaging and Processing 37