Image processing for historical newspaper archives

Image Processing for Historical Newspaper Archives

Takahiro SHIMA, Kengo TERASAWA and Toshio KAWASHIMA

Background

September 16, 2011 Historical Document Imaging and Processing 2

Fast full text search system for newspaper archives is very useful

However, OCR is difficult for old newspaper Word spotting based method (Terasawa et al. 2005, 2009)

Apply OCR only for segmentation


Character-segmented images are preferable to deal with large image database.

Even though character recognition is difficult, we thought character segmentation can be performed by OCR software.

However, the accuracy of segmentation using OCR software in status quo is low (0% - 40%).

Proposed methods


1. Column Separation

2. Apply some tricks that improve the following process.

<< Main contribution of the study >>

3. Character segmentation

Separated Columns


Character segmentation


Using off-the-shelf OCR library Typewritten document OCR library V7.0 (Media Drive

Corporation)

Ideal result Real result

Obstacle factors


3. Errors in layout analysis

1. Ruled lines 2. Ruby characters

4. Noise

Solution1 : Ruled lines removal


Applying Run Length Smoothing Algorithm (N. Stamatopoulos et al. 2009 )

Estimating rough text line width by size of connected components

Extremely large connected components are removed

Source Image Ruled lines are removed

Solution2: Ruby characters removal


Connect characters vertically using anisotropic Gaussian filter and apply the scale space approach (R.Manmatha et al. 1996)

Thin connected regions are regarded as ruby characters and removed

Filtered anisotropic Gaussian

Binarized Estimated body character regions (ruby characters are removed)

+−= 2

2

2

2

22exp

21);,(

yxyx

yxyxGσσσπσ

σ

Solution3: Text line segmentation


For improving OCR ’s layout analysis performance, column images are segmented into text lines Layout analysis to small region is more accurate

Solution4: Noise removal


Noise regions are estimated by pixel value histogram and removed

255 0

Noiseless image

Noisy image Character

Noise

Source noisy image

After noise removal

Character segmentation result using the proposed methods


Without Preprocessing Proposed Method

Experiment’s outline


Material: 771 pages of “Hakodate Shimbun” published in 1881

1. Column separation Number of columns: 2379

2. Character segmentation to 20 columns Number of characters: 17021 Excluding masthead and ruby characters

Experimental results (column separation)


Number of

columns

Detected columns

Correctly detected columns

Recall False negatives

False positives

2379 2401 2340 0.983 39 61

False negative (2-column layout) Masthead false detection

Experimental results (character segmentation)


Number of characters

Correctly detected regions

Recall Over-segmentation

Connecting multiple

characters

Ruby character

s

Without preprocessing 17021 5375 0.316 428 9913 2197

Proposed method 17021 16384 0.963 626 40 191

Over-segmentation Connecting multiple characters

Discussion (Character segmentation)


Connecting multiple characters and false negative detection are improved

Over-segmentation becomes worse Pixels are over-removed by noise removal

Conclusion


Highly accurate character segmentation is realized by proposed preprocess methods

The system is implemented and publicly in use via the Hakodate City Central Library website http://www.lib-hkd.jp/rein/

Future research Suppression of over-segmentation Measures against tables and advertisements

Background

September 16, 2011 20

Searching intended information from digital archives is getting more difficult.

Newspaper Archive

Digitize

Where are articles of XXX ?

User must search them manually

User

Digital Archive

Historical Document Imaging and Processing

Summary


This study figures out the causes that make the character segmentation difficult, and removes them to improve the accuracy of character segmentation.


Target materials Japanese local newspaper “Hakodate Shimbun”

(published from 1878-1884, 3575 pages)

September 16, 2011 23 Historical Document Imaging and Processing

Characteristics of the material


Binary image (ディザ) (6072x8600px)

Three-column layout Irregular masthead part No illustration

Transverse and vertical ruled lines in a boundary of columns

Ruby characters Noises caused by show-

through or sanction seal


Outer frame detection


Hough Transform To reduce costs, regions where outer frames will exist

are estimated by projection Edge detection filter is performed to clarify straight lines


Column separation


Inner ruled lines are detected by probabilistic Hough Transform This outputs line segments. Both endpoints of line segment

is useful for column separation. performing dilation to remove line discontinuity

Using target knowledge Separating rough columns by

detecting horizontal lines Detecting masthead’s boundary

line

Needed character segmentation accuracy on humane studies


50% accuracy of a full text search will be deemed good for humanities researchers “文献研究と情報技術：史学・古典

学の現場から,” 林晋ら, 2009

80%程度の精度でも簡単で速い作業で切り出しが可能である方が、利便性が高い現状では0~50%程度

Discussion (column separation)


False negatives Damage of outer frame or ruled lines Unexpected layout (e.g. four-column layout)

False detection of masthead Many vertical ruled lines causes adverse

effects on masthead detecting

The accuracy of column separation

achieved 97.5%

Results on noisy image


Proposed method

Without Preprocessing


0

Noiseless Image

Noisy Image Character

Noise

255

255 0

Problems of the approach using OCR


Connecting multiple characters

Character recognition failed almost completely

櫨鷲難讐蹴．殿謎鷺，

麹”藻霞翫藩

織糞憶、、ｘ

謬鑑ム醜轟蟹嚢欝

乏㌦癒・

象名子棉績熔胡ノ魂綱，

のたうじこべやムこれい

・・・

ノイズの多い紙面の例


変則的レイアウトの例


なぜ画像化文書検索か


誤りがあったとしても、手軽に絞り込みを行うことができれば大いに有用質に関する問題

古い文書を「一意に決まった文字列」と理解することは困難何を表現するのかは人文学者が判断すること

量に関する問題膨大な資料

処理速度


段組切り出し 1ページにつき1分程度

文字切り出し段組1個につき約1分。1ページでは3~4分

全文検索のための特徴量算出 1ページにつき20~40分程度

提案手法

Image processing for historical newspaper archives

Documents

Transcript of Image processing for historical newspaper archives