Image processing for historical newspaper archives
Transcript of Image processing for historical newspaper archives
Image Processing for Historical Newspaper Archives
Takahiro SHIMA, Kengo TERASAWA and Toshio KAWASHIMA
Background
September 16, 2011 Historical Document Imaging and Processing 2
Fast full text search system for newspaper archives is very useful
However, OCR is difficult for old newspaper Word spotting based method (Terasawa et al. 2005, 2009)
Apply OCR only for segmentation
September 16, 2011 Historical Document Imaging and Processing 3
Character-segmented images are preferable to deal with large image database.
Even though character recognition is difficult, we thought character segmentation can be performed by OCR software.
However, the accuracy of segmentation using OCR software in status quo is low (0% - 40%).
Proposed methods
September 16, 2011 Historical Document Imaging and Processing 4
1. Column Separation
2. Apply some tricks that improve the following process.
<< Main contribution of the study >>
3. Character segmentation
Character segmentation
September 16, 2011 Historical Document Imaging and Processing 6
Using off-the-shelf OCR library Typewritten document OCR library V7.0 (Media Drive
Corporation)
Ideal result Real result
Obstacle factors
September 16, 2011 Historical Document Imaging and Processing 7
3. Errors in layout analysis
1. Ruled lines 2. Ruby characters
4. Noise
Solution1 : Ruled lines removal
September 16, 2011 Historical Document Imaging and Processing 8
Applying Run Length Smoothing Algorithm (N. Stamatopoulos et al. 2009 )
Estimating rough text line width by size of connected components
Extremely large connected components are removed
Source Image Ruled lines are removed
Solution2: Ruby characters removal
September 16, 2011 Historical Document Imaging and Processing 9
Connect characters vertically using anisotropic Gaussian filter and apply the scale space approach (R.Manmatha et al. 1996)
Thin connected regions are regarded as ruby characters and removed
Filtered anisotropic Gaussian
Binarized Estimated body character regions (ruby characters are removed)
+−= 2
2
2
2
22exp
21);,(
yxyx
yxyxGσσσπσ
σ
Solution3: Text line segmentation
September 16, 2011 Historical Document Imaging and Processing 10
For improving OCR ’s layout analysis performance, column images are segmented into text lines Layout analysis to small region is more accurate
Solution4: Noise removal
September 16, 2011 Historical Document Imaging and Processing 11
Noise regions are estimated by pixel value histogram and removed
255 0
Noiseless image
Noisy image Character
Noise
Source noisy image
After noise removal
Character segmentation result using the proposed methods
September 16, 2011 Historical Document Imaging and Processing 12
Without Preprocessing Proposed Method
Experiment’s outline
September 16, 2011 Historical Document Imaging and Processing 13
Material: 771 pages of “Hakodate Shimbun” published in 1881
1. Column separation Number of columns: 2379
2. Character segmentation to 20 columns Number of characters: 17021 Excluding masthead and ruby characters
Experimental results (column separation)
September 16, 2011 Historical Document Imaging and Processing 14
Number of
columns
Detected columns
Correctly detected columns
Recall False negatives
False positives
2379 2401 2340 0.983 39 61
False negative (2-column layout) Masthead false detection
Experimental results (character segmentation)
September 16, 2011 Historical Document Imaging and Processing 15
Number of characters
Correctly detected regions
Recall Over-segmentation
Connecting multiple
characters
Ruby character
s
Without preprocessing 17021 5375 0.316 428 9913 2197
Proposed method 17021 16384 0.963 626 40 191
Over-segmentation Connecting multiple characters
Discussion (Character segmentation)
September 16, 2011 Historical Document Imaging and Processing 16
Connecting multiple characters and false negative detection are improved
Over-segmentation becomes worse Pixels are over-removed by noise removal
Conclusion
September 16, 2011 Historical Document Imaging and Processing 17
Highly accurate character segmentation is realized by proposed preprocess methods
The system is implemented and publicly in use via the Hakodate City Central Library website http://www.lib-hkd.jp/rein/
Future research Suppression of over-segmentation Measures against tables and advertisements
Background
September 16, 2011 20
Searching intended information from digital archives is getting more difficult.
Newspaper Archive
Digitize
Where are articles of XXX ?
User must search them manually
User
Digital Archive
Historical Document Imaging and Processing
Summary
September 16, 2011 22
This study figures out the causes that make the character segmentation difficult, and removes them to improve the accuracy of character segmentation.
Historical Document Imaging and Processing
Target materials Japanese local newspaper “Hakodate Shimbun”
(published from 1878-1884, 3575 pages)
September 16, 2011 23 Historical Document Imaging and Processing
Characteristics of the material
September 16, 2011 24
Binary image (ディザ) (6072x8600px)
Three-column layout Irregular masthead part No illustration
Transverse and vertical ruled lines in a boundary of columns
Ruby characters Noises caused by show-
through or sanction seal
Historical Document Imaging and Processing
Outer frame detection
September 16, 2011 25
Hough Transform To reduce costs, regions where outer frames will exist
are estimated by projection Edge detection filter is performed to clarify straight lines
Historical Document Imaging and Processing
Column separation
September 16, 2011 Historical Document Imaging and Processing 26
Inner ruled lines are detected by probabilistic Hough Transform This outputs line segments. Both endpoints of line segment
is useful for column separation. performing dilation to remove line discontinuity
Using target knowledge Separating rough columns by
detecting horizontal lines Detecting masthead’s boundary
line
Needed character segmentation accuracy on humane studies
September 16, 2011 Historical Document Imaging and Processing 27
50% accuracy of a full text search will be deemed good for humanities researchers “文献研究と情報技術:史学・古典
学の現場から,” 林 晋ら, 2009
80%程度の精度でも簡単で速い作業で切り出しが可能である方が、利便性が高い 現状では0~50%程度
Discussion (column separation)
September 16, 2011 Historical Document Imaging and Processing 28
False negatives Damage of outer frame or ruled lines Unexpected layout (e.g. four-column layout)
False detection of masthead Many vertical ruled lines causes adverse
effects on masthead detecting
The accuracy of column separation
achieved 97.5%
Results on noisy image
September 16, 2011 Historical Document Imaging and Processing 29
Proposed method
Without Preprocessing
September 16, 2011 Historical Document Imaging and Processing 30
0
Noiseless Image
Noisy Image Character
Noise
255
255 0
Problems of the approach using OCR
September 16, 2011 Historical Document Imaging and Processing 31
Connecting multiple characters
Character recognition failed almost completely
櫨鷲難讐蹴.殿謎鷺,
麹”藻霞翫藩
織糞憶、、x
謬鑑ム醜轟蟹嚢欝
乏㌦癒・
象名子棉績熔胡ノ魂綱,
のたうじこべやムこれい
・・・
なぜ画像化文書検索か
September 16, 2011 Historical Document Imaging and Processing 34
誤りがあったとしても、手軽に絞り込みを行うことができれば大いに有用 質に関する問題
古い文書を「一意に決まった文字列」と理解することは困難 何を表現するのかは人文学者が判断すること
量に関する問題 膨大な資料
処理速度
September 16, 2011 Historical Document Imaging and Processing 35
段組切り出し 1ページにつき1分程度
文字切り出し 段組1個につき約1分。1ページでは3~4分
全文検索のための特徴量算出 1ページにつき20~40分程度
提案手法