Document image retrieval through word shape coding

1

Document Image Retrieval through Word

Shape Coding

Shijian Lu, Member, IEEE, Li Linlin, Chew Lim Tan, Senior Member, IEEE

April 10, 2008 DRAFT

2

Abstract

This paper presents a document retrieval technique that is capable of searching document images

without OCR (optical character recognition). The proposed technique retrieves document images by a

new word shape coding scheme, which captures the document content through annotating each word

image by a word shape code. In particular, we annotate word images by using a set of topological shape

features including character ascenders/descenders, character holes, and character water reservoirs. With

the annotated word shape codes, document images can be retrieved by either query keywords or a query

document image. Experimental results show that the proposed document image retrieval technique is

fast, efficient, and tolerant to various types of document degradation.

Index Terms

Document image retrieval, document image analysis, word shape coding.

I. INTRODUCTION

With the proliferation of digital libraries and the promise of paper-less office, an increasing

number of document images of different qualities are being scanned and archived. Under the

traditional retrieval scenario, scanned document images need to be first converted to ASCII text

through OCR (optical character recognition) [12]. However, for a huge amount of document

images archived in digital libraries, the OCR of all of them for the retrieval purpose is wasteful

and has been proven prohibitively expensive, particularly considering the arduous post-OCR

correction process. In addition, compared with structured representation of documents via OCR,

image-based representation of documents is often more intuitive and more flexible because it

preserves the physical document layout and non-text components (such as embedded graphics)

much better [20]. Under such circumstances, a fast and efficient document image retrieval

technique will facilitate the location of the imaged text information, or at least significantly

narrow the archived document images down to those interested ones.

There is therefore a recent trend towards content-based document image retrieval techniques

without going through the OCR process. A large number of content-based image retrieval

techniques [16] has been reported. For the retrieval of document images, the earlier works were

often based on the character shape coding that annotates character images by a set of pre-defined

codes. For example, Nakayama annotates character images by seven codes and then uses them


https://www.researchgate.net/publication/249737494_The_Future_of_Document_Imaging_in_the_Era_of_Electronic_Documents?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

https://www.researchgate.net/publication/285652799_Content-based_multimedia_information_retrieval_State_of_the_art_and_challenges?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

https://www.researchgate.net/publication/31626696_Introduction_to_Modern_Information_Retrieval_G_Salton?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

3

Fig. 1. The three topological character shape features in use: (a) the sample word image “shape”; (b) character ascenders and

descenders; (c) character holes; (d) character water reservoirs.

for content word detection [7] and document image categorization [6]. Similarly, Spitz et al.

take a character shape coding approach for language identification [3], word spotting [8], and

document image retrieval [9]. In [11], Tan et al. also propose a character shape coding scheme

that annotates character images based on the vertical component cut. In addition, a number of

image matching techniques [19], [18] have also been reported for the word image spotting. The

major limitation of the above character shape coding techniques lies with their sensitivity to the

character segmentation error. For document images of low quality, the accuracy of the resultant

character shape codes is often severely degraded by the character segmentation error resulting

from various types of document degradation.

To overcome the limitation of the character shape coding, we have proposed a number of

word shape coding schemes, which treat each word image as a single component and so are

much more tolerant to the character segmentation error. In our earlier work [21], the vertical

bar pattern is used for the word shape coding and document image retrieval. In [4], we code

word images by using character extremum points and the resultant word shape codes are then

used for the language identification. Later, the number of horizontal word cuts is incorporated

in [5] and then used for the multilingual document image retrieval. Besides, we also reported a

keyword spotting technique in [10] where each word image is annotated by a primitive string.

This paper presents a new word image annotation technique and its applications to the docu-

ment image retrieval by either query keywords or a query document image. We annotate word

images by a set of topological character shape features including character ascenders/descenders,


https://www.researchgate.net/publication/243770606_Using_character_shape_codes_for_word_spotting_in_document_images?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

https://www.researchgate.net/publication/243763186_Spotting_phrases_in_lines_of_imaged_text?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

https://www.researchgate.net/publication/3193410_Imaged_document_text_retrieval_without_OCR?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

https://www.researchgate.net/publication/222396099_Retrieval_of_machine-printed_Latin_documents_through_Word_Shape_Coding?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

https://www.researchgate.net/publication/2477505_Content-Oriented_Categorization_of_Document_Images?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

https://www.researchgate.net/publication/220204629_Text_Retrieval_from_Document_Images_Based_on_Word_Shape_Analysis?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

https://www.researchgate.net/publication/238710887_Keyword_location_in_noisy_document_image?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

https://www.researchgate.net/publication/3710674_Using_character_shape_coding_for_information_retrieval?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

https://www.researchgate.net/publication/5847247_Script_and_Language_Identification_in_Noisy_and_Degraded_Document_Images?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

https://www.researchgate.net/publication/297462081_Information_retrieval_in_document_image_databases?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

https://www.researchgate.net/publication/281253378_Determination_of_script_and_language_content_of_document_images?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

4

Fig. 2. The detection of word and text line images through the analysis of the horizontal and vertical projection profiles and

the illustration of the x line and base line of text.

character holes, and character water reservoirs illustrated in Figs. 1(b-d). Compared with the

coding schemes reported in our earlier works [4], [5], [21], the word annotation technique

presented in this paper has the following advantages. First, it is much faster because it does not

require the time-consuming connected component labeling. Second, the character shape features

in use are more tolerant to the document skew and the variations in text fonts and text styles.

Third and most importantly, its collision rate is much lower because of the distinguishability of

the three character shape features in use.

The rest of this paper is organized as follows. Section 2 describes the proposed word image

annotation scheme. The proposed document image retrieval techniques are then presented in

Section 3. Section 4 then presents and discusses experimental results. Finally, some concluding

remarks are drawn in Section 5.

II. WORD IMAGE ANNOTATION

This section presents the proposed word image annotation technique. In particular, we’ll divide

this section into three subsections, which deal with the document image preprocessing, the word

shape feature extraction, and the word image representation, respectively.

A. Document Image Preprocessing

Archived document images often suffer from various types of document degradation such as

impulse noise and low contrast. Therefore, document images need to be preprocessed so as to

extract the character shape features in use properly. In the proposed technique, document images

are first smoothed to suppress noise by a simple mean filter within a 3 × 3 window.





5

The filtered document images are then binarized. A large number of document binarization

techniques [2] have been reported and we directly make use of Otsu’s global method [1]. After

that, words and text lines are located through the analysis of the horizontal and vertical docu-

ment projection profiles illustrated in Fig. 2. For Latin based document images, the horizontal

projection profile normally shows two peaks at the x line and base line of the text. Besides, due

to the blanks between adjacent words within the same text line, some zero-height segments of

significant length can also be detected from the vertical projection profile. Word and text line

images can thus be located based on the peaks and the zero-height segments of the horizontal

and vertical projection profiles, respectively.

B. Word Shape Feature Extraction

This section presents the extraction of the three character shape features in use, namely,

character ascenders/descenders, character holes, and character reservoirs. Among them, character

ascenders and descenders can be simply located based on the observation that they lie above the

x line and below the base line of the text, respectively. Character holes and character reservoirs

can then be detected through the analysis of character white runs described below.

Scanning vertically (or horizontally) from top to bottom (or from left to right), a character

white run can be located by a beginning pixel BP and an ending pixel EP corresponding to “01”

and “10” illustrated in Fig. 3 (“1” and “0” denote white background pixels and gray foreground

pixels in Fig. 3). As we only need leftward and rightward reservoirs (to be discussed in the next

subsection), we scan word images vertically column by column. Clearly, two vertical white runs

from the two adjacent scanning columns are connected if they satisfy the following constraint:

BPc < EPa & EPc > BPa (1)

where [BPc EPc] and [BPa EPa] refer to the BP and EP of the white runs detected in the

current and adjacent scanning columns. Consequently, a set of connected vertical white runs

form a white run component whose centroid can be estimated as follows:

Cx =∑Nr

i=1(EPi,y−BPi,y)BPi,x

∑Nr

i=1(EPi,y−BPi,y

)

Cy =∑Nr

i=1(EPi,y−BPi,y)(EPi,y+BPi,y+1)/2

∑Nr

i=1(EPi,y−BPi,y)

(2)


https://www.researchgate.net/publication/302937305_A_threshold_selection_method_from_gray-level_histograms?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

https://www.researchgate.net/publication/2686483_Evaluation_of_Binarization_Methods_for_Document_Images?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1

6

Fig. 3. The illustration of the beginning pixel and ending pixel of a horizontal and a vertical white runs.

where the denominator gives the number of pixels (component size) within the white run

component under study. The numerator instead gives the sums of the x and y coordinates of

pixels within the white run component. Parameter Nr refers to the number of white runs within

the white run component under study.

Character holes and character reservoirs can be detected based on the openness and closeness

of the detected white run components shown in Figs. 1c and 1d. Generally, a white run component

is closed if all neighboring pixels on the left of the first and on the right of the last constituent

white run are text pixels. On the contrary, a white run component is open if some neighboring

pixels on the left of the first or on the right of the last constituent white run are background

pixels. Therefore, a leftward and rightward closed white run component results in a character

hole (such as the hole of character “o”). At the same time, a leftward (or rightward) open and

rightward (or leftward) closed white run component results in a leftward (or rightward) character

reservoir (such as the leftward reservoir of character “a”).

It should be noted that due to the document degradation, there normally exist a large number

of tiny concavities along the character stroke boundary. As a result, a large number of character

reservoirs of a small depth will be detected by the above vertical scanning process. However,

these small reservoirs are not desired, which can be identified based on their depth (Nr in


7

Equation (2) above) relative to the x height (the distance between x line and base line of the

text shown in Fig. 2). Generally, the relative depth of these undesired reservoirs is much smaller

than that of these desired ones. Our experiments show that a relative depth threshold at 0.2 is

capable of identifying character reservoirs of a small depth adequately.

C. Word Image Representation

Each word image can thus be annotated by a sequence of character codes for the three types

of character shape features. However, not every character (i.e. “mnruvw” in Table I) has a code

and some characters (such as “hlIJL” in Table I) may share a code with another, while other

characters (such as p, b, x in Table I) may be represented by more than one code. The idea here

is to represent a word as a linear sequence of codes rather than representing each and every

character in a word. To deal with character segmentation error, we particularly annotate word

images by using five shape features including character ascenders/descenders, character holes,

and leftward and rightward character reservoirs. We do not use upward and downward reservoirs

based on two observations. First, most character segmentation error is due to the touching of

two or more adjacent characters at either x line or base line position but seldom at both x line

and base line positions. Second, a typical touching at the x line or base line position introduces

an upward or downward reservoir, which seldom affects leftward or rightward reservoirs.

The five shape features in use are annotated by two types of codes according to their vertical

alignment. Particularly, the first type is used when the five shape features have no vertically

aligned shape features (such as the hole of “o” and the rightward reservoir of “c”). In this

case, the five shape features (i.e. character ascenders/descenders, character holes, and leftward

and rightward character reservoirs from left to right) are annotated by “l”, “n”, “o”, “u”, and

“c”, respectively. The second type is used when the five shape features have vertically aligned

features (such as “e” whose hole lies right above its rightward reservoir). In this case, the shape

feature together with its vertical alignments usually determines a Roman letter uniquely. Under

such circumstance, we annotate the shape feature together with its vertical alignments by the

uniquely determined Roman letter. It should be noted that we annotate character descenders and

leftward reservoirs by “n” and “u” for the first type of codes because both “n” and “u” have no

desired shape features and so will not contribute any shape codes.

Table I shows the proposed coding scheme where 52 Roman letters and numbers 0–9 are


8

TABLE I

CODES OF 52 ROMAN LETTERS AND NUMBERS 0-9 BY USING THE THREE PROPOSED SHAPE FEATURES.

Characters Shape codes Characters Shape codes Characters Shape codes Characters Shape codes

a a b lo c c d ol

e e f f g g hlIJLT17 l

i i j j ktK lc o o

p no q on s s xX uc

y y z z A A B8 B

CG C DO04 O E E F F

HMNUVWY ll P P Q Q R R

S S Z Z 2 2 3 3

5 5 6 6 9 9 mnruvw no codes

annotated by 35 codes. For example, character “b” is annotated by “lo” (the first type of code),

indicating a character hole (o) directly on the right of a character ascender (l). Character “a” is

coded by itself (the second type of code) because a leftward reservoir right above a character

hole uniquely indicates an entity of character “a”. Based on the coding scheme in Table I, the

word image “shape” in Fig. 1a, can be represented by a code sequence “slanoe” where “s”,

“l”, “a”, “no”, and “e” are converted from the five spelling characters, respectively. It should be

noted that character “g” in Table I may have two holes with one lying below the base line (for

serif “g”) or a single hole lying above a leftward reservoir (for sans serif “g”). However, both

the two feature patterns uniquely indicate the entity of the character “g”.

The proposed word shape coding scheme is tolerant to character segmentation error. For

example, though characters “ab” are frequently touched at the base line position, they can still

be properly annotated by “alo”. Another example, characters “rt” touched at the x line position

can be properly annotated as “lc” as well. In addition, though some text font such as serif may

produce a number of leftward and rightward reservoirs, the depth of the reservoir from serif is

normally much smaller than that of those real reservoirs (such as the rightward reservoir of “c”).

Therefore, the reservoirs from serif can be simply detected based on their depth relative to the

x height as described in the last subsection.


9

III. DOCUMENT IMAGE RETRIEVAL

Based on the word shape coding scheme described above, the content of document images can

be captured by the converted word shape codes. Similar to most content-based image retrieval,

document images can then be retrieved by either query keywords or a query document image

based on their content similarity.

A. Retrieval by Query Keywords

Similar to a Google search, which retrieves web pages containing the query keywords, our

document image retrieval works by matching the codes transliterated from the query keywords

and those converted from words within the archived documents. Practically, such type of retrieval

can be simply accomplished by matching the codes transliterated from the query keywords and

those converted from words within archived document images. In particular, we define it as a

retrieval success if a document image containing any of the query keywords is retrieved. In

addition, we define it as a retrieval failure if a document image containing query keywords

is not retrieved or a document image containing no query keywords is retrieved. Acting as a

pre-screening procedure, such retrieval by query keywords significantly narrows the archived

document images down to those containing the query keywords, though it may not locate the

relevant document images accurately.

For text images, such retrieval by query keywords can be simply adapted for the keyword

spotting. For the keyword spotting purpose, the word position needs to be determined. In addition,

the page number needs to be determined as well because query keywords may appear multiple

times at different pages. To locate the query keywords properly, we format each word image W

with an unique spelling as a word record as follows:

WR =[

WSC 〈p1 blx1 bly1 w1 h1〉 · · · 〈pi blxi blyi wi hi〉 · · ·]

(3)

where WSC denotes the indexing word shape code converted from the W . Terms pi, blxi, blyi,

wi, and hi i = 1 · · ·n specify the page number, the position (blxi and blyi give the x and y

coordinates of the word left bottom corner), and the size (wi, and hi refer to the word width and

height) of the ith occurrence of the W , respectively. In our implemented system, all word records

are stored within a table where each record is indexed by the corresponding word shape code.


10

Word images can thus be located if their indexing word shape codes match those transliterated

from the query keywords.

B. Retrieval by A Query Document Image

Similar to content-based retrieval of images from image database, archived document images

can also be retrieved by a query document image according to the content similarity based on our

proposed word shape coding. To evaluate the document similarity, we first convert a document

image into a document vector. Particularly, each document vector element is composed of two

components including a word shape component and a word frequency component:

D = [(WSC1 : WON1), · · · , (WSCN : WONN)] (4)

where N is the number of unique words within the document image under study. WSCi and

WONi denote the word shape and word frequency components, respectively.

The document vector construction process can be summarized as follows. Given a word

shape code converted from a word within the document image under study, the corresponding

document vector is searched for the element with the same word shape code component. If such

element exists, the word frequency component of that document vector element is increased

by one. Otherwise, a new document vector element is created and the corresponding word

shape and word frequency components are initialized with the converted word shape code and

one, respectively. The conversion process terminates when all words within the document image

under study have been converted and examined as described above. Finally, to compensate for the

variable document length, the frequency component of the converted document vector elements

is normalized by dividing by the number of words within the document image under study.

The similarity between two document images can thus be evaluated based on the frequency

component of their document vectors. In particular, the similarity between the two document

vectors DV1 and DV2 can be evaluated by using the cosine measure as follows:

sim(DV1, DV2) =

∑Vi=1 DV F1,i · DV F2,i

√

∑Vi=1(DV F1,i)2 ·

∑Vi=1(DV F2,i)2

(5)

where V defines the vocabulary size, which is equal to the number of unique word shape

codes within the DV1 and DV2. DV F1,i and DV F2,i specify the word frequency information. In

particular, if the word shape code under study finds a match within DV1 and DV2, DV F1,i and


11

TABLE II

COLLISION RATES OF THE FOUR CODING SCHEMES THAT ARE EVALUATED BASED ON 57000 ENGLISH WORDS (WCR:

WORD COLLISION RATE).

Collision word WCR of the coding

scheme in [3]

WCR of the coding

scheme in [21]

WCR of the coding

scheme in [5]

WCR of the present

coding scheme

0 0.2769 0.2317 0.3769 0.7221

1 0.0602 0.0991 0.1368 0.1381

2 0.0478 0.0587 0.0714 0.0479

3 0.0335 0.0472 0.0518 0.0266

4 0.0265 0.0335 0.0406 0.0171

5 0.0255 0.0295 0.0305 0.0102

6 0.0195 0.0262 0.0278 0.0077

7 0.0154 0.0241 0.0224 0.0061

≥ 8 0.3985 0.4495 0.2414 0.0230

DV F2,i are determined as the corresponding word frequency components (WON component).

Otherwise, both are simply set at zero.

It should be noted that documents normally contain a large number of stop words, which

greatly affect the document similarity because they frequently dominate the direction of the con-

verted document vectors. Therefore, stop words must be removed from the converted document

vectors before the document similarity evaluation. In our proposed technique, we simply utilize

the stop words provided by the Cross-Language Evaluation Forum (CLEF) [13]. In particular,

all listed stop words are first transliterated into a stop word template according to the coding

scheme in Table I. The converted document vectors are then updated by removing elements that

share the word shape component with the constructed stop word template.

IV. EXPERIMENTAL RESULTS

This section evaluates the performance of the proposed word image annotation and document

image retrieval techniques. Throughout the experiments, we use 252 text documents selected

from the Reuters-21578 [15] where every 63 deal with one specific topic.




12

A. Coding Performance

The proposed document retrieval techniques depend heavily on the performance of the pro-

posed word shape coding scheme. To retrieve document image properly, the collision rate

(frequency of words that have different spelling but share the same word shape code) of the

word shape coding scheme should be as low as possible. In addition, the coding scheme should

be tolerant to various types of document degradation. In our experiments, we particularly compare

our word shape coding scheme with Spitz’s [3] and our earlier coding schemes that use character

extremum points [5] and vertical bar pattern [21], respectively.

We test the coding collision rate by using a dictionary that is composed of 57000 English

words. First, the 57000 English words are transliterated into word shape codes according to our

proposed word shape coding scheme and the other three. The coding collision rates are then

calculated and the results are shown in Table II. As Table II shows, our proposed word shape

coding scheme significantly outperforms the other three in term of the coding collision rate.

Such experimental results can be explained by the fact that our coding scheme annotates 26

lowercase Roman letters by 18 codes, while the other three comparison schemes annotate 26

lowercase Roman letters by 6 [3], 9 [21], and 13 [5] codes, respectively.

The coding robustness is then tested by the 252 text documents described above. For each text

document, five test document images are first created including: 1) a synthetic image created by

Photoshop, 2-3) two noisy images by adding impulse noise (noise level = 0.05) and Gaussian

noise (σ = 0.08) to the synthetic image, and 4-5) two real images scanned at 600 dpi (dots per

inch) and 300 dpi, respectively. Therefore, five sets of test document images are created where

each set is composed of 252 document images. After that, words within the five sets of document

images are converted into word shape codes by using the four word image shape coding schemes.

Table III shows the coding accuracy under various types of document degradation.

As Table III shows, Spitz’s character shape coding scheme is the most accurate when the

document image is synthetic. However, for document images scanned at a low resolution, the

accuracy of Spitz’s coding scheme drops severely because of the dramatic increase of character

segmentation error. In addition, compared with our earlier word shape coding schemes [5], [4],

[21], the word shape coding scheme presented in this paper is more tolerant to noise. Furthermore,

the proposed coding scheme is fast. It is 5-8 times faster than our earlier coding schemes [5],









13

TABLE III

ACCURACY OF THE SPITZ’S CHARACTER SHAPE CODING SCHEME [3], OUR EARLIER TWO WORD SHAPE CODING SCHEMES

[5], [21], AND THE WORD SHAPE CODING SCHEME PRESENTED IN THIS PAPER.

Coding schemes Synthetic Impaired by impulse Impaired by Gaussian Scanned at 600dpi Scanned at 300dpi

Method in [3] 0.9617 0.9379 0.9348 0.8372 0.5163

Method in [21] 0.9421 0.7218 0.7195 0.8586 0.8419

Method in [5] 0.9434 0.6583 0.6277 0.8526 0.8418

Present method 0.9524 0.9396 0.9288 0.9316 0.8692

[21] and up to 15 times faster than OCR (evaluated by Omnipage [14]). The speed advantage

can be explained by the fact that our word shape coding scheme needs neither time-consuming

connected component labeling nor complicated post-processing.

B. Retrieval by Query Keywords

The performance of the retrieval by query keywords is then evaluated. First, 137 frequent words

are selected from the 252 text documents as query keywords. The retrieval is then conducted

over the five sets of test document images described above. In our experiments, the retrieval

performance is evaluated by precision (P), recall (R), and the F1 rating [17] defined as follows:

P =No of correctly searched words

No of all searched words, R =

No of correctly searched words

No of all correct words, F1 =

2RP

R + P(6)

where the retrieval precision (P ) and recall (R) are averaging over all occurrence of the 137

selected keywords, respectively.

Table IV shows the experimental results where the retrieval precisions and recalls are evaluated

based on the number of word images searched by using the 137 query keywords. As Table IV

shows, our proposed word shape coding scheme consistently outperforms the other three in term

of the retrieval precision, recall, and F1. In fact, such experimental results coincide with the

coding performance described in the last subsection.

C. Retrieval by A Query Document Image

The retrieval by a query document image is also evaluated based on the five sets of document

images described in Section IV. Instead of designing retrieval experiments, we just evaluate the


https://www.researchgate.net/publication/2552792_A_Re-Examination_of_Text_Categorization_Methods?el=1_x_8&enrichId=rgreq-7f028c3635878829e78683550a0c411a-XXX&enrichSource=Y292ZXJQYWdlOzIzMjUzMzIwO0FTOjEwMzEwNjQ4MjI3ODQxMkAxNDAxNTkzOTA5ODc1


14

TABLE IV

PERFORMANCE OF THE RETRIEVAL BY QUERY KEYWORDS EVALUATED BY PRECISION, RECALL, AND F1 PARAMETER.

Retrieval by the method

in [3]


in [21]


in [5]

Retrieval by the present

method

Precision 0.6537 0.7848 0.8411 0.9488

Recall 0.9369 0.8261 0.8136 0.9026

F1 0.7750 0.8099 0.8182 0.9251

similarity between document images of the same and different topics. This is based on the belief

that document images can be ranked properly if their topic similarity can be gauged properly.

In addition, the similarity between the 252 ASCII text documents is also evaluated to verify the

performance of the proposed document image retrieval technique.

In our experiments, the five sets of test images are first converted into document vectors.

The similarity among them is then evaluated as described in Section III. B. In particular, the

similarity between documents of the same topic (315 images created from the 63 text documents

of one specific topic described in Section IV. A) is evaluated as follows:

Sim =M · (M − 1)

2

M∑

i=1

M∑

j=1

sim(DVi, DVj) ∀i, j : j > i (7)

where M is the number of the document images of the same topic. DVi and DVj denote the

document vectors (stop words removed) of two document images of the same topic under study.

The function sim() refers to the cosine similarity defined in Equation (5). The similarity between

document images of two different topics (630 document images with each 315 created from 63

text documents that deal with one specific topic) is evaluated as follows:

Sim =1

M2

M∑

i=1

M∑

j=1

sim(DVi, DVj) (8)

where DVi and DVj denote the document vectors of two different topics instead.

The upper part of Table V shows the similarity between document images of the same and

different topics. Clearly, the topic similarity between documents of the same topic is much larger

than those between documents of different topics. Archived document images can therefore be

ranked based on the similarity between their document vectors and the query document vector.

In addition, the similarity among the 252 text documents is also evaluated where document





15

TABLE V

SIMILARITIES BETWEEN DOCUMENTS OF THE SAME AND DIFFERENT TOPICS.

text image class I text image class II text image class III text image class IV

text image class I 0.3251 0.2391 0.0644 0.1088

text image class II 0.2391 0.5896 0.0969 0.1436

text image class III 0.0644 0.0969 0.2812 0.0326

text image class IV 0.1088 0.1436 0.0326 0.2659

ASCII text class I ASCII text class II ASCII text class III ASCII text class IV

ASCII text class I 0.3624 0.2734 0.1238 0.1533

ASCII text class II 0.2734 0.6474 0.0873 0.1612

ASCII text class III 0.1238 0.0873 0.3916 0.1195

ASCII text class IV 0.1533 0.1612 0.1195 0.3026

vectors are constructed by using the ASCII text [12]. The lower part of Table V shows the

evaluated document similarity. As Table V shows, the topic similarities evaluated by the proposed

technique are close to those evaluated over the ASCII text, indicating that the proposed technique

captures the document topics properly. In addition, it also indicates that the proposed document

retrieval technique is comparable to the OCR+Search whose performance should be more or less

(depending on OCR error) lower than that directly evaluated over ASCII text.

V. CONCLUSION

This paper reports a document image retrieval technique that searches document images by

either query keywords or a query document image. A novel word image annotation technique

is presented, which captures the document content by converting each word image into a word

shape code. In particular, we convert word images by using a set of topological character shape

features including character ascenders/descenders, character holes, and character water reservoirs.

Experimental results show that the proposed word image annotation technique is fast, robust,

and capable of retrieving imaged document effectively.

VI. ACKNOWLEDGMENTS

This research is supported by Agency for Science, Technology and Research (A*STAR),

Singapore, under grant no. 0421010085.



16

REFERENCES

[1] N. Otsu, “A threshold selection method from graylevel histogram,” IEEE Transactions on System, Man, Cybernetics, vol.

19, no. 1, pp. 62–66, 1979.

[2] O. D. Trier and T. Taxt, “Evaluation of Binarization Methods for Document Images,” IEEE Transaction on Pattern Analysis

and Machine Intelligence, vol. 17, no. 3, pp. 312–315, 1995.

[3] A. L. Spitz, “Determination of Script and Language Content of Document Images,” IEEE Transaction on Pattern Analysis

and Machine Intelligence, vol. 19, no. 3, pp. 235–245, 1997.

[4] S. Lu and C. L. Tan, “Script and Language Identification in Noisy and Degraded Document Images,” IEEE Transaction on

Pattern Analysis and Machine Intelligence, vol. 30, no. 1, pp. 14–24, 2008.

[5] S. Lu and C. L. Tan, “Retrieval of Machine-printed Latin Documents through Word Shape Coding,” Pattern Recognition,

vol. 41, no. 5, pp. 1816–1826, 2008.

[6] T. Nakayama, “Content-oriented categorization of document images”, International Conference On Computational Linguis-

tics, pp. 818–823, 1996.

[7] T. Nakayama, “Modeling Content Identification from Document Images, Fourth conference on Applied natural language

processing, pp. 22–27, 1994.

[8] A. L. Spitz, “Using Character Shape Codes for Word Spotting in Document Images,” Shape, Structure and Pattern

Recognition, World Scientific, pp. 382–389, 1995.

[9] A. F. Smeaton and A. L. Spitz, “Using character shape coding for information retrieval,” 4th International Conference on

Document Analysis and Recognition, pp. 974–978, 1997.

[10] Y. Lu and C. L. Tan, “Information retrieval in document image databases,” IEEE Transactions on Knowledge and Data

Engineering, vol. 16, no. 11, pp. 1398–1410, 2004.

[11] C. L. Tan and W. Huang and Z. Yu and Y. Xu, “Image document text retrieval without OCR,” IEEE Transaction on Pattern

Analysis and Machine Intelligence, vol. 24, no. 6, pp. 838–844, 2002.

[12] , Gerard Salton, “Introduction to Modern Information Retrieval,” McGraw-Hill, 1983.

[13] http://www.unine.ch/info/clef/.

[14] http://www.nuance.com/omnipage/.

[15] http://kdd.ics.uci.edu/databases/reuters21578.

[16] M. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based Multimedia Information Retrieval: State-of-the-art and

Challenges,” ACM Transactions on Multimedia Computing, Communication, and Applications, vol. 2, no. 1, pp. 1–19,

2006.

[17] Y. Yang and X. Liu, “A Re-Examination of Text Categorization Methods,” The 22nd annual international ACM SIGIR

conference on Research and development in information retrieval, vol. 42–49, 1999.

[18] S. Khoubyari and J. J.Hull, “Keyword location in noisy document image,” Second Annual Symposium on Document Analysis

and Information Retrieval, pp. 217–231, 1993.

[19] F. R. Chen and D. S. Bloomberg and L. D. Wilcox, “Spotting Phrases in Lines of Imaged Text,” SPIE conf. on Document

Recognition II, pp. 256–269, 1995.

[20] T. M. Breuel, “The Future of Document Imaging in the Era of Electronic Documents, Proceedings of International Workshop

on Document Analysis, pp. 275–296, 2005.

[21] C. L. Tan and W. Huang and Z. Yu and Y. Xu, “Text retreival from document images based on word shape analysis,”

Applied Intelligence, vol. 18, no. 3, pp. 257–270, 2003.




































Document image retrieval through word shape coding

Documents

Transcript of Document image retrieval through word shape coding