Multi Script Identification from Printed Words

10

Transcript of Multi Script Identification from Printed Words

Multi-script Identi�cation from Printed Words

Saumya Jetley1, Kapil Mehrotra1, Atish Vaze1, Swapnil Belhe1

1Centre for Development of Advanced Computing (C-DAC), Pune, India{saumyaj, kapilm, atishv, swapnilb}@cdac.in

Abstract. In today's multi-script scenario, documents contain page,paragraph, line and up to word level intermixing of di�erent scripts. Weneed a script recognition approach that can perform well even at the low-est semantically-valid level of words so as to serve as a generic solution.The present paper proposes a combination of Histogram of Oriented Gra-dients (HoG) and Local Binary Patterns (LBP), extracted over words,to capture the unique and discriminative structural formations of di�er-ent scripts. Tested over MILE printed-word data set, this concatenatedfeature descriptor yields a state-of-the-art average recognition accuracyof 97.4% over a set of 11 Indian scripts.In an end-to-end document recognition system it is correct to assumea skew correction unit prior to script identi�cation. Depending on theamount of skew, the skew correction unit can either yield a correctlyaligned document or an inverted one. For script identi�cation in suchscenarios, we introduce a novel set of features - Inversion Invariant HoG(II-HoG) and Inversion Invariant LBP (II-LBP). II-HoG and II-LBP aretogether used to identify the script irrespective of text inversion. Afterscript recognition, script-speci�c HoG and LBP feature-combination isused to �nd the text alignment i.e. 0° or 180°. For the same database,�rst-level inversion-invariant script-identi�cation accuracy over 11 scriptsis 95.8%, 1% gain over the existing best, while the second-level script-speci�c orientation-detection accuracy is averaged at 97.7%.

1 Introduction

In today's multi-lingual and multi-script setting, script identi�cation has becomea necessity for document analysis. A single document commonly contains twoor even three distinct scripts. In the Indian context, this bi-script and tri-scriptscenario is well presented by Pati and Ramakrishnan[1]. State local script andRoman, with Devanagari as an extra addition, are common combinations. Re-search works have also dealt with script pairs such as Farsi and Latin[2], Hanand Roman[3], and Persian and Roman[4].

If the character-recognition engine were to work for even two scripts together,number of classes would be prohibitively large, not to mention the ine�ciency interms of performance. Thus, it becomes important to identify a priori the scriptfrom the known set and accordingly send the text for character-recognition. Also,pre-processing tasks like morphological de-noising, text-line/word/character seg-mentation tend to show script-biased behavior. With knowledge of the script,

pre-processing modules tailored to the script can be called for an improved per-formance. Further, script recognition has developed from its primitive role asa pre-cursor to document recognition. It is now also being used for documentretrieval based on script similarity[5]. Particularly in this context, multi-scriptidenti�cation becomes relevant. The recognition unit should be equipped to iden-tify more than 2 or 3 di�erent scripts at a time.

Having discussed the application scenarios and importance of bi-script, tri-script as well as multi-script recognition, we need to identify the level at whichthe approach should operate. [6] considers multi-script printed documents but as-sumes page level script uniformity. [7,8,9,10,11] work with printed documents as-suming text-block or line-level script uniformity. Methods presented in [12,2,3,1]perform script identi�cation at word or character level, but none of these tacklepossible text skew or inversion. Basically, the methods work for a �xed scriptset and/or assume particular level of script uniformity (block, line or page level)and thus lack generality.

Our present work aims to address this very lacuna. The novelty of our workis fourfold - (a) our script identi�cation approaches work successfully even forthe lowest semantically-valid level of words, (b) given an adequate database ourapproaches can be easily extended to any given script set, (c) our novel set ofinversion invariant features - II-HoG and II-LBP is capable of identifying thescript despite text inversion, and �nally (d) we propose a complete module toidentify the script, even for inverted text, and then �nd its orientation i.e. 0° or180° for correction before further processing.

For a comparative experimental analysis, we test our approaches on word-level 11 Indian script MILE database[1]. Using our gradient plus texture featurecombination (HoG and LBP), the 11-script recognition accuracy of 97.4% is thenew state-of-the-art. Also, for inversion-invariant script identi�cation, our novelfeatures II-HoG & II-LBP at 95.8% give a 1% accuracy gain over the existingbest.

The complete paper is organized as follows: Section 2 discusses the relatedworks and existing approaches, Section 3 & 4 present the proposed approaches,Section 5 provides the experimental results and analysis, and Section 6 concludesthe work.

2 Related Works and Existing Approaches

The task of script identi�cation has been attempted at di�erent levels � textblock, text line, word and even component level.

Script identi�cation at text block level is a commonly used idea. In [7], bi-dimensional empirical mode decomposition (BEMD) followed by extraction oflocal binary patterns (LBPs) is used to identify between English, French, Chi-nese, Japanese, Russian and Korean scripts over 128x128 sized text blocks. [13]uses texture features from co-occurrence histograms of wavelet decomposed im-ages. Then KNN classi�er is used for a block level recognition of 8 di�erentIndian scripts. In [11], wavelet energy based histogram moments are used with

an SVM classi�er for identi�cation of 6 scripts � Arabic, Chinese, English, Hindi,Thai and Korean.

[8,9,10] perform script recognition at line level with the help of handcraftedstructural and statistical features. Ghosh and Chaudhuri[8] introduce the ideaof inversion-invariant script identi�cation followed by script-speci�c orientationdetection. However, the approach assumes line level script uniformity and em-ploys a hierarchical classi�cation setup that is customized for the given scriptset. Aithal et al.[9] use line level horizontal-projection pro�le and its statisti-cal details to distinguish between Hindi, Kannada and English text in Trilingualdocuments. Gopakumar et al.[10] mark out horizontals, verticals, right diagonalsand left diagonals in a given text line and carry out zone-based gradient anal-ysis for identifying between 4 South Indian scripts of Kannada, Tamil, Telugu,Malayalam along with English and Hindi.

Word level approach adopted by Das et al.[12] shows the same hand-madefeature and rule-based threshold trend, to distinguish between Telugu, Hindiand English. Huanfeng and Doermann[14] introduce texture features extractedusing Gabor �lter and apply them to a variety of bilingual dictionaries for wordlevel script identi�cation. Following suit, Pati and Ramakrishnan[1] employ acombination of Gabor �lters to identify between 11 Indian scripts, experimentingwith both Nearest-neighbour and SVM classi�er.

Another popular set of techniques makes use of component level script iden-ti�cation, with majority-vote based extension to word, line and page level scriptidenti�cation. [2,5,6,3] employ component level features and use SVM/KNN forclassi�cation. Khoddami and Behrad[2] present rotation and scale-invariant Cur-vature Scale Space features for identi�cation of Farsi and Latin scripts. Chandaet al.[5] draw out a comparison between two distinct features � rotation-invariantZernike moments and rotation-variant gradient features, to achieve the task ofidentifying among 11 Indian scripts. Wang et al.[6] make use of DowngradedPixel Density features from skeletonized character for script identi�cation. Palet al.[3] use histograms based on directional codes for character level identi�ca-tion of Japanese, Korean, Chinese and Roman scripts.

To build a generic script recognition system we propose word-level implemen-tation. Also, a close observation of the above presented variety of approachesreveals one common idea. It is, the ability and hence wide use of texture and/orgradient features to successfully distinguish between di�erent scripts. Buildingon this, we present a combination of both gradient and texture features for scriptidenti�cation. The �rst proposed approach employs a concatenation of HoG andLBP descriptors, while the second approach introduces a combination of com-pletely novel Inversion-Invariant II-HoG and II-LBP features. The latter canwork at the word level to identify the script even when the text is inverted.

(a) Structural highlights of 11 di�erent Indianscripts.

(b) Evaluation of Local Binary Patternaround a given pixel

Fig. 1

3 Proposed Approach I - HoG and LBP

3.1 Histogram of Oriented Gradients (HoG)

For a task like script identi�cation, important discriminative information lies inthe relative proportion of di�erent gradients. As shown in Figure 1a(Top to Bot-tom: BE, HI, EN, GU, KA, MA, OD, PU, TA, TE, and UR), due to a necessaryword headline (shirorekha) for Devanagari, Bangla and Gurumukhi, horizontallines (or 0° gradients) are dominant in these scripts. It may depend on the fontbut as a general observation, the character-level joints become less curved andincreasingly sharp from Devanagari to Gurumukhi to Bengali. Kannada scriptfrequently shows a horizontal line with an upward curl, while Telugu has a highlycommon tick mark. Highlights of Oriya, Tamil and Malayalam are an invertedU-shape, vertical lines, and right & left bracket shapes respectively. Urdu is verydi�erent from any other Indian script. Majority of the lines have slope of 0° orother angles in the upper-half of 1st quadrant. These and many other uniquestructural properties of di�erent scripts, as elaborated in [1], motivate the useof gradient proportions for script identi�cation.

We employ histogram of oriented gradients for script recognition at wordlevel. The position of gradients within the text unit is not important, andHoG[15] is applied at the complete word level without considering any over-lapping sub-blocks. Angles lie in the 0°-180° range and are divided into 36 binsbased on the empirically evaluated bin size of 5°.

3.2 Local Binary Patterns (LBP)

Local binary patterns capture the image texture embedded in the gray-levelvariations of the immediate neighborhood of image pixels. Figure 1b shows thecomputation of local binary pattern around a particular pixel in an image. Foran LBP-based image texture analysis, the count of each binary pattern valueis summed up over the image to yield the LBP histogram. For a 3x3 windowanalysis, the 256 distinct binary pattern values yield a feature descriptor of thesame length.

3.3 HoG and LBP based classi�cation

The two feature vectors extracted as above are concatenated to yield the �nalfeature descriptor for the word image. Given normalized 36 HoG features and256 LBP features, total length of the descriptor adds up to 292. We use SVM[16] for the task of multi-class classi�cation. In order to handle non-linear classboundaries, SVM uses radial basis function kernel.

This feature set works highly accurately on the word level MILE database.With minimal training, it yields state-of the-art results. However, these featuresare not invariant to text inversion.

4 Proposed Approach II - Inversion Invariant HoG(II-HoG) and Inversion Invariant LBP (II-LBP)

For an end-to-end printed document analysis system it would be correct to as-sume a skew-correction unit prior to the script identi�cation module. Withoutthe knowledge of script, the skew-correction module can make alignment errors.For acute angle skews, the skew-corrected text is properly aligned. However, forobtuse angle skews, the text may get inverted during de-skewing. To handle boththese scenarios, our proposed system �ow is presented in Figure 2.

Fig. 2: Proposed system �ow

4.1 Skew Correction Unit

We experimented with the skew-correction module of Leptonica library, basedon the work by Bloomberg et al[17] . Skew-corrected outputs for 2 di�erent(Bangla) text orientations, as shown in Figure 3, con�rm the idea behind theabove proposed �ow. The text block with an acute skew of 20°, Figure 3a, gets

correctly aligned after de-skewing, while the text block with an obtuse skew of150°, Figure 3b, gets inverted.

We introduce a completely novel set of Inversion-tolerant features for scriptidenti�cation. For a given text segment, despite the orientation, the output fea-ture vector is the same. Thus, inversion, if present, is ignored and the taskbecomes one of plain script discrimination.

(a) Skew corrected output for a text blockrotated by - (clockwise) 20° is correctlyaligned

(b) Skew corrected output for a text blockrotated by (anti-clockwise)150° is inverted

Fig. 3

4.2 Inversion Invariant HoG (II-HoG)

When the text is inverted, gradients in the 0°-90° range shift to the 90°-180°range and vice-versa. Inversion invariance can be achieved by either preventingthis shift or staying independent of this shift. We have attempted to achieveinvariance by staying independent. This is done by mapping all the gradientsinto the �rst quadrant i.e. 0°-90° range.

Gradient at a pixel is calculated as:

j = arctan(dy/dx)

, where dy is the vertical gradient and dx is the horizontal gradient at a givenpixel point.

Following ensures that the gradients stay between 0°-90°:if (dx < 0)dx = dx ∗ −1

&if (dy < 0)dy = dy ∗ −1

Thus, dx and dy values stay positive and as a result angles lie in the 0°-90°range. To keep the bin size as 5° the number of bins is reduced to 18.

4.3 Inversion Invariant LBP (II-LBP)

For invariance to text inversion, we introduce a novel set of LBP features i.eII-LBP. Its evaluation is as shown in Figure 4a. Re-assignment of weights makesthe decimal-equivalent inversion tolerant. As is illustrated in Figure 4a, the dec-imal value of the binary pattern for a given pixel remains the same despite theinversion of the pixel's neighborhood. Also, the 256 LBP values get reduced toa count of 31.

(a) Inversion Invariance - Evaluation of II-LBP (b) Recognition accuracies for the 55 bi-script scenarios evaluated using HoG &LBP feature set

Fig. 4

4.4 II-HoG and II-LBP

Final feature vector is a concatenation of the two feature descriptors evaluatedas above. Both the techniques are invariant to text inversion and so is theircombination. The complete feature vector has a reduced length of 50, 31 featuresbelonging to II-LBP and 19 to II-HoG. These features are learnt using a multi-class SVM based on radial basis function kernel.

5 Experimental Results and Analysis

For a comparative analysis, we tested our approaches on the printed-word MILEDatabase compiled by Pati and Ramakrishnan[1]. This database contains 20,000

printed word binary samples for 11 di�erent Indian scripts - Bangla(BE), De-vanagari(HI), Roman(EN), Gujarati(GU), Kannada(KA), Malayalam(MA), Odiya(OD),Gurumukhi(PU), Tamil(TA), Telugu(TE) and Urdu(UR). For suitability for tex-ture as well as gradient analysis, we smoothen the binary images using a 3x3averaging �lter. For most practical purposes, we divide the database into 2,000training samples and 18,000 testing samples respectively.

No. ofTrainingSamples

TestAccuracy(in %)

300 94.4

600 95.8

1000 96.6

2000 97.4

(a) 11-script test accuracywith increasing number oftraining samples

LocalScript

(with EN& HI)

TestAccuracy(in %)

BE 99.1

GU 99.2

KA 99.2

MA 98.3

OD 99.1

PU 97

TA 98.4

TE 99.3

UR 99.5

m 98.8

(b) Tri-script recognitionaccuracies evaluated usingHoG & LBP feature set

Script TestAccuracy(in %)

BE 99.3

HI 99.7

EN 95.4

GU 96.5

KA 98

MA 97.7

OD 98.3

PU 99.1

TA 95.6

TE 96.6

UR 98.4

m 97.7%

(c) Script-speci�c ori-entation detection ac-curacies using HoGand LBP combination

Table 1

5.1 HoG and LBP based classi�cation

For 11-script classi�cation task, SVM classi�er is trained on increasing numberof training samples from 300 to 2,000. The test results on 18,000 sample-set areas compiled in Table 1a. For just 600 training samples (< 1/11th of the trainingsamples assumed in [1]), the test accuracy becomes the new state-of-the-art witha gain of 1%.

Using the same 600 training samples and 18,000 test samples, the accu-racy results for 55 bi-script scenarios are as shown in Figure 4b. In [1], usingtheir best con�guration of Gabor features and SVM classi�er, the three lowestbi-script recognition results of Telugu-Kannada(91%), Urdu-Gurumukhi(93.7%),and Gurumukhi-Hindi(94.2%) are bettered by 4.1%, 6% and 2% respectively. Theaverage accuracy over the 55 bi-script scenarios is 99%, an average gain of 0.6%.

For the 10 tri-script scenarios of Roman and Devanagari with 10 di�erentlocal scripts, the recognition accuracies are presented in Table 1b. Our approachclaims a total accuracy increment of approx. 5.1% .

5.2 II-HoG and II-LBP based classi�cation

As the feature vector length for inversion-invariant descriptor is only 50, we couldexperiment with an increased number of training samples. Thus, we trained theapproach on a set of 6,000 word images and tested it on 14,000 word images,both containing a mix of inverted and non-inverted samples. An 11-script testaccuracy of 95.8% is achieved. Along with tolerance to text inversion, the featureset shows an average accuracy gain of 1% (against [1]) over 11 di�erent Indianscripts. Given the high performance of this feature descriptor for the 11-script setand its similarity to HoG and LBP features, we are con�dent of top recognitionresults for the bi-script as well as the tri-script scenarios.

The next level script-speci�c orientation detection is performed by HoG andLBP combination. Test accuracy �gures are shown in Table. For each script, twoclasses are considered. One for non-inverted text and other for inverted text. 600word samples are used for training and 18,000 word samples for testing. Theaverage orientation detection accuracy over 11 scripts is 97.7%.

6 Conclusion

The present work uses a combination of gradient (HoG) and texture (LBP) fea-tures to yield state-of-the-art recognition accuracies over 11 Indian-script set. Italso introduces two completely new feature descriptors that are tolerant to imageinversion - Inversion invariant HoG and LBP. These are combined together andused for script recognition in cases where the text may be inverted. These inver-sion tolerant features also give high recognition results, surpassing the existingbest by approx. 1%. Both the proposed approaches perform at the word leveland can quickly adapt to any new script given su�cient data samples. Thus, ourapproach is generic and can easily be integrated into various practical documentrecognition systems for an improved performance.

References

1. P. B. Pati and A. G. Ramakrishnan, �Word level multi-script identi�cation,� Pat-tern Recogn. Lett., pp. 1218�1229, 2008.

2. M. Khoddami and A. Behrad, �Farsi and latin script identi�cation using curvaturescale space features,� in Neural Network Applications in Electrical Engineering(NEUREL), 2010 10th Symposium on, 2010, pp. 213�217.

3. S. Chanda, U. Pal, K. Franke, and F. Kimura, �Script identi�cation : A han androman script perspective,� in Pattern Recognition (ICPR), 2010 20th InternationalConference on, 2010, pp. 2708�2711.

4. K. Roy, A. Alaei, and U. Pal, �Word-wise handwritten persian and roman scriptidenti�cation,� in Frontiers in Handwriting Recognition (ICFHR), 2010 Interna-tional Conference on, 2010, pp. 628�633.

5. S. Chanda, K. Franke, and U. Pal, �Identi�cation of indic scripts on torn-documents,� in Document Analysis and Recognition (ICDAR), 2011 InternationalConference on, 2011, pp. 713�717.

6. N. Wang, L. Lam, and C. Suen, �Noise tolerant script identi�cation of printed ori-ental and english documents using a downgraded pixel density feature,� in PatternRecognition (ICPR), 2010 20th International Conference on, 2010, pp. 2037�2040.

7. J. Pan and Y. Tang, �A rotation-robust script identi�cation based on bemd andlbp,� in Wavelet Analysis and Pattern Recognition (ICWAPR), 2011 InternationalConference on, 2011, pp. 165�170.

8. S. Ghosh and B. Chaudhuri, �Composite script identi�cation and orientation de-tection for indian text images,� in Document Analysis and Recognition (ICDAR),2011 International Conference on, 2011, pp. 294�298.

9. P. Aithal, G. Rajesh, D. Acharya, and N. Subbareddy, �Text line script identi�-cation for a tri-lingual document,� in Computing Communication and NetworkingTechnologies (ICCCNT), 2010 International Conference on, 2010, pp. 1�3.

10. R. Gopakumar, N. Subbareddy, K. Makkithaya, and D. Acharya, �Zone-basedstructural feature extraction for script identi�cation from indian documents,� inIndustrial and Information Systems (ICIIS), 2010 International Conference on,2010, pp. 420�425.

11. L. Zhou, X. Ping, E. Zheng, and L. Guo, �Script identi�cation based on waveletenergy histogram moment features,� in Signal Processing (ICSP), 2010 IEEE 10thInternational Conference on, 2010, pp. 980�983.

12. M. Das, D. Rani, and C. R. K. Reddy, �Heuristic based script identi�cationfrom multilingual text documents,� in Recent Advances in Information Technol-ogy (RAIT), 2012 1st International Conference on, 2012, pp. 487�492.

13. P. Hiremath, S. Shivashankar, J. Pujari, and V. Mouneswara, �Script identi�cationin a handwritten document image using texture features,� in Advance ComputingConference (IACC), 2010 IEEE 2nd International, 2010, pp. 110�114.

14. H. Ma and D. Doermann, �Word level script identi�cation for scanned documentimages,� in Proc. of Int. Conf. on Document Recognition and Retrieval (SPIE),2004, pp. 178�191.

15. N. Dalal and B. Triggs, �Histograms of oriented gradients for human detection,� inIn CVPR, 2005, pp. 886�893.

16. C.-C. Chang and C.-J. Lin, �LIBSVM: A library for support vector machines,�ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1�27:27,2011.

17. D. S. Bloomberg, G. E. Kopec, and L. Dasari, �Measuring document image skewand orientation,� in IS&T/SPIE's Symposium on Electronic Imaging: Science &Technology. International Society for Optics and Photonics, 1995, pp. 302�316.