New Arabic OCR Technology with advanced learning of new fonts and characters

20
New Arabic OCR Technology with advanced learning of new fonts and characters By Eng. Ahmed Hossam El Din

Transcript of New Arabic OCR Technology with advanced learning of new fonts and characters

New Arabic OCR Technologywith advanced learning of new fonts and

characters

By

Eng. Ahmed Hossam El Din

Agenda

• Technology Overview

• Technology Details

• Sample Result

Technology Overview

• The technology aims to give users a systematic way to recognize text by using learning ability for new fonts through factor tuning process, or learning a new character inside a certain font

• Technology is using mainly geometric features to describe segments in shape of vector of consisting of about 60 values beside assistance of flags vector

• Overlapped characters specially in traditional Arabic are processed as group of characters up to 3 connected characters

• Technology is implemented in Visual Basic ready to be coded in any suitable language or technology

Closed branchVertical Branch

Dot

Vector of geometric features

Geometric features example

Technology Details

1. Image processing

2. Zone startup & learning new font

3. Line startup

4. Word startup

5. Segmentation

6. Classification

7. Connection

8. Recognition

9. Correction and learning new character

1-Image processing

• Scanning B&W one bit color

• Crop to the desired area

• Tilt to adjust lines horizontally

Tilt process

2- Zone startup & learning new font

• Assign predefined font factors

• Define factors of new font according to hints specified beside the window of factors (learning of new font).

Factors of current and new fonts

3- Line startup

• Detect start and end of the line

• Detect up and down limits

• Detect Middle line

• Detect the font thickness

4- Word startup

• Adjust Middle line for the new position.

5- Segmentation

• Detect start of the segment

• Detect end of the segment

• (segment means limited piece of any bump or dot up or down the middle line)

• Record the boundary of the segment image.

6- Classification

• Extract geometric features of a single segment in a form of vector, after processes of threshold and normalization.

• Extract deducted flag’s vector of segment.

7- Connection

• Connect different successive segments into one character.

• Integrate features into one vector of features (classifier).

8- Recognition

• Get character classes which match the classifier.

• Sort weighted order of estimated recognized characters.

• Choose most probable estimation.

• Assign character position in the text corresponding to the character boundary in the image.

Text before cleaning

Automatic recognition and cleaning produces better results

عبير الرسى وهانى عزت وحازم 0القاهرة

أبودومة تعقد األحزاب والقوى السياسية اجتماعا غدا ،

لمناقشة االقترا ح الذى تقدم به النائب وحيد عبدالمجيد،

بش وضع معايير جديدة لتشكيل الجمعية التأسيسية

يمثلون مختلف 23للدستور، وتشمل مائة عضو، من بينهم

من النقابات المهنية ، ومثلهم من 8ا حزاب السياسية ، و

شباب الثورة د ومثلى الطالب د واتحاد ات العمال

من خبراء القان 1تحادات النوعية ، و 9والفالحين، وا

من 3من المؤسسات الدينية ، و 9والهيئات القضائية ، و

من 71السلطة التنفيذية الجي ، والشرطة ، والحكومة ع ،

الشخصيات العامة واقترح سامح عاشورنقيب المحامين

رئيس الجلس االستشارى ورئيس الجبهة الوطنية

بما يجعل الجمعية التأسيسية 6الديمقراطية ،تعديل المادة د

تستمد سلطاتها من الدستور فقط ، وال تخضع لسلطات

Text after cleaning

Automatic recognition and cleaning produces better results

9- Correction and learning new character

• From the mouse position character is displayed both from text and image, user has the ability to correct and teach the character set of the chosen font, learning process is an intelligent process to prevent un-logic learning.

Character from textCharacter from image

Input of new character

Learning of new character

On mouse click the program shows both character from text and image the user can input the correct character and program will learn it

Sample Result