Turkish Fingerspelling Recognition System Using Axis of Least Inertia Based Fast Alignment

Post on 26-Jan-2023

1 views 0 download

Transcript of Turkish Fingerspelling Recognition System Using Axis of Least Inertia Based Fast Alignment

A. Sattar and B.H. Kang (Eds.): AI 2006, LNAI 4304, pp. 473 – 481, 2006. © Springer-Verlag Berlin Heidelberg 2006

Turkish Fingerspelling Recognition System Using Axis of Least Inertia Based Fast Alignment

Oğuz Altun, Songül Albayrak, Ali Ekinci, and Behzat Bükün

Yıldız Technical University, Computer Engineering Department, Yıldız, İstanbul, Türkiye {oguz, songul}@ce.yildiz.edu.tr, behzat.bukun@gmail.com,

ali_ekinci@yahoo.com

Abstract. Fingerspelling is used in sign language to spell out names of people and places for which there is no sign or for which the sign is not known. In this work we describe a Turkish fingerspelling recognition system that recognizes all 29 letters of the Turkish alphabet. A single representative frame is extracted from the sign video, since that frame is enough for recognition purposes of the letters mentioned. Processing a single frame, instead of the whole video, increases speed considerably. The skin regions in the representative frame are extracted by color segmentation in YCrCb space before clearing noise regions by morphological opening. A novel fast alignment method that uses the angle of orientation between the axis of least inertia and y axis is applied to hand regions. This method compensates small orientation differences but increases big ones. This is desirable when differentiating the fingerspelling signs, some of which are close in shape but different in orientation. Also the use of minimum bounding square is advised, which helps in resizing without breaking the alignment. Binary values of this minimum bounding square are directly used as feature values, and that allowed experimenting with different classification schemes. Features like mean radial distance and circularity are also used for increasing success rate. Classifiers like kNN, SVM, Naïve Bayes, and RBF Network are experimented with, and 1NN and SVM are found to be the best two of them. The video database was created by 3 different signers, a set of 290 training videos, and a separate set of 174 testing videos are used in experiments. The best classifiers 1NN and SVM achieved a success rate of 99.43% and 98.83% respectively.

Keywords: Turkish Fingerspelling Recognition, Fast Alignment, Angle of orientation, Axis of Least Inertia, Minimum Bounding Square, Classification.

1 Introduction

Sign Language is a visual means of communication using gestures, facial expression, and body language. Sign Language is used mainly by deaf people and people with hearing difficulties. There are two major types of communication in sign language. The first one has word based sign vocabulary, where gestures, facial expression, and body language are used for the most common words. The second one has letter based vocabulary, and is called fingerspelling, which is a method of spelling words using hand movements. Fingerspelling is used in sign language to spell out names of people

474 O. Altun et al.

and places for which there is no sign and can also be used to spell words for signs that the signer does not know the sign for, or to clarify a sign that is not known by the person reading the signer [1].

Sign languages develop specific to their communities and are not universal. For example, ASL (American Sign Language) is totally different from British Sign Language even though both countries speak English [2]. In the automatic sign language recognition, there are successful systems for American Sign Language (SL) [3], Australian SL [4], and Chinese SL [5] .

Previous approaches to word level sign recognition rely heavily on statistical models such as Hidden Markov Models (HMMs). A real-time ASL recognition system developed by Starner and Pentland [3] used colored gloves to track and identify left and right hands. They extracted global features that represent positions, angle of axis of least inertia, and eccentricity of the bounding ellipse of two hands. Using an HMM recognizer with a known grammar, they achieved a 99.2% accuracy at the word level for 99 test sequences. For TSL (Turkish Sign Language) Haberdar and Albayrak [6], developed a TSL recognition system from video using HMMs for trajectories of hands. The system achieved a word accuracy of 95.7% by concentrating only on the global features of the generated signs. The developed system is the first comprehensive study on TSL and recognizes 50 isolated signs. This study is improved with local features and performs person dependent recognition of 172 isolated signs in two stages with an accuracy of 93.31% [7].

For fingerspelling recognition, most successful approaches are based on instrumented gloves, which provide information about finger positions. Lamar and Bhuiyant [8] achieved letter recognition rates ranging from 70% to 93%, using colored gloves and neural networks. More recently, Rebollar et al. [9] used a more sophisticated glove to classify 21 out of 26 letters with 100% accuracy. The worst case, letter ’U’, achieved 78% accuracy. Isaacs and Foo [10] developed a two layer feed-forward neural network that recognizes the 24 static letters in the American Sign Language (ASL) alphabet using still input images. ASL fingerspelling recognition system is with 99.9% accuracy with an SNR as low as 2. Feris, Turk and others [11] used a multi-flash camera with flashes strategically positioned to cast shadows along depth discontinuities in the scene, allowing efficient and accurate hand shape extraction. Altun et al. [12] increased the effect of fingers in Turkish fingerspelling shapes by thick edge detection and correlation with penalization. They achieved 99% accuracy out of 203 sign videos of 29 the Turkish alphabet letters.

In this work, we have developed a signer independent fingerspelling recognition system for Turkish Sign Language (TSL). The representative frames are extracted from sign videos. Hand objects in these frames are segmented out by skin color in YCrCb space. These hand objects are aligned using the novel angle of orientation based fast alignment method. Then, the aligned object is moved into the center of a minimum bounding square, and resized. The binary values of the minimum bounding square are used as classification features, in addition to the binary object features like mean radial distance and circularity. We experimented with different classification schemes and reported their success rate.

The remaining of this paper is organized as follows: In Section 2 we describe the representative frame extraction, our fast alignment method, and extraction of

Turkish Fingerspelling Recognition System 475

additional object features. Section 3 covers the video database we use. We listed the classification schemes we used in Section 4. Finally, conclusions and future work are addressed in Section 5.

2 Feature Extraction

Contrary to Turkish Sign Language word signs, Turkish fingerspelling signs, because of their static structure, can be discriminated by shape alone by use of a representative frame. To take advantage of this and to increase processing speed, these representative frames are extracted and used for recognition. Fig. 1 shows representative frames for all 29 Turkish Alphabet letters.

In each representative frame, hand regions are determined by skin color. From the binary images that show hand and background pixels, the regions we are interested in are extracted, aligned and resized. In addition to aligned binary pixel values, binary object features are extracted to support maximum correlation based matching.

Each process is summarized below:

2.1 Representative Frame Extraction

In a Turkish fingerspelling video, representative frames are the ones with least hand movement. Hence, the frame whose distance to its successor is minimum can be chosen as a representative frame. Distance between successive frames f and f+1 is given by the sum of the city block distance between corresponding pixels:

Fig. 1. Representative frames for all 29 letters in Turkish Alphabet

476 O. Altun et al.

|||||| fnn

fn

fn

f BGRD Δ+Δ+Δ=∑ , (1)

where f iterates over frames, n iterates over pixels, R, G, B are the components of the pixel color, f

nf

nf

n RRR −=Δ +1 , fn

fn

fn GGG −=Δ +1 , and f

nf

nf

n BBB −=Δ +1 .

2.2 Skin Detection by Color

For skin detection, YCrCb color-space has been found to be superior to other color spaces such as RGB and HSV [13]. Hence we convert the pixel values of images from RGB color space to YCrCb using (2).

In order to decrease noise, each of the Y, Cr and Cb components of the image are smoothed with the 2D Gaussian filter given by (3), where σ is its standard deviation

BGRY 114.0587.0299.0 ++= , YBCr −= , YRCb −= (2)

)2

exp(2

1),(

2

22

2 σπσyx

yxF+−= , (3)

Chai and Bouzerdom [14] report that pixels that belong to the skin region have similar Cr and Cb values, and give a distribution of the pixel color in Cr-Cb plane. Consequently, we classified a pixel as skin if the Y, Cr, Cb values of it falls inside the ranges 135 < Cr < 180, 85 < Cb < 135 and Y > 80 (Fig. 2.a).

After clearing small skin colored regions by morphological opening (Fig. 2.b), skin detection is completed.

(a) (b)

Fig. 2. (a) Original image and detected skin regions after pixel classification, (b) result of the morphological opening.

(a) (b) (c) (d)

Fig. 3. (a)-(b) The 'C' sign by two different signers. (c)-(d) The 'U' sign by two different signers.

Turkish Fingerspelling Recognition System 477

2.3 Fast Alignment for Maximum Correlation Based Template Matching

Template matching is very sensitive to size and orientation changes. A scheme that can compensate size and orientation changes is needed. Eliminating orientation information totally is not appropriate however, as depicted in Fig. 3. Fig. 3a-b show two 'C' signs that we must be able to match each other, so we must compensate the small orientation difference. In Fig. 3c-d we see two 'U' signs that we need to differentiate from 'C' signs. 'U' signs and 'C' signs are quite similar to each other in shape, luckily orientation is a major differentiator. As a result we need a scheme that not only can compensate small orientation differences of hand regions, but also is responsive to large ones.

Fig. 4. Axis of least second moment and the angle of orientation

We propose a fast alignment method that works by making the angle of orientation (θ ) zero. Angle of orientation, given by (4), is the angle between y axis and the axis of least moment (shown in Fig. 4).

⎟⎟⎠

⎞⎜⎜⎝

⎛−

=0220

112arctan2

MM

Mθ (4)

where (I(x,y) = 1 for pixels on the object, 0 otherwise), ∑ ∑=x y

yxxyIM ),(11 ,

∑ ∑=x y

yxIxM ),(220 , and ∑ ∑=

x yyxIyM ),(2

02 .

(a) (b) (c) (d) (e)

Fig. 5. Stages of fast alignment. (a) Original frame. (b) Detected skin regions. (c) Region of Interest (ROI). (d) Rotated ROI. (e) Resized bounding square with the object in the center.

478 O. Altun et al.

Let's define bounding square as the smallest square that can completely enclose all the pixels of the object. After putting images in the center of a bounding square, and than resizing the bounding square to a fixed, smaller resolution, the fast alignment process ends (Fig. 5).

2.4 Additional Binary Object Features

Instead of using only pixel values in the bounding square, additional binary object features are extracted to support decision process.

These features include area, center of area, perimeter [15], angle of orientation (defined above), and circularity (defined as perimeter2/area). In addition, mean radial

distance Rμ is extracted:

∑ −=n nnNR yxyx ),(),(1μ (5)

where n iterates over all pixels, N is the number of pixels, ( x , y ) is the center of

area, ( nx , ny ) is the coordinate of the nth pixel, and . denotes the Euclidean

distance between two pixels. Another feature is the standard deviation of radial

distance Rσ , defined as

[ ]( ) 21

21 ),(),(∑ −−=n RnnNR yxyx μσ . (6)

As the last binary object feature, a second circularity measure C is computed by

RRC σμ= . (7)

To summarize, 9 binary object features are added to the 30x30 binary values of the minimum boundary square.

3 Video Database

The training and test videos are acquired by a Philips PCVC840K CCD webcam. The capture resolution is set to 320x240 with 15 frames per second (fps). While programming is done in C++, the Intel OpenCV library routines are used for video capturing and some of the image processing tasks.

We have developed a Turkish Sign Language fingerspelling recognition system for the 29 static letters in Turkish Alphabet. The training set was created using three different signers. For training, they signed a total of 10 times for each letter, which sums up to 290 training videos. For testing, they signed a total of 6 times for each letter, which sums up to 174 test videos. Table 1 gives a summary of the distribution of the train and test video numbers for each signer. Notice that training and test sets are totally separated.

Turkish Fingerspelling Recognition System 479

Table 1. Distribution of train and test video numbers for each signer

Signer 1 Signer 2 Signer 3 Total Train Test Train Test Train Test Train Test

A 4 2 4 2 2 2 10 6

Z 4 2 4 2 2 2 10 6 Total 290 174

Table 2. Success rates of most successful classifiers on fingerspelling data

Classifier Success Rate (%)

1NN [16] 99.43

SVM [17] 98.85

Random Forest [18] 97.13

RBF Network [19] 96.55

Multinomial Naive Bayes [20] 93.68

Naive Bayes [21] 88.51

J48 [22] 85.63

4 Classification Comparison

A set of different classification algorithms are applied to the features extracted as explained in Section 2 and obtained results are sorted according to their success rates. These classification results are summarized in Table 2.

The most successful classifiers are one nearest neighbor (1NN) and support vector machine (SVM). These methods classified more than 98% successfully, as seen in Table 2. The biggest problem is in the classification of the letter 'S', which is confused by ' '. A second problem letter was 'G', which is confused by ' '. The confused characters are very similar to each other in shape, as seen in Fig. 6.

Fig. 6. Two difficult cases where our methods may fail. Left to right: 'S' and ' ', 'G' and ' '.

5 Conclusions and Future Work

A Turkish fingerspelling recognition system is tested and found to have more than 99% accuracy. Testing and training sets is created by multiple signers, as a consequence the developed system is signer independent. Accuracy is the result of the

480 O. Altun et al.

fast alignment process we applied. This process brings objects with similar orientation into same alignment, while bringing objects with high orientation difference into different alignment. This is a desired result, because for fingerspelling recognition, shapes that belong to different letters can be very similar, and the orientation can be the main differentiator. After the alignment, to resize without breaking the alignment, the object is moved into the center of a minimum bounding square. The binary values in minimum bounding square are used as the features. In addition, we used binary object features like circularity and mean radial distance, which helped increasing success rate.

Our method is robust to the problem of occlusion of the hands, because the fast alignment method allows us to process an ensemble of one or more connected components in the same way.

The system is fast due to representing the sign video by a single frame, the speed of fast alignment process, and resizing the bounding square to a smaller resolution. The amount of resizing can be arranged for different applications.

Since we used binary pixel values as ordinary features, we are able to experiment with different classification algorithms, amongst which are kNN, SVM, RBF Network, Naïve Bayes, Random Forest, and J48 tree. The 1NN and SVM give the best success rates, 99.43% and 98.85% respectively.

Not all letters in Turkish alphabet are representable by one single frame, ' ' being an example. The sign of letter involves some movement that differentiates it from 'S'. In fact, this letter is the one that prevented us achieving a 100% success rate. Still, representing the whole sign by one single frame is acceptable since this work is actually a step towards making a full blown Turkish Sign Language recognition system that can also recognize word signs. That system will incorporate not only shape but also the movement, and the research on it is continuing.

The importance of successful segmentation of the skin and background regions can not be overstated. In this work we assumed that there is no skin colored background regions and used color based segmentation in YCrCb space. The systems' success depends on that assumption, and research on better skin segmentation is invaluable.

The fast alignment and classification schemes presented would work equally well in the existence of a face in the frame, even though in this study we used only hand regions when creating the fingerspelling video database.

Although we demonstrated the fast alignment method in the context of hand shape recognition, it is equally applicable to other problems where shape recognition is required, for example to the problem of shape retrieval.

References

1. http://www.british-sign.co.uk/learnbslsignlanguage/whatisfingerspelling.htm. 2. http://www.deaflibrary.org/asl.html. 3. Starner, T., Weaver, J., Pentland, A.: Real-time American sign language recognition using

desk and wearable computer based video. Ieee Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 1371-1375

4. Holden, E.J., Lee, G., Owens, R.: Australian sign language recognition. Machine Vision and Applications 16 (2005) 312-320

Turkish Fingerspelling Recognition System 481

5. Gao, W., Fang, G.L., Zhao, D.B., Chen, Y.Q.: A Chinese sign language recognition system based on SOFM/SRN/HMM. Pattern Recognition 37 (2004) 2389-2402

6. Haberdar, H., Albayrak, S.: Real Time Isolated Turkish Sign Language Recognition From Video Using Hidden Markov Models With Global Features. Lecture Notes in Computer Science LNCS 3733 (2005) 677

7. Haberdar, H., Albayrak, S.: Vision Based Real Time Isolated Turkish Sign Language Recognition. International Symposium on Methodologies for Intelligent Systems, Bari, Italy (2006)

8. Lamar, M., Bhuiyant, M.: Hand Alphabet Recognition Using Morphological PCA and Neural Networks. International Joint Conference on Neural Networks, Washington, USA (1999) 2839-2844

9. Rebollar, J., Lindeman, R., Kyriakopoulos, N.: A Multi-Class Pattern Recognition System for Practical Fingerspelling Translation. International Conference on Multimodel Interfaces, Pittsburgh, USA (200)

10. Isaacs, J., Foo, S.: Hand Pose Estimation for American Sign Language Recognition. Thirty-Sixth Southeastern Symposium on, IEEE System Theory (2004) 132-136

11. Feris, R., Turk, M., Raskar, R., Tan, K.: Exploiting Depth Discontinuities for Vision-Based Fingerspelling Recognition. 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops(CVPRW'04) (2004)

12. Altun, O., Albayrak, S., Ekinci, A., Bükün, B.: Increasing the Effect of Fingers in Fingerspelling Hand Shapes by Thick Edge Detection and Correlation with Penalization. PSIVT 2006 (2006)

13. Sazonov, V., Vezhnevetsi, V., Andreeva, A.: A survey on pixel vased skin color detection techniques. Graphicon-2003 (2003) 85-92

14. Chai, D., Bouzerdom, A.: A Bayesian Approach To Skin Colour Classification. TENCON-2000 (2000)

15. Umbaugh, S.E.: Computer Vision and Image Processing: A Practical Approach Using CVIPtools. Prentice Hall (1998)

16. Aha, D.W., Kibler, D., Albert, M.K.: Instance-Based Learning Algorithms. Machine Learning 6 (1991) 37-66

17. Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to Platt's SMO algorithm for SVM classifier design. Neural Computation 13 (2001) 637-649

18. Breiman, L.: Random forests. Machine Learning 45 (2001) 5-32 19. Fritzke, B.: Fast Learning with Incremental Rbf Networks. Neural Processing Letters 1

(1994) 2-5 20. McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text

Classification. AAAI-98, Workshop on Learning for Text Categorization (1998) 21. John, G.H., Langley, P.: Estimating Continuous Distributions in Bayesian Classifiers.

Eleventh Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, San Mateo (1995) 338-345

22. Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA (1993)