TUGAS MAKALAH JAMINAN SOSIAL

Journal on Multimodal User Interfaces, Submission File, Draft, 30 June 2008

SPEECH AND SLIDING TEXT AIDED SIGN RETRIEVAL FROM HEARING IMPAIREDSIGN NEWS VIDEOS

Oya Aran, Ismail Ari, Lale Akarun

PILAB, Bogazici University, Istanbul, Turkey{aranoya;ismailar;akarun}@boun.edu.tr

Erinc Dikici, Siddika Parlak, Murat Saraclar

BUSIM, Bogazici University, Istanbul, Turkey{erinc.dikici;siddika.parlak;murat.saraclar}@boun.edu.tr

Pavel Campr, Marek Hruz

University of West Bohemia, Pilsen,Czech Republic

{campr;mhruz}@kky.zcu.cz

ABSTRACT

The objective of this study is to automatically extract annotatedsign data from the broadcast news recordings for the hearingimpaired. These recordings present an excellent source for au-tomatically generating annotated data: In news for the hearingimpaired, the speaker also signs with the hands as she talks. Ontop of this, there is also corresponding sliding text superimposedon the video. The video of the signer can be segmented via thehelp of either the speech or both the speech and the text, gener-ating segmented, and annotated sign videos. We call this appli-cation as Signiary, and aim to use it as a sign dictionary wherethe users enter a word as text and retrieves sign videos of therelated sign. This application can also be used to automaticallycreate annotated sign databases that can be used for training rec-ognizers.

KEYWORDS

speech recognition – sliding text recognition – sign languageanalysis – sequence clustering – hand tracking

1. INTRODUCTION

Sign language is the primary means of communication for deafand mute people. Like spoken languages, it emerges naturallyamong the deaf people living in the same region. Thus, there ex-ist a large number of sign languages all over the world. Some ofthem are well known, like the American Sign Language (ASL),and some of them are known by only a very small group ofdeaf people who use it. Sign languages make use of hand ges-tures, body movements and facial expressions to convey infor-mation. Each language has its own signs, grammar, and wordorder, which is not necessarily the same as the spoken languageof that region.

Turkish sign language (Turk Isaret Dili, TID) is a naturalfull-fledged language used in the Turkish deaf community [1].Its roots go back to the 16th century, to the time of the OttomanEmpire. TID has many regional dialects, and is used throughoutTurkey. It has its own vocabulary and grammar, and its own sys-tem of fingerspelling and it is completely different from spokenTurkish in many aspects, especially in terms of word order andgrammar [2].

In this work, we aim to exploit videos of the Turkish newsfor the hearing impaired in order to generate usable data for signlanguage education. For this purpose, we have recorded newsvideos from the Turkish Radio-Television (TRT) channel. The

broadcast news is for the hearing impaired and consists of threemajor information sources: sliding text, speech and sign. Fig. 1shows an example frame from the recordings.

Figure 1: An example frame from the news recordings. The threeinformation sources are the speech, sliding text, signs.

The three sources in the video convey the same informationvia different modalities. The news presenter signs the words asshe talks. However, since it is not necessary to have the sameword ordering in a Turkish spoken sentence and in a Turkishsign sentence [2], the signing in these news videos is not con-sidered as TID but can be called as signed Turkish: the sign ofeach word is from TID but their ordering would have been dif-ferent in a proper TID sentence. Moreover, facial expressionsand head/body movements, which are frequently used in TID,are not used in these signings. Since the signer also talks, no fa-cial expression that uses lip movements can be done. In additionto the speech and sign information, a corresponding sliding textis superimposed on the video. Our methodology is to processthe video to extract the information content in the sliding textand speech components and to use both the speech and the textto generate segmented and annotated sign videos. The main goalis to use this annotation to form a sign dictionary. Once the an-notation is completed, unsupervised techniques are employed tocheck consistency among the retrieved signs, using a clusteringof the signs.

The system flow of Signiary is illustrated in Fig. 2. The ap-plication receives the text input of the user and attempts to find

1

mailto:[email protected]











Figure 2: Modalities and the system flow

the word in the news videos by using the speech. At this stepthe application returns several intervals from different videosthat contain the entered word. If the resolution is high enoughto analyze the lip movements, audio-visual analysis can be ap-plied to increase speech recognition accuracy. Then, sliding textinformation is used to control and correct the result of the re-trieval. This is done by searching for the word in the slidingtext modality during each retrieved interval. If the word canalso be retrieved by the sliding text modality, the interval is as-sumed to be correct. The sign intervals are extracted by analyz-ing the correlation of the signs with the speech. However, thesigns in these intervals are not necessarly the same. First, therecan be false alarms of the retrieval corresponding to some unre-lated signs and second, there are homophonic words that havethe same phonology but different meanings; thus, possibly dif-ferent signs. Thus we need to cluster the signs that are retrievedby the speech recognizer.

In Sections 2, 3 and 4, we explain the analysis techniquesand results of the recognition experiments for speech, text andsign modalities, respectively. The overall assesment of the sys-tem is given in Section 5. In Section 6, we give the details aboutthe Signiary application and its graphical user interface. Weconlude and discuss further improvement areas of the systemin Section 7.

2. SPOKEN TERM DETECTION

Spoken term detection (STD) is a subfield of speech retrieval,which locates occurrences of a query in a spoken archive. In thiswork, STD is used as a tool to segment and retrieve the signs inthe news videos based on speech information. After the locationof the query is extracted with STD, the sign video correspondingto that time interval is displayed to the user. The block diagramof the STD system is given in Fig. 3.

The three main components of the system are: speech recog-nition, indexation and retrieval. The speech recognizer convertsthe audio data, extracted from videos, into a symbolic represen-tation (in terms of weighted finite state automata). The index,which is represented as a weighted finite state transducer, is builtoffline, whereas the retrieval is performed after the query is en-tered. The retrieval module uses the index to return the time,program and relevance information about the occurrences of thequery. Now, we will explain each of the components in detail.

Figure 3: Block diagram of the spoken term detection system

2.1. Speech recognition

Prior to recognition, audio data is segmented into utterancesbased on energy, using the method explained in [3]. The speechsignal corresponding to each utterance is converted into a textualrepresentation using an HMM based large vocabulary continu-ous speech recognition (LVCSR) system. The acoustic modelsconsist of decision tree clustered triphones and the output prob-abilities are given by Gaussian mixture models. As the languagemodel, we use word based n-grams. The recognition networksand the output hypotheses are represented as weighted automata.The details of the ASR system can be found in [4].

The HTK toolkit[5] is used to produce the acoustic featurevectors and AT&T’s FSM and DCD tools[6] are used for recog-nition.

The mechanism of an ASR system requires searching througha network of all possible word sequences. One-best output is ob-tained by finding the most likely hypothesis. Alternative likelyhypotheses can be represented using a lattice. To illustrate, lat-tice output of a recognized utterance is shown in Fig. 4. Thelabels on arcs are words and the weights indicate the probabilityof arcs [7]. In Fig. 4, the circles represent states, where state’0’ is the initial state and state ’4’ is the final state. An utterancehypothesis is a path between the initial state and one of the finalstates. The probability of each path is computed by multiplyingthe probabilities of the arcs on that path. For example, the prob-ability of path iyi gUnler is 0.73 × 0.975 ' 0.71, which is thehighest of all hypotheses (one-best).

jmui_sekil.fsm

0

1bir/0.046

ilk/0.0142

bu/0.035

3iyi/0.7304/0

bugUn/0.082

eGitim/0.068

mUmkUn/0.016

gUn/1

nedenle/1

mi/0.024

gUnler/0.975

Figure 4: An example lattice output, for the utterance ”iyi gUn-ler”

2


2.2. Indexation

Weighted automata indexation is an efficient method in retrievalof uncertain data. Since the output of the ASR (hence, inputto the indexer) is a collection of alternative hypotheses repre-sented as weighted automata, it is advantageous to represent theindex as a finite state transducer. Therefore, the automata, (out-put of ASR) are turned into transducers such that the inputs arewords and the outputs are utterance numbers, which the wordappears in. By taking the union of these transducers, a singletransducer is obtained, which is further optimized via weightedtransducer determinization. The resulting index is optimal insearch complexity, the search time is linear in the length of theinput string. The weights in the index transducer correspond toexpected counts, where the expectation is taken under the prob-ability distribution given by the lattice [8].

2.3. Retrieval

Having built the index, we are now able to transduce input queriesinto utterance numbers. To accomplish this, the queries are rep-resented as finite state automata and composed with the indextransducer (via weighted finite state composition) [8]. The out-put is a list of all utterance numbers which the query appears in,as well as the corresponding expected counts. Next, utterancesare ranked based on expected counts, the ones higher than a par-ticular threshold are retrieved. We apply forced alignment onthe utterances to identify the starting time and duration of eachterm [9].

As explained in section 2.1, use of lattices introduces morethan one hypothesis for the same time interval, with differentprobabilities. Indexation estimates the expected count usingthese path probabilities. By setting a threshold on the expectedcount, different precision-recall points can be obtained which re-sults in a curve. On the other hand, one-best hypothesis can berepresented with only one point. Having a curve allows choos-ing an operating point by varying the threshold. Use of a higherthreshold improves precision but recall falls. Conversely, a lowerthreshold value causes less probable documents to be retrieved.This increases recall but decreases precision.

The opportunity of choosing the operating point is a greatadvantage. Depending on the application, it may be desirable toretrieve all of the related documents or only the most probableones. For our case, it is more convenient to operate at a pointwhere precision is high.

2.4. Experiments and results

2.4.1. Evaluation

Speech recognition performance is evaluated by the word errorrate (WER) metric which was measured to be around 20% inour previous experiments.

Retrieval part is evaluated via precision-recall rates and F-measure, which are calculated as follows: Given Q queries,let the reference transcriptions include R(q) occurrences of thequery q, A(q) be the total number of retrieved documents andC(q) be the number of correctly retrieved documents. Then:

Precision =1

Q

Q∑q=1

C(q)

A(q)Recall =

1

Q

Q∑q=1

C(q)

R(q)(1)

andF =

2× Precision× Recall

Precision + Recall(2)

Evaluation is done over 15 news videos, each with an ap-proximate duration of 10 minutes. Our query set consists of

all the words seen in manual transcriptions (excluding foreignwords and acronyms). Correct transcriptions are obtained man-ually and the time intervals are obtained by forced viterbi align-ment. The acoustic model of the speech recognizer is trained onour broadcast news corpus, which includes approximately 100hours of speech. The language models are trained on a text cor-pus, consisting of 96M words [4].

We consider the retrieved occurrence as a correct match ifthe time interval falls within ± 0.5 seconds of the correct inter-val, if not, we assume a miss or false alarm.

2.4.2. Results

We experimented with both lattice and one-best indexes. Asmentioned before, the lattice approach corresponds to a curvein the plot, while one-best approach is represented with a point.The resulting precision-recall graph is depicted as in Fig. 5

50 60 70 80 90 10050

60

70

80

90

100

Precision(%)

Rec

all(%

)

word latticeone−best

P = 90.9R = 71.2F = 80.3

Figure 5: Precision-Recall for word-lattice and one-best hy-potheses when no limit is set on maximum number of retrieveddocuments

In the plot, we see that, use of lattices performs better thanone-best. The arrow points at the position where the maximumF-measure is achieved for lattice. We also experimented withsetting a limit on the number of retrieved occurrences. Whenonly 10 occurrences are selected, precision-recall results weresimilar on one-best vs lattice performance. However, the max-F measure obtained by this method outperforms the previousby 1%. A comparison of lattice and one-best performances (inF-measure) is given in Table 1, for both experiments. Use oflattices introduces 1-1.5 % of improvement in F-measure. Sincethe broadcast news corpora are fairly noiseless, the achievementmay seem minor. However for noisy data, this improvement ismuch higher [7].

Max-F(%) Max-F@10(%)Lattice 80.32 81.16

One-best 79.05 80.06

Table 1: Comparison of lattice and one-best ASR outputs onmaximum F-measure performance

3


3. SLIDING TEXT RECOGNITION

3.1. Sliding Text Properties

The second modality consists of the overlaid sliding text, whichincludes simultaneous transcription of the speech. We work with20 DivXr encoded color videos, having two different resolu-tions (288x352 - 288x364), and 25 fps sampling rate. The slid-ing text band at the bottom of the screen contains a solid back-ground and white characters with a constant font. Sliding textspeed does not change considerably throughout the whole videosequence (4 pixels/frame). Therefore, each character appears onthe screen for at least 2.5 seconds. An example of a frame withsliding text is shown in Fig. 6.

Figure 6: Frame snapshot of the broadcast news video

3.2. Baseline Method

Sliding text information extraction consists of three steps: ex-traction of the text line, character recognition and temporal align-ment.

3.2.1. Text Line Extraction

Since the size and position of the sliding text is constant through-out the video, it is deternined at the first frame and used in therest of the operations. To find the position of the text, first, weconvert the RGB image into a binary image, using grayscalequantization and thresholding with the Otsu method [10]. Thenwe calculate horizontal projection histogram of the binary im-age, i.e., the number of white pixels for each row. The text bandappears as a peak on this representation, separated from the restof the image. We apply a similar technique over the cropped textband, this time on the vertical projection direction, to eliminatethe program logo. The sliding text is bounded by the remain-ing box, whose coordinates are defined as the text line position.Fig. 7 shows an example of binary image with its horizontal andvertical projection histograms.

Since there is redundant information in successive frames,we do not extract text information from every frame. Experi-ments have shown that updating the text transcription once inevery 10 frames is optimal for achieving sufficient recognitionaccuracy. We call these the ”sample frames”. The other framesin between are used for noise removal and smoothing.

Noise in binary images stems mostly from quantization op-erations in color scale conversion. Considering the low reso-lution of our images, noise can cause two characters, or two

Figure 7: Binary image with horizontal and vertical projectionhistograms

distinct parts of a single character to be combined, which com-plicates the segmentation of text into characters. We apply mor-phological opening with a 2x2 structuring element to removesuch effects of noise. To further smooth the appearance of char-acters, we horizontally align binary text images of the framesbetween two samples, and for each pixel position, decide on a 0or 1, by voting.

Once the text image is obtained, vertical projection histogramis calculated again to find the start and end positions of everycharacter. Our algorithm assumes that two consecutive char-acters are perfectly separated by at least one black pixel col-umn. Words are segmented by looking for spaces which exceedan adaptive threshold, which takes into account the outliers incharacter spacing. To achieve proper alignment, only completewords are taken for transcription at each sample frame.

Each character is individually cropped from the text figureand saved, along with its start and end horizontal pixel positions.

3.2.2. Character Recognition

Since the font of the sliding text is constant in all videos, tem-plate matching method is implemented for character recogni-tion. Normalized Hamming distance is used to compare eachbinary character image to each template. The total number ofmatching pixels are divided by the size of the character imageand used as a normalized similarity score: Let nij be the totalnumber of pixel positions, where binary template pixel has valuei and character pixel has value j. Then, the score is formulatedas:

sj =n00 + n11

n00 + n01 + n10 + n11(3)

The character whose template gets the best score is assignedto that position. Matched characters are stored as a string. Fig. 8depicts the first three sample text band images, their transcribedtext and corresponding pixel positions.

3.2.3. Temporal Alignment

The sliding text is interpreted as a continuous band through-out the video. Since we process only selected sample frames,the calculated positions of each character should be aligned inspace (and therefore in time) with their positions from the pre-vious sample frame. This is done using frame shift and pixel

4


(a)

(b)

(c)

Figure 8: First three sample frames and their transcription re-sults

shift values. For instance in Figure 8, the recognized ”G” whichappears in positions 159-167 in the first sample (a) and the onein 119-127 of the second (b) refer to the same character, sincewe investigate each 10th frame with 4 pixels of shift per frame.Small changes in these values (mainly due to noise) are compen-sated using shift-alignment comparison checks. Therefore, weobtain a unique pixel position (start-end) pair for each characterseen on the text band.

The characters in successive samples, which have the sameunique position pair may not be identical, due to recognition er-rors. Continuing the previous example, we see that the parts offigures which corresponds to the positions in boldface are rec-ognized as ”G”, ”G” and ”O”, respectively. The final decision ismade by majority voting; the character that is seen the most isassigned to that position pair. Therefore, in our case, we decideon ”G” for that position pair.

3.2.4. Baseline Performance

We compare the transcribed text with the ground truth data anduse character recognition and word recognition rates as perfor-mance criteria. Even if only one character of a word is misrec-ognized, we label this word as erroneous.

Applying the baseline method on the same corpora as speechrecognition, we achieved 94% character recognition accuracy,and 70% word recognition accuracy.

3.2.5. Discussions

One of the most important challenges of character recognitionwas the combined effect of low resolution and noise. We workwith frames of considerably low resolution, therefore, each char-acter covers barely an area of 10-14 pixels in height and 2-10pixels in width. In such a small area, any noise pixel distorts theimage considerably, thus making it harder to achieve a reason-able score by template comparison.

Noise cancellation techniques, described in subsection 3.2.1,created another disadvantage since they removed distinctive parts

of some Turkish characters, such as erasing dots (I, O, U), orpruning hooks (C, S). Fig. 9 shows examples of such charac-ters, which, after noise cancellation operations, look very muchthe same. A list of top 10 commonly confused characters, alongwith their confusion rates, is given in Table 2.

Figure 9: Examples to commonly confused characters

Table 2: Top 10 commonly confused characters

Character Character Confusion(Original) (Recognized) Rate (%)

8 B 65.000 O 47.47O O 42.03O 0 40.56S S 39.79B 8 35.975 S 27.78G O 22.65S S 39.792 Z 13.48

As it can be seen in Table 2, the highest confusion ratesoccur in the discrimination of similarly-shaped characters andnumbers, and Turkish characters with diacritics.

3.3. Improvements over the Baseline

We made two major improvements over the baseline system, toimprove recognition accuracy. The first one is to use Jaccard’scoefficient as template match score. Jaccard uses pixel compar-ison with a slightly different formulation. Following the samenotation as in 3.2.2, we have [11]:

sj =n11

n11 + n10 + n01(4)

A second approach to improve recognition accuracy is cor-recting the transcriptions using Transformation Based Learning[12], [13].

Transformation Based Learning (TBL) is a supervised clas-sification method, commonly used in natural language process-ing. The aim of TBL is to find the most common changes on atraining corpus and construct a rule set which leads to the high-est classification accuracy. A corpus of 11M characters, col-lected from a news website, is used as training data for TBL.The labeling problem is defined as choosing the most probablecharacter among a set of alternatives. For the training data thealternatives for each character are randomly created using theconfusion rates presented in Table 2. Table 3 presents some ofthe learned rules and their rankings. For example, in line 2, theletter ”O”, which starts a new word and is succeeded by an ”E”is suggested to be changed to a ”G”.

The joint effect of Jaccard’s score for comparison and TBLcorrections on the transcribed text lead to 98% of character and89% of word recognition accuracies, respectively.

5


Table 3: Learned rules by TBL training

Rule Rank Context Character Changed withto be Changed (%)

1 8 8 B4 OE O G5 A5 5 S

12 Y0 0 O17 OO O G40 2O O 0

3.4. Integration of Sliding Text Recognition and Spoken TermDetection

The speech and sliding text modalities are integrated via a cas-cading strategy. Output of the sliding text recognition is usedlike a verifier on the spoken term detection.

The cascade is implemented as follows: In the STD part, theutterances whose relevance scores exceed a particular thresholdare selected, as in section 2.3. In the sliding text part, spokenterm detection hypotheses are checked with sliding text recog-nition. Using the starting time information provided by STD,the interval of starting time ± 4 seconds is scanned on the slid-ing text output. The word which is closest to the query (in termsof normalized minimum edit distance) is assigned as the corre-sponding text result. The normalized distance is compared toanother threshold. Those below the distance threshold are as-sumed to be correct and returned to the user.

70 75 80 85 90 95 10050

55

60

65

70

75

80

Precision (%)

Rec

all (

%)

only speechsliding text aided

P = 98.5R = 56.8

P = 97.5R = 50.8

Figure 10: Precision-Recall curves using only speech informa-tion and using both speech and sliding text information

Evaluation of the sliding text integration is done as explainedin section 2.4.1. The resulting precision recall curves are shownin Figure 10. To obtain the text aided curve, probability thresh-old in STD is set to 0.3 (determined empirically) and the thresh-old on the normalized minimum edit distance is varied from 0.1to 1. Sliding text integration improves the system performanceby 1% in maximum precision, which is a considerable improve-ment in the high precision region. Using both text and speech,the maximum attainable precision is 98.5%. This point also hasa higher recall than maximum precision of only speech.

4. SIGN ANALYSIS

4.1. Skin color detection

Skin color is widely used to aid segmentation in finding partsof the human body [14]. We learn skin colors from a trainingset and then create a model of them using a Gaussian MixtureModel (GMM).

Figure 11: Examples from the training data

4.1.1. Training of the GMM

We prepared a set of training data by extracting images fromour input video sequences and manually selecting the skin col-ored pixels. In total, we processed six video segments. Someexample training images are shown in Fig. 11. We use the RGBcolor space for color representation. In the current videos thatwe process, there are a total of two speakers/signers. RGB colorspace gives us an acceptable rate of detection with which weare able to perform an acceptable tracking. However, we shouldchange the color space to HSV when we expect a wider rangeof signers, from different ethnicities [15].

The collected data are processed by the Expectation Maxi-mization (EM) algorithm to train the GMM. After inspecting thespatial parameters of the data, we decided to use a five Gaussianmixtures model.

Figure 12: Skincolor probability distribution. There are twoprobability levels visible. The outside layer corresponds to prob-ability of 0.5 and the inner layer correresponds to probability of0.86.

The straightforward way of using the GMM for segmenta-tion is to compute the probability of belonging to a skin segmentfor every pixel in the image. This computation takes a long time,provided the information we have is the mean, variance and gain

6


of each Gaussian in the mixture. To reduce the time complex-ity of this operation, we have precomputed the probabilities andused table look-up.

The values of the look-up table are computed from the prob-ability density function given by the GMM and we obtain a 256x 256 x 256 look-up table containing the probabilities of all thecolors from RGB color space. The resulting model can be ob-served in Fig. 12.

4.1.2. Adaptation of general skin color model for a givenvideo sequence

To handle different illuminations and different speakers accu-rately, we need to adapt the general look-up table, modeled fromseveral speakers, to form an adapted look-up table for each video.Since manual data editing is not possible at this stage, a quickautomatic method is needed.

Figure 13: Examples of different lighting conditions resultingin different skin color of the speaker and the correspondingadapted skincolor probability distribution.

We detect the face of the speaker [16] and store it in a sep-arate image. This image is then segmented using the generallook up table. We generate a new look-up table by processingall the pixels of the segmented image and store the pixels’ colorin the proper positions in the new look-up table. As the infor-mation from one image is not sufficient, there may be big gapsin this look-up table. These gaps are smoothed by applying aconvolution with a Gaussian kernel. To save time, each colorcomponent is processed separately. At the end of this step, weobtain a specialized look-up table for the current video.

The general look-up table and the new specialized look uptable are combined by weighted averaging of the two tables.The speaker is given more weight in the weighted averaging.This way, we eliminate improbable colors from the general lookup table thus improving its effectiveness. Some examples ofadapted look-up tables can be seen in Fig. 13.

4.1.3. Segmentation

The segmentation is straightforward. We compare the color ofeach pixel with the value in the look-up table and decide whetherit belongs to the skin segment or not. For better performance,we blur the resulting probability image. That way, the segmentswith low probability disappear and the low probability regionsnear the high probability regions are strengthened. For eachframe, we create a mask of skin color segments by threshold-ing the probability image (See Fig. 14) to get a binary imagethat indicates skin and non-skin pixels.

Figure 14: The process of image segmentation. From left toright: original image, probability of skin color blurred for betterperformance, binary image as a result of thresholding, originalimage with applied mask.

4.2. Hand and face tracking

The movement of the face and the hands is an important featureof sign language. The trajectory of the hands gives us a goodidea about the performed sign and the position of the head ismainly used for creating a local coordinate system and for nor-malizing the hand position.

From the resulting bounding rectangle of the face detector[16] we fit an ellipse around the face. The center of the ellipse isassumed to be the center of mass of the head. We use this pointas the head’s tracked position.

For hand tracking, we aim to retrieve the hands from theskin color image as separate blobs [17]. We define a blob as aconnected component of skin color pixels. However, if there areskin color-like parts in the image, the algorithm retrieves someblobs which do not belong to the signer. That means the blobis neither the signer’s head nor the signer’s hand. These blobsshould be eliminated for better performance.

4.2.1. Blob filtering

We eliminate some of the blobs with the following rules:

• The hand area should not be very small. Eliminate theblobs with small area.

• Head and the two hands are the three biggest blobs in theimage, when there is no occlusion. Use only the threebiggest blobs.

• Eliminate blobs whose distances from the previous posi-tions of already identified blobs exceeds a threshold. Weuse a multiple of the signer’s head width as threshold.

At the end of filtering we expect to have blobs that corre-spond to the head and hands of the signer. However, when handsocclude each other or the head, two or three objects form a sin-gle blob. We need to somehow know which objects of interestare in occlusion or whether they are so close to each other thatthey form a single blob.

7


4.2.2. Occlusion prediction

We apply an occlusion prediction algorithm as a first step inocclusion solving. We need to predict whether there will be anocclusion and among which blobs will the occlusion be. For thispurpose, we use a simple strategy that predicts the new position,pt+1, of a blob from its velocity and acceleration. The velocityand acceleration are calculated using:

vt = pt − pt−1 (5)at = vt − vt−1 (6)

pt+1 = pt + vt + 0.5 ∗ at (7)

The size of the blob is predicted with the same strategy.With the predicted positions and sizes, we check whether theseblobs intersect. If there is an intersection, we identify the inter-secting ones and predict that there will be an occlusion betweenthose blobs.

4.2.3. Occlusion solving

We solve occlusions by first finding out the part of the blob thatbelongs to one of the occluding objects and divide the blob intotwo parts (or three in the case that both hands and head are inocclusion). Thus we always obtain three blobs which can thenbe tracked (unless one of the hands is out of the view). Let usdescribe the particular cases of occlusion.

Two hands occlude each other. In this case we separate thecombined blob into two blobs with a line. We find the boundingellipse of the blob of the occluded hands. The minor axis of theellipse is computed and a black line is drawn along this axis,forming two separate blobs (see Fig. 15). This approach onlyhelps us to find approximate positions of the hand blobs in caseof occlusion. The hand shape information on the other hand iserror-prone since the shape of the occluded hand can not be seenproperly.

Figure 15: An example of hand occlusion. From left to right:original image, image with the minor axis drawn, result of track-ing.

One hand occludes the head. Usually the blob of the headis much bigger than the blob of the hand. Therefore the samestrategy as in the two-hand-occlusion case would have an un-wanted effect: there would be two blobs with equal sizes. In-stead, we use a template matching method [18] to find the handtemplate, collected at the previous frame, in the detected blobregion. The template is a gray scale image defined by the hand’sbounding box. When occlusion is detected, a region around theprevious position of the hand is defined. We calculate the corre-lation

R(x, y) =∑x′

∑y′

(T (x′, y′)− I(x + x′, y + y′))2 (8)

where T is the template, I is the image we search in, x and yare the coordinates in the image, x′ and y′ are the coordinatesin the template. We search for the minimum correlation at the

intersection of this region and the combined blob where the oc-clusion is. We use the estimated position of the hand as in Fig.16. The parameters of the ellipse are taken from the boundingellipse of the hand. As a last step we collect a new template forthe hand.

Figure 16: An example of hand and head occlusion. From leftto right: original image, image with the ellipse drawn, result oftracking.

Both hands occlude the head. In this case, we apply thetemplate matching method, as described above, to each of thehands.

4.2.4. Tracking of hands

The tracking starts by assuming that, for the first time a hand isseen, it is assigned as the right hand if it is on the right side ofthe face and vice versa for the left hand. This assumption is onlyused when there is no hand in the previous frame.

After the two hands are identified, the tracking continuesby considering the previous positions of each hand. We alwaysassume that the biggest blob closest to the previous head posi-tion belongs to the signer’s head. The other two blobs belong tothe hands. Since the blobs that are far away from the previouspositions of hands are eliminated at the filtering step, we eitherassign the closest blob to each hand or if there are no blobs, weuse the previous position of the corresponding hand. This is ap-plied to compensate the mistakes of the skin color detector. Ifthe hand blob can not be found for a period of time (i.e half asecond), we assume that the hand is out of view.

4.3. Feature extraction

Features need to be extracted from processed video sequences[19] for later stages such as sign recognition, or clustering. Wechoose to use the hand position, motion and simple hand shapefeatures.

The output of the tracking algorithm introduced at the pre-vious section is the position and a bounding ellipse of the handsand the head during the whole video sequence. The positionof the ellipse in time forms the trajectory and the shape of theellipse gives us some information about the hand or head ori-entation. The features extracted from the ellipse are its cen-ter of mass coordinates, x, y, width, w, height, a, and the an-gle between ellipse’s major axis and the x-axis, a. Thus, thetracking algorithm provides five features per object of interest,x, y, w, h, a, for each of the three objects, 15 features in total.

Gaussian smoothing of measured data in time is applied toreduce noise. The width of the Gaussian kernel is five, i.e. wecalculate the smoothed value from two previous, present andtwo following frames.

We normalize all the features such that they are speaker in-dependent and invariant to the source resolution. We define anew coordinate system. The average position of the head centeris the new origin and the average width of the head ellipse is thenew unit. Thus the normalization consists of translation to the

8


6350 6400 6450 6500 6550

-6

-5

-4

-3

-2

-1

0

1

2

3

frame number

po

sitio

n / s

ize

smoothed X

smoothed Y

smoothed width

smoothed height

measured X

measured Y

measured width

measured height

Left hand features behaviour in time

Figure 17: Smoothed features - example on 10 seconds sequenceof four left hand features (x and y position, bounding ellipsewidth and height)

new origin and scaling. Figure 18 shows the trajectories of headand the hands in the normalized coordinate system for a videoof four seconds.

-2 -1 0 1 2-7

-6

-5

-4

-3

-2

-1

0

1Head and hands trajectories - 4 seconds

X

Y

left hand

right hand

head

Figure 18: Trajectories of head and hands in normalized coor-dinate system

The coordinate transformations are calculated for all 15 fea-tures, which can be used for dynamical analysis of movements.We then calculate differences from future and previous framesand include this in our feature vector:

x =1

2(x(t)− x(t− 1)) +

1

2(x(t + 1)− x(t)) (9)

=1

2x(t + 1)− −1

2x(t− 1) (10)

In total, 30 features from tracking are provided, 15 smoothedfeatures obtained by the tracking algorithm and 15 differences.

4.4. Alignment and Clustering

The sign news, which contains continuous sign speech, was splitinto isolated signs by a speech recognizer (see Fig. 19). In thecase the pronounced word and the performed sign are shifted,the recognized borders will not fit the sign exactly. We have ex-amined that the spoken word usually precedes the corresponding

sign. The starting border of the sign has to be moved backwardsin order to compensate for the delay between speech and sign-ing. A typical delay of starting instant is about 0.2 seconds back-wards. In some cases, the sign was delayed against the speech,so it’s suitable to move the ending instant of the sign forward.To keep our sequence as short as possible, we shifted the end-ing instant about 0.1 second forward. If we increase the shift ofthe beginning or the ending border, we increase the probabilitythat the whole sign is present in the selected time interval, at theexpense of including parts of previous and next signs into theinterval.

Figure 19: Timeline with marked borders from the speech rec-ognizer

4.4.1. Distance Calculation

We take two signs from the news whose accompanying speechsegments were recognized as the same word. We want to findout whether those two signs are the same or are different, in caseof homonyms or mistakes of the speech recognizer. For thispurpose we use clustering. We have to consider that we do notknow the exact borders of the sign. After we extend the bordersof those signs, as described above, the goal is to calculate thesimilarity of these two signs and to determine if they are twodifferent performances of the same sign or two totally differentsigns. If the signs are considered to be the same, they should beadded to the same cluster.

We have evaluated our distance calculation methodology bymanually selecting a short query sequence, which contains asingle sign and searching for it in a longer interval. We cal-culate the distance between the query sequence and other equal-length sequences which were extracted from the longer inter-val. Fig. 20 shows the distances between a 0.6 second-longquery sequence and the equal-length sequences extracted froma 60 seconds-long interval. This way, we have extracted nearly1500 sequences and calculated their distances to the query se-quence. This experiment has shown that our distance calcula-tion has high distances for different signs and low distances forsimilar signs.

200 400 600 800 1000 1200 1400

10

20

30

40

50

60

70

frame

dis

tan

ce

taken from 60 seconds (1500 frames) interval

distance of 0.6 second sequence and same length sequences

0

Figure 20: Distance calculation between 0.6 seconds query se-quence and equal-length sequences taken from 60 seconds in-terval

The distance of two sequences is calculated by tracking fea-tures (x and y coordinates of both hands), simple hand shape

9


features (width and height of fitted ellipse) and approximatederivatives of all these features. We calculate the distance oftwo equal-length sequences as the mean value of squared differ-ences between corresponding feature values in the sequences.If we consider the feature sequence, represented as matrix S,where s(m, n) is the n-th feature in the m-th frame, S1 and S2

represent two compared sequences, N is number of features andM is number of frames, then the distance is calculated as:

D(S1, S2) =1

MN

M∑m=1

N∑n=1

(s1(m, n)− s2(m, n))2 (11)

The squared difference in Equation 11 suppresses the in-crease in the distance for small differences on one hand and em-phasizes greater diferences on the other.

Inspired by the described experiment with searching for shortquery sequence in longer interval, we extended the distance cal-culation for two different-length sequences. We take the shorterof the two compared signs and go through the longer sign. Wefind the time position where the distance between the short signand the corresponding same-length interval from the longer signis the lowest. The distance from this time position, where theshort sign and part of the longer sequence fit best, is consideredas the overall distance between the two signs.

We solve the effect of changes of the signing speed in thefollowing way: when the speed of the two same signs is dif-ferent, the algorithm suppresses small variations and empha-sizes larger variations as described before, so that the signs areclassified into different clusters. This feature is useful whenthe desired behaviour is to separate even the same signs whichwere performed at different speeds. Other techniques can beused, such as multidimensional Dynamic Time Warping (DTW),which can handle multidimensional time warped signals. An-other approach for distance calculation and clustering is to usea supervised method, such as Hidden Markov Models (HMM).Every class (group of same signs) has one model trained frommanually selected signs. The unknown sign is classified into aclass which has the highest probability for features of the un-known sign. In addition, HMM can recognize sign borders inthe time interval and separate it from parts of previous and fol-lowing signs which can be present in the interval. We plan touse a supervised or a semi-supervised approach with labelledground truth data but currently we leave it as future work.

4.4.2. Clustering

To apply clustering for multiple signs, we use the followingmethod:

1. Calculate pairwise distances between all compared signs;store those distances in an upper triangular matrix

2. Group the two most similar signs together and recalculatethe distance matrix. The new distance matrix will haveone less row and column and the scores of the new groupis calculated as the average of the scores of the two signsin the new group.

3. Repeat step 2. until all signs are in one group

4. Mark the highest difference between two distances, atwhich two signs were grouped together in previous steps,as the distance up to which the signs are in the same clus-ter (see Fig. 21).

We set the maximum number of clusters to three: one clus-ter for the main sign, one cluster for the pronounciation differ-ences or the homonyms, and one cluster for the unrelated signs,due to mistakes of the speech recognizer, or the tracker.

Figure 21 shows the dendograms for two example signs,‘ulke” and ‘basbakan”. For sign ‘ulke”, nine of the retrievalsare put into the main cluster and one retrieval is put into anothercluster. That retrieval is a mistake of the speech recognizer andthe sign performed is completely different from the others. Theclustering algorithm correctly separates it from the other signs.For sign ‘basbakan”, three clusters are formed. The first clusteris the one with the correct sign. The four retrievals in the secondand third clusters are from a single video where the tracking cannot be performed accurately as a result of bad skin detection.The two signs are shown in Figure 22.

(a)

(b)

Figure 21: Dendogram - grouping signs together at differentdistances, for two example signs (a) ‘ulke”, and (b) ‘basbakan”.Red line shows the cluster separation level.

5. SYSTEM ASSESMENT

The performance of the system is affected by several factors. Inthis section we indicate each of these factors and present perfor-mance evaluation results based on ground truth data.

We extracted the ground truth data for a subset of 15 words.The words are selected from the most frequent words in the newsvideos. The speech retrieval system retrieves, at most, the best10 occurences of the requested word. For each word we anno-tated each retrieval to indicate whether the retrieval is correct,whether the sign is contained in the interval, whether the signis a homonym or a pronounciation difference, and whether thetracking is correct.

For each selected word, 10 occurences are retrieved, except

10


(a)

(b)

Figure 22: Two example signs (a) ‘ulke”, and (b) ‘basbakan”.

Table 4: System performance evaluation

% # ofAccuracy Retrievals Yes No

Correct retrieval 91.8 146 134 12Correct tracking 77.6 134 104 30Interval contains 51.5 134 69 65

Ext. interval contains 78.4 134 105 29

two of them where we retrieve eight occurences. Among these146 retrievals, 134 of them are annotated as correct, yielding a91.8% correct retrieval accuracy for the selected 15 words. Thesummary of the results can be seen in Table 4.

In 51.5% of the correct retrievals, the sign is contained inthe interval. However, as explained in Section 4.4, prior to signclustering, we extend the beginning and the end of the retrievedinterval, 0.2 seconds and 0.1 seconds respectively, in order toincrease the probability of containing the whole sign. Figure 23shows the sign presence in the retrieved intervals. The numberof retrievals that contain the sign when the interval is extendedin different levels is also shown. If we consider these extendedintervals, the accuracy goes up to 78.4%.

The tracking accuracy is 77.6% for all of the retrievals. Weconsider the tracking is erroreneus even if one frame in the in-terval is missed.

Figure 23: Sign presence in the spoken word intervals. Mostof the intervals either contain the sign or the sign preceeds thespoken word by 0.2-0.3 seconds

The clustering accuracy for the selected signs is 73.3%. Thisis calculated by comparing the clustering result with the groundtruth annotation. The signs that are different from the signs inthe other retrievals should form a separate cluster. In 7.5% ofthe correct retrievals, the performed sign is different from theother signs, which can be either a pronounciation difference ora sign homonym. Similarly, if the tracking for that sign is notcorrect, it can not be analyzed properly and these signs shouldbe in a different cluster. The results show that the clustering per-formance is acceptable. Most of the mistakes occur as a resultof the insufficient hand shape information. Our current systemuses only the width, the height and the angle of the boundingellipse of the hand. However this information is not sufficientto discriminate different finger configurations, or orientations.More detailed hand shape features should be incorporated intothe system for better performance.

6. APPLICATION AND GUI

The user interface and the core search engine are separately lo-cated and the communication is done via TCP/IP socket con-nections. This design is also expected to be helpful in the futuresince a web application is planned to be built using this service.This design can be seen in the Fig. 24.

Figure 24: System Structure

In Fig. 25, the screenshot of the user interface is shown.There are five sections in it. The first is the ‘Search” sectionwhere the user inputs a word or some phrases using the lettersin the Turkish alphabet and sends the query. The applicationcommunicates with the engine on the server side using a TCP/IPsocket connection, retrieves data and processes it to show theresults in the ‘Search Results” section. Each query and its resultsare stored with the date and time in a tree structure for laterreferral. A ‘recent searches” menu is added to the search boxaiming to cache the searches and to decrease the service time.But the user can clear the search history to retrieve the resultsfrom the server again for the recent searches. The relevance ofthe returned results (with respect to the speech) is shown usingstars (0 to 10) in the tree structure to inform the user about thereliability of the found results. Moreover, the sign clusters areshown in parentheses, next to each result.

When the user selects a result, it is loaded and played inthe ‘Video Display” section. The original news video or thesegmented news video are shown in this section according tothe user’s ‘Show segmented video” selection. The segmentedvideo is added separately to the figure above to indicate this. In

11


Figure 25: Screenshot of the User Interface

addition to video display, the ‘Properties” section informs theuser about the date and time of the news, starting time, durationand the relevance of the result. ‘Player Controls” and ‘Options”enable the user to expand the duration to left/right or adjust thevolume/speed of the video to analyze the sign in detail. Apartfrom using local videos to show in the display, one can uncheck‘Use local videos to show” and use the videos on the web. Butthe speed of loading video files from the web is not satisfactorysince the video files are very large.

7. CONCLUSIONS

We have developed a Turkish sign dictionary, Signiary, whichcan be used as tutoring videos for novice signers. The dic-tionary is easy to extend by adding more videos and providesa large vocabulary dictionary with the corpus of the broadcastnews videos. The application is accessible as a standalone ap-plication, from our laboratory’s web site [20] and will soon beaccessible from the Internet.

Currently, the system processes the speech and the slidingtext to search for the given query and clusters the results by pro-cessing the signs that are performed during the retrieved inter-vals. The clustering method uses only hand tracking featuresand simple hand shape features. For better clustering, we needto extract detailed hand shape features since for most of the signshand shape is a discriminating feature. Another improvementthat we leave as a future work is the sign alignment. The systemdirectly attempts to cluster the signs with a very rough align-ment. Further detailed alignment is needed since the speed ofthe signs and their synchronization with the speech may differ.

8. ACKNOWLEDGMENTS

This work was developed during the eNTERFACE’07 SummerWorkshop on Multimodal Interfaces, Istanbul, Turkey and sup-ported by European 6th FP SIMILAR Network of Excellence,The Scientific and Technological Research Council of Turkey(TUBITAK) project 107E021, Bogazici University project BAP-03S106 and the Grant Agency of Academy of Sciences of theCzech Republic, project No. 1ET101470416.

We thank Pınar Santemiz for her help in extracting groundtruth data that we use for system assessment.

9. REFERENCES

[1] U. Zeshan, “Sign language in Turkey: The story of a hid-den language.”, Turkic Languages, vol. 6, no. 2, pp. 229–274, 2002. 1

[2] U. Zeshan, “Aspects of Turk isaret dili (Turkish signlanguage)”, Sign Language & Linguistics, vol. 6, no. 1,pp. 43–75, 2003. 1

[3] L. R. Rabiner and M. Sambur, “An algorithm for deter-mining the endpoints of isolated utterances”, Bell SystemTechnical Journal, vol. 54, no. 2, 1975. 2

[4] E. Arisoy, H. Sak, and M. Saraclar, “Language Modelingfor Automatic Turkish Broadcast News Transcription”, inProc. Interspeech, 2007. 2, 3

[5] “HTK Speech Recognition Toolkit”. http://htk.eng.cam.ac.uk. 2

[6] “AT&T fsm & dcd tools”. http://www.research.att.com. 2

[7] M. Saraclar and R. Sproat, “Lattice-based search for spo-ken utterance retrieval”, in HLTNAACL, 2004. 2, 3

[8] C. Allauzen, M. Mohri, and M. Saraclar, “General indexa-tion of weighted automata- application to spoken utteranceretrieval”, in HLTNAACL, 2004. 3

[9] S. Parlak and M. Saraclar, “Spoken term detection forTurkish broadcast news”, in ICASSP, 2008. 3

[10] N. Otsu, “A threshold selection method from gray-levelhistograms”, IEEE Trans. Systems, Man, and Cybernetics,vol. 9, no. 1, pp. 62–66, 1979. 4

[11] J. D. Tubbs, “A note on binary template matching”, PatternRecognition, vol. 22, no. 4, pp. 359–365, 1989. 5

[12] E. Brill, “Transformation-based error-driven learning andnatural language processing: a case study in part-of-speech tagging”, Computational Linguistics, vol. 21, no. 4,pp. 543–565, 1995. 5

[13] G. Ngai and R. Florian, “Transformation-based learning inthe fast lane”, in NAACL 2001, pp. 40–47, 2001. 5

[14] V. Vezhnevets, V. Sazonov, and A. Andreeva, “A surveyon pixel-based skin color detection techniques”, in Graph-icon, pp. 85–92, 2003. 6

[15] Y. Raja, S. J. McKenna, and S. Gong, “Tracking andsegmenting people in varying lighting conditions usingcolour”, in IEEE International Conference on Face andGesture Recognition, Nara, Japan, pp. 228–233, 1998. 6

[16] “OpenCV, intel Open source Computer Vision Library”.http://opencvlibrary.sourceforge.net/. 7

[17] “OpenCV Blob extraction library”. http://opencvlibrary.sourceforge.net/cvBlobsLib. 7

[18] N. Tanibata, N. Shimada, and Y. Shirai, “Extraction ofhand features for recognition of sign language words”, inIn International Conference on Vision Interface, pp. 391–398, 2002. 8

[19] S. Ong and S. Ranganath, “Automatic sign language anal-ysis: A survey and the future beyond lexical meaning”,IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 27, no. 6, pp. 873–891, 2005. 8

[20] “Bogazici University, Perceptual Intelligence Labora-tory (PILAB)”. http://www.cmpe.boun.edu.tr/pilab. 12

12

http://htk.eng.cam.ac.uk

http://htk.eng.cam.ac.uk

http://www.research.att.com

http://www.research.att.com

http://opencvlibrary.sourceforge.net/

http://opencvlibrary.sourceforge.net/cvBlobsLib



http://www.cmpe.boun.edu.tr/pilab

http://www.cmpe.boun.edu.tr/pilab

TUGAS MAKALAH JAMINAN SOSIAL

Documents

Transcript of TUGAS MAKALAH JAMINAN SOSIAL