An efficient method for face retrieval from large video datasets

An Efficient Method for Face Retrievalfrom Large Video Datasets

Thao Ngoc NguyenUniversity of ScienceFaculty of Information

Technology227 Nguyen Van Cu, Dist 5Ho Chi Minh City, Vietnam

[email protected]

Thanh Duc NgoThe Graduate University for

Advanced Studies2-1-2 Hitotsubashi,

Chiyoda-kuTokyo, Japan

[email protected]

Duy-Dinh LeNational Institute of

Informatics2-1-2 Hitotsubashi,

Chiyoda-kuTokyo, Japan

[email protected]’ichi Satoh

National Institute ofInformatics

2-1-2 Hitotsubashi,Chiyoda-ku

Tokyo, [email protected]

Bac Hoai LeUniversity of ScienceFaculty of Information


[email protected]

Duc Anh DuongUniversity of ScienceFaculty of Information


[email protected]

ABSTRACTThe human face is one of the most important objects invideos since it provides rich information for spotting cer-tain people of interest, such as government leaders in newsvideo, or the hero in a movie, and is the basis for interpretingfacts. Therefore, detecting and recognizing faces appearingin video are essential tasks of many video indexing and re-trieval applications. Due to large variations in pose changes,illumination conditions, occlusions, hairstyles, and facial ex-pressions, robust face matching has been a challenging prob-lem. In addition, when the number of faces in the datasetis huge, e.g. tens of millions of faces, a scalable method formatching is needed. To this end, we propose an efficientmethod for face retrieval in large video datasets. In orderto make the face retrieval robust, the faces of the same per-son appearing in individual shots are grouped into a singleface track by using a reliable tracking method. The retrievalis done by computing the similarity between face tracks inthe databases and the input face track. For each face track,we select one representative face and the similarity betweentwo face tracks is the similarity between their two represen-tative faces. The representative face is the mean face of asubset selected from the original face track. In this way,we can achieve high accuracy in retrieval while maintaininglow computational cost. For experiments, we extracted ap-proximately 20 million faces from 370 hours of TRECVIDvideo, of which scale has never been addressed by the for-mer attempts. The results evaluated on a subset consistingof manually annotated 457,320 faces show that the proposed

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.CIVR ’10, July 5-7, Xi’an ChinaCopyright c©2010 ACM 978-1-4503-0117-6/10/07 ...$10.00.

method is effective and scalable.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: Informa-tion Search and Retrieval—Retrieval models; I.5.3 [PatternRecognition]: Applications—Computer vision

General TermsAlgorithms, Experimentation, Performance

Keywordsface retrieval, face matching, face recognition, local binarypatterns, TRECVID

1. INTRODUCTIONThe human face is one of the most important objects in

videos, especially in news programs, dramas, and movies.By extracting and organizing face information from suchvideos, we can facilitate multimedia applications having content-based access to reality, such as video retrieval, video indexingand video mining [11, 10, 12, 2, 5, 13].

Current state-of-the-art face detectors [15] that can re-liably and quickly detect frontal faces with different sizesand locations in complex background images can be usedto extract faces from videos. However, recognizing faces ofthe same person is still a challenging task due to large vari-ation in pose changes, illumination conditions, occlusions,hairstyles and facial expressions. Figure 1 shows an exam-ple of these face variations .

Working on video datasets, one popular approach is to useface tracks 1 for matching instead of single faces as in staticimage datasets. The main idea is to take advantages fromthe abundance of frames in each sequence. For example, X.Liu and T. Chen [6] used adaptive Hidden Markov Models(HMM) to model temporal dynamics for face recognition; A.

1face sequence containing faces of one person.

https://www.researchgate.net/publication/3820840_An_efficient_implementation_and_evaluation_of_robust_face_sequence_matching?el=1_x_8&enrichId=rgreq-83ec5a66aaf50b0dfdac23592c1d1b0a-XXX&enrichSource=Y292ZXJQYWdlOzIyMTM2ODgzODtBUzo5ODY3NTQyMjQwMDUxNUAxNDAwNTM3NDYzMTU2

https://www.researchgate.net/publication/3338514_Name-It_Naming_and_detecting_faces_in_news_videos?el=1_x_8&enrichId=rgreq-83ec5a66aaf50b0dfdac23592c1d1b0a-XXX&enrichSource=Y292ZXJQYWdlOzIyMTM2ODgzODtBUzo5ODY3NTQyMjQwMDUxNUAxNDAwNTM3NDYzMTU2

https://www.researchgate.net/publication/4297422_Finding_Important_People_in_Large_News_Video_Databases_Using_Multimodal_and_Clustering_Analysis?el=1_x_8&enrichId=rgreq-83ec5a66aaf50b0dfdac23592c1d1b0a-XXX&enrichSource=Y292ZXJQYWdlOzIyMTM2ODgzODtBUzo5ODY3NTQyMjQwMDUxNUAxNDAwNTM3NDYzMTU2

https://www.researchgate.net/publication/46573646_'Who_are_you'_-_Learning_person_specific_classifiers_from_video?el=1_x_8&enrichId=rgreq-83ec5a66aaf50b0dfdac23592c1d1b0a-XXX&enrichSource=Y292ZXJQYWdlOzIyMTM2ODgzODtBUzo5ODY3NTQyMjQwMDUxNUAxNDAwNTM3NDYzMTU2

https://www.researchgate.net/publication/221259657_Hello_My_name_is_Buffy''_--_Automatic_Naming_of_Characters_in_TV_Video?el=1_x_8&enrichId=rgreq-83ec5a66aaf50b0dfdac23592c1d1b0a-XXX&enrichSource=Y292ZXJQYWdlOzIyMTM2ODgzODtBUzo5ODY3NTQyMjQwMDUxNUAxNDAwNTM3NDYzMTU2

https://www.researchgate.net/publication/2474755_Video-Based_Face_Recognition_Using_Adaptive_Hidden_Markov_Models?el=1_x_8&enrichId=rgreq-83ec5a66aaf50b0dfdac23592c1d1b0a-XXX&enrichSource=Y292ZXJQYWdlOzIyMTM2ODgzODtBUzo5ODY3NTQyMjQwMDUxNUAxNDAwNTM3NDYzMTU2

https://www.researchgate.net/publication/221368919_Person_Spotting_Video_Shot_Retrieval_for_Face_Sets?el=1_x_8&enrichId=rgreq-83ec5a66aaf50b0dfdac23592c1d1b0a-XXX&enrichSource=Y292ZXJQYWdlOzIyMTM2ODgzODtBUzo5ODY3NTQyMjQwMDUxNUAxNDAwNTM3NDYzMTU2

https://www.researchgate.net/publication/240224936_Rapid_object_detection_using_a_boosted_cascade_of_simple_features_In_CVPR?el=1_x_8&enrichId=rgreq-83ec5a66aaf50b0dfdac23592c1d1b0a-XXX&enrichSource=Y292ZXJQYWdlOzIyMTM2ODgzODtBUzo5ODY3NTQyMjQwMDUxNUAxNDAwNTM3NDYzMTU2

Figure 1: Large variations in facial expressions,poses, illumination conditions, and occlusions mak-ing face recognition difficult. Best viewed in color.

Hadid and M. Pietikainen [3] proposed an efficient methodfor extracting the representative faces from each face trackusing the Locally Linear Embedding (LLE) algorithm; AndSivic et al. [12] modeled each face track by a histogram offacial part appearance. These methods have shown betterrecognition performance compared to that using single face.However, the number of individuals and the number of facetracks used in the experiments performed by these methodsare rather small. Therefore, the scalability was not takeninto account.

When the dataset is getting huge, for example, TRECVIDdatasets [14] might have several hundreds of hours of videos,such above methods [6, 3, 12] need to be revised or a newscalable method is developed to handle scalability. Thesignificant point to solve the scalability problem is aboutmatching two face sequences. The traditional approach thatis to compute the minimum distance among all pairs of facesin two face tracks is not applicable when the number of facesin the dataset is huge, e.g. several millions of faces. S. Satohand N. Katayama [10] proposed to use SR-tree to reduce thematching complexity for this approach.

We propose an alternative approach for efficient face trackmatching in large video datasets. The main idea is to selecta subset of faces in each face track and compute one rep-resentative face for matching. Specifically, given a numberof selected faces k in each face sequence, we divide the facetrack into k equal parts according to its temporal informa-tion. For each part, we select one face as a representativefor that part. These faces are represented as points in ahigh dimensional feature space. Then, the mean of thesepoints is calculated and is considered as the representativeface for the input face track. Then, we compute the meanface from this subset of k representative faces in the featurespace. The similarity between two face tracks is defined asthe similarity between their two representative faces. In thisway, we can achieve very low computational cost. Althoughthis method is simple, comprehensive experiments on thedataset consisting of 457,320 faces of 1,511 face tracks ex-tracted from TRECVID dataset shows that it can achievecomparable matching performance with other state-of-the-art methods.

2. METHOD OVERVIEW

2.1 Face Track ExtractionThere are several approaches that can be used to group

faces into face tracks. For example, Sivic et al. tracks fa-cial regions and connects them for grouping [12]. This ap-

proach is accurate but requires a high computational cost.To reduce computational cost while maintain accuracy, inthe approach proposed by Everingham et al. [2], trackedpoints obtained from Kanade-Lucas-Tomasi (KLT) trackerwere used. However, face tracks obtained by this methodmay be fragmented since tracked points are sensitive to illu-mination changes, occlusions, and false face detections. Weproposed a method [7] that has successfully handled thesecases.

We also use tracked points for grouping faces detectedfrom an individual in video sequences into a face track. In-stead of generating interest points at a certain frame andtracking them through frames in the input frame sequenceas in [2], we re-generate tracked points to complement lostpoints due to occlusions and new appearing faces. Whenpoints are distracted by flash, a simple flash light detectoris used for detecting flash frames, then removing them fromthe grouping process. This method has shown to be robustand efficient in experiments on various long video sequencesof TRECVID dataset. Its result (94.17%) has outperformedEveringham et al.’s method (81.19%). For more details, re-fer to [7].

2.2 Face Track RepresentationWe use LBP (local binary patterns) feature to represent

faces in each face track. The LBP feature proposed by Ojalaet al. [8] is a powerful method for texture description. Itis invariant with respect to monotonic grey-scale changes;hence, no grey-scale normalization needs to be done prior toapplying the LBP operator. This operator labels the pixelsof an image by thresholding the neighborhoods of each pixelwith the center value and considering the results as a binarynumber. Recently, the LBP operator has been extended toaccount for different neighborhood sizes [8]. In general, theoperator LBPPP,R refers to a neighborhood size of P equallyspaced pixels on a circle of radius R that form a circularlysymmetric neighbor set. It has been shown that certain binscontain more information than others do. Therefore, it ispossible to use only a subset of the 2P local binary patternsto describe the textured images.

In [8], Ojala et al. defined these fundamental patterns(also called ”uniform” patterns) as those with a small num-ber of bitwise transitions from 0 to 1 and vice versa. Accu-mulating the patterns which have more than 2 transitionsinto a single bin yields an LBP descriptor, denoted LBPu2

P,R,

which less than 2P bins. In our experiment, the input im-age is divided into sub-images by a n×n grid and then LBPoperator is applied to these sub-images and computes a k-bin histogram. Consequently, a n×n× k-dimension featurevector is formed for each input image.

We did not use PCA feature for face representation sinceit usually requires robust facial feature point detection, forexample, eyes, nose, and mouth, for normalization. Buildingsuch robust feature point detectors for real video data is ex-pensive. Furthermore, using PCA requires additional com-putational cost for projection from original feature space toeigen space.

2.3 Matching MethodThe main purpose of face retrieval in videos is to get face

tracks that are relevant to a given query. In order to do that,similarity estimation methods must be used to determinehow relevant they are.


https://www.researchgate.net/publication/221292051_From_Still_Image_to_Video-Based_Face_Recognition_An_Experimental_Analysis?el=1_x_8&enrichId=rgreq-83ec5a66aaf50b0dfdac23592c1d1b0a-XXX&enrichSource=Y292ZXJQYWdlOzIyMTM2ODgzODtBUzo5ODY3NTQyMjQwMDUxNUAxNDAwNTM3NDYzMTU2


https://www.researchgate.net/publication/3193420_Multiresolution_Gray-Scale_and_Rotation_Invariant_Texture_Classification_with_Local_Binary_Patterns?el=1_x_8&enrichId=rgreq-83ec5a66aaf50b0dfdac23592c1d1b0a-XXX&enrichSource=Y292ZXJQYWdlOzIyMTM2ODgzODtBUzo5ODY3NTQyMjQwMDUxNUAxNDAwNTM3NDYzMTU2



https://www.researchgate.net/publication/221657352_Robust_Face_Track_Finding_in_Video_Using_Tracked_Points?el=1_x_8&enrichId=rgreq-83ec5a66aaf50b0dfdac23592c1d1b0a-XXX&enrichSource=Y292ZXJQYWdlOzIyMTM2ODgzODtBUzo5ODY3NTQyMjQwMDUxNUAxNDAwNTM3NDYzMTU2





https://www.researchgate.net/publication/29651745_Evaluation_campaigns_and_TRECVid?el=1_x_8&enrichId=rgreq-83ec5a66aaf50b0dfdac23592c1d1b0a-XXX&enrichSource=Y292ZXJQYWdlOzIyMTM2ODgzODtBUzo5ODY3NTQyMjQwMDUxNUAxNDAwNTM3NDYzMTU2




Figure 2: The face track extraction method - Faces of the same person appearing in one shot are groupedinto one face track.

In common, to compute the similarity of two face tracks,we can adopt the idea of hierarchical agglomerative cluster-ing in which each cluster is a face track and the distancebetween clusters means the distance between face tracks.There are two common approaches following this idea.

1. Single Linkage Clustering based Distance: Thesingle linkage clustering based distance defines the dis-tance between two clusters (i.e. two face tracks) asthe minimum distance between elements (i.e. faces) ofeach cluster:

D(A,B) = minx∈A,y∈B

(d(x, y)) (1)

where A and B are face tracks, x and y are faces of Aand B respectively.

This method is widely used in many state-of-the-artmethods [10, 3, 12, 2].

2. Average Linkage Clustering based Distance: Theaverage linkage clustering based distance defines thedistance between two clusters (i.e. two face tracks)as the mean distance between elements (i.e. faces) ofeach cluster:

D(A,B) =1

|A|.|B|∑x∈A

∑y∈B

(d(x, y)) (2)

where A and B are face tracks, x and y are faces of Aand B respectively.

These methods leads to a huge computation because theyemploy pair-wise matching and face tracks usually have alarge number of faces. To reduce the computation cost, rep-resentative faces instead of all faces in face tracks can be usedfor matching. One intuitive way of doing so is to choose themiddle face as a representative face. However, this methoddoes not work well when the face tracks have large variationsas shown in Figure 3.

We propose a robust but low computation cost method formatching, called k-Faces. This method, which is inspired bythe idea of selecting representatives to reduce the computa-tional cost, can overcome the aforementioned weakness. Foreach face track, we select one representative face. The simi-larity between two face tracks is the similarity between their

two representatives. The representative face is the mean faceof a subset selected from the original set of faces in the facetrack.

In particular, selecting the representative face for eachface track, k-Faces does these following steps:

1. Divide the face track into k equal parts according toits temporal information. For example, with k equal tofive, a face track F that has 100 faces will be dividedinto five parts, which each part comprises 20 faces thatwere extracted from consecutive frames.

2. For each part, select the middle face as the representa-tive for that part. We then obtain a subset of k facesfrom the original set of faces in the face track.

3. Compute the mean face from a subset of k faces. Notethat mean face may be not a real face. We define themean face (or representative face) is the ’face’ whosefeature vector is computed by averaging out all featurevectors of k faces from the above step.

4. Finally, compute the Euclidean distance between two’mean faces’ .

We expect that averaging out multiple faces can lever-age variations and therefore produce a better representativeface. In this way, we can achieve high accuracy in retrievalwhile maintaining low computational cost. Figure 4 givesan example of selecting representatives when k = 3.

After selecting an appropriate similarity estimation method,given a query face track, we compute the similarity betweenthe query and each face track in the database, then a list ofnearest neighbors will be returned quickly by employing theLSH (Local Sensitive Hashing) technique for indexing [4].

3. EXPERIMENTS

3.1 DatasetWe used the TRECVID news video datasets from 2004 to

2006. These datasets have about 370 hours of video broad-casts in different languages such as English, Chinese, andArabic. The total number of frames that we processed wasabout 35 millions. 157,524 face tracks with about 20 mil-lion faces were extracted. This amount is much larger than

https://www.researchgate.net/publication/2320094_Approximate_Nearest_Neighbors_Towards_Removing_the_Curse_of_Dimensionality?el=1_x_8&enrichId=rgreq-83ec5a66aaf50b0dfdac23592c1d1b0a-XXX&enrichSource=Y292ZXJQYWdlOzIyMTM2ODgzODtBUzo5ODY3NTQyMjQwMDUxNUAxNDAwNTM3NDYzMTU2





Figure 3: Selecting the middle face in the face track might lead to poor matching performance due to variationinformation lost.

Figure 4: The k-Faces matching method. A subsetof k faces selected from the face track to computethe mean face.

in previous studies such as [1, 9]. We filtered out shortface tracks that had less than ten faces. This resulted in35,836 face tracks. Then, 49 people were annotated from1,511 face tracks containing 457,320 faces. Political figuressuch as George W. Bush, Hu Jintao and Saddam Husseinwere among them. Figure 5 shows statistics of the evalu-ated dataset.

3.2 Evaluation CriteriaWe evaluated the performance with measures that are

commonly used in information retrieval, such as precision,recall, and average precision. Given a face track of queriedperson and letting Nret be the total number of face tracksreturned, Nrel the number of relevant face tracks, and Nhit

the total number of relevant face tracks in the dataset, recalland precision can be calculated as follows:

Recall =Nrel

Nhit(3)

Precision =Nrel

Nret(4)

Average precision (AP) emphasizes returning more relevantface tracks earlier. It can be computed with the followingformula:

AveragePrecision =

∑Nr=1(Precision(r)× rel(r))

Nhit(5)

where r is the rank, Nret the number returned, Nhit thenumber of relevant faces at a given cut-off rank, rel() a bi-nary function on the relevance of a given rank, and Preci-sion() is the precision at a given cut-off rank.

In addition, to evaluate the performance of multiple queries,we used mean average precision (MAP), which is the meanof the average precisions computed from the queries.

3.3 ResultsWe compared k-Faces with three other face matching meth-

ods: Single face based method (i.e. k-Face with k=1, picking

Table 1: The MAP Accuracy of Four Methods: Min-Min, Avg-Min, Single Face, and k-Faces (k=5).

Method MAP (%)Min-Min 56.93k-Faces (k=5) 54.97Avg-Min 53.69Single Face 46.46

Figure 6: Precision-Recall of Min-Min, Avg-Min,Single Face, and k-Faces (k=5).

the middle face of the face track as the representative facefor matching), Single Linkage Clustering based and AverageLinkage Clustering based methods. For short representa-tion, they are renamed k-Faces, Single Face, Min-Min, andAvg-Min, respectively. The LBP feature is 3×3×59 which isextracted from 3× 3-grid from the input face and quantizedinto 59-bins.

As shown in Table 1 and Figure 6, using one face for facetrack representation (i.e. Single Face) gave the worst results.Min-Min gave the best result, and k-Faces was comparablewith Avg-Min and Min-Min.

The Single Face method uses middle faces to estimate thedistances between two face tracks. It is obviously fast. Ex-perimental results showed that Single Face takes only sixseconds on our dataset of 1,511 face tracks. However, be-cause real life videos have large variations, the method failswhen the middle faces of two face tracks are different inposes, illumination conditions, etc. In contrast, the use offace tracks, which considers multiple faces, can avoid thisobstacle.

Figure 7 shows an example for the weakness of Single Face.For a given face track Q in Figure 7a, Single Face rankedthe relevant face track A 10th (see Figure 7b) and face trackB 43th (see Figure 7c). The face in the rectangle is therepresentative face (middle face) chosen by Single Face. Themiddle faces of Q and A are similar in pose to each other,

https://www.researchgate.net/publication/221111430_Leveraging_archival_video_for_building_face_datasets?el=1_x_8&enrichId=rgreq-83ec5a66aaf50b0dfdac23592c1d1b0a-XXX&enrichSource=Y292ZXJQYWdlOzIyMTM2ODgzODtBUzo5ODY3NTQyMjQwMDUxNUAxNDAwNTM3NDYzMTU2

Figure 5: Statistics of the evaluated dataset.

Figure 7: (a) The queried face track Q. (b) The returned face track A. (c) The returned face track B. Middleface is the face in the rectangle. Faces shown here are sampled from the real face track whose the number offaces is too large to represent.

Table 2: The Computational Cost of Four Methods:Min-Min, Avg-Min, Single Face, and k-Faces (k=5).

Method Time (seconds)Min-Min 124,393Avg-Min 124,119k-Faces (k=5) 19Single Face 6

while B’s is different from Q’s. This explains why SingleFace wrongly ranked A higher than B.

From Table 1 and Figure 6, we noticed that k-Faces iscomparable in performance to other face track based match-ing method, Min-Min and Avg-Min. In particular, its MAPaccuracy was 1.28% higher than Avg-Min and a little lower(1.96%) than Min-Min. However, k-Faces’s advantage inspeed is impressive: it is over 6,500 times faster than Avg-Min and Min-Min (see Table 2). k-Faces’s processing timeis a bit slower than that of Single Face due to the cost ofcomputing the “mean face”. This cost is much smaller thanthe cost of computing all pair-wise distance in Min-Min.

Avg-Min usually gives bad results as faces in a face trackhave large variations from the beginning to the end. Aver-aging out all pair-wise distances is a good way to eliminatenoise; However, it causes the estimated distance more differ-ent from actual observation. Thus, a relevant face track maybe not considered relevant anymore. For example, given twoface tracks A and B of the same person, if most of the facesin A are turned left while those in B are turned right, thedistance may be large, and hence, they would be less rele-vant to each other. Meanwhile, if we select an appropriate k,somehow, we can choose a representation for each variationso that we can avoid the situation of the majority takes overthe minority. In Figure 8, both query face track Q and rel-evant face track R have large variations. R is ranked thirdwith the k-Faces (k = 5) but it is ranked 94th with Avg-Min.

Although Min-Min is better than k-Faces overall, there arestill cases that Min-Min gives worse results. For example,two face tracks belonging to different people can be consid-ered matched by Min-Min because they have two faces thatare unexpectedly similar to each other, as shown in Figure 9.Given a query face track Q, face track A contains the sameperson as in Q, and face track B contains a different person.Min-Min ranked B third and A 11th, after many irrelevantface tracks. Because of large variations, Min-Min could notfind a suitable minimum pair for face track A (see Figure 9a),while faces in Q and B were very similar to each other intheir poses, and illumination conditions. (see Figure 9b). Incontrast, k-Faces, by averaging out subsets of k faces of facetracks, formed feature vectors that revealed the differencesbetter. In this example, k-Faces ranked A at position 3 andthe irrelevant face track B at a reasonable position of 196.

However, the k-Faces method depends on the choice ofan appropriate subset of faces. Figure 10 shows an exam-ple in which Min-Min is better than k-Faces. Min-Min wassuccessful in finding an extremely similar pair of faces (seeFigure 10a). Meanwhile, k-Faces found faces that were toodifferent from each other in pose, so the mean faces werealso different (see Figure 10b and 10c). Therefore, the ques-tion raised here is how to choose an appropriate k so thatwe can achieve high accuracy in retrieval while maintaining

Figure 11: The performance of k-Faces method withdifferent k evaluated by MAP. Note that when k = 1,the Smean faceT corresponds to the middle face inthe feature space (i.e. Single Face).

Figure 12: The computation cost of k-Faces methodwith different k.

low computational cost.To answer that question, we investigated different values

of parameter k. Figure 11 and Figure 12 show the accuracyevaluated by MAP and the computation cost for each k. Wenoticed that the computation cost linearly increases as k in-creases. On the other hand, the performance becomes stablefrom k = 5 onward. We therefore conclude that selecting ak that has the highest MAP is not a good solution becauseof the trade-off between accuracy and computational cost.Too small a k gives poor results; too big a k gives betterresults, but unnecessarily consumes time.

We also compared the performance of our method withthat of the method using k-means clustering for selectingrepresentative faces for each face track. Figure 13 showsthese performances are comparable across feature configu-rations while our method is simpler and requires less com-putational cost. This figure can also be used to consider thetrade-off between computational cost and processing speed.For example, the feature 3.3.10 which stands for feature ex-tracted from 3 × 3 grid and quantized in 10-bins has 90dimensions and can be computed faster than the feature3.3.59.

4. CONCLUSIONSThe proposed method gives a robust but low computation

cost way of face retrieval in videos compared to existing

Figure 8: (a) Faces in the queried face track. (b) Faces in the relevant face track R. (c) Five representativefaces picked from the queried face track. (d) Five representative faces picked from the relevant face track R.

Figure 9: (a) Minimum pair computed by Min-Min that contains relevant people (left: query, right: rele-vant face track A). (b) Minimum pair computed by Min-Min that contains irrelevant people (left: query,right: irrelevant face track B). (c) Five representative faces picked from the queried face track. (d) Fiverepresentative faces picked from A. (e) Five representative faces picked from B.

Figure 10: (a) Minimum pair computed by Min-Min (left: query, right: relevant face track). (b) Fiverepresentative faces picked from the queried face track. (c) Five representative faces picked from the relevantface track.

Figure 13: The performance of the proposed method compared with that of the method using k-meansclustering for selecting representative faces for each face track.

baseline methods. Real life video data is huge, and faces inthem have large variations; hence, efficient methods like oursare essential. Our future research will theoretically selectappropriate k values or features for representing faces inorder to improve face retrieval performance.

5. REFERENCES[1] T. L. Berg, A. C. Berg, J. Edwards, and D. A.

Forsyth. Who’s in the picture? In Advances in NeuralInformation Processing Systems, 2004.

[2] M. Everingham, J. Sivic, and A. Zisserman. “Hello,My name is... Buffy” – automatic naming of charectersin tv video. In Proc. British Machine Vision Conf.,pages 899–908, 2006.

[3] A. Hadid and M. Pietikainen. From still image tovideo-based face recognition: An experimentalanalysis. In Proc. Intl. Conf. on Automatic Face andGesture Recognition, pages 813–818, 2004.

[4] P. Indyk and R. Motwani. Approximate nearestneighbor - towards removing the curse ofdimensionality. In Proc. 30th Symposium on Theory ofComputing, pages 604–613, 1998.

[5] D.-D. Le, S. Satoh, M. Houle, and D. Nguyen. Findingimportant people in large news video databases usingmultimodal and clustering analysis. In Proc. 2ndIEEE Intl. Workshop on Multimedia Databases andData Management, pages 127–136, 2007.

[6] X. Liu and T. Chen. Video-based face recognitionusing adaptive hidden markov models. In Proc. Intl.Conf. on Computer Vision and Pattern Recognition,volume 1, pages 340–345, 2003.

[7] T. Ngo, D.-D. Le, S. Satoh, and D. Duong. Robustface track finding in video using tracked points. InProc. Intl. Conf. on Signal-Image Technology &Internet-Based Systems, pages 59–64, 2008.

[8] T. Ojala, M. Pietikainen, and T. Maenpaa.

Multiresolution gray-scale and rotation invarianttexture classification with local binary patterns. IEEETrans. on Pattern Analysis and Machine Intelligence,24(7):971–987, 2002.

[9] D. Ramanan, S. Baker, and S. Kakade. Leveragingarchival video for building face datasets. In Proc. Intl.Conf. on Computer Vision, volume 1, pages 1–8, 2007.

[10] S. Satoh and N. Katayama. An efficientimplementation and evaluation of robust face sequencematching. In Proc. 10th Intl. Conf. on Image Analysisand Processing, pages 266–271.

[11] S. Satoh, Y. Nakamura, and T. Kanade. Name-it:Naming and detecting faces in news videos. IEEEMultimedia, 6(1):22–35, 1999.

[12] J. Sivic, M. Everingham, and A. Zisserman. Personspotting: Video shot retrieval for face sets. In Proc.Int. Conf. on Image and Video Retrieval, pages226–236, 2005.

[13] J. Sivic, M. Everingham, and A. Zisserman. “Who areyou?” - learning person specific classifiers from video.In Proc. Intl. Conf. on Computer Vision and PatternRecognition, pages 1145–1152, 2009.

[14] A. F. Smeaton, P. Over, and W. Kraaij. Evaluationcampaigns and trecvid. In MIR ’06: Proceedings of the8th ACM International Workshop on MultimediaInformation Retrieval, pages 321–330, 2006.

[15] P. Viola and M. Jones. Rapid object detection using aboosted cascade of simple features. In Proc. Intl.Conf. on Computer Vision and Pattern Recognition,volume 1, pages 511–518, 2001.

























































An efficient method for face retrieval from large video datasets

Documents

Transcript of An efficient method for face retrieval from large video datasets