Task-based evaluation of skin detection for communication and perceptual interfaces

This article was published in an Elsevier journal. The attached copyis furnished to the author for non-commercial research and

education use, including for instruction at the author’s institution,sharing with colleagues and providing to institution administration.

Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies areencouraged to visit:

http://www.elsevier.com/copyright

http://www.elsevier.com/copyright

Author's personal copy

Task-based evaluation of skin detection for communicationand perceptual interfaces

Stephen J. Schmugge a, M. Adeel Zaffar a, Leonid V. Tsap b, Min C. Shin a,*

a Department of Computer Science, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC 28223, USAb Advanced Communications and Signal Processing Group, Systems Research Group, University of California,

Lawrence Livermore National Laboratory, Livermore, CA 94551, USA

Received 20 February 2006; accepted 17 April 2007Available online 22 May 2007

Abstract

Skin detection is frequently used as the first step for the tasks of face and gesture recognition in perceptual interfaces for human–computerinteraction and communication. Thus, it is important for the researchers using skin detection to choose the optimal method for their specifictask. In this paper, we propose a novel method of measuring the performance of skin detection for a task. We have created an evaluationframework for the task of hand detection and executed this assessment using a large dataset containing 17 million pixels from 225 imagestaken under various conditions. The parameter set of the skin detection has been trained extensively. Five colorspace transformations withand without the illuminance component coupled with two color modeling approaches have been evaluated. The results indicate that the bestperformance is achieved by transforming to SCT colorspace, using the illuminance component, and modeling the distribution with the his-togram approach. Some conclusions such as the SCT colorspace being one of the best colorspaces are consistent with our previous work,while findings such as the YUV colorspace performing well in this work when it was one of the worst in our previous work are different. Thisindicates that the performance measured at the pixel-level might not be the ultimate indicator for the performance at the task-level of handdetection. We believe that the users of skin detection will find our task-based results to be more relevant than the traditional pixel-levelresults. However, we acknowledge that an evaluation is limited by its specific dataset and evaluation protocols.� 2007 Elsevier Inc. All rights reserved.

Keywords: Skin detection; Hand detection; Empirical evaluation

1. Introduction

Skin detection is frequently used as the first step forimportant interfaces/communication tasks including handdetection, gesture recognition and face recognition. Forthe task to perform robustly, it needs to receive the bestskin detection output. Therefore, it is important for theusers of the skin detector (or the researchers of the tasks)to choose the optimal detection approach for their task.

Pre-existing evaluation frameworks are not used. Table1 lists recent face and hand analysis tasks using skin detec-

tion. The reasoning behind selecting a specific colorspacetransformation, an important step of skin detection, isshown under the column of ‘‘Reason for Choosing.’’ From16 recent papers on tasks using skin detection, two papers[1,2] considered the outcome of previously established col-orspace evaluation framework [3]. Sigal et al. [1] hasselected the best skin detection approach based on the con-clusion from [3]. However, Hsu et al. [2] selected the detec-tor based on a theoretical reasoning from [4] instead afterconsidering the conclusions from [3]. In fact, 7 out of 16papers have stated no explicit reason for such a selection.Three have actually conducted their own studies to com-pare colorspaces for their decisions. The others based theirdecisions on the theoretical reasonings of ‘‘unique skincolor’’, ‘‘perceptually uniform’’ or ‘‘chrominance separat-ing.’’ Simply, none of the 16 papers has selected the colorspace

1047-3203/$ - see front matter � 2007 Elsevier Inc. All rights reserved.

doi:10.1016/j.jvcir.2007.04.008

* Corresponding author.E-mail addresses: [email protected] (S.J. Schmugge), mazaffar@

uncc.edu (M.A. Zaffar), [email protected] (L.V. Tsap), [email protected](M.C. Shin).

www.elsevier.com/locate/jvci

Available online at www.sciencedirect.com

J. Vis. Commun. Image R. 18 (2007) 487–495

https://www.researchgate.net/publication/5276189_Skin_color-based_video_segmentation_under_time-varying_illumination_IEEE_Trans_Pattern_Anal_Mach_Intell?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==


https://www.researchgate.net/publication/3193394_Face_detection_in_color_images?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==


https://www.researchgate.net/publication/220690354_A_Technical_Introduction_to_Digital_Video?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==


transformation based on the results or reasoning that has ahigh relevance to their specific task of face or hand analy-sis. Because it is questionable that the best skin detectorapproach is used, it is also questionable that the applica-tion is performing at its best.

It is important to question why none of those 16 papersin Table 1 has decided to select the skin detection algorithmbased on the conclusions of the outcome of the previousevaluation framework. We believe an evaluation frame-work has to have the following attributes to provide theresults that are highly relevant to the users of the skindetection. First, the performance needs to be measured atthe ‘‘task-level’’ rather than a typical ‘‘pixel-level.’’Although it might seem to be true, the correlation in theperformances measured at the pixel-level and at the task-level just has not been studied. Second, the parameter ofthe algorithm needs to be tuned extensively during thetraining session and the optimized parameter settingsshould be applied on the testing dataset. The skin detector(like any other vision algorithms) produces different out-puts with respect to the parameter setting. But, it is rareto see a train-test with an extensive parameter tuning usedfor the evaluation study. Third, a large and thorough data-set taken by varying the factors impacting the performancesuch as illumination conditions, indoor vs outdoor, andskin tones. The previous skin detection evaluation frame-works are listed in Table 2. We found that no evaluationframework satisfies all three criteria.

In this paper, we present a novel method of measuringthe performance of skin detector for the task of hand detec-tion. The skin detectors are assessed by how well they pre-

pare the input for the task of hand detection to performaccurately. We believe that the results from our frameworkwill provide much more relevant information for the usersof skin detection to choose the optimal detector for the taskof hand detection. Twenty combinations of five colorspacetransformations, dropping/keeping of illuminance compo-nent, and two color modeling methods are evaluated usingthe receiver operating characteristic (ROC) curve. A largedataset of more than 17 million pixels in 225 images arecollected by varying the attributes of skin tone, illumina-tion conditions, background complexity and hand pose.The parameter of skin detection is extensively tuned. Intotal, our experiment consumed more than 4 days of con-tinuous computation on 10 dual 3 GHz Xeon processorservers which is equivalent to nearly 91 days on a singleprocessor machine. The train and testing scheme is usedto measure the performance on unseen data. The resultsindicate that the best performance is achieved by trans-forming to SCT colorspace, using the illuminance compo-nent, and modeling the distribution with the histogramapproach. Some conclusions such as the SCT colorspacebeing one of the best colorspaces are consistent with ourprevious work, while findings such as the YUV colorspaceperforming well in this work when it was one of the worstin our previous work are different. This indicates that theperformance measured at the pixel-level might not be agood indicator for the performance at the task-level ofhand detection.

Note that our results and conclusions are based on oneparticular hand detection method. Thus, we do not claimthat our conclusions apply to a general task of hand detec-

Table 1Applications using skin detection

Source Task Skin detection approach Reason for choosing

CS IC Model Benefit Eval method

[5] Face detect YCbCr Yes Gaussian clusters None explicit Simple self eval[6] Face detect NRGB No Gaussian Illum. Inv. [7,8] Prev skin algo [9][10] Face detect HSV and YCbCr Yes K-means wavelet Illum. Inv. Self eval[11] Face detect NRGB No Ellipse Illum. Inv. Prev skin algo [7][12] Face detect Multiple No Gaussian Illum. Inv. Self eval[13] Face detect NRGB No Mixture of Gaussian Frequently used Simple self eval[14] Face detect YCbCr Yes Thresholding Illum. Inv. None explicit

Unique skin color[2] Face detect YCbCr Yes Ellipse Skin color cluster Eval work [3]*[15] Face recog HSV Yes Gaussian mixture Skin color cluster Prev skin algo [16][17] Gesture recog YUV Yes Look-up table None explicit None explicit[18] Gesture recog HSI Yes Gabor None explicit None explicit[19] Gesture recog YUV Yes Gaussian Skin color cluster Prev skin algo [20][21] Gesture recog YIQ No Threshold Skin color cluster None explicit[22] Gesture recog HSV No Thresholding None explicit None explicit[23] Gesture recog HSI Yes Look-up table Illum. Inv. None explicit[1] Video seg HSV Yes Histogram Illum. Inv. Prev eval [3]

Under ‘‘skin detection approach’’, the colorspace (CS), the usage of illuminance component (IC) and the color modeling (model) are listed. The benefits ofthe skin detection algorithm are listed as ‘‘Illum. Inv.’’ (illumination invariance), ‘‘skin color cluster’’ (different skin tones are clustered together) ‘‘unique’’(the skin color is unique from background thus enabling detection in complex background.) The evaluation method used for deciding the skin detectionalgorithm are listed under ‘‘eval method’’ as ‘‘prev eval’’ (referred to the outcome of the previous evaluation frameworks), ‘‘self eval’’ (created their ownevaluation criteria and performed an extensive analysis), ‘‘simple self eval’’ (performed a simple, usually visual, comparison), ‘‘prev skin alg’’ (referred tothe claims made by previous skin detection algorithm) ‘‘none explicit’’ (no explicit mention of method is mentioned.). *rejected the claims from [3] andchose based on theoretical reason instead.

488 S.J. Schmugge et al. / J. Vis. Commun. Image R. 18 (2007) 487–495


https://www.researchgate.net/publication/3970140_A_novel_skin_color_model_in_YCBCR_color_space_and_its_application_to_human_face_detection?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/3845484_Hand_gesture_recognition_using_input-output_hidden_Markov_models?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/3193327_A_system_for_person-independent_hand_posture_recognition_against_complex_backgrounds?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/220931427_Skin_Detection_in_Video_under_Changing_Illumination_Conditions?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/3975379_Improving_adaptive_skin_color_segmentation_by_incorporating_results_from_face_detection?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/262159029_A_New_Robust_Face_Detection_in_Color_Images?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/3308029_Face_segmentation_using_skin-color_map_in_videophone_applications?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/222570068_Mixture_model_for_face-color_modeling_and_segmentation?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/222320848_A_Real-Time_Approach_to_the_Spotting_Representation_and_Recognition_of_Hand_Gestures_for_Human-Computer_Interaction?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/3424007_Face_Detection_Using_Quantized_Skin_Color_Regions_Merging_and_Wavelet_Packet_Analysis?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/3193040_An_HMM-Based_threshold_model_approach_for_gesture_recognition?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/262154341_Color-Based_Face_Detection_Using_Skin_Locus_Model_and_Hierarchical_Filtering?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==


https://www.researchgate.net/publication/2530570_Hand_Gesture_Recognition_using_Multi-Scale_Colour_Features_Hierarchical_Models_and_Particle_Filtering?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/3890679_Real-Time_Input_of_3D_Pose_and_Gestures_of_a_Useraposs_Hand_and_Its_Applications_for_HCI?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==


tion. However, we believe that the users of skin detectionwill find our results much more relevant thus helping themto make more informed decision on choosing a skindetector.

2. Previous works

Albiol et al. [24] conducted a theoretical study and con-cluded that an invertible colorspace has its own corre-sponding skin detection scheme which produces optimalresults. They validated their hypothesis by testing RGB,YCbCr, HSV and CbCr colorspaces with a histogramapproach of color modeling. In [25], a database of imagescaptured from the web was used to construct a skin detec-tor and to evaluate the performance of two skin-color mod-eling methods in the RGB colorspace. A scheme wasproposed for detecting adult content in images. Resultsindicated a detection rate of 88% against a false alarm rateof 11%. They found that the histogram-based method wasbetter than the mixture of Gaussians both in terms ofdetection rates as well as running time. A quantitative anal-ysis of selected skin detection methods was conducted in[27]. The skin probability map in the RGB colorspacewas determined to produce the best results. The perfor-mance of five chrominance spaces in the task of detectingfaces in still images is measured in [3]. TSL (Tint SaturationLuminance) outperformed other colorspaces with a 90.8%correct face detection rate whereas CIELAB gave the worstface detection rate of 38.4%. The paper acknowledges thatits idea of using segmentation with normalized colorspacesdoes not adequately take into account the possible varia-tions in illumination. Furthermore it states that pixels withlow RGB values hinder the performance of the detector interms of differentiating between skin and non-skin pixelsand contribute more towards the noise level. Zarit et al.[28] evaluated two skin detection schemes using five color-spaces. The detection methods were based on color histo-grams and built on LUTs (Look-Up Tables) andBayesian theory. The performance was evaluated in termsof ratio of correctly detected skin pixels, the skin andnon-skin error. The study concluded that the LUT whenused with HSV gave the best results compared to CIELABand YCbCr colorspaces. Hence the paper stated that theperformance of the detection methods varied with thechoice of colorpsace transformation.

3. Framework

3.1. Skin detection approach

In this paper, we assume that skin detection for the taskof hand detection is a three-step process. First, the color ofan input pixel is transformed from the RGB colorspace toone of five colorspaces. Second, the illuminance componentof the transformed color is kept or dropped for furtheranalysis. We call the color with illuminance componentas ‘‘3D color’’ and without as ‘‘2D color.’’ Third, the skinand non-skin color is modeled statistically using one of twoapproaches.

3.1.1. Colorspace transformations and their illuminance

component

The input images are captured in the RGB (Red, Green,Blue) colorspace. The 2D color is computed by droppingthe G component. The CIE color system is based on theCIE (Commission Internationale de l’Eclairage) primariesestablished in 1931. The CIEXYZ colorspace forms a coneshaped space with Y as the luminance component. TheCIELAB is computed from the CIEXYZ space. It attemptsto linearize the color differences so that the same amount ofcolor difference results in the same distance in the colorspace. L component is illuminance component. The HSIspace separates the intensity (I) from two other chromatic-ity components of hue and saturation. I is the illuminancecomponent. The SCT (spherical coordinate transform) sep-arates illumination information from color information[29]. L is the illuminance component. The YUV space isused for digital video and compression techniques. Y isthe illuminance component.

For each colorspace, we dropped its illuminance compo-nent to form 2D color. Note that values of each componentof the colorspaces are adjusted to the range of [0,255] andquantized in 256 levels. The details of the colorspace trans-formation equations can be found in [4] except for the SCT[29].

3.1.2. Color modeling methods

We used a classifier incorporating the Bayesian decisiontheory to classify a pixel color into skin class (xs) or non-skin class (xns). For each pixel, a feature (x) is created byusing all three color components after colorspace transfor-

Table 2Comparative studies on colorspaces for skin detection

Source Compared Dataset Parameter tuned Evaluation

CS IC Model Images Pixels Train-test Metric Task

Albiol et al. [24] 4 No 1 7022 17 mil No No ROC PixelJones and Rehg [25] 1 No 2 18,696 2 mil Yes Yes ROC PixelBrand and Mason [26] 2 No 1 18,696 2 mil No Yes SA/FA PixelTerrillon et al. [3] 9 No 1 170 13 mil No No TP/TN FaceThis work 5 Yes 2 225 17 mil Yes Yes ROC Hand

Under ‘‘compared’’, the number of the colorspace (CS) and the color modeling (Model) compared and the evaluation of the effect of illuminancecomponent (IC) are listed.

S.J. Schmugge et al. / J. Vis. Commun. Image R. 18 (2007) 487–495 489

https://www.researchgate.net/publication/239062918_Position_estimation_of_micro-rovers_using_a_spherical_coordinate_transform_color_segmenter?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/239062918_Position_estimation_of_micro-rovers_using_a_spherical_coordinate_transform_color_segmenter?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/3823455_Comparison_of_five_color_models_in_skin_pixel_classification?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/2882163_Statistical_Color_Models_with_Application_to_Skin_Detection?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==

https://www.researchgate.net/publication/220690354_A_Technical_Introduction_to_Digital_Video?el=1_x_8&enrichId=rgreq-fe3a8e67-e661-4340-bbff-41e85682e2fb&enrichSource=Y292ZXJQYWdlOzIyMDI3ODMzNztBUzoxMDY2MTg0OTg4NDY3MjBAMTQwMjQzMTIzOTQ3NQ==


mation for the 3D color or by using two chrominance com-ponents after colorspace transformation for the 2D color.The posterior probability of xs is computed as

pðxsjxÞ ¼ pðxjxsÞpðxsÞpðxÞ where p(xjxs) is class condition proba-

bility, p(xs) is the priori probability, and p(x) is the evi-dence factor. Each feature vector (x) is classified as xs ifpðxsjxÞpðxns jxÞ > T ratio, or xns otherwise.

Using the normal density approach, the class condi-tional probability of a class is determined using themulti-variate normal density equation [30]. In the histo-gram approach [25,1], the probability is modeled with ahistogram for each class. The histograms are quantizedinto bins per channel. The class conditional probabilityp(xjxs) is computed as the ratio of each bin value (c[x])to the sum of all bins.

For each skin detection approach, we used the besthistogram bin size found in our previous work[31,32].

3.2. Task of hand detection

We have selected the hand detection algorithm by Shinet al. [33]. The overview of the algorithm is shown inFig. 1. Given color and range images of the scene, the algo-rithm performs the skin detection in the SCT colorspacewithout the illuminance component using a minimum dis-tance classifier. In this paper, we have varied the skin detec-tion approaches. Small noises are removed usingmorphological operators. Then the regions that are smallor with the unlikely texture or shape are removed. Themanipulating hand is detected by finding the closest regionby using the corresponding range image. We evaluate the

skin detection approaches by measuring how well they pre-pare the input for the task of hand detection.

3.3. Dataset

We have collected a large dataset of 225 color and rangeimages consisting more than 17 million pixels using Digi-clops range camera with different illumination conditions(taken indoor and outdoor), background complexity, skintone, and hand pose. In order to examine the effect of onlyskin detectors (not the hand detection), we captured theimages with hand being the closest skin region. However,note that the image could contain skin-like colored regionsthat are closer than the hand. Four attributes of illumina-tion, skin tone, hand pose and scene type are varied in thedataset. A full categorization and count of the dataset isdisplayed in Table 3.

The skin tone and illumination condition are importantfactors to assess since one of the most frequently stated rea-sons for using a skin detector for the task of hand detectionis its robust performance in those two factors. Since thehand detection involves segmentation at the region level,the hand poses are varied to examine the connectednessof the detected skin pixels. Note that open hands could

Fig. 1. Steps of hand detection.

Table 3The number of images in categories of the dataset

Hand Yes 177 No 48Outdoor Dark Illum. 37 Regular Illum. 58Indoor Dark Illum. 38 Regular Illum. 92Scene Simple 101 Complex 124Skin Light 52 Medium 66 Dark 59Hand pose Closed 62 Open 54 Open spread 61



get broken into multiple regions yielding to incorrectdetection.

3.4. Performance metrics

3.4.1. Ground truth

The ground truth of hand detection is specified by twovalues of (1) presence or absence of hand, and (2) locationof the hand. The hand location is described by using abounding box. The ground truth is labeled manually.

3.4.2. Detection assessment

Each hand detection is assessed as the true positive (TP),true negative (TN), false positive (FP), or false negative(FN). The hand detection by the algorithm is classified asa TP if the hand is detected at the correct location. It isclassified as a false positive if (1) hand is detected in animage without hand or if (2) hand is detected in an incor-rect location. The detected location is determined to be cor-rect if

Aoverlap

AGT> T overlap and

Aoverlap

AMD> T overlap where Aoverlap is

the size of overlap between the ground truth (AGT) andalgorithm detected (AMD) hand regions. The value of0.75 is used for Toverlap. We use TP and FP for measuringthe performance using a receiver operating characteristic(ROC) curve as described in Section 3.4.3.

3.4.3. ROC analysis

The performance of a skin detector is computed by ana-lyzing the receiver operating characteristic (ROC) curve.We apply the skin detection output for a value of theparameter Tratio (refer to Section 3.1.2) to the hand detec-tion and assess the hand detection output as TP or FP.By applying to multiple images, we obtain the number ofTPs and FPs. We normalize the TPs by dividing with thetotal number of images with hand and normalize the FPsby dividing with the sum of the total number of imageswithout hand (the first criteria for FP) and the total num-ber of images with hand without TP (the second criteria forFP). Note that the second criteria for FP is added since theFP could occur in images without hand and in images withhand that did not have true positive assessment. So, for agiven Tratio, we compute the performance with the normal-ized TP and FP. Then, the area under the ROC curve(AUC) is determined by computing the area under thepoints along the ROC curve. They are the set of perfor-mance points that has the highest TP value for a givenFP value. The ROC curves of the best and worst skindetectors are shown in Fig. 2.

3.4.4. Parameter tuning

The performance of a skin detection algorithm variesgreatly with respect to the parameter. The goal of parame-ter tuning is to find the set of parameter values that resultsin the set of highest TP values for a given FP values (or thepoints along the ROC curve). These points could be foundby assessing every possible values of the parameter (Tratio).However, this is not a feasible option especially considering

that the possible range of Tratio could be [0, +1]. Thereforewe propose the following scheme for parameter tuning.First, the effective parameters of Tratio are computed byreferring to the trained skin color model. After training askin detector using the manually segmented skin pixels,all unique values of pðxsjxÞ

pðxnsjxÞ is found by calculating pðxsjxÞpðxns jxÞ

for each possible color. For histogram modeling, the num-ber of unique values is much smaller due to its non-para-metric nature. Second, we find the initial set of thethresholds to attempt. Note that since the range of Tratio

is very large, it is important to select the values of Tratio

that will cover a good range of performance. To do so,we first uniformly sample from the unique sorted valuesof Tratio at 1000 values. We then evaluate at those 1000 val-ues of Tratio and calculate the initial AUC (AUCbefore).Next, we sample another 1000 unique Tratio values and cal-culate the AUC again (AUCafter) with these newly addedTratio values. If AUCafter � AUCbefore > 0.001, then wesample another 1000 not used Tratio values and repeat untilAUCafter � AUCbefore 6 0.001. For Normal density model-ing, 4110 Tratio values were attempted on average per train-ing session, and 1035 Tratio values were attempted onaverage for the histogram approaches. The entire parame-ter tuning process (including training and testing) for allskin detection methods consumed more than 4 days ofcomputation in our 10 dual 3 GHz Xeon processor servers.

3.4.5. Train and testing

For training and testing, we have performed two foldcross validation by dividing the entire dataset into twofolds by randomly choosing images with the constraintsof keeping the number ground truthed skin images equalin each fold and keeping the number ratio of hand imagesto non-hand images constant. The training involves twosteps. First, the skin detector is trained by using the manu-ally segmented skin and non-skin pixels in the representa-tive set of images from the train set (refer to Fig. 3.) On

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPF

FPF

Best and Worst Skin Detectors

Fig. 2. The ROC curve for the best (the red solid line) and worst (the bluedotted line) performing skin detector. (For interpretation of the referencesto color in this figure legend, the reader is referred to the web version ofthis paper.)



average, 1,497,600 pixels are used for training. Second, aset of Tratio values along the ROC curve are found throughthe parameter tuning process. Third, the test performanceis computed by applying the Tratio values along the trainedROC curve on the test set of images.

4. Results

We have executed the 20 (=5 colorspaces · 2 (with orwithout illuminance component) · 2 modeling methods)combinations of skin detection approaches on 225 images.For each approach, we have performed two training andtesting folds by splitting the dataset into halves. The entireanalysis consumed more than 4 days of computing time onten computer servers with dual Xeon processors which isequivalent to 91 days of execution in a single workstation.The results of these experiments are limited to the handdetection method, dataset, and evaluation framework used.

4.1. Effect of colorspace transformation

For a given modeling approach and color dimension(with or without illuminance component), the performancevaries greatly with respect to the colorspace transforma-tion. For instance, with the 3D color and normal modeling,the performance could range from being the eighth best

with YUV and the worst with CIELAB (refer to the secondcolumn in Table 4). This type of high variation of rankingamong different colorspaces is shown with both modelingapproaches and color dimensions.

The performance improvement due to colorspace trans-formation depends on the modeling and color dimension.All four colorspace transformations improve performancewith the histogram modeling. However, all but one degradeperformance in 3D color with normal density modeling,and all but one improve the performance in 2D color withnormal density modeling.

4.2. Effect of illuminance component

The illuminance component effects the performancemostly the same for the two modeling approaches. Whenthe illuminance component is used (3D color), the perfor-mance improved with the histogram modeling. The perfor-mance also improved for all but CIELAB using normaldensity color modeling. The average performance of nor-mal density is 0.279 (in 3D color) and 0.228 (in 2D color).For the histogram modeling, the average performance is0.417 (in 3D color) and 0.332 (in 2D color). This findingis similar to our previous work [31,32] measured at thepixel-level which has shown that keeping the illuminancecomponent yields higher performance in both color model-ings for certain colorspaces.

Fig. 3. A sample of ground truth of skin pixels (left) in a color image(right) used for training skin detectors. Skin pixels are colored in black,non-skin pixels are colored in white and difficult and tedious to markpixels are colored in gray. (For interpretation of the references to color inthis figure legend, the reader is referred to the web version of this paper.)

Table 4The testing performance of 20 combinations are shown in AUC (areaunder curve)

Colorspace Normal Histogram

3D 2D 3D 2D

CIELAB 0.052 (20) 0.057(19) 0.443 (2) 0.384 (6)HSI 0.289 (14) 0.249 (16) 0.439 (3) 0.309 (12)RGB 0.354 (10) 0.233 (18) 0.374 (7) 0.263 (15)SCT 0.336 (11) 0.247 (17) 0.450 (1) 0.303 (13)YUV 0.366 (8) 0.356 (9) 0.387 (5) 0.400 (4)

Each AUC is an average from two testings. Higher AUC value indicatesbetter performance. The ranking is shown in the parenthesis. For thehistogram approach, the performance is based on the trained histogramsize.

Fig. 4. The warmer the color the greater the number of colors for the bin in that category. The distribution in CIELAB 2D shows that the colors do not fitwell under one Gaussian, but would fit better with multiple Gaussians, thus the poorer performance with the Normal density modeler. YUV 2D colorclusters into one centroid for both skin and non-skin thus with better fit for one Gaussian probability yielding to much better performance. (Forinterpretation of the references to color in this figure legend, the reader is referred to the web version of this paper.)



4.3. Effect of modeling methods

In general, the histogram model performed better thanthe normal density model. All 5 3D colors and all 5 2D col-ors performed better using histogram modeling than nor-mal density modeling. The average performances are0.254 for the normal density and 0.375 for the histogram.This finding is also similar to our previous work [31,32]which found that the histogram modeler performed betterthan the normal density modeler for most colorspaces.

4.4. Ranking

First, the performance of twenty combinations variesgreatly (Table 4); the best performance is nearly nine timesbetter than the worst performance. The CIELAB color-space degrades greatly when using the normal density

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.5 0.6 0.7 0.8 0.9 1

Aver

age

AUC

Hand Overlap Threshold

Performance vs Hand Overlap

Fig. 5. Performance of each skin detector at various values of Toverlap.Note that the ranking change between Toverlap is minimal.

Table 5The average AUC (area under curve) performance in each category

Outdoor Dark Illum. 0.340 Regular Illum. 0.429 All Outdoor 0.329Indoor Dark Illum. 0.351 Regular Illum. 0.309 All Indoor 0.297Scene Simple 0.277 Complex 0.345Skin Light 0.336 Medium 0.423 Dark 0.472Hand pose Closed 0.359 Open 0.452 Open Spread 0.415

Fig. 6. A sample of the hardest and easiest images with their accuracies. Scene{C = complex, S = simple}, Illumination{ID = indoor dark, IR = indoorregular, OD = outdoor dark, OR = outdoor regular}, Pose{CL = closed, O = open, OS = open spread}, Skin tone{L = light, M = medium, D = dark}.



modeler. Fig. 4 displays a histogram color distribution ofCIELAB 2D and YUV 2D. It indicates that the distribu-tion of skin and non-skin color in CIELAB 2D is not suit-able for the normal density model resulting in rather poorperformance.

Second, the best performance is achieved by transform-ing to the SCT colorspace, keeping the illuminance compo-nent, and modeling with the histogram approach. In thepixel-level evaluation [31,32], the SCT colorspace wasfound to be performing well with this same combination.Surprisingly, the YUV colorspace, which was evaluatedto be one of the worst colorspaces in 2D color in thepixel-level evaluation, is ranked to be the fourth (2D color)and fifth (3D color) with the histogram model. This indi-cates that the correlation between the pixel-level basedand task-level based evaluation is not clear. It is beyondthe scope of this paper to find the reasons for the disparitybetween two evaluation frameworks.

Third, however, the ranking change was minimal withdifferent Toverlap value (which was used in the evaluation).The results from the Toverlap values of 0.55, 0.65, 0.85,and 0.95 are shown in Fig. 5. Note that the curves of eachskin detection approach intersect minimally indicating thatToverlap plays a minimal role on the ranking.

4.5. Performance in categories and images

Table 5 shows the average performance for each cate-gory. The analysis reveals that the algorithm had the mostchallenges with in indoor regular illumination images withsimple categorized scenes with a light skin person posingwith hand closed. The algorithm performed its best withoutdoor regular images with complex scenes with a darkskin person with hand opened. With an easier categorycombination of outdoor scenes and dark skin persons,the HSI 2D Histogram skin detector performed the bestwith an AUC of 0.887. With a more difficult combinationof indoor scenes and light skin persons, the best skin detec-tor performance was with CIELAB 3D Histogram with anAUC of 0.457. The great variance of performance for thesetwo combination categories show that the dataset wasmixed with difficult and not too difficult images.

An accuracy computation of the images in the data setwas computed. A histogram for how many times an imagewas correctly detected (TP and TN) and how many timesan image was incorrectly detected (FN and FP) was calcu-lated from each attempted tested threshold and skin detec-tor. Fig. 6 includes images with the highest and lowestperformance. The hardest non-hand images were indoorwith complex backgrounds. The easiest hand images hadthe hand in an open position decreasing difficulty in thehand ‘‘region’’ detection. The hardest hand images werewhen there were shadows or glare on the hand. An interest-ing observation is that one of the hardest non-hand imageshad the same background as one of the easiest handimages. This indicates the difficulty of the complex back-ground category.

5. Conclusions

This paper presented a novel method of evaluating skindetectors for the task of hand detection in perceptual inter-faces and communication applications. The effect of threeimportant aspects of skin detection (colorspace, illumi-nance component, and color modeling) was examined ona large dataset of 17 million pixels from different illumina-tion conditions, skin tones, hand pose and backgroundcomplexity. The parameters of skin detectors were tunedextensively during the training process. The trained param-eter values were applied on the testing dataset to measurethe performance on unseen data. The resulting perfor-mance varied with respect to the colorspace, illuminancecomponent and color modeling approaches. The best per-formance was achieved by transforming to the SCT color-space, keeping the illuminance component, and modelingwith the histogram approach. There is some similaritybetween the pixel-level and task-level results. The SCTand HSI colorspaces performed well in both pixel and tasklevels, and keeping the illuminance component for mostcolorspaces favored performance in both color modelingsat the pixel level and the task level. Some differences arefound. For instance, the YUV colorspace ranked low inthe pixel-level approach while ranked fourth in this study.The great diversity of the categories in the dataset showedwhere the hand detection method failed and excelled withrespect to the skin detector chosen. The framework of thisevaluation is limited by the dataset, evaluation protocols,and hand detection algorithm. We hope that this evalua-tion will provide insight to the users of skin detection forthe task of hand detection.

References

[1] L. Sigal, S. Sclaroff, V. Athitsos, Skin color-based video segmentationunder time-varying illumination, IEEE Transactions on PatternAnalysis and Machine Intelligence 26 (7) (2004) 862–877.

[2] R.L. Hsu, M.A. Mottaleb, A.K. Jain, Face detection in color images,IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5) (2002) 696–706.

[3] J.C. Terrillon, M.N. Shirazi, H. Fukamachi, S. Akamatsu, Compar-ative performance of different skin chrominance models and chromi-nance spaces for the automatic detection of human faces in colorimages for color segmentation and detection of human faces incomplex scene images, in: Proceedings of the Fourth IEEE Interna-tional Conference on Automatic Face and Gesture Recognition,March 2000, pp. 54–61.

[4] C. Poynton, A Technical Introduction to Digital Video, Wiley, NewYork, 1996.

[5] S.L. Phung, A. Bouzerdoum, D. Chai, A novel skin color model inycbcr color space and its application to human face detection, in:Proceedings of International Conference on Image Processing, Sept.2002, pp. I-289–I-292.

[6] J. Fritsch, S. Lang, A. Kleinehagenbrock, G.A. Fink, G. Sagerer,Improving adaptive skin color segmentation by incorporating resultsfrom face detection, in: 11th IEEE International Workshop onRobot and Human Interactive Communication, Sept. 2002, pp. 337–343.

[7] M. Soriano, B. Martinkauppi, S. Huovinen, M. Laaksonen, Skindetection in video under changing illumination conditions, in:



Proceedings of 15th International Conference on Pattern Recogni-tion, 2000, pp. 839–842.

[8] Y. Raja, S. McKenna, S. Gong, Colour model selection andadaptation in dynamic scenes, in: European Conference on ComputerVision, June 1998.

[9] M. Storring, H.J. Andersen, E. Granum, Physics-based modeling ofhuman skin colour under mixed illuminants, Robotics and Autono-mous Systems (2001) 131–142.

[10] C. Garcia, G. Tziritas, Face detection using quantized skin colorregions merging and wavelet packet analysis, IEEE Transactions onMultimedia 1 (3) (1999) 264–277.

[11] A. Hadid, M. Pietikainen, B. Martinkauppi, Color-based facedetection using skin locus model and hierarchical filtering, in:Proceeding of 16th International Conference on Pattern Recognition,August 2002, pp. 196–200.

[12] S. Srisuk, W. Kurutach, A new robust face detection in color images,in: Fifth IEEE International Conference on Automatic Face andGesture Recognition, May 2002, pp. 291–296.

[13] H. Greenspan, J. Goldberger, I. Eshet, Mixture model for face-colormodeling and segmentation, Pattern Recognition Letter 22 (14)(2001) 1525–1536.

[14] D. Chai, K.N. Ngan, Face segmentation using skin-color map invideophone applications, IEEE Transactions on Circuits and Systemsfor Video Technology 9 (4) (1999) 551–564.

[15] S.J. McKenna, S. Gong, Y. Raja, Modeling facial colour and identitywith gaussian mixtures, Pattern Recognition 31 (12) (1998) 1883–1892.

[16] M. Hunke, A. Waibel, Face locating and tracking for human–computer interaction, in: 28th Asilomar Conference on Signals,Systems, and Computers, 1994.

[17] S. Marcel, O. Bernier, J.E. Viallet, D. Collobert, Hand gesturerecognition using input–output hidden markov models, in: FourthIEEE International Conference on Automatic Face and GestureRecognition, March 2000, pp. 456–461.

[18] J. Triesch, C.V.D. Malsburg, A system for person-independent handposture recognition against complex backgrounds, IEEE Transac-tions on Pattern Analysis and Machine Intelligence 23 (12) (2001)1449–1453.

[19] L. Bretzner, I. Laptev, T. Lindeberg, Hand gesture recognition usingmulti-scale colour features, hierarchical models and particle filtering,in: Fifth IEEE International Conference on Automatic Face andGesture Recognition, 2002.

[20] T. Lindeberg, Feature detection with automatic scale selection,International Journal of Computer Vision 2 (30) (1998) 77–116.

[21] H.K. Lee, J.H. Kim, An hmm-based threshold model approach forgesture recognition, IEEE Transactions on Pattern Analysis andMachine Learning 21 (10) (1999) 961–973.

[22] Y. Sato, M. Saito, H. Koike, Real-time input of 3d pose and gesturesof a user’s hand and its applications for hci, in: Proceeding of IEEEConference in Virtual Reality, IEEE Computer Society, 2001, p. 79.

[23] Y. Zhu, G. Xu, D.J. Kriegman, A real-time approach to the spotting,representation, and recognition of hand gestures for human–com-puter interaction, Computer Vision and Image Understanding 85 (3)(2002) 189–208.

[24] A. Albiol, L. Torres, E. Delp, Optimum color spaces for skindetection, in: Proceedings of IEEE International Conference onImage Processing, October 2001, pp. 122–124.

[25] M.J. Jones, J.M. Rehg, Statistical color models with application toskin detection, International Journal of Computer Vision 46 (1)(2002) 81–96.

[26] J. Brand, J. Mason, A comparative assessment of three approaches topixel-level human skin-detection, in: Proceedings of InternationalConference in Pattern Recognition, 2000, pp. I-1056–I-1059.

[27] J. Yang, W. Lu, A. Waibel, Skin color modeling and adaptation, in:Asian Conference in Computer Vision, 1998, pp. 687–694.

[28] B.D. Zarit, B.J. Super, F.K. Quek, Comparison of five color modelsin skin pixel classification, in: Proceedings of the InternationalWorkshop on Recognition, Analysis, and Tracking of Faces andGestures in Real-Time Systems, September 1999, pp. 58–63.

[29] M.W. Powell, R. Murphy, Position estimation of micro-rovers usinga spherical coordinate transform color segmenter, in: Proceedings ofIEEE Workshop on Photometric Modeling for Computer Vision andGraphics, June 1999, pp. 21–27.

[30] R. Duda, P. Hart, D. Stork, Pattern Classification, Wiley, 2001.[31] S. Jayaram, S. Schmugge, M.C. Shin, L.V. Tsap, Effect of colorspace

transformation, the illuminance component, and color modeling onskin detection, in: Proceedings of IEEE Conference on ComputerVision and Pattern Recognition, June–July 2004, pp. 813–818.

[32] S.J. Schmugge, J. Jayaram, L.V. Tsap, M.C. Shin, Objectiveevaluation of approaches of skin detection using ROC analysis,Computer Vision and Image Understanding (2007).

[33] M.C. Shin, L.V. Tsap, D.B. Goldgof, Gesture recognition usingBezier curves for visualization navigation from registered 3-d data,Pattern Recognition Letter 37 (5) (2004) 1011–1024.


Task-based evaluation of skin detection for communication and perceptual interfaces

Documents

Transcript of Task-based evaluation of skin detection for communication and perceptual interfaces