Face segmentation
-
Upload
jainuniversity -
Category
Documents
-
view
1 -
download
0
Transcript of Face segmentation
INTRODUCTION
1.1 Aim of the ThesisThis project presents an interactive algorithm to
automatically segment out a person’s face from a given
image that consists of a head-and-shoulders view of the
person and a complex background scene. The method involves
a fast, reliable, and effective algorithm that exploits the
spatial distribution characteristics of human skin color.
To fulfill this aim, the following objectives are
carried out:
1. Implement different types of image segmentations.
2. A universal skin-color map is derived and used on the
chrominance component of the input image to detect
pixels with skin-color appearance .
3. Then, based on the spatial distribution of the
detected skin-color pixels and their corresponding
luminance values, the algorithm employs a set of
novel regularization processes to reinforce regions
of skin color pixels that are more likely to belong
to the facial regions and eliminate those that are
not.
4. The performance of the face segmentation algorithm is
illustrated by some simulation results carried out on
various head-and-shoulders test images.
1
1.2 Sco
pe of the Thesis
The main objective of this research is to design a
system that can find a person’s face from given image
data. This problem is commonly referred to as face
location, face extraction, or face segmentation.
Regardless of the terminology, they all share the same
objective. However, note that the problem usually deals
with finding the position and contour of a person’s face
since its location is unknown, but given the knowledge
of its existence. If this is not known, then there is
also a need to discriminate between “images containing
faces” and “images not containing faces.” This is known
as face detection. This paper, however, focuses on face
segmentation. The significance of this problem can be
illustrated by its vast applications, as face
segmentation holds an important key to future advances
in human-to-human and human-to-machine communications.
The segmentation of a facial region provides a content-
based representation of the image where it can be used
for encoding, manipulation, enhancement, indexing,
modeling, pattern-recognition, and object-tracking
purposes.
1.3 Literature Survey
2
The research on face segmentation has been pursued at
a feverish pace, there are still many problems yet to be
fully and convincingly solved as the level of difficulty of
the problem depends highly on the complexity level of the
image content and its application. Many existing methods
only work well on simple input images with a benign
background and frontal view of the person’s face. To cope
with more complicated images and conditions, many more
assumptions will then have to be made.Many of the
approaches proposed over the years involved the combination
of shape, motion, and statistical analysis. In recent
times, however, a new approach of using color information
has been introduced.
In this paper, we will discuss the color analysis
approach to face segmentation. The discussion includes the
derivation of a universal model of human skin color, the
use of appropriate color space, and the limitations of
color segmentation. We then present a practical solution to
the face-segmentation problem. This includes how to derive
a robust skin-color reference map and how to overcome the
limitations of color segmentation. In addition to face
segmentation, one of its applications on video coding will
be presented in further detail. It will explain how the
face-segmentation results can be exploited by an existing
video coder so that it encodes the area of interest (i.e.,
3
the facial region) with higher fidelity and hence produces
images with better rendered facial features.
1.4 Applications of the Thesis
Some practical applications include the following
Coding area of interest with better quality:
The subjective quality of a very low-bit-rate encoded
videophone sequence can be improved by coding the facial
image region that is of interest to viewers at higher
quality Content-based representation and MPEG-4: Face
segmentation is a useful tool for the MPEG-4 content-based
functionality. It provides content-based representation of
the image, which can subsequently be used for coding,
editing, or other interactivity purposes.
Three-dimensional (3-D) human face model fitting: The delimitation of the person’s face is the
fundamental requirement of 3-D human face model fitting
used in model-based coding, computer animation, and
morphing.
Image enhancement: Face segmentation information can be used in a post
processing task for enhancing images, such as the automatic
adjustment of tint in the facial region.
4
Face recognition: Finding the person’s face is the first important step
in the human face recognition, classification, and
identification systems.
Face tracking: Face location can be used to design a video camera
system that tracks a person’s face in a room. It can be
used as part of an intelligent vision system or simply in
video surveillance.
1.5 Organization of the Thesis
The presented project thesis is organized with 6 main
chapters. Chapter 1 gives an overview of aim of the thesis,
technical approach, Literature survey and organization of
the thesis. Chapter 2 describes the Fundamentals of image
segmentation. This chapter describes exsisting methods of
image segmentation. This chapter also gives an idea of
different file formats. Chapter 3 briefly explains about
the Color Analysis. Chapter 4 describes about the Face
segmentation Algorithm. This chapter also presents the
simulation results obtained for the implemented design.
Chapter 5 presents Conclusion and future scope with
references and appendix presented at the end.
5
Image segmentation
2.1 Segmentation:
In computer vision, segmentation refers to the process
of partitioning a digital image into multiple segments
(sets of pixels, also known as super pixels). The goal of
segmentation is to simplify and/or change the
representation of an image into something that is more
meaningful and easier to analyze. Image segmentation is
typically used to locate objects and boundaries (lines,
curves, etc.) in images. More precisely, image segmentation
is the process of assigning a label to every pixel in an
image such that pixels with the same label share certain
visual characteristics.
The result of image segmentation is a set of segments
that collectively cover the entire image, or a set of
contours extracted from the image (see edge detection).
Each of the pixels in a region is similar with respect to
some characteristic or computed property, such as color,
intensity, or texture. Adjacent regions are significantly
different with respect to the same characteristics.
Thresholding
Edge finding
7
Binary mathematical morphology
Gray-value mathematical morphology
In the analysis of the objects in images it
is essential that we can distinguish between the
objects of interest and "the rest." This latter group
is also referred to as the background. The techniques
that are used to find the objects of interest are
usually referred to as segmentation techniques -
segmenting the foreground from background. In this
section we will two of the most common techniques
thresholding and edge finding and we will present
techniques for improving the quality of the
segmentation result. It is important to understand
that:
1. There is no universally applicable segmentation
technique that will work for all images, and,
2. No segmentation technique is perfect.
2.1.1 THRESHOLDING:
This technique is based upon a simple concept. A
parameter called the brightness threshold is chosen and
applied to the image a [m, n] as follows:
8
This version of the algorithm assumes that we are
interested in light objects on a dark background. For dark
objects on a light background we would use:
The output is the label "object" or "background"
which, due to its dichotomous nature, can be represented as
a Boolean variable "1" or "0". In principle, the test
condition could be based upon some other property than
simple brightness (for example, If (Redness {a [m, n]} >=
θred), but the concept is clear.
The central question in thresholding then becomes: how
do we choose the threshold θ? While there is no universal
procedure for threshold selection that is guaranteed to
work on all images, there are a variety of alternatives.
Fixed threshold - One alternative is to use a threshold
that is chosen independently of the image data. If it is
known that one is dealing with very high-contrast images
where the objects are very dark and the background is
homogeneous and very light, then a constant threshold of
9
128 on a scale of 0 to 255 might be sufficiently accurate.
By accuracy we mean that the number of falsely-classified
pixels should be kept to a minimum.
Histogram-derived thresholds - In most cases the threshold
is chosen from the brightness histogram of the region or
image that we wish to segment. An image and its associated
brightness histogram are shown in Figure 2.
(a) Image to be threshold
(b) Brightness histogram of the image
Figure 2: Pixels below the threshold (a [m, n] < θ) will be
labeled as object pixels; those above the threshold will be
labeled as background pixels.
A variety of techniques have been devised to automatically
choose a threshold starting from the gray-value histogram,
{h[b] | b = 0, 1, 2B-1}. Some of the most common ones are
presented below. Many of these algorithms can benefit from
a smoothing of the raw histogram data to remove small
fluctuations but the smoothing algorithm must not shift the
peak positions. This translates into a zero-phase smoothing
10
algorithm given below where typical values for W are 3 or
5:
2.1.2 EDGE FINDING:
Thresholding produces a segmentation that yields all
the pixels that, in principle, belong to the object or
objects of interest in an image. An alternative to this is
to find those pixels that belong to the borders of the
objects. Techniques that are directed to this goal are
termed edge finding techniques. From our discussion on
mathematical morphology, specifically eqs. , and, we see
that there is an intimate relationship between edges and
regions.
Gradient-based procedure - The central challenge to
edge finding techniques is to find procedures that produce
closed contours around the objects of interest. For objects
of particularly high SNR, this can be achieved by
calculating the gradient and then using a suitable
threshold. This is illustrated in Figure 4.
11
(a) SNR = 30 dB (b) SNR = 20 dB
Figure 4: Edge finding based on the Sobel gradient, eq,
combined with the Isodata thresholding algorithm eq. . . .
While the technique works well for the 30 dB image in
Figure 4a, it fails to provide an accurate determination of
those pixels associated with the object edges for the 20 dB
image in Figure 4b. A variety of smoothing techniques as
described in Section 9.4 and in eq. can be used to reduce
the noise effects before the gradient operator is applied.
Zero-crossing based procedure - A more modern view tohandling the problem of edges in noisy images is to use the
zero crossings generated in the Laplacian of an image. The
rationale starts from the model of an ideal edge, a step
function, that has been blurred by an OTF such as Table 4
T.3 (out-of-focus), T.5 (diffraction-limited), or T.6
(general model) to produce the result shown in Figure 5.
12
Figure 5: Edge finding based on the zero crossing as
determined by the second derivative, the Laplacian. The
curves are not to scale.
The edge location is, according to the model, at that
place in the image where the Laplacian changes sign, the
zero crossing. As the Laplacian operation involves a second
derivative, this means a potential enhancement of noise in
the image at high spatial frequencies. To prevent enhanced
noise from dominating the search for zero crossings, a
smoothing is necessary.
The appropriate smoothing filter, from among the many
possibilities should according to Canny have the following
properties:
In the frequency domain, (u,v) or (Ω,ψ), the filter
should be as narrow as possible to provide suppression
of high frequency noise, and;
In the spatial domain, (x, y) or [m, n], the filter
should be as narrow as possible to provide good
localization of the edge. A too wide filter generates
uncertainty as to precisely where, within the filter
width, the edge is located.
The smoothing filter that simultaneously satisfies
both these properties--minimum bandwidth and minimum
spatial width is the Gaussian filter described. This means
13
that the image should be smoothed with a Gaussian of an
appropriate followed by application of the Laplacian. In
formula:
Where g2D(x, y). The derivative operation is linear and
shift-invariant as defined in eqs. (85) And (86). This
means that the order of the operators can be exchanged (eq.
(4)) or combined into one single filter (eq. (5)). This
second approach leads to the Marr-ildreth formulation of
the "Laplacian-of-Gaussians" (LoG) filter:
Where
Given the circular symmetry this can also be written as:
14
This two-dimensional convolution kernel, which is sometimes
referred to as a "Mexican hat filter", is illustrated in
Figure 6.
(a) -LoG(x, y)
(b) LoG(r)
Figure 6: LoG filter with = 1.0.
PLUS-based procedure - Among the zero crossing
procedures for edge detection, perhaps the most accurate is
the PLUS filter as developed by Verbeek and Van Vliet. The
filter is defined,
15
Neither the derivation of the PLUS's properties nor an
evaluation of its accuracy is within the scope of this
section. Suffice it to say that, for positively curved
edges in gray value images, the Laplacian-based zero
crossing procedure overestimates the position of the edge
and the SDGD-based procedure underestimates the position.
This is true in both two-dimensional and three-dimensional
images with an error on the order of (σ/R) 2 where R is the
radius of curvature of the edge.
The PLUS operator has an error on the order of (σ/R) 4 if
the image is sampled at, at least, 3x the usual Nyquist
sampling frequency or if we choose σ>= 2.7 and sample at
the usual Nyquist frequency.
All of the methods based on zero crossings in the
Laplacian must be able to distinguish between zero
crossings and zero values. While the former represent edge
positions, the latter can be generated by regions that are
no more complex than bilinear surfaces, that is, a(x,y) =
a0 + a1*x + a2*y + a3*x*y. To distinguish between these two
situations, we first find the zero crossing positions and
label them as "1" and all other pixels as "0". We then
multiply the resulting image by a measure of the edge
strength at each pixel. There are various measures for the
edge strength that are all based on the gradient as
described in Section 9.5.1 and eq. . This last possibility,
16
use of a morphological gradient as an edge strength
measure, was first described by Lee, aralick, and Shapiro
and is particularly effective. After multiplication the
image is then thresholded (as above) to produce the final
result. The procedure is thus as follows:
Figure 7: General strategy for
edges based on zero crossings.
The results of these two edge finding techniques based on
zero crossings, LoG filtering and PLUS filtering, are shown
in Figure 7 for images with a 20 dB SNR.
17
a) Image SNR = 20 dB b) LoG
filter c) PLUS filter
Figure 7: Edge finding using zero crossing algorithms LoG
and PLUS. In both algorithms σ=
Edge finding techniques provide, as the name suggests, an
image that contains a collection of edge pixels. Should the
edge pixels correspond to objects, as opposed to say simple
lines in the image, and then a region-filling technique
18
such as eq. may be required to provide the complete
objects.
2.1.3 BINARY MATHEMATICAL MORPHOLOGY
The various algorithms that we have described for
mathematical morphology in Section 9.6 can be put together
to form powerful techniques for the processing of binary
images and gray level images. As binary images frequently
result from segmentation processes on gray level images,
the morphological processing of the binary result permits
the improvement of the segmentation result.
Salt-or-pepper filtering - Segmentation procedures
frequently result in isolated "1" pixels in a "0"
neighborhood (salt) or isolated "0" pixels in a "1"
neighborhood (pepper). The appropriate neighborhood
definition must be chosen as in Figure 3. Using the lookup
table formulation for Boolean operations in a 3 x 3
neighborhood that was described in association with Figure
43, salt filtering and pepper filtering are straightforward
to implement. We weight the different positions in the 3 x
3 neighborhood as follows:
19
For a 3 x 3 window in a [m, n] with values "0" or "1" we
then compute:
The result, sum, is a number bounded by 0 <= sum <= 511.
Salt Filter The 4-connected and 8-connected versionsof this filter are the same and are given by the following
procedure:
i) Compute sum ii) If ((sum == 1) c [m, n] = 0 Else c [m,
n] = a [m, n]
Pepper Filter - The 4-connected and 8-connected versionsof this filter are the following procedures:
4-connected 8-connected i) Compute sum i) Compute sum ii)
If ( (sum == 170) ii) If ( (sum == 510) c[m,n] = 1 c[m,n] =
1 Else Else c[m,n] = a[m,n] c[m,n] = a[m,n]
Isolate objects with holes - To find objects with
holes we can use the following procedure which is
illustrated in Figure 8.
i) Segment image to produce binary mask representation
ii) Compute skeleton without end pixels
20
iii) Use salt filter to remove single skeleton pixels
iv) Propagate remaining skeleton pixels into original
binary mask.
a) Binary image b) Skeleton after salt filter c)
Objects with holes
Figure 8: Isolation of objects with holes using
morphological operations.
The binary objects are shown in gray and the
skeletons, after application of the salt filter, are shown
as a black overlay on the binary objects. Note that this
procedure uses no parameters other then the fundamental
choice of connectivity; it is free from "magic numbers." In
the example shown in Figure 58, the 8-connected definition
was used as well as the structuring element B = N8.
Filling holes in objects - To fill holes in objects weuse the following procedure which is illustrated in Figure
9.
i) Segment image to produce binary representation of
objects
21
ii) Compute complement of binary image as a mask image
iii) Generate a seed image as the border of the image
iv) Propagate the seed into the mask - eq.
v) Complement result of propagation to produce final
result
a) Mask and Seed images
b) Objects with holes filled
Figure 9: Filling holes in objects.
The mask image is illustrated in gray in Figure 9a and the
seed image is shown in black in that same illustration.
When the object pixels are specified with a connectivity of
C = 8, then the propagation into the mask (background)
image should be performed with a connectivity of C = 4,
that is, dilations with the structuring element B = N4.
This procedure is also free of "magic numbers."
22
Removing border-touching objects - Objects that areconnected to the image border are not suitable for
analysis. To eliminate them we can use a series of
morphological operations that are illustrated in Figure 10.
i) Segment image to produce binary mask image of
objects
ii) Generate a seed image as the border of the image
iii) Propagate the seed into the mask - eq.
iv) Compute XOR of the propagation result and the mask
image as final result
a) Mask and Seed images
b) Remaining objects
Figure 10: Removing objects touching borders.
23
The mask image is illustrated in gray in Figure 10a and the
seed image is shown in black in that same illustration. If
the structuring element used in the propagation is B = N4,
then objects are removed that are 4-connected with the
image boundary. If B = N8 is used then objects that 8-
connected with the boundary are removed.
Exo-skeleton - The exo-skeleton of a set of objects isthe skeleton of the background that contains the objects.
The exo-skeleton produces a partition of the image into
regions each of which contains one object. The actual
skeletonization is performed without the preservation of
end pixels and with the border set to "0." The procedure is
described below and the result is illustrated in Figure 11.
i) Segment image to produce binary image
ii) Compute complement of binary image
iii) Compute skeleton using eq. i+ii with border set to
"0"
Figure 11: Exo-skeleton.
24
2.1.4 GRAY-VALUE MATHEMATICAL MORPHOLOGY Gray-value morphological processing techniques can be
used for practical problems such as shading correction. In
this section several other techniques will be presented.
Top-hat transform - The isolation of gray-value objectsthat are convex can be accomplished with the top-hat
transform as developed by Meyer. Depending upon whether we
are dealing with light objects on a dark background or dark
objects on a light background, the transform is defined as:
Light objects -
Dark objects -
Where the structuring element B is chosen to be bigger than
the objects in question and, if possible, to have a convex
shape. Because of the properties given in eqs. And, Topat
(A, B) >= 0. An example of this technique is shown in
Figure 13.
The original image including shading is processed by a 15 x
1 structuring element as described in eqs. And to produce
the desired result. Note that the transform for dark
objects has been defined in such a way as to yield
"positive" objects as opposed to "negative" objects. Other
definitions are, of course, possible.
25
Thresholding - A simple estimate of a locally-varyingthreshold surface can be derived from morphological
processing as follows:
Threshold surface -
Once again, we suppress the notation for the structuring
element B under the max and min operations to keep the
notation simple. Its use, however, is understood.
(A) Original
(a) Light object transform (b) Dark object transform
Figure 13: Top-hat transforms.
Local contrast stretching - Using morphological
operations we can implement a technique for local contrast
stretching. That is, the amount of stretching that will be
applied in a neighborhood will be controlled by the
26
original contrast in that neighborhood. The morphological
gradient defined in eq. may also be seen as related to a
measure of the local contrast in the window defined by the
structuring element B:
The procedure for local contrast stretching is given by:
The max and min operations are taken over the structuring
element B. The effect of this procedure is illustrated in
Figure 14. It is clear that this local operation is an
extended version of the point operation for contrast
stretching.
Before after before after
before after
Figure 1.4: Local contrast stretching.
27
Using standard test images (as we have seen in so many examples)
illustrates the power of this local morphological filtering
approach.
Facial Image Processing and Analysis:
Face recognition systems are progressively becoming
popular as means of extracting biometric information. Face
recognition has a critical role in biometric systems and is
attractive for numerous applications including visual
surveillance and security. Because of the general public
acceptance of face images on various documents, face
recognition has a great potential to become the next
generation biometric technology of choice. Face images are
also the only biometric information available in some
legacy databases and international terrorist watch-lists
and can be acquired even without subjects' cooperation.
Though there has been a great deal of progress in face
detection and recognition in the last few years, many
problems remain unsolved. Research on face detection must
confront with many challenging problems, especially when
dealing with outdoor illumination, pose variation with
large rotation angles, low image quality, low resolution,
occlusion, and background changes in complex real-life
scenes. The design of face recognition algorithms that are
effective over a wide range of viewpoints, complex outdoor28
lighting, occlusions, facial expressions, and aging of
subjects, is still a major area of research. Before one
claims that the facial image processing / analysis system
is reliable, rigorous testing and verification on real-
world datasets must be performed, including databases for
face analysis and tracking in digital video. 3D head model
assisted recognition is another research area where new
solutions are urgently needed to enhance robustness of
today's recognition systems and enable real-time, face-
oriented processing and analysis of visual data. Thus,
vigorous research is needed to solve such outstanding
challenging problems and propose advanced solutions and
systems for emerging applications of facial image
processing and analysis.
This special issue is particularly interested in recent
progress in face detection and recognition that explores
emerging themes such as digital video, 3D, near infrared,
occlusion and disguise, long-term aging, and/or the lack of
sufficient training data. Submitted articles must not have
been previously published and must not be currently
submitted for publication elsewhere. Topics of interest
include, but are not limited to, the following:
New sensors or data sources
29
3D-based face recognition
Near infrared imaging for face recognition
Video-based face recognition
Preprocessing
Image preprocessing for face detection /
recognition
Color-based facial image processing and analysis
De-blurring and super-resolution for robust face
detection / recognition
Face and feature detection
Face detection for best-shot selection
Facial feature detection and extraction
3D head modeling and face tracking
Methods
Subspace / kernel methods for face recognition
Non-linear methods for face modeling
Bionic face representation
Ensemble learning for face classification
Key problems
30
Outdoor illumination
Large pose variations
Mid-term and long-term aging
Occlusion and disguise
Low quality and low resolution
Generalization problem due to lack of enough
training examples
Dataset and evaluation
Challenging datasets
Video-based datasets
Statistical performance evaluation
Applications and other topics
Facial gesture recognition
Real-time processing solutions and systems
Face-based surveillance, biometrics, and multimedia
applications
Face recognition in compressed domain
31
2.2Face Location:
Face Retrieval:
The face retrieval problem, known as face detection,
can be defined as follows: given an arbitrary black and
white, still image, find the location and size of every
human face it contains. There are many applications in
which human face detection plays a very important role: it
represents the first step in a fully automatic face
recognition system, it can be used in image database
32
indexing/searching by content, in surveillance systems and
in human-computer interfaces. It also provides insight on
how to approach other pattern recognition problems
involving deformable textured objects. At the same time, it
is one of the harder problems in pattern recognition.
We've designed an inductive learning detection method that
produces a maximally specific hypothesis consistent with
the training data. Three different sets of features were
considered for defining the concept of a human face. The
performance achieved is as follows: 85% detection rate, a
false alarm rate of 0.04 % of the number of windows
analyzed and 1 minute detection time on a 320 x 240 image
on a Sun Ultrasparc 1.
33
2.3Color Image Processing: Methods and Applications
Color Image Processing: Methods and
Applications embraces two decades of extraordinary growth in
the technologies and applications for color image
processing. The book offers comprehensive coverage of state-
of-the-art systems, processing techniques, and emerging
applications of digital color imaging.
To elucidate the significant progress in
specialized areas, the editors invited renowned authorities
to address specific research challenges and recent trends in
their area of expertise. The book begins by focusing on
color fundamentals, including color management, gamut
mapping, and color constancy. The remaining chapters detail
34
the latest techniques and approaches to contemporary and
traditional color image processing and analysis for a broad
spectrum of sophisticated applications, including:
Vector and semantic processing
Secure imaging
Object recognition and feature detection
Facial and retinal image analysis
Digital camera image processing
Spectral and superresolution imaging
Image and video colorization
Virtual restoration of artwork
Video shot segmentation and surveillance
2.4 Videophone Communication:
Videophone:
A videophone is a telephone with a video screen, and is
capable of full duplex (bi-directional) video and audio
transmissions for communication between people in real-time.
It was the first form of videotelephony, later to be
followed by videoconferencing, webcams, and finally
telepresence.
At the dawn of the technology, videotelephony also included
image phones which would exchange still images between units
every few seconds over conventional POTS-type telephone
35
lines, essentially the same as slow scan TV systems.
Currently videophones are particularly useful to the deaf
and speech-impaired who can use them with sign language, and
also with video relay services to communicate with hearing
persons. Videophones are also very useful to those with
mobility issues or those who are located in distant places
and are in need of telemedical or tele-educational services.
A videophone is a telephone with a video screen, and is
capable of full duplex (bi-directional) video and audio
transmissions for communication between people in real-time.
It was the first form of videotelephony, later to be
followed by videoconferencing, webcams, and finally
telepresence.
At the dawn of the technology, videotelephony also included
image phones which would exchange still images between units
every few seconds over conventional POTS-type telephone
lines, essentially the same as slow scan TV systems.
Currently videophones are particularly useful to the deaf
and speech-impaired who can use them with sign language, and
also with video relay services to communicate with hearing
persons. Videophones are also very useful to those with
mobility issues or those who are located in distant places
and are in need of telemedical or tele-educational services.
36
A videophone is a telephone with a video screen, and is
capable of full duplex (bi-directional) video and audio
transmissions for communication between people in real-time.
It was the first form of videotelephony, later to be
followed by videoconferencing, webcams, and finally
telepresence.
At the dawn of the technology, videotelephony also included
image phones which would exchange still images between units
every few seconds over conventional POTS-type telephone
lines, essentially the same as slow scan TV systems.
Currently videophones are particularly useful to the deaf
and speech-impaired who can use them with sign language, and
also with video relay services to communicate with hearing
persons. Videophones are also very useful to those with
mobility issues or those who are located in distant places
and are in need of telemedical or tele-educational services.
A videophone is a telephone with a video screen, and is
capable of full duplex (bi-directional) video and audio
transmissions for communication between people in real-time.
It was the first form of videotelephony, later to be
followed by videoconferencing, webcams, and finally
telepresence.
At the dawn of the technology, videotelephony also included
37
image phones which would exchange still images between units
every few seconds over conventional POTS-type telephone
lines, essentially the same as slow scan TV systems.
Currently videophones are particularly useful to the deaf
and speech-impaired who can use them with sign language, and
also with video relay services to communicate with hearing
persons. Videophones are also very useful to those with
mobility issues or those who are located in distant places
and are in need of telemedical or tele-educational services.
2.5 Technology:
Video coding:
Video coding is the field in electrical engineering and
computer science that deals with representation of video
data, for storage and/or transmission, for both analog and
digital video. Though video coding is often considered to be
only for natural video, it can also be applied to synthetic
(computer generated) video, i.e., graphics. Many
representations take advantage of features of the Human
Visual System to achieve an efficient representation.
The goals of video coding are to accurately represent the
video data, compactly, provide means to navigate the video
(i.e. search forwards and backwards, random access, etc) and
other additional author and content benefits such as text38
(subtitles), meta information for searching/browsing and
digital rights management.
The biggest challenge is to reduce the size of the video
data using video compression. For this reason the terms
"video coding" and "video compression" are often used
interchangeably by those who don't know the difference. The
search for efficient video compression techniques dominated
much of the research activity for video coding since the
early 1980s, the first major milestone was H.261, from which
JPEG adopted the idea of using the DCT; since then many
other advancements have been made to algorithms such as
motion estimation. Since approximately 2000 the focus has
been more on meta data and video search, resulting in MPEG-7
and MPEG-21.
H.261:
H.261 is a ITU-T video coding standard, ratified in
November 1988.[1][2] Originally designed for transmission over
ISDN lines on which data rates are multiples of 64 kbit/s.
It is one member of the H.26x family of video coding
standards in the domain of the ITU-T Video Coding Experts
Group (VCEG). The coding algorithm was designed to be able
to operate at video bit rates between 40 kbit/s and 2
Mbit/s. The standard supports two video frame sizes: CIF
(352x288 luma with 176x144 chroma) and QCIF (176x144 with39
88x72 chroma) using a 4:2:0 sampling scheme. It also has a
backward-compatible trick for sending still picture graphics
with 704x576 luma resolution and 352x288 chroma resolution
(which was added in a later revision in 1993).
2.6 History:
Whilst H.261 was preceded in 1984 by H.120 (which also
underwent a revision in 1988 of some historic importance) as
a digital video coding standard, H.261 was the first truly
practical digital video coding standard (in terms of product
support in significant quantities). In fact, all subsequent
international video coding standards (MPEG-1 Part 2,
H.262/MPEG-2 Part 2, H.263, MPEG-4 Part 2, and H.264/MPEG-4
Part 10) have been based closely on the H.261 design.
Additionally, the methods used by the H.261 development
committee to collaboratively develop the standard have
remained the basic operating process for subsequent
standardization work in the field (see S. Okubo, "Reference
model methodology-A tool for the collaborative creation of
video coding standards", Proceedings of the IEEE, vol. 83,
no. 2, Feb. 1995, pp. 139–150). The coding algorithm uses a
hybrid of motion compensated inter-picture prediction and
spatial transform coding with scalar quantization, zig-zag
scanning and entropy encoding.
40
H.261 design:
The basic processing unit of the design is called a
macroblock, and H.261 was the first standard in which the
macroblock concept appeared. Each macroblock consists of a
16x16 array of luma samples and two corresponding 8x8 arrays
of chroma samples, using 4:2:0 sampling and a YCbCr color
space.
The inter-picture prediction reduces temporal redundancy,
with motion vectors used to help the codec compensate for
motion. Whilst only integer-valued motion vectors are
supported in H.261, a blurring filter can be applied to the
prediction signal — partially mitigating the lack of
fractional-sample motion vector precision. Transform coding
using an 8x8 discrete cosine transform (DCT) reduces the
spatial redundancy. Scalar quantization is then applied to
round the transform coefficients to the appropriate
precision determined by a step size control parameter, and
the quantized transform coefficients are zig-zag scanned and
entropy coded (using a "run-level" variable-length code) to
remove statistical redundancy.
The H.261 standard actually only specifies how to decode the
video. Encoder designers were left free to design their own
encoding algorithms, as long as their output was constrained
properly to allow it to be decoded by any decoder made
41
according to the standard. Encoders are also left free to
perform any pre-processing they want to their input video,
and decoders are allowed to perform any post-processing they
want to their decoded video prior to display. One effective
post-processing technique that became a key element of the
best H.261-based systems is called deblocking filtering.
This reduces the appearance of block-shaped artifacts caused
by the block-based motion compensation and spatial transform
parts of the design. Indeed, blocking artifacts are probably
a familiar phenomenon to almost everyone who has watched
digital video. Deblocking filtering has since become an
integral part of the most recent standard, H.264 (although
even when using H.264, additional post-processing is still
allowed and can enhance visual quality if performed well).
Design refinements introduced in later standardization
efforts have resulted in significant improvements in
compression capability relative to the H.261 design. This
has resulted in H.261 becoming essentially obsolete,
although it is still used as a backward-compatibility mode
in some video conferencing systems and for some types of
internet video. However, H.261 remains a major historical
milestone in the development of the field of video coding.
2.7 Quantization:
Quantization is the procedure of constraining something from
42
a continuous set of values (such as the real numbers) to a
discrete set (such as the integers). Quantization in
specific domains is discussed in:
Quantization (signal processing)
o Quantization (image processing)
o Quantization (sound processing)
o Quantization (music)
Quantization (physics)
o Canonical quantization
o Spatial quantization
o Charge quantization
Quantization (linguistics)
Quantization (non commutative mathematics)
o Non commutative geometry
Quantization (image processing):
Quantization, involved in image processing, is a lossy
compression technique achieved by compressing a range of
values to a single quantum value. When the number of
discrete symbols in a given stream is reduced, the stream
becomes more compressible. For example, reducing the number
of colors required to represent a digital image makes it
possible to reduce its file size. Specific applications
include DCT data quantization in JPEG and DWT data
quantization in JPEG 2000.
43
Color quantization:
Color quantization reduces the number of colors used in an
image; this is important for displaying images on devices
that support a limited number of colors and for efficiently
compressing certain kinds of images. Most bitmap editors and
many operating systems have built-in support for color
quantization. Popular modern color quantization algorithms
include the nearest color algorithm (for fixed palettes),
the median cut algorithm, and an algorithm based on octrees.
It is common to combine color quantization with dithering to
create an impression of a larger number of colors and
eliminate banding artifacts.
2.8 Frequency quantization for image compression:
The human eye is fairly good at seeing small differences in
brightness over a relatively large area, but not so good at
distinguishing the exact strength of a high frequency
(rapidly varying) brightness variation. This fact allows one
to reduce the amount of information required by ignoring the
high frequency components. This is done by simply dividing
each component in the frequency domain by a constant for
that component, and then rounding to the nearest integer.
This is the main lossy operation in the whole process. As a
result of this, it is typically the case that many of the
44
higher frequency components are rounded to zero, and many of
the rest become small positive or negative numbers.
As human vision is also more sensitive to luminance than
chrominance, further compression can be obtained by working
in a non-RGB color space which separates the two (e.g.
YCbCr), and quantizing the channels separately.[1]
Quantization matrices:
A typical video codec works by breaking the picture into
discrete blocks (8×8 pixels in the case of MPEG[1]). These
blocks can then be subjected to discrete cosine transform
(DCT) to separate out the low frequency and high frequency
components in both the horizontal and vertical direction.[1]
The resulting block (the same size as the original block) is
then divided by the quantization matrix, and each entry
rounded. The coefficients of quantization matrices are often
specifically designed to keep certain frequencies in the
source to avoid losing image quality. Many video encoders
(such as DivX, Xvid, and 3ivx) and compression standards
(such as MPEG-2 and H.264/AVC) allow custom matrices to be
used. Alternatively, the extent of the reduction may be
varied by multiplying the quantizer matrix by a scaling
factor, the quantizer scale code, prior to performing the
division.[1]
45
This is an example of DCT coefficient matrix:
A common quantization matrix is:
Dividing the DCT coefficient matrix element-wise with this
quantization matrix, and rounding to integers results in:
46
For example, using −415 (the DC coefficient) and rounding to
the nearest integer
Typically this process will result in matrices with values
primarily in the upper left (low frequency) corner. By using
a zig-zag ordering to group the non-zero entries and run
length encoding, the quantized matrix can be much more
efficiently stored than the non-quantized version.
47
COLOR ANALYSIS
The use of color information has been introduced to the
face-locating problem in recent years, and it has gained
increasing attention since then. Some recent publications
that have reported this study include. They have allshown,
in one way or another, that color is a powerful descriptor
that has practical use in the extraction of face location.
The color information is typically used for region rather
than edge segmentation. We classify the region segmentation
into two general approaches, as illustrated in Fig. 1. One
approach is to employ color as a feature for partitioning an
image into a set of homogeneous regions. For instance, the
color component of the image can be used in the region
growing technique, as demonstrated in [24], or as a basis
for a simple thresholding technique, as shown in [23]. The
other approach, however, makes use of color as a feature for
identifying a specific object in an image. In this case, the
48
skin color can be used to identify the human face. This is
feasible because human faces have a special color
distribution that differs significantly (although not
entirely) from those of the background objects. Hence this
approach requires a color map that models the skin-color
distribution characteristics.
Fig. 1. The use of color information for region
segmentation.
The skin-color map can be derived in two ways on
account of the fact not all faces have identical color
features. One approach is to predefine or manually obtain
the map such that it suits only an individual color feature.
For example, here we obtain the skin-color feature of the
subject in a standard headand- shoulders test image called
Foreman. Although this is a
49
Fig. 2. Foreman image with a white contour highlighting the
facial region.
color image in YCrCb format, its gray-scale version is
shown in Fig. 2. The figure also shows a white contour
highlighting the facial region. The histograms of the color
information (i.e., Cr and Cb values) bounded within this
contour are obtained as shown in Fig. 3. The diagrams show
that the chrominance values in the facial region are
narrowly distributed, which implies that the skin color is
fairly uniform. Therefore, this individual color feature can
simply be defined by the presence of Cr values within, say,
136 and 156, and Cb values within 110 and 123. Using these
ranges of values, we managed to locate the subject’s face in
another frame of Foreman and also in a different scene (a
standard test image called Carphone), as can be seen in Fig.
4. This approach was suggested in the past by Li and50
Forchheimer in ; however, a detailed procedure on the
modeling of individual color features and their choice of
color space was not disclosed.
Fig. 3. Histograms of Cr and Cb components in the facial
region.
In another approach, the skin-color map can be designed by
adopting histograming technique on a given set of training
data and subsequently used as a reference for any human
face. Such a method was successfully adopted by the
authors,, Sobottka and Pitas, and Cornall and Pang.
Among the two approaches, the first is likely to produce
better segmentation results in terms of reliability and
accuracy by virtue of using a precise map. However, it is
realized at the expense of having a face-segmentation
process either that is too restrictive because it uses a
predefined map or requires human interaction to manually
define the necessary map. Therefore, the second approach is51
more practical and appealing, as it attempts to cater to all
personal color features in an automatic manner, albeit in a
less precise way. This, however, raises a very important
issue regarding the coverage of all human races with one
reference map. In addition, the general use of a skin-color
model for region segmentation prompts two other questions,
namely, which color space to use and how to distinguish
other parts of the body and background objects with skin-
color appearance from the actual facial region.
3.1 Color Space:An image can be presented in a number of different color
space models.
52
• RGB:
This stands for the three primary colors: red, green,
and blue. It is a hardware-oriented model and is well known
for its color-monitor display purpose.
• HSV:
An acronym for hue-saturation-value. Hue is a color
attribute that describes a pure color, while saturation defines
the relative purity or the amount of white light mixed with
a hue; value refers to the brightness of the image. This model
is commonly used for image analysis.
• YCrCb:
This is yet another hardware-oriented model.
However, unlike the RGB space, here the luminance is
separated from the chrominance data. The Y value represents
the luminance (or brightness) component, while the Cr and Cb
values, also known as the color difference signals,
represent the chrominance component of the image. These are
some, but certainly not all, of the color space models
available in image processing. Therefore, it is important to
choose the appropriate color space for modeling human skin
color. The factors that need to be considered are application
53
and effectiveness. The intended purpose of the face segmentation
will usually determine which color space to use; at the same
time, it is essential that an effective and robust skincolor
model can be derived from the given color space. For
instance, in this paper, we propose the use of the YCrCb
color space and the reason is twofold. First, an effective
use of the chrominance information for modeling human skin
color can be achieved in this color space. Second, this
format is typically used in video coding, and therefore the
use of the same, instead of another, format for segmentation
will avoid the extra computation required in conversion. On
the other hand, both Sobottka and Pitas and Saxe and Foulds
have opted for the HSV color space, as it is compatible with
human color perception, and the hue and saturation components
have been reported also to be sufficient for discriminating
color information for modeling skin color. However, this
color space is not suitable for video coding. Hunke and
Waibel and Graf et al. used a normalized RGB color space. The
normalization was employed to minimize the dependence on the
luminance values. On this note, it is interesting to point
out that unlike the YCrCb and HSV color spaces, whereby the
brightness component is decoupled from the color information
of the image, in the RGB color space it is not. Therefore,
Graf et al. have suggested preprocessing calibration in order
to cope with unknown lighting conditions. From this point of
view, the skincolor model derived from the RGB color space
54
will be inferior to those obtained from the YCrCb or HSV
color spaces. Based on the same reasoning, we hypothesize
that a skin-color model can remain effective regardless of
the variation of skin color (e.g., black, white, or yellow)
if the derivation of the model is independent of the
brightness information of the image.
3.2 Limitations of Color Segmentation
A simple region segmentation based on the skin-color
map can provide accurate and reliable results if there is a
good contrast between skin color and those of the background
objects. However, if the color characteristic of the
background is similar to that of the skin, then pinpointing
the exact face location is more difficult, as there will be
more falsely detected background regions with skin-color
appearance. Note that in the context of face segmentation,
other parts of the body are also considered as background
objects. There are a number of methods to discriminate
between the face and the background objects, including the
use of other cues such as motion and shape. Provided that
the temporal information is available and there is a priori
knowledge of a stationary background and no camera motion,
motion analysis can be incorporated into the face-
localization system to identify nonmoving skin-color regions
as background objects. Alternatively, shape analysis
55
involving ellipse fitting can also be employed to identify
the facial region from among the detected skin-color
regions. It is a common observation that the appearance of a
human face resembles an oval shape, and therefore it can be
approximated by an ellipse [2]. In this paper, however, we
propose a set of regularization processes that are based on
the spatial distribution and the corresponding luminance
values of the detected skin-color pixels. This approach
overcomes the restriction of
motion analysis and avoids the extensive computation of the
ellipse-fitting method. The details will be discussed in the
next section along with our proposed method for face
segmentation.
In addition to poor color contrast, there are other
limitations of color segmentation when an input image is
taken in some particular lighting conditions. The color
process will encounter
some difficulty when the input image has:
• a “bright spot” on the subject’s face due to reflection of
intense lighting a dark shadow on the face as a result of
the use of strong directional lighting that has partially
blackened the facial region;
• been captured with the use of color filters. Note that
these types of images (particularly in cases 1 and 2) are
56
posing great technical challenges not only to the color
segmentation approach but also to a wide range of other
facesegmentation approaches, especially those that utilize
edge image, intensity image, or facial feature-points
extraction. However, we have found that the color analysis
approach is immune to moderate illumination changes and
shading resulting from a slightly unbalanced light source,
as these conditions do not alter the chrominance
characteristics of the skin-color model.
57
FACE-SEGMENTATION ALGORITHM
In this section, we present our methodology to perform
face segmentation. Our proposed approach is automatic in the
sense that it uses an unsupervised segmentation algorithm,
and hence no manual adjustment of any design parameter is
needed in order to suit any particular input image.
Moreover, the algorithm can be implemented in real time, and
its underlying assumptions are minimal. In fact, the only
principal assumption is that the person’s face must be
present in the given image, since we are locating and not
detecting whether there is a face. Thus, the input
information required by the algorithm is a single color
image that consists of a head-andshoulders view of the
person and a background scene, and the facial region can be
as small as only a 32 32 pixels window (or 1%) of a CIF-size
(352 288) input image. The format of the input image is to
follow the YCrCb color space, based on the reason given in
the previous section. The spatial sampling frequency ratio
of Y, Cr, and Cb is 4 : 1 : 1. So, for a CIF-size image, Y
has 288 lines and 352 pixels per line, while both Cr and Cb
58
have 144 lines and 176 pixels per line each. The algorithm
consists of five operating stages, as outlined in Fig. 5. It
begins by employing a low-level process like color
segmentation in the first stage, then uses higher level
operations that involve some heuristic knowledge about the
local connectivity of the skin-color pixels in the later
stages. Thus, each stage makes full use of the result
yielded by its preceding stage in order to refine the output
result. Consequently, all the stages must be carried out
progressively according to the given sequence.
A detailed description of each stage is presented below. For
59
illustration purposes, we will use a studio-based headand-
shoulders image called Miss America to present the intermediate
results obtained from each stage of the algorithm.
A. Stage One—Color Segmentation
The first stage of the algorithm involves the use of
color information in a fast, low-level region segmentation
process. The aim is to classify pixels of the input image
into skin color and non-skin color. To do so, we have
devised a skin-color reference map in YCrCb color space.
We have found that
a skin-color
region can be
identified by
the presence of a certain set of chrominance (i.e., Cr and
Cb) values narrowly and consistently distributed in the
YCrCb color space. The location of these chrominance values
has been found and can be illustrated using the CIE
chromaticity diagram as shown in Fig. 7. We denote and as
the respective ranges of Cr and Cb values that correspond to
skin color, which subsequently define our skin-color
reference map. The ranges that we found to be the most
60
suitable for all the input images that we have tested are
and . This map has been proven, in our experiments, to be
very robust against different types of skin color. Our
conjecture is that the different skin color that we
perceived from the video image cannot be differentiated from
the chrominance information of that image region. So, a map
that is derived from Cr and Cb chrominance values will
remain effective regardless of skin-color
variation .Moreover, our intuitive justification for the
manifestation of similar Cr and Cb distributions of skin
color of all races is that the apparent difference in skin
color that viewers perceived is mainly due to the darkness
or fairness of the skin; these features are characterized by
the difference in the brightness of the color, which is
governed by Y but not Cr and Cb. With this skin-color
reference map, the color segmentation can now begin. Since
we are utilizing only the color information, the
segmentation requires only the chrominance component of the
input image. Consider an input image of pixels, for which
the dimension of Cr and Cb therefore is . The output of the
color segmentation, and hence
stage one of the algorithm, is a bitmap of size, described
as
61
The output pixel at point is classified as skin color
and set to one if both the Cr and Cb values at that point
fall inside their respective ranges and . Otherwise, the
pixel is classified as non-skin color and set to zero. To
illustrate this, we perform color segmentation on the input
image of Miss America, and the bitmap produced can be seen in
Fig. 8. The output value of one is shown in black, while the
value of zero is shown in white (this convention will be
used throughout this paper).
62
Among all the stages, this first stage is the most
vital. Based on our model of human skin color, the color
segmentation has to remove as many pixels as possible that
are unlikely to belong to the facial region while catering
for a wide variety of skin color. However, if it falsely
removes too many pixels that belong to the facial region,
then the error will propagate down the remaining stages of
the algorithm, consequently causing a failure to the entire
algorithm. Nevertheless, the result of color segmentation is
the detection of pixels in a facial area and may also
include other areas where the chrominance values coincide
with those of the skin color (as is the case in Fig. 8).
Hence the successive operating stages of the algorithm are
used to remove these unwanted areas.
B. Stage Two—Density Regularization
63
This stage considers the bitmap produced by the
previous stage to contain the facial region that is
corrupted by noise.The noise may appear as small holes on
the facial regiondue to undetected facial features such as
eyes and mouth, or it may also appear as objects with skin-
color appearance in the background scene. Therefore, this
stage performs simple morphological operations such as dilation
to fill in any small hole in the facial area and erosion to
remove any small object in the background area. The
intention is not necessarily to remove the noise entirely
but to reduce its amount and size. To distinguish between
these two areas, we first need to identify regions of the
bitmap that have higher probability of being the facial
region. The probability measure that we used is derived from
our observation that the facial color is very uniform, and
therefore the skin-color pixels belonging to the facial
region will appear in a large cluster, while the skin-color
pixels belonging to the background may appear as large
clusters or small isolated objects. Thus, we study the
density distribution of the skin-color pixels detected in
stage one. An array of density values, called density map ,
is computed as
64
It first partitions the output bitmap of stage one into
nonoverlapping groups of 4 4 pixels, then counts the number
of skin-color pixels within each group and assigns this
value to the corresponding point of the density map.
According to the density value, we classify each point into
three types, namely, zero ( ), intermediate (0 16), and full
( ). A roup of points with zero density value will
represent a nonfacial region, while a group of fulldensity
points will signify a cluster of skin-color pixels and a
high probability of belonging to a facial region. Any point
of intermediate density value will indicate the presence of
noise. The density map of Miss America with the three density
lassifications is depicted in Fig. 9. The point of zero
density is shown in white, intermediate density in gray, and
full density in black. Once the density map is derived, we
can then begin the process that we termed as density
regularization. This involves the following three steps.
1. Discard all points at the edge of the density map,i.e
set D(0,y)=d((M/8)- ,y)=D(x,0)=D(x,(N/8)-1)=0 for all
x=0,…….(M/8)-1 and y=0,……(N/8)-1
2. Erode any full_density point (i.e set to 0)if it is
surrounded by less than 5 other full_density points in
its local 3X3 neighborhood.
3. Dilate any point of either zero or intermediate-65
density (i.e set to6) if there are more than full-
density points in its local 3X3 neighborhood.
After this process,the density map is converted to the
output bitmap of stage two as
The result of stage two for the Miss America image is displayed
in Fig. 10. Note that this bitmap is now four times lower in
spatial resolution than that of the output bitmap in stage
one.
66
C. Stage Three—Luminance Regularization
We have found that in a typical videophone image, the
brightness is nonuniform throughout the facial region, while
the background region tends to have a more even distribution
of brightness. Hence, based on this characteristic,
background region that was previously detected due to its
skin-color appearance can be further eliminated. The
analysis employed in this stage involves the spatial
distribution characteristic of the luminance values since
they define the brightness of the image. We use standard
deviation as the statistical measure of the distribution.
Note that the size of the previously obtained bitmap is
67
hence each point corresponds to a group of 8 8 luminance
values, denoted by , in the original input image. For every
skin-color pixel in , we calculate the standard deviation,
denoted as , of its corresponding group of luminance values,
using
Fig. 11 depicts the standard deviation values calculated for
the Miss America image. If the standard deviation is below a
value of two, then the corresponding 8 8 pixels region is
considered too uniform and therefore unlikely to be part of
the facial region. As a result, the output bitmap of stage
three, denoted as , is derived as
The output bitmap of this stage for the Miss America image is
presented in Fig. 12. The figure shows that a significant
portion of the unwanted background region was eliminated at
this stage.
D. Stage Four—Geometric Correction
We performed a horizontal and vertical scanning process
to identify the presence of any odd structure in the
previously obtained bitmap, , and subsequently removed it.68
This is to ensure that a correct geometric shape of the
facial region is obtained. However, prior to the scanning
process, we will attempt to further remove any more noise by
using a technique similar to that initially introduced in
stage two. Therefore, a pixel in with the value of one will
remain as a detected pixel if there are more than three
other pixels, in its local 3 3 neighborhood, with the same
value. At the same time, a pixel in with a value of zero
will be reconverted to a value of one (i.e., as a potential
pixel of the facial region) if it is surrounded by more than
five pixels, in its local 3 3 neighborhood, with a value of
one. These simple procedures will ensure that noise
appearing on the facial region is filled in and that
isolated noise objects on the background are removed.
E. Stage Five—Contour Extraction
In this final stage, we convert the output bitmap of
stage four back to the dimension of . To achieve the
increase in spatial resolution, we utilize the edge
information that is already made available by the color
segmentation in stage one. Therefore, all the boundary
points in the previous bitmap will be mapped into the
corresponding group of 4 4 pixels with the value of each
pixel as defined in the output bitmap of stage one. The
representative output bitmap of this final stage of the
69
algorithm is shown in Fig. 14.
We then commence the horizontal scanning process on the
“filtered” bitmap. We search for any short continuous run of
pixels that are assigned with the value of one. For a
CIFsize image, the threshold for a group of connected pixels
to belong to the facial region is four. Therefore, any group
of less than four horizontally connected pixels with the
value of one will be eliminated and assigned to zero. A
similar process is then performed in the vertical direction.
The rationale behind this method is that, based on our
observation, any such short horizontal or vertical run of
pixels with the value of one is unlikely to be part of a70
reasonable-size and well-detected facial region. As a
result, the output bitmap of this stage should contain the
facial region with minimal or no noise.
71
Simulation Results and Discussions
The proposed skin-color reference map is intended to
work on a wide range of skin color, including that of people
of European, Asian, and African decent. Therefore, to show
that it works on subject with skin color other than white
(as is the case with the Miss America image), we have used the
same map to perform the color-segmentation process on
subjects with black and yellow skin color. The results
obtained were very good, as can be seen in Fig. 15. The
skin-color pixels were correctly identified, in both input
images, with only a small amount of noise appearing, as
expected, in the facial regions and background scenes, which
can be removed by the remaining stages of the algorithm. We
have further tested the skin-color map with 30 samples of
images. Skin colors were grouped into three classes: white,
yellow, and black. Ten samples, each of which contained the
facial region of a different subject captured in a different
lighting condition, were taken from each class to form the
test set. We have constructed three normalized histograms
for each sample in the separate Y, Cr, and Cb components.
The rmalization process was used to account for the
variation of facial-region size in each sample. We have then
taken the average results from the ten samples of each
class. These average normalized histogram results are
presented in Fig. 16. Since all samples were taken from
different and unknown lighting conditions, the histograms of
72
the Y component for all three classes cannot be used to
verify whether the variations of luminance values in these
image samples were caused by the different skin color or by
the different lighting conditions. However, the use of such
samples illustrated that the variation in illumination does
not seem to affect the skincolor distribution in the Cr and
Cb components. On the other hand, the histograms of Cr and
Cb components for all three classes clearly showed that the
chrominance values are indeed narrowly distributed, and
more important, that the distributions are consistent across
different classes. This demonstrated that an effective skin-
color reference map could be achieved based on the Cr and Cb
components of the input image. The face-segmentation
algorithm with this universal skincolor reference map was
tested on many head-and-shoulders images. Here we emphasize
that the face-segmentation process was designed to be
completely automatic, and therefore the same design
parameters and rules (including the reference skin-color map
and the heuristic) as described in the previous section were
applied to all the test images. The test set now contained
20 images from each class of skin color. Therefore, a total
of 60 images of different subjects, background complexities,
and lighting conditions from the three classes were
73
fig 16.Histograms of Y,Cr,Cb values of different facial skin
coloursa) white,(b)yellow,and (c)black
76
Fig. 19. (a) Foreground MB’s and (b) background MB’s (c)
coded by RM8 and (d) coded by H.261FB. (e) Magnified image
of (c). (f) Magnified image of (d).
Fig. 20. Bit rates achieved by RM8 and H.261FB coders at a
target bit rate of 192 kbits/s.
78
CONCLUSION
The color analysis approach to face segmentation was
discussed .In this approach the face location can be
identified by performing region segmentation with the use of
a skin color map. This is feasible because human faces have a
special color distribution characteristic that differs
significantly from those of the background objects. We have
found that pixels belonging to the facial region of the image
in YCrCb color space, exhibit similar chrominance values,
Further more a consistent range of chrominance values was
also discovered from many different facial images which
include people of European. Asian and African decent. This
led us to the derivation of a skin color map that models the
facial color of all human races . With this universal skin
color map, we classified pixels of the input image into skin
color and non skin color, Consequently, a bitmap is produced,
containing the facial region that is corrupted by noise . The
noise may appear as small holes on the facial region due to
undetected facial features or it may also appear as objects
with skin color appearance in the background scene . To cope
with this noise and at the same time refine the facial
80
region detection we have proposed a set of novel region based
regularization processes that are based on the spatial
distribution study of the detected skin color pixels and
their corresponding luminance values. All the operations are
unsupervised and low in computational complexity.
Future Scope: This project can be used to detect the skin color
of a person from the video track. In the future, this
technique can be used to detect the actual skin color of the
person. By this technique, if the skin color differs from the
actual skin color then the person will be different from the
original person.
BIBLOGRAPHY
1. D. Chai and K. N. Ngan, “Foreground/background video
coding scheme,” in Proc. IEEE Int. Symp. Circuits
Syst., Hong Kong, June 1997, vol. II, pp. 1448–1451.
2. A. Eleftheriadis and A. Jacquin, “Model-assisted
coding of video teleconferencing sequences at low bit
81
rates,” in Proc. IEEE Int. Symp. Circuits Syst.,
London, U.K., June 1994, vol. 3, pp. 177–180.
3. K. Aizawa and T. Huang, “Model-based image coding:
Advanced video coding techniques for very low-rate
applications,” Proc. IEEE, vol. 83, p. 259–271, Feb.
1995.
4. V. Govindaraju, D. B. Sher, R. K. Srihari, and S. N.
Srihari, “Locating human faces in newspaper
photographs,” in Proc. IEEE Computer Vision Pattern
Recognition Conf., San Diego, CA, June 1989, pp. 549–
554.
5. G. Sexton, “Automatic face detection for
videoconferencing,” in Proc. Inst. Elect. Eng.
Colloquium Low Bit Rate Image Coding, May 1990, pp.
9/1–9/3.
6. V. Govindaraju, S. N. Srihari, and D. B. Sher, “A
computational model for face location,” in Proc. Int.
Conf. Computer Vision, Dec. 1990, pp. 718–721.
7. H. Li, “Segmentation of the facial area for
videophone applications,” Electron. Lett., vol. 28,
pp. 1915–1916, Sept. 1992.
8. S. Shimada, “Extraction of scenes containing a
specific person from image sequences of a real-world
scene,” in Proc. IEEE TENCON’92, Melbourne, Australia,
Nov. 1992, pp. 568–572.
82
9. M. Menezes de Sequeira and F. Pereira, “Knowledge-
based videotelephone sequence segmentation,” in Proc.
SPIE Visual Commun. And Image Processing, vol. 2094,
Nov. 1993, pp. 858–869.
10. G. Yang and T. S. Huang, “Human face detection
in a complex background,” Pattern Recognit., vol. 27,
no. 1, pp. 53–63, Jan. 1994.
11. A. Eleftheriadis and A. Jacquin, “Automatic face
location detection and tracking for model-assisted
coding of video teleconferencing sequences at low-
rates,” Signal Process. Image Commun., vol. 7, nos. 4–
6, pp. 231–248, Nov. 1995.
Video track
clc;clear all;close all;[filename,pathname]=uigetfile('*.avi','select the source video');str2='.bmp';file=aviinfo(filename); %to get inforamtaion abt video filefrm_cnt=file.NumFrames ;% No.of frames in the video file frmt =aviread(filename);for i=1:25 frm(i) = aviread(filename,i);% read the Video file frm_name = frame2im(frm(i)); filename1 = strcat(num2str(i),str2); imwrite(frm_name,filename1);end%%%% face recognition in video %%%%%for p=1:25
83
filename_1=strcat((num2str(p)),str2);% form filenameI=imread(filename_1);skin_region=skin(I); %%%% from skin codingse = strel('disk',5);se2 = strel('disk',3);er = imerode(skin_region,se2);cl = imclose(er,se);dil = imdilate(cl,se); % morphologic dilationdil = imdilate(dil,se); cl2 = imclose(dil,se);d2 = imfill(cl2, 'holes');% morphologic fillfacearea = bwdist(~d2); % computing minimal euclidean distance to non-white pixel face(:,:,1)=double(I(:,:,1)).*d2; face(:,:,2)=double(I(:,:,2)).*d2; face(:,:,3)=double(I(:,:,3)).*d2; face_a=uint8(face);vwiw =strcat((num2str(p)),str2);imwrite(face_a,vwiw);endfor k=1:25 cimg=strcat(num2str(k),str2); v=imread(cimg); [Y, map] = rgb2ind(v,128); F(:,:,k)=im2frame(flipud(Y),map);end[h, w, p] = size(F(1).cdata);hf = figure('Name','TRACKED VIDEO');set(hf, 'position', [w h w h]);movie(gcf,F);delete '*.bmp'[h, w, p] = size(frm(1).cdata);hf = figure('Name','ORIGINAL VIDEO');set(hf, 'position', [w h w h]);movie(gcf,frm);
skin track
84
function segment=skin(I);I=double(I);[hue,s,v]=rgb2hsv(I);cb =0.148*I(:,:,1)-0.291*I(:,:,2)+0.439*I(:,:,3)+128;cr =0.439*I(:,:,1)-0.368*I(:,:,2)-0.071*I(:,:,3)+128;[w h]=size(I(:,:,1));for i=1:w for j=1:h if 140<=cr(i,j)&cr(i,j)<=165&140<=cb(i,j)&cb(i,j)<=195 & 0.01<=hue(i,j) & hue(i,j)<=0.1 segment(i,j)=1; else segment(i,j)=0; end endendim(:,:,1)=I(:,:,1).*segment; im(:,:,2)=I(:,:,2).*segment; im(:,:,3)=I(:,:,3).*segment;
85