A fuzzy inference approach to template-based visual tracking
Transcript of A fuzzy inference approach to template-based visual tracking
Machine Vision and Applications manuscript No.(will be inserted by the editor)
A Fuzzy Inference Approach to Template-Based VisualTracking
J. Pablo Ramirez-Paredes · Raul E. Sanchez-Yanez · Victor
Ayala-Ramirez
the date of receipt and acceptance should be inserted later
Abstract The tracking of visual features using appear-
ance models is a well studied but still open area of com-
puter vision. In the absence of knowledge about thestructural constraints of the tracked object, the valid-
ity of the model can be compromised if only appear-
ance information is used. We propose a fuzzy inference
scheme that can be used to selectively update a giventemplate based model in tracking tasks. This allows us
to track moving objects under translation, rotation and
scale changes with minimal feature drift. Moreover, norigidity constraint needs to be enforced on the moving
target. Some experiments have been performed using
several targets, and the results are very close to theground truth paths. The computational cost of our ap-
proach is low enough to allow its application in real-
time tracking using modest hardware requirements.
Keywords Target tracking, deformable template,
fuzzy system, real-time vision.
1 Introduction
Many approaches have been proposed to address the
problem of object tracking in computer vision. In most
instances, researchers have used extracted features suchas edges, interest points, color blobs, etc., as pointed out
by Chaumette [1]. While it might be clear that using
full image templates is the object representation withthe most information content, the interpretation of this
information is a difficult task. As an object moves in
J-P Ramirez-Paredes · R. E. Sanchez Yanez · V. Ayala-RamirezEngineering Division, Universidad de Guanajuato, CampusIrapuato-Salamanca. Salamanca, Gto., Mexico.Tel.: +52-464-6479940Fax: +52-464-6479940 ext 2311E-mail: [email protected].
the observed scene, its appearance changes. An open
problem is the tracking of objects that change their 3D
pose over time. It is evident that the appearance of anobject will change under such a transformation, and
this object can even disappear from the field of view.
For tracking applications, the need of comparing sin-gle template models for matching arises. There have
been some recent advances on this field, like that of
Bigot et al. [2]. They propose the use of M-estimators
to obtain a mean pattern image, useful for classifica-tion and retrieval applications. Their focus is on dif-
feomorphic deformations, so other transformations de-
rived from perspective projection are not considered.While not directly dealing with object tracking, these
approaches deal with the fundamental template match-
ing problem, introducing deformations.
Modeling the object using several facets extracted
from different points of view is a way of coping with the
aforementioned problem. Jurie and Dhome [3] track ob-jects by storing several templates and matching them.
These template collections allow them to estimate the
3D pose of the target. They discuss the different ap-pearance of objects as they are observed from different
points of view.
Hager and Belhumeur [4] attempt to track objectsthat change their appearance using models of illumi-
nation and geometric changes. They achieve good per-
formance although the construction of such models re-quires sets of training images of the model. This pre-
vents their approach from working with novel objects
from arbitrary image sequences. A more recent work by
Hager and Stewart [5] uses the Sum of Squared Differ-ences (SSD) as a search function, using different kernels
to extract information from feature histograms. This
certainly outperforms a pure SSD tracker, although they
2
focus on using color as the feature to track instead of
using the raw pixel data of the templates.
Color based features have been used successfully to
track objects with appearance changes, even over vary-
ing backgrounds. Collins et al. [6] achieve this by select-
ing among a set of features derived from the RGB colorspace, using a variance ratio to establish their discrim-
inability. They also propose a measure for distraction,
that is, objects similar to the one tracked, and discuss away to minimize its effects. The mean shift algorithm is
used as the fundamental tracking mechanism. This al-
gorithm, coupled with an SSD tracker of SIFT features,is the approach proposed by Zhou et al. [7]. They re-
port results that outperform any of those techniques by
itself.
Another set of solutions for the tracking of deformable
templates attempt to overcome changes of appearancedue not only to 3D displacements but also to the ab-
sence of rigidity constraints. An example of these meth-
ods are the Active Appearance Models (AAM) by Cooteset al. [8]. They use angle relationships in a mesh formed
by several image patches. This approach is mainly used
to track and match face images, and it also uses eigen-images to reduce the search space. A derivation of the
AAM, the online appearance models proposed by Jep-
son et al. [9], combines several approaches, including
steerable pyramids with a generative model in order totrack nonrigid objects under various transformations.
In general, active appearance models use deformable
grids to track feature sets and some of these approachesattempt to overcome partial occlusion, like that of Gross
et al. [10].
After reviewing the current literature on tracking
nonrigid objects, a question arises. Is it possible to pro-pose a tracker capable of dealing with object appear-
ance changes, delivering near real-time performance?
Even with more computing power increasingly avail-
able in new devices, the complexity of algorithms withreal-time goals has to be limited to achieve good perfor-
mance. In this paper, we use the model template sim-
ilarity as the single object property. By using a fuzzylogic approach, we are able to track objects in scenarios
where standard trackers based on correlation-like mea-
sures fail, all while keeping a fairly low computationalcomplexity. This allows us to reach 10 frames per sec-
ond using off-the-shelf hardware.
The main contribution of this work is the possibil-
ity of tracking image templates over sequences where
their appearance changes, in real time. The results ofour experiments show that, even if there is some feature
drift, the selected objects are tracked successfully. The
final position error is low even for objects that gradu-
ally experiment simultaneous rotation and translation
in space, or that change their configuration.
This paper is organized as follows: in Section 2 an
overview of the proposed system is presented. Next, inSection 3, the process of selecting and extracting candi-
date image templates for tracking is described, includ-
ing the use of a Kalman filter for linear position pre-
diction. Section 4 contains the description of the fuzzyinference method that is applied to image templates in
order to track them. This completes the description of
the developed system. Section 5 presents execution re-sults of our implementation that show its real-time per-
formance, and Section 6 includes experiments to eval-
uate its performance. Finally, Section 7 is a discussionof the results of our approach.
2 Overview
At its core, our approach consists of a template-based
tracker, modified to introduce an update procedure forthe template under tracking. As shown in Fig. 1, the
basic building blocks for our system include the image
acquisition, the search for the reference or model tem-
plate, and its location. First, some fuzzy sets to be usedin the system have to be initialized using correlation
data from the object to be tracked or from other exper-
imental source, as will be described in Section 4. Then,there is an image capture stage, where an image coming
from a camera or a previously stored file is acquired. If
the system is at time step zero, it needs to be initializedwith the model of the target to be tracked. In the case
of appearance based trackers, like in our approach, the
target needs to be salient enough to be detectable in
the following images. In order to select suitable targets,we use an interest point detector to provide features as
cues.
From the next iteration onwards, the model tem-
plate at that step k is compared with those included
in a Region of Interest (ROI) for the tracker. We use
a correlation measure to estimate the position of thetarget in the current frame. At the end of this posi-
tion estimation stage, the position information used by
the tracker is updated in another stage that provides alocation for the object.
The location at this time step serves as the basis for
the search in the next iteration, since we perform thesearch for the model inside a region of interest only. To
cope with object appearance changes, we introduced a
model update method that is based on the principles offuzzy logic.
Fuzzy logic has been applied successfully in several
areas in engineering. Control systems in particular have
3
Fig. 1: Overview of the complete tracking system for
our approach.
been formulated with basis on fuzzy logic [11]. This ap-
proach to reasoning under uncertainty has special ap-
peal in areas where human knowledge needs to be trans-lated into machine instructions. By using fuzzy logic, it
is possible to construct inference engines that gather
human expert knowledge. In the area of computer vi-sion, for instance, motion analysis using fuzzy logic to
detect certain patterns of interest has been explored by
Anderson et al. [12] and by Chen et al. [13].
In our proposal, we have applied fuzzy logic to de-
termine if the target model needs to be updated and if
so, to what extent the target needs to be modified in
order to better represent the current appearance of thetarget. This results in a novel method to update the
model in a target tracking system that requires a low
computational cost for its implementation.
3 Template Selection and Tracking
As it was mentioned before, our approach to objecttracking starts from the selection of a template with
good features to track. After a certain image point is
selected, the image template formed by the surrounding
pixel neighborhood is used as initialization data for the
tracker. As the tracked object undergoes different kindsof motion, its appearance changes. This effect appears
also if the object remains fixed and the camera moves.
Using a correlation based measure, we determine thedegree of similarity between the current model template
and the template that is most alike within a ROI. This
ROI is the image region to which search is restricted.In tracking tasks, the question of which features to
track is of the utmost importance. When we follow a
particular object with no rigidity constraints, we have
to rely on correlation-like methods to detect an imagetemplate from frame to frame. For these methods not
to fail, the visual distinctiveness of the template with
respect to its surroundings is crucial. Several methods[14] [15] have achieved the detection of such regions.
In this work we use the Harris and Stephens de-
tector [16], a well accepted technique that is based oncomputing the autocorrelation function for local pixel
neighborhoods. By using this detector, we obtain some
tolerance to noise and small changes in translation or
rotation of the object. There is a modification of thiskind of detector that provides invariance under affine
transformations as well [17], although Mikolajczyk et
al. [18] have shown in a performance comparison thatlittle is gained by deviating from the original detector
for object tracking.
The detector by Harris and Stephens provides a ro-bust method to extract regions of an image with enough
information content to make them worth tracking. The
response of this detector is isotropic with respect to the
intensity changes. Specifically, it is able to detect edgesand corners at any angle.
A procedural formulation of this detector is as fol-
lows: first, the image gradients in the horizontal andvertical directions are obtained. These gradients are
computed from the convolution of small pixel neigh-
borhoods for each image location with the first orderdifferential of a Gaussian kernel. This kernel has a stan-
dard deviation σ.
Ix(x, y, σ) = I(x, y) ⋆ Gx(x, y, σ) (1)
Iy(x, y, σ) = I(x, y) ⋆ Gy(x, y, σ) (2)
The autocorrelation matrix, M, is computed fromthe directional derivatives. Such a matrix contains the
information needed to determine if the point under anal-
ysis is an interest point. The notation here is slightly
simplified, omitting the arguments of functions Ix andIy.
M(x, y, σ) =
[
I2x IxIyIxIy I2y
]
(3)
4
Fig. 2: An image with its interest points marked with
squares.
The determination of the interest point quality of acertain image location depends on a response function
R(x, y, σ). This function peaks around interest points,
but nonmaximal suppression is needed to isolate indi-
vidual locations.
R(x, y, σ) = det(M(x, y, σ))− α tr2(M(x, y, σ)) (4)
A sample of the results from this interest point de-tector is given in Figure 2. The regions with few inter-
est points detected are discarded for tracking purposes
in this work, since their lack of saliency would cause an
early failure of our approach. For some applications, theHarris detector has been superseded by the use of SIFT
features, as proposed by Lowe [14]. The SURF detector
by Bay et al. [19] is a fast alternative. Although thesedevelopments provide features that can be detected at
different scales, our approach could use them only in the
initialization stage to select good candidate regions totrack. After that, the performance of the tracker would
depend mostly on the fuzzy inference component.
3.1 Similarity measure
In order to find an image template in a defined image
neighborhood, it is necessary to establish some criterionto measure the similarity of the template compared to
the area surrounding each pixel in the neighborhood.
Full convolution or correlation is computationally ex-pensive, so several alternatives have been proposed. In
an image registering survey, Zitova and Flusser [20] de-
scribed and analyzed several correlation-like methods,some of which we use. For our application, it is crucial
to have a function with a constant domain as the sim-
ilarity measure. Because of this constraint, we settled
on the usage of the Normalized Correlation Coefficient(NCC). Our system is aimed to run in a platform with a
framegrabbing interface with a performance of around
10 frames per second. Computing the NCC for each
point in the image requires a large number of opera-
tions. A first approach to reduce the number of opera-tions to be performed is to restrict the object search to
the pixel neighborhood of the object, a common heuris-
tic in template based methods, as discussed by Yilmazet al [21].
NCC(x, y) =∑
W [W − E(W )][I(x, y)− E(I(x, y))]√∑
W [W − E(W )]2∑
I [I(x, y)− E(I(x, y))]2(5)
The NCC can be obtained for each image point as
described in Eq. 5. Here, the function E(W ) denotes
the expected value of the data inside some square win-dow W of the image under analysis. An image window
centered at the point (x, y) of the sought template is de-
noted as I(x, y). The normalization that distinguishesthe NCC from simple correlation begins by subtracting
the expected value of each window before performing
the correlation itself. The result must also be dividedby the standard deviation of the data windows. These
operations make the NCC a bounded quantity in the
interval [−1, 1], similar to computing a projection be-
tween normalized vectors. As in that case, comparingidentical patterns results in a value of 1. The additional
operations to perform make the NCC more costly than
simple correlation to implement. An alternative metricthat can be used to locate a template in a region is the
SSD, shown in Equation (6).
SSD(x, y) =∑
m,n∈N
[W (m,n)− I(x+m, y + n)]2 (6)
Minimizing this measure gives the approximate lo-
cation of the object sought. Computing this quantity
takes less operations than the NCC, but its value is notclearly bounded, since high energy regions of an image
compared to the template may have a high SSD value.
A compromise between precise target location and awell defined similarity metric can be reached by com-
bining the SSD and NCC measures. In this work, the
SSD is used first to obtain an estimate of the location ofthe target at any moment. Then, the NCC between the
original template and the one found using the SSD is
obtained as the similarity measure. It is worth mention-
ing that these template matching algorithms are usedon grayscale versions of the processed images. The ob-
ject search performed in this work does not use color
information.
5
3.2 Searching for the tracked object
Fast object localization is a priority when dealing with
tracking problems. Most of the time it is unfeasible toperform a global search to locate a template. Hence,
the concept of a search window has to be introduced.
In our case, we have empirically determined that most2D translations over the image plane can be followed
using a search window of one tenth of the total image
side length.
When analyzing each frame, instead of maximizing
the NCC with regression methods we only search for the
local maximum at the discrete locations on the imagegrid. This can be considered as a measurement contam-
inated with quantization noise. In order to both over-
come this limitation and to have a way to predict mo-
tion of the template under tracking, we use a Kalmanfilter. The Kalman filter can estimate the state of a
linear system, assuming a Gaussian distribution of the
system state.
In our implementation we considered a simplified
model for object motion, where x = xk − xk−1. Com-
puting y is achieved in a similar way. The state vectorfor such a system is given by x = [x y x y]T .
The Kalman filter consists of two stages. First, atime update stage changes the a priori state estimation
and estimation error covariance. The a priori qualifier
is due to the measurements not having taken place yet.
This is illustrated in the following equations.
x−
k = Axk−1 +Buk−1 (7)
P−
k = APk−1AT +Q (8)
In these equations, x−
k is the a priori state vector es-timation at time step k. The vector uk−1 is the desired
state input. The a posteriori state vector estimation
from the previous time step, xk−1, is required for the
current step. Also on this stage, the a priori estimationerror covariance is updated, P−
k , using the a posteri-
ori estimation error covariance from the previous step,
Pk−1, and Q, which is the process error covariance. Itis constant and should be given at the beginning of the
procedure.
The state transition matrix, A, describes the dy-
namics of the system. The simplified model used for
object motion mentioned before gives rise to the fol-
lowing state transition matrix.
A =
1 0 1 0
0 1 0 10 0 1 0
0 0 0 1
(9)
Matrix B, representing any operations to be per-
formed over the state inputs, is equal to zero in ourimplementation, since no state inputs are considered
for the model.
After the a priori estimations have been updated,
the measurement update stage takes place. It is in this
stage that the Kalman gain is computed. The matrixKk determines the amount in which the state measure-
ments affect the state estimation, and is computed as
follows.
Kk = P−
kHT (HP−
kHT +R)−1 (10)
This Kalman gain equation is the optimal solution
that minimizes the a posteriori estimation error covari-ance. In Eq. 10, R is the measurement error covariance.
This is constant and must be determined or approxi-
mated beforehand. The remaining equations from thisstage update the a posteriori state and error covariance
estimations. To this effect, the measurement vector zkis compared with a measurement prediction Hx−
k .
xk = x−
k +Kk(zk −Hx−
k ) (11)
Pk = (I−KkH)P−
k (12)
It is important to note that we use the Kalman filteras a method to improve localization of the tracked ob-
ject and that it is mostly decoupled from the template
updating stage. There is one point were both stages in-
teract, though. The fuzzy inference stage can influencethe time update stage of the Kalman filter. We address
this particular case in the following paragraphs.
4 Fuzzy Inference for Object Tracking
In the proposed approach, the model update procedureis based on the fuzzy similarity of the appearance of
the target in the current frame, with respect to its ap-
pearance in the previous frame of the sequence beingprocessed. In order to evaluate the similarity using a
fuzzy approach, we define a Similarity linguistic vari-
able. The parameters of the membership functions of
the fuzzy sets associated to the linguistic labels weredetermined by using a fuzzy clustering method.
For a number of targets, we have perturbed their
appearance by translating and rotating the target in
the image. See Fig. 3 for an example. For each of these
positions, we have computed the NCC value of the tar-get patch with respect to the previous one. Figure 4
shows the values of the NCC between frames over time
for the previous example.
6
The NCC dataset is then fed into a Fuzzy C-means
(FCM) algorithm in order to find the parameters ofthe fuzzy sets associated to the linguistic variable Sim-
ilarity. Specifically, three mean values from the FCM
algorithm become the main parameters of their corre-sponding fuzzy sets.
The FCM method assigns a membership functioncorresponding to each class to each datum. After some
iterations the membership of each datum tends to in-
crease for a certain class and to diminish for others.
A detailed discussion of the FCM method and its ex-tensions has been discussed by Hathaway et al. [22],
but a discussion is presented here for the sake of com-
pleteness. The FCM algorithm solves the problem ofminimizing the data point to cluster center distance,
represented as the function of Eq. 13.
J(U,v) =
c∑
i=1
n∑
k=1
Umik ||xk − vi||2 (13)
Here, vi is the center of the ith data cluster, while xkis the kth point in the data set. In our application these
are scalar quantities, but the FCM algorithm is useful
for multidimensional data sets. There is a parameter,called the fuzzification constant, set tom > 1. If a value
close to m = 1 is chosen, the FCM method behaves like
the k -means method, while a value m > 2 increases the
convergence time.
Matrix U contains the degree of membership for
every datum in every cluster. The set of all the clustercenters is represented as a single matrix v. The mem-
bership of datum xk in cluster i is denoted as Uik. The
FCM alternates between updating the cluster centers
and updating the membership matrix. Updating thecluster centers is done using Eq. 14.
vi =
∑n
k=1 Umikxk
∑n
k=1 Umik
(14)
This must be done for each of the cluster centers,
so i = 1, . . . , c. Updating the membership matrix U isdone element by element, as in Eq. 15. The function dijis a distance metric, usually Euclidean, between some
datum xi and the jth cluster center. In this case, dij =||xi − vj ||
2.
Uik =
[
c∑
k=1
dijdik
]
−1
m−1
(15)
The algorithm stops when the error is below some
threshold ǫ. A useful and often applied criterion is to
Fig. 3: Samples of a training set used for an object.
0 50 100 150 200 250 300 350 4000.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Frame
Nor
mal
ized
Cor
rela
tion
Coe
ffici
ent
Fig. 4: Plots of similarity change over time, for the
sequence of Figure 3. The resulting parameters aresl = 0.859, sm = 0.935 and sh = 0.985.
take the max norm from the difference of the member-ship matrix at iteration t− 1 and at the current step t.
This criterion is shown in Eq. 16.
maxij
{|U(t−1)ij − U
(t)ij |} < ǫ (16)
We empirically determined that the data gatheredfrom the NCC change over time describes three differ-
ent events. In the first case, the visual target experi-
ences slight appearance variations due to illuminationchanges and other noise generating phenomena. The
second case refers to a change of pose of the target, due
to translation or rotation of the object. As expected,the change in the NCC is greater than in the first case.
Finally, the last case corresponds to a loss of visibil-
ity of the target due to severe occlusion or a change in
pose that exceeds the capacity of the system to recog-nize it. Such three clusters hypothesis translates into
Low, Medium and High similarity situations that must
be coped with.
7
Fig. 5: The fuzzy inference system used to update the model.
4.1 Template updating
An overview of the proposed method to apply fuzzy
inference for model update is shown in Fig. 5. Here,
the similarity measure at each time step and its changewith respect to the previous frame are used as inputs
in a fuzzy inference engine. This stage of our scheme
contains two Fuzzy Associative Memories (FAMs), de-scribing the input-output fuzzy relations.
For both FAMs, the NCC between the model tem-
plate and the template found by the search procedure isused as an input. This quantity is then transformed into
the linguistic variables Similarity and Similarity Differ-
ence. The variable Similarity has three associated fuzzy
sets L, M and H. These sets describe a Low, Mediumand High Similarity, respectively, as is shown in Fig.
6. Meanwhile, the Similarity Difference takes its value
from the current NCC and its value from a previousiteration. This variable has two associated fuzzy sets, S
and B, describing Small and Big Similarity Differences
accordingly.
The fuzzy sets associated with each linguistic vari-
able are constructed from numerical constants obtained
from experimental data. The constants sl, sm and share taken from the results of the Fuzzy C-means clus-tering of the NCC data, and satisfy the constraints
sl < sm < sh, sl ≥ 0 and sh ≤ 1. If there is no prior
information from which to compute these constants, asin the case of tracking a previously unknown object,
some predefined set of values can be used. A set of use-
ful values in this situation is sl = 0.25, sm = 0.5 andsh = 0.75, as some empirical tests have shown. This
selection is not optimal and values from training sets
should be preferred.
Similarity changes over time, even for relatively sta-
ble scenes. The amount of this change is relevant infor-
mation that we use to aid in the formulation of rules
for the fuzzy inference system. The Similarity is denotedby s(k), where k is the current discrete time step and
s(k) ∈ R[0, 1]. The Similarity Difference is captured by
the following equation.
Fig. 6: Membership functions for the Similarity fuzzy
sets.
d(k) = |s(k)− s(k − 1)| (17)
Using these two linguistic variables, s(k) and d(k),
we build membership functions for several fuzzy sets.
Trapezoidal functions are used for all membership func-tions. In the case of s(k), the membership functions for
each of these sets are summarized in Eq. 18 to 20. In
these equations, µX(y) denotes the membership of thequantity y in the fuzzy set X.
µL(s) =
{
1 0 ≤ s ≤ sl1
sm−sl(s− sl) + 1 sl < s ≤ sm
(18)
µM (s) =
{
1sm−sl
(s− sl) sl < s ≤ sm−1
sm−sl(s− sl) + 1 sm < s ≤ sh
(19)
µH(s) =
{
1sh−sm
(s− sm) sm < s ≤ sh
1 sh < s ≤ 1(20)
The second input variable, d(k), has two fuzzy setsrelated to it, and they are represented in Fig. 7. The
first one corresponds to small changes in Similarity, the
other to large changes. The shapes of their membership
functions are determined by two constants, ds and db.These quantities are related to the membership func-
tion boundaries for the fuzzy sets involving s(k). The
Similarity Difference in a single time step is a bounded
8
Fig. 7: Fuzzy sets for the Similarity Difference input
variable.
Fig. 8: Membership functions for the output variable,
the fusion coefficient ψ.
quantity, since we use the absolute value of d(k) and
||d(k)|| ≤ 1 ∀ k. In the mean case, where Similarity
changes from one frame to the next are small, d(k) hasa value in the order of 1×10−3. Hence, the values of dsand db are given by Eq. 21 and 22. A useful value for γ
is 1/10. This value was determined using experimental
data gathered from the same image sequences that areused to determine the Similarity fuzzy sets boundaries.
ds = γ|sh − sm| (21)
db = γ|sh − sl| (22)
The first FAM determines whether the object should
be tracked or not, since there are cases in which the sys-tem should consider the target to be lost. This means
that the model should not be updated with the avail-
able visual information, and the measurements of theKalman filter should not be updated either. To take into
account these instances, we used a single rule Mamdani-
Assilian model with a singleton output. The output
variable is Boolean and controls whether the trackershould update its position measurement or not, and also
if the model template should be updated. The output
fuzzy sets are Track, and Stop and they determine if the
L M H
S 0 F3 F1
B F4 F2 0
Table 1: Decision table for the output variable Fusion.
tracker should update its measurements. The output isthe result of evaluating a single rule:
IF s is L AND d is S then STOP
The behaviour of the tracker when this FAM hasa Stop output is equivalent to waiting for an occlu-
sion to clear. The object position is not updated, while
the model template is not modified either. If the ob-ject remains in a position covered by the ROI after the
occlusion clears, the tracking procedure resumes. This
strategy works as long as the occluding object enters
the ROI quickly enough to trigger the Stop responseand not become part of the tracked template data.
The second FAM is the mechanism that enables the
system to incorporate new template information intothe model template. An output coefficient that is used
for the model fusion is computed from the Similarity
and Similarity Difference inputs.
In order to determine the fusion coefficient we use
the Mamdani implication method of inference. The fuzzy
IF-THEN rules are based on the Similarity and Simi-larity Difference values as antecedents, and they have
the fusion coefficient as their consequent. The fusion
coefficient, ψ, has four related fuzzy sets. The mem-
bership functions for those sets are F1 to F4, describingLow Fusion, Medium-Low Fusion, Medium-High Fusion
and High Fusion, respectively. The membership func-
tions for the output fuzzy sets are shown in Fig. 8. Thesummary of the control rules for the fusion output is
the fuzzy associative memory (FAM) given in Table 1.
Eq. 23 to Eq. 26 describe the aggregation that is usedto compute the membership of each of the output fuzzy
sets.
µF1= min[µH(s), µS(d)] (23)
µF2= min[µM (s), µB(d)] (24)
µF3= min[µM (s), µS(d)] (25)
µF4= min[µL(s), µB(d)] (26)
Since ψ needs to be a crisp quantity for our system,
we used a defuzzification method at the end of the in-ference engine. The center of sums is a fast alternative
to the centroid method. The center of sums equation
for our system is shown in Eq. 27.
9
Fig. 9: Example of a decision surface for the ψ output
variable.
ψ′ =
∫
ψψ∑4k=1 µFk
(ψ)dψ∫
ψ
∑4k=1 µFk
(ψ)dψ(27)
The decision surface formed by the output of thisFAM is dependent upon the values of sl, sm and shfrom the NCC clustering data or from manually selected
values. An example for a particular case of these values
is given in Fig. 9.
4.2 Model fusion
The appearance model of the visual target is updated
using the output from the fuzzy inference stage. Themodel is mixed with a new measurement using a fu-
sion function. In our tracker implementation, we do not
modify the size of the template under inspection. Foreach discrete location on the model template, an update
procedure occurs at every iteration. Both the amount
of information that is preserved and the amount that is
updated are determined by the quantity derived fromthe fuzzy inference stage, ψ. We call this quantity the
fusion coefficient. The fusion coefficient at time step k
is denoted by ψk.Each pixel of the updated model is a function of the
old model template, the new matching template and
the fusion coefficient. This function is called the modelfusion function and is given in Eq. 28.
M(x, y, ψk)k = (1− ψk)M(x, y, ψk−1)k−1
+ ψkI(x, y)k (28)
5 Implementation issues
One of the goals that the proposed tracker was meant
to achieve is low computational complexity. This is a
11 31 51 71 91
NCC 0.586 28.034 201.333 770.126 2030.536
SSD+NCC 0.204 5.480 38.278 138.871 362.763
Table 2: Average processing time per frame, in millisec-
onds, for different template sizes. A NCC-based tracker
and a SSD-based tracker with NCC computation forthe target are considered.
problem that every template-based tracker must over-
come. The main obstacle for this kind of trackers is thecomplexity of the correlation algorithm. The most ba-
sic form of correlation is generally a O(N2) operation.
This can be improved upon by using the convolutiontheorem of the Fourier transform and the Fast Fourier
Transform algorithm. The best case scenario for the
FFT is of O(N logN) complexity. Lewis [23] noted that
there is a trade off that makes the FFT inconvenient touse below a certain template size. Coupled with the fact
that a ROI is often used in real-time implementations
to restrict the object search area, under the assump-tion that the translational velocity of the object in the
image plane will remain below some amount, the exact
template size for which using the FFT becomes moreconvenient is difficult to establish.
Moreover, our fuzzy inference scheme depends on
the use of the NCC, as detailed before. This is an op-
eration even more costly than simple correlation. Thefact that the expected value for different pixel windows
needs to be calculated increases the complexity. A way
to overcome this is using an image representation called
an integral image. These integral images are used forfeature extraction by Viola and Jones [24], and can be
used to obtain the sum of the pixel values in any given
area by performing only four operations.
In this work, the necessity for a bounded quantityto represent template similarity is limited to the fuzzy
inference stage. This means that the initial location of
the template can be obtained by any alternative proce-dure. Since the SSD, the NCC and simple correlation
can be used to minimize the quadratic error between
the template and a region of the current image, any ofthem can be used. It is important to note that using
the faster alternatives to the NCC can mean entering
potential pitfalls. With the SSD, regions of low energy
in the image can give rise to false positives, while thesame thing may happen with regions of high energy and
correlation.
As a compromise, we use the SSD to locate the
template position at any given time. Using this spa-tial information we proceed to get the NCC between
the template and the neighborhood of its location. It
is this NCC that allows us to compute the amount of
10
fusion between the current template and the previous
one. The speedup resulting from this approach is no-ticeable even at small template sizes. Table 2 shows the
average processing time per frame obtained by track-
ing image regions of different sizes, comparing usingthe NCC only to using the SSD+NCC approach. These
rates include the time taken by our Kalman filter im-
plementation and the fuzzy inference stage. The timevalues computed for this table were obtained by using
an inexpensive video camera. For this hardware config-
uration, the camera had a resolution of 640×480 pixels
and a frame rate of 6 fps. The machine ran at 1.67 GHz.
6 Tracker evaluation
For the purpose of assessing the performance of ourmethod, a series of tests involving comparisons with
a baseline SSD tracker were conducted. Determining
the performance of a tracking method is an importantand non trivial task. It is evident that this evalua-
tion is not possible in any satisfactory manner if some
ground truth is not available. We use sequences forwhich ground truth is widely available. This allows us
to measure the error of our method and compare it
to results offered by others. Video or image sequences
captured from real world scenery are difficult to usein performance evaluation. The project called Context
Aware Vision using Image-based Active Recognition
(CAVIAR) [25] from the European Commission usesimage sequences tagged by humans. These sequences
are useful to an extent, although they allow some uncer-
tainty to enter the process, in the form of subjectivityfrom the volunteers.
Another approach to benchmarking tracking algo-rithms is the use of synthetic sequences. The work by
McCane et al. [26] is one example of this. Their bench-
mark is oriented towards optical flow algorithms. How-ever, the motion vectors provided as ground truth can
be used to generate motion paths useful to track the
objects. It can be argued that this kind of benchmarkis more precise, as it uses optical flow data generated
with the images. This benchmark is well suited for the
initial testing of trackers.
We ran a set of tests using the benchmark by Mc-
Cane et al. The comparison between the given opticalflow vectors and the output from our Kalman filter at
each time step provides evidence of the performance
of our approach. Without noticeable noise and cam-
era jitter, the performance of our tracker is close tothat of a full optical flow algorithm. Besides these tests,
we also used image sequences from the aforementioned
CAVIAR project. These sequences contain significantly
complex foreground and background objects whose ap-
pearance may change in time. The ground truth forthese sequences was generated by human observers and
provides a natural, qualitative performance assessment.
A pure SSD tracker, including the Kalman filter, is
the basis for comparison. This tracker has good per-formance under pure X-Y translation of the target. Its
shortcomings are Z translation (scale change) and all
kinds of rotation. The fuzzy tracker is aimed at theseproblems and our results show that it is successful in
a number of situations. The following case descriptions
contain the tests that were performed, and discussionsabout the results.
6.1 Case I: the medium sequence
The first sequence used to test our approach is the im-
age sequence medium from the benchmark of McCane
et al. This sequence includes translation and rotation of
a 3D model which is on top of a checkerboard pattern.The sequence consists of 10 images, 400× 300 pixels in
size. Fig. 10 contains the starting and ending frames for
this sequence, with a marker placed upon the trackedregion. The evolution of the tracked template is shown
as well. In this figure one can qualitatively assert that
the tracker is, in fact, overcoming a large translationover the Z axis. A direct comparison of the location
of the tracked template in some frames according to
the optical flow algorithm and our approach is shown
there. This comparison shows a remarkable qualitativesimilarity between the ground truth and our tracker.
It is relevant to note that the templates shown for our
tracker are the result of applying the model fusion func-tion, instead of pixel neighborhoods from the bench-
mark frames. Hence, this image mosaic does not reflect
the exact appearance of the object, showing fused in-formation instead.
These rotations and translations cause an early fail-
ure of an SSD-only tracker, but our fuzzy variant tracks
the object successfully.
The error for both trackers, expressed as the Eu-clidean distance from the center of the target to the po-
sition in the ground truth, in pixels, is shown in Fig. 12.
The error for our tracker is low considering the nature
of the image sequence.
6.2 Case II: the street sequence
Another interesting sequence in the benchmark set ofMcCane et al. is called street. This sequence combines
camera movement with objects moving inside the scene.
The size of each image is 200 × 200 pixels. This low
11
(a) (b) (c)
Fig. 10: Evolution of the tracked template for the medium sequence over time.
(a) (b) (c)
Fig. 11: Evolution of the tracked template for the street sequence over time.
0 1 2 3 4 5 6 7 8 90
10
20
30
40
50
60
70
Euc
lidea
n di
stan
ce to
targ
et (
pixe
ls)
Frame
Fuzzy Tracker
SSD Tracker
Fig. 12: Errors for the medium sequence.
resolution causes great appearance changes in textured
areas, even under small camera displacements. Fig. 11shows the starting and ending frames of the street se-
quence, with an overlaid marker for the tracked zone.
Also, the evolution over time of the model template forthis sequence, compared to the benchmark is shown.
The optical flow vectors reach a point of near con-
stant Y coordinate around frame 30. This corresponds
with the frames where the tracked feature is no longer
0 10 20 30 40 50 600
5
10
15
20
25
30
35
40
45
Frame
Euc
lidea
n di
stan
ce to
targ
et (
pixe
ls)
Fuzzy Tracker
SSD Tracker
Fig. 13: Errors for the street sequence.
visible. Likewise, our tracker loses the target completely
at this point, since the occlusion is total. The occlusion
lasts long enough for our fuzzy tracker to update themodel template with information of the occluding ob-
ject. The error in pixels over time can be found in Fig.
13.
12
(a) (b) (c)
Fig. 14: Evolution of the tracked template for the street sequence over time.
6.3 Case III: the Walk2 sequence
The CAVIAR sequences provide a challenging data set.
We ran some tests using this benchmark and obtainedpromising results. The sequence titled Walk2 consists
of images taken from a surveillance camera in a com-
mercial venue. This images contain moving humans.
The camera itself had a very wide angle lens attached,adding radial distortion that increased the difficulty.
Perspective correction of the image is possible, since
the sequence included information to compute the ho-mography using the floor as a plane. We refrained from
applying any perspective correction to the images, since
our purpose was to evaluate the performance of ourtracker under non-rigid transformations.
Some frames from the sequence are illustrated by
Fig. 14, with the evolution of the model template over
time.
Our tracker was able to follow the target, a human
walking over a concrete floor, until it disappeared from
the field of view. Error from our tracker compared tothe ground truth is low and more constant than in other
cases, as Fig. 15 shows.
As in other implementations of template-based tar-
get tracking systems, ours lacks robustness when bothbackground and foreground change simultaneously. How-
ever, the use of the fuzzy inference system reduces fea-
ture drifts and delays the loss of the target. Prolongedocclusion of the target results in its loss. Nevertheless,
our tracking system can handle partial oclussions dur-
ing short time intervals without problems.
6.4 Error comparison
As a useful metric, we present the tracking errors for
the sequences presented above. Table 3 contains themean Euclidean distances from the center of the de-
tected regions to the feature position as it appears in
the ground truth. These averages were taken from the
0 20 40 60 80 100 120 140 1600
10
20
30
40
50
60
Frame
Euc
lidea
n di
stan
ce to
targ
et (
pixe
ls)
Fuzzy Tracker
SSD Tracker
Fig. 15: Errors for the CAVIAR Walk2 sequence.
Medium Street Walk2
Fuzzy Tracker 7.394 5.617 10.777
SSD Tracker 15.019 18.218 18.726
Table 3: Average error, by sequence, for the fuzzy
tracker and the SSD-only tracker. This represents theaverage Euclidean distance to the tracked feature.
whole sequences, except for the Walk2 case, since theSSD tracker loses the target early. In that particular
case, the average error only takes into account the por-
tion of the sequence were both trackers successfully fol-
low the object or feature.The error for our approach is smaller than that of
a pure SSD-based tracker. The SSD only approach is
more sensitive to appearance changes of the target,making it more vulnerable to noise and object rotations
and translations.
7 Conclusion
We propose a novel approach to template-based object
tracking in images, using fuzzy logic to make decisions
13
regarding model updates. From different tests, we could
observe that the tracker discussed is able to cope withsome degree of 3D transformations and also with some
nonrigid transformations. The only requirement for our
tracker to work is a saliency assumption. This meansthat the tracked regions must be selected using an in-
terest point detector.
The limitations of our approach include severe oc-clusion and degradation over time of the saliency of
the tracked region. Another problem that requires fur-
ther analysis is the case of simultaneously dynamic fore-ground and background. This causes the tracker to drift
over time.
An important result from this work is that the hard-ware requirements for a real-time implementation of our
tracker are fairly low. The search window does not need
to be large, assuming that no large translations wouldtake place. We consider that the work discussed in this
paper shows that fuzzy reasoning is useful in dealing
with problems that seem difficult for most object track-ers. We also concede that there are cases in which our
method could fail and we tried to point them out.
Acknowledgements J. Pablo Ramirez-Paredes would like to
thank the Mexican National Council for Science and Technology(CONACyT) for the scholarship grant number 253761/213845.
References
1. F. Chaumette. Image moments: a general and useful set offeatures for visual servoing. Robotics, IEEE Transactions
on, 20(4):713–723, Aug. 2004.
2. J. Bigot, S. Gadat, and J-M Loubes. Statistical M-Estimation and consistency in large deformable models for
image warping. Journal of Mathematical Imaging and Vi-
sion, 34(3):270–290, 2009.
3. F. Jurie and M. Dhome. Real time 3D template matching.
In Computer Vision and Pattern Recognition, 2001. CVPR
2001. Proceedings of the 2001 IEEE Computer Society Con-
ference on, volume 1, pages 791–796, 2001.
4. G. D. Hager and P. N. Belhumeur. Efficient region trackingwith parametric models of geometry and illumination. Pat-
tern Analysis and Machine Intelligence, IEEE Transactions
on, 20(10):1025–1039, Oct. 1998.
5. G.D. Hager, M. Dewan, and C.V. Stewart. Multiple kerneltracking with SSD. Computer Vision and Pattern Recog-
nition, IEEE Computer Society Conference on, 1:790–797,2004.
6. R.T. Collins, Y. Liu, and M. Leordeanu. Online selection
of discriminative tracking features. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 27(10):1631–1643, 2005.
7. H. Zhou, Y. Yuan, and C. Shi. Object tracking using SIFTfeatures and mean shift. Computer Vision and Image Un-
derstanding, 113(3):345–352, 2009.
8. T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appear-ance models. Pattern Analysis and Machine Intelligence,
IEEE Transactions on, 23(6):681–685, June 2001.
9. A.D. Jepson, D.J. Fleet, and T. F. El-Maraghi. Robust onlineappearance models for visual tracking. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 25(10):1296–
1311, Oct. 2003.10. R. Gross, I. Matthews, and S. Bakera. Active appear-
ance models with occlusion. Image and Vision Computing,
24(6):593–604, 2006.11. S.H. Lee, R.J. Howlett, and S.D. Walters. Small engine con-
trol by fuzzy logic. Journal of Intelligent and Fuzzy Systems,15:207–217, 2004.
12. D. Anderson, J.M. Keller, M. Skubic, X. Chen, and Z. He.Recognizing falls from silhouettes. In IEEE 2006 Int. Conf.
of the Engineering in Medicine and Biology Society, pages6388–6391, 30 2006-Sept. 3 2006.
13. X. Chen, Z. He, D. Anderson, J.M. Keller, and M. Skubic.Adaptive silouette extraction and human tracking in complex
and dynamic environments. In Image Processing, 2006 IEEE
International Conference on, pages 561–564, Oct. 2006.14. D.G. Lowe. Object recognition from local scale-invariant fea-
tures. In Computer Vision, 1999. The Proceedings of the
Seventh IEEE International Conference on, volume 2, pages
1150–1157 vol.2, 1999.15. T. Kadir and M. Brady. Saliency, scale and image descrip-
tion. International Journal of Computer Vision, 45(2):83–105, 2004.
16. C. Harris and M. Stephens. A combined corner and edgedetection. In Proceedings of The Fourth Alvey Vision Con-
ference, pages 147–151, 1988.
17. K. Mikolajczyk and C. Schmid. Scale & affine invariant in-terest point detectors. International Journal of Computer
Vision, 60(1):63–86, 2004.18. K. Mikolajczyk and C. Schmid. A performance evaluation of
local descriptors. Pattern Analysis and Machine Intelligence,
IEEE Transactions on, 27(10):1615–1630, Oct. 2005.19. H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-
up robust features (SURF). Computer Vision and Image
Understanding, 110(3):346–359, 2008.20. B. Zitova and J. Flusser. Image registration methods: a sur-
vey. Image and Vision Computing, 21(11):977–1000, October
2003.21. A. Yilmaz, O. Javed, and M. Shah. Object tracking: A sur-
vey. ACM Computing Surveys, 38(4), 2006.22. R.J. Hathaway, J.C. Bezdek, and Y. Hu. Generalized fuzzy
C-means clustering strategies using Lp norm distances. FuzzySystems, IEEE Transactions on, 8(5):576–582, Oct. 2000.
23. J.P. Lewis. Fast template matching. In Vision Interface 95,
pages 120–123, 1995.24. P. Viola and M. Jones. Robust real-time object detection.
International Journal of Computer Vision, 57(2):137–154,2002.
25. European Commission. CAVIAR Project. IST 2001 37540.
26. B. McCane, K. Novins, D. Crannitch, and B. Galvin. Onbenchmarking optical flow. Computer Vision and Image Un-
derstanding, 84:126–143, 2001.