Logotype detection to support semantic-based video annotation

11
Signal Processing: Image Communication 22 (2007) 669–679 Logotype detection to support semantic-based video annotation J.R. Co´zar a, , N. Guil a , J.M. Gonza´lez-Linares a , E.L. Zapata a , E. Izquierdo b a Department of Computer Architecture, University of Ma´laga, Spain b Multimedia and Vision Research Group, Electronic Engineering Department, Queen Mary, University of London, UK Received 29 May 2007; accepted 30 May 2007 Abstract In conventional video production, logotypes are used to convey information about content originator or the actual video content. Logotypes contain information that is critical to infer genre, class and other important semantic features of video. This paper presents a framework to support semantic-based video classification and annotation. The backbone of the proposed framework is a technique for logotype extraction and recognition. The method consists of two main processing stages. The first stage performs temporal and spatial segmentation by calculating the minimal luminance variance region (MVLR) for a set of frames. Non-linear diffusion filters (NLDF) are used at this stage to reduce noise in the shape of the logotype. In the second stage, logotype classification and recognition are achieved. The earth mover’s distance (EMD) is used as a metric to decide if the detected MLVR belongs to one of the following logotype categories: learned or candidate. Learned logos are semantically annotated shapes available in the database. The semantic characterization of such logos is obtained through an iterative learning process. Candidate logos are non-annotated shapes extracted during the first processing stage. They are assigned to clusters grouping different instances of logos of similar shape. Using these clusters, false logotypes are removed and different instances of the same logo are averaged to obtain a unique prototype representing the underlying noisy cluster. Experiments involving several hours of MPEG video and around 1000 of candidate logotypes have been carried out in order to show the robustness of both detection and classification processes. r 2007 Elsevier B.V. All rights reserved. Keywords: Semantic video classification; Minimal luminance variance; Non-linear diffusion filter; Earth mover’s distance; Logotype classification 1. Introduction Logotypes convey information that can be crucial to infer semantics implicit in broadcasted videos. It is a common practice for broadcasters and specific TV programs, e.g., news, talk shows, advertise- ments, etc., to superimpose a specific logotype on the broadcasted material. Usually such logotypes refer to the actual video content or the content creator and, as such, they can be used to support automatic semantic-based video annotation. Logo- types extracted from broadcasted videos can be annotated in a database by indicating their shape and shots or scenes where they appear. Then, video retrieval tools can be used to search for a particular TV program or to group different pieces of video ARTICLE IN PRESS www.elsevier.com/locate/image 0923-5965/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.image.2007.05.006 Corresponding author. Tel.: +34 95 2132822; fax: +34 95 2132790. E-mail addresses: [email protected] (J.R. Co´zar), [email protected] (N. Guil), [email protected] (J.M. Gonza´lez-Linares), [email protected] (E.L. Zapata), [email protected] (E. Izquierdo).

Transcript of Logotype detection to support semantic-based video annotation

ARTICLE IN PRESS

0923-5965/$ - se

doi:10.1016/j.im

�Correspondfax: +3495 213

E-mail addre

[email protected]

(J.M. Gonzale

ebroul.izquierdo

Signal Processing: Image Communication 22 (2007) 669–679

www.elsevier.com/locate/image

Logotype detection to support semantic-based video annotation

J.R. Cozara,�, N. Guila, J.M. Gonzalez-Linaresa, E.L. Zapataa, E. Izquierdob

aDepartment of Computer Architecture, University of Malaga, SpainbMultimedia and Vision Research Group, Electronic Engineering Department, Queen Mary, University of London, UK

Received 29 May 2007; accepted 30 May 2007

Abstract

In conventional video production, logotypes are used to convey information about content originator or the actual video

content. Logotypes contain information that is critical to infer genre, class and other important semantic features of video.

This paper presents a framework to support semantic-based video classification and annotation. The backbone of the

proposed framework is a technique for logotype extraction and recognition. The method consists of two main processing

stages. The first stage performs temporal and spatial segmentation by calculating the minimal luminance variance region

(MVLR) for a set of frames. Non-linear diffusion filters (NLDF) are used at this stage to reduce noise in the shape of the

logotype. In the second stage, logotype classification and recognition are achieved. The earth mover’s distance (EMD) is

used as a metric to decide if the detected MLVR belongs to one of the following logotype categories: learned or candidate.

Learned logos are semantically annotated shapes available in the database. The semantic characterization of such logos is

obtained through an iterative learning process. Candidate logos are non-annotated shapes extracted during the first

processing stage. They are assigned to clusters grouping different instances of logos of similar shape. Using these clusters,

false logotypes are removed and different instances of the same logo are averaged to obtain a unique prototype

representing the underlying noisy cluster. Experiments involving several hours of MPEG video and around 1000 of

candidate logotypes have been carried out in order to show the robustness of both detection and classification processes.

r 2007 Elsevier B.V. All rights reserved.

Keywords: Semantic video classification; Minimal luminance variance; Non-linear diffusion filter; Earth mover’s distance; Logotype

classification

1. Introduction

Logotypes convey information that can be crucialto infer semantics implicit in broadcasted videos. Itis a common practice for broadcasters and specific

e front matter r 2007 Elsevier B.V. All rights reserved

age.2007.05.006

ing author. Tel.: +3495 2132822;

2790.

sses: [email protected] (J.R. Cozar),

s (N. Guil), [email protected]

z-Linares), [email protected] (E.L. Zapata),

@elec.qmul.ac.uk (E. Izquierdo).

TV programs, e.g., news, talk shows, advertise-ments, etc., to superimpose a specific logotype onthe broadcasted material. Usually such logotypesrefer to the actual video content or the contentcreator and, as such, they can be used to supportautomatic semantic-based video annotation. Logo-types extracted from broadcasted videos can beannotated in a database by indicating their shapeand shots or scenes where they appear. Then, videoretrieval tools can be used to search for a particularTV program or to group different pieces of video

.

ARTICLE IN PRESSJ.R. Cozar et al. / Signal Processing: Image Communication 22 (2007) 669–679670

material with related contents. As a consequence,accurate logotype detection can be efficientlyexploited for semantic-based video classification,video retrieval, aggregation and summarization.

In this paper, a framework to support semantic-based video classification and annotation is de-scribed. The backbone of the proposed frameworkis a technique for logotype extraction and recogni-tion. The aim is to accurately extract and recognizeopaque (or semitransparent) static logos embeddedin a video sequence and to use this information in avideo cataloguing system to annotate sequencescontaining the same logo. In practical applications,new logotypes are continuously created as newchannels and TV programs are broadcasted. Thus,an efficient and robust tool for logotype detectionshould be able to continuously update the metadatabase in an iterative process in which new logos arelearned as they are detected and classified.

Many works related to logo detection in docu-ment analysis have been reported in the literature[4]. However, few approaches consider logo detec-tion in conventional video sequences. Logo detec-tion techniques have been used to differentiateadvertisements from TV programs in [1]. Thisapproach assumes that a logo exists if an area withstable contours can be found in the image. Theauthors claim that their approach does not requireany supervised training and can be easily used forany type of logos without human interaction. In [5],a neural network is trained using two sets of logoand non-logo examples to detect a transparent logo.It obtains a good detection rate at the expense of arather large training set. In [7], color outliers areused to detect pixels different from the background.No temporal information is used, thus many falsedetections can arise. The work presented in [13]shows an application for logo removing. Thelogotype is detected by exploiting frame differencesin video sequence. This procedure fails in video withlow motion activity. In this case, however, theauthors propose to use a logo database and tosearch for them using a Bayesian approach. Thedetection accuracy is improved by assuming that theprobability of the logos appearing in the fourcorners of the video frames is higher than in thecenter. This prior knowledge is combined with aneural network-based classifier.

Other works argue correctly that logos canprovide a helpful visual cue for finding related newsstories. Usually, a logo is defined as the smallgraphic or picture that appears behind the anchor

person on the screen. In [8], it is assumed that eachbroadcasted TV channel contains some representa-tive semantic objects, including the channel logo-type, that is displayed only during news programs.Channel logotype detection and tracking is per-formed to automatically classify news events inconventional broadcasting material. The approachrelies on the use of logotype models stored in adatabase. Information about logo position and scalehelps to identify the channel and the type of news.In [6], the detection of logos is used to mark newsstories, as an alternative approach for trackingthem. Here, from each logo, three sets of 2D Haarcoefficients are computed (one for each of the RGBchannels). The logo’s feature vector is formed byselecting the coefficients representing the overallaverages and the low frequency coefficients of thethree color channels. However, in all these works,logos must be known a priori.

There are related works that try to identifyknown brand logotypes in video data. In [9], certainvariability in the logotype appearance must beallowed. In practice, due to the high computationalcost of this method, only a few logotypes can beidentified simultaneously. Similarly, in [3], a searchfor specific instances of brand logos is performed.Logo detection is achieved by exploiting homo-geneously colored regions surrounding large inten-sity frame differences.

Contrasting these and other approaches from theliterature, this paper presents a multistage methodfor video cataloguing based on logo detection andrecognition. The proposed approach consists of twoprocessing stages. The first stage detects the minimalluminance variance region (MLVR) in every frameby applying temporal segmentation to the lumi-nance variance image (LVI). Since noise affects thecorrect extraction of potential logos, spatial seg-mentation is performed to filter noise and discardfalse positives. Detected MLVRs are then comparedto an available database of previously annotated orlearned logotypes. If a match is found, the videoannotation process is triggered. Otherwise, theMLVR is labeled as a candidate logotype. Candidate

logotypes are defined as potential logos that need tobe validated. The validation is performed during theclassification process. Here, similar candidate logo-types are grouped together. Specific logotypes arethen promoted to learned logotypes, if specificconditions are satisfied. Otherwise, the correspond-ing shapes are declared as false positives anddiscarded. Fig. 1 outlines the proposed system and

ARTICLE IN PRESS

Fig. 1. Block diagram of our logotype detection system.

Fig. 2. A video containing different logo sequences.

J.R. Cozar et al. / Signal Processing: Image Communication 22 (2007) 669–679 671

the most important modules will be elaborated inthe following sections.

The remaining of the paper is organized asfollows. The next section introduces the definitionof MLVR. Section 3 is devoted to the description ofthe proposed temporal segmentation technique.Different types of artefacts originating in thetemporal segmentation step are analyzed in Section4. Some of these artefacts are corrected by applyinga spatial segmentation process, which is detailed inSection 5. Section 6 describes the classificationprocess and how new logos are recognized andadded to the database. Section 7 gives a report onselected results from the conducted experimentalevaluation. The paper closes with conclusions inSection 8.

2. Minimal luminance variance region

A video is a set of ordered frames through time.These frames can include one or several super-imposed logos. In this work, we call a sequence a setof consecutive frames containing the same logolayout. Fig. 2 shows an example of a videoconsisting of three sequences: logo ‘smiley’ (f i

frames), no logo (f j frames) and logo ‘heart’ (f k

frames).Let F ¼ ðf 1; f 2; . . . ; f nÞ be a set of temporal

ordered frames represented by the luminance valueof their pixels. Let miðF Þ be the minimum luminancevalue taken by pixel i from frames f 1 to f n.Similarly, MiðF Þ represents the maximum lumi-nance value achieved by pixel i in the same range offrames. Then, each pixel value of the luminancevariance image, LVIðF Þ, is calculated according to

the following expression:

LVIiðF Þ ¼MiðF Þ �miðF Þ; 1pipp, (1)

where p is the number of pixels in a frame. Thus, theLVIðF Þ is an image with the same size of the videoframes in which a pixel is represented by thedifference between the maximum and minimumluminance values in the corresponding pixel loca-tion along the sequence of frames.

The MLVRðF Þ can be extracted from the LVIðF Þby applying a threshold, ts. Therefore, each pixel ofthe MLVRðF Þ is set according to the followingexpression:

MLVRiðF Þ ¼1 if LVIiðF Þots

0 elsewhere

�1pipp. (2)

From previous expression, it is clear that twofactors will affect the MLVRðF Þ calculation and,consequently, the accuracy of the logotype segmen-tation: the set of frames used for variance calcula-tion, F, and the value of the threshold ts. Wepropose a temporal segmentation technique tocalculate LVIðF Þ which tries to select the best setof frames, F, for MLVRðF Þ calculation. Then, aspatial segmentation is applied to extractMLVRðF Þ. Results are improved by using a non-linear diffusion filters (NLDF) to reduce noise.

3. Temporal segmentation

Image areas where logo is overlaid show lumi-nance variance values in a narrower interval thanthe rest of the image areas, depending on the logotransparency. Thus, the objective of the temporalsegmentation is to find areas in the frame with a lowintensity variation through time. Both the firstsequence frame and its duration are unknown,making the detection process more difficult.

Logotypes are usually placed at any of the fourcorners of the frame. Thus, these four image areasare considered regions of interest (ROIs). Moreover,their size is limited, since logos should not perturb

ARTICLE IN PRESS

Fig. 3. Block diagram for the temporal segmentation stage.

Fig. 4. MPEG quantification noise reduction with a NLDF:

original image (left side) and filtered image (right side).

Fig. 5. Evolution of the high (H), medium (M) and low (L)

luminance variance areas of the LVI as the number of processed

frames grows (5, 15, 25 and 40, respectively).

J.R. Cozar et al. / Signal Processing: Image Communication 22 (2007) 669–679672

video viewing. Furthermore, logo areas do notsignificantly change from frame to frame. Thus,without losing generality, only I-frames of MPEGvideo are considered and analyzed in this paper.1

These two strategies can be used to reducecomputational costs in order to meet real-timeconstraints in a mid-range computer. Fig. 3summarizes the temporal segmentation stage.

The NLDF introduced in [11] has two interestingproperties that are used here for video frameenhancement before the LVI extraction: (1) thediffusion coefficient can be chosen to vary spatiallyin such a way as to encourage intra-region smooth-ing in preference to inter-region smoothing, causinga MPEG codification noise reduction; (2) regionboundaries remain sharp, therefore logo shapeinformation is not distorted. Fig. 4 depicts a framedetail before (left) and after (right) the applicationof this filter.

The LVI is created by recording the maximumand minimum pixel intensity value for each newprocessed frame. Fig. 5 shows an example of theevolution of the LVI computation for a realsequence through time. Frame regions with lowluminance variance—corresponding to logotypes—are shown in black and regions with high luminancevariance—corresponding to the background—havelight colors. Initially, all the ROI is considered as anull variation area (black). The pixels maximumand minimum values are updated as new sequenceframes are added and the pixel luminance variationareas are modified accordingly. After the last image,the addition of new frames does not change theLVI. The logo shape can be segmented from thislast image. On the other hand, black regions inletter-box programs could lead to errors by detect-ing them as low luminance variance regions. Never-theless, these errors are avoided by detecting theseblack pixels and labeling them as high variancepixels in the LVI.

The main problem in the LVI computation is thechoice of n, the number of processed frames beforethe LVI remains stable and can be correctlysegmented. It is not possible to establish a fixedvalue for the number of frames n, as it dependsstrongly on the characteristics of the video; e.g. adynamic video need less frames to generate a stableLVI than a static one. It seems plausible that a large

1Using a standard codec, the I-frames rate is around two

frames per second. This value is incremented by the codec when

the level of changes from one frame to the next is high enough.

value for n can guarantee better detection for logosthat stay static for a long time period. However, anerroneous LVI is generated if a logo transitionoccurs in the n frames, as will be discussed in thenext section. Thus, in order to reduce the negativeeffects of the unknown sequence duration andtransition locations in a video, the number offrames used for the LVI computation must beminimized. By using the minimum number offrames, several LVIs will be computed for the samesequence, as shown in Fig. 6. This redundancy willbe used in the logo classification stage as a conditionto validate new logos. In fact, only LVIs instancespresent at the middle of the sequence are consideredand LVIs obtained during transitions are detectedand discarded in later stages.

Fig. 7 summarizes the algorithm proposed tocompute the value of n. An important assumptionof our system is that video frames are processed on-line and properties of future frames cannot be usedin n estimation. Our algorithm needs two values toset the logotypes that the system detects: theminimum, AMin, and the maximum, AMax, logo

ARTICLE IN PRESS

Fig. 7. Flow chart for n selection.

Fig. 6. A video containing two different sequences and the LVIs

extracted from them.

J.R. Cozar et al. / Signal Processing: Image Communication 22 (2007) 669–679 673

area. During the LVI computation, its pixels areroughly segmented into three populations, as inFig. 5: H (high variability, more than 150 graylevels), M (medium variability, between 100 and 150gray levels) and L (low variability, less than 100gray levels). While the area of low variability pixels(AL, black zone in Fig. 5) exceeds AMax, nothing canbe assumed, as not enough changes through theframes have occurred yet. On the other hand, if thisnumber is below AMin, logotype presence can bediscarded. Finally, we can consider that a valid LVIfor logo shape segmentation is detected when the H,M and L pixel populations do not change with theaddition of new frames and remains between theseminimum and maximum limits. Notice that thethreshold used for L population allows the correct

segmentation of logotypes with a moderate degreeof transparency.

4. Characterization of temporal segmentation errors

Two classes of problems can arise during thecalculation of the LVI leading to several badsegmentation results. One appears in LVIs extractedfrom a single sequence and the other one occurswhen the LVI is computed during a transitionbetween two different sequences.

4.1. Sequence analysis

A sequence can have two possible states during itsduration: presence or absence of logotype(s). Foreach LVI extracted from this sequence, we considera logotype present if there are two well-definedregions: one with a high variability corresponding tothe background, and another with a low variabilitycorresponding to the logotype (as shown in theprevious section). On the other side, we decide thereis a logotype absence if the variability is high enoughin almost every pixel through time.

Two kinds of errors can arise when there is notenough variability in the n frame contents used forLVI computation. These errors are caused becausehigh variation areas do not fully emerge and cannotbe distinguished from the shapes of the logos. If areal logo is present in the video sequence, thedetection is incomplete and we can consider that aType I error (added pixels) occurs. If there is alogotype absence, false positives could be detected.

4.2. Transition analysis

In a video broadcast, sequences of logotypepresence and absence of an unknown durationalternate. The LVI extracted during these transi-tions can introduce errors due to:

Logotype! non-logotype transition: The currentLVI detection process should be stopped before thelogo disappears in the next sequence, otherwise afalse negative error will take place.

Non-logotype! logotype transition: It could be aproblem only if there are high variability pixels inthe frame region prior to the true logotypeappearance. This error is called Type II error

(missed pixels).Logotype ‘A’! logotype ‘B’ transition: The pro-

blem here is similar to the previous one. The shapeof the detected logo will be obtained as the

ARTICLE IN PRESSJ.R. Cozar et al. / Signal Processing: Image Communication 22 (2007) 669–679674

intersection of both logo ‘A’ and logo ‘B’ shapes(Type II error).

Fig. 8 (view in columns) summarizes the typesof errors that can appear during logotype detection:(a) false positive; (b) Type I error (added pixels), dueto an insufficient activity in the available frames;(c) Type II error (missed pixels), caused by a non-logotype to logotype or logotype A to logotype Btransitions; and (d) noise, originated by high videocompression rates. For each column, the first rowrepresents the LVI obtained during the temporalsegmentation process and the second and third rows

Fig. 8. Examples of error in logotype detection: (a) false positive,

(b) Type I error (inside circle), (c) Type II error (inside circle),

(d) codification.

900

800

700

600

500

400

300

200

100

0

300

250

200

150

100

50

0

160

140

120

100

80

60

40

20

0

120

100

80

60

40

20

0

0 50 100 150 200 250 300 0 50 100

0 50 100100 120 140 160 200 260240

Opaque

Tra

nspare

nt

No compress. Low c

Fig. 9. Histograms of the LVIs for

show the detected (after binarization) and expectedlogo, respectively. The following spatial segmenta-tion and logo database classification stages will tryto correct these errors.

5. Spatial segmentation

In this stage a spatial segmentation of LVI isconducted to obtain the MLVR, which represents amask with the logo shape.

Even supposing that enough frames can be usedto calculate the LVI, the opacity degree of thelogotype and the noise generated by the video codeccan complicate the calculation of a correct value forts. This is illustrated in Fig. 9, where the histogramsof the LVI extracted from a set of synthetic testvideos are shown. Typically, the luminance valuestaken by pixels placed in similar positions andalong consecutive frames are correlated. However,if enough frames are used to calculate the LVI,including several shots and camera moving effects,we can consider luminance values to have a randombehavior in these longer time intervals. Thus,synthetic videos are built by assigning a randomvalue to the luminance of every pixel. In addition,two types of logotypes are overlaid: one fullyopaque (top row) and another with a transpa-rency level of 25% (bottom row). Every columncorresponds to different compression levels inMPEG: none, low (4Mbps) and high (100 kbps).Leftmost column (no compression) corresponds to

120

100

80

60

40

20

0

70

60

50

40

30

20

10

0

150 200 250 0 50 100 150 200 250

0 50 100 150 200 250150 200 250

ompress. High compress.

a set of synthetic test videos.

ARTICLE IN PRESS

Fig. 11. Extracting the MLVR from (a) a not filtered and

(b) after applying a NLDF to the LVI (see Fig. 10).

J.R. Cozar et al. / Signal Processing: Image Communication 22 (2007) 669–679 675

a theoretical model. It can be observed that, in thiscase and with a semitransparent logotype, a peakoccurs at the lower values of the histogram due tothe logotype transparency. This peak is not anisolated bar at 0 (as with the opaque logotype), buta smaller version of the peak that corresponds to thebackground (the higher values of the histogram).On the other side, as the compression level grows,the peaks are wider and they can overlap if there aretoo many compression losses (histograms on theright). In these circumstances, the use of a fixedthreshold can generate severe errors.

Thus, prior to the choice of a threshold, the LVIimage is enhanced applying a NLDF for intra-region smoothing while keeping the edges. Fig. 10shows two histograms of a LVI: before and afterapplying the NLDF. In the filtered LVI, a correctselection of the threshold is more achievable. Then,the MLVR is obtained by binarizing the LVIwith a threshold computed using Otsu’s method[10]. Fig. 11 compares the results of applying thethresholds to the LVI and the filtered LVI. Type I

error was avoided in this example.The binary image obtained is analyzed and

connected components are found. Connected com-ponents with less than eight pixels are considerednoise and removed from the MLVR. After thewhole spatial segmentation process, the final area ofthe MLVR is checked and considered a false positive

if it is below the minimum searched ðAMinÞ.

6. Logotype databases

Previous sections indicated how the segmentationis applied to a set of video frames, F, in order togenerate the MLVRðF Þ of that set. Now, shape-based characteristics of the detected MLVRðF Þ areused to search for similar logos previously stored ina database called the learned logotypes database. As

Fig. 10. Histograms of a LVI before

a result of this searching process two types ofactions can be taken:

Logo annotation: If a similar logo is found in thelearned logotypes database, the video is annotatedand a logo tracking process is started to check itspresence until it disappears. Information about thetime of appearance or disappearance, duration, etc.can be used to enrich semantic information ofcataloged videos.

Database updating: If no similar logo is located inthe learned logotypes database, the MLVRðF Þ isassumed to be a new logotype. However, instead ofinserting it immediately into the learned logotypesdatabase, the MLVRðF Þ is labeled as a candidate

logotype and inserted into a candidate logotypesdatabase. In this database, the candidate logotypesare grouped according to the similarity among theirshape-based characteristics, to form logo clusters. Ifthe number of candidate logotypes in a cluster ishigh enough, a prototype of this cluster is promotedto a learned logotype and inserted into the learned

logotypes database. This process closes the databaseupdating loop.

The two advantages of using candidate logos are:Candidate logotype averaging: Different detec-

tions of the same logo can have different appear-ances due to errors during the segmentationprocesses. In order to alleviate this problem, a newlogo is not added definitively to the database (as alearned logotype) until several similar instances of

and after applying the NLDF.

ARTICLE IN PRESS

Fig. 12. Cluster of candidate logotypes (first three) promoting to

learned logotype, with noise removed after combination (right).

Fig. 13. Examples EMDs obtained using Eq. (3).

J.R. Cozar et al. / Signal Processing: Image Communication 22 (2007) 669–679676

the same (candidate) logotype are detected. Thus,clusters are represented by the average binary shapeof their members (prototype). Once a cluster hasenough similar logotypes, its average is labeled aslearned logotype. Fig. 12 shows an example of thissituation, where part of the cluster of logo ‘3’ isrepresented. Note that noise from every detection isremoved after combining the logotypes.

False positive logotype discarding: Taking intoaccount how the segmentation process is performed,several instances of the same logo will be detectedduring that time. Thus, if the number of candidate

logotypes in a specific cluster stays very low, theyare assumed to be false positives and, consequently,are discarded.

6.1. Logo features and metric

MLVR are binary masks typically small in size.Most binary object descriptors (as statistical-basedor contour-based ones) do not provide enoughaccuracy for logo representation. A good descriptoris the shape itself, but a long feature vector would becreated. An important reduction of the featurevector size, without a great loss of accuracy, can beachieved if the x-axis and y-axis shape projectionsare used. One of the best metrics to compare thefeature vectors of logos Li and Lj is the earthmover’s distance (EMD) [12], given by the followingexpression:

distanceðLi;LjÞ

¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiEMDðpLi

x ; pLjx Þ

2þ EMDðpLi

y ; pLjy Þ

2

q, ð3Þ

where pSx and pS

y are the x-axis and the y-axisprojections of shape S, respectively.

The EMD is based on a solution to thetransportation problem for which efficient algo-rithms are available. The shape projections arealigned with respect to their centroids to make theEMD invariant to logo displacements [2].

Fig. 13 shows some examples of the EMD-baseddistance between some database logos. The numberat the bottom-left of each picture is the EMDbetween a logo model shape (similar to the top of

the ranking) and the represented logo shape. Firstlogos in the ranking belonged to the same clusterduring the logo learning process.

6.2. Database updating process

As previously introduced, two databases arecreated from scratch and updated as new logosappear in the videos. Databases building is im-plemented by a robust mechanism that considerstwo possible states of the logos: candidate andlearned. Each is saved in different databases namedsimilarly.

Learned logos are shapes which have beenpromoted to be a logotype. Candidate logos areshapes which have passed the segmentation stagebut they have not been validated yet. In order tovalidate a candidate logo, several instances of thesame logo must be detected. Thus, the validationprocess involves both the grouping of similar logosand the promotion of a prototype of this cluster tolearned logo.

As shown in Fig. 1, the detected MLVRðF Þ isused to search a similar learned logotype in thelearned logotype database. This is accomplished byusing expression (3) to compare the MLVRðF Þ withthe set of learned logotypes, LL. Then, a hit is

ARTICLE IN PRESS

Fig. 14. PDFs of the EMD-based distance between learnt logos

and logos belonging to the same and different classes.

J.R. Cozar et al. / Signal Processing: Image Communication 22 (2007) 669–679 677

assumed if the following expression is fulfilled:

minðdistanceðMLVRðF Þ;LiÞÞotl 8i 2 LL (4)

and, as a consequence, the i learned logotype startsto be tracked.

If no learned logotype satisfies the previousexpression, then MLVRðF Þ is labeled as a candidate

logotype and is included in an existing candidate

logotypes cluster, if possible. This is accomplishedby again applying expression (3) to the MLVRðF Þand the set of the clusters prototypes. Thus,MLVRðF Þ will be included in cluster i if thefollowing expression is accomplished:

distanceðMLVRðF Þ;LiÞotc 8i 2 PL, (5)

where PL is the set of candidate cluster prototypes.Finally, if previous expression is not fulfilled, a

new cluster is created with the current MLVRðF Þ.Temporal information can be used to speed up

the search for the cluster where a logo belongs tobecause, when a video is analyzed, it is veryprobable to find similar MLVRs from the samesequence in consecutive order.

7. Experimental results

The whole logo detection process is divided intwo main phases for testing purposes: segmentationand database searching and updating. The thresh-olds needed for database updating will be estimatedalso in this section.

7.1. Segmentation

In order to test the validity of the segmentationsubsystem, 31,983 I-frames corresponding to 6 h of45 broadcasted video sequences from different TVchannels of PAL size have been analyzed. In thesevideos, 485 LVIs were detected in the temporalsegmentation stage. After spatial segmentation andfiltering, 50 LVIs of false logos (low variabilityframes) were rejected. The remaining LVIs wereused to obtain the MLVRs of 376 differentinstances of 42 valid logos and 59 false positives.During the database updating stage, all these falsepositives were discarded.

7.2. Database

A TV logos database has been created in order totest both the logo recognition accuracy and thelearning robustness. These logos are the result of

segmenting different versions of the video database,resulting in a total of 15 h 30min of processed video.From these sequences we have extracted 1058candidate logotypes using the proposed algorithm.These candidates have been manually annotatedinto 42 different logo classes plus one non-logoclass. The logo classes contain 696 logotypes and theremaining 362 candidates belong to a generic non-logo class. This non-logo class contains very noisyimages and poorly detected logotypes.

The previously annotated logotypes have beendivided in two sets. The first set is for training andcontains 21 classes. For each class, 15 logotypeshave been selected and a prototype has been built.The second set is for testing and contains theremaining logotypes.

7.2.1. tl estimation

In order to find the best EMD-based distancethreshold, tl , to decide if a logo belongs to a knownclass, the distances to prototypes for all the logos inthe training set have been computed. These dis-tances were used to create the inter-class and intra-class EMD-based distance probability distributionfunctions (pdfs) (see Fig. 14). These pdf are used tocompute the EMD-based distance threshold tl—seeequation (4)—that minimizes the number of mis-classified logos. This value is 2.9 and results in a1.7% of wrongly classified logos.

The validity of this threshold is checked byclassifying all the logos in the test set. Logos willbe included in the candidate database if they arefarther than tl to the nearest prototype as they areunknown yet. The rate of wrongly classified logos inthe test set is 4.9%. The main source of error is thesimilarity among some logos projections.

ARTICLE IN PRESSJ.R. Cozar et al. / Signal Processing: Image Communication 22 (2007) 669–679678

7.2.2. tc estimation

Logos are assigned to a cluster if their distance tothe cluster prototype is lower that a threshold, tc.This cluster prototype is refined during the learningperiod using the mean shape of the logos belongingto it. An upper bound of this clustering algorithmperformance can be obtained using the finalprototype (learned logo) shape as the clusterprototype. Depending on the maximum distanceallowed between a cluster prototype and the MLVRto be assigned to this cluster, different precision andrecall values are obtained. These values are definedfor each logo clusteri linked to classi as follows:

precisioniðtcÞ ¼Cluster_AND_Class

Total_Cluster, (6)

recalliðtcÞ ¼Cluster_AND_Class

Total_Class, (7)

where Cluster_AND_Class is the number of logosassigned to clusteri and belonging to classi,Total_Cluster is the total number of logos assignedto clusteri, and Total_Class is the total number oflogos belonging to classi.

Fig. 15 depicts the mean value of precision versusrecall for all the logo clusters in the training set:precision (recall) is increased as the recall (precision)is decreased. For our logo learning approach, it ispreferred a high precision value over a high recallvalue, as it is better to reject a noisy logo thanaccepting a wrong one. Thus, a good choice for theprecision is 90%, which corresponds to a recall of81%. These results are reached with a thresholdvalue tc of 2.3. The application of this threshold forclustering the logo test set results in a precision of85% and a recall of 81%.

False positives not discarded by the segmentationprocess must be removed from the logo database.

Fig. 15. Precision vs recall for the logo training set.

A logo will be considered noise if after a predefinedperiod of time since it was inserted to the database,no more logos have been added to the cluster itbelongs to. The validity of this criterion waschecked in our logo test database. After more than15 h of processed video, clusters of ‘noise’ logoscontaining several instances were studied. Erro-neous logo promotion can occur only in twoinfrequent situations: (1) with very noisy versionsof the same logo (they were different enough fromtheir respective prototypes); and (2) in the unlikelycase that noise spots are very similar. User feedbackcould be used to purge erroneous learned logos fromthe database.

8. Conclusions

A framework for on-line video cataloguingbased on logo detection has been introduced. Logosare assumed to be image regions with low luminancevariance along time. Thus, our method uses amultistage process to apply temporal andspatial segmentation in order to compute a minimalluminance variance region (MLVR). The temporalsegmentation module processes temporal orderedframes to obtain the low luminance varianceregions. The spatial segmentation module selects athreshold to generate the binary logo mask.In addition, it also carries out a filtering processwhich allows detection and removal of falsepositives.

Once MLVR has been generated, it is comparedwith previous stored logos (learned logos) forannotation purposes. If no match is found, then,the MLVR is labeled as candidate logo and it iseither associated to an existing cluster of candidate

logotypes or used to define a new cluster. When thenumber of elements in a candidate logotype clusterbecomes reasonable large, a prototype of this clusteris promoted to learned logo. If the size of a givencluster remains small over a long time, the wholecluster is declared as false positives and removedand not further considered in the recognitionprocess.

Promising results have been achieved by applyingthe presented framework to a set of broadcastedvideos. In the future, we plan to test other featuresand classifiers to detect animated and more trans-parent logos. In this sense, future work will studythe use of color information and other classificationand learning techniques like neural networks orsupport vector machines.

ARTICLE IN PRESSJ.R. Cozar et al. / Signal Processing: Image Communication 22 (2007) 669–679 679

Acknowledgments

This work was supported in part by the Ministryof Education and Science (CICYT) of Spain undercontract TIC2003-06623 and Tedial contract 8.06/29.1821.

References

[1] A. Albiol, M.J. Fulla, A. Albiol, L. Torres, Detection of TV

commercials, in: Proceedings of the International Confer-

ence on Acoustics, Speech and Signal Processing, vol. 3,

2004, pp. 541–544.

[2] S. Cohen, L. Guibas, Earth mover’s distance under

transformation sets, in: Proceedings of the IEEE Interna-

tional Conference on Computer Vision, vol. 2, 1999,

pp. 1076–1083.

[3] R.J. Den Hollander, A. Hanjalic, Logo recognition in video

stills by string matching, in: IEEE International Conference

on Image Processing, vol. 3, 2003, pp. 517–520.

[4] D. Doerman, The indexing and retrieval of document

images: a survey, Comput. Vision Image Understanding 70

(3) (1998) 287–298.

[5] S. Duffner, C. Garcıa, A neural scheme for robust detection

of transparent logos in TV programs, in: Lecture Notes in

Computer Science—II, vol. 4132, Springer, Berlin, 2006, pp.

14–23.

[6] P. Duygulu, J. Pan, D.A. Forsyth, Towards auto-documen-

tary: tracking the evolution of news stories, in: ACM

Multimedia 2004—Proceedings of the 12th ACM Interna-

tional Conference on Multimedia, 2004, pp. 820–827.

[7] A. Ekin, R. Braspenning, Spatial detection of TV channel

logos as outliers from the content, in: Proceedings of SPIE—

The International Society for Optical Engineering, vol. 6077,

2006.

[8] B. Gunsel, A. Ferman, A.M. Tekalp, Temporal video

segmentation using unsupervised clustering and semantic

object tracking, J. Electron. Imaging 7 (3) (1998) 592–604.

[9] D. Hall, F. Pelisson, O. Riff, J. Crowley, Brand identifica-

tion using gaussian derivative histograms, Mach. Vision

Appl. 16 (1) (2004) 41–46.

[10] N. Otsu, A threshold selection method from grey-level

histograms, IEEE Trans. System Man Cybernet. 9 (1) (1979)

62–66.

[11] P. Perona, J. Malik, Scale space and edge detection using

anisotropic diffusion, IEEE Trans. Pattern Anal. Mach.

Intell. 12 (7) (1990) 629–639.

[12] Y. Rubner, C. Tomasi, L. Guibas, The earth mover’s

distance as a metric for image retrieval, Internat. J. Comput.

Vision 40 (2) (2000) 99–121.

[13] Y. Wei-Qi, J. Wang, M.S. Kankanhalli, Automatic video

logo detection and removal, Multimedia Syst. 10 (5) (2005)

379–391.