The Cut Detection Issue in the Animation Movie Domain

10
The Cut Detection Issue in the Animation Movie Domain Bogdan Ionescu 1,2 , Patrick Lambert 1 , Didier Coquin 1 , Vasile Buzuloiu 2 1 University of Savoie, LISTIC, BP.80439 74944 Annecy-le-Vieux Cedex, France Email: {Patrick.Lambert, Didier.Coquin}@univ-savoie.fr 2 University ”Politehnica” Bucharest, LAPI, 061071, Bucharest, Romania Email: {BIonescu,Buzuloiu}@alpha.imag.pub.ro Abstract— In this paper we are proposing an improved cut detection algorithm adapted to the animation movies domain. A cut is the direct concatenation of two different shots and produces an important visual discontinuity in the video stream. As color is one major feature of the animation movies (each movie has its own particular color palette) the proposed approach performs by thresholding the distance values obtained between frame color histograms. To over- come the difficulty raised by the peculiarity of the animation movies several improvements have been adopted. The frames are divided intro quadrants to reduce the influence of the entering/exiting objects. The second order derivative is used to reduce the influence of the fast object/camera motions. For the frame classification, an automatic threshold estimation is proposed. Moreover, to reduce the false detections, we are proposing an algorithm to detect a particular color effect, named ”short color changes” (i.e. lightnings, explosions, flashes etc.). The proposed cut detection algorithm achieved better results compared to the conventional histogram–based and motion–discontinuity based approaches as shown by tests conducted on several animation movies. Index Terms— video segmentation, histogram-based cut de- tection, automatic threshold estimation, short color changes detection, animation movies, video indexing. Nowadays the computer vision and the image pro- cessing domains are mainly focussed on the analy- sis/processing of digital multimedia content. Digital im- ages, videos and music are an inseparable part of our day-to-day life. The important recent evolution of the dig- ital image devices, storage devices, multimedia portable devices, wireless transmission protocols and LAN/WAN infrastructures has made the access to large amounts of digital multimedia content easier. However, large collec- tions of data require content annotation/indexation, which is the process of attaching content based labels to the data to facilitate the retrieval task. In such large collections it could be said that the availability of the unindexed data is very low as for the end-user there is no trace of its existence. In the case of the video annotation, whether it is a syntactic or a semantic annotation, it generally requires the video parsing as a first processing step. Detecting the video shot boundaries, that is recovering the elementary This paper is based on ”Improved Cut Detection for the Segmentation of Animation Movies” by B. Ionescu, V. Buzuloiu, P. Lambert, D. Coquin, which appeared in the Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toulouse, France, May 2006. c 2006 IEEE. video units, provides the basis for nearly all existing video abstraction and high-level analysis techniques [18]. To assemble the movie, the shots are linked together using video transitions. Basically, the existing video transitions are divided into two major categories: sharp transitions, known as cuts, which result from the direct concatenation of two different shots without using any visual effects, and the gradual transitions, such as fades, dissolves, mattes, etc., which imply the use of special optical effects (for a literature survey see [18] [8]). From the existing transitions, the cuts are the most frequently used. For example, 30 minutes of video contain approximately 300 cuts while the number of the gradual transitions is significantly less high. That is because the cuts are used for making the common shot-to- shot transition while the gradual transitions are generally used with special meanings: the dissolves are often used to change the time of action [18], the fades introduce a change of the action place [15], etc. Thanks to the ”The International Animated Film Festi- val” [3], which has taken place at Annecy, France, every year, since 1960, a very large database of animation movies is available. Managing thousands of videos is a tedious task, therefore an automatic indexation system is required. Such a system will have to be adapted to the peculiarities of the animation movies. Animation movies differ from natural ones in many respects (see Figure 1): typically the events do not follow their natural way, objects or characters emerge or vanish without re- specting any physical rule, sometimes the movements are not continuous and the predominant motion is the object motion [27], a lot of particular color effects are used, for example the ”short color changes” or SCC [11], artistic concepts are used: painting concepts, theatri- cal concepts, various animation techniques are used: 3D graphics, puppet/object animation, paper drawing, etc. (see Animaquid [3]), every animation movie has its own particular color palette, unlike conventional movies which have al- most all the same color distribution [14], understanding the movie content is sometimes diffi- cult, some animation experts say that more than 30% 10 JOURNAL OF MULTIMEDIA, VOL. 2, NO. 4, AUGUST 2007 © 2007 ACADEMY PUBLISHER

Transcript of The Cut Detection Issue in the Animation Movie Domain

The Cut Detection Issue in the Animation MovieDomain

Bogdan Ionescu1,2, Patrick Lambert1, Didier Coquin1, Vasile Buzuloiu21University of Savoie, LISTIC, BP.80439 74944 Annecy-le-Vieux Cedex, France

Email: {Patrick.Lambert, Didier.Coquin}@univ-savoie.fr2University ”Politehnica” Bucharest, LAPI, 061071, Bucharest, Romania

Email: {BIonescu,Buzuloiu}@alpha.imag.pub.ro

Abstract— In this paper we are proposing an improvedcut detection algorithm adapted to the animation moviesdomain. A cut is the direct concatenation of two differentshots and produces an important visual discontinuity in thevideo stream. As color is one major feature of the animationmovies (each movie has its own particular color palette) theproposed approach performs by thresholding the distancevalues obtained between frame color histograms. To over-come the difficulty raised by the peculiarity of the animationmovies several improvements have been adopted. The framesare divided intro quadrants to reduce the influence of theentering/exiting objects. The second order derivative is usedto reduce the influence of the fast object/camera motions. Forthe frame classification, an automatic threshold estimationis proposed. Moreover, to reduce the false detections, we areproposing an algorithm to detect a particular color effect,named ”short color changes” (i.e. lightnings, explosions,flashes etc.). The proposed cut detection algorithm achievedbetter results compared to the conventional histogram–basedand motion–discontinuity based approaches as shown bytests conducted on several animation movies.

Index Terms— video segmentation, histogram-based cut de-tection, automatic threshold estimation, short color changesdetection, animation movies, video indexing.

Nowadays the computer vision and the image pro-cessing domains are mainly focussed on the analy-sis/processing of digital multimedia content. Digital im-ages, videos and music are an inseparable part of ourday-to-day life. The important recent evolution of the dig-ital image devices, storage devices, multimedia portabledevices, wireless transmission protocols and LAN/WANinfrastructures has made the access to large amounts ofdigital multimedia content easier. However, large collec-tions of data require content annotation/indexation, whichis the process of attaching content based labels to the datato facilitate the retrieval task. In such large collectionsitcould be said that the availability of the unindexed datais very low as for the end-user there is no trace of itsexistence.

In the case of the video annotation, whether it is asyntactic or a semantic annotation, it generally requiresthe video parsingas a first processing step. Detecting thevideo shot boundaries, that is recovering the elementary

This paper is based on ”Improved Cut Detection for the Segmentationof Animation Movies” by B. Ionescu, V. Buzuloiu, P. Lambert,D.Coquin, which appeared in the Proceedings of the IEEE InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP),Toulouse, France, May 2006.c© 2006 IEEE.

video units, provides the basis for nearly all existing videoabstraction and high-level analysis techniques [18]. Toassemble the movie, the shots are linked together usingvideo transitions.

Basically, the existing video transitions are divided intotwo major categories:sharp transitions, known as cuts,which result from the direct concatenation of two differentshots without using any visual effects, and thegradualtransitions, such as fades, dissolves, mattes, etc., whichimply the use of special optical effects (for a literaturesurvey see [18] [8]). From the existing transitions, the cutsare the most frequently used. For example, 30 minutes ofvideo contain approximately 300 cuts while the numberof the gradual transitions is significantly less high. That isbecause the cuts are used for making the common shot-to-shot transition while the gradual transitions are generallyused with special meanings: the dissolves are often usedto change the time of action [18], the fades introduce achange of the action place [15], etc.

Thanks to the ”The International Animated Film Festi-val” [3], which has taken place at Annecy, France, everyyear, since 1960, a very large database of animationmovies is available. Managing thousands of videos is atedious task, therefore an automatic indexation system isrequired. Such a system will have to be adapted to thepeculiarities of the animation movies. Animation moviesdiffer from natural ones in many respects (see Figure 1):

• typically the events do not follow their natural way,objects or characters emerge or vanish without re-specting any physical rule,

• sometimes the movements are not continuous andthe predominant motion is the object motion [27],

• a lot of particular color effects are used, for examplethe ”short color changes” or SCC [11],

• artistic concepts are used: painting concepts, theatri-cal concepts,

• various animation techniques are used: 3D graphics,puppet/object animation, paper drawing, etc. (seeAnimaquid [3]),

• every animation movie has its ownparticular colorpalette, unlike conventional movies which have al-most all the same color distribution [14],

• understanding the movie content is sometimes diffi-cult, some animation experts say that more than 30%

10 JOURNAL OF MULTIMEDIA, VOL. 2, NO. 4, AUGUST 2007

© 2007 ACADEMY PUBLISHER

Figure 1. The particularity of the animation movies (some pictures fromdifferent movies).

of the animation movies from [3] apparently do nothave any logical meaning.

In this paper we are addressing the problem of videosegmentation by means of cut detection in the anima-tion movie content-based indexation framework [13]. Weare proposing a histogram-based cut detection algorithmwhich has been specially adapted to handle the peculiarityof this domain.

The remainder of the article is organized as follows:Section I presents (state-of-art) cut detection techniqueswhile Section II describes the proposed cut detection ap-proach. The experimental results are presented in SectionIII. Finally, Section IV contains the final conclusions anddiscusses the future improvements.

I. PREVIOUS WORK

The basis of the cut detection process is the analysisof the similarity of the frames surrounding the cut, whichshould display enough significant changes in their visualcontents. Generally speaking, the cut detection consistsof several computation steps as presented in [8]. Firstthe feature extractionis performed, where features depictvarious aspects of the visual content of the video frames.Then, ametric distance is used to quantify feature varia-tion between neighboring frames. The discontinuity valueis typically the magnitude of this variation and serves asan input for the detector. Finally, the cuts are detectedby comparing these values against athresholdto identifywhenever a significant visual discontinuity has occurred.

Existing cut detection algorithms differ in the featuresthey use to measure the cut discontinuity. They can bedivided into several categories as follows: intensity/color–based, edge/contour–based, motion–based and methodsperforming in the compressed domain [18] [8] [24].

The simplestintensity/color–basedmethod to measurethe cut visual discontinuity is the thresholding of thepixel-wise intensity difference between frames, like oneof the first approaches proposed in [22]. These meth-ods, despite their good performances, proved to be verysensitive to the presence of noise and motion. To bemore invariant, other cut detection methods are usingthe difference between color/gray level histograms, likethe approach proposed in [33]. However, histograms aresensitive to intensity fluctuations. One solution to improvetheir invariance is the use of an appropriate color space

and distance measure [18]. For example, [4] proposes assimilarity measure the histogram intersection by meansof color distances in the Cb-Cr/r-b color space. In theapproach proposed in [20], the similarity measure is basedon histogram intersection and on the difference betweenpixel block mean colors. Performance studies provedthat the histogram–based methods are generally the mostefficient methods compared to other existing approaches[6] [18]. For example, [6] proposes a performance studyof several different approaches: histogram–based, usingthe MPEG compressed domain and motion–based. Thebest detection ratio is achieved using a histogram-basedmethod, namely the intersection of histograms computedon the Munsell color space, which also proved to have amedium complexity.

On the other hand, theedge/contour–basedmethodsare based on the fact that the object edges in the imagesbefore and after the cut are not the same. The existingapproaches propose different methods to measure theamount of the edge/countour changes in the image. In[35] the analysis of the number of entering/exiting edgepixels is performed by computing the Edge Change Ratio(ECR). The combination of color histograms computed inthe YUV color space and the Edge Matching Rate (EMR)is used in [16]. The methods proposed in [17] combine thecolor, edge and motion information. The edge/contour–based approaches are less efficient than the histogram–based methods having a higher computational complexity.However, their major advantage is that they could servealso for the detection of gradual transitions, like fades ordissolves [18].

From a different point of view a cut also produces adiscontinuity in the motion flow, which is the considera-tion used by themotion–basedapproaches. For example,in [19] the image similarity is measured using the motionestimation and the pixel displacements in the image. In[23] the cut detection is performed using the motion com-pensation with the pixel-block correlation as a cost func-tion. Generally speaking, the motion–based cut detectiontechniques are also less efficient than the histogram-basedmethods [6]. Moreover, the motion estimation requireshigh computational time [18]. Nevertheless, the motion-based approaches are interesting in the case of the movieswith predominant motion where the other methods tendto fail.

In the case of methods using thecompressed domainthe detection is performed directly on the MPEG coeffi-cients. For example, in [21] the visual discontinuities pro-duced by the cuts are converted into significant distancevalues between MPEG coefficients. Another approachis the one proposed in [31] where the cut detection isperformed on the MPEG-2 flow by analyzing the numberof macro-block predictions in type B images (bidirec-tional compression). The recent evolution of the videocompression standards allows to make the segmentationtask easier. The new MPEG-7 video standard will providethe video shot information directly, thus making the videoparsing task obsolete [32]. The main advantage of the

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 4, AUGUST 2007 11

© 2007 ACADEMY PUBLISHER

time axis

Preprocessing: spatial subsampling & color reduction

I(0) I(s) I(2s) I(ks)...

1 2

34

1 2

34

1 2

34

1 2

34...

histograms:h1,h2,h3,h4

...histograms:h1,h2,h3,h4

histograms:h1,h2,h3,h4

histograms:h1,h2,h3,h4

...meandistance

meandistance

meandistance distance

accumulation k

smoothing

k

thresholding +SCC checking

cuts SCC

0 1 2 k

I(1) ......

.

D.

mean

Dmean

Figure 2. The proposed cut detection approach:I(ks) is an image at the time momentk · s, with s the analysis step andk the discrete time index,Dmean() is the achieved distance sequence,Dmean() is the second order derivative ofDmean() and SCC stands for ”short color changes”.

methods performing in the compressed domain is that de-compressing the data (required by all the other methods),is no longer needed, which drastically reduces the com-putational time. They are the best candidates for real-timedetection. However, the detection precision could be lessaccurate compared to the image-based methods. That isdue to the fact that the MPEG coefficients are sometimesinconsistent. One solution is to use a compromise betweenthe two, thus decompressing the data to a certain level ofdetail.

Apart from the cut detection approaches mentionedabove, we can mentionother methodswhich exploitdifferent sources of information from the conventionalones. For example, [8] proposes a statistical method forthe segmentation of the generic movies (applicable toall kinds of video material) which uses the minimizationof the mean detection error probability. An interestingapproach is proposed in [7]. The detection is performedusing several morphological operators applied on a 2Drepresentation of the movie temporal evolution, referredto as visual rhythm. Each image from the movie getsrepresented in the visual rhythm with its diagonal. Inthis way the cuts are converted to vertical transitions andthe detection problem is transferred to a segmentationtask. Generally speaking, the other methods are not assatisfactory as the conventional approaches, yet, as theyhave not benefited from sufficient experimental tests to beconsidered as performing better. Typically the proposedevaluation tests have been performed and analyzed forsome particular applications and the comparison withother approaches is either limited or missing.

II. T HE PROPOSED CUT DETECTION

One major feature of the animation movies is the colordistribution. Almost every animation movie has its ownparticular color palette(see Figure 1), being the artist’ssignature. The proposed cut detection method exploits this

particularity by using a color histogram–based approach.It consist of several processing steps. The method diagramis depicted with the Figure 2.

To overcome the difficulties raised by the peculiarity ofthe animation movies several improvements are adopted.First the frames are divided into quadrants to reducethe influence on the detection of the entering/exitingobjects (one predominant motion in animation moviesis the object motion [27]). For each quadrant of eachretained frame (denoted in Figure 2 with numbers from1 to 4) we compute one color histogram. Then, foreach retained frame, 4 Euclidian distances are computedbetween the 4 quadrant histograms and the correspondingones from the next retained neighbor frame. The 4 valuesare transformed into one by taking the mean value. Thiswill lead, for the entire movie, to a distance sequenceDmean(). Further on, the second order derivative is usedon theDmean() sequence to reduce the influence of therepetitive motion (i.e. object/camera), on the detection.For the detection, an automatic threshold computationis performed using some statistical measures computedon the obtained histogram distance sequenceDmean().Moreover, using a modified camera flash–detector thefalse detections are reduced by detecting the SCCs. ASCC (short color change) is an animation movie specificcolor effect (i.e. lightnings, explosions, flashes) which isusually wrongly detected as a cut due to the significativevisual change it implies.

A. Video pre-processing

To reduce the computational complexity several pre-processing steps are performed. First, the movie istem-porally subsampled. We are only retaining the images ofa discrete time sample multiple ofs, with s being theanalysis step. Having the frames temporally subsampledwill not affect the cut detection accuracy if the right valueof s is chosen, but instead it will result in reducing the

12 JOURNAL OF MULTIMEDIA, VOL. 2, NO. 4, AUGUST 2007

© 2007 ACADEMY PUBLISHER

processing time by a factor of at least1/s. That is becausethe cuts generally occur in the animation movies at timeintervals of no more than 3 or 4 seconds. Also, the movieframe rate is at 25 images/s which provides more thanenough information for the cut detection (even for thevisualization task, as the visual continuity limit is at 15images/s). After several experimental tests for differentvalues of s, s ∈ {1, .., 10}, we have decided to uses = 2. Therefore, only one image out of two shall beretained. Using higher step values,s > 2, will typicallyresult in transforming the smooth global camera motions,or other similar global transitions, into important framedifferences, and thus into false detections.

Secondly, the retained frames arespatially subsampled:only one pixel (typically the center one) is retained foreach non overlapping image block ofn × n pixels, withn ∈ {2, 3, 4, ...}. The n value is determined on theoriginal image size to achieve a final image resolutionof around100 × 100 pixels. As the proposed detectionmethod uses a statistical approach (i.e. histograms) de-creasing the image resolution around the mentioned limit(experimentally determined) will not affect its accuracy.

To compute the color histograms, one has to reducethe number of colors, as the original movies come intrue color (16 million color palette). After studying theinfluence of several color reduction methods on the cutdetection [12] we have decided to use the Floyd–Steinbergerror diffusion algorithm run in theXY Z color space[30]. The colors are chosen from a standard predefinedcolor palette, namely the non-dithering 216 Webmastercolor palette. This palette has the advantage of deliveringa good compromise between the total number of colorsand the color richness [13]. The visual quality loss whichtypically occurs with a fixed palette color quantizationis reduced here thanks to the property of the animationmovies of containing reduced color palettes. Moreover,the histogram comparison task is simplified because inthis way each frame gets represented with the same colorpalette. Using an adaptive approach instead and thusrepresenting each image with a specific color palette willresult in a high number of small undesired variationsof the same elementary colors. Computing the distancebetween different images will be in this case a difficultand inaccurate task as different color palettes are to becompared [25].

B. Color histogram computation

Animation movies have some peculiarities that must bespecifically addressed. One of the most important issueswhen dealing with the color histograms is the movementof the objects in the scene, as large–sized moving objectsmay produce noticeable differences in the histograms ofsuccessive frames.

To study the influence of the object sizes on theglobal color histogram we are considering the followingsituations: an image containing an object of the size ofone quadrant, thus1/4 the size of the image,I1/4, animage with an object of the size1/16th the size of the

image,I1/16, and the image without any object, also thereference image,Iref (see Figure 3, the image quadrantsare depicted with the white lines).

Iref I1/4 I1/16

Figure 3. Several images containing different object sizes(movie”Francois le Vaillant”) [5].

We are now evaluating the image distances as forthe cut detection. The Euclidian distance is computedbetween the color histograms obtained for the imagesI1/4

andI1/16, denotedH1/4() and respectivelyH1/16(), andthe reference histogramHref () of the imageIref . Theachieved distance values are presented with Table I.

image Iref I1/4 I1/16

dE(Hn, Href ) 0 0.3 0.04

TABLE I.HISTOGRAM DISTANCE EVALUATION (n ∈ {1/4, 1/16}).

Only the objects with the size of an image quadrantor higher can change significantly the global color his-togram thus leading to false positives. In this case theobtained distance value is0.3 which is greater than thedetection threshold, usually around0.2 (see Section II-D). In consequence, to localize the partial image changeswhich should not be detected as cuts, we have decided todivide the frames into only four regions (i.e. quadrants).

For the detection, each retained frameIck = Ic(k · s),

wherec denotes the subsampled and color reduced ver-sion,k is the discrete time index ands is the analysis step(see Figure 2), four color histograms are computed:Hj

k =Hj(Ic

k, i), wherej = 1, ..., 4 is the quadrant index andi isthe color index from the used predefined Webmaster colorpalette. Then, four Euclidean distances, denotedDj(k),are computed between each color histogram of the currentframeIc

k and the corresponding one from the next retainedframeIc

k+1:

Dj(k) =

(

Ncolor∑

i=1

[

Hj(Ick+1, i) − Hj(Ic

k, i)]2

)1/2

(1)

whereNcolor = 216 is the total number of colors of theused color palette.

Instead of using all the four distance values and thus 4distance sequences for the entire movie, we are computinga single average histogram distance sequence,Dmean(),as the arithmetic mean ofDj(), with j = 1, ..., 4, thus:

Dmean(k) =1

4

4∑

j=1

Dj(k) (2)

with k = 0, ..., [Nmovie

s ], where Nmovie is the totalnumber of movie frames. The new obtained distance

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 4, AUGUST 2007 13

© 2007 ACADEMY PUBLISHER

sequence will be used as the basis for the cut detectionprocedure we shall explain in the sequel.

The advantage of using a single frame difference mea-sure instead of four is multiple. First, it leads tothesimplification of the cut/non–cut decision as only onevalue shall be compared with the detection threshold.Secondly, itreduces the false positivesdue to the local-ization of the color change. For example, if only onequadrant significantly changes during two consecutiveframes this will basically not affect the mean distancevalue. Only a major change in the image (i.e. 3 quadrantsfrom 4 drastically change) will trigger the detection.Finally, thorough tests have shown that, provided thatan accurate threshold selection procedure is devised, themethod based onDmean() measure leads to improvedresults (see Section II-D).

C. Second order derivative

Continuous camera/objects motion is one source of cutdetection false positives as they introduce a significant andrepetitive visual change in the movie flow. Consideringthat a cut is represented in the dissimilarity function witha ”Low-High-Low” (L-H-L) distance value pattern (seeFigure 4), thus:

• low value: a strong color similarity between twoneighbor frames (i.e. a low value ofDmean()) whichis the case of the two images before the cut,

• high value: a strong color dissimilarity between thenext two consecutive frames (i.e., a high value of theDmean()) which is the case of the cut frames,

• low value: a strong color similarity for the next twoneighbor frames (i.e. a low value ofDmean()) whichis the case of the first two frames after the cut.

gives us the idea of usingp–order derivatives to improvethe detection. Applied on theDmean() sequence they willpreserve the isolated important distance values (i.e. cuts)while reducing the repetitive high values (i.e. motion).

D D

cutx x x x

L-H-L

cut cut

Dmean

. ..

mean mean

Figure 4. Cuts emphasizing using second order derivatives.The oXaxis corresponds to time, the negative values are set to 0 (marked withthe red×).

A simple temporal gradient computed on theDmean()sequence does not take more than a pair of neighbors intoaccount , thus higher–order derivatives should be used foraccurate cut detection.

Tests performed usingp–order derivatives have shownthat the higher thep value used, the lower the obtainedcut detection rate, as the distance values decline whilepincreases. The use of the second order derivative has ledto the best compromise between detection rate and falsepositive incidence. Moreover, the second order derivativeis the best match for our problem, which consists in

cut cutcontinuous camera motion

cut

cut

Dmean

D..

mean

Figure 5. Cut emphasizing using the second order derivative.

detecting the L-H-L cut pattern (see Figure 4). The newsecond order derivative distance sequenceDmean() isgiven by:

Dmean(k + 1) = Dmean(k + 1) − 2 · Dmean(k) +

+Dmean(k − 1) (3)

wherek is the discrete time index. Negative values areset to 0 as they contain redundant information. Cuts arethen sought as local maxima of theDmean() sequence.

The advantage of using the second order derivative isillustrated with Figure 5 for an animation movie segmentcontaining a continuous camera motion. We can easilyobserve that when using the second order derivativethe influence of continuous camera motion is drasticallyreduced and the real cuts get better emphasized.

D. Automatic threshold estimation

For the final frame classification, theDmean() disconti-nuity values are compared against a certain thresholdτcut.The choice of the detection threshold is decisive for thesuccess of the detection algorithm. Taking the thresholdvalue too high will lead to a lot of misdetections, while atoo low threshold value will result in a great number offalse detection.

The existing threshold estimation methods are dividedinto: heuristic approaches[2] where the threshold is set toa fixed value which is empirically determined,statisticalapproaches[36] where the threshold is estimated usingthe statistical distribution of the discontinuity function,adaptive approaches[28] where the threshold is com-puted within a limited time window and adapted to thelocal discontinuity,the mixed approaches[9] which takeadvantage of several different approaches by combiningthem together, andoptimal approaches[29] which areusing the statistical detection theory to find the optimalthreshold value. For a state-of-art refer to [18] [8].

In our case an empirical approach is not appropriateas each animation movie has its own particular color dis-tribution (this is one characteristic of animation movies,see [13]). The threshold value should be adapted tothe color characteristics of each movie. On the other

14 JOURNAL OF MULTIMEDIA, VOL. 2, NO. 4, AUGUST 2007

© 2007 ACADEMY PUBLISHER

hand, this consideration actually simplifies the thresholdestimation task. Asimple global threshold adaptedtoeach movie should be efficient enough in this case ratherthan considering some other more complex approacheslike the locally adapted thresholds.

0.14

cuts

0.08

camera motion

Figure 6. Threshold estimation: red line -Dmean(), blue line - theproposed thresholdτcut, green line -τG from equation 4 (r = 1).

Our approach uses one of the most frequently used ap-proaches which is the statistical modeling of the disconti-nuity function (i.e.Dmean()) with a Gaussian distribution,as starting point, thus:

τG = µD + r · σD (4)

whereτG is the threshold value,µD andσD are the meanand the standard deviation values of the discontinuityfunctionDmean() andr is an empirical parameter relatedto the false detection probability.

As the occurrence of the cut frames is reduced in themovie, takingτG as threshold would typically lead to ahigh false detection ratio, as the threshold will be set toolow (see the green line in Figure 6).

To avoid this problem our threshold estimation per-forms slightly differently. The threshold is computed intwo steps. First, we are detecting all the significant localmaximum of theDmean() sequence. A valueDmean(k)is to be considered as a local maximum if the followingconditions are satisfied:

Dmean(k) > µD,

Dmean(k − 1) < µD,

Dmean(k + 1) < µD

This step ensures that we are considering only the rep-resentative discontinuity values. Further on, the proposedthresholdτcut is set to the average valueof all the retainedlocal maxima (see Figure 6).

The cut detection is then performed by thresholding theDmean() sequence. More precisely, a cut is detected at thediscrete time indexk whenever the following conditionsare satisfied:

Dmean(k) > τcut, Dmean(k + 1) < τcut (5)

The first condition ensures that an important visual dis-continuity has occurred (i.e. a cut) while avoiding thefalse cut situations (i.e. the visual discontinuity generatedby motion, etc.) the second condition ensures that theframe discontinuity is followed by a frame visual sim-ilarity (i.e. the shot images occurring right after the cut).Experimental results show that this choice leads to a verygood detection rate (see Section III).

E. False positive reduction

The animation movies contain a lot of visual color ef-fects, e.g. the SCCs or the ”short color changes” describedin [13]. A SCC stands for a short (in time) dramaticcolor change and typically corresponds to explosions,lightnings, flashes, color effects, etc. Several examples areillustrated with Figure 7.

Figure 7. Several SCC examples from the movie ”Francois le Vaillant”[3] (the oX axis is the temporal axis).

Generally, the SCCs do not produce a shot change butunfortunately due to the visual discontinuity they produce,they are, by mistake, detected as cuts. Detecting the SCCswill improve the cut detection by reducing a part of thefalse positives. Every detected cut will be additionallychecked for the presence of the SCC color effect (seeFigure 2 in Section II).

I(k) I(k+2)

...k

tcut

D D2

I(k+3)

D3

I(k+l)

Dl

Figure 8. The SCC detection approach (D is the Euclidean distance,I(k) is the movie frame at the time indexk, τcut is the same thresholdas for the cut detection andl is an integer).

The proposed SCC detection algorithm was inspiredfrom the flashlight detection in natural movies [10] assimilarly to a camera flash, a SCC starts with an importantchange of the image color distribution and ends withalmost the same image as the starting one (see Figure8).

The Euclidean distances between frame global colorhistograms are successively computed starting with thedetected cut image,Ic(k), where c denotes the spatialsubsampled and color reduced version (see Section II-B). Instead of computing the distances,dE(), betweenconsecutive frames, as for the cut detection, they arecomputed between theIc(k) color histogram,Hk (alsothe reference histogram), and the histogramsHk+l of theneighbor frames,Ic(k + l), l is an integer value rangingfrom 2 to Lmax, with Lmax the SCC maximal length(typically 10 frames, experimentally determined).

During the SCC the histogram distance should be im-portant, thus greater than a certain threshold. As the SCCdetection algorithm uses the same discontinuity measureas for the cut detection, the threshold value is set toτcut

(see Section II-D). Therefore, if for a givenl value, thedistance between the two histograms,Hk and Hk+l, is

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 4, AUGUST 2007 15

© 2007 ACADEMY PUBLISHER

lower thanτcut then a SCC is detected between the framek andk + l and the previously detected cut will gets thusdisregard. The detection algorithm is described with thefollowing:

1. if (cut detected at the momentk) then2. l ← 2

3. Hk ← GlobalHistogramComputation(Ic(k))4. Hk+l ← GlobalHistogramComputation(Ic(k + l))5. Dl ← dE(Hk, Hk+l)

6. if (Dl > τcut) then7. repeat

8. l← l + 1

9. Hk+l ← GlobalHistogramComputation(Ic(k + l))10. Dl ← dE(Hk, Hk+l)

11. while [(Dl > τcut) or (l < Lmax)]12. if (Dl < τcut) then13. SCC detected betweenk andk + l

14. endif

15. endif

16. endif

Several experimental tests are presented with SectionIII.

III. E XPERIMENTAL RESULTS

The performance of the proposed methods was evalu-ated using the conventional precision/recall ratios, definedas:

precision =GD

GD + FD, recall =

GD

Ntotal

(6)

whereGD is the number of good detections,FD is thenumber of false detections andNtotal is the total numberof the real transitions (i.e. cuts or SCC). Several tests wereperformed. To constitute the detection ground truth all theSCCs and the cuts have been manually labeled using aspecially–developed software.

A. SCC detection

First, the proposed ”short color change” (SCC) detec-tion algorithm was validated using 14 short–time anima-tion movies from [3] with a total time of 101min47s andcontaining 120 SCCs. The archived results are depictedwith Table II.

We have obtained a good detection ratio, thus an overallprecisionandrecall of 93% and88.3%, respectively. Theobtained false detections are typically related, first to shotchanges which are occurring between color similar framesand second to global intensity fluctuations on the image.The misdetections are mainly due to the dissimilaritybetween the SCC start and end image (some examplesare presented with Figure 9).

B. Cut detection

The proposed cut detection algorithm was tested on twolong animation movies1 with a total time of 158 minutes

1due to the copyright protection we do not have the permissiontomention their names

movie Ntotal GD (good det.) FD (false det.)#1 1 1 0#2 0 0 0#3 39 38 2#4 7 6 0#5 0 0 0#6 7 5 0#7 0 0 1#8 0 0 0#9 4 4 0#10 2 1 0#11 45 40 0#12 3 2 0#13 12 9 5#14 0 0 0

TABLE II.THE SCCDETECTION RESULTS.

Figure 9. Detection error examples: false detection (up, movie ”TheHill Farm”), misdetection (bottom, movie ”Francois le Vailant”) [5].

and containing 3166 cuts:movie1 - 85 minutes and 1597cuts andmovie2 - 73 minutes and 1569 cuts.

The results are compared with the ones achievedwith two other methods. The first one is the conven-tional histogram–based approach, referred as ”histogrammethod”, were the detection is performed directly bythresholding the frames color histogram distance (adapta-tion of the method proposed in [34]), while the second oneis a motion–based approach, referred as ”motion method”,which uses as discontinuity function the time evolutionof the amount of motion discontinuity blocks obtainedthrough a block-based motion estimation (adaptation ofthe method proposed in [26]). The detection results arepresented with Table III.

Method GD FD precision recallhistogram 2806 199 93.37% 88.63%

motion 2993 355 89.39% 94.53%proposed 2931 157 94.92% 92.6%

proposed+SCC 2931 127 95.97% 92.6%

TABLE III.THE CUT DETECTION RESULTS.

The histogram methodleads to the lower recall ratio,88.63%, thus the smallest number of good detections,while the motion methodachieved the best recall ratio,94.53%, but with a higher false detection ratio, precision89.39%.

Using the proposed method we have obtainedbothhigh precision and recall ratios, above92% (see TableIII). The recall obtained with thehistogram methodwasimproved of 4%, thus a number of 125 cuts were ad-ditionally successfully detected, while the precision of

16 JOURNAL OF MULTIMEDIA, VOL. 2, NO. 4, AUGUST 2007

© 2007 ACADEMY PUBLISHER

the motion methodwas improved of5.5%, thus 178 cutswere additionally successfully detected. Moreover, usingthe proposed cut detection in conjunction with the SCCdetection procedure, the precision was up to96%and thusimproved of 1% (see Table III).

The obtained false detections are mostly owing to thevery fast camera motions, while the misdetections to colorsimilarity between the cut frames and to the occurrenceof the gradual transitions (i.e. fades, dissolves).

A comparative study of the error situations achievedwith the three methods is presented with Table IV. Theletters from S1 to S4 denote some particular moviesegments, namely:S1 corresponds to a cut situation, theframes before and after the cut are color similar as thesecond one is the image obtained after zooming a regionof the first one;S2 corresponds to a global camera motion;S3 corresponds to a cut which is followed by a fast cameramotion;S4 corresponds to a fast object motion. We haveused the following notation: ”Yes” stands for cut detectedand ”No” means cut not detected.

Method histogram proposed motionsituation S1 (cut) No No Yessituation S2 Yes No Nosituation S3 (cut) Yes Yes Nosituation S4 Yes (delayed) Yes No

TABLE IV.A COMPARATIVE STUDY OF THE DETECTION ERRORS

The three methods performed, thus:

• situation S1: due to the strong color similarity,only the motion-based method (motion) successfullydetected the cut,

• situation S2: due to the camera motion right afterthe cut, the classical histogram-based method (his-togram) leads to a false detection,

• situation S3: due to the very fast camera motionoccurred after the cut, the motion-based method(motion) failed in detection,

• situation S4: due to the very fast character motiontheproposedmethod leads to a false detection whilethe classical histogram-based method (histogram)performed the false detection several frames later.

The proposed cut detection method was also tested inthe ARGOS video tools evaluation campaign [1] on aset of 21 documentary movies with a total time of 10hours and 24 minutes. The proposed adaptive thresholdingapproach provided good detection results, thus we haveachieved a global precision ratio of 91% and a recall ratioof 90.5%. Most of the false positives were in this casedue to the presence of noise, discontinuous motion andaleatoric changes of the image global illumination.

IV. CONCLUSIONS

In this paper we are proposing an improved histogram-based cut detection technique which is adapted to theshot segmentation of the animation movies. To manage

the difficulties raised by the peculiarity of the anima-tion movies several improvements are proposed. First,to reduce the influence of the entering/exiting objects(i.e. characters) the frames are divided into quadrants,thus allowing us to localize the discontinuity (only theobjects of the size of one quadrant or more influence thedetection). To reduce the influence of the continuous vi-sual changes (global camera motion, color changes, videotransitions, etc.) the second-order derivatives are appliedon mean Euclidean distances between frame histograms.Moreover, for the frame classification we propose anautomatic global threshold estimation which is adaptedto the specificity of each movie. Finally, to reduce someof the false positives, the cuts are additionally checkedfor the presence of a specific visual effect, called SCC(”short color changes”), which is, by mistake, detected ascuts.

The proposed method achieved a very good recogni-tion ratio compared to a conventional histogram–basedapproach and to a motion–based approach. The obtainedrecall and precision are both above 92%, which is animprovement of around 3% (more than 100 cuts). Also,by using the proposed method in conjonction with theSCC detection, the precision ratio was improved by 1%.

Future improvements will consist in merging this ap-proach with the motion-based method, to improve theinvariance to fast motions and to the detection of the cutswhich occur between similar color images. A possiblestrategy is based on the use of a confidence measure inorder to describe the quality of the detection for bothmethods. Also a multimodal analysis is to be consideredby including the analysis of the movie soundtrack (par-ticular sounds are related to cuts).

ACKNOWLEDGMENT

We like to thank CITIA - ”City of Moving Images”[3] and Folimage Animation Company [5] for providingus with the animation movies and for their collaborationduring the developing of the animation movie project.

REFERENCES

[1] ARGOS. Evaluation campaign for surveillance tools ofvideo content.http://www.irit.fr/argos, 2006.

[2] F. Arman, A. Hsu, and M.Y. Chiu. Image processingon compressed data for large video database.ACMInternational Conference on Multimedia, pages 267–272,August, Anaheim, USA 1993.

[3] Citia - city of moving images.http://www.annecy.org.

[4] M.S. Drew, Z.N. Li, and X. Zhong. Video dissolve andwipe detection via spatio-temporal images of chromatichistograms differences.IEEE International Conference onImage Processing, 3:929–932, 2000.

[5] Folimage animation company.http://www.folimage.com.

[6] U. Gargi, R. Kasturi, and S. H. Strayer. PerformanceCharacterization and Comparison of Video-Shot-ChangeDetection Methods. IEEE Transactions on Circuits andSystems for Video Technology, 10(1), 2000.

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 4, AUGUST 2007 17

© 2007 ACADEMY PUBLISHER

[7] S.J.F. Guimaraes, M. Couprie, A. de A. Araujo, and N.J.Leite. Video segmentation based on 2d image analysis.Pattern Recognition Letters, (24):947–957, 2003.

[8] A. Hanjalic. Shot-boundary detection: Unraveled andresolved?IEEE Transactions on Circuits and Systems forVideo Technology, 12(2):90–105, February 2002.

[9] A. Hanjalic, M. Ceccarelli, R.L. Lagendijk, andJ. Biemond. Automation of systems enabling searchon stored video data.SPIE Storage and Retrieval forImage and Video Databases V, 3022:427–438, February1997.

[10] W.J. Heng and K.N. Ngan. Post shot boundary detectiontechnique: flashlight scene determination.5th InternationalSymposium on Signal Processing and Its Applications,pages 447–450, 1999.

[11] B. Ionescu, V. Buzuloiu, P. Lambert, and D. Coquin.Improved cut detection for the segmentation of animationmovies. IEEE International Conference on Acoustic,Speech and Signal Processing, may, Toulouse, France2006.

[12] B. Ionescu, D. Coquin, P. Lambert, and V. Buzuloiu. TheInfluence of the Color Reduction on Cut Detection inAnimation Movies. 20th Conference GRETSI on Signaland Image Processing and Analysis, September, Louvain-la-Neuve, Belgium 2005.

[13] B. Ionescu, D. Coquin, P. Lambert, and V. Buzuloiu. FuzzySemantic Action and Color Characterization of AnimationMovies in the Video Indexing Task Context.SpringerLNCS, 4398:119–135, 2007.

[14] B. Ionescu, P. Lambert, D. Coquin, and V. Buzuloiu.Fuzzy color-based semantic characterization of animationmovies. IS&T CGIV - 3th European Conference on Colorin Graphics, Imaging, and Vision, June, Leeds, UnitedKingdom 2006.

[15] J.M.Corridoni and A. Del Bimbo. Film semantic analy-sis. Proceedings of Computer Architectures for MachinePerception, pages 202–209, September, Como, Italy 1995.

[16] S.H. Kim and R.H. Park. Robust video indexing for videosequences with complex brightness variation.IASTEDInternational Conference on Signal and Image Processing,pages 410–414, Kauai, Hawaii 2002.

[17] R. Lienhart. Dynamic video summarization of homevideo. SPIE Storage and Retrieval for Media Databases,3972:378–389, January 2000.

[18] R. Lienhart. Reliable Transition Detection in Videos:ASurvey and Practitioner’s Guide.International Journal ofImage and Graphics, 1(3):469–486, 2001.

[19] G. Lupatini, C. Saraceno, and R. Leonardi. Scene breakdetection: A comparison.Research Issues in Data Engi-neering, Workshop on Continuous Media Databases andApplications, pages 34–41, 1998.

[20] Y.F. Ma, J. Sheng, Y. Chen, and H.J. Zhang. Msr-asia attrec-10 video track: Shot boundary detection task.10thText Retrieval Conference, page 371, 2001.

[21] J. Meng, Y. Juan, and S.F. Chang. Scene change detectionin a mpeg compressed video sequence.SPIE, 2419:14–25,February 1995.

[22] K. Otsuji, Y. Tonomura, and Y. Ohba. Video browsingusing brightness data.SPIE Visual Communications andImage Processing, 1606:980–989, 1991.

[23] S.V. Porter, M. Mirmehdi, and B.T. Thomas. Videocut detection using frequency domain correlation.15thInternational Conference on Pattern Recognition, pages413–416, Barcelona, Spain 2000.

[24] W. Ren and S. Singh. Video transition: Modeling andprediction.Pattern Analysis and Neural Networks, PANN,http://www.dcs.ex.ac.uk/research/pann/pdf/pannSS089.PDF,2003.

[25] Y. Rubner, L. Guibas, and C. Tomasi. The earths moverdistance, multi-dimensional scaling and color-based image

retrieval. Proc. of the ARPA Image Understanding Work-shop, 1997.

[26] B. Shahraray. Scene change detection and content-basedsampling of video sequences.SPIE Digital Video Com-pression: Algorithm and Technologies, 2419:2–13, febru-ary 1995.

[27] Cees G.M. Snoek and M. Worring. Multimodal videoindexing: A review of the state-of-the-art.MultimediaTools and Applications, 25(1):5–35, 2005.

[28] B.T. Truong, C. Dorai, and S. Venkatesh. New enhance-ments to cut, fade, and dissolve detection processes invideo segmentation.ACM Multimedia, pages 219–227,November 2000.

[29] N. Vasconcelos and A. Lippman. Statistical models ofvideo structure for content analysis and characterization.IEEE Transactions on Image Processing, 9:3–19, January2000.

[30] N.D. Venkata, B.L. Evans, and V. Monga. Color errordiffusion halftoning. IEEE Signal Processing Magazine,20(4):51–58, 2003.

[31] W.A.C.Fernando, C.N.Canagarajah, and D.R.Bull. Scenechange detection algorithms for content-based video in-dexing and retrieval.IEE Electronics and CommunicationEngineering Journal, pages 117–126, Jun 2001.

[32] Y. Wang, Z. Liu, and J.-C. Huang. Multimedia contentanalysis using both audio and visual clues.IEEE SignalProcessing Magazine, 17(6):12–36, novembre 2000.

[33] B.-L. Yeo and B. Liu. Rapid scene analysis on compressedvideo. IEEE Transactions on Circuits, Systems and VideoTechnology, 5:533–544, December 1995.

[34] B.-L. Yeo and B. Liu. Rapid scene analysis on compressedvideo. IEEE Transactions on Circuits, Systems and VideoTechnology, 5:533–544, december 1995.

[35] R. Zabih, J. Miller, and K. Mai. A feature-based algo-rithm for detecting and classification production effects.Multimedia Systems, 7:119–128, 1999.

[36] H. Zhang, A. Kankanhalli, and S.W. Smoliar. Automaticpartitioning of full-motion video. Multimedia Systems,1(1):10–28, 1993.

Bogdan Ionescu is currently a temporary assistant profes-sor with the University of Savoie, Polytech Savoie, France.He received an BS degree in applied electronics (2002) andan MS degree in computing systems (2003) from University”Politehnica” of Bucharest, Romania. In 2007 he received anPhD degree in Electronics, Electrotechnics, Telecommunicationsand Automatics both from University of Savoie, France, andUniversity ”Politehnica” of Bucharest, Romania for which hewas granted with the ”Magna cum laudae” award. His scientificinterests cover electronics engineering, artificial intelligence,image processing, computer vision, software engineering,andcomputer science.

Patrick Lambert graduated in electrical engineering fromthe Grenoble National Polytechnic Institute (INPG), Greno-ble, France, in 1978. He received the Ph.D. degree in signalprocessing in 1983. He is currently an Assistant Professor ofelectrical engineering at the Savoie University (TechnologicalAcademic Institute) and works in the Informatics, Systems,Information and Knowledge Processing Laboratory (LISTIC),Annecy, France. His research interests include image and videoindexation, data fusion, and fuzzy logic.

Didier Coquin received his PhD. degree in signal process-ing and telecommunications from the University of Rennes I,France, in 1991. He is currently an Associate Professor ofTelecommunications and Networking Engineering at the Savoie

18 JOURNAL OF MULTIMEDIA, VOL. 2, NO. 4, AUGUST 2007

© 2007 ACADEMY PUBLISHER

University and does research in LISTIC (Informatics, Systems,Information and Knowledge Processing Laboratory), Annecy,France. His research interests include image processing, patternrecognition, image/video indexation, data fusion and fuzzy logic.

Vasile Buzuloiu is a professor with the Department of AppliedElectronics and Information Engineer. He also heads the ImageProcessing and Analysis Laboratory, University ”Politehnica”Bucharest, Romania, and is a Research Associate with CERNGeneva, Switzerland. He holds an M.S. degree in electronicsfrom the University ”Politehnica” Bucharest (1959), an M.S.degree in mathematics from ”Universitatea Bucuresti” (1971),and a Ph.D. degree in electronics from University ”Politehnica”Bucharest (1980). He received the ”Traian Vuia” Award of theRomanian Academy (1985) for the first digital image analysissystem developed in Romania. He is a Member of IEEE, SPIE,Color Group (Great Britain), and the Romanian Society for Ap-plied Mathematics. Since 1995, he has been the Director of theSpring International School Multidimensional Signal Processingand Analysis: Methods, Algorithms, Technologies, Applications,which is organized yearly at University ”Politehnica” Bucharest.His scientific interests cover mathematical modeling, statisticaldecisions, encryption, digital signal processing, image process-ing and analysis systems, and image processing applications.

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 4, AUGUST 2007 19

© 2007 ACADEMY PUBLISHER