IM (S)< sup> 2 : Interactive movie summarization system

J. Vis. Commun. Image R. 21 (2010) 283–294

Contents lists available at ScienceDirect

J. Vis. Commun. Image R.

journal homepage: www.elsevier .com/ locate/ jvc i

IM(S)2: Interactive movie summarization system

Mehdi Ellouze a,*, Nozha Boujemaa b, Adel M. Alimi a,1

a REGIM: Research Group on Intelligent Machines, University of Sfax, ENIS, BP 1173, Sfax 3038, Tunisiab INRIA: IMEDIA Team, BP 105 Rocquencourt, 78153 Le Chesnay Cedex, France

a r t i c l e i n f o a b s t r a c t

Article history:Received 15 December 2008Accepted 18 January 2010Available online 25 January 2010

Keywords:Video analysisVideo summarizationUsers’ preferencesInteractive multimedia systemContent analysisPattern recognitionGenetic algorithmOne-class SVM

1047-3203/$ - see front matter � 2010 Elsevier Inc. Adoi:10.1016/j.jvcir.2010.01.007

* Corresponding author. Fax: +216 74 275 595.E-mail addresses: [email protected] (M. Ellou

(N. Boujemaa), [email protected] (A.M. Alimi).1 Fax: +33 13963 5674.

The need of summarization methods and systems has become more and more crucial as the audio-visualmaterial continues its critical growth. This paper presents a novel vision and a novel system for moviessummarization. A video summary is an audio-visual document displaying the essential parts of an origi-nal document. However, the definition of the term ‘‘essential” is user-dependent. The advantage of thiswork, unlike the others, is the involvement of users in the summarization process. By means of IM(S)2,people generate on the fly customized video summaries responding to their preferences. IM(S)2 is madeup of an offline part and an online part. In the offline, we segment the movies into shots and we computefeatures describing them. In the online part users inform about their preferences by selecting interestingshots. After that, the system will analyze the selected shots to bring out the user’s preferences. Finally thesystem will generate a summary from the whole movie which will provide more focus on the user’s pref-erences. To show the efficiency of IM(S)2, it was tested on the database of the European project MUSCLEmade up of five movies. We invited 10 users to evaluate the usability of our system by generating forevery movie of the database a semi-supervised summary and to judge at the end its quality. Obtainedresults are encouraging and show the merits of our approach.

� 2010 Elsevier Inc. All rights reserved.

1. Introduction

Nowadays, the movie industry is flourishing as video contentconsumption is growing and hence presenting challenging prob-lems for the scientific community. In fact, in the every-day lifesearch for entertainment material, people become more and moreinterested in the products and services related to video contentindustry and especially movies [43].

Besides, the increasing use of new technologies as the Internetmakes the information sources close to people. Web sites as You-Tube, Google video or Daily Motion make thousands of video se-quences available everyday. Consulting video sequences becomesa part of a daily routine for the majority of the internet users.

All this has created a problem namely how I can find what Ineed in a minimum of time without being distorted? [41].

Video summarization systems have tried to solve this problem.Summarizing a video document consists in producing a reduced vi-deo sequence showing the important events of the original se-quence and hence after consulting the summaries, viewers canhave an idea about the context and the semantics of the originalsequences [18,19,21].

ll rights reserved.

ze), [email protected]

Many video summarization systems have been proposed in theliterature. They differ in two things: the form of the generatedsummary and the rules (assumptions) on which authors basedtheir system to generate a summary shorter than the originalsequence but that gives an appropriate overview of the wholecontent. However, what is appropriate for one viewer may beinappropriate for others. In fact, selecting the parts of the originalvideo that should be included in the summaries is still a challeng-ing problem. No clear criteria were established to decide whatshould and what should not be included in the final summary. Thishas posed also another challenging problem of evaluation that hasbeen solved in part in the TRECVID evaluation campaigns[9,20].

Aware of the importance of this problem, we introduce in thispaper a new vision for video summarization: ‘‘A tailor made sum-mary for each user”. Every user has to specify what should be in-cluded in the final summary. Contrary to the majority of existingvideo summarization systems which either neglect or do not ac-cord a lot of attention to the users’ preferences, the contributionof this paper may be summarized by proposing a complete frame-work for a user oriented video summarization. We propose a novelsummarization system that involves users in the process of moviessummarization.

We have been motivated by the lessons learned from the studyon ‘‘requirement specifications for personalized multimedia summari-zation” done by Agnihoti et al. [1]. The authors organized a panel

http://dx.doi.org/10.1016/j.jvcir.2010.01.007

mailto:[email protected]



http://www.sciencedirect.com/science/journal/10473203

http://www.elsevier.com/locate/jvci

284 M. Ellouze et al. / J. Vis. Commun. Image R. 21 (2010) 283–294

which gathered five experienced researchers in the field of videoindexing, video summarization and user studies in order to analyzeuser needs. Over four separate brainstorming sessions, the panelexamined four questions: (i) Who needs summarization andwhy? (ii) What should summaries contain? (iii) How should sum-maries be personalized? And (iv) How can results be validated?

According to this work, the brainstorming sessions concludedunanimously that ‘‘summary is user context dependent, and so itis important for summarization to be personalized to cover user as-pects: preferences, context, time, location, and task. They con-cluded also that both content requirements and userrequirements should be validated”.

Moreover, we concentrate our efforts on movies. The choice ofmovies as an application field can be justified by the growing needof people for watching movies and the growing number of pro-duced movies. Indeed, according to IMDB [17], we have untilnow a stock of 328,530 movies representing a total of 740,803 h.

The rest of the paper will be organized as follows; in Section 2,we discuss works related to video summarization. In Section 3, westate the problem and present how some works include users inthe summarization process. In Section 4, we present our approach.Results of our approach are shown in Section 5. We conclude withdirections for future work.

2. Related work

Recently two prominent papers [32,49] have been publishedand presented an exhaustive analysis of the majority of works pro-posed in the field of video summarization.

A video summarization system aims at generating a reducedversion of the original sequence in which we find the essentialinformation. It may have two forms:

� Storyboard: the storyboard is a set of keyframes extracted fromthe original sequence showing an overview of the whole videosequence [49].

� Video skim: this type of summary consists of a set of videoexcerpts extracted from the original sequence because they havebeen judged important. They are joined by either a cut or a grad-ual effect [49].

We can classify existing works according to the form of gener-ated video summary.

2.1. Storyboards

Early works as [31] and [46] produced storyboards by randomor uniform sampling. At each important visual change, a singlekeyframe will be selected to appear in the storyboard. These kindsof approaches are not used any more because they suffer from a lotof redundancy. The following works are based on shot unit. So, astoryboard is made up of keyframes representing the whole or apart of the video shots. In [54], the authors select from every shotthe first image and if there is an important change between twosuccessive frames, the shot will be represented by more than oneimage. Authors in [52] proposed to summarize video by a set ofkey-frames with different sizes. The selection of key-frames isbased on eliminating uninteresting and redundant shots. Selectedkey-frames are sized according to the importance of the shots fromwhich they have been extracted. In [45], the authors representevery shot in the storyboard by a set of frames disposed accordingto type of camera work. More sophisticated works such as [53] and[55] use the classification to eliminate redundancy and some oth-ers as [47] use the motion information to measure the activity of a

shot and to estimate the number of frames that must be selectedfor every shot.

2.2. Video skim

Although the storyboard is interesting to have an overviewabout the whole video sequence, it suffers from a lack of semanticsdue to the absence of the temporal and auditory information.That’s why more effort has been accorded to video skims. In the lit-erature, we find two kinds of video skims:

� summary sequence which is used to provide users with animpression about the entire video content;

� highlights sequence which provides users with exiting scenesand events.

In the category of systems generating highlights oriented sum-maries, we can evoke the VAbstract system [37] which selectsimportant segments in action movies. The selected segments pres-ent either high contrast, or contain frames having the largest dif-ferences. The VAbstract system is targeting action events. TheMOCA project [22] which is an improved version of the VAbstractsystem, tries to locate high level events as gunfire, explosions andfights to improve the quality of generated trailer.

In the sport field, there are also many systems that try to gen-erate highlights sequences [2,23,36,39]. In the context of sport,highlights depend on the sport competition itself. They correspondin general to goals, tries, points, red cards, penalties, substitutions,etc.

In the category of works generating summary oriented skims,we can evoke the Informedia system [42] from Carnegie MellonUniversity which transcribes the auditory bound of the video se-quence and uses NLP techniques to locate important segments. Italso uses the visual bound to locate important segments throughanalysing the motion information and the detection of faces. Thesystem fuses the results of the two works to have an audio-visualsummary. IBM, Microsoft and Mitsubishi labs proposed systems[33,35,38] that summarize the original video sequence by changingthe speed playback. Important segments will have a slow playbackand non-important segments will have fast playback. The differ-ence between the three systems consists in criteria used to delimitimportant segments.

More sophisticated works used some psychological and percep-tional features to generate video skims. In [28], the authors try tosimulate the user attention by a function. This function evaluatesa given video excerpt using some visual and auditory features ascontrast, motion, audio volume, caption text, etc. To find the bestsummary, authors try to maximise this function. In [26], theauthors proceed nearly in the same way, they affect to shots scoresaccording to some high level features which generally attract theattention of the viewer as the face occurrences, text occurrences,voices, cameras zooming, explode noises, etc. The summarisationprocess is considered as a constraint-satisfaction problem to satisfythe attention of the viewer by means of evoked features. In ourwork [9] we also consider also the problem of summarization asan optimization problem. During our last participation in TRECVID2008, we proposed a system that summarizes BBC rushes by someshots selected according to fixed criteria [9]. A genetic algorithmwas used to compute the optimal summary. In [51] Tsonevaet al. propose a novel approach for video summarization. Their ap-proach uses the subtitles and the scripts to segment movies intomeaningful parts called subscenes. After that the subscenes willbe ranked through an importance function. Finally importantsubscenes will be selected to formulate the final summary. Theevaluation of the generated summaries shows the merits of usingthe textual information in the summarization process.

M. Ellouze et al. / J. Vis. Commun. Image R. 21 (2010) 283–294 285

3. Problem statement

A video summary should describe briefly and concisely the ori-ginal content. It must be shorter than the original sequence andgives an appropriate overview of the whole content. However,what is appropriate for one viewer may be inappropriate for oth-ers. That’s why a new tendency of works tries to involve users inthe process of video summarization. The first work speaking aboutincluding users in the summarization process is [12] in whichauthors suggest a summarization process in two stages. The firstone consists in computing some descriptors as color and motionand the second consists in generating the summary according toone and simple user preference namely the number of key framesof the summary. It is a preference specified in the MPEG7 UserPreference Description scheme. In [13], the authors propose a sys-tem that produces on-the-fly storyboards and allows users to se-lect the number of key frames and the processing time. In [34],the authors try to collect the user preferences through weightsover a set of high level features. The major drawback of this ap-proach is the fact that users operate directly on technical parame-ters that may not be easily understood. More sophisticated worksas [44] try to collect the preferences of a given person using hisown pictures stored in his personal computer. They based theirwork on the assumption that nowadays it is common and espe-cially among young people to store thousands of photos insidetheir own PCs. These photos reflect generally the user’s taste per-sonality and lifestyle. The authors judge that these photos maybe used to estimate users’ preferences. They do a categorizationfor the personal photo library to extract pertinent classes. Afterthat they classify the key frames of the video which will be sum-marized according to detected classes to bring out clips meaningfulto users.

Even if some efforts have been made to integrate users in thesummarization process, they remain insufficient. In fact usingusers’ personal photos to generate summaries, or asking usersabout the duration (video skim) or the number of frames (story-boards) of the targeted summary does not reflect the real users’preferences.

Even if Parshin and Chen [34] proposed an adjustable systemthat tries to gather a set of information related to the users’ prefer-ences, it remains difficult for users to understand technical param-eters and to adapt their preferences according to technical features.

Our contribution will be the design of a system that collects theuser’s preferences appropriately and tries to generate a summarythat reflects these preferences and describes briefly and concisely

Shot

Scene

Video Summ

Overview

Offline part

Online part

User

Video Sum

Preferences

Mo

Fig. 1. Overview of th

the original content. It combines the most successful methods forcomputer vision, video processing and pattern recognition[6,9,11,15,40] into an integrated architecture.

4. Proposed approach

4.1. System framework

A video summary is an audio-visual document displaying theessential parts of the original document. The term essential de-pends on the user and on the audio-visual document itself. In fact,It is only after knowing the context of the movie that a user mayexpress his/her preferences. Indeed, we may have a preferencefor a kind of scenes in action movies and for another kind in dra-matic movies, etc.

Our idea consists in displaying to every user an overview of themovie that can help him to illustrate his preferences easily andefficiently. We have been inspired by image interactive indexingsystems in which users operate directly on the database and trythrough many iterations to reach the targeted class.

In our previous work [11] for instance, the system displays ateach round two images that are randomly selected from the data-base which contains thousands of images. The user is required tochoose among the two images the one that is closest to the tar-geted class. Little by little and through many rounds the systemdelimits the targeted class.

The framework that we propose may be summarized as follows:the user will take a look at an overview, interact with it to expresshis preferences and the system replies by generating a summarywhich at the same time displays the essential parts of the originalmovie and includes the user’s preferences. Fig. 1 illustrates ourframework in a synthetical way.

This kind of framework is original in the context of video sum-marization and that’s why many questions may be asked [3,4]:

� What kind of overview we have to display to the user?� How the user will express one’s preferences?� How many rounds we will have between the user and the

system?� How the system will bring out and understand the user’s

preferences?� How the system will generate a summary that takes into consid-

eration the user’s preferences and covers the essential parts ofthe movie at the same time?

arization FeaturesDatabase

ComputeFeatures

mary

vie

e IM(S)2 system.

Fig. 2. GUI of IM(S)2.


We will try in the following sections to reply to all thesequestions.

4.2. Movie overview

An overview should be at the same time simple to understandand quickly understandable. It must give an idea about the seman-tics of the original movie by informing about the actors involved inthe movie, the places in which the story of movie takes place, theimportant events, etc.

Clustering the shots of the movie and showing a representativeimage for each cluster was one of the solutions that we considered.However, this kind of overview may hide many details that areimportant to understand the context of the movie. As it’s men-tioned by Hanjalic et al. in [16] ‘‘Humans tend to remember differentevents after watching a movie and think in terms of events during thevideo retrieval process. Such an event can be a dialog, action scene or,generally, any series of shots unified by location or dramatic incident.Therefore, an event as a whole should be treated as an elementary re-trieval unit in advanced movie retrieval systems”

For this reason, we think that segmenting the movie into scenesand making an overview showing all the scenes of the movie is asuitable solution.

Generally, in one movie we find between 1500 and 2500 shotsand between 50 and 80 scenes. A scene gathers a set of correlatedshots that share the same actors and take place at the same mo-ment and in the same place. So, presenting a mosaic image com-posed of small images representing the scenes of the movie canreduce a lot of redundancy and give a general overview of thewhole movie. This overview has also the advantage of giving theuser the opportunity to decide before starting the process of sum-marization if the movie is interesting or not, and so avoiding thewaste of time to summarize and browse it needlessly.

4.3. User interaction

After watching the overview, the user is required to expresshis/her preferences. These preferences are generally centered

on some types of scenes (action scenes, dialog scenes, lands-cape, romance), or around some contents as locations (forest,mountain, city, buildings, indoor, outdoor, etc.), actors, time(day and night) or simply special kinds of objects (plane, tank,car, etc.).

As it is impossible to predict and to model all the users’ prefer-ences and to implement appropriate detectors, the user has tointeract directly with the system by selecting directly from theoverview shots that represent what he likes to watch in the sum-mary. Besides, we ask him to specify the upper duration that thegenerated summary must not exceed (see Fig. 2).

The system will analyze the selected shots, locate the center ofinterest of the user inside the movie and finally take into consider-ation his preferences to generate a summary not exceeding thefixed duration (see Fig. 3).

In the following sections, we will present the two main parts ofIM(S) 2: the offline part and the online part. In the offline part, themovies of our database are segmented into shots and features arecomputed to describe these shots. In the online part, users interactwith the system to generate summaries taking into considerationtheir preferences.

4.4. Offline part

4.4.1. Scenes segmentation and overview makingSemantically speaking, a movie is considered as a set of scenes

and not a set of shots because scenes, and contrary to shots, aresemantic units. The database of the European project MUSCLE onwhich we tested our system is provided with the ground truth ofthe shot boundaries.

To segment movies into scenes, we used our system called‘‘Scene Pathfinder” which is tested on films of different cinemato-graphic genres and gives encouraging results [8].

To formulate the overview of a given movie, we achieve insideevery detected scene a clustering to extract the shots clusters. Toaccomplish this we relied on the work of Casillas et al. [7]. The rep-resentative shot of every cluster is the longest one (duration). An

Online Summarization

Process

Content

Type

Duration

Locations

Time

Persons

Romance

Dialogue

Action

User’s Preferences

Summary constraints

No RedundancyTemporal Coverage

Summary

Fig. 3. Constraints taken into consideration by IM(S)2 system.


image composed of the keyframes of representative shots is com-puted to have a representative image of the scene. Four shots atmaximum representing the four greatest clusters are selected inorder not to disturb the vision and the attention of the user to lethim focus on the core of the scene. If the user is interested in a gi-ven scene he can zoom more and all the clusters of the scenes areshown (see Fig. 2).

If a cluster is selected by a user, all the shots of cluster are con-sidered as positive examples, representing the preferences of thisuser.

4.4.2. Shot representationThe major problem in multimedia remains the semantic gap.

Until now we have not the features that describe well the con-tent of images. In our case the features that will be used mustdescribe as well the maximum of the concepts present in theselected shots.

The preferences of a user may be divided according to two axes:type and content. In fact by selecting some specific shots, the usermay inform about:

� His preferences to some types of shots in action shots, dialogshots, landscape, romance, etc. Action shots are characterizedby high tempo, a lot of motion and a lot of sound effects. How-ever the other types are related to special content. Dialog shotsare related to the presence of the actors of the movie, landscapeshots are related to special texture or color (greenery, sun set-ting, etc.), and romance shots are characterized by a high per-centage of skin color in the image.

� His preferences to special contents as locations (forest, moun-tain, city, buildings, indoor, outdoor, etc.), actors, time (dayand night), objects (planes, cars, tanks, etc.).

For this reason we used in our system two types of descriptors:tempo descriptors and content descriptors.

The reference shot

An interval of 60 s

Fig. 4. The shot frequency feature.

4.4.3. Tempo descriptors and shots clusteringIn order to know if the user is interested by action scenes, we

have to delimit them in the movie. To do that we achieve anunsupervised clustering to classify shots of the movie into twoclasses: action shots and non-action shots. This classification isdone using Fuzzy CMeans [5] classifier and tempo features. In-deed, compared to non-action-scenes (dialogue, romance, etc.)which are characterized by a calm tempo, action scenes (fight,

war, car chase, etc.) display three important phenomena. First,they contain a lot of motion: motion of objects (motion of actors,motion of cars, etc.) and camera motion (pan, tilt, zoom, etc.).The second phenomenon is the use of special sound effects to ex-cite and stimulate the viewer’s attention and this by amplifyingthe actors’ voices, introducing explosion and gunfire sounds fromtimes to times, etc. The third important phenomenon is the dura-tion and the number of shots per minute. Action scenes arefilmed by many cameras. For this reason, the filmmaker isswitching permanently between all cameras to film the scenefrom many views.

The features that we use to quantify these three phenomenaare:

� The shot activity: we used the Lukas Kanade optical flow [27] toestimate the direction and speed of object motion from oneframe to another in the same shot.

� The shot audio energy: We compute Short-Time Average Energyof the auditory bound of every shot. The Short-Time AverageEnergy of a discrete signal is defined as follows:

EðshotÞ ¼ 1N

Xi

sðiÞ2 ð1Þ

� The shot frequency: We place every shot in the center of a one
minute interval. After that, we count the number of shots thatbelong to this interval. This number is the third feature of everyshot (see Fig. 4).
As result of a classification done by the Fuzzy CMeans classifierand with a number of classes equal to two, we obtained compactzones composed of consecutive action shots representing actionscenes [8].

4.4.4. Content descriptorsThe quality of the descriptors is very important in any work of

video processing. In fact, the quality of these descriptors is crucialto bridge the semantic gap.


In our system the goal of the content descriptors is to provideimportant visual cues to describe the content of non-action shots(dialog, romance and landscape) and the content of shots in gen-eral as location, time, objects, concepts, themes, etc.

The content descriptors that we used in IM(S)2 have been al-ready implemented and integrated in our content based image re-trieval engine called IKONA [6].

In general, every concept or object is related to special colors,special texture and special edges. That is why the descriptors inte-grated into IKONA are varied and touch the colors, the textures andthe edges of images.

To describe the color content we used a standard HSV histo-gram computed on 120 bins. The HSV histograms inform aboutthe color distribution of images; however, they suffer from a lackof spatial information because all pixels are equally important.For this reason, we used also two other descriptors. First, the Lap-RGB which is a histogram computed on 216 bins and which usesthe edge information to weight every color pixel. The idea of Lap-RGB is that pixels situated on the edges and corners have an impor-tant contribution to the histogram. Second, the ProbRGB which is ahistogram computed also on 216 bins and which uses the textureinformation to weight every color pixel, favoring the contributionof pixel where there is an important color activity.

We use also the Fourier histogram descriptor which exploits thetexture information to have an exact measure of the energy infor-mation and its distribution in a given image. This histogram iscomputed on 64 bins.

Besides, in order to have a general idea about the differentshapes that a given image may contain, we used two descriptors:the Hough histogram and the Local Edge Oriented Histogram.The Hough histogram is computed on 49 bins and gathers informa-tion about how pixels in the image behave with respect to the tan-gent line to the local edge passing through each pixel and withrespect to the distance from the origin point. In addition, we com-pute the Local Edge Oriented Histogram (LEOH) on 32 bins to havean idea about pixels belonging to horizontal and vertical lines.

The vector gathering all the evoked features has 697 dimen-sions which is a very high number. To reduce the dimensionalitywe used linear principal component analysis. A previous study[10] which is done on several image databases in a context of queryby example shows that we may reduce the resulting vector dimen-sion about 5 times and we lose less than 3%.

4.5. Online part

In the online part the user is required to specify the duration ofthe targeted summary and to select shots that he judges they cor-respond or are close to his preferences. Relying on these selectedshots, the system will generate a summary including shots thatmay be interesting. However a summary has to give also an ideaabout the whole content of the original sequence. For this reason,other constraints will be taken into consideration namely, insurea good temporal coverage, avoid redundancy and insure a smoothsummary. We consider the summarization process as an optimiza-tion problem in which we will try to find a compromise among allevoked constraints.

4.5.1. Summary durationThe user has to specify the duration of the summary (see Fig. 1).

This duration is not necessarily the duration of the generated sum-mary but an upper duration that the summary must not exceed.

4.5.2. Learning users interestsThe shots selected by the user will be used to understand his

preferences: the types and contents of shots that he is targeting.

To know if the user is interested in action scenes, we achieve aclustering of shots into action and non-action shots. If one of theselected shots belongs to the cluster of action shots, we supposethat the user is interested in such kind of shots. Consequently,two shots from every action scene (the two shots that have thehighest motion) will be included in the final summary.

For the other types of shots, as they are related to content, acontent classification is achieved to know if they belong to the cen-ter of interest of the user.

The aim of this classification is to discriminate between inter-esting and non-interesting shots. However we are confronted toa particular classification problem because we have only positiveexamples. Starting from the shots selected by the user (positiveexamples) and the content features computed in the offline part,we have to classify the shots into two categories: the shots thatmay be of great interest for the user and the shots that are not par-ticularly interesting.

This situation is not particularly new and has been treated inother fields as text categorization [29] and gene prediction [30].In [29] and [30] the authors solved this problem using the One-class SVM. It is a type of SVMs which uses few samples of one class(either positive or negative class) to enclose all samples that belongto this class. The encouraging results obtained in [29] and [30] in-cited us to use this type of SVM to classify movie shots into inter-esting and non-interesting.

In the literature, there are many implementations of the oneclass SVM. The implementation which is the most widespreadand which gives the best results is the one of Schölkopf et al.[40] in which the authors consider the origin as the only memberof the second class, after transforming the features through a ker-nel function. After that, they try to separate the positive samplesfrom the origin by a hyperplane.

In our context we can have preferences non-linearly separableand having irregular shapes. One of the advantages of using SVMsis their flexibility toward the class shape and this is due to the useof kernels. The study done in [29] in the context of document clas-sification, shows that the radial kernel produces in general the bestresults relatively to other kernels. This conclusion is not surprisingbecause the radial kernel is known for its ability to capture differ-ent shapes of classes. The form of the radial kernel that we used isthe following:

KðX;YÞ ¼ e�cjjX�Y jj2 ð2Þ

We choose the toolbox LIBSVM which is available in [24] and inwhich we find the version one-class SVMs implemented. The usedparameters are the standard ones.

Every shot will be represented by its middle frame. The positiveexamples are the middle frames of selected shots. The one-classSVM classifies the shots using content features evoked in Sec-tion 4.4.4 and computed on the middle frames to extract all inter-esting shots. Extracted shots represent the basic elements of thefinal summary.

4.5.3. Generating the summaryThis step aims at finding a compromise between what a user

likes to watch and how a summary must be.If the user is interested in action scenes, we will take from every

action scene two shots that have the highest motion to integratethem in the summary. We add the durations of these shots andwe substract their sum from the upper duration. The remainingtime duration will be used to integrate other shots that representthe content users’ preferences.

From the shots classified as shots having interesting content bythe one-class SVM, we will select some of them to be included inthe video skim. The integration of these shots is not random.

Fig. 5. Results of the genetic algorithm compared to other approaches tested on TRECVID benchmark. (a) Distribution of ‘‘Presence of repeated material” scores and (b)Distribution of ‘‘Pleasant Tempo material” scores.

0 0 0 1 0 0 0 0 1 0 0 1 0 0 0

1: shot included in the summary 0: shot not included in the summary

Chromosome representing a summary

Candidate shots selected by the user

Fig. 6. Encoding of the genetic algorithm. Shots which are present in the summaryare encoded by 1. The remaining are encoded by 0.


Indeed, we have to remind that a summary has to inform about thewhole content of the original sequence (temporal and spatial cov-erage) in pleasant tempo and without redundancy.

Genetic algorithms [15] are able to support heterogeneous cri-teria in the evaluation and they are naturally suited to carry outincremental selection, which may be applied to streaming mediaas video. We consider the summarization process as a constrainedoptimization problem and we propose to solve it with geneticalgorithms.

The reason for the choice of genetic algorithms is the encourag-ing results we obtained in TRECVID 2008 evaluation campaign[48], in the summarization task.

4.5.3.1. TRECVID benchmark and genetic algorithms. During the twolast years, TRECVID has added the task of summarization to itsevaluation campaigns. In our participation in TRECVID 2008 [9],we computed summaries for BBC Rushes (TRECVID summarizationdatabase) using genetic algorithms by selecting from every one ofthem shots considered important to obtain an informative skim.

We suggest generating randomly a set of summaries (initialpopulation). Then, we run the genetic algorithm (crossing, muta-tion, selection, crossing, etc.) many times on this population withthe hope to enhance the quality of summaries.

In TRECVID campaign the evaluation of generated summaries isdone by assessors who judge the summaries and attribute forevery one of them, and for a set of criteria, scores ranging from 1(worst) to 5 (best).

We ranked first and second among 43 participants respectivelyin the criteria ‘‘Pleasant Tempo ‘‘and ‘‘Presence of repeated mate-rial” (see Fig. 5). This shows the merits of using genetic algorithmsin the summarization task and justifies why we reuse them oncemore. However, some modifications will be done to adapt it inthe context of real-time movie summarization.

In our current context, we have to generate a summaryresponding to the following criteria:

Movie Shots

Reference Shot

Shot included in the summary

Shot not included in the summary

5 min

Fig. 7. Position of summary shots in the

� not exceeding the remaining duration (after deduction of thetotal duration of action shots from the upper duration fixed bythe user);

� including the maximum of shots (from those judgedinteresting);

� presenting the minimum of redundancy;� pleasant tempo;� spread over the whole movie.

4.5.3.2. Features of the genetic algorithm. To take into considerationthe constraints evoked in the previous section, we suggest the fol-lowing features for our genetic algorithm.

Binary encoding: We have chosen to encode our chromosomes(summaries) with binary encoding because of its popularity andits relative simplicity. In binary encoding, every chromosome is astring of bits (0, 1). In our genetic solution, the bits of a given chro-mosome are the shots representing the user’s preferences. We use1’s to denote selected shots (see Fig. 6).

Fitness Function: The fitness function aims at evaluating a givensummary. This function has to favor smooth summaries that coverthe whole movie, present a minimum of redundancy and do notexceed the duration specified by the user.

movie relatively to reference shots.

0 0 0 1 1 1 0 0 1 0 0 1 0 0 0

0 1 0 0 1 0 1 1 0 0 0 0 1 1 0

0 0 0 1 1 1 0 0 1 0 0 0 1 1 0

0 1 0 0 1 0 1 1 0 0 0 1 0 0 0

Children

Parents

Fig. 9. The crossover operation between two chromosomes representing twosummaries.

0 0 0 1 0 0 0 0 1 0 0 1 0 0 0

0 0 0 1 0 0 1 0 1 0 0 1 0 0 0

Fig. 10. The mutation operation.


Let S be a video summary composed of m shots. S = {sho-ti, 1 < i < m}.We evaluate the chromosome representing the sum-mary S by maximizing the following fitness function:

FitðSÞ ¼ minðDistðSÞÞ�maxðDistðSÞÞ�NshotsðSÞ�SDðSÞ�SSðS;UÞ ð3Þ

Let RS be the set of shots of the movie called reference shots. Thefirst shot of RS is the first shot of the movie. Every two successiveshots of RS are separated by 5 min (see Fig. 7).

RS ¼ fshoti; shot0 ¼ First shot of the movie;shoti þ 1� shoti ¼ 5 mingU ¼ RS [ S; the shots of U are temporally ordered

A summary has to cover the major parts of the original movie andshould be as smooth as possible. In fact the video jerkiness may dis-turb the users.

SS(S, U) is a score computed to evaluate the continuity (thesmoothness) of the summary and the coverage of the movie bythe summary (through reference shots). It is based on a measurecalled Temporal Cost Function TCF. SS and TCF are introduced in[25]. SS has the following form:

SS ¼X

i�1;i2U

TCFðjshi�1 � shijÞ ð4Þ

where TCFðxÞ ¼ 1:25x�kz kz is a constant ð5Þ

Nshots (S) is the number of shots selected in the chromosome. Wehave to include a maximum of shots to include a maximum ofground truth.

Dist(S) is the vector containing the distances computed betweenthe shots of the summary two by two. Maximizing the term‘‘min(Dist) �max(Dist)” penalizes redundancy that may occur inthe final summary. The distance used to compute the differencebetween two shots is called Hist and is defined as the complemen-tary of the histograms intersection. The distance between twoshots A and B is computed as follows:

HistðA;BÞ ¼ 1�X256

k¼1

minðHAðkÞ;HBðkÞÞ ,X256

k¼1

HAðkÞ!

ð6Þ

SD is a weight computed on the summary duration according to aGaussian function represented in Fig. 8 and has the following form:

SDðxÞ ¼ exp � lnð2Þ�ð2�ðx�Max DurationÞÞ2

Max Duration2

" #2

ð7Þ

The more the duration of the summary is nearer to maximum dura-tion (remaining duration) the more the score increases and will benear to ‘‘1”. Since the summary must not exceed the specified dura-tion this coefficient penalizes the summaries exceeding thisduration.

Fig. 8. Duration weighting.

The Crossover operation: The crossover consists in exchanging apart of the genetic material of the two parents to construct twonew chromosomes (see Fig. 9). This technique is used to explorethe space of solutions with the proposition of new chromosomesto ameliorate the fitness function. In our genetic algorithm thecrossover operation is classic. The crossover probability is fixedat 0.75.

The Mutation operation: After a crossover is performed, mutationtakes place. Mutation is intended to prevent the falling of all solu-tions in the population into a local optimum (see Fig. 10). Themutation probability is fixed at 0.02

5. Experimental results

5.1. Dataset and test bed

To show the effectiveness of IM(S)2 we have been inspired bythe evaluation done in TRECVID workshop [48] (the video summa-rization task) and by many works that evaluate the summariesgenerated by their systems [13,14,25,50] through the MOS (MeanOpinion Score). We invited 10 users (assessors) to test IM(S) 2 ona database composed of 5 movies of the European project MUSCLE(Multimedia Understanding through Semantics, Computation andLearning). The 10 users are not familiar with the system, but theyare regular movie consumers. They belong to different age catego-ries and to different backgrounds (students, PhD student, profes-sors and office workers). The list of movies is presented inTable 1. IM(S)2 is totally implemented in Matlab. We use the genet-ic algorithm toolbox of Matlab called ‘‘gatool”. The hardware plat-form is a PC with 2.66 GHZ processor and 1GB RAM.

5.2. Results

The system displays to every user an overview of the moviecomposed of its scenes. Then, the user will select the shots thatcorrespond to his preferences. The system will study these prefer-

Table 1Details of MUSCLE database.

Movie Genre Duration

Cold Mountain Drama 2H300

Jackie Brown Drama 2H300

Platoon Action 1H550

Secret Window Horror 1H300

The Lord of Rings Action 2H450

0 1 2 3 4 5

Cold Mountain

Jackie brown

Platoon

Lord of Rings

Secret Window

Fig. 12. Distribution of ‘‘Respecting the user’s preferences” scores according tomovies.


ences and generate a summary. After watching the summary, everyuser will be asked to evaluate the quality of the summary. It is asort of assessment similar to that of TRECVID workshop to evaluatethe generated summaries. The measures that we mentioned in ourevaluation are:

� the summary gives a clear idea about the story of the movie;� the summary includes the preferences of the user;� the summary has a pleasant tempo and it is easy to understand;� the lack of redundancy.

As in TRECVID, every user is asked to give a score ranging from 1(worst) to 5 (best) to indicate how the summaries respect the cri-teria. The overall results of individual measures are presented asTukey-style boxplots. Tukey Boxplots (also known as Box andWhisker plots) are used to create a graphic image of the severalkey measures of distribution including the minimum, maximum,median and 25th and 75th percentile (the middle 50%).

5.2.1. Giving clear idea about the story of the movieThe first goal of a video summary is to give clear idea about the

story of the original video sequence. For this reason we ask the 10users if the generated summaries give clear idea about the storiesof original sequences. To every user we give the textual summaryof the original movie located in the official movie producer’s web-site. Then we ask them how the textual summary matches the gen-erated video skim. After that, we plot for every movie thedistribution of its scores (see Fig. 11).

All the scores are around four. The fitness function played a keyrole here, it favors summaries that contain a maximum of shotsdue to the term Nshots(S) (maximum of events), with a minimumof redundancy due to the term ‘‘min(Dist(S)) �max(Dist(S))” (all in-cluded events are original) and covering the major parts of the mo-vie due to the terms SS(S, U) (this means that all included eventsare spread over the entire movie).

However we notice that there is a small difference between ac-tion movies and non-action movies. Action movies have the bestscores. Shots of non- action movies are long compared to shotsof action movies. For this reason summaries of non-action moviescontain fewer shots than action movies. This reduces the coverageof the movie by the summary and so affects the understandabilityof the non-action movies.

5.2.2. Respecting the user’s preferencesWe plot for every movie the distribution of the scores of its

summaries taking into account the criterion of ‘‘respecting theuser’s preferences” (see Fig. 12). Generally users are satisfied bythe output of the system. Users retrieve the content that they tar-geted. This confirms our technical choices. It confirms the effi-ciency of the one class-SVMs and the content features that weuse in bridging the semantic gap and in understanding the users’

0 1 2 3 4 5 6

Cold Mountain

Jackie brown

Platoon

Lord of Rings

Secret Window

Fig. 11. Distribution of ‘‘Giving clear idea about the story of the movie” scoresaccording to movies.

preferences in terms of content information. The high score ob-tained for action movies (Platoon and the Lord of Rings) confirmsalso the efficiency of tempo features and of Fuzzy CMeans to locateaction scenes.

Besides, what we noticed also when watching the test opera-tions is that the users do not focus on only one preference. Forexample, a user may select from the overview shots showing sol-diers fighting in the forest and a close-up shot showing another ac-tor. Regarding the obtained results we can deduce also that theone-class SVM is able to understand different concepts (heteroge-neous preferences). We think that this is due to the use of the ra-dial kernel.

Moreover, specifying of the preferences directly in the moviesby means of the overview makes easier the understanding of thecontext of the movies. It avoids the problem of ambiguity thatmay occur when specifying the preferences using keywords forexample. The preferences in form of images are more concreteand more linked to the context of the original movies.

In fact, some users suggest specifying their preferences directlyby typing the targeted concepts. For example, if we treat the case ofthe movie ‘‘Platoon”, a user may ask for shots showing a soldier inthe forest. Instead of selecting a shot showing a soldier in the forestto express this preference he may write it as follows: Concept {Sol-dier} + Concept {Forest}.

However this will ask for a heavy work of concept video index-ing to extract the concepts present in every movie according toontology. After that users may navigate in this ontology and spec-ify their preferences.

5.2.3. Pleasant tempo and easy to understandTaking into the consideration the problem of the smoothness of

video summary was useful. Introducing the term ‘‘SS(S, U)” in thefitness function, makes the generated summaries easy to under-stand (not jumpy) and with pleasant tempo. Besides we used thefades as transitions between shots of the generated summaries toovercome the change blindness issue and to have summaries morecomfortable to watch. This has an important impact on the scores

0 1 2 3 4 5 6

Cold Mountain

Jackie brown

Platoon

Lord of Rings

Secret Window

Fig. 13. Distribution of ‘‘Pleasant tempo and easy to understand” scores accordingto movies.

0 1 2 3 4 5 6

Cold Mountain

Jackie brown

Platoon

Lord of Rings

Secret Window

Fig. 14. Distribution of ‘‘Lack of redundancy” scores according to movies.


attributed to generated summaries which are generally around ‘‘4”(see Fig. 13).

5.2.4. Lack of redundancyOur strategy to integrate the users’ preferences in the summa-

ries may produce a sort of redundancy in the summaries. In fact,we may have the preferences of a given user centered on a givencontext (content) and so this context may be present during theentire summary which will make an impression of redundancy inthe summary. Here the genetic algorithm played a key role to avoidthis problem. Our participation in TRECVID 2008 shows that thegenetic algorithms are efficient in doing such kind of tasks. Theseresults are once more confirmed. Indeed, we plot for every moviethe distribution of the scores of its summaries about the criterionof ‘‘Lack of redundancy” and the results are all high for all movies(see Fig. 14).

5.3. Comparison results

We compared our system with Parshin’s system [34]. Parshin’ssystem [34] is the only summarization system in the literature thatproposes an effective interaction with the users and tries to collecteffective preferences, related to content of the movie. In fact in [12]the only preference is the number of key frames of the summary(summary size). In [13] the preferences are the number of key-frames and the processing time. In [44] users are required to pro-vide their own photos to deduce their preferences.

Parshin’s system is based on quantifying the preferences of theuser through some high level features, namely the place of action(indoor or outdoor), the time for outdoors shots (day or night), per-centage of the human skin, the average quantity of motion and theduration of the semantic segments in the shots (speech, music, si-lence and noise).

Table 2Comparison results of our system with the system of Parashin et al.

Systems Our system

Users First user Second user

Movie/criterion CS RP PT LR CS RP PT

Cold Mountain 4 4 4 5 5 5 4Jackie Brown 3 4 5 5 4 3 4Platoon 4 3 3 5 4 4 4Secret Window 5 4 4 4 3 4 3The Lord of Rings 4 4 4 5 4 4 3

Average 4 3.8 4 4.8 4 4 3.6

*CS: Giving clear idea about the story of the movie.*RP: Respecting the user’s preferences.*PT: Pleasant tempo and easy to understand.*LR: Lack of redundancy.

Two of the ten users that judge our summarization system wereinvited to test the Parashin’s system using the MUSCLE database.The results of the comparison are shown in Table 2.

Although we explained all the features and their impact on gen-erated summaries, the users do not appreciate a lot specifying thepreferences by quantifying some high level features. They considerthat the used features are general and are not related to the exactcontext of every movie. Besides they consider that their prefer-ences are more complicated than ‘‘indoor/outdoor” or ‘‘day/night”.For this reason we perform better in the criterion of respecting theusers’ preferences.

In our system we pay a lot of attention to the quality of gener-ated summaries. Indeed, we try to avoid having jumpy and non-smooth summaries and to insure that these summaries cover thewhole original sequences and this through the SS function. Con-trary to our system, Parashin’s system does not take into consider-ation these two constraints. It is simply based on tracking somehigh level features in the movie. All shots showing these high levelfeatures are automatically integrated. No effort was done to see ifthe selected shots are well spread over the entire movie. That iswhy we perform better in the criterion ‘‘giving clear idea aboutthe story of the movie” and the criterion ‘‘Pleasant Tempo and easyto understand”.

Parashin’s system gives good results in the criterion ‘‘Lack ofredundancy” because it achieves a clustering of shots to gathersimilar ones. The score attributed to every shot of the cluster takesinto consideration the ‘‘Originality” of this shot relatively to theother shots of the cluster. The results of our system and those ofParashin’s system in this criterion are nearly the same.

6. Conclusions and perspectives

We propose the IM(S)2 system for generating user oriented vi-deo summaries. Contrary to existing systems which either neglector ignore the users’ preferences, our system involves users directlyin the summarization process. In our work, we tried to encompassall the preferences that user may have as type of scenes (action, ro-mance, dialog), their contents (characters, locations, time, etc.) andthe duration of the final summary.

To demonstrate the effectiveness of IM(S)2 we invited 10 usersto judge the quality of generated summaries according to four cri-teria. Results are generally encouraging and show that the oneclass SVMs are successful in bringing out the users’ preferencesand that the used genetic algorithms are efficient in generatingoptimal summaries and hence confirm results obtained in TRECVIDworkshop.

However, IM(S)2 system needs some improvements. In the nearfuture we will try to enhance the interaction between the user andthe system. We plan to refine the way by which users specify their

Parashin’s system

First user Second user

LR CS RP PT LR CS RP PT LR

4 3 3 3 4 2 3 4 55 2 3 2 5 2 2 2 35 3 4 1 4 3 3 2 54 4 3 2 4 3 3 3 44 2 3 3 4 2 3 2 3

4.4 2.8 3.2 2.2 4.2 2.4 2.8 2.6 4


preferences. In a given shot we may be interested in particular ob-jects or concepts. For this reason we propose that users specify in-side the shots the regions or the objects that may be interestinginstead of selecting the whole shot.

The encouraging results obtained for movies corpus, motivateus to think to extend the use of our system to summarize other cor-pus. In fact, we have begun to investigate adapting the architectureof our system to news and documentary videos.

Acknowledgments

The authors would like to thank several individuals and groupsfor making the implementation of this system possible. Theauthors would like to acknowledge the financial support of thiswork by grants from the General Direction of Scientific Researchand Technological Renovation (DGRSRT), Tunisia, under the ARUBprogram 01/UR/11/02. We are also grateful, to EGIDE and INRIA,France, for sponsoring this work and the three-month researchplacement of Mehdi Ellouze from 1/11/2007 to 31/1/2008 in INRIAIMEDIA Team in which parts of this work were done. We are alsograteful to the European project MUSCLE and to Prof. ConstantineKotropoulos from Aristotle University of Thessaloniki for providingthe data.

References

[1] L. Agnihot, N. Dimitrova, J. Kender, J. Zimmerman, Study on requirementspecifications for personalized multimedia systems summarization, in:Proceedings of the IEEE International Conference on Multimedia and Expo,Baltimore, USA, 2003, pp. 757–760.

[2] J. Assfalg, M. Bertini, A. Del Bimbo, W. Nunziati, P. Pala, Soccer highlightsdetection and recognition using HMM’s, in: Proceedings of the IEEEInternational Conference on Multimedia and Expo, Lausanne, Switzerland,2002, pp. 825–828.

[3] A. Benammar, Enhancing query reformulation by combining content andhypertext analyses, in: Proceedings of the European Conference onInformation Systems, 2004.

[4] A. Benammar, G. Hubert, J. Mothe, Automatic profile reformulation using alocal document analysis, in: Proceedings of the European Conference on IRResearch, 2002, pp. 124–134.

[5] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms,Plenum, New York, 1981.

[6] N. Boujemaa, J. Fauqueur, M. Ferecatu, F. Fleuret, V. Gouet, B.L. Saux, H. Sahbi,IKONA: interactive generic and specific image retrieval, in: Proceedings of theInternational Workshop on Multimedia, Rocquencourt, France, 2001.

[7] A. Casillas, M.T. Gonzalez, R. Martinez, Document clustering into an unknownnumber of clusters using a genetic algorithm, in: Proceedings of theInternational Conference on Text, Speech and Dialogue, Ceske Budejovice,Czech Republic, 2003, pp. 43–49.

[8] M. Ellouze, N. Boujemaa, A.M. Alimi, Scene pathfinder: unsupervised clusteringtechniques for movie scenes extraction, Multimedia Tools And Application, inpress, doi:10.1007/s11042-009-0325-5.

[9] M. Ellouze, H. Karray, A.M. Alimi, REGIM, Research Group on IntelligentMachines, Tunisia, at TRECVID 2008, BBC Rushes Summarization, in:Proceedings of the International Conference ACM Multimedia, TRECVID BBCRushes Summarization Workshop, Vancouver, British Columbia, Canada, 2008,pp. 105–108.

[10] M. Ferecatu, Image retrieval with active relevance feedback using both visualand keyword-based descriptors. Ph.D. thesis, University of Vesrsailles Saint-Quentin-En-Yveline, 2005.

[11] M. Ferecatu, N. Boujemaa, M. Crucianu, Semantic interactive image retrievalcombining visual and conceptual content description, Multimedia Systems 13(2008) 309–322.

[12] A.M. Ferman, A.M. Tekalp, Two-stage hierarchical video summary extraction tomatch low-level user browsing preferences, IEEE Transactions on Multimedia5 (2003) 244–256.

[13] M. Furini, F. Geraci, M. Montangero, VISTO: Visual STOryboard for Web VideoBrowsing, in: Proceedings of ACM International Conference on Image andVideo Retrieval, Amsterdam, The Netherlands, 2007, pp. 635–642.

[14] M. Furini, G. Vittorio, An audio-video summarization scheme based on audioand video analysis, in: Proceedings of Consumer Communications andNetworking Conference, Las Vegas, USA, 2006, pp. 1209–1213.

[15] D.E. Goldberg, Genetic Algorithms in Search, Optimization and MachineLearning, Addison-Wesley Longman Publishing Co., Inc. Boston, MA, 1989.

[16] A. Hanjalic, L.L. Reginald, J. Biemond, Automatically segmenting movies intological story units, in: Proceedings of the International Conference on VisualInformation Systems, Amsterdam, The Netherlands, 1999, pp. 229–236.

[17] IMDB, http://www.imdb.com/, Last visited June 2008.

[18] H. Karray, M. Ellouze, A.M. Alimi, KKQ: K-frames and K-words extraction forquick news story browsing, International Journal of Information andCommunication Technology 1 (2008) 69–76.

[19] H. Karray, M. Ellouze, A.M. Alimi, Indexing video summaries for quick videobrowsing, Springer, London, 2009, pp. 77–95.

[20] H. Karray, A. Wali, N. Elleuch, A. BenAmmar, M. Ellouze, I. Feki, A.M. Alimi,REGIM at TRECVID2008: High-level features extraction and video search,TRECVID 2008.

[21] M. Kherallah, H. Karray, M. Ellouze, A.M Alimi, Toward an interactive device forquick news story browsing, in: Proceedings of the International Conference onPattern Recognition, Florida, USA, 2008, pp. 1–4.

[22] R. Leinhart, S. Pfeiffer, W. Effelsberg, Video abstracting, Communication of theACM (1997) 55–62.

[23] B.X. Li, M.I. Sezan, Event detection and summarization in American Footballbroadcast video, in: Proceedings of the Symposium of Electronic Imaging:Science and Technology: Storage and Retrieval for Media Databases, 2002, pp.202–213.

[24] LIBSVM 2.0, http://www.csie.ntu.edu.tw, Last visited May 2008.[25] W.-N. Lie, K.-C. Hsu, Video summarization based on semantic feature analysis

and user preference, in: Proceedings of the IEEE International Conference onSensor Networks, Ubiquitous, and Trustworthy Computing, Taichung, Taiwan,2008, pp. 486–491.

[26] S. Lu, I. King, M.R. Lyu, Video summarization using greedy method in aconstraint satisfaction framework, in: Proceedings of the InternationalConference on Distributed Multimedia Systems, Florida, USA, 2003, pp. 456–461.

[27] B. Lukas, T. Kanade, An iterative image registration technique with anapplication to stereo vision, in: Proceedings of the International JointConference on Artificial Intelligence, 1981, pp. 674–679.

[28] Y.F. Ma, L. Lu, H.J. Zhang, M.J. Li, A user attention model for videosummarization, in: Proceedings of ACM Multimedia, Juan-les-Pins, France,2002, pp. 533–542.

[29] L.M. Manevitz, M. Yousef, One-class SVMS for document classification,Machine Learning Research Archive 2 (2002) 139–154.

[30] D. Marcu, The automatic construction of large-scale corpora forsummarization research, in: Proceedings ACM SIGIR Conference on Researchand Development in Information Retrieval, California, USA, 1999, pp.137–144.

[31] M. Mills, A magnifier tool for video data, in: Proceedings of the SIGCHIConference on Human Factors in Computing Systems, California, USA, 1992,pp. 93–98.

[32] A.G. Money, H. Agius, Video summarization: a conceptual framework andsurvey of the state of the art, Journal of Visual Communication and ImageRepresentation (2007) 121–143.

[33] N. Omoigui, L. He, A. Gupta, J. Grudin, E. Sanoki, Time-compression: systemconcerns, usage, and benefits, in: Proceedings of the SIGCHI Conference onHuman Factors in Computing Systems, Pennsylvania, USA, 1999, pp. 136–143.

[34] V. Parshin, L. Chen, Video summarization based on user-defined constraintsand preferences, in: Proceedings of the Conference Recherche d’InformationAssistée par Ordinateur, 2004.

[35] K. Peker, A. Divakaran, Adaptive fast playback-based video skimming using acompressed-domain visual complexity measure, in: Proceedings of the IEEEInternational Conference on Multimedia and Expo, Taipei, Taiwan, 2004, pp.2055–2058.

[36] M. Petkovic, V. Mihajlovic, M. Jonker, S. Djordjevic-Kajan, Multi-modalextraction of highlights from TV Formula 1 programs, in: Proceedings of theIEEE International Conference on Multimedia and Expo, Lausanne,Switzerland, 2002, pp. 817–820.

[37] S. Pfeiffer, R. Leinhart, S. Fisher, W. Effelsberg, Abstracting digital moviesautomatically, Journal of Visual Communication and Image Representation 7(1996) 345–353.

[38] D. Ponceleon, A. Amir, Cuevideo: automated multimedia indexing andretrieval, in: Proceeding of ACM Multimedia Conference, Florida, USA, 1999,p. 199.

[39] D. Sadlier, N. O’Connor, Event detection in field sports video using audio-visualfeatures and a support vector machine, Journal of IEEE Transactions on Circuitsand Systems for Video Technology (2005) 1225–1233.

[40] B. Schölkopf, J.C. Platt, J.S. Taylor, A.J. Smola, Estimating the support of a high-dimensional distribution, Neural Computation 7 (2001) 1443–1471.

[41] A.F. Smeaton, B. Lehane, N.E. O’Connor, C. Brady, G. Craig, Automaticallyselecting shots for action movie trailers, in: Proceedings of the ACMInternational Workshop on Multimedia Information, New York, USA, 2006,pp. 231–238.

[42] M.A. Smith, T. Kanade, Video skimming and characterization through thecombination of image and language understanding techniques, in:Proceedings of the IEEE International Conference on Computer Vision andPattern Recognition, Puerto Rico, USA, 1997, pp. 775–781.

[43] STUDIO4NETWORKS, http://www.studio4networks.com/, Last visited June2008.

[44] Y. Takeuchi, M. Sugimito, Video summarization using personal photo libraries,in: Proceedings of ACM International Workshop on Multimedia InformationRetrieval, Santa Barbara, USA, 2006, pp. 213–222.

[45] Y. Taniguchi, A. Akutsu, Y. Tonomura, Panorama Excerpts: extracting andpacking panoramas for video browsing, in: Proceedings of ACM Multimedia,Seattle, USA, 1997, pp. 427–436.

[46] Y. Taniguchi, A. Akutsu, Y. Tonomura, H. Hamada, An intuitive and efficientaccess interface to real-time incoming video based on automatic indexing, in:

http://dx.doi.org/10.1007/s11042-009-0325-5

http://www.imdb.com/

http://www.csie.ntu.edu.tw

http://www.studio4networks.com/


Proceedings of the ACM International Conference on Multimedia, SanFrancisco, USA, 1995, pp. 25–33.

[47] C. Toklu, S.P. Liou, M. Das, Video abstract: a hybrid approach to generatesemantically meaningful video summaries, in: Proceedings of the IEEEInternational Conference on Multimedia and Expo, New York, USA, 2000, pp.268–271.

[48] TRECVID, 2003-2008, TREC Video Retrieval Evaluation: <http://www-nlpir.nist.gov/projects/trecvid/>, Last visited May 2008.

[49] B.T. Truong, S. Venkatesh, Video abstraction: a systematic review andclassification, ACM Transactions on Multimedia Computing, Communicationsand Applications 3 (2007).

[50] T. Tsvetomira, Ph.D. thesis, Automated summarization of movies and TV serieson a semantic level University of Eindhoven, 2007.

[51] Tsvetomira Tsoneva, Mauro Barbieri, Hans Weda, Automated summarizationof narrative video on a semantic level, in: Proceedings of the InternationalConference on Semantic Computing, California, USA, 2007, p. 169–176.

[52] S. Uchihashi, J. Foot, A. Girgensohn, J. Boreczky, Video Manga: Generatingsemantically meaningful video summaries, in: Proceedings of ACMMultimedia, Florida, USA, 1999, pp. 383–392.

[53] X.D. Yu, L. Wang, Q. Tlan, P. Xue, Multi-level video representation withapplication to keyframe extraction, in: Proceedings of the InternationalConference on Multimedia Modelling, Brisbane, Australia, 2004, pp. 117–121.

[54] H.J. Zhang, D. Zhong, S.W. Smoliar, An integrated system for content-basedvideo retrieval and browsing, Pattern Recognition 30 (1997) 643–658.

[55] Y. Zhuang, Y. Rui, T.S. Huang, S. Mehrotra, Adaptive key frame extraction usingunsupervised clustering, in: Proceedings of the IEEE International Conferenceon Image Processing, Chicago, USA, 1998, pp. 73–82.

http://www-nlpir.nist.gov/projects/trecvid/

http://www-nlpir.nist.gov/projects/trecvid/

IM (S)< sup> 2 : Interactive movie summarization system

Documents

Transcript of IM (S)< sup> 2 : Interactive movie summarization system