Event Oriented Dictionary Learning for Complex Event Detection

12
IEEE Proof IEEE TRANSACTIONS ON IMAGE PROCESSING 1 Event Oriented Dictionary Learning for Complex Event Detection Yan Yan, Yi Yang, Deyu Meng, Member, IEEE, Gaowen Liu, Wei Tong, Alexander G. Hauptmann, and Nicu Sebe, Senior Member, IEEE Abstract—Complex event detection is a retrieval task with 1 the goal of finding videos of a particular event in a large-scale 2 unconstrained Internet video archive, given example videos and 3 text descriptions. Nowadays, different multimodal fusion schemes 4 of low-level and high-level features are extensively investigated 5 and evaluated for the complex event detection task. However, how 6 to effectively select the high-level semantic meaningful concepts 7 from a large pool to assist complex event detection is rarely 8 studied in the literature. In this paper, we propose a novel strategy 9 to automatically select semantic meaningful concepts for the event 10 detection task based on both the events-kit text descriptions 11 and the concepts high-level feature descriptions. Moreover, we 12 introduce a novel event oriented dictionary representation based 13 on the selected semantic concepts. Toward this goal, we leverage 14 training images (frames) of selected concepts from the semantic 15 indexing dataset with a pool of 346 concepts, into a novel 16 supervised multitask p -norm dictionary learning framework. 17 Extensive experimental results on TRECVID multimedia event AQ:1 18 detection dataset demonstrate the efficacy of our proposed 19 method. AQ:2 20 Index Terms— Complex event detection, concept selection, 21 event oriented dictionary learning, supervised multi-task 22 dictionary learning. 23 I. I NTRODUCTION 24 C OMPLEX event detection in unconstrained videos has 25 received much attention in the research community 26 recently [1]–[3]. It is a retrieval task with the goal of detecting 27 videos of a particular event in a large-scale internet video 28 archive, given an event-kit. An event-kit consists of example 29 videos and text descriptions of the event. Unlike traditional 30 Manuscript received July 7, 2014; revised October 17, 2014 and February 11, 2015; accepted March 4, 2015. This work was supported in part by the MIUR Cluster Project Active Ageing at Home, in part by the European Commission Project xLiMe, in part by the Australian Research Council Discovery Projects, in part by the U.S. Army Research Office under Grant W911NF-13-1-0277, and in part by the National Science Foundation under Grant IIS-1251187. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Gang Hua. Y. Yan, G. Liu, and N. Sebe are with the Department of Information Engineering and Computer Science, University of Trento, Trento 38123, Italy (e-mail: [email protected]; [email protected]; [email protected]). Y. Yang is with the Centre for Quantum Computation and Intelligent Systems, University of Technology at Sydney, Sydney, NSW 2007, Australia (e-mail: [email protected]). D. Meng is with the Institute for Information and System Sciences, Xi’an Jiaotong University, Xi’an 710049, China (e-mail: dymeng@mail. xjtu.edu.cn). W. Tong is with the General Motors Research and Development Center, Warren, MI 48090 USA (e-mail: [email protected]). A. G. Hauptmann is with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2015.2413294 action recognition of atomic actions from videos, such as 31 ‘walking’ or ‘jumping’, complex event detection aims to detect 32 more complex events such as ‘Birthday party’, ‘Changing a 33 vehicle tire’, etc. 34 An event is a higher level semantic abstraction of video 35 sequences than a concept and consists of many concepts. 36 For example, a ‘Birthday party’ event can be described by 37 multiple concepts, such as objects (e.g., boy, cake), actions 38 (e.g., talking, walking) and scene (e.g., at home, in a 39 restaurant). A concept can be detected in a shorter video 40 sequence or even in a single frame but an event is usually 41 contained in a longer video clip. 42 Traditional approaches for complex event detection rely 43 on fusing the classification outputs of multiple low-level 44 features [1], i.e. SIFT, STIP, MOSIFT [4]. Recently, 45 representing videos using high-level features, such as concept 46 detectors [5], appears promising for the complex event detec- 47 tion task. However, the state-of-the-art concept detector based 48 approaches for complex event detection have not considered 49 which concepts should be included in the training concept list. 50 This induces the redundancy of concepts [2], [6] in the concept 51 list for the vocabulary construction. For example, it is highly 52 improbable for some concepts to help detecting a certain event, 53 e.g. ‘cows’ or ‘football’ are not helpful to detect events like 54 ‘Landing a fish’ or ‘Working on a sewing project’. Therefore, 55 removing the uncorrelated concepts from the vocabulary 56 construction tends to eliminate such redundancy and 57 potentially boosts the complex event detection performance. 58 Intuitively, it is highly expected that complex event detection 59 is more accurate and faster when we build a specific dictionary 60 representation for each event. In this paper, we investigate 61 how to learn a concept-driven event oriented representation 62 for complex event detection. There are mainly two important 63 issues to be considered for accomplishing this goal. The first 64 issue is which concepts should be included in the vocabulary 65 construction of the learning framework. Since we want to 66 learn an event oriented dictionary representation, how to 67 properly select qualified concepts for each event in the learning 68 framework is the key issue. This raises the problem of how to 69 optimally select the necessary and meaningful concepts from 70 a large pool of concepts for each event. The second issue is 71 how can we design an effective dictionary learning framework 72 to seamlessly learn the common knowledge from both the 73 low-level features and the high-level concept features. 74 To facilitate reading, we first introduce the abbreviations 75 used in the paper. SIN stands for the Sematic Indexing 76 dataset [7] containing 346 different categories (concepts) 77 1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of Event Oriented Dictionary Learning for Complex Event Detection

IEEE

Proo

f

IEEE TRANSACTIONS ON IMAGE PROCESSING 1

Event Oriented Dictionary Learningfor Complex Event Detection

Yan Yan, Yi Yang, Deyu Meng, Member, IEEE, Gaowen Liu, Wei Tong,Alexander G. Hauptmann, and Nicu Sebe, Senior Member, IEEE

Abstract— Complex event detection is a retrieval task with1

the goal of finding videos of a particular event in a large-scale2

unconstrained Internet video archive, given example videos and3

text descriptions. Nowadays, different multimodal fusion schemes4

of low-level and high-level features are extensively investigated5

and evaluated for the complex event detection task. However, how6

to effectively select the high-level semantic meaningful concepts7

from a large pool to assist complex event detection is rarely8

studied in the literature. In this paper, we propose a novel strategy9

to automatically select semantic meaningful concepts for the event10

detection task based on both the events-kit text descriptions11

and the concepts high-level feature descriptions. Moreover, we12

introduce a novel event oriented dictionary representation based13

on the selected semantic concepts. Toward this goal, we leverage14

training images (frames) of selected concepts from the semantic15

indexing dataset with a pool of 346 concepts, into a novel16

supervised multitask �p-norm dictionary learning framework.17

Extensive experimental results on TRECVID multimedia eventAQ:1 18

detection dataset demonstrate the efficacy of our proposed19

method.

AQ:2

20

Index Terms— Complex event detection, concept selection,21

event oriented dictionary learning, supervised multi-task22

dictionary learning.23

I. INTRODUCTION24

COMPLEX event detection in unconstrained videos has25

received much attention in the research community26

recently [1]–[3]. It is a retrieval task with the goal of detecting27

videos of a particular event in a large-scale internet video28

archive, given an event-kit. An event-kit consists of example29

videos and text descriptions of the event. Unlike traditional30

Manuscript received July 7, 2014; revised October 17, 2014 andFebruary 11, 2015; accepted March 4, 2015. This work was supported inpart by the MIUR Cluster Project Active Ageing at Home, in part by theEuropean Commission Project xLiMe, in part by the Australian ResearchCouncil Discovery Projects, in part by the U.S. Army Research Office underGrant W911NF-13-1-0277, and in part by the National Science Foundationunder Grant IIS-1251187. The associate editor coordinating the review of thismanuscript and approving it for publication was Prof. Gang Hua.

Y. Yan, G. Liu, and N. Sebe are with the Department of InformationEngineering and Computer Science, University of Trento, Trento 38123, Italy(e-mail: [email protected]; [email protected]; [email protected]).

Y. Yang is with the Centre for Quantum Computation and IntelligentSystems, University of Technology at Sydney, Sydney, NSW 2007, Australia(e-mail: [email protected]).

D. Meng is with the Institute for Information and System Sciences,Xi’an Jiaotong University, Xi’an 710049, China (e-mail: [email protected]).

W. Tong is with the General Motors Research and Development Center,Warren, MI 48090 USA (e-mail: [email protected]).

A. G. Hauptmann is with the School of Computer Science, Carnegie MellonUniversity, Pittsburgh, PA 15213 USA (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2015.2413294

action recognition of atomic actions from videos, such as 31

‘walking’ or ‘jumping’, complex event detection aims to detect 32

more complex events such as ‘Birthday party’, ‘Changing a 33

vehicle tire’, etc. 34

An event is a higher level semantic abstraction of video 35

sequences than a concept and consists of many concepts. 36

For example, a ‘Birthday party’ event can be described by 37

multiple concepts, such as objects (e.g., boy, cake), actions 38

(e.g., talking, walking) and scene (e.g., at home, in a 39

restaurant). A concept can be detected in a shorter video 40

sequence or even in a single frame but an event is usually 41

contained in a longer video clip. 42

Traditional approaches for complex event detection rely 43

on fusing the classification outputs of multiple low-level 44

features [1], i.e. SIFT, STIP, MOSIFT [4]. Recently, 45

representing videos using high-level features, such as concept 46

detectors [5], appears promising for the complex event detec- 47

tion task. However, the state-of-the-art concept detector based 48

approaches for complex event detection have not considered 49

which concepts should be included in the training concept list. 50

This induces the redundancy of concepts [2], [6] in the concept 51

list for the vocabulary construction. For example, it is highly 52

improbable for some concepts to help detecting a certain event, 53

e.g. ‘cows’ or ‘football’ are not helpful to detect events like 54

‘Landing a fish’ or ‘Working on a sewing project’. Therefore, 55

removing the uncorrelated concepts from the vocabulary 56

construction tends to eliminate such redundancy and 57

potentially boosts the complex event detection performance. 58

Intuitively, it is highly expected that complex event detection 59

is more accurate and faster when we build a specific dictionary 60

representation for each event. In this paper, we investigate 61

how to learn a concept-driven event oriented representation 62

for complex event detection. There are mainly two important 63

issues to be considered for accomplishing this goal. The first 64

issue is which concepts should be included in the vocabulary 65

construction of the learning framework. Since we want to 66

learn an event oriented dictionary representation, how to 67

properly select qualified concepts for each event in the learning 68

framework is the key issue. This raises the problem of how to 69

optimally select the necessary and meaningful concepts from 70

a large pool of concepts for each event. The second issue is 71

how can we design an effective dictionary learning framework 72

to seamlessly learn the common knowledge from both the 73

low-level features and the high-level concept features. 74

To facilitate reading, we first introduce the abbreviations 75

used in the paper. SIN stands for the Sematic Indexing 76

dataset [7] containing 346 different categories (concepts) 77

1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE

Proo

f

2 IEEE TRANSACTIONS ON IMAGE PROCESSING

Fig. 1. The event oriented dictionary learning framework.

of images, such as car, adult, etc. SIN-MED stands for78

the high-level concept features using the SIN concept list79

representing each MED video by a 346D feature (each80

dimension represents a concept).81

The overview of our framework is shown in Fig. 1. Firstly,82

we design a novel method to automatically select semantic83

meaningful concepts for each MED event based on both84

MED events-kit text descriptions and SIN-MED high-level85

concept feature representations. Then, we leverage training86

samples of selected concepts from the SIN dataset into a87

jointly supervised multi-task dictionary learning framework.88

An event specific semantic meaningful dictionary is learned89

through embedding the feature representation of original90

datasets (both MED dataset and SIN dataset) into a hidden91

shared subspace. We add label information in the learning92

framework to facilitate the event oriented dictionary learning93

process. Therefore, the learned sparse codes achieve intrinsic94

discriminative information and naturally lead to effective95

complex event detection. Moreover, a novel �p-norm multi-96

task dictionary learning is proposed to strengthen the flexibility97

of the traditional �1-norm dictionary learning problem.98

To summarize, the contributions of this paper are as99

follows:100

• We propose a novel approach for concept selection and101

present one of the first works making a comprehensive102

evaluation of automatic concept selection strategies for103

event detection;104

• We propose the event oriented dictionary learning for105

event detection;106

• We construct a supervised multi-task dictionary learning107

framework which is capable of learning an event108

oriented dictionary via leveraging information from109

selected semantic concepts;110

• We propose a novel �p-norm multi-task dictionary111

learning framework which is more flexible than the112

traditional �1-norm dictionary learning problem.113

II. RELATED WORK 114

To highlight our research contributions, we now review the 115

related work on (a) Event Detection, (b) Dictionary Learning 116

and (c) Multi-task Learning. 117

A. Event Detection 118

With the success of event detection in structured videos, 119

complex event detection from general unconstrained videos, 120

such as those obtained from internet video sharing web sites 121

like YouTube, has received increasing attention in recent 122

years. Unlike traditional action recognition from videos of 123

atomic actions, such as ‘walking’ or ‘jumping’, complex 124

event detection aims to detect more complex events such 125

as ‘Birthday party’, ‘Attempting board trick’, ‘Changing 126

a vehicle tire’, etc. Tamrakar et al. [1] and Lan et al. [8] 127

evaluated different low-level appearance as well as spatio- 128

temporal features, appropriately quantized and aggregated 129

into Bag-of-Words (BoW) descriptors for NIST TRECVID 130

Multimedia Event Detection. Jiang et al. [9] proposed a 131

method for high-level and low-level feature fusion based on 132

collective classification from three steps which are training a 133

classifier from low-level features, encoding high-level features 134

into graphs, and diffusing the scores on the established graph 135

to obtain the final prediction. Natarajan et al. [10] evaluated 136

a large set of low-level audio and visual features as well as 137

high-level information from object detection, speech and video 138

text OCR for event detection. They combined multiple features 139

using a multi-stage feature fusion strategy with feature level 140

early fusion using multiple kernel learning (MKL) and score 141

level fusion using Bayesian model combination (BayCom) 142

and weighted average fusion using video specific weights. 143

Tang et al. [11] tackled the problem of understanding the 144

temporal structure of complex events in highly varying videos 145

obtained from the Internet. A conditional model was trained 146

in a max-margin framework able to automatically discover 147

discriminative and interesting segments of video, while 148

simultaneously achieving competitive accuracies on difficult 149

detection and recognition tasks. 150

Recently, representing video in terms of multi-model 151

low-level features, e.g. SIFT, STIP, Dense Trajectory, 152

Mel-Frequency Cepstral Coefficients (MFCC), Automatic 153

Speech Recognition (ASR), Optical Character 154

Recognition (OCR), combined with early or late fusion 155

schemes is the state-of-the-art [12] for event detection. 156

Despite of their good performance, low-level features are 157

incapable of capturing the inherent semantic information in an 158

event. Comparatively, high-level concept features were shown 159

to be promising for event detection [5]. High-level concept 160

representation approaches become available nowadays due to 161

the availability of large labeled training collections such as 162

ImageNet and TRECVID. However, currently there are still 163

few research works on how to automatically select useful 164

concepts for event detection. Oh et al. [13] used Latent 165

SVMs for concept weighting and Brown et al. [14] ordered 166

concepts by their discrimination power for each event kit 167

query. Different from [13] and [14], we focus on learning an 168

event oriented dictionary from the perspective of adaptation 169

of the selecting concepts. 170

IEEE

Proo

f

YAN et al.: EVENT ORIENTED DICTIONARY LEARNING 3

B. Dictionary Learning171

Dictionary learning (also called Sparse Coding) has been172

shown to be able to find succinct representations of stimuli173

and model data vectors as a linear combination of a few174

elements from a dictionary. Dictionary learning has been175

successfully applied to a variety of problems in computer176

vision analysis recently. Yang et al. [15] proposed a spatial177

pyramid matching approach based on SIFT sparse codes178

for image classification. The method used selective sparse179

coding instead of the traditional vector quantization to extract180

salient properties of appearance descriptors of local image181

patches. Elad and Aharon [16] addressed the image denoising182

problem, where zero-mean white and homogeneous Gaussian183

additive noise was to be removed from a given image.184

The approach taken was based on sparse and redundant185

representations over trained dictionaries. Using the K-SVD186

algorithm, the authors obtained a dictionary that described187

the image content effectively. For the image segmentation188

problem, Mairal et al. [17] proposed an energy formulation189

with both sparse reconstruction and class discrimination190

components, jointly optimized during dictionary learning.191

The approach improved over the state of the art in image192

segmentation experiments.193

Different optimization algorithms have also been proposed194

to solve dictionary learning problems. Aharon et al. [18]195

proposed a novel K-SVD algorithm for adapting dictionaries196

in order to achieve sparse signal representations. K-SVD is197

an iterative method that alternates between sparse coding of198

the examples based on the current dictionary and a process of199

updating the dictionary atoms to better fit the data. The update200

of the dictionary columns was combined with an update of the201

sparse representations, thereby accelerating the convergence.202

Lee et al. [19] presented efficient sparse coding algorithms203

that were based on iteratively solving two convex optimization204

problems: an �1-regularized least squares problem and an205

�2-constrained least squares problem. To learn a206

discriminative dictionary for sparse coding, a label consistent207

K-SVD (LC-KSVD) algorithm was proposed in [20].208

In addition to using class labels of training data, the authors209

also associated label information with each dictionary item210

(columns of the dictionary matrix) to enforce discriminability211

in sparse codes during the dictionary learning process.212

More specifically, a new label consistent constraint was213

introduced and combined with the reconstruction error and214

the classification error to form a unified objective function.215

To effectively handle very large training sets and dynamic216

training data changing over time, Mairal et al. [21] proposed217

an online optimization algorithm for dictionary learning,218

based on stochastic approximations, which scaled up to large219

datasets with millions of training samples.220

However, so far as we know, there is no research work on221

how to learn the dictionary representation at the event level222

for event detection and there is no research work on how to223

simultaneously leverage the semantic information to learn an224

event oriented dictionary.225

C. Multi-Task Learning226

Multi-task learning [22] methods aim to simultaneously227

learn classification/regression models for a set of related tasks.228

This typically leads to better models as compared to a learner 229

that does not account for task relationships. To capture the 230

task relatedness from multiple related tasks is to constrain all 231

models to share a common set of features. This motivates the 232

group sparsity, i.e. the �2,p-norm regularized learning [23]. 233

The joint feature learning using �2,p-norm regularization 234

performs well in ideal cases. In practical applications, 235

however, simply using the �2,p-norm regularization may not 236

be effective for dealing with dirty data which may not fall into 237

a single structure. To this end, the dirty model for multi-task 238

learning was proposed in [24]. Another way to capture the task 239

relationship is to constrain the models from different tasks to 240

share a low-dimensional subspace by the trace norm [25]. The 241

assumption that all models share a common low-dimensional 242

subspace is too restrictive in some applications. To this 243

end, an extension that learns incoherent sparse and low-rank 244

patterns simultaneously was proposed in [26]. 245

Many multi-task learning algorithms assume that all learn- 246

ing tasks are related. In practical applications, however, the 247

tasks may exhibit a more sophisticated group structure where 248

the models of tasks from the same group are closer to each 249

other than those from a different group. There have been many 250

works along this line of research [27], [28], known as clustered 251

multi-task learning (CMTL). Moreover, most multi-task learn- 252

ing formulations assume that all tasks are relevant, which is 253

however not the case in many real-world applications. Robust 254

multi-task learning (RMTL) is aimed at identifying irrelevant 255

(outlier) tasks when learning from multiple tasks [29]. 256

However, there is little work on multi-task learning used for 257

dictionary learning problem. The only related theoretical work 258

is that in [30], where only theoretical bounds are provided on 259

evaluating the generalization error of dictionary learning for 260

multi-task learning and transfer learning. Multi-task learning 261

has received considerable attention in the computer vision 262

community and has been successfully applied to many com- 263

puter vision problems, such as image classification [31], head- 264

pose estimation [32], visual tracking [33], multi-view action 265

recognition [34] and egocentric activity recognition [35]. 266

However, to our knowledge, no previous works have 267

considered the problem of complex event detection. 268

III. BUILDING AN EVENT SPECIFIC CONCEPT POOL 269

The concepts, which are related to objects, actions, scenes, 270

attributes, etc. are usually basic elements for the description 271

of an event. Since the availability of large labeled training 272

collections such as ImageNet and TRECVID, there exists 273

a large pool of concept detectors for event descriptions. 274

However, selecting important concepts is the key issue for 275

concept vocabulary construction. For example, the event 276

‘Landing a fish’ is composed of concepts such as ‘adult’, 277

‘waterscape’, ‘outdoor’ and ‘fish’. If concepts related to the 278

event can be accurately selected, redundancy and unrelated 279

information will be suppressed, potentially improving the 280

event recognition performance. In order to select useful 281

concepts for the specific event, we propose a novel concept 282

selection strategy based on the combination of text and visual 283

information provided by the event-kit descriptions, which are 284

(i) Text-based semantic relatedness from linguistic knowledge 285

IEEE

Proo

f

4 IEEE TRANSACTIONS ON IMAGE PROCESSING

Fig. 2. Linguistic-based concept selection strategy with an example of‘E007: Changing a vehicle tire’ in MED event-kit text description and acorresponding example video provided by NIST.

of MED event-kit text description and (ii) Elastic-Net feature286

selection from visual high-level representation.287

A. Linguistic: Text-Based Semantic Relatedness288

The most widely used resources in Natural Language289

Processing (NLP) to calulate the semantic relateness of290

concepts are WordNet [36] and Wikipedia [37]. There291

are detailed event-kit text descriptions for each MED event292

provided by NIST [38]. In this paper, we explore the semantic293

similarity between each term in the event-kit text description294

and the SIN 346 visual concept names based on WordNet.295

Fig. 2 shows an example of event-kit text description for296

‘Changing a vehicle tire’.297

Intuitively, the more information two concepts share in298

common, the more similar they are, and the information299

shared by two concepts is indicated by the information300

content of the concepts that subsume them in the taxonomy.301

As illustrated in Fig. 2, we calculate the similarity between302

each term in event-kit text descriptions and the SIN 346 visual303

concept names based on the similarity measurement proposed304

in [39]. This measurement defines the similarity of two words305

w1i and w2 j as:306

sim(w1i , w2 j ) = 2 π(lcs)

π(w1i) + π(w2 j )307

where w1i ∈ {event-kit text descriptions},308

i = 1, . . . , Nevent_kit and w2 j ∈ {SIN visual concept309

names}, j = 1, . . . , 346. lcs denotes the lowest common310

subsumer of two words in the WordNet hierarchy. π denotes311

the information content of a word and is computed as312

Fig. 3. Visual high-level semantic representation with Elastic-Net conceptselection.

π(w) = log p(w), where p(w) is the probability of 313

encountering an instance of w in the union of WordNet and 314

event-kit text descriptions. The probability p(w) = freq(w)/N , 315

which can be estimated from the relative frequency of w 316

and all words N in the union of WordNet and event-kit text 317

descriptions [40]. In this way, we expect to properly capture 318

the semantic similarity between subjects (e.g. human, crowd) 319

and objects (e.g. animal, vehicle) based on the WordNet 320

hierarchy. Finally, we construct a 346D event-level feature vec- 321

tor representation for each event (each dimension corresponds 322

to a SIN visual concept name) using the MED event-kit text 323

description. A threshold is set (thr = 0.5 in our experiments) 324

to select useful concepts into our final semantic concept list. 325

B. Visual High-Level Representation: Elastic-Net 326

Concept Selection 327

Concept detectors provide a high-level semantic 328

representation for videos with complex contents, which 329

are beneficial for developing powerful retrieval or filtering 330

systems for consumer media [5]. In our case, we firstly use the 331

SIN dataset to train 346 semantic concept models. We adopt 332

the approach [41] to extract keyframes from the MED dataset. 333

The trained 346 semantic concept models are used to 334

predict the 346 semantic concepts existing in the keyframes 335

of MED videos. Once we have the prediction score of each 336

concept on each keyframe, the keyframe can be represented as 337

a 346D SIN-MED feature indicating the determined concept 338

probabilities. Finally, the video-level SIN-MED feature 339

is computed as the average of keyframe-level SIN-MED 340

features. 341

To select the useful concepts for each specific event, we 342

adopt the Elastic-Net [42] concept selection as illustrated 343

in Fig. 3, given the intuition that the learner generally would 344

like to choose the most representative SIN-MED feature 345

dimensions (concepts) to differentiate events. Elastic-Net is 346

formulated as follows: 347

minu

‖l − Fu‖2 + α1‖u‖1 + α2‖u‖2348

where l = {0, 1}n ∈ Rn indicates the event labels, F ∈ Rn×b349

is the SIN-MED feature matrix (n is the number of samples 350

IEEE

Proo

f

YAN et al.: EVENT ORIENTED DICTIONARY LEARNING 5

Fig. 4. The correlation demonstration between SIN visual concepts.High correlations between contextually related clusters ‘G4:car’, ‘G7:nature’,‘G11:urban-scene’ (in red) and negative correlations between contextuallyunrelated clusters ‘G7:nature’, ‘G8:indoor’ (in blue) can be easily observed.(Figure is best viewed in color and under zoom).

and b is the SIN-MED feature dimension) and u ∈ Rb is the351

parameter to be optimized. Each dimension of u corresponds352

to one semantic concept if F is the high-level SIN-MED353

feature. α1 and α2 are the regularization parameters. We use354

Elastic-Net instead of LASSO due to the high correlation355

between concepts in the SIN concept lists [7] (see Fig. 4).356

While LASSO (when α2 = 0) tends to select only a small357

number of variables from a group and ignore the others,358

Elastic-Net is capable of automatically taking such correlation359

information into account through adding a quadratic term ‖u‖2360

to the penalty. We can adjust the value of α1 to control361

the sparsity degree, i.e., how many semantic concepts are362

selected in our problem. The concepts to be selected are the363

corresponding dimensions with non-zero vaules of u.364

C. Union of Selected Concepts365

To sum up, we form a union of the semantic concepts366

selected from both text-based semantic relatedness described367

in section 3.1 and visual high-level semantic representation368

described in section 3.2 as the final list of selected concepts369

for each MED event. All the confidence values used are370

normalized to 0-1 using the Z-score normalization and are371

used for ranking the concepts. In our paper, we give equal372

weights to the textual and visual selection methods for the373

final late fusion of confidence scores. Top 10 ranked concepts374

are finally selected to adapt semantic information. Since we375

select the most important concepts from the pool, the top376

10 ranked concepts are usually already enough to describe377

the event (see Fig. 8). To evaluate the effectiveness of our378

proposed strategy, we compare the selected concepts with379

the groundtruth (we use human labeled concepts list as the380

groundtruth for each MED event).381

IV. EVENT ORIENTED DICTIONARY LEARNING 382

After we select semantic meaningful concepts for each 383

event, we can leverage training samples of selected concepts 384

from the SIN dataset into a supervised multi-task dictionary 385

learning framework. In this section, we investigate how to 386

learn an event oriented dictionary representation. To accom- 387

plish this goal, we firstly propose our multi-task ditionary 388

learning framework and then introduce its supervised setting. 389

A. Multi-Task Dictionary Learning 390

Given K tasks (e.g. K = 2 in our case, one task is the 391

MED dataset and the other task is the subset of SIN dataset 392

where samples are collected from specified selected concepts 393

for each event), each task consists of data samples denoted 394

by Xk = {x1k, x2

k, . . . , xnkk } ∈ Rnk×d , (k = 1, . . . , K ), where 395

xik ∈ Rd is a d-dimensional feature vector and nk is the 396

number of samples in the k-th task. We are going to learn a 397

shared subspace across all tasks, obtained by an orthonormal 398

projection W ∈ Rd×s , where s is the dimensionality of the 399

subspace. In this learned subspace, the data distributions from 400

all tasks should be similar to each other. Therefore, we can 401

code all tasks together in the shared subspace and achieve 402

better coding quality. The benefits of this strategy are: (i) we 403

can improve each individual coding quality by transferring 404

knowledge across all tasks. (ii) we can discover the rela- 405

tionship among different datasets via coding analysis. Such 406

a purpose can be realized through the following optimization 407

problem: 408

minTk,Ck,W,D

K∑

k=1

‖Xk − CkTk‖2F + λ1

K∑

k=1

‖Ck‖1 409

+ λ2

K∑

k=1

‖XkW − CkD‖2F 410

s.t .

⎧⎨

WTW = I(Tk)j·(Tk)T

j· ≤ 1, ∀ j = 1, . . . , lDj·DT

j· ≤ 1, ∀ j = 1, . . . , l(1) 411

where Tk ∈ Rl×d is an overcomplete dictionary (l > d) 412

with l prototypes of the k-th task, (Tk)j· in the constraints 413

denotes the j -th row of Tk, and Ck ∈ Rnk×l corresponds 414

to the sparse representation coefficients of Xk. In the third 415

term of Eqn.(1), Xk is projected by W into the subspace to 416

explore the relationship among different tasks. D ∈ Rl×s is 417

the dictionary learned in the datasets’ shared subspace. Dj· in 418

the constraints denotes the j -th row of D. I is the identity 419

matrix. (·)T denotes the transpose operator. λ1 and λ2 are 420

the regularization parameters. The first constraint guarantees 421

the learned W to be orthonormal, and the second and third 422

constraints prevent the learned dictionary to be arbitrarily 423

large. In our objective function, we learn a dictionary Tk for 424

each task k and one shared dictionary D among k tasks. Since 425

one task in our model uses samples from the SIN dataset 426

of selected semantic meaningful concepts, the shared 427

learned dictionary D is the event oriented dictionary. When 428

λ2 = 0, Eqn.(1) reduces to the traditional dictionary learning 429

on separated tasks. 430

IEEE

Proo

f

6 IEEE TRANSACTIONS ON IMAGE PROCESSING

B. Supervised Multi-Task Dictionary Learning431

It is well-known that the traditional dictionary learning432

framework is not directly available for classification and the433

learned dictionary has merely been used for signal recon-434

struction [17]. To circumvent this problem, researchers have435

developed several algorithms to learn a classification-oriented436

dictionary in a supervised learning fashion by exploring the437

label information. In this subsection, we extend our proposed438

multi-task dictionary learning of Eqn.(1) to be suitable for439

event detection.440

Assuming that the k-th task has mk classes, the label441

information of the k-th task is Yk = {y1k, y2

k, . . . , ynkk } ∈442

Rnk×mk (k = 1, . . . , K ), yik = [0, . . . , 0, 1, 0, . . . , 0]443

(the position of non-zero element indicates the class).444

�k ∈ Rl×mk is the parameter of the k-th task classifier.445

Inspired by [43], we consider the following optimization446

problem:447

minTk,Ck,�k,W,D

K∑

k=1

‖Xk − CkTk‖2F + λ1

K∑

k=1

‖Ck‖1448

+λ2

K∑

k=1

‖XkW − CkD‖2F + λ3

K∑

k=1

‖Yk − Ck�k‖2F449

s.t .

⎧⎪⎪⎨

⎪⎪⎩

WTW = I

(Tk)j·(Tk)Tj· ≤ 1, ∀ j = 1, . . . , l

Dj·DTj· ≤ 1, ∀ j = 1, . . . , l

(2)450

Compared with Eqn.(1), we add the last term into Eqn.(2)451

to enforce the model involving discriminative information452

for classification. This objective function can simultane-453

ously achieve a desired dictionary with good representation454

power and support optimal discrimination of the classes for455

multi-task setting.456

1) Optimization: To solve the proposed objective problem457

of Eqn.(2), we adopt the alternating minimization algorithm to458

optimize it with respect to D, Tk, Ck, �k and W respectively459

in five steps as follows:460

Step1 (Fixing Tk, Ck, W, Θk, Optimize D): If we461

stack X = [XT1 , . . . , XT

k ]T, C = [CT1 , . . . , CT

k ]T, Eqn.(2) is462

equivalent to:463

minD

K∑

k=1

‖XkW − CkD‖2F = min

D‖XW − CD‖2

F464

s.t . Dj·DTj· ≤ 1, ∀ j = 1, . . . , l465

This is equivalent to the dictionary update stage in tradi-466

tional dictionary learning algorithm. We adopt the dictionary467

update strategy of [21, Algorithm 2] to efficiently solve it.468

Step2 (Fixing D, Ck, W, Θk, Optimize Tk): Eqn.(2) is469

equivalent to:470

minTk

‖Xk − CkTk‖2F471

s.t . (Tk)j·(Tk)Tj· ≤ 1, ∀ j = 1, . . . , l472

This is also equivalent to the dictionary update stage in473

traditional dictionary learning for k tasks. We adopt the474

dictionary update strategy of [21, Algorithm 2] to efficiently 475

solve it. 476

Step3 (Fixing Tk, W, D, Θk, Optimize Ck): Eqn.(2) is 477

equivalent to: 478

minCk

K∑

k=1

‖Xk − CkTk‖2F + λ1

K∑

k=1

‖Ck‖1 479

+ λ2

K∑

k=1

‖XkW − CkD‖2F + λ3

K∑

k=1

‖Yk − Ck�k‖2F 480

This formulation can be decoupled into (n1 +n2 + . . .+nk) 481

distinct problems: 482

minci

k

K∑

k=1

nk∑

i=1

(∥∥∥xi

k − cikTk

∥∥∥2

2+ λ1

∥∥∥cik

∥∥∥1

483

+ λ2

∥∥∥xikW − ci

kD∥∥∥

2

2+ λ3

∥∥∥yik − ci

k�k

∥∥∥2

2) 484

We adopt the Fast Iterative Shrinkage-Thresholding 485

Algorithm (FISTA) [44] to solve the problem. FISTA solves 486

the optimization problems in the form of minμ

f (μ) + r(μ), 487

where f (μ) is convex and smooth, and r(μ) is convex but 488

non-smooth. We adopt FISTA since it is a popular tool 489

for solving many convex smooth/non-smooth problems and 490

its effectiveness has been verified in many applications. In 491

our setting, we denote the smooth term part as f (cik) = 492∥∥xi

k − cikTk

∥∥22 + λ2

∥∥xikW − ci

kD∥∥2

2+λ3∥∥yi

k − cik�k

∥∥22 and the 493

non-smooth term part as g(cik) = λ1

∥∥∥cik

∥∥∥1. 494

Step4 (Fixing D, Ck, W, Tk, Optimize Θk): Eqn.(2) is 495

equivalent to: 496

min�k

‖Yk − Ck�k‖2F 497

Setting ∂∂�k

= 0, we obtain �k = (CTk Ck)−1CT

k Yk. 498

Step5 (Fixing Tk, Ck, D, Θk, Optimize W): If we 499

stack X = [XT1 , . . . , XT

k ]T, C = [CT1 , . . . , CT

k ]T, Eqn.(2) is 500

equivalent to: 501

minW

K∑

k=1

‖XkW − CkD‖2F = min

W‖XW − CD‖2

F 502

s.t . WTW = I 503

Substituting D = (CTC)−1CTXW back into the above 504

function, we achieve 505

minW

∥∥∥(I − C(CTC)−1CT)XW∥∥∥

2

F506

= minW

tr(WTXT(I − C(CTC)−1CT)XW) 507

s.t . WTW = I 508

The optimal W is composed of eigenvectors of the matrix 509

XT(I − C(CTC)−1CT)X corresponding to the s smallest 510

eigenvalues. 511

We summarize our algorithm for solving Eqn.(2) 512

as Algorithm 1. 513

IEEE

Proo

f

YAN et al.: EVENT ORIENTED DICTIONARY LEARNING 7

Algorithm 1 Supervised Multi-Task Dictionary Learning

C. Supervised Multi-Task �p-Norm Dictionary Learning514

In the literature it has been shown that using a non-convex515

�p-norm minimization (0 ≤ p < 1) can often yield better516

results than the convex �1-norm minimization. Inspired by this,517

we extend our supervised multi-task dictionary learning model518

to a supervised multi-task �p-norm dictionary learning model.519

Assuming that the k-th task has mk classes, the label520

information of the k-th task is Yk = {y1k, y2

k, . . . , ynkk } ∈521

Rnk×mk , (k = 1, . . . , K ), yik = [0, . . . , 0, 1, 0, . . . , 0] (the522

position of non-zero element indicates the class). �k ∈ Rl×mk523

is the parameter of the k-th task classifier. We formulate our524

supervised multi-task �p-norm dictionary learning problem525

as follows:526

minTk,Ck,�k,W,D

K∑

k=1

‖Xk − CkTk‖2F + λ1

K∑

k=1

‖Ck‖pp527

+λ2

K∑

k=1

‖XkW − CkD‖2F + λ3

K∑

k=1

‖Yk − Ck�k‖2F528

s.t .

⎧⎨

WTW = I(Tk)j·(Tk)T

j· ≤ 1, ∀ j = 1, . . . , lDj·DT

j· ≤ 1, ∀ j = 1, . . . , l(3)529

Compared with Eqn.2, we replace the traditional sparse530

coding �1-norm term ‖Ck‖1 with the more flexible �p-norm531

term ‖Ck‖pp . Since we can adjust the value of p (0 ≤ p < 1)532

in our framework, our algorithm is more flexible to control the533

sparseness of the feature representation, thus usually resulting534

in better performance than the traditional �1-norm sparse535

coding.536

To solve the proposed problem of Eqn.3, we adopt the537

alternating minimization algorithm to optimize it with respect538

to D, Tk, Ck,�k, W respectively. The updated rules for539

D, Tk,�k, W are the same as Eqn.2, the only difference is540

in the updated rule of Ck. Various algorithms have been541

proposed for �p-norm non-convex sparse coding [45]–[47].542

In this paper, we adopt the Generalized Iterated Shrinkage543

Algorithm (GISA) [48] to solve the proposed problem.544

Algorithm 2 Supervised Multi-Task �p-Norm DictionaryLearning

TABLE I

18 EVENTS OF MED10 AND MED11

We summarize our algorithm for solving Eqn.3 as 545

Algorithm 2. 546

After the optimized � is obtained, the final classification of 547

a test video can be obtained based on its sparse coefficient cik, 548

which carries the discriminative information. We can simply 549

apply the linear classifier cik�k to obtain the predicted score 550

of the video. 551

V. EXPERIMENTS 552

In this section, we conduct extensive experiments to test our 553

proposed method using large-scale real world datasets. 554

A. Datasets 555

TRECVID MED10 (P001-P003) and MED11 (E001-E015) 556

datasets are used in our experiments. The datasets consist of 557

9746 videos from 18 events of interest, with 100-200 examples 558

per event, and the rest of the videos are from the background 559

class. The details are listed in the Table 1. 560

TRECVID Semantic Indexing Task (SIN) [7] contains anno- 561

tation for 346 semantic concepts on 400,000 keyframes from 562

web videos. 346 concepts are related to objects, actions, 563

IEEE

Proo

f

8 IEEE TRANSACTIONS ON IMAGE PROCESSING

TABLE II

15 GROUPS OF SIN 346 VISUAL CONCEPTS (THE NUMBER OF

CONCEPTS FOR EACH GROUP ARE IN PARENTHESIS)

scences, attributes and non-visual concept which are all the564

basic elements for an event, e.g. kitchen, boy, girl, bus. For565

the sake of better understanding and easy concept selection,566

we manually divide the 346 visual concepts into 15 groups567

which are listed in Table 2.568

B. Evaluation Metrics569

1) Average Precision (AP): is a measure combining recall570

and precision for ranked retrieval results. The average preci-571

sion is the mean of the precision scores after each relevant572

sample is retrieved. The higher number indicates the better573

performance.574

2) PMiss@TER = 12.5: is an official evaluation metric for575

event detection as defined by NIST [38]. It is defined as the576

point at which the ratio between the probability of Missed577

Detection and probability of False Alarm is 12.5:1. The lower578

number indicates the better performance.579

3) Normalized Detection Cost (NDC): is an official evalua-580

tion metric for event detection as defined by NIST [38]. It is a581

weighted linear combination of the system’s Missed Detection582

and False Alarm probabilities. NDC measures the performance583

of a detection system in the context of an application profile584

using error rate estimates calculated on a test set. The lower585

number indicates the better performance.586

C. Experiment Settings and Comparison Methods587

There are 3104 videos used for training and 6642 videos588

used for testing in our experiments. We use three representative589

features which are SIFT, Color SIFT (CSIFT) and Motion590

SIFT (MOSIFT) [4]. SIFT and CSIFT describe the gradient591

and color information of images. MOSIFT describes both the592

optical flow and gradient information of video clips. Finally,593

768D SIFT-BoW, CSIFT-BoW, MOSIFT-BoW features are594

extracted respectively to represent each video. We set the reg-595

ularization parameters in the range of {0.01, 0.1, 1, 10, 100}.596

The subspace dimensionality s is set by searching the grid597

from {200, 400, 600}. For the experiments in the paper, we598

try three different dictionary sizes from {768, 1024, 1280}.599

To evaluate the multi-task �p-norm dictionary learning600

algorithm, the parameter p is tuned in the range of {0.2, 0.4,601

0.6, 0.8, 1}. We compare our proposed event oriented602

dictionary learning method with the following important603

baselines:604

• Support Vector Machine (SVM): SVM has been widely605

used by several research groups for MED and has shown606

its robustness [8], [49], [50], so we use it as one of the607

comparison algorithms (RBF, Histogram Intersection and608

χ2 kernels are used respectively);609

Fig. 5. Example results of semantic concept selection proposed in section 3on event (top left) Attempting board trick, (top right) Feeding animal,(bottom left) Flash mob gathering, (bottom right) Making a sandwich. Thefont size in the figure reflects the ranking value for concepts (the larger thefont the higher the value).

TABLE III

AP PERFORMANCE FOR EACH MED EVENT USING TEXT (T),

VISUAL (V) AND TEXT + VISUAL (T + V) INFORMATION

FOR CONCEPT SELECTION. THE LAST COLUMN SHOWS

THE NUMBER OF CONCEPTS IN THE TOP 10 THAT

COINCIDE WITH THE GROUNDTRUTH

• Single Task Supervised Dictionary Learning (ST-SDL): 610

Performing supervised dictionary learning on each task 611

separately; 612

• Pooling Tasks Supervised Dictionary Learning (PT-SDL): 613

Performing single task supervised dictionary learning by 614

simply aggregating data from all tasks; 615

• Multiple Kernel Transfer Learning (MKTL) [51]: A 616

method incorporating prior features into a multiple kernel 617

learning framework. We use the code provided by the 618

author1; 619

• Dirty Model Multi-Task Learning (DMMTL) [24]: A 620

state-of-the-art multi-task learning method imposing 621

�1/�q -norm regularization. We use the code provided by 622

MALSAR toolbox2; 623

• Multiple Kernel Learnig Latent Variable Approach 624

(MKLLVA) [52]: A multiple kernel learning latent 625

variable approach for complex video event detection; 626

• Random Concept Selection Strategy (RCSS): Performing 627

our proposed supervised multi-task dictionary learning 628

1http://homes.esat.kuleuven.be/~ttommasi/source_code_ICCV11.html2http://www.public.asu.edu/~jye02/Software/MALSAR/

IEEE

Proo

f

YAN et al.: EVENT ORIENTED DICTIONARY LEARNING 9

TABLE IV

COMPARISON OF Average DETECTION ACCURACY OF DIFFERENT METHODS FOR THE SIFT FEATURE. BEST RESULTS ARE HIGHLIGHTED IN BOLD

Fig. 6. Comparison of AP performance of different methods for each MED event. (Figure is best viewed in color and under zoom).

without involving concept selection strategy (leveraging629

random samples).630

• Simple Concept Selection Strategy (SCSS): Each concept631

is independently used as a predictor for each event632

yielding a certain average precision performance for633

each event. The concepts are ranked according to the634

AP measurement for each event, and the top 10 concepts635

retained.636

D. Experimental Results637

We firstly calculate the covariance matrix of the 346 SIN638

concepts, shown in Fig. 4, where the visual concepts are639

grouped and sorted the same as in Table 2. As shown in Fig. 4,640

the convariance matrix shows high within-cluster correlations641

along the diagonal direction and also relatively high corre-642

lations between contextually related clusters (in red color),643

such as ‘G4:car’, ‘G7:nature’ and ‘G11:urban-scene’. One can644

also easily observe negative correlations between contextually645

unrelated clusters (in blue color), such as ‘G7:nature’ and646

‘G8:indoor’. This gives us the intuition that visual concepts647

co-occurrence exists. Therefore, removing redundant concepts648

and selecting related concepts for each event is expected to be649

helpful for the event detection task.650

Fig. 5 shows the results of our concept selection strategy for651

the event ‘Attempting board trick’, ‘Feeding animal’, ‘Flash652

mob gathering’ and ‘Making a sandwich’. From Fig. 5, we can653

observe that the concepts selected are reasonably consistent654

with human selections.655

To better exploit the effectiveness of our proposed concept656

selection strategy, we compare our selected top 10 concepts657

with the groundtruth (we use the ranking list of human658

labeled concepts as the groundtruth for each MED event).659

The results are listed in the last column of Table 3, showing660

the number of concepts in the top 10 that coincide with the661

groundtruth. The AP performance for event detection based on662

text information, visual information and their combinations 663

are also shown in Table 3. The benefit of using both text 664

and visual information for concept selection can be concluded 665

from Table 3. 666

Table 4 shows the average detection results of the 18 MED 667

events for different comparison methods. We have the 668

following observations: (1) Comparing ST-SDL with SVM, 669

we observe that performing supervised dictionary learning is 670

better than SVM which shows the effectiveness of dictionary 671

learning for MED. (2) Comparing PT-SDL with ST-SDL, 672

leveraging knowledge from the SIN dataset improves the 673

performance for MED. (3) Our concept selection strategy 674

for semantic dictionary learning performs the best for 675

MED among all the comparison methods. (4) Our proposed 676

method outperforms by 8%, 6%, 10% with respect to AP, 677

PMiss@TER=12.5 and MinNDC respectively compared with 678

SVM. (5) Considering the difficulty of MED dataset and 679

the typically low AP performance of MED, the absolute 680

8% AP improvement is very significant. 681

Fig. 6 shows the AP results for each MED event. Our 682

proposed method achieves the best performance for 13 events 683

out of a total of 18 events. It is also interesting to notice that 684

the larger improvements in Fig. 6, such as ‘E004: Wedding 685

ceremony’, ‘E005: Working wood working project’ and 686

‘E009: Getting a vehicle unstuck’ usually correspond to the 687

higher number of selected concepts that coincide with the 688

groundtruth. This gives us the evidence of the effectiveness 689

of our proposed automatic concept selection strategy. 690

Fig. 7 illustrates the MAP performance for different 691

methods based on SIFT, CSIFT and MOSIFT features. 692

It can be easily observed that our proposed surpervised multi- 693

task dictionary learning with our concept selection strategy 694

outperforms SVM by more than 8%. 695

Moreover, we evaluate our proposed method with respect 696

to different numbers of selected concepts, dictionary sizes and 697

IEEE

Proo

f

10 IEEE TRANSACTIONS ON IMAGE PROCESSING

Fig. 7. Comparison of MAP performance of different methods for differenttypes of features.

Fig. 8. MAP performance variation with respect to the number of selectedconcepts.

Fig. 9. MAP performance variation with respect to (left) different dictionarysize; (right) different subspace dimensionality.

different subspace dimensionality settings based on the SIFT698

feature. Fig. 8 shows that the proposed method achieves the699

best MAP results when the numbers of selected concepts is 10.700

Fig. 9(left) shows that the proposed method achieves the best701

MAP results when the dictionary size is 1024. Too large or702

too small dictionary size tends to hamper the performance.703

Fig. 9(right) shows that the best MAP result is achieved when704

the subspace dimensionality is 400 (dictionary size = 1024).705

Large or small subspace dimensionality also degrades the706

performance.707

Finally, we also study the parameter sensitivity of the708

proposed method in Fig. 10. Here, we fix λ3 = 1709

(discriminative information contribution fixed) and p = 0.6710

and analyze the regularization parameters λ1 and λ2. As shown711

in Fig. 10(left), we observe that the proposed method is more712

sensitive to λ2 compared with λ1, which demonstrates the713

importance of the subspace for multi-task dictionary learning.714

Moreover, to understand the influence of parameter p for our715

proposed supervised �p-norm dictionary learning algorithm,716

we also perform an experiment on the parameter sensitivity.717

Fig. 10(right) demonstrates that the best performance for the718

Fig. 10. Sensitivity study of parameters on (left) λ1 and λ2 with fixed pand λ3. (right) p and λ3 with fixed λ1 and λ2.

Fig. 11. Convergence rate study on (left) supervised multi-task dictionarylearning and (right) supervised multi-task �p-norm dictionary learning.

supervised �p-norm dictionary learning algorithm is achieved 719

when p = 0.6. More than 2% MAP can be achieved if we 720

adopt the �p-norm model compared with the fixed �1-norm 721

model. This shows the suboptimality of the traditional 722

�1-norm sparse coding compared with the flexible �p-norm 723

sparse coding. 724

The proposed iterative approaches monotonically decrease 725

the objective function values in Eqn.(2) and Eqn.(3) until 726

convergence. Fig. 11 shows the convergence rate curves of 727

our algorithms. It can be observed that the objective function 728

values converge quickly and our approaches usually converge 729

after 6 iterations for supervised multi-task dictionary learning 730

and 7 iterations for supervised multi-task �p-norm dictionary 731

learning at most (precision = 10−6). 732

Regarding the computational cost of our proposed algorithm 733

for supervised multi-task �p-norm dictionary learning, we 734

train our model for TRECVID MED dataset in 5 hours 735

with cross-validation on a workstation with Intel(R) Xeon(R) 736

CPU E5-2620 v2 @ 2.10GHz × 17 processors. Our proposed 737

event oriented dicionary learning approach can be easily 738

parallelled on multi-core computers due to its ‘event 739

oriented’. This means that our algorithms would be scalable 740

for large-scale problems. 741

VI. CONCLUSION AND FUTURE WORK 742

In this paper, we have firstly investigated the possibility 743

of automatically selecting semantic meaningful concepts for 744

complex event detection based on both the MED events-kit 745

text descriptions and the high-level concept feature 746

descriptions. Then, we attempt to learn an event oriented 747

dictionary representation based on the selected semantic 748

concepts. To this aim, we leverage training samples of 749

selected concepts from the SIN dataset into a novel jointly 750

supervised multi-task dictionary learning framework. 751

IEEE

Proo

f

YAN et al.: EVENT ORIENTED DICTIONARY LEARNING 11

Extensive experimental results on the TRECVID MED752

dataset showed that our proposed method outperformed several753

important baseline methods. Our proposed method outper-754

formed SVM by up to 8% MAP which showed the effective-755

ness of dictionary learning for TRECVID MED. More than756

6% and 3% MAP was achieved respectively compared with757

ST-SDL and PT-SDL, which showed the advantage of multi-758

task setting in our proposed framework. To show the benefit of759

concept selection strategy, we compared RCSS to our method760

and showed that achieves 4% less MAP.761

For some sparse coding problems, non-convex �p-norm762

minimization (0 ≤ p < 1) can often obtain better results763

than the convex �1-norm minimization. Inspired by this, we764

extended our supervised multi-task dictionary learning model765

to a supervised multi-task �p-norm dictionary learning model.766

We evaluated the influence of the �p-norm parameter p in our767

proposed problem and found that more than 2% MAP can768

be achieved if we adopted the more flexible �p-norm model769

compared with the fixed �1-norm model.770

Overall, the proposed multi-task dictionary learning771

solutions are novel in the context of complex event detection,772

which is a relevant and important research problem in773

applications such as image and video understanding and774

surveillance. Future research involves (i) integration of775

knowledge from multiple sources (video, audio, text) and776

incorporation of kernel learning in our framework, and (ii) the777

use of deep structures instead of a shallow single-layer model778

in the proposed problem since deep learning has achieved the779

supreme success in many different fields of image processing780

and computer vision.781

ACKNOWLEDGEMENT782

Any opinions, findings, and conclusions or recommenda-783

tions expressed in this material are those of the authors and do784

not necessarily reflect the views of ARO, the National Science785

Foundation or the U.S. Government.786

REFERENCES787

[1] A. Tamrakar et al., “Evaluation of low-level features and their combina-788

tions for complex event detection in open source videos,” in Proc. IEEE789

Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 3681–3688.790

[2] Z. Ma, Y. Yang, Z. Xu, S. Yan, N. Sebe, and A. G. Hauptmann,791

“Complex event detection via multi-source video attributes,” in Proc.792

IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 2627–2633.793

[3] Y. Yan, Y. Yang, H. Shen, D. Meng, G. Liu, and N. S. A. Hauptmann,794

“Complex event detection via event oriented dictionary learning,” in795

Proc. AAAI Conf. Artif. Intell., 2015.AQ:3 796

[4] M.-Y. Chen and A. Hauptmann, “MoSIFT: Recognizing human actions797

in surveillance videos,” School Comput. Sci., Carnegie Mellon Univ.,798

Pittsburgh, PA, USA, Tech. Rep. CMU-CS-09-161, 2009.799

[5] C. G. M. Snoek and A. W. M. Smeulders, “Visual-concept search800

solved?” Computer, vol. 43, no. 6, pp. 76–78, Jun. 2010.801

[6] C. Sun and R. Nevatia, “ACTIVE: Activity concept transitions in video802

event classification,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013,803

pp. 913–920.804

[7] SIN. (2013). NIST TRECVID Semantic Indexing. [Online]. Available:805

http://www-nlpir.nist.gov/projects/tv2013/tv2013.html#sin806

[8] Z.-Z. Lan et al., “Informedia E-lamp @ TRECVID 2013 multimedia807

event detection and recounting (MED and MER),” in Proc. NIST808

TRECVID Workshop, 2013.809

[9] L. Jiang, A. G. Hauptmann, and G. Xiang, “Leveraging high-level and810

low-level features for multimedia event detection,” in Proc. ACM Int.811

Conf. Multimedia, 2012, pp. 449–458.812

[10] P. Natarajan et al., “Multimodal feature fusion for robust event detection 813

in web videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 814

Jun. 2012, pp. 1298–1305. 815

[11] K. Tang, D. Koller, and L. Fei-Fei, “Learning latent temporal structure 816

for complex event detection,” in Proc. IEEE Conf. Comput. Vis. Pattern 817

Recognit., Jun. 2012, pp. 1250–1257. 818

[12] Z.-Z. Lan, L. Bao, S.-I. Yu, W. Liu, and A. G. Hauptmann, “Double 819

fusion for multimedia event detection,” in Proc. Int. Conf. Multimedia 820

Modeling, 2012, pp. 173–185. 821

[13] S. Oh et al., “Multimedia event detection with multimodal feature fusion 822

and temporal concept localization,” Mach. Vis. Appl., vol. 25, no. 1, 823

pp. 49–69, Jan. 2014. 824

[14] L. Brown et al., “IBM Research and Columbia University 825

TRECVID-2013 multimedia event detection (MED), multimedia event 826

recounting (MER), surveillance event detection (SED), and semantic 827

indexing (SIN) systems,” in Proc. NIST TRECVID Workshop, 2013. 828

[15] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid 829

matching using sparse coding for image classification,” in Proc. IEEE 830

Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 1794–1801. 831

[16] M. Elad and M. Aharon, “Image denoising via sparse and redundant 832

representations over learned dictionaries,” IEEE Trans. Image Process., 833

vol. 15, no. 12, pp. 3736–3745, Dec. 2006. 834

[17] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Discrimina- 835

tive learned dictionaries for local image analysis,” in Proc. IEEE Conf. 836

Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8. 837

[18] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for 838

designing overcomplete dictionaries for sparse representation,” IEEE 839

Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006. 840

[19] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding 841

algorithms,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 2006, 842

pp. 801–808. 843

[20] Z. Jiang, Z. Lin, and L. S. Davis, “Learning a discriminative dictionary 844

for sparse coding via label consistent K-SVD,” in Proc. IEEE Conf. 845

Comput. Vis. Pattern Recognit., Jun. 2011, pp. 1697–1704. 846

[21] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning 847

for sparse coding,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, 848

pp. 689–696. 849

[22] T. Evgeniou and M. Pontil, “Regularized multi–task learning,” in 850

Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2004, 851

pp. 109–117. 852

[23] A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,” 853

in Proc. Conf. Adv. Neural Inf. Process. Syst., 2007, pp. 1–8. 854

[24] A. Jalali, P. K. Ravikumar, S. Sanghavi, and C. Ruan, “A dirty model 855

for multi-task learning,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 856

2010. 857

[25] S. Ji and J. Ye, “An accelerated gradient method for trace norm 858

minimization,” in Proc. Int. Conf. Mach. Learn., 2009, pp. 457–464. 859

[26] J. Chen, J. Liu, and J. Ye, “Learning incoherent sparse and low-rank 860

patterns from multiple tasks,” in Proc. ACM SIGKDD Int. Conf. Knowl. 861

Discovery Data Mining, 2010, p. 22. 862

[27] L. Jacob, F. R. Bach, and J.-P. Vert, “Clustered multi-task learning: 863

A convex formulation,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 864

2008. 865

[28] J. Zhou, J. Chen, and J. Ye, “Clustered multi-task learning via alternating 866

structure optimization,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 867

2011, pp. 702–710. 868

[29] J. Chen, J. Zhou, and J. Ye, “Integrating low-rank and group-sparse 869

structures for robust multi-task learning,” in Proc. ACM SIGKDD Int. 870

Conf. Knowl. Discovery Data Mining, 2011, pp. 42–50. 871

[30] A. Maurer, M. Pontil, and B. Romera-Paredes, “Sparse coding for 872

multitask and transfer learning,” in Proc. Int. Conf. Mach. Learn., 2013, 873

pp. 343–351. 874

[31] X.-T. Yuan and S. Yan, “Visual classification with multi-task joint sparse 875

representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 876

Jun. 2010, pp. 3493–3500. 877

[32] Y. Yan, E. Ricci, R. Subramanian, O. Lanz, and N. Sebe, “No matter 878

where you are: Flexible graph-guided multi-task learning for multi-view 879

head pose classification under target motion,” in Proc. IEEE Int. Conf. 880

Comput. Vis., Dec. 2013, pp. 1177–1184. 881

[33] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual tracking via 882

multi-task sparse learning,” in Proc. IEEE Conf. Comput. Vis. Pattern 883

Recognit., Jun. 2012, pp. 2042–2049. 884

[34] Y. Yan, E. Ricci, R. Subramanian, G. Liu, and N. Sebe, “Multitask linear 885

discriminant analysis for view invariant action recognition,” IEEE Trans. 886

Image Process., vol. 23, no. 12, pp. 5599–5611, Dec. 2014. 887

IEEE

Proo

f

12 IEEE TRANSACTIONS ON IMAGE PROCESSING

[35] Y. Yan, E. Ricci, G. Liu, and N. Sebe, “Recognizing daily activities888

from first-person videos with multi-task clustering,” in Proc. Asian Conf.889

Comput. Vis., 2014, pp. 1–16.890

[36] C. Fellbaum, Ed., WordNet: An Electronic Lexical Database. Cambridge,891

MA, USA: MIT Press, 1998.892

[37] M. Strube and S. P. Ponzetto, “WikiRelate! Computing semantic related-893

ness using Wikipedia,” in Proc. AAAI Conf. Artif. Intell., 2006, pp. 1–6.894

[38] NIST. (2013). NIST TRECVID Multimedia Event Detection. [Online].895

Available: http://www.nist.gov/itl/iad/mig/med13.cfm896

[39] D. Lin, “An information-theoretic definition of similarity,” in Proc. 5th897

Int. Conf. Mach. Learn., 1998, pp. 296–304.898

[40] P. Resnik, “Using information content to evaluate semantic similarity899

in a taxonomy,” in Proc. 14th Int. Joint Conf. Artif. Intell., 1995,900

pp. 448–453.901

[41] J. Luo, C. Papin, and K. Costello, “Towards extracting semantically902

meaningful key frames from personal video clips: From humans to903

computers,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 2,904

pp. 289–301, Feb. 2009.905

[42] H. Zou and T. Hastie, “Regularization and variable selection via the906

elastic net,” J. Roy. Statist. Soc., Ser. B, vol. 67, no. 2, pp. 301–320,907

Apr. 2005.908

[43] Q. Zhang and B. Li, “Discriminative K-SVD for dictionary learning in909

face recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,910

Jun. 2010, pp. 2691–2698.911

[44] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding912

algorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1,913

pp. 183–202, 2009.914

[45] I. F. Gorodnitsky and B. D. Rao, “Sparse signal reconstruction from915

limited data using FOCUSS: A re-weighted minimum norm algorithm,”916

IEEE Trans. Signal Process., vol. 45, no. 3, pp. 600–616, Mar. 1997.917

[46] Y. She, “Thresholding-based iterative selection procedures for model918

selection and shrinkage,” Electron. J. Statist., vol. 3, pp. 384–415,919

Nov. 2009.920

[47] D. Krishnan and R. Fergus, “Fast image deconvolution using hyper-921

Laplacian priors,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 2009,922

pp. 1033–1041.923

[48] W. Zuo, D. Meng, L. Zhang, X. Feng, and D. Zhang, “A generalized924

iterated shrinkage algorithm for non-convex sparse coding,” in Proc.925

IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 217–224.926

[49] L. Cao et al., “IBM Research and Columbia University TRECVID-2011927

multimedia event detection (MED) system,” in Proc. NIST TRECVID928

Workshop, Dec. 2011.929

[50] D. Oneata et al., “AXES at TRECVid 2012: KIS, INS, and MED,” in930

Proc. NIST TRECVID Workshop, 2012.931

[51] L. Jie, T. Tommasi, and B. Caputo, “Multiclass transfer learning from932

unconstrained priors,” in Proc. IEEE Int. Conf. Comput. Vis., Nov. 2011,933

pp. 1863–1870.934

[52] A. Vahdat, K. Cannons, G. Mori, S. Oh, and I. Kim, “Compositional935

models for video event detection: A multiple kernel learning latent936

variable approach,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013,937

pp. 1185–1192.938

Yan Yan received the Ph.D. degree from the939

University of Trento, in 2014. He is currently a940

Post-Doctoral Researcher with the MHUG Group,941

University of Trento. His research interests include942

machine learning and its application to computer943

vision and multimedia analysis.944

Yi Yang received the Ph.D. degree in computer945

science from Zhejiang University. He was a946

Post-Doctoral Research Fellow with the School947

of Computer Science, Carnegie Mellon University,948

from 2011 to 2013. He is currently a Senior949

Lecturer with the Centre for Quantum Computation950

and Intelligent Systems, University of Technology951

at Sydney, Sydney. His research interests include952

multimedia and computer vision.953

Deyu Meng (M’13) received the B.Sc., M.Sc., 954

and Ph.D. degrees from Xi’an Jiaotong University, 955

Xi’an, China, in 2001, 2004, and 2008, respectively. 956

He was a Visiting Scholar with Carnegie Mellon 957

University, Pittsburgh, PA, USA, from 2012 to 2014. 958

He is currently an Associate Professor with the 959

Institute for Information and System Sciences, 960

School of Mathematics and Statistics, Xi’an 961

Jiaotong University. His current research interests 962

include principal component analysis, nonlinear 963

dimensionality reduction, feature extraction and 964

selection, compressed sensing, and sparse machine learning methods. 965

Gaowen Liu received the B.S. degree in automation 966

from Qingdao Univerisity, China, in 2006, and 967

the M.S. degree in system engineering from the 968

Nanjing University of Science and Technology, 969

China, in 2008. She is currently pursuing the 970

Ph.D. degree with the MHUG Group, University of 971

Trento, Italy. Her research interests include machine 972

learning and its application to computer vision and 973

multimedia analysis. 974

Wei Tong received the Ph.D. degree in computer 975

science from Michigan State University, in 2010. 976

He is currently a Researcher with General Motors 977

Research and Development Center. His research 978

interests include machine learning, computer vision, 979

and autonomous driving. 980

Alexander G. Hauptmann received the B.A. and 981

M.A. degrees in psychology from Johns Hopkins 982

University, Baltimore, MD, the degree in computer 983

science from the Technische Universität Berlin, 984

Berlin, Germany, in 1984, and the Ph.D. degree 985

in computer science from Carnegie Mellon 986

University (CMU), Pittsburgh, PA, in 1991. 987

He worked on speech and machine translation 988

from 1984 to 1994, when he joined the Informedia 989

project for digital video analysis and retrieval, 990

and led the development and evaluation of 991

news-on-demand applications. He is currently a Faculty Member with 992

the Department of Computer Science and the Language Technologies 993

Institute, CMU. His research interests include several different areas, 994

including man–machine communication, natural language processing, speech 995

understanding and synthesis, video analysis, and machine learning. 996

AQ:4

Nicu Sebe (M’01–SM’11) received the Ph.D. degree 997

from Leiden University, The Netherlands, in 2001. 998

He is currently with the Department of Information 999

Engineering and Computer Science, University of 1000

Trento, Italy, where he leads the research in the 1001

areas of multimedia information retrieval and human 1002

behavior understanding. He is a Senior Member of 1003

the Association for Computing Machinery and a 1004

fellow of the International Association for Pattern 1005

Recognition. He was the General Co-Chair of 1006

FG 2008 and ACM Multimedia 2013, and the 1007

Program Chair of CIVR 2007 and 2010, and ACM Multimedia 2007 and 2011. 1008

He will be the Program Chair of ECCV 2016 and ICCV 2017. 1009