Post on 26-Feb-2023
IEEE
Proo
f
IEEE TRANSACTIONS ON IMAGE PROCESSING 1
Event Oriented Dictionary Learningfor Complex Event Detection
Yan Yan, Yi Yang, Deyu Meng, Member, IEEE, Gaowen Liu, Wei Tong,Alexander G. Hauptmann, and Nicu Sebe, Senior Member, IEEE
Abstract— Complex event detection is a retrieval task with1
the goal of finding videos of a particular event in a large-scale2
unconstrained Internet video archive, given example videos and3
text descriptions. Nowadays, different multimodal fusion schemes4
of low-level and high-level features are extensively investigated5
and evaluated for the complex event detection task. However, how6
to effectively select the high-level semantic meaningful concepts7
from a large pool to assist complex event detection is rarely8
studied in the literature. In this paper, we propose a novel strategy9
to automatically select semantic meaningful concepts for the event10
detection task based on both the events-kit text descriptions11
and the concepts high-level feature descriptions. Moreover, we12
introduce a novel event oriented dictionary representation based13
on the selected semantic concepts. Toward this goal, we leverage14
training images (frames) of selected concepts from the semantic15
indexing dataset with a pool of 346 concepts, into a novel16
supervised multitask �p-norm dictionary learning framework.17
Extensive experimental results on TRECVID multimedia eventAQ:1 18
detection dataset demonstrate the efficacy of our proposed19
method.
AQ:2
20
Index Terms— Complex event detection, concept selection,21
event oriented dictionary learning, supervised multi-task22
dictionary learning.23
I. INTRODUCTION24
COMPLEX event detection in unconstrained videos has25
received much attention in the research community26
recently [1]–[3]. It is a retrieval task with the goal of detecting27
videos of a particular event in a large-scale internet video28
archive, given an event-kit. An event-kit consists of example29
videos and text descriptions of the event. Unlike traditional30
Manuscript received July 7, 2014; revised October 17, 2014 andFebruary 11, 2015; accepted March 4, 2015. This work was supported inpart by the MIUR Cluster Project Active Ageing at Home, in part by theEuropean Commission Project xLiMe, in part by the Australian ResearchCouncil Discovery Projects, in part by the U.S. Army Research Office underGrant W911NF-13-1-0277, and in part by the National Science Foundationunder Grant IIS-1251187. The associate editor coordinating the review of thismanuscript and approving it for publication was Prof. Gang Hua.
Y. Yan, G. Liu, and N. Sebe are with the Department of InformationEngineering and Computer Science, University of Trento, Trento 38123, Italy(e-mail: yan@disi.unitn.it; gaowen.liu@unitn.it; sebe@disi.unitn.it).
Y. Yang is with the Centre for Quantum Computation and IntelligentSystems, University of Technology at Sydney, Sydney, NSW 2007, Australia(e-mail: yee.i.yang@gmail.com).
D. Meng is with the Institute for Information and System Sciences,Xi’an Jiaotong University, Xi’an 710049, China (e-mail: dymeng@mail.xjtu.edu.cn).
W. Tong is with the General Motors Research and Development Center,Warren, MI 48090 USA (e-mail: tongweig@gmail.com).
A. G. Hauptmann is with the School of Computer Science, Carnegie MellonUniversity, Pittsburgh, PA 15213 USA (e-mail: alex@cs.cmu.edu).
Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2015.2413294
action recognition of atomic actions from videos, such as 31
‘walking’ or ‘jumping’, complex event detection aims to detect 32
more complex events such as ‘Birthday party’, ‘Changing a 33
vehicle tire’, etc. 34
An event is a higher level semantic abstraction of video 35
sequences than a concept and consists of many concepts. 36
For example, a ‘Birthday party’ event can be described by 37
multiple concepts, such as objects (e.g., boy, cake), actions 38
(e.g., talking, walking) and scene (e.g., at home, in a 39
restaurant). A concept can be detected in a shorter video 40
sequence or even in a single frame but an event is usually 41
contained in a longer video clip. 42
Traditional approaches for complex event detection rely 43
on fusing the classification outputs of multiple low-level 44
features [1], i.e. SIFT, STIP, MOSIFT [4]. Recently, 45
representing videos using high-level features, such as concept 46
detectors [5], appears promising for the complex event detec- 47
tion task. However, the state-of-the-art concept detector based 48
approaches for complex event detection have not considered 49
which concepts should be included in the training concept list. 50
This induces the redundancy of concepts [2], [6] in the concept 51
list for the vocabulary construction. For example, it is highly 52
improbable for some concepts to help detecting a certain event, 53
e.g. ‘cows’ or ‘football’ are not helpful to detect events like 54
‘Landing a fish’ or ‘Working on a sewing project’. Therefore, 55
removing the uncorrelated concepts from the vocabulary 56
construction tends to eliminate such redundancy and 57
potentially boosts the complex event detection performance. 58
Intuitively, it is highly expected that complex event detection 59
is more accurate and faster when we build a specific dictionary 60
representation for each event. In this paper, we investigate 61
how to learn a concept-driven event oriented representation 62
for complex event detection. There are mainly two important 63
issues to be considered for accomplishing this goal. The first 64
issue is which concepts should be included in the vocabulary 65
construction of the learning framework. Since we want to 66
learn an event oriented dictionary representation, how to 67
properly select qualified concepts for each event in the learning 68
framework is the key issue. This raises the problem of how to 69
optimally select the necessary and meaningful concepts from 70
a large pool of concepts for each event. The second issue is 71
how can we design an effective dictionary learning framework 72
to seamlessly learn the common knowledge from both the 73
low-level features and the high-level concept features. 74
To facilitate reading, we first introduce the abbreviations 75
used in the paper. SIN stands for the Sematic Indexing 76
dataset [7] containing 346 different categories (concepts) 77
1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
IEEE
Proo
f
2 IEEE TRANSACTIONS ON IMAGE PROCESSING
Fig. 1. The event oriented dictionary learning framework.
of images, such as car, adult, etc. SIN-MED stands for78
the high-level concept features using the SIN concept list79
representing each MED video by a 346D feature (each80
dimension represents a concept).81
The overview of our framework is shown in Fig. 1. Firstly,82
we design a novel method to automatically select semantic83
meaningful concepts for each MED event based on both84
MED events-kit text descriptions and SIN-MED high-level85
concept feature representations. Then, we leverage training86
samples of selected concepts from the SIN dataset into a87
jointly supervised multi-task dictionary learning framework.88
An event specific semantic meaningful dictionary is learned89
through embedding the feature representation of original90
datasets (both MED dataset and SIN dataset) into a hidden91
shared subspace. We add label information in the learning92
framework to facilitate the event oriented dictionary learning93
process. Therefore, the learned sparse codes achieve intrinsic94
discriminative information and naturally lead to effective95
complex event detection. Moreover, a novel �p-norm multi-96
task dictionary learning is proposed to strengthen the flexibility97
of the traditional �1-norm dictionary learning problem.98
To summarize, the contributions of this paper are as99
follows:100
• We propose a novel approach for concept selection and101
present one of the first works making a comprehensive102
evaluation of automatic concept selection strategies for103
event detection;104
• We propose the event oriented dictionary learning for105
event detection;106
• We construct a supervised multi-task dictionary learning107
framework which is capable of learning an event108
oriented dictionary via leveraging information from109
selected semantic concepts;110
• We propose a novel �p-norm multi-task dictionary111
learning framework which is more flexible than the112
traditional �1-norm dictionary learning problem.113
II. RELATED WORK 114
To highlight our research contributions, we now review the 115
related work on (a) Event Detection, (b) Dictionary Learning 116
and (c) Multi-task Learning. 117
A. Event Detection 118
With the success of event detection in structured videos, 119
complex event detection from general unconstrained videos, 120
such as those obtained from internet video sharing web sites 121
like YouTube, has received increasing attention in recent 122
years. Unlike traditional action recognition from videos of 123
atomic actions, such as ‘walking’ or ‘jumping’, complex 124
event detection aims to detect more complex events such 125
as ‘Birthday party’, ‘Attempting board trick’, ‘Changing 126
a vehicle tire’, etc. Tamrakar et al. [1] and Lan et al. [8] 127
evaluated different low-level appearance as well as spatio- 128
temporal features, appropriately quantized and aggregated 129
into Bag-of-Words (BoW) descriptors for NIST TRECVID 130
Multimedia Event Detection. Jiang et al. [9] proposed a 131
method for high-level and low-level feature fusion based on 132
collective classification from three steps which are training a 133
classifier from low-level features, encoding high-level features 134
into graphs, and diffusing the scores on the established graph 135
to obtain the final prediction. Natarajan et al. [10] evaluated 136
a large set of low-level audio and visual features as well as 137
high-level information from object detection, speech and video 138
text OCR for event detection. They combined multiple features 139
using a multi-stage feature fusion strategy with feature level 140
early fusion using multiple kernel learning (MKL) and score 141
level fusion using Bayesian model combination (BayCom) 142
and weighted average fusion using video specific weights. 143
Tang et al. [11] tackled the problem of understanding the 144
temporal structure of complex events in highly varying videos 145
obtained from the Internet. A conditional model was trained 146
in a max-margin framework able to automatically discover 147
discriminative and interesting segments of video, while 148
simultaneously achieving competitive accuracies on difficult 149
detection and recognition tasks. 150
Recently, representing video in terms of multi-model 151
low-level features, e.g. SIFT, STIP, Dense Trajectory, 152
Mel-Frequency Cepstral Coefficients (MFCC), Automatic 153
Speech Recognition (ASR), Optical Character 154
Recognition (OCR), combined with early or late fusion 155
schemes is the state-of-the-art [12] for event detection. 156
Despite of their good performance, low-level features are 157
incapable of capturing the inherent semantic information in an 158
event. Comparatively, high-level concept features were shown 159
to be promising for event detection [5]. High-level concept 160
representation approaches become available nowadays due to 161
the availability of large labeled training collections such as 162
ImageNet and TRECVID. However, currently there are still 163
few research works on how to automatically select useful 164
concepts for event detection. Oh et al. [13] used Latent 165
SVMs for concept weighting and Brown et al. [14] ordered 166
concepts by their discrimination power for each event kit 167
query. Different from [13] and [14], we focus on learning an 168
event oriented dictionary from the perspective of adaptation 169
of the selecting concepts. 170
IEEE
Proo
f
YAN et al.: EVENT ORIENTED DICTIONARY LEARNING 3
B. Dictionary Learning171
Dictionary learning (also called Sparse Coding) has been172
shown to be able to find succinct representations of stimuli173
and model data vectors as a linear combination of a few174
elements from a dictionary. Dictionary learning has been175
successfully applied to a variety of problems in computer176
vision analysis recently. Yang et al. [15] proposed a spatial177
pyramid matching approach based on SIFT sparse codes178
for image classification. The method used selective sparse179
coding instead of the traditional vector quantization to extract180
salient properties of appearance descriptors of local image181
patches. Elad and Aharon [16] addressed the image denoising182
problem, where zero-mean white and homogeneous Gaussian183
additive noise was to be removed from a given image.184
The approach taken was based on sparse and redundant185
representations over trained dictionaries. Using the K-SVD186
algorithm, the authors obtained a dictionary that described187
the image content effectively. For the image segmentation188
problem, Mairal et al. [17] proposed an energy formulation189
with both sparse reconstruction and class discrimination190
components, jointly optimized during dictionary learning.191
The approach improved over the state of the art in image192
segmentation experiments.193
Different optimization algorithms have also been proposed194
to solve dictionary learning problems. Aharon et al. [18]195
proposed a novel K-SVD algorithm for adapting dictionaries196
in order to achieve sparse signal representations. K-SVD is197
an iterative method that alternates between sparse coding of198
the examples based on the current dictionary and a process of199
updating the dictionary atoms to better fit the data. The update200
of the dictionary columns was combined with an update of the201
sparse representations, thereby accelerating the convergence.202
Lee et al. [19] presented efficient sparse coding algorithms203
that were based on iteratively solving two convex optimization204
problems: an �1-regularized least squares problem and an205
�2-constrained least squares problem. To learn a206
discriminative dictionary for sparse coding, a label consistent207
K-SVD (LC-KSVD) algorithm was proposed in [20].208
In addition to using class labels of training data, the authors209
also associated label information with each dictionary item210
(columns of the dictionary matrix) to enforce discriminability211
in sparse codes during the dictionary learning process.212
More specifically, a new label consistent constraint was213
introduced and combined with the reconstruction error and214
the classification error to form a unified objective function.215
To effectively handle very large training sets and dynamic216
training data changing over time, Mairal et al. [21] proposed217
an online optimization algorithm for dictionary learning,218
based on stochastic approximations, which scaled up to large219
datasets with millions of training samples.220
However, so far as we know, there is no research work on221
how to learn the dictionary representation at the event level222
for event detection and there is no research work on how to223
simultaneously leverage the semantic information to learn an224
event oriented dictionary.225
C. Multi-Task Learning226
Multi-task learning [22] methods aim to simultaneously227
learn classification/regression models for a set of related tasks.228
This typically leads to better models as compared to a learner 229
that does not account for task relationships. To capture the 230
task relatedness from multiple related tasks is to constrain all 231
models to share a common set of features. This motivates the 232
group sparsity, i.e. the �2,p-norm regularized learning [23]. 233
The joint feature learning using �2,p-norm regularization 234
performs well in ideal cases. In practical applications, 235
however, simply using the �2,p-norm regularization may not 236
be effective for dealing with dirty data which may not fall into 237
a single structure. To this end, the dirty model for multi-task 238
learning was proposed in [24]. Another way to capture the task 239
relationship is to constrain the models from different tasks to 240
share a low-dimensional subspace by the trace norm [25]. The 241
assumption that all models share a common low-dimensional 242
subspace is too restrictive in some applications. To this 243
end, an extension that learns incoherent sparse and low-rank 244
patterns simultaneously was proposed in [26]. 245
Many multi-task learning algorithms assume that all learn- 246
ing tasks are related. In practical applications, however, the 247
tasks may exhibit a more sophisticated group structure where 248
the models of tasks from the same group are closer to each 249
other than those from a different group. There have been many 250
works along this line of research [27], [28], known as clustered 251
multi-task learning (CMTL). Moreover, most multi-task learn- 252
ing formulations assume that all tasks are relevant, which is 253
however not the case in many real-world applications. Robust 254
multi-task learning (RMTL) is aimed at identifying irrelevant 255
(outlier) tasks when learning from multiple tasks [29]. 256
However, there is little work on multi-task learning used for 257
dictionary learning problem. The only related theoretical work 258
is that in [30], where only theoretical bounds are provided on 259
evaluating the generalization error of dictionary learning for 260
multi-task learning and transfer learning. Multi-task learning 261
has received considerable attention in the computer vision 262
community and has been successfully applied to many com- 263
puter vision problems, such as image classification [31], head- 264
pose estimation [32], visual tracking [33], multi-view action 265
recognition [34] and egocentric activity recognition [35]. 266
However, to our knowledge, no previous works have 267
considered the problem of complex event detection. 268
III. BUILDING AN EVENT SPECIFIC CONCEPT POOL 269
The concepts, which are related to objects, actions, scenes, 270
attributes, etc. are usually basic elements for the description 271
of an event. Since the availability of large labeled training 272
collections such as ImageNet and TRECVID, there exists 273
a large pool of concept detectors for event descriptions. 274
However, selecting important concepts is the key issue for 275
concept vocabulary construction. For example, the event 276
‘Landing a fish’ is composed of concepts such as ‘adult’, 277
‘waterscape’, ‘outdoor’ and ‘fish’. If concepts related to the 278
event can be accurately selected, redundancy and unrelated 279
information will be suppressed, potentially improving the 280
event recognition performance. In order to select useful 281
concepts for the specific event, we propose a novel concept 282
selection strategy based on the combination of text and visual 283
information provided by the event-kit descriptions, which are 284
(i) Text-based semantic relatedness from linguistic knowledge 285
IEEE
Proo
f
4 IEEE TRANSACTIONS ON IMAGE PROCESSING
Fig. 2. Linguistic-based concept selection strategy with an example of‘E007: Changing a vehicle tire’ in MED event-kit text description and acorresponding example video provided by NIST.
of MED event-kit text description and (ii) Elastic-Net feature286
selection from visual high-level representation.287
A. Linguistic: Text-Based Semantic Relatedness288
The most widely used resources in Natural Language289
Processing (NLP) to calulate the semantic relateness of290
concepts are WordNet [36] and Wikipedia [37]. There291
are detailed event-kit text descriptions for each MED event292
provided by NIST [38]. In this paper, we explore the semantic293
similarity between each term in the event-kit text description294
and the SIN 346 visual concept names based on WordNet.295
Fig. 2 shows an example of event-kit text description for296
‘Changing a vehicle tire’.297
Intuitively, the more information two concepts share in298
common, the more similar they are, and the information299
shared by two concepts is indicated by the information300
content of the concepts that subsume them in the taxonomy.301
As illustrated in Fig. 2, we calculate the similarity between302
each term in event-kit text descriptions and the SIN 346 visual303
concept names based on the similarity measurement proposed304
in [39]. This measurement defines the similarity of two words305
w1i and w2 j as:306
sim(w1i , w2 j ) = 2 π(lcs)
π(w1i) + π(w2 j )307
where w1i ∈ {event-kit text descriptions},308
i = 1, . . . , Nevent_kit and w2 j ∈ {SIN visual concept309
names}, j = 1, . . . , 346. lcs denotes the lowest common310
subsumer of two words in the WordNet hierarchy. π denotes311
the information content of a word and is computed as312
Fig. 3. Visual high-level semantic representation with Elastic-Net conceptselection.
π(w) = log p(w), where p(w) is the probability of 313
encountering an instance of w in the union of WordNet and 314
event-kit text descriptions. The probability p(w) = freq(w)/N , 315
which can be estimated from the relative frequency of w 316
and all words N in the union of WordNet and event-kit text 317
descriptions [40]. In this way, we expect to properly capture 318
the semantic similarity between subjects (e.g. human, crowd) 319
and objects (e.g. animal, vehicle) based on the WordNet 320
hierarchy. Finally, we construct a 346D event-level feature vec- 321
tor representation for each event (each dimension corresponds 322
to a SIN visual concept name) using the MED event-kit text 323
description. A threshold is set (thr = 0.5 in our experiments) 324
to select useful concepts into our final semantic concept list. 325
B. Visual High-Level Representation: Elastic-Net 326
Concept Selection 327
Concept detectors provide a high-level semantic 328
representation for videos with complex contents, which 329
are beneficial for developing powerful retrieval or filtering 330
systems for consumer media [5]. In our case, we firstly use the 331
SIN dataset to train 346 semantic concept models. We adopt 332
the approach [41] to extract keyframes from the MED dataset. 333
The trained 346 semantic concept models are used to 334
predict the 346 semantic concepts existing in the keyframes 335
of MED videos. Once we have the prediction score of each 336
concept on each keyframe, the keyframe can be represented as 337
a 346D SIN-MED feature indicating the determined concept 338
probabilities. Finally, the video-level SIN-MED feature 339
is computed as the average of keyframe-level SIN-MED 340
features. 341
To select the useful concepts for each specific event, we 342
adopt the Elastic-Net [42] concept selection as illustrated 343
in Fig. 3, given the intuition that the learner generally would 344
like to choose the most representative SIN-MED feature 345
dimensions (concepts) to differentiate events. Elastic-Net is 346
formulated as follows: 347
minu
‖l − Fu‖2 + α1‖u‖1 + α2‖u‖2348
where l = {0, 1}n ∈ Rn indicates the event labels, F ∈ Rn×b349
is the SIN-MED feature matrix (n is the number of samples 350
IEEE
Proo
f
YAN et al.: EVENT ORIENTED DICTIONARY LEARNING 5
Fig. 4. The correlation demonstration between SIN visual concepts.High correlations between contextually related clusters ‘G4:car’, ‘G7:nature’,‘G11:urban-scene’ (in red) and negative correlations between contextuallyunrelated clusters ‘G7:nature’, ‘G8:indoor’ (in blue) can be easily observed.(Figure is best viewed in color and under zoom).
and b is the SIN-MED feature dimension) and u ∈ Rb is the351
parameter to be optimized. Each dimension of u corresponds352
to one semantic concept if F is the high-level SIN-MED353
feature. α1 and α2 are the regularization parameters. We use354
Elastic-Net instead of LASSO due to the high correlation355
between concepts in the SIN concept lists [7] (see Fig. 4).356
While LASSO (when α2 = 0) tends to select only a small357
number of variables from a group and ignore the others,358
Elastic-Net is capable of automatically taking such correlation359
information into account through adding a quadratic term ‖u‖2360
to the penalty. We can adjust the value of α1 to control361
the sparsity degree, i.e., how many semantic concepts are362
selected in our problem. The concepts to be selected are the363
corresponding dimensions with non-zero vaules of u.364
C. Union of Selected Concepts365
To sum up, we form a union of the semantic concepts366
selected from both text-based semantic relatedness described367
in section 3.1 and visual high-level semantic representation368
described in section 3.2 as the final list of selected concepts369
for each MED event. All the confidence values used are370
normalized to 0-1 using the Z-score normalization and are371
used for ranking the concepts. In our paper, we give equal372
weights to the textual and visual selection methods for the373
final late fusion of confidence scores. Top 10 ranked concepts374
are finally selected to adapt semantic information. Since we375
select the most important concepts from the pool, the top376
10 ranked concepts are usually already enough to describe377
the event (see Fig. 8). To evaluate the effectiveness of our378
proposed strategy, we compare the selected concepts with379
the groundtruth (we use human labeled concepts list as the380
groundtruth for each MED event).381
IV. EVENT ORIENTED DICTIONARY LEARNING 382
After we select semantic meaningful concepts for each 383
event, we can leverage training samples of selected concepts 384
from the SIN dataset into a supervised multi-task dictionary 385
learning framework. In this section, we investigate how to 386
learn an event oriented dictionary representation. To accom- 387
plish this goal, we firstly propose our multi-task ditionary 388
learning framework and then introduce its supervised setting. 389
A. Multi-Task Dictionary Learning 390
Given K tasks (e.g. K = 2 in our case, one task is the 391
MED dataset and the other task is the subset of SIN dataset 392
where samples are collected from specified selected concepts 393
for each event), each task consists of data samples denoted 394
by Xk = {x1k, x2
k, . . . , xnkk } ∈ Rnk×d , (k = 1, . . . , K ), where 395
xik ∈ Rd is a d-dimensional feature vector and nk is the 396
number of samples in the k-th task. We are going to learn a 397
shared subspace across all tasks, obtained by an orthonormal 398
projection W ∈ Rd×s , where s is the dimensionality of the 399
subspace. In this learned subspace, the data distributions from 400
all tasks should be similar to each other. Therefore, we can 401
code all tasks together in the shared subspace and achieve 402
better coding quality. The benefits of this strategy are: (i) we 403
can improve each individual coding quality by transferring 404
knowledge across all tasks. (ii) we can discover the rela- 405
tionship among different datasets via coding analysis. Such 406
a purpose can be realized through the following optimization 407
problem: 408
minTk,Ck,W,D
K∑
k=1
‖Xk − CkTk‖2F + λ1
K∑
k=1
‖Ck‖1 409
+ λ2
K∑
k=1
‖XkW − CkD‖2F 410
s.t .
⎧⎨
⎩
WTW = I(Tk)j·(Tk)T
j· ≤ 1, ∀ j = 1, . . . , lDj·DT
j· ≤ 1, ∀ j = 1, . . . , l(1) 411
where Tk ∈ Rl×d is an overcomplete dictionary (l > d) 412
with l prototypes of the k-th task, (Tk)j· in the constraints 413
denotes the j -th row of Tk, and Ck ∈ Rnk×l corresponds 414
to the sparse representation coefficients of Xk. In the third 415
term of Eqn.(1), Xk is projected by W into the subspace to 416
explore the relationship among different tasks. D ∈ Rl×s is 417
the dictionary learned in the datasets’ shared subspace. Dj· in 418
the constraints denotes the j -th row of D. I is the identity 419
matrix. (·)T denotes the transpose operator. λ1 and λ2 are 420
the regularization parameters. The first constraint guarantees 421
the learned W to be orthonormal, and the second and third 422
constraints prevent the learned dictionary to be arbitrarily 423
large. In our objective function, we learn a dictionary Tk for 424
each task k and one shared dictionary D among k tasks. Since 425
one task in our model uses samples from the SIN dataset 426
of selected semantic meaningful concepts, the shared 427
learned dictionary D is the event oriented dictionary. When 428
λ2 = 0, Eqn.(1) reduces to the traditional dictionary learning 429
on separated tasks. 430
IEEE
Proo
f
6 IEEE TRANSACTIONS ON IMAGE PROCESSING
B. Supervised Multi-Task Dictionary Learning431
It is well-known that the traditional dictionary learning432
framework is not directly available for classification and the433
learned dictionary has merely been used for signal recon-434
struction [17]. To circumvent this problem, researchers have435
developed several algorithms to learn a classification-oriented436
dictionary in a supervised learning fashion by exploring the437
label information. In this subsection, we extend our proposed438
multi-task dictionary learning of Eqn.(1) to be suitable for439
event detection.440
Assuming that the k-th task has mk classes, the label441
information of the k-th task is Yk = {y1k, y2
k, . . . , ynkk } ∈442
Rnk×mk (k = 1, . . . , K ), yik = [0, . . . , 0, 1, 0, . . . , 0]443
(the position of non-zero element indicates the class).444
�k ∈ Rl×mk is the parameter of the k-th task classifier.445
Inspired by [43], we consider the following optimization446
problem:447
minTk,Ck,�k,W,D
K∑
k=1
‖Xk − CkTk‖2F + λ1
K∑
k=1
‖Ck‖1448
+λ2
K∑
k=1
‖XkW − CkD‖2F + λ3
K∑
k=1
‖Yk − Ck�k‖2F449
s.t .
⎧⎪⎪⎨
⎪⎪⎩
WTW = I
(Tk)j·(Tk)Tj· ≤ 1, ∀ j = 1, . . . , l
Dj·DTj· ≤ 1, ∀ j = 1, . . . , l
(2)450
Compared with Eqn.(1), we add the last term into Eqn.(2)451
to enforce the model involving discriminative information452
for classification. This objective function can simultane-453
ously achieve a desired dictionary with good representation454
power and support optimal discrimination of the classes for455
multi-task setting.456
1) Optimization: To solve the proposed objective problem457
of Eqn.(2), we adopt the alternating minimization algorithm to458
optimize it with respect to D, Tk, Ck, �k and W respectively459
in five steps as follows:460
Step1 (Fixing Tk, Ck, W, Θk, Optimize D): If we461
stack X = [XT1 , . . . , XT
k ]T, C = [CT1 , . . . , CT
k ]T, Eqn.(2) is462
equivalent to:463
minD
K∑
k=1
‖XkW − CkD‖2F = min
D‖XW − CD‖2
F464
s.t . Dj·DTj· ≤ 1, ∀ j = 1, . . . , l465
This is equivalent to the dictionary update stage in tradi-466
tional dictionary learning algorithm. We adopt the dictionary467
update strategy of [21, Algorithm 2] to efficiently solve it.468
Step2 (Fixing D, Ck, W, Θk, Optimize Tk): Eqn.(2) is469
equivalent to:470
minTk
‖Xk − CkTk‖2F471
s.t . (Tk)j·(Tk)Tj· ≤ 1, ∀ j = 1, . . . , l472
This is also equivalent to the dictionary update stage in473
traditional dictionary learning for k tasks. We adopt the474
dictionary update strategy of [21, Algorithm 2] to efficiently 475
solve it. 476
Step3 (Fixing Tk, W, D, Θk, Optimize Ck): Eqn.(2) is 477
equivalent to: 478
minCk
K∑
k=1
‖Xk − CkTk‖2F + λ1
K∑
k=1
‖Ck‖1 479
+ λ2
K∑
k=1
‖XkW − CkD‖2F + λ3
K∑
k=1
‖Yk − Ck�k‖2F 480
This formulation can be decoupled into (n1 +n2 + . . .+nk) 481
distinct problems: 482
minci
k
K∑
k=1
nk∑
i=1
(∥∥∥xi
k − cikTk
∥∥∥2
2+ λ1
∥∥∥cik
∥∥∥1
483
+ λ2
∥∥∥xikW − ci
kD∥∥∥
2
2+ λ3
∥∥∥yik − ci
k�k
∥∥∥2
2) 484
We adopt the Fast Iterative Shrinkage-Thresholding 485
Algorithm (FISTA) [44] to solve the problem. FISTA solves 486
the optimization problems in the form of minμ
f (μ) + r(μ), 487
where f (μ) is convex and smooth, and r(μ) is convex but 488
non-smooth. We adopt FISTA since it is a popular tool 489
for solving many convex smooth/non-smooth problems and 490
its effectiveness has been verified in many applications. In 491
our setting, we denote the smooth term part as f (cik) = 492∥∥xi
k − cikTk
∥∥22 + λ2
∥∥xikW − ci
kD∥∥2
2+λ3∥∥yi
k − cik�k
∥∥22 and the 493
non-smooth term part as g(cik) = λ1
∥∥∥cik
∥∥∥1. 494
Step4 (Fixing D, Ck, W, Tk, Optimize Θk): Eqn.(2) is 495
equivalent to: 496
min�k
‖Yk − Ck�k‖2F 497
Setting ∂∂�k
= 0, we obtain �k = (CTk Ck)−1CT
k Yk. 498
Step5 (Fixing Tk, Ck, D, Θk, Optimize W): If we 499
stack X = [XT1 , . . . , XT
k ]T, C = [CT1 , . . . , CT
k ]T, Eqn.(2) is 500
equivalent to: 501
minW
K∑
k=1
‖XkW − CkD‖2F = min
W‖XW − CD‖2
F 502
s.t . WTW = I 503
Substituting D = (CTC)−1CTXW back into the above 504
function, we achieve 505
minW
∥∥∥(I − C(CTC)−1CT)XW∥∥∥
2
F506
= minW
tr(WTXT(I − C(CTC)−1CT)XW) 507
s.t . WTW = I 508
The optimal W is composed of eigenvectors of the matrix 509
XT(I − C(CTC)−1CT)X corresponding to the s smallest 510
eigenvalues. 511
We summarize our algorithm for solving Eqn.(2) 512
as Algorithm 1. 513
IEEE
Proo
f
YAN et al.: EVENT ORIENTED DICTIONARY LEARNING 7
Algorithm 1 Supervised Multi-Task Dictionary Learning
C. Supervised Multi-Task �p-Norm Dictionary Learning514
In the literature it has been shown that using a non-convex515
�p-norm minimization (0 ≤ p < 1) can often yield better516
results than the convex �1-norm minimization. Inspired by this,517
we extend our supervised multi-task dictionary learning model518
to a supervised multi-task �p-norm dictionary learning model.519
Assuming that the k-th task has mk classes, the label520
information of the k-th task is Yk = {y1k, y2
k, . . . , ynkk } ∈521
Rnk×mk , (k = 1, . . . , K ), yik = [0, . . . , 0, 1, 0, . . . , 0] (the522
position of non-zero element indicates the class). �k ∈ Rl×mk523
is the parameter of the k-th task classifier. We formulate our524
supervised multi-task �p-norm dictionary learning problem525
as follows:526
minTk,Ck,�k,W,D
K∑
k=1
‖Xk − CkTk‖2F + λ1
K∑
k=1
‖Ck‖pp527
+λ2
K∑
k=1
‖XkW − CkD‖2F + λ3
K∑
k=1
‖Yk − Ck�k‖2F528
s.t .
⎧⎨
⎩
WTW = I(Tk)j·(Tk)T
j· ≤ 1, ∀ j = 1, . . . , lDj·DT
j· ≤ 1, ∀ j = 1, . . . , l(3)529
Compared with Eqn.2, we replace the traditional sparse530
coding �1-norm term ‖Ck‖1 with the more flexible �p-norm531
term ‖Ck‖pp . Since we can adjust the value of p (0 ≤ p < 1)532
in our framework, our algorithm is more flexible to control the533
sparseness of the feature representation, thus usually resulting534
in better performance than the traditional �1-norm sparse535
coding.536
To solve the proposed problem of Eqn.3, we adopt the537
alternating minimization algorithm to optimize it with respect538
to D, Tk, Ck,�k, W respectively. The updated rules for539
D, Tk,�k, W are the same as Eqn.2, the only difference is540
in the updated rule of Ck. Various algorithms have been541
proposed for �p-norm non-convex sparse coding [45]–[47].542
In this paper, we adopt the Generalized Iterated Shrinkage543
Algorithm (GISA) [48] to solve the proposed problem.544
Algorithm 2 Supervised Multi-Task �p-Norm DictionaryLearning
TABLE I
18 EVENTS OF MED10 AND MED11
We summarize our algorithm for solving Eqn.3 as 545
Algorithm 2. 546
After the optimized � is obtained, the final classification of 547
a test video can be obtained based on its sparse coefficient cik, 548
which carries the discriminative information. We can simply 549
apply the linear classifier cik�k to obtain the predicted score 550
of the video. 551
V. EXPERIMENTS 552
In this section, we conduct extensive experiments to test our 553
proposed method using large-scale real world datasets. 554
A. Datasets 555
TRECVID MED10 (P001-P003) and MED11 (E001-E015) 556
datasets are used in our experiments. The datasets consist of 557
9746 videos from 18 events of interest, with 100-200 examples 558
per event, and the rest of the videos are from the background 559
class. The details are listed in the Table 1. 560
TRECVID Semantic Indexing Task (SIN) [7] contains anno- 561
tation for 346 semantic concepts on 400,000 keyframes from 562
web videos. 346 concepts are related to objects, actions, 563
IEEE
Proo
f
8 IEEE TRANSACTIONS ON IMAGE PROCESSING
TABLE II
15 GROUPS OF SIN 346 VISUAL CONCEPTS (THE NUMBER OF
CONCEPTS FOR EACH GROUP ARE IN PARENTHESIS)
scences, attributes and non-visual concept which are all the564
basic elements for an event, e.g. kitchen, boy, girl, bus. For565
the sake of better understanding and easy concept selection,566
we manually divide the 346 visual concepts into 15 groups567
which are listed in Table 2.568
B. Evaluation Metrics569
1) Average Precision (AP): is a measure combining recall570
and precision for ranked retrieval results. The average preci-571
sion is the mean of the precision scores after each relevant572
sample is retrieved. The higher number indicates the better573
performance.574
2) PMiss@TER = 12.5: is an official evaluation metric for575
event detection as defined by NIST [38]. It is defined as the576
point at which the ratio between the probability of Missed577
Detection and probability of False Alarm is 12.5:1. The lower578
number indicates the better performance.579
3) Normalized Detection Cost (NDC): is an official evalua-580
tion metric for event detection as defined by NIST [38]. It is a581
weighted linear combination of the system’s Missed Detection582
and False Alarm probabilities. NDC measures the performance583
of a detection system in the context of an application profile584
using error rate estimates calculated on a test set. The lower585
number indicates the better performance.586
C. Experiment Settings and Comparison Methods587
There are 3104 videos used for training and 6642 videos588
used for testing in our experiments. We use three representative589
features which are SIFT, Color SIFT (CSIFT) and Motion590
SIFT (MOSIFT) [4]. SIFT and CSIFT describe the gradient591
and color information of images. MOSIFT describes both the592
optical flow and gradient information of video clips. Finally,593
768D SIFT-BoW, CSIFT-BoW, MOSIFT-BoW features are594
extracted respectively to represent each video. We set the reg-595
ularization parameters in the range of {0.01, 0.1, 1, 10, 100}.596
The subspace dimensionality s is set by searching the grid597
from {200, 400, 600}. For the experiments in the paper, we598
try three different dictionary sizes from {768, 1024, 1280}.599
To evaluate the multi-task �p-norm dictionary learning600
algorithm, the parameter p is tuned in the range of {0.2, 0.4,601
0.6, 0.8, 1}. We compare our proposed event oriented602
dictionary learning method with the following important603
baselines:604
• Support Vector Machine (SVM): SVM has been widely605
used by several research groups for MED and has shown606
its robustness [8], [49], [50], so we use it as one of the607
comparison algorithms (RBF, Histogram Intersection and608
χ2 kernels are used respectively);609
Fig. 5. Example results of semantic concept selection proposed in section 3on event (top left) Attempting board trick, (top right) Feeding animal,(bottom left) Flash mob gathering, (bottom right) Making a sandwich. Thefont size in the figure reflects the ranking value for concepts (the larger thefont the higher the value).
TABLE III
AP PERFORMANCE FOR EACH MED EVENT USING TEXT (T),
VISUAL (V) AND TEXT + VISUAL (T + V) INFORMATION
FOR CONCEPT SELECTION. THE LAST COLUMN SHOWS
THE NUMBER OF CONCEPTS IN THE TOP 10 THAT
COINCIDE WITH THE GROUNDTRUTH
• Single Task Supervised Dictionary Learning (ST-SDL): 610
Performing supervised dictionary learning on each task 611
separately; 612
• Pooling Tasks Supervised Dictionary Learning (PT-SDL): 613
Performing single task supervised dictionary learning by 614
simply aggregating data from all tasks; 615
• Multiple Kernel Transfer Learning (MKTL) [51]: A 616
method incorporating prior features into a multiple kernel 617
learning framework. We use the code provided by the 618
author1; 619
• Dirty Model Multi-Task Learning (DMMTL) [24]: A 620
state-of-the-art multi-task learning method imposing 621
�1/�q -norm regularization. We use the code provided by 622
MALSAR toolbox2; 623
• Multiple Kernel Learnig Latent Variable Approach 624
(MKLLVA) [52]: A multiple kernel learning latent 625
variable approach for complex video event detection; 626
• Random Concept Selection Strategy (RCSS): Performing 627
our proposed supervised multi-task dictionary learning 628
1http://homes.esat.kuleuven.be/~ttommasi/source_code_ICCV11.html2http://www.public.asu.edu/~jye02/Software/MALSAR/
IEEE
Proo
f
YAN et al.: EVENT ORIENTED DICTIONARY LEARNING 9
TABLE IV
COMPARISON OF Average DETECTION ACCURACY OF DIFFERENT METHODS FOR THE SIFT FEATURE. BEST RESULTS ARE HIGHLIGHTED IN BOLD
Fig. 6. Comparison of AP performance of different methods for each MED event. (Figure is best viewed in color and under zoom).
without involving concept selection strategy (leveraging629
random samples).630
• Simple Concept Selection Strategy (SCSS): Each concept631
is independently used as a predictor for each event632
yielding a certain average precision performance for633
each event. The concepts are ranked according to the634
AP measurement for each event, and the top 10 concepts635
retained.636
D. Experimental Results637
We firstly calculate the covariance matrix of the 346 SIN638
concepts, shown in Fig. 4, where the visual concepts are639
grouped and sorted the same as in Table 2. As shown in Fig. 4,640
the convariance matrix shows high within-cluster correlations641
along the diagonal direction and also relatively high corre-642
lations between contextually related clusters (in red color),643
such as ‘G4:car’, ‘G7:nature’ and ‘G11:urban-scene’. One can644
also easily observe negative correlations between contextually645
unrelated clusters (in blue color), such as ‘G7:nature’ and646
‘G8:indoor’. This gives us the intuition that visual concepts647
co-occurrence exists. Therefore, removing redundant concepts648
and selecting related concepts for each event is expected to be649
helpful for the event detection task.650
Fig. 5 shows the results of our concept selection strategy for651
the event ‘Attempting board trick’, ‘Feeding animal’, ‘Flash652
mob gathering’ and ‘Making a sandwich’. From Fig. 5, we can653
observe that the concepts selected are reasonably consistent654
with human selections.655
To better exploit the effectiveness of our proposed concept656
selection strategy, we compare our selected top 10 concepts657
with the groundtruth (we use the ranking list of human658
labeled concepts as the groundtruth for each MED event).659
The results are listed in the last column of Table 3, showing660
the number of concepts in the top 10 that coincide with the661
groundtruth. The AP performance for event detection based on662
text information, visual information and their combinations 663
are also shown in Table 3. The benefit of using both text 664
and visual information for concept selection can be concluded 665
from Table 3. 666
Table 4 shows the average detection results of the 18 MED 667
events for different comparison methods. We have the 668
following observations: (1) Comparing ST-SDL with SVM, 669
we observe that performing supervised dictionary learning is 670
better than SVM which shows the effectiveness of dictionary 671
learning for MED. (2) Comparing PT-SDL with ST-SDL, 672
leveraging knowledge from the SIN dataset improves the 673
performance for MED. (3) Our concept selection strategy 674
for semantic dictionary learning performs the best for 675
MED among all the comparison methods. (4) Our proposed 676
method outperforms by 8%, 6%, 10% with respect to AP, 677
PMiss@TER=12.5 and MinNDC respectively compared with 678
SVM. (5) Considering the difficulty of MED dataset and 679
the typically low AP performance of MED, the absolute 680
8% AP improvement is very significant. 681
Fig. 6 shows the AP results for each MED event. Our 682
proposed method achieves the best performance for 13 events 683
out of a total of 18 events. It is also interesting to notice that 684
the larger improvements in Fig. 6, such as ‘E004: Wedding 685
ceremony’, ‘E005: Working wood working project’ and 686
‘E009: Getting a vehicle unstuck’ usually correspond to the 687
higher number of selected concepts that coincide with the 688
groundtruth. This gives us the evidence of the effectiveness 689
of our proposed automatic concept selection strategy. 690
Fig. 7 illustrates the MAP performance for different 691
methods based on SIFT, CSIFT and MOSIFT features. 692
It can be easily observed that our proposed surpervised multi- 693
task dictionary learning with our concept selection strategy 694
outperforms SVM by more than 8%. 695
Moreover, we evaluate our proposed method with respect 696
to different numbers of selected concepts, dictionary sizes and 697
IEEE
Proo
f
10 IEEE TRANSACTIONS ON IMAGE PROCESSING
Fig. 7. Comparison of MAP performance of different methods for differenttypes of features.
Fig. 8. MAP performance variation with respect to the number of selectedconcepts.
Fig. 9. MAP performance variation with respect to (left) different dictionarysize; (right) different subspace dimensionality.
different subspace dimensionality settings based on the SIFT698
feature. Fig. 8 shows that the proposed method achieves the699
best MAP results when the numbers of selected concepts is 10.700
Fig. 9(left) shows that the proposed method achieves the best701
MAP results when the dictionary size is 1024. Too large or702
too small dictionary size tends to hamper the performance.703
Fig. 9(right) shows that the best MAP result is achieved when704
the subspace dimensionality is 400 (dictionary size = 1024).705
Large or small subspace dimensionality also degrades the706
performance.707
Finally, we also study the parameter sensitivity of the708
proposed method in Fig. 10. Here, we fix λ3 = 1709
(discriminative information contribution fixed) and p = 0.6710
and analyze the regularization parameters λ1 and λ2. As shown711
in Fig. 10(left), we observe that the proposed method is more712
sensitive to λ2 compared with λ1, which demonstrates the713
importance of the subspace for multi-task dictionary learning.714
Moreover, to understand the influence of parameter p for our715
proposed supervised �p-norm dictionary learning algorithm,716
we also perform an experiment on the parameter sensitivity.717
Fig. 10(right) demonstrates that the best performance for the718
Fig. 10. Sensitivity study of parameters on (left) λ1 and λ2 with fixed pand λ3. (right) p and λ3 with fixed λ1 and λ2.
Fig. 11. Convergence rate study on (left) supervised multi-task dictionarylearning and (right) supervised multi-task �p-norm dictionary learning.
supervised �p-norm dictionary learning algorithm is achieved 719
when p = 0.6. More than 2% MAP can be achieved if we 720
adopt the �p-norm model compared with the fixed �1-norm 721
model. This shows the suboptimality of the traditional 722
�1-norm sparse coding compared with the flexible �p-norm 723
sparse coding. 724
The proposed iterative approaches monotonically decrease 725
the objective function values in Eqn.(2) and Eqn.(3) until 726
convergence. Fig. 11 shows the convergence rate curves of 727
our algorithms. It can be observed that the objective function 728
values converge quickly and our approaches usually converge 729
after 6 iterations for supervised multi-task dictionary learning 730
and 7 iterations for supervised multi-task �p-norm dictionary 731
learning at most (precision = 10−6). 732
Regarding the computational cost of our proposed algorithm 733
for supervised multi-task �p-norm dictionary learning, we 734
train our model for TRECVID MED dataset in 5 hours 735
with cross-validation on a workstation with Intel(R) Xeon(R) 736
CPU E5-2620 v2 @ 2.10GHz × 17 processors. Our proposed 737
event oriented dicionary learning approach can be easily 738
parallelled on multi-core computers due to its ‘event 739
oriented’. This means that our algorithms would be scalable 740
for large-scale problems. 741
VI. CONCLUSION AND FUTURE WORK 742
In this paper, we have firstly investigated the possibility 743
of automatically selecting semantic meaningful concepts for 744
complex event detection based on both the MED events-kit 745
text descriptions and the high-level concept feature 746
descriptions. Then, we attempt to learn an event oriented 747
dictionary representation based on the selected semantic 748
concepts. To this aim, we leverage training samples of 749
selected concepts from the SIN dataset into a novel jointly 750
supervised multi-task dictionary learning framework. 751
IEEE
Proo
f
YAN et al.: EVENT ORIENTED DICTIONARY LEARNING 11
Extensive experimental results on the TRECVID MED752
dataset showed that our proposed method outperformed several753
important baseline methods. Our proposed method outper-754
formed SVM by up to 8% MAP which showed the effective-755
ness of dictionary learning for TRECVID MED. More than756
6% and 3% MAP was achieved respectively compared with757
ST-SDL and PT-SDL, which showed the advantage of multi-758
task setting in our proposed framework. To show the benefit of759
concept selection strategy, we compared RCSS to our method760
and showed that achieves 4% less MAP.761
For some sparse coding problems, non-convex �p-norm762
minimization (0 ≤ p < 1) can often obtain better results763
than the convex �1-norm minimization. Inspired by this, we764
extended our supervised multi-task dictionary learning model765
to a supervised multi-task �p-norm dictionary learning model.766
We evaluated the influence of the �p-norm parameter p in our767
proposed problem and found that more than 2% MAP can768
be achieved if we adopted the more flexible �p-norm model769
compared with the fixed �1-norm model.770
Overall, the proposed multi-task dictionary learning771
solutions are novel in the context of complex event detection,772
which is a relevant and important research problem in773
applications such as image and video understanding and774
surveillance. Future research involves (i) integration of775
knowledge from multiple sources (video, audio, text) and776
incorporation of kernel learning in our framework, and (ii) the777
use of deep structures instead of a shallow single-layer model778
in the proposed problem since deep learning has achieved the779
supreme success in many different fields of image processing780
and computer vision.781
ACKNOWLEDGEMENT782
Any opinions, findings, and conclusions or recommenda-783
tions expressed in this material are those of the authors and do784
not necessarily reflect the views of ARO, the National Science785
Foundation or the U.S. Government.786
REFERENCES787
[1] A. Tamrakar et al., “Evaluation of low-level features and their combina-788
tions for complex event detection in open source videos,” in Proc. IEEE789
Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 3681–3688.790
[2] Z. Ma, Y. Yang, Z. Xu, S. Yan, N. Sebe, and A. G. Hauptmann,791
“Complex event detection via multi-source video attributes,” in Proc.792
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 2627–2633.793
[3] Y. Yan, Y. Yang, H. Shen, D. Meng, G. Liu, and N. S. A. Hauptmann,794
“Complex event detection via event oriented dictionary learning,” in795
Proc. AAAI Conf. Artif. Intell., 2015.AQ:3 796
[4] M.-Y. Chen and A. Hauptmann, “MoSIFT: Recognizing human actions797
in surveillance videos,” School Comput. Sci., Carnegie Mellon Univ.,798
Pittsburgh, PA, USA, Tech. Rep. CMU-CS-09-161, 2009.799
[5] C. G. M. Snoek and A. W. M. Smeulders, “Visual-concept search800
solved?” Computer, vol. 43, no. 6, pp. 76–78, Jun. 2010.801
[6] C. Sun and R. Nevatia, “ACTIVE: Activity concept transitions in video802
event classification,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013,803
pp. 913–920.804
[7] SIN. (2013). NIST TRECVID Semantic Indexing. [Online]. Available:805
http://www-nlpir.nist.gov/projects/tv2013/tv2013.html#sin806
[8] Z.-Z. Lan et al., “Informedia E-lamp @ TRECVID 2013 multimedia807
event detection and recounting (MED and MER),” in Proc. NIST808
TRECVID Workshop, 2013.809
[9] L. Jiang, A. G. Hauptmann, and G. Xiang, “Leveraging high-level and810
low-level features for multimedia event detection,” in Proc. ACM Int.811
Conf. Multimedia, 2012, pp. 449–458.812
[10] P. Natarajan et al., “Multimodal feature fusion for robust event detection 813
in web videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 814
Jun. 2012, pp. 1298–1305. 815
[11] K. Tang, D. Koller, and L. Fei-Fei, “Learning latent temporal structure 816
for complex event detection,” in Proc. IEEE Conf. Comput. Vis. Pattern 817
Recognit., Jun. 2012, pp. 1250–1257. 818
[12] Z.-Z. Lan, L. Bao, S.-I. Yu, W. Liu, and A. G. Hauptmann, “Double 819
fusion for multimedia event detection,” in Proc. Int. Conf. Multimedia 820
Modeling, 2012, pp. 173–185. 821
[13] S. Oh et al., “Multimedia event detection with multimodal feature fusion 822
and temporal concept localization,” Mach. Vis. Appl., vol. 25, no. 1, 823
pp. 49–69, Jan. 2014. 824
[14] L. Brown et al., “IBM Research and Columbia University 825
TRECVID-2013 multimedia event detection (MED), multimedia event 826
recounting (MER), surveillance event detection (SED), and semantic 827
indexing (SIN) systems,” in Proc. NIST TRECVID Workshop, 2013. 828
[15] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid 829
matching using sparse coding for image classification,” in Proc. IEEE 830
Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 1794–1801. 831
[16] M. Elad and M. Aharon, “Image denoising via sparse and redundant 832
representations over learned dictionaries,” IEEE Trans. Image Process., 833
vol. 15, no. 12, pp. 3736–3745, Dec. 2006. 834
[17] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Discrimina- 835
tive learned dictionaries for local image analysis,” in Proc. IEEE Conf. 836
Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8. 837
[18] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for 838
designing overcomplete dictionaries for sparse representation,” IEEE 839
Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006. 840
[19] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding 841
algorithms,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 2006, 842
pp. 801–808. 843
[20] Z. Jiang, Z. Lin, and L. S. Davis, “Learning a discriminative dictionary 844
for sparse coding via label consistent K-SVD,” in Proc. IEEE Conf. 845
Comput. Vis. Pattern Recognit., Jun. 2011, pp. 1697–1704. 846
[21] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning 847
for sparse coding,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, 848
pp. 689–696. 849
[22] T. Evgeniou and M. Pontil, “Regularized multi–task learning,” in 850
Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2004, 851
pp. 109–117. 852
[23] A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,” 853
in Proc. Conf. Adv. Neural Inf. Process. Syst., 2007, pp. 1–8. 854
[24] A. Jalali, P. K. Ravikumar, S. Sanghavi, and C. Ruan, “A dirty model 855
for multi-task learning,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 856
2010. 857
[25] S. Ji and J. Ye, “An accelerated gradient method for trace norm 858
minimization,” in Proc. Int. Conf. Mach. Learn., 2009, pp. 457–464. 859
[26] J. Chen, J. Liu, and J. Ye, “Learning incoherent sparse and low-rank 860
patterns from multiple tasks,” in Proc. ACM SIGKDD Int. Conf. Knowl. 861
Discovery Data Mining, 2010, p. 22. 862
[27] L. Jacob, F. R. Bach, and J.-P. Vert, “Clustered multi-task learning: 863
A convex formulation,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 864
2008. 865
[28] J. Zhou, J. Chen, and J. Ye, “Clustered multi-task learning via alternating 866
structure optimization,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 867
2011, pp. 702–710. 868
[29] J. Chen, J. Zhou, and J. Ye, “Integrating low-rank and group-sparse 869
structures for robust multi-task learning,” in Proc. ACM SIGKDD Int. 870
Conf. Knowl. Discovery Data Mining, 2011, pp. 42–50. 871
[30] A. Maurer, M. Pontil, and B. Romera-Paredes, “Sparse coding for 872
multitask and transfer learning,” in Proc. Int. Conf. Mach. Learn., 2013, 873
pp. 343–351. 874
[31] X.-T. Yuan and S. Yan, “Visual classification with multi-task joint sparse 875
representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 876
Jun. 2010, pp. 3493–3500. 877
[32] Y. Yan, E. Ricci, R. Subramanian, O. Lanz, and N. Sebe, “No matter 878
where you are: Flexible graph-guided multi-task learning for multi-view 879
head pose classification under target motion,” in Proc. IEEE Int. Conf. 880
Comput. Vis., Dec. 2013, pp. 1177–1184. 881
[33] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual tracking via 882
multi-task sparse learning,” in Proc. IEEE Conf. Comput. Vis. Pattern 883
Recognit., Jun. 2012, pp. 2042–2049. 884
[34] Y. Yan, E. Ricci, R. Subramanian, G. Liu, and N. Sebe, “Multitask linear 885
discriminant analysis for view invariant action recognition,” IEEE Trans. 886
Image Process., vol. 23, no. 12, pp. 5599–5611, Dec. 2014. 887
IEEE
Proo
f
12 IEEE TRANSACTIONS ON IMAGE PROCESSING
[35] Y. Yan, E. Ricci, G. Liu, and N. Sebe, “Recognizing daily activities888
from first-person videos with multi-task clustering,” in Proc. Asian Conf.889
Comput. Vis., 2014, pp. 1–16.890
[36] C. Fellbaum, Ed., WordNet: An Electronic Lexical Database. Cambridge,891
MA, USA: MIT Press, 1998.892
[37] M. Strube and S. P. Ponzetto, “WikiRelate! Computing semantic related-893
ness using Wikipedia,” in Proc. AAAI Conf. Artif. Intell., 2006, pp. 1–6.894
[38] NIST. (2013). NIST TRECVID Multimedia Event Detection. [Online].895
Available: http://www.nist.gov/itl/iad/mig/med13.cfm896
[39] D. Lin, “An information-theoretic definition of similarity,” in Proc. 5th897
Int. Conf. Mach. Learn., 1998, pp. 296–304.898
[40] P. Resnik, “Using information content to evaluate semantic similarity899
in a taxonomy,” in Proc. 14th Int. Joint Conf. Artif. Intell., 1995,900
pp. 448–453.901
[41] J. Luo, C. Papin, and K. Costello, “Towards extracting semantically902
meaningful key frames from personal video clips: From humans to903
computers,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 2,904
pp. 289–301, Feb. 2009.905
[42] H. Zou and T. Hastie, “Regularization and variable selection via the906
elastic net,” J. Roy. Statist. Soc., Ser. B, vol. 67, no. 2, pp. 301–320,907
Apr. 2005.908
[43] Q. Zhang and B. Li, “Discriminative K-SVD for dictionary learning in909
face recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,910
Jun. 2010, pp. 2691–2698.911
[44] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding912
algorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1,913
pp. 183–202, 2009.914
[45] I. F. Gorodnitsky and B. D. Rao, “Sparse signal reconstruction from915
limited data using FOCUSS: A re-weighted minimum norm algorithm,”916
IEEE Trans. Signal Process., vol. 45, no. 3, pp. 600–616, Mar. 1997.917
[46] Y. She, “Thresholding-based iterative selection procedures for model918
selection and shrinkage,” Electron. J. Statist., vol. 3, pp. 384–415,919
Nov. 2009.920
[47] D. Krishnan and R. Fergus, “Fast image deconvolution using hyper-921
Laplacian priors,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 2009,922
pp. 1033–1041.923
[48] W. Zuo, D. Meng, L. Zhang, X. Feng, and D. Zhang, “A generalized924
iterated shrinkage algorithm for non-convex sparse coding,” in Proc.925
IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 217–224.926
[49] L. Cao et al., “IBM Research and Columbia University TRECVID-2011927
multimedia event detection (MED) system,” in Proc. NIST TRECVID928
Workshop, Dec. 2011.929
[50] D. Oneata et al., “AXES at TRECVid 2012: KIS, INS, and MED,” in930
Proc. NIST TRECVID Workshop, 2012.931
[51] L. Jie, T. Tommasi, and B. Caputo, “Multiclass transfer learning from932
unconstrained priors,” in Proc. IEEE Int. Conf. Comput. Vis., Nov. 2011,933
pp. 1863–1870.934
[52] A. Vahdat, K. Cannons, G. Mori, S. Oh, and I. Kim, “Compositional935
models for video event detection: A multiple kernel learning latent936
variable approach,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013,937
pp. 1185–1192.938
Yan Yan received the Ph.D. degree from the939
University of Trento, in 2014. He is currently a940
Post-Doctoral Researcher with the MHUG Group,941
University of Trento. His research interests include942
machine learning and its application to computer943
vision and multimedia analysis.944
Yi Yang received the Ph.D. degree in computer945
science from Zhejiang University. He was a946
Post-Doctoral Research Fellow with the School947
of Computer Science, Carnegie Mellon University,948
from 2011 to 2013. He is currently a Senior949
Lecturer with the Centre for Quantum Computation950
and Intelligent Systems, University of Technology951
at Sydney, Sydney. His research interests include952
multimedia and computer vision.953
Deyu Meng (M’13) received the B.Sc., M.Sc., 954
and Ph.D. degrees from Xi’an Jiaotong University, 955
Xi’an, China, in 2001, 2004, and 2008, respectively. 956
He was a Visiting Scholar with Carnegie Mellon 957
University, Pittsburgh, PA, USA, from 2012 to 2014. 958
He is currently an Associate Professor with the 959
Institute for Information and System Sciences, 960
School of Mathematics and Statistics, Xi’an 961
Jiaotong University. His current research interests 962
include principal component analysis, nonlinear 963
dimensionality reduction, feature extraction and 964
selection, compressed sensing, and sparse machine learning methods. 965
Gaowen Liu received the B.S. degree in automation 966
from Qingdao Univerisity, China, in 2006, and 967
the M.S. degree in system engineering from the 968
Nanjing University of Science and Technology, 969
China, in 2008. She is currently pursuing the 970
Ph.D. degree with the MHUG Group, University of 971
Trento, Italy. Her research interests include machine 972
learning and its application to computer vision and 973
multimedia analysis. 974
Wei Tong received the Ph.D. degree in computer 975
science from Michigan State University, in 2010. 976
He is currently a Researcher with General Motors 977
Research and Development Center. His research 978
interests include machine learning, computer vision, 979
and autonomous driving. 980
Alexander G. Hauptmann received the B.A. and 981
M.A. degrees in psychology from Johns Hopkins 982
University, Baltimore, MD, the degree in computer 983
science from the Technische Universität Berlin, 984
Berlin, Germany, in 1984, and the Ph.D. degree 985
in computer science from Carnegie Mellon 986
University (CMU), Pittsburgh, PA, in 1991. 987
He worked on speech and machine translation 988
from 1984 to 1994, when he joined the Informedia 989
project for digital video analysis and retrieval, 990
and led the development and evaluation of 991
news-on-demand applications. He is currently a Faculty Member with 992
the Department of Computer Science and the Language Technologies 993
Institute, CMU. His research interests include several different areas, 994
including man–machine communication, natural language processing, speech 995
understanding and synthesis, video analysis, and machine learning. 996
AQ:4
Nicu Sebe (M’01–SM’11) received the Ph.D. degree 997
from Leiden University, The Netherlands, in 2001. 998
He is currently with the Department of Information 999
Engineering and Computer Science, University of 1000
Trento, Italy, where he leads the research in the 1001
areas of multimedia information retrieval and human 1002
behavior understanding. He is a Senior Member of 1003
the Association for Computing Machinery and a 1004
fellow of the International Association for Pattern 1005
Recognition. He was the General Co-Chair of 1006
FG 2008 and ACM Multimedia 2013, and the 1007
Program Chair of CIVR 2007 and 2010, and ACM Multimedia 2007 and 2011. 1008
He will be the Program Chair of ECCV 2016 and ICCV 2017. 1009