A Dictionary Learning Approach for Classification - CiteSeerX
-
Upload
khangminh22 -
Category
Documents
-
view
0 -
download
0
Transcript of A Dictionary Learning Approach for Classification - CiteSeerX
000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
ECCV#165
ECCV#165
A Dictionary Learning Approach for Classification:Separating the Particularity and the Commonality
Shu Kong and Donghui Wang
Dept. of Computer Science and Technology, Zhejiang University, Hangzhou, P.R. China{aimerykong, dhwang}@zju.edu.cn
Abstract. Empirically, we find that, despite the class-specific features owned bythe objects appearing in the images, the objects from different categories usual-ly share some common patterns, which do not contribute to the discrimination ofthem. Concentrating on this observation and under the general dictionary learning(DL) framework, we propose a novel method to explicitly learn a common pat-tern pool (the commonality) and class-specific dictionaries (the particularity) forclassification. We call our method DL-COPAR, which can learn the most com-pact and most discriminative class-specific dictionaries used for classification.The proposed DL-COPAR is extensively evaluated both on synthetic data andon benchmark image databases in comparison with existing DL-based classifi-cation methods. The experimental results demonstrate that DL-COPAR achievesvery promising performances in various applications, such as face recognition,handwritten digit recognition, scene classification and object recognition.
1 Introduction
Dictionary learning (DL), as a particular sparse signal model, has risen to prominencein recent years. It aims to learn a (overcomplete) dictionary in which only a few atomscan be linearly combined to well approximate a given signal. DL-based methods haveachieved state-of-the-art performances in many application fields, such as image de-noising [1] and image classification [2, 3].
DL methods are originally designed to learn a dictionary which can faithfully rep-resent signals, therefore they do not work well for classification tasks. To circumventthis problem, researchers recently develop several approaches to learn a classification-oriented dictionary. By exploring the label information, most DL-based classificationmethods learn such an adaptive dictionary mainly in two ways: either directly forcingthe dictionary discriminative, or making the sparse coefficients discriminative (usual-ly through simultaneously learning a classifier) to promote the discrimination powerof the dictionary. For the first case, as an example, Ramirez et al. advocate learningclass-specific sub-dictionaries for each class with a novel penalty term to make the sub-dictionaries incoherent [4]. Most methods belong to the latter case. Mairal et al. proposea supervised DL method by embedding a logistic loss function to simultaneously learn aclassifier [3]. Zhang and Li propose discriminative K-SVD method to achieve a desireddictionary which has good representation power while supporting optimal discrimina-tion of the classes [5]. Furthermore, Jiang et al. add a label consistence term on K-SVDto make the sparse coefficients more discriminative, thus the discrimination power of
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
ECCV#165
ECCV#165
2 S. Kong and D.H. Wang
the dictionary is further improved. Out of these two scenarios, Yang et al. propose Fish-er discrimination DL method to simultaneously learn class-specific sub-dictionaries andto make the coefficients more discriminative through Fisher criterion [6].
Besides, in the empirical observation, despite the highly discriminative featuresowned by the objects from different categories, the images containing the objects usu-ally share some common patterns, which do not contribute to the recognition of them.Such common patterns include, for example, the background of the objects in objectcategorization, the noses in expression recognition (the nose is almost motionless invarious expressions), and so on. Under this observation, we can see it is not feasible totreat the (inter-category) atoms of the overall dictionary equally without discrimination.In the inspiring work of Ramirez et al. [4], a DL method is proposed by adding a struc-tured inherence penalty term (DLSI) to learn C sub-dictionaries for C classes (eachone corresponds to a specific class), and discarding the most coherent atoms amongthese sub-dictionaries as the shared features. These shared features will confuse therepresentation as a result of the coherent atoms from different sub-dictionaries can beused for representation interchangeably. Then DLSI uses the combination of all the sub-dictionaries as an overall dictionary for final classification. Even if some improvementsare achieved by DLSI, the common patterns are still hidden in the sub-dictionaries. Inthe work of Wang et al. [7], an automatic group sparse coding (AutoGSC) method isproposed to discover the hidden shared data patterns with a common dictionary and Cindividual dictionaries for the C groups. Actually, AutoGSC is a clustering approachbased on sparse coding and is not adapted for classification.
Inspired by the aforementioned works, we propose a novel DL-based approach forclassification tasks, named DL-COPAR in this paper. Given the training data with thelabel information, we expect explicitly learning the class-specific feature dictionaries(the particularity) and the common pattern pool (the commonality). The particularityis most discriminative for classification despite its representation power, and the com-monality is separated out to merely contribute the representation for the data from allclasses. With the help of the commonality, the overall dictionary can be more com-pact and more discriminative for classification. To evaluate the proposed DL-COPAR,we extensively perform a series of experiments both on synthesis data and on bench-mark image databases. The experimental results demonstrate DL-COPAR achieves verypromising performances in various applications, such as face recognition, handwrittendigit recognition, scene classification and object recognition.
2 Learning the Commonality and the Particularity
To derive our DL-COPAR, we first review the classical dictionary learning (DL) model.Suppose there are N training data (possibly from different categories) denoted by xi ∈Rd (i = 1, . . . , N ), DL learns a dictionary D ∈ Rd×K from them by alternativelyminimizing the objective function f over D and coefficient matrix A = [a1, . . . ,aN ] ∈RK×N :
{A,D} = argminD,A
{f ≡
N∑i=1
∥xi −Dai∥22 + λ∥ai∥1}
s.t. ∥dk∥22 = 1, for ∀k = 1, . . . ,K,
(1)
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
ECCV#165
ECCV#165
Learning Dictionary-Based Particularity and Commonality for Classification 3
If K > d, then D is called an overcomplete dictionary. Suppose there are C class-es and X = [X1, . . . ,Xc, . . . ,XC ] ∈ Rd×N is the dataset, wherein Xc ∈ Rd×Nc
(N =∑C
c=1 Nc) represent the data from the cth class and signal xi ∈ Rd from thisclass is indexed by i ∈ Ic. Although the learned dictionary D by Eq.(1) from X isadapted for representation of the data, it is not suitable for classifying them. To obtaina classification-oriented dictionary, an intuitive way is to learn C class-specific dic-tionaries Dc’s for all C classes (c = 1, . . . , C), then the reconstruction error can beused for classification [4, 8]. Besides, as we observe in the real world, the class-specificdictionaries Dc’s from different categories usually share some common patterns/atoms,which do not contribute to classification but are essential for reconstruction in DL. Thusthese common/coherent atoms can be interchangeably used for representation of a querydatum, by which way the classification performance can be compromised. For this rea-son, to improve classification performance, we explicitly separate the coherent atomsby learning the commonality DC+1, which provides the common bases for all cate-gories. Denote the overall dictionary as D = [D1, . . . ,Dc, . . . ,DC ,DC+1] ∈ Rd×K ,in which K =
∑C+1c=1 Kc, Dc ∈ Rd×Kc stands for the particularity of the cth class, and
DC+1 ∈ Rd×KC+1 is the commonality. I is the identity matrix with appropriate size.First of all, a learned dictionary D ought to well represent every sample xi, i.e.
xi ≈ Dai, where the efficient ai = [θ(1)i ; . . . ;θ
(C)i ;θ
(C+1)i ] ∈ RK and θ
(c)i ∈ RKc
is the part corresponding to the sub-dictionary Dc. Despite of the overall dictionary,it is also expected that the sample from the cth class can be well represented by thecooperative efforts of the cth particularity Dc and the commonality DC+1. Therefore,we renew the objective function f :
f ≡C∑
c=1
∑i∈Ic
{∥xi −Dai∥22 + λ∥ai∥1 + ∥xi −Dcθ
(c)i −DC+1θ
(C+1)i ∥22
}, (2)
where θ(C)i and θ
(C+1)i are the corresponding coefficients of the two sub-dictionaries.
Actually, the last term of Eq.(2) is the same as the objective formulation of [7] byignoring the sparse penalty term. For mathematical brevity, we introduce a selectionoperator Qc = [qc
1, . . . ,qcj , . . . ,q
cKc
] ∈ RK×Kc , in which the jth column of Qc is ofthe form:
qcj = [ 0, . . . , 0︸ ︷︷ ︸∑c−1
m=1 Km
,
j−1︷ ︸︸ ︷0, . . . , 0, 1, 0, . . . , 0︸ ︷︷ ︸
Kc
, 0, . . . , 0︸ ︷︷ ︸∑C+1m=c+1 Km
]T . (3)
Therefore, we have QTc Qc = I, Dc = DQc and θ
(c)i = QT
c ai ∈ RKc . Let Qc =[Qc,QC+1], and we have the coefficient corresponding to the particularity and the
commonality( θ(c)
i
θ(C+1)
i
)= QT
c ai ∈ RKc+KC+1 . By introducing a terse denotation
ϕ(Ac) =∑Nc
i=1 ∥aci∥1 for Ac = [ac1, . . . ,acNc
], Eq.(2) can be rewritten as:
f ≡C∑
c=1
{∥Xc −DAc∥2F + λϕ(Ac) + ∥Xc −DQcQ
Tc Ac∥2F .
}(4)
In this way, we first guarantee the representation power of the overall dictionary,and then require the particularity, in conjunction with the commonality, to well rep-
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
ECCV#165
ECCV#165
4 S. Kong and D.H. Wang
resent the data of this category. However, merely doing this is not sufficient to learna discriminative dictionary, because other class-specific dictionaries may share similarbases with that of the cth one, i.e. the atoms from different class-specific dictionariescan still be coherent and thus be used interchangeably for representing the query data.To circumvent this problem, we force the coefficients, except the parts correspondingto the cth particularity and the commonality, to be zero. Mathematically, denote Q/c =
[Q1, . . . ,Qc−1,Qc+1, . . . ,QC ,QC+1] and Q/c = [Q1, . . . ,Qc−1,Qc+1, . . . ,QC ],we force ∥QT
/cAc∥2F = 0. Thus, we add a new term to the objective function:
f ≡C∑
c=1
{∥Xc −DAc∥2F + ∥QT
/cAc∥2F + λϕ(Ac) + ∥Xc −DQcQTc Ac∥2F
}(5)
It is worth noting that Eq.(5) can fail in capturing the common patterns. For ex-ample, the bases of the real common patterns may appear in several particularities,which makes the learned particularities redundant and less discriminative. Therefore, itis desired to drive the common patterns to the commonality and preserve the class-specific features in the particularity. For this purpose, we add an incoherence termQ(Di,Dj) = ∥DT
i Dj∥2F to the objective function. This penalty term has been usedamong the class-specific sub-dictionaries in [4], but we also consider the incoherenceof the commonality with the particularities. Then we derive the objective function ofthe proposed DL-COPAR:
f ≡C∑
c=1
{∥Xc −DAc∥2F + ∥QT
/cAc∥2F∥Xc −DQcQ
Tc Ac∥2F + λϕ(Ac)
}+ η
C+1∑c=1
C+1∑j=1j =c
Q(Dc,Dj) (6)
3 Optimization Step
At the first sight, the objective function Eq.(6) is complex to solve, but we show it canbe easily optimized through an alternative optimization process.
3.1 Fix D to update Ac
When fixing D to update Ac, we omit the terms independent of Ac from Eq.(6):
f ≡C∑
c=1
{∥Xc −DAc∥2F + ∥QT
/cAc∥2F + λϕ(Ac) + ∥Xc −DQcQTc Ac∥2F
}
=
C∑c=1
∥∥∥∥∥Xc
Xc
0
−
D
DQcQTc
QT/c
Ac
∥∥∥∥∥2
F
+ λϕ(Ac)
.
(7)
Proposition 1 Denote Xc =
Xc
Xc
0
and D = [d1, . . . , dk, . . . , dK ], where dk is
the ℓ2-norm unitized vector of the kth column in
D
DQcQTc
QT/c
, then when fixing D to
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
ECCV#165
ECCV#165
Learning Dictionary-Based Particularity and Commonality for Classification 5
minimize Eq.(6) over Ac, the minimum is reached when Ac equals to 1√2A, wherein A
is the result of:
A = argminA
∥Xc − DA∥2F +λ√2ϕ(A),
Proof. Denote D′ =
D
DQcQTc
QT/c
, through simple derivations, it is easy to show the
Euclidean length of each column of D′ is√2, then
Ac =argminAc
{∥Xc − 1√
2D′√2Ac∥2F + λϕ(Ac)
}=argmin
Ac
{∥Xc − D
√2Ac∥2F + λ√
2ϕ(
√2Ac)
} (8)
Proposition 1 indicates that to update the coefficients Ac with fixed D, we cansimply solve a LASSO problem. Through this paper, we adopt the feature-sign searchalgorithm [9] to solve this LASSO problem.
3.2 Fix Ac to update D
To update the overall dictionary D = [D1, . . . ,DC ,DC+1], we turn to an iterativeapproach, i.e. updating Dc by fixing all the other Di’s. Despite the cth class-specificdictionary, the common pattern pool also contributes to fitting the signals from the cth
class. Thus there are differences in optimizing DC+1 and Dc. We elaborate the opti-mization steps as below.
Update the particularity Dc Without loss of generality, we concentrate on the opti-mization of the cth class-specific dictionary Dc by fixing the other Di’s. Specifically,we denote A
(i)c = QT
i Ac for i = 1, . . . , C + 1. By dropping the unrelated terms, weupdate the cth particularity Dc as below:
Dc =argminDc
∥Xc −
C+1∑i=1i =c
DiA(i)c −DcA
(c)c ∥2F+
∥Xc −DC+1A(C+1)c −DcA
(c)c ∥2F + 2η
C+1∑i=1i=c
Q(Dc,Di)
(9)
Let Xc = Xc −DC+1A(C+1)c , Yc = Xc −
∑C+1i=1,i =c DiA
(i)c and B = DQ/c, then
we have:
Dc = argminDc
∥Yc −DcA(c)c ∥2F + ∥Xc −DcA
(c)c ∥2F + 2η∥DT
c B∥2F . (10)
We propose to update Dc = [d1c , . . . ,d
Kcc ] atom by atom, i.e. updating dk
c while fix-ing other columns. Specifically, denote A
(c)c = [a(1); . . . ;a(Kc)] ∈ RKc×Nc , where-
in a(k) ∈ R1×Nc is the kth row of A(c)c . Let Yc = Yc −
∑j =k d
jca(j) and Xc =
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
ECCV#165
ECCV#165
6 S. Kong and D.H. Wang
Xc −∑
j =k djca(j), then we get:
dkc =argmin
dkc
{g(dk
c ) ≡ ∥Yc − dkca(k)∥2F + ∥Xc − dk
ca(k)∥2F + 2η∥dkc
TB∥2F
}. (11)
Let the first derivative of g(dkc ) w.r.t. dk
c equal zero, i.e. ∂g(dkc )/∂d
kc = 0, then we
obtain the updated dkc as below:
dkc =
1
2(∥a(k)∥22I+ ηBBT )−1(Yc + Xc)a
T(k). (12)
Note as an atom of dictionary, it ought to be unitized, i.e. dkc = dk
c/∥dkc∥2. Along with
this, the corresponding coefficient should be multiplied ∥dkc∥2, i.e. a(k) = ∥dk
c∥2a(k).
Update the shared feature pool DC+1 Denote B = DQ/C+1, i.e. B = [D1, . . . ,DC ],by dropping the unrelated terms, we update DC+1 as below:
DC+1 = argminDC+1
C∑c=1
{∥Xc −
C∑i=1
DiA(i)c −DC+1A
(C+1)C+1 ∥2F+
∥Xc −DcA(c)c −DC+1A
(C+1)C+1 ∥2F
}+ 2η∥DT
C+1B∥2F .
(13)
Denote Yc = Xc −∑C
i=1 DiA(i)c and Xc = Xc −DcA
(c)c (note Xc and Yc here are
different from the ones in the optimization of Dc), then we have:
DC+1 = argminDC+1
∥Y −DC+1A(C+1)∥2F + ∥X−DC+1A
(C+1)∥2F + 2η∥DTC+1B∥2F ,
(14)
where A(C+1) = [A(C+1)1 , . . . ,A
(C+1)C ], X = [X1, . . . , XC ], and Y = [Y1, . . . , YC ].
As well, we choose to update DC+1 atom by atom, and the kth column is updated by:
dkC+1 =
1
2(∥a(k)∥22I+ ηBBT )−1(X+ Y)aT
(k), (15)
where A(C+1) = [a(1); . . . ;a(KC+1)] ∈ RKC+1×N , Y = Y −∑
j =k djC+1a(j), and
X = X −∑
j =k djC+1a(j). Similarly, we unitize dk
C+1 to get the unit-length atomdkC+1, with scaled coefficients a(k) = ∥dk
C+1∥2a(k).The overall algorithm is summarized in Algorithm 1. Note that the value of the
objective function decreases at each iteration, therefore the algorithm converges.
4 Experimental Validation
In this section, we perform a series of experiments to evaluate the proposed DL-COPAR.First, a synthetic dataset is used to demonstrate the powerfulness of our method in learn-ing the commonality and particularity. Then, we conduct experiments to compare ourmethod with some state-of-the-art approaches on four public available benchmarks forfour real-world recognition tasks respectively, i.e. face, hand-written digit, scene, andobject recognition.
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
ECCV#165
ECCV#165
Learning Dictionary-Based Particularity and Commonality for Classification 7
Algorithm 1 Learning the Commonality and ParticularityRequire: training dataset [X1, . . . ,XC ] with labels, the size of the C + 1 desired sub-
dictionaries Kc’s, and λ;Ensure: ∥di∥2 = 1, for ∀i = 1, . . . ,K;1: initialize Dc for c = 1, . . . , C, C + 1;2: while stop criterion is not reached do3: update the coefficients Ac by solving LASSO problem Eq.(8);4: update Dc atom-by-atom through Eq.(12) with unitization process and correspondingly
scale the related coefficients;5: update DC+1 atom-by-atom through Eq.(15) with unitization process and corresponding-
ly scale the related coefficients;6: end while7: return the learned dictionaries D1, . . . ,DC ,DC+1.
4.1 Experimental setup
We employ K-SVD algorithm [10] to initialize the particularities and the commonality.In detail, to initialize the cth class-specific dictionary, we perform K-SVD on the datafrom the cth category, and to initialize the common pattern pool, we perform K-SVDon the whole training set. Note that we adopt feature-sign search algorithm [9] for thesparse coding problem throughout our experiments.
In classification stage, we propose to adopt two reconstruction error based classifi-cation schemes and the linear SVM classifier for different classification tasks. The tworeconstruction error based classification schemes are used for the face and hand-writtendigit recognition, one is a global classifier (GC) and the other is a local classifier (LC).Note throughout our experiments, the sparsity parameter γ as below is tuned to makeabout 20% elements of the coefficient be nonzero.
GC For a testing sample y, we code it over the overall learned dictionary D, a =argmina ∥y−Da∥22+γ∥a∥1, where γ is a constant. Denote a = [θ(1); . . . ;θ(C);θ(C+1)],where θ(c) is the coefficient vector associated with sub-dictionary Dc. Then the re-
construction error by class c is ec = ∥y − Dcθ(c)
∥22, where Dc = [Dc,DC+1] and
θ(c)
= [θ(c);θ(C+1)]. Finally, the predicted label c is calculated by c = argminc ec.LC For a testing sample y, we directly calculate C reconstruction error for the
C classes: ec = minac∥y − Dcac∥22 + γ∥ac∥1, where Dc = [Dc,DC+1]. The final
identity of y is c = argminc ec.The linear SVM is used for scene and object recognition with spatial pyramid
matching (SPM) framework [11]. But in the dictionary learning stage, we adopt theproposed DL-COPAR. In detail, we first extract SIFT descriptors (128 in length) [12]from 16×16 pixel patches, which are densely sampled from each image on a dense gridwith 8-pixels stepsize. Over the extracted SIFT descriptors from different categories, weuse the proposed DL-COPAR to train an overall dictionary D ∈ R128×K concatenat-ed by the particularities and the commonality. Then we encode the descriptors of theimage over the learned dictionary. By partitioning the image into into 2l × 2l segmentsin different scales l = 0, 1, 2, we use max pooling technique to pool the coefficientsover each segment into a single vector. Thus, there are 1 + 2× 2 + 4× 4 = 21 pooled
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
ECCV#165
ECCV#165
8 S. Kong and D.H. Wang
vectors, which are concatenated into the 21K-dimensional vector for final representa-tion. Note that different from pooling process in the existing SPM-based methods, weonly use the coefficients which correspond to the particularities by discarding the onesto the commonality. Finally, we concatenate all the intercepted pooled vectors as therepresentation of the image. In all the four recognition experiments, we run 10 times ofeach experiment and report the average recognition rate.
4.2 Synthetic Example
The synthetic dataset constitutes two classes of gray-scale images of size 30× 30. Bothcategories have three common basis images denoted by FS
i for i = 1, 2, 3, as illustratedby right panel of Fig. 1, where we use dark color to represent value zero and whitecolor to represent 255. Category 1 has four individual basis images denoted by FI
1i
for i = 1, . . . , 4, as shown in left panel of Fig. 1. Similarly, the four basis images ofcategory 2 are denoted by FI
2i for i = 1, . . . , 4 in Fig. 1. The set of images used in ourexperiment are generated by adding Gaussian noise with zero mean and 0.01 variance.Fig. 2 displays several examples from the dataset, the top row for the first class andthe bottom row for the second class. Note a similar synthetic dataset is used in [7], butours is different from it in two ways. We use Gaussian noise instead of uniform randomnoise, and noises are added to both the dark and the white region rather than merely tothe dark region in [7]. Therefore, our synthetic data set is more difficult to separate thecommon patterns from the individual features.
We compare the learned dictionaries with that learned by two closely related DL-based classification methods. The methods used for comparison include Fisher Dis-criminative Dictionary Learning method (FDDL) [6] and the method based on DL withstructured incoherence (DLSI) [4]. As a supervised DL method, FDDL learns class-specific dictionaries for each class and makes them most discriminative through Fishercriteria. Fig. 3 shows the learned dictionary by FDDL, one row for one class. It is clearthat the dictionaries learned by FDDL are much discriminative and each class-specificdictionary can faithfully represent the signals from the corresponding class. Howev-er, even though FDDL succeeds in learning most discriminative dictionaries, it cannotseparate the common patterns and the individual features. Thus, the dictionaries canbe too redundant for classification. DLSI can learn the common features despite theclass-specific dictionaries, as Fig. 4 shows. Actually, DLSI learns class-specific sub-dictionaries, and then select the most coherent atoms among different individual dic-tionaries as the common features. From this figure, we can see the learned individualdictionaries are also mixed with the common features, as well the separated commonfeatures are too redundant.
On the contrary, as expected, the proposed DL-COPAR can correctly separate thecommon features and individual features, as shown in Fig. 5. As a comparison, we alsoplot the initialized particularities by K-SVD. There is no doubt that any signals can bewell represented by the corresponding class-specific dictionary and the shared patternpool. This phenomenon reveals that DL-COPAR can learn the most compact and mostdiscriminative dictionaries for classification. Fig. 6 demonstrates the convergence curveof running DL-COPAR on the synthetic data set, which shows the proposed methodconverges within about 15 steps in this case.
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
ECCV#165
ECCV#165
Learning Dictionary-Based Particularity and Commonality for Classification 9
the common basis imagesthe class-specific basis images
Fig. 1. The real class-specific features and the common patterns.
Fig. 2. Examples of the synthetic dataset.
F21 F22 F23 F24 F25 F26 F27
F11 F12 F13 F14 F15 F16 F17
Fig. 3. The class-specific dictionaries learned by FDDL [6].
FS
12 FS
13FS
11
FS
21 FS
23FS
22
the shared featuresthe learned class-specific dictionaries
Fig. 4. The class-specific dictionaries and the shared features learned by DLSI [4].
the learned particularities the learned commonalitythe initialized particularities
Fig. 5. The initialized particularities and the learned particularities and commonality by the pro-posed DL-COPAR.
Fig. 6. The objective function value vs. number of iterations on the synthetic data set.
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
ECCV#165
ECCV#165
10 S. Kong and D.H. Wang
Table 1. The recognition rates (%) of various methods on Extended YaleB face database.
Method SRC D-KSVD LC-SVD DLSI FDDL kNN SVMDL-COPAR (LC)
DL-COPAR (GC)
Accuracy 97.2 94.1 96.7 96.5 97.9 82.5 96.296.998.3
4.3 Face recognition
Extended YaleB [13] database contains 2,414 frontal face images of 38 persons, about68 images for each individual. It is challenging due to varying illumination conditionsand expressions, and the original images are cropped to 192× 168 pixels. We use ran-dom faces [14, 15, 5] as the feature descriptors, which are obtained by projecting a faceimage onto a 504-dimensional vector with a randomly generated matrix from a zero-mean normal distribution. We randomly select half of the images (about 32 images perindividual) for training and the rest for testing. The learned dictionary consists of 570atoms, which correspond to an average of 15 atoms for each person and the last 5 onesas the common features. We compare the performance of our DL-COPAR with that ofD-KSVD [5], SRC [14], DLSI [4], LC-SVD [15] and FDDL [6].
Direct comparison is shown in Table 1. We can see that D-KSVD, which only us-es coding coefficients for classification, does not work well on this dataset. LC-SVD,which employs the label information to improve the discrimination of the learned over-all dictionary and also uses the coefficients for the final classification, achieves betterclassification performance than D-KSVD, but still no better than that of SRC. DLSIaims to make the class-specific dictionaries incoherent and thereby facilitates the dis-crimination of the dictionaries. It employs the reconstruction error for classification,and achieves no better results than that of SRC. FDDL utilizes Fisher criterion to makethe coefficients more discriminative, thus improves the discrimination of the individualdictionaries. It has achieved state-of-the-art classification performance on this database,with the accuracy higher than that of SRC. However, our DL-COPAR with GC achievesthe best classification performance, about 0.4% higher than that of FDDL. But we cansee if DL-COPAR use LC for classification, it cannot produce satisfactory outcome. Thereason is that the size of the class-specific dictionary is too small to faithfully representthe data, thus the collaboration of these sub-dictionaries is of crucial significance.
4.4 Hand-written digit recognition
USPS 1 is a widely used handwritten digit data set with 7, 291 training and 2, 007 test-ing images. We compare the proposed DL-COPAR with several methods including thereconstructive DL method with linear and bilinear classifier models (denoted by REC-L and REC-BL) reported in [3], the supervised DL method with generative trainingand discriminative training (denoted by SDL-G and SDL-D) reported in [3], the resultof sparse representation for signal classification (denoted by SRSC) by [16], DLSI [4]
1 http://www-i6.informatik.rwth-aachen.de/ keysers/usps.html
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
ECCV#165
ECCV#165
Learning Dictionary-Based Particularity and Commonality for Classification 11
Fig. 7. The learned particularity (a) and commonalities (b) by DL-COPAR on USPS database.
Table 2. Error rates (%) of various methods on USPS data set.
Method SRSCREC-L SDL-G
DLSI FDDL kNN SVMDL-COPAR (LC)
REC-BL SDL-D DL-COPAR (GC)
Error Rate 6.056.83 6.67
3.98 3.69 5.2 4.23.61
4.38 3.54 4.70
and FDDL [6]. Additionally, some results of problem-specific methods (i.e. the kNNmethod and SVM with a Gaussian kernel) reported in [4] are also listed. The originalimages of 16× 16 resolution are directly used here.
Direct comparison is shown in Table 2. These results are originally reported or takenfrom the paper [6]. As we can see from the table, our DL-COPAR with LC classificationscheme outperforms all the other methods except for SDL-D. Actually, the optimiza-tion of SDL-D is much more complex than that of DL-COPAR, and it uses much moreinformation in the dictionary learning and classification process, such as a learned clas-sifier of coding coefficients, the sparsity of coefficients and the reconstruction error. Itis worth noting that DL-COPAR only trains 30 codewords for each class and 8 atomsas the shared commonality, whereas FDDL learns 90 atoms of dictionary for each digiteven though it achieves very close result with that of DL-COPAR. Fig. 7 illustrates thelearned particularities and the commonality by DL-COPAR, respectively.
4.5 Scene recognition
We also try our DL-COPAR for scene classification task on 15-scene dataset, which iscompiled by several researchers [22, 23, 17]. This dataset contains totally 4484 imagesfalling into 15 categories, with the number of images per category ranging from 200to 400. Following the common setup on this database [11, 17], we randomly select 100images per category as the training set and use the rest for testing. An overall 1024-visual-words dictionary is constituted. As for DL-COPAR, a 60-visual-word particu-larity is learned for each category, and the commonality consists of 124 shared patternbases. Thus the size of the concatenated dictionary is 15× 60 + 124 = 1024. Note thatthe size of the final representation of each image is 21 × 15 × 60 = 18900, which isless than that of other SPM methods, 21× 1024 = 21504, as a result of that we discardthe codes corresponding to the commonality.
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
ECCV#165
ECCV#165
12 S. Kong and D.H. Wang
Table 3. The recognition rates (%) of various methods on 15-Scenes database.
MethodKSPM ScSPM LLC GLP Boureau et al. Boureau et al. DL-COPAR
[17] [11] [18] [19] [20] [21]Accuracy 81.40 80.28 79.24 83.20 83.3 84.9 85.37
Table 4. The recognition rates (%) of various methods on Caltech101 dataset. 15 and 30 imagespre class are randomly selected for training, shown in the second and third row, respectively.
MethodKSPM ScSPM LLC GLP Boureau et al. Boureau et al. DL-COPAR
[17] [11] [18] [19] [20] [21]15 training 56.44 67.00 65.43 70.34 - - 75.1030 training 64.40 73.20 73.44 82.60 77.3 75.70 83.26
We compare the proposed DL-COPAR with several recent methods which are basedon SPM and sparse coding. The direct comparisons are shown in Table 3. KSPM [17],which represents the popular nonlinear kernel SPM, using spatial-pyramid histogram-s and Chi-square kernels, achieves 81.40% accuracy. Compared with KSPM, ScSP-M [11], as the baseline method in our paper, employs sparse coding and max poolingand linear SVM for classification. It achieves 80.28% classification rate. LLC [18] con-siders locality relations for sparse coding, achieving 79.24%. GLP [19] focuses on amore sophisticated pooling method, which learns a weight for SIFT descriptors in pool-ing process by enhancing discrimination and incorporates local spatial relations. Withhigh computational cost (eigenvalue decomposition of large size matrix and gradientascent process), GLP achieves 83.20% accuracy on this dataset. In [21], 84.9% clas-sification accuracy is achieved through macrofeatures, which jointly encodes a smallneighborhood SIFT descriptors and learns a discriminative dictionary.
Obviously, 85.37% accuracy achieved by our method is higher than that of all theother methods, and is also competitive with the current state-of-the art of 88.1% re-ported in [24] and 89.75% in [25]. It is worth noting that our DL-COPAR uses onlythe SIFT descriptors and solves the classical LASSO problem for sparse coding, whilethe method in [24] combines 14 different low-level features, and that in [25] involvesa more complicated coding process called Laplacian sparse coding, which requires awell-estimated similarity matrix and multiple optimization stages.
4.6 Object recognition
Caltech101 dataset [26] contains 9144 images in 101 object classes including animals,vehicles, flowers, etc, and one background category. Each class contains 31 to 800 im-ages with significant variance in shape. As suggested by the original dataset and alsoby the previous studies, we randomly select 15 and 30 images respectively per class fortraining and report the classification accuracies averaged over the 102 classes. The sizeof the overall dictionary is set K = 2048 which is a popular set in the community, andDL-COPAR learns 17 visual words for each category and 314 for the shared pattern
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
ECCV#165
ECCV#165
Learning Dictionary-Based Particularity and Commonality for Classification 13
Fig. 8. Some example images from Scene15 and Caltech101. Obviously, despite the class-specificfeatures, the images of different categories share some common patterns, such as the clouds in(a) and the flora background in (b).
pool, 17×102+314 = 2048 in total. Similar to scene recognition experiment, the finalimage representation vector (21 × 17 × 102 = 36414) is still much less than that ofother SPM methods (21 × 2048 = 43008), as a result of discarding the sparse codescorresponding to the commonality.
Table 4 demonstrates the direct comparisons. It is easy to see our DL-COPAR con-sistently outperforms other SPM-based methods, achieving 83.26% accuracy which iscompetitive to the state-of-the-art performance of 84.3% achieved by a group-sensitivemultiple kernel method [27]. Compared with ScSPM, which is the baseline method inthis experiment, DL-COPAR improves the classification performance with a margin ofmore than 8% and 10% when 15 and 30 images are randomly selected for training, re-spectively. As demonstrated by Fig. 8, we conjecture the reason why our DL-COPARachieves such a high classification accuracy on both 15-scene and Caltech101 is that, forexample, the complex background are intensively encoded over the learned commonal-ity, whereas the part of the coefficient corresponding to the commonality is discarded.Thus, the omission of the common patterns will improve the classification performance.
5 Conclusion
Under the empirical observation that images (objects) from different categories usuallyshare some common patterns which are not helpful for classification but essential forrepresentation, we propose a novel approach based on dictionary learning to explicit-ly learn the shared pattern pool (the commonality) and the class-specific dictionaries(the particularity), dubbed DL-COPAR. Therefore, the combination of the particular-ity (corresponding to the specific class) and the commonality can faithfully representthe samples from this class, and the particularities are more discriminative and morecompact for classification. Through experiments on a synthetic dataset and several pub-lic benchmarks for various applications, we can see the proposed DL-COPAR achievesvery promising performances on these datasets.
Acknowledgements. We are grateful for financial support from 973Program (No.2010CB327904)and Natural Science Foundations of China (No.61071218).
References
1. Elad, M., Aharon, M.: Image denoising via learned dictionaries and sparse representation.CVPR (2006)
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
ECCV#165
ECCV#165
14 S. Kong and D.H. Wang
2. Bradley, D.M., Bagnell, J.A.: Differentiable sparse coding. NIPS (2008)3. Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Supervised dictionary learning.
NIPS (2008)4. Ramirez, I., Sprechmann, P., Sapiro, G.: Classification and clustering via dictionary learning
with structured incoherence and shared features. CVPR (2010)5. Zhang, Q., Li, B.: Discriminative k-svd for dictionary learning in face recognition. CVPR
(2010)6. Yang, M., Zhang, L., Feng, X., Zhang, D.: Fisher discrimination dictionary learning for
sparse representation. ICCV (2011)7. Wang, F., Lee, N., Sun, J., Hu, J., Ebadollahi, S.: Automatic group sparse coding. AAAI
(2011)8. Yang, M., Zhang, L., Yang, J., Zhang, D.: metaface learning for sparse representation based
face recognition. ICIP (2010)9. Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algorithms. NIPS (2007)
10. Aharon, M., Elad, M., Bruckstein, A.M.: K-svd: An algorithm for designing overcompletedictionaries for sparse representation. IEEE Trans. Signal Processing 54 (2006) 4311–4322
11. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse codingfor image classification. CVPR (2009)
12. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)13. Georghiades, A., Belhumeur, P., Kriegman, D.: From few to many: Illumination cone models
for face recognition under variable lighting and pose. PAMI (2001)14. Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y.: Robust face recognition via sparse
representation. PAMI (2009)15. Jiang, Z., Lin, Z., Davis, L.S.: Learning a discriminative dictionary for sparse coding via
label consistent k-svd. CVPR (2011)16. Huang, K., Aviyente, S.: Sparse representation for signal classification. NIPS (2006)17. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for
recognition natural scene categories. CVPR (2006)18. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Learning locality-constrained linear
coding for image classification. CVPR (2010)19. Feng, J., Ni, B., Tian, Q., Yan, S.: Geometric ℓp norm feature pooling for image classifica-
tion. CVPR (2011)20. Boureau, Y.L., Roux, N.L., Bach, F., Ponce, J., LeCun, Y.: Ask the locals: multi-way local
pooling for image recognition. ICCV (2011)21. Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition.
CVPR (2010)22. Oliva, A., Torraba, A.: Modeling the shape of the scene: A holistic representation of the
spatial envelop. IJCV (2001)23. Li, F.F., Perona, P.: A bayesian hierarchical model for learning natural scene categories.
CVPR (2005)24. Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: Sun database: Large scale scene
recognition from abbey to zoo. CVPR (2010)25. Gao, S., Tsang, I.W.H., Chia, L.T., Zhao, P.: local features are not lonely - laplacian sparse
coding for image classification. CVPR (2010)26. Li, F.F., Fergus, R., Perona, P.: Learning generative visual models from few training ex-
amples: an incremental bayesian approach tested on 101 object categories. In IEEE CVPRWorkshop on Generative-Model Based Vision (2004)
27. Yang, J., Li, Y., Tian, Y., Duan, L., Gao, W.: Group-sensitive multiple kernel learning forobject categorization. ICCV (2009)