A Dictionary Learning Approach for Classification - CiteSeerX

000

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

000

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

ECCV#165

ECCV#165

A Dictionary Learning Approach for Classification:Separating the Particularity and the Commonality

Shu Kong and Donghui Wang

Dept. of Computer Science and Technology, Zhejiang University, Hangzhou, P.R. China{aimerykong, dhwang}@zju.edu.cn

Abstract. Empirically, we find that, despite the class-specific features owned bythe objects appearing in the images, the objects from different categories usual-ly share some common patterns, which do not contribute to the discrimination ofthem. Concentrating on this observation and under the general dictionary learning(DL) framework, we propose a novel method to explicitly learn a common pat-tern pool (the commonality) and class-specific dictionaries (the particularity) forclassification. We call our method DL-COPAR, which can learn the most com-pact and most discriminative class-specific dictionaries used for classification.The proposed DL-COPAR is extensively evaluated both on synthetic data andon benchmark image databases in comparison with existing DL-based classifi-cation methods. The experimental results demonstrate that DL-COPAR achievesvery promising performances in various applications, such as face recognition,handwritten digit recognition, scene classification and object recognition.

1 Introduction

Dictionary learning (DL), as a particular sparse signal model, has risen to prominencein recent years. It aims to learn a (overcomplete) dictionary in which only a few atomscan be linearly combined to well approximate a given signal. DL-based methods haveachieved state-of-the-art performances in many application fields, such as image de-noising [1] and image classification [2, 3].

DL methods are originally designed to learn a dictionary which can faithfully rep-resent signals, therefore they do not work well for classification tasks. To circumventthis problem, researchers recently develop several approaches to learn a classification-oriented dictionary. By exploring the label information, most DL-based classificationmethods learn such an adaptive dictionary mainly in two ways: either directly forcingthe dictionary discriminative, or making the sparse coefficients discriminative (usual-ly through simultaneously learning a classifier) to promote the discrimination powerof the dictionary. For the first case, as an example, Ramirez et al. advocate learningclass-specific sub-dictionaries for each class with a novel penalty term to make the sub-dictionaries incoherent [4]. Most methods belong to the latter case. Mairal et al. proposea supervised DL method by embedding a logistic loss function to simultaneously learn aclassifier [3]. Zhang and Li propose discriminative K-SVD method to achieve a desireddictionary which has good representation power while supporting optimal discrimina-tion of the classes [5]. Furthermore, Jiang et al. add a label consistence term on K-SVDto make the sparse coefficients more discriminative, thus the discrimination power of

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

ECCV#165

ECCV#165

2 S. Kong and D.H. Wang

the dictionary is further improved. Out of these two scenarios, Yang et al. propose Fish-er discrimination DL method to simultaneously learn class-specific sub-dictionaries andto make the coefficients more discriminative through Fisher criterion [6].

Besides, in the empirical observation, despite the highly discriminative featuresowned by the objects from different categories, the images containing the objects usu-ally share some common patterns, which do not contribute to the recognition of them.Such common patterns include, for example, the background of the objects in objectcategorization, the noses in expression recognition (the nose is almost motionless invarious expressions), and so on. Under this observation, we can see it is not feasible totreat the (inter-category) atoms of the overall dictionary equally without discrimination.In the inspiring work of Ramirez et al. [4], a DL method is proposed by adding a struc-tured inherence penalty term (DLSI) to learn C sub-dictionaries for C classes (eachone corresponds to a specific class), and discarding the most coherent atoms amongthese sub-dictionaries as the shared features. These shared features will confuse therepresentation as a result of the coherent atoms from different sub-dictionaries can beused for representation interchangeably. Then DLSI uses the combination of all the sub-dictionaries as an overall dictionary for final classification. Even if some improvementsare achieved by DLSI, the common patterns are still hidden in the sub-dictionaries. Inthe work of Wang et al. [7], an automatic group sparse coding (AutoGSC) method isproposed to discover the hidden shared data patterns with a common dictionary and Cindividual dictionaries for the C groups. Actually, AutoGSC is a clustering approachbased on sparse coding and is not adapted for classification.

Inspired by the aforementioned works, we propose a novel DL-based approach forclassification tasks, named DL-COPAR in this paper. Given the training data with thelabel information, we expect explicitly learning the class-specific feature dictionaries(the particularity) and the common pattern pool (the commonality). The particularityis most discriminative for classification despite its representation power, and the com-monality is separated out to merely contribute the representation for the data from allclasses. With the help of the commonality, the overall dictionary can be more com-pact and more discriminative for classification. To evaluate the proposed DL-COPAR,we extensively perform a series of experiments both on synthesis data and on bench-mark image databases. The experimental results demonstrate DL-COPAR achieves verypromising performances in various applications, such as face recognition, handwrittendigit recognition, scene classification and object recognition.

2 Learning the Commonality and the Particularity

To derive our DL-COPAR, we first review the classical dictionary learning (DL) model.Suppose there are N training data (possibly from different categories) denoted by xi ∈Rd (i = 1, . . . , N ), DL learns a dictionary D ∈ Rd×K from them by alternativelyminimizing the objective function f over D and coefficient matrix A = [a1, . . . ,aN ] ∈RK×N :

{A,D} = argminD,A

{f ≡

N∑i=1

∥xi −Dai∥22 + λ∥ai∥1}

s.t. ∥dk∥22 = 1, for ∀k = 1, . . . ,K,

(1)

090

091

092

093

094

095

096

097

098

099

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

090

091

092

093

094

095

096

097

098

099

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

ECCV#165

ECCV#165

Learning Dictionary-Based Particularity and Commonality for Classification 3

If K > d, then D is called an overcomplete dictionary. Suppose there are C class-es and X = [X1, . . . ,Xc, . . . ,XC ] ∈ Rd×N is the dataset, wherein Xc ∈ Rd×Nc

(N =∑C

c=1 Nc) represent the data from the cth class and signal xi ∈ Rd from thisclass is indexed by i ∈ Ic. Although the learned dictionary D by Eq.(1) from X isadapted for representation of the data, it is not suitable for classifying them. To obtaina classification-oriented dictionary, an intuitive way is to learn C class-specific dic-tionaries Dc’s for all C classes (c = 1, . . . , C), then the reconstruction error can beused for classification [4, 8]. Besides, as we observe in the real world, the class-specificdictionaries Dc’s from different categories usually share some common patterns/atoms,which do not contribute to classification but are essential for reconstruction in DL. Thusthese common/coherent atoms can be interchangeably used for representation of a querydatum, by which way the classification performance can be compromised. For this rea-son, to improve classification performance, we explicitly separate the coherent atomsby learning the commonality DC+1, which provides the common bases for all cate-gories. Denote the overall dictionary as D = [D1, . . . ,Dc, . . . ,DC ,DC+1] ∈ Rd×K ,in which K =

∑C+1c=1 Kc, Dc ∈ Rd×Kc stands for the particularity of the cth class, and

DC+1 ∈ Rd×KC+1 is the commonality. I is the identity matrix with appropriate size.First of all, a learned dictionary D ought to well represent every sample xi, i.e.

xi ≈ Dai, where the efficient ai = [θ(1)i ; . . . ;θ

(C)i ;θ

(C+1)i ] ∈ RK and θ

(c)i ∈ RKc

is the part corresponding to the sub-dictionary Dc. Despite of the overall dictionary,it is also expected that the sample from the cth class can be well represented by thecooperative efforts of the cth particularity Dc and the commonality DC+1. Therefore,we renew the objective function f :

f ≡C∑

c=1

∑i∈Ic

{∥xi −Dai∥22 + λ∥ai∥1 + ∥xi −Dcθ

(c)i −DC+1θ

(C+1)i ∥22

}, (2)

where θ(C)i and θ

(C+1)i are the corresponding coefficients of the two sub-dictionaries.

Actually, the last term of Eq.(2) is the same as the objective formulation of [7] byignoring the sparse penalty term. For mathematical brevity, we introduce a selectionoperator Qc = [qc

1, . . . ,qcj , . . . ,q

cKc

] ∈ RK×Kc , in which the jth column of Qc is ofthe form:

qcj = [ 0, . . . , 0︸︷︷︸∑c−1

m=1 Km

,

j−1︷︸︸︷0, . . . , 0, 1, 0, . . . , 0︸︷︷︸

Kc

, 0, . . . , 0︸︷︷︸∑C+1m=c+1 Km

]T . (3)

Therefore, we have QTc Qc = I, Dc = DQc and θ

(c)i = QT

c ai ∈ RKc . Let Qc =[Qc,QC+1], and we have the coefficient corresponding to the particularity and the

commonality( θ(c)

i

θ(C+1)

i

)= QT

c ai ∈ RKc+KC+1 . By introducing a terse denotation

ϕ(Ac) =∑Nc

i=1 ∥aci∥1 for Ac = [ac1, . . . ,acNc

], Eq.(2) can be rewritten as:

f ≡C∑

c=1

{∥Xc −DAc∥2F + λϕ(Ac) + ∥Xc −DQcQ

Tc Ac∥2F .

}(4)

In this way, we first guarantee the representation power of the overall dictionary,and then require the particularity, in conjunction with the commonality, to well rep-

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

ECCV#165

ECCV#165


resent the data of this category. However, merely doing this is not sufficient to learna discriminative dictionary, because other class-specific dictionaries may share similarbases with that of the cth one, i.e. the atoms from different class-specific dictionariescan still be coherent and thus be used interchangeably for representing the query data.To circumvent this problem, we force the coefficients, except the parts correspondingto the cth particularity and the commonality, to be zero. Mathematically, denote Q/c =

[Q1, . . . ,Qc−1,Qc+1, . . . ,QC ,QC+1] and Q/c = [Q1, . . . ,Qc−1,Qc+1, . . . ,QC ],we force ∥QT

/cAc∥2F = 0. Thus, we add a new term to the objective function:

f ≡C∑

c=1

{∥Xc −DAc∥2F + ∥QT

/cAc∥2F + λϕ(Ac) + ∥Xc −DQcQTc Ac∥2F

}(5)

It is worth noting that Eq.(5) can fail in capturing the common patterns. For ex-ample, the bases of the real common patterns may appear in several particularities,which makes the learned particularities redundant and less discriminative. Therefore, itis desired to drive the common patterns to the commonality and preserve the class-specific features in the particularity. For this purpose, we add an incoherence termQ(Di,Dj) = ∥DT

i Dj∥2F to the objective function. This penalty term has been usedamong the class-specific sub-dictionaries in [4], but we also consider the incoherenceof the commonality with the particularities. Then we derive the objective function ofthe proposed DL-COPAR:

f ≡C∑

c=1


/cAc∥2F∥Xc −DQcQ

Tc Ac∥2F + λϕ(Ac)

}+ η

C+1∑c=1

C+1∑j=1j =c

Q(Dc,Dj) (6)

3 Optimization Step

At the first sight, the objective function Eq.(6) is complex to solve, but we show it canbe easily optimized through an alternative optimization process.

3.1 Fix D to update Ac

When fixing D to update Ac, we omit the terms independent of Ac from Eq.(6):

f ≡C∑

c=1


/cAc∥2F + λϕ(Ac) + ∥Xc −DQcQTc Ac∥2F

}

=

C∑c=1

∥∥∥∥∥Xc

Xc

0

−

D

DQcQTc

QT/c

Ac

∥∥∥∥∥2

F

+ λϕ(Ac)

.

(7)

Proposition 1 Denote Xc =

Xc

Xc

0

and D = [d1, . . . , dk, . . . , dK ], where dk is

the ℓ2-norm unitized vector of the kth column in

D

DQcQTc

QT/c

, then when fixing D to

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

ECCV#165

ECCV#165


minimize Eq.(6) over Ac, the minimum is reached when Ac equals to 1√2A, wherein A

is the result of:

A = argminA

∥Xc − DA∥2F +λ√2ϕ(A),

Proof. Denote D′ =

D

DQcQTc

QT/c

, through simple derivations, it is easy to show the

Euclidean length of each column of D′ is√2, then

Ac =argminAc

{∥Xc − 1√

2D′√2Ac∥2F + λϕ(Ac)

}=argmin

Ac

{∥Xc − D

√2Ac∥2F + λ√

2ϕ(

√2Ac)

} (8)

Proposition 1 indicates that to update the coefficients Ac with fixed D, we cansimply solve a LASSO problem. Through this paper, we adopt the feature-sign searchalgorithm [9] to solve this LASSO problem.

3.2 Fix Ac to update D

To update the overall dictionary D = [D1, . . . ,DC ,DC+1], we turn to an iterativeapproach, i.e. updating Dc by fixing all the other Di’s. Despite the cth class-specificdictionary, the common pattern pool also contributes to fitting the signals from the cth

class. Thus there are differences in optimizing DC+1 and Dc. We elaborate the opti-mization steps as below.

Update the particularity Dc Without loss of generality, we concentrate on the opti-mization of the cth class-specific dictionary Dc by fixing the other Di’s. Specifically,we denote A

(i)c = QT

i Ac for i = 1, . . . , C + 1. By dropping the unrelated terms, weupdate the cth particularity Dc as below:

Dc =argminDc

∥Xc −

C+1∑i=1i =c

DiA(i)c −DcA

(c)c ∥2F+

∥Xc −DC+1A(C+1)c −DcA

(c)c ∥2F + 2η

C+1∑i=1i=c

Q(Dc,Di)

(9)

Let Xc = Xc −DC+1A(C+1)c , Yc = Xc −

∑C+1i=1,i =c DiA

(i)c and B = DQ/c, then

we have:

Dc = argminDc

∥Yc −DcA(c)c ∥2F + ∥Xc −DcA

(c)c ∥2F + 2η∥DT

c B∥2F . (10)

We propose to update Dc = [d1c , . . . ,d

Kcc ] atom by atom, i.e. updating dk

c while fix-ing other columns. Specifically, denote A

(c)c = [a(1); . . . ;a(Kc)] ∈ RKc×Nc , where-

in a(k) ∈ R1×Nc is the kth row of A(c)c . Let Yc = Yc −

∑j =k d

jca(j) and Xc =

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

ECCV#165

ECCV#165


Xc −∑

j =k djca(j), then we get:

dkc =argmin

dkc

{g(dk

c ) ≡ ∥Yc − dkca(k)∥2F + ∥Xc − dk

ca(k)∥2F + 2η∥dkc

TB∥2F

}. (11)

Let the first derivative of g(dkc ) w.r.t. dk

c equal zero, i.e. ∂g(dkc )/∂d

kc = 0, then we

obtain the updated dkc as below:

dkc =

1

2(∥a(k)∥22I+ ηBBT )−1(Yc + Xc)a

T(k). (12)

Note as an atom of dictionary, it ought to be unitized, i.e. dkc = dk

c/∥dkc∥2. Along with

this, the corresponding coefficient should be multiplied ∥dkc∥2, i.e. a(k) = ∥dk

c∥2a(k).

Update the shared feature pool DC+1 Denote B = DQ/C+1, i.e. B = [D1, . . . ,DC ],by dropping the unrelated terms, we update DC+1 as below:

DC+1 = argminDC+1

C∑c=1

{∥Xc −

C∑i=1

DiA(i)c −DC+1A

(C+1)C+1 ∥2F+

∥Xc −DcA(c)c −DC+1A

(C+1)C+1 ∥2F

}+ 2η∥DT

C+1B∥2F .

(13)

Denote Yc = Xc −∑C

i=1 DiA(i)c and Xc = Xc −DcA

(c)c (note Xc and Yc here are

different from the ones in the optimization of Dc), then we have:

DC+1 = argminDC+1

∥Y −DC+1A(C+1)∥2F + ∥X−DC+1A

(C+1)∥2F + 2η∥DTC+1B∥2F ,

(14)

where A(C+1) = [A(C+1)1 , . . . ,A

(C+1)C ], X = [X1, . . . , XC ], and Y = [Y1, . . . , YC ].

As well, we choose to update DC+1 atom by atom, and the kth column is updated by:

dkC+1 =

1

2(∥a(k)∥22I+ ηBBT )−1(X+ Y)aT

(k), (15)

where A(C+1) = [a(1); . . . ;a(KC+1)] ∈ RKC+1×N , Y = Y −∑

j =k djC+1a(j), and

X = X −∑

j =k djC+1a(j). Similarly, we unitize dk

C+1 to get the unit-length atomdkC+1, with scaled coefficients a(k) = ∥dk

C+1∥2a(k).The overall algorithm is summarized in Algorithm 1. Note that the value of the

objective function decreases at each iteration, therefore the algorithm converges.

4 Experimental Validation

In this section, we perform a series of experiments to evaluate the proposed DL-COPAR.First, a synthetic dataset is used to demonstrate the powerfulness of our method in learn-ing the commonality and particularity. Then, we conduct experiments to compare ourmethod with some state-of-the-art approaches on four public available benchmarks forfour real-world recognition tasks respectively, i.e. face, hand-written digit, scene, andobject recognition.

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

ECCV#165

ECCV#165


Algorithm 1 Learning the Commonality and ParticularityRequire: training dataset [X1, . . . ,XC ] with labels, the size of the C + 1 desired sub-

dictionaries Kc’s, and λ;Ensure: ∥di∥2 = 1, for ∀i = 1, . . . ,K;1: initialize Dc for c = 1, . . . , C, C + 1;2: while stop criterion is not reached do3: update the coefficients Ac by solving LASSO problem Eq.(8);4: update Dc atom-by-atom through Eq.(12) with unitization process and correspondingly

scale the related coefficients;5: update DC+1 atom-by-atom through Eq.(15) with unitization process and corresponding-

ly scale the related coefficients;6: end while7: return the learned dictionaries D1, . . . ,DC ,DC+1.

4.1 Experimental setup

We employ K-SVD algorithm [10] to initialize the particularities and the commonality.In detail, to initialize the cth class-specific dictionary, we perform K-SVD on the datafrom the cth category, and to initialize the common pattern pool, we perform K-SVDon the whole training set. Note that we adopt feature-sign search algorithm [9] for thesparse coding problem throughout our experiments.

In classification stage, we propose to adopt two reconstruction error based classifi-cation schemes and the linear SVM classifier for different classification tasks. The tworeconstruction error based classification schemes are used for the face and hand-writtendigit recognition, one is a global classifier (GC) and the other is a local classifier (LC).Note throughout our experiments, the sparsity parameter γ as below is tuned to makeabout 20% elements of the coefficient be nonzero.

GC For a testing sample y, we code it over the overall learned dictionary D, a =argmina ∥y−Da∥22+γ∥a∥1, where γ is a constant. Denote a = [θ(1); . . . ;θ(C);θ(C+1)],where θ(c) is the coefficient vector associated with sub-dictionary Dc. Then the re-

construction error by class c is ec = ∥y − Dcθ(c)

∥22, where Dc = [Dc,DC+1] and

θ(c)

= [θ(c);θ(C+1)]. Finally, the predicted label c is calculated by c = argminc ec.LC For a testing sample y, we directly calculate C reconstruction error for the

C classes: ec = minac∥y − Dcac∥22 + γ∥ac∥1, where Dc = [Dc,DC+1]. The final

identity of y is c = argminc ec.The linear SVM is used for scene and object recognition with spatial pyramid

matching (SPM) framework [11]. But in the dictionary learning stage, we adopt theproposed DL-COPAR. In detail, we first extract SIFT descriptors (128 in length) [12]from 16×16 pixel patches, which are densely sampled from each image on a dense gridwith 8-pixels stepsize. Over the extracted SIFT descriptors from different categories, weuse the proposed DL-COPAR to train an overall dictionary D ∈ R128×K concatenat-ed by the particularities and the commonality. Then we encode the descriptors of theimage over the learned dictionary. By partitioning the image into into 2l × 2l segmentsin different scales l = 0, 1, 2, we use max pooling technique to pool the coefficientsover each segment into a single vector. Thus, there are 1 + 2× 2 + 4× 4 = 21 pooled

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

ECCV#165

ECCV#165


vectors, which are concatenated into the 21K-dimensional vector for final representa-tion. Note that different from pooling process in the existing SPM-based methods, weonly use the coefficients which correspond to the particularities by discarding the onesto the commonality. Finally, we concatenate all the intercepted pooled vectors as therepresentation of the image. In all the four recognition experiments, we run 10 times ofeach experiment and report the average recognition rate.

4.2 Synthetic Example

The synthetic dataset constitutes two classes of gray-scale images of size 30× 30. Bothcategories have three common basis images denoted by FS

i for i = 1, 2, 3, as illustratedby right panel of Fig. 1, where we use dark color to represent value zero and whitecolor to represent 255. Category 1 has four individual basis images denoted by FI

1i

for i = 1, . . . , 4, as shown in left panel of Fig. 1. Similarly, the four basis images ofcategory 2 are denoted by FI

2i for i = 1, . . . , 4 in Fig. 1. The set of images used in ourexperiment are generated by adding Gaussian noise with zero mean and 0.01 variance.Fig. 2 displays several examples from the dataset, the top row for the first class andthe bottom row for the second class. Note a similar synthetic dataset is used in [7], butours is different from it in two ways. We use Gaussian noise instead of uniform randomnoise, and noises are added to both the dark and the white region rather than merely tothe dark region in [7]. Therefore, our synthetic data set is more difficult to separate thecommon patterns from the individual features.

We compare the learned dictionaries with that learned by two closely related DL-based classification methods. The methods used for comparison include Fisher Dis-criminative Dictionary Learning method (FDDL) [6] and the method based on DL withstructured incoherence (DLSI) [4]. As a supervised DL method, FDDL learns class-specific dictionaries for each class and makes them most discriminative through Fishercriteria. Fig. 3 shows the learned dictionary by FDDL, one row for one class. It is clearthat the dictionaries learned by FDDL are much discriminative and each class-specificdictionary can faithfully represent the signals from the corresponding class. Howev-er, even though FDDL succeeds in learning most discriminative dictionaries, it cannotseparate the common patterns and the individual features. Thus, the dictionaries canbe too redundant for classification. DLSI can learn the common features despite theclass-specific dictionaries, as Fig. 4 shows. Actually, DLSI learns class-specific sub-dictionaries, and then select the most coherent atoms among different individual dic-tionaries as the common features. From this figure, we can see the learned individualdictionaries are also mixed with the common features, as well the separated commonfeatures are too redundant.

On the contrary, as expected, the proposed DL-COPAR can correctly separate thecommon features and individual features, as shown in Fig. 5. As a comparison, we alsoplot the initialized particularities by K-SVD. There is no doubt that any signals can bewell represented by the corresponding class-specific dictionary and the shared patternpool. This phenomenon reveals that DL-COPAR can learn the most compact and mostdiscriminative dictionaries for classification. Fig. 6 demonstrates the convergence curveof running DL-COPAR on the synthetic data set, which shows the proposed methodconverges within about 15 steps in this case.

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

ECCV#165

ECCV#165


the common basis imagesthe class-specific basis images

Fig. 1. The real class-specific features and the common patterns.

Fig. 2. Examples of the synthetic dataset.

F21 F22 F23 F24 F25 F26 F27

F11 F12 F13 F14 F15 F16 F17

Fig. 3. The class-specific dictionaries learned by FDDL [6].

FS

12 FS

13FS

11

FS

21 FS

23FS

22

the shared featuresthe learned class-specific dictionaries

Fig. 4. The class-specific dictionaries and the shared features learned by DLSI [4].

the learned particularities the learned commonalitythe initialized particularities

Fig. 5. The initialized particularities and the learned particularities and commonality by the pro-posed DL-COPAR.

Fig. 6. The objective function value vs. number of iterations on the synthetic data set.

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

ECCV#165

ECCV#165


Table 1. The recognition rates (%) of various methods on Extended YaleB face database.

Method SRC D-KSVD LC-SVD DLSI FDDL kNN SVMDL-COPAR (LC)

DL-COPAR (GC)

Accuracy 97.2 94.1 96.7 96.5 97.9 82.5 96.296.998.3

4.3 Face recognition

Extended YaleB [13] database contains 2,414 frontal face images of 38 persons, about68 images for each individual. It is challenging due to varying illumination conditionsand expressions, and the original images are cropped to 192× 168 pixels. We use ran-dom faces [14, 15, 5] as the feature descriptors, which are obtained by projecting a faceimage onto a 504-dimensional vector with a randomly generated matrix from a zero-mean normal distribution. We randomly select half of the images (about 32 images perindividual) for training and the rest for testing. The learned dictionary consists of 570atoms, which correspond to an average of 15 atoms for each person and the last 5 onesas the common features. We compare the performance of our DL-COPAR with that ofD-KSVD [5], SRC [14], DLSI [4], LC-SVD [15] and FDDL [6].

Direct comparison is shown in Table 1. We can see that D-KSVD, which only us-es coding coefficients for classification, does not work well on this dataset. LC-SVD,which employs the label information to improve the discrimination of the learned over-all dictionary and also uses the coefficients for the final classification, achieves betterclassification performance than D-KSVD, but still no better than that of SRC. DLSIaims to make the class-specific dictionaries incoherent and thereby facilitates the dis-crimination of the dictionaries. It employs the reconstruction error for classification,and achieves no better results than that of SRC. FDDL utilizes Fisher criterion to makethe coefficients more discriminative, thus improves the discrimination of the individualdictionaries. It has achieved state-of-the-art classification performance on this database,with the accuracy higher than that of SRC. However, our DL-COPAR with GC achievesthe best classification performance, about 0.4% higher than that of FDDL. But we cansee if DL-COPAR use LC for classification, it cannot produce satisfactory outcome. Thereason is that the size of the class-specific dictionary is too small to faithfully representthe data, thus the collaboration of these sub-dictionaries is of crucial significance.

4.4 Hand-written digit recognition

USPS 1 is a widely used handwritten digit data set with 7, 291 training and 2, 007 test-ing images. We compare the proposed DL-COPAR with several methods including thereconstructive DL method with linear and bilinear classifier models (denoted by REC-L and REC-BL) reported in [3], the supervised DL method with generative trainingand discriminative training (denoted by SDL-G and SDL-D) reported in [3], the resultof sparse representation for signal classification (denoted by SRSC) by [16], DLSI [4]

1 http://www-i6.informatik.rwth-aachen.de/ keysers/usps.html

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

ECCV#165

ECCV#165


Fig. 7. The learned particularity (a) and commonalities (b) by DL-COPAR on USPS database.

Table 2. Error rates (%) of various methods on USPS data set.

Method SRSCREC-L SDL-G

DLSI FDDL kNN SVMDL-COPAR (LC)

REC-BL SDL-D DL-COPAR (GC)

Error Rate 6.056.83 6.67

3.98 3.69 5.2 4.23.61

4.38 3.54 4.70

and FDDL [6]. Additionally, some results of problem-specific methods (i.e. the kNNmethod and SVM with a Gaussian kernel) reported in [4] are also listed. The originalimages of 16× 16 resolution are directly used here.

Direct comparison is shown in Table 2. These results are originally reported or takenfrom the paper [6]. As we can see from the table, our DL-COPAR with LC classificationscheme outperforms all the other methods except for SDL-D. Actually, the optimiza-tion of SDL-D is much more complex than that of DL-COPAR, and it uses much moreinformation in the dictionary learning and classification process, such as a learned clas-sifier of coding coefficients, the sparsity of coefficients and the reconstruction error. Itis worth noting that DL-COPAR only trains 30 codewords for each class and 8 atomsas the shared commonality, whereas FDDL learns 90 atoms of dictionary for each digiteven though it achieves very close result with that of DL-COPAR. Fig. 7 illustrates thelearned particularities and the commonality by DL-COPAR, respectively.

4.5 Scene recognition

We also try our DL-COPAR for scene classification task on 15-scene dataset, which iscompiled by several researchers [22, 23, 17]. This dataset contains totally 4484 imagesfalling into 15 categories, with the number of images per category ranging from 200to 400. Following the common setup on this database [11, 17], we randomly select 100images per category as the training set and use the rest for testing. An overall 1024-visual-words dictionary is constituted. As for DL-COPAR, a 60-visual-word particu-larity is learned for each category, and the commonality consists of 124 shared patternbases. Thus the size of the concatenated dictionary is 15× 60 + 124 = 1024. Note thatthe size of the final representation of each image is 21 × 15 × 60 = 18900, which isless than that of other SPM methods, 21× 1024 = 21504, as a result of that we discardthe codes corresponding to the commonality.

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

ECCV#165

ECCV#165


Table 3. The recognition rates (%) of various methods on 15-Scenes database.

MethodKSPM ScSPM LLC GLP Boureau et al. Boureau et al. DL-COPAR

[17] [11] [18] [19] [20] [21]Accuracy 81.40 80.28 79.24 83.20 83.3 84.9 85.37

Table 4. The recognition rates (%) of various methods on Caltech101 dataset. 15 and 30 imagespre class are randomly selected for training, shown in the second and third row, respectively.

MethodKSPM ScSPM LLC GLP Boureau et al. Boureau et al. DL-COPAR

[17] [11] [18] [19] [20] [21]15 training 56.44 67.00 65.43 70.34 - - 75.1030 training 64.40 73.20 73.44 82.60 77.3 75.70 83.26

We compare the proposed DL-COPAR with several recent methods which are basedon SPM and sparse coding. The direct comparisons are shown in Table 3. KSPM [17],which represents the popular nonlinear kernel SPM, using spatial-pyramid histogram-s and Chi-square kernels, achieves 81.40% accuracy. Compared with KSPM, ScSP-M [11], as the baseline method in our paper, employs sparse coding and max poolingand linear SVM for classification. It achieves 80.28% classification rate. LLC [18] con-siders locality relations for sparse coding, achieving 79.24%. GLP [19] focuses on amore sophisticated pooling method, which learns a weight for SIFT descriptors in pool-ing process by enhancing discrimination and incorporates local spatial relations. Withhigh computational cost (eigenvalue decomposition of large size matrix and gradientascent process), GLP achieves 83.20% accuracy on this dataset. In [21], 84.9% clas-sification accuracy is achieved through macrofeatures, which jointly encodes a smallneighborhood SIFT descriptors and learns a discriminative dictionary.

Obviously, 85.37% accuracy achieved by our method is higher than that of all theother methods, and is also competitive with the current state-of-the art of 88.1% re-ported in [24] and 89.75% in [25]. It is worth noting that our DL-COPAR uses onlythe SIFT descriptors and solves the classical LASSO problem for sparse coding, whilethe method in [24] combines 14 different low-level features, and that in [25] involvesa more complicated coding process called Laplacian sparse coding, which requires awell-estimated similarity matrix and multiple optimization stages.

4.6 Object recognition

Caltech101 dataset [26] contains 9144 images in 101 object classes including animals,vehicles, flowers, etc, and one background category. Each class contains 31 to 800 im-ages with significant variance in shape. As suggested by the original dataset and alsoby the previous studies, we randomly select 15 and 30 images respectively per class fortraining and report the classification accuracies averaged over the 102 classes. The sizeof the overall dictionary is set K = 2048 which is a popular set in the community, andDL-COPAR learns 17 visual words for each category and 314 for the shared pattern

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

ECCV#165

ECCV#165


Fig. 8. Some example images from Scene15 and Caltech101. Obviously, despite the class-specificfeatures, the images of different categories share some common patterns, such as the clouds in(a) and the flora background in (b).

pool, 17×102+314 = 2048 in total. Similar to scene recognition experiment, the finalimage representation vector (21 × 17 × 102 = 36414) is still much less than that ofother SPM methods (21 × 2048 = 43008), as a result of discarding the sparse codescorresponding to the commonality.

Table 4 demonstrates the direct comparisons. It is easy to see our DL-COPAR con-sistently outperforms other SPM-based methods, achieving 83.26% accuracy which iscompetitive to the state-of-the-art performance of 84.3% achieved by a group-sensitivemultiple kernel method [27]. Compared with ScSPM, which is the baseline method inthis experiment, DL-COPAR improves the classification performance with a margin ofmore than 8% and 10% when 15 and 30 images are randomly selected for training, re-spectively. As demonstrated by Fig. 8, we conjecture the reason why our DL-COPARachieves such a high classification accuracy on both 15-scene and Caltech101 is that, forexample, the complex background are intensively encoded over the learned commonal-ity, whereas the part of the coefficient corresponding to the commonality is discarded.Thus, the omission of the common patterns will improve the classification performance.

5 Conclusion

Under the empirical observation that images (objects) from different categories usuallyshare some common patterns which are not helpful for classification but essential forrepresentation, we propose a novel approach based on dictionary learning to explicit-ly learn the shared pattern pool (the commonality) and the class-specific dictionaries(the particularity), dubbed DL-COPAR. Therefore, the combination of the particular-ity (corresponding to the specific class) and the commonality can faithfully representthe samples from this class, and the particularities are more discriminative and morecompact for classification. Through experiments on a synthetic dataset and several pub-lic benchmarks for various applications, we can see the proposed DL-COPAR achievesvery promising performances on these datasets.

Acknowledgements. We are grateful for financial support from 973Program (No.2010CB327904)and Natural Science Foundations of China (No.61071218).

References

1. Elad, M., Aharon, M.: Image denoising via learned dictionaries and sparse representation.CVPR (2006)

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

ECCV#165

ECCV#165


2. Bradley, D.M., Bagnell, J.A.: Differentiable sparse coding. NIPS (2008)3. Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Supervised dictionary learning.

NIPS (2008)4. Ramirez, I., Sprechmann, P., Sapiro, G.: Classification and clustering via dictionary learning

with structured incoherence and shared features. CVPR (2010)5. Zhang, Q., Li, B.: Discriminative k-svd for dictionary learning in face recognition. CVPR

(2010)6. Yang, M., Zhang, L., Feng, X., Zhang, D.: Fisher discrimination dictionary learning for

sparse representation. ICCV (2011)7. Wang, F., Lee, N., Sun, J., Hu, J., Ebadollahi, S.: Automatic group sparse coding. AAAI

(2011)8. Yang, M., Zhang, L., Yang, J., Zhang, D.: metaface learning for sparse representation based

face recognition. ICIP (2010)9. Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algorithms. NIPS (2007)

10. Aharon, M., Elad, M., Bruckstein, A.M.: K-svd: An algorithm for designing overcompletedictionaries for sparse representation. IEEE Trans. Signal Processing 54 (2006) 4311–4322

11. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse codingfor image classification. CVPR (2009)

12. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)13. Georghiades, A., Belhumeur, P., Kriegman, D.: From few to many: Illumination cone models

for face recognition under variable lighting and pose. PAMI (2001)14. Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y.: Robust face recognition via sparse

representation. PAMI (2009)15. Jiang, Z., Lin, Z., Davis, L.S.: Learning a discriminative dictionary for sparse coding via

label consistent k-svd. CVPR (2011)16. Huang, K., Aviyente, S.: Sparse representation for signal classification. NIPS (2006)17. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for

recognition natural scene categories. CVPR (2006)18. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Learning locality-constrained linear

coding for image classification. CVPR (2010)19. Feng, J., Ni, B., Tian, Q., Yan, S.: Geometric ℓp norm feature pooling for image classifica-

tion. CVPR (2011)20. Boureau, Y.L., Roux, N.L., Bach, F., Ponce, J., LeCun, Y.: Ask the locals: multi-way local

pooling for image recognition. ICCV (2011)21. Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition.

CVPR (2010)22. Oliva, A., Torraba, A.: Modeling the shape of the scene: A holistic representation of the

spatial envelop. IJCV (2001)23. Li, F.F., Perona, P.: A bayesian hierarchical model for learning natural scene categories.

CVPR (2005)24. Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: Sun database: Large scale scene

recognition from abbey to zoo. CVPR (2010)25. Gao, S., Tsang, I.W.H., Chia, L.T., Zhao, P.: local features are not lonely - laplacian sparse

coding for image classification. CVPR (2010)26. Li, F.F., Fergus, R., Perona, P.: Learning generative visual models from few training ex-

amples: an incremental bayesian approach tested on 101 object categories. In IEEE CVPRWorkshop on Generative-Model Based Vision (2004)

27. Yang, J., Li, Y., Tian, Y., Duan, L., Gao, W.: Group-sensitive multiple kernel learning forobject categorization. ICCV (2009)

A Dictionary Learning Approach for Classification - CiteSeerX

Documents

Transcript of A Dictionary Learning Approach for Classification - CiteSeerX